5 Structured Thinking Techniques for Data Scientists

problem solving data science problems

Structured thinking is a framework for solving unstructured problems — which covers just about all data science problems. Using a structured approach to solve problems not only only helps solve problems faster but also helps identify the parts of the problem that may need some extra attention. 

Think of structured thinking like the map of a city you’re visiting for the first time.Without a map, you’ll probably find it difficult to reach your destination. Even if you did eventually reach your destination, it’ll probably take you at least double the time.

What Is Structured Thinking?

Here’s where the analogy breaks down: Structured thinking is a framework and not a fixed mindset; you can modify these techniques based on the problem you’re trying to solve.  Let’s look at five structured thinking techniques to use in your next data science project .

  • Six Step Problem Solving Model
  • Eight Disciplines of Problem Solving
  • The Drill Down Technique
  • The Cynefin Framework
  • The 5 Whys Technique

More From Sara A. Metwalli 3 Reasons Data Scientists Need Linear Algebra

1. Six Step Problem Solving Model

This technique is the simplest and easiest to use. As the name suggests, this technique uses six steps to solve a problem, which are:

Have a clear and concise problem definition.

Study the roots of the problem.

Brainstorm possible solutions to the problem.

Examine the possible solution and choose the best one.

Implement the solution effectively.

Evaluate the results.

This model follows the mindset of continuous development and improvement. So, on step six, if your results didn’t turn out the way you wanted, go back to step four and choose another solution (or to step one and try to define the problem differently).

My favorite part about this simple technique is how easy it is to alter based on the specific problem you’re attempting to solve. 

We’ve Got Your Data Science Professionalization Right Here 4 Types of Projects You Need in Your Data Science Portfolio

2. Eight Disciplines of Problem Solving

The eight disciplines of problem solving offers a practical plan to solve a problem using an eight-step process. You can think of this technique as an extended, more-detailed version of the six step problem-solving model.

Each of the eight disciplines in this process should move you a step closer to finding the optimal solution to your problem. So, after you’ve got the prerequisites of your problem, you can follow  disciplines D1-D8.

D1 : Put together your team. Having a team with the skills to solve the project can make moving forward much easier.

D2 : Define the problem. Describe the problem using quantifiable terms: the who, what, where, when, why and how.

D3 : Develop a working plan.

D4 : Determine and identify root causes. Identify the root causes of the problem using cause and effect diagrams to map causes against their effects.

D5 : Choose and verify permanent corrections. Based on the root causes, assess the work plan you developed earlier and edit as needed.

D6 : Implement the corrected action plan.

D7 : Assess your results.

D8 : Congratulate your team. After the end of a project, it’s essential to take a step back and appreciate the work you’ve all done before jumping into a new project.

3. The Drill Down Technique

The drill down technique is more suitable for large, complex problems with multiple collaborators. The whole purpose of using this technique is to break down a problem to its roots to make finding solutions that much easier. To use the drill down technique, you first need to create a table. The first column of the table will contain the outlined definition of the problem, followed by a second column containing the factors causing this problem. Finally, the third column will contain the cause of the second column's contents, and you’ll continue to drill down on each column until you reach the root of the problem.

Once you reach the root causes of the symptoms, you can begin developing solutions for the bigger problem.

On That Note . . . 4 Essential Skills Every Data Scientist Needs

4. The Cynefin Framework

The Cynefin framework, like the rest of the techniques, works by breaking down a problem into its root causes to reach an efficient solution. We consider the Cynefin framework a higher-level approach because it requires you to place your problem into one of five contexts.

  • Obvious Contexts. In this context, your options are clear, and the cause-and-effect relationships are apparent and easy to point out.
  • Complicated Contexts. In this context, the problem might have several correct solutions. In this case, a clear relationship between cause and effect may exist, but it’s not equally apparent to everyone.
  • Complex Contexts. If it’s impossible to find a direct answer to your problem, then you’re looking at a complex context. Complex contexts are problems that have unpredictable answers. The best approach here is to follow a trial and error approach.
  • Chaotic Contexts. In this context, there is no apparent relationship between cause and effect and our main goal is to establish a correlation between the causes and effects.
  • Disorder. The final context is disorder, the most difficult of the contexts to categorize. The only way to diagnose disorder is to eliminate the other contexts and gather further information.

Get the Job You Want. We Can Help. Apply for Data Science Jobs on Built In

5. The 5 Whys Technique

Our final technique is the 5 Whys or, as I like to call it, the curious child approach. I think this is the most well-known and natural approach to problem solving.

This technique follows the simple approach of asking “why” five times — like a child would. First, you start with the main problem and ask why it occurred. Then you keep asking why until you reach the root cause of said problem. (Fair warning, you may need to ask more than five whys to find your answer.)

problem solving data science problems

Women Who Code

Built In’s expert contributor network publishes thoughtful, solutions-oriented stories written by innovative tech professionals. It is the tech industry’s definitive destination for sharing compelling, first-person accounts of problem-solving on the road to innovation.

Great Companies Need Great People. That's Where We Come In.

  • Data, AI, & Machine Learning
  • Managing Technology
  • Social Responsibility
  • Workplace, Teams, & Culture
  • AI & Machine Learning
  • Diversity & Inclusion
  • Big ideas Research Projects
  • Artificial Intelligence and Business Strategy
  • Responsible AI
  • Future of the Workforce
  • Future of Leadership
  • All Research Projects
  • AI in Action
  • Most Popular
  • The Truth Behind the Nursing Crisis
  • Work/23: The Big Shift
  • Coaching for the Future-Forward Leader
  • Measuring Culture

Spring 2024 Issue

The spring 2024 issue’s special report looks at how to take advantage of market opportunities in the digital space, and provides advice on building culture and friendships at work; maximizing the benefits of LLMs, corporate venture capital initiatives, and innovation contests; and scaling automation and digital health platform.

  • Past Issues
  • Upcoming Events
  • Video Archive
  • Me, Myself, and AI
  • Three Big Points

MIT Sloan Management Review Logo

Framing Data Science Problems the Right Way From the Start

Data science project failure can often be attributed to poor problem definition, but early intervention can prevent it.

  • Data, AI, & Machine Learning
  • Analytics & Business Intelligence
  • Data & Data Culture

problem solving data science problems

The failure rate of data science initiatives — often estimated at over 80% — is way too high. We have spent years researching the reasons contributing to companies’ low success rates and have identified one underappreciated issue: Too often, teams skip right to analyzing the data before agreeing on the problem to be solved. This lack of initial understanding guarantees that many projects are doomed to fail from the very beginning.

Of course, this issue is not a new one. Albert Einstein is often quoted as having said , “If I were given one hour to save the planet, I would spend 59 minutes defining the problem and one minute solving it.”

Get Updates on Leading With AI and Data

Get monthly insights on how artificial intelligence impacts your organization and what it means for your company and customers.

Please enter a valid email address

Thank you for signing up

Privacy Policy

Consider how often data scientists need to “clean up the data” on data science projects, often as quickly and cheaply as possible. This may seem reasonable, but it ignores the critical “why” question: Why is there bad data in the first place? Where did it come from? Does it represent blunders, or are there legitimate data points that are just surprising? Will they occur in the future? How does the bad data impact this particular project and the business? In many cases, we find that a better problem statement is to find and eliminate the root causes of bad data .

Too often, we see examples where people either assume that they understand the problem and rush to define it, or they don’t build the consensus needed to actually solve it. We argue that a key to successful data science projects is to recognize the importance of clearly defining the problem and adhere to proven principles in so doing. This problem is not relegated to technology teams; we find that many business, political, management, and media projects, at all levels, also suffer from poor problem definition.

Toward Better Problem Definition

Data science uses the scientific method to solve often complex (or multifaceted) and unstructured problems using data and analytics. In analytics, the term fishing expedition refers to a project that was never framed correctly to begin with and involves trolling the data for unexpected correlations. This type of data fishing does not meet the spirit of effective data science but is prevalent nonetheless. Consequently, defining the problem correctly needs to be step one. We previously proposed an organizational “bridge” between data science teams and business units, to be led by an innovation marshal — someone who speaks the language of both the data and management teams and can report directly to the CEO. This marshal would be an ideal candidate to assume overall responsibility to ensure that the following proposed principles are utilized.

Get the right people involved. To ensure that your problem framing has the correct inputs, you have to involve all the key people whose contributions are needed to complete the project successfully from the beginning. After all, data science is an interdisciplinary, transdisciplinary team sport. This team should include those who “own” the problem, those who will provide data, those responsible for the analyses, and those responsible for all aspects of implementation. Think of the RACI matrix — those responsible , accountable , to be consulted , and to be informed — for each aspect of the project.

Recognize that rigorously defining the problem is hard work. We often find that the problem statement changes as people work to nail it down. Leaders of data science projects should encourage debate, allow plenty of time, and document the problem statement in detail as they go. This ensures broad agreement on the statement before moving forward.

Don’t confuse the problem and its proposed solution. Consider a bank that is losing market share in consumer loans and whose leadership team believes that competitors are using more advanced models. It would be easy to jump to a problem statement that looks something like “Build more sophisticated loan risk models.” But that presupposes that a more sophisticated model is the solution to market share loss, without considering other possible options, such as increasing the number of loan officers, providing better training, or combating new entrants with more effective marketing. Confusing the problem and proposed solution all but ensures that the problem is not well understood, limits creativity, and keeps potential problem solvers in the dark. A better statement in this case would be “Research root causes of market share loss in consumer loans, and propose viable solutions.” This might lead to more sophisticated models, or it might not.

Understand the distinction between a proximate problem and a deeper root cause. In our first example, the unclean data is a proximate problem, whereas the root cause is whatever leads to the creation of bad data in the first place. Importantly, “We don’t know enough to fully articulate the root cause of the bad data problem” is a legitimate state of affairs, demanding a small-scale subproject.

Do not move past problem definition until it meets the following criteria:

Related Articles

  • It does no harm. It may not be clear how to solve the defined problem, but it should be clear that solving it will lead to a good business result. If it’s not clear, more refinement may be needed. Consider the earlier bank example. While it might be easy enough to adjust models in ways that grant more loans, this might significantly increase risk — an unacceptable outcome. So the real goal should be to improve market share without creating additional risk, hence the inclusion of “propose viable solutions” in the problem statement above.
  • It considers necessary constraints. Using the bank example, we can recognize that more sophisticated models might require hiring additional highly skilled loan officers — something the bank might be unwilling to do. All constraints, including those involving time, budget, technology, and people, should be clearly articulated to avoid a problem statement misaligned with business goals.
  • It has an accountability matrix (or its equivalent). Alignment is key for success, so ensure that those who are responsible for solving the problem understand their various roles and responsibilities. Again, think RACI matrix.
  • It receives buy-in from stakeholders. Poorly defined or controversial problem statements often produce resistors within the organization. In extreme cases, they may become “snipers,” attempting to ensure project failure. Work to develop a general (not necessarily unanimous) consensus from leadership, those involved in the solution, and the ultimate customers (those who will be affected) on the problem definition.

Taking the time needed to properly define the problem can feel uncomfortable. After all, we live and work in cultures that demand results and are eager to “get on with it.” But shortchanging this step is akin to putting the cart before the horse — it simply doesn’t work. There is no substitute for probing more deeply, getting the right people involved, and taking the time to understand the real problem. All of us — data scientists, business leaders, and politicians alike — need to get better at defining the right problem the right way.

About the Authors

Roger W. Hoerl ( @rogerhoerl ) teaches statistics at Union College in Schenectady, New York. Previously, he led the applied statistics lab at GE Global Research. Diego Kuonen ( @diegokuonen ) is head of Bern, Switzerland-based Statoo Consulting and a professor of data science at the Geneva School of Economics and Management at the University of Geneva. Thomas C. Redman ( @thedatadoc1 ) is president of New Jersey-based consultancy Data Quality Solutions and coauthor of The Real Work of Data Science: Turning Data Into Information, Better Decisions, and Stronger Organizations (Wiley, 2019).

More Like This

Add a comment cancel reply.

You must sign in to post a comment. First time here? Sign up for a free account : Comment on articles and get access to many more articles.

Comments (2)

Tathagat varma.

Data Topics

  • Data Architecture
  • Data Literacy
  • Data Science
  • Data Strategy
  • Data Modeling
  • Governance & Quality
  • Education Resources For Use & Management of Data

Data Science Solutions: Applications and Use Cases

Data Science is a broad field with many potential applications. It’s not just about analyzing data and modeling algorithms, but it also reinvents the way businesses operate and how different departments interact. Data scientists solve complex problems every day, leveraging a variety of Data Science solutions to tackle issues like processing unstructured data, finding patterns […]

data science solutions

Data Science is a broad field with many potential applications. It’s not just about analyzing data and modeling algorithms, but it also reinvents the way businesses operate and how different departments interact. Data scientists solve complex problems every day, leveraging a variety of Data Science solutions to tackle issues like processing unstructured data, finding patterns in large datasets, and building recommendation engines using advanced statistical methods, artificial intelligence, and machine learning techniques. 

problem solving data science problems

Data Science helps analyze and extract patterns from corporate data, so these patterns can be organized to guide corporate decisions. Data analysis using Data Science techniques helps companies to figure out which trends are the best fit for businesses during various parts of the year. 

Through data patterns, Data Science professionals can use tools and techniques to forecast future customer needs toward a specific product or service.  Data Science and businesses  can work together closely in understanding consumer preferences across a wide range of items and running better marketing campaigns. 

To enhance the scope of  predictive analytics , Data Science now employs other advanced technologies such as machine learning and deep learning to improve decision-making and create better models for predicting financial risks, customer behaviors, or market trends.

Data Science helps with making  future-proofing decisions,  supply chain predictions, understanding market trends, planning better pricing for products, consideration of automation for various data-driven tasks, and so on.

For example, in sales and marketing, Data Science is mainly used to predict markets, determine new customer segments, optimize pricing structures, and analyze the customer portfolio. Businesses frequently use sentiment analysis and behavior analytics to determine purchase and usage patterns, and to understand how people view products and services. Some businesses like Lowes, Home Depot, or Netflix use “hyper-personalization” techniques to match offers to customers accurately via their recommendation engines. 

E-commerce companies use recommendation engines, pricing algorithms, customer predictive segmentation, personalized product image searching, and artificially intelligent chat bots to offer transformational customer experience. 

In recent times,  deep learning , through its use of “artificial neural networks,” has empowered data scientists to perform unstructured data analytics, such as image recognition, object categorizing, and sound mapping.  

Data Science Solutions by Industry Applications

Now let’s take a look at how Data Science is powering the industry sectors  with its cross-disciplinary platforms and tools:

Data Science Solutions in Banking:  Banking and financial sectors are highly dependent on Data Science solutions powered with big data tools for risk analytics, risk management, KYC, and fraud mitigation. Large banks, hedge funds, stock exchanges, and other financial institutions use advanced Data Science (powered by big data, AI, ML) for trading analytics, pre-trade decision-support analytics, sentiment measurements, predictive analytics, and more. 

Data Science Solutions in Marketing:  Marketing departments often use Data Science to build recommendation systems and to analyze customer behavior. When we talk about Data Science in marketing, we are primarily concerned with what we call “retail marketing.” The retail marketing process involves analyzing customer data to inform business decisions and drive revenue. Common data used in retail marketing include customer data, product data, sales data, and competitor data. Customer transactional data is used extensively in AI-powered  data analytics systems  for increased sales and providing excellent marketing services. Chatbot analytics and sales representative response data are used together to improve sales efficiency. 

The retailer can use this data to build customer-targeted marketing campaigns, optimize prices based on demand, and decide on product assortment. The retail marketing process is rarely automated; it involves making business decisions based on the data. Data scientists working in retail marketing are primarily concerned with deriving insights from the data and applying statistical and machine learning methods to inform these decisions.

Data Science Solutions in Finance and Trading:  Finance departments use Data Science to build trading algorithms, manage risk, and improve compliance. A  data scientist  working in finance will primarily use data about the financial markets. This includes data about the companies whose stocks are traded on the market, the trading activity of the investors, and the stock prices. The financial data is unstructured and messy; it’s collected from different sources using different formats. The data scientist’s first task, therefore, is to process the data and convert it into a structured format. This is necessary for building algorithms and other models. For example, the data scientist might build a trading algorithm that exploits the market inefficiencies and generates profits for the company.

Data Science Solutions in Human Resources:  HR departments use Data Science to hire the best talent, manage employee data, and predict employee performance. The data scientist working in HR will primarily use employee data collected from different sources. This data could be structured or unstructured depending on how it’s collected. The most common source is an HR database such as Workday. The data scientist’s first task is to process the data and clean it. This is necessary for insights from the data. The data scientist might use methods like  machine learning  to predict the employee’s performance. This can be done by training the algorithm on historical employee data and the features it contains. For example, the data scientist might build a model that predicts employee performance using historical data. 

Data Science in Logistics and Warehousing:  Logistics and operations departments  use Data Science to manage supply chains and predict demand. The data scientist working in logistics and warehousing will primarily use data about customer orders, inventory, and product prices. The data scientist will use data from sensors and IoT devices deployed in the supply chain to track the product’s journey. The data scientist might use methods like machine learning to predict demand.  

Data Science Solutions in Customer Service:  Customer service departments use Data Science to answer customer queries, manage tickets, and improve the end-to-end customer experience. The data scientist working in customer service will primarily use data about customer tickets, customers, and the support team. The most common source is the ticket management system. In this case, the data scientist might use methods like machine learning to predict when the customer will stop engaging with the brand. This can be done by training the algorithm on historical customer data. For example, using historical data, the data scientist might build a model that predicts when a customer will stop engaging with the brand.

Big Data with Data Science Solutions Use Cases

While Data Science solutions can be used to get insights into behaviors and processes, big data analytics indicates the convergence of several cutting-edge technologies working together to help enterprise organizations extract better value from the data that they have.

In biomedical research and health, advanced Data Science and big data analytics techniques are used for increasing online revenue, reducing customer complaints, and enhancing customer experience through personalized services. In the hospitality and food services industries, once again big data analytics is used for studying customers’ behavior through shopping data, such as wait times at the checkout. Statistics show that 38% of companies use big data to improve organizational effectiveness. 

In the insurance sector, big data-powered predictive analytics is frequently used for analyzing large volumes of data at high speed during the underwriting stage. Insurance claims analysts now have access to algorithms that help identify fraudulent behaviors. Across all industry sectors, organizations are harnessing the predictive powers of Data Science to enhance their business forecasting capabilities. 

Big data coupled with Data Science  enables enterprise businesses  to leverage their own organization data, rather than relying on market studies or third-party tools. Data Science practitioners work closely with RPA industry professionals to identify data sources for a company, as well as to build dashboards and visuals for searching various forms of data analytics in real-time. Data Science teams can now train deep learning systems to identify contracts and invoices from a stack of documents, as well as perform different types of identification for the information.

Big data analytics has the potential to unlock great insights into data across social media channels and platforms, enabling marketing, customer support, and advertising to improve and be more aligned with corporate goals. Big data analytics make research results better, and helps organizations use research more effectively by allowing them to identify specific test cases and user settings.

Specialized Data Science Use Cases with Examples

Data Science applications can be used for any industry or area of study, but the majority of examples involve data analytics for  business use cases . In this section, some specific use cases are presented with examples to help you better understand its potential in your organization.

Data cleansing:  In Data Science, the first step is data cleansing, which involves identifying and cleaning up any incorrect or incomplete data sets. Data cleansing is critical to identify errors and inconsistencies that can skew your data analysis and lead to poor business decisions. The most important thing about data cleansing is that it’s an ongoing process. Business data is always changing, which means the data you have today might not be correct tomorrow. The best data scientists know that data cleansing isn’t done just once; it’s an ongoing process that starts with the very first data set you collect. 

Prediction and forecasting:  The next step in Data Science is data analysis, prediction, and forecasting. You can do this on an individual level or on a larger scale for your entire customer base. Prediction and forecasting helps you understand how your customers behave and what they may do next. You can use these insights to create better products, marketing campaigns, and customer support. Normally, the techniques used for prediction and forecasting include regression, time series analysis, and artificial neural networks. 

Fraud detection:  Fraud detection is a highly specialized use of Data Science that relies on many techniques to identify inconsistencies. With fraud detection, you’re trying to find any transactions that are incorrect or fraudulent. It’s an important use case because it can significantly reduce the costs of business operations. The best fraud detection systems are wide-ranging. They use many different techniques to identify inconsistencies and unusual data points that suggest fraud. Because fraud detection is such a specialized use case, it’s best to work with a Data Science professional. 

Data Science for business growth:  Every business wants to grow, and this is a natural outcome of doing business. Yet many businesses struggle to keep up with their competitors. Data Science can help you understand your potential customers and improve your services. It can also help you identify new opportunities and explore different areas you can expand into. Use Data Science to identify your target audience and their needs. Then create products and services that serve those needs better than your competitors can. You can also use Data Science to identify new markets, explore new areas for growth, and expand into new industries. 

Data Science is an interdisciplinary field that uses mathematics, engineering, statistics, machine learning, and other fields of study to analyze data and identify patterns. Data Science applications can be used for any industry or area of study, but most examples involve data analytics for  business use cases . Data Science often helps you understand your potential customers and their buying needs. 

Image used under license from Shutterstock.com

Leave a Reply Cancel reply

You must be logged in to post a comment.

Data Analytics with R

1 problem solving with data, 1.1 introduction.

This chapter will introduce you to a general approach to solving problems and answering questions using data. Throughout the rest of the module, we will reference back to this chapter as you work your way through your own data analysis exercises.

The approach is applicable to actuaries, data scientists, general data analysts, or anyone who intends to critically analyze data and develop insights from data.

This framework, which some may refer to as The Data Science Process includes the following five main components:

  • Data Collection
  • Data Cleaning
  • Exploratory Data Analysis
  • Model Building
  • Inference and Communication

problem solving data science problems

Note that all five steps may not be applicable in every situation, but these steps should guide you as you think about how to approach each analysis you perform.

In the subsections below, we’ll dive into each of these in more detail.

1.2 Data Collection

In order to solve a problem or answer a question using data, it seems obvious that you must need some sort of data to start with. Obtaining data may come in the form of pre-existing or generating new data (think surveys). As an actuary, your data will often come from pre-existing sources within your company. This could include querying data from databases or APIs, being sent excel files, text files, etc. You may also find supplemental data online to assist you with your project.

For example, let’s say you work for a health insurance company and you are interested in determining the average drive time for your insured population to the nearest in-network primary care providers to see if it would be prudent to contract with additional doctors in the area. You would need to collect at least three pieces of data:

  • Addresses of your insured population (internal company source/database)
  • Addresses of primary care provider offices (internal company source/database)
  • Google Maps travel time API to calculate drive times between addresses (external data source)

In summary, data collection provides the fundamental pieces needed to solve your problem or answer your question.

1.3 Data Cleaning

We’ll discuss data cleaning in a little more detail in later chapters, but this phase generally refers to the process of taking the data you collected in step 1, and turning it into a usable format for your analysis. This phase can often be the most time consuming as it may involve handling missing data as well as pre-processing the data to be as error free as possible.

Depending on where you source your data will have major implications for how long this phase takes. For example, many of us actuaries benefit from devoted data engineers and resources within our companies who exert much effort to make our data as clean as possible for us to use. However, if you are sourcing your data from raw files on the internet, you may find this phase to be exceptionally difficult and time intensive.

1.4 Exploratory Data Analysis

Exploratory Data Analysis , or EDA, is an entire subject itself. In short, EDA is an iterative process whereby you:

  • Generate questions about your data
  • Search for answers, patterns, and characteristics of your data by transforming, visualizing, and summarizing your data
  • Use learnings from step 2 to generate new questions and insights about your data

We’ll cover some basics of EDA in Chapter 4 on Data Manipulation and Chapter 5 on Data Visualization, but we’ll only be able to scratch the surface of this topic.

A successful EDA approach will allow you to better understand your data and the relationships between variables within your data. Sometimes, you may be able to answer your question or solve your problem after the EDA step alone. Other times, you may apply what you learned in the EDA step to help build a model for your data.

1.5 Model Building

In this step, we build a model, often using machine learning algorithms, in an effort to make sense of our data and gain insights that can be used for decision making or communicating to an audience. Examples of models could include regression approaches, classification algorithms, tree-based models, time-series applications, neural networks, and many, many more. Later in this module, we will practice building our own models using introductory machine learning algorithms.

It’s important to note that while model building gets a lot of attention (because it’s fun to learn and apply new types of models), it typically encompasses a relatively small portion of your overall analysis from a time perspective.

It’s also important to note that building a model doesn’t have to mean applying machine learning algorithms. In fact, in actuarial science, you may find more often than not that the actuarial models you create are Microsoft Excel-based models that blend together historical data, assumptions about the business, and other factors that allow you make projections or understand the business better.

1.6 Inference and Communication

The final phase of the framework is to use everything you’ve learned about your data up to this point to draw inferences and conclusions about the data, and to communicate those out to an audience. Your audience may be your boss, a client, or perhaps a group of actuaries at an SOA conference.

In any instance, it is critical for you to be able to condense what you’ve learned into clear and concise insights and convince your audience why your insights are important. In some cases, these insights will lend themselves to actionable next steps, or perhaps recommendations for a client. In other cases, the results will simply help you to better understand the world, or your business, and to make more informed decisions going forward.

1.7 Wrap-Up

As we conclude this chapter, take a few minutes to look at a couple alternative visualizations that others have used to describe the processes and components of performing analyses. What do they have in common?

  • Karl Rohe - Professor of Statistics at the University of Wisconsin-Madison
  • Chanin Nantasenamat - Associate Professor of Bioinformatics and Youtuber at the “Data Professor” channel

problem solving data science problems

Practice Exams

Course Notes

Infographics

Career Guides

A selection of practice exams that will test your current data science knowledge. Identify key areas of improvement to strengthen your theoretical preparation, critical thinking, and practical problem-solving skills so you can get one step closer to realizing your professional goals.

Green cover of Excel Mechanics. This practice exam is from 365 Data Science.

Excel Mechanics

Imagine if you had to apply the same Excel formatting adjustment to both Sheet 1 and Sheet 2 (i.e., adjust font, adjust fill color of the sheets, add a couple of empty rows here and there) which contain thousands of rows. That would cost an unjustifiable amount of time. That is where advanced Excel skills come in handy as they optimize your data cleaning, formatting and analysis process and shortcut your way to a job well-done. Therefore, asses your Excel data manipulation skills with this free practice exam.  

Green cover of Formatting Excel Spreadsheets. This practice exam is from 365 Data Science.

Formatting Excel Spreadsheets

Did you know that more than 1 in 8 people on the planet uses Excel and that Office users typically spend a third of their time in Excel. But how many of them use the popular spreadsheet tool efficiently? Find out where you stand in your Excel skills with this free practice exam where you are a first-year investment banking analyst at one of the top-tier banks in the world. The dynamic nature of your position will test your skills in quick Excel formatting and various Excel shortcuts 

Green cover of Hypothesis Testing. This practice exam is from 365 Data Science.

Hypothesis Testing

Whenever we need to verify the results of a test or experiment we turn to hypothesis testing. In this free practice exam you are a data analyst at an electric car manufacturer, selling vehicles in the US and Canada. Currently the company offers two car models – Apollo and SpeedX.  You will need to download a free Excel file containing the car sales of the two models over the last 3 years in order find out interesting insights and  test your skills in hypothesis testing. 

Green cover of Confidence Intervals. This practice exam is from 365 Data Science.

Confidence Intervals

Confidence Intervals refers to the probability of a population parameter falling between a range of certain values. In this free practice exam, you lead the research team at a portfolio management company with over $50 billion dollars in total assets under management. You are asked to compare the performance of 3 funds with similar investment strategies  and are given a table with the return of the three portfolios over the last 3 years. You will have to use the data to answer questions that will test your knowledge in confidence intervals. 

Green cover of Fundamentals of Inferential Statistics. This practice exam is from 365 Data Science.

Fundamentals of Inferential Statistics

While descriptive statistics helps us describe and summarize a dataset, inferential statistics allows us to make predictions based off data. In this free practice exam, you are a data analyst at a leading statistical research company. Much of your daily work relates to understanding data structures and processes, as well as applying analytical theory to real-world problems on large and dynamic datasets. You will be given an excel dataset and will be tested on normal distribution, standardizing a dataset, the Central Limit Theorem among other inferential statistics questions.   

Green cover of Fundamentals of Descriptive Statistics. This practice exam is from 365 Data Science.

Fundamentals of Descriptive Statistics

Descriptive statistics helps us understand the actual characteristics of a dataset by generating summaries about data samples. The most popular types of descriptive statistics are measures of center: median, mode and mean. In this free practice exam you have been appointed as a Junior Data Analyst at a property developer company in the US, where you are asked to evaluate the renting prices in 9 key states. You will work with a free excel dataset file that contains the rental prices and houses over the last years.

Yellow Cover of Jupyter Notebook Shortcuts. This practice exam is from 365 Data Science.

Jupyter Notebook Shortcuts

In this free practice exam you are an experienced university professor in Statistics who is looking to upskill in data science and has joined the data science apartment. As on of the most popular coding environments for Python, your colleagues recommend you learn Jupyter Notebook as a beginner data scientist. Therefore, in this quick assessment exam you are going to be tested on some basic theory regarding Jupyter Notebook and some of its shortcuts which will determine how efficient you are at using the environment. 

Yellow cover of Intro to Jupyter Notebooks. This practice exam is from 365 Data Science.

Intro to Jupyter Notebooks

Jupyter is a free, open-source interactive web-based computational notebook. As one of the most popular coding environments for Python and R, you are inevitably  going to encounter Jupyter at some point in you data science journey, if you have not already. Therefore, in this free practice exam you are a professor of Applied Economics and Finance who is learning how to use Jupyter. You are going to be tested on the very basics of the Jupyter environment like how to set up the environment and some Jupyter keyboard shortcuts. 

Yellow cover of Black-Scholes-Merton Model in Python. This practice exam is from 365 Data Science.

Black-Scholes-Merton Model in Python

The Black Scholes formula is one of the most popular financial instruments used in the past 40 years. Derived by Fisher, Black Myron Scholes and Robert Merton in 1973, it has become the primary tool for derivative pricing. In this free practice exam, you are a finance student whose Applied Finance is approaching and is asked to perform the Black-Scholes-Merton formula in Python  by working on a dataset containing Tesla’s stock prices for the period between mid-2010 and mid-2020.  

Yellow cover of Python for Financial Analysis. This practice exam is from 365 Data Science.

Python for Financial Analysis

In a heavily regulated industry like fintech, simplicity and efficiency is key. Which is why Python is the preferred choice for programming language over the likes of Java or C++. In this free practice exam you are a university professor of Applied Economics and Finance, who is focused on running regressions and applying the CAPM model on the NASDAQ and The Coca-Cola Company Dataset for the period between 2016 and 2020 inclusive. Make sure to have the following packages running to complete your practice test: pandas, numpy, api, scipy, and pyplot as plt. 

Yellow cover of Python Finance. This practice exam is from 365 Data Science.

Python Finance

Python has become the ideal programming language for the financial industry, as more and more hedge funds and large investment banks are adopting this general multi-purpose language to solve their quantitative problems. In this free practice exam on Python Finance, you are part of the IT team of a huge company, operating in the US stock market, where you are asked to analyze the performance of three market indices. The packages you need to have running are numpy, pandas and pyplot as plt.   

Yellow cover of Machine Learning with KNN. This template resource is from 365 Data Science.

Machine Learning with KNN

KNN is a popular supervised machine learning algorithm that is used for solving both classification and regression problems. In this free practice exam, this is exactly what you are going to be asked to do, as you are required to create 2 datasets for 2 car dealerships in Jupyter Notebook, fit the models to the training data, find the set of parameters that best classify a car, construct a confusion matrix and more.

Green cover of Excel Functions. This practice exam is from 365 Data Science.

Excel Functions

The majority of data comes in spreadsheet format, making Excel the #1 tool of choice for professional data analysts. The ability to work effectively and efficiently in Excel is highly desirable for any data practitioner who is looking to bring value to a company. As a matter of fact, being proficient in Excel has become the new standard, as 82% of middle-skill jobs require competent use of the productivity software. Take this free Excel Functions practice exam and test your knowledge on removing duplicate values, transferring data from one sheet to another, rand using the VLOOKUP and SUMIF function.

Green Cover of Useful Tools in Excel. This practice exam is from 365 Data Science.

Useful Tools in Excel

What Excel lacks in data visualization tools compared to Tableau, or computational power for analyzing big data compared to Python, it compensates with accessibility and flexibility. Excel allows you to quickly organize, visualize and perform mathematical functions on a set of data, without the need for any programming or statistical skills. Therefore, it is in your best interest to learn how to use the various Excel tools at your disposal. This practice exam is a good opportunity to test your excel knowledge in the text to column functions, excel macros, row manipulation and basic math formulas.

Green Cover of Excel Basics. This practice exam is from 365 Data Science.

Excel Basics

Ever since its first release in 1985, Excel continues to be the most popular spreadsheet application to this day- with approximately 750 million users worldwide, thanks to its flexibility and ease of use. No matter if you are a data scientist or not, knowing how to use Excel will greatly improve and optimize your workflow. Therefore, in this free Excel Basics practice exam you are going to work with a dataset of a company in the Fast Moving Consumer Goods Sector as an aspiring data analyst and test your knowledge on basic Excel functions and shortcuts.

Grey cover of A/B Testing for Social Media. This practice exam resource is from 365 Data Science.

A/B Testing for Social Media

In this free A/B Testing for Social Media practice exam, you are an experienced data analyst who works at a new social media company called FilmIt. You are tasked with the job of increasing user engagement by applying the correct modifications to how users move on to the next video. You decide that the best approach is by conducting a A/B test in a controlled environment. Therefore, in order to successfully complete this task, you are going to be tested on statistical significance, 2 tailed-tests and choosing the success metrics.

Grey cover of Fundamentals of A/B Testing. This practice exam resource is from 365 Data Science.

Fundamentals of A/B Testing

A/B Testing is a powerful statistical tool used to compare the results between two versions of the same marketing asset such as a webpage or email in a controlled environment. An example of A/B testing is when Electronic Arts created a variation version of the sales page for the popular SimCity 5 simulation game, which performed 40% better than the control page. Speaking about video games, in this free practice test, you are a data analyst who is tasked with the job to conduct A/B testing for a game developer. You are going to be asked to choose the best way to perform an A/B test, identify the null hypothesis, choose the right evaluation metrics, and ultimately increase revenue through in-game ads.

Grey Cover of Intro to Machine Learning. The practice exam resource is from 365 Data Science.

Introduction to Data Science Disciplines

The term “Data Science” dates back to the 1960s, to describe the emerging field of working with large amounts of data that drives organizational growth and decision-making. While the essence has remained the same, the data science disciplines have changed a lot over the past decades thanks to rapid technological advancements. In this free introduction to data science practice exam, you will test your understanding of the modern day data science disciplines and their role within an organization.

Ocean blue cover of Advanced SQL. This practice exam is from 365 Data Science.

Advanced SQL

In this free Advanced SQL practice exam you are a sophomore Business student who has decided to focus on improving your coding and analytical skills in the areas of relational database management systems. You are given an employee dataset containing information like titles, salaries, birth dates and department names, and are required to come up with the correct answers. This free SQL practice test will evaluate your knowledge on MySQL aggregate functions , DML statements (INSERT, UPDATE) and other advanced SQL queries.

Most Popular Practice Exams

Check out our most helpful downloadable resources according to 365 Data Science students and our expert team of instructors.

Join 2M+ Students and Start Learning

Learn from the best, develop an invaluable skillset, and secure a job in data science.

Join 2M+ Students and Start Learning

  • Solving Problems with Data Science

problem solving data science problems

Aakash Tandel , Former Data Scientist

Article Categories: #Strategy , #Data & Analytics

Posted on December 3, 2018

There is a systematic approach to solving data science problems and it begins with asking the right questions. This article covers some of the many questions we ask when solving data science problems at Viget.

T h e r e i s a s y s t e m a t i c a p p r o a c h t o s o l v i n g d a t a s c i e n c e p r o b l e m s a n d i t b e g i n s w i t h a s k i n g t h e r i g h t q u e s t i o n s . T h i s a r t i c l e c o v e r s s o m e o f t h e m a n y q u e s t i o n s w e a s k w h e n s o l v i n g d a t a s c i e n c e p r o b l e m s a t V i g e t .

A challenge that I’ve been wrestling with is the lack of a widely populated framework or systematic approach to solving data science problems. In our analytics work at Viget, we use a framework inspired by Avinash Kaushik’s Digital Marketing and Measurement Model . We use this framework on almost every project we undertake at Viget. I believe data science could use a similar framework that organizes and structures the data science process.

As a start, I want to share the questions we like to ask when solving a data science problem. Even though some of the questions are not specific to the data science domain, they help us efficiently and effectively solve problems with data science.

Business Problem

What is the problem we are trying to solve?

That’s the most logical first step to solving any question, right? We have to be able to articulate exactly what the issue is. Start by writing down the problem without going into the specifics, such as how the data is structured or which algorithm we think could effectively solve the problem.

Then try explaining the problem to your niece or nephew, who is a freshman in high school. It is easier than explaining the problem to a third-grader, but you still can’t dive into statistical uncertainty or convolutional versus recurrent neural networks. The act of explaining the problem at a high school stats and computer science level makes your problem, and the solution, accessible to everyone within your or your client’s organization, from the junior data scientists to the Chief Legal Officer.

Clearly defining our business problem showcases how data science is used to solve real-world problems. This high-level thinking provides us with a foundation for solving the problem. Here are a few other business problem definitions we should think about.

  • Who are the stakeholders for this project?
  • Have we solved similar problems before?
  • Has someone else documented solutions to similar problems?
  • Can we reframe the problem in any way?

And don’t be fooled by these deceivingly simple questions. Sometimes more generalized questions can be very difficult to answer. But, we believe answering these framing question is the first, and possibly most important, step in the process, because it makes the rest of the effort actionable.  

Say we work at a video game company —  let’s call the company Rocinante. Our business is built on customers subscribing to our massive online multiplayer game. Users are billed monthly. We have data about users who have cancelled their subscription and those who have continued to renew month after month. Our management team wants us to analyze our customer data.

Well, as a company, the Rocinante wants to be able to predict whether or not customers will cancel their subscription . We want to be able to predict which customers will churn, in order to address the core reasons why customers unsubscribe. Additionally, we need a plan to target specific customers with more proactive retention strategies.

Churn is the turnover of customers, also referred to as customer death. In a contractual setting - such as when a user signs a contract to join a gym - a customer “dies” when they cancel their gym membership. In a non-contractual setting, customer death is not observed and is more difficult to model. For example, Amazon does not know when you have decided to never-again purchase Adidas. Your customer death as an Amazon or Adidas customer is implied.

problem solving data science problems

Possible Solutions

What are the approaches we can use to solve this problem.

There are many instances when we shouldn’t be using machine learning to solve a problem. Remember, data science is one of many tools in the toolbox. There could be a simpler, and maybe cheaper, solution out there. Maybe we could answer a question by looking at descriptive statistics around web analytics data from Google Analytics. Maybe we could solve the problem with user interviews and hear what the users think in their own words. This question aims to see if spinning up EC2 instances on Amazon Web Services is worth it. If the answer to,  “Is there a simple solution,”  is, “No,” then we can ask, “ Can we use data science to solve this problem? ” This yes or no question brings about two follow-up questions:

  • “ Is the data available to solve this problem? ” A data scientist without data is not a very helpful individual. Many of the data science techniques that are highlighted in media today — such as deep learning with artificial neural networks — requires a massive amount of data. A hundred data points is unlikely to provide enough data to train and test a model. If the answer to this question is no, then we can consider acquiring more data and pipelining that data to warehouses, where it can be accessed at a later date.
  • “ Who are the team members we need in order to solve this problem? ” Your initial answer to this question will be, “The data scientist, of course!” The vast majority of the problems we face at Viget can’t or shouldn’t be solved by a lone data scientist because we are solving business problems. Our data scientists team up with UXers , designers , developers , project managers , and hardware developers to develop digital strategies and solving data science problems is one part of that strategy. Siloing your problem and siloing your data scientists isn’t helpful for anyone.

We want to predict when a customer will unsubscribe from Rocinante’s flagship game. One simple approach to solving this problem would be to take the average customer life - how long a gamer remains subscribed - and predict that all customers will churn after X amount of time. Say our data showed that on average customers churned after 72 months of subscription. Then we  could  predict a new customer would churn after 72 months of subscription. We test out this hypothesis on new data and learn that it is wildly inaccurate. The average customer lifetime for our previous data was 72 months, but our new batch of data had an average customer lifetime of 2 months. Users in the second batch of data churned much faster than those in the first batch. Our prediction of 72 months didn’t generalize well. Let’s try a more sophisticated approach using data science.

  • Is the data available to solve this problem?  The dataset contains 12,043 rows of data and 49 features. We determine that this sample of data is large enough for our use-case. We don’t need to deploy Rocinante’s data engineering team for this project.
  • Who are the team members we need in order to solve this problem?   Let’s talk with the Rocinante’s data engineering team to learn more about their data collection process. We could learn about biases in the data from the data collectors themselves. Let’s also chat with the customer retention and acquisitions team and hear about their tactics to reduce churn. Our job is to analyze data that will ultimately impact their work. Our project team will consist of the data scientist to lead the analysis, a project manager to keep the project team on task, and a UX designer to help facilitate research efforts we plan to conduct before and after the data analysis.

problem solving data science problems

How do we know if we have successfully solved the problem?

At Viget, we aim to be data-informed, which means we aren’t blindly driven by our data, but we are still focused on quantifiable measures of success. Our data science problems are held to the same standard.  What are the ways in which this problem could be a success? What are the ways in which this problem could be a complete and utter failure?  We often have specific success metrics and Key Performance Indicators (KPIs) that help us answer these questions.

Our UX coworker has interviewed some of the other stakeholders at Rocinante and some of the gamers who play our game. Our team believes if our analysis is inconclusive, and we continue the status quo, the project would be a failure. The project would be a success if we are able to predict a churn risk score for each subscriber. A churn risk score, coupled with our monthly churn rate (the rate at which customers leave the subscription service per month), will be useful information. The customer acquisition team will have a better idea of how many new users they need to acquire in order to keep the number of customers the same, and how many new users they need in order to grow the customer base. 

problem solving data science problems

Data Science-ing

What do we need to learn about the data and what analysis do we need to conduct.

At the heart of solving a data science problem are hundreds of questions. I attempted to ask these and similar questions last year in a blog post,  Data Science Workflow . Below are some of the most crucial — they’re not the only questions you could face when solving a data science problem, but are ones that our team at Viget thinks about on nearly every data problem.

  • What do we need to learn about the data?
  • What type of exploratory data analysis do we need to conduct?
  • Where is our data coming from?
  • What is the current state of our data?
  • Is this a supervised or unsupervised learning problem?
  • Is this a regression, classification, or clustering problem?
  • What biases could our data contain?
  • What type of data cleaning do we need to do?
  • What type of feature engineering could be useful?
  • What algorithms or types of models have been proven to solve similar problems well?
  • What evaluation metric are we using for our model?
  • What is our training and testing plan?
  • How can we tweak the model to make it more accurate, increase the ROC/AUC, decrease log-loss, etc. ?
  • Have we optimized the various parameters of the algorithm? Try grid search here.
  • Is this ethical?

That last question raises the conversation about ethics in data science. Unfortunately, there is no hippocratic oath for data scientists, but that doesn’t excuse the data science industry from acting unethically. We should apply ethical considerations to our standard data science workflow. Additionally, ethics in data science as a topic deserves more than a paragraph in this article — but I wanted to highlight that we should be cognizant and practice only ethical data science.

Let’s get started with the analysis. It’s  time to answer the data science questions. Because this is an example, the answer to these data science questions are entirely hypothetical.

  • We need to learn more about the time series nature of our data, as well as the format.
  • We should look into average customer lifetime durations and summary statistics around some of the features we believe could be important.
  • Our data came from login data and customer data, compiled by Rocinante’s data engineering team.
  • The data needs to be cleaned, but it is conveniently in a PostgreSQL database.
  • This is a supervised learning problem because we know which customers have churned.
  • This is a binary classification problem.
  • After conducting exploratory data analysis and speaking with the data engineering team, we do not see any biases in the data.
  • We need to reformat some of the data and use missing data imputation for features we believe are important but have some missing data points.
  • With 49 good features, we don’t believe we need to do any feature engineering.
  • We have used random forests, XGBoost, and standard logistic regressions to solve classification problems.
  • We will use ROC-AUC score as our evaluation metric.
  • We are going to use a training-test split (80% training, 20% test) to evaluate our model.
  • Let’s remove features that are statistically insignificant from our model to improve the ROC-AUC score.
  • Let’s optimize the parameters within our random forests model to improve the ROC-AUC score.
  • Our team believes we are acting ethically.

This process may look deceivingly linear, but data science is often a nonlinear practice. After doing all of the work in our example above, we could still end up with a model that doesn’t generalize well. It could be bad at predicting churn in new customers. Maybe we shouldn’t have assumed this problem was a binary classification problem and instead used survival regression to solve the problem. This part of the project will be filled with experimentation, and that’s totally normal.

problem solving data science problems

Communication

What is the best way to communicated and circulate our results.

Our job is typically to bring our findings to the client, explain how the process was a success or failure, and explain why. Communicating technical details and explaining to non-technical audiences is important because not all of our clients have degrees in statistics.  There are three ways in which communication of technical details can be advantageous:

  • It can be used to inspire confidence that the work is thorough and multiple options have been considered.
  • It can highlight technical considerations or caveats that stakeholders and decision-makers should be aware of.  
  • It can offer resources to learn more about specific techniques applied.
  • It can provide supplemental materials to allow the findings to be replicated where possible.

We often use blog posts and articles to circulate our work. They help spread our knowledge and the lessons we learned while working on a project to peers. I encourage every data scientist to engage with the data science community by attending and speaking at meetups and conferences, publishing their work online, and extending a helping hand to other curious data scientists and analysts.

Our method of binary classification was in fact incorrect, so we ended up using survival regression to determine there are four features that impact churn: gaming platform, geographical region, days since last update, and season. Our team aggregates all of our findings into one report, detailing the specific techniques we used, caveats about the analysis, and the multiple recommendations from our team to the customer retention and acquisition team. This report is full of the nitty-gritty details that the more technical folks, such as the data engineering team, may appreciate. Our team also creates a slide deck for the less-technical audience. This deck glosses over many of the technical details of the project and focuses on recommendations for the customer retention and acquisition team.

We give a talk at a local data science meetup, going over the trials, tribulations, and triumphs of the project and sharing them with the data science community at large.

problem solving data science problems

Why are we doing all of this?

I ask myself this question daily — and not in the metaphysical sense, but in the value-driven sense. Is there value in the work we have done and in the end result? I hope the answer is yes. But, let’s be honest, this is business. We don’t have three years to put together a PhD thesis-like paper. We have to move quickly and cost-effectively. Critically evaluating the value ultimately created will help you refine your approach to the next project. And, if you didn’t produce the value you’d originally hoped, then at the very least, I hope you were able to learn something and sharpen your data science skills. 

Rocinante has a better idea of how long our users will remain active on the platform based on user characteristics, and can now launch preemptive strikes in order to retain those users who look like they are about to churn. Our team eventually develops a system that alerts the customer retention and acquisition team when a user may be about to churn, and they know to reach out to that user, via email, encouraging them to try out a new feature we recently launched. Rocinante is making better data-informed decisions based on this work, and that’s great!

I hope this article will help guide your next data science project and get the wheels turning in your own mind. Maybe you will be the creator of a data science framework the world adopts! Let me know what you think about the questions, or whether I’m missing anything, in the comments below.

Related Articles

Start Your Project With an Innovation Workshop

Start Your Project With an Innovation Workshop

Kate Trenerry

Charting New Paths to Startup Product Development

Charting New Paths to Startup Product Development

Making a Business Case for Your Website Project

Making a Business Case for Your Website Project

The viget newsletter.

Nobody likes popups, so we waited until now to recommend our newsletter, featuring thoughts, opinions, and tools for building a better digital world. Read the current issue.

Subscribe Here (opens in new window)

  • Share this page
  • Post this page

5 Steps on How to Approach a New Data Science Problem

Many companies struggle to reorganize their decision making around data and implement a coherent data strategy. The problem certainly isn’t lack of data but inability to transform it into actionable insights. Here's how to do it right.

A QUICK SUMMARY – FOR THE BUSY ONES

TABLE OF CONTENTS

Introduction

Data has become the new gold. 85 percent of companies are trying to be data-driven, according to last year’s survey by NewVantage Partners , and the global data science platform market is expected to reach $128.21 billion by 2022, up from $19.75 billion in 2016.

Clearly, data science is not just another buzzword with limited real-world use cases. Yet, many companies struggle to reorganize their decision making around data and implement a coherent data strategy. The problem certainly isn’t lack of data.

In the past few years alone, 90 percent of all of the world’s data has been created, and our current daily data output has reached 2.5 quintillion bytes , which is such a mind-bogglingly large number that it’s difficult to fully appreciate the break-neck pace at which we generate new data.

The real problem is the inability of companies to transform the data they have at their disposal into actionable insights that can be used to make better business decisions, stop threats, and mitigate risks.

In fact, there’s often too much data available to make a clear decision, which is why it’s crucial for companies to know how to approach a new data science problem and understand what types of questions data science can answer.

What types of questions can data science answer?

“Data science and statistics are not magic. They won’t magically fix all of a company’s problems. However, they are useful tools to help companies make more accurate decisions and automate repetitive work and choices that teams need to make,” writes Seattle Data Guy , a data-driven consulting agency.

The questions that can be answered with the help of data science fall under following categories:

  • Identifying themes in large data sets : Which server in my server farm needs maintenance the most?
  • Identifying anomalies in large data sets : Is this combination of purchases different from what this customer has ordered in the past?
  • Predicting the likelihood of something happening : How likely is this user to click on my video?
  • Showing how things are connected to one another : What is the topic of this online article?
  • Categorizing individual data points : Is this an image of a cat or a mouse?

Of course, this is by no means a complete list of all questions that data science can answer. Even if it were, data science is evolving at such a rapid pace that it would most likely be completely outdated within a year or two from its publication.

Now that we’ve established the types of questions that can be reasonably expected to be answered with the help of data science, it’s time to lay down the steps most data scientists would take when approaching a new data science problem.

Step 1: Define the problem

First, it’s necessary to accurately define the data problem that is to be solved. The problem should be clear, concise, and measurable . Many companies are too vague when defining data problems, which makes it difficult or even impossible for data scientists to translate them into machine code.

Here are some basic characteristics of a well-defined data problem:

  • The solution to the problem is likely to have enough positive impact to justify the effort.
  • Enough data is available in a usable format.
  • Stakeholders are interested in applying data science to solve the problem.

Step 2: Decide on an approach

There are many data science algorithms that can be applied to data, and they can be roughly grouped into the following families:

  • Two-class classification : useful for any question that has just two possible answers.
  • Multi-class classification : answers a question that has multiple possible answers.
  • Anomaly detection : identifies data points that are not normal.
  • Regression : gives a real-valued answer and is useful when looking for a number instead of a class or category.
  • Multi-class classification as regression : useful for questions that occur as rankings or comparisons.
  • Two-class classification as regression : useful for binary classification problems that can also be reformulated as regression.
  • Clustering : answer questions about how data is organized by seeking to separate out a data set into intuitive chunks.
  • Dimensionality reduction : reduces the number of random variables under consideration by obtaining a set of principal variables.
  • Reinforcement learning algorithms : focus on taking action in an environment so as to maximize some notion of cumulative reward.

Step 3: Collect data

With the problem clearly defined and a suitable approach selected, it’s time to collect data. All collected data should be organized in a log along with collection dates and other helpful metadata.

It’s important to understand that collected data is seldom ready for analysis right away. Most data scientists spend much of their time on data cleaning , which includes removing missing values, identifying duplicate records, and correcting incorrect values.

Step 4: Analyze data

The next step after data collection and cleanup is data analysis. At this stage, there’s a certain chance that the selected data science approach won’t work. This is to be expected and accounted for. Generally, it’s recommended to start with trying all the basic machine learning approaches as they have fewer parameters to alter.

There are many excellent open source data science libraries that can be used to analyze data. Most data science tools are written in Python, Java, or C++.

<blockquote><p>“Tempting as these cool toys are, for most applications the smart initial choice will be to pick a much simpler model, for example using scikit-learn and modeling techniques like simple logistic regression,” – advises Francine Bennett, the CEO and co-founder of Mastodon C.</p></blockquote>

Step 5: Interpret results

After data analysis, it’s finally time to interpret the results. The most important thing to consider is whether the original problem has been solved. You might discover that your model is working but producing subpar results. One way how to deal with this is to add more data and keep retraining the model until satisfied with it.

Most companies today are drowning in data. The global leaders are already using the data they generate to gain competitive advantage, and others are realizing that they must do the same or perish. While transforming an organization to become data-driven is no easy task, the reward is more than worth the effort.

The 5 steps on how to approach a new data science problem we’ve described in this article are meant to illustrate the general problem-solving mindset companies must adopt to successfully face the challenges of our current data-centric era.

Frequently Asked Questions

Our promise

Every year, Brainhub helps 750,000+ founders, leaders and software engineers make smart tech decisions. We earn that trust by openly sharing our insights based on practical software engineering experience.

problem solving data science problems

A serial entrepreneur, passionate R&D engineer, with 15 years of experience in the tech industry. Shares his expert knowledge about tech, startups, business development, and market analysis.

Popular this month

Get smarter in engineering and leadership in less than 60 seconds.

Join 300+ founders and engineering leaders, and get a weekly newsletter that takes our CEO 5-6 hours to prepare.

previous article in this collection

It's the first one.

next article in this collection

It's the last one.

banner-in1

  • Data Science

Common Data Science Challenges of 2024 [with Solution]

Home Blog Data Science Common Data Science Challenges of 2024 [with Solution]

Play icon

Data is the new oil for companies. Since then, it has been a standard aspect of every choice made. Increasingly, businesses rely on analytics and data to strengthen their brand's position in the market and boost revenue.

Information now has more value than physical metals. According to a poll conducted by NewVantage Partners in 2017, 85% of businesses are making an effort to become data-driven, and the worldwide data science platform market is projected to grow to $128.21 billion by 2022, from only $19.75 billion in 2016.

Data science is not a meaningless term with no practical applications. Yet, many businesses have difficulty reorganizing their decision-making around data and implementing a consistent data strategy. Lack of information is not the issue. 

Our daily data production has reached 2.5 quintillion bytes, which is so huge that it is impossible to completely understand the breakneck speed at which we produce new data. Ninety percent of all global data was generated in the previous few years. 

The actual issue is that businesses aren't able to properly use the data they already collect to get useful insights that can be utilized to improve decision-making, counteract risks, and protect against threats. 

It is vital for businesses to know how to approach a new data science challenge and understand what kinds of questions data science can answer since there is frequently too much data accessible to make a clear choice. One must have a look at Data Science Course Subjects  for an outstanding career in Data Science. 

What is Data Science Challenges?

Data science is an application of the scientific method that utilizes data and analytics to address issues that are often difficult (or multiple) and unstructured. The phrase "fishing expedition" comes from the field of analytics and refers to a project that was never structured appropriately, to begin with, and entails searching through the data for unanticipated connections. This particular kind of "data fishing" does not adhere to the principles of efficient data science; nonetheless, it is still rather common. Therefore, the first thing that needs to be done is to clearly define the issue. In the past, we put out an idea for 

"The study of statistics and data is not a kind of witchcraft. They will not, by any means, solve all of the issues that plague a corporation. According to Seattle Data Guy, a data-driven consulting service, "but, they are valuable tools that assist organizations make more accurate judgments and automate repetitious labor and choices that teams need to make." 

The following are some of the categories that may be used to classify the problems that can be solved with the assistance of data science:

  • Finding patterns in massive data sets : Which of the servers in my server farm need the most maintenance? 
  • Detecting deviations from the norm in huge data sets : Is this particular mix of acquisitions distinct from what this particular consumer has previously ordered? 
  • The process of estimating the possibility of something occurring : What are the chances that this person will click on my video? 
  • illustrating the ways in which things are related to one another : What exactly is the focus of this article that I saw online? 
  • Categorizing specific data points: Which animal do you think this picture depicts a kitty or a mouse? 

Of course, the aforementioned is in no way a comprehensive list of all the questions that can be answered by data science. Even if it were, the field of data science is advancing at such a breakneck speed that it is quite possible that it would be rendered entirely irrelevant within a year or two of its release. 

It is time to write out the stages that the majority of data scientists would follow when tackling a new data science challenge now that we have determined the categories of questions that may be fairly anticipated to be solved with the assistance of data science. Data Science Bootcamp review is for people struggling to make a breakthrough in this domain.

Common Data Science Problems Faced by Data Scientists

1. preparation of data for smart enterprise ai.

Finding and cleaning up the proper data is a data scientist's priority. Nearly 80% of a data scientist's day is spent on cleaning, organizing, mining, and gathering data, according to a CrowdFlower poll. In this stage, the data is double-checked before undergoing additional analysis and processing. Most data scientists (76%) agree that this is one of the most tedious elements of their work. As part of the data wrangling process, data scientists must efficiently sort through terabytes of data stored in a wide variety of formats and codes on a wide variety of platforms, all while keeping track of changes to such data to avoid data duplication. 

Adopting AI-based tools that help data scientists maintain their edge and increase their efficacy is the best method to deal with this issue. Another flexible workplace AI technology that aids in data preparation and sheds light on the topic at hand is augmented learning. 

2. Generation of Data from Multiple Sources

Data is obtained by organizations in a broad variety of forms from the many programs, software, and tools that they use. Managing voluminous amounts of data is a significant obstacle for data scientists. This method calls for the manual entering of data and compilation, both of which are time-consuming and have the potential to result in unnecessary repeats or erroneous choices. The data may be most valuable when exploited effectively for maximum usefulness in company artificial intelligence . 

Companies now can build up sophisticated virtual data warehouses that are equipped with a centralized platform to combine all of their data sources in a single location. It is possible to modify or manipulate the data that is stored in the central repository to satisfy the needs of a company and increase its efficiency. This easy-to-implement modification has the potential to significantly reduce the amount of time and labor required by data scientists. 

3. Identification of Business Issues

Identifying issues is a crucial component of conducting a solid organization. Before constructing data sets and analyzing data, data scientists should concentrate on identifying enterprise-critical challenges. Before establishing the data collection, it is crucial to determine the source of the problem rather than immediately resorting to a mechanical solution. 

Before commencing analytical operations, data scientists may have a structured workflow in place. The process must consider all company stakeholders and important parties. Using specialized dashboard software that provides an assortment of visualization widgets, the enterprise's data may be rendered more understandable. 

4. Communication of Results to Non-Technical Stakeholders

The primary objective of a data scientist is to enhance the organization's capacity for decision-making, which is aligned with the business plan that its function supports. The most difficult obstacle for data scientists to overcome is effectively communicating their findings and interpretations to business leaders and managers. Because the majority of managers or stakeholders are unfamiliar with the tools and technologies used by data scientists, it is vital to provide them with the proper foundation concept to apply the model using business AI. 

In order to provide an effective narrative for their analysis and visualizations of the notion, data scientists need to incorporate concepts such as "data storytelling." 

5. Data Security

Due to the need to scale quickly, businesses have turned to cloud management for the safekeeping of their sensitive information. Cyberattacks and online spoofing have made sensitive data stored in the cloud exposed to the outside world. Strict measures have been enacted to protect data in the central repository against hackers. Data scientists now face additional challenges as they attempt to work around the new restrictions brought forth by the new rules. 

Organizations must use cutting-edge encryption methods and machine learning security solutions to counteract the security threat. In order to maximize productivity, it is essential that the systems be compliant with all applicable safety regulations and designed to deter lengthy audits. 

6. Efficient Collaboration

It is common practice for data scientists and data engineers to collaborate on the same projects for a company. Maintaining strong lines of communication is very necessary to avoid any potential conflicts. To guarantee that the workflows of both teams are comparable, the institution hosting the event should make the necessary efforts to establish clear communication channels. The organization may also choose to establish a Chief Officer position to monitor whether or not both departments are functioning along the same lines. 

7. Selection of Non-Specific KPI Metrics

It is a common misunderstanding that data scientists can handle the majority of the job on their own and come prepared with answers to all of the challenges that are encountered by the company. Data scientists are put under a great deal of strain as a result of this, which results in decreased productivity. 

It is vital for any company to have a certain set of metrics to measure the analyses that a data scientist presents. In addition, they have the responsibility of analyzing the effects that these indicators have on the operation of the company. 

The many responsibilities and duties of a data scientist make for a demanding work environment. Nevertheless, it is one of the occupations that are in most demand in the market today. The challenges that are experienced by data scientists are simply solvable difficulties that may be used to increase the functionality and efficiency of workplace AI in high-pressure work situations.

Types of Data Science Challenges/Problems

1. data science business challenges.

Listening to important words and phrases is one of the responsibilities of a data scientist during an interview with a line-of-business expert who is discussing a business issue. The data scientist breaks the issue down into a procedural flow that always involves a grasp of the business challenge, a comprehension of the data that is necessary, as well as the many forms of artificial intelligence (AI) and data science approaches that can address the problem. This information, when taken as a whole, serves as the impetus behind an iterative series of thought experiments, modeling methodologies, and assessment of the business objectives. 

The company itself has to remain the primary focus. When technology is used too early in a process, it may lead to the solution focusing on the technology itself, while the original business challenge may be ignored or only partially addressed. 

Artificial intelligence and data science demand a degree of accuracy that must be captured from the beginning: 

  • Describe the issue that needs to be addressed. 
  • Provide as much detail as you can on each of the business questions. 
  • Determine any additional business needs, such as maintaining existing client relationships while expanding potential for upselling and cross-selling. 
  • Specify the predicted advantages in terms of how they will affect the company, such as a 10% reduction in the customer turnover rate among high-value clients. 

2. Real Life Data Science Problems

Data science is the use of hybrid mathematical and computer science models to address real-world business challenges in order to get actionable insights. It is willing to take the risk of venturing into the unknown domain of 'unstructured' data in order to get significant insights that assist organizations in improving their decision-making. 

  • Managing the placement of digital advertisements using computerized processes. 
  • The search function will be improved by the use of data science and sophisticated analytics. 
  • Using data science for producing data-driven crime predictions 
  • Utilizing data science in order to avoid breaking tax laws 

3. Data Science Challenges In Healthcare And Example

It has been calculated that each human being creates around 2 gigabytes of data per day. These measurements include brain activity, tension, heart rate, blood sugar, and many more. These days, we have more sophisticated tools, and Data Science is one among them, to deal with such a massive data volume. This system aids in keeping tabs on a patient's health by recording relevant information. 

The use of Data Science in medicine has made it feasible to spot the first signs of illness in otherwise healthy people. Doctors may now check up on their patients from afar thanks to a host of cutting-edge technology. 

Historically, hospitals and their staffs have struggled to care for large numbers of patients simultaneously. The patients' ailments used to worsen because of a lack of adequate care.

A) Medical Image Analysis:  Focusing on the efforts connected to the applications of computer vision, virtual reality, and robotics to biomedical imaging challenges, Medical Image Analysis offers a venue for the dissemination of new research discoveries in the area of medical and biological image analysis. It publishes high-quality, original research articles that advance our understanding of how to best process, analyze, and use medical and biological pictures in these contexts. Methods that make use of molecular/cellular imaging data as well as tissue/organ imaging data are of interest to the journal. Among the most common sources of interest for biomedical image databases are those gathered from: 

  • Magnetic resonance 
  • Ultrasound 
  • Computed tomography 
  • Nuclear medicine 
  • X-ray 
  • Optical and Confocal Microscopy 
  • Video and range data images 

Procedures such as identifying cancers, artery stenosis, and organ delineation use a variety of different approaches and frameworks like MapReduce to determine ideal parameters for tasks such as lung texture categorization. Examples of these procedures include: 

  • The categorization of solid textures is accomplished by the use of machine learning techniques, support vector machines (SVM), content-based medical picture indexing, and wavelet analysis. 

B) Drug Research and Development:  The ever-increasing human population brings a plethora of new health concerns. Possible causes include insufficient nutrition, stress, environmental hazards, disease, etc. Medical research facilities now under pressure to rapidly discover treatments or vaccinations for many illnesses. It may take millions of test cases to uncover a medicine's formula since scientists need to learn about the properties of the causal agent. Then, once they have a recipe, researchers must put it through its paces in a battery of experiments.

Previously, it took a team of researchers 10–12 years to sift through the information of the millions of test instances stated above. However, with the aid of Data Science's many medical applications, this process is now simplified. It is possible to process data from millions of test cases in a matter of months, if not weeks. It's useful for analyzing the data that shows how well the medicine works. So, the vaccine or drug may be available to the public in less than a year if all tests go well. Data Science and machine learning make this a reality. Both have been game-changing for the pharmaceutical industry's R&D departments. As we go forward, we shall see Data Science's use in genomics. Data analytics played a crucial part in the rapid development of a vaccine against the global pandemic Corona-virus.

C) Genomics and Bioinformatics:  One of the most fascinating parts of modern medicine is genomics. Human genomics focuses on the sequencing and analysis of genomes, which are made up of the genetic material of living organisms. Genealogical studies pave the way for cutting-edge medical interventions. Investigating DNA for its peculiarities and quirks is what genomics is all about. It also aids in determining the link between a disease's symptoms and the patient's actual health. Drug response analysis for a certain DNA type is also a component of genomics research.

Before the development of effective data analysis methods, studying genomes was a laborious and unnecessary process. The human body has millions of chromosomes, each of which may code for a unique set of instructions. However, recent Data Science advancements in the fields of medicine and genetics have simplified this process. Analyzing human genomes now takes much less time and energy because to the many Data Science and Big Data techniques available. These methods aid scientists in identifying the underlying genetic problem and the corresponding medication.

D) Virtual Assistance:  One excellent illustration of how Data Science may be put to use is seen in the development of apps with the use of virtual assistants. The work of data scientists has resulted in the creation of complete platforms that provide patients with individualized experiences. The patient's symptoms are analyzed by the medical apps that make use of data science in order to aid in the diagnosis of a condition. Simply having the patient input his or her symptoms into the program will allow it to make an accurate diagnosis of the patient's ailment and current status. According on the state of the patient, it will provide recommendations for any necessary precautions, medications, and treatments.

In addition, the software does an analysis on the patient's data and generates a checklist of the treatment methods that must be adhered to at all times. After that, it reminds the patient to take their medication at regular intervals. This helps to prevent the scenario of neglect, which might potentially make the illness much worse. 

Patients suffering from Alzheimer's disease, anxiety, depression, and other psychological problems have also benefited from the usage of virtual aid, since its benefits have been shown to be beneficial. Because the application reminds these patients on a consistent basis to carry out the actions that are necessary, their therapy is beginning to bear fruit. Taking the appropriate medicine, being active, and eating well are all part of these efforts. Woebot, which was created at Stanford University, is an example of a virtual assistant that may help you out. It is a chatbot that assists individuals suffering from psychiatric diseases in obtaining the appropriate therapy in order to improve their mental health. 

4. Data Science Problems In Retail

Although the phrase "customer analytics" is relatively new to the retail sector, the practice of analyzing data collected from consumers to provide them with tailored products and services is centuries old. The development of data science has made it simple to manage a growing number of customers. With the use of data science software, reductions and sales may be managed in real-time, which might boost sales of previously discontinued items and generate buzz for forthcoming releases. One further use of data science is to analyze the whole social media ecosystem to foresee which items will be popular in the near future so that they may be promoted to the market at the same time. 

Data science is far from being complete. loaded with actual uses in the world today. Data science is still in its infancy, but its applications are already being felt throughout the globe. We have a long way to go before we reach saturation.

Steps on How to Approach and Address a Solution to Data Science Problems

Step 1: define the problem.

First things first, it is essential to precisely characterize the data issue that has to be addressed. The issue at hand need to be comprehensible, succinct, and quantifiable. When identifying data challenges, many businesses are far too general with their language, which makes it difficult, if not impossible, for data scientists to transform such problems into machine code. Below we will discuss a few most common data science problem statements and data science challenges. 

The following is a list of fundamental qualities that describe a data issue as well-defined: 

  • It seems probable that the solution to the issue will have a sufficient amount of positive effect to warrant the effort. 
  • There is sufficient data accessible in a format that can be used. 
  • The use of data science as a means of resolving the issue has garnered the attention of stakeholders. 

Step 2: Types of Data Science Problem

There is a wide variety of data science algorithms that can be implemented on data, and they can be classified, to a certain extent, within the following families, below are the most common data science problems examples: 

  • Two-class classification: Useful for any issue that can only have two responses, the two-class categorization consists of two distinct categories. 
  • Multi-class classification: Providing an answer to a question that might have many different responses is an example of multi-class categorization. 
  • Anomaly detection: The term "anomaly detection" refers to the process of locating data points that deviate from the norm. 
  • Regression: When searching for a number as opposed to a class or category, regression is helpful since it provides an answer with a real-valued result. 
  • Multi-class classification as regression: Useful when questions are posed in the form of rankings or comparisons, multi-class classification may be thought of as regression. 
  • Two-class classification as regression: Useful for binary classification problems that can also be reformulated as regression, the two-class classification method is also referred to as regression analysis. 
  • Clustering: The term "clustering" refers to the process of answering questions regarding the organization of data by attempting to partition a data set into understandable chunks. 
  • Dimensionality reduction: It is the process of acquiring a set of major variables in order to lower the number of random variables that are being taken into account. 
  • Reinforcement learning : The goal of the learning algorithms known as reinforcement learning is to perform actions within an environment in such a way as to maximize some concept of cumulative reward.

Step 3: Data Collection

Now that the issue has been articulated in its entirety and an appropriate solution has been chosen, it is time to gather data. It is important to record all of the data that has been gathered in a log, along with the date of each collection and any other pertinent information. 

It is essential to understand that the data produced are rarely immediately available for analysis. The majority of a data scientist's day is dedicated to cleaning the data, which involves tasks such as eliminating records with missing values, locating records with duplicates, and correcting values that are wrong. It is one of the prominent data scientist problems. 

Step 4: Data Analysis

Data analysis comes after data gathering and cleansing. At this point, there is a danger that the chosen data science strategy will fail. This is to be expected and anticipated. In general, it is advisable to begin by experimenting with all of the fundamental machine learning algorithms since they have fewer parameters to adjust. 

There are several good open source data science libraries available for use in data analysis. The vast majority of data science tools are developed in Python, Java, or C++. Apart from this, many data science practice problems are available for free on web. 

Step 5: Result Interpretation

Following the completion of the data analysis, the next step is to interpret the findings. Consideration of whether or not the primary issue has been resolved should take precedence over anything else. It's possible that you'll find out that your model works but generates results that aren't very good. Adding new data and continually retraining the model until one is pleased with it is one strategy for dealing with this situation.

Finalizing the Problem Statement

After identifying the precise issue type, you should be able to formulate a refined problem statement that includes the model's predictions. For instance: 

This is a multi-class classification problem that predicts if a picture belongs to one of four classes: "vehicle," "traffic," "sign," and "human." 

Additionally, you should be able to provide a desired result or intended use for the model prediction. Making a model accurate is one of the most crucial thing Data Scientists problems. 

The optimal result is to offer quick notice to end users when a target class is predicted. One may practice such data science hackathon problem statements on Kaggle. 

Gain the skills you need to excel in business analysis with ccba certification course . Start your journey today and open doors to limitless opportunities!

When professionals are working toward their analytics objectives, they may run across a variety of different kinds of data science challenegs, all of which slow down their progress. The stages that we've discussed in this article on how to tackle a new data science issue are designed to highlight the general problem-solving attitude that businesses need to adopt in order to effectively meet the problems of our present data-centric era.

Not only will a competent data science problem seek to make predictions, but it will also aim to make judgments. Always keep this overarching goal in mind while you think about the many challenges you are facing. You may combat the blues of data science with the aid of a detailed approach. In addition, engaging with professionals in the field of data science enables you to get insights, which ultimately results in the effective execution of the project. Have a look at KnowledgeHut’s Data Science Course Subjects to understand this matter in depth.

Frequently Asked Questions (FAQs)

The discipline of data science aims to provide answers to actual challenges faced by businesses by using data in the construction of algorithms and the development of programs that assist in demonstrating that certain issues have ideal solutions. Data science is the use of hybrid mathematical and computer science models to address real-world business challenges in order to get actionable insights. 

There are many other platforms available for the same- Kaggle, KnowledgeHut, HackerEarth, MachineHack, Colab by Google, Datacamp etc.

Statistics, Coding, Business Intelligence, Data Structures, Mathematics, Machine Learning, and Algorithms are only few of the primary subjects that are covered in the Data Science curriculum.  

Aspects of this profession may be stressful, but I imagine that's true of most jobs. I will provide an excellent illustration of a potential source of "stress" in the field of data science. doubt it's a tough time Data science is R&D, and there's always enough time to get everything done.  

Stressful? No. Frustrating? Absolute yeah. It's really annoying that we often get trapped on an error for three days or have to ponder the question of what metrics to use a thousand times.

Profile

Ritesh Pratap Arjun Singh

RiteshPratap A. Singh is an AI & DeepTech Data Scientist. His research interests include machine vision and cognitive intelligence. He is known for leading innovative AI projects for large corporations and PSUs. Collaborate with him in the fields of AI/ML/DL, machine vision, bioinformatics, molecular genetics, and psychology.

Avail your free 1:1 mentorship session.

Something went wrong

Upcoming Data Science Batches & Dates

Course advisor icon

Center for Data Innovation

Solving Data Science Problems

code

Researchers at the University of Hong Kong, Peking University, Stanford University, the University of California, Berkeley, the University of Washington, Carnegie Mellon University, and Meta have created a dataset of 1,000 data science questions from 451 problems found on Stack Overflow, a collective knowledge platform for programmers. Researchers can use the dataset to train AI systems to solve data science problems. 

Get the data.  

Image credit: Flickr user Christiaan Colen

Morgan Stevens

Morgan Stevens

Morgan Stevens is a Research Assistant at the Center for Data Innovation. She holds a J.D. from the Sandra Day O'Connor College of Law at Arizona State University and a B.A. in Economics and Government from the University of Texas at Austin.

Visualizing Household Emission Levels

10 bits: the data news hotlist, you may also like, analyzing un security council resolutions, tracking street trees in los angeles, understanding brain structure, transcribing youtube videos for llm training, forecasting tornadoes in the northern hemisphere, understanding border control, tracking opioid settlement payouts, tracking plague deaths in medieval london, tracking state lawmakers, enhancing diversity in dermatology datasets.

Show Buttons

Data Science Problems

  • First Online: 11 January 2023

Cite this chapter

problem solving data science problems

  • Parikshit Narendra Mahalle 7 ,
  • Nancy Ambritta P. 8 ,
  • Sachin R. Sakhare 9 &
  • Atul P. Kulkarni 10  

Part of the book series: Studies in Autonomic, Data-driven and Industrial Computing ((SADIC))

194 Accesses

1 Citations

Data are elements or information that are usually numerical and are collected by observation. It can also be defined as a set of values (quality or quality) related to people or things.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
  • Durable hardcover edition

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Trivedi KS (2008) Probability & statistics with reliability, queing, and computer science applications” PHI

Google Scholar  

Kothari CR (2004) Research methodology (2nd edn), New age international. ISBN(13): 978-81-224-1522-3

Han J, Kamber M (2001) Data mining: concepts and techniques. Morgan Kaufmann Publishers, San Francisco

MATH   Google Scholar  

Aho, Hopcraft, Ullman (1974) The design and analysis of computer algorithms. Addison Wesley

Mahalle PN, Shinde GR, Pise PJ, Deshmukh JY (2021) Foundations of data science for engineering problem solving (1st edn). Springer Verlag, Singapore. ISBN: 9789811651595

Mahalle PN, Nancy AP, Shinde GR, Vinayak Deshpande A (2021) The convergence of internet of things and cloud for smart computing (1st edn). CRC Press. https://doi.org/10.1201/9781003189091

Stančin I, Jović A (2019) An overview and comparison of free Python libraries for data mining and big data analysis. In: 2019 42nd International convention on information and communication technology, electronics and microelectronics (MIPRO), pp 977–982. https://doi.org/10.23919/MIPRO.2019.8757088

Sousa R, Miranda R, Moreira A, Alves C, Lori N, Machado J (2021) Software tools for conducting real-time information processing and visualization in industry: an up-to-date review. Appl Sci 11:4800. https://doi.org/10.3390/app11114800

Nasrabadi AM, Eslaminia AR, Enayati AMS, Alibiglou L, Behzadipour S (2019) Optimal sensor configuration for activity recognition during whole-body exercises. In: 2019 7th international conference on robotics and mechatronics (ICRoM), pp 512–518. https://doi.org/10.1109/ICRoM48714.2019.9071849

Rahman SAZ, Chandra Mitra K, Mohidul Islam SM (2018) Soil classification using machine learning methods and crop suggestion based on soil series. In: 2018 21st international conference of computer and information technology (ICCIT), pp 1–4. https://doi.org/10.1109/ICCITECHN.2018.8631943

Download references

Author information

Authors and affiliations.

Department of Artificial Intelligence and Data Science, Vishwakarma Institute of Information Technology, Pune, India

Parikshit Narendra Mahalle

Glareal Software Solutions PTE. Ltd., Singapore, Singapore

Nancy Ambritta P.

Department of Computer Engineering, Vishwakarma Institute of Information Technology, Pune, India

Sachin R. Sakhare

Department of Mechanical Engineering, Vishwakarma Institute of Information Technology, Pune, India

Atul P. Kulkarni

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Nancy Ambritta P. .

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this chapter

Mahalle, P.N., Ambritta P., N., Sakhare, S.R., Kulkarni, A.P. (2023). Data Science Problems. In: Foundations of Mathematical Modelling for Engineering Problem Solving. Studies in Autonomic, Data-driven and Industrial Computing. Springer, Singapore. https://doi.org/10.1007/978-981-19-8828-8_6

Download citation

DOI : https://doi.org/10.1007/978-981-19-8828-8_6

Published : 11 January 2023

Publisher Name : Springer, Singapore

Print ISBN : 978-981-19-8827-1

Online ISBN : 978-981-19-8828-8

eBook Packages : Intelligent Technologies and Robotics Intelligent Technologies and Robotics (R0)

Share this chapter

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

problem solving data science problems

Data Science Central

  • Author Portal
  • 3D Printing
  • AI Data Stores
  • AI Hardware
  • AI Linguistics
  • AI User Interfaces and Experience
  • AI Visualization
  • Cloud and Edge
  • Cognitive Computing
  • Containers and Virtualization
  • Data Science
  • Data Security
  • Digital Factoring
  • Drones and Robot AI
  • Internet of Things
  • Knowledge Engineering
  • Machine Learning
  • Quantum Computing
  • Robotic Process Automation
  • The Mathematics of AI
  • Tools and Techniques
  • Virtual Reality and Gaming
  • Blockchain & Identity
  • Business Agility
  • Business Analytics
  • Data Lifecycle Management
  • Data Privacy
  • Data Strategist
  • Data Trends
  • Digital Communications
  • Digital Disruption
  • Digital Professional
  • Digital Twins
  • Digital Workplace
  • Marketing Tech
  • Sustainability
  • Agriculture and Food AI
  • AI and Science
  • AI in Government
  • Autonomous Vehicles
  • Education AI
  • Energy Tech
  • Financial Services AI
  • Healthcare AI
  • Logistics and Supply Chain AI
  • Manufacturing AI
  • Mobile and Telecom AI
  • News and Entertainment AI
  • Smart Cities
  • Social Media and AI
  • Functional Languages
  • Other Languages
  • Query Languages
  • Web Languages
  • Education Spotlight
  • Newsletters
  • O’Reilly Media

33 unusual problems that can be solved with data science

Vincent Granville

  • August 28, 2014 at 5:00 pm

Here is a non-exhausting list of curious problems that could greatly benefit from data analysis. If you think you can’t get a job as a data scientist (because you only apply to jobs at Facebook, LinkedIn, Twitter or Apple), here’s a way to find or create new jobs, broaden your horizons, and make Earth a better world not just for human beings, but for all living creatures. Even beyond Earth indeed. Help us grow this list of 33 problems, to 100+.

The actual number is higher than 33, as I’m adding new entries.

2808294209

Figure 1: related to  problem #33

  • Automated translation, including translating one programming language into another one (for instance, SQL to Python – the converse is not possible)
  • Spell checks, especially for people writing in multiple languages – lot’s of progress to be made here, including automatically recognizing the language when you type, and stop trying to correct the same word every single time (some browsers have tried to change Ning to Nong hundreds of times, and I have no idea why after 50 failures they continue to try – I call this machine unlearning ) 
  • Detection of earth-like planets – focus on planetary systems with many planets to increase odds of finding inhabitable planets, rather than stars and planets matching our Sun and Earth
  • Distinguishing between noise and signal on millions of NASA pictures or videos, to identify patterns
  • Automated piloting (drones, cars without pilots)
  • Customized, patient-specific medications and diets
  • Predicting and legally manipulating elections
  • Predicting oil demand, oil reserves, oil price, impact of coal usage
  • Predicting chances that a container in a port contains a nuclear bomb
  • Assessing the probability that a convict is really the culprit, especially when a chain of events resulted in a crime or accident (think about a civil airplane shot down by a missile)
  • Computing correct average time-to-crime statistics for an average gun (using censored models to compensate for the bias caused by new guns not having a criminal history attached to them)
  • Predicting iceberg paths: this occasionally requires icebergs to be towed to avoid collisions
  • Oil wells drilling optimization: how to digg as few test wells as possible to detect the entire area where oil can be found 
  • Predicting solar flares: timing, duration, intensity and localization
  • Predicting Earthquakes
  • Predicting very local weather (short-term) or global weather (long-term); reconstructing past weather (like 200 million years old)
  • Predicting weather on Mars to identify best time and spots for a landing
  • Predict riots based on tweets
  • Designing metrics to predict student success, or employee attrition
  • Predicting book sales, determining correct price, price elasticity and whether a specific book should be accepted or rejected by a publisher, based on projected ROI
  • Predicting volcano risk, to evacuate populations or cancel flights, while minimizing expenses caused by these decisions
  • Predicting 500-year floods, to build dams
  • Actuarial science: predict your death, and health expenditures, to compute your premiums (based on which population segment you belong to)
  • Predicting reproduction rate in animal populations
  • Predicting food reserves each year (fish, meat, crops including crop failures caused by diseases or other problems). Same with electricity and water consumption, as well as rare metals or elements that are critical to build computers and other modern products.
  • Predicting longevity of a product, or a customer
  • Asteroid risks
  • Predicting duration, extent and severity of draught or fires
  • Predicting racial and religious mix in a population, detecting change point (e.g. when more people speak Spanish than English, in California) to adapt policies accordingly
  • Attribution modeling to optimize advertising mix, branding efforts and organic traffic
  • Predicting new flu viruses to design efficient vaccines each year
  • Explaing hexagonal patterns in this Death Valley picture (see Figure 1)
  • Road constructions, HOV lanes, and traffic lights designed to optimize highway traffic. Major bottlenecks are caused by 3-lanes highways suddenly narrowing down to 2-lanes on a short section and for no reasons, usually less than 100 yards long. No need for big data to understand and fix this, though if you don’t know basic physics (fluids theory) and your job is traffic planning / optimization / engineering, then big data – if used smartly – will help you find the cause, and compensate for your lack of good judgement. These bottlenecks should be your top proprity, and not expensive to fix.
  • Google algorithm to predict duration of a road trip, doing much better than GPS systems not connected to the Internet. Potential improvement: when Google tells me that I will arrive in Portland at 5pm when I’m currently in Seattle at 2pm, it should incorporate forecasted traffic in Portland at 5pm: that is, congestion due to peak telecommuting time, rather than making computations based on Portland traffic at 2pm. 

Other articles

  • Data science apprenticeship
  • Data science certification
  • Previous digests
  • Data science resources
  • Competitions and Challenges
  • Salary surveys
  • Data science books
  • How to detect spurious correlations, and how to find the real ones
  • Data science job ads that do not attract candidates, versus those t…
  • Data Science and Analytics Jobs
  • Hadoop resources
  • 17 short tutorials all data scientists should read (and practice)
  • 10 types of data scientists
  • 66 job interview questions for data scientists
  • Our Wiley Book on Data Science
  • Data Science Top Articles
  • Our Data Science Weekly Newsletter
  • Practical illustration of Map-Reduce (Hadoop-style), on real data
  • What makes up data science?
  • DSC webinar series

Related Content

'  data-srcset=

We are in the process of writing and adding new material (compact eBooks) exclusively available to our members, and written in simple English, by world leading experts in AI, data science, and machine learning.

Welcome to the newly launched Education Spotlight page! View Listings

Log in with your Pace credentials for access to all the resources (click the " Sign In with your School Credentials " button).

  • What is a Community?
  • Career Community Finder
  • Accounting & Financial Services
  • Artistic & Creative: Arts and Design, Writing and Journalism, & Film, TV, and Music Production
  • Counseling, Therapy, & Support and Community Services
  • Data Science, Data Analytics, Data Engineering, & Machine Learning
  • Education & Academia
  • Government, Policy, Legal & Advocacy
  • Healthcare Services (Clinical)
  • Healthcare Services (Non-Clinical)
  • Human Resources, Operations, Project Management, & Hospitality
  • IT Systems Support & Cybersecurity
  • Marketing, Social Media, PR, & Account Management
  • Product Management, UX/UI Research and Design, & Technology Consulting
  • Scientific Research: Biology, Chemistry, Forensics, Environmental Fieldwork, & Neuroscience
  • Software Engineering, Web Development, & Game Design
  • Undecided: Career Exploration
  • About to Graduate Student Community
  • Asian, Asian-American, & Pacific Islander Student Community
  • Black Student Community
  • First Generation Student Community
  • International Student Community
  • Latinx Student Community
  • LGBTQIA+ Student Community
  • Military & Veterans Student Community
  • Neurodiversity Student Community
  • Student-Athlete Community
  • Students with Disabilities
  • Women Student Community
  • International Student Guide to Gaining Experience
  • Pitch and Interviewing
  • Cover Letters and Professional Writing
  • LinkedIn and Handshake Profiles
  • Build Skills Employers Want
  • Grow your Network
  • Connect with Employers
  • Search for Internships and Jobs
  • On-Campus Employment
  • Pace-Funded Internships
  • Unleash Your Inner Professional
  • Achieve and Announce
  • Aspire Program (Accelerated Success Professional Readiness Education)
  • Inspire Program (International Student Professional Readiness Education)
  • Customer Service Program for Students
  • Resume Worded
  • Big Interview
  • Interstride
  • Career Community Finder and Entry Portal
  • Alumni Mentoring Program
  • On-Campus Internships and Jobs
  • Interview Room Finder
  • LinkedIn Learning Courses
  • Micro-Internships & Jobs Simulations
  • Faculty & Staff
  • About Career Services
  • Meet the Directors
  • Meet Our Career Counseling Teams
  • Employer Relations Team
  • Operations and Front Desk Team

Problem-Solving Strategies for Data Engineers

Problem-Solving Strategies for Data Engineers

  • Share This: Share Problem-Solving Strategies for Data Engineers on Facebook Share Problem-Solving Strategies for Data Engineers on LinkedIn Share Problem-Solving Strategies for Data Engineers on X

Instructor: Andreas Kretz

Data engineers face a wide variety of problems every day—and often variations of the same problems. In this course, data engineer Andreas Kretz takes you through a variety of common problems you may face and shares her problem-solving strategies for typical problems within all phases of engineering projects. Andreas teaches you how to recognize which phase of a data project you’re in—planning, design, implementation, and operations—and shares solutions targeted to problems you may encounter in each phase. Andreas teaches you how to identify key knowledge performance indicators (KPIs) in planning, how to predict costs and scale better in the design phase, explains why and how to do a risk assessment, and shares some tips on bug fixing and ways you can improve your process. If you’re looking for better ways to deal with data engineering issues, join Andreas in this course to take your problem-solving skills to the next level.

Data Science

The proximal distance principle for constrained estimation.

Statistical methods often involve solving an optimization problem, such as in maximum likelihood estimation and regression. The addition of constraints, either to enforce a hard requirement in estimation or to regularize solutions, complicates matters. Fortunately, the rich theory of convex optimization provides ample tools for devising novel methods. In this talk, I present applications of distance-to-set penalties to statistical learning problems. Specifically, I will focus on proximal distance algorithms, based on the MM principle, tailored to various applications such as regression and discriminant analysis. Special emphasis is given to sparsity set constraints as a compromise between exhaustive combinatorial searches and lasso penalization methods that induce shrinkage.

Dr. Alfonso Landeros

Download Interview guide PDF

  • Data Science Interview Questions

Download PDF

problem solving data science problems

Introduction:

Data science is an interdisciplinary field that mines raw data, analyses it, and comes up with patterns that are used to extract valuable insights from it. Statistics, computer science, machine learning, deep learning, data analysis, data visualization, and various other technologies form the core foundation of data science.

Over the years, data science has gained widespread importance due to the importance of data. Data is considered the new oil of the future which when analyzed and harnessed properly can prove to be very beneficial to the stakeholders. Not just this, a data scientist gets exposure to work in diverse domains, solving real-life practical problems all by making use of trendy technologies. The most common real-time application is fast delivery of food in apps such as Uber Eats by aiding the delivery person to show the fastest possible route to reach the destination from the restaurant. 

Data Science is also used in item recommendation systems in e-commerce sites like Amazon, Flipkart, etc which recommend the user what item they can buy based on their search history. Not just recommendation systems, Data Science is becoming increasingly popular in fraud detection applications to detect any fraud involved in credit-based financial applications. A successful data scientist can interpret data, perform innovation and bring out creativity while solving problems that help drive business and strategic goals. This makes it the most lucrative job of the 21st century.

problem solving data science problems

In this article, we will explore what are the most commonly asked Data Science Technical Interview Questions which will help both aspiring and experienced data scientists.

Data Science Interview Questions for Freshers

Data science interview questions for experienced, frequently asked questions, data science mcq, 1. what is data science.

An interdisciplinary field that constitutes various scientific processes, algorithms, tools, and machine learning techniques working to help find common patterns and gather sensible insights from the given raw input data using statistical and mathematical analysis is called Data Science.

The following figure represents the life cycle of data science.

problem solving data science problems

  • It starts with gathering the business requirements and relevant data.
  • Once the data is acquired, it is maintained by performing data cleaning, data warehousing, data staging, and data architecture.
  • Data processing does the task of exploring the data, mining it, and analyzing it which can be finally used to generate the summary of the insights extracted from the data.
  • Once the exploratory steps are completed, the cleansed data is subjected to various algorithms like predictive analysis, regression, text mining, recognition patterns, etc depending on the requirements.
  • In the final stage, the results are communicated to the business in a visually appealing manner. This is where the skill of data visualization, reporting, and different business intelligence tools come into the picture. Learn More .

2. What is the difference between data analytics and data science?

  • Data science involves the task of transforming data by using various technical analysis methods to extract meaningful insights using which a data analyst can apply to their business scenarios.
  • Data analytics deals with checking the existing hypothesis and information and answers questions for a better and effective business-related decision-making process.
  • Data Science drives innovation by answering questions that build connections and answers for futuristic problems. Data analytics focuses on getting present meaning from existing historical context whereas data science focuses on predictive modeling.
  • Data Science can be considered as a broad subject that makes use of various mathematical and scientific tools and algorithms for solving complex problems whereas data analytics can be considered as a specific field dealing with specific concentrated problems using fewer tools of statistics and visualization.

The following Venn diagram depicts the difference between data science and data analytics clearly:

problem solving data science problems

3. What are some of the techniques used for sampling? What is the main advantage of sampling?

Data analysis can not be done on a whole volume of data at a time especially when it involves larger datasets. It becomes crucial to take some data samples that can be used for representing the whole population and then perform analysis on it. While doing this, it is very much necessary to carefully take sample data out of the huge data that truly represents the entire dataset.

problem solving data science problems

There are majorly two categories of sampling techniques based on the usage of statistics, they are:

  • Probability Sampling techniques: Clustered sampling, Simple random sampling, Stratified sampling.
  • Non-Probability Sampling techniques: Quota sampling, Convenience sampling, snowball sampling, etc.

4. List down the conditions for Overfitting and Underfitting.

Overfitting: The model performs well only for the sample training data. If any new data is given as input to the model, it fails to provide any result. These conditions occur due to low bias and high variance in the model. Decision trees are more prone to overfitting.

problem solving data science problems

Underfitting: Here, the model is so simple that it is not able to identify the correct relationship in the data, and hence it does not perform well even on the test data. This can happen due to high bias and low variance. Linear regression is more prone to Underfitting.

problem solving data science problems

5. Differentiate between the long and wide format data.

The following image depicts the representation of wide format and long format data:

problem solving data science problems

Learn via our Video Courses

6. what are eigenvectors and eigenvalues.

Eigenvectors are column vectors or unit vectors whose length/magnitude is equal to 1. They are also called right vectors. Eigenvalues are coefficients that are applied on eigenvectors which give these vectors different values for length or magnitude.

problem solving data science problems

A matrix can be decomposed into Eigenvectors and Eigenvalues and this process is called Eigen decomposition. These are then eventually used in machine learning methods like PCA (Principal Component Analysis) for gathering valuable insights from the given matrix.

7. What does it mean when the p-values are high and low?

A p-value is the measure of the probability of having results equal to or more than the results achieved under a specific hypothesis assuming that the null hypothesis is correct. This represents the probability that the observed difference occurred randomly by chance.

  • Low p-value which means values ≤ 0.05 means that the null hypothesis can be rejected and the data is unlikely with true null.
  • High p-value, i.e values ≥ 0.05 indicates the strength in favor of the null hypothesis. It means that the data is like with true null.
  • p-value = 0.05 means that the hypothesis can go either way.

8. When is resampling done?

Resampling is a methodology used to sample data for improving accuracy and quantify the uncertainty of population parameters. It is done to ensure the model is good enough by training the model on different patterns of a dataset to ensure variations are handled. It is also done in the cases where models need to be validated using random subsets or when substituting labels on data points while performing tests.

9. What do you understand by Imbalanced Data?

Data is said to be highly imbalanced if it is distributed unequally across different categories. These datasets result in an error in model performance and result in inaccuracy.

10. Are there any differences between the expected value and mean value?

There are not many differences between these two, but it is to be noted that these are used in different contexts. The mean value generally refers to the probability distribution whereas the expected value is referred to in the contexts involving random variables.

11. What do you understand by Survivorship Bias?

This bias refers to the logical error while focusing on aspects that survived some process and overlooking those that did not work due to lack of prominence. This bias can lead to deriving wrong conclusions.

12. Define the terms KPI, lift, model fitting, robustness and DOE.

  • KPI: KPI stands for Key Performance Indicator that measures how well the business achieves its objectives.
  • Lift: This is a performance measure of the target model measured against a random choice model. Lift indicates how good the model is at prediction versus if there was no model.
  • Model fitting: This indicates how well the model under consideration fits given observations.
  • Robustness: This represents the system’s capability to handle differences and variances effectively.
  • DOE: stands for the design of experiments, which represents the task design aiming to describe and explain information variation under hypothesized conditions to reflect variables.

13. Define confounding variables.

Confounding variables are also known as confounders. These variables are a type of extraneous variables that influence both independent and dependent variables causing spurious association and mathematical relationships between those variables that are associated but are not casually related to each other.

14. Define and explain selection bias?

The selection bias occurs in the case when the researcher has to make a decision on which participant to study. The selection bias is associated with those researches when the participant selection is not random. The selection bias is also called the selection effect. The selection bias is caused by as a result of the method of sample collection.

Four types of selection bias are explained below:

  • Sampling Bias: As a result of a population that is not random at all, some members of a population have fewer chances of getting included than others, resulting in a biased sample. This causes a systematic error known as sampling bias.
  • Time interval: Trials may be stopped early if we reach any extreme value but if all variables are similar invariance, the variables with the highest variance have a higher chance of achieving the extreme value.
  • Data: It is when specific data is selected arbitrarily and the generally agreed criteria are not followed.
  • Attrition: Attrition in this context means the loss of the participants. It is the discounting of those subjects that did not complete the trial.

15. Define bias-variance trade-off?

Let us first understand the meaning of bias and variance in detail:

Bias: It is a kind of error in a machine learning model when an ML Algorithm is oversimplified. When a model is trained, at that time it makes simplified assumptions so that it can easily understand the target function. Some algorithms that have low bias are Decision Trees, SVM, etc. On the other hand, logistic and linear regression algorithms are the ones with a high bias.

Variance: Variance is also a kind of error. It is introduced into an ML Model when an ML algorithm is made highly complex. This model also learns noise from the data set that is meant for training. It further performs badly on the test data set. This may lead to over lifting as well as high sensitivity.

When the complexity of a model is increased, a reduction in the error is seen. This is caused by the lower bias in the model. But, this does not happen always till we reach a particular point called the optimal point. After this point, if we keep on increasing the complexity of the model, it will be over lifted and will suffer from the problem of high variance. We can represent this situation with the help of a graph as shown below:

problem solving data science problems

As you can see from the image above, before the optimal point, increasing the complexity of the model reduces the error (bias). However, after the optimal point, we see that the increase in the complexity of the machine learning model increases the variance.

Trade-off Of Bias And Variance: So, as we know that bias and variance, both are errors in machine learning models, it is very essential that any machine learning model has low variance as well as a low bias so that it can achieve good performance.

Let us see some examples. The K-Nearest Neighbor Algorithm is a good example of an algorithm with low bias and high variance. This trade-off can easily be reversed by increasing the k value which in turn results in increasing the number of neighbours. This, in turn, results in increasing the bias and reducing the variance.

Another example can be the algorithm of a support vector machine. This algorithm also has a high variance and obviously, a low bias and we can reverse the trade-off by increasing the value of parameter C. Thus, increasing the C parameter increases the bias and decreases the variance.

So, the trade-off is simple. If we increase the bias, the variance will decrease and vice versa.

16. Define the confusion matrix?

It is a matrix that has 2 rows and 2 columns. It has 4 outputs that a binary classifier provides to it. It is used to derive various measures like specificity, error rate, accuracy, precision, sensitivity, and recall.

problem solving data science problems

The test data set should contain the correct and predicted labels. The labels depend upon the performance. For instance, the predicted labels are the same if the binary classifier performs perfectly. Also, they match the part of observed labels in real-world scenarios. The four outcomes shown above in the confusion matrix mean the following:

  • True Positive: This means that the positive prediction is correct.
  • False Positive: This means that the positive prediction is incorrect.
  • True Negative: This means that the negative prediction is correct.
  • False Negative: This means that the negative prediction is incorrect.

The formulas for calculating basic measures that comes from the confusion matrix are:

  • Error rate : (FP + FN)/(P + N)
  • Accuracy : (TP + TN)/(P + N)
  • Sensitivity = TP/P
  • Specificity = TN/N
  • Precision = TP/(TP + FP)
  • F-Score  = (1 + b)(PREC.REC)/(b2 PREC + REC) Here, b is mostly 0.5 or 1 or 2.

In these formulas:

FP = false positive FN = false negative TP = true positive RN = true negative

Sensitivity is the measure of the True Positive Rate. It is also called recall. Specificity is the measure of the true negative rate. Precision is the measure of a positive predicted value. F-score is the harmonic mean of precision and recall.

17. What is logistic regression? State an example where you have recently used logistic regression.

Logistic Regression is also known as the logit model. It is a technique to predict the binary outcome from a linear combination of variables (called the predictor variables). 

For example , let us say that we want to predict the outcome of elections for a particular political leader. So, we want to find out whether this leader is going to win the election or not. So, the result is binary i.e. win (1) or loss (0). However, the input is a combination of linear variables like the money spent on advertising, the past work done by the leader and the party, etc. 

18. What is Linear Regression? What are some of the major drawbacks of the linear model?

Linear regression is a technique in which the score of a variable Y is predicted using the score of a predictor variable X. Y is called the criterion variable. Some of the drawbacks of Linear Regression are as follows:

  • The assumption of linearity of errors is a major drawback.
  • It cannot be used for binary outcomes. We have Logistic Regression for that.
  • Overfitting problems are there that can’t be solved.

19. What is a random forest? Explain it’s working.

Classification is very important in machine learning. It is very important to know to which class does an observation belongs. Hence, we have various classification algorithms in machine learning like logistic regression, support vector machine, decision trees, Naive Bayes classifier, etc. One such classification technique that is near the top of the classification hierarchy is the random forest classifier. 

So, firstly we need to understand a decision tree before we can understand the random forest classifier and its works. So, let us say that we have a string as given below:

problem solving data science problems

So, we have the string with 5 ones and 4 zeroes and we want to classify the characters of this string using their features. These features are colour (red or green in this case) and whether the observation (i.e. character) is underlined or not. Now, let us say that we are only interested in red and underlined observations. So, the decision tree would look something like this:

problem solving data science problems

So, we started with the colour first as we are only interested in the red observations and we separated the red and the green-coloured characters. After that, the “No” branch i.e. the branch that had all the green coloured characters was not expanded further as we want only red-underlined characters. So, we expanded the “Yes” branch and we again got a “Yes” and a “No” branch based on the fact whether the characters were underlined or not. 

So, this is how we draw a typical decision tree. However, the data in real life is not this clean but this was just to give an idea about the working of the decision trees. Let us now move to the random forest.

Random Forest

It consists of a large number of decision trees that operate as an ensemble. Basically, each tree in the forest gives a class prediction and the one with the maximum number of votes becomes the prediction of our model. For instance, in the example shown below, 4 decision trees predict 1, and 2 predict 0. Hence, prediction 1 will be considered.

problem solving data science problems

The underlying principle of a random forest is that several weak learners combine to form a keen learner. The steps to build a random forest are as follows:

  • Build several decision trees on the samples of data and record their predictions.
  • Each time a split is considered for a tree, choose a random sample of mm predictors as the split candidates out of all the pp predictors. This happens to every tree in the random forest.
  • Apply the rule of thumb i.e. at each split m = p√m = p.
  • Apply the predictions to the majority rule.

20. In a time interval of 15-minutes, the probability that you may see a shooting star or a bunch of them is 0.2. What is the percentage chance of you seeing at least one star shooting from the sky if you are under it for about an hour?

Let us say that Prob is the probability that we may see a minimum of one shooting star in 15 minutes.

So, Prob = 0.2

Now, the probability that we may not see any shooting star in the time duration of 15 minutes is = 1 - Prob

1-0.2 = 0.8

The probability that we may not see any shooting star for an hour is: 

= (1-Prob)(1-Prob)(1-Prob)*(1-Prob) = 0.8 * 0.8 * 0.8 * 0.8 = (0.8)⁴   ≈ 0.40

So, the probability that we will see one shooting star in the time interval of an hour is = 1-0.4 = 0.6

So, there are approximately 60% chances that we may see a shooting star in the time span of an hour.

21. What is deep learning? What is the difference between deep learning and machine learning?

Deep learning is a paradigm of machine learning. In deep learning,  multiple layers of processing are involved in order to extract high features from the data. The neural networks are designed in such a way that they try to simulate the human brain. 

Deep learning has shown incredible performance in recent years because of the fact that it shows great analogy with the human brain.

The difference between machine learning and deep learning is that deep learning is a paradigm or a part of machine learning that is inspired by the structure and functions of the human brain called the artificial neural networks. Learn More .

22. What is a Gradient and Gradient Descent?

Gradient: Gradient is the measure of a property that how much the output has changed with respect to a little change in the input. In other words, we can say that it is a measure of change in the weights with respect to the change in error. The gradient can be mathematically represented as the slope of a function.

problem solving data science problems

Gradient Descent: Gradient descent is a minimization algorithm that minimizes the Activation function. Well, it can minimize any function given to it but it is usually provided with the activation function only. 

Gradient descent, as the name suggests means descent or a decrease in something. The analogy of gradient descent is often taken as a person climbing down a hill/mountain. The following is the equation describing what gradient descent means:

So, if a person is climbing down the hill, the next position that the climber has to come to is denoted by “b” in this equation. Then, there is a minus sign because it denotes the minimization (as gradient descent is a minimization algorithm). The Gamma is called a waiting factor and the remaining term which is the Gradient term itself shows the direction of the steepest descent. 

This situation can be represented in a graph as follows:

problem solving data science problems

Here, we are somewhere at the “Initial Weights” and we want to reach the Global minimum. So, this minimization algorithm will help us do that.

1. How are the time series problems different from other regression problems?

  • Time series data can be thought of as an extension to linear regression which uses terms like autocorrelation, movement of averages for summarizing historical data of y-axis variables for predicting a better future.
  • Forecasting and prediction is the main goal of time series problems where accurate predictions can be made but sometimes the underlying reasons might not be known.
  • Having Time in the problem does not necessarily mean it becomes a time series problem. There should be a relationship between target and time for a problem to become a time series problem.
  • The observations close to one another in time are expected to be similar to the ones far away which provide accountability for seasonality. For instance, today’s weather would be similar to tomorrow’s weather but not similar to weather from 4 months from today. Hence, weather prediction based on past data becomes a time series problem.

2. What are RMSE and MSE in a linear regression model?

RMSE: RMSE stands for Root Mean Square Error. In a linear regression model, RMSE is used to test the performance of the machine learning model. It is used to evaluate the data spread around the line of best fit. So, in simple words, it is used to measure the deviation of the residuals.

RMSE is calculated using the formula:

problem solving data science problems

  • Yi is the actual value of the output variable.
  • Y(Cap) is the predicted value and,
  • N is the number of data points.

MSE: Mean Squared Error is used to find how close is the line to the actual data. So, we make the difference in the distance of the data points from the line and the difference is squared. This is done for all the data points and the submission of the squared difference divided by the total number of data points gives us the Mean Squared Error (MSE).

So, if we are taking the squared difference of N data points and dividing the sum by N, what does it mean? Yes, it represents the average of the squared difference of a data point from the line i.e. the average of the squared difference between the actual and the predicted values. The formula for finding MSE is given below:

problem solving data science problems

  • Yi is the actual value of the output variable (the ith data point)
  • Y(cap) is the predicted value and,
  • N is the total number of data points.

So, RMSE is the square root of MSE .

3. What are Support Vectors in SVM (Support Vector Machine)?

problem solving data science problems

In the above diagram, we can see that the thin lines mark the distance from the classifier to the closest data points (darkened data points). These are called support vectors. So, we can define the support vectors as the data points or vectors that are nearest (closest) to the hyperplane. They affect the position of the hyperplane. Since they support the hyperplane, they are known as support vectors.

4. So, you have done some projects in machine learning and data science and we see you are a bit experienced in the field. Let’s say your laptop’s RAM is only 4GB and you want to train your model on 10GB data set.

What will you do have you experienced such an issue before.

In such types of questions, we first need to ask what ML model we have to train. After that, it depends on whether we have to train a model based on Neural Networks or SVM.

The steps for Neural Networks are given below:

  • The Numpy array can be used to load the entire data. It will never store the entire data, rather just create a mapping of the data.
  • Now, in order to get some desired data, pass the index into the NumPy Array.
  • This data can be used to pass as an input to the neural network maintaining a small batch size.

The steps for SVM are given below:

  • For SVM, small data sets can be obtained. This can be done by dividing the big data set.
  • The subset of the data set can be obtained as an input if using the partial fit function.
  • Repeat the step of using the partial fit method for other subsets as well.

Now, you may describe the situation if you have faced such an issue in your projects or working in machine learning/ data science.

5. Explain Neural Network Fundamentals.

In the human brain, different neurons are present. These neurons combine and perform various tasks. The Neural Network in deep learning tries to imitate human brain neurons. The neural network learns the patterns from the data and uses the knowledge that it gains from various patterns to predict the output for new data, without any human assistance.

A perceptron is the simplest neural network that contains a single neuron that performs 2 functions. The first function is to perform the weighted sum of all the inputs and the second is an activation function.

problem solving data science problems

There are some other neural networks that are more complicated. Such networks consist of the following three layers:

  • Input Layer: The neural network has the input layer to receive the input.
  • Hidden Layer: There can be multiple hidden layers between the input layer and the output layer. The initially hidden layers are used for detecting the low-level patterns whereas the further layers are responsible for combining output from previous layers to find more patterns.
  • Output Layer: This layer outputs the prediction.

An example neural network image is shown below:

problem solving data science problems

6. What is Generative Adversarial Network?

This approach can be understood with the famous example of the wine seller. Let us say that there is a wine seller who has his own shop. This wine seller purchases wine from the dealers who sell him the wine at a low cost so that he can sell the wine at a high cost to the customers. Now, let us say that the dealers whom he is purchasing the wine from, are selling him fake wine. They do this as the fake wine costs way less than the original wine and the fake and the real wine are indistinguishable to a normal consumer (customer in this case). The shop owner has some friends who are wine experts and he sends his wine to them every time before keeping the stock for sale in his shop. So, his friends, the wine experts, give him feedback that the wine is probably fake. Since the wine seller has been purchasing the wine for a long time from the same dealers, he wants to make sure that their feedback is right before he complains to the dealers about it. Now, let us say that the dealers also have got a tip from somewhere that the wine seller is suspicious of them.

So, in this situation, the dealers will try their best to sell the fake wine whereas the wine seller will try his best to identify the fake wine. Let us see this with the help of a diagram shown below:

problem solving data science problems

From the image above, it is clear that a noise vector is entering the generator (dealer) and he generates the fake wine and the discriminator has to distinguish between the fake wine and real wine. This is a Generative Adversarial Network (GAN).

In a GAN, there are 2 main components viz. Generator and Discrminator. So, the generator is a CNN that keeps producing images and the discriminator tries to identify the real images from the fake ones. 

7. What is a computational graph?

A computational graph is also known as a “Dataflow Graph”. Everything in the famous deep learning library TensorFlow is based on the computational graph. The computational graph in Tensorflow has a network of nodes where each node operates. The nodes of this graph represent operations and the edges represent tensors.

8. What are auto-encoders?

Auto-encoders are learning networks. They transform inputs into outputs with minimum possible errors. So, basically, this means that the output that we want should be almost equal to or as close as to input as follows. 

Multiple layers are added between the input and the output layer and the layers that are in between the input and the output layer are smaller than the input layer. It received unlabelled input. This input is encoded to reconstruct the input later.

9. What are Exploding Gradients and Vanishing Gradients?

  • Exploding Gradients: Let us say that you are training an RNN. Say, you saw exponentially growing error gradients that accumulate, and as a result of this, very large updates are made to the neural network model weights. These exponentially growing error gradients that update the neural network weights to a great extent are called Exploding Gradients .
  • Vanishing Gradients: Let us say again, that you are training an RNN. Say, the slope became too small. This problem of the slope becoming too small is called Vanishing Gradient . It causes a major increase in the training time and causes poor performance and extremely low accuracy.

10. What is the p-value and what does it indicate in the Null Hypothesis?

P-value is a number that ranges from 0 to 1. In a hypothesis test in statistics, the p-value helps in telling us how strong the results are. The claim that is kept for experiment or trial is called Null Hypothesis.

  • A low p-value i.e. p-value less than or equal to 0.05 indicates the strength of the results against the Null Hypothesis which in turn means that the Null Hypothesis can be rejected. 
  • A high p-value i.e. p-value greater than 0.05 indicates the strength of the results in favour of the Null Hypothesis i.e. for the Null Hypothesis which in turn means that the Null Hypothesis can be accepted.

11. Since you have experience in the deep learning field, can you tell us why TensorFlow is the most preferred library in deep learning?

Tensorflow is a very famous library in deep learning. The reason is pretty simple actually. It provides C++ as well as Python APIs which makes it very easier to work on. Also, TensorFlow has a fast compilation speed as compared to Keras and Torch (other famous deep learning libraries). Apart from that, Tenserflow supports both GPU and CPU computing devices. Hence, it is a major success and a very popular library for deep learning.

12. Suppose there is a dataset having variables with missing values of more than 30%, how will you deal with such a dataset?

Depending on the size of the dataset, we follow the below ways:

  • In case the datasets are small, the missing values are substituted with the mean or average of the remaining data. In pandas, this can be done by using mean = df.mean() where df represents the pandas dataframe representing the dataset and mean() calculates the mean of the data. To substitute the missing values with the calculated mean, we can use df.fillna(mean) .
  • For larger datasets, the rows with missing values can be removed and the remaining data can be used for data prediction.

13. What is Cross-Validation?

Cross-Validation is a Statistical technique used for improving a model’s performance. Here, the model will be trained and tested with rotation using different samples of the training dataset to ensure that the model performs well for unknown data. The training data will be split into various groups and the model is run and validated against these groups in rotation.

problem solving data science problems

The most commonly used techniques are:

  • K- Fold method
  • Leave p-out method
  • Leave-one-out method
  • Holdout method

14. What are the differences between correlation and covariance?

Although these two terms are used for establishing a relationship and dependency between any two random variables, the following are the differences between them:

  • Correlation: This technique is used to measure and estimate the quantitative relationship between two variables and is measured in terms of how strong are the variables related.
  • Covariance: It represents the extent to which the variables change together in a cycle. This explains the systematic relationship between pair of variables where changes in one affect changes in another variable.

Mathematically, consider 2 random variables, X and Y where the means are represented as  μ X {"detectHand":false}  and  μ Y {"detectHand":false}  respectively and standard deviations are represented by  σ X {"detectHand":false}  and  σ Y {"detectHand":false}  respectively and E represents the expected value operator, then:

  • covarianceXY = E[(X- μ X {"detectHand":false} ),(Y- μ Y {"detectHand":false} )]
  • correlationXY = E[(X- μ X {"detectHand":false} ),(Y- μ Y {"detectHand":false} )]/( σ X {"detectHand":false} σ Y {"detectHand":false} ) so that

Based on the above formula, we can deduce that the correlation is dimensionless whereas covariance is represented in units that are obtained from the multiplication of units of two variables.

The following image graphically shows the difference between correlation and covariance:

problem solving data science problems

15. How do you approach solving any data analytics based project?

Generally, we follow the below steps:

  • The first step is to thoroughly understand the business requirement/problem
  • Next, explore the given data and analyze it carefully. If you find any data missing, get the requirements clarified from the business.
  • Data cleanup and preparation step is to be performed next which is then used for modelling. Here, the missing values are found and the variables are transformed.
  • Run your model against the data, build meaningful visualization and analyze the results to get meaningful insights.
  • Release the model implementation, and track the results and performance over a specified period to analyze the usefulness.
  • Perform cross-validation of the model.

Check out the list of data analytics projects .

problem solving data science problems

16. How regularly must we update an algorithm in the field of machine learning?

We do not want to update and make changes to an algorithm on a regular basis as an algorithm is a well-defined step procedure to solve any problem and if the steps keep on updating, it cannot be said well defined anymore. Also, this brings in a lot of problems to the systems already implementing the algorithm as it becomes difficult to bring in continuous and regular changes. So, we should update an algorithm only in any of the following cases:

  • If you want the model to evolve as data streams through infrastructure, it is fair to make changes to an algorithm and update it accordingly.
  • If the underlying data source is changing, it almost becomes necessary to update the algorithm accordingly.
  • If there is a case of non-stationarity, we may update the algorithm.
  • One of the most important reasons for updating any algorithm is its underperformance and lack of efficiency. So, if an algorithm lacks efficiency or underperforms it should be either replaced by some better algorithm or it must be updated.

17. Why do we need selection bias?

Selection Bias happens in cases where there is no randomization specifically achieved while picking a part of the dataset for analysis. This bias tells that the sample analyzed does not represent the whole population meant to be analyzed.

  • For example, in the below image, we can see that the sample that we selected does not entirely represent the whole population that we have. This helps us to question whether we have selected the right data for analysis or not.

problem solving data science problems

18. Why is data cleaning crucial? How do you clean the data?

While running an algorithm on any data, to gather proper insights, it is very much necessary to have correct and clean data that contains only relevant information. Dirty data most often results in poor or incorrect insights and predictions which can have damaging effects.

For example, while launching any big campaign to market a product, if our data analysis tells us to target a product that in reality has no demand and if the campaign is launched, it is bound to fail. This results in a loss of the company’s revenue. This is where the importance of having proper and clean data comes into the picture.

  • Data Cleaning of the data coming from different sources helps in data transformation and results in the data where the data scientists can work on.
  • Properly cleaned data increases the accuracy of the model and provides very good predictions.
  • If the dataset is very large, then it becomes cumbersome to run data on it. The data cleanup step takes a lot of time (around 80% of the time) if the data is huge. It cannot be incorporated with running the model. Hence, cleaning data before running the model, results in increased speed and efficiency of the model.
  • Data cleaning helps to identify and fix any structural issues in the data. It also helps in removing any duplicates and helps to maintain the consistency of the data.

The following diagram represents the advantages of data cleaning:

problem solving data science problems

19. What are the available feature selection methods for selecting the right variables for building efficient predictive models?

While using a dataset in data science or machine learning algorithms, it so happens that not all the variables are necessary and useful to build a model. Smarter feature selection methods are required to avoid redundant models to increase the efficiency of our model. Following are the three main methods in feature selection:

  • These methods pick up only the intrinsic properties of features that are measured via univariate statistics and not cross-validated performance. They are straightforward and are generally faster and require less computational resources when compared to wrapper methods.
  • There are various filter methods such as the Chi-Square test, Fisher’s Score method, Correlation Coefficient, Variance Threshold, Mean Absolute Difference (MAD) method, Dispersion Ratios, etc.

problem solving data science problems

  • These methods need some sort of method to search greedily on all possible feature subsets, access their quality by learning and evaluating a classifier with the feature.
  • The selection technique is built upon the machine learning algorithm on which the given dataset needs to fit.
  • Forward Selection: Here, one feature is tested at a time and new features are added until a good fit is obtained.
  • Backward Selection: Here, all the features are tested and the non-fitting ones are eliminated one by one to see while checking which works better.
  • Recursive Feature Elimination: The features are recursively checked and evaluated how well they perform.
  • These methods are generally computationally intensive and require high-end resources for analysis. But these methods usually lead to better predictive models having higher accuracy than filter methods.

problem solving data science problems

  • Embedded methods constitute the advantages of both filter and wrapper methods by including feature interactions while maintaining reasonable computational costs.
  • These methods are iterative as they take each model iteration and carefully extract features contributing to most of the training in that iteration.
  • Examples of embedded methods: LASSO Regularization (L1), Random Forest Importance.

problem solving data science problems

20. During analysis, how do you treat the missing values?

To identify the extent of missing values, we first have to identify the variables with the missing values. Let us say a pattern is identified. The analyst should now concentrate on them as it could lead to interesting and meaningful insights. However, if there are no patterns identified, we can substitute the missing values with the median or mean values or we can simply ignore the missing values. 

If the variable is categorical, the common strategies for handling missing values include:

  • Assigning a New Category: You can assign a new category, such as "Unknown" or "Other," to represent the missing values.
  • Mode imputation: You can replace missing values with the mode, which represents the most frequent category in the variable.
  • Using a Separate Category: If the missing values carry significant information, you can create a separate category to indicate missing values.

It's important to select an appropriate strategy based on the nature of the data and the potential impact on subsequent analysis or modelling.

If 80% of the values are missing for a particular variable, then we would drop the variable instead of treating the missing values.

21. Will treating categorical variables as continuous variables result in a better predictive model?

Yes! A categorical variable is a variable that can be assigned to two or more categories with no definite category ordering. Ordinal variables are similar to categorical variables with proper and clear ordering defines. So, if the variable is ordinal, then treating the categorical value as a continuous variable will result in better predictive models.

22. How will you treat missing values during data analysis?

The impact of missing values can be known after identifying what type of variables have missing values.

  • If the data analyst finds any pattern in these missing values, then there are chances of finding meaningful insights.
  • In case of patterns are not found, then these missing values can either be ignored or can be replaced with default values such as mean, minimum, maximum, or median values.
  • Assigning a new category: You can assign a new category, such as "Unknown" or "Other," to represent the missing values.
  • Using a separate category : If the missing values carry significant information, you can create a separate category to indicate the missing values. It's important to select an appropriate strategy based on the nature of the data and the potential impact on subsequent analysis or modelling.
  • If 80% of values are missing, then it depends on the analyst to either replace them with default values or drop the variables.

23. What does the ROC Curve represent and how to create it?

ROC (Receiver Operating Characteristic) curve is a graphical representation of the contrast between false-positive rates and true positive rates at different thresholds. The curve is used as a proxy for a trade-off between sensitivity and specificity.

The ROC curve is created by plotting values of true positive rates (TPR or sensitivity) against false-positive rates (FPR or (1-specificity)) TPR represents the proportion of observations correctly predicted as positive out of overall positive observations. The FPR represents the proportion of observations incorrectly predicted out of overall negative observations. Consider the example of medical testing, the TPR represents the rate at which people are correctly tested positive for a particular disease.

problem solving data science problems

24. What are the differences between univariate, bivariate and multivariate analysis?

Statistical analyses are classified based on the number of variables processed at a given time.

25. What is the difference between the Test set and validation set?

The test set is used to test or evaluate the performance of the trained model. It evaluates the predictive power of the model. The validation set is part of the training set that is used to select parameters for avoiding model overfitting.

26. What do you understand by a kernel trick?

Kernel functions are generalized dot product functions used for the computing dot product of vectors xx and yy in high dimensional feature space. Kernal trick method is used for solving a non-linear problem by using a linear classifier by transforming linearly inseparable data into separable ones in higher dimensions.

problem solving data science problems

27. Differentiate between box plot and histogram.

Box plots and histograms are both visualizations used for showing data distributions for efficient communication of information. Histograms are the bar chart representation of information that represents the frequency of numerical variable values that are useful in estimating probability distribution, variations and outliers. Boxplots are used for communicating different aspects of data distribution where the shape of the distribution is not seen but still the insights can be gathered. These are useful for comparing multiple charts at the same time as they take less space when compared to histograms.

problem solving data science problems

28. How will you balance/correct imbalanced data?

There are different techniques to correct/balance imbalanced data. It can be done by increasing the sample numbers for minority classes. The number of samples can be decreased for those classes with extremely high data points. Following are some approaches followed to balance data:

  • Specificity/Precision: Indicates the number of selected instances that are relevant.
  • Sensitivity: Indicates the number of relevant instances that are selected.
  • F1 score: It represents the harmonic mean of precision and sensitivity.
  • MCC (Matthews correlation coefficient): It represents the correlation coefficient between observed and predicted binary classifications.
  • AUC (Area Under the Curve): This represents a relation between the true positive rates and false-positive rates.

For example, consider the below graph that illustrates training data:

Here, if we measure the accuracy of the model in terms of getting "0"s, then the accuracy of the model would be very high -> 99.9%, but the model does not guarantee any valuable information. In such cases, we can apply different evaluation metrics as stated above.

problem solving data science problems

  • Under-sampling This balances the data by reducing the size of the abundant class and is used when the data quantity is sufficient. By performing this, a new dataset that is balanced can be retrieved and this can be used for further modeling.
  • Over-sampling This is used when data quantity is not sufficient. This method balances the dataset by trying to increase the samples size. Instead of getting rid of extra samples, new samples are generated and introduced by employing the methods of repetition, bootstrapping, etc.
  • Perform K-fold cross-validation correctly: Cross-Validation needs to be applied properly while using over-sampling. The cross-validation should be done before over-sampling because if it is done later, then it would be like overfitting the model to get a specific result. To avoid this, resampling of data is done repeatedly with different ratios. 

29. What is better - random forest or multiple decision trees?

Random forest is better than multiple decision trees as random forests are much more robust, accurate, and lesser prone to overfitting as it is an ensemble method that ensures multiple weak decision trees learn strongly.

30. Consider a case where you know the probability of finding at least one shooting star in a 15-minute interval is 30%. Evaluate the probability of finding at least one shooting star in a one-hour duration?

So the probability is 0.8628 = 86.28%

31. Toss the selected coin 10 times from a jar of 1000 coins. Out of 1000 coins, 999 coins are fair and 1 coin is double-headed, assume that you see 10 heads. Estimate the probability of getting a head in the next coin toss.

We know that there are two types of coins - fair and double-headed. Hence, there are two possible ways of choosing a coin. The first is to choose a fair coin and the second is to choose a coin having 2 heads.

P(selecting fair coin) = 999/1000 = 0.999 P(selecting double headed coin) = 1/1000 = 0.001

Using Bayes rule,

So, the answer is 0.7531 or 75.3%.

32. What are some examples when false positive has proven important than false negative?

Before citing instances, let us understand what are false positives and false negatives.

  • False Positives are those cases that were wrongly identified as an event even if they were not. They are called Type I errors.
  • False Negatives are those cases that were wrongly identified as non-events despite being an event. They are called Type II errors.

Some examples where false positives were important than false negatives are:

  • In the medical field: Consider that a lab report has predicted cancer to a patient even if he did not have cancer. This is an example of a false positive error. It is dangerous to start chemotherapy for that patient as he doesn’t have cancer as starting chemotherapy would lead to damage of healthy cells and might even actually lead to cancer.
  • In the e-commerce field: Suppose a company decides to start a campaign where they give $100 gift vouchers for purchasing $10000 worth of items without any minimum purchase conditions. They assume it would result in at least 20% profit for items sold above $10000. What if the vouchers are given to the customers who haven’t purchased anything but have been mistakenly marked as those who purchased $10000 worth of products. This is the case of false-positive error.

33. Give one example where both false positives and false negatives are important equally?

In Banking fields: Lending loans are the main sources of income to the banks. But if the repayment rate isn’t good, then there is a risk of huge losses instead of any profits. So giving out loans to customers is a gamble as banks can’t risk losing good customers but at the same time, they can’t afford to acquire bad customers. This case is a classic example of equal importance in false positive and false negative scenarios.

34. Is it good to do dimensionality reduction before fitting a Support Vector Model?

If the features number is greater than observations then doing dimensionality reduction improves the SVM (Support Vector Model).

35. What are various assumptions used in linear regression? What would happen if they are violated?

Linear regression is done under the following assumptions:

  • The sample data used for modeling represents the entire population.
  • There exists a linear relationship between the X-axis variable and the mean of the Y variable.
  • The residual variance is the same for any X values. This is called homoscedasticity
  • The observations are independent of one another.
  • Y is distributed normally for any value of X.

Extreme violations of the above assumptions lead to redundant results. Smaller violations of these result in greater variance or bias of the estimates.

36. How is feature selection performed using the regularization method?

The method of regularization entails the addition of penalties to different parameters in the machine learning model for reducing the freedom of the model to avoid the issue of overfitting. There are various regularization methods available such as linear model regularization, Lasso/L1 regularization, etc. The linear model regularization applies penalty over coefficients that multiplies the predictors. The Lasso/L1 regularization has the feature of shrinking some coefficients to zero, thereby making it eligible to be removed from the model.

37. How do you identify if a coin is biased?

To identify this, we perform a hypothesis test as below: According to the null hypothesis, the coin is unbiased if the probability of head flipping is 50%. According to the alternative hypothesis, the coin is biased and the probability is not equal to 500. Perform the below steps:

  • Flip coin 500 times
  • Calculate p-value.
  • p-value > alpha: Then null hypothesis holds good and the coin is unbiased.
  • p-value < alpha: Then the null hypothesis is rejected and the coin is biased.

38. What is the importance of dimensionality reduction?

The process of dimensionality reduction constitutes reducing the number of features in a dataset to avoid overfitting and reduce the variance. There are mostly 4 advantages of this process:

  • This reduces the storage space and time for model execution.
  • Removes the issue of multi-collinearity thereby improving the parameter interpretation of the ML model.
  • Makes it easier for visualizing data when the dimensions are reduced.
  • Avoids the curse of increased dimensionality.

39. How is the grid search parameter different from the random search tuning strategy?

Tuning strategies are used to find the right set of hyperparameters. Hyperparameters are those properties that are fixed and model-specific before the model is tested or trained on the dataset. Both the grid search and random search tuning strategies are optimization techniques to find efficient hyperparameters.

  • Here, every combination of a preset list of hyperparameters is tried out and evaluated.
  • The search pattern is similar to searching in a grid where the values are in a matrix and a search is performed. Each parameter set is tried out and their accuracy is tracked. after every combination is tried out, the model with the highest accuracy is chosen as the best one.
  • The main drawback here is that, if the number of hyperparameters is increased, the technique suffers. The number of evaluations can increase exponentially with each increase in the hyperparameter. This is called the problem of dimensionality in a grid search.

problem solving data science problems

  • In this technique, random combinations of hyperparameters set are tried and evaluated for finding the best solution. For optimizing the search, the function is tested at random configurations in parameter space as shown in the image below.
  • In this method, there are increased chances of finding optimal parameters because the pattern followed is random. There are chances that the model is trained on optimized parameters without the need for aliasing.
  • This search works the best when there is a lower number of dimensions as it takes less time to find the right set.

problem solving data science problems

Conclusion:

Data Science is a very vast field and comprises many topics like Data Mining, Data Analysis, Data Visualization, Machine Learning, Deep Learning, and most importantly it is laid on the foundation of mathematical concepts like Linear Algebra and Statistical analysis. Since there are a lot of pre-requisites for becoming a good professional Data Scientist, the perks and benefits are very big. Data Scientist has become the most sought job role these days. 

Looking for a comprehensive course on Data Science: Check out Scaler’s Data Science Course .

Useful Resources:

  • Best Data Science Courses
  • Python Data Science Interview Questions
  • Google Data Scientist Salary
  • Spotify Data Scientist Salary
  • Data Scientist Salary
  • Data Science Resume
  • Data Analyst: Career Guide
  • Tableau Interview
  • Additional Technical Interview Questions

1. How do I prepare for a data science interview?

Some of the preparation tips for data science interviews are as follows:

  • Resume Building: Firstly, prepare your resume well. It is preferable if the resume is only a 1-page resume, especially for a fresher. You should give great thought to the format of the resume as it matters a lot. The data science interviews can be based more on the topics like linear and logistic regression, SVM, root cause analysis, random forest, etc. So, prepare well for the data science-specific questions like those discussed in this article, make sure your resume has a mention of such important topics and you have a good knowledge of them. Also, please make sure that your resume contains some Data Science-based Projects as well. It is always better to have a group project or internship experience in the field that you are interested to go for. However, personal projects will also have a good impact on the resume. So, your resume should contain at least 2-3 data science-based projects that show your skill and knowledge level in data science. Please do not write any such skill in your resume that you do not possess. If you are just familiar with some technology and have not studied it at an advanced level, you can mention a beginner tag for those skills.
  • Prepare Well: Apart from the specific questions on data science, questions on Core subjects like Database Management systems (DBMS), Operating Systems (OS), Computer Networks(CN), and Object-Oriented Programming (OOPS) can be asked from the freshers especially. So, prepare well for that as well.
  • Data structures and Algorithms are the basic building blocks of programming. So, you should be well versed with that as well.
  • Research the Company: This is the tip that most people miss and it is very important. If you are going for an interview with any company, read about the company before and especially in the case of data science, learn which libraries the company uses, what kind of models are they building, and so on. This gives you an edge over most other people.

2. Are data science interviews hard?

An honest reply will be “YES”. This is because of the fact that this field is newly emerging and will keep on emerging forever. In almost every interview, you have to answer many tough and challenging questions with full confidence and your concepts should be strong to satisfy the interviewer. However, with great practice, anything can be achieved. So, follow the tips discussed above and keep practising and learning. You will definitely succeed.

3. What are the top 3 technical skills of a data scientist?

The top 3 skills of a data scientist are:

  • Mathematics: Data science requires a lot of mathematics and a good data scientist is strong in it. It is not possible to become a good data scientist if you are weak in mathematics.
  • Machine Learning and Deep Learning : A data scientist should be very skilled in Artificial Intelligence technologies like deep learning and machine learning. Some good projects and a lot of hands-on practice will help in achieving excellence in that field.
  • Programming: This is an obvious yet the most important skill. If a person is good at programming it does mean that he/she can solve complex problems as that is just a problem-solving skill. Programming is the ability to write clean and industry-understandable code. This is the skill that most freshers slack because of the lack of exposure to industry-level code. This also improves with practice and experience. 

4. Is data science a good career?

Yes, data science is one of the most futuristic and great career fields. Today and tomorrow or even years later, this field is just going to expand and never end. The reason is simple. Data can be compared to gold today as it is the key to selling everything in the world. Data scientists know how to play with this data to generate some tremendous outputs that are not even imaginable today making it a great career.

5. Are coding questions asked in data science interviews?

Yes, coding questions are asked in data science interviews. One more important thing to note here is that the data scientists are very good problem solvers as they are indulged in a lot of strict mathematics-based activities. Hence, the interviewer expects the data science interview candidates to know data structures and algorithms and at least come up with the solutions to most of the problems.

6. Is python and SQL enough for data science?

Yes. Python and SQL are sufficient for the data science roles. However, knowing the R programming Language can have also have a better impact. If you know these 3 languages, you have got the edge over most of the competitors. However, Python and SQL are enough for data science interviews.

7. What are Data Science tools?

There are various data science tools available in the market nowadays. Various tools can be of great importance. Tensorflow is one of the most famous data science tools. Some of the other famous tools are BigML, SAS (Statistical Analysis System), Knime, Scikit, Pytorch, etc.

Which among the below is NOT a necessary condition for weakly stationary time series data?

Overfitting more likely occurs when there is a huge data amount to train. True or False?

Given the information that the demand is 100 in October 2020, 150 in November 2020, 350 during December 2020 and 400 during January 2021. Calculate a 3-month simple moving average for February 2021.

Which of the below method depicts hierarchical data in nested format?

Which among the following defines the analysis of data objects not complying with general data behaviour?

What does a linear equation having 3 variables represent?

What would be the formula representation of this problem in terms of x and y variables: “The price of 2 pens and 1 pencil as 10 units”?

Which among the below is true regarding hypothesis testing?

What are the model parameters that are used to build ML models using iterative methods under model-based learning methods?

What skills are necessary for a Data Scientist?

  • Privacy Policy

instagram-icon

  • Practice Questions
  • Programming
  • System Design
  • Fast Track Courses
  • Online Interviewbit Compilers
  • Online C Compiler
  • Online C++ Compiler
  • Online Java Compiler
  • Online Javascript Compiler
  • Online Python Compiler
  • Interview Preparation
  • Java Interview Questions
  • Sql Interview Questions
  • Python Interview Questions
  • Javascript Interview Questions
  • Angular Interview Questions
  • Networking Interview Questions
  • Selenium Interview Questions
  • Data Structure Interview Questions
  • System Design Interview Questions
  • Hr Interview Questions
  • Html Interview Questions
  • C Interview Questions
  • Amazon Interview Questions
  • Facebook Interview Questions
  • Google Interview Questions
  • Tcs Interview Questions
  • Accenture Interview Questions
  • Infosys Interview Questions
  • Capgemini Interview Questions
  • Wipro Interview Questions
  • Cognizant Interview Questions
  • Deloitte Interview Questions
  • Zoho Interview Questions
  • Hcl Interview Questions
  • Highest Paying Jobs In India
  • Exciting C Projects Ideas With Source Code
  • Top Java 8 Features
  • Angular Vs React
  • 10 Best Data Structures And Algorithms Books
  • Best Full Stack Developer Courses
  • Python Commands List
  • Maximum Subarray Sum Kadane’s Algorithm
  • Python Cheat Sheet
  • C++ Cheat Sheet
  • Javascript Cheat Sheet
  • Git Cheat Sheet
  • Java Cheat Sheet
  • Data Structure Mcq
  • C Programming Mcq
  • Javascript Mcq

1 Million +

Help | Advanced Search

Computer Science > Machine Learning

Title: paired autoencoders for inverse problems.

Abstract: We consider the solution of nonlinear inverse problems where the forward problem is a discretization of a partial differential equation. Such problems are notoriously difficult to solve in practice and require minimizing a combination of a data-fit term and a regularization term. The main computational bottleneck of typical algorithms is the direct estimation of the data misfit. Therefore, likelihood-free approaches have become appealing alternatives. Nonetheless, difficulties in generalization and limitations in accuracy have hindered their broader utility and applicability. In this work, we use a paired autoencoder framework as a likelihood-free estimator for inverse problems. We show that the use of such an architecture allows us to construct a solution efficiently and to overcome some known open problems when using likelihood-free estimators. In particular, our framework can assess the quality of the solution and improve on it if needed. We demonstrate the viability of our approach using examples from full waveform inversion and inverse electromagnetic imaging.

Submission history

Access paper:.

  • HTML (experimental)
  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

  • Trending Now
  • Foundational Courses
  • Data Science
  • Practice Problem
  • Machine Learning
  • System Design
  • DevOps Tutorial

Welcome to the daily solving of our PROBLEM OF THE DAY with Nitin Kaplas . We will discuss the entire problem step-by-step and work towards developing an optimized solution. This will not only help you brush up on your concepts of Dynamic Programming but also build up problem-solving skills. Given an array a [] of size n , find the length of the longest subsequence such that the absolute difference between adjacent elements is 1.

Input:   n = 7 a[] = {10, 9, 4, 5, 4, 8, 6} Output:   3 Explaination:   The three possible subsequences of length 3 are {10, 9, 8}, {4, 5, 4}, and {4, 5, 6}, where adjacent elements have a absolute difference of 1. No valid subsequence of greater length could be formed.

Give the problem a try before going through the video. All the best!!! Problem Link: https://practice.geeksforgeeks.org/problems/longest-subsequence-such-that-difference-between-adjacents-is-one4724/1

Video Thumbnail

IMAGES

  1. Problems Solved by Data Science

    problem solving data science problems

  2. Data Problem Solving

    problem solving data science problems

  3. How to Manage a Data Science Project for Successful Delivery

    problem solving data science problems

  4. Data Problem Solving

    problem solving data science problems

  5. 39 Best Problem-Solving Examples (2024)

    problem solving data science problems

  6. Data Science problem solving visualization

    problem solving data science problems

VIDEO

  1. SQL Interview Questions- Part 7 by Data Analyst Duo

  2. Python For Data Science || Exam Preparation Part 2 || My Swayam || July 2023

  3. Problem-Solving Skills for Data Scientist✅🔥 #datascience #datascientist

  4. Python For Data Science Week 3 || NPTEL Answers || My Swayam || Jan 2024

  5. Python For Data Science Week 4 || NPTEL Answers || My Swayam || Jan 2024

  6. Python For Data Science Week 1 || NPTEL Answers || My Swayam || July 2023

COMMENTS

  1. The Art of Solving Any Data Science Problem

    Most people interested in data science learn about tools and technology to solve data science problems. They are absolutely necessary to build a solution. But, remember, it is just not enough. ... Problem Definition: The very first step in solving a data science problem is understanding the problem. A framework like First-Principle Thinking and ...

  2. Data Science Case Studies: Solved and Explained

    4 min read. ·. Feb 21, 2021. 1. Solving a Data Science case study means analyzing and solving a problem statement intensively. Solving case studies will help you show unique and amazing data ...

  3. 5 Structured Thinking Techniques for Data Scientists

    Structured thinking is a framework for solving unstructured problems — which covers just about all data science problems. Using a structured approach to solve problems not only only helps solve problems faster but also helps identify the parts of the problem that may need some extra attention. ... The eight disciplines of problem solving ...

  4. Framing Data Science Problems the Right Way From the Start

    Toward Better Problem Definition. Data science uses the scientific method to solve often complex (or multifaceted) and unstructured problems using data and analytics. In analytics, the term fishing expedition refers to a project that was never framed correctly to begin with and involves trolling the data for unexpected correlations.

  5. Doing Data Science: A Framework and Case Study

    Without applications (problems), doing data science would not exist. Our data science framework and research processes are fundamentally tied to practical problem solving and can be used in diverse settings. We provide a case study of using local data to address questions raised by county officials.

  6. Data Science Solutions: Applications and Use Cases

    Data Science is a broad field with many potential applications. It's not just about analyzing data and modeling algorithms, but it also reinvents the way businesses operate and how different departments interact. ... Data scientists solve complex problems every day, leveraging a variety of Data Science solutions to tackle issues like ...

  7. Chapter 1 Problem Solving with Data

    1.1 Introduction. This chapter will introduce you to a general approach to solving problems and answering questions using data. Throughout the rest of the module, we will reference back to this chapter as you work your way through your own data analysis exercises. The approach is applicable to actuaries, data scientists, general data analysts ...

  8. Free Practice Exams

    A selection of practice exams that will test your current data science knowledge. Identify key areas of improvement to strengthen your theoretical preparation, critical thinking, and practical problem-solving skills so you can get one step closer to realizing your professional goals.

  9. Solving Problems with Data Science

    The vast majority of the problems we face at Viget can't or shouldn't be solved by a lone data scientist because we are solving business problems. Our data scientists team up with UXers, designers, developers, project managers, and hardware developers to develop digital strategies and solving data science problems is one part of that ...

  10. 5 Steps on How to Approach a New Data Science Problem

    Step 1: Define the problem. First, it's necessary to accurately define the data problem that is to be solved. The problem should be clear, concise, and measurable. Many companies are too vague when defining data problems, which makes it difficult or even impossible for data scientists to translate them into machine code.

  11. Common Data Science Challenges of 2024 [with Solution]

    Steps on How to Approach and Address a Solution to Data Science Problems. Step 1: Define the Problem. First things first, it is essential to precisely characterize the data issue that has to be addressed. The issue at hand need to be comprehensible, succinct, and quantifiable.

  12. Medium: Problem solving and data analysis

    Unit test. Level up on all the skills in this unit and collect up to 1,000 Mastery points! This unit tackles the medium-difficulty problem solving and data analysis questions on the SAT Math test. Work through each skill, taking quizzes and the unit test to level up your mastery progress.

  13. Solving Data Science Problems

    Solving Data Science Problems. by Morgan Stevens December 16, 2022. Researchers at the University of Hong Kong, Peking University, Stanford University, the University of California, Berkeley, the University of Washington, Carnegie Mellon University, and Meta have created a dataset of 1,000 data science questions from 451 problems found on Stack ...

  14. Advanced: Problem solving and data analysis

    Unit test. Level up on all the skills in this unit and collect up to 1,000 Mastery points! Ready for a challenge? This unit covers the hardest problem solving and data analysis questions on the SAT Math test. Work through each skill, taking quizzes and the unit test to level up your mastery progress.

  15. Data Science Problems

    This will provide students with a real-life data science problem. The chapter presents an in-depth description of the required statistics for data science researchers and IT professionals. A step-by-step approach to problem-solving is presented clearly and accurately to the reader with examples for better understanding. Case studies will assist ...

  16. 33 unusual problems that can be solved with data science

    Help us grow this list of 33 problems, to 100+. The actual number is higher than 33, as I'm adding new entries. Figure 1: related to problem #33. 33 unusual problems that can be solved with data science. Predicting food reserves each year (fish, meat, crops including crop failures caused by diseases or other problems).

  17. Problem-Solving Strategies for Data Engineers

    In this course, data engineer Andreas Kretz takes you through a variety of common problems you may face and shares her problem-solving strategies for typical problems within all phases of engineering projects. Andreas teaches you how to recognize which phase of a data project you're in—planning, design, implementation, and operations—and ...

  18. The Proximal Distance Principle for Constrained Estimation

    May 24, 2024. Statistical methods often involve solving an optimization problem, such as in maximum likelihood estimation and regression. The addition of constraints, either to enforce a hard requirement in estimation or to regularize solutions, complicates matters. Fortunately, the rich theory of convex optimization provides ample tools for ...

  19. Top Data Science Interview Questions and Answers (2024)

    Introduction: Data science is an interdisciplinary field that mines raw data, analyses it, and comes up with patterns that are used to extract valuable insights from it. Statistics, computer science, machine learning, deep learning, data analysis, data visualization, and various other technologies form the core foundation of data science. Over ...

  20. [2405.13220] Paired Autoencoders for Inverse Problems

    We consider the solution of nonlinear inverse problems where the forward problem is a discretization of a partial differential equation. Such problems are notoriously difficult to solve in practice and require minimizing a combination of a data-fit term and a regularization term. The main computational bottleneck of typical algorithms is the direct estimation of the data misfit. Therefore ...

  21. PROBLEM OF THE DAY : 27/05/2024

    Welcome to the daily solving of our PROBLEM OF THE DAY with Nitin Kaplas.We will discuss the entire problem step-by-step and work towards developing an optimized solution. This will not only help you brush up on your concepts of Dynamic Programming but also build up problem-solving skills. Given an array a[] of size n, find the length of the longest subsequence such that the absolute ...