• Skip to main content
  • Skip to primary sidebar
  • Skip to footer
  • QuestionPro

survey software icon

  • Solutions Industries Gaming Automotive Sports and events Education Government Travel & Hospitality Financial Services Healthcare Cannabis Technology Use Case NPS+ Communities Audience Contactless surveys Mobile LivePolls Member Experience GDPR Positive People Science 360 Feedback Surveys
  • Resources Blog eBooks Survey Templates Case Studies Training Help center

data analysis research paper

Home Market Research

Data Analysis in Research: Types & Methods

data-analysis-in-research

Content Index

Why analyze data in research?

Types of data in research, finding patterns in the qualitative data, methods used for data analysis in qualitative research, preparing data for analysis, methods used for data analysis in quantitative research, considerations in research data analysis, what is data analysis in research.

Definition of research in data analysis: According to LeCompte and Schensul, research data analysis is a process used by researchers to reduce data to a story and interpret it to derive insights. The data analysis process helps reduce a large chunk of data into smaller fragments, which makes sense. 

Three essential things occur during the data analysis process — the first is data organization . Summarization and categorization together contribute to becoming the second known method used for data reduction. It helps find patterns and themes in the data for easy identification and linking. The third and last way is data analysis – researchers do it in both top-down and bottom-up fashion.

LEARN ABOUT: Research Process Steps

On the other hand, Marshall and Rossman describe data analysis as a messy, ambiguous, and time-consuming but creative and fascinating process through which a mass of collected data is brought to order, structure and meaning.

We can say that “the data analysis and data interpretation is a process representing the application of deductive and inductive logic to the research and data analysis.”

Researchers rely heavily on data as they have a story to tell or research problems to solve. It starts with a question, and data is nothing but an answer to that question. But, what if there is no question to ask? Well! It is possible to explore data even without a problem – we call it ‘Data Mining’, which often reveals some interesting patterns within the data that are worth exploring.

Irrelevant to the type of data researchers explore, their mission and audiences’ vision guide them to find the patterns to shape the story they want to tell. One of the essential things expected from researchers while analyzing data is to stay open and remain unbiased toward unexpected patterns, expressions, and results. Remember, sometimes, data analysis tells the most unforeseen yet exciting stories that were not expected when initiating data analysis. Therefore, rely on the data you have at hand and enjoy the journey of exploratory research. 

Create a Free Account

Every kind of data has a rare quality of describing things after assigning a specific value to it. For analysis, you need to organize these values, processed and presented in a given context, to make it useful. Data can be in different forms; here are the primary data types.

  • Qualitative data: When the data presented has words and descriptions, then we call it qualitative data . Although you can observe this data, it is subjective and harder to analyze data in research, especially for comparison. Example: Quality data represents everything describing taste, experience, texture, or an opinion that is considered quality data. This type of data is usually collected through focus groups, personal qualitative interviews , qualitative observation or using open-ended questions in surveys.
  • Quantitative data: Any data expressed in numbers of numerical figures are called quantitative data . This type of data can be distinguished into categories, grouped, measured, calculated, or ranked. Example: questions such as age, rank, cost, length, weight, scores, etc. everything comes under this type of data. You can present such data in graphical format, charts, or apply statistical analysis methods to this data. The (Outcomes Measurement Systems) OMS questionnaires in surveys are a significant source of collecting numeric data.
  • Categorical data: It is data presented in groups. However, an item included in the categorical data cannot belong to more than one group. Example: A person responding to a survey by telling his living style, marital status, smoking habit, or drinking habit comes under the categorical data. A chi-square test is a standard method used to analyze this data.

Learn More : Examples of Qualitative Data in Education

Data analysis in qualitative research

Data analysis and qualitative data research work a little differently from the numerical data as the quality data is made up of words, descriptions, images, objects, and sometimes symbols. Getting insight from such complicated information is a complicated process. Hence it is typically used for exploratory research and data analysis .

Although there are several ways to find patterns in the textual information, a word-based method is the most relied and widely used global technique for research and data analysis. Notably, the data analysis process in qualitative research is manual. Here the researchers usually read the available data and find repetitive or commonly used words. 

For example, while studying data collected from African countries to understand the most pressing issues people face, researchers might find  “food”  and  “hunger” are the most commonly used words and will highlight them for further analysis.

LEARN ABOUT: Level of Analysis

The keyword context is another widely used word-based technique. In this method, the researcher tries to understand the concept by analyzing the context in which the participants use a particular keyword.  

For example , researchers conducting research and data analysis for studying the concept of ‘diabetes’ amongst respondents might analyze the context of when and how the respondent has used or referred to the word ‘diabetes.’

The scrutiny-based technique is also one of the highly recommended  text analysis  methods used to identify a quality data pattern. Compare and contrast is the widely used method under this technique to differentiate how a specific text is similar or different from each other. 

For example: To find out the “importance of resident doctor in a company,” the collected data is divided into people who think it is necessary to hire a resident doctor and those who think it is unnecessary. Compare and contrast is the best method that can be used to analyze the polls having single-answer questions types .

Metaphors can be used to reduce the data pile and find patterns in it so that it becomes easier to connect data with theory.

Variable Partitioning is another technique used to split variables so that researchers can find more coherent descriptions and explanations from the enormous data.

LEARN ABOUT: Qualitative Research Questions and Questionnaires

There are several techniques to analyze the data in qualitative research, but here are some commonly used methods,

  • Content Analysis:  It is widely accepted and the most frequently employed technique for data analysis in research methodology. It can be used to analyze the documented information from text, images, and sometimes from the physical items. It depends on the research questions to predict when and where to use this method.
  • Narrative Analysis: This method is used to analyze content gathered from various sources such as personal interviews, field observation, and  surveys . The majority of times, stories, or opinions shared by people are focused on finding answers to the research questions.
  • Discourse Analysis:  Similar to narrative analysis, discourse analysis is used to analyze the interactions with people. Nevertheless, this particular method considers the social context under which or within which the communication between the researcher and respondent takes place. In addition to that, discourse analysis also focuses on the lifestyle and day-to-day environment while deriving any conclusion.
  • Grounded Theory:  When you want to explain why a particular phenomenon happened, then using grounded theory for analyzing quality data is the best resort. Grounded theory is applied to study data about the host of similar cases occurring in different settings. When researchers are using this method, they might alter explanations or produce new ones until they arrive at some conclusion.

LEARN ABOUT: 12 Best Tools for Researchers

Data analysis in quantitative research

The first stage in research and data analysis is to make it for the analysis so that the nominal data can be converted into something meaningful. Data preparation consists of the below phases.

Phase I: Data Validation

Data validation is done to understand if the collected data sample is per the pre-set standards, or it is a biased data sample again divided into four different stages

  • Fraud: To ensure an actual human being records each response to the survey or the questionnaire
  • Screening: To make sure each participant or respondent is selected or chosen in compliance with the research criteria
  • Procedure: To ensure ethical standards were maintained while collecting the data sample
  • Completeness: To ensure that the respondent has answered all the questions in an online survey. Else, the interviewer had asked all the questions devised in the questionnaire.

Phase II: Data Editing

More often, an extensive research data sample comes loaded with errors. Respondents sometimes fill in some fields incorrectly or sometimes skip them accidentally. Data editing is a process wherein the researchers have to confirm that the provided data is free of such errors. They need to conduct necessary checks and outlier checks to edit the raw edit and make it ready for analysis.

Phase III: Data Coding

Out of all three, this is the most critical phase of data preparation associated with grouping and assigning values to the survey responses . If a survey is completed with a 1000 sample size, the researcher will create an age bracket to distinguish the respondents based on their age. Thus, it becomes easier to analyze small data buckets rather than deal with the massive data pile.

LEARN ABOUT: Steps in Qualitative Research

After the data is prepared for analysis, researchers are open to using different research and data analysis methods to derive meaningful insights. For sure, statistical analysis plans are the most favored to analyze numerical data. In statistical analysis, distinguishing between categorical data and numerical data is essential, as categorical data involves distinct categories or labels, while numerical data consists of measurable quantities. The method is again classified into two groups. First, ‘Descriptive Statistics’ used to describe data. Second, ‘Inferential statistics’ that helps in comparing the data .

Descriptive statistics

This method is used to describe the basic features of versatile types of data in research. It presents the data in such a meaningful way that pattern in the data starts making sense. Nevertheless, the descriptive analysis does not go beyond making conclusions. The conclusions are again based on the hypothesis researchers have formulated so far. Here are a few major types of descriptive analysis methods.

Measures of Frequency

  • Count, Percent, Frequency
  • It is used to denote home often a particular event occurs.
  • Researchers use it when they want to showcase how often a response is given.

Measures of Central Tendency

  • Mean, Median, Mode
  • The method is widely used to demonstrate distribution by various points.
  • Researchers use this method when they want to showcase the most commonly or averagely indicated response.

Measures of Dispersion or Variation

  • Range, Variance, Standard deviation
  • Here the field equals high/low points.
  • Variance standard deviation = difference between the observed score and mean
  • It is used to identify the spread of scores by stating intervals.
  • Researchers use this method to showcase data spread out. It helps them identify the depth until which the data is spread out that it directly affects the mean.

Measures of Position

  • Percentile ranks, Quartile ranks
  • It relies on standardized scores helping researchers to identify the relationship between different scores.
  • It is often used when researchers want to compare scores with the average count.

For quantitative research use of descriptive analysis often give absolute numbers, but the in-depth analysis is never sufficient to demonstrate the rationale behind those numbers. Nevertheless, it is necessary to think of the best method for research and data analysis suiting your survey questionnaire and what story researchers want to tell. For example, the mean is the best way to demonstrate the students’ average scores in schools. It is better to rely on the descriptive statistics when the researchers intend to keep the research or outcome limited to the provided  sample  without generalizing it. For example, when you want to compare average voting done in two different cities, differential statistics are enough.

Descriptive analysis is also called a ‘univariate analysis’ since it is commonly used to analyze a single variable.

Inferential statistics

Inferential statistics are used to make predictions about a larger population after research and data analysis of the representing population’s collected sample. For example, you can ask some odd 100 audiences at a movie theater if they like the movie they are watching. Researchers then use inferential statistics on the collected  sample  to reason that about 80-90% of people like the movie. 

Here are two significant areas of inferential statistics.

  • Estimating parameters: It takes statistics from the sample research data and demonstrates something about the population parameter.
  • Hypothesis test: I t’s about sampling research data to answer the survey research questions. For example, researchers might be interested to understand if the new shade of lipstick recently launched is good or not, or if the multivitamin capsules help children to perform better at games.

These are sophisticated analysis methods used to showcase the relationship between different variables instead of describing a single variable. It is often used when researchers want something beyond absolute numbers to understand the relationship between variables.

Here are some of the commonly used methods for data analysis in research.

  • Correlation: When researchers are not conducting experimental research or quasi-experimental research wherein the researchers are interested to understand the relationship between two or more variables, they opt for correlational research methods.
  • Cross-tabulation: Also called contingency tables,  cross-tabulation  is used to analyze the relationship between multiple variables.  Suppose provided data has age and gender categories presented in rows and columns. A two-dimensional cross-tabulation helps for seamless data analysis and research by showing the number of males and females in each age category.
  • Regression analysis: For understanding the strong relationship between two variables, researchers do not look beyond the primary and commonly used regression analysis method, which is also a type of predictive analysis used. In this method, you have an essential factor called the dependent variable. You also have multiple independent variables in regression analysis. You undertake efforts to find out the impact of independent variables on the dependent variable. The values of both independent and dependent variables are assumed as being ascertained in an error-free random manner.
  • Frequency tables: The statistical procedure is used for testing the degree to which two or more vary or differ in an experiment. A considerable degree of variation means research findings were significant. In many contexts, ANOVA testing and variance analysis are similar.
  • Analysis of variance: The statistical procedure is used for testing the degree to which two or more vary or differ in an experiment. A considerable degree of variation means research findings were significant. In many contexts, ANOVA testing and variance analysis are similar.
  • Researchers must have the necessary research skills to analyze and manipulation the data , Getting trained to demonstrate a high standard of research practice. Ideally, researchers must possess more than a basic understanding of the rationale of selecting one statistical method over the other to obtain better data insights.
  • Usually, research and data analytics projects differ by scientific discipline; therefore, getting statistical advice at the beginning of analysis helps design a survey questionnaire, select data collection methods , and choose samples.

LEARN ABOUT: Best Data Collection Tools

  • The primary aim of data research and analysis is to derive ultimate insights that are unbiased. Any mistake in or keeping a biased mind to collect data, selecting an analysis method, or choosing  audience  sample il to draw a biased inference.
  • Irrelevant to the sophistication used in research data and analysis is enough to rectify the poorly defined objective outcome measurements. It does not matter if the design is at fault or intentions are not clear, but lack of clarity might mislead readers, so avoid the practice.
  • The motive behind data analysis in research is to present accurate and reliable data. As far as possible, avoid statistical errors, and find a way to deal with everyday challenges like outliers, missing data, data altering, data mining , or developing graphical representation.

LEARN MORE: Descriptive Research vs Correlational Research The sheer amount of data generated daily is frightening. Especially when data analysis has taken center stage. in 2018. In last year, the total data supply amounted to 2.8 trillion gigabytes. Hence, it is clear that the enterprises willing to survive in the hypercompetitive world must possess an excellent capability to analyze complex research data, derive actionable insights, and adapt to the new market needs.

LEARN ABOUT: Average Order Value

QuestionPro is an online survey platform that empowers organizations in data analysis and research and provides them a medium to collect data by creating appealing surveys.

MORE LIKE THIS

customer communication tool

Customer Communication Tool: Types, Methods, Uses, & Tools

Apr 23, 2024

sentiment analysis tools

Top 12 Sentiment Analysis Tools for Understanding Emotions

QuestionPro BI: From Research Data to Actionable Dashboards

QuestionPro BI: From research data to actionable dashboards within minutes

Apr 22, 2024

customer experience management software

21 Best Customer Experience Management Software in 2024

Other categories.

  • Academic Research
  • Artificial Intelligence
  • Assessments
  • Brand Awareness
  • Case Studies
  • Communities
  • Consumer Insights
  • Customer effort score
  • Customer Engagement
  • Customer Experience
  • Customer Loyalty
  • Customer Research
  • Customer Satisfaction
  • Employee Benefits
  • Employee Engagement
  • Employee Retention
  • Friday Five
  • General Data Protection Regulation
  • Insights Hub
  • Life@QuestionPro
  • Market Research
  • Mobile diaries
  • Mobile Surveys
  • New Features
  • Online Communities
  • Question Types
  • Questionnaire
  • QuestionPro Products
  • Release Notes
  • Research Tools and Apps
  • Revenue at Risk
  • Survey Templates
  • Training Tips
  • Uncategorized
  • Video Learning Series
  • What’s Coming Up
  • Workforce Intelligence

data analysis Recently Published Documents

Total documents.

  • Latest Documents
  • Most Cited Documents
  • Contributed Authors
  • Related Sources
  • Related Keywords

Introduce a Survival Model with Spatial Skew Gaussian Random Effects and its Application in Covid-19 Data Analysis

Futuristic prediction of missing value imputation methods using extended ann.

Missing data is universal complexity for most part of the research fields which introduces the part of uncertainty into data analysis. We can take place due to many types of motives such as samples mishandling, unable to collect an observation, measurement errors, aberrant value deleted, or merely be short of study. The nourishment area is not an exemption to the difficulty of data missing. Most frequently, this difficulty is determined by manipulative means or medians from the existing datasets which need improvements. The paper proposed hybrid schemes of MICE and ANN known as extended ANN to search and analyze the missing values and perform imputations in the given dataset. The proposed mechanism is efficiently able to analyze the blank entries and fill them with proper examining their neighboring records in order to improve the accuracy of the dataset. In order to validate the proposed scheme, the extended ANN is further compared against various recent algorithms or mechanisms to analyze the efficiency as well as the accuracy of the results.

Applications of multivariate data analysis in shelf life studies of edible vegetal oils – A review of the few past years

Hypothesis formalization: empirical findings, software limitations, and design implications.

Data analysis requires translating higher level questions and hypotheses into computable statistical models. We present a mixed-methods study aimed at identifying the steps, considerations, and challenges involved in operationalizing hypotheses into statistical models, a process we refer to as hypothesis formalization . In a formative content analysis of 50 research papers, we find that researchers highlight decomposing a hypothesis into sub-hypotheses, selecting proxy variables, and formulating statistical models based on data collection design as key steps. In a lab study, we find that analysts fixated on implementation and shaped their analyses to fit familiar approaches, even if sub-optimal. In an analysis of software tools, we find that tools provide inconsistent, low-level abstractions that may limit the statistical models analysts use to formalize hypotheses. Based on these observations, we characterize hypothesis formalization as a dual-search process balancing conceptual and statistical considerations constrained by data and computation and discuss implications for future tools.

The Complexity and Expressive Power of Limit Datalog

Motivated by applications in declarative data analysis, in this article, we study Datalog Z —an extension of Datalog with stratified negation and arithmetic functions over integers. This language is known to be undecidable, so we present the fragment of limit Datalog Z programs, which is powerful enough to naturally capture many important data analysis tasks. In limit Datalog Z , all intensional predicates with a numeric argument are limit predicates that keep maximal or minimal bounds on numeric values. We show that reasoning in limit Datalog Z is decidable if a linearity condition restricting the use of multiplication is satisfied. In particular, limit-linear Datalog Z is complete for Δ 2 EXP and captures Δ 2 P over ordered datasets in the sense of descriptive complexity. We also provide a comprehensive study of several fragments of limit-linear Datalog Z . We show that semi-positive limit-linear programs (i.e., programs where negation is allowed only in front of extensional atoms) capture coNP over ordered datasets; furthermore, reasoning becomes coNEXP-complete in combined and coNP-complete in data complexity, where the lower bounds hold already for negation-free programs. In order to satisfy the requirements of data-intensive applications, we also propose an additional stability requirement, which causes the complexity of reasoning to drop to EXP in combined and to P in data complexity, thus obtaining the same bounds as for usual Datalog. Finally, we compare our formalisms with the languages underpinning existing Datalog-based approaches for data analysis and show that core fragments of these languages can be encoded as limit programs; this allows us to transfer decidability and complexity upper bounds from limit programs to other formalisms. Therefore, our article provides a unified logical framework for declarative data analysis which can be used as a basis for understanding the impact on expressive power and computational complexity of the key constructs available in existing languages.

An empirical study on Cross-Border E-commerce Talent Cultivation-—Based on Skill Gap Theory and big data analysis

To solve the dilemma between the increasing demand for cross-border e-commerce talents and incompatible students’ skill level, Industry-University-Research cooperation, as an essential pillar for inter-disciplinary talent cultivation model adopted by colleges and universities, brings out the synergy from relevant parties and builds the bridge between the knowledge and practice. Nevertheless, industry-university-research cooperation developed lately in the cross-border e-commerce field with several problems such as unstable collaboration relationships and vague training plans.

The Effects of Cross-border e-Commerce Platforms on Transnational Digital Entrepreneurship

This research examines the important concept of transnational digital entrepreneurship (TDE). The paper integrates the host and home country entrepreneurial ecosystems with the digital ecosystem to the framework of the transnational digital entrepreneurial ecosystem. The authors argue that cross-border e-commerce platforms provide critical foundations in the digital entrepreneurial ecosystem. Entrepreneurs who count on this ecosystem are defined as transnational digital entrepreneurs. Interview data were dissected for the purpose of case studies to make understanding from twelve Chinese immigrant entrepreneurs living in Australia and New Zealand. The results of the data analysis reveal that cross-border entrepreneurs are in actual fact relying on the significant framework of the transnational digital ecosystem. Cross-border e-commerce platforms not only play a bridging role between home and host country ecosystems but provide entrepreneurial capitals as digital ecosystem promised.

Subsampling and Jackknifing: A Practically Convenient Solution for Large Data Analysis With Limited Computational Resources

The effects of cross-border e-commerce platforms on transnational digital entrepreneurship, a trajectory evaluator by sub-tracks for detecting vot-based anomalous trajectory.

With the popularization of visual object tracking (VOT), more and more trajectory data are obtained and have begun to gain widespread attention in the fields of mobile robots, intelligent video surveillance, and the like. How to clean the anomalous trajectories hidden in the massive data has become one of the research hotspots. Anomalous trajectories should be detected and cleaned before the trajectory data can be effectively used. In this article, a Trajectory Evaluator by Sub-tracks (TES) for detecting VOT-based anomalous trajectory is proposed. Feature of Anomalousness is defined and described as the Eigenvector of classifier to filter Track Lets anomalous trajectory and IDentity Switch anomalous trajectory, which includes Feature of Anomalous Pose and Feature of Anomalous Sub-tracks (FAS). In the comparative experiments, TES achieves better results on different scenes than state-of-the-art methods. Moreover, FAS makes better performance than point flow, least square method fitting and Chebyshev Polynomial Fitting. It is verified that TES is more accurate and effective and is conducive to the sub-tracks trajectory data analysis.

Export Citation Format

Share document.

Loading metrics

Open Access

Principles for data analysis workflows

Contributed equally to this work with: Sara Stoudt, Váleri N. Vásquez

Affiliations Berkeley Institute for Data Science, University of California Berkeley, Berkeley, California, United States of America, Statistical & Data Sciences Program, Smith College, Northampton, Massachusetts, United States of America

ORCID logo

Affiliations Berkeley Institute for Data Science, University of California Berkeley, Berkeley, California, United States of America, Energy and Resources Group, University of California Berkeley, Berkeley, California, United States of America

* E-mail: [email protected]

Affiliations Berkeley Institute for Data Science, University of California Berkeley, Berkeley, California, United States of America, Department of Molecular and Cellular Biology, University of California Berkeley, Berkeley, California, United States of America

  • Sara Stoudt, 
  • Váleri N. Vásquez, 
  • Ciera C. Martinez

PLOS

Published: March 18, 2021

  • https://doi.org/10.1371/journal.pcbi.1008770
  • Reader Comments

Fig 1

A systematic and reproducible “workflow”—the process that moves a scientific investigation from raw data to coherent research question to insightful contribution—should be a fundamental part of academic data-intensive research practice. In this paper, we elaborate basic principles of a reproducible data analysis workflow by defining 3 phases: the Explore, Refine, and Produce Phases. Each phase is roughly centered around the audience to whom research decisions, methodologies, and results are being immediately communicated. Importantly, each phase can also give rise to a number of research products beyond traditional academic publications. Where relevant, we draw analogies between design principles and established practice in software development. The guidance provided here is not intended to be a strict rulebook; rather, the suggestions for practices and tools to advance reproducible, sound data-intensive analysis may furnish support for both students new to research and current researchers who are new to data-intensive work.

Citation: Stoudt S, Vásquez VN, Martinez CC (2021) Principles for data analysis workflows. PLoS Comput Biol 17(3): e1008770. https://doi.org/10.1371/journal.pcbi.1008770

Editor: Patricia M. Palagi, SIB Swiss Institute of Bioinformatics, SWITZERLAND

Copyright: © 2021 Stoudt et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: SS was supported by the National Physical Sciences Consortium ( https://stemfellowships.org/ ) fellowship. SS, VNV, and CCM were supported by the Gordon & Betty Moore Foundation ( https://www.moore.org/ ) (GBMF3834) and Alfred P. Sloan Foundation ( https://sloan.org/ ) (2013-10-27) as part of the Moore-Sloan Data Science Environments. CCM holds a Postdoctoral Enrichment Program Award from the Burroughs Wellcome Fund ( https://www.bwfund.org/ ). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. The authors received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Both traditional science fields and the humanities are becoming increasingly data driven and computational. Researchers who may not identify as data scientists are working with large and complex data on a regular basis. A systematic and reproducible research workflow —the process that moves a scientific investigation from raw data to coherent research question to insightful contribution—should be a fundamental part of data-intensive research practice in any academic discipline. The importance and effective development of a workflow should, in turn, be a cornerstone of the data science education designed to prepare researchers across disciplinary specializations.

Data science education tends to review foundational statistical analysis methods [ 1 ] and furnish training in computational tools , software, and programming languages. In scientific fields, education and training includes a review of domain-specific methods and tools, but generally omits guidance on the coding practices relevant to developing new analysis software—a skill of growing relevance in data-intensive scientific fields [ 2 ]. Meanwhile, the holistic discussion of how to develop and pursue a research workflow is often left out of introductions to both data science and disciplinary science. Too frequently, students and academic practitioners of data-intensive research are left to learn these essential skills on their own and on the job. Guidance on the breadth of potential products that can emerge from research is also lacking. In the interest of both reproducible science (providing the necessary data and code to recreate the results) and effective career building, researchers should be primed to regularly generate outputs over the course of their workflow.

The goal of this paper is to deconstruct an academic data-intensive research project, demonstrating how both design principles and software development methods can motivate the creation and standardization of practices for reproducible data and code. The implementation of such practices generates research products that can be effectively communicated, in addition to constituting a scientific contribution. Here, “data-intensive” research is used interchangeably with “data science” in a recognition of the breadth of domain applications that draw upon computational analysis methods and workflows. (We define other terms we’ve bolded throughout this paper in Box 1 ). To be useful, let alone high impact, research analyses should be contextualized in the data processing decisions that led to their creation and accompanied by a narrative that explains why the rest of the world should be interested. One way of thinking about this is that the scientific method should be tangibly reflected, and feasibly reproducible, in any data-intensive research project.

Box 1. Terminology

This box provides definitions for terms in bold throughout the text. Terms are sorted alphabetically and cross referenced where applicable.

Agile: An iterative software development framework which adheres to the principles described in the Manifesto for Agile software development [ 35 ] (e.g., breaks up work into small increments).

Accessor function: A function that returns the value of a variable (synonymous term: getter function).

Assertion: An expression that is expected to be true at a particular point in the code.

Computational tool: May include libraries, packages, collections of functions, and/or data structures that have been consciously designed to facilitate the development and pursuit of data-intensive questions (synonymous term: software tool).

Continuous integration: Automatic tests that updated code.

Gut check: Also “data gut check.” Quick, broad, and shallow testing [ 48 ] before and during data analysis. Although this is usually described in the context of software development, the concept of a data-specific gut check can include checking the dimensions of data structures after merging or assessing null values/missing values, zero values, negative values, and ranges of values to see if they make sense (synonymous words: smoke test, sanity check [ 49 ], consistency check, sniff test, soundness check).

Data-intensive research : Research that is centrally based on the analysis of data and its structural or statistical properties. May include but is not limited to research that hinges on large volumes of data or a wide variety of data types requiring computational skills to approach such research (synonymous term: data science research). “Data science” as a stand-alone term may also refer more broadly to the use of computational tools and statistical methods to gain insights from digitized information.

Data structure: A format for storing data values and definition of operations that can be applied to data of a particular type.

Defensive programming : Strategies to guard against failures or bugs in code; this includes the use of tests and assertions.

Design thinking: The iterative process of defining a problem then identifying and prototyping potential solutions to that problem, with an emphasis on solutions that are empathetic to the particular needs of the target user.

Docstring: A code comment for a particular line of code that describes what a function does, as opposed to how the function performs that operation.

DOI: A digital object identifier or DOI is a unique handle, standardized by the International Organization for Standardization (ISO), that can be assigned to different types of information objects.

Extensibility: The flexibility to be extended or repurposed in a new scenario.

Function: A piece of more abstracted code that can be reused to perform the same operation on different inputs of the same type and has a standardized output [ 50 – 52 ].

Getter function: Another term for an accessor function.

Integrated Development Environment (IDE): A software application that facilitates software development and minimally consists of a source code editor, build automation tools, and a debugger.

Modularity: An ability to separate different functionality into stand-alone pieces.

Mutator method: A function used to control changes to variables. See “setter function” and “accessor function.”

Notebook: A computational or physical place to store details of a research process including decisions made.

Mechanistic code : Code used to perform a task as opposed to conduct an analysis. Examples include processing functions and plotting functions.

Overwrite: The process, intentional or accidental, of assigning new values to existing variables.

Package manager: A system used to automate the installation and configuration of software.

Pipeline : A series of programmatic processes during data analysis and data cleaning, usually linear in nature, that can be automated and usually be described in the context of inputs and outputs.

Premature optimization : Focusing on details before the general scheme is decided upon.

Refactoring: A change in code, such as file renaming, to make it more organized without changing the overall output or behavior.

Replicable: A new study arrives at the same scientific findings as a previous study, collecting new data (with the same or different methods) and completes new analyses [ 53 – 55 ].

Reproducible: Authors provide all the necessary data, and the computer codes to run the analysis again, recreating the results [ 53 – 55 ].

Script : A collection of code, ideally related to one particular step in the data analysis.

Setter function: A type of function that controls changes to variables. It is used to directly access and alter specific values (synonymous term: mutator method).

Serialization: The process of saving data structures, inputs and outputs, and experimental setups generally in a storable, shareable format. Serialized information can be reconstructed in different computer environments for the purpose of replicating or reproducing experiments.

Software development: A process of writing and documenting code in pursuit of an end goal, typically focused on process over analysis.

Source code editor: A program that facilitates changes to code by an author.

Technical debt: The extra work you defer by pursuing an easier, yet not ideal solution, early on in the coding process.

Test-driven development: Each change in code should be verified against tests to prove its functionality.

Unit test: A code test for the smallest chunk of code that is actually testable.

Version control: A way of managing changes to code or documentation that maintains a record of changes over time.

White paper: An informative, at least semiformal document that explains a particular issue but is not peer reviewed.

Workflow : The process that moves a scientific investigation from raw data to coherent research question to insightful contribution. This often involves a complex series of processes and includes a mixture of machine automation and human intervention. It is a nonlinear and iterative exercise.

Discussions of “workflow” in data science can take on many different meanings depending on the context. For example, the term “workflow” often gets conflated with the term “ pipeline ” in the context of software development and engineering. Pipelines are often described as a series of processes that can be programmatically defined and automated and explained in the context of inputs and outputs. However, in this paper, we offer an important distinction between pipelines and workflows: The former refers to what a computer does, for example, when a piece of software automatically runs a series of Bash or R script s. For the purpose of this paper, a workflow describes what a researcher does to make advances on scientific questions: developing hypotheses, wrangling data, writing code, and interpreting results.

Data analysis workflows can culminate in a number of outcomes that are not restricted to the traditional products of software engineering (software tools and packages) or academia (research papers). Rather, the workflow that a researcher defines and iterates over the course of a data science project can lead to intellectual contributions as varied as novel data sets, new methodological approaches, or teaching materials in addition to the classical tools, packages, and papers. While the workflow should be designed to serve the researcher and their collaborators, maintaining a structured approach throughout the process will inform results that are replicable (see replicable versus reproducible in Box 1 ) and easily translated into a variety of products that furnish scientific insights for broader consumption.

In the following sections, we explain the basic principles of a constructive and productive data analysis workflow by defining 3 phases: the Explore, Refine, and Produce Phases. Each phase is roughly centered around the audience to whom research decisions, methodologies, and results are being immediately communicated. Where relevant, we draw analogies to the realm of design thinking and software development . While the 3 phases described here are not intended to be a strict rulebook, we hope that the many references to additional resources—and suggestions for nontraditional research products—provide guidance and support for both students new to research and current researchers who are new to data-intensive work.

The Explore, Refine, Produce (ERP) workflow for data-intensive research

We partition the workflow of a data-intensive research process into 3 phases: Explore, Refine, and Produce. These phases, collectively the ERP workflow, are visually described in Fig 1A and 1B . In the Explore Phase, researchers “meet” their data: process it, interrogate it, and sift through potential solutions to a problem of interest. In the Refine Phase, researchers narrow their focus to a particularly promising approach, develop prototypes, and organize their code into a clearer narrative. The Produce Phase happens concurrently with the Explore and Refine Phases. In this phase, researchers prepare their work for broader consumption and critique.

thumbnail

  • PPT PowerPoint slide
  • PNG larger image
  • TIFF original image

(A) We deconstruct a data-intensive research project into 3 phases, visualizing this process as a tree structure. Each branch in the tree represents a decision that needs to be made about the project, such as data cleaning, refining the scope of the research, or using a particular tool or model. Throughout the natural life of a project, there are many dead ends (yellow Xs). These may include choices that do not work, such as experimentation with a tool that is ultimately not compatible with our data. Dead ends can result in informal learning or procedural fine-tuning. Some dead ends that lie beyond the scope of our current project may turn into a new project later on (open turquoise circles). Throughout the Explore and Refine Phases, we are concurrently in the Produce Phase because research products (closed turquoise circles) can arise at any point throughout the workflow. Products, regardless of the phase that generates their content, contribute to scientific understanding and advance the researcher’s career goals. Thus, the data-intensive research portfolio and corresponding academic CV can be grown at any point in the workflow. (B) The ERP workflow as a nonlinear cycle. Although the tree diagram displayed in Fig 1A accurately depicts the many choices and dead ends that a research project contains, it does not as easily reflect the nonlinearity of the process; Fig 1B’s representation aims to fill this gap. We often iterate between the Explore and Refine Phases while concurrently contributing content to the Produce Phase. The time spent in each phase can vary significantly across different types of projects. For example, hypothesis generation in the Explore Phase might be the biggest hurdle in one project, while effectively communicating a result to a broader audience in the Produce Phase might be the most challenging aspect of another project.

https://doi.org/10.1371/journal.pcbi.1008770.g001

Each phase has an immediate audience—the researcher themselves, their collaborative groups, or the public—that broadens progressively and guides priorities. Each of the 3 phases can benefit from standards that the software development community uses to streamline their code-based pipelines, as well as from principles the design community uses to generate and carry out ideas; many such practices can be adapted to help structure a data-intensive researcher’s workflow. The Explore and Refine Phases provide fodder for the concurrent Produce Phase. We hope that the potential to produce a variety of research products throughout a data-intensive research process, rather than merely at the end of a project, motivates researchers to apply the ERP workflow.

Phase 1: Explore

Data-intensive research projects typically start with a domain-specific question or a particular data set to explore [ 3 ]. There is no fixed, cross-disciplinary rule that defines the point in a workflow by which a hypothesis must be established. This paper adopts an open-minded approach concerning the timing of hypothesis generation [ 4 ], assuming that data-intensive research projects can be motivated by either an explicit, preexisting hypothesis or a new data set about which no strong preconceived assumptions or intuitions exist. The often messy Explore Phase is rarely discussed as an explicit step of the methodological process, but it is an essential component of research: It allows us to gain intuition about our data, informing future phases of the workflow. As we explore our data, we refine our research question and work toward the articulation of a well-defined problem. The following section will address how to reap the benefits of data set and problem space exploration and provide pointers on how to impose structure and reproducibility during this inherently creative phase of the research workflow.

Designing data analysis: Goals and standards of the Explore Phase

Trial and error is the hallmark of the Explore Phase (note the density of “deadends” and decisions made in this phase in Fig 1A ). In “Designerly Ways of Knowing” [ 5 ], the design process is described as a “co-evolution of solution and problem spaces.” Like designers, data-intensive researchers explore the problem space, learn about the potential structure of the solution space, and iterate between the 2 spaces. Importantly, the difficulties we encounter in this phase help us build empathy for an eventual audience beyond ourselves. It is here that we experience firsthand the challenges of processing our data set, framing domain research questions appropriate to it, and structuring the beginnings of a workflow. Documenting our trial and error helps our own work stay on track in addition to assisting future researchers facing similar challenges.

One end goal of the Explore Phase is to determine whether new questions of interest might be answered by leveraging existing software tools (either off the shelf or with minor adjustments), rather than building new computational capabilities ourselves. For example, during this phase, a common activity includes surveying the software available for our data set or problem space and estimating its utility for the unique demands of our current analysis. Through exploration, we learn about relevant computational and analysis tools while concurrently building an understanding of our data.

A second important goal of the Explore Phase is data cleaning and developing a strategy to analyze our data. This is a dynamic process that often goes hand in hand with improving our understanding of the data. During the Explore Phase, we redesign and reformat data structures, identify important variables, remove redundancies, take note of missing information, and ponder outliers in our data set. Once we have established the software tools—the programming language, data analysis packages, and a handful of the useful functions therein—that are best suited to our data and domain area, we also start putting those tools to use [ 6 ]. In addition, during the Explore Phase, we perform initial tests, build a simple model, or create some basic visualizations to better grasp the contents of our data set and check for expected outputs. Our research is underway in earnest now, and this effort will help us to identify what questions we might be able to ask of our data.

The Explore Phase is often a solo endeavor; as shown in Fig 1A , our audience is typically our current or future self. This can make navigating the phase difficult, especially for new researchers. It also complicates a third goal of this phase: documentation. In this phase, we ourselves are our only audience, and if we are not conscientious documenters, we can easily end up concluding the phase without the ability to coherently describe our research process up to that point. Record keeping in the Explore Phase is often subject to our individual style of approaching problems. Some styles work in real time, subsetting or reconfiguring data as ideas occur. More methodical styles tend to systematically plan exploratory steps, recording them before taking action. These natural tendencies impact the state of our analysis code, affecting its readability and reproducibility.

However, there are strategies—inspired by analogous software development principles—that can help set us up for success in meeting the standards of reproducibility [ 7 ] relevant to a scientifically sound research workflow. These strategies impose a semblance of order on the Explore Phase. To avoid concerns of premature optimization [ 8 ] while we are iterating during this phase, documentation is the primary goal, rather than fine-tuning the code structure and style. Documentation enables the traceability of a researcher’s workflow, such that all efforts are replicable and final outcomes are reproducible.

Analogies to software development in the Explore Phase

Documentation: code and process..

Software engineers typically value formal documentation that is readable by software users. While the audience for our data analysis code may not be defined as a software user per se, documentation is still vital for workflow development. Documentation for data analysis workflows can come in many forms, including comments describing individual lines of code, README files orienting a reader within a code repository, descriptive commit history logs tracking the progress of code development, docstrings detailing function capabilities, and vignettes providing example applications. Documentation provides both a user manual for particular tools within a project (for example, data cleaning functions), and a reference log describing scientific research decisions and their rationale (for example, the reasons behind specific parameter choices).

In the Explore Phase, we may identify with the type of programmer described by Brant and colleagues as “opportunistic” [ 9 ]. This type of programmer finds it challenging to prioritize documenting and organizing code that they see as impermanent or a work in progress. “Opportunistic” programmers tend to build code using others’ tools, focusing on writing “glue” code that links preexisting components and iterate quickly. Hartmann and colleagues also describe this mash-up approach [ 10 ]. Rather than “opportunistic programmers,” their study focuses on “opportunistic designers.” This style of design “search[es] for bridges,” finding connections between what first appears to be different fields. Data-intensive researchers often use existing tools to answer questions of interest; we tend to build our own only when needed.

Even if the code that is used for data exploration is not developed into a software-based final research product, the exploratory process as a whole should exist as a permanent record: Future scientists should be able to rerun our analysis and work from where we left off, beginning from raw, unprocessed data. Therefore, documenting choices and decisions we make along the way is crucial to making sure we do not forget any aspect of the analysis workflow, because each choice may ultimately impact the final results. For example, if we remove some data points from our analyses, we should know which data points we removed—and our reason for removing them—and be able to communicate those choices when we start sharing our work with others. This is an important argument against ephemerally conducting our data analysis work via the command line.

Instead of the command line, tools like a computational notebook [ 11 ] can help capture a researcher’s decision-making process in real time [ 12 ]. A computational notebook where we never delete code, and—to avoid overwriting named variables—only move forward in our document, could act as “version control designed for a 10-minute scale” that Brant and colleagues found might help the “opportunistic” programmer. More recent advances in this area include the reactive notebook [ 13 – 14 ]. Such tools assist documentation while potentially enhancing our creativity during the Explore Phase. The bare minimum documentation of our Explore Phase might therefore include such a notebook or an annotated script [ 15 ] to record all analyses that we perform and code that we write.

To go a step beyond annotated scripts or notebooks, researchers might employ a version control system such as Git. With its issues, branches, and informative commit messages, Git is another useful way to maintain a record of our trial-and-error process and track which files are progressing toward which goals of the overall project. Using Git together with a public online hosting service such as GitHub allows us to share our work with collaborators and the public in real time, if we so choose.

A researcher dedicated to conducting an even more thoroughly documented Explore Phase may take Ford’s advice and include notes that explicitly document our stream of consciousness [ 16 ]. Our notes should be able to efficiently convey what failed, what worked but was uninteresting or beyond scope of the project, and what paths of inquiry we will continue forward with in more depth ( Fig 1A ). In this way, as we transition from the Explore Phase to the Refine Phase, we will have some signposts to guide our way.

Testing: Comparing expectations to output.

As Ford [ 16 ] explains, we face competing goals in the Explore Phase: We want to get results quickly, but we also want to be confident in our answers. Her strategy is to focus on documentation over tests for one-off analyses that will not form part of a larger research project. However, the complete absence of formal tests may raise a red flag for some data scientists used to the concept of test-driven development . This is a tension between the code-based work conducted in scientific research versus software development: Tests help build confidence in analysis code and convince users that it is reliable or accurate, but tests also imply finality and take time to write that we may not be willing to allocate in the experimental Explore Phase. However, software development style tests do have useful analogs in data analysis efforts: We can think of tests, in the data analysis sense, as a way of checking whether our expectations match the reality of a piece of code’s output.

Imagine we are looking at a data set for the first time. What weird things can happen? The type of variable might not be what we expect (for example, the integer 4 instead of the float 4.0). The data set could also include unexpected aspects (for example, dates formatted as strings instead of numbers). The amount of missing data may be larger than we thought, and this missingness could be coded in a variety of ways (for example, as a NaN, NULL, or −999). Finally, the dimensions of a data frame after merging or subsetting it for data cleaning may not match our expectations. Such gaps in expectation versus reality are “silent faults” [ 17 ]. Without checking for them explicitly, we might proceed with our analysis unaware that anything is amiss and encode that error in our results.

For these reasons, every data exploration should include quantitative and qualitative “gut checks” [ 18 ] that can help us diagnose an expectation mismatch as we go about examining and manipulating our data. We may check assumptions about data quality such as the proportion of missing values, verify that a joined data set has the expected dimensions, or ascertain the statistical distributions of well-known data categories. In this latter case, having domain knowledge can help us understand what to expect. We may want to compare 2 data sets (for example, pre- and post-processed versions) to ensure they are the same [ 19 ]; we may also evaluate diagnostic plots to assess a model’s goodness of fit. Each of the elements that gut checks help us monitor will impact the accuracy and direction of our future analyses.

We perform these manual checks to reassure ourselves that our actions at each step of data cleaning, processing, or preliminary analysis worked as expected. However, these types of checks often rely on us as researchers visually assessing output and deciding if we agree with it. As we transition to needing to convince users beyond ourselves of the correctness of our work, we may consider employing defensive programming techniques that help guard against specific mistakes. An example of defensive programming in the Julia language is the use of assertions, such as the @assert macro to validate values or function outputs. Another option includes writing “chatty functions” [ 20 ] that signal a user to pause, examine the output, and decide if they agree with it.

When to transition from the Explore Phase: Balancing breadth and depth

A researcher in the Explore Phase experiments with a variety of potential data configurations, analysis tools, and research directions. Not all of these may bear fruit in the form of novel questions or promising preliminary findings. Learning how to find a balance between the breadth and depth of data exploration helps us understand when to transition to the Refine Phase of data-intensive research. Specific questions to ask ourselves as we prepare to transition between the Explore Phase and the Refine Phase can be found in Box 2 .

Box 2. Questions

This box provides guiding questions to assist readers in navigating through each workflow phase. Questions pertain to planning, organization, and accountability over the course of workflow iteration.

Questions to ask in the Explore Phase

  • Good: Ourselves (e.g., Code includes signposts refreshing our memory of what is happening where.)
  • Better: Our small team who has specialized knowledge about the context of the problem.
  • Best: Anyone with experience using similar tools to us.
  • Good: Dead ends marked differently than relevant and working code.
  • Better: Material connected to a handful of promising leads.
  • Best: Material connected to a clearly defined scope.
  • Good: Backed up in a second location in addition to our computer.
  • Better: Within a shared space among our team (e.g., Google Drive, Box, etc.).
  • Best: Within a version control system (e.g., GitHub) that furnishes a complete timeline of actions taken.
  • Good: Noted in a separate place from our code (e.g., a physical notebook).
  • Better: Noted in comments throughout the code itself, with expectations informally checked.
  • Best: Noted systematically throughout code as part of a narrative, with expectations formally checked.

Questions to ask in the Refine Phase

  • Who is in our team?
  • Consider career level, computational experience, and domain-specific experience.
  • How do we communicate methodology with our teammates’ skills in mind?
  • What reproducibility tools can be agreed upon?
  • How can our work be packaged into impactful research products?
  • Can we explain the same important results across different platforms (e.g., blog post in addition to white paper)?
  • How can we alert these people and make our work accessible?
  • How can we use narrative to make this clear?

Questions to ask in the Produce Phase

  • Do we have more than 1 audience?
  • What is the next step in our research?
  • Can we turn our work into more than 1 publishable product?
  • Consider products throughout the entire workflow.
  • See suggestions in the Tool development guide ( Box 4 ).

Imposing structure at certain points throughout the Explore Phase can help to balance our wide search for solutions with our deep dives into particular options. In an analogy to the software development world, we can treat our exploratory code as a code release—the marker of a stable version of a piece of software. For example, we can take stock of the code we have written at set intervals, decide what aspects of the analysis conducted using it seem most promising, and focus our attention on more formally tuning those parts of the code. At this point, we can also note the presence of research “dead ends” and perhaps record where they fit into our thought process. Some trains of thought may not continue into the next phase or become a formal research product, but they can still contribute to our understanding of the problem or eliminate a potential solution from consideration. As the project matures, computational pipelines are established. These inform project workflow, and tools, such as Snakemake and Nextflow, can begin to be used to improve the flexibility and reproducibility of the project [ 21 – 23 ]. As we make decisions about which research direction we are going to pursue, we can also adjust our file structure and organize files into directories with more informative names.

Just as Cross [ 5 ] finds that a “reasonably-structured process” leads to design success where “rigid, over-structured approaches” find less success, a balance between the formality of documentation and testing and the informality of creative discovery is key to the Explore Phase of data-intensive research. By taking inspiration from software development and adapting the principles of that arena to fit our data analysis work, we add enough structure to this phase to ease transition into the next phase of the research workflow.

Phase 2: Refine

Inevitably, we reach a point in the Explore Phase when we have acquainted ourselves with our data set, processed and cleaned it, identified interesting research questions that might be asked using it, and found the analysis tools that we prefer to apply. Having reached this important juncture, we may also wish to expand our audience from ourselves to a team of research collaborators. It is at this point that we are ready to transition to the Refine Phase. However, we should keep in mind that new insights may bring us back to the Explore Phase: Over the lifetime of a given research project, we are likely to cycle through each workflow phase multiple times.

In the Refine Phase, the extension of our target audience demands a higher standard for communicating our research decisions as well as a more formal approach to organizing our workflow and documenting and testing our code. In this section, we will discuss principles for structuring our data analysis in the Refine Phase. This phase will ultimately prepare our work for polishing into more traditional research products, including peer-reviewed academic papers.

Designing data analysis: Goals and standards of the Refine Phase

The Refine Phase encompasses many critical aspects of a data-intensive research project. Additional data cleaning may be conducted, analysis methodologies are chosen, and the final experimental design is decided upon. Experimental design may include identifying case studies for variables of interest within our data. If applicable, it is during this phase that we determine the details of simulations. Preliminary results from the Explore Phase inform how we might improve upon or scale up prototypes in the Refine Phase. Data management is essential during this phase and can be expanded to include the serialization of experimental setups. Finally, standards of reproducibility should be maintained throughout. Each of these aspects constitutes an important goal of the Refine Phase as we determine the most promising avenues for focusing our research workflow en route to the polished research products that will emerge from this phase and demand even higher reproducibility standards.

All of these goals are developed in conjunction with our research team. Therefore, decisions should be documented and communicated in a way that is reproducible and constructive within that group. Just as the solitary nature of the Explore Phase can be daunting, the collaboration that may happen in the Refine Phase brings its own set of challenges as we figure out how to best work together. Our team can be defined as the people who participate in developing the research question, preparing the data set it is applied to, coding the analysis, or interpreting the results. It might also include individuals who offer feedback about the progress of our work. In the context of academia, our team usually includes our laboratory or research group. Like most other aspects of data-intensive research, our team may evolve as the project evolves. But however we define our team, its members inform how our efforts proceed during the Refine Phase: Thus, another primary goal of the Refine Phase is establishing group-based standards for the research workflow. Specific questions to ask ourselves during this phase can be found in Box 2 .

In recent years, the conversation on standards within academic data science and scientific computing has shifted from “best” practices [ 24 ] to “good enough” practices [ 25 ]. This is an important distinction when establishing team standards during the Refine Phase: Reproducibility is a spectrum [ 26 ], and collaborative work in data-intensive research carries unique demands on researchers as scholars and coworkers [ 27 ]. At this point in the research workflow, standards should be adopted according to their appropriateness for our team. This means talking among ourselves not only about scientific results, but also about the computational experimental design that led to those results and the role that each team member plays in the research workflow. Establishing methods for effective communication is therefore another important goal in the Refine Phase, as we cannot develop group-based standards for the research workflow without it.

Analogies to software development in the Refine Phase

Documentation as a driver of reproducibility..

The concept of literate programming [ 8 ] is at the core of an effective Refine Phase. This philosophy brings together code with human-readable explanations, allowing scientists to demonstrate the functionality of their code in the context of words and visualizations that describe the rationale for and results of their analysis. The computational notebooks that were useful in the Explore Phase are also applicable here, where they can assist with team-wide discussions, research development, prototyping, and idea sharing. Jupyter Notebooks [ 28 ] are agnostic to choice of programming language and so provide a good option for research teams that may be working with a diverse code base or different levels of comfort with a particular programming language. Language-specific interfaces such as R’s RMarkdown functionality [ 29 ] and Literate.jl or the reactive notebook put forward by Pluto.jl in the Julia programming language furnish additional options for literate programming.

The same strategies that promote scientific reproducibility for traditional laboratory notebooks can be applied to the computational notebook [ 30 ]. After all, our data-intensive research workflow can be considered a sort of scientific experiment—we develop a hypothesis, query our data, support or reject our hypothesis, and state our insights. A central tenet of scientific reproducibility is recording inputs relevant to a given analysis, such as parameter choices, and explaining any calculation used to obtain them so that our outputs can later be verifiably replicated. Methodological details—for example, the decision to develop a dynamic model in continuous time versus discrete time or the choice of a specific statistical analysis over alternative options—should also be fully explained in computational notebooks developed during the Refine Phase. Domain knowledge may inform such decisions, making this an important part of proper notebook documentation; such details should also be elaborated in the final research product. Computational research descriptions in academic journals generally include a narrative relevant to their final results, but these descriptions often do not include enough methodological detail to enable replicability, much less reproducibility. However, this is changing with time [ 31 , 32 ].

As scientists, we should keep a record of the tools we use to obtain our results in addition to our methodological process. In a data-intensive research workflow, this includes documenting the specific version of any software that we used, as well as its relevant dependencies and compatibility constraints. Recording this information at the top of the computational notebook that details our data science experiment allows future researchers—including ourselves and our teams—to establish the precise computational environment that was used to run the original research analysis. Our chosen programming language may supply automated approaches for doing this, such as a package manager , simplifying matters and painlessly raising the standards of reproducibility in a research team. The unprecedented levels of reproducibility possible in modern computational environments have produced some variance in the expectations of different research communities; it behooves the research team to investigate the community-level standards applicable to our specific domain science and chosen programming language.

A notebook can include more than a deep dive into a full-fledged data science experiment. It can also involve exploring and communicating basic properties of the data, whether for purposes of training team members new to the project or for brainstorming alternative possible approaches to a piece of research. In the Exploration Phase, we have discovered characteristics of our data that we want our research team to know about, for example, outliers or unexpected distributions, and created preliminary visualizations to better understand their presence. In the Refine Phase, we may choose to improve these initial plots and reprise our data processing decisions with team members to ensure that the logic we applied still holds.

Computational notebooks can live in private or public repositories to ensure accessibility and transparency among team members. A version control system such as Git continues to be broadly useful for documentation purposes in the Refine Phase, beyond acting as a storage site for computational notebooks. Especially as our team and code base grows larger, a history of commits and pull requests helps keep track of responsibilities, coding or data issues, and general workflow.

Importantly however, all tools have their appropriate use cases. Researchers should not develop an overt reliance on any one tool and should learn to recognize when different tools are required. For example, computational notebooks may quickly become unwieldy for certain projects and large teams, incurring technical debt in the form of duplications or overwritten variables. As our research project grows in complexity and size, or gains team members, we may want to transition to an Integrated Development Environment (IDE) or a source code editor —which interact easily with container environments like Docker and version control systems such as GitHub—to help scale our data analysis, while retaining important properties like reproducibility.

Testing and establishing code modularity.

Code in data-intensive research is generally written as a means to an end, the end being a scientific result from which researchers can draw conclusions. This stands in stark contrast to the purpose of code developed by data engineers or computer scientists, which is generally written to optimize a mechanistic function for maximum efficiency. During the Refine Phase, we may find ourselves with both analysis-relevant and mechanistic code , especially in “big data” statistical analyses or complex dynamic simulations where optimized computation becomes a concern. Keeping the immediate audience of this workflow phase, our research team, at the forefront of our mind can help us take steps to structure both mechanistic and analysis code in a useful way.

Mechanistic code, which is designed for repeated use, often employs abstractions by wrapping code into functions that apply the same action repeatedly or stringing together multiple scripts into a computational pipeline. Unit tests and so-called accessor functions or getter and setter functions that extract parameter values from data structures or set new values are examples of mechanistic code that might be included in a data-intensive research analysis. Meanwhile, code that is designed to gain statistical insight into distributions or model scientific dynamics using mathematical equations are 2 examples of analysis code. Sometimes, the line between mechanistic code and analysis code can be a blurry one. For example, we might write a looping function to sample our data set repeatedly, and that would classify as mechanistic code. But that sampling may be designed to occur according to an algorithm such as Markov Chain Monte Carlo that is directly tied to our desire to sample from a specific probability distribution; therefore, this could be labeled analysis and mechanistic code. Keep your audience in mind and the reproducibility of your experiment when considering how to present your code.

It is common practice to wrap code that we use repeatedly into functions to increase readability and modularity while reducing the propensity for user-induced error. However, the scripts and programming notebooks so useful to establishing a narrative and documenting work in the Refine Phase are set up to be read in a linear fashion. Embedding mechanistic functions in the midst of the research narrative obscures the utility of the notebooks in telling the research story and generally clutters up the analysis with a lot of extra code. For example, if we develop a function to eliminate the redundancy of repeatedly restructuring our data to produce a particular type of plot, we do not need to showcase that function in the middle of a computational notebook analyzing the implications of the plot that is created—the point is the research implications of the image, not the code that made the plot. Then where do we keep the data-reshaping, plot-generating code?

Strategies to structure the more mechanistic aspects of our analysis can be drawn from common software development practices. As our team grows or changes, we may require the same mechanistic code. For example, the same data-reshaping, plot-generating function described earlier might be pulled into multiple computational experiments that are set up in different locations, computational notebooks, scripts, or Git branches. Therefore, a useful approach would be to start collecting those mechanistic functions into their own script or file, sometimes called “helpers” or “utils,” that acts as a supplement to the various ongoing experiments, wherever they may be conducted. This separate script or file can be referenced or “called” at the beginning of the individual data analyses. Doing so allows team members to benefit from collaborative improvements to the mechanistic code without having to reinvent the wheel themselves. It also preserves the narrative properties of team members’ analysis-centric computational notebooks or scripts while maintaining transparency in basic methodologies that ensure project-wide reproducibility. The need to begin collecting mechanistic functions into files separate from analysis code is a good indicator that it may be time for the research team to supplement computational notebooks by using a code editor or IDE for further code development.

Testing scientific software is not always perfectly analogous to testing typical software development projects, where automated continuous integration is often employed [ 17 ]. However, as we start to modularize our code, breaking it into functions and from there into separate scripts or files that serve specific purposes, principles from software engineering become more readily applicable to our data-intensive analysis. Unit tests can now help us ensure that our mechanistic functions are working as expected, formalizing the “gut checks” that we performed in the Explore Phase. Among other applications, these tests should verify that our functions return the appropriate value, object type, or error message as needed [ 33 ]. Formal tests can also provide a more extensive investigation of how “trustworthy” the performance of a particular analysis method might be, affording us an opportunity to check the correctness of our scientific inferences. For example, we could use control data sets where we know the result of a particular analysis to make sure our analysis code is functioning as we expect. Alternatively, we could also use a regression test to compare computational outputs before and after changes in the code to make sure we haven’t introduced any unanticipated behavior.

When to transition from the Refine Phase: Going backwards and forwards

Workflows in data science are rarely linear; it is often necessary for researchers to iterate between the Refine and Explore Phases ( Fig 1B ). For example, while our research team may decide on a computational experimental design to pursue in the Refine Phase, the scope of that design may require us to revisit decisions made during the data processing that was conducted in the Explore Phase. This might mean including additional information from supplementary data sets to help refine our hypothesis or research question. In returning to the Explore Phase, we investigate these potential new data sets and decide if it makes sense to merge them with our original data set.

Iteration between the Refine and Explore Phases is a careful balance. On the one hand, we should be careful not to allow “scope creep” to expand our problem space beyond an area where we are able to develop constructive research contributions. On the other hand, if we are too rigid about decisions made over the course of our workflow and refuse to look backwards as well as forwards, we may risk cutting ourselves off from an important part of the potential solution space.

Data-intensive researchers can once more look to principles within the software development community, such as Agile frameworks, to help guide the careful balancing act required to conduct research that is both comprehensive and able to be completed [ 34 , 35 ]. How a team organizes and further documents their organization process can serve as research products themselves, which we describe further in the next phase of the workflow: the Produce Phase.

Phase 3: Produce

In the previous sections of this paper, we discussed how to progress from the exploration of raw data through the refinement of a research question and selection of an analytical methodology. We also described how the details of that workflow are guided by the breadth of the immediately relevant audience: ourselves in the Explore Phase and our research team in the Refine Phase. In the Produce Phase, it becomes time to make our data analysis camera ready for a much broader group, bringing our research results into a state that can be understood and built upon by others. This may translate to developing a variety of research products in addition to—or instead of—traditional academic outputs like peer-reviewed publications and typical software development products such as computational tools.

Beyond data analysis: Goals and standards of the Produce Phase

The main goal of the Produce Phase is to prepare our analysis to enter the public realm as a set of products ready for external use, reflection, and improvement. The Produce Phase encompasses the cleanup that happens prior to initially sharing our results to a broader community beyond our team, for example, ahead of submitting our work to peer review. It also includes the process of incorporating suggestions for improvement prior to finalization, for example, adjustments to address reviewer comments ahead of publication. The research products that emerge from a given workflow may vary in both their form and their formality—indeed, some research products, like a code base, might continually evolve without ever assuming “final” status—but each product constitutes valuable contributions that push our field’s scientific boundaries in their own way.

Importantly, producing public-facing products over the course of an entire workflow ( Fig 2 ) rather than just at the end of a project can help researchers progressively build their data science research portfolios and fulfill a second goal of the Produce Phase: gaining credit, and credibility, in our domain area. This is especially relevant for junior scientists who are just starting research careers or who wish to become industry data scientists [ 3 ]. Developing polished products at several intervals along a single workflow is also instructional for the researcher themselves. Researchers who prepare their work for public assessment from the earliest phases of an analysis become acquainted with the pertinent problem and solution spaces from multiple perspectives. This additional understanding, together with the feedback that polished products generate from people outside ourselves and our immediate team, may furnish insights that improve our approach in other phases of the research workflow.

thumbnail

Research products can build off of content generated in either the Explore or the Refine Phase. As they did in Fig 1A , turquoise circles represent potential research products generated as the project develops Closed circles represents research project within scope of project, while open circles represent beyond scope of current project. This figure emphasizes how those research products project onto a timeline and represent elements in our portfolio of work or lines on a CV. The ERP workflow emphasizes and encourages production, beyond traditional, academic research products, throughout the lifecycle of a data-intensive project rather than just at the very end.

https://doi.org/10.1371/journal.pcbi.1008770.g002

Building our data science research portfolio requires a method for tracking and attributing the many products that we might develop. One important method for tracking and attribution is the digital object identifier or DOI. It is a unique handle, standardized by the International Organization for Standardization (ISO), that can be assigned to different types of information objects. DOIs are usually connected to metadata, for example, they might include a URL pointing to where the object they are associated with can be found online. Academic researchers are used to thinking of DOIs as persistent identifiers for peer-reviewed publications. However, DOIs can also be generated for data sets, GitHub repositories, computational notebooks, teaching materials, management plans, reports, white papers , and preprints. Researchers would also be well advised to register for a unique and persistent digital identifier to be associated with their name, called an ORCID iD ( https://orcid.org ), as an additional method of tracking and attributing their personal outputs over the course of their career.

A third, longer-term goal of the Produce Phase involves establishing a researcher’s professional trajectory. Every individual needs to gauge how their compendium of research products contribute to their career and how intentional portfolio building might, in turn, drive the research that they ultimately conduct. For example, researchers who wish to work in academia might feel obliged to obtain “academic value” from less traditional research products by essentially reprising them as peer-reviewed papers. But judging a researcher’s productivity by the metric of paper authorship can alter how and even whether research is performed [ 36 ]. Increasingly, academic journals are revisiting their publishing requirements [ 37 ] and raising their standards of reproducibility. This shift is bringing the data and programming methodologies that underpin our written analyses closer to center stage. Data-intensive research, and the people who produce it, stand to benefit. Scientists—now encouraged, and even required by some academic journals to share both data and code—can publish and receive credit as well as feedback for the multiple research products that support their publications. Questions to ask ourselves as we consider possible research products can be found in Box 2 .

Produce: Products of the Explore Phase

The old adage that one person’s trash is another’s treasure is relevant to the Explore Phase of a data science analysis: Of the many potential applications for a particular data set, there is often only time to explore a small subset. Those applications which fall outside the scope of the current analysis can nonetheless be valuable to our future selves or to others seeking to conduct their own analyses. To that end, the documentation that accompanies data exploration can furnish valuable guidance for later projects. Further, the cleaned and processed data set that emerges from the Explore Phase is itself a valuable outcome that can be assigned a DOI and rendered a formal product of this portion of the data analysis workflow, using outlets like Dryad ( http://www.datadryad.org ) and Figshare ( https://figshare.com/ ) among others.

Publicly sharing the data set, along with its metadata, is an essential component of scientific transparency and reproducibility, and it is of fundamental importance to the scientific community. Data associated with a research outcome should follow “FAIR” principles of findability, accessibility, interoperability, and reusability. Importantly, discipline-specific data standards should be followed when preparing data, whether the data are being refined for public-facing or personal use. Data-intensive researchers should familiarize themselves with the standards relevant to their field of study and recognize that meeting these standards increases the likelihood of their work being both reusable and reproducible. In addition to enabling future scientists to use the data set as it was developed, adhering to a standard also facilitates the creation of synthetic data sets for later research projects. Examples of discipline-specific data standards in the natural sciences are Darwin Core ( https://dwc.tdwg.org ) for biodiversity data and EML ( https://eml.ecoinformatics.org ) for ecological data. To maximize the utility of a publically accessible data set, during the Produce Phase, researchers should confirm that it includes descriptive README files and field descriptions and also ensure that all abbreviations and coded entries are defined. In addition, an appropriate license should be assigned to the data set prior to publication: The license indicates whether, or under what circumstances, the data require attribution.

The Git repositories or computational notebooks that archive a data scientist’s approach, record the process of uncovering coding bugs, redundancies, or inconsistencies and note the rationale for focusing on specific aspects of the data are also useful research products in their own right. These items, which emerge from software development practices, can provide a touchstone for alternative explorations of the same data set at a later time. In addition to documenting valuable lessons learned, contributions of this kind can formally augment a data-intensive researcher’s registered body of work: Code used to actively clean data or record an Explore Phase process can be made citable by employing services like Zenodo to add a DOI to the applicable Git commit. Smaller code snippets or data excerpts can be shared—publicly or privately—using the more lightweight GitHub Gists ( https://gist.github.com/ ). Tools such as Dr.Watson ( https://github.com/JuliaDynamics/DrWatson.jl ) and Snakemake [ 23 ] are designed to assist researchers with organization and reproducibility and can inform the polishing process for products emerging from any phase of the analysis (see [ 22 ] for more discussion of reproducible workflow design and tools). As with data products, in the Produce Phase, researchers should license their code repositories such that other scientists know how they can use, augment, or redistribute the contents. The Produce Phase is also the time for researchers to include descriptive README files and clear guidelines for future code contributors in their repository.

Alternative mechanisms for crediting the time and talent that researchers invest in the Explore Phase include relatively informal products. For example, blog posts can detail problem space exploration for a specific research question or lessons learned about data analysis training and techniques. White papers that describe the raw data set and the steps taken to clean it, together with an explanation of why and how these decisions were taken, might constitute another such informal product. Versions of these blog posts or white papers can be uploaded to open-access websites such as arXiv.org as preprints and receive a DOI.

The familiar academic route of a peer-reviewed publication is also available for products emerging from the Explore Phase. For example, depending on the domain area of interest, journals such as Nature Scientific Data and IEEE Transactions are especially suited to papers that document the methods of data set development or simply reproduce the data set itself. Pedagogical contributions that were learned or applied over the course of a research workflow can be written up for submission to training-focused journals such as the Journal of Statistics Education . For a list of potential research product examples for the Explore Phase, see Box 3 .

Box 3. Products

Research products can be developed throughout the ERP workflow. This box helps identify some options for each phase, including products less traditional to academia. Those that can be labeled with a digital object identifier (DOI) are marked as such.

Potential Products in the Explore Phase

  • Publication of cleaned and processed data set (DOI)
  • Citable GitHub repository and/or computational notebook that shows data cleaning/processing, exploratory data analysis. (e.g., Jupyter Notebook, Knitr, Literate, Pluto, etc.) (DOI)
  • GitHub Gists (e.g., particular piece of processing code)
  • White paper (e.g., explaining a data set)
  • Blog post (e.g., detailing exploratory process)
  • Teaching/training materials (e.g., data wrangling)
  • Preprint (e.g., about a data set or its creation) (DOI)
  • Peer-reviewed publication (e.g., about a curated data set) (DOI)

Potential Products in the Refine Phase

  • White paper (e.g., explaining preliminary findings)
  • Citable GitHub repository and/or computational showing methodology and results (DOI)
  • Blog post (e.g., explaining findings informally)
  • Teaching/training materials (e.g., using your work as an example to teach a computational method)
  • Preprint (e.g., preliminary paper before being submitted to a journal) (DOI)
  • Peer-reviewed publication (e.g., formal description of your findings) (DOI)
  • Grant application incorporating the data management procedure
  • Methodology (e.g., writing a methods paper) (DOI)
  • This might include a package, a library, or an interactive web application.
  • See Box 4 for further discussion of this potential research product.

Produce: Products of the Refine Phase

In the Refine Phase, documentation and the ability to communicate both methods and results become essential to daily management of the project. Happily, the implementation of these basic practices can also provide benefits beyond the immediate team of research collaborators: They can be standardized as a Data Management Plan or Protocol (DMP). DMPs are a valuable product that can emerge from the Refine Phase as a formal version of lessons learned concerning both research and team management. This product records the strategies and approaches used to, for example, describe, share, store, analyze, and preserve data.

While DMPs are often living documents over the course of a research project, evolving dynamically with the needs or restrictions that are encountered along the way, there is great utility to codifying them either for our team’s later use or for others conducting similar projects. DMPs can also potentially be leveraged into new research grants for our team, as these protocols are now a common mandate by many funders [ 38 ]. The group discussions that contribute to developing a DMP can be difficult and encompass considerations relevant to everything from team building to research design. The outcome of these discussions is often directly tied to the constructiveness of a research team and its robustness to potential turnover [ 38 ]. Sharing these standards and lessons learned in the form of polished research products can propel a proactive discussion of data management and sharing practices within our research domain. This, in turn, bolsters the creation or enhancement of community standards beyond our team and provides training materials for those new to the field.

As with the research products that are generated by the Explore Phase, DMPs can lead to polished blog posts, training materials, white papers, and preprints that enable researchers to both spread the word about their valuable findings and be credited for their work. In addition, peer-reviewed journals are beginning to allow the publication of DMPs as a formal outcome of the data analysis workflow (e.g., Rio Journal ). Importantly, when new members join a research team, they should receive a copy of the group’s DMP. If any additional training pertinent to plans or protocols is furnished to help get new members up to speed, these materials too can be polished into research products that contribute to scientific advancement. For a list of potential research product examples for the Refine Phase, see Box 3 .

Produce: Traditional research products and scientific software

By polishing our work, we finalize and format it to receive critiques beyond ourselves and our immediate team. The scientific analysis and results that are born of the full research workflow—once documented and linked appropriately to the code and data used to conduct it—are most frequently packaged into the traditional academic research product: a peer-reviewed publication. Even this product, however, can be improved upon in terms of its reproducibility and transparency thanks to software development tools and practices. For example, papers that employ literate programming notebooks enable researchers to augment the real-time evolution of a written draft with the code that informs it. A well-kept notebook can be used to outline the motivations for a manuscript and select the figures best suited to conveying the intended narrative, because it shows the evolution of ideas and the mathematics behind each analysis along with—ideally—brief textual explanations.

Peer-reviewed papers are of primary importance to the career and reputation of academic researchers [ 39 ], but the traditional format for such publications often does not take into account essential aspects of data-intensive analysis such as computational reproducibility [ 40 ]. Where strict requirements for reproducibility are not enforced by a given journal, researchers should nonetheless compile the supporting products that made our submitted manuscript possible—including relevant code and data, as well as the documentation of our computational tools and methodologies as described in the earlier sections of this paper—into a research compendium [ 37 , 41 – 43 ]. The objective is to provide transparency to those who read or wish to replicate our academic publication and reproduce the workflow that led to our results.

In addition to peer-reviewed publications and the various alternative research products described above, some scientists may choose to revisit the scripts developed during the Explore or RefinePhases and polish that code into a traditional software development product: a computational tool, also called a software tool . A computational tool can include libraries, packages, collections of functions, or data structures designed to help with a specific class of problem. Such products might be accompanied by repository documentation or a full-fledged methodological paper that can be categorized as additional research products beyond the tool itself. Each of these items can augment a researcher’s body of citable work and contribute to advances in our domain science.

One very simple example of a tool might be an interactive web application built in RShiny ( https://shiny.rstudio.com/ ) that allows the easy exploration of cleaned data sets or demonstrates the outcomes of alternative research questions. More complex examples include a software package that builds an open-source analysis pipeline or a data structure that formally standardizes the problem space of a domain-specific research area. In all cases, the README files, docstrings, example vignettes, and appropriate licensing relevant to the Explore phase are also a necessity for open-source software. Developers should also specify contributing guidelines for future researchers who might seek to improve or extend the capabilities of the original tool. Where applicable, the dynamic equations that inform simulations should be cited with the original scientific literature where they were derived.

The effort to translate reproducible scripts into reusable software and then to maintain the software and support users is often a massive undertaking. While the software engineering literature furnishes a rich suite of resources for researchers seeking to develop their own computational tools, this existing body of work is generally directed toward trained programmers and software engineers. The design decisions that are crucial to scientists—who are primarily interested in data analysis, experiment extensibility , and result reporting and inference—can be obscured by concepts that are either out of scope or described in overtly technical jargon. Box 4 furnishes a basic guide to highlight the decision points and architectural choices relevant to creating a tool for data-intensive research. Domain scientists seeking to wade into computational tool development are well advised to review the guidelines described in Gruning and colleagues [ 2 ] in addition to more traditional software development resources and texts such as Clean Code [ 44 ], Refactoring [ 45 ], and Best Practices in Scientific Computing [ 24 ].

Box 4. Tool development guide

Creating a new software tool as the polished product of a research workflow is nontrivial. This box furnishes a series of guiding questions to help researchers think through whether tool creation is appropriate to project goals, domain science needs, and team member skill sets.

  • Does a tool in this space already exist that can be used to provide the functionality/answer the research question of interest?
  • Does it formalize our research question?
  • Does it extend/allow extension of investigative capabilities beyond the research question that our existing script was developed to ask?
  • Does creating a tool advance our personal career goals or augment a desired/necessary skill set?
  • Funding (if applicable)?
  • Domain expertise?
  • Programming expertise?
  • Collaborative research partners with either time, funding, or relevant expertise?
  • Will the process of creating the new tool be valued/helpful for your career goals?
  • Should we build on an existing tool or make a new one?
  • What research area is it designed for?
  • Who is the envisioned end user? (e.g., scientist inside our domain, scientist outside our domain, policy maker, member of the public)
  • What is the goal of the end user? (e.g., analysis of raw inputs, explanation of results, creation of inputs for the next step of a larger analysis)
  • What are field norms?
  • Is it accessible (free, open source)?
  • What is the likely form and type of data input to our tool?
  • What is the desired form and type of data output from our tool?
  • Are there preexisting structures that are useful to emulate, or should we develop our own?
  • Is there an existing package that provides basic structure or building block functionalities necessary or useful for our tool, such that we do not need to reinvent the wheel?

Conclusions

Defining principles for data analysis workflows is important for scientific accuracy, efficiency, and the effective communication of results, regardless of whether researchers are working alone or in a team. Establishing standards, such as for documentation and unit testing, both improves the quality of work produced by practicing data scientists and sets a proactive example for fledgling researchers to do the same. There is no single set of principles for performing data-intensive research. Each computational project carries its own context—from the scientific domain in which it is conducted, to the software and methodological analysis tools we use to pursue our research questions, to the dynamics of our particular research team. Therefore, this paper has outlined general concepts for designing a data analysis such that researchers may incorporate the aspects of the ERP workflow that work best for them. It has also put forward suggestions for specific tools to facilitate that workflow and for a selection of nontraditional research products that could emerge throughout a given data analysis project.

Aiming for full reproducibility when communicating research results is a noble pursuit, but it is imperative to understand that there is a balance between generating a complete analysis and furnishing a 100% reproducible product. Researchers have competing motivations: finishing their work in a timely fashion versus having a perfectly documented final product, while balancing how these trade-offs might strengthen their career. Despite various calls for the creation of a standard framework [ 7 , 46 ], achieving complete reproducibility may go far beyond the individual researcher to encompass a culture-wide shift in expectations by consumers of scientific research products, to realistic capacities of version control software. The first of these advancements is particularly challenging and unlikely to manifest quickly across data-intensive research areas, although it is underway in a number of scientific domains [ 26 ]. By reframing what a formal research product can be—and noting that polished contributions can constitute much more than the academic publications previously held forth as the benchmark for career advancement—we motivate structural change to data analysis workflows.

In addition to amassing outputs beyond the peer-reviewed academic publication, there are increasingly venues for writing less traditional papers that describe or consist solely of a novel data set, a software tool, a particular methodology, or training materials. As the professional landscape for data-intensive research evolves, these novel publications and research products are extremely valuable for distinguishing applicants to academic and nonacademic jobs, grants, and teaching positions. Data scientists and researchers should possess numerous and multifaceted skills to perform scientifically robust and computationally effective data analysis. Therefore, potential research collaborators or hiring entities both inside and outside the academy should take into account a variety of research products, from every phase of the data analysis workflow, when evaluating the career performance of data-intensive researchers [ 47 ].

Acknowledgments

We thank the Best Practices Working Group (UC Berkeley) for the thoughtful conversations and feedback that greatly informed the content of this paper. We thank the Berkeley Institute for Data Science for hosting meetings that brought together data scientists, biologists, statisticians, computer scientists, and software engineers to discuss how data-intensive research is performed and evaluated. We especially thank Stuart Gieger (UC Berkeley) for his leadership of the Best Practices in Data Science Group and Rebecca Barter (UC Berkeley) for her helpful feedback.

  • View Article
  • PubMed/NCBI
  • Google Scholar
  • 3. Robinson E, Nolis J. Build a Career in Data Science. Simon and Schuster; 2020.
  • 6. Terence S. An Extensive Step by Step Guide to Exploratory Data Analysis. 2020 [cited 2020 Jun 15]. https://towardsdatascience.com/an-extensive-guide-to-exploratory-data-analysis-ddd99a03199e .
  • 13. Bostock MA. Better Way to Code—Mike Bostock—Medium. 2017 [cited 2020 Jun 15]. https://medium.com/@mbostock/a-better-way-to-code-2b1d2876a3a0 .
  • 14. van der Plas F. Pluto.jl. Github. https://github.com/fonsp/Pluto.jl .
  • 15. Best Practices for Writing R Code–Programming with R. [cited 15 Jun 2020]. https://swcarpentry.github.io/r-novice-inflammation/06-best-practices-R/
  • 16. PyCon 2019. Jes Ford—Getting Started Testing in Data Science—PyCon 2019. Youtube; 5 May 2019 [cited 2020 Feb 20]. https://www.youtube.com/watch?v=0ysyWk-ox-8
  • 17. Hook D, Kelly D. Testing for trustworthiness in scientific software. 2009 ICSE Workshop on Software Engineering for Computational Science and Engineering. 2009. pp. 59–64.
  • 18. Oh J-H. Check Yo’ Data Before You Wreck Yo’ Results. In: Medium [Internet]. ACLU Tech & Analytics; 24 Jan 2020 [cited 2020 Apr 9]. https://medium.com/aclu-tech-analytics/check-yo-data-before-you-wreck-yo-results-53f0e919d0b9 .
  • 19. Gelfand S. comparing two data frames: one #rstats, many ways! | Sharla Gelfand. In: Sharla Gelfand [Internet]. Sharla Gelfand; 17 Feb 2020 [cited 2020 Apr 20]. https://sharla.party/post/comparing-two-dfs/ .
  • 20. Gelfand S. Don’t repeat yourself, talk to yourself! Repeated reporting in the R universe | Sharla Gelfand. In: Sharla Gelfand [Internet]. 30 Jan 2020 [cited 2020 Apr 20]. https://sharla.party/talk/2020-01-01-rstudio-conf/ .
  • 27. Geiger RS, Sholler D, Culich A, Martinez C, Hoces de la Guardia F, Lanusse F, et al. Challenges of Doing Data-Intensive Research in Teams, Labs, and Groups: Report from the BIDS Best Practices in Data Science Series. 2018.
  • 29. Xie Y. Dynamic Documents with R and knitr. Chapman and Hall/CRC; 2017.
  • 33. Wickham H. R Packages: Organize, Test, Document, and Share Your Code. “O’Reilly Media, Inc.”; 2015.
  • 34. Abrahamsson P, Salo O, Ronkainen J, Warsta J. Agile Software Development Methods: Review and Analysis. arXiv [cs.SE]. 2017. http://arxiv.org/abs/1709.08439 .
  • 35. Beck K, Beedle M, Van Bennekum A, Cockburn A, Cunningham W, Fowler M, et al. Manifesto for agile software development. 2001. https://moodle2019-20.ua.es/moodle/pluginfile.php/2213/mod_resource/content/2/agile-manifesto.pdf .
  • 38. Sholler D, Das D, Hoces de la Guardia F, Hoffman C, Lanusse F, Varoquaux N, et al. Best Practices for Managing Turnover in Data Science Groups, Teams, and Labs. 2019.
  • 44. Martin RC. Clean Code: A Handbook of Agile Software Craftsmanship. Pearson Education; 2009.
  • 45. Fowler M. Refactoring: Improving the Design of Existing Code. Addison-Wesley Professional; 2018.
  • 47. Geiger RS, Cabasse C, Cullens CY, Norén L, Fiore-Gartland B, Das D, et al. Career Paths and Prospects in Academic Data Science: Report of the Moore-Sloan Data Science Environments Survey. 2018.
  • 49. Jorgensen PC, editor. About the International Software Testing Qualification Board. 1st ed. The Craft of Model-Based Testing. 1st ed. Boca Raton: Taylor & Francis, a CRC title, part of the Taylor & Francis imprint, a member of the Taylor & Francis Group, the academic division of T&F Informa, plc, [2017]: Auerbach Publications; 2017. pp. 231–240.
  • 51. Wikipedia contributors. Functional design. In: Wikipedia, The Free Encyclopedia [Internet]. 4 Feb 2020 [cited 21 Feb 2020]. https://en.wikipedia.org/w/index.php?title=Functional_design&oldid=939128138
  • 52. 7 Essential Guidelines For Functional Design—Smashing Magazine. In: Smashing Magazine [Internet]. 5 Aug 2008 [cited 21 Feb 2020]. https://www.smashingmagazine.com/2008/08/7-essential-guidelines-for-functional-design/
  • 53. Claerbout JF, Karrenbach M. Electronic documents give reproducible research a new meaning. SEG Technical Program Expanded Abstracts 1992. Society of Exploration Geophysicists; 1992. pp. 601–604.
  • 54. Heroux MA, Barba L, Parashar M, Stodden V, Taufer M. Toward a Compatible Reproducibility Taxonomy for Computational and Computing Sciences. Sandia National Lab.(SNL-NM), Albuquerque, NM (United States); 2018. https://www.osti.gov/biblio/1481626 .

Advertisement

Advertisement

Data Science and Analytics: An Overview from Data-Driven Smart Computing, Decision-Making and Applications Perspective

  • Review Article
  • Published: 12 July 2021
  • Volume 2 , article number  377 , ( 2021 )

Cite this article

data analysis research paper

  • Iqbal H. Sarker   ORCID: orcid.org/0000-0003-1740-5517 1 , 2  

64k Accesses

114 Citations

Explore all metrics

The digital world has a wealth of data, such as internet of things (IoT) data, business data, health data, mobile data, urban data, security data, and many more, in the current age of the Fourth Industrial Revolution (Industry 4.0 or 4IR). Extracting knowledge or useful insights from these data can be used for smart decision-making in various applications domains. In the area of data science, advanced analytics methods including machine learning modeling can provide actionable insights or deeper knowledge about data, which makes the computing process automatic and smart. In this paper, we present a comprehensive view on “Data Science” including various types of advanced analytics methods that can be applied to enhance the intelligence and capabilities of an application through smart decision-making in different scenarios. We also discuss and summarize ten potential real-world application domains including business, healthcare, cybersecurity, urban and rural data science, and so on by taking into account data-driven smart computing and decision making. Based on this, we finally highlight the challenges and potential research directions within the scope of our study. Overall, this paper aims to serve as a reference point on data science and advanced analytics to the researchers and decision-makers as well as application developers, particularly from the data-driven solution point of view for real-world problems.

Similar content being viewed by others

data analysis research paper

Machine Learning: Algorithms, Real-World Applications and Research Directions

data analysis research paper

AI-Based Modeling: Techniques, Applications and Research Issues Towards Automation, Intelligent and Smart Systems

data analysis research paper

Artificial intelligence-based solutions for climate change: a review

Avoid common mistakes on your manuscript.

Introduction

We are living in the age of “data science and advanced analytics”, where almost everything in our daily lives is digitally recorded as data [ 17 ]. Thus the current electronic world is a wealth of various kinds of data, such as business data, financial data, healthcare data, multimedia data, internet of things (IoT) data, cybersecurity data, social media data, etc [ 112 ]. The data can be structured, semi-structured, or unstructured, which increases day by day [ 105 ]. Data science is typically a “concept to unify statistics, data analysis, and their related methods” to understand and analyze the actual phenomena with data. According to Cao et al. [ 17 ] “data science is the science of data” or “data science is the study of data”, where a data product is a data deliverable, or data-enabled or guided, which can be a discovery, prediction, service, suggestion, insight into decision-making, thought, model, paradigm, tool, or system. The popularity of “Data science” is increasing day-by-day, which is shown in Fig. 1 according to Google Trends data over the last 5 years [ 36 ]. In addition to data science, we have also shown the popularity trends of the relevant areas such as “Data analytics”, “Data mining”, “Big data”, “Machine learning” in the figure. According to Fig. 1 , the popularity indication values for these data-driven domains, particularly “Data science”, and “Machine learning” are increasing day-by-day. This statistical information and the applicability of the data-driven smart decision-making in various real-world application areas, motivate us to study briefly on “Data science” and machine-learning-based “Advanced analytics” in this paper.

figure 1

The worldwide popularity score of data science comparing with relevant  areas in a range of 0 (min) to 100 (max) over time where x -axis represents the timestamp information and y -axis represents the corresponding score

Usually, data science is the field of applying advanced analytics methods and scientific concepts to derive useful business information from data. The emphasis of advanced analytics is more on anticipating the use of data to detect patterns to determine what is likely to occur in the future. Basic analytics offer a description of data in general, while advanced analytics is a step forward in offering a deeper understanding of data and helping to analyze granular data, which we are interested in. In the field of data science, several types of analytics are popular, such as "Descriptive analytics" which answers the question of what happened; "Diagnostic analytics" which answers the question of why did it happen; "Predictive analytics" which predicts what will happen in the future; and "Prescriptive analytics" which prescribes what action should be taken, discussed briefly in “ Advanced analytics methods and smart computing ”. Such advanced analytics and decision-making based on machine learning techniques [ 105 ], a major part of artificial intelligence (AI) [ 102 ] can also play a significant role in the Fourth Industrial Revolution (Industry 4.0) due to its learning capability for smart computing as well as automation [ 121 ].

Although the area of “data science” is huge, we mainly focus on deriving useful insights through advanced analytics, where the results are used to make smart decisions in various real-world application areas. For this, various advanced analytics methods such as machine learning modeling, natural language processing, sentiment analysis, neural network, or deep learning analysis can provide deeper knowledge about data, and thus can be used to develop data-driven intelligent applications. More specifically, regression analysis, classification, clustering analysis, association rules, time-series analysis, sentiment analysis, behavioral patterns, anomaly detection, factor analysis, log analysis, and deep learning which is originated from the artificial neural network, are taken into account in our study. These machine learning-based advanced analytics methods are discussed briefly in “ Advanced analytics methods and smart computing ”. Thus, it’s important to understand the principles of various advanced analytics methods mentioned above and their applicability to apply in various real-world application areas. For instance, in our earlier paper Sarker et al. [ 114 ], we have discussed how data science and machine learning modeling can play a significant role in the domain of cybersecurity for making smart decisions and to provide data-driven intelligent security services. In this paper, we broadly take into account the data science application areas and real-world problems in ten potential domains including the area of business data science, health data science, IoT data science, behavioral data science, urban data science, and so on, discussed briefly in “ Real-world application domains ”.

Based on the importance of machine learning modeling to extract the useful insights from the data mentioned above and data-driven smart decision-making, in this paper, we present a comprehensive view on “Data Science” including various types of advanced analytics methods that can be applied to enhance the intelligence and the capabilities of an application. The key contribution of this study is thus understanding data science modeling, explaining different analytic methods for solution perspective and their applicability in various real-world data-driven applications areas mentioned earlier. Overall, the purpose of this paper is, therefore, to provide a basic guide or reference for those academia and industry people who want to study, research, and develop automated and intelligent applications or systems based on smart computing and decision making within the area of data science.

The main contributions of this paper are summarized as follows:

To define the scope of our study towards data-driven smart computing and decision-making in our real-world life. We also make a brief discussion on the concept of data science modeling from business problems to data product and automation, to understand its applicability and provide intelligent services in real-world scenarios.

To provide a comprehensive view on data science including advanced analytics methods that can be applied to enhance the intelligence and the capabilities of an application.

To discuss the applicability and significance of machine learning-based analytics methods in various real-world application areas. We also summarize ten potential real-world application areas, from business to personalized applications in our daily life, where advanced analytics with machine learning modeling can be used to achieve the expected outcome.

To highlight and summarize the challenges and potential research directions within the scope of our study.

The rest of the paper is organized as follows. The next section provides the background and related work and defines the scope of our study. The following section presents the concepts of data science modeling for building a data-driven application. After that, briefly discuss and explain different advanced analytics methods and smart computing. Various real-world application areas are discussed and summarized in the next section. We then highlight and summarize several research issues and potential future directions, and finally, the last section concludes this paper.

Background and Related Work

In this section, we first discuss various data terms and works related to data science and highlight the scope of our study.

Data Terms and Definitions

There is a range of key terms in the field, such as data analysis, data mining, data analytics, big data, data science, advanced analytics, machine learning, and deep learning, which are highly related and easily confusing. In the following, we define these terms and differentiate them with the term “Data Science” according to our goal.

The term “Data analysis” refers to the processing of data by conventional (e.g., classic statistical, empirical, or logical) theories, technologies, and tools for extracting useful information and for practical purposes [ 17 ]. The term “Data analytics”, on the other hand, refers to the theories, technologies, instruments, and processes that allow for an in-depth understanding and exploration of actionable data insight [ 17 ]. Statistical and mathematical analysis of the data is the major concern in this process. “Data mining” is another popular term over the last decade, which has a similar meaning with several other terms such as knowledge mining from data, knowledge extraction, knowledge discovery from data (KDD), data/pattern analysis, data archaeology, and data dredging. According to Han et al. [ 38 ], it should have been more appropriately named “knowledge mining from data”. Overall, data mining is defined as the process of discovering interesting patterns and knowledge from large amounts of data [ 38 ]. Data sources may include databases, data centers, the Internet or Web, other repositories of data, or data dynamically streamed through the system. “Big data” is another popular term nowadays, which may change the statistical and data analysis approaches as it has the unique features of “massive, high dimensional, heterogeneous, complex, unstructured, incomplete, noisy, and erroneous” [ 74 ]. Big data can be generated by mobile devices, social networks, the Internet of Things, multimedia, and many other new applications [ 129 ]. Several unique features including volume, velocity, variety, veracity, value (5Vs), and complexity are used to understand and describe big data [ 69 ].

In terms of analytics, basic analytics provides a summary of data whereas the term “Advanced Analytics” takes a step forward in offering a deeper understanding of data and helps to analyze granular data. Advanced analytics is characterized or defined as autonomous or semi-autonomous data or content analysis using advanced techniques and methods to discover deeper insights, predict or generate recommendations, typically beyond traditional business intelligence or analytics. “Machine learning”, a branch of artificial intelligence (AI), is one of the major techniques used in advanced analytics which can automate analytical model building [ 112 ]. This is focused on the premise that systems can learn from data, recognize trends, and make decisions, with minimal human involvement [ 38 , 115 ]. “Deep Learning” is a subfield of machine learning that discusses algorithms inspired by the human brain’s structure and the function called artificial neural networks [ 38 , 139 ].

Unlike the above data-related terms, “Data science” is an umbrella term that encompasses advanced data analytics, data mining, machine, and deep learning modeling, and several other related disciplines like statistics, to extract insights or useful knowledge from the datasets and transform them into actionable business strategies. In [ 17 ], Cao et al. defined data science from the disciplinary perspective as “data science is a new interdisciplinary field that synthesizes and builds on statistics, informatics, computing, communication, management, and sociology to study data and its environments (including domains and other contextual aspects, such as organizational and social aspects) to transform data to insights and decisions by following a data-to-knowledge-to-wisdom thinking and methodology”. In “ Understanding data science modeling ”, we briefly discuss the data science modeling from a practical perspective starting from business problems to data products that can assist the data scientists to think and work in a particular real-world problem domain within the area of data science and analytics.

Related Work

In the area, several papers have been reviewed by the researchers based on data science and its significance. For example, the authors in [ 19 ] identify the evolving field of data science and its importance in the broader knowledge environment and some issues that differentiate data science and informatics issues from conventional approaches in information sciences. Donoho et al. [ 27 ] present 50 years of data science including recent commentary on data science in mass media, and on how/whether data science varies from statistics. The authors formally conceptualize the theory-guided data science (TGDS) model in [ 53 ] and present a taxonomy of research themes in TGDS. Cao et al. include a detailed survey and tutorial on the fundamental aspects of data science in [ 17 ], which considers the transition from data analysis to data science, the principles of data science, as well as the discipline and competence of data education.

Besides, the authors include a data science analysis in [ 20 ], which aims to provide a realistic overview of the use of statistical features and related data science methods in bioimage informatics. The authors in [ 61 ] study the key streams of data science algorithm use at central banks and show how their popularity has risen over time. This research contributes to the creation of a research vector on the role of data science in central banking. In [ 62 ], the authors provide an overview and tutorial on the data-driven design of intelligent wireless networks. The authors in [ 87 ] provide a thorough understanding of computational optimal transport with application to data science. In [ 97 ], the authors present data science as theoretical contributions in information systems via text analytics.

Unlike the above recent studies, in this paper, we concentrate on the knowledge of data science including advanced analytics methods, machine learning modeling, real-world application domains, and potential research directions within the scope of our study. The advanced analytics methods based on machine learning techniques discussed in this paper can be applied to enhance the capabilities of an application in terms of data-driven intelligent decision making and automation in the final data product or systems.

Understanding Data Science Modeling

In this section, we briefly discuss how data science can play a significant role in the real-world business process. For this, we first categorize various types of data and then discuss the major steps of data science modeling starting from business problems to data product and automation.

Types of Real-World Data

Typically, to build a data-driven real-world system in a particular domain, the availability of data is the key [ 17 , 112 , 114 ]. The data can be in different types such as (i) Structured—that has a well-defined data structure and follows a standard order, examples are names, dates, addresses, credit card numbers, stock information, geolocation, etc.; (ii) Unstructured—has no pre-defined format or organization, examples are sensor data, emails, blog entries, wikis, and word processing documents, PDF files, audio files, videos, images, presentations, web pages, etc.; (iii) Semi-structured—has elements of both the structured and unstructured data containing certain organizational properties, examples are HTML, XML, JSON documents, NoSQL databases, etc.; and (iv) Metadata—that represents data about the data, examples are author, file type, file size, creation date and time, last modification date and time, etc. [ 38 , 105 ].

In the area of data science, researchers use various widely-used datasets for different purposes. These are, for example, cybersecurity datasets such as NSL-KDD [ 127 ], UNSW-NB15 [ 79 ], Bot-IoT [ 59 ], ISCX’12 [ 15 ], CIC-DDoS2019 [ 22 ], etc., smartphone datasets such as phone call logs [ 88 , 110 ], mobile application usages logs [ 124 , 149 ], SMS Log [ 28 ], mobile phone notification logs [ 77 ] etc., IoT data [ 56 , 11 , 64 ], health data such as heart disease [ 99 ], diabetes mellitus [ 86 , 147 ], COVID-19 [ 41 , 78 ], etc., agriculture and e-commerce data [ 128 , 150 ], and many more in various application domains. In “ Real-world application domains ”, we discuss ten potential real-world application domains of data science and analytics by taking into account data-driven smart computing and decision making, which can help the data scientists and application developers to explore more in various real-world issues.

Overall, the data used in data-driven applications can be any of the types mentioned above, and they can differ from one application to another in the real world. Data science modeling, which is briefly discussed below, can be used to analyze such data in a specific problem domain and derive insights or useful information from the data to build a data-driven model or data product.

Steps of Data Science Modeling

Data science is typically an umbrella term that encompasses advanced data analytics, data mining, machine, and deep learning modeling, and several other related disciplines like statistics, to extract insights or useful knowledge from the datasets and transform them into actionable business strategies, mentioned earlier in “ Background and related work ”. In this section, we briefly discuss how data science can play a significant role in the real-world business process. Figure 2 shows an example of data science modeling starting from real-world data to data-driven product and automation. In the following, we briefly discuss each module of the data science process.

figure 2

An example of data science modeling from real-world data to data-driven system and decision making

Understanding business problems: This involves getting a clear understanding of the problem that is needed to solve, how it impacts the relevant organization or individuals, the ultimate goals for addressing it, and the relevant project plan. Thus to understand and identify the business problems, the data scientists formulate relevant questions while working with the end-users and other stakeholders. For instance, how much/many, which category/group, is the behavior unrealistic/abnormal, which option should be taken, what action, etc. could be relevant questions depending on the nature of the problems. This helps to get a better idea of what business needs and what we should be extracted from data. Such business knowledge can enable organizations to enhance their decision-making process, is known as “Business Intelligence” [ 65 ]. Identifying the relevant data sources that can help to answer the formulated questions and what kinds of actions should be taken from the trends that the data shows, is another important task associated with this stage. Once the business problem has been clearly stated, the data scientist can define the analytic approach to solve the problem.

Understanding data: As we know that data science is largely driven by the availability of data [ 114 ]. Thus a sound understanding of the data is needed towards a data-driven model or system. The reason is that real-world data sets are often noisy, missing values, have inconsistencies, or other data issues, which are needed to handle effectively [ 101 ]. To gain actionable insights, the appropriate data or the quality of the data must be sourced and cleansed, which is fundamental to any data science engagement. For this, data assessment that evaluates what data is available and how it aligns to the business problem could be the first step in data understanding. Several aspects such as data type/format, the quantity of data whether it is sufficient or not to extract the useful knowledge, data relevance, authorized access to data, feature or attribute importance, combining multiple data sources, important metrics to report the data, etc. are needed to take into account to clearly understand the data for a particular business problem. Overall, the data understanding module involves figuring out what data would be best needed and the best ways to acquire it.

Data pre-processing and exploration: Exploratory data analysis is defined in data science as an approach to analyzing datasets to summarize their key characteristics, often with visual methods [ 135 ]. This examines a broad data collection to discover initial trends, attributes, points of interest, etc. in an unstructured manner to construct meaningful summaries of the data. Thus data exploration is typically used to figure out the gist of data and to develop a first step assessment of its quality, quantity, and characteristics. A statistical model can be used or not, but primarily it offers tools for creating hypotheses by generally visualizing and interpreting the data through graphical representation such as a chart, plot, histogram, etc [ 72 , 91 ]. Before the data is ready for modeling, it’s necessary to use data summarization and visualization to audit the quality of the data and provide the information needed to process it. To ensure the quality of the data, the data  pre-processing technique, which is typically the process of cleaning and transforming raw data [ 107 ] before processing and analysis is important. It also involves reformatting information, making data corrections, and merging data sets to enrich data. Thus, several aspects such as expected data, data cleaning, formatting or transforming data, dealing with missing values, handling data imbalance and bias issues, data distribution, search for outliers or anomalies in data and dealing with them, ensuring data quality, etc. could be the key considerations in this step.

Machine learning modeling and evaluation: Once the data is prepared for building the model, data scientists design a model, algorithm, or set of models, to address the business problem. Model building is dependent on what type of analytics, e.g., predictive analytics, is needed to solve the particular problem, which is discussed briefly in “ Advanced analytics methods and smart computing ”. To best fits the data according to the type of analytics, different types of data-driven or machine learning models that have been summarized in our earlier paper Sarker et al. [ 105 ], can be built to achieve the goal. Data scientists typically separate training and test subsets of the given dataset usually dividing in the ratio of 80:20 or data considering the most popular k -folds data splitting method [ 38 ]. This is to observe whether the model performs well or not on the data, to maximize the model performance. Various model validation and assessment metrics, such as error rate, accuracy, true positive, false positive, true negative, false negative, precision, recall, f-score, ROC (receiver operating characteristic curve) analysis, applicability analysis, etc. [ 38 , 115 ] are used to measure the model performance, which can guide the data scientists to choose or design the learning method or model. Besides, machine learning experts or data scientists can take into account several advanced analytics such as feature engineering, feature selection or extraction methods, algorithm tuning, ensemble methods, modifying existing algorithms, or designing new algorithms, etc. to improve the ultimate data-driven model to solve a particular business problem through smart decision making.

Data product and automation: A data product is typically the output of any data science activity [ 17 ]. A data product, in general terms, is a data deliverable, or data-enabled or guide, which can be a discovery, prediction, service, suggestion, insight into decision-making, thought, model, paradigm, tool, application, or system that process data and generate results. Businesses can use the results of such data analysis to obtain useful information like churn (a measure of how many customers stop using a product) prediction and customer segmentation, and use these results to make smarter business decisions and automation. Thus to make better decisions in various business problems, various machine learning pipelines and data products can be developed. To highlight this, we summarize several potential real-world data science application areas in “ Real-world application domains ”, where various data products can play a significant role in relevant business problems to make them smart and automate.

Overall, we can conclude that data science modeling can be used to help drive changes and improvements in business practices. The interesting part of the data science process indicates having a deeper understanding of the business problem to solve. Without that, it would be much harder to gather the right data and extract the most useful information from the data for making decisions to solve the problem. In terms of role, “Data Scientists” typically interpret and manage data to uncover the answers to major questions that help organizations to make objective decisions and solve complex problems. In a summary, a data scientist proactively gathers and analyzes information from multiple sources to better understand how the business performs, and  designs machine learning or data-driven tools/methods, or algorithms, focused on advanced analytics, which can make today’s computing process smarter and intelligent, discussed briefly in the following section.

Advanced Analytics Methods and Smart Computing

As mentioned earlier in “ Background and related work ”, basic analytics provides a summary of data whereas advanced analytics takes a step forward in offering a deeper understanding of data and helps in granular data analysis. For instance, the predictive capabilities of advanced analytics can be used to forecast trends, events, and behaviors. Thus, “advanced analytics” can be defined as the autonomous or semi-autonomous analysis of data or content using advanced techniques and methods to discover deeper insights, make predictions, or produce recommendations, where machine learning-based analytical modeling is considered as the key technologies in the area. In the following section, we first summarize various types of analytics and outcome that are needed to solve the associated business problems, and then we briefly discuss machine learning-based analytical modeling.

Types of Analytics and Outcome

In the real-world business process, several key questions such as “What happened?”, “Why did it happen?”, “What will happen in the future?”, “What action should be taken?” are common and important. Based on these questions, in this paper, we categorize and highlight the analytics into four types such as descriptive, diagnostic, predictive, and prescriptive, which are discussed below.

Descriptive analytics: It is the interpretation of historical data to better understand the changes that have occurred in a business. Thus descriptive analytics answers the question, “what happened in the past?” by summarizing past data such as statistics on sales and operations or marketing strategies, use of social media, and engagement with Twitter, Linkedin or Facebook, etc. For instance, using descriptive analytics through analyzing trends, patterns, and anomalies, etc., customers’ historical shopping data can be used to predict the probability of a customer purchasing a product. Thus, descriptive analytics can play a significant role to provide an accurate picture of what has occurred in a business and how it relates to previous times utilizing a broad range of relevant business data. As a result, managers and decision-makers can pinpoint areas of strength and weakness in their business, and eventually can take more effective management strategies and business decisions.

Diagnostic analytics: It is a form of advanced analytics that examines data or content to answer the question, “why did it happen?” The goal of diagnostic analytics is to help to find the root cause of the problem. For example, the human resource management department of a business organization may use these diagnostic analytics to find the best applicant for a position, select them, and compare them to other similar positions to see how well they perform. In a healthcare example, it might help to figure out whether the patients’ symptoms such as high fever, dry cough, headache, fatigue, etc. are all caused by the same infectious agent. Overall, diagnostic analytics enables one to extract value from the data by posing the right questions and conducting in-depth investigations into the answers. It is characterized by techniques such as drill-down, data discovery, data mining, and correlations.

Predictive analytics: Predictive analytics is an important analytical technique used by many organizations for various purposes such as to assess business risks, anticipate potential market patterns, and decide when maintenance is needed, to enhance their business. It is a form of advanced analytics that examines data or content to answer the question, “what will happen in the future?” Thus, the primary goal of predictive analytics is to identify and typically answer this question with a high degree of probability. Data scientists can use historical data as a source to extract insights for building predictive models using various regression analyses and machine learning techniques, which can be used in various application domains for a better outcome. Companies, for example, can use predictive analytics to minimize costs by better anticipating future demand and changing output and inventory, banks and other financial institutions to reduce fraud and risks by predicting suspicious activity, medical specialists to make effective decisions through predicting patients who are at risk of diseases, retailers to increase sales and customer satisfaction through understanding and predicting customer preferences, manufacturers to optimize production capacity through predicting maintenance requirements, and many more. Thus predictive analytics can be considered as the core analytical method within the area of data science.

Prescriptive analytics: Prescriptive analytics focuses on recommending the best way forward with actionable information to maximize overall returns and profitability, which typically answer the question, “what action should be taken?” In business analytics, prescriptive analytics is considered the final step. For its models, prescriptive analytics collects data from several descriptive and predictive sources and applies it to the decision-making process. Thus, we can say that it is related to both descriptive analytics and predictive analytics, but it emphasizes actionable insights instead of data monitoring. In other words, it can be considered as the opposite of descriptive analytics, which examines decisions and outcomes after the fact. By integrating big data, machine learning, and business rules, prescriptive analytics helps organizations to make more informed decisions to produce results that drive the most successful business decisions.

In summary, to clarify what happened and why it happened, both descriptive analytics and diagnostic analytics look at the past. Historical data is used by predictive analytics and prescriptive analytics to forecast what will happen in the future and what steps should be taken to impact those effects. In Table 1 , we have summarized these analytics methods with examples. Forward-thinking organizations in the real world can jointly use these analytical methods to make smart decisions that help drive changes in business processes and improvements. In the following, we discuss how machine learning techniques can play a big role in these analytical methods through their learning capabilities from the data.

Machine Learning Based Analytical Modeling

In this section, we briefly discuss various advanced analytics methods based on machine learning modeling, which can make the computing process smart through intelligent decision-making in a business process. Figure 3 shows a general structure of a machine learning-based predictive modeling considering both the training and testing phase. In the following, we discuss a wide range of methods such as regression and classification analysis, association rule analysis, time-series analysis, behavioral analysis, log analysis, and so on within the scope of our study.

figure 3

A general structure of a machine learning based predictive model considering both the training and testing phase

Regression Analysis

In data science, one of the most common statistical approaches used for predictive modeling and data mining tasks is regression techniques [ 38 ]. Regression analysis is a form of supervised machine learning that examines the relationship between a dependent variable (target) and independent variables (predictor) to predict continuous-valued output [ 105 , 117 ]. The following equations Eqs. 1 , 2 , and 3 [ 85 , 105 ] represent the simple, multiple or multivariate, and polynomial regressions respectively, where x represents independent variable and y is the predicted/target output mentioned above:

Regression analysis is typically conducted for one of two purposes: to predict the value of the dependent variable in the case of individuals for whom some knowledge relating to the explanatory variables is available, or to estimate the effect of some explanatory variable on the dependent variable, i.e., finding the relationship of causal influence between the variables. Linear regression cannot be used to fit non-linear data and may cause an underfitting problem. In that case, polynomial regression performs better, however, increases the model complexity. The regularization techniques such as Ridge, Lasso, Elastic-Net, etc. [ 85 , 105 ] can be used to optimize the linear regression model. Besides, support vector regression, decision tree regression, random forest regression techniques [ 85 , 105 ] can be used for building effective regression models depending on the problem type, e.g., non-linear tasks. Financial forecasting or prediction, cost estimation, trend analysis, marketing, time-series estimation, drug response modeling, etc. are some examples where the regression models can be used to solve real-world problems in the domain of data science and analytics.

Classification Analysis

Classification is one of the most widely used and best-known data science processes. This is a form of supervised machine learning approach that also refers to a predictive modeling problem in which a class label is predicted for a given example [ 38 ]. Spam identification, such as ‘spam’ and ‘not spam’ in email service providers, can be an example of a classification problem. There are several forms of classification analysis available in the area such as binary classification—which refers to the prediction of one of two classes; multi-class classification—which involves the prediction of one of more than two classes; multi-label classification—a generalization of multiclass classification in which the problem’s classes are organized hierarchically [ 105 ].

figure 4

An example of a random forest structure considering multiple decision trees

Several popular classification techniques, such as k-nearest neighbors [ 5 ], support vector machines [ 55 ], navies Bayes [ 49 ], adaptive boosting [ 32 ], extreme gradient boosting [ 85 ], logistic regression [ 66 ], decision trees ID3 [ 92 ], C4.5 [ 93 ], and random forests [ 13 ] exist to solve classification problems. The tree-based classification technique, e.g., random forest considering multiple decision trees, performs better than others to solve real-world problems in many cases as due to its capability of producing logic rules [ 103 , 115 ]. Figure 4 shows an example of a random forest structure considering multiple decision trees. In addition, BehavDT recently proposed by Sarker et al. [ 109 ], and IntrudTree [ 106 ] can be used for building effective classification or prediction models in the relevant tasks within the domain of data science and analytics.

Cluster Analysis

Clustering is a form of unsupervised machine learning technique and is well-known in many data science application areas for statistical data analysis [ 38 ]. Usually, clustering techniques search for the structures inside a dataset and, if the classification is not previously identified, classify homogeneous groups of cases. This means that data points are identical to each other within a cluster, and different from data points in another cluster. Overall, the purpose of cluster analysis is to sort various data points into groups (or clusters) that are homogeneous internally and heterogeneous externally [ 105 ]. To gain insight into how data is distributed in a given dataset or as a preprocessing phase for other algorithms, clustering is often used. Data clustering, for example, assists with customer shopping behavior, sales campaigns, and retention of consumers for retail businesses, anomaly detection, etc.

Many clustering algorithms with the ability to group data have been proposed in machine learning and data science literature [ 98 , 138 , 141 ]. In our earlier paper Sarker et al. [ 105 ], we have summarized this based on several perspectives, such as partitioning methods, density-based methods, hierarchical-based methods, model-based methods, etc. In the literature, the popular K-means [ 75 ], K-Mediods [ 84 ], CLARA [ 54 ] etc. are known as partitioning methods; DBSCAN [ 30 ], OPTICS [ 8 ] etc. are known as density-based methods; single linkage [ 122 ], complete linkage [ 123 ], etc. are known as hierarchical methods. In addition, grid-based clustering methods, such as STING [ 134 ], CLIQUE [ 2 ], etc.; model-based clustering such as neural network learning [ 141 ], GMM [ 94 ], SOM [ 18 , 104 ], etc.; constrained-based methods such as COP K-means [ 131 ], CMWK-Means [ 25 ], etc. are used in the area. Recently, Sarker et al. [ 111 ] proposed a hierarchical clustering method, BOTS [ 111 ] based on bottom-up agglomerative technique for capturing user’s similar behavioral characteristics over time. The key benefit of agglomerative hierarchical clustering is that the tree-structure hierarchy created by agglomerative clustering is more informative than an unstructured set of flat clusters, which can assist in better decision-making in relevant application areas in data science.

Association Rule Analysis

Association rule learning is known as a rule-based machine learning system, an unsupervised learning method is typically used to establish a relationship among variables. This is a descriptive technique often used to analyze large datasets for discovering interesting relationships or patterns. The association learning technique’s main strength is its comprehensiveness, as it produces all associations that meet user-specified constraints including minimum support and confidence value [ 138 ].

Association rules allow a data scientist to identify trends, associations, and co-occurrences between data sets inside large data collections. In a supermarket, for example, associations infer knowledge about the buying behavior of consumers for different items, which helps to change the marketing and sales plan. In healthcare, to better diagnose patients, physicians may use association guidelines. Doctors can assess the conditional likelihood of a given illness by comparing symptom associations in the data from previous cases using association rules and machine learning-based data analysis. Similarly, association rules are useful for consumer behavior analysis and prediction, customer market analysis, bioinformatics, weblog mining, recommendation systems, etc.

Several types of association rules have been proposed in the area, such as frequent pattern based [ 4 , 47 , 73 ], logic-based [ 31 ], tree-based [ 39 ], fuzzy-rules [ 126 ], belief rule [ 148 ] etc. The rule learning techniques such as AIS [ 3 ], Apriori [ 4 ], Apriori-TID and Apriori-Hybrid [ 4 ], FP-Tree [ 39 ], Eclat [ 144 ], RARM [ 24 ] exist to solve the relevant business problems. Apriori [ 4 ] is the most commonly used algorithm for discovering association rules from a given dataset among the association rule learning techniques [ 145 ]. The recent association rule-learning technique ABC-RuleMiner proposed in our earlier paper by Sarker et al. [ 113 ] could give significant results in terms of generating non-redundant rules that can be used for smart decision making according to human preferences, within the area of data science applications.

Time-Series Analysis and Forecasting

A time series is typically a series of data points indexed in time order particularly, by date, or timestamp [ 111 ]. Depending on the frequency, the time-series can be different types such as annually, e.g., annual budget, quarterly, e.g., expenditure, monthly, e.g., air traffic, weekly, e.g., sales quantity, daily, e.g., weather, hourly, e.g., stock price, minute-wise, e.g., inbound calls in a call center, and even second-wise, e.g., web traffic, and so on in relevant domains.

A mathematical method dealing with such time-series data, or the procedure of fitting a time series to a proper model is termed time-series analysis. Many different time series forecasting algorithms and analysis methods can be applied to extract the relevant information. For instance, to do time-series forecasting for future patterns, the autoregressive (AR) model [ 130 ] learns the behavioral trends or patterns of past data. Moving average (MA) [ 40 ] is another simple and common form of smoothing used in time series analysis and forecasting that uses past forecasted errors in a regression-like model to elaborate an averaged trend across the data. The autoregressive moving average (ARMA) [ 12 , 120 ] combines these two approaches, where autoregressive extracts the momentum and pattern of the trend and moving average capture the noise effects. The most popular and frequently used time-series model is the autoregressive integrated moving average (ARIMA) model [ 12 , 120 ]. ARIMA model, a generalization of an ARMA model, is more flexible than other statistical models such as exponential smoothing or simple linear regression. In terms of data, the ARMA model can only be used for stationary time-series data, while the ARIMA model includes the case of non-stationarity as well. Similarly, seasonal autoregressive integrated moving average (SARIMA), autoregressive fractionally integrated moving average (ARFIMA), autoregressive moving average model with exogenous inputs model (ARMAX model) are also used in time-series models [ 120 ].

figure 5

An example of producing aggregate time segments from initial time slices based on similar behavioral characteristics

In addition to the stochastic methods for time-series modeling and forecasting, machine and deep learning-based approach can be used for effective time-series analysis and forecasting. For instance, in our earlier paper, Sarker et al. [ 111 ] present a bottom-up clustering-based time-series analysis to capture the mobile usage behavioral patterns of the users. Figure 5 shows an example of producing aggregate time segments Seg_i from initial time slices TS_i based on similar behavioral characteristics that are used in our bottom-up clustering approach, where D represents the dominant behavior BH_i of the users, mentioned above [ 111 ]. The authors in [ 118 ], used a long short-term memory (LSTM) model, a kind of recurrent neural network (RNN) deep learning model, in forecasting time-series that outperform traditional approaches such as the ARIMA model. Time-series analysis is commonly used these days in various fields such as financial, manufacturing, business, social media, event data (e.g., clickstreams and system events), IoT and smartphone data, and generally in any applied science and engineering temporal measurement domain. Thus, it covers a wide range of application areas in data science.

Opinion Mining and Sentiment Analysis

Sentiment analysis or opinion mining is the computational study of the opinions, thoughts, emotions, assessments, and attitudes of people towards entities such as products, services, organizations, individuals, issues, events, topics, and their attributes [ 71 ]. There are three kinds of sentiments: positive, negative, and neutral, along with more extreme feelings such as angry, happy and sad, or interested or not interested, etc. More refined sentiments to evaluate the feelings of individuals in various situations can also be found according to the problem domain.

Although the task of opinion mining and sentiment analysis is very challenging from a technical point of view, it’s very useful in real-world practice. For instance, a business always aims to obtain an opinion from the public or customers about its products and services to refine the business policy as well as a better business decision. It can thus benefit a business to understand the social opinion of their brand, product, or service. Besides, potential customers want to know what consumers believe they have when they use a service or purchase a product. Document-level, sentence level, aspect level, and concept level, are the possible levels of opinion mining in the area [ 45 ].

Several popular techniques such as lexicon-based including dictionary-based and corpus-based methods, machine learning including supervised and unsupervised learning, deep learning, and hybrid methods are used in sentiment analysis-related tasks [ 70 ]. To systematically define, extract, measure, and analyze affective states and subjective knowledge, it incorporates the use of statistics, natural language processing (NLP), machine learning as well as deep learning methods. Sentiment analysis is widely used in many applications, such as reviews and survey data, web and social media, and healthcare content, ranging from marketing and customer support to clinical practice. Thus sentiment analysis has a big influence in many data science applications, where public sentiment is involved in various real-world issues.

Behavioral Data and Cohort Analysis

Behavioral analytics is a recent trend that typically reveals new insights into e-commerce sites, online gaming, mobile and smartphone applications, IoT user behavior, and many more [ 112 ]. The behavioral analysis aims to understand how and why the consumers or users behave, allowing accurate predictions of how they are likely to behave in the future. For instance, it allows advertisers to make the best offers with the right client segments at the right time. Behavioral analytics, including traffic data such as navigation paths, clicks, social media interactions, purchase decisions, and marketing responsiveness, use the large quantities of raw user event information gathered during sessions in which people use apps, games, or websites. In our earlier papers Sarker et al. [ 101 , 111 , 113 ] we have discussed how to extract users phone usage behavioral patterns utilizing real-life phone log data for various purposes.

In the real-world scenario, behavioral analytics is often used in e-commerce, social media, call centers, billing systems, IoT systems, political campaigns, and other applications, to find opportunities for optimization to achieve particular outcomes. Cohort analysis is a branch of behavioral analytics that involves studying groups of people over time to see how their behavior changes. For instance, it takes data from a given data set (e.g., an e-commerce website, web application, or online game) and separates it into related groups for analysis. Various machine learning techniques such as behavioral data clustering [ 111 ], behavioral decision tree classification [ 109 ], behavioral association rules [ 113 ], etc. can be used in the area according to the goal. Besides, the concept of RecencyMiner, proposed in our earlier paper Sarker et al. [ 108 ] that takes into account recent behavioral patterns could be effective while analyzing behavioral data as it may not be static in the real-world changes over time.

Anomaly Detection or Outlier Analysis

Anomaly detection, also known as Outlier analysis is a data mining step that detects data points, events, and/or findings that deviate from the regularities or normal behavior of a dataset. Anomalies are usually referred to as outliers, abnormalities, novelties, noise, inconsistency, irregularities, and exceptions [ 63 , 114 ]. Techniques of anomaly detection may discover new situations or cases as deviant based on historical data through analyzing the data patterns. For instance, identifying fraud or irregular transactions in finance is an example of anomaly detection.

It is often used in preprocessing tasks for the deletion of anomalous or inconsistency in the real-world data collected from various data sources including user logs, devices, networks, and servers. For anomaly detection, several machine learning techniques can be used, such as k-nearest neighbors, isolation forests, cluster analysis, etc [ 105 ]. The exclusion of anomalous data from the dataset also results in a statistically significant improvement in accuracy during supervised learning [ 101 ]. However, extracting appropriate features, identifying normal behaviors, managing imbalanced data distribution, addressing variations in abnormal behavior or irregularities, the sparse occurrence of abnormal events, environmental variations, etc. could be challenging in the process of anomaly detection. Detection of anomalies can be applicable in a variety of domains such as cybersecurity analytics, intrusion detections, fraud detection, fault detection, health analytics, identifying irregularities, detecting ecosystem disturbances, and many more. This anomaly detection can be considered a significant task for building effective systems with higher accuracy within the area of data science.

Factor Analysis

Factor analysis is a collection of techniques for describing the relationships or correlations between variables in terms of more fundamental entities known as factors [ 23 ]. It’s usually used to organize variables into a small number of clusters based on their common variance, where mathematical or statistical procedures are used. The goals of factor analysis are to determine the number of fundamental influences underlying a set of variables, calculate the degree to which each variable is associated with the factors, and learn more about the existence of the factors by examining which factors contribute to output on which variables. The broad purpose of factor analysis is to summarize data so that relationships and patterns can be easily interpreted and understood [ 143 ].

Exploratory factor analysis (EFA) and confirmatory factor analysis (CFA) are the two most popular factor analysis techniques. EFA seeks to discover complex trends by analyzing the dataset and testing predictions, while CFA tries to validate hypotheses and uses path analysis diagrams to represent variables and factors [ 143 ]. Factor analysis is one of the algorithms for unsupervised machine learning that is used for minimizing dimensionality. The most common methods for factor analytics are principal components analysis (PCA), principal axis factoring (PAF), and maximum likelihood (ML) [ 48 ]. Methods of correlation analysis such as Pearson correlation, canonical correlation, etc. may also be useful in the field as they can quantify the statistical relationship between two continuous variables, or association. Factor analysis is commonly used in finance, marketing, advertising, product management, psychology, and operations research, and thus can be considered as another significant analytical method within the area of data science.

Log Analysis

Logs are commonly used in system management as logs are often the only data available that record detailed system runtime activities or behaviors in production [ 44 ]. Log analysis is thus can be considered as the method of analyzing, interpreting, and capable of understanding computer-generated records or messages, also known as logs. This can be device log, server log, system log, network log, event log, audit trail, audit record, etc. The process of creating such records is called data logging.

Logs are generated by a wide variety of programmable technologies, including networking devices, operating systems, software, and more. Phone call logs [ 88 , 110 ], SMS Logs [ 28 ], mobile apps usages logs [ 124 , 149 ], notification logs [ 77 ], game Logs [ 82 ], context logs [ 16 , 149 ], web logs [ 37 ], smartphone life logs [ 95 ], etc. are some examples of log data for smartphone devices. The main characteristics of these log data is that it contains users’ actual behavioral activities with their devices. Similar other log data can be search logs [ 50 , 133 ], application logs [ 26 ], server logs [ 33 ], network logs [ 57 ], event logs [ 83 ], network and security logs [ 142 ] etc.

Several techniques such as classification and tagging, correlation analysis, pattern recognition methods, anomaly detection methods, machine learning modeling, etc. [ 105 ] can be used for effective log analysis. Log analysis can assist in compliance with security policies and industry regulations, as well as provide a better user experience by encouraging the troubleshooting of technical problems and identifying areas where efficiency can be improved. For instance, web servers use log files to record data about website visitors. Windows event log analysis can help an investigator draw a timeline based on the logging information and the discovered artifacts. Overall, advanced analytics methods by taking into account machine learning modeling can play a significant role to extract insightful patterns from these log data, which can be used for building automated and smart applications, and thus can be considered as a key working area in data science.

Neural Networks and Deep Learning Analysis

Deep learning is a form of machine learning that uses artificial neural networks to create a computational architecture that learns from data by combining multiple processing layers, such as the input, hidden, and output layers [ 38 ]. The key benefit of deep learning over conventional machine learning methods is that it performs better in a variety of situations, particularly when learning from large datasets [ 114 , 140 ].

The most common deep learning algorithms are: multi-layer perceptron (MLP) [ 85 ], convolutional neural network (CNN or ConvNet) [ 67 ], long short term memory recurrent neural network (LSTM-RNN) [ 34 ]. Figure 6 shows a structure of an artificial neural network modeling with multiple processing layers. The Backpropagation technique [ 38 ] is used to adjust the weight values internally while building the model. Convolutional neural networks (CNNs) [ 67 ] improve on the design of traditional artificial neural networks (ANNs), which include convolutional layers, pooling layers, and fully connected layers. It is commonly used in a variety of fields, including natural language processing, speech recognition, image processing, and other autocorrelated data since it takes advantage of the two-dimensional (2D) structure of the input data. AlexNet [ 60 ], Xception [ 21 ], Inception [ 125 ], Visual Geometry Group (VGG) [ 42 ], ResNet [ 43 ], etc., and other advanced deep learning models based on CNN are also used in the field.

In addition to CNN, recurrent neural network (RNN) architecture is another popular method used in deep learning. Long short-term memory (LSTM) is a popular type of recurrent neural network architecture used broadly in the area of deep learning. Unlike traditional feed-forward neural networks, LSTM has feedback connections. Thus, LSTM networks are well-suited for analyzing and learning sequential data, such as classifying, sorting, and predicting data based on time-series data. Therefore, when the data is in a sequential format, such as time, sentence, etc., LSTM can be used, and it is widely used in the areas of time-series analysis, natural language processing, speech recognition, and so on.

figure 6

A structure of an artificial neural network modeling with multiple processing layers

In addition to the most popular deep learning methods mentioned above, several other deep learning approaches [ 104 ] exist in the field for various purposes. The self-organizing map (SOM) [ 58 ], for example, uses unsupervised learning to represent high-dimensional data as a 2D grid map, reducing dimensionality. Another learning technique that is commonly used for dimensionality reduction and feature extraction in unsupervised learning tasks is the autoencoder (AE) [ 10 ]. Restricted Boltzmann machines (RBM) can be used for dimensionality reduction, classification, regression, collaborative filtering, feature learning, and topic modeling, according to [ 46 ]. A deep belief network (DBN) is usually made up of a backpropagation neural network and unsupervised networks like restricted Boltzmann machines (RBMs) or autoencoders (BPNN) [ 136 ]. A generative adversarial network (GAN) [ 35 ] is a deep learning network that can produce data with characteristics that are similar to the input data. Transfer learning is common worldwide presently because it can train deep neural networks with a small amount of data, which is usually the re-use of a pre-trained model on a new problem [ 137 ]. These deep learning methods can perform  well, particularly, when learning from large-scale datasets [ 105 , 140 ]. In our previous article Sarker et al. [ 104 ], we have summarized a brief discussion of various artificial neural networks (ANN) and deep learning (DL) models mentioned above, which can be used in a variety of data science and analytics tasks.

Real-World Application Domains

Almost every industry or organization is impacted by data, and thus “Data Science” including advanced analytics with machine learning modeling can be used in business, marketing, finance, IoT systems, cybersecurity, urban management, health care, government policies, and every possible industries, where data gets generated. In the following, we discuss ten most popular application areas based on data science and analytics.

Business or financial data science: In general, business data science can be considered as the study of business or e-commerce data to obtain insights about a business that can typically lead to smart decision-making as well as taking high-quality actions [ 90 ]. Data scientists can develop algorithms or data-driven models predicting customer behavior, identifying patterns and trends based on historical business data, which can help companies to reduce costs, improve service delivery, and generate recommendations for better decision-making. Eventually, business automation, intelligence, and efficiency can be achieved through the data science process discussed earlier, where various advanced analytics methods and machine learning modeling based on the collected data are the keys. Many online retailers, such as Amazon [ 76 ], can improve inventory management, avoid out-of-stock situations, and optimize logistics and warehousing using predictive modeling based on machine learning techniques [ 105 ]. In terms of finance, the historical data is related to financial institutions to make high-stakes business decisions, which is mostly used for risk management, fraud prevention, credit allocation, customer analytics, personalized services, algorithmic trading, etc. Overall, data science methodologies can play a key role in the future generation business or finance industry, particularly in terms of business automation, intelligence, and smart decision-making and systems.

Manufacturing or industrial data science: To compete in global production capability, quality, and cost, manufacturing industries have gone through many industrial revolutions [ 14 ]. The latest fourth industrial revolution, also known as Industry 4.0, is the emerging trend of automation and data exchange in manufacturing technology. Thus industrial data science, which is the study of industrial data to obtain insights that can typically lead to optimizing industrial applications, can play a vital role in such revolution. Manufacturing industries generate a large amount of data from various sources such as sensors, devices, networks, systems, and applications [ 6 , 68 ]. The main categories of industrial data include large-scale data devices, life-cycle production data, enterprise operation data, manufacturing value chain sources, and collaboration data from external sources [ 132 ]. The data needs to be processed, analyzed, and secured to help improve the system’s efficiency, safety, and scalability. Data science modeling thus can be used to maximize production, reduce costs and raise profits in manufacturing industries.

Medical or health data science: Healthcare is one of the most notable fields where data science is making major improvements. Health data science involves the extrapolation of actionable insights from sets of patient data, typically collected from electronic health records. To help organizations, improve the quality of treatment, lower the cost of care, and improve the patient experience, data can be obtained from several sources, e.g., the electronic health record, billing claims, cost estimates, and patient satisfaction surveys, etc., to analyze. In reality, healthcare analytics using machine learning modeling can minimize medical costs, predict infectious outbreaks, prevent preventable diseases, and generally improve the quality of life [ 81 , 119 ]. Across the global population, the average human lifespan is growing, presenting new challenges to today’s methods of delivery of care. Thus health data science modeling can play a role in analyzing current and historical data to predict trends, improve services, and even better monitor the spread of diseases. Eventually, it may lead to new approaches to improve patient care, clinical expertise, diagnosis, and management.

IoT data science: Internet of things (IoT) [ 9 ] is a revolutionary technical field that turns every electronic system into a smarter one and is therefore considered to be the big frontier that can enhance almost all activities in our lives. Machine learning has become a key technology for IoT applications because it uses expertise to identify patterns and generate models that help predict future behavior and events [ 112 ]. One of the IoT’s main fields of application is a smart city, which uses technology to improve city services and citizens’ living experiences. For example, using the relevant data, data science methods can be used for traffic prediction in smart cities, to estimate the total usage of energy of the citizens for a particular period. Deep learning-based models in data science can be built based on a large scale of IoT datasets [ 7 , 104 ]. Overall, data science and analytics approaches can aid modeling in a variety of IoT and smart city services, including smart governance, smart homes, education, connectivity, transportation, business, agriculture, health care, and industry, and many others.

Cybersecurity data science: Cybersecurity, or the practice of defending networks, systems, hardware, and data from digital attacks, is one of the most important fields of Industry 4.0 [ 114 , 121 ]. Data science techniques, particularly machine learning, have become a crucial cybersecurity technology that continually learns to identify trends by analyzing data, better detecting malware in encrypted traffic, finding insider threats, predicting where bad neighborhoods are online, keeping people safe while surfing, or protecting information in the cloud by uncovering suspicious user activity [ 114 ]. For instance, machine learning and deep learning-based security modeling can be used to effectively detect various types of cyberattacks or anomalies [ 103 , 106 ]. To generate security policy rules, association rule learning can play a significant role to build rule-based systems [ 102 ]. Deep learning-based security models can perform better when utilizing the large scale of security datasets [ 140 ]. Thus data science modeling can enable professionals in cybersecurity to be more proactive in preventing threats and reacting in real-time to active attacks, through extracting actionable insights from the security datasets.

Behavioral data science: Behavioral data is information produced as a result of activities, most commonly commercial behavior, performed on a variety of Internet-connected devices, such as a PC, tablet, or smartphones [ 112 ]. Websites, mobile applications, marketing automation systems, call centers, help desks, and billing systems, etc. are all common sources of behavioral data. Behavioral data is much more than just data, which is not static data [ 108 ]. Advanced analytics of these data including machine learning modeling can facilitate in several areas such as predicting future sales trends and product recommendations in e-commerce and retail; predicting usage trends, load, and user preferences in future releases in online gaming; determining how users use an application to predict future usage and preferences in application development; breaking users down into similar groups to gain a more focused understanding of their behavior in cohort analysis; detecting compromised credentials and insider threats by locating anomalous behavior, or making suggestions, etc. Overall, behavioral data science modeling typically enables to make the right offers to the right consumers at the right time on various common platforms such as e-commerce platforms, online games, web and mobile applications, and IoT. In social context, analyzing the behavioral data of human being using advanced analytics methods and the extracted insights from social data can be used for data-driven intelligent social services, which can be considered as social data science.

Mobile data science: Today’s smart mobile phones are considered as “next-generation, multi-functional cell phones that facilitate data processing, as well as enhanced wireless connectivity” [ 146 ]. In our earlier paper [ 112 ], we have shown that users’ interest in “Mobile Phones” is more and more than other platforms like “Desktop Computer”, “Laptop Computer” or “Tablet Computer” in recent years. People use smartphones for a variety of activities, including e-mailing, instant messaging, online shopping, Internet surfing, entertainment, social media such as Facebook, Linkedin, and Twitter, and various IoT services such as smart cities, health, and transportation services, and many others. Intelligent apps are based on the extracted insight from the relevant datasets depending on apps characteristics, such as action-oriented, adaptive in nature, suggestive and decision-oriented, data-driven, context-awareness, and cross-platform operation [ 112 ]. As a result, mobile data science, which involves gathering a large amount of mobile data from various sources and analyzing it using machine learning techniques to discover useful insights or data-driven trends, can play an important role in the development of intelligent smartphone applications.

Multimedia data science: Over the last few years, a big data revolution in multimedia management systems has resulted from the rapid and widespread use of multimedia data, such as image, audio, video, and text, as well as the ease of access and availability of multimedia sources. Currently, multimedia sharing websites, such as Yahoo Flickr, iCloud, and YouTube, and social networks such as Facebook, Instagram, and Twitter, are considered as valuable sources of multimedia big data [ 89 ]. People, particularly younger generations, spend a lot of time on the Internet and social networks to connect with others, exchange information, and create multimedia data, thanks to the advent of new technology and the advanced capabilities of smartphones and tablets. Multimedia analytics deals with the problem of effectively and efficiently manipulating, handling, mining, interpreting, and visualizing various forms of data to solve real-world problems. Text analysis, image or video processing, computer vision, audio or speech processing, and database management are among the solutions available for a range of applications including healthcare, education, entertainment, and mobile devices.

Smart cities or urban data science: Today, more than half of the world’s population live in urban areas or cities [ 80 ] and considered as drivers or hubs of economic growth, wealth creation, well-being, and social activity [ 96 , 116 ]. In addition to cities, “Urban area” can refer to the surrounding areas such as towns, conurbations, or suburbs. Thus, a large amount of data documenting daily events, perceptions, thoughts, and emotions of citizens or people are recorded, that are loosely categorized into personal data, e.g., household, education, employment, health, immigration, crime, etc., proprietary data, e.g., banking, retail, online platforms data, etc., government data, e.g., citywide crime statistics, or government institutions, etc., Open and public data, e.g., data.gov, ordnance survey, and organic and crowdsourced data, e.g., user-generated web data, social media, Wikipedia, etc. [ 29 ]. The field of urban data science typically focuses on providing more effective solutions from a data-driven perspective, through extracting knowledge and actionable insights from such urban data. Advanced analytics of these data using machine learning techniques [ 105 ] can facilitate the efficient management of urban areas including real-time management, e.g., traffic flow management, evidence-based planning decisions which pertain to the longer-term strategic role of forecasting for urban planning, e.g., crime prevention, public safety, and security, or framing the future, e.g., political decision-making [ 29 ]. Overall, it can contribute to government and public planning, as well as relevant sectors including retail, financial services, mobility, health, policing, and utilities within a data-rich urban environment through data-driven smart decision-making and policies, which lead to smart cities and improve the quality of human life.

Smart villages or rural data science: Rural areas or countryside are the opposite of urban areas, that include villages, hamlets, or agricultural areas. The field of rural data science typically focuses on making better decisions and providing more effective solutions that include protecting public safety, providing critical health services, agriculture, and fostering economic development from a data-driven perspective, through extracting knowledge and actionable insights from the collected rural data. Advanced analytics of rural data including machine learning [ 105 ] modeling can facilitate providing new opportunities for them to build insights and capacity to meet current needs and prepare for their futures. For instance, machine learning modeling [ 105 ] can help farmers to enhance their decisions to adopt sustainable agriculture utilizing the increasing amount of data captured by emerging technologies, e.g., the internet of things (IoT), mobile technologies and devices, etc. [ 1 , 51 , 52 ]. Thus, rural data science can play a very important role in the economic and social development of rural areas, through agriculture, business, self-employment, construction, banking, healthcare, governance, or other services, etc. that lead to smarter villages.

Overall, we can conclude that data science modeling can be used to help drive changes and improvements in almost every sector in our real-world life, where the relevant data is available to analyze. To gather the right data and extract useful knowledge or actionable insights from the data for making smart decisions is the key to data science modeling in any application domain. Based on our discussion on the above ten potential real-world application domains by taking into account data-driven smart computing and decision making, we can say that the prospects of data science and the role of data scientists are huge for the future world. The “Data Scientists” typically analyze information from multiple sources to better understand the data and business problems, and develop machine learning-based analytical modeling or algorithms, or data-driven tools, or solutions, focused on advanced analytics, which can make today’s computing process smarter, automated, and intelligent.

Challenges and Research Directions

Our study on data science and analytics, particularly data science modeling in “ Understanding data science modeling ”, advanced analytics methods and smart computing in “ Advanced analytics methods and smart computing ”, and real-world application areas in “ Real-world application domains ” open several research issues in the area of data-driven business solutions and eventual data products. Thus, in this section, we summarize and discuss the challenges faced and the potential research opportunities and future directions to build data-driven products.

Understanding the real-world business problems and associated data including nature, e.g., what forms, type, size, labels, etc., is the first challenge in the data science modeling, discussed briefly in “ Understanding data science modeling ”. This is actually to identify, specify, represent and quantify the domain-specific business problems and data according to the requirements. For a data-driven effective business solution, there must be a well-defined workflow before beginning the actual data analysis work. Furthermore, gathering business data is difficult because data sources can be numerous and dynamic. As a result, gathering different forms of real-world data, such as structured, or unstructured, related to a specific business issue with legal access, which varies from application to application, is challenging. Moreover, data annotation, which is typically the process of categorization, tagging, or labeling of raw data, for the purpose of building data-driven models, is another challenging issue. Thus, the primary task is to conduct a more in-depth analysis of data collection and dynamic annotation methods. Therefore, understanding the business problem, as well as integrating and managing the raw data gathered for efficient data analysis, may be one of the most challenging aspects of working in the field of data science and analytics.

The next challenge is the extraction of the relevant and accurate information from the collected data mentioned above. The main focus of data scientists is typically to disclose, describe, represent, and capture data-driven intelligence for actionable insights from data. However, the real-world data may contain many ambiguous values, missing values, outliers, and meaningless data [ 101 ]. The advanced analytics methods including machine and deep learning modeling, discussed in “ Advanced analytics methods and smart computing ”, highly impact the quality, and availability of the data. Thus understanding real-world business scenario and associated data, to whether, how, and why they are insufficient, missing, or problematic, then extend or redevelop the existing methods, such as large-scale hypothesis testing, learning inconsistency, and uncertainty, etc. to address the complexities in data and business problems is important. Therefore, developing new techniques to effectively pre-process the diverse data collected from multiple sources, according to their nature and characteristics could be another challenging task.

Understanding and selecting the appropriate analytical methods to extract the useful insights for smart decision-making for a particular business problem is the main issue in the area of data science. The emphasis of advanced analytics is more on anticipating the use of data to detect patterns to determine what is likely to occur in the future. Basic analytics offer a description of data in general, while advanced analytics is a step forward in offering a deeper understanding of data and helping to granular data analysis. Thus, understanding the advanced analytics methods, especially machine and deep learning-based modeling is the key. The traditional learning techniques mentioned in “ Advanced analytics methods and smart computing ” may not be directly applicable for the expected outcome in many cases. For instance, in a rule-based system, the traditional association rule learning technique [ 4 ] may  produce redundant rules from the data that makes the decision-making process complex and ineffective [ 113 ]. Thus, a scientific understanding of the learning algorithms, mathematical properties, how the techniques are robust or fragile to input data, is needed to understand. Therefore, a deeper understanding of the strengths and drawbacks of the existing machine and deep learning methods [ 38 , 105 ] to solve a particular business problem is needed, consequently to improve or optimize the learning algorithms according to the data characteristics, or to propose the new algorithm/techniques with higher accuracy becomes a significant challenging issue for the future generation data scientists.

The traditional data-driven models or systems typically use a large amount of business data to generate data-driven decisions. In several application fields, however, the new trends are more likely to be interesting and useful for modeling and predicting the future than older ones. For example, smartphone user behavior modeling, IoT services, stock market forecasting, health or transport service, job market analysis, and other related areas where time-series and actual human interests or preferences are involved over time. Thus, rather than considering the traditional data analysis, the concept of RecencyMiner, i.e., recent pattern-based extracted insight or knowledge proposed in our earlier paper Sarker et al. [ 108 ] might be effective. Therefore, to propose the new techniques by taking into account the recent data patterns, and consequently to build a recency-based data-driven model for solving real-world problems, is another significant challenging issue in the area.

The most crucial task for a data-driven smart system is to create a framework that supports data science modeling discussed in “ Understanding data science modeling ”. As a result, advanced analytical methods based on machine learning or deep learning techniques can be considered in such a system to make the framework capable of resolving the issues. Besides, incorporating contextual information such as temporal context, spatial context, social context, environmental context, etc. [ 100 ] can be used for building an adaptive, context-aware, and dynamic model or framework, depending on the problem domain. As a result, a well-designed data-driven framework, as well as experimental evaluation, is a very important direction to effectively solve a business problem in a particular domain, as well as a big challenge for the data scientists.

In several important application areas such as autonomous cars, criminal justice, health care, recruitment, housing, management of the human resource, public safety, where decisions made by models, or AI agents, have a direct effect on human lives. As a result, there is growing concerned about whether these decisions can be trusted, to be right, reasonable, ethical, personalized, accurate, robust, and secure, particularly in the context of adversarial attacks [ 104 ]. If we can explain the result in a meaningful way, then the model can be better trusted by the end-user. For machine-learned models, new trust properties yield new trade-offs, such as privacy versus accuracy; robustness versus efficiency; fairness versus robustness. Therefore, incorporating trustworthy AI particularly, data-driven or machine learning modeling could be another challenging issue in the area.

In the above, we have summarized and discussed several challenges and the potential research opportunities and directions, within the scope of our study in the area of data science and advanced analytics. The data scientists in academia/industry and the researchers in the relevant area have the opportunity to contribute to each issue identified above and build effective data-driven models or systems, to make smart decisions in the corresponding business domains.

In this paper, we have presented a comprehensive view on data science including various types of advanced analytical methods that can be applied to enhance the intelligence and the capabilities of an application. We have also visualized the current popularity of data science and machine learning-based advanced analytical modeling and also differentiate these from the relevant terms used in the area, to make the position of this paper. A thorough study on the data science modeling with its various processing modules that are needed to extract the actionable insights from the data for a particular business problem and the eventual data product. Thus, according to our goal, we have briefly discussed how different data modules can play a significant role in a data-driven business solution through the data science process. For this, we have also summarized various types of advanced analytical methods and outcomes as well as machine learning modeling that are needed to solve the associated business problems. Thus, this study’s key contribution has been identified as the explanation of different advanced analytical methods and their applicability in various real-world data-driven applications areas including business, healthcare, cybersecurity, urban and rural data science, and so on by taking into account data-driven smart computing and decision making.

Finally, within the scope of our study, we have outlined and discussed the challenges we faced, as well as possible research opportunities and future directions. As a result, the challenges identified provide promising research opportunities in the field that can be explored with effective solutions to improve the data-driven model and systems. Overall, we conclude that our study of advanced analytical solutions based on data science and machine learning methods, leads in a positive direction and can be used as a reference guide for future research and applications in the field of data science and its real-world applications by both academia and industry professionals.

Adnan N, Nordin SM, Rahman I, Noor A. The effects of knowledge transfer on farmers decision making toward sustainable agriculture practices. World J Sci Technol Sustain Dev. 2018.

Agrawal R, Gehrke J, Gunopulos D, Raghavan P. Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the 1998 ACM SIGMOD international conference on management of data. 1998. p. 94–105.

Agrawal R, Imieliński T, Swami A. Mining association rules between sets of items in large databases. In: ACM SIGMOD record, vol 22. ACM. 1993. p. 207–16.

Agrawal R, Srikant R. Fast algorithms for mining association rules. In: Proceedings of the international joint conference on very large data bases, Santiago, Chile, vol 1215. 1994. p. 487–99.

Aha DW, Kibler D, Albert MK. Instance-based learning algorithms. Mach Learn. 1991;6(1):37–66.

Article   Google Scholar  

Al-Abassi A, Karimipour H, HaddadPajouh H, Dehghantanha A, Parizi RM. Industrial big data analytics: challenges and opportunities. In: Handbook of big data privacy. Springer; 2020. p. 37–61.

Al-Garadi MA, Mohamed A, Al-Ali AK, Du X, Ali I, Guizani M. A survey of machine and deep learning methods for internet of things (iot) security. IEEE Commun Surv Tutor. 2020;22(3):1646–85.

Ankerst M, Breunig MM, Kriegel H-P, Sander J. Optics: ordering points to identify the clustering structure. ACM Sigmod Rec. 1999;28(2):49–60.

Atzori L, Iera A, Morabito G. The internet of things: a survey. Comput Netw. 2010;54(15):2787–805.

Article   MATH   Google Scholar  

Baldi P. Autoencoders, unsupervised learning, and deep architectures. In: Proceedings of ICML workshop on unsupervised and transfer learning. 2012. p. 37–49.

Balducci F, Impedovo D, Pirlo G. Machine learning applications on agricultural datasets for smart farm enhancement. Machines. 2018;6(3):38.

Box GEP, Jenkins GM, Reinsel GC, Ljung GM. Time series analysis: forecasting and control. New York: Wiley; 2015.

MATH   Google Scholar  

Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.

Brettel M, Friederichsen N, Keller M, Rosenberg M. How virtualization, decentralization and network building change the manufacturing landscape: an industry 4.0 perspective. FormaMente 2017;12.

Canadian institute of cybersecurity. University of new Brunswick, iscx dataset. http://www.unb.ca/cic/datasets/index.html/ . Accessed 20 Oct 2019.

Cao H, Bao T, Yang Q, Chen E, Tian J. An effective approach for mining mobile user habits. In: Proceedings of the international conference on information and knowledge management, Toronto, ON, Canada, 26–30 October. New York: ACM; 2010. p. 1677–80.

Cao L. Data science: a comprehensive overview. ACM Comput Surv (CSUR). 2017;50(3):1–42.

Carpenter GA, Grossberg S. A massively parallel architecture for a self-organizing neural pattern recognition machine. Comput Vis Graph Image Process. 1987;37(1):54–115.

Cervone HF. Informatics and data science: an overview for the information professional. Digital Library Perspectives. 2016.

Chessel A. An overview of data science uses in bioimage informatics. Methods. 2017;115:110–8.

Chollet F. Xception: deep learning with depthwise separable convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017. p. 1251–58.

Cic-ddos2019 [online]. https://www.unb.ca/cic/datasets/ddos-2019.html/ . Accessed 28 Mar 2020.

Cudeck R. Exploratory factor analysis. In: Handbook of applied multivariate statistics and mathematical modeling. Elsevier. p. 265–96. 2000.

Das A, Ng W-K, Woon Y-K. Rapid association rule mining. In: Proceedings of the tenth international conference on Information and knowledge management. ACM; 2001. p. 474–481.

de Amorim V. Constrained clustering with Minkowski weighted k-means. In: 2012 IEEE 13th international symposium on computational intelligence and informatics (CINTI). IEEE. 2012. p. 13–17.

Dev H, Liu Z. Identifying frequent user tasks from application logs. In: Proceedings of the 22nd international conference on intelligent user interfaces. 2017. p. 263–73.

Donoho D. 50 years of data science. J Comput Graph Stat. 2017;26(4):745–66.

Article   MathSciNet   Google Scholar  

Eagle N, Pentland AS. Reality mining: sensing complex social systems. Pers Ubiquitous Comput. 2006;10(4):255–68.

Engin Z, van Dijk J, Lan T, Longley PA, Treleaven P, Batty M, Penn A. Data-driven urban management: mapping the landscape. J Urban Manag. 2020;9(2):140–50.

Ester M, Kriegel H-P, Sander J, Xiaowei X, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd. 1996;96:226–31.

Google Scholar  

Flach PA, Lachiche N. Confirmation-guided discovery of first-order rules with tertius. Mach Learn. 2001;42(1–2):61–95.

Freund Y, Schapire RE, et al. Experiments with a new boosting algorithm. In: Icml, vol 96. Citeseer; 1996. p. 148–156.

Ghavare P, Ahire P. Big data classification of users navigation and behavior using web server logs. In: 2018 fourth international conference on computing communication control and automation (ICCUBEA). IEEE. 2018. p. 1–6.

Goodfellow I, Bengio Y, Courville A, Bengio Y. Deep learning, vol. 1. Cambridge: MIT Press; 2016.

Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. Generative adversarial nets. In: Advances in neural information processing systems. 2014. p. 2672–80.

Google trends. 2019. https://trends.google.com/trends/ .

Halvey M, Keane MT, Smyth B. Time based segmentation of log data for user navigation prediction in personalization. In: Proceedings of the international conference on web intelligence, Compiegne, France, 19–22 September. Washington, DC: IEEE Computer Society; 2005. p. 636–40.

Han J, Pei J, Kamber M. Data mining: concepts and techniques. Amsterdam: Elsevier; 2011.

Han J, Pei J, Yin Y. Mining frequent patterns without candidate generation. In: ACM Sigmod Record, vol 29. ACM; 2000. p. 1–12.

Hansun S. A new approach of moving average method in time series analysis. In: 2013 conference on new media studies (CoNMedia). IEEE; 2013. p. 1–4.

Harmon SA, Sanford TH, Xu S, Turkbey EB, Holger R, Ziyue X, Dong Y, Andriy M, Victoria A, Amel A, et al. Artificial intelligence for the detection of covid-19 pneumonia on chest ct using multinational datasets. Nat Commun. 2020;11(1):1–7.

He K, Zhang X, Ren S, Sun J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell. 2015;37(9):1904–16.

He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. p. 770–78.

He P, Zhu J, He S, Li J, Lyu MR. Towards automated log parsing for large-scale log data analysis. IEEE Trans Dependable Secure Comput. 2017;15(6):931–44.

Hemmatian F, Sohrabi MK. A survey on classification techniques for opinion mining and sentiment analysis. In: Artificial intelligence review. 2019. p. 1–51.

Hinton GE. A practical guide to training restricted Boltzmann machines. In: Neural networks: tricks of the trade. Springer; 2012. p. 599–619.

Houtsma M, Swami A. Set-oriented mining for association rules in relational databases. In: Proceedings of the eleventh international conference on data engineering. IEEE; 1995. p. 25–33.

Howard MC. A review of exploratory factor analysis decisions and overview of current practices: what we are doing and how can we improve? Int J Hum Comput Interact. 2016;32(1):51–62.

John GH, Langley P. Estimating continuous distributions in Bayesian classifiers. In: Proceedings of the eleventh conference on uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc.; 1995. p. 338–45.

Kacprzak E, Koesten L, Ibá nez L-D, Blount T, Tennison J, Simperl E. Characterising dataset search-an analysis of search logs and data requests. J Web Semant. 2019;55:37–55.

Kamble SS, Gunasekaran A, Gawankar SA. Sustainable industry 4.0 framework: a systematic literature review identifying the current trends and future perspectives. Process Saf Environ Prot. 2018;117:408–425.

Kamble SS, Gunasekaran A, Gawankar SA. Achieving sustainable performance in a data-driven agriculture supply chain: a review for research and applications. Int J Prod Econ. 2020;219:179–94.

Karpatne A, Atluri G, Faghmous JH, Steinbach M, Banerjee A, Ganguly A, Shekhar S, Samatova N, Kumar V. Theory-guided data science: a new paradigm for scientific discovery from data. IEEE Trans Knowl Data Eng. 2017;29(10):2318–31.

Kaufman L, Rousseeuw PJ. Finding groups in data: an introduction to cluster analysis, vol. 344. New York: Wiley; 2009.

Keerthi SS, Shevade SK, Bhattacharyya C, Murthy KRK. Improvements to Platt’s smo algorithm for svm classifier design. Neural Comput. 2001;13(3):637–49.

Khadse V, Mahalle PN, Biraris SV. An empirical comparison of supervised machine learning algorithms for internet of things data. In: 2018 fourth international conference on computing communication control and automation (ICCUBEA). IEEE; 2018. p. 1–6.

Kimura T, Watanabe A, Toyono T, Ishibashi K. Proactive failure detection learning generation patterns of large-scale network logs. IEICE Trans Commun. 2018.

Kohonen T. The self-organizing map. Proc IEEE. 1990;78(9):1464–80.

Koroniotis N, Moustafa N, Sitnikova E, Turnbull B. Towards the development of realistic botnet dataset in the internet of things for network forensic analytics: bot-iot dataset. Future Gener Comput Syst. 2019;100:779–96.

Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems. 2012. p. 1097–1105.

Krukovets D, et al. Data science opportunities at central banks: overview. Visnyk Natl Bank Ukr. 2020;249:13–24.

Kulin M, Fortuna C, De Poorter E, Deschrijver D, Moerman I. Data-driven design of intelligent wireless networks: an overview and tutorial. Sensors. 2016;16(6):790.

Kwon D, Kim H, Kim J, Suh SC, Kim I, Kim KJ. A survey of deep learning-based network anomaly detection. Cluster Comput. 2019;22(1):949–61.

Lade P, Ghosh R, Srinivasan S. Manufacturing analytics and industrial internet of things. IEEE Intell Syst. 2017;32(3):74–9.

Larson D, Chang V. A review and future direction of agile, business intelligence, analytics and data science. Int J Inf Manag. 2016;36(5):700–10.

Le Cessie S, Van Houwelingen JC. Ridge estimators in logistic regression. J R Stat Soc Ser C (Applied Statistics). 1992;41(1):191–201.

LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proc IEEE. 1998;86(11):2278–324.

Lee J, Bagheri B, Kao H-A. Recent advances and trends of cyber-physical systems and big data analytics in industrial informatics. In: International proceeding of int conference on industrial informatics (INDIN). 2014. p. 1–6.

Leevy JL, Khoshgoftaar TM, Bauder RA, Seliya N. A survey on addressing high-class imbalance in big data. J Big Data. 2018;5(1):42.

Li Z, Fan Y, Jiang B, Lei T, Liu W. A survey on sentiment analysis and opinion mining for social multimedia. Multimed Tools Appl. 2019;78(6):6939–67.

Liu B. Sentiment analysis: mining opinions, sentiments, and emotions. Cambridge: Cambridge University Press; 2020.

Book   Google Scholar  

Liu J, Tang T, Wang W, Bo X, Kong X, Xia F. A survey of scholarly data visualization. IEEE Access. 2018;6:19205–21.

Ma B, Liu W, Hsu Y. Integrating classification and association rule mining. In: Proceedings of the fourth international conference on knowledge discovery and data mining. 1998.

Ma C, Zhang HH, Wang X. Machine learning for big data analytics in plants. Trends Plant Sci. 2014;19(12):798–808.

MacQueen J, et al. Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Oakland, CA, USA, vol 1. 1967. p. 281–297.

Marchand A, Marx P. Automated product recommendations with preference-based explanations. J Retail. 2020;96(3):328–43.

Mehrotra A, Hendley R, Musolesi M. Prefminer: mining user’s preferences for intelligent mobile notification management. In: Proceedings of the international joint conference on pervasive and ubiquitous computing, Heidelberg, 12–16 September, ACM, New York. 2016. p. 1223–1234.

Mohamadou Y, Halidou A, Kapen PT. A review of mathematical modeling, artificial intelligence and datasets used in the study, prediction and management of covid-19. Appl Intell. 2020;50(11):3913–25.

Moustafa N, Slay J. Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In: 2015 military communications and information systems conference (MilCIS). IEEE. 2015. p. 1–6.

Nations U. Revision of world urbanization prospects. New York: United Nations; 2018.

Nilashi M, Ibrahim O, Ahmadi H, Shahmoradi L. An analytical method for diseases prediction using machine learning techniques. Comput Chem Eng. 2017;106:212–23.

Paireekreng W, Rapeepisarn K, Wong KW. Time-based personalised mobile game downloading. In: Transactions on edutainment II. 2009. p. 59–69.

Pan Y, Zhang L, Li Z. Mining event logs for knowledge discovery based on adaptive efficient fuzzy Kohonen clustering network. Knowl Based Syst. 2020:209.

Park H-S, Jun C-H. A simple and fast algorithm for k-medoids clustering. Expert Syst Appl. 2009;36(2):3336–41.

Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, et al. Scikit-learn: machine learning in python. J Mach Learn Res. 2011;12:2825–30.

MathSciNet   MATH   Google Scholar  

Perveen S, Shahbaz M, Keshavjee K, Guergachi A. Metabolic syndrome and development of diabetes mellitus: predictive modeling based on machine learning techniques. IEEE Access. 2018;7:1365–75.

Peyré G, Cuturi M, et al. Computational optimal transport: with applications to data science. Found Trends Mach Learn. 2019;11(5–6):355–607.

Phithakkitnukoon S, Dantu R, Claxton R, Eagle N. Behavior-based adaptive call predictor. ACM Trans Auton Adapt Syst. 2011;6(3):21:1–21:28.

Pouyanfar S, Yang Y, Chen S-C, Shyu M-L, Iyengar SS. Multimedia big data analytics: a survey. ACM Comput Surv (CSUR). 2018;51(1):1–34.

Provost F, Fawcett T. Data science for business: what you need to know about data mining and data-analytic thinking. O’Reilly Media, Inc.; 2013.

Qin X, Luo Y, Tang N, Li G. Making data visualization more efficient and effective: a survey. VLDB J. 2020;29(1):93–117.

Quinlan JR. Induction of decision trees. Mach Learn. 1986;1(1):81–106.

Quinlan JR. C4.5: programs for machine learning. Mach Learn. 1993.

Rasmussen C. The infinite Gaussian mixture model. Adv Neural Inf Process Syst. 1999;12:554–60.

Rawassizadeh R, Tomitsch M, Wac K, Tjoa AM. Ubiqlog: a generic mobile phone-based life-log framework. Pers Ubiquitous Comput. 2013;17(4):621–37.

Resch B, Szell M. Human-centric data science for urban studies. 2019.

Rizk A, Elragal A. Data science: developing theoretical contributions in information systems via text analytics. J Big Data. 2020;7(1):1–26.

Rokach L. A survey of clustering algorithms. In: Data mining and knowledge discovery handbook. Springer; 2010. p. 269–298.

Safdar S, Zafar S, Zafar N, Khan NF. Machine learning based decision support systems (dss) for heart disease diagnosis: a review. Artif Intell Rev. 2018;50(4):597–623.

Sarker IH. Context-aware rule learning from smartphone data: survey, challenges and future directions. J Big Data. 2019;6(1):1–25.

Sarker IH. A machine learning based robust prediction model for real-life mobile phone data. Internet Things. 2019;5:180–93.

Sarker IH. Ai-driven cybersecurity: an overview, security intelligence modeling and research directions. SN Comput Sci. 2021.

Sarker IH. Cyberlearning: effectiveness analysis of machine learning security modeling to detect cyber-anomalies and multi-attacks. Internet Things. 2021:100393.

Sarker IH. Deep cybersecurity: a comprehensive overview from neural network and deep learning perspective. SN Comput Sci. 2021.

Sarker IH. Machine learning: algorithms, real-world applications and research directions. SN Comput Sci. 2021;2(3):1–21.

Sarker IH, Abushark YB, Alsolami F, Khan AI. Intrudtree: a machine learning based cyber security intrusion detection model. Symmetry. 2020;12(5):754.

Sarker IH, Alqahtani H, Alsolami F, Khan AI, Abushark YB, Siddiqui MK. Context pre-modeling: an empirical analysis for classification based user-centric context-aware predictive modeling. J Big Data. 2020;7(1):1–23.

Sarker IH, Colman A, Han J. Recencyminer: mining recency-based personalized behavior from contextual smartphone data. J Big Data. 2019;6(1):1–21.

Sarker IH, Colman A, Han J, Khan AI, Abushark YB, Salah K. Behavdt: a behavioral decision tree learning to build user-centric context-aware predictive model. Mob Netw Appl. 2020;25(3):1151–61.

Sarker IH, Colman A, Kabir MA, Han J. Phone call log as a context source to modeling individual user behavior. In: Proceedings of the 2016 ACM international joint conference on pervasive and ubiquitous computing (Ubicomp): adjunct, Germany. ACM. 2016. p. 630–634.

Sarker IH, Colman A, Kabir MA, Han J. Individualized time-series segmentation for mining mobile phone user behavior. Comput J. 2018;61(3):349–68.

Sarker IH, Hoque MM, Uddin MK, Alsanoosy T. Mobile data science and intelligent apps: Concepts, ai-based modeling and research directions. Mob Netw Appl. 2020:1–19.

Sarker IH, Kayes ASM. Abc-ruleminer: user behavioral rule-based machine learning method for context-aware intelligent services. J Netw Comput Appl. 2020:102762.

Sarker IH, Kayes ASM, Badsha S, Alqahtani H, Watters P, Ng A. Cybersecurity data science: an overview from machine learning perspective. J Big Data. 2020;7(1):1–29.

Sarker IH, Kayes ASM, Watters P. Effectiveness analysis of machine learning classification models for predicting personalized context-aware smartphone usage. J Big Data. 2019;6(1):1–28.

Schläpfer M, Bettencourt LMA, Grauwin S, Raschke M, Claxton R, Smoreda Z, West GB, Ratti C. The scaling of human interactions with city size. J R Soc Interface. 2014;11(98):20130789.

Shukla N, Fricklas K. Machine learning with TensorFlow. Greenwich: Manning; 2018.

Siami-Namini S, Tavakoli N, Namin AS. A comparison of arima and lstm in forecasting time series. In: 2018 17th IEEE international conference on machine learning and applications (ICMLA). IEEE. 2018. p. 1394–1401.

Silahtaroğlu G, Yılmaztürk N. Data analysis in health and big data: a machine learning medical diagnosis model based on patients’ complaints. Commun Stat Theory Methods. 2019;1–10.

Silvestrini A, Veredas D. Temporal aggregation of univariate and multivariate time series models: a survey. J Econ Surv. 2008;22(3):458–97.

Ślusarczyk B. Industry 4.0: are we ready? Pol J Manag Stud. 2018:17.

Sneath PHA. The application of computers to taxonomy. J Gen Microbiol. 1957;17(1).

Sorensen T. Method of establishing groups of equal amplitude in plant sociology based on similarity of species. Biol. Skr. 1948:5.

Srinivasan V, Moghaddam S, Mukherji A. Mobileminer: mining your frequent patterns on your phone. In: Proceedings of the international joint conference on pervasive and ubiquitous computing, Seattle, WA, USA, 13–17 September. New York: ACM; 2014. p. 389–400

Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2015. p. 1–9.

Tajbakhsh A, Rahmati M, Mirzaei A. Intrusion detection using fuzzy association rules. Appl Soft Comput. 2009;9(2):462–9.

Tavallaee M, Bagheri E, Lu W, Ghorbani AA. A detailed analysis of the kdd cup 99 data set. In: 2009 IEEE symposium on computational intelligence for security and defense applications. IEEE. 2009. p. 1–6.

Tsagkias M, Tracy HK, Surya K, Vanessa M, de Rijke M. Challenges and research opportunities in ecommerce search and recommendations. In: ACM SIGIR forum, vol 54. New York: ACM; 2021. p. 1–23.

Tsai C-W, Lai C-F, Chao H-C, Vasilakos AV. Big data analytics: a survey. J Big Data. 2015;2(1):1–32.

Tuncel KS, Baydogan MG. Autoregressive forests for multivariate time series modeling. Pattern Recognit. 2018;73:202–15.

Wagstaff K, Cardie C, Rogers S, Schrödl S, et al. Constrained k-means clustering with background knowledge. ICML. 2001;1:577–84.

Wang J, Zhang W, Shi Y, Duan S, Liu J. Industrial big data analytics: challenges, methodologies, and applications. 2018. arXiv:1807.01016 .

Wang L, Zhang J, Chen G, Qiao D. Identifying comparable entities with indirectly associative relations and word embeddings from web search logs. Decis Support Syst. 2021:141.

Wang W, Yang J, Muntz R, et al. Sting: a statistical information grid approach to spatial data mining. VLDB. 1997;97:186–95.

Waskom ML. Seaborn: statistical data visualization. J Open Source Softw. 2021;6(60):3021.

Wei P, Li Y, Zhang Z, Tao H, Li Z, Liu D. An optimization method for intrusion detection classification model based on deep belief network. IEEE Access. 2019;7:87593–605.

Weiss K, Khoshgoftaar TM, Wang DD. A survey of transfer learning. J Big Data. 2016;3(1):9.

Witten IH, Frank E. Data mining: practical machine learning tools and techniques. Morgan Kaufmann; 2005.

Witten IH, Frank E, Trigg LE, Hall MA, Holmes G, Cunningham SJ. Weka: practical machine learning tools and techniques with java implementations. 1999.

Xin Y, Kong L, Liu Z, Chen Y, Li Y, Zhu H, Gao M, Hou H, Wang C. Machine learning and deep learning methods for cybersecurity. IEEE Access. 2018;6:35365–81.

Xu D, Yingjie T. A comprehensive survey of clustering algorithms. Ann Data Sci. 2015;2(2):165–93.

Ya J, Liu T, Li Q, Shi J, Zhang H, Lv P, Guo L. Mining host behavior patterns from massive network and security logs. Proc Comput Sci. 2017;108:38–47.

Yong AG, Pearce S, et al. A beginner’s guide to factor analysis: Focusing on exploratory factor analysis. Tutor Quant Methods Psychol. 2013;9(2):79–94.

Zaki MJ. Scalable algorithms for association mining. IEEE Trans Knowl Data Eng. 2000;12(3):372–90.

Zhao Q, Bhowmick SS. Association rule mining: a survey. Singapore: Nanyang Technological University; 2003.

Zheng P, Ni LM. Spotlight: the rise of the smart phone. IEEE Distrib Syst Online. 2006;7(3):3.

Zheng T, Xie W, Liling X, He X, Zhang Y, You M, Yang G, Chen Y. A machine learning-based framework to identify type 2 diabetes through electronic health records. Int J Med Inform. 2017;97:120–7.

Zhou Z-J, Hu G-Y, Hu C-H, Wen C-L, Chang L-L. A survey of belief rule-base expert system. IEEE Trans Syst Man Cybern Syst. 2019.

Zhu H, Chen E, Xiong H, Kuifei Y, Cao H, Tian J. Mining mobile user preferences for personalized context-aware recommendation. ACM Trans Intell Syst Technol (TIST). 2014;5(4):58.

Zikang H, Yong Y, Guofeng Y, Xinyu Z. Sentiment analysis of agricultural product ecommerce review data based on deep learning. In: 2020 international conference on internet of things and intelligent applications (ITIA). IEEE. 2020. p. 1–7.

Download references

Author information

Authors and affiliations.

Swinburne University of Technology, Melbourne, VIC, 3122, Australia

Iqbal H. Sarker

Department of Computer Science and Engineering, Chittagong University of Engineering & Technology, Chittagong, 4349, Bangladesh

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Iqbal H. Sarker .

Ethics declarations

Conflict of interest.

The author declares no conflict of interest.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Advances in Computational Approaches for Artificial Intelligence, Image Processing, IoT and Cloud Applications” guest edited by Bhanu Prakash K N and M. Shivakumar.

Rights and permissions

Reprints and permissions

About this article

Sarker, I.H. Data Science and Analytics: An Overview from Data-Driven Smart Computing, Decision-Making and Applications Perspective. SN COMPUT. SCI. 2 , 377 (2021). https://doi.org/10.1007/s42979-021-00765-8

Download citation

Received : 09 August 2019

Accepted : 02 July 2021

Published : 12 July 2021

DOI : https://doi.org/10.1007/s42979-021-00765-8

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Data science
  • Advanced analytics
  • Machine learning
  • Deep learning
  • Smart computing
  • Decision-making
  • Predictive analytics
  • Data science applications
  • Find a journal
  • Publish with us
  • Track your research

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Springer Nature - PMC COVID-19 Collection

Logo of phenaturepg

The use of Big Data Analytics in healthcare

Kornelia batko.

1 Department of Business Informatics, University of Economics in Katowice, Katowice, Poland

Andrzej Ślęzak

2 Department of Biomedical Processes and Systems, Institute of Health and Nutrition Sciences, Częstochowa University of Technology, Częstochowa, Poland

Associated Data

The datasets for this study are available on request to the corresponding author.

The introduction of Big Data Analytics (BDA) in healthcare will allow to use new technologies both in treatment of patients and health management. The paper aims at analyzing the possibilities of using Big Data Analytics in healthcare. The research is based on a critical analysis of the literature, as well as the presentation of selected results of direct research on the use of Big Data Analytics in medical facilities. The direct research was carried out based on research questionnaire and conducted on a sample of 217 medical facilities in Poland. Literature studies have shown that the use of Big Data Analytics can bring many benefits to medical facilities, while direct research has shown that medical facilities in Poland are moving towards data-based healthcare because they use structured and unstructured data, reach for analytics in the administrative, business and clinical area. The research positively confirmed that medical facilities are working on both structural data and unstructured data. The following kinds and sources of data can be distinguished: from databases, transaction data, unstructured content of emails and documents, data from devices and sensors. However, the use of data from social media is lower as in their activity they reach for analytics, not only in the administrative and business but also in the clinical area. It clearly shows that the decisions made in medical facilities are highly data-driven. The results of the study confirm what has been analyzed in the literature that medical facilities are moving towards data-based healthcare, together with its benefits.

Introduction

The main contribution of this paper is to present an analytical overview of using structured and unstructured data (Big Data) analytics in medical facilities in Poland. Medical facilities use both structured and unstructured data in their practice. Structured data has a predetermined schema, it is extensive, freeform, and comes in variety of forms [ 27 ]. In contrast, unstructured data, referred to as Big Data (BD), does not fit into the typical data processing format. Big Data is a massive amount of data sets that cannot be stored, processed, or analyzed using traditional tools. It remains stored but not analyzed. Due to the lack of a well-defined schema, it is difficult to search and analyze such data and, therefore, it requires a specific technology and method to transform it into value [ 20 , 68 ]. Integrating data stored in both structured and unstructured formats can add significant value to an organization [ 27 ]. Organizations must approach unstructured data in a different way. Therefore, the potential is seen in Big Data Analytics (BDA). Big Data Analytics are techniques and tools used to analyze and extract information from Big Data. The results of Big Data analysis can be used to predict the future. They also help in creating trends about the past. When it comes to healthcare, it allows to analyze large datasets from thousands of patients, identifying clusters and correlation between datasets, as well as developing predictive models using data mining techniques [ 60 ].

This paper is the first study to consolidate and characterize the use of Big Data from different perspectives. The first part consists of a brief literature review of studies on Big Data (BD) and Big Data Analytics (BDA), while the second part presents results of direct research aimed at diagnosing the use of big data analyses in medical facilities in Poland.

Healthcare is a complex system with varied stakeholders: patients, doctors, hospitals, pharmaceutical companies and healthcare decision-makers. This sector is also limited by strict rules and regulations. However, worldwide one may observe a departure from the traditional doctor-patient approach. The doctor becomes a partner and the patient is involved in the therapeutic process [ 14 ]. Healthcare is no longer focused solely on the treatment of patients. The priority for decision-makers should be to promote proper health attitudes and prevent diseases that can be avoided [ 81 ]. This became visible and important especially during the Covid-19 pandemic [ 44 ].

The next challenges that healthcare will have to face is the growing number of elderly people and a decline in fertility. Fertility rates in the country are found below the reproductive minimum necessary to keep the population stable [ 10 ]. The reflection of both effects, namely the increase in age and lower fertility rates, are demographic load indicators, which is constantly growing. Forecasts show that providing healthcare in the form it is provided today will become impossible in the next 20 years [ 70 ]. It is especially visible now during the Covid-19 pandemic when healthcare faced quite a challenge related to the analysis of huge data amounts and the need to identify trends and predict the spread of the coronavirus. The pandemic showed it even more that patients should have access to information about their health condition, the possibility of digital analysis of this data and access to reliable medical support online. Health monitoring and cooperation with doctors in order to prevent diseases can actually revolutionize the healthcare system. One of the most important aspects of the change necessary in healthcare is putting the patient in the center of the system.

Technology is not enough to achieve these goals. Therefore, changes should be made not only at the technological level but also in the management and design of complete healthcare processes and what is more, they should affect the business models of service providers. The use of Big Data Analytics is becoming more and more common in enterprises [ 17 , 54 ]. However, medical enterprises still cannot keep up with the information needs of patients, clinicians, administrators and the creator’s policy. The adoption of a Big Data approach would allow the implementation of personalized and precise medicine based on personalized information, delivered in real time and tailored to individual patients.

To achieve this goal, it is necessary to implement systems that will be able to learn quickly about the data generated by people within clinical care and everyday life. This will enable data-driven decision making, receiving better personalized predictions about prognosis and responses to treatments; a deeper understanding of the complex factors and their interactions that influence health at the patient level, the health system and society, enhanced approaches to detecting safety problems with drugs and devices, as well as more effective methods of comparing prevention, diagnostic, and treatment options [ 40 ].

In the literature, there is a lot of research showing what opportunities can be offered to companies by big data analysis and what data can be analyzed. However, there are few studies showing how data analysis in the area of healthcare is performed, what data is used by medical facilities and what analyses and in which areas they carry out. This paper aims to fill this gap by presenting the results of research carried out in medical facilities in Poland. The goal is to analyze the possibilities of using Big Data Analytics in healthcare, especially in Polish conditions. In particular, the paper is aimed at determining what data is processed by medical facilities in Poland, what analyses they perform and in what areas, and how they assess their analytical maturity. In order to achieve this goal, a critical analysis of the literature was performed, and the direct research was based on a research questionnaire conducted on a sample of 217 medical facilities in Poland. It was hypothesized that medical facilities in Poland are working on both structured and unstructured data and moving towards data-based healthcare and its benefits. Examining the maturity of healthcare facilities in the use of Big Data and Big Data Analytics is crucial in determining the potential future benefits that the healthcare sector can gain from Big Data Analytics. There is also a pressing need to predicate whether, in the coming years, healthcare will be able to cope with the threats and challenges it faces.

This paper is divided into eight parts. The first is the introduction which provides background and the general problem statement of this research. In the second part, this paper discusses considerations on use of Big Data and Big Data Analytics in Healthcare, and then, in the third part, it moves on to challenges and potential benefits of using Big Data Analytics in healthcare. The next part involves the explanation of the proposed method. The result of direct research and discussion are presented in the fifth part, while the following part of the paper is the conclusion. The seventh part of the paper presents practical implications. The final section of the paper provides limitations and directions for future research.

Considerations on use Big Data and Big Data Analytics in the healthcare

In recent years one can observe a constantly increasing demand for solutions offering effective analytical tools. This trend is also noticeable in the analysis of large volumes of data (Big Data, BD). Organizations are looking for ways to use the power of Big Data to improve their decision making, competitive advantage or business performance [ 7 , 54 ]. Big Data is considered to offer potential solutions to public and private organizations, however, still not much is known about the outcome of the practical use of Big Data in different types of organizations [ 24 ].

As already mentioned, in recent years, healthcare management worldwide has been changed from a disease-centered model to a patient-centered model, even in value-based healthcare delivery model [ 68 ]. In order to meet the requirements of this model and provide effective patient-centered care, it is necessary to manage and analyze healthcare Big Data.

The issue often raised when it comes to the use of data in healthcare is the appropriate use of Big Data. Healthcare has always generated huge amounts of data and nowadays, the introduction of electronic medical records, as well as the huge amount of data sent by various types of sensors or generated by patients in social media causes data streams to constantly grow. Also, the medical industry generates significant amounts of data, including clinical records, medical images, genomic data and health behaviors. Proper use of the data will allow healthcare organizations to support clinical decision-making, disease surveillance, and public health management. The challenge posed by clinical data processing involves not only the quantity of data but also the difficulty in processing it.

In the literature one can find many different definitions of Big Data. This concept has evolved in recent years, however, it is still not clearly understood. Nevertheless, despite the range and differences in definitions, Big Data can be treated as a: large amount of digital data, large data sets, tool, technology or phenomenon (cultural or technological.

Big Data can be considered as massive and continually generated digital datasets that are produced via interactions with online technologies [ 53 ]. Big Data can be defined as datasets that are of such large sizes that they pose challenges in traditional storage and analysis techniques [ 28 ]. A similar opinion about Big Data was presented by Ohlhorst who sees Big Data as extremely large data sets, possible neither to manage nor to analyze with traditional data processing tools [ 57 ]. In his opinion, the bigger the data set, the more difficult it is to gain any value from it.

In turn, Knapp perceived Big Data as tools, processes and procedures that allow an organization to create, manipulate and manage very large data sets and storage facilities [ 38 ]. From this point of view, Big Data is identified as a tool to gather information from different databases and processes, allowing users to manage large amounts of data.

Similar perception of the term ‘Big Data’ is shown by Carter. According to him, Big Data technologies refer to a new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data by enabling high velocity capture, discovery and/or analysis [ 13 ].

Jordan combines these two approaches by identifying Big Data as a complex system, as it needs data bases for data to be stored in, programs and tools to be managed, as well as expertise and personnel able to retrieve useful information and visualization to be understood [ 37 ].

Following the definition of Laney for Big Data, it can be state that: it is large amount of data generated in very fast motion and it contains a lot of content [ 43 ]. Such data comes from unstructured sources, such as stream of clicks on the web, social networks (Twitter, blogs, Facebook), video recordings from the shops, recording of calls in a call center, real time information from various kinds of sensors, RFID, GPS devices, mobile phones and other devices that identify and monitor something [ 8 ]. Big Data is a powerful digital data silo, raw, collected with all sorts of sources, unstructured and difficult, or even impossible, to analyze using conventional techniques used so far to relational databases.

While describing Big Data, it cannot be overlooked that the term refers more to a phenomenon than to specific technology. Therefore, instead of defining this phenomenon, trying to describe them, more authors are describing Big Data by giving them characteristics included a collection of V’s related to its nature [ 2 , 3 , 23 , 25 , 58 ]:

  • Volume (refers to the amount of data and is one of the biggest challenges in Big Data Analytics),
  • Velocity (speed with which new data is generated, the challenge is to be able to manage data effectively and in real time),
  • Variety (heterogeneity of data, many different types of healthcare data, the challenge is to derive insights by looking at all available heterogenous data in a holistic manner),
  • Variability (inconsistency of data, the challenge is to correct the interpretation of data that can vary significantly depending on the context),
  • Veracity (how trustworthy the data is, quality of the data),
  • Visualization (ability to interpret data and resulting insights, challenging for Big Data due to its other features as described above).
  • Value (the goal of Big Data Analytics is to discover the hidden knowledge from huge amounts of data).

Big Data is defined as an information asset with high volume, velocity, and variety, which requires specific technology and method for its transformation into value [ 21 , 77 ]. Big Data is also a collection of information about high-volume, high volatility or high diversity, requiring new forms of processing in order to support decision-making, discovering new phenomena and process optimization [ 5 , 7 ]. Big Data is too large for traditional data-processing systems and software tools to capture, store, manage and analyze, therefore it requires new technologies [ 28 , 50 , 61 ] to manage (capture, aggregate, process) its volume, velocity and variety [ 9 ].

Undoubtedly, Big Data differs from the data sources used so far by organizations. Therefore, organizations must approach this type of unstructured data in a different way. First of all, organizations must start to see data as flows and not stocks—this entails the need to implement the so-called streaming analytics [ 48 ]. The mentioned features make it necessary to use new IT tools that allow the fullest use of new data [ 58 ]. The Big Data idea, inseparable from the huge increase in data available to various organizations or individuals, creates opportunities for access to valuable analyses, conclusions and enables making more accurate decisions [ 6 , 11 , 59 ].

The Big Data concept is constantly evolving and currently it does not focus on huge amounts of data, but rather on the process of creating value from this data [ 52 ]. Big Data is collected from various sources that have different data properties and are processed by different organizational units, resulting in creation of a Big Data chain [ 36 ]. The aim of the organizations is to manage, process and analyze Big Data. In the healthcare sector, Big Data streams consist of various types of data, namely [ 8 , 51 ]:

  • clinical data, i.e. data obtained from electronic medical records, data from hospital information systems, image centers, laboratories, pharmacies and other organizations providing health services, patient generated health data, physician’s free-text notes, genomic data, physiological monitoring data [ 4 ],
  • biometric data provided from various types of devices that monitor weight, pressure, glucose level, etc.,
  • financial data, constituting a full record of economic operations reflecting the conducted activity,
  • data from scientific research activities, i.e. results of research, including drug research, design of medical devices and new methods of treatment,
  • data provided by patients, including description of preferences, level of satisfaction, information from systems for self-monitoring of their activity: exercises, sleep, meals consumed, etc.
  • data from social media.

These data are provided not only by patients but also by organizations and institutions, as well as by various types of monitoring devices, sensors or instruments [ 16 ]. Data that has been generated so far in the healthcare sector is stored in both paper and digital form. Thus, the essence and the specificity of the process of Big Data analyses means that organizations need to face new technological and organizational challenges [ 67 ]. The healthcare sector has always generated huge amounts of data and this is connected, among others, with the need to store medical records of patients. However, the problem with Big Data in healthcare is not limited to an overwhelming volume but also an unprecedented diversity in terms of types, data formats and speed with which it should be analyzed in order to provide the necessary information on an ongoing basis [ 3 ]. It is also difficult to apply traditional tools and methods for management of unstructured data [ 67 ]. Due to the diversity and quantity of data sources that are growing all the time, advanced analytical tools and technologies, as well as Big Data analysis methods which can meet and exceed the possibilities of managing healthcare data, are needed [ 3 , 68 ].

Therefore, the potential is seen in Big Data analyses, especially in the aspect of improving the quality of medical care, saving lives or reducing costs [ 30 ]. Extracting from this tangle of given association rules, patterns and trends will allow health service providers and other stakeholders in the healthcare sector to offer more accurate and more insightful diagnoses of patients, personalized treatment, monitoring of the patients, preventive medicine, support of medical research and health population, as well as better quality of medical services and patient care while, at the same time, the ability to reduce costs (Fig.  1 ).

An external file that holds a picture, illustration, etc.
Object name is 40537_2021_553_Fig1_HTML.jpg

Healthcare Big Data Analytics applications

(Source: Own elaboration)

The main challenge with Big Data is how to handle such a large amount of information and use it to make data-driven decisions in plenty of areas [ 64 ]. In the context of healthcare data, another major challenge is to adjust big data storage, analysis, presentation of analysis results and inference basing on them in a clinical setting. Data analytics systems implemented in healthcare are designed to describe, integrate and present complex data in an appropriate way so that it can be understood better (Fig.  2 ). This would improve the efficiency of acquiring, storing, analyzing and visualizing big data from healthcare [ 71 ].

An external file that holds a picture, illustration, etc.
Object name is 40537_2021_553_Fig2_HTML.jpg

Process of Big Data Analytics

The result of data processing with the use of Big Data Analytics is appropriate data storytelling which may contribute to making decisions with both lower risk and data support. This, in turn, can benefit healthcare stakeholders. To take advantage of the potential massive amounts of data in healthcare and to ensure that the right intervention to the right patient is properly timed, personalized, and potentially beneficial to all components of the healthcare system such as the payer, patient, and management, analytics of large datasets must connect communities involved in data analytics and healthcare informatics [ 49 ]. Big Data Analytics can provide insight into clinical data and thus facilitate informed decision-making about the diagnosis and treatment of patients, prevention of diseases or others. Big Data Analytics can also improve the efficiency of healthcare organizations by realizing the data potential [ 3 , 62 ].

Big Data Analytics in medicine and healthcare refers to the integration and analysis of a large amount of complex heterogeneous data, such as various omics (genomics, epigenomics, transcriptomics, proteomics, metabolomics, interactomics, pharmacogenetics, deasomics), biomedical data, talemedicine data (sensors, medical equipment data) and electronic health records data [ 46 , 65 ].

When analyzing the phenomenon of Big Data in the healthcare sector, it should be noted that it can be considered from the point of view of three areas: epidemiological, clinical and business.

From a clinical point of view, the Big Data analysis aims to improve the health and condition of patients, enable long-term predictions about their health status and implementation of appropriate therapeutic procedures. Ultimately, the use of data analysis in medicine is to allow the adaptation of therapy to a specific patient, that is personalized medicine (precision, personalized medicine).

From an epidemiological point of view, it is desirable to obtain an accurate prognosis of morbidity in order to implement preventive programs in advance.

In the business context, Big Data analysis may enable offering personalized packages of commercial services or determining the probability of individual disease and infection occurrence. It is worth noting that Big Data means not only the collection and processing of data but, most of all, the inference and visualization of data necessary to obtain specific business benefits.

In order to introduce new management methods and new solutions in terms of effectiveness and transparency, it becomes necessary to make data more accessible, digital, searchable, as well as analyzed and visualized.

Erickson and Rothberg state that the information and data do not reveal their full value until insights are drawn from them. Data becomes useful when it enhances decision making and decision making is enhanced only when analytical techniques are used and an element of human interaction is applied [ 22 ].

Thus, healthcare has experienced much progress in usage and analysis of data. A large-scale digitalization and transparency in this sector is a key statement of almost all countries governments policies. For centuries, the treatment of patients was based on the judgment of doctors who made treatment decisions. In recent years, however, Evidence-Based Medicine has become more and more important as a result of it being related to the systematic analysis of clinical data and decision-making treatment based on the best available information [ 42 ]. In the healthcare sector, Big Data Analytics is expected to improve the quality of life and reduce operational costs [ 72 , 82 ]. Big Data Analytics enables organizations to improve and increase their understanding of the information contained in data. It also helps identify data that provides insightful insights for current as well as future decisions [ 28 ].

Big Data Analytics refers to technologies that are grounded mostly in data mining: text mining, web mining, process mining, audio and video analytics, statistical analysis, network analytics, social media analytics and web analytics [ 16 , 25 , 31 ]. Different data mining techniques can be applied on heterogeneous healthcare data sets, such as: anomaly detection, clustering, classification, association rules as well as summarization and visualization of those Big Data sets [ 65 ]. Modern data analytics techniques explore and leverage unique data characteristics even from high-speed data streams and sensor data [ 15 , 16 , 31 , 55 ]. Big Data can be used, for example, for better diagnosis in the context of comprehensive patient data, disease prevention and telemedicine (in particular when using real-time alerts for immediate care), monitoring patients at home, preventing unnecessary hospital visits, integrating medical imaging for a wider diagnosis, creating predictive analytics, reducing fraud and improving data security, better strategic planning and increasing patients’ involvement in their own health.

Big Data Analytics in healthcare can be divided into [ 33 , 73 , 74 ]:

  • descriptive analytics in healthcare is used to understand past and current healthcare decisions, converting data into useful information for understanding and analyzing healthcare decisions, outcomes and quality, as well as making informed decisions [ 33 ]. It can be used to create reports (i.e. about patients’ hospitalizations, physicians’ performance, utilization management), visualization, customized reports, drill down tables, or running queries on the basis of historical data.
  • predictive analytics operates on past performance in an effort to predict the future by examining historical or summarized health data, detecting patterns of relationships in these data, and then extrapolating these relationships to forecast. It can be used to i.e. predict the response of different patient groups to different drugs (dosages) or reactions (clinical trials), anticipate risk and find relationships in health data and detect hidden patterns [ 62 ]. In this way, it is possible to predict the epidemic spread, anticipate service contracts and plan healthcare resources. Predictive analytics is used in proper diagnosis and for appropriate treatments to be given to patients suffering from certain diseases [ 39 ].
  • prescriptive analytics—occurs when health problems involve too many choices or alternatives. It uses health and medical knowledge in addition to data or information. Prescriptive analytics is used in many areas of healthcare, including drug prescriptions and treatment alternatives. Personalized medicine and evidence-based medicine are both supported by prescriptive analytics.
  • discovery analytics—utilizes knowledge about knowledge to discover new “inventions” like drugs (drug discovery), previously unknown diseases and medical conditions, alternative treatments, etc.

Although the models and tools used in descriptive, predictive, prescriptive, and discovery analytics are different, many applications involve all four of them [ 62 ]. Big Data Analytics in healthcare can help enable personalized medicine by identifying optimal patient-specific treatments. This can influence the improvement of life standards, reduce waste of healthcare resources and save costs of healthcare [ 56 , 63 , 71 ]. The introduction of large data analysis gives new analytical possibilities in terms of scope, flexibility and visualization. Techniques such as data mining (computational pattern discovery process in large data sets) facilitate inductive reasoning and analysis of exploratory data, enabling scientists to identify data patterns that are independent of specific hypotheses. As a result, predictive analysis and real-time analysis becomes possible, making it easier for medical staff to start early treatments and reduce potential morbidity and mortality. In addition, document analysis, statistical modeling, discovering patterns and topics in document collections and data in the EHR, as well as an inductive approach can help identify and discover relationships between health phenomena.

Advanced analytical techniques can be used for a large amount of existing (but not yet analytical) data on patient health and related medical data to achieve a better understanding of the information and results obtained, as well as to design optimal clinical pathways [ 62 ]. Big Data Analytics in healthcare integrates analysis of several scientific areas such as bioinformatics, medical imaging, sensor informatics, medical informatics and health informatics [ 65 ]. Big Data Analytics in healthcare allows to analyze large datasets from thousands of patients, identifying clusters and correlation between datasets, as well as developing predictive models using data mining techniques [ 65 ]. Discussing all the techniques used for Big Data Analytics goes beyond the scope of a single article [ 25 ].

The success of Big Data analysis and its accuracy depend heavily on the tools and techniques used to analyze the ability to provide reliable, up-to-date and meaningful information to various stakeholders [ 12 ]. It is believed that the implementation of big data analytics by healthcare organizations could bring many benefits in the upcoming years, including lowering health care costs, better diagnosis and prediction of diseases and their spread, improving patient care and developing protocols to prevent re-hospitalization, optimizing staff, optimizing equipment, forecasting the need for hospital beds, operating rooms, treatments, and improving the drug supply chain [ 71 ].

Challenges and potential benefits of using Big Data Analytics in healthcare

Modern analytics gives possibilities not only to have insight in historical data, but also to have information necessary to generate insight into what may happen in the future. Even when it comes to prediction of evidence-based actions. The emphasis on reform has prompted payers and suppliers to pursue data analysis to reduce risk, detect fraud, improve efficiency and save lives. Everyone—payers, providers, even patients—are focusing on doing more with fewer resources. Thus, some areas in which enhanced data and analytics can yield the greatest results include various healthcare stakeholders (Table ​ (Table1 1 ).

The use of analytics by various healthcare stakeholders

Source: own elaboration on the basis of [ 19 , 20 ]

Healthcare organizations see the opportunity to grow through investments in Big Data Analytics. In recent years, by collecting medical data of patients, converting them into Big Data and applying appropriate algorithms, reliable information has been generated that helps patients, physicians and stakeholders in the health sector to identify values and opportunities [ 31 ]. It is worth noting that there are many changes and challenges in the structure of the healthcare sector. Digitization and effective use of Big Data in healthcare can bring benefits to every stakeholder in this sector. A single doctor would benefit the same as the entire healthcare system. Potential opportunities to achieve benefits and effects from Big Data in healthcare can be divided into four groups [ 8 ]:

  • assessment of diagnoses made by doctors and the manner of treatment of diseases indicated by them based on the decision support system working on Big Data collections,
  • detection of more effective, from a medical point of view, and more cost-effective ways to diagnose and treat patients,
  • analysis of large volumes of data to reach practical information useful for identifying needs, introducing new health services, preventing and overcoming crises,
  • prediction of the incidence of diseases,
  • detecting trends that lead to an improvement in health and lifestyle of the society,
  • analysis of the human genome for the introduction of personalized treatment.
  • doctors’ comparison of current medical cases to cases from the past for better diagnosis and treatment adjustment,
  • detection of diseases at earlier stages when they can be more easily and quickly cured,
  • detecting epidemiological risks and improving control of pathogenic spots and reaction rates,
  • identification of patients who are predicted to have the highest risk of specific, life-threatening diseases by collating data on the history of the most common diseases, in healing people with reports entering insurance companies,
  • health management of each patient individually (personalized medicine) and health management of the whole society,
  • capturing and analyzing large amounts of data from hospitals and homes in real time, life monitoring devices to monitor safety and predict adverse events,
  • analysis of patient profiles to identify people for whom prevention should be applied, lifestyle change or preventive care approach,
  • the ability to predict the occurrence of specific diseases or worsening of patients’ results,
  • predicting disease progression and its determinants, estimating the risk of complications,
  • detecting drug interactions and their side effects.
  • supporting work on new drugs and clinical trials thanks to the possibility of analyzing “all data” instead of selecting a test sample,
  • the ability to identify patients with specific, biological features that will take part in specialized clinical trials,
  • selecting a group of patients for which the tested drug is likely to have the desired effect and no side effects,
  • using modeling and predictive analysis to design better drugs and devices.
  • reduction of costs and counteracting abuse and counseling practices,
  • faster and more effective identification of incorrect or unauthorized financial operations in order to prevent abuse and eliminate errors,
  • increase in profitability by detecting patients generating high costs or identifying doctors whose work, procedures and treatment methods cost the most and offering them solutions that reduce the amount of money spent,
  • identification of unnecessary medical activities and procedures, e.g. duplicate tests.

According to research conducted by Wang, Kung and Byrd, Big Data Analytics benefits can be classified into five categories: IT infrastructure benefits (reducing system redundancy, avoiding unnecessary IT costs, transferring data quickly among healthcare IT systems, better use of healthcare systems, processing standardization among various healthcare IT systems, reducing IT maintenance costs regarding data storage), operational benefits (improving the quality and accuracy of clinical decisions, processing a large number of health records in seconds, reducing the time of patient travel, immediate access to clinical data to analyze, shortening the time of diagnostic test, reductions in surgery-related hospitalizations, exploring inconceivable new research avenues), organizational benefits (detecting interoperability problems much more quickly than traditional manual methods, improving cross-functional communication and collaboration among administrative staffs, researchers, clinicians and IT staffs, enabling data sharing with other institutions and adding new services, content sources and research partners), managerial benefits (gaining quick insights about changing healthcare trends in the market, providing members of the board and heads of department with sound decision-support information on the daily clinical setting, optimizing business growth-related decisions) and strategic benefits (providing a big picture view of treatment delivery for meeting future need, creating high competitive healthcare services) [ 73 ].

The above specification does not constitute a full list of potential areas of use of Big Data Analysis in healthcare because the possibilities of using analysis are practically unlimited. In addition, advanced analytical tools allow to analyze data from all possible sources and conduct cross-analyses to provide better data insights [ 26 ]. For example, a cross-analysis can refer to a combination of patient characteristics, as well as costs and care results that can help identify the best, in medical terms, and the most cost-effective treatment or treatments and this may allow a better adjustment of the service provider’s offer [ 62 ].

In turn, the analysis of patient profiles (e.g. segmentation and predictive modeling) allows identification of people who should be subject to prophylaxis, prevention or should change their lifestyle [ 8 ]. Shortened list of benefits for Big Data Analytics in healthcare is presented in paper [ 3 ] and consists of: better performance, day-to-day guides, detection of diseases in early stages, making predictive analytics, cost effectiveness, Evidence Based Medicine and effectiveness in patient treatment.

Summarizing, healthcare big data represents a huge potential for the transformation of healthcare: improvement of patients’ results, prediction of outbreaks of epidemics, valuable insights, avoidance of preventable diseases, reduction of the cost of healthcare delivery and improvement of the quality of life in general [ 1 ]. Big Data also generates many challenges such as difficulties in data capture, data storage, data analysis and data visualization [ 15 ]. The main challenges are connected with the issues of: data structure (Big Data should be user-friendly, transparent, and menu-driven but it is fragmented, dispersed, rarely standardized and difficult to aggregate and analyze), security (data security, privacy and sensitivity of healthcare data, there are significant concerns related to confidentiality), data standardization (data is stored in formats that are not compatible with all applications and technologies), storage and transfers (especially costs associated with securing, storing, and transferring unstructured data), managerial skills, such as data governance, lack of appropriate analytical skills and problems with Real-Time Analytics (health care is to be able to utilize Big Data in real time) [ 4 , 34 , 41 ].

The research is based on a critical analysis of the literature, as well as the presentation of selected results of direct research on the use of Big Data Analytics in medical facilities in Poland.

Presented research results are part of a larger questionnaire form on Big Data Analytics. The direct research was based on an interview questionnaire which contained 100 questions with 5-point Likert scale (1—strongly disagree, 2—I rather disagree, 3—I do not agree, nor disagree, 4—I rather agree, 5—I definitely agree) and 4 metrics questions. The study was conducted in December 2018 on a sample of 217 medical facilities (110 private, 107 public). The research was conducted by a specialized market research agency: Center for Research and Expertise of the University of Economics in Katowice.

When it comes to direct research, the selected entities included entities financed from public sources—the National Health Fund (23.5%), and entities operating commercially (11.5%). In the surveyed group of entities, more than a half (64.9%) are hybrid financed, both from public and commercial sources. The diversity of the research sample also applies to the size of the entities, defined by the number of employees. Taking into account proportions of the surveyed entities, it should be noted that in the sector structure, medium-sized (10–50 employees—34% of the sample) and large (51–250 employees—27%) entities dominate. The research was of all-Poland nature, and the entities included in the research sample come from all of the voivodships. The largest group were entities from Łódzkie (32%), Śląskie (18%) and Mazowieckie (18%) voivodships, as these voivodships have the largest number of medical institutions. Other regions of the country were represented by single units. The selection of the research sample was random—layered. As part of medical facilities database, groups of private and public medical facilities have been identified and the ones to which the questionnaire was targeted were drawn from each of these groups. The analyses were performed using the GNU PSPP 0.10.2 software.

The aim of the study was to determine whether medical facilities in Poland use Big Data Analytics and if so, in which areas. Characteristics of the research sample is presented in Table ​ Table2 2 .

Characteristics of the research sample

The research is non-exhaustive due to the incomplete and uneven regional distribution of the samples, overrepresented in three voivodeships (Łódzkie, Mazowieckie and Śląskie). The size of the research sample (217 entities) allows the authors of the paper to formulate specific conclusions on the use of Big Data in the process of its management.

For the purpose of this paper, the following research hypotheses were formulated: (1) medical facilities in Poland are working on both structured and unstructured data (2) medical facilities in Poland are moving towards data-based healthcare and its benefits.

The paper poses the following research questions and statements that coincide with the selected questions from the research questionnaire:

  • From what sources do medical facilities obtain data? What types of data are used by the particular organization, whether structured or unstructured, and to what extent?
  • From what sources do medical facilities obtain data?
  • In which area organizations are using data and analytical systems (clinical or business)?
  • Is data analytics performed based on historical data or are predictive analyses also performed?
  • Determining whether administrative and medical staff receive complete, accurate and reliable data in a timely manner?
  • Determining whether real-time analyses are performed to support the particular organization’s activities.

Results and discussion

On the basis of the literature analysis and research study, a set of questions and statements related to the researched area was formulated. The results from the surveys show that medical facilities use a variety of data sources in their operations. These sources are both structured and unstructured data (Table ​ (Table3 3 ).

Type of data sources used in medical facility (%)

1—strongly disagree, 2—I disagree, 3—I agree or disagree, 4—I rather agree, 5—I strongly agree

According to the data provided by the respondents, considering the first statement made in the questionnaire, almost half of the medical institutions (47.58%) agreed that they rather collect and use structured data (e.g. databases and data warehouses, reports to external entities) and 10.57% entirely agree with this statement. As much as 23.35% of representatives of medical institutions stated “I agree or disagree”. Other medical facilities do not collect and use structured data (7.93%) and 6.17% strongly disagree with the first statement. Also, the median calculated based on the obtained results (median: 4), proves that medical facilities in Poland collect and use structured data (Table ​ (Table4 4 ).

Collection and use of data determined by the size of medical facility (number of employees)

In turn, 28.19% of the medical institutions agreed that they rather collect and use unstructured data and as much as 9.25% entirely agree with this statement. The number of representatives of medical institutions that stated “I agree or disagree” was 27.31%. Other medical facilities do not collect and use structured data (17.18%) and 13.66% strongly disagree with the first statement. In the case of unstructured data the median is 3, which means that the collection and use of this type of data by medical facilities in Poland is lower.

In the further part of the analysis, it was checked whether the size of the medical facility and form of ownership have an impact on whether it analyzes unstructured data (Tables ​ (Tables4 4 and ​ and5). 5 ). In order to find this out, correlation coefficients were calculated.

Collection and use of data determined by the form of ownership of medical facility

Based on the calculations, it can be concluded that there is a small statistically monotonic correlation between the size of the medical facility and its collection and use of structured data (p < 0.001; τ = 0.16). This means that the use of structured data is slightly increasing in larger medical facilities. The size of the medical facility is more important according to use of unstructured data (p < 0.001; τ = 0.23) (Table ​ (Table4 4 .).

To determine whether the form of medical facility ownership affects data collection, the Mann–Whitney U test was used. The calculations show that the form of ownership does not affect what data the organization collects and uses (Table ​ (Table5 5 ).

Detailed information on the sources of from which medical facilities collect and use data is presented in the Table ​ Table6 6 .

Data sources used in medical facility

1—we do not use at all, 5—we use extensively

The questionnaire results show that medical facilities are especially using information published in databases, reports to external units and transaction data, but they also use unstructured data from e-mails, medical devices, sensors, phone calls, audio and video data (Table ​ (Table6). 6 ). Data from social media, RFID and geolocation data are used to a small extent. Similar findings are concluded in the literature studies.

From the analysis of the answers given by the respondents, more than half of the medical facilities have integrated hospital system (HIS) implemented. As much as 43.61% use integrated hospital system and 16.30% use it extensively (Table ​ (Table7). 7 ). 19.38% of exanimated medical facilities do not use it at all. Moreover, most of the examined medical facilities (34.80% use it, 32.16% use extensively) conduct medical documentation in an electronic form, which gives an opportunity to use data analytics. Only 4.85% of medical facilities don’t use it at all.

The use of HIS and electronic documentation in medical facilities (%)

Other problems that needed to be investigated were: whether medical facilities in Poland use data analytics? If so, in what form and in what areas? (Table ​ (Table8). 8 ). The analysis of answers given by the respondents about the potential of data analytics in medical facilities shows that a similar number of medical facilities use data analytics in administration and business (31.72% agreed with the statement no. 5 and 12.33% strongly agreed) as in the clinical area (33.04% agreed with the statement no. 6 and 12.33% strongly agreed). When considering decision-making issues, 35.24% agree with the statement "the organization uses data and analytical systems to support business decisions” and 8.37% of respondents strongly agree. Almost 40.09% agree with the statement that “the organization uses data and analytical systems to support clinical decisions (in the field of diagnostics and therapy)” and 15.42% of respondents strongly agree. Exanimated medical facilities use in their activity analytics based both on historical data (33.48% agree with statement 7 and 12.78% strongly agree) and predictive analytics (33.04% agrees with the statement number 8 and 15.86% strongly agree). Detailed results are presented in Table ​ Table8 8 .

Conditions of using Big Data Analytics in medical facilities (%)

Medical facilities focus on development in the field of data processing, as they confirm that they conduct analytical planning processes systematically and analyze new opportunities for strategic use of analytics in business and clinical activities (38.33% rather agree and 10.57% strongly agree with this statement). The situation is different with real-time data analysis, here, the situation is not so optimistic. Only 28.19% rather agree and 14.10% strongly agree with the statement that real-time analyses are performed to support an organization’s activities.

When considering whether a facility’s performance in the clinical area depends on the form of ownership, it can be concluded that taking the average and the Mann–Whitney U test depends. A higher degree of use of analyses in the clinical area can be observed in public institutions.

Whether a medical facility performs a descriptive or predictive analysis do not depend on the form of ownership (p > 0.05). It can be concluded that when analyzing the mean and median, they are higher in public facilities, than in private ones. What is more, the Mann–Whitney U test shows that these variables are dependent from each other (p < 0.05) (Table ​ (Table9 9 ).

Conditions of using Big Data Analytics in medical facilities determined by the form of ownership of medical facility

When considering whether a facility’s performance in the clinical area depends on its size, it can be concluded that taking the Kendall’s Tau (τ) it depends (p < 0.001; τ = 0.22), and the correlation is weak but statistically important. This means that the use of data and analytical systems to support clinical decisions (in the field of diagnostics and therapy) increases with the increase of size of the medical facility. A similar relationship, but even less powerful, can be found in the use of descriptive and predictive analyses (Table ​ (Table10 10 ).

Conditions of using Big Data Analytics in medical facilities determined by the size of medical facility (number of employees)

Considering the results of research in the area of analytical maturity of medical facilities, 8.81% of medical facilities stated that they are at the first level of maturity, i.e. an organization has developed analytical skills and does not perform analyses. As much as 13.66% of medical facilities confirmed that they have poor analytical skills, while 38.33% of the medical facility has located itself at level 3, meaning that “there is a lot to do in analytics”. On the other hand, 28.19% believe that analytical capabilities are well developed and 6.61% stated that analytics are at the highest level and the analytical capabilities are very well developed. Detailed data is presented in Table ​ Table11. 11 . Average amounts to 3.11 and Median to 3.

Analytical maturity of examined medical facilities (%)

The results of the research have enabled the formulation of following conclusions. Medical facilities in Poland are working on both structured and unstructured data. This data comes from databases, transactions, unstructured content of emails and documents, devices and sensors. However, the use of data from social media is smaller. In their activity, they reach for analytics in the administrative and business, as well as in the clinical area. Also, the decisions made are largely data-driven.

In summary, analysis of the literature that the benefits that medical facilities can get using Big Data Analytics in their activities relate primarily to patients, physicians and medical facilities. It can be confirmed that: patients will be better informed, will receive treatments that will work for them, will have prescribed medications that work for them and not be given unnecessary medications [ 78 ]. Physician roles will likely change to more of a consultant than decision maker. They will advise, warn, and help individual patients and have more time to form positive and lasting relationships with their patients in order to help people. Medical facilities will see changes as well, for example in fewer unnecessary hospitalizations, resulting initially in less revenue, but after the market adjusts, also the accomplishment [ 78 ]. The use of Big Data Analytics can literally revolutionize the way healthcare is practiced for better health and disease reduction.

The analysis of the latest data reveals that data analytics increase the accuracy of diagnoses. Physicians can use predictive algorithms to help them make more accurate diagnoses [ 45 ]. Moreover, it could be helpful in preventive medicine and public health because with early intervention, many diseases can be prevented or ameliorated [ 29 ]. Predictive analytics also allows to identify risk factors for a given patient, and with this knowledge patients will be able to change their lives what, in turn, may contribute to the fact that population disease patterns may dramatically change, resulting in savings in medical costs. Moreover, personalized medicine is the best solution for an individual patient seeking treatment. It can help doctors decide the exact treatments for those individuals. Better diagnoses and more targeted treatments will naturally lead to increases in good outcomes and fewer resources used, including doctors’ time.

The quantitative analysis of the research carried out and presented in this article made it possible to determine whether medical facilities in Poland use Big Data Analytics and if so, in which areas. Thanks to the results obtained it was possible to formulate the following conclusions. Medical facilities are working on both structured and unstructured data, which comes from databases, transactions, unstructured content of emails and documents, devices and sensors. According to analytics, they reach for analytics in the administrative and business, as well as in the clinical area. It clearly showed that the decisions made are largely data-driven. The results of the study confirm what has been analyzed in the literature. Medical facilities are moving towards data-based healthcare and its benefits.

In conclusion, Big Data Analytics has the potential for positive impact and global implications in healthcare. Future research on the use of Big Data in medical facilities will concern the definition of strategies adopted by medical facilities to promote and implement such solutions, as well as the benefits they gain from the use of Big Data analysis and how the perspectives in this area are seen.

Practical implications

This work sought to narrow the gap that exists in analyzing the possibility of using Big Data Analytics in healthcare. Showing how medical facilities in Poland are doing in this respect is an element that is part of global research carried out in this area, including [ 29 , 32 , 60 ].

Limitations and future directions

The research described in this article does not fully exhaust the questions related to the use of Big Data Analytics in Polish healthcare facilities. Only some of the dimensions characterizing the use of data by medical facilities in Poland have been examined. In order to get the full picture, it would be necessary to examine the results of using structured and unstructured data analytics in healthcare. Future research may examine the benefits that medical institutions achieve as a result of the analysis of structured and unstructured data in the clinical and management areas and what limitations they encounter in these areas. For this purpose, it is planned to conduct in-depth interviews with chosen medical facilities in Poland. These facilities could give additional data for empirical analyses based more on their suggestions. Further research should also include medical institutions from beyond the borders of Poland, enabling international comparative analyses.

Future research in the healthcare field has virtually endless possibilities. These regard the use of Big Data Analytics to diagnose specific conditions [ 47 , 66 , 69 , 76 ], propose an approach that can be used in other healthcare applications and create mechanisms to identify “patients like me” [ 75 , 80 ]. Big Data Analytics could also be used for studies related to the spread of pandemics, the efficacy of covid treatment [ 18 , 79 ], or psychology and psychiatry studies, e.g. emotion recognition [ 35 ].

Acknowledgements

We would like to thank those who have touched our science paths.

Authors’ contributions

KB proposed the concept of research and its design. The manuscript was prepared by KB with the consultation of AŚ. AŚ reviewed the manuscript for getting its fine shape. KB prepared the manuscript in the contexts such as definition of intellectual content, literature search, data acquisition, data analysis, and so on. AŚ obtained research funding. Both authors read and approved the final manuscript.

This research was fully funded as statutory activity—subsidy of Ministry of Science and Higher Education granted for Technical University of Czestochowa on maintaining research potential in 2018. Research Number: BS/PB–622/3020/2014/P. Publication fee for the paper was financed by the University of Economics in Katowice.

Availability of data and materials

Declarations.

Not applicable.

The author declares no conflict of interest.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Kornelia Batko, Email: [email protected] .

Andrzej Ślęzak, Email: moc.liamg@25kazelsa .

AI Index Report

Welcome to the seventh edition of the AI Index report. The 2024 Index is our most comprehensive to date and arrives at an important moment when AI’s influence on society has never been more pronounced. This year, we have broadened our scope to more extensively cover essential trends such as technical advancements in AI, public perceptions of the technology, and the geopolitical dynamics surrounding its development. Featuring more original data than ever before, this edition introduces new estimates on AI training costs, detailed analyses of the responsible AI landscape, and an entirely new chapter dedicated to AI’s impact on science and medicine.

Read the 2024 AI Index Report

The AI Index report tracks, collates, distills, and visualizes data related to artificial intelligence (AI). Our mission is to provide unbiased, rigorously vetted, broadly sourced data in order for policymakers, researchers, executives, journalists, and the general public to develop a more thorough and nuanced understanding of the complex field of AI.

The AI Index is recognized globally as one of the most credible and authoritative sources for data and insights on artificial intelligence. Previous editions have been cited in major newspapers, including the The New York Times, Bloomberg, and The Guardian, have amassed hundreds of academic citations, and been referenced by high-level policymakers in the United States, the United Kingdom, and the European Union, among other places. This year’s edition surpasses all previous ones in size, scale, and scope, reflecting the growing significance that AI is coming to hold in all of our lives.

Steering Committee Co-Directors

Jack Clark

Ray Perrault

Steering committee members.

Erik Brynjolfsson

Erik Brynjolfsson

John Etchemendy

John Etchemendy

Katrina light

Katrina Ligett

Terah Lyons

Terah Lyons

James Manyika

James Manyika

Juan Carlos Niebles

Juan Carlos Niebles

Vanessa Parli

Vanessa Parli

Yoav Shoham

Yoav Shoham

Russell Wald

Russell Wald

Staff members.

Loredana Fattorini

Loredana Fattorini

Nestor Maslej

Nestor Maslej

Letter from the co-directors.

A decade ago, the best AI systems in the world were unable to classify objects in images at a human level. AI struggled with language comprehension and could not solve math problems. Today, AI systems routinely exceed human performance on standard benchmarks.

Progress accelerated in 2023. New state-of-the-art systems like GPT-4, Gemini, and Claude 3 are impressively multimodal: They can generate fluent text in dozens of languages, process audio, and even explain memes. As AI has improved, it has increasingly forced its way into our lives. Companies are racing to build AI-based products, and AI is increasingly being used by the general public. But current AI technology still has significant problems. It cannot reliably deal with facts, perform complex reasoning, or explain its conclusions.

AI faces two interrelated futures. First, technology continues to improve and is increasingly used, having major consequences for productivity and employment. It can be put to both good and bad uses. In the second future, the adoption of AI is constrained by the limitations of the technology. Regardless of which future unfolds, governments are increasingly concerned. They are stepping in to encourage the upside, such as funding university R&D and incentivizing private investment. Governments are also aiming to manage the potential downsides, such as impacts on employment, privacy concerns, misinformation, and intellectual property rights.

As AI rapidly evolves, the AI Index aims to help the AI community, policymakers, business leaders, journalists, and the general public navigate this complex landscape. It provides ongoing, objective snapshots tracking several key areas: technical progress in AI capabilities, the community and investments driving AI development and deployment, public opinion on current and potential future impacts, and policy measures taken to stimulate AI innovation while managing its risks and challenges. By comprehensively monitoring the AI ecosystem, the Index serves as an important resource for understanding this transformative technological force.

On the technical front, this year’s AI Index reports that the number of new large language models released worldwide in 2023 doubled over the previous year. Two-thirds were open-source, but the highest-performing models came from industry players with closed systems. Gemini Ultra became the first LLM to reach human-level performance on the Massive Multitask Language Understanding (MMLU) benchmark; performance on the benchmark has improved by 15 percentage points since last year. Additionally, GPT-4 achieved an impressive 0.97 mean win rate score on the comprehensive Holistic Evaluation of Language Models (HELM) benchmark, which includes MMLU among other evaluations.

Although global private investment in AI decreased for the second consecutive year, investment in generative AI skyrocketed. More Fortune 500 earnings calls mentioned AI than ever before, and new studies show that AI tangibly boosts worker productivity. On the policymaking front, global mentions of AI in legislative proceedings have never been higher. U.S. regulators passed more AI-related regulations in 2023 than ever before. Still, many expressed concerns about AI’s ability to generate deepfakes and impact elections. The public became more aware of AI, and studies suggest that they responded with nervousness.

Ray Perrault Co-director, AI Index

Our Supporting Partners

Supporting Partner Logos

Analytics & Research Partners

data analysis research paper

Stay up to date on the AI Index by subscribing to the  Stanford HAI newsletter.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 17 April 2024

The economic commitment of climate change

  • Maximilian Kotz   ORCID: orcid.org/0000-0003-2564-5043 1 , 2 ,
  • Anders Levermann   ORCID: orcid.org/0000-0003-4432-4704 1 , 2 &
  • Leonie Wenz   ORCID: orcid.org/0000-0002-8500-1568 1 , 3  

Nature volume  628 ,  pages 551–557 ( 2024 ) Cite this article

74k Accesses

3411 Altmetric

Metrics details

  • Environmental economics
  • Environmental health
  • Interdisciplinary studies
  • Projection and prediction

Global projections of macroeconomic climate-change damages typically consider impacts from average annual and national temperatures over long time horizons 1 , 2 , 3 , 4 , 5 , 6 . Here we use recent empirical findings from more than 1,600 regions worldwide over the past 40 years to project sub-national damages from temperature and precipitation, including daily variability and extremes 7 , 8 . Using an empirical approach that provides a robust lower bound on the persistence of impacts on economic growth, we find that the world economy is committed to an income reduction of 19% within the next 26 years independent of future emission choices (relative to a baseline without climate impacts, likely range of 11–29% accounting for physical climate and empirical uncertainty). These damages already outweigh the mitigation costs required to limit global warming to 2 °C by sixfold over this near-term time frame and thereafter diverge strongly dependent on emission choices. Committed damages arise predominantly through changes in average temperature, but accounting for further climatic components raises estimates by approximately 50% and leads to stronger regional heterogeneity. Committed losses are projected for all regions except those at very high latitudes, at which reductions in temperature variability bring benefits. The largest losses are committed at lower latitudes in regions with lower cumulative historical emissions and lower present-day income.

Similar content being viewed by others

data analysis research paper

Climate damage projections beyond annual temperature

data analysis research paper

Investment incentive reduced by climate damages can be restored by optimal policy

data analysis research paper

Climate economics support for the UN climate targets

Projections of the macroeconomic damage caused by future climate change are crucial to informing public and policy debates about adaptation, mitigation and climate justice. On the one hand, adaptation against climate impacts must be justified and planned on the basis of an understanding of their future magnitude and spatial distribution 9 . This is also of importance in the context of climate justice 10 , as well as to key societal actors, including governments, central banks and private businesses, which increasingly require the inclusion of climate risks in their macroeconomic forecasts to aid adaptive decision-making 11 , 12 . On the other hand, climate mitigation policy such as the Paris Climate Agreement is often evaluated by balancing the costs of its implementation against the benefits of avoiding projected physical damages. This evaluation occurs both formally through cost–benefit analyses 1 , 4 , 5 , 6 , as well as informally through public perception of mitigation and damage costs 13 .

Projections of future damages meet challenges when informing these debates, in particular the human biases relating to uncertainty and remoteness that are raised by long-term perspectives 14 . Here we aim to overcome such challenges by assessing the extent of economic damages from climate change to which the world is already committed by historical emissions and socio-economic inertia (the range of future emission scenarios that are considered socio-economically plausible 15 ). Such a focus on the near term limits the large uncertainties about diverging future emission trajectories, the resulting long-term climate response and the validity of applying historically observed climate–economic relations over long timescales during which socio-technical conditions may change considerably. As such, this focus aims to simplify the communication and maximize the credibility of projected economic damages from future climate change.

In projecting the future economic damages from climate change, we make use of recent advances in climate econometrics that provide evidence for impacts on sub-national economic growth from numerous components of the distribution of daily temperature and precipitation 3 , 7 , 8 . Using fixed-effects panel regression models to control for potential confounders, these studies exploit within-region variation in local temperature and precipitation in a panel of more than 1,600 regions worldwide, comprising climate and income data over the past 40 years, to identify the plausibly causal effects of changes in several climate variables on economic productivity 16 , 17 . Specifically, macroeconomic impacts have been identified from changing daily temperature variability, total annual precipitation, the annual number of wet days and extreme daily rainfall that occur in addition to those already identified from changing average temperature 2 , 3 , 18 . Moreover, regional heterogeneity in these effects based on the prevailing local climatic conditions has been found using interactions terms. The selection of these climate variables follows micro-level evidence for mechanisms related to the impacts of average temperatures on labour and agricultural productivity 2 , of temperature variability on agricultural productivity and health 7 , as well as of precipitation on agricultural productivity, labour outcomes and flood damages 8 (see Extended Data Table 1 for an overview, including more detailed references). References  7 , 8 contain a more detailed motivation for the use of these particular climate variables and provide extensive empirical tests about the robustness and nature of their effects on economic output, which are summarized in Methods . By accounting for these extra climatic variables at the sub-national level, we aim for a more comprehensive description of climate impacts with greater detail across both time and space.

Constraining the persistence of impacts

A key determinant and source of discrepancy in estimates of the magnitude of future climate damages is the extent to which the impact of a climate variable on economic growth rates persists. The two extreme cases in which these impacts persist indefinitely or only instantaneously are commonly referred to as growth or level effects 19 , 20 (see Methods section ‘Empirical model specification: fixed-effects distributed lag models’ for mathematical definitions). Recent work shows that future damages from climate change depend strongly on whether growth or level effects are assumed 20 . Following refs.  2 , 18 , we provide constraints on this persistence by using distributed lag models to test the significance of delayed effects separately for each climate variable. Notably, and in contrast to refs.  2 , 18 , we use climate variables in their first-differenced form following ref.  3 , implying a dependence of the growth rate on a change in climate variables. This choice means that a baseline specification without any lags constitutes a model prior of purely level effects, in which a permanent change in the climate has only an instantaneous effect on the growth rate 3 , 19 , 21 . By including lags, one can then test whether any effects may persist further. This is in contrast to the specification used by refs.  2 , 18 , in which climate variables are used without taking the first difference, implying a dependence of the growth rate on the level of climate variables. In this alternative case, the baseline specification without any lags constitutes a model prior of pure growth effects, in which a change in climate has an infinitely persistent effect on the growth rate. Consequently, including further lags in this alternative case tests whether the initial growth impact is recovered 18 , 19 , 21 . Both of these specifications suffer from the limiting possibility that, if too few lags are included, one might falsely accept the model prior. The limitations of including a very large number of lags, including loss of data and increasing statistical uncertainty with an increasing number of parameters, mean that such a possibility is likely. By choosing a specification in which the model prior is one of level effects, our approach is therefore conservative by design, avoiding assumptions of infinite persistence of climate impacts on growth and instead providing a lower bound on this persistence based on what is observable empirically (see Methods section ‘Empirical model specification: fixed-effects distributed lag models’ for further exposition of this framework). The conservative nature of such a choice is probably the reason that ref.  19 finds much greater consistency between the impacts projected by models that use the first difference of climate variables, as opposed to their levels.

We begin our empirical analysis of the persistence of climate impacts on growth using ten lags of the first-differenced climate variables in fixed-effects distributed lag models. We detect substantial effects on economic growth at time lags of up to approximately 8–10 years for the temperature terms and up to approximately 4 years for the precipitation terms (Extended Data Fig. 1 and Extended Data Table 2 ). Furthermore, evaluation by means of information criteria indicates that the inclusion of all five climate variables and the use of these numbers of lags provide a preferable trade-off between best-fitting the data and including further terms that could cause overfitting, in comparison with model specifications excluding climate variables or including more or fewer lags (Extended Data Fig. 3 , Supplementary Methods Section  1 and Supplementary Table 1 ). We therefore remove statistically insignificant terms at later lags (Supplementary Figs. 1 – 3 and Supplementary Tables 2 – 4 ). Further tests using Monte Carlo simulations demonstrate that the empirical models are robust to autocorrelation in the lagged climate variables (Supplementary Methods Section  2 and Supplementary Figs. 4 and 5 ), that information criteria provide an effective indicator for lag selection (Supplementary Methods Section  2 and Supplementary Fig. 6 ), that the results are robust to concerns of imperfect multicollinearity between climate variables and that including several climate variables is actually necessary to isolate their separate effects (Supplementary Methods Section  3 and Supplementary Fig. 7 ). We provide a further robustness check using a restricted distributed lag model to limit oscillations in the lagged parameter estimates that may result from autocorrelation, finding that it provides similar estimates of cumulative marginal effects to the unrestricted model (Supplementary Methods Section 4 and Supplementary Figs. 8 and 9 ). Finally, to explicitly account for any outstanding uncertainty arising from the precise choice of the number of lags, we include empirical models with marginally different numbers of lags in the error-sampling procedure of our projection of future damages. On the basis of the lag-selection procedure (the significance of lagged terms in Extended Data Fig. 1 and Extended Data Table 2 , as well as information criteria in Extended Data Fig. 3 ), we sample from models with eight to ten lags for temperature and four for precipitation (models shown in Supplementary Figs. 1 – 3 and Supplementary Tables 2 – 4 ). In summary, this empirical approach to constrain the persistence of climate impacts on economic growth rates is conservative by design in avoiding assumptions of infinite persistence, but nevertheless provides a lower bound on the extent of impact persistence that is robust to the numerous tests outlined above.

Committed damages until mid-century

We combine these empirical economic response functions (Supplementary Figs. 1 – 3 and Supplementary Tables 2 – 4 ) with an ensemble of 21 climate models (see Supplementary Table 5 ) from the Coupled Model Intercomparison Project Phase 6 (CMIP-6) 22 to project the macroeconomic damages from these components of physical climate change (see Methods for further details). Bias-adjusted climate models that provide a highly accurate reproduction of observed climatological patterns with limited uncertainty (Supplementary Table 6 ) are used to avoid introducing biases in the projections. Following a well-developed literature 2 , 3 , 19 , these projections do not aim to provide a prediction of future economic growth. Instead, they are a projection of the exogenous impact of future climate conditions on the economy relative to the baselines specified by socio-economic projections, based on the plausibly causal relationships inferred by the empirical models and assuming ceteris paribus. Other exogenous factors relevant for the prediction of economic output are purposefully assumed constant.

A Monte Carlo procedure that samples from climate model projections, empirical models with different numbers of lags and model parameter estimates (obtained by 1,000 block-bootstrap resamples of each of the regressions in Supplementary Figs. 1 – 3 and Supplementary Tables 2 – 4 ) is used to estimate the combined uncertainty from these sources. Given these uncertainty distributions, we find that projected global damages are statistically indistinguishable across the two most extreme emission scenarios until 2049 (at the 5% significance level; Fig. 1 ). As such, the climate damages occurring before this time constitute those to which the world is already committed owing to the combination of past emissions and the range of future emission scenarios that are considered socio-economically plausible 15 . These committed damages comprise a permanent income reduction of 19% on average globally (population-weighted average) in comparison with a baseline without climate-change impacts (with a likely range of 11–29%, following the likelihood classification adopted by the Intergovernmental Panel on Climate Change (IPCC); see caption of Fig. 1 ). Even though levels of income per capita generally still increase relative to those of today, this constitutes a permanent income reduction for most regions, including North America and Europe (each with median income reductions of approximately 11%) and with South Asia and Africa being the most strongly affected (each with median income reductions of approximately 22%; Fig. 1 ). Under a middle-of-the road scenario of future income development (SSP2, in which SSP stands for Shared Socio-economic Pathway), this corresponds to global annual damages in 2049 of 38 trillion in 2005 international dollars (likely range of 19–59 trillion 2005 international dollars). Compared with empirical specifications that assume pure growth or pure level effects, our preferred specification that provides a robust lower bound on the extent of climate impact persistence produces damages between these two extreme assumptions (Extended Data Fig. 3 ).

figure 1

Estimates of the projected reduction in income per capita from changes in all climate variables based on empirical models of climate impacts on economic output with a robust lower bound on their persistence (Extended Data Fig. 1 ) under a low-emission scenario compatible with the 2 °C warming target and a high-emission scenario (SSP2-RCP2.6 and SSP5-RCP8.5, respectively) are shown in purple and orange, respectively. Shading represents the 34% and 10% confidence intervals reflecting the likely and very likely ranges, respectively (following the likelihood classification adopted by the IPCC), having estimated uncertainty from a Monte Carlo procedure, which samples the uncertainty from the choice of physical climate models, empirical models with different numbers of lags and bootstrapped estimates of the regression parameters shown in Supplementary Figs. 1 – 3 . Vertical dashed lines show the time at which the climate damages of the two emission scenarios diverge at the 5% and 1% significance levels based on the distribution of differences between emission scenarios arising from the uncertainty sampling discussed above. Note that uncertainty in the difference of the two scenarios is smaller than the combined uncertainty of the two respective scenarios because samples of the uncertainty (climate model and empirical model choice, as well as model parameter bootstrap) are consistent across the two emission scenarios, hence the divergence of damages occurs while the uncertainty bounds of the two separate damage scenarios still overlap. Estimates of global mitigation costs from the three IAMs that provide results for the SSP2 baseline and SSP2-RCP2.6 scenario are shown in light green in the top panel, with the median of these estimates shown in bold.

Damages already outweigh mitigation costs

We compare the damages to which the world is committed over the next 25 years to estimates of the mitigation costs required to achieve the Paris Climate Agreement. Taking estimates of mitigation costs from the three integrated assessment models (IAMs) in the IPCC AR6 database 23 that provide results under comparable scenarios (SSP2 baseline and SSP2-RCP2.6, in which RCP stands for Representative Concentration Pathway), we find that the median committed climate damages are larger than the median mitigation costs in 2050 (six trillion in 2005 international dollars) by a factor of approximately six (note that estimates of mitigation costs are only provided every 10 years by the IAMs and so a comparison in 2049 is not possible). This comparison simply aims to compare the magnitude of future damages against mitigation costs, rather than to conduct a formal cost–benefit analysis of transitioning from one emission path to another. Formal cost–benefit analyses typically find that the net benefits of mitigation only emerge after 2050 (ref.  5 ), which may lead some to conclude that physical damages from climate change are simply not large enough to outweigh mitigation costs until the second half of the century. Our simple comparison of their magnitudes makes clear that damages are actually already considerably larger than mitigation costs and the delayed emergence of net mitigation benefits results primarily from the fact that damages across different emission paths are indistinguishable until mid-century (Fig. 1 ).

Although these near-term damages constitute those to which the world is already committed, we note that damage estimates diverge strongly across emission scenarios after 2049, conveying the clear benefits of mitigation from a purely economic point of view that have been emphasized in previous studies 4 , 24 . As well as the uncertainties assessed in Fig. 1 , these conclusions are robust to structural choices, such as the timescale with which changes in the moderating variables of the empirical models are estimated (Supplementary Figs. 10 and 11 ), as well as the order in which one accounts for the intertemporal and international components of currency comparison (Supplementary Fig. 12 ; see Methods for further details).

Damages from variability and extremes

Committed damages primarily arise through changes in average temperature (Fig. 2 ). This reflects the fact that projected changes in average temperature are larger than those in other climate variables when expressed as a function of their historical interannual variability (Extended Data Fig. 4 ). Because the historical variability is that on which the empirical models are estimated, larger projected changes in comparison with this variability probably lead to larger future impacts in a purely statistical sense. From a mechanistic perspective, one may plausibly interpret this result as implying that future changes in average temperature are the most unprecedented from the perspective of the historical fluctuations to which the economy is accustomed and therefore will cause the most damage. This insight may prove useful in terms of guiding adaptation measures to the sources of greatest damage.

figure 2

Estimates of the median projected reduction in sub-national income per capita across emission scenarios (SSP2-RCP2.6 and SSP2-RCP8.5) as well as climate model, empirical model and model parameter uncertainty in the year in which climate damages diverge at the 5% level (2049, as identified in Fig. 1 ). a , Impacts arising from all climate variables. b – f , Impacts arising separately from changes in annual mean temperature ( b ), daily temperature variability ( c ), total annual precipitation ( d ), the annual number of wet days (>1 mm) ( e ) and extreme daily rainfall ( f ) (see Methods for further definitions). Data on national administrative boundaries are obtained from the GADM database version 3.6 and are freely available for academic use ( https://gadm.org/ ).

Nevertheless, future damages based on empirical models that consider changes in annual average temperature only and exclude the other climate variables constitute income reductions of only 13% in 2049 (Extended Data Fig. 5a , likely range 5–21%). This suggests that accounting for the other components of the distribution of temperature and precipitation raises net damages by nearly 50%. This increase arises through the further damages that these climatic components cause, but also because their inclusion reveals a stronger negative economic response to average temperatures (Extended Data Fig. 5b ). The latter finding is consistent with our Monte Carlo simulations, which suggest that the magnitude of the effect of average temperature on economic growth is underestimated unless accounting for the impacts of other correlated climate variables (Supplementary Fig. 7 ).

In terms of the relative contributions of the different climatic components to overall damages, we find that accounting for daily temperature variability causes the largest increase in overall damages relative to empirical frameworks that only consider changes in annual average temperature (4.9 percentage points, likely range 2.4–8.7 percentage points, equivalent to approximately 10 trillion international dollars). Accounting for precipitation causes smaller increases in overall damages, which are—nevertheless—equivalent to approximately 1.2 trillion international dollars: 0.01 percentage points (−0.37–0.33 percentage points), 0.34 percentage points (0.07–0.90 percentage points) and 0.36 percentage points (0.13–0.65 percentage points) from total annual precipitation, the number of wet days and extreme daily precipitation, respectively. Moreover, climate models seem to underestimate future changes in temperature variability 25 and extreme precipitation 26 , 27 in response to anthropogenic forcing as compared with that observed historically, suggesting that the true impacts from these variables may be larger.

The distribution of committed damages

The spatial distribution of committed damages (Fig. 2a ) reflects a complex interplay between the patterns of future change in several climatic components and those of historical economic vulnerability to changes in those variables. Damages resulting from increasing annual mean temperature (Fig. 2b ) are negative almost everywhere globally, and larger at lower latitudes in regions in which temperatures are already higher and economic vulnerability to temperature increases is greatest (see the response heterogeneity to mean temperature embodied in Extended Data Fig. 1a ). This occurs despite the amplified warming projected at higher latitudes 28 , suggesting that regional heterogeneity in economic vulnerability to temperature changes outweighs heterogeneity in the magnitude of future warming (Supplementary Fig. 13a ). Economic damages owing to daily temperature variability (Fig. 2c ) exhibit a strong latitudinal polarisation, primarily reflecting the physical response of daily variability to greenhouse forcing in which increases in variability across lower latitudes (and Europe) contrast decreases at high latitudes 25 (Supplementary Fig. 13b ). These two temperature terms are the dominant determinants of the pattern of overall damages (Fig. 2a ), which exhibits a strong polarity with damages across most of the globe except at the highest northern latitudes. Future changes in total annual precipitation mainly bring economic benefits except in regions of drying, such as the Mediterranean and central South America (Fig. 2d and Supplementary Fig. 13c ), but these benefits are opposed by changes in the number of wet days, which produce damages with a similar pattern of opposite sign (Fig. 2e and Supplementary Fig. 13d ). By contrast, changes in extreme daily rainfall produce damages in all regions, reflecting the intensification of daily rainfall extremes over global land areas 29 , 30 (Fig. 2f and Supplementary Fig. 13e ).

The spatial distribution of committed damages implies considerable injustice along two dimensions: culpability for the historical emissions that have caused climate change and pre-existing levels of socio-economic welfare. Spearman’s rank correlations indicate that committed damages are significantly larger in countries with smaller historical cumulative emissions, as well as in regions with lower current income per capita (Fig. 3 ). This implies that those countries that will suffer the most from the damages already committed are those that are least responsible for climate change and which also have the least resources to adapt to it.

figure 3

Estimates of the median projected change in national income per capita across emission scenarios (RCP2.6 and RCP8.5) as well as climate model, empirical model and model parameter uncertainty in the year in which climate damages diverge at the 5% level (2049, as identified in Fig. 1 ) are plotted against cumulative national emissions per capita in 2020 (from the Global Carbon Project) and coloured by national income per capita in 2020 (from the World Bank) in a and vice versa in b . In each panel, the size of each scatter point is weighted by the national population in 2020 (from the World Bank). Inset numbers indicate the Spearman’s rank correlation ρ and P -values for a hypothesis test whose null hypothesis is of no correlation, as well as the Spearman’s rank correlation weighted by national population.

To further quantify this heterogeneity, we assess the difference in committed damages between the upper and lower quartiles of regions when ranked by present income levels and historical cumulative emissions (using a population weighting to both define the quartiles and estimate the group averages). On average, the quartile of countries with lower income are committed to an income loss that is 8.9 percentage points (or 61%) greater than the upper quartile (Extended Data Fig. 6 ), with a likely range of 3.8–14.7 percentage points across the uncertainty sampling of our damage projections (following the likelihood classification adopted by the IPCC). Similarly, the quartile of countries with lower historical cumulative emissions are committed to an income loss that is 6.9 percentage points (or 40%) greater than the upper quartile, with a likely range of 0.27–12 percentage points. These patterns reemphasize the prevalence of injustice in climate impacts 31 , 32 , 33 in the context of the damages to which the world is already committed by historical emissions and socio-economic inertia.

Contextualizing the magnitude of damages

The magnitude of projected economic damages exceeds previous literature estimates 2 , 3 , arising from several developments made on previous approaches. Our estimates are larger than those of ref.  2 (see first row of Extended Data Table 3 ), primarily because of the facts that sub-national estimates typically show a steeper temperature response (see also refs.  3 , 34 ) and that accounting for other climatic components raises damage estimates (Extended Data Fig. 5 ). However, we note that our empirical approach using first-differenced climate variables is conservative compared with that of ref.  2 in regard to the persistence of climate impacts on growth (see introduction and Methods section ‘Empirical model specification: fixed-effects distributed lag models’), an important determinant of the magnitude of long-term damages 19 , 21 . Using a similar empirical specification to ref.  2 , which assumes infinite persistence while maintaining the rest of our approach (sub-national data and further climate variables), produces considerably larger damages (purple curve of Extended Data Fig. 3 ). Compared with studies that do take the first difference of climate variables 3 , 35 , our estimates are also larger (see second and third rows of Extended Data Table 3 ). The inclusion of further climate variables (Extended Data Fig. 5 ) and a sufficient number of lags to more adequately capture the extent of impact persistence (Extended Data Figs. 1 and 2 ) are the main sources of this difference, as is the use of specifications that capture nonlinearities in the temperature response when compared with ref.  35 . In summary, our estimates develop on previous studies by incorporating the latest data and empirical insights 7 , 8 , as well as in providing a robust empirical lower bound on the persistence of impacts on economic growth, which constitutes a middle ground between the extremes of the growth-versus-levels debate 19 , 21 (Extended Data Fig. 3 ).

Compared with the fraction of variance explained by the empirical models historically (<5%), the projection of reductions in income of 19% may seem large. This arises owing to the fact that projected changes in climatic conditions are much larger than those that were experienced historically, particularly for changes in average temperature (Extended Data Fig. 4 ). As such, any assessment of future climate-change impacts necessarily requires an extrapolation outside the range of the historical data on which the empirical impact models were evaluated. Nevertheless, these models constitute the most state-of-the-art methods for inference of plausibly causal climate impacts based on observed data. Moreover, we take explicit steps to limit out-of-sample extrapolation by capping the moderating variables of the interaction terms at the 95th percentile of the historical distribution (see Methods ). This avoids extrapolating the marginal effects outside what was observed historically. Given the nonlinear response of economic output to annual mean temperature (Extended Data Fig. 1 and Extended Data Table 2 ), this is a conservative choice that limits the magnitude of damages that we project. Furthermore, back-of-the-envelope calculations indicate that the projected damages are consistent with the magnitude and patterns of historical economic development (see Supplementary Discussion Section  5 ).

Missing impacts and spatial spillovers

Despite assessing several climatic components from which economic impacts have recently been identified 3 , 7 , 8 , this assessment of aggregate climate damages should not be considered comprehensive. Important channels such as impacts from heatwaves 31 , sea-level rise 36 , tropical cyclones 37 and tipping points 38 , 39 , as well as non-market damages such as those to ecosystems 40 and human health 41 , are not considered in these estimates. Sea-level rise is unlikely to be feasibly incorporated into empirical assessments such as this because historical sea-level variability is mostly small. Non-market damages are inherently intractable within our estimates of impacts on aggregate monetary output and estimates of these impacts could arguably be considered as extra to those identified here. Recent empirical work suggests that accounting for these channels would probably raise estimates of these committed damages, with larger damages continuing to arise in the global south 31 , 36 , 37 , 38 , 39 , 40 , 41 , 42 .

Moreover, our main empirical analysis does not explicitly evaluate the potential for impacts in local regions to produce effects that ‘spill over’ into other regions. Such effects may further mitigate or amplify the impacts we estimate, for example, if companies relocate production from one affected region to another or if impacts propagate along supply chains. The current literature indicates that trade plays a substantial role in propagating spillover effects 43 , 44 , making their assessment at the sub-national level challenging without available data on sub-national trade dependencies. Studies accounting for only spatially adjacent neighbours indicate that negative impacts in one region induce further negative impacts in neighbouring regions 45 , 46 , 47 , 48 , suggesting that our projected damages are probably conservative by excluding these effects. In Supplementary Fig. 14 , we assess spillovers from neighbouring regions using a spatial-lag model. For simplicity, this analysis excludes temporal lags, focusing only on contemporaneous effects. The results show that accounting for spatial spillovers can amplify the overall magnitude, and also the heterogeneity, of impacts. Consistent with previous literature, this indicates that the overall magnitude (Fig. 1 ) and heterogeneity (Fig. 3 ) of damages that we project in our main specification may be conservative without explicitly accounting for spillovers. We note that further analysis that addresses both spatially and trade-connected spillovers, while also accounting for delayed impacts using temporal lags, would be necessary to adequately address this question fully. These approaches offer fruitful avenues for further research but are beyond the scope of this manuscript, which primarily aims to explore the impacts of different climate conditions and their persistence.

Policy implications

We find that the economic damages resulting from climate change until 2049 are those to which the world economy is already committed and that these greatly outweigh the costs required to mitigate emissions in line with the 2 °C target of the Paris Climate Agreement (Fig. 1 ). This assessment is complementary to formal analyses of the net costs and benefits associated with moving from one emission path to another, which typically find that net benefits of mitigation only emerge in the second half of the century 5 . Our simple comparison of the magnitude of damages and mitigation costs makes clear that this is primarily because damages are indistinguishable across emissions scenarios—that is, committed—until mid-century (Fig. 1 ) and that they are actually already much larger than mitigation costs. For simplicity, and owing to the availability of data, we compare damages to mitigation costs at the global level. Regional estimates of mitigation costs may shed further light on the national incentives for mitigation to which our results already hint, of relevance for international climate policy. Although these damages are committed from a mitigation perspective, adaptation may provide an opportunity to reduce them. Moreover, the strong divergence of damages after mid-century reemphasizes the clear benefits of mitigation from a purely economic perspective, as highlighted in previous studies 1 , 4 , 6 , 24 .

Historical climate data

Historical daily 2-m temperature and precipitation totals (in mm) are obtained for the period 1979–2019 from the W5E5 database. The W5E5 dataset comes from ERA-5, a state-of-the-art reanalysis of historical observations, but has been bias-adjusted by applying version 2.0 of the WATCH Forcing Data to ERA-5 reanalysis data and precipitation data from version 2.3 of the Global Precipitation Climatology Project to better reflect ground-based measurements 49 , 50 , 51 . We obtain these data on a 0.5° × 0.5° grid from the Inter-Sectoral Impact Model Intercomparison Project (ISIMIP) database. Notably, these historical data have been used to bias-adjust future climate projections from CMIP-6 (see the following section), ensuring consistency between the distribution of historical daily weather on which our empirical models were estimated and the climate projections used to estimate future damages. These data are publicly available from the ISIMIP database. See refs.  7 , 8 for robustness tests of the empirical models to the choice of climate data reanalysis products.

Future climate data

Daily 2-m temperature and precipitation totals (in mm) are taken from 21 climate models participating in CMIP-6 under a high (RCP8.5) and a low (RCP2.6) greenhouse gas emission scenario from 2015 to 2100. The data have been bias-adjusted and statistically downscaled to a common half-degree grid to reflect the historical distribution of daily temperature and precipitation of the W5E5 dataset using the trend-preserving method developed by the ISIMIP 50 , 52 . As such, the climate model data reproduce observed climatological patterns exceptionally well (Supplementary Table 5 ). Gridded data are publicly available from the ISIMIP database.

Historical economic data

Historical economic data come from the DOSE database of sub-national economic output 53 . We use a recent revision to the DOSE dataset that provides data across 83 countries, 1,660 sub-national regions with varying temporal coverage from 1960 to 2019. Sub-national units constitute the first administrative division below national, for example, states for the USA and provinces for China. Data come from measures of gross regional product per capita (GRPpc) or income per capita in local currencies, reflecting the values reported in national statistical agencies, yearbooks and, in some cases, academic literature. We follow previous literature 3 , 7 , 8 , 54 and assess real sub-national output per capita by first converting values from local currencies to US dollars to account for diverging national inflationary tendencies and then account for US inflation using a US deflator. Alternatively, one might first account for national inflation and then convert between currencies. Supplementary Fig. 12 demonstrates that our conclusions are consistent when accounting for price changes in the reversed order, although the magnitude of estimated damages varies. See the documentation of the DOSE dataset for further discussion of these choices. Conversions between currencies are conducted using exchange rates from the FRED database of the Federal Reserve Bank of St. Louis 55 and the national deflators from the World Bank 56 .

Future socio-economic data

Baseline gridded gross domestic product (GDP) and population data for the period 2015–2100 are taken from the middle-of-the-road scenario SSP2 (ref.  15 ). Population data have been downscaled to a half-degree grid by the ISIMIP following the methodologies of refs.  57 , 58 , which we then aggregate to the sub-national level of our economic data using the spatial aggregation procedure described below. Because current methodologies for downscaling the GDP of the SSPs use downscaled population to do so, per-capita estimates of GDP with a realistic distribution at the sub-national level are not readily available for the SSPs. We therefore use national-level GDP per capita (GDPpc) projections for all sub-national regions of a given country, assuming homogeneity within countries in terms of baseline GDPpc. Here we use projections that have been updated to account for the impact of the COVID-19 pandemic on the trajectory of future income, while remaining consistent with the long-term development of the SSPs 59 . The choice of baseline SSP alters the magnitude of projected climate damages in monetary terms, but when assessed in terms of percentage change from the baseline, the choice of socio-economic scenario is inconsequential. Gridded SSP population data and national-level GDPpc data are publicly available from the ISIMIP database. Sub-national estimates as used in this study are available in the code and data replication files.

Climate variables

Following recent literature 3 , 7 , 8 , we calculate an array of climate variables for which substantial impacts on macroeconomic output have been identified empirically, supported by further evidence at the micro level for plausible underlying mechanisms. See refs.  7 , 8 for an extensive motivation for the use of these particular climate variables and for detailed empirical tests on the nature and robustness of their effects on economic output. To summarize, these studies have found evidence for independent impacts on economic growth rates from annual average temperature, daily temperature variability, total annual precipitation, the annual number of wet days and extreme daily rainfall. Assessments of daily temperature variability were motivated by evidence of impacts on agricultural output and human health, as well as macroeconomic literature on the impacts of volatility on growth when manifest in different dimensions, such as government spending, exchange rates and even output itself 7 . Assessments of precipitation impacts were motivated by evidence of impacts on agricultural productivity, metropolitan labour outcomes and conflict, as well as damages caused by flash flooding 8 . See Extended Data Table 1 for detailed references to empirical studies of these physical mechanisms. Marked impacts of daily temperature variability, total annual precipitation, the number of wet days and extreme daily rainfall on macroeconomic output were identified robustly across different climate datasets, spatial aggregation schemes, specifications of regional time trends and error-clustering approaches. They were also found to be robust to the consideration of temperature extremes 7 , 8 . Furthermore, these climate variables were identified as having independent effects on economic output 7 , 8 , which we further explain here using Monte Carlo simulations to demonstrate the robustness of the results to concerns of imperfect multicollinearity between climate variables (Supplementary Methods Section  2 ), as well as by using information criteria (Supplementary Table 1 ) to demonstrate that including several lagged climate variables provides a preferable trade-off between optimally describing the data and limiting the possibility of overfitting.

We calculate these variables from the distribution of daily, d , temperature, T x , d , and precipitation, P x , d , at the grid-cell, x , level for both the historical and future climate data. As well as annual mean temperature, \({\bar{T}}_{x,y}\) , and annual total precipitation, P x , y , we calculate annual, y , measures of daily temperature variability, \({\widetilde{T}}_{x,y}\) :

the number of wet days, Pwd x , y :

and extreme daily rainfall:

in which T x , d , m , y is the grid-cell-specific daily temperature in month m and year y , \({\bar{T}}_{x,m,{y}}\) is the year and grid-cell-specific monthly, m , mean temperature, D m and D y the number of days in a given month m or year y , respectively, H the Heaviside step function, 1 mm the threshold used to define wet days and P 99.9 x is the 99.9th percentile of historical (1979–2019) daily precipitation at the grid-cell level. Units of the climate measures are degrees Celsius for annual mean temperature and daily temperature variability, millimetres for total annual precipitation and extreme daily precipitation, and simply the number of days for the annual number of wet days.

We also calculated weighted standard deviations of monthly rainfall totals as also used in ref.  8 but do not include them in our projections as we find that, when accounting for delayed effects, their effect becomes statistically indistinct and is better captured by changes in total annual rainfall.

Spatial aggregation

We aggregate grid-cell-level historical and future climate measures, as well as grid-cell-level future GDPpc and population, to the level of the first administrative unit below national level of the GADM database, using an area-weighting algorithm that estimates the portion of each grid cell falling within an administrative boundary. We use this as our baseline specification following previous findings that the effect of area or population weighting at the sub-national level is negligible 7 , 8 .

Empirical model specification: fixed-effects distributed lag models

Following a wide range of climate econometric literature 16 , 60 , we use panel regression models with a selection of fixed effects and time trends to isolate plausibly exogenous variation with which to maximize confidence in a causal interpretation of the effects of climate on economic growth rates. The use of region fixed effects, μ r , accounts for unobserved time-invariant differences between regions, such as prevailing climatic norms and growth rates owing to historical and geopolitical factors. The use of yearly fixed effects, η y , accounts for regionally invariant annual shocks to the global climate or economy such as the El Niño–Southern Oscillation or global recessions. In our baseline specification, we also include region-specific linear time trends, k r y , to exclude the possibility of spurious correlations resulting from common slow-moving trends in climate and growth.

The persistence of climate impacts on economic growth rates is a key determinant of the long-term magnitude of damages. Methods for inferring the extent of persistence in impacts on growth rates have typically used lagged climate variables to evaluate the presence of delayed effects or catch-up dynamics 2 , 18 . For example, consider starting from a model in which a climate condition, C r , y , (for example, annual mean temperature) affects the growth rate, Δlgrp r , y (the first difference of the logarithm of gross regional product) of region r in year y :

which we refer to as a ‘pure growth effects’ model in the main text. Typically, further lags are included,

and the cumulative effect of all lagged terms is evaluated to assess the extent to which climate impacts on growth rates persist. Following ref.  18 , in the case that,

the implication is that impacts on the growth rate persist up to NL years after the initial shock (possibly to a weaker or a stronger extent), whereas if

then the initial impact on the growth rate is recovered after NL years and the effect is only one on the level of output. However, we note that such approaches are limited by the fact that, when including an insufficient number of lags to detect a recovery of the growth rates, one may find equation ( 6 ) to be satisfied and incorrectly assume that a change in climatic conditions affects the growth rate indefinitely. In practice, given a limited record of historical data, including too few lags to confidently conclude in an infinitely persistent impact on the growth rate is likely, particularly over the long timescales over which future climate damages are often projected 2 , 24 . To avoid this issue, we instead begin our analysis with a model for which the level of output, lgrp r , y , depends on the level of a climate variable, C r , y :

Given the non-stationarity of the level of output, we follow the literature 19 and estimate such an equation in first-differenced form as,

which we refer to as a model of ‘pure level effects’ in the main text. This model constitutes a baseline specification in which a permanent change in the climate variable produces an instantaneous impact on the growth rate and a permanent effect only on the level of output. By including lagged variables in this specification,

we are able to test whether the impacts on the growth rate persist any further than instantaneously by evaluating whether α L  > 0 are statistically significantly different from zero. Even though this framework is also limited by the possibility of including too few lags, the choice of a baseline model specification in which impacts on the growth rate do not persist means that, in the case of including too few lags, the framework reverts to the baseline specification of level effects. As such, this framework is conservative with respect to the persistence of impacts and the magnitude of future damages. It naturally avoids assumptions of infinite persistence and we are able to interpret any persistence that we identify with equation ( 9 ) as a lower bound on the extent of climate impact persistence on growth rates. See the main text for further discussion of this specification choice, in particular about its conservative nature compared with previous literature estimates, such as refs.  2 , 18 .

We allow the response to climatic changes to vary across regions, using interactions of the climate variables with historical average (1979–2019) climatic conditions reflecting heterogenous effects identified in previous work 7 , 8 . Following this previous work, the moderating variables of these interaction terms constitute the historical average of either the variable itself or of the seasonal temperature difference, \({\hat{T}}_{r}\) , or annual mean temperature, \({\bar{T}}_{r}\) , in the case of daily temperature variability 7 and extreme daily rainfall, respectively 8 .

The resulting regression equation with N and M lagged variables, respectively, reads:

in which Δlgrp r , y is the annual, regional GRPpc growth rate, measured as the first difference of the logarithm of real GRPpc, following previous work 2 , 3 , 7 , 8 , 18 , 19 . Fixed-effects regressions were run using the fixest package in R (ref.  61 ).

Estimates of the coefficients of interest α i , L are shown in Extended Data Fig. 1 for N  =  M  = 10 lags and for our preferred choice of the number of lags in Supplementary Figs. 1 – 3 . In Extended Data Fig. 1 , errors are shown clustered at the regional level, but for the construction of damage projections, we block-bootstrap the regressions by region 1,000 times to provide a range of parameter estimates with which to sample the projection uncertainty (following refs.  2 , 31 ).

Spatial-lag model

In Supplementary Fig. 14 , we present the results from a spatial-lag model that explores the potential for climate impacts to ‘spill over’ into spatially neighbouring regions. We measure the distance between centroids of each pair of sub-national regions and construct spatial lags that take the average of the first-differenced climate variables and their interaction terms over neighbouring regions that are at distances of 0–500, 500–1,000, 1,000–1,500 and 1,500–2000 km (spatial lags, ‘SL’, 1 to 4). For simplicity, we then assess a spatial-lag model without temporal lags to assess spatial spillovers of contemporaneous climate impacts. This model takes the form:

in which SL indicates the spatial lag of each climate variable and interaction term. In Supplementary Fig. 14 , we plot the cumulative marginal effect of each climate variable at different baseline climate conditions by summing the coefficients for each climate variable and interaction term, for example, for average temperature impacts as:

These cumulative marginal effects can be regarded as the overall spatially dependent impact to an individual region given a one-unit shock to a climate variable in that region and all neighbouring regions at a given value of the moderating variable of the interaction term.

Constructing projections of economic damage from future climate change

We construct projections of future climate damages by applying the coefficients estimated in equation ( 10 ) and shown in Supplementary Tables 2 – 4 (when including only lags with statistically significant effects in specifications that limit overfitting; see Supplementary Methods Section  1 ) to projections of future climate change from the CMIP-6 models. Year-on-year changes in each primary climate variable of interest are calculated to reflect the year-to-year variations used in the empirical models. 30-year moving averages of the moderating variables of the interaction terms are calculated to reflect the long-term average of climatic conditions that were used for the moderating variables in the empirical models. By using moving averages in the projections, we account for the changing vulnerability to climate shocks based on the evolving long-term conditions (Supplementary Figs. 10 and 11 show that the results are robust to the precise choice of the window of this moving average). Although these climate variables are not differenced, the fact that the bias-adjusted climate models reproduce observed climatological patterns across regions for these moderating variables very accurately (Supplementary Table 6 ) with limited spread across models (<3%) precludes the possibility that any considerable bias or uncertainty is introduced by this methodological choice. However, we impose caps on these moderating variables at the 95th percentile at which they were observed in the historical data to prevent extrapolation of the marginal effects outside the range in which the regressions were estimated. This is a conservative choice that limits the magnitude of our damage projections.

Time series of primary climate variables and moderating climate variables are then combined with estimates of the empirical model parameters to evaluate the regression coefficients in equation ( 10 ), producing a time series of annual GRPpc growth-rate reductions for a given emission scenario, climate model and set of empirical model parameters. The resulting time series of growth-rate impacts reflects those occurring owing to future climate change. By contrast, a future scenario with no climate change would be one in which climate variables do not change (other than with random year-to-year fluctuations) and hence the time-averaged evaluation of equation ( 10 ) would be zero. Our approach therefore implicitly compares the future climate-change scenario to this no-climate-change baseline scenario.

The time series of growth-rate impacts owing to future climate change in region r and year y , δ r , y , are then added to the future baseline growth rates, π r , y (in log-diff form), obtained from the SSP2 scenario to yield trajectories of damaged GRPpc growth rates, ρ r , y . These trajectories are aggregated over time to estimate the future trajectory of GRPpc with future climate impacts:

in which GRPpc r , y =2020 is the initial log level of GRPpc. We begin damage estimates in 2020 to reflect the damages occurring since the end of the period for which we estimate the empirical models (1979–2019) and to match the timing of mitigation-cost estimates from most IAMs (see below).

For each emission scenario, this procedure is repeated 1,000 times while randomly sampling from the selection of climate models, the selection of empirical models with different numbers of lags (shown in Supplementary Figs. 1 – 3 and Supplementary Tables 2 – 4 ) and bootstrapped estimates of the regression parameters. The result is an ensemble of future GRPpc trajectories that reflect uncertainty from both physical climate change and the structural and sampling uncertainty of the empirical models.

Estimates of mitigation costs

We obtain IPCC estimates of the aggregate costs of emission mitigation from the AR6 Scenario Explorer and Database hosted by IIASA 23 . Specifically, we search the AR6 Scenarios Database World v1.1 for IAMs that provided estimates of global GDP and population under both a SSP2 baseline and a SSP2-RCP2.6 scenario to maintain consistency with the socio-economic and emission scenarios of the climate damage projections. We find five IAMs that provide data for these scenarios, namely, MESSAGE-GLOBIOM 1.0, REMIND-MAgPIE 1.5, AIM/GCE 2.0, GCAM 4.2 and WITCH-GLOBIOM 3.1. Of these five IAMs, we use the results only from the first three that passed the IPCC vetting procedure for reproducing historical emission and climate trajectories. We then estimate global mitigation costs as the percentage difference in global per capita GDP between the SSP2 baseline and the SSP2-RCP2.6 emission scenario. In the case of one of these IAMs, estimates of mitigation costs begin in 2020, whereas in the case of two others, mitigation costs begin in 2010. The mitigation cost estimates before 2020 in these two IAMs are mostly negligible, and our choice to begin comparison with damage estimates in 2020 is conservative with respect to the relative weight of climate damages compared with mitigation costs for these two IAMs.

Data availability

Data on economic production and ERA-5 climate data are publicly available at https://doi.org/10.5281/zenodo.4681306 (ref. 62 ) and https://www.ecmwf.int/en/forecasts/datasets/reanalysis-datasets/era5 , respectively. Data on mitigation costs are publicly available at https://data.ene.iiasa.ac.at/ar6/#/downloads . Processed climate and economic data, as well as all other necessary data for reproduction of the results, are available at the public repository https://doi.org/10.5281/zenodo.10562951  (ref. 63 ).

Code availability

All code necessary for reproduction of the results is available at the public repository https://doi.org/10.5281/zenodo.10562951  (ref. 63 ).

Glanemann, N., Willner, S. N. & Levermann, A. Paris Climate Agreement passes the cost-benefit test. Nat. Commun. 11 , 110 (2020).

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

Burke, M., Hsiang, S. M. & Miguel, E. Global non-linear effect of temperature on economic production. Nature 527 , 235–239 (2015).

Article   ADS   CAS   PubMed   Google Scholar  

Kalkuhl, M. & Wenz, L. The impact of climate conditions on economic production. Evidence from a global panel of regions. J. Environ. Econ. Manag. 103 , 102360 (2020).

Article   Google Scholar  

Moore, F. C. & Diaz, D. B. Temperature impacts on economic growth warrant stringent mitigation policy. Nat. Clim. Change 5 , 127–131 (2015).

Article   ADS   Google Scholar  

Drouet, L., Bosetti, V. & Tavoni, M. Net economic benefits of well-below 2°C scenarios and associated uncertainties. Oxf. Open Clim. Change 2 , kgac003 (2022).

Ueckerdt, F. et al. The economically optimal warming limit of the planet. Earth Syst. Dyn. 10 , 741–763 (2019).

Kotz, M., Wenz, L., Stechemesser, A., Kalkuhl, M. & Levermann, A. Day-to-day temperature variability reduces economic growth. Nat. Clim. Change 11 , 319–325 (2021).

Kotz, M., Levermann, A. & Wenz, L. The effect of rainfall changes on economic production. Nature 601 , 223–227 (2022).

Kousky, C. Informing climate adaptation: a review of the economic costs of natural disasters. Energy Econ. 46 , 576–592 (2014).

Harlan, S. L. et al. in Climate Change and Society: Sociological Perspectives (eds Dunlap, R. E. & Brulle, R. J.) 127–163 (Oxford Univ. Press, 2015).

Bolton, P. et al. The Green Swan (BIS Books, 2020).

Alogoskoufis, S. et al. ECB Economy-wide Climate Stress Test: Methodology and Results European Central Bank, 2021).

Weber, E. U. What shapes perceptions of climate change? Wiley Interdiscip. Rev. Clim. Change 1 , 332–342 (2010).

Markowitz, E. M. & Shariff, A. F. Climate change and moral judgement. Nat. Clim. Change 2 , 243–247 (2012).

Riahi, K. et al. The shared socioeconomic pathways and their energy, land use, and greenhouse gas emissions implications: an overview. Glob. Environ. Change 42 , 153–168 (2017).

Auffhammer, M., Hsiang, S. M., Schlenker, W. & Sobel, A. Using weather data and climate model output in economic analyses of climate change. Rev. Environ. Econ. Policy 7 , 181–198 (2013).

Kolstad, C. D. & Moore, F. C. Estimating the economic impacts of climate change using weather observations. Rev. Environ. Econ. Policy 14 , 1–24 (2020).

Dell, M., Jones, B. F. & Olken, B. A. Temperature shocks and economic growth: evidence from the last half century. Am. Econ. J. Macroecon. 4 , 66–95 (2012).

Newell, R. G., Prest, B. C. & Sexton, S. E. The GDP-temperature relationship: implications for climate change damages. J. Environ. Econ. Manag. 108 , 102445 (2021).

Kikstra, J. S. et al. The social cost of carbon dioxide under climate-economy feedbacks and temperature variability. Environ. Res. Lett. 16 , 094037 (2021).

Article   ADS   CAS   Google Scholar  

Bastien-Olvera, B. & Moore, F. Persistent effect of temperature on GDP identified from lower frequency temperature variability. Environ. Res. Lett. 17 , 084038 (2022).

Eyring, V. et al. Overview of the Coupled Model Intercomparison Project Phase 6 (CMIP6) experimental design and organization. Geosci. Model Dev. 9 , 1937–1958 (2016).

Byers, E. et al. AR6 scenarios database. Zenodo https://zenodo.org/records/7197970 (2022).

Burke, M., Davis, W. M. & Diffenbaugh, N. S. Large potential reduction in economic damages under UN mitigation targets. Nature 557 , 549–553 (2018).

Kotz, M., Wenz, L. & Levermann, A. Footprint of greenhouse forcing in daily temperature variability. Proc. Natl Acad. Sci. 118 , e2103294118 (2021).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Myhre, G. et al. Frequency of extreme precipitation increases extensively with event rareness under global warming. Sci. Rep. 9 , 16063 (2019).

Min, S.-K., Zhang, X., Zwiers, F. W. & Hegerl, G. C. Human contribution to more-intense precipitation extremes. Nature 470 , 378–381 (2011).

England, M. R., Eisenman, I., Lutsko, N. J. & Wagner, T. J. The recent emergence of Arctic Amplification. Geophys. Res. Lett. 48 , e2021GL094086 (2021).

Fischer, E. M. & Knutti, R. Anthropogenic contribution to global occurrence of heavy-precipitation and high-temperature extremes. Nat. Clim. Change 5 , 560–564 (2015).

Pfahl, S., O’Gorman, P. A. & Fischer, E. M. Understanding the regional pattern of projected future changes in extreme precipitation. Nat. Clim. Change 7 , 423–427 (2017).

Callahan, C. W. & Mankin, J. S. Globally unequal effect of extreme heat on economic growth. Sci. Adv. 8 , eadd3726 (2022).

Diffenbaugh, N. S. & Burke, M. Global warming has increased global economic inequality. Proc. Natl Acad. Sci. 116 , 9808–9813 (2019).

Callahan, C. W. & Mankin, J. S. National attribution of historical climate damages. Clim. Change 172 , 40 (2022).

Burke, M. & Tanutama, V. Climatic constraints on aggregate economic output. National Bureau of Economic Research, Working Paper 25779. https://doi.org/10.3386/w25779 (2019).

Kahn, M. E. et al. Long-term macroeconomic effects of climate change: a cross-country analysis. Energy Econ. 104 , 105624 (2021).

Desmet, K. et al. Evaluating the economic cost of coastal flooding. National Bureau of Economic Research, Working Paper 24918. https://doi.org/10.3386/w24918 (2018).

Hsiang, S. M. & Jina, A. S. The causal effect of environmental catastrophe on long-run economic growth: evidence from 6,700 cyclones. National Bureau of Economic Research, Working Paper 20352. https://doi.org/10.3386/w2035 (2014).

Ritchie, P. D. et al. Shifts in national land use and food production in Great Britain after a climate tipping point. Nat. Food 1 , 76–83 (2020).

Dietz, S., Rising, J., Stoerk, T. & Wagner, G. Economic impacts of tipping points in the climate system. Proc. Natl Acad. Sci. 118 , e2103081118 (2021).

Bastien-Olvera, B. A. & Moore, F. C. Use and non-use value of nature and the social cost of carbon. Nat. Sustain. 4 , 101–108 (2021).

Carleton, T. et al. Valuing the global mortality consequences of climate change accounting for adaptation costs and benefits. Q. J. Econ. 137 , 2037–2105 (2022).

Bastien-Olvera, B. A. et al. Unequal climate impacts on global values of natural capital. Nature 625 , 722–727 (2024).

Malik, A. et al. Impacts of climate change and extreme weather on food supply chains cascade across sectors and regions in Australia. Nat. Food 3 , 631–643 (2022).

Article   ADS   PubMed   Google Scholar  

Kuhla, K., Willner, S. N., Otto, C., Geiger, T. & Levermann, A. Ripple resonance amplifies economic welfare loss from weather extremes. Environ. Res. Lett. 16 , 114010 (2021).

Schleypen, J. R., Mistry, M. N., Saeed, F. & Dasgupta, S. Sharing the burden: quantifying climate change spillovers in the European Union under the Paris Agreement. Spat. Econ. Anal. 17 , 67–82 (2022).

Dasgupta, S., Bosello, F., De Cian, E. & Mistry, M. Global temperature effects on economic activity and equity: a spatial analysis. European Institute on Economics and the Environment, Working Paper 22-1 (2022).

Neal, T. The importance of external weather effects in projecting the macroeconomic impacts of climate change. UNSW Economics Working Paper 2023-09 (2023).

Deryugina, T. & Hsiang, S. M. Does the environment still matter? Daily temperature and income in the United States. National Bureau of Economic Research, Working Paper 20750. https://doi.org/10.3386/w20750 (2014).

Hersbach, H. et al. The ERA5 global reanalysis. Q. J. R. Meteorol. Soc. 146 , 1999–2049 (2020).

Cucchi, M. et al. WFDE5: bias-adjusted ERA5 reanalysis data for impact studies. Earth Syst. Sci. Data 12 , 2097–2120 (2020).

Adler, R. et al. The New Version 2.3 of the Global Precipitation Climatology Project (GPCP) Monthly Analysis Product 1072–1084 (University of Maryland, 2016).

Lange, S. Trend-preserving bias adjustment and statistical downscaling with ISIMIP3BASD (v1.0). Geosci. Model Dev. 12 , 3055–3070 (2019).

Wenz, L., Carr, R. D., Kögel, N., Kotz, M. & Kalkuhl, M. DOSE – global data set of reported sub-national economic output. Sci. Data 10 , 425 (2023).

Article   PubMed   PubMed Central   Google Scholar  

Gennaioli, N., La Porta, R., Lopez De Silanes, F. & Shleifer, A. Growth in regions. J. Econ. Growth 19 , 259–309 (2014).

Board of Governors of the Federal Reserve System (US). U.S. dollars to euro spot exchange rate. https://fred.stlouisfed.org/series/AEXUSEU (2022).

World Bank. GDP deflator. https://data.worldbank.org/indicator/NY.GDP.DEFL.ZS (2022).

Jones, B. & O’Neill, B. C. Spatially explicit global population scenarios consistent with the Shared Socioeconomic Pathways. Environ. Res. Lett. 11 , 084003 (2016).

Murakami, D. & Yamagata, Y. Estimation of gridded population and GDP scenarios with spatially explicit statistical downscaling. Sustainability 11 , 2106 (2019).

Koch, J. & Leimbach, M. Update of SSP GDP projections: capturing recent changes in national accounting, PPP conversion and Covid 19 impacts. Ecol. Econ. 206 (2023).

Carleton, T. A. & Hsiang, S. M. Social and economic impacts of climate. Science 353 , aad9837 (2016).

Article   PubMed   Google Scholar  

Bergé, L. Efficient estimation of maximum likelihood models with multiple fixed-effects: the R package FENmlm. DEM Discussion Paper Series 18-13 (2018).

Kalkuhl, M., Kotz, M. & Wenz, L. DOSE - The MCC-PIK Database Of Subnational Economic output. Zenodo https://zenodo.org/doi/10.5281/zenodo.4681305 (2021).

Kotz, M., Wenz, L. & Levermann, A. Data and code for “The economic commitment of climate change”. Zenodo https://zenodo.org/doi/10.5281/zenodo.10562951 (2024).

Dasgupta, S. et al. Effects of climate change on combined labour productivity and supply: an empirical, multi-model study. Lancet Planet. Health 5 , e455–e465 (2021).

Lobell, D. B. et al. The critical role of extreme heat for maize production in the United States. Nat. Clim. Change 3 , 497–501 (2013).

Zhao, C. et al. Temperature increase reduces global yields of major crops in four independent estimates. Proc. Natl Acad. Sci. 114 , 9326–9331 (2017).

Wheeler, T. R., Craufurd, P. Q., Ellis, R. H., Porter, J. R. & Prasad, P. V. Temperature variability and the yield of annual crops. Agric. Ecosyst. Environ. 82 , 159–167 (2000).

Rowhani, P., Lobell, D. B., Linderman, M. & Ramankutty, N. Climate variability and crop production in Tanzania. Agric. For. Meteorol. 151 , 449–460 (2011).

Ceglar, A., Toreti, A., Lecerf, R., Van der Velde, M. & Dentener, F. Impact of meteorological drivers on regional inter-annual crop yield variability in France. Agric. For. Meteorol. 216 , 58–67 (2016).

Shi, L., Kloog, I., Zanobetti, A., Liu, P. & Schwartz, J. D. Impacts of temperature and its variability on mortality in New England. Nat. Clim. Change 5 , 988–991 (2015).

Xue, T., Zhu, T., Zheng, Y. & Zhang, Q. Declines in mental health associated with air pollution and temperature variability in China. Nat. Commun. 10 , 2165 (2019).

Article   ADS   PubMed   PubMed Central   Google Scholar  

Liang, X.-Z. et al. Determining climate effects on US total agricultural productivity. Proc. Natl Acad. Sci. 114 , E2285–E2292 (2017).

Desbureaux, S. & Rodella, A.-S. Drought in the city: the economic impact of water scarcity in Latin American metropolitan areas. World Dev. 114 , 13–27 (2019).

Damania, R. The economics of water scarcity and variability. Oxf. Rev. Econ. Policy 36 , 24–44 (2020).

Davenport, F. V., Burke, M. & Diffenbaugh, N. S. Contribution of historical precipitation change to US flood damages. Proc. Natl Acad. Sci. 118 , e2017524118 (2021).

Dave, R., Subramanian, S. S. & Bhatia, U. Extreme precipitation induced concurrent events trigger prolonged disruptions in regional road networks. Environ. Res. Lett. 16 , 104050 (2021).

Download references

Acknowledgements

We gratefully acknowledge financing from the Volkswagen Foundation and the Deutsche Gesellschaft für Internationale Zusammenarbeit (GIZ) GmbH on behalf of the Government of the Federal Republic of Germany and Federal Ministry for Economic Cooperation and Development (BMZ).

Open access funding provided by Potsdam-Institut für Klimafolgenforschung (PIK) e.V.

Author information

Authors and affiliations.

Research Domain IV, Research Domain IV, Potsdam Institute for Climate Impact Research, Potsdam, Germany

Maximilian Kotz, Anders Levermann & Leonie Wenz

Institute of Physics, Potsdam University, Potsdam, Germany

Maximilian Kotz & Anders Levermann

Mercator Research Institute on Global Commons and Climate Change, Berlin, Germany

Leonie Wenz

You can also search for this author in PubMed   Google Scholar

Contributions

All authors contributed to the design of the analysis. M.K. conducted the analysis and produced the figures. All authors contributed to the interpretation and presentation of the results. M.K. and L.W. wrote the manuscript.

Corresponding author

Correspondence to Leonie Wenz .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Peer review

Peer review information.

Nature thanks Xin-Zhong Liang, Chad Thackeray and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended data fig. 1 constraining the persistence of historical climate impacts on economic growth rates..

The results of a panel-based fixed-effects distributed lag model for the effects of annual mean temperature ( a ), daily temperature variability ( b ), total annual precipitation ( c ), the number of wet days ( d ) and extreme daily precipitation ( e ) on sub-national economic growth rates. Point estimates show the effects of a 1 °C or one standard deviation increase (for temperature and precipitation variables, respectively) at the lower quartile, median and upper quartile of the relevant moderating variable (green, orange and purple, respectively) at different lagged periods after the initial shock (note that these are not cumulative effects). Climate variables are used in their first-differenced form (see main text for discussion) and the moderating climate variables are the annual mean temperature, seasonal temperature difference, total annual precipitation, number of wet days and annual mean temperature, respectively, in panels a – e (see Methods for further discussion). Error bars show the 95% confidence intervals having clustered standard errors by region. The within-region R 2 , Bayesian and Akaike information criteria for the model are shown at the top of the figure. This figure shows results with ten lags for each variable to demonstrate the observed levels of persistence, but our preferred specifications remove later lags based on the statistical significance of terms shown above and the information criteria shown in Extended Data Fig. 2 . The resulting models without later lags are shown in Supplementary Figs. 1 – 3 .

Extended Data Fig. 2 Incremental lag-selection procedure using information criteria and within-region R 2 .

Starting from a panel-based fixed-effects distributed lag model estimating the effects of climate on economic growth using the real historical data (as in equation ( 4 )) with ten lags for all climate variables (as shown in Extended Data Fig. 1 ), lags are incrementally removed for one climate variable at a time. The resulting Bayesian and Akaike information criteria are shown in a – e and f – j , respectively, and the within-region R 2 and number of observations in k – o and p – t , respectively. Different rows show the results when removing lags from different climate variables, ordered from top to bottom as annual mean temperature, daily temperature variability, total annual precipitation, the number of wet days and extreme annual precipitation. Information criteria show minima at approximately four lags for precipitation variables and ten to eight for temperature variables, indicating that including these numbers of lags does not lead to overfitting. See Supplementary Table 1 for an assessment using information criteria to determine whether including further climate variables causes overfitting.

Extended Data Fig. 3 Damages in our preferred specification that provides a robust lower bound on the persistence of climate impacts on economic growth versus damages in specifications of pure growth or pure level effects.

Estimates of future damages as shown in Fig. 1 but under the emission scenario RCP8.5 for three separate empirical specifications: in orange our preferred specification, which provides an empirical lower bound on the persistence of climate impacts on economic growth rates while avoiding assumptions of infinite persistence (see main text for further discussion); in purple a specification of ‘pure growth effects’ in which the first difference of climate variables is not taken and no lagged climate variables are included (the baseline specification of ref.  2 ); and in pink a specification of ‘pure level effects’ in which the first difference of climate variables is taken but no lagged terms are included.

Extended Data Fig. 4 Climate changes in different variables as a function of historical interannual variability.

Changes in each climate variable of interest from 1979–2019 to 2035–2065 under the high-emission scenario SSP5-RCP8.5, expressed as a percentage of the historical variability of each measure. Historical variability is estimated as the standard deviation of each detrended climate variable over the period 1979–2019 during which the empirical models were identified (detrending is appropriate because of the inclusion of region-specific linear time trends in the empirical models). See Supplementary Fig. 13 for changes expressed in standard units. Data on national administrative boundaries are obtained from the GADM database version 3.6 and are freely available for academic use ( https://gadm.org/ ).

Extended Data Fig. 5 Contribution of different climate variables to overall committed damages.

a , Climate damages in 2049 when using empirical models that account for all climate variables, changes in annual mean temperature only or changes in both annual mean temperature and one other climate variable (daily temperature variability, total annual precipitation, the number of wet days and extreme daily precipitation, respectively). b , The cumulative marginal effects of an increase in annual mean temperature of 1 °C, at different baseline temperatures, estimated from empirical models including all climate variables or annual mean temperature only. Estimates and uncertainty bars represent the median and 95% confidence intervals obtained from 1,000 block-bootstrap resamples from each of three different empirical models using eight, nine or ten lags of temperature terms.

Extended Data Fig. 6 The difference in committed damages between the upper and lower quartiles of countries when ranked by GDP and cumulative historical emissions.

Quartiles are defined using a population weighting, as are the average committed damages across each quartile group. The violin plots indicate the distribution of differences between quartiles across the two extreme emission scenarios (RCP2.6 and RCP8.5) and the uncertainty sampling procedure outlined in Methods , which accounts for uncertainty arising from the choice of lags in the empirical models, uncertainty in the empirical model parameter estimates, as well as the climate model projections. Bars indicate the median, as well as the 10th and 90th percentiles and upper and lower sixths of the distribution reflecting the very likely and likely ranges following the likelihood classification adopted by the IPCC.

Supplementary information

Supplementary information, peer review file, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Kotz, M., Levermann, A. & Wenz, L. The economic commitment of climate change. Nature 628 , 551–557 (2024). https://doi.org/10.1038/s41586-024-07219-0

Download citation

Received : 25 January 2023

Accepted : 21 February 2024

Published : 17 April 2024

Issue Date : 18 April 2024

DOI : https://doi.org/10.1038/s41586-024-07219-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

data analysis research paper

Numbers, Facts and Trends Shaping Your World

Read our research on:

Full Topic List

Regions & Countries

  • Publications
  • Our Methods
  • Short Reads
  • Tools & Resources

Read Our Research On:

How Pew Research Center will report on generations moving forward

Journalists, researchers and the public often look at society through the lens of generation, using terms like Millennial or Gen Z to describe groups of similarly aged people. This approach can help readers see themselves in the data and assess where we are and where we’re headed as a country.

Pew Research Center has been at the forefront of generational research over the years, telling the story of Millennials as they came of age politically and as they moved more firmly into adult life . In recent years, we’ve also been eager to learn about Gen Z as the leading edge of this generation moves into adulthood.

But generational research has become a crowded arena. The field has been flooded with content that’s often sold as research but is more like clickbait or marketing mythology. There’s also been a growing chorus of criticism about generational research and generational labels in particular.

Recently, as we were preparing to embark on a major research project related to Gen Z, we decided to take a step back and consider how we can study generations in a way that aligns with our values of accuracy, rigor and providing a foundation of facts that enriches the public dialogue.

A typical generation spans 15 to 18 years. As many critics of generational research point out, there is great diversity of thought, experience and behavior within generations.

We set out on a yearlong process of assessing the landscape of generational research. We spoke with experts from outside Pew Research Center, including those who have been publicly critical of our generational analysis, to get their take on the pros and cons of this type of work. We invested in methodological testing to determine whether we could compare findings from our earlier telephone surveys to the online ones we’re conducting now. And we experimented with higher-level statistical analyses that would allow us to isolate the effect of generation.

What emerged from this process was a set of clear guidelines that will help frame our approach going forward. Many of these are principles we’ve always adhered to , but others will require us to change the way we’ve been doing things in recent years.

Here’s a short overview of how we’ll approach generational research in the future:

We’ll only do generational analysis when we have historical data that allows us to compare generations at similar stages of life. When comparing generations, it’s crucial to control for age. In other words, researchers need to look at each generation or age cohort at a similar point in the life cycle. (“Age cohort” is a fancy way of referring to a group of people who were born around the same time.)

When doing this kind of research, the question isn’t whether young adults today are different from middle-aged or older adults today. The question is whether young adults today are different from young adults at some specific point in the past.

To answer this question, it’s necessary to have data that’s been collected over a considerable amount of time – think decades. Standard surveys don’t allow for this type of analysis. We can look at differences across age groups, but we can’t compare age groups over time.

Another complication is that the surveys we conducted 20 or 30 years ago aren’t usually comparable enough to the surveys we’re doing today. Our earlier surveys were done over the phone, and we’ve since transitioned to our nationally representative online survey panel , the American Trends Panel . Our internal testing showed that on many topics, respondents answer questions differently depending on the way they’re being interviewed. So we can’t use most of our surveys from the late 1980s and early 2000s to compare Gen Z with Millennials and Gen Xers at a similar stage of life.

This means that most generational analysis we do will use datasets that have employed similar methodologies over a long period of time, such as surveys from the U.S. Census Bureau. A good example is our 2020 report on Millennial families , which used census data going back to the late 1960s. The report showed that Millennials are marrying and forming families at a much different pace than the generations that came before them.

Even when we have historical data, we will attempt to control for other factors beyond age in making generational comparisons. If we accept that there are real differences across generations, we’re basically saying that people who were born around the same time share certain attitudes or beliefs – and that their views have been influenced by external forces that uniquely shaped them during their formative years. Those forces may have been social changes, economic circumstances, technological advances or political movements.

When we see that younger adults have different views than their older counterparts, it may be driven by their demographic traits rather than the fact that they belong to a particular generation.

The tricky part is isolating those forces from events or circumstances that have affected all age groups, not just one generation. These are often called “period effects.” An example of a period effect is the Watergate scandal, which drove down trust in government among all age groups. Differences in trust across age groups in the wake of Watergate shouldn’t be attributed to the outsize impact that event had on one age group or another, because the change occurred across the board.

Changing demographics also may play a role in patterns that might at first seem like generational differences. We know that the United States has become more racially and ethnically diverse in recent decades, and that race and ethnicity are linked with certain key social and political views. When we see that younger adults have different views than their older counterparts, it may be driven by their demographic traits rather than the fact that they belong to a particular generation.

Controlling for these factors can involve complicated statistical analysis that helps determine whether the differences we see across age groups are indeed due to generation or not. This additional step adds rigor to the process. Unfortunately, it’s often absent from current discussions about Gen Z, Millennials and other generations.

When we can’t do generational analysis, we still see value in looking at differences by age and will do so where it makes sense. Age is one of the most common predictors of differences in attitudes and behaviors. And even if age gaps aren’t rooted in generational differences, they can still be illuminating. They help us understand how people across the age spectrum are responding to key trends, technological breakthroughs and historical events.

Each stage of life comes with a unique set of experiences. Young adults are often at the leading edge of changing attitudes on emerging social trends. Take views on same-sex marriage , for example, or attitudes about gender identity .

Many middle-aged adults, in turn, face the challenge of raising children while also providing care and support to their aging parents. And older adults have their own obstacles and opportunities. All of these stories – rooted in the life cycle, not in generations – are important and compelling, and we can tell them by analyzing our surveys at any given point in time.

When we do have the data to study groups of similarly aged people over time, we won’t always default to using the standard generational definitions and labels. While generational labels are simple and catchy, there are other ways to analyze age cohorts. For example, some observers have suggested grouping people by the decade in which they were born. This would create narrower cohorts in which the members may share more in common. People could also be grouped relative to their age during key historical events (such as the Great Recession or the COVID-19 pandemic) or technological innovations (like the invention of the iPhone).

By choosing not to use the standard generational labels when they’re not appropriate, we can avoid reinforcing harmful stereotypes or oversimplifying people’s complex lived experiences.

Existing generational definitions also may be too broad and arbitrary to capture differences that exist among narrower cohorts. A typical generation spans 15 to 18 years. As many critics of generational research point out, there is great diversity of thought, experience and behavior within generations. The key is to pick a lens that’s most appropriate for the research question that’s being studied. If we’re looking at political views and how they’ve shifted over time, for example, we might group people together according to the first presidential election in which they were eligible to vote.

With these considerations in mind, our audiences should not expect to see a lot of new research coming out of Pew Research Center that uses the generational lens. We’ll only talk about generations when it adds value, advances important national debates and highlights meaningful societal trends.

  • Age & Generations
  • Demographic Research
  • Generation X
  • Generation Z
  • Generations
  • Greatest Generation
  • Methodological Research
  • Millennials
  • Silent Generation

Kim Parker's photo

Kim Parker is director of social trends research at Pew Research Center

How Teens and Parents Approach Screen Time

Who are you the art and science of measuring identity, u.s. centenarian population is projected to quadruple over the next 30 years, older workers are growing in number and earning higher wages, teens, social media and technology 2023, most popular.

1615 L St. NW, Suite 800 Washington, DC 20036 USA (+1) 202-419-4300 | Main (+1) 202-857-8562 | Fax (+1) 202-419-4372 |  Media Inquiries

Research Topics

  • Coronavirus (COVID-19)
  • Economy & Work
  • Family & Relationships
  • Gender & LGBTQ
  • Immigration & Migration
  • International Affairs
  • Internet & Technology
  • News Habits & Media
  • Non-U.S. Governments
  • Other Topics
  • Politics & Policy
  • Race & Ethnicity
  • Email Newsletters

ABOUT PEW RESEARCH CENTER  Pew Research Center is a nonpartisan fact tank that informs the public about the issues, attitudes and trends shaping the world. It conducts public opinion polling, demographic research, media content analysis and other empirical social science research. Pew Research Center does not take policy positions. It is a subsidiary of  The Pew Charitable Trusts .

Copyright 2024 Pew Research Center

Terms & Conditions

Privacy Policy

Cookie Settings

Reprints, Permissions & Use Policy

  • Support Dal
  • Current Students
  • Faculty & Staff
  • Family & Friends
  • Agricultural Campus (Truro)
  • Halifax Campuses
  • Campus Maps
  • Brightspace

Dalhousie University

data analysis research paper

  • Student Life
  • Media Centre
  • DAL Magazine

Most Commented

News archive.

  • February 2024
  • January 2024
  • December 2023
  • November 2023
  • October 2023

Study reveals more than half of branded global plastic waste linked to just 56 companies

Dal researcher co-authors paper on five-year international effort.

Alison Auld - April 24, 2024

For more than five years, citizen scientists in dozens of countries combed beaches, waterways, parks, busy city streets and other public areas in an ambitious bid to quantify the amount of plastic waste in the environment and track its source.

They carefully recorded the brand or trademark on each plastic item and the number of items with those brands wherever possible, also noting the location, date, type of plastic, type of item, number of plastic layers and time of each audit event, which ran from 2018 to 2022. 

Now, researchers have synthesized those results in a new paper that found a clear link between plastic production and plastic pollution, such that a one-per-cent increase in plastic production was associated with a one-per-cent increase in plastic pollution in the environment. 

The team, including co-author Dr. Tony Walker of Dal's School for Resource and Environmental Studies , also determined that companies producing single-use consumer goods disproportionately contributed to the problem more than household and retail companies, and that most collected items had no discernible brand. 

"We were surprised to find that the direct relationship between plastic production and plastic pollution was consistent around the world, irrespective of whether the litter audits were conducted in the global north or global south," says Dr. Walker, noting that plastic production doubled to about 400 metric tons from 2000 to 2019. 

data analysis research paper

"This confirms that companies responsible for omnipresent plastic pollution is consistent no matter where you live." 

Data 'speaks for itself'

The study, published Wednesday (April 24) in Science Advances , marks the first robust quantification of the global relationship between production and pollution, and comes at a time when world leaders are meeting in Ottawa to hammer out a Global Plastics Treaty at the fourth annual International Negotiating Committee, or INC-4. 

They also discovered that about 52 per cent of the more than two million inventoried plastic items had no identifiable brand, highlighting the need for better transparency about production and labeling of plastic products to enhance traceability and accountability. The researchers suggest creating an international, open-access database into which companies are obliged to quantitatively track and report their products, packaging and brands. 

"When I first saw the relationship between production and pollution, I was shocked," says co-author Win Cowger of the Moore Institute for Plastic Pollution Research. "Despite all the things big brands say they are doing, we see no positive impact from their efforts. But on the other hand, it gives me hope that reducing plastic production by fast-moving consumer goods companies will have a strong positive impact on the environment.” 

data analysis research paper

The research, led by scientists at Dalhousie and a dozen different universities in the United States, Australia, the Philippines, New Zealand, Estonia, Chile, Sweden and the U.K., found that 56 global companies are responsible for more than half of all branded plastic pollution. The paper states that the top five producers of branded plastic pollution were Coca-Cola Company, which was responsible for 11 per cent of roughly 910,000 branded items, followed by PepsiCo (5%), Nestlé (3%), Danone (3%), and Altria/Philip Morris International (2%). The top companies produce food, beverage or tobacco products. 

"This global branded plastic pollution data speaks for itself and demonstrates unequivocally that the world's top global producers are the biggest plastics polluters,” says Dr. Walker.

data analysis research paper

Paradigm shift needed

The five-year analysis used data from 1,576 audit events in 84 countries. Brand audits are citizen science initiatives in which volunteers conduct waste cleanups and document the brands collected. More than 100,000 volunteers submitted data through Break Free from Plastic or the 5 Gyres’ TrashBlitz app. 

The authors state that the strong relationship between plastic production and pollution, across geographies and different waste management systems, suggests that reducing the production of single-use plastic consumer goods could curb global plastic pollution. 

"Findings from this study suggest we need a paradigm shift in how we regulate plastic producers, especially the top branded producers that are responsible for half of branded plastic pollution," says Dr. Walker. 

data analysis research paper

For world leaders, this research serves as a tool to support a legally binding treaty that includes provisions on corporate accountability, prioritizing plastic production reduction measures, and promoting reuse and refill systems. 

"Our study underscores the critical role of corporate accountability in tackling plastic pollution," says Dr. Lisa Erdle, director of Science and Innovation at the 5 Gyres Institute . "I urge world leaders at INC-4 to listen to the science, and to consider the clear link between plastic production and pollution during negotiations for a Global Plastics Treaty." 

data analysis research paper

Dal News welcomes discussion from members of the Dalhousie community and beyond, but urge comment writers to be respectful and refrain from personal attacks. False or unsubstantiated allegations, libellous statements and offensive language are not allowed. External links must be appropriate and relevant to the subject being discussed.

We encourage commenters to use their real first and last names.

Please note that comments that appear on the site are not the opinion of Dal News or Dalhousie University but only of the comment writer. The editors reserve the right to post, or not to post comments, edit or not edit, at their discretion.

Halifax, Nova Scotia, Canada  B3H 4R2 1-902-494-2211

Agricultural Campus Truro, Nova Scotia, Canada  B2N 5E3 1-902-893-6600

  • Campus Directory
  • Student Career Services
  • Employment with Dalhousie
  • For Parents
  • For Employers
  • Privacy Statement
  • Terms of Use

Dalhousie University Halifax, Nova Scotia, Canada B3H 4R2 1.902.494.2211

data analysis research paper

Help | Advanced Search

Computer Science > Computation and Language

Title: "a good pun is its own reword": can large language models understand puns.

Abstract: Puns play a vital role in academic research due to their distinct structure and clear definition, which aid in the comprehensive analysis of linguistic humor. However, the understanding of puns in large language models (LLMs) has not been thoroughly examined, limiting their use in creative writing and humor creation. In this paper, we leverage three popular tasks, i.e., pun recognition, explanation and generation to systematically evaluate the capabilities of LLMs in pun understanding. In addition to adopting the automated evaluation metrics from prior research, we introduce new evaluation methods and metrics that are better suited to the in-context learning paradigm of LLMs. These new metrics offer a more rigorous assessment of an LLM's ability to understand puns and align more closely with human cognition than previous metrics. Our findings reveal the "lazy pun generation" pattern and identify the primary challenges LLMs encounter in understanding puns.

Submission history

Access paper:.

  • Other Formats

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

IMAGES

  1. FREE 42+ Research Paper Examples in PDF

    data analysis research paper

  2. (PDF) DATA ANALYSIS IN QUANTITATIVE RESEARCH

    data analysis research paper

  3. (PDF) Quantitative Data Analysis

    data analysis research paper

  4. Data Analysis Research Paper Example

    data analysis research paper

  5. FREE 42+ Research Paper Examples in PDF

    data analysis research paper

  6. (PDF) Practical Data Analysis: An Example

    data analysis research paper

VIDEO

  1. How to Assess the Quantitative Data Collected from Questionnaire

  2. How to interpret Reliability analysis results

  3. Why collect the data through Questionnaires || The Power of Questionnaires in Data Collection

  4. DATA ANALYSIS

  5. interpretation of data , analysis and thesis writing (Nta UGC net sociology)

  6. book your #dissertation #assignments today to score distinction #assignmenthelp #ukuniversities #uk

COMMENTS

  1. A practical guide to data analysis in general literature reviews

    This article is a practical guide to conducting data analysis in general literature reviews. The general literature review is a synthesis and analysis of published research on a relevant clinical issue, and is a common format for academic theses at the bachelor's and master's levels in nursing, physiotherapy, occupational therapy, public health and other related fields.

  2. Data Analysis in Research: Types & Methods

    Definition of research in data analysis: According to LeCompte and Schensul, research data analysis is a process used by researchers to reduce data to a story and interpret it to derive insights. The data analysis process helps reduce a large chunk of data into smaller fragments, which makes sense. Three essential things occur during the data ...

  3. Data Science and Analytics: An Overview from Data-Driven Smart

    The term "Data analysis" refers to the processing of data ... This research contributes to the creation of a research vector on the role of data science in central banking. ... to make the position of this paper. A thorough study on the data science modeling with its various processing modules that are needed to extract the actionable ...

  4. The Beginner's Guide to Statistical Analysis

    Step 1: Write your hypotheses and plan your research design. To collect valid data for statistical analysis, you first need to specify your hypotheses and plan out your research design. Writing statistical hypotheses. The goal of research is often to investigate a relationship between variables within a population. You start with a prediction ...

  5. data analysis Latest Research Papers

    The Given. Missing data is universal complexity for most part of the research fields which introduces the part of uncertainty into data analysis. We can take place due to many types of motives such as samples mishandling, unable to collect an observation, measurement errors, aberrant value deleted, or merely be short of study.

  6. A Really Simple Guide to Quantitative Data Analysis

    It is important to know w hat kind of data you are planning to collect or analyse as this w ill. affect your analysis method. A 12 step approach to quantitative data analysis. Step 1: Start with ...

  7. Creating a Data Analysis Plan: What to Consider When Choosing

    For those interested in conducting qualitative research, previous articles in this Research Primer series have provided information on the design and analysis of such studies. 2, 3 Information in the current article is divided into 3 main sections: an overview of terms and concepts used in data analysis, a review of common methods used to ...

  8. Principles for data analysis workflows

    A systematic and reproducible "workflow"—the process that moves a scientific investigation from raw data to coherent research question to insightful contribution—should be a fundamental part of academic data-intensive research practice. In this paper, we elaborate basic principles of a reproducible data analysis workflow by defining 3 phases: the Explore, Refine, and Produce Phases.

  9. Data Analysis in Quantitative Research

    Quantitative data analysis is an essential process that supports decision-making and evidence-based research in health and social sciences. Compared with qualitative counterpart, quantitative data analysis has less flexibility (see Chaps. 48, "Thematic Analysis," 49, "Narrative Analysis," 28, "Conversation Analysis: An Introduction to Methodology, Data Collection, and Analysis ...

  10. (PDF) Data Analytics and Techniques: A Review

    This paper presents several innovative methods that use data analytics techniques to improve the analysis process and data management. Furthermore, this paper discusses how the revolution of data ...

  11. Full article: Design Principles for Data Analysis

    Our primary focus in this article is to (i) introduce a set of data analytic design principles ( Section 2 ), (ii) describe an example of how the design principles can be used to measure different characteristics of a data analysis ( Section 3 ), and (iii) present data on the variation in principles within and between producers of data analyses ...

  12. What Is Data Analysis? (With Examples)

    Written by Coursera Staff • Updated on Apr 19, 2024. Data analysis is the practice of working with data to glean useful information, which can then be used to make informed decisions. "It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts," Sherlock ...

  13. Home

    Overview. The International Journal of Data Science and Analytics is a pioneering journal in data science and analytics, publishing original and applied research outcomes. Focuses on fundamental and applied research outcomes in data and analytics theories, technologies and applications. Promotes new scientific and technological approaches for ...

  14. Rapid and Rigorous Qualitative Data Analysis:

    Similarly, some researchers argue that the most time-consuming step in a qualitative research project occurs during data analysis, as the amount of data (e.g., number of pages) and depth of the data generated from qualitative data collection methods can exceed that of quantitative data collection methods. ... papers, and data reports. Popular ...

  15. (PDF) Different Types of Data Analysis; Data Analysis Methods and

    Data analysis is simply the process of converting the gathered data to meanin gf ul information. Different techniques such as modeling to reach trends, relatio nships, and therefore conclusions to ...

  16. PDF Structure of a Data Analysis Report

    - Data - Methods - Analysis - Results This format is very familiar to those who have written psych research papers. It often works well for a data analysis paper as well, though one problem with it is that the Methods section often sounds like a bit of a stretch: In a psych research paper the Methods section describes what you did to ...

  17. Data Science and Analytics: An Overview from Data-Driven Smart

    The digital world has a wealth of data, such as internet of things (IoT) data, business data, health data, mobile data, urban data, security data, and many more, in the current age of the Fourth Industrial Revolution (Industry 4.0 or 4IR). Extracting knowledge or useful insights from these data can be used for smart decision-making in various applications domains. In the area of data science ...

  18. Different Types of Data Analysis; Data Analysis Methods and ...

    This article is concentrated to define data analysis and the concept of data preparation. Then, the data analysis methods will be discussed. For doing so, the f ... Hamed, Different Types of Data Analysis; Data Analysis Methods and Techniques in Research Projects (August 1, 2022). ... Research Paper Series; Conference Papers; Partners in ...

  19. The use of Big Data Analytics in healthcare

    The success of Big Data analysis and its accuracy depend heavily on the tools and techniques used to analyze the ability to provide reliable, up-to-date and meaningful information to various stakeholders . ... For the purpose of this paper, the following research hypotheses were formulated: (1) medical facilities in Poland are working on both ...

  20. A Step-by-Step Process of Thematic Analysis to Develop a Conceptual

    Thematic analysis is a research method used to identify and interpret patterns or themes in a data set; it often leads to new insights and understanding (Boyatzis, 1998; Elliott, 2018; Thomas, 2006).However, it is critical that researchers avoid letting their own preconceptions interfere with the identification of key themes (Morse & Mitcham, 2002; Patton, 2015).

  21. AI Index Report

    Mission. The AI Index report tracks, collates, distills, and visualizes data related to artificial intelligence (AI). Our mission is to provide unbiased, rigorously vetted, broadly sourced data in order for policymakers, researchers, executives, journalists, and the general public to develop a more thorough and nuanced understanding of the complex field of AI.

  22. (PDF) ANALYSIS OF DATA

    Data Analysis is a process of applying statistical practices to organize, represent, describe, evaluate, and interpret data. ... and computing. This paper amounts to a research agenda, so it poses ...

  23. The economic commitment of climate change

    Analysis of projected sub-national damages from temperature and precipitation show an income reduction of 19% of the world economy within the next 26 years independent of future emission choices.

  24. How Pew Research Center will report on generations moving forward

    ABOUT PEW RESEARCH CENTER Pew Research Center is a nonpartisan fact tank that informs the public about the issues, attitudes and trends shaping the world. It conducts public opinion polling, demographic research, media content analysis and other empirical social science research. Pew Research Center does not take policy positions.

  25. Study reveals more than half of branded global plastic waste linked to

    The research team behind the international analysis, ... Now, researchers have synthesized those results in a new paper that found a clear link between plastic production and plastic pollution, ... The five-year analysis used data from 1,576 audit events in 84 countries. Brand audits are citizen science initiatives in which volunteers conduct ...

  26. (PDF) Data Analytics: A Literature Review Paper

    This paper aims to analyze some. of the different analytics metho ds and tools which can be applied to big data, as. well as the opportunities provided by the application of big data a nalytics in ...

  27. Title: "A good pun is its own reword": Can Large Language Models

    Puns play a vital role in academic research due to their distinct structure and clear definition, which aid in the comprehensive analysis of linguistic humor. However, the understanding of puns in large language models (LLMs) has not been thoroughly examined, limiting their use in creative writing and humor creation. In this paper, we leverage three popular tasks, i.e., pun recognition ...