essay data analysis

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Publications
Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

Advanced Search
Journal List
Springer Nature - PMC COVID-19 Collection

Data Science and Analytics: An Overview from Data-Driven Smart Computing, Decision-Making and Applications Perspective

Iqbal h. sarker.

1 Swinburne University of Technology, Melbourne, VIC 3122 Australia

2 Department of Computer Science and Engineering, Chittagong University of Engineering & Technology, Chittagong, 4349 Bangladesh

The digital world has a wealth of data, such as internet of things (IoT) data, business data, health data, mobile data, urban data, security data, and many more, in the current age of the Fourth Industrial Revolution (Industry 4.0 or 4IR). Extracting knowledge or useful insights from these data can be used for smart decision-making in various applications domains. In the area of data science, advanced analytics methods including machine learning modeling can provide actionable insights or deeper knowledge about data, which makes the computing process automatic and smart. In this paper, we present a comprehensive view on “Data Science” including various types of advanced analytics methods that can be applied to enhance the intelligence and capabilities of an application through smart decision-making in different scenarios. We also discuss and summarize ten potential real-world application domains including business, healthcare, cybersecurity, urban and rural data science, and so on by taking into account data-driven smart computing and decision making. Based on this, we finally highlight the challenges and potential research directions within the scope of our study. Overall, this paper aims to serve as a reference point on data science and advanced analytics to the researchers and decision-makers as well as application developers, particularly from the data-driven solution point of view for real-world problems.

Introduction

We are living in the age of “data science and advanced analytics”, where almost everything in our daily lives is digitally recorded as data [ 17 ]. Thus the current electronic world is a wealth of various kinds of data, such as business data, financial data, healthcare data, multimedia data, internet of things (IoT) data, cybersecurity data, social media data, etc [ 112 ]. The data can be structured, semi-structured, or unstructured, which increases day by day [ 105 ]. Data science is typically a “concept to unify statistics, data analysis, and their related methods” to understand and analyze the actual phenomena with data. According to Cao et al. [ 17 ] “data science is the science of data” or “data science is the study of data”, where a data product is a data deliverable, or data-enabled or guided, which can be a discovery, prediction, service, suggestion, insight into decision-making, thought, model, paradigm, tool, or system. The popularity of “Data science” is increasing day-by-day, which is shown in Fig. Fig.1 1 according to Google Trends data over the last 5 years [ 36 ]. In addition to data science, we have also shown the popularity trends of the relevant areas such as “Data analytics”, “Data mining”, “Big data”, “Machine learning” in the figure. According to Fig. Fig.1, 1 , the popularity indication values for these data-driven domains, particularly “Data science”, and “Machine learning” are increasing day-by-day. This statistical information and the applicability of the data-driven smart decision-making in various real-world application areas, motivate us to study briefly on “Data science” and machine-learning-based “Advanced analytics” in this paper.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_765_Fig1_HTML.jpg

The worldwide popularity score of data science comparing with relevant areas in a range of 0 (min) to 100 (max) over time where x -axis represents the timestamp information and y -axis represents the corresponding score

Usually, data science is the field of applying advanced analytics methods and scientific concepts to derive useful business information from data. The emphasis of advanced analytics is more on anticipating the use of data to detect patterns to determine what is likely to occur in the future. Basic analytics offer a description of data in general, while advanced analytics is a step forward in offering a deeper understanding of data and helping to analyze granular data, which we are interested in. In the field of data science, several types of analytics are popular, such as "Descriptive analytics" which answers the question of what happened; "Diagnostic analytics" which answers the question of why did it happen; "Predictive analytics" which predicts what will happen in the future; and "Prescriptive analytics" which prescribes what action should be taken, discussed briefly in “ Advanced analytics methods and smart computing ”. Such advanced analytics and decision-making based on machine learning techniques [ 105 ], a major part of artificial intelligence (AI) [ 102 ] can also play a significant role in the Fourth Industrial Revolution (Industry 4.0) due to its learning capability for smart computing as well as automation [ 121 ].

Although the area of “data science” is huge, we mainly focus on deriving useful insights through advanced analytics, where the results are used to make smart decisions in various real-world application areas. For this, various advanced analytics methods such as machine learning modeling, natural language processing, sentiment analysis, neural network, or deep learning analysis can provide deeper knowledge about data, and thus can be used to develop data-driven intelligent applications. More specifically, regression analysis, classification, clustering analysis, association rules, time-series analysis, sentiment analysis, behavioral patterns, anomaly detection, factor analysis, log analysis, and deep learning which is originated from the artificial neural network, are taken into account in our study. These machine learning-based advanced analytics methods are discussed briefly in “ Advanced analytics methods and smart computing ”. Thus, it’s important to understand the principles of various advanced analytics methods mentioned above and their applicability to apply in various real-world application areas. For instance, in our earlier paper Sarker et al. [ 114 ], we have discussed how data science and machine learning modeling can play a significant role in the domain of cybersecurity for making smart decisions and to provide data-driven intelligent security services. In this paper, we broadly take into account the data science application areas and real-world problems in ten potential domains including the area of business data science, health data science, IoT data science, behavioral data science, urban data science, and so on, discussed briefly in “ Real-world application domains ”.

Based on the importance of machine learning modeling to extract the useful insights from the data mentioned above and data-driven smart decision-making, in this paper, we present a comprehensive view on “Data Science” including various types of advanced analytics methods that can be applied to enhance the intelligence and the capabilities of an application. The key contribution of this study is thus understanding data science modeling, explaining different analytic methods for solution perspective and their applicability in various real-world data-driven applications areas mentioned earlier. Overall, the purpose of this paper is, therefore, to provide a basic guide or reference for those academia and industry people who want to study, research, and develop automated and intelligent applications or systems based on smart computing and decision making within the area of data science.

The main contributions of this paper are summarized as follows:

To define the scope of our study towards data-driven smart computing and decision-making in our real-world life. We also make a brief discussion on the concept of data science modeling from business problems to data product and automation, to understand its applicability and provide intelligent services in real-world scenarios.
To provide a comprehensive view on data science including advanced analytics methods that can be applied to enhance the intelligence and the capabilities of an application.
To discuss the applicability and significance of machine learning-based analytics methods in various real-world application areas. We also summarize ten potential real-world application areas, from business to personalized applications in our daily life, where advanced analytics with machine learning modeling can be used to achieve the expected outcome.
To highlight and summarize the challenges and potential research directions within the scope of our study.

The rest of the paper is organized as follows. The next section provides the background and related work and defines the scope of our study. The following section presents the concepts of data science modeling for building a data-driven application. After that, briefly discuss and explain different advanced analytics methods and smart computing. Various real-world application areas are discussed and summarized in the next section. We then highlight and summarize several research issues and potential future directions, and finally, the last section concludes this paper.

Background and Related Work

In this section, we first discuss various data terms and works related to data science and highlight the scope of our study.

Data Terms and Definitions

There is a range of key terms in the field, such as data analysis, data mining, data analytics, big data, data science, advanced analytics, machine learning, and deep learning, which are highly related and easily confusing. In the following, we define these terms and differentiate them with the term “Data Science” according to our goal.

The term “Data analysis” refers to the processing of data by conventional (e.g., classic statistical, empirical, or logical) theories, technologies, and tools for extracting useful information and for practical purposes [ 17 ]. The term “Data analytics”, on the other hand, refers to the theories, technologies, instruments, and processes that allow for an in-depth understanding and exploration of actionable data insight [ 17 ]. Statistical and mathematical analysis of the data is the major concern in this process. “Data mining” is another popular term over the last decade, which has a similar meaning with several other terms such as knowledge mining from data, knowledge extraction, knowledge discovery from data (KDD), data/pattern analysis, data archaeology, and data dredging. According to Han et al. [ 38 ], it should have been more appropriately named “knowledge mining from data”. Overall, data mining is defined as the process of discovering interesting patterns and knowledge from large amounts of data [ 38 ]. Data sources may include databases, data centers, the Internet or Web, other repositories of data, or data dynamically streamed through the system. “Big data” is another popular term nowadays, which may change the statistical and data analysis approaches as it has the unique features of “massive, high dimensional, heterogeneous, complex, unstructured, incomplete, noisy, and erroneous” [ 74 ]. Big data can be generated by mobile devices, social networks, the Internet of Things, multimedia, and many other new applications [ 129 ]. Several unique features including volume, velocity, variety, veracity, value (5Vs), and complexity are used to understand and describe big data [ 69 ].

In terms of analytics, basic analytics provides a summary of data whereas the term “Advanced Analytics” takes a step forward in offering a deeper understanding of data and helps to analyze granular data. Advanced analytics is characterized or defined as autonomous or semi-autonomous data or content analysis using advanced techniques and methods to discover deeper insights, predict or generate recommendations, typically beyond traditional business intelligence or analytics. “Machine learning”, a branch of artificial intelligence (AI), is one of the major techniques used in advanced analytics which can automate analytical model building [ 112 ]. This is focused on the premise that systems can learn from data, recognize trends, and make decisions, with minimal human involvement [ 38 , 115 ]. “Deep Learning” is a subfield of machine learning that discusses algorithms inspired by the human brain’s structure and the function called artificial neural networks [ 38 , 139 ].

Unlike the above data-related terms, “Data science” is an umbrella term that encompasses advanced data analytics, data mining, machine, and deep learning modeling, and several other related disciplines like statistics, to extract insights or useful knowledge from the datasets and transform them into actionable business strategies. In [ 17 ], Cao et al. defined data science from the disciplinary perspective as “data science is a new interdisciplinary field that synthesizes and builds on statistics, informatics, computing, communication, management, and sociology to study data and its environments (including domains and other contextual aspects, such as organizational and social aspects) to transform data to insights and decisions by following a data-to-knowledge-to-wisdom thinking and methodology”. In “ Understanding data science modeling ”, we briefly discuss the data science modeling from a practical perspective starting from business problems to data products that can assist the data scientists to think and work in a particular real-world problem domain within the area of data science and analytics.

Related Work

In the area, several papers have been reviewed by the researchers based on data science and its significance. For example, the authors in [ 19 ] identify the evolving field of data science and its importance in the broader knowledge environment and some issues that differentiate data science and informatics issues from conventional approaches in information sciences. Donoho et al. [ 27 ] present 50 years of data science including recent commentary on data science in mass media, and on how/whether data science varies from statistics. The authors formally conceptualize the theory-guided data science (TGDS) model in [ 53 ] and present a taxonomy of research themes in TGDS. Cao et al. include a detailed survey and tutorial on the fundamental aspects of data science in [ 17 ], which considers the transition from data analysis to data science, the principles of data science, as well as the discipline and competence of data education.

Besides, the authors include a data science analysis in [ 20 ], which aims to provide a realistic overview of the use of statistical features and related data science methods in bioimage informatics. The authors in [ 61 ] study the key streams of data science algorithm use at central banks and show how their popularity has risen over time. This research contributes to the creation of a research vector on the role of data science in central banking. In [ 62 ], the authors provide an overview and tutorial on the data-driven design of intelligent wireless networks. The authors in [ 87 ] provide a thorough understanding of computational optimal transport with application to data science. In [ 97 ], the authors present data science as theoretical contributions in information systems via text analytics.

Unlike the above recent studies, in this paper, we concentrate on the knowledge of data science including advanced analytics methods, machine learning modeling, real-world application domains, and potential research directions within the scope of our study. The advanced analytics methods based on machine learning techniques discussed in this paper can be applied to enhance the capabilities of an application in terms of data-driven intelligent decision making and automation in the final data product or systems.

Understanding Data Science Modeling

In this section, we briefly discuss how data science can play a significant role in the real-world business process. For this, we first categorize various types of data and then discuss the major steps of data science modeling starting from business problems to data product and automation.

Types of Real-World Data

Typically, to build a data-driven real-world system in a particular domain, the availability of data is the key [ 17 , 112 , 114 ]. The data can be in different types such as (i) Structured—that has a well-defined data structure and follows a standard order, examples are names, dates, addresses, credit card numbers, stock information, geolocation, etc.; (ii) Unstructured—has no pre-defined format or organization, examples are sensor data, emails, blog entries, wikis, and word processing documents, PDF files, audio files, videos, images, presentations, web pages, etc.; (iii) Semi-structured—has elements of both the structured and unstructured data containing certain organizational properties, examples are HTML, XML, JSON documents, NoSQL databases, etc.; and (iv) Metadata—that represents data about the data, examples are author, file type, file size, creation date and time, last modification date and time, etc. [ 38 , 105 ].

In the area of data science, researchers use various widely-used datasets for different purposes. These are, for example, cybersecurity datasets such as NSL-KDD [ 127 ], UNSW-NB15 [ 79 ], Bot-IoT [ 59 ], ISCX’12 [ 15 ], CIC-DDoS2019 [ 22 ], etc., smartphone datasets such as phone call logs [ 88 , 110 ], mobile application usages logs [ 124 , 149 ], SMS Log [ 28 ], mobile phone notification logs [ 77 ] etc., IoT data [ 56 , 11 , 64 ], health data such as heart disease [ 99 ], diabetes mellitus [ 86 , 147 ], COVID-19 [ 41 , 78 ], etc., agriculture and e-commerce data [ 128 , 150 ], and many more in various application domains. In “ Real-world application domains ”, we discuss ten potential real-world application domains of data science and analytics by taking into account data-driven smart computing and decision making, which can help the data scientists and application developers to explore more in various real-world issues.

Overall, the data used in data-driven applications can be any of the types mentioned above, and they can differ from one application to another in the real world. Data science modeling, which is briefly discussed below, can be used to analyze such data in a specific problem domain and derive insights or useful information from the data to build a data-driven model or data product.

Steps of Data Science Modeling

Data science is typically an umbrella term that encompasses advanced data analytics, data mining, machine, and deep learning modeling, and several other related disciplines like statistics, to extract insights or useful knowledge from the datasets and transform them into actionable business strategies, mentioned earlier in “ Background and related work ”. In this section, we briefly discuss how data science can play a significant role in the real-world business process. Figure Figure2 2 shows an example of data science modeling starting from real-world data to data-driven product and automation. In the following, we briefly discuss each module of the data science process.

Understanding business problems: This involves getting a clear understanding of the problem that is needed to solve, how it impacts the relevant organization or individuals, the ultimate goals for addressing it, and the relevant project plan. Thus to understand and identify the business problems, the data scientists formulate relevant questions while working with the end-users and other stakeholders. For instance, how much/many, which category/group, is the behavior unrealistic/abnormal, which option should be taken, what action, etc. could be relevant questions depending on the nature of the problems. This helps to get a better idea of what business needs and what we should be extracted from data. Such business knowledge can enable organizations to enhance their decision-making process, is known as “Business Intelligence” [ 65 ]. Identifying the relevant data sources that can help to answer the formulated questions and what kinds of actions should be taken from the trends that the data shows, is another important task associated with this stage. Once the business problem has been clearly stated, the data scientist can define the analytic approach to solve the problem.
Understanding data: As we know that data science is largely driven by the availability of data [ 114 ]. Thus a sound understanding of the data is needed towards a data-driven model or system. The reason is that real-world data sets are often noisy, missing values, have inconsistencies, or other data issues, which are needed to handle effectively [ 101 ]. To gain actionable insights, the appropriate data or the quality of the data must be sourced and cleansed, which is fundamental to any data science engagement. For this, data assessment that evaluates what data is available and how it aligns to the business problem could be the first step in data understanding. Several aspects such as data type/format, the quantity of data whether it is sufficient or not to extract the useful knowledge, data relevance, authorized access to data, feature or attribute importance, combining multiple data sources, important metrics to report the data, etc. are needed to take into account to clearly understand the data for a particular business problem. Overall, the data understanding module involves figuring out what data would be best needed and the best ways to acquire it.
Data pre-processing and exploration: Exploratory data analysis is defined in data science as an approach to analyzing datasets to summarize their key characteristics, often with visual methods [ 135 ]. This examines a broad data collection to discover initial trends, attributes, points of interest, etc. in an unstructured manner to construct meaningful summaries of the data. Thus data exploration is typically used to figure out the gist of data and to develop a first step assessment of its quality, quantity, and characteristics. A statistical model can be used or not, but primarily it offers tools for creating hypotheses by generally visualizing and interpreting the data through graphical representation such as a chart, plot, histogram, etc [ 72 , 91 ]. Before the data is ready for modeling, it’s necessary to use data summarization and visualization to audit the quality of the data and provide the information needed to process it. To ensure the quality of the data, the data pre-processing technique, which is typically the process of cleaning and transforming raw data [ 107 ] before processing and analysis is important. It also involves reformatting information, making data corrections, and merging data sets to enrich data. Thus, several aspects such as expected data, data cleaning, formatting or transforming data, dealing with missing values, handling data imbalance and bias issues, data distribution, search for outliers or anomalies in data and dealing with them, ensuring data quality, etc. could be the key considerations in this step.
Machine learning modeling and evaluation: Once the data is prepared for building the model, data scientists design a model, algorithm, or set of models, to address the business problem. Model building is dependent on what type of analytics, e.g., predictive analytics, is needed to solve the particular problem, which is discussed briefly in “ Advanced analytics methods and smart computing ”. To best fits the data according to the type of analytics, different types of data-driven or machine learning models that have been summarized in our earlier paper Sarker et al. [ 105 ], can be built to achieve the goal. Data scientists typically separate training and test subsets of the given dataset usually dividing in the ratio of 80:20 or data considering the most popular k -folds data splitting method [ 38 ]. This is to observe whether the model performs well or not on the data, to maximize the model performance. Various model validation and assessment metrics, such as error rate, accuracy, true positive, false positive, true negative, false negative, precision, recall, f-score, ROC (receiver operating characteristic curve) analysis, applicability analysis, etc. [ 38 , 115 ] are used to measure the model performance, which can guide the data scientists to choose or design the learning method or model. Besides, machine learning experts or data scientists can take into account several advanced analytics such as feature engineering, feature selection or extraction methods, algorithm tuning, ensemble methods, modifying existing algorithms, or designing new algorithms, etc. to improve the ultimate data-driven model to solve a particular business problem through smart decision making.
Data product and automation: A data product is typically the output of any data science activity [ 17 ]. A data product, in general terms, is a data deliverable, or data-enabled or guide, which can be a discovery, prediction, service, suggestion, insight into decision-making, thought, model, paradigm, tool, application, or system that process data and generate results. Businesses can use the results of such data analysis to obtain useful information like churn (a measure of how many customers stop using a product) prediction and customer segmentation, and use these results to make smarter business decisions and automation. Thus to make better decisions in various business problems, various machine learning pipelines and data products can be developed. To highlight this, we summarize several potential real-world data science application areas in “ Real-world application domains ”, where various data products can play a significant role in relevant business problems to make them smart and automate.

Overall, we can conclude that data science modeling can be used to help drive changes and improvements in business practices. The interesting part of the data science process indicates having a deeper understanding of the business problem to solve. Without that, it would be much harder to gather the right data and extract the most useful information from the data for making decisions to solve the problem. In terms of role, “Data Scientists” typically interpret and manage data to uncover the answers to major questions that help organizations to make objective decisions and solve complex problems. In a summary, a data scientist proactively gathers and analyzes information from multiple sources to better understand how the business performs, and designs machine learning or data-driven tools/methods, or algorithms, focused on advanced analytics, which can make today’s computing process smarter and intelligent, discussed briefly in the following section.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_765_Fig2_HTML.jpg

An example of data science modeling from real-world data to data-driven system and decision making

Advanced Analytics Methods and Smart Computing

As mentioned earlier in “ Background and related work ”, basic analytics provides a summary of data whereas advanced analytics takes a step forward in offering a deeper understanding of data and helps in granular data analysis. For instance, the predictive capabilities of advanced analytics can be used to forecast trends, events, and behaviors. Thus, “advanced analytics” can be defined as the autonomous or semi-autonomous analysis of data or content using advanced techniques and methods to discover deeper insights, make predictions, or produce recommendations, where machine learning-based analytical modeling is considered as the key technologies in the area. In the following section, we first summarize various types of analytics and outcome that are needed to solve the associated business problems, and then we briefly discuss machine learning-based analytical modeling.

Types of Analytics and Outcome

In the real-world business process, several key questions such as “What happened?”, “Why did it happen?”, “What will happen in the future?”, “What action should be taken?” are common and important. Based on these questions, in this paper, we categorize and highlight the analytics into four types such as descriptive, diagnostic, predictive, and prescriptive, which are discussed below.

Descriptive analytics: It is the interpretation of historical data to better understand the changes that have occurred in a business. Thus descriptive analytics answers the question, “what happened in the past?” by summarizing past data such as statistics on sales and operations or marketing strategies, use of social media, and engagement with Twitter, Linkedin or Facebook, etc. For instance, using descriptive analytics through analyzing trends, patterns, and anomalies, etc., customers’ historical shopping data can be used to predict the probability of a customer purchasing a product. Thus, descriptive analytics can play a significant role to provide an accurate picture of what has occurred in a business and how it relates to previous times utilizing a broad range of relevant business data. As a result, managers and decision-makers can pinpoint areas of strength and weakness in their business, and eventually can take more effective management strategies and business decisions.
Diagnostic analytics: It is a form of advanced analytics that examines data or content to answer the question, “why did it happen?” The goal of diagnostic analytics is to help to find the root cause of the problem. For example, the human resource management department of a business organization may use these diagnostic analytics to find the best applicant for a position, select them, and compare them to other similar positions to see how well they perform. In a healthcare example, it might help to figure out whether the patients’ symptoms such as high fever, dry cough, headache, fatigue, etc. are all caused by the same infectious agent. Overall, diagnostic analytics enables one to extract value from the data by posing the right questions and conducting in-depth investigations into the answers. It is characterized by techniques such as drill-down, data discovery, data mining, and correlations.
Predictive analytics: Predictive analytics is an important analytical technique used by many organizations for various purposes such as to assess business risks, anticipate potential market patterns, and decide when maintenance is needed, to enhance their business. It is a form of advanced analytics that examines data or content to answer the question, “what will happen in the future?” Thus, the primary goal of predictive analytics is to identify and typically answer this question with a high degree of probability. Data scientists can use historical data as a source to extract insights for building predictive models using various regression analyses and machine learning techniques, which can be used in various application domains for a better outcome. Companies, for example, can use predictive analytics to minimize costs by better anticipating future demand and changing output and inventory, banks and other financial institutions to reduce fraud and risks by predicting suspicious activity, medical specialists to make effective decisions through predicting patients who are at risk of diseases, retailers to increase sales and customer satisfaction through understanding and predicting customer preferences, manufacturers to optimize production capacity through predicting maintenance requirements, and many more. Thus predictive analytics can be considered as the core analytical method within the area of data science.
Prescriptive analytics: Prescriptive analytics focuses on recommending the best way forward with actionable information to maximize overall returns and profitability, which typically answer the question, “what action should be taken?” In business analytics, prescriptive analytics is considered the final step. For its models, prescriptive analytics collects data from several descriptive and predictive sources and applies it to the decision-making process. Thus, we can say that it is related to both descriptive analytics and predictive analytics, but it emphasizes actionable insights instead of data monitoring. In other words, it can be considered as the opposite of descriptive analytics, which examines decisions and outcomes after the fact. By integrating big data, machine learning, and business rules, prescriptive analytics helps organizations to make more informed decisions to produce results that drive the most successful business decisions.

In summary, to clarify what happened and why it happened, both descriptive analytics and diagnostic analytics look at the past. Historical data is used by predictive analytics and prescriptive analytics to forecast what will happen in the future and what steps should be taken to impact those effects. In Table Table1, 1 , we have summarized these analytics methods with examples. Forward-thinking organizations in the real world can jointly use these analytical methods to make smart decisions that help drive changes in business processes and improvements. In the following, we discuss how machine learning techniques can play a big role in these analytical methods through their learning capabilities from the data.

Various types of analytical methods with examples

Machine Learning Based Analytical Modeling

In this section, we briefly discuss various advanced analytics methods based on machine learning modeling, which can make the computing process smart through intelligent decision-making in a business process. Figure Figure3 3 shows a general structure of a machine learning-based predictive modeling considering both the training and testing phase. In the following, we discuss a wide range of methods such as regression and classification analysis, association rule analysis, time-series analysis, behavioral analysis, log analysis, and so on within the scope of our study.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_765_Fig3_HTML.jpg

A general structure of a machine learning based predictive model considering both the training and testing phase

Regression Analysis

In data science, one of the most common statistical approaches used for predictive modeling and data mining tasks is regression techniques [ 38 ]. Regression analysis is a form of supervised machine learning that examines the relationship between a dependent variable (target) and independent variables (predictor) to predict continuous-valued output [ 105 , 117 ]. The following equations Eqs. 1 , 2 , and 3 [ 85 , 105 ] represent the simple, multiple or multivariate, and polynomial regressions respectively, where x represents independent variable and y is the predicted/target output mentioned above:

Regression analysis is typically conducted for one of two purposes: to predict the value of the dependent variable in the case of individuals for whom some knowledge relating to the explanatory variables is available, or to estimate the effect of some explanatory variable on the dependent variable, i.e., finding the relationship of causal influence between the variables. Linear regression cannot be used to fit non-linear data and may cause an underfitting problem. In that case, polynomial regression performs better, however, increases the model complexity. The regularization techniques such as Ridge, Lasso, Elastic-Net, etc. [ 85 , 105 ] can be used to optimize the linear regression model. Besides, support vector regression, decision tree regression, random forest regression techniques [ 85 , 105 ] can be used for building effective regression models depending on the problem type, e.g., non-linear tasks. Financial forecasting or prediction, cost estimation, trend analysis, marketing, time-series estimation, drug response modeling, etc. are some examples where the regression models can be used to solve real-world problems in the domain of data science and analytics.

Classification Analysis

Classification is one of the most widely used and best-known data science processes. This is a form of supervised machine learning approach that also refers to a predictive modeling problem in which a class label is predicted for a given example [ 38 ]. Spam identification, such as ‘spam’ and ‘not spam’ in email service providers, can be an example of a classification problem. There are several forms of classification analysis available in the area such as binary classification—which refers to the prediction of one of two classes; multi-class classification—which involves the prediction of one of more than two classes; multi-label classification—a generalization of multiclass classification in which the problem’s classes are organized hierarchically [ 105 ].

Several popular classification techniques, such as k-nearest neighbors [ 5 ], support vector machines [ 55 ], navies Bayes [ 49 ], adaptive boosting [ 32 ], extreme gradient boosting [ 85 ], logistic regression [ 66 ], decision trees ID3 [ 92 ], C4.5 [ 93 ], and random forests [ 13 ] exist to solve classification problems. The tree-based classification technique, e.g., random forest considering multiple decision trees, performs better than others to solve real-world problems in many cases as due to its capability of producing logic rules [ 103 , 115 ]. Figure Figure4 4 shows an example of a random forest structure considering multiple decision trees. In addition, BehavDT recently proposed by Sarker et al. [ 109 ], and IntrudTree [ 106 ] can be used for building effective classification or prediction models in the relevant tasks within the domain of data science and analytics.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_765_Fig4_HTML.jpg

An example of a random forest structure considering multiple decision trees

Cluster Analysis

Clustering is a form of unsupervised machine learning technique and is well-known in many data science application areas for statistical data analysis [ 38 ]. Usually, clustering techniques search for the structures inside a dataset and, if the classification is not previously identified, classify homogeneous groups of cases. This means that data points are identical to each other within a cluster, and different from data points in another cluster. Overall, the purpose of cluster analysis is to sort various data points into groups (or clusters) that are homogeneous internally and heterogeneous externally [ 105 ]. To gain insight into how data is distributed in a given dataset or as a preprocessing phase for other algorithms, clustering is often used. Data clustering, for example, assists with customer shopping behavior, sales campaigns, and retention of consumers for retail businesses, anomaly detection, etc.

Many clustering algorithms with the ability to group data have been proposed in machine learning and data science literature [ 98 , 138 , 141 ]. In our earlier paper Sarker et al. [ 105 ], we have summarized this based on several perspectives, such as partitioning methods, density-based methods, hierarchical-based methods, model-based methods, etc. In the literature, the popular K-means [ 75 ], K-Mediods [ 84 ], CLARA [ 54 ] etc. are known as partitioning methods; DBSCAN [ 30 ], OPTICS [ 8 ] etc. are known as density-based methods; single linkage [ 122 ], complete linkage [ 123 ], etc. are known as hierarchical methods. In addition, grid-based clustering methods, such as STING [ 134 ], CLIQUE [ 2 ], etc.; model-based clustering such as neural network learning [ 141 ], GMM [ 94 ], SOM [ 18 , 104 ], etc.; constrained-based methods such as COP K-means [ 131 ], CMWK-Means [ 25 ], etc. are used in the area. Recently, Sarker et al. [ 111 ] proposed a hierarchical clustering method, BOTS [ 111 ] based on bottom-up agglomerative technique for capturing user’s similar behavioral characteristics over time. The key benefit of agglomerative hierarchical clustering is that the tree-structure hierarchy created by agglomerative clustering is more informative than an unstructured set of flat clusters, which can assist in better decision-making in relevant application areas in data science.

Association Rule Analysis

Association rule learning is known as a rule-based machine learning system, an unsupervised learning method is typically used to establish a relationship among variables. This is a descriptive technique often used to analyze large datasets for discovering interesting relationships or patterns. The association learning technique’s main strength is its comprehensiveness, as it produces all associations that meet user-specified constraints including minimum support and confidence value [ 138 ].

Association rules allow a data scientist to identify trends, associations, and co-occurrences between data sets inside large data collections. In a supermarket, for example, associations infer knowledge about the buying behavior of consumers for different items, which helps to change the marketing and sales plan. In healthcare, to better diagnose patients, physicians may use association guidelines. Doctors can assess the conditional likelihood of a given illness by comparing symptom associations in the data from previous cases using association rules and machine learning-based data analysis. Similarly, association rules are useful for consumer behavior analysis and prediction, customer market analysis, bioinformatics, weblog mining, recommendation systems, etc.

Several types of association rules have been proposed in the area, such as frequent pattern based [ 4 , 47 , 73 ], logic-based [ 31 ], tree-based [ 39 ], fuzzy-rules [ 126 ], belief rule [ 148 ] etc. The rule learning techniques such as AIS [ 3 ], Apriori [ 4 ], Apriori-TID and Apriori-Hybrid [ 4 ], FP-Tree [ 39 ], Eclat [ 144 ], RARM [ 24 ] exist to solve the relevant business problems. Apriori [ 4 ] is the most commonly used algorithm for discovering association rules from a given dataset among the association rule learning techniques [ 145 ]. The recent association rule-learning technique ABC-RuleMiner proposed in our earlier paper by Sarker et al. [ 113 ] could give significant results in terms of generating non-redundant rules that can be used for smart decision making according to human preferences, within the area of data science applications.

Time-Series Analysis and Forecasting

A time series is typically a series of data points indexed in time order particularly, by date, or timestamp [ 111 ]. Depending on the frequency, the time-series can be different types such as annually, e.g., annual budget, quarterly, e.g., expenditure, monthly, e.g., air traffic, weekly, e.g., sales quantity, daily, e.g., weather, hourly, e.g., stock price, minute-wise, e.g., inbound calls in a call center, and even second-wise, e.g., web traffic, and so on in relevant domains.

A mathematical method dealing with such time-series data, or the procedure of fitting a time series to a proper model is termed time-series analysis. Many different time series forecasting algorithms and analysis methods can be applied to extract the relevant information. For instance, to do time-series forecasting for future patterns, the autoregressive (AR) model [ 130 ] learns the behavioral trends or patterns of past data. Moving average (MA) [ 40 ] is another simple and common form of smoothing used in time series analysis and forecasting that uses past forecasted errors in a regression-like model to elaborate an averaged trend across the data. The autoregressive moving average (ARMA) [ 12 , 120 ] combines these two approaches, where autoregressive extracts the momentum and pattern of the trend and moving average capture the noise effects. The most popular and frequently used time-series model is the autoregressive integrated moving average (ARIMA) model [ 12 , 120 ]. ARIMA model, a generalization of an ARMA model, is more flexible than other statistical models such as exponential smoothing or simple linear regression. In terms of data, the ARMA model can only be used for stationary time-series data, while the ARIMA model includes the case of non-stationarity as well. Similarly, seasonal autoregressive integrated moving average (SARIMA), autoregressive fractionally integrated moving average (ARFIMA), autoregressive moving average model with exogenous inputs model (ARMAX model) are also used in time-series models [ 120 ].

In addition to the stochastic methods for time-series modeling and forecasting, machine and deep learning-based approach can be used for effective time-series analysis and forecasting. For instance, in our earlier paper, Sarker et al. [ 111 ] present a bottom-up clustering-based time-series analysis to capture the mobile usage behavioral patterns of the users. Figure Figure5 5 shows an example of producing aggregate time segments Seg_i from initial time slices TS_i based on similar behavioral characteristics that are used in our bottom-up clustering approach, where D represents the dominant behavior BH_i of the users, mentioned above [ 111 ]. The authors in [ 118 ], used a long short-term memory (LSTM) model, a kind of recurrent neural network (RNN) deep learning model, in forecasting time-series that outperform traditional approaches such as the ARIMA model. Time-series analysis is commonly used these days in various fields such as financial, manufacturing, business, social media, event data (e.g., clickstreams and system events), IoT and smartphone data, and generally in any applied science and engineering temporal measurement domain. Thus, it covers a wide range of application areas in data science.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_765_Fig5_HTML.jpg

An example of producing aggregate time segments from initial time slices based on similar behavioral characteristics

Opinion Mining and Sentiment Analysis

Sentiment analysis or opinion mining is the computational study of the opinions, thoughts, emotions, assessments, and attitudes of people towards entities such as products, services, organizations, individuals, issues, events, topics, and their attributes [ 71 ]. There are three kinds of sentiments: positive, negative, and neutral, along with more extreme feelings such as angry, happy and sad, or interested or not interested, etc. More refined sentiments to evaluate the feelings of individuals in various situations can also be found according to the problem domain.

Although the task of opinion mining and sentiment analysis is very challenging from a technical point of view, it’s very useful in real-world practice. For instance, a business always aims to obtain an opinion from the public or customers about its products and services to refine the business policy as well as a better business decision. It can thus benefit a business to understand the social opinion of their brand, product, or service. Besides, potential customers want to know what consumers believe they have when they use a service or purchase a product. Document-level, sentence level, aspect level, and concept level, are the possible levels of opinion mining in the area [ 45 ].

Several popular techniques such as lexicon-based including dictionary-based and corpus-based methods, machine learning including supervised and unsupervised learning, deep learning, and hybrid methods are used in sentiment analysis-related tasks [ 70 ]. To systematically define, extract, measure, and analyze affective states and subjective knowledge, it incorporates the use of statistics, natural language processing (NLP), machine learning as well as deep learning methods. Sentiment analysis is widely used in many applications, such as reviews and survey data, web and social media, and healthcare content, ranging from marketing and customer support to clinical practice. Thus sentiment analysis has a big influence in many data science applications, where public sentiment is involved in various real-world issues.

Behavioral Data and Cohort Analysis

Behavioral analytics is a recent trend that typically reveals new insights into e-commerce sites, online gaming, mobile and smartphone applications, IoT user behavior, and many more [ 112 ]. The behavioral analysis aims to understand how and why the consumers or users behave, allowing accurate predictions of how they are likely to behave in the future. For instance, it allows advertisers to make the best offers with the right client segments at the right time. Behavioral analytics, including traffic data such as navigation paths, clicks, social media interactions, purchase decisions, and marketing responsiveness, use the large quantities of raw user event information gathered during sessions in which people use apps, games, or websites. In our earlier papers Sarker et al. [ 101 , 111 , 113 ] we have discussed how to extract users phone usage behavioral patterns utilizing real-life phone log data for various purposes.

In the real-world scenario, behavioral analytics is often used in e-commerce, social media, call centers, billing systems, IoT systems, political campaigns, and other applications, to find opportunities for optimization to achieve particular outcomes. Cohort analysis is a branch of behavioral analytics that involves studying groups of people over time to see how their behavior changes. For instance, it takes data from a given data set (e.g., an e-commerce website, web application, or online game) and separates it into related groups for analysis. Various machine learning techniques such as behavioral data clustering [ 111 ], behavioral decision tree classification [ 109 ], behavioral association rules [ 113 ], etc. can be used in the area according to the goal. Besides, the concept of RecencyMiner, proposed in our earlier paper Sarker et al. [ 108 ] that takes into account recent behavioral patterns could be effective while analyzing behavioral data as it may not be static in the real-world changes over time.

Anomaly Detection or Outlier Analysis

Anomaly detection, also known as Outlier analysis is a data mining step that detects data points, events, and/or findings that deviate from the regularities or normal behavior of a dataset. Anomalies are usually referred to as outliers, abnormalities, novelties, noise, inconsistency, irregularities, and exceptions [ 63 , 114 ]. Techniques of anomaly detection may discover new situations or cases as deviant based on historical data through analyzing the data patterns. For instance, identifying fraud or irregular transactions in finance is an example of anomaly detection.

It is often used in preprocessing tasks for the deletion of anomalous or inconsistency in the real-world data collected from various data sources including user logs, devices, networks, and servers. For anomaly detection, several machine learning techniques can be used, such as k-nearest neighbors, isolation forests, cluster analysis, etc [ 105 ]. The exclusion of anomalous data from the dataset also results in a statistically significant improvement in accuracy during supervised learning [ 101 ]. However, extracting appropriate features, identifying normal behaviors, managing imbalanced data distribution, addressing variations in abnormal behavior or irregularities, the sparse occurrence of abnormal events, environmental variations, etc. could be challenging in the process of anomaly detection. Detection of anomalies can be applicable in a variety of domains such as cybersecurity analytics, intrusion detections, fraud detection, fault detection, health analytics, identifying irregularities, detecting ecosystem disturbances, and many more. This anomaly detection can be considered a significant task for building effective systems with higher accuracy within the area of data science.

Factor Analysis

Factor analysis is a collection of techniques for describing the relationships or correlations between variables in terms of more fundamental entities known as factors [ 23 ]. It’s usually used to organize variables into a small number of clusters based on their common variance, where mathematical or statistical procedures are used. The goals of factor analysis are to determine the number of fundamental influences underlying a set of variables, calculate the degree to which each variable is associated with the factors, and learn more about the existence of the factors by examining which factors contribute to output on which variables. The broad purpose of factor analysis is to summarize data so that relationships and patterns can be easily interpreted and understood [ 143 ].

Exploratory factor analysis (EFA) and confirmatory factor analysis (CFA) are the two most popular factor analysis techniques. EFA seeks to discover complex trends by analyzing the dataset and testing predictions, while CFA tries to validate hypotheses and uses path analysis diagrams to represent variables and factors [ 143 ]. Factor analysis is one of the algorithms for unsupervised machine learning that is used for minimizing dimensionality. The most common methods for factor analytics are principal components analysis (PCA), principal axis factoring (PAF), and maximum likelihood (ML) [ 48 ]. Methods of correlation analysis such as Pearson correlation, canonical correlation, etc. may also be useful in the field as they can quantify the statistical relationship between two continuous variables, or association. Factor analysis is commonly used in finance, marketing, advertising, product management, psychology, and operations research, and thus can be considered as another significant analytical method within the area of data science.

Log Analysis

Logs are commonly used in system management as logs are often the only data available that record detailed system runtime activities or behaviors in production [ 44 ]. Log analysis is thus can be considered as the method of analyzing, interpreting, and capable of understanding computer-generated records or messages, also known as logs. This can be device log, server log, system log, network log, event log, audit trail, audit record, etc. The process of creating such records is called data logging.

Logs are generated by a wide variety of programmable technologies, including networking devices, operating systems, software, and more. Phone call logs [ 88 , 110 ], SMS Logs [ 28 ], mobile apps usages logs [ 124 , 149 ], notification logs [ 77 ], game Logs [ 82 ], context logs [ 16 , 149 ], web logs [ 37 ], smartphone life logs [ 95 ], etc. are some examples of log data for smartphone devices. The main characteristics of these log data is that it contains users’ actual behavioral activities with their devices. Similar other log data can be search logs [ 50 , 133 ], application logs [ 26 ], server logs [ 33 ], network logs [ 57 ], event logs [ 83 ], network and security logs [ 142 ] etc.

Several techniques such as classification and tagging, correlation analysis, pattern recognition methods, anomaly detection methods, machine learning modeling, etc. [ 105 ] can be used for effective log analysis. Log analysis can assist in compliance with security policies and industry regulations, as well as provide a better user experience by encouraging the troubleshooting of technical problems and identifying areas where efficiency can be improved. For instance, web servers use log files to record data about website visitors. Windows event log analysis can help an investigator draw a timeline based on the logging information and the discovered artifacts. Overall, advanced analytics methods by taking into account machine learning modeling can play a significant role to extract insightful patterns from these log data, which can be used for building automated and smart applications, and thus can be considered as a key working area in data science.

Neural Networks and Deep Learning Analysis

Deep learning is a form of machine learning that uses artificial neural networks to create a computational architecture that learns from data by combining multiple processing layers, such as the input, hidden, and output layers [ 38 ]. The key benefit of deep learning over conventional machine learning methods is that it performs better in a variety of situations, particularly when learning from large datasets [ 114 , 140 ].

The most common deep learning algorithms are: multi-layer perceptron (MLP) [ 85 ], convolutional neural network (CNN or ConvNet) [ 67 ], long short term memory recurrent neural network (LSTM-RNN) [ 34 ]. Figure Figure6 6 shows a structure of an artificial neural network modeling with multiple processing layers. The Backpropagation technique [ 38 ] is used to adjust the weight values internally while building the model. Convolutional neural networks (CNNs) [ 67 ] improve on the design of traditional artificial neural networks (ANNs), which include convolutional layers, pooling layers, and fully connected layers. It is commonly used in a variety of fields, including natural language processing, speech recognition, image processing, and other autocorrelated data since it takes advantage of the two-dimensional (2D) structure of the input data. AlexNet [ 60 ], Xception [ 21 ], Inception [ 125 ], Visual Geometry Group (VGG) [ 42 ], ResNet [ 43 ], etc., and other advanced deep learning models based on CNN are also used in the field.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_765_Fig6_HTML.jpg

A structure of an artificial neural network modeling with multiple processing layers

In addition to CNN, recurrent neural network (RNN) architecture is another popular method used in deep learning. Long short-term memory (LSTM) is a popular type of recurrent neural network architecture used broadly in the area of deep learning. Unlike traditional feed-forward neural networks, LSTM has feedback connections. Thus, LSTM networks are well-suited for analyzing and learning sequential data, such as classifying, sorting, and predicting data based on time-series data. Therefore, when the data is in a sequential format, such as time, sentence, etc., LSTM can be used, and it is widely used in the areas of time-series analysis, natural language processing, speech recognition, and so on.

In addition to the most popular deep learning methods mentioned above, several other deep learning approaches [ 104 ] exist in the field for various purposes. The self-organizing map (SOM) [ 58 ], for example, uses unsupervised learning to represent high-dimensional data as a 2D grid map, reducing dimensionality. Another learning technique that is commonly used for dimensionality reduction and feature extraction in unsupervised learning tasks is the autoencoder (AE) [ 10 ]. Restricted Boltzmann machines (RBM) can be used for dimensionality reduction, classification, regression, collaborative filtering, feature learning, and topic modeling, according to [ 46 ]. A deep belief network (DBN) is usually made up of a backpropagation neural network and unsupervised networks like restricted Boltzmann machines (RBMs) or autoencoders (BPNN) [ 136 ]. A generative adversarial network (GAN) [ 35 ] is a deep learning network that can produce data with characteristics that are similar to the input data. Transfer learning is common worldwide presently because it can train deep neural networks with a small amount of data, which is usually the re-use of a pre-trained model on a new problem [ 137 ]. These deep learning methods can perform well, particularly, when learning from large-scale datasets [ 105 , 140 ]. In our previous article Sarker et al. [ 104 ], we have summarized a brief discussion of various artificial neural networks (ANN) and deep learning (DL) models mentioned above, which can be used in a variety of data science and analytics tasks.

Real-World Application Domains

Almost every industry or organization is impacted by data, and thus “Data Science” including advanced analytics with machine learning modeling can be used in business, marketing, finance, IoT systems, cybersecurity, urban management, health care, government policies, and every possible industries, where data gets generated. In the following, we discuss ten most popular application areas based on data science and analytics.

Business or financial data science: In general, business data science can be considered as the study of business or e-commerce data to obtain insights about a business that can typically lead to smart decision-making as well as taking high-quality actions [ 90 ]. Data scientists can develop algorithms or data-driven models predicting customer behavior, identifying patterns and trends based on historical business data, which can help companies to reduce costs, improve service delivery, and generate recommendations for better decision-making. Eventually, business automation, intelligence, and efficiency can be achieved through the data science process discussed earlier, where various advanced analytics methods and machine learning modeling based on the collected data are the keys. Many online retailers, such as Amazon [ 76 ], can improve inventory management, avoid out-of-stock situations, and optimize logistics and warehousing using predictive modeling based on machine learning techniques [ 105 ]. In terms of finance, the historical data is related to financial institutions to make high-stakes business decisions, which is mostly used for risk management, fraud prevention, credit allocation, customer analytics, personalized services, algorithmic trading, etc. Overall, data science methodologies can play a key role in the future generation business or finance industry, particularly in terms of business automation, intelligence, and smart decision-making and systems.
Manufacturing or industrial data science: To compete in global production capability, quality, and cost, manufacturing industries have gone through many industrial revolutions [ 14 ]. The latest fourth industrial revolution, also known as Industry 4.0, is the emerging trend of automation and data exchange in manufacturing technology. Thus industrial data science, which is the study of industrial data to obtain insights that can typically lead to optimizing industrial applications, can play a vital role in such revolution. Manufacturing industries generate a large amount of data from various sources such as sensors, devices, networks, systems, and applications [ 6 , 68 ]. The main categories of industrial data include large-scale data devices, life-cycle production data, enterprise operation data, manufacturing value chain sources, and collaboration data from external sources [ 132 ]. The data needs to be processed, analyzed, and secured to help improve the system’s efficiency, safety, and scalability. Data science modeling thus can be used to maximize production, reduce costs and raise profits in manufacturing industries.
Medical or health data science: Healthcare is one of the most notable fields where data science is making major improvements. Health data science involves the extrapolation of actionable insights from sets of patient data, typically collected from electronic health records. To help organizations, improve the quality of treatment, lower the cost of care, and improve the patient experience, data can be obtained from several sources, e.g., the electronic health record, billing claims, cost estimates, and patient satisfaction surveys, etc., to analyze. In reality, healthcare analytics using machine learning modeling can minimize medical costs, predict infectious outbreaks, prevent preventable diseases, and generally improve the quality of life [ 81 , 119 ]. Across the global population, the average human lifespan is growing, presenting new challenges to today’s methods of delivery of care. Thus health data science modeling can play a role in analyzing current and historical data to predict trends, improve services, and even better monitor the spread of diseases. Eventually, it may lead to new approaches to improve patient care, clinical expertise, diagnosis, and management.
IoT data science: Internet of things (IoT) [ 9 ] is a revolutionary technical field that turns every electronic system into a smarter one and is therefore considered to be the big frontier that can enhance almost all activities in our lives. Machine learning has become a key technology for IoT applications because it uses expertise to identify patterns and generate models that help predict future behavior and events [ 112 ]. One of the IoT’s main fields of application is a smart city, which uses technology to improve city services and citizens’ living experiences. For example, using the relevant data, data science methods can be used for traffic prediction in smart cities, to estimate the total usage of energy of the citizens for a particular period. Deep learning-based models in data science can be built based on a large scale of IoT datasets [ 7 , 104 ]. Overall, data science and analytics approaches can aid modeling in a variety of IoT and smart city services, including smart governance, smart homes, education, connectivity, transportation, business, agriculture, health care, and industry, and many others.
Cybersecurity data science: Cybersecurity, or the practice of defending networks, systems, hardware, and data from digital attacks, is one of the most important fields of Industry 4.0 [ 114 , 121 ]. Data science techniques, particularly machine learning, have become a crucial cybersecurity technology that continually learns to identify trends by analyzing data, better detecting malware in encrypted traffic, finding insider threats, predicting where bad neighborhoods are online, keeping people safe while surfing, or protecting information in the cloud by uncovering suspicious user activity [ 114 ]. For instance, machine learning and deep learning-based security modeling can be used to effectively detect various types of cyberattacks or anomalies [ 103 , 106 ]. To generate security policy rules, association rule learning can play a significant role to build rule-based systems [ 102 ]. Deep learning-based security models can perform better when utilizing the large scale of security datasets [ 140 ]. Thus data science modeling can enable professionals in cybersecurity to be more proactive in preventing threats and reacting in real-time to active attacks, through extracting actionable insights from the security datasets.
Behavioral data science: Behavioral data is information produced as a result of activities, most commonly commercial behavior, performed on a variety of Internet-connected devices, such as a PC, tablet, or smartphones [ 112 ]. Websites, mobile applications, marketing automation systems, call centers, help desks, and billing systems, etc. are all common sources of behavioral data. Behavioral data is much more than just data, which is not static data [ 108 ]. Advanced analytics of these data including machine learning modeling can facilitate in several areas such as predicting future sales trends and product recommendations in e-commerce and retail; predicting usage trends, load, and user preferences in future releases in online gaming; determining how users use an application to predict future usage and preferences in application development; breaking users down into similar groups to gain a more focused understanding of their behavior in cohort analysis; detecting compromised credentials and insider threats by locating anomalous behavior, or making suggestions, etc. Overall, behavioral data science modeling typically enables to make the right offers to the right consumers at the right time on various common platforms such as e-commerce platforms, online games, web and mobile applications, and IoT. In social context, analyzing the behavioral data of human being using advanced analytics methods and the extracted insights from social data can be used for data-driven intelligent social services, which can be considered as social data science.
Mobile data science: Today’s smart mobile phones are considered as “next-generation, multi-functional cell phones that facilitate data processing, as well as enhanced wireless connectivity” [ 146 ]. In our earlier paper [ 112 ], we have shown that users’ interest in “Mobile Phones” is more and more than other platforms like “Desktop Computer”, “Laptop Computer” or “Tablet Computer” in recent years. People use smartphones for a variety of activities, including e-mailing, instant messaging, online shopping, Internet surfing, entertainment, social media such as Facebook, Linkedin, and Twitter, and various IoT services such as smart cities, health, and transportation services, and many others. Intelligent apps are based on the extracted insight from the relevant datasets depending on apps characteristics, such as action-oriented, adaptive in nature, suggestive and decision-oriented, data-driven, context-awareness, and cross-platform operation [ 112 ]. As a result, mobile data science, which involves gathering a large amount of mobile data from various sources and analyzing it using machine learning techniques to discover useful insights or data-driven trends, can play an important role in the development of intelligent smartphone applications.
Multimedia data science: Over the last few years, a big data revolution in multimedia management systems has resulted from the rapid and widespread use of multimedia data, such as image, audio, video, and text, as well as the ease of access and availability of multimedia sources. Currently, multimedia sharing websites, such as Yahoo Flickr, iCloud, and YouTube, and social networks such as Facebook, Instagram, and Twitter, are considered as valuable sources of multimedia big data [ 89 ]. People, particularly younger generations, spend a lot of time on the Internet and social networks to connect with others, exchange information, and create multimedia data, thanks to the advent of new technology and the advanced capabilities of smartphones and tablets. Multimedia analytics deals with the problem of effectively and efficiently manipulating, handling, mining, interpreting, and visualizing various forms of data to solve real-world problems. Text analysis, image or video processing, computer vision, audio or speech processing, and database management are among the solutions available for a range of applications including healthcare, education, entertainment, and mobile devices.
Smart cities or urban data science: Today, more than half of the world’s population live in urban areas or cities [ 80 ] and considered as drivers or hubs of economic growth, wealth creation, well-being, and social activity [ 96 , 116 ]. In addition to cities, “Urban area” can refer to the surrounding areas such as towns, conurbations, or suburbs. Thus, a large amount of data documenting daily events, perceptions, thoughts, and emotions of citizens or people are recorded, that are loosely categorized into personal data, e.g., household, education, employment, health, immigration, crime, etc., proprietary data, e.g., banking, retail, online platforms data, etc., government data, e.g., citywide crime statistics, or government institutions, etc., Open and public data, e.g., data.gov, ordnance survey, and organic and crowdsourced data, e.g., user-generated web data, social media, Wikipedia, etc. [ 29 ]. The field of urban data science typically focuses on providing more effective solutions from a data-driven perspective, through extracting knowledge and actionable insights from such urban data. Advanced analytics of these data using machine learning techniques [ 105 ] can facilitate the efficient management of urban areas including real-time management, e.g., traffic flow management, evidence-based planning decisions which pertain to the longer-term strategic role of forecasting for urban planning, e.g., crime prevention, public safety, and security, or framing the future, e.g., political decision-making [ 29 ]. Overall, it can contribute to government and public planning, as well as relevant sectors including retail, financial services, mobility, health, policing, and utilities within a data-rich urban environment through data-driven smart decision-making and policies, which lead to smart cities and improve the quality of human life.
Smart villages or rural data science: Rural areas or countryside are the opposite of urban areas, that include villages, hamlets, or agricultural areas. The field of rural data science typically focuses on making better decisions and providing more effective solutions that include protecting public safety, providing critical health services, agriculture, and fostering economic development from a data-driven perspective, through extracting knowledge and actionable insights from the collected rural data. Advanced analytics of rural data including machine learning [ 105 ] modeling can facilitate providing new opportunities for them to build insights and capacity to meet current needs and prepare for their futures. For instance, machine learning modeling [ 105 ] can help farmers to enhance their decisions to adopt sustainable agriculture utilizing the increasing amount of data captured by emerging technologies, e.g., the internet of things (IoT), mobile technologies and devices, etc. [ 1 , 51 , 52 ]. Thus, rural data science can play a very important role in the economic and social development of rural areas, through agriculture, business, self-employment, construction, banking, healthcare, governance, or other services, etc. that lead to smarter villages.

Overall, we can conclude that data science modeling can be used to help drive changes and improvements in almost every sector in our real-world life, where the relevant data is available to analyze. To gather the right data and extract useful knowledge or actionable insights from the data for making smart decisions is the key to data science modeling in any application domain. Based on our discussion on the above ten potential real-world application domains by taking into account data-driven smart computing and decision making, we can say that the prospects of data science and the role of data scientists are huge for the future world. The “Data Scientists” typically analyze information from multiple sources to better understand the data and business problems, and develop machine learning-based analytical modeling or algorithms, or data-driven tools, or solutions, focused on advanced analytics, which can make today’s computing process smarter, automated, and intelligent.

Challenges and Research Directions

Our study on data science and analytics, particularly data science modeling in “ Understanding data science modeling ”, advanced analytics methods and smart computing in “ Advanced analytics methods and smart computing ”, and real-world application areas in “ Real-world application domains ” open several research issues in the area of data-driven business solutions and eventual data products. Thus, in this section, we summarize and discuss the challenges faced and the potential research opportunities and future directions to build data-driven products.

Understanding the real-world business problems and associated data including nature, e.g., what forms, type, size, labels, etc., is the first challenge in the data science modeling, discussed briefly in “ Understanding data science modeling ”. This is actually to identify, specify, represent and quantify the domain-specific business problems and data according to the requirements. For a data-driven effective business solution, there must be a well-defined workflow before beginning the actual data analysis work. Furthermore, gathering business data is difficult because data sources can be numerous and dynamic. As a result, gathering different forms of real-world data, such as structured, or unstructured, related to a specific business issue with legal access, which varies from application to application, is challenging. Moreover, data annotation, which is typically the process of categorization, tagging, or labeling of raw data, for the purpose of building data-driven models, is another challenging issue. Thus, the primary task is to conduct a more in-depth analysis of data collection and dynamic annotation methods. Therefore, understanding the business problem, as well as integrating and managing the raw data gathered for efficient data analysis, may be one of the most challenging aspects of working in the field of data science and analytics.
The next challenge is the extraction of the relevant and accurate information from the collected data mentioned above. The main focus of data scientists is typically to disclose, describe, represent, and capture data-driven intelligence for actionable insights from data. However, the real-world data may contain many ambiguous values, missing values, outliers, and meaningless data [ 101 ]. The advanced analytics methods including machine and deep learning modeling, discussed in “ Advanced analytics methods and smart computing ”, highly impact the quality, and availability of the data. Thus understanding real-world business scenario and associated data, to whether, how, and why they are insufficient, missing, or problematic, then extend or redevelop the existing methods, such as large-scale hypothesis testing, learning inconsistency, and uncertainty, etc. to address the complexities in data and business problems is important. Therefore, developing new techniques to effectively pre-process the diverse data collected from multiple sources, according to their nature and characteristics could be another challenging task.
Understanding and selecting the appropriate analytical methods to extract the useful insights for smart decision-making for a particular business problem is the main issue in the area of data science. The emphasis of advanced analytics is more on anticipating the use of data to detect patterns to determine what is likely to occur in the future. Basic analytics offer a description of data in general, while advanced analytics is a step forward in offering a deeper understanding of data and helping to granular data analysis. Thus, understanding the advanced analytics methods, especially machine and deep learning-based modeling is the key. The traditional learning techniques mentioned in “ Advanced analytics methods and smart computing ” may not be directly applicable for the expected outcome in many cases. For instance, in a rule-based system, the traditional association rule learning technique [ 4 ] may produce redundant rules from the data that makes the decision-making process complex and ineffective [ 113 ]. Thus, a scientific understanding of the learning algorithms, mathematical properties, how the techniques are robust or fragile to input data, is needed to understand. Therefore, a deeper understanding of the strengths and drawbacks of the existing machine and deep learning methods [ 38 , 105 ] to solve a particular business problem is needed, consequently to improve or optimize the learning algorithms according to the data characteristics, or to propose the new algorithm/techniques with higher accuracy becomes a significant challenging issue for the future generation data scientists.
The traditional data-driven models or systems typically use a large amount of business data to generate data-driven decisions. In several application fields, however, the new trends are more likely to be interesting and useful for modeling and predicting the future than older ones. For example, smartphone user behavior modeling, IoT services, stock market forecasting, health or transport service, job market analysis, and other related areas where time-series and actual human interests or preferences are involved over time. Thus, rather than considering the traditional data analysis, the concept of RecencyMiner, i.e., recent pattern-based extracted insight or knowledge proposed in our earlier paper Sarker et al. [ 108 ] might be effective. Therefore, to propose the new techniques by taking into account the recent data patterns, and consequently to build a recency-based data-driven model for solving real-world problems, is another significant challenging issue in the area.
The most crucial task for a data-driven smart system is to create a framework that supports data science modeling discussed in “ Understanding data science modeling ”. As a result, advanced analytical methods based on machine learning or deep learning techniques can be considered in such a system to make the framework capable of resolving the issues. Besides, incorporating contextual information such as temporal context, spatial context, social context, environmental context, etc. [ 100 ] can be used for building an adaptive, context-aware, and dynamic model or framework, depending on the problem domain. As a result, a well-designed data-driven framework, as well as experimental evaluation, is a very important direction to effectively solve a business problem in a particular domain, as well as a big challenge for the data scientists.
In several important application areas such as autonomous cars, criminal justice, health care, recruitment, housing, management of the human resource, public safety, where decisions made by models, or AI agents, have a direct effect on human lives. As a result, there is growing concerned about whether these decisions can be trusted, to be right, reasonable, ethical, personalized, accurate, robust, and secure, particularly in the context of adversarial attacks [ 104 ]. If we can explain the result in a meaningful way, then the model can be better trusted by the end-user. For machine-learned models, new trust properties yield new trade-offs, such as privacy versus accuracy; robustness versus efficiency; fairness versus robustness. Therefore, incorporating trustworthy AI particularly, data-driven or machine learning modeling could be another challenging issue in the area.

In the above, we have summarized and discussed several challenges and the potential research opportunities and directions, within the scope of our study in the area of data science and advanced analytics. The data scientists in academia/industry and the researchers in the relevant area have the opportunity to contribute to each issue identified above and build effective data-driven models or systems, to make smart decisions in the corresponding business domains.

In this paper, we have presented a comprehensive view on data science including various types of advanced analytical methods that can be applied to enhance the intelligence and the capabilities of an application. We have also visualized the current popularity of data science and machine learning-based advanced analytical modeling and also differentiate these from the relevant terms used in the area, to make the position of this paper. A thorough study on the data science modeling with its various processing modules that are needed to extract the actionable insights from the data for a particular business problem and the eventual data product. Thus, according to our goal, we have briefly discussed how different data modules can play a significant role in a data-driven business solution through the data science process. For this, we have also summarized various types of advanced analytical methods and outcomes as well as machine learning modeling that are needed to solve the associated business problems. Thus, this study’s key contribution has been identified as the explanation of different advanced analytical methods and their applicability in various real-world data-driven applications areas including business, healthcare, cybersecurity, urban and rural data science, and so on by taking into account data-driven smart computing and decision making.

Finally, within the scope of our study, we have outlined and discussed the challenges we faced, as well as possible research opportunities and future directions. As a result, the challenges identified provide promising research opportunities in the field that can be explored with effective solutions to improve the data-driven model and systems. Overall, we conclude that our study of advanced analytical solutions based on data science and machine learning methods, leads in a positive direction and can be used as a reference guide for future research and applications in the field of data science and its real-world applications by both academia and industry professionals.

Declarations

The author declares no conflict of interest.

This article is part of the topical collection “Advances in Computational Approaches for Artificial Intelligence, Image Processing, IoT and Cloud Applications” guest edited by Bhanu Prakash K N and M. Shivakumar.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Skip to main content
Skip to primary sidebar
Skip to footer
QuestionPro

Solutions Industries Gaming Automotive Sports and events Education Government Travel & Hospitality Financial Services Healthcare Cannabis Technology Use Case NPS+ Communities Audience Contactless surveys Mobile LivePolls Member Experience GDPR Positive People Science 360 Feedback Surveys
Resources Blog eBooks Survey Templates Case Studies Training Help center

Home Market Research

Data Analysis in Research: Types & Methods

Content Index

Why analyze data in research?

Types of data in research, finding patterns in the qualitative data, methods used for data analysis in qualitative research, preparing data for analysis, methods used for data analysis in quantitative research, considerations in research data analysis, what is data analysis in research.

Definition of research in data analysis: According to LeCompte and Schensul, research data analysis is a process used by researchers to reduce data to a story and interpret it to derive insights. The data analysis process helps reduce a large chunk of data into smaller fragments, which makes sense.

Three essential things occur during the data analysis process — the first is data organization . Summarization and categorization together contribute to becoming the second known method used for data reduction. It helps find patterns and themes in the data for easy identification and linking. The third and last way is data analysis – researchers do it in both top-down and bottom-up fashion.

LEARN ABOUT: Research Process Steps

On the other hand, Marshall and Rossman describe data analysis as a messy, ambiguous, and time-consuming but creative and fascinating process through which a mass of collected data is brought to order, structure and meaning.

We can say that “the data analysis and data interpretation is a process representing the application of deductive and inductive logic to the research and data analysis.”

Researchers rely heavily on data as they have a story to tell or research problems to solve. It starts with a question, and data is nothing but an answer to that question. But, what if there is no question to ask? Well! It is possible to explore data even without a problem – we call it ‘Data Mining’, which often reveals some interesting patterns within the data that are worth exploring.

Irrelevant to the type of data researchers explore, their mission and audiences’ vision guide them to find the patterns to shape the story they want to tell. One of the essential things expected from researchers while analyzing data is to stay open and remain unbiased toward unexpected patterns, expressions, and results. Remember, sometimes, data analysis tells the most unforeseen yet exciting stories that were not expected when initiating data analysis. Therefore, rely on the data you have at hand and enjoy the journey of exploratory research.

Create a Free Account

Every kind of data has a rare quality of describing things after assigning a specific value to it. For analysis, you need to organize these values, processed and presented in a given context, to make it useful. Data can be in different forms; here are the primary data types.

Qualitative data: When the data presented has words and descriptions, then we call it qualitative data . Although you can observe this data, it is subjective and harder to analyze data in research, especially for comparison. Example: Quality data represents everything describing taste, experience, texture, or an opinion that is considered quality data. This type of data is usually collected through focus groups, personal qualitative interviews , qualitative observation or using open-ended questions in surveys.
Quantitative data: Any data expressed in numbers of numerical figures are called quantitative data . This type of data can be distinguished into categories, grouped, measured, calculated, or ranked. Example: questions such as age, rank, cost, length, weight, scores, etc. everything comes under this type of data. You can present such data in graphical format, charts, or apply statistical analysis methods to this data. The (Outcomes Measurement Systems) OMS questionnaires in surveys are a significant source of collecting numeric data.
Categorical data: It is data presented in groups. However, an item included in the categorical data cannot belong to more than one group. Example: A person responding to a survey by telling his living style, marital status, smoking habit, or drinking habit comes under the categorical data. A chi-square test is a standard method used to analyze this data.

Learn More : Examples of Qualitative Data in Education

Data analysis in qualitative research

Data analysis and qualitative data research work a little differently from the numerical data as the quality data is made up of words, descriptions, images, objects, and sometimes symbols. Getting insight from such complicated information is a complicated process. Hence it is typically used for exploratory research and data analysis .

Although there are several ways to find patterns in the textual information, a word-based method is the most relied and widely used global technique for research and data analysis. Notably, the data analysis process in qualitative research is manual. Here the researchers usually read the available data and find repetitive or commonly used words.

For example, while studying data collected from African countries to understand the most pressing issues people face, researchers might find “food” and “hunger” are the most commonly used words and will highlight them for further analysis.

LEARN ABOUT: Level of Analysis

The keyword context is another widely used word-based technique. In this method, the researcher tries to understand the concept by analyzing the context in which the participants use a particular keyword.

For example , researchers conducting research and data analysis for studying the concept of ‘diabetes’ amongst respondents might analyze the context of when and how the respondent has used or referred to the word ‘diabetes.’

The scrutiny-based technique is also one of the highly recommended text analysis methods used to identify a quality data pattern. Compare and contrast is the widely used method under this technique to differentiate how a specific text is similar or different from each other.

For example: To find out the “importance of resident doctor in a company,” the collected data is divided into people who think it is necessary to hire a resident doctor and those who think it is unnecessary. Compare and contrast is the best method that can be used to analyze the polls having single-answer questions types .

Metaphors can be used to reduce the data pile and find patterns in it so that it becomes easier to connect data with theory.

Variable Partitioning is another technique used to split variables so that researchers can find more coherent descriptions and explanations from the enormous data.

LEARN ABOUT: Qualitative Research Questions and Questionnaires

There are several techniques to analyze the data in qualitative research, but here are some commonly used methods,

Content Analysis: It is widely accepted and the most frequently employed technique for data analysis in research methodology. It can be used to analyze the documented information from text, images, and sometimes from the physical items. It depends on the research questions to predict when and where to use this method.
Narrative Analysis: This method is used to analyze content gathered from various sources such as personal interviews, field observation, and surveys . The majority of times, stories, or opinions shared by people are focused on finding answers to the research questions.
Discourse Analysis: Similar to narrative analysis, discourse analysis is used to analyze the interactions with people. Nevertheless, this particular method considers the social context under which or within which the communication between the researcher and respondent takes place. In addition to that, discourse analysis also focuses on the lifestyle and day-to-day environment while deriving any conclusion.
Grounded Theory: When you want to explain why a particular phenomenon happened, then using grounded theory for analyzing quality data is the best resort. Grounded theory is applied to study data about the host of similar cases occurring in different settings. When researchers are using this method, they might alter explanations or produce new ones until they arrive at some conclusion.

LEARN ABOUT: 12 Best Tools for Researchers

Data analysis in quantitative research

The first stage in research and data analysis is to make it for the analysis so that the nominal data can be converted into something meaningful. Data preparation consists of the below phases.

Phase I: Data Validation

Data validation is done to understand if the collected data sample is per the pre-set standards, or it is a biased data sample again divided into four different stages

Fraud: To ensure an actual human being records each response to the survey or the questionnaire
Screening: To make sure each participant or respondent is selected or chosen in compliance with the research criteria
Procedure: To ensure ethical standards were maintained while collecting the data sample
Completeness: To ensure that the respondent has answered all the questions in an online survey. Else, the interviewer had asked all the questions devised in the questionnaire.

Phase II: Data Editing

More often, an extensive research data sample comes loaded with errors. Respondents sometimes fill in some fields incorrectly or sometimes skip them accidentally. Data editing is a process wherein the researchers have to confirm that the provided data is free of such errors. They need to conduct necessary checks and outlier checks to edit the raw edit and make it ready for analysis.

Phase III: Data Coding

Out of all three, this is the most critical phase of data preparation associated with grouping and assigning values to the survey responses . If a survey is completed with a 1000 sample size, the researcher will create an age bracket to distinguish the respondents based on their age. Thus, it becomes easier to analyze small data buckets rather than deal with the massive data pile.

LEARN ABOUT: Steps in Qualitative Research

After the data is prepared for analysis, researchers are open to using different research and data analysis methods to derive meaningful insights. For sure, statistical analysis plans are the most favored to analyze numerical data. In statistical analysis, distinguishing between categorical data and numerical data is essential, as categorical data involves distinct categories or labels, while numerical data consists of measurable quantities. The method is again classified into two groups. First, ‘Descriptive Statistics’ used to describe data. Second, ‘Inferential statistics’ that helps in comparing the data .

Descriptive statistics

This method is used to describe the basic features of versatile types of data in research. It presents the data in such a meaningful way that pattern in the data starts making sense. Nevertheless, the descriptive analysis does not go beyond making conclusions. The conclusions are again based on the hypothesis researchers have formulated so far. Here are a few major types of descriptive analysis methods.

Measures of Frequency

Count, Percent, Frequency
It is used to denote home often a particular event occurs.
Researchers use it when they want to showcase how often a response is given.

Measures of Central Tendency

Mean, Median, Mode
The method is widely used to demonstrate distribution by various points.
Researchers use this method when they want to showcase the most commonly or averagely indicated response.

Measures of Dispersion or Variation

Range, Variance, Standard deviation
Here the field equals high/low points.
Variance standard deviation = difference between the observed score and mean
It is used to identify the spread of scores by stating intervals.
Researchers use this method to showcase data spread out. It helps them identify the depth until which the data is spread out that it directly affects the mean.

Measures of Position

Percentile ranks, Quartile ranks
It relies on standardized scores helping researchers to identify the relationship between different scores.
It is often used when researchers want to compare scores with the average count.

For quantitative research use of descriptive analysis often give absolute numbers, but the in-depth analysis is never sufficient to demonstrate the rationale behind those numbers. Nevertheless, it is necessary to think of the best method for research and data analysis suiting your survey questionnaire and what story researchers want to tell. For example, the mean is the best way to demonstrate the students’ average scores in schools. It is better to rely on the descriptive statistics when the researchers intend to keep the research or outcome limited to the provided sample without generalizing it. For example, when you want to compare average voting done in two different cities, differential statistics are enough.

Descriptive analysis is also called a ‘univariate analysis’ since it is commonly used to analyze a single variable.

Inferential statistics

Inferential statistics are used to make predictions about a larger population after research and data analysis of the representing population’s collected sample. For example, you can ask some odd 100 audiences at a movie theater if they like the movie they are watching. Researchers then use inferential statistics on the collected sample to reason that about 80-90% of people like the movie.

Here are two significant areas of inferential statistics.

Estimating parameters: It takes statistics from the sample research data and demonstrates something about the population parameter.
Hypothesis test: I t’s about sampling research data to answer the survey research questions. For example, researchers might be interested to understand if the new shade of lipstick recently launched is good or not, or if the multivitamin capsules help children to perform better at games.

These are sophisticated analysis methods used to showcase the relationship between different variables instead of describing a single variable. It is often used when researchers want something beyond absolute numbers to understand the relationship between variables.

Here are some of the commonly used methods for data analysis in research.

Correlation: When researchers are not conducting experimental research or quasi-experimental research wherein the researchers are interested to understand the relationship between two or more variables, they opt for correlational research methods.
Cross-tabulation: Also called contingency tables, cross-tabulation is used to analyze the relationship between multiple variables. Suppose provided data has age and gender categories presented in rows and columns. A two-dimensional cross-tabulation helps for seamless data analysis and research by showing the number of males and females in each age category.
Regression analysis: For understanding the strong relationship between two variables, researchers do not look beyond the primary and commonly used regression analysis method, which is also a type of predictive analysis used. In this method, you have an essential factor called the dependent variable. You also have multiple independent variables in regression analysis. You undertake efforts to find out the impact of independent variables on the dependent variable. The values of both independent and dependent variables are assumed as being ascertained in an error-free random manner.
Frequency tables: The statistical procedure is used for testing the degree to which two or more vary or differ in an experiment. A considerable degree of variation means research findings were significant. In many contexts, ANOVA testing and variance analysis are similar.
Analysis of variance: The statistical procedure is used for testing the degree to which two or more vary or differ in an experiment. A considerable degree of variation means research findings were significant. In many contexts, ANOVA testing and variance analysis are similar.
Researchers must have the necessary research skills to analyze and manipulation the data , Getting trained to demonstrate a high standard of research practice. Ideally, researchers must possess more than a basic understanding of the rationale of selecting one statistical method over the other to obtain better data insights.
Usually, research and data analytics projects differ by scientific discipline; therefore, getting statistical advice at the beginning of analysis helps design a survey questionnaire, select data collection methods , and choose samples.

LEARN ABOUT: Best Data Collection Tools

The primary aim of data research and analysis is to derive ultimate insights that are unbiased. Any mistake in or keeping a biased mind to collect data, selecting an analysis method, or choosing audience sample il to draw a biased inference.
Irrelevant to the sophistication used in research data and analysis is enough to rectify the poorly defined objective outcome measurements. It does not matter if the design is at fault or intentions are not clear, but lack of clarity might mislead readers, so avoid the practice.
The motive behind data analysis in research is to present accurate and reliable data. As far as possible, avoid statistical errors, and find a way to deal with everyday challenges like outliers, missing data, data altering, data mining , or developing graphical representation.

LEARN MORE: Descriptive Research vs Correlational Research The sheer amount of data generated daily is frightening. Especially when data analysis has taken center stage. in 2018. In last year, the total data supply amounted to 2.8 trillion gigabytes. Hence, it is clear that the enterprises willing to survive in the hypercompetitive world must possess an excellent capability to analyze complex research data, derive actionable insights, and adapt to the new market needs.

LEARN ABOUT: Average Order Value

QuestionPro is an online survey platform that empowers organizations in data analysis and research and provides them a medium to collect data by creating appealing surveys.

MORE LIKE THIS

What Are My Employees Really Thinking? The Power of Open-ended Survey Analysis

May 24, 2024

When I think of “disconnected”, it is important that this is not just in relation to people analytics, Employee Experience or Customer Experience - it is also relevant to looking across them.

I Am Disconnected – Tuesday CX Thoughts

May 21, 2024

20 Best Customer Success Tools of 2024

May 20, 2024

AI-Based Services Buying Guide for Market Research (based on ESOMAR’s 20 Questions)

Other categories.

Academic Research
Artificial Intelligence
Assessments
Brand Awareness
Case Studies
Communities
Consumer Insights
Customer effort score
Customer Engagement
Customer Experience
Customer Loyalty
Customer Research
Customer Satisfaction
Employee Benefits
Employee Engagement
Employee Retention
Friday Five
General Data Protection Regulation
Insights Hub
Life@QuestionPro
Market Research
Mobile diaries
Mobile Surveys
New Features
Online Communities
Question Types
Questionnaire
QuestionPro Products
Release Notes
Research Tools and Apps
Revenue at Risk
Survey Templates
Training Tips
Uncategorized
Video Learning Series
What’s Coming Up
Workforce Intelligence

Data analysis - List of Essay Samples And Topic Ideas

Data analysis is the process of inspecting, cleaning, transforming, and modeling data to discover useful information and support decision-making. Essays on data analysis could delve into various techniques and tools used in data analysis, its application in different fields like business or science, or the ethical considerations in handling data. Discussions might also explore the role of big data and analytics in modern society, the challenges posed by data privacy concerns, or the future of data analysis with the advent of technologies like artificial intelligence and machine learning. We have collected a large number of free essay examples about Data Analysis you can find in Papersowl database. You can use our samples for inspiration to write your own essay, research paper, or just to explore a new topic for yourself.

Impact of Technology on Privacy

The 21st Century is characterized by the heavy impact technology has on us as a society while it continues to develop new devices and modernize technology. Millions of individuals around the world are now connected digitally, in other words, people globally rely heavily on smartphones tablets, and/ or computers that store or save a majority of their personal information. Critical and extremely personal data is available and collected in these smart technology such as credit card details, fingerprint layout, and […]

What the Future Life Might Look Like with Blockchain

Just ten short years ago, blockchain was just an idea in a handful of forward-thinking developer’s heads. Even today, it’s widely misunderstood and only a small portion of the population knows exactly what it is. Most technology experts agree, however, that blockchain will transform many of the ways we do business. In just a short time, it is already having an impact on many business segments and is starting to touch our lives in many meaningful ways that we may […]

Sexual Orientation on Helping Behaviors Among African American College Students

Introduction Seeking help from friends and family members are much easier as opposed to strangers. Many considerations run in the mind of an individual when seeking help from persons they are not aware of. Factors such as the sexual orientation of the person expected to provide help and the time of the day are some of the considerations made before determining the chances of the help being granted. Sexual orientation and time of the day are therefore important factors that […]

We will write an essay sample crafted to your needs.

Alcohol Abuse Among Women in Salford

Introduction Nowadays, our society found it difficult to understand or restrain their use of alcohol to women because this alcoholic beverage has been a part of social life for ages. Alcohol abuse has destroyed many lives and considerably damaged many families, patients and women. Also it can bring a massive consequence in the entire society. The shocking effects of too much alcohol drinking are generally well-known. The alcohol use in women in Salford have been brought sharply into focus over […]

Beyond Numbers: the Significance of Qualitative Data Analysis

In the ever-evolving landscape of research and analysis, the spotlight has historically shone brightly on quantitative data, offering a seemingly unassailable foundation for drawing conclusions. However, a paradigm shift is underway, acknowledging the irreplaceable role played by qualitative data analysis. Beyond the stark numerical arrays inhabiting spreadsheets and statistical models, a treasure trove of nuanced insights awaits discovery through a more interpretive and thorough exploration. Qualitative data analysis, fundamentally, is a meticulous and rigorous expedition into non-numeric information. It entails […]

Quantifying Impact: Data Analysis in Assessing Social Phenomena

In the contemporary tapestry of knowledge, the quest to fathom the reverberations of social phenomena has evolved into a fascinating expedition at the crossroads of data analysis and human interaction. This symbiotic relationship has birthed a vibrant discipline, where the intricate dance of numbers meets the enigma of social dynamics. Unraveling the complexities and quantifying the profound consequences of various phenomena on individuals and communities now unfolds in the realm of data-driven exploration. In an age where data reigns supreme, […]

Insights from Data Analysis: Methodologies, Applications, and Implications

Multifaceted domain of data analysis, examining its methodologies, applications, and implications within diverse fields of inquiry. Drawing from statistical techniques, machine learning algorithms, and computational methods, it explores the process of deriving meaningful insights from data and making informed decisions based on empirical evidence. Through an examination of case studies and theoretical frameworks, this essay elucidates the role of data analysis in informing decision-making, driving innovation, and advancing knowledge across disciplines. Data analysis serves as a cornerstone of empirical inquiry, […]

Quantum Data Analysis: Bridging Gaps in Incomplete Datasets

In the ever-evolving landscape of data analytics, the challenge of handling incomplete datasets remains a persistent stumbling block. As we delve into the intricacies of quantum data interpolation, a groundbreaking approach emerges—one that transcends conventional methods and introduces a paradigm shift in the way we address data gaps. Traditional data interpolation methods often fall short when confronted with the complexity of incomplete datasets. The inadequacies of linear and polynomial interpolations become evident, especially in scenarios where quantum data is involved. […]

1. Tell Us Your Requirements

2. Pick your perfect writer

3. Get Your Paper and Pay

Hi! I'm Amy, your personal assistant!

Don't know where to start? Give me your paper requirements and I connect you to an academic expert.

short deadlines

100% Plagiarism-Free

Certified writers

Choose Your Test

Sat / act prep online guides and tips, 5 steps to write a great analytical essay.

General Education

Do you need to write an analytical essay for school? What sets this kind of essay apart from other types, and what must you include when you write your own analytical essay? In this guide, we break down the process of writing an analytical essay by explaining the key factors your essay needs to have, providing you with an outline to help you structure your essay, and analyzing a complete analytical essay example so you can see what a finished essay looks like.

What Is an Analytical Essay?

Before you begin writing an analytical essay, you must know what this type of essay is and what it includes. Analytical essays analyze something, often (but not always) a piece of writing or a film.

An analytical essay is more than just a synopsis of the issue though; in this type of essay you need to go beyond surface-level analysis and look at what the key arguments/points of this issue are and why. If you’re writing an analytical essay about a piece of writing, you’ll look into how the text was written and why the author chose to write it that way. Instead of summarizing, an analytical essay typically takes a narrower focus and looks at areas such as major themes in the work, how the author constructed and supported their argument, how the essay used literary devices to enhance its messages, etc.

While you certainly want people to agree with what you’ve written, unlike with persuasive and argumentative essays, your main purpose when writing an analytical essay isn’t to try to convert readers to your side of the issue. Therefore, you won’t be using strong persuasive language like you would in those essay types. Rather, your goal is to have enough analysis and examples that the strength of your argument is clear to readers.

Besides typical essay components like an introduction and conclusion, a good analytical essay will include:

A thesis that states your main argument
Analysis that relates back to your thesis and supports it
Examples to support your analysis and allow a more in-depth look at the issue

In the rest of this article, we’ll explain how to include each of these in your analytical essay.

How to Structure Your Analytical Essay

Analytical essays are structured similarly to many other essays you’ve written, with an introduction (including a thesis), several body paragraphs, and a conclusion. Below is an outline you can follow when structuring your essay, and in the next section we go into more detail on how to write an analytical essay.

Introduction

Your introduction will begin with some sort of attention-grabbing sentence to get your audience interested, then you’ll give a few sentences setting up the topic so that readers have some context, and you’ll end with your thesis statement. Your introduction will include:

Brief background information explaining the issue/text
Your thesis

Body Paragraphs

Your analytical essay will typically have three or four body paragraphs, each covering a different point of analysis. Begin each body paragraph with a sentence that sets up the main point you’ll be discussing. Then you’ll give some analysis on that point, backing it up with evidence to support your claim. Continue analyzing and giving evidence for your analysis until you’re out of strong points for the topic. At the end of each body paragraph, you may choose to have a transition sentence that sets up what the next paragraph will be about, but this isn’t required. Body paragraphs will include:

Introductory sentence explaining what you’ll cover in the paragraph (sort of like a mini-thesis)
Analysis point
Evidence (either passages from the text or data/facts) that supports the analysis
(Repeat analysis and evidence until you run out of examples)

You won’t be making any new points in your conclusion; at this point you’re just reiterating key points you’ve already made and wrapping things up. Begin by rephrasing your thesis and summarizing the main points you made in the essay. Someone who reads just your conclusion should be able to come away with a basic idea of what your essay was about and how it was structured. After this, you may choose to make some final concluding thoughts, potentially by connecting your essay topic to larger issues to show why it’s important. A conclusion will include:

Paraphrase of thesis
Summary of key points of analysis
Final concluding thought(s)

5 Steps for Writing an Analytical Essay

Follow these five tips to break down writing an analytical essay into manageable steps. By the end, you’ll have a fully-crafted analytical essay with both in-depth analysis and enough evidence to support your argument. All of these steps use the completed analytical essay in the next section as an example.

#1: Pick a Topic

You may have already had a topic assigned to you, and if that’s the case, you can skip this step. However, if you haven’t, or if the topic you’ve been assigned is broad enough that you still need to narrow it down, then you’ll need to decide on a topic for yourself. Choosing the right topic can mean the difference between an analytical essay that’s easy to research (and gets you a good grade) and one that takes hours just to find a few decent points to analyze

Before you decide on an analytical essay topic, do a bit of research to make sure you have enough examples to support your analysis. If you choose a topic that’s too narrow, you’ll struggle to find enough to write about.

For example, say your teacher assigns you to write an analytical essay about the theme in John Steinbeck’s The Grapes of Wrath of exposing injustices against migrants. For it to be an analytical essay, you can’t just recount the injustices characters in the book faced; that’s only a summary and doesn’t include analysis. You need to choose a topic that allows you to analyze the theme. One of the best ways to explore a theme is to analyze how the author made his/her argument. One example here is that Steinbeck used literary devices in the intercalary chapters (short chapters that didn’t relate to the plot or contain the main characters of the book) to show what life was like for migrants as a whole during the Dust Bowl.

You could write about how Steinbeck used literary devices throughout the whole book, but, in the essay below, I chose to just focus on the intercalary chapters since they gave me enough examples. Having a narrower focus will nearly always result in a tighter and more convincing essay (and can make compiling examples less overwhelming).

#2: Write a Thesis Statement

Your thesis statement is the most important sentence of your essay; a reader should be able to read just your thesis and understand what the entire essay is about and what you’ll be analyzing. When you begin writing, remember that each sentence in your analytical essay should relate back to your thesis

In the analytical essay example below, the thesis is the final sentence of the first paragraph (the traditional spot for it). The thesis is: “In The Grapes of Wrath’s intercalary chapters, John Steinbeck employs a variety of literary devices and stylistic choices to better expose the injustices committed against migrants in the 1930s.” So what will this essay analyze? How Steinbeck used literary devices in the intercalary chapters to show how rough migrants could have it. Crystal clear.

#3: Do Research to Find Your Main Points

This is where you determine the bulk of your analysis--the information that makes your essay an analytical essay. My preferred method is to list every idea that I can think of, then research each of those and use the three or four strongest ones for your essay. Weaker points may be those that don’t relate back to the thesis, that you don’t have much analysis to discuss, or that you can’t find good examples for. A good rule of thumb is to have one body paragraph per main point

This essay has four main points, each of which analyzes a different literary device Steinbeck uses to better illustrate how difficult life was for migrants during the Dust Bowl. The four literary devices and their impact on the book are:

Lack of individual names in intercalary chapters to illustrate the scope of the problem
Parallels to the Bible to induce sympathy for the migrants
Non-showy, often grammatically-incorrect language so the migrants are more realistic and relatable to readers
Nature-related metaphors to affect the mood of the writing and reflect the plight of the migrants

#4: Find Excerpts or Evidence to Support Your Analysis

Now that you have your main points, you need to back them up. If you’re writing a paper about a text or film, use passages/clips from it as your main source of evidence. If you’re writing about something else, your evidence can come from a variety of sources, such as surveys, experiments, quotes from knowledgeable sources etc. Any evidence that would work for a regular research paper works here.

In this example, I quoted multiple passages from The Grapes of Wrath in each paragraph to support my argument. You should be able to back up every claim you make with evidence in order to have a strong essay.

#5: Put It All Together

Now it's time to begin writing your essay, if you haven’t already. Create an introductory paragraph that ends with the thesis, make a body paragraph for each of your main points, including both analysis and evidence to back up your claims, and wrap it all up with a conclusion that recaps your thesis and main points and potentially explains the big picture importance of the topic.

Analytical Essay Example + Analysis

So that you can see for yourself what a completed analytical essay looks like, here’s an essay I wrote back in my high school days. It’s followed by analysis of how I structured my essay, what its strengths are, and how it could be improved.

One way Steinbeck illustrates the connections all migrant people possessed and the struggles they faced is by refraining from using specific titles and names in his intercalary chapters. While The Grapes of Wrath focuses on the Joad family, the intercalary chapters show that all migrants share the same struggles and triumphs as the Joads. No individual names are used in these chapters; instead the people are referred to as part of a group. Steinbeck writes, “Frantic men pounded on the doors of the doctors; and the doctors were busy. And sad men left word at country stores for the coroner to send a car,” (555). By using generic terms, Steinbeck shows how the migrants are all linked because they have gone through the same experiences. The grievances committed against one family were committed against thousands of other families; the abuse extends far beyond what the Joads experienced. The Grapes of Wrath frequently refers to the importance of coming together; how, when people connect with others their power and influence multiplies immensely. Throughout the novel, the goal of the migrants, the key to their triumph, has been to unite. While their plans are repeatedly frustrated by the government and police, Steinbeck’s intercalary chapters provide a way for the migrants to relate to one another because they have encountered the same experiences. Hundreds of thousands of migrants fled to the promised land of California, but Steinbeck was aware that numbers alone were impersonal and lacked the passion he desired to spread. Steinbeck created the intercalary chapters to show the massive numbers of people suffering, and he created the Joad family to evoke compassion from readers. Because readers come to sympathize with the Joads, they become more sensitive to the struggles of migrants in general. However, John Steinbeck frequently made clear that the Joads were not an isolated incident; they were not unique. Their struggles and triumphs were part of something greater. Refraining from specific names in his intercalary chapters allows Steinbeck to show the vastness of the atrocities committed against migrants.

Steinbeck also creates significant parallels to the Bible in his intercalary chapters in order to enhance his writing and characters. By using simple sentences and stylized writing, Steinbeck evokes Biblical passages. The migrants despair, “No work till spring. No work,” (556). Short, direct sentences help to better convey the desperateness of the migrants’ situation. Throughout his novel, John Steinbeck makes connections to the Bible through his characters and storyline. Jim Casy’s allusions to Christ and the cycle of drought and flooding are clear biblical references. By choosing to relate The Grapes of Wrath to the Bible, Steinbeck’s characters become greater than themselves. Starving migrants become more than destitute vagrants; they are now the chosen people escaping to the promised land. When a forgotten man dies alone and unnoticed, it becomes a tragedy. Steinbeck writes, “If [the migrants] were shot at, they did not run, but splashed sullenly away; and if they were hit, they sank tiredly in the mud,” (556). Injustices committed against the migrants become greater because they are seen as children of God through Steinbeck’s choice of language. Referencing the Bible strengthens Steinbeck’s novel and purpose: to create understanding for the dispossessed. It is easy for people to feel disdain for shabby vagabonds, but connecting them to such a fundamental aspect of Christianity induces sympathy from readers who might have otherwise disregarded the migrants as so many other people did.

The simple, uneducated dialogue Steinbeck employs also helps to create a more honest and meaningful representation of the migrants, and it makes the migrants more relatable to readers. Steinbeck chooses to accurately represent the language of the migrants in order to more clearly illustrate their lives and make them seem more like real paper than just characters in a book. The migrants lament, “They ain’t gonna be no kinda work for three months,” (555). There are multiple grammatical errors in that single sentence, but it vividly conveys the despair the migrants felt better than a technically perfect sentence would. The Grapes of Wrath is intended to show the severe difficulties facing the migrants so Steinbeck employs a clear, pragmatic style of writing. Steinbeck shows the harsh, truthful realities of the migrants’ lives and he would be hypocritical if he chose to give the migrants a more refined voice and not portray them with all their shortcomings. The depiction of the migrants as imperfect through their language also makes them easier to relate to. Steinbeck’s primary audience was the middle class, the less affluent of society. Repeatedly in The Grapes of Wrath , the wealthy make it obvious that they scorn the plight of the migrants. The wealthy, not bad luck or natural disasters, were the prominent cause of the suffering of migrant families such as the Joads. Thus, Steinbeck turns to the less prosperous for support in his novel. When referring to the superior living conditions barnyard animals have, the migrants remark, “Them’s horses-we’re men,” (556). The perfect simplicity of this quote expresses the absurdness of the migrants’ situation better than any flowery expression could.

In The Grapes of Wrath , John Steinbeck uses metaphors, particularly about nature, in order to illustrate the mood and the overall plight of migrants. Throughout most of the book, the land is described as dusty, barren, and dead. Towards the end, however; floods come and the landscape begins to change. At the end of chapter twenty-nine, Steinbeck describes a hill after the floods saying, “Tiny points of grass came through the earth, and in a few days the hills were pale green with the beginning year,” (556). This description offers a stark contrast from the earlier passages which were filled with despair and destruction. Steinbeck’s tone from the beginning of the chapter changes drastically. Early in the chapter, Steinbeck had used heavy imagery in order to convey the destruction caused by the rain, “The streams and the little rivers edged up to the bank sides and worked at willows and tree roots, bent the willows deep in the current, cut out the roots of cottonwoods and brought down the trees,” (553). However, at the end of the chapter the rain has caused new life to grow in California. The new grass becomes a metaphor representing hope. When the migrants are at a loss over how they will survive the winter, the grass offers reassurance. The story of the migrants in the intercalary chapters parallels that of the Joads. At the end of the novel, the family is breaking apart and has been forced to flee their home. However, both the book and final intercalary chapter end on a hopeful note after so much suffering has occurred. The grass metaphor strengthens Steinbeck’s message because it offers a tangible example of hope. Through his language Steinbeck’s themes become apparent at the end of the novel. Steinbeck affirms that persistence, even when problems appear insurmountable, leads to success. These metaphors help to strengthen Steinbeck’s themes in The Grapes of Wrath because they provide a more memorable way to recall important messages.

John Steinbeck’s language choices help to intensify his writing in his intercalary chapters and allow him to more clearly show how difficult life for migrants could be. Refraining from using specific names and terms allows Steinbeck to show that many thousands of migrants suffered through the same wrongs. Imitating the style of the Bible strengthens Steinbeck’s characters and connects them to the Bible, perhaps the most famous book in history. When Steinbeck writes in the imperfect dialogue of the migrants, he creates a more accurate portrayal and makes the migrants easier to relate to for a less affluent audience. Metaphors, particularly relating to nature, strengthen the themes in The Grapes of Wrath by enhancing the mood Steinbeck wants readers to feel at different points in the book. Overall, the intercalary chapters that Steinbeck includes improve his novel by making it more memorable and reinforcing the themes Steinbeck embraces throughout the novel. Exemplary stylistic devices further persuade readers of John Steinbeck’s personal beliefs. Steinbeck wrote The Grapes of Wrath to bring to light cruelties against migrants, and by using literary devices effectively, he continuously reminds readers of his purpose. Steinbeck’s impressive language choices in his intercalary chapters advance the entire novel and help to create a classic work of literature that people still are able to relate to today.

This essay sticks pretty closely to the standard analytical essay outline. It starts with an introduction, where I chose to use a quote to start off the essay. (This became my favorite way to start essays in high school because, if I wasn’t sure what to say, I could outsource the work and find a quote that related to what I’d be writing about.) The quote in this essay doesn’t relate to the themes I’m discussing quite as much as it could, but it’s still a slightly different way to start an essay and can intrigue readers. I then give a bit of background on The Grapes of Wrath and its themes before ending the intro paragraph with my thesis: that Steinbeck used literary devices in intercalary chapters to show how rough migrants had it.

Each of my four body paragraphs is formatted in roughly the same way: an intro sentence that explains what I’ll be discussing, analysis of that main point, and at least two quotes from the book as evidence.

My conclusion restates my thesis, summarizes each of four points I discussed in my body paragraphs, and ends the essay by briefly discussing how Steinbeck’s writing helped introduce a world of readers to the injustices migrants experienced during the dust bowl.

What does this analytical essay example do well? For starters, it contains everything that a strong analytical essay should, and it makes that easy to find. The thesis clearly lays out what the essay will be about, the first sentence of each of the body paragraph introduces the topic it’ll cover, and the conclusion neatly recaps all the main points. Within each of the body paragraphs, there’s analysis along with multiple excerpts from the book in order to add legitimacy to my points.

Additionally, the essay does a good job of taking an in-depth look at the issue introduced in the thesis. Four ways Steinbeck used literary devices are discussed, and for each of the examples are given and analysis is provided so readers can understand why Steinbeck included those devices and how they helped shaped how readers viewed migrants and their plight.

Where could this essay be improved? I believe the weakest body paragraph is the third one, the one that discusses how Steinbeck used plain, grammatically incorrect language to both accurately depict the migrants and make them more relatable to readers. The paragraph tries to touch on both of those reasons and ends up being somewhat unfocused as a result. It would have been better for it to focus on just one of those reasons (likely how it made the migrants more relatable) in order to be clearer and more effective. It’s a good example of how adding more ideas to an essay often doesn’t make it better if they don’t work with the rest of what you’re writing. This essay also could explain the excerpts that are included more and how they relate to the points being made. Sometimes they’re just dropped in the essay with the expectation that the readers will make the connection between the example and the analysis. This is perhaps especially true in the second body paragraph, the one that discusses similarities to Biblical passages. Additional analysis of the quotes would have strengthened it.

Summary: How to Write an Analytical Essay

What is an analytical essay? A critical analytical essay analyzes a topic, often a text or film. The analysis paper uses evidence to support the argument, such as excerpts from the piece of writing. All analytical papers include a thesis, analysis of the topic, and evidence to support that analysis.

When developing an analytical essay outline and writing your essay, follow these five steps:

Reading analytical essay examples can also give you a better sense of how to structure your essay and what to include in it.

What's Next?

Learning about different writing styles in school? There are four main writing styles, and it's important to understand each of them. Learn about them in our guide to writing styles , complete with examples.

Writing a research paper for school but not sure what to write about? Our guide to research paper topics has over 100 topics in ten categories so you can be sure to find the perfect topic for you.

Literary devices can both be used to enhance your writing and communication. Check out this list of 31 literary devices to learn more !

Christine graduated from Michigan State University with degrees in Environmental Biology and Geography and received her Master's from Duke University. In high school she scored in the 99th percentile on the SAT and was named a National Merit Finalist. She has taught English and biology in several countries.

Ask a Question Below

Have any questions about this article or other topics? Ask below and we'll reply!

Improve With Our Famous Guides

For All Students

The 5 Strategies You Must Be Using to Improve 160+ SAT Points

How to Get a Perfect 1600, by a Perfect Scorer

Series: How to Get 800 on Each SAT Section:

Score 800 on SAT Math

Score 800 on SAT Reading

Score 800 on SAT Writing

Series: How to Get to 600 on Each SAT Section:

Score 600 on SAT Math

Score 600 on SAT Reading

Score 600 on SAT Writing

Free Complete Official SAT Practice Tests

What SAT Target Score Should You Be Aiming For?

15 Strategies to Improve Your SAT Essay

The 5 Strategies You Must Be Using to Improve 4+ ACT Points

How to Get a Perfect 36 ACT, by a Perfect Scorer

Series: How to Get 36 on Each ACT Section:

36 on ACT English

36 on ACT Math

36 on ACT Reading

36 on ACT Science

Series: How to Get to 24 on Each ACT Section:

24 on ACT English

24 on ACT Math

24 on ACT Reading

24 on ACT Science

What ACT target score should you be aiming for?

ACT Vocabulary You Must Know

ACT Writing: 15 Tips to Raise Your Essay Score

How to Get Into Harvard and the Ivy League

How to Get a Perfect 4.0 GPA

How to Write an Amazing College Essay

What Exactly Are Colleges Looking For?

Is the ACT easier than the SAT? A Comprehensive Guide

Should you retake your SAT or ACT?

When should you take the SAT or ACT?

Stay Informed

Get the latest articles and test prep tips!

Looking for Graduate School Test Prep?

Check out our top-rated graduate blogs here:

GRE Online Prep Blog

GMAT Online Prep Blog

TOEFL Online Prep Blog

Holly R. "I am absolutely overjoyed and cannot thank you enough for helping me!”

The Community

Modern analyst blog, community blog.

Member Profiles

Networking Opportunities

Community spotlight, business analysis glossary, articles listing, business analyst humor, self assessment.

Training Courses
Organizations
Resume Writing Tips
Interview Questions

Let Us Help Your Business

Advertise with us, rss feeds & syndication, privacy policy.

Business Analyst Community & Resources | Modern Analyst

Writing a Good Data Analysis Report: 7 Steps

As a data analyst, you feel most comfortable when you’re alone with all the numbers and data. You’re able to analyze them with confidence and reach the results you were asked to find. But, this is not the end of the road for you. You still need to write a data analysis report explaining your findings to the laymen - your clients or coworkers.

That means you need to think about your target audience, that is the people who’ll be reading your report.

They don’t have nearly as much knowledge about data analysis as you do. So, your report needs to be straightforward and informative. The article below will help you learn how to do it. Let’s take a look at some practical tips you can apply to your data analysis report writing and the benefits of doing so.

Writing a Good Data Analysis Report: 7 Steps

source: Pexels

Data Analysis Report Writing: 7 Steps

The process of writing a data analysis report is far from simple, but you can master it quickly, with the right guidance and examples of similar reports .

This is why we've prepared a step-by-step guide that will cover everything you need to know about this process, as simply as possible. Let’s get to it.

Consider Your Audience

You are writing your report for a certain target audience, and you need to keep them in mind while writing. Depending on their level of expertise, you’ll need to adjust your report and ensure it speaks to them. So, before you go any further, ask yourself:

Who will be reading this report? How well do they understand the subject?

Let’s say you’re explaining the methodology you used to reach your conclusions and find the data in question. If the reader isn’t familiar with these tools and software, you’ll have to simplify it for them and provide additional explanations.

So, you won't be writing the same type of report for a coworker who's been on your team for years or a client who's seeing data analysis for the first time. Based on this determining factor, you'll think about:

the language and vocabulary you’re using

abbreviations and level of technicality

the depth you’ll go into to explain something

the type of visuals you’ll add

Your readers’ expertise dictates the tone of your report and you need to consider it before writing even a single word.

Draft Out the Sections

The next thing you need to do is create a draft of your data analysis report. This is just a skeleton of what your report will be once you finish. But, you need a starting point.

So, think about the sections you'll include and what each section is going to cover. Typically, your report should be divided into the following sections:

Introduction

Body (Data, Methods, Analysis, Results)

For each section, write down several short bullet points regarding the content to cover. Below, we'll discuss each section more elaborately.

Develop The Body

The body of your report is the most important section. You need to organize it into subsections and present all the information your readers will be interested in.

We suggest the following subsections.

Explain what data you used to conduct your analysis. Be specific and explain how you gathered the data, what your sample was, what tools and resources you’ve used, and how you’ve organized your data. This will give the reader a deeper understanding of your data sample and make your report more solid.

Also, explain why you choose the specific data for your sample. For instance, you may say “ The sample only includes data of the customers acquired during 2021, in the peak of the pandemic.”

Next, you need to explain what methods you’ve used to analyze the data. This simply means you need to explain why and how you choose specific methods. You also need to explain why these methods are the best fit for the goals you’ve set and the results you’re trying to reach.

Back up your methodology section with background information on each method or tool used. Explain how these resources are typically used in data analysis.

After you've explained the data and methods you've used, this next section brings those two together. The analysis section shows how you've analyzed the specific data using the specific methods.

This means you’ll show your calculations, charts, and analyses, step by step. Add descriptions and explain each of the steps. Try making it as simple as possible so that even the most inexperienced of your readers understand every word.

This final section of the body can be considered the most important section of your report. Most of your clients will skim the rest of the report to reach this section.

Because it’ll answer the questions you’ve all raised. It shares the results that were reached and gives the reader new findings, facts, and evidence.

So, explain and describe the results using numbers. Then, add a written description of what each of the numbers stands for and what it means for the entire analysis. Summarize your results and finalize the report on a strong note.

Write the Introduction

Yes, it may seem strange to write the introduction section at the end, but it’s the smartest way to do it. This section briefly explains what the report will cover. That’s why you should write it after you’ve finished writing the Body.

In your introduction, explain:

the question you’ve raised and answered with the analysis

context of the analysis and background information

short outline of the report

Simply put, you’re telling your audience what to expect.

Add a Short Conclusion

Finally, the last section of your paper is a brief conclusion. It only repeats what you described in the Body, but only points out the most important details.

It should be less than a page long and use straightforward language to deliver the most important findings. It should also include a paragraph about the implications and importance of those findings for the client, customer, business, or company that hired you.

Include Data Visualization Elements

You have all the data and numbers in your mind and find it easy to understand what the data is saying. But, to a layman or someone less experienced than yourself, it can be quite a puzzle. All the information that your data analysis has found can create a mess in the head of your reader.

So, you should simplify it by using data visualization elements.

Firstly, let’s define what are the most common and useful data visualization elements you can use in your report:

There are subcategories to each of the elements and you should explore them all to decide what will do the best job for your specific case. For instance, you'll find different types of charts including, pie charts, bar charts, area charts, or spider charts.

For each data visualization element, add a brief description to tell the readers what information it contains. You can also add a title to each element and create a table of contents for visual elements only.

Proofread & Edit Before Submission

All the hard work you’ve invested in writing a good data analysis report might go to waste if you don’t edit and proofread. Proofreading and editing will help you eliminate potential mistakes, but also take another objective look at your report.

First, do the editing part. It includes:

reading the whole report objectively, like you’re seeing it for the first time

leaving an open mind for changes

adding or removing information

rearranging sections

finding better words to say something

You should repeat the editing phase a couple of times until you're completely happy with the result. Once you're certain the content is all tidied up, you can move on to the proofreading stage. It includes:

finding and removing grammar and spelling mistakes

rethinking vocabulary choices

improving clarity

improving readability

You can use an online proofreading tool to make things faster. If you really want professional help, Grab My Essay is a great choice. Their professional writers can edit and rewrite your entire report, to make sure it’s impeccable before submission.

Whatever you choose to do, proofread yourself or get some help with it, make sure your report is well-organized and completely error-free.

Benefits of Writing Well-Structured Data Analysis Reports

Yes, writing a good data analysis report is a lot of hard work. But, if you understand the benefits of writing it, you’ll be more motivated and willing to invest the time and effort. After knowing how it can help you in different segments of your professional journey, you’ll be more willing to learn how to do it.

Below are the main benefits a data analysis report brings to the table.

Improved Collaboration

When you’re writing a data analysis report, you need to be aware more than one end user is going to use it. Whether it’s your employer, customer, or coworker - you need to make sure they’re all on the same page. And when you write a data analysis report that is easy to understand and learn from, you’re creating a bridge between all these people.

Simply, all of them are given accurate data they can rely on and you’re thus removing the potential misunderstandings that can happen in communication. This improves the overall collaboration level and makes everyone more open and helpful.

Increased Efficiency

People who are reading your data analysis report need the information it contains for some reason. They might use it to do their part of the job, to make decisions, or report further to someone else. Either way, the better your report, the more efficient it'll be. And, if you rely on those people as well, you'll benefit from this increased productivity as well.

Data tells a story about a business, project, or venture. It's able to show how well you've performed, what turned out to be a great move, and what needs to be reimagined. This means that a data analysis report provides valuable insight and measurable KPIs (key performance indicators) that you’re able to use to grow and develop.

Clear Communication

Information is key regardless of the industry you're in or the type of business you're doing. Data analysis finds that information and proves its accuracy and importance. But, if those findings and the information itself aren't communicated clearly, it's like you haven't even found them.

This is why a data analysis report is crucial. It will present the information less technically and bring it closer to the readers.

Final Thoughts

As you can see, it takes some skill and a bit more practice to write a good data analysis report. But, all the effort you invest in writing it will be worth it once the results kick in. You’ll improve the communication between you and your clients, employers, or coworkers. People will be able to understand, rely on, and use the analysis you’ve conducted.

So, don’t be afraid and start writing your first data analysis report. Just follow the 7 steps we’ve listed and use a tool such as ProWebScraper to help you with website data analysis. You’ll be surprised when you see the result of your hard work.

Jessica Fender is a business analyst and a blogger. She writes about business and data analysis, networking in this sector, and acquiring new skills. Her goal is to provide fresh and accurate information that readers can apply instantly.

Data-Driven Decision Making: Leveraging Analytics in Process Management

Article/Paper Categories

Upcoming live webinars, ace the interview.

Roles and Titles

Business Analyst
Business Process Analyst
IT Business Analyst
Requirements Engineer
Business Systems Analyst
Systems Analyst
Data Analyst

Career Resources

Interview Tips
Salary Information
Directory of Links

Community Resources

Project Members

Advertising Opportunities | Contact Us | Privacy Policy

Importance of Data Analysis Essay

The data analysis process will take place after all the necessary information is obtained and structured appropriately. This will be a basis for the initial stage of the mentioned process – primary data processing. It is important to analyze the results of each study as soon as possible after its completion. So far, the researcher’s memory can suggest those details that, for some reason, are not fixed but are of interest for understanding the essence of the matter. When processing the collected data, it may turn out that they are either insufficient or contradictory and therefore do not give grounds for final conclusions.

In this case, the study must be continued with the required additions. After collecting information from various sources, it is necessary to understand what exactly is needed for the initial analysis of needs in accordance with the task at hand. In most cases, it is advisable to start processing with the compilation of tables (pivot tables) of the data obtained (Simplilearn, 2021). For both manual and computer processing, the initial data is most often entered into the original pivot table. Recently, computer processing has become the predominant form of mathematical and statistical processing.

The second stage is mathematical data processing, which implies a complex preparation. In order to determine the methods of mathematical and statistical processing, first of all, it is important to assess the nature of the distribution for all the parameters used. For parameters that are normally distributed or close to normal, parametric statistics methods can be used, which in many cases are more powerful than nonparametric statistical methods (Ali & Bhaskar, 2016). The advantage of the latter is that they allow testing statistical hypotheses regardless of the shape of the distribution.

One of the most common tasks in data processing is assessing the reliability of differences between two or more series of values. There are a number of ways in mathematical statistics to solve it. The computer version of data processing has become the most widespread today. Many statistical applications have procedures for evaluating the differences between the parameters of the same sample or different samples (Tyagi, 2020). With fully computerized processing of the material, it is not difficult to use the appropriate procedure at the right time and assess the differences of interest.

The following stage may be called the formulation of conclusions. The latter are statements expressing in a concise form the meaningful results of the study. They, in a thesis form, reflect the new findings that were obtained by the author. A common mistake is that the author includes in the conclusions generally accepted in science provisions – no longer needing proof. The responses to each of the objectives listed in the introduction should be reflected in the conclusions in a certain way.

The format for presenting the results after completing the task of analyzing information is of no small importance (Tyagi, 2020). The main content needs to be translated into an easy-to-read format based on their requirements. At the same time, you should provide easy access to additional background data for those who are interested and want to understand the topic more thoroughly. These basic rules apply regardless of the format of the presentation of the information.

In order to successfully solve this problem, special methods of analysis and information processing are required. Classical information technologies make it possible to efficiently store, structure and quickly retrieve information in a user-friendly form. The main strength of SPSS Statistics is the provision of a vast range of instruments that can be utilized in the framework of statistics (Allen et al., 2014). For all the complexity of modern methods of statistical analysis, which use the latest achievements of mathematical science, the SPSS program allows one to focus on the peculiarities of their application in each specific case. This program has capabilities that significantly exceed the scope of functions provided by standard business programs such as Excel.

The SPSS program provides the user with ample opportunities for statistical processing of experimental data, for the formation of databases (SPSS data files), for their modification. SPSS may be considered a complex and flexible statistical analysis tool (Allen et al., 2014). SPSS can take data from virtually any file type and use it to create tabular reports, graphs and distribution maps, descriptive statistics, and sophisticated statistical analysis.

At this point, it seems reasonable to define the sequence of the analysis using the SPSS tools. First, it is essential to draw up a questionnaire with the questions necessary for the researcher. Next, a survey is carried out. To process the received data, you need to draw up a coding table. The coding table establishes the correspondence between individual questions of the questionnaire and the variables used in the computer data processing (Allen et al., 2014). This solves the following tasks; first, a correspondence is established between the individual questions of the questionnaire and the variables. Second, a correspondence is established between the possible values of variables and code numbers.

Next, one needs to enter the data into the data editor according to the defined variables. After that, depending on the task, it is necessary to select the desired function and schedule. Then, you should analyze the subsequent tabular output of the result. All the necessary statistical functions that will be directly used in exploring and analyzing data are located in the Analysis menu. A very important analysis can be done with multiple responses; it is called the dichotomous method. This approach is used in cases when in the questionnaire for answering a question, it is proposed to mark several answer options (Allen et al., 2014).

Comparison of the means of different samples is one of the most commonly used methods of statistical analysis. In this case, the question must always be clarified whether the existing difference in mean values can be explained by statistical fluctuations or not. This method seems appropriate as the study will involve participants from all over the state, and their responses will need to be compared.

It should be stressed that SPSS is the most widely used statistical software. The main advantage of the SPSS software package, as one of the most advanced attainments in the area of automatized data analysis, is the broad coverage of modern statistical approaches. It is successfully combined with a large number of convenient visualization tools for processing results (Allen et al., 2014). The latest version gives notable possibilities not only within the scope of psychology, sociology, and biology but also in the field of medicine, which is crucial for the aims of future research. This greatly expands the applicability of the complex, which will serve as a significant basis for ensuring the validity of the study.

Ali, Z., & Bhaskar, S. B. (2016). Basic statistical tools in research and data analysis. Indian Journal of Anesthesia, 60 (9), 662–669.

Allen, P., Bennet, K., & Heritage, B. (2014). SPSS Statistics version 22: A practical guide . Cengage.

Simplilearn. (2021). What is data analysis: Methods, process and types explained . Web.

Tyagi, N. (2020). Introduction to statistical data analysis . Analytic Steps. Web.

Using IBM Spss-Statistics: Gradpack 20 System
Application of T-Tests: Data Files and SPSS
Preparation of Correspondences by Typewriters and Computers
Data Collection Methodology and Analysis
Information Systems and Typical Cycle Explanation
Aspects of Databases in Hospitals
Data Analytics in TED Talks
The Document Warehousing Concepts
Chicago (A-D)
Chicago (N-B)

IvyPanda. (2022, November 30). Importance of Data Analysis. https://ivypanda.com/essays/importance-of-data-analysis/

"Importance of Data Analysis." IvyPanda , 30 Nov. 2022, ivypanda.com/essays/importance-of-data-analysis/.

IvyPanda . (2022) 'Importance of Data Analysis'. 30 November.

IvyPanda . 2022. "Importance of Data Analysis." November 30, 2022. https://ivypanda.com/essays/importance-of-data-analysis/.

1. IvyPanda . "Importance of Data Analysis." November 30, 2022. https://ivypanda.com/essays/importance-of-data-analysis/.

Bibliography

IvyPanda . "Importance of Data Analysis." November 30, 2022. https://ivypanda.com/essays/importance-of-data-analysis/.

Introductory essay

Written by the educators who created Visualizing Data, a brief look at the key facts, tough questions and big ideas in their field. Begin this TED Study with a fascinating read that gives context and clarity to the material.

The reality of today

All of us now are being blasted by information design. It's being poured into our eyes through the Web, and we're all visualizers now; we're all demanding a visual aspect to our information...And if you're navigating a dense information jungle, coming across a beautiful graphic or a lovely data visualization, it's a relief, it's like coming across a clearing in the jungle. David McCandless

In today's complex 'information jungle,' David McCandless observes that "Data is the new soil." McCandless, a data journalist and information designer, celebrates data as a ubiquitous resource providing a fertile and creative medium from which new ideas and understanding can grow. McCandless's inspiration, statistician Hans Rosling, builds on this idea in his own TEDTalk with his compelling image of flowers growing out of data/soil. These 'flowers' represent the many insights that can be gleaned from effective visualization of data.

We're just learning how to till this soil and make sense of the mountains of data constantly being generated. As Gary King, Director of Harvard's Institute for Quantitative Social Science says in his New York Times article "The Age of Big Data":

It's a revolution. We're really just getting under way. But the march of quantification, made possible by enormous new sources of data, will sweep through academia, business and government. There is no area that is going to be untouched.

How do we deal with all this data without getting information overload? How do we use data to gain real insight into the world? Finding ways to pull interesting information out of data can be very rewarding, both personally and professionally. The managing editor of Financial Times observed on CNN's Your Money : "The people who are able to in a sophisticated and practical way analyze that data are going to have terrific jobs." Those who learn how to present data in effective ways will be valuable in every field.

Many people, when they think of data, think of tables filled with numbers. But this long-held notion is eroding. Today, we're generating streams of data that are often too complex to be presented in a simple "table." In his TEDTalk, Blaise Aguera y Arcas explores images as data, while Deb Roy uses audio, video, and the text messages in social media as data.

Some may also think that only a few specialized professionals can draw insights from data. When we look at data in the right way, however, the results can be fun, insightful, even whimsical — and accessible to everyone! Who knew, for example, that there are more relationship break-ups on Monday than on any other day of the week, or that the most break-ups (at least those discussed on Facebook) occur in mid-December? David McCandless discovered this by analyzing thousands of Facebook status updates.

Data, data, everywhere

There is more data available to us now than we can possibly process. Every minute , Internet users add the following to the big data pool (i):

204,166,667 email messages sent
More than 2,000,000 Google searches
684,478 pieces of content added on Facebook
$272,070 spent by consumers via online shopping
More than 100,000 tweets on Twitter
47,000 app downloads from Apple
34,722 "likes" on Facebook for different brands and organizations
27,778 new posts on Tumblr blogs
3,600 new photos on Instagram
3,125 new photos on Flickr
2,083 check-ins on Foursquare
571 new websites created
347 new blog posts published on Wordpress
217 new mobile web users
48 hours of new video on YouTube

These numbers are almost certainly higher now, as you read this. And this just describes a small piece of the data being generated and stored by humanity. We're all leaving data trails — not just on the Internet, but in everything we do. This includes reams of financial data (from credit cards, businesses, and Wall Street), demographic data on the world's populations, meteorological data on weather and the environment, retail sales data that records everything we buy, nutritional data on food and restaurants, sports data of all types, and so on.

Governments are using data to search for terrorist plots, retailers are using it to maximize marketing strategies, and health organizations are using it to track outbreaks of the flu. But did you ever think of collecting data on every minute of your child's life? That's precisely what Deb Roy did. He recorded 90,000 hours of video and 140,000 hours of audio during his son's first years. That's a lot of data! He and his colleagues are using the data to understand how children learn language, and they're now extending this work to analyze publicly available conversations on social media, allowing them to take "the real-time pulse of a nation."

Data can provide us with new and deeper insight into our world. It can help break stereotypes and build understanding. But the sheer quantity of data, even in just any one small area of interest, is overwhelming. How can we make sense of some of this data in an insightful way?

The power of visualizing data

Visualization can help transform these mountains of data into meaningful information. In his TEDTalk, David McCandless comments that the sense of sight has by far the fastest and biggest bandwidth of any of the five senses. Indeed, about 80% of the information we take in is by eye. Data that seems impenetrable can come alive if presented well in a picture, graph, or even a movie. Hans Rosling tells us that "Students get very excited — and policy-makers and the corporate sector — when they can see the data."

It makes sense that, if we can effectively display data visually, we can make it accessible and understandable to more people. Should we worry, however, that by condensing data into a graph, we are simplifying too much and losing some of the important features of the data? Let's look at a fascinating study conducted by researchers Emre Soyer and Robin Hogarth . The study was conducted on economists, who are certainly no strangers to statistical analysis. Three groups of economists were asked the same question concerning a dataset:

One group was given the data and a standard statistical analysis of the data; 72% of these economists got the answer wrong.
Another group was given the data, the statistical analysis, and a graph; still 61% of these economists got the answer wrong.
A third group was given only the graph, and only 3% got the answer wrong.

Visualizing data can sometimes be less misleading than using the raw numbers and statistics!

What about all the rest of us, who may not be professional economists or statisticians? Nathalie Miebach finds that making art out of data allows people an alternative entry into science. She transforms mountains of weather data into tactile physical structures and musical scores, adding both touch and hearing to the sense of sight to build even greater understanding of data.

Another artist, Chris Jordan, is concerned about our ability to comprehend big numbers. As citizens of an ever-more connected global world, we have an increased need to get useable information from big data — big in terms of the volume of numbers as well as their size. Jordan's art is designed to help us process such numbers, especially numbers that relate to issues of addiction and waste. For example, Jordan notes that the United States has the largest percentage of its population in prison of any country on earth: 2.3 million people in prison in the United States in 2005 and the number continues to rise. Jordan uses art, in this case a super-sized image of 2.3 million prison jumpsuits, to help us see that number and to help us begin to process the societal implications of that single data value. Because our brains can't truly process such a large number, his artwork makes it real.

The role of technology in visualizing data

The TEDTalks in this collection depend to varying degrees on sophisticated technology to gather, store, process, and display data. Handling massive amounts of data (e.g., David McCandless tracking 10,000 changes in Facebook status, Blaise Aguera y Arcas synching thousands of online images of the Notre Dame Cathedral, or Deb Roy searching for individual words in 90,000 hours of video tape) requires cutting-edge computing tools that have been developed specifically to address the challenges of big data. The ability to manipulate color, size, location, motion, and sound to discover and display important features of data in a way that makes it readily accessible to ordinary humans is a challenging task that depends heavily on increasingly sophisticated technology.

The importance of good visualization

There are good ways and bad ways of presenting data. Many examples of outstanding presentations of data are shown in the TEDTalks. However, sometimes visualizations of data can be ineffective or downright misleading. For example, an inappropriate scale might make a relatively small difference look much more substantial than it should be, or an overly complicated display might obfuscate the main relationships in the data. Statistician Kaiser Fung's blog Junk Charts offers many examples of poor representations of data (and some good ones) with descriptions to help the reader understand what makes a graph effective or ineffective. For more examples of both good and bad representations of data, see data visualization architect Andy Kirk's blog at visualisingdata.com . Both consistently have very current examples from up-to-date sources and events.

Creativity, even artistic ability, helps us see data in new ways. Magic happens when interesting data meets effective design: when statistician meets designer (sometimes within the same person). We are fortunate to live in a time when interactive and animated graphs are becoming commonplace, and these tools can be incredibly powerful. Other times, simpler graphs might be more effective. The key is to present data in a way that is visually appealing while allowing the data to speak for itself.

Changing perceptions through data

While graphs and charts can lead to misunderstandings, there is ultimately "truth in numbers." As Steven Levitt and Stephen Dubner say in Freakonomics , "[T]eachers and criminals and real-estate agents may lie, and politicians, and even C.I.A. analysts. But numbers don't." Indeed, consideration of data can often be the easiest way to glean objective insights. Again from Freakonomics : "There is nothing like the sheer power of numbers to scrub away layers of confusion and contradiction."

Data can help us understand the world as it is, not as we believe it to be. As Hans Rosling demonstrates, it's often not ignorance but our preconceived ideas that get in the way of understanding the world as it is. Publicly-available statistics can reshape our world view: Rosling encourages us to "let the dataset change your mindset."

Chris Jordan's powerful images of waste and addiction make us face, rather than deny, the facts. It's easy to hear and then ignore that we use and discard 1 million plastic cups every 6 hours on airline flights alone. When we're confronted with his powerful image, we engage with that fact on an entirely different level (and may never see airline plastic cups in the same way again).

The ability to see data expands our perceptions of the world in ways that we're just beginning to understand. Computer simulations allow us to see how diseases spread, how forest fires might be contained, how terror networks communicate. We gain understanding of these things in ways that were unimaginable only a few decades ago. When Blaise Aguera y Arcas demonstrates Photosynth, we feel as if we're looking at the future. By linking together user-contributed digital images culled from all over the Internet, he creates navigable "immensely rich virtual models of every interesting part of the earth" created from the collective memory of all of us. Deb Roy does somewhat the same thing with language, pulling in publicly available social media feeds to analyze national and global conversation trends.

Roy sums it up with these powerful words: "What's emerging is an ability to see new social structures and dynamics that have previously not been seen. ...The implications here are profound, whether it's for science, for commerce, for government, or perhaps most of all, for us as individuals."

Let's begin with the TEDTalk from David McCandless, a self-described "data detective" who describes how to highlight hidden patterns in data through its artful representation.

David McCandless

The beauty of data visualization.

i. Data obtained June 2012 from “How Much Data Is Created Every Minute?” on http://mashable.com/2012/06/22/data-created-every-minute/ .

Relevant talks

Hans Rosling

The magic washing machine.

Nathalie Miebach

Art made of storms.

Chris Jordan

Turning powerful stats into art.

Blaise Agüera y Arcas

How photosynth can connect the world's images.

The birth of a word

data analysis Recently Published Documents

Total documents.

Latest Documents
Most Cited Documents
Contributed Authors
Related Sources
Related Keywords

Introduce a Survival Model with Spatial Skew Gaussian Random Effects and its Application in Covid-19 Data Analysis

Futuristic prediction of missing value imputation methods using extended ann.

Missing data is universal complexity for most part of the research fields which introduces the part of uncertainty into data analysis. We can take place due to many types of motives such as samples mishandling, unable to collect an observation, measurement errors, aberrant value deleted, or merely be short of study. The nourishment area is not an exemption to the difficulty of data missing. Most frequently, this difficulty is determined by manipulative means or medians from the existing datasets which need improvements. The paper proposed hybrid schemes of MICE and ANN known as extended ANN to search and analyze the missing values and perform imputations in the given dataset. The proposed mechanism is efficiently able to analyze the blank entries and fill them with proper examining their neighboring records in order to improve the accuracy of the dataset. In order to validate the proposed scheme, the extended ANN is further compared against various recent algorithms or mechanisms to analyze the efficiency as well as the accuracy of the results.

Applications of multivariate data analysis in shelf life studies of edible vegetal oils – A review of the few past years

Hypothesis formalization: empirical findings, software limitations, and design implications.

Data analysis requires translating higher level questions and hypotheses into computable statistical models. We present a mixed-methods study aimed at identifying the steps, considerations, and challenges involved in operationalizing hypotheses into statistical models, a process we refer to as hypothesis formalization . In a formative content analysis of 50 research papers, we find that researchers highlight decomposing a hypothesis into sub-hypotheses, selecting proxy variables, and formulating statistical models based on data collection design as key steps. In a lab study, we find that analysts fixated on implementation and shaped their analyses to fit familiar approaches, even if sub-optimal. In an analysis of software tools, we find that tools provide inconsistent, low-level abstractions that may limit the statistical models analysts use to formalize hypotheses. Based on these observations, we characterize hypothesis formalization as a dual-search process balancing conceptual and statistical considerations constrained by data and computation and discuss implications for future tools.

The Complexity and Expressive Power of Limit Datalog

Motivated by applications in declarative data analysis, in this article, we study Datalog Z —an extension of Datalog with stratified negation and arithmetic functions over integers. This language is known to be undecidable, so we present the fragment of limit Datalog Z programs, which is powerful enough to naturally capture many important data analysis tasks. In limit Datalog Z , all intensional predicates with a numeric argument are limit predicates that keep maximal or minimal bounds on numeric values. We show that reasoning in limit Datalog Z is decidable if a linearity condition restricting the use of multiplication is satisfied. In particular, limit-linear Datalog Z is complete for Δ 2 EXP and captures Δ 2 P over ordered datasets in the sense of descriptive complexity. We also provide a comprehensive study of several fragments of limit-linear Datalog Z . We show that semi-positive limit-linear programs (i.e., programs where negation is allowed only in front of extensional atoms) capture coNP over ordered datasets; furthermore, reasoning becomes coNEXP-complete in combined and coNP-complete in data complexity, where the lower bounds hold already for negation-free programs. In order to satisfy the requirements of data-intensive applications, we also propose an additional stability requirement, which causes the complexity of reasoning to drop to EXP in combined and to P in data complexity, thus obtaining the same bounds as for usual Datalog. Finally, we compare our formalisms with the languages underpinning existing Datalog-based approaches for data analysis and show that core fragments of these languages can be encoded as limit programs; this allows us to transfer decidability and complexity upper bounds from limit programs to other formalisms. Therefore, our article provides a unified logical framework for declarative data analysis which can be used as a basis for understanding the impact on expressive power and computational complexity of the key constructs available in existing languages.

An empirical study on Cross-Border E-commerce Talent Cultivation-—Based on Skill Gap Theory and big data analysis

To solve the dilemma between the increasing demand for cross-border e-commerce talents and incompatible students’ skill level, Industry-University-Research cooperation, as an essential pillar for inter-disciplinary talent cultivation model adopted by colleges and universities, brings out the synergy from relevant parties and builds the bridge between the knowledge and practice. Nevertheless, industry-university-research cooperation developed lately in the cross-border e-commerce field with several problems such as unstable collaboration relationships and vague training plans.

The Effects of Cross-border e-Commerce Platforms on Transnational Digital Entrepreneurship

This research examines the important concept of transnational digital entrepreneurship (TDE). The paper integrates the host and home country entrepreneurial ecosystems with the digital ecosystem to the framework of the transnational digital entrepreneurial ecosystem. The authors argue that cross-border e-commerce platforms provide critical foundations in the digital entrepreneurial ecosystem. Entrepreneurs who count on this ecosystem are defined as transnational digital entrepreneurs. Interview data were dissected for the purpose of case studies to make understanding from twelve Chinese immigrant entrepreneurs living in Australia and New Zealand. The results of the data analysis reveal that cross-border entrepreneurs are in actual fact relying on the significant framework of the transnational digital ecosystem. Cross-border e-commerce platforms not only play a bridging role between home and host country ecosystems but provide entrepreneurial capitals as digital ecosystem promised.

Subsampling and Jackknifing: A Practically Convenient Solution for Large Data Analysis With Limited Computational Resources

The effects of cross-border e-commerce platforms on transnational digital entrepreneurship, a trajectory evaluator by sub-tracks for detecting vot-based anomalous trajectory.

With the popularization of visual object tracking (VOT), more and more trajectory data are obtained and have begun to gain widespread attention in the fields of mobile robots, intelligent video surveillance, and the like. How to clean the anomalous trajectories hidden in the massive data has become one of the research hotspots. Anomalous trajectories should be detected and cleaned before the trajectory data can be effectively used. In this article, a Trajectory Evaluator by Sub-tracks (TES) for detecting VOT-based anomalous trajectory is proposed. Feature of Anomalousness is defined and described as the Eigenvector of classifier to filter Track Lets anomalous trajectory and IDentity Switch anomalous trajectory, which includes Feature of Anomalous Pose and Feature of Anomalous Sub-tracks (FAS). In the comparative experiments, TES achieves better results on different scenes than state-of-the-art methods. Moreover, FAS makes better performance than point flow, least square method fitting and Chebyshev Polynomial Fitting. It is verified that TES is more accurate and effective and is conducive to the sub-tracks trajectory data analysis.

Export Citation Format

Share document.

Home — Essay Samples — Information Science and Technology — Data Analysis — Research of Data Analysis and Different Types of Analysis

Research of Data Analysis and Different Types of Analysis

Categories: Data Analysis Data Mining

About this sample

Words: 1171 |

Published: Sep 20, 2018

Words: 1171 | Pages: 2 | 6 min read

Introduction, types of anaysis.

To think in terms of significant tables that the data permit.
To examine carefully the statement of problem and earlier analysis and to study the original records of data.
To get away from the data to think about the problem in layman’s terms or to actually discuss the problems with others.
To attack the data by making various statistical calculations. Any of these approaches can be used to start analysis of data. The data analysis strategy is influenced by factors like the type of data, the research design researcher’s qualifications and assumptions underlying a statistical technique.

Data requirements

Data processing, data cleaning.

Should follow an “upside down” triangle format, meaning, the writer should start off broad and introduce the text and author or topic being discussed, and then get more specific to the thesis statement.

Provides a foundational overview, outlining the historical context and introducing key information that will be further explored in the essay, setting the stage for the argument to follow.

The topic sentence serves as the main point or focus of a paragraph in an essay, summarizing the key idea that will be discussed in that paragraph.

The body of each paragraph builds an argument in support of the topic sentence, citing information from sources as evidence.

After each piece of evidence is provided, the author should explain HOW and WHY the evidence supports the claim.

Should follow a right side up triangle format, meaning, specifics should be mentioned first such as restating the thesis, and then get more broad about the topic at hand. Lastly, leave the reader with something to think about and ponder once they are done reading.

Cite this Essay

Let us write you an essay from scratch

450+ experts on 30 subjects ready to help
Custom essay delivered in as few as 3 hours

Get high-quality help

Prof. Kifaru

Verified writer

Expert in: Information Science and Technology

+ 120 experts online

By clicking “Check Writers’ Offers”, you agree to our terms of service and privacy policy . We’ll occasionally send you promo and account related email

No need to pay just yet!

Related Essays

4 pages / 1993 words

2 pages / 960 words

1 pages / 679 words

1 pages / 651 words

Remember! This is just a sample.

You can get your custom paper by one of our expert writers.

121 writers online

Research of Data Analysis and Different Types of Analysis Essay

Still can’t find what you need?

Browse our vast selection of original essay samples, each expertly formatted and styled

Related Essays on Data Analysis

I created a graph above showsing the relationship between the resistance for the conducting wire and the different lengths of a Nichrome wire. My graph is a straight line graph with the majority points lying on the line of best [...]

Mortality prediction of intensive care unit (ICU) patients facilitates hospital benchmarking and has the opportunity to provide caregivers with useful summaries of patient health at the bedside. The development of novel models [...]

Research is the scientific method of investigation. The purpose of research is to learn something and gather evidence. Research typically begins with an observation and then curiosity as to why or how an entity or system [...]

While running their day to day business, organizations encounter textual data. The source of the data could be electronic text, call center logs, social media, corporate documents, research papers, application forms, service [...]

This paper tells us about the latest development in Wireless Powered Communication Networks in which one hybrid access point with constant power supply controls the wireless information transmissions to a set of various users [...]

The integrity of the system refers to the correctness and uniformity of that particular system. The integrity of database system refers to the correctness and constancy of that particular data placed in the database. The [...]

Get Your Personalized Essay in 3 Hours or Less!

We use cookies to personalyze your web-site experience. By continuing we’ll assume you board with our cookie policy .

Instructions Followed To The Letter
Deadlines Met At Every Stage
Unique And Plagiarism Free

Subscribe to the PwC Newsletter

Join the community, edit social preview.

Add a new code entry for this paper

Remove a code repository from this paper, mark the official implementation from paper authors, add a new evaluation result row, remove a task, add a method, remove a method, edit datasets, multilevel functional data analysis modeling of human glucose response to meal intake.

23 May 2024 · Marcos Matabuena , Joe Sartini , Francisco Gude · Edit social preview

Glucose meal response information collected via Continuous Glucose Monitoring (CGM) is relevant to the assessment of individual metabolic status and the support of personalized diet prescriptions. However, the complexity of the data produced by CGM monitors pushes the limits of existing analytic methods. CGM data often exhibits substantial within-person variability and has a natural multilevel structure. This research is motivated by the analysis of CGM data from individuals without diabetes in the AEGIS study. The dataset includes detailed information on meal timing and nutrition for each individual over different days. The primary focus of this study is to examine CGM glucose responses following patients' meals and explore the time-dependent associations with dietary and patient characteristics. Motivated by this problem, we propose a new analytical framework based on multilevel functional models, including a new functional mixed R-square coefficient. The use of these models illustrates 3 key points: (i) The importance of analyzing glucose responses across the entire functional domain when making diet recommendations; (ii) The differential metabolic responses between normoglycemic and prediabetic patients, particularly with regards to lipid intake; (iii) The importance of including random, person-level effects when modelling this scientific problem.

Code Edit Add Remove Mark official

Tasks edit add remove, datasets edit, results from the paper edit add remove, methods edit add remove.

Loading metrics

Open Access

Essays articulate a specific perspective on a topic of broad interest to scientists.

See all article types »

Integrating phylogenies into single-cell RNA sequencing analysis allows comparisons across species, genes, and cells

Roles Conceptualization, Funding acquisition, Methodology, Validation, Visualization, Writing – original draft, Writing – review & editing

* E-mail: [email protected]

Affiliation Department of Ecology and Evolutionary Biology, Yale University, New Haven, Connecticut, United States of America

Roles Conceptualization, Methodology, Writing – original draft, Writing – review & editing

Roles Conceptualization, Funding acquisition, Methodology, Writing – original draft, Writing – review & editing

Samuel H. Church,
Jasmine L. Mah,
Casey W. Dunn

Published: May 24, 2024

https://doi.org/10.1371/journal.pbio.3002633
Reader Comments

Comparisons of single-cell RNA sequencing (scRNA-seq) data across species can reveal links between cellular gene expression and the evolution of cell functions, features, and phenotypes. These comparisons evoke evolutionary histories, as depicted by phylogenetic trees, that define relationships between species, genes, and cells. This Essay considers each of these in turn, laying out challenges and solutions derived from a phylogenetic comparative approach and relating these solutions to previously proposed methods for the pairwise alignment of cellular dimensional maps. This Essay contends that species trees, gene trees, cell phylogenies, and cell lineages can all be reconciled as descriptions of the same concept—the tree of cellular life. By integrating phylogenetic approaches into scRNA-seq analyses, challenges for building informed comparisons across species can be overcome, and hypotheses about gene and cell evolution can be robustly tested.

Citation: Church SH, Mah JL, Dunn CW (2024) Integrating phylogenies into single-cell RNA sequencing analysis allows comparisons across species, genes, and cells. PLoS Biol 22(5): e3002633. https://doi.org/10.1371/journal.pbio.3002633

Copyright: © 2024 Church et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: SHC was funded in part by National Science Foundation, https://nsf.org , grant 2109502 and by the Yale Institute of Biospheric Studies, https://yibs.yale.edu . The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Single-cell RNA sequencing (scRNA-seq) generates high-dimensional gene expression data from thousands of cells from an organ, tissue, or body [ 1 ]. Single-cell expression data are increasingly common, with new animal cell atlases being released every year [ 2 – 6 ]. The next steps will be to compare such atlases across species [ 2 ], identifying the dimensions in which these results differ and associating these differences with other features of interest [ 7 ]. Because all cross-species comparisons are inherently evolutionary comparisons, such analyses present an opportunity to integrate approaches from the field of evolutionary biology, and especially phylogenetic biology [ 8 ]. Drawing concepts, models, and methods from these fields will help to overcome central challenges with comparative scRNA-seq analysis, especially in how to draw coherent comparisons over thousands of genes and cells across species. In addition, this synthesis of concepts will help avoid the unnecessary reinvention of analytical methods that have already been rigorously tested in evolutionary biology for other types of data, such as morphological and molecular data.

Comparative gene expression analysis has been used for decades to answer evolutionary questions such as how changes in gene expression are associated with the evolution of novel functions and phenotypes [ 9 ]. The introduction of scRNA-seq technology has led to a massive increase in the scale of these experiments [ 1 ], from working with a few genes or a few tissues, to assays that cover the entire transcriptome, across thousands of cells in a dissociation experiment. Comparative scRNA-seq analysis therefore enables evolutionary questions to be scaled up, for example: how has the genetic basis of differentiation evolved across cell populations and over time; what kinds of cells and gene expression patterns were likely present in the most recent common ancestor; what changes in cell transcriptomes are associated with the evolution of new ecologies, life-histories, or other features; how much variation in cellular gene expression do we observe over evolutionary time; which changes in gene expression are significant (i.e., larger or smaller than we expect by chance); which genes show patterns of correlated expression evolution; and can evolutionary screens detect novel interactions between genes?

In comparative scRNA-seq studies, the results of individual experiments are analyzed across species. These scRNA-seq experiments usually generate matrices of count data with measurements along 2 axes: cells and genes ( Fig 1 ). Comparative scRNA-seq analysis adds a third axis: species. At first glance, it might make sense to try and align scRNA-seq matrices across species, thereby creating a 3D tensor of cellular gene expression. But neither genes nor cells are expected to share a one-to-one correspondence across species. In the case of genes, gene duplication (leading to paralogous relationships) and gene loss are rampant [ 10 ]. In the case of cells, there is rarely justification for equating 2 individual cells across species; instead, populations of cells (“cell types”) are typically compared [ 11 ]. Therefore to align matrices, an appropriate system of grouping both dimensions must first be found. This is essentially a question of homology [ 12 ]: which genes and cell types are homologous, based on their relationship to predicted genes and cell types in the common ancestor.

PPT PowerPoint slide
PNG larger image
TIFF original image

scRNA-seq experiments generate count matrices, shown here with columns as cells and rows as genes. Higher expression counts for a given gene in a given cell are depicted with darker shading. In an idealized comparison, count matrices across species would be aligned to form a 3D tensor of expression across cells, genes, and species. In reality, there is no expectation of one-to-one correspondence or independence for any of the 3 axes. Instead, relationships between species, genes, and cells are described by their respective evolutionary histories, as depicted with phylogenies.

https://doi.org/10.1371/journal.pbio.3002633.g001

Questions about homology can be answered using phylogenies [ 12 ]. Species relationships are defined by their shared ancestry, as depicted using a phylogeny of speciation events ( Fig 2 ). Gene homology is also defined by shared ancestry, depicted using gene trees that contain nodes corresponding to either speciation and gene duplication events. Cell homology inference requires assessing the evolutionary relationships between cell types [ 12 , 13 ], defined here as populations of cells related via the process of cellular differentiation and distinguishable from one another (e.g., by using molecular markers) [ 14 ]. Relationships between cell types can be represented with cell phylogenies that, like gene trees, contain both speciation and duplication nodes [ 13 ]. As with genes, the evolutionary relationships between cell types may be complex, as differentiation trajectories drift, split, or are lost over evolutionary time [ 7 , 13 , 15 ].

Species phylogenies contain speciation events as nodes in a bifurcating tree. Gene phylogenies contain both gene duplication events (black hexagons) and speciation events (unmarked) at nodes. Cell phylogenies also include both speciation and duplication events; here, duplication events represent a split in the program of cellular development that leads to differentiated cell types [ 13 ]. Branches from the species phylogeny (numbered branches) can be found within gene and cell phylogenies. Note that gene families are strictly defined by ancestry, but cell types have historically been defined by form, function, or patterns of gene expression [ 15 ]. This means that groups of cells identified as the same “type” across species may reflect paraphyletic groups [ 11 ], as depicted in the second cell type in this tree.

https://doi.org/10.1371/journal.pbio.3002633.g002

In this Essay, we illustrate a tree-based framework for comparing scRNA-seq data and contrast this framework with existing methods. We describe how we can use trees to identify homologous and comparable groups of genes and cells, based on their predicted relationship to genes and cells present in the common ancestor. We advocate for mapping data to branches of phylogenetic trees to test hypotheses about the evolution of cellular gene expression, describing the kinds of data that can be compared and the types of questions that each comparison has the potential to address. Finally, we reconcile species phylogenies, gene phylogenies, cell phylogenies, and cell lineages as different representations of the same concept—the tree of cellular life.

Comparisons across species

Shared ancestry between species will impact the results of all cross-species analyses and should therefore influence expectations and interpretations [ 16 ]. For scRNA-seq data, this has several implications. First, species are expected to be different from one another, given that they have continued evolving since diverging from their common ancestor. Therefore, by default, many differences in cellular gene expression are expected across the thousands of measurements in an scRNA-seq dataset. Second, the degree of difference is expected to correlate with time since the last common ancestor. The null expectation is that closely related species will have more cell types in common, and that those cells will have more similar patterns of gene expression than more distantly related species. The structure of this similarity can be approximated with a species phylogeny calibrated to time.

Methods for the evolutionary comparison of scRNA-seq data have already been proposed in packages such as SAMap [ 7 ]. These packages have overcome significant challenges, such as how to account for non-orthologous genes (see the section Comparisons across genes). However, up to now these methods have relied on pairwise comparisons of species, rather than phylogenetic relationships. The problems with pairwise comparisons have been well-described elsewhere [ 17 ]; briefly, they result in pseudo-replication of evolutionary events. This pseudo-replication is of increasing concern as comparisons are drawn across a greater number of taxa and across more closely related species. By contrast, an evolutionary comparative approach maps evolutionary changes to branches in the phylogeny [ 8 , 10 ]. With this approach, data are assigned to the tips of a tree, and ancestral states are reconstructed using an evolutionary model. Evolutionary changes are then calculated as differences between ancestral and descendant states, and the distribution of evolutionary changes along branches are analyzed and compared [ 18 ].

Shifting toward a phylogenetic approach to comparative scRNA-seq analysis unlocks new avenues of discovery, including tests of coevolution of cellular gene expression and other features of interest [ 9 ], as well as evolutionary screens for signatures of correlated gene and cell modules [ 19 ]. In phylogenetic analyses, statistical power depends on the number of independent evolutionary events rather than on the absolute number of taxa [ 8 ]. Therefore, the choice of which species to compare is critical, especially when comparisons can be constructed to capture potential convergence.

One consideration when comparing species is the degree to which the history of scientific study has favored certain organisms (e.g., model organisms) [ 20 ]. This is especially relevant to single-cell comparisons, as more information about cell and gene function is available for some species (e.g., mice and humans) than for others. This creates a risk of bias toward observing described biological phenomena, while missing the hidden biology in less well-studied organisms [ 20 ]. Consider the identification of “novel” cell types based on the absence of canonical marker genes: because most canonical marker genes were originally described in well-studied species, cell type definitions that rely on these will necessarily be less useful in the context of other species [ 2 ].

Technologies such as scRNA-seq have great potential to democratize the types of data collected [ 2 ]. For example, scRNA-seq allows all genes and thousands of cells to be assayed, rather than a curated list of candidates. To leverage this to full effect, researchers need to acknowledge the filtering steps in their analyses, including how orthologous gene sequences are identified and how cell types are labeled.

Comparisons across genes

Due to gene duplication and loss, there is usually not a one-to-one correspondence between genes across species [ 21 ]. Instead, evolutionary histories of genes are depicted using gene trees ( Fig 2 ). Pairs of tips in gene trees may be labeled as “orthologs” or “paralogs,” based on whether they descend from a node corresponding to a speciation or gene duplication event [ 22 ]. Gene duplication happens both at the individual gene level and in bulk, via whole or partial genome duplication [ 21 ]. Gene loss means that comparative scRNA-seq matrices may be sparse, not only due to a failure to detect a gene, but also because genes in one species often do not exist in another.

The authors of many cross-species comparisons have confronted the challenge of finding equivalent genes across species [ 23 ], and often start by restricting analyses to sets of one-to-one orthologs [ 24 ]. However, there are several problems with this approach [ 22 ]: one-to-one orthologs are only well-described for a small set of very well-annotated genomes [ 23 ]; the number of one-to-one orthologs decreases rapidly as species are added to the comparison, and as comparisons are made across deeper evolutionary distances [ 7 ]; and the subset of genes that can be described by one-to-one orthologs is not randomly drawn from across the genome, they are enriched for indispensable genes under single-copy control [ 25 ]. New tools like SAMap are expanding the analytical approach beyond one-to-one orthologs to the set of all homologs across species [ 7 ]. Homolog groups are identified with a clustering algorithm, by which genes are separated into groups with strong sequence or expression similarity. These may include more than 1 representative gene per species. Gene trees can then be inferred for these gene families, and duplication events mapped to individual nodes in the gene tree.

But how can cellular expression measures be compared across groups of homologous genes? One option is to use summary statistics, such as the sum or average expression per species for genes within a homology group [ 26 ]. However, these statistics might obscure or average over real biological variation in expression that arose subsequent to a duplication event (among paralogs) [ 19 ]. An alternative approach is to connect genes via a similarity matrix, and then make all-by-all comparisons that are weighted on the basis of putative homology [ 7 ]. A third approach is to reconstruct changes in cellular expression along gene trees, rather than along the species tree [ 10 , 27 ]. Here, evolutionary changes are associated with branches descending from either speciation or duplication events. Such an approach has been demonstrated for bulk RNA sequencing, in which gene trees were inferred from gene sequence data and cellular expression data were assigned to tips of a gene tree. In this approach, ancestral states and evolutionary changes are calculated and equivalent branches between trees are identified using “species branch filtering” [ 27 ]. Branches between speciation events can be unambiguously equated across trees based on the composition of their descendant tips (see numbered branches in Fig 2 ) and changes across equivalent branches of a cell tree analyzed (e.g., to identify significant changes, signatures of correlation).

Mapping cellular gene expression data to branches of a gene tree sidesteps the problem of finding sets of orthologs by incorporating the history of gene duplication and loss into the analytical framework. One technical limitation is that the ability to accurately reconstruct gene trees depends on the phylogenetic signal of gene sequences, which in turn depends on the length of the gene, the mutation rate, and the evolutionary distance in question [ 28 ]. These dynamics are such that, for some genes, it may not be possible to robustly reconstruct the topology, although targeted taxon sampling can improve gene tree inference across a wider range of histories.

Comparisons across cells

As with genes, there is usually not an expectation of a one-to-one correspondence between cells across species. Individual cells can rarely be equated, with notable exceptions such as the zygote or the cells of certain eutelic species (which have a fixed number of cells). Instead, the homology of groups of cells (cell types) are usually considered, with the hypothesis being that the cell developmental programs that give rise to these groups are derived from a program present in the shared ancestor [ 11 ].

Similarly, a one-to-one correspondence between cell types across species is also not expected, as cell types may be gained or lost over evolutionary time. The relationships between cell types across species can be described using phylogenetic trees. These cell phylogenies are distinct from cell lineages (the bifurcating trees that describe cellular divisions within an individual developmental history). Nodes in cell lineages represent cell divisions, whereas nodes in cell phylogenies represent either speciation events or splits in differentiation programs that lead to novel cell types ( Fig 2 ). The evolutionary histories of cell types may not follow a strict bifurcating pattern of evolution, as elements of differentiation programs are mixed and combined. However, evidence from inference on sequence data shows that the majority of relationships between cell types can be represented as trees [ 15 ].

The term “cell type” has been used for several distinct concepts [ 15 ], including cells that are defined and distinguished by their position in a tissue, their form, function, or in the case of scRNA-seq data, their relative expression profiles, which fall into distinct clusters [ 14 ]. Homology of structures across species is often inferred using many of the same criteria: position, form, function, and gene expression patterns [ 29 ]. The fact that the same principles are used for inferring cell types and cell homologies presents both an opportunity and obstacles for comparative scRNA-seq analysis. The same methods that are used for identifying clusters of cells within species can potentially be leveraged to identify clusters of cells across species. This could be done simultaneously, inferring a joint cell atlas in a shared expression space [ 7 ], or it could be done individually for each species and subsequently merged [ 2 , 23 ]. In either case, this inference requires contending with the evolutionary histories between genes and species, described above.

One obstacle is that, because cell types are not typically defined according to evolutionary relationships [ 15 ], cells labeled as the same type across species may constitute paraphyletic groups [ 11 ]. A solution to this problem is to use methods for reconstructing evolutionary relationships to infer the cell tree [ 15 , 30 ] ( Fig 2 ). This method is distinct from an approach in which cell types are organized into a taxonomy on the basis of morphological or functional similarity [ 14 ]; instead, this approach uses an evolutionary model to infer the evolutionary history, including potential duplication and loss. It has the additional advantage of generating a tree, comparable to a species or gene tree, onto which cellular characters can be mapped and their evolution described [ 15 ]. Methods for inferring cell trees from expression data have been described in detail elsewhere [ 31 – 33 ]. Using this approach, cell trees are inferred (e.g., using expression of orthologous genes as characters in an evolutionary model) and gene expression data are assigned to the tips of the cell tree. Ancestral states and evolutionary changes are then calculated and changes along branches are analyzed (e.g., to identify changes in gene expression associated with the evolution of novel cell types).

As with genes, the ability to infer cell trees depends on the phylogenetic signal of cellular traits such as cellular gene expression profiles. Although the phylogenetic signal of expression data has been demonstrated in various contexts [ 32 – 35 ], certain cell types, such as cancer cells that follow a distinct mode of evolution, may exhibit less tree-like structures [ 34 ]. Species-specific effects and signals from correlated evolution may also obscure cell phylogenetic signals. Given the low-rank nature of cell gene expression, dimensional reduction techniques such as principal component analysis have been employed to extract and clarify phylogenetic signals [ 33 ]. Other complexities, such as naturally occurring instances of cellular reprogramming or transdifferentiation could also potentially obscure phylogenetic signals, although cellular identity is thought to be stable under most circumstances [ 36 ].

Another obstacle to comparing single-cell datasets are reported batch effects [ 26 ] across experiments, which may need to be accounted for via integration [ 23 ]. When considering these effects, it is critical to remember that the null expectation is that species are different from one another. Naive batch integration practices have no method for distinguishing technical effects from the real biological differences that are the target of study in comparative scRNA-seq analysis [ 23 ]. Other approaches (e.g., LIGER [ 37 ] or Seurat [ 38 ]) are reportedly able to distinguish and characterize species-specific differences [ 23 ]. Given that null hypotheses are still being developed [ 16 ] for how much variation in expression is expected to be observed across species [ 19 ], we hold that cross-species integration should be treated with caution until elucidation of the approach can robustly target and strictly remove technical batch effects.

A final obstacle is that cell identities and homologies may be more complex than can accurately be captured by categorization into discrete clusters or cell types, particularly when considering multiple cell states along a differentiation trajectory [ 14 , 15 ]. Single-cell experiments that include both progenitor and differentiated cells can reveal the limits of clustering algorithms [ 39 ]. In these experiments, there may or may not be obvious boundaries for distinguishing cell states. In cases where boundaries are arbitrary, the number of clusters, and therefore the abundance of cells within a cluster will depend on technical and not biological inputs, such as the resolution parameter that the user predetermines for the clustering algorithm. A solution here is to define homology for the differentiation trajectory, rather than for individual clusters of cells [ 26 ]. This can be accomplished by defining anchor points where trajectories overlap in the expression of homologous genes while allowing for trajectories to have drifted or split over evolutionary time, such that sections of the trajectories no longer overlap [ 15 ]. Cellular homologies within a trajectory may be more difficult to infer, as this requires contending with potential heterochronic changes to differentiation (e.g., as cell differentiation evolves, genes may be expressed relatively earlier or later in the process) [ 26 ].

Constructing comparisons of scRNA-seq data

Single-cell comparisons potentially draw on a broad range of phylogenetic comparative methods for different data types, including binary, discrete, continuous, and categorical data [ 40 ] ( Fig 3 ). The primary data structure of scRNA-seq is a matrix of integers, representing counts of transcripts or unique molecular identifiers for a given gene within a given cell [ 41 ]. In a typical scRNA-seq analysis, this count matrix is passed through a pipeline of normalization, transformation, dimensional reduction, and clustering [ 42 , 43 ]. The decisions of when during this pipeline to draw a comparison determines the data type, questions that can be addressed, and caveats that must be considered.

Several types of scRNA-seq data could potentially be mapped onto a phylogeny. Nine types of data are shown, along with example questions that can be addressed and caveats to be considered.

https://doi.org/10.1371/journal.pbio.3002633.g003

Gene expression data

Unlike bulk RNA sequencing, where counts are typically distributed across a few to dozens of samples, scRNA-seq counts are distributed across thousands of cells. The result is that scRNA-seq count matrices are often shallow and sparse [ 44 ]. The vast majority of counts (often >95%) in standard scRNA-seq datasets are either 0, 1, or 2 [ 41 ]. These count values, representing the number of unique molecular identifiers that encode unique transcripts in cells, are discrete, low integer numbers, and not continuous measurements. The high dimensionality and sparse nature of single-cell data therefore present a unique challenge when considering cross-species comparisons [ 2 ].

In a standard scRNA-seq approach, expression values are analyzed after depth normalization and other transformations. With depth normalization, counts are converted from discrete, absolute measures to continuous, relative ones (although currently available instruments do not actually quantify relative expression). There is a growing concern that this, and other transformations, are inappropriate for the sparse and shallow sequencing data produced by scRNA-seq [ 45 , 46 ]. Further transformations of the data, such as log transformation or variance rescaling, introduce additional distortions that may obscure real biological differences between species.

Alternatively, counts can be compared across species directly, without normalization or transformation [ 41 ]. There are 2 potential drawbacks to this approach. First, count values are influenced by stochasticity due to the shallow nature of sequencing, resulting in uncertainty around integer values. Second, cells are not sequenced to a standard depth. Comparing raw counts does not take this heterogeneity into account, although this can be accomplished using a restricted algebra to analyze counts [ 41 ]. Another option is to transform count values to a binary or categorical trait [ 47 ]; for example, binning counts into “on” and “off” based on a threshold value and then modeling the evolution of these states on a tree. Analyzing expression as a binary or categorical trait eliminates some of the quantitative power of scRNA-seq, but still allows interesting questions about the evolution of expression dynamics within and across cell types to be addressed.

Models of expression

A promising avenue for scRNA-seq data is using generalized linear models to analyze expression [ 46 , 48 , 49 ]. These models describe expression as a continuous trait and incorporate the sampling process using a Poisson or other distribution, avoiding normalization and transformation, and returning fitted estimates of relative expression. These estimates can be compared using models that describe continuous trait evolution. One feature of generalized linear models is that they can report uncertainty values for estimates of relative expression, which can then be passed along to phylogenetic methods to assess confidence in the evolutionary conclusions drawn.

Cell diversity

In a standard scRNA-seq approach, cells are analyzed in a reduced dimensional space and clustered by patterns of gene expression [ 43 ]. There are several types of cellular data that can be compared. The evolution of the presence or absence of cell types can be modeled as a binary trait. When cell type labels are unambiguously assigned, this approach can answer questions about when cell types evolved and are lost. Such a comparison is hampered; however, when cells do not fall into discrete categories [ 14 ] or when equivalent cell types cannot be identified across species due to substantial divergence in gene expression patterns. An alternative is to model the evolution of cell differentiation pathways as a binary trait on a tree to ask when pathways, rather than cell types, evolved and have been lost. As with other comparative methods, this approach must contend with complex evolutionary histories, including the potential for convergence as pathways independently evolve to generate cell types with similar functions and expression profiles.

Similarly, the abundance of cells of a given type might be compared across species (for example, to ask how dynamics of cell proliferation have evolved). However, the number of cells within a cluster can be influenced by technical features of the experiment such as the total number of clusters identified (often influenced by user-supplied parameters), as well as where cluster boundaries are defined. An alternative is to compare relative cell abundance values, which may account for experimental factors but is still unreliable as it is susceptible to bias from technical aspects of how cells are dissociated and how clusters are determined.

Cellular manifolds

One area for further development are methods that can model the evolution of the entire cellular expression manifold—the space that defines cell-to-cell similarity and cellular differentiation—on an evolutionary tree. Practically, this might be accomplished by parameterizing the manifold, for example, by calculating measures of manifold shape and structure such as distances between cells in a reduced dimensional space. The evolution of such parameters could be studied by analyzing them as characters on a phylogenetic tree.

Alternatively, we can envision a method in which entire ancestral landscapes of cellular gene expression are reconstructed, and then the way this landscape has been reshaped over evolutionary time is described. Such an approach would require an expansion of existing phylogenetic comparative models to ones that can incorporate many thousands of dimensions. It would also likely require dense taxonomic sampling to build robust reconstructions.

Future directions and conclusion

Comparative scRNA-seq analysis spans the fields of evolutionary, developmental, and cellular biology. Trees depicting relationships across time are the common denominator of these fields. Taking a step back reveals that many of the trees that are typically encountered, such as species phylogenies, gene phylogenies, cell phylogenies, and cell fate maps, can be reconciled as part of a larger whole ( Fig 4 ). Because all cellular life is related via an unbroken chain of cellular divisions, species phylogenies and cell fate maps are 2 representations of the same larger phenomenon, visualized at vastly different scales. Gene trees and cell trees (i.e., cell phylogenies) depict the evolution of specific characters (genes and cells) across populations within a species tree. These characters may have discordant evolutionary histories with each other, and with the overall species phylogeny, due to patterns of gene and cell duplication, loss, and incomplete sorting across populations.

All cellular life is related by an unbroken chain of cell divisions. Species phylogenies describe the relationship between populations. Populations are themselves a description of the genealogical relationships between individuals. Peering even closer reveals that each individual consists of a lineage of cells, connected to other individuals via reproductive cells. Therefore, species trees, genealogies, and cell lineages are all descriptions of the same concept—the tree of life—but at different scales. Gene trees and cell trees (i.e., cellular phylogenies) describe the evolutionary histories of specific characters within the tree of life. These trees may be discordant with species trees due to duplication, loss, and incomplete sorting in populations.

https://doi.org/10.1371/journal.pbio.3002633.g004

The synthesis of species, gene, and cell trees makes 2 points clear. First, phylogenetic trees are essential for testing hypotheses about cellular gene expression evolution. Mapping single-cell data to trees, whether gene trees, cell trees, or species trees, allows for statistical tests of coevolution, diversification, and convergence. The choice of which trees to use for mapping data will be determined by the questions that need to be answered. For example, mapping cellular expression data to gene trees would allow whether expression evolves differently following gene duplication events (i.e., the ortholog conjecture [ 50 ]) to be tested. Second, because the fields of evolutionary, developmental, and cellular biology study the same phenomena at different scales, there is a potential benefit from sharing methods. In the case of scRNA-seq, building evolutionary context around data can prove essential for understanding the fundamental biology, including how to interpret cell types and cellular differentiation trajectories, and how to reconcile gene relationships. An evolutionary perspective is also critical for building robust null expectations of how much variation might be expected to be observed across species [ 16 ], which will allow the significance of results to be interpreted as new species atlases come to light. Methods that infer and incorporate trees are essential not only for evolutionary biology, but also for developmental and cellular biology as well. As single-cell data become increasingly available, rather than reinvent methods for building cell trees or comparing across cellular network diagrams, we can draw approaches from the extensive and robust fields of phylogenetic inference and phylogenetic comparative methods. These approaches include Bayesian and Maximum Likelihood inference of trees, evolutionary models, ancestral state reconstruction, character state matrices, and phylogenetic hypothesis testing, among many others [ 51 – 53 ].

Biology has benefited in the past from the synthesis of disparate fields of study, including the modern synthesis of Darwinian evolution and mendelian genetics [ 54 ], and the synthesis of evolution and development in the field of evo-devo [ 55 ]. With the advent and commercialization of technologies like scRNA-seq, there is a broadened opportunity for new syntheses [ 56 ]. Rich and complex datasets are increasingly available from understudied branches on the tree of life, and comparisons between species will invariably invoke evolutionary questions. By integrating phylogenetic thinking across fields, we can start to answer these questions and raise new ones.

Acknowledgments

We thank Daniel Stadtmauer, Namrata Ahuja, Seth Donoughe, and other members of the Dunn lab for helpful conversation and comments on an initial version of the manuscript.

View Article
PubMed/NCBI
Google Scholar
12. Wagner GP. Homology, genes, and evolutionary innovation. Princeton University Press; 2014.
51. Swofford D, Olsen G, Waddell P, Hillis D. Phylogenetic inference. In: Molecular Systematics. p. 407–514. Sinauer; 1996.
54. Evolution Huxley J. The modern synthesis. George Alien & Unwin Ltd.; 1942.

IMAGES

Tools for data analysis in research example
Data Analysis And Discussion Science Fair (300 Words)
Data Essay
📌 Data Analysis Essay Sample: Random Forest Classifier Analysis
How to Write a Data Analysis Report
How to Write a Summary, Analysis, and Response Essay Paper With

VIDEO

Canonized: The Games You Can't Stop Talking About
Data-Driven Decisions in Criminal Justice: Data Collection
A very brief Introduction to Data Analysis (part 1)
LIVE STREAM! Get ALL your queries answered! IB Economics exam this Thu 16 May, 2024
Cloud Storage Service Concept, Pros & Cons
|| Bio-data || resume || #application #essay #letter #essay

COMMENTS

Learning to Do Qualitative Data Analysis: A Starting Point
On the basis of Rocco (2010), Storberg-Walker's (2012) amended list on qualitative data analysis in research papers included the following: (a) the article should provide enough details so that reviewers could follow the same analytical steps; (b) the analysis process selected should be logically connected to the purpose of the study; and (c ...
Data Science and Analytics: An Overview from Data-Driven Smart
Data pre-processing and exploration: Exploratory data analysis is defined in data science as an approach to analyzing datasets to summarize their key characteristics, often with visual methods . This examines a broad data collection to discover initial trends, attributes, points of interest, etc. in an unstructured manner to construct ...
What Is Data Analysis? (With Examples)
Written by Coursera Staff • Updated on Apr 19, 2024. Data analysis is the practice of working with data to glean useful information, which can then be used to make informed decisions. "It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts," Sherlock ...
Data Analysis in Research: Types & Methods
Definition of research in data analysis: According to LeCompte and Schensul, research data analysis is a process used by researchers to reduce data to a story and interpret it to derive insights. The data analysis process helps reduce a large chunk of data into smaller fragments, which makes sense. Three essential things occur during the data ...
A practical guide to data analysis in general literature reviews
This article is a practical guide to conducting data analysis in general literature reviews. The general literature review is a synthesis and analysis of published research on a relevant clinical issue, and is a common format for academic theses at the bachelor's and master's levels in nursing, physiotherapy, occupational therapy, public health and other related fields.
How to Do Thematic Analysis
How to Do Thematic Analysis | Step-by-Step Guide & Examples. Published on September 6, 2019 by Jack Caulfield.Revised on June 22, 2023. Thematic analysis is a method of analyzing qualitative data.It is usually applied to a set of texts, such as an interview or transcripts.The researcher closely examines the data to identify common themes - topics, ideas and patterns of meaning that come up ...
Data Analysis Essays: Examples, Topics, & Outlines
Data analysis is the processes of project reporting that involves inspection, cleansing, transformation, and modeling of the data collected with the aim of establishing information that is useful in suggesting the possible conclusions and in providing insights to support the decisions made (Ott & Longnecker, 2015).
Data analysis Free Essay Examples And Topic Ideas
8 essay samples found. Data analysis is the process of inspecting, cleaning, transforming, and modeling data to discover useful information and support decision-making. Essays on data analysis could delve into various techniques and tools used in data analysis, its application in different fields like business or science, or the ethical ...
The Beginner's Guide to Statistical Analysis
Table of contents. Step 1: Write your hypotheses and plan your research design. Step 2: Collect data from a sample. Step 3: Summarize your data with descriptive statistics. Step 4: Test hypotheses or make estimates with inferential statistics.
5 Steps to Write a Great Analytical Essay
The analysis paper uses evidence to support the argument, such as excerpts from the piece of writing. All analytical papers include a thesis, analysis of the topic, and evidence to support that analysis. When developing an analytical essay outline and writing your essay, follow these five steps: #1: Choose a topic. #2: Write your thesis.
How to Write an Analytical Essay in 7 Simple Steps
4. Craft clear topic sentences. Each main body paragraph should begin with a topic sentence that both introduces the topic of the specific paragraph, and ties it to your main thesis. 5. Populate your essay with evidence. The main body of the essay should be filled with a mixture of substance and analysis.
PDF Structure of a Data Analysis Report
- Data - Methods - Analysis - Results This format is very familiar to those who have written psych research papers. It often works well for a data analysis paper as well, though one problem with it is that the Methods section often sounds like a bit of a stretch: In a psych research paper the Methods section describes what you did to ...
Writing a Good Data Analysis Report: 7 Steps
Data. Explain what data you used to conduct your analysis. Be specific and explain how you gathered the data, what your sample was, what tools and resources you've used, and how you've organized your data. This will give the reader a deeper understanding of your data sample and make your report more solid.
Importance of Data Analysis
Importance of Data Analysis Essay. The data analysis process will take place after all the necessary information is obtained and structured appropriately. This will be a basis for the initial stage of the mentioned process - primary data processing. It is important to analyze the results of each study as soon as possible after its completion.
Data Analysis Essay
1.1 Procedures To analysis the collected qualitative data, the five steps for qualitative data analysis was applied: data immersion, data coding, data reduction, data display, and interpretation (Lui 2014). In the data immersion step, besides reading and rereading the transcription of recording, the observation note and report also been ...
PDF Strategies for Essay Writing
oConsideration of counterarguments (what Sandel might say in response to this section of your argument) Each argument you will make in an essay will be different, but this strategy will often be a useful first step in figuring out the path of your argument. Strategy #2: Use subheadings, even if you remove themlater.
Introductory essay
Introductory essay Written by the educators who created Visualizing Data, a brief look at the key facts, tough questions and big ideas in their field. ... One group was given the data and a standard statistical analysis of the data; 72% of these economists got the answer wrong. Another group was given the data, the statistical analysis, and a ...
data analysis Latest Research Papers
Data Missing . The Given. Missing data is universal complexity for most part of the research fields which introduces the part of uncertainty into data analysis. We can take place due to many types of motives such as samples mishandling, unable to collect an observation, measurement errors, aberrant value deleted, or merely be short of study.
Data Analysis Essays
Abstract Data analysis is known as 'analysis of data 'or 'data analytics', is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, suggesting conclusions and supporting decision making. Data analysis has multiple facets and approaches, encompassing diverse techniques...
Data Analysis Essays
Data Collection: A Critical Analysis. Example essay. Last modified: 18th Oct 2021. This critical essay will investigate the moral implications of data collection and how data should inform marketing activities. Example cases will be examined to draw an informed conclusion....
Research of Data Analysis and Different Types of Analysis: [Essay
Data analysis is known as 'analysis of data 'or 'data analytics', is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, suggesting conclusions and supporting decision making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of ...
Short Essay On Data Analysis
Short Essay On Data Analysis. 763 Words4 Pages. DATA ANALYTICS. In the world of technology growing at lightning speed data plays an important role in all runs of life. Be it a grocery shop or a multinational company, everywhere data is a game changer and helping everyone grow exponentially. So, the question arises what exactly is data analytics?
Papers with Code
However, the complexity of the data produced by CGM monitors pushes the limits of existing analytic methods. CGM data often exhibits substantial within-person variability and has a natural multilevel structure. This research is motivated by the analysis of CGM data from individuals without diabetes in the AEGIS study.
Integrating phylogenies into single-cell RNA sequencing analysis allows
Comparisons of single-cell RNA sequencing (scRNA-seq) data across species can reveal links between cellular gene expression and the evolution of cell functions, features, and phenotypes. This Essay contends that, by integrating phylogenetic approaches into scRNA-seq analyses, hypotheses about gene and cell evolution can be robustly tested.
Improving Language Models Trained with Translated Data via Continual
To rectify these issues, we further pre-train the models with a small dataset of synthesized high-quality stories, representing 1\% of the original training data, using a capable LLM in Arabic. We show using GPT-4 as a judge and dictionary learning analysis from mechanistic interpretability that the suggested approach is a practical means to ...
Systematic review and meta-analysis of hepatitis E seroprevalence in
The burden of hepatitis E in Southeast Asia is substantial, influenced by its distinct socio-economic and environmental factors, as well as variations in healthcare systems. The aim of this study was to assess the pooled seroprevalence of hepatitis E across countries within the Southeast Asian region by the UN division.The study analyzed 66 papers across PubMed, Web of Science, and Scopus ...
IMF Working Papers
This paper discusses connections between female economic empowerment and government spending. It is an abbreviated overview for non-gender-experts on how fiscal expenditure may support female economic empowerment as an interim step toward advancing gender equality. From this perspective, it offers a preliminary exploration of key factors and indicators associated with gender-differentiated ...