Advertisement

Advertisement

Machine Learning: Algorithms, Real-World Applications and Research Directions

  • Review Article
  • Published: 22 March 2021
  • Volume 2 , article number  160 , ( 2021 )

Cite this article

research paper about machines

  • Iqbal H. Sarker   ORCID: orcid.org/0000-0003-1740-5517 1 , 2  

506k Accesses

1428 Citations

23 Altmetric

Explore all metrics

In the current age of the Fourth Industrial Revolution (4 IR or Industry 4.0), the digital world has a wealth of data, such as Internet of Things (IoT) data, cybersecurity data, mobile data, business data, social media data, health data, etc. To intelligently analyze these data and develop the corresponding smart and automated  applications, the knowledge of artificial intelligence (AI), particularly, machine learning (ML) is the key. Various types of machine learning algorithms such as supervised, unsupervised, semi-supervised, and reinforcement learning exist in the area. Besides, the deep learning , which is part of a broader family of machine learning methods, can intelligently analyze the data on a large scale. In this paper, we present a comprehensive view on these machine learning algorithms that can be applied to enhance the intelligence and the capabilities of an application. Thus, this study’s key contribution is explaining the principles of different machine learning techniques and their applicability in various real-world application domains, such as cybersecurity systems, smart cities, healthcare, e-commerce, agriculture, and many more. We also highlight the challenges and potential research directions based on our study. Overall, this paper aims to serve as a reference point for both academia and industry professionals as well as for decision-makers in various real-world situations and application areas, particularly from the technical point of view.

Similar content being viewed by others

research paper about machines

Deep Learning: A Comprehensive Overview on Techniques, Taxonomy, Applications and Research Directions

research paper about machines

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

research paper about machines

AI-Based Modeling: Techniques, Applications and Research Issues Towards Automation, Intelligent and Smart Systems

Avoid common mistakes on your manuscript.

Introduction

We live in the age of data, where everything around us is connected to a data source, and everything in our lives is digitally recorded [ 21 , 103 ]. For instance, the current electronic world has a wealth of various kinds of data, such as the Internet of Things (IoT) data, cybersecurity data, smart city data, business data, smartphone data, social media data, health data, COVID-19 data, and many more. The data can be structured, semi-structured, or unstructured, discussed briefly in Sect. “ Types of Real-World Data and Machine Learning Techniques ”, which is increasing day-by-day. Extracting insights from these data can be used to build various intelligent applications in the relevant domains. For instance, to build a data-driven automated and intelligent cybersecurity system, the relevant cybersecurity data can be used [ 105 ]; to build personalized context-aware smart mobile applications, the relevant mobile data can be used [ 103 ], and so on. Thus, the data management tools and techniques having the capability of extracting insights or useful knowledge from the data in a timely and intelligent way is urgently needed, on which the real-world applications are based.

figure 1

The worldwide popularity score of various types of ML algorithms (supervised, unsupervised, semi-supervised, and reinforcement) in a range of 0 (min) to 100 (max) over time where x-axis represents the timestamp information and y-axis represents the corresponding score

Artificial intelligence (AI), particularly, machine learning (ML) have grown rapidly in recent years in the context of data analysis and computing that typically allows the applications to function in an intelligent manner [ 95 ]. ML usually provides systems with the ability to learn and enhance from experience automatically without being specifically programmed and is generally referred to as the most popular latest technologies in the fourth industrial revolution (4 IR or Industry 4.0) [ 103 , 105 ]. “Industry 4.0” [ 114 ] is typically the ongoing automation of conventional manufacturing and industrial practices, including exploratory data processing, using new smart technologies such as machine learning automation. Thus, to intelligently analyze these data and to develop the corresponding real-world applications, machine learning algorithms is the key. The learning algorithms can be categorized into four major types, such as supervised, unsupervised, semi-supervised, and reinforcement learning in the area [ 75 ], discussed briefly in Sect. “ Types of Real-World Data and Machine Learning Techniques ”. The popularity of these approaches to learning is increasing day-by-day, which is shown in Fig. 1 , based on data collected from Google Trends [ 4 ] over the last five years. The x - axis of the figure indicates the specific dates and the corresponding popularity score within the range of \(0 \; (minimum)\) to \(100 \; (maximum)\) has been shown in y - axis . According to Fig. 1 , the popularity indication values for these learning types are low in 2015 and are increasing day by day. These statistics motivate us to study on machine learning in this paper, which can play an important role in the real-world through Industry 4.0 automation.

In general, the effectiveness and the efficiency of a machine learning solution depend on the nature and characteristics of data and the performance of the learning algorithms . In the area of machine learning algorithms, classification analysis, regression, data clustering, feature engineering and dimensionality reduction, association rule learning, or reinforcement learning techniques exist to effectively build data-driven systems [ 41 , 125 ]. Besides, deep learning originated from the artificial neural network that can be used to intelligently analyze data, which is known as part of a wider family of machine learning approaches [ 96 ]. Thus, selecting a proper learning algorithm that is suitable for the target application in a particular domain is challenging. The reason is that the purpose of different learning algorithms is different, even the outcome of different learning algorithms in a similar category may vary depending on the data characteristics [ 106 ]. Thus, it is important to understand the principles of various machine learning algorithms and their applicability to apply in various real-world application areas, such as IoT systems, cybersecurity services, business and recommendation systems, smart cities, healthcare and COVID-19, context-aware systems, sustainable agriculture, and many more that are explained briefly in Sect. “ Applications of Machine Learning ”.

Based on the importance and potentiality of “Machine Learning” to analyze the data mentioned above, in this paper, we provide a comprehensive view on various types of machine learning algorithms that can be applied to enhance the intelligence and the capabilities of an application. Thus, the key contribution of this study is explaining the principles and potentiality of different machine learning techniques, and their applicability in various real-world application areas mentioned earlier. The purpose of this paper is, therefore, to provide a basic guide for those academia and industry people who want to study, research, and develop data-driven automated and intelligent systems in the relevant areas based on machine learning techniques.

The key contributions of this paper are listed as follows:

To define the scope of our study by taking into account the nature and characteristics of various types of real-world data and the capabilities of various learning techniques.

To provide a comprehensive view on machine learning algorithms that can be applied to enhance the intelligence and capabilities of a data-driven application.

To discuss the applicability of machine learning-based solutions in various real-world application domains.

To highlight and summarize the potential research directions within the scope of our study for intelligent data analysis and services.

The rest of the paper is organized as follows. The next section presents the types of data and machine learning algorithms in a broader sense and defines the scope of our study. We briefly discuss and explain different machine learning algorithms in the subsequent section followed by which various real-world application areas based on machine learning algorithms are discussed and summarized. In the penultimate section, we highlight several research issues and potential future directions, and the final section concludes this paper.

Types of Real-World Data and Machine Learning Techniques

Machine learning algorithms typically consume and process data to learn the related patterns about individuals, business processes, transactions, events, and so on. In the following, we discuss various types of real-world data as well as categories of machine learning algorithms.

Types of Real-World Data

Usually, the availability of data is considered as the key to construct a machine learning model or data-driven real-world systems [ 103 , 105 ]. Data can be of various forms, such as structured, semi-structured, or unstructured [ 41 , 72 ]. Besides, the “metadata” is another type that typically represents data about the data. In the following, we briefly discuss these types of data.

Structured: It has a well-defined structure, conforms to a data model following a standard order, which is highly organized and easily accessed, and used by an entity or a computer program. In well-defined schemes, such as relational databases, structured data are typically stored, i.e., in a tabular format. For instance, names, dates, addresses, credit card numbers, stock information, geolocation, etc. are examples of structured data.

Unstructured: On the other hand, there is no pre-defined format or organization for unstructured data, making it much more difficult to capture, process, and analyze, mostly containing text and multimedia material. For example, sensor data, emails, blog entries, wikis, and word processing documents, PDF files, audio files, videos, images, presentations, web pages, and many other types of business documents can be considered as unstructured data.

Semi-structured: Semi-structured data are not stored in a relational database like the structured data mentioned above, but it does have certain organizational properties that make it easier to analyze. HTML, XML, JSON documents, NoSQL databases, etc., are some examples of semi-structured data.

Metadata: It is not the normal form of data, but “data about data”. The primary difference between “data” and “metadata” is that data are simply the material that can classify, measure, or even document something relative to an organization’s data properties. On the other hand, metadata describes the relevant data information, giving it more significance for data users. A basic example of a document’s metadata might be the author, file size, date generated by the document, keywords to define the document, etc.

In the area of machine learning and data science, researchers use various widely used datasets for different purposes. These are, for example, cybersecurity datasets such as NSL-KDD [ 119 ], UNSW-NB15 [ 76 ], ISCX’12 [ 1 ], CIC-DDoS2019 [ 2 ], Bot-IoT [ 59 ], etc., smartphone datasets such as phone call logs [ 84 , 101 ], SMS Log [ 29 ], mobile application usages logs [ 137 ] [ 117 ], mobile phone notification logs [ 73 ] etc., IoT data [ 16 , 57 , 62 ], agriculture and e-commerce data [ 120 , 138 ], health data such as heart disease [ 92 ], diabetes mellitus [ 83 , 134 ], COVID-19 [ 43 , 74 ], etc., and many more in various application domains. The data can be in different types discussed above, which may vary from application to application in the real world. To analyze such data in a particular problem domain, and to extract the insights or useful knowledge from the data for building the real-world intelligent applications, different types of machine learning techniques can be used according to their learning capabilities, which is discussed in the following.

Types of Machine Learning Techniques

Machine Learning algorithms are mainly divided into four categories: Supervised learning, Unsupervised learning, Semi-supervised learning, and Reinforcement learning [ 75 ], as shown in Fig. 2 . In the following, we briefly discuss each type of learning technique with the scope of their applicability to solve real-world problems.

figure 2

Various types of machine learning techniques

Supervised: Supervised learning is typically the task of machine learning to learn a function that maps an input to an output based on sample input-output pairs [ 41 ]. It uses labeled training data and a collection of training examples to infer a function. Supervised learning is carried out when certain goals are identified to be accomplished from a certain set of inputs [ 105 ], i.e., a task-driven approach . The most common supervised tasks are “classification” that separates the data, and “regression” that fits the data. For instance, predicting the class label or sentiment of a piece of text, like a tweet or a product review, i.e., text classification, is an example of supervised learning.

Unsupervised: Unsupervised learning analyzes unlabeled datasets without the need for human interference, i.e., a data-driven process [ 41 ]. This is widely used for extracting generative features, identifying meaningful trends and structures, groupings in results, and exploratory purposes. The most common unsupervised learning tasks are clustering, density estimation, feature learning, dimensionality reduction, finding association rules, anomaly detection, etc.

Semi-supervised: Semi-supervised learning can be defined as a hybridization of the above-mentioned supervised and unsupervised methods, as it operates on both labeled and unlabeled data [ 41 , 105 ]. Thus, it falls between learning “without supervision” and learning “with supervision”. In the real world, labeled data could be rare in several contexts, and unlabeled data are numerous, where semi-supervised learning is useful [ 75 ]. The ultimate goal of a semi-supervised learning model is to provide a better outcome for prediction than that produced using the labeled data alone from the model. Some application areas where semi-supervised learning is used include machine translation, fraud detection, labeling data and text classification.

Reinforcement: Reinforcement learning is a type of machine learning algorithm that enables software agents and machines to automatically evaluate the optimal behavior in a particular context or environment to improve its efficiency [ 52 ], i.e., an environment-driven approach . This type of learning is based on reward or penalty, and its ultimate goal is to use insights obtained from environmental activists to take action to increase the reward or minimize the risk [ 75 ]. It is a powerful tool for training AI models that can help increase automation or optimize the operational efficiency of sophisticated systems such as robotics, autonomous driving tasks, manufacturing and supply chain logistics, however, not preferable to use it for solving the basic or straightforward problems.

Thus, to build effective models in various application areas different types of machine learning techniques can play a significant role according to their learning capabilities, depending on the nature of the data discussed earlier, and the target outcome. In Table 1 , we summarize various types of machine learning techniques with examples. In the following, we provide a comprehensive view of machine learning algorithms that can be applied to enhance the intelligence and capabilities of a data-driven application.

Machine Learning Tasks and Algorithms

In this section, we discuss various machine learning algorithms that include classification analysis, regression analysis, data clustering, association rule learning, feature engineering for dimensionality reduction, as well as deep learning methods. A general structure of a machine learning-based predictive model has been shown in Fig. 3 , where the model is trained from historical data in phase 1 and the outcome is generated in phase 2 for the new test data.

figure 3

A general structure of a machine learning based predictive model considering both the training and testing phase

Classification Analysis

Classification is regarded as a supervised learning method in machine learning, referring to a problem of predictive modeling as well, where a class label is predicted for a given example [ 41 ]. Mathematically, it maps a function ( f ) from input variables ( X ) to output variables ( Y ) as target, label or categories. To predict the class of given data points, it can be carried out on structured or unstructured data. For example, spam detection such as “spam” and “not spam” in email service providers can be a classification problem. In the following, we summarize the common classification problems.

Binary classification: It refers to the classification tasks having two class labels such as “true and false” or “yes and no” [ 41 ]. In such binary classification tasks, one class could be the normal state, while the abnormal state could be another class. For instance, “cancer not detected” is the normal state of a task that involves a medical test, and “cancer detected” could be considered as the abnormal state. Similarly, “spam” and “not spam” in the above example of email service providers are considered as binary classification.

Multiclass classification: Traditionally, this refers to those classification tasks having more than two class labels [ 41 ]. The multiclass classification does not have the principle of normal and abnormal outcomes, unlike binary classification tasks. Instead, within a range of specified classes, examples are classified as belonging to one. For example, it can be a multiclass classification task to classify various types of network attacks in the NSL-KDD [ 119 ] dataset, where the attack categories are classified into four class labels, such as DoS (Denial of Service Attack), U2R (User to Root Attack), R2L (Root to Local Attack), and Probing Attack.

Multi-label classification: In machine learning, multi-label classification is an important consideration where an example is associated with several classes or labels. Thus, it is a generalization of multiclass classification, where the classes involved in the problem are hierarchically structured, and each example may simultaneously belong to more than one class in each hierarchical level, e.g., multi-level text classification. For instance, Google news can be presented under the categories of a “city name”, “technology”, or “latest news”, etc. Multi-label classification includes advanced machine learning algorithms that support predicting various mutually non-exclusive classes or labels, unlike traditional classification tasks where class labels are mutually exclusive [ 82 ].

Many classification algorithms have been proposed in the machine learning and data science literature [ 41 , 125 ]. In the following, we summarize the most common and popular methods that are used widely in various application areas.

Naive Bayes (NB): The naive Bayes algorithm is based on the Bayes’ theorem with the assumption of independence between each pair of features [ 51 ]. It works well and can be used for both binary and multi-class categories in many real-world situations, such as document or text classification, spam filtering, etc. To effectively classify the noisy instances in the data and to construct a robust prediction model, the NB classifier can be used [ 94 ]. The key benefit is that, compared to more sophisticated approaches, it needs a small amount of training data to estimate the necessary parameters and quickly [ 82 ]. However, its performance may affect due to its strong assumptions on features independence. Gaussian, Multinomial, Complement, Bernoulli, and Categorical are the common variants of NB classifier [ 82 ].

Linear Discriminant Analysis (LDA): Linear Discriminant Analysis (LDA) is a linear decision boundary classifier created by fitting class conditional densities to data and applying Bayes’ rule [ 51 , 82 ]. This method is also known as a generalization of Fisher’s linear discriminant, which projects a given dataset into a lower-dimensional space, i.e., a reduction of dimensionality that minimizes the complexity of the model or reduces the resulting model’s computational costs. The standard LDA model usually suits each class with a Gaussian density, assuming that all classes share the same covariance matrix [ 82 ]. LDA is closely related to ANOVA (analysis of variance) and regression analysis, which seek to express one dependent variable as a linear combination of other features or measurements.

Logistic regression (LR): Another common probabilistic based statistical model used to solve classification issues in machine learning is Logistic Regression (LR) [ 64 ]. Logistic regression typically uses a logistic function to estimate the probabilities, which is also referred to as the mathematically defined sigmoid function in Eq. 1 . It can overfit high-dimensional datasets and works well when the dataset can be separated linearly. The regularization (L1 and L2) techniques [ 82 ] can be used to avoid over-fitting in such scenarios. The assumption of linearity between the dependent and independent variables is considered as a major drawback of Logistic Regression. It can be used for both classification and regression problems, but it is more commonly used for classification.

K-nearest neighbors (KNN): K-Nearest Neighbors (KNN) [ 9 ] is an “instance-based learning” or non-generalizing learning, also known as a “lazy learning” algorithm. It does not focus on constructing a general internal model; instead, it stores all instances corresponding to training data in n -dimensional space. KNN uses data and classifies new data points based on similarity measures (e.g., Euclidean distance function) [ 82 ]. Classification is computed from a simple majority vote of the k nearest neighbors of each point. It is quite robust to noisy training data, and accuracy depends on the data quality. The biggest issue with KNN is to choose the optimal number of neighbors to be considered. KNN can be used both for classification as well as regression.

Support vector machine (SVM): In machine learning, another common technique that can be used for classification, regression, or other tasks is a support vector machine (SVM) [ 56 ]. In high- or infinite-dimensional space, a support vector machine constructs a hyper-plane or set of hyper-planes. Intuitively, the hyper-plane, which has the greatest distance from the nearest training data points in any class, achieves a strong separation since, in general, the greater the margin, the lower the classifier’s generalization error. It is effective in high-dimensional spaces and can behave differently based on different mathematical functions known as the kernel. Linear, polynomial, radial basis function (RBF), sigmoid, etc., are the popular kernel functions used in SVM classifier [ 82 ]. However, when the data set contains more noise, such as overlapping target classes, SVM does not perform well.

Decision tree (DT): Decision tree (DT) [ 88 ] is a well-known non-parametric supervised learning method. DT learning methods are used for both the classification and regression tasks [ 82 ]. ID3 [ 87 ], C4.5 [ 88 ], and CART [ 20 ] are well known for DT algorithms. Moreover, recently proposed BehavDT [ 100 ], and IntrudTree [ 97 ] by Sarker et al. are effective in the relevant application domains, such as user behavior analytics and cybersecurity analytics, respectively. By sorting down the tree from the root to some leaf nodes, as shown in Fig. 4 , DT classifies the instances. Instances are classified by checking the attribute defined by that node, starting at the root node of the tree, and then moving down the tree branch corresponding to the attribute value. For splitting, the most popular criteria are “gini” for the Gini impurity and “entropy” for the information gain that can be expressed mathematically as [ 82 ].

figure 4

An example of a decision tree structure

figure 5

An example of a random forest structure considering multiple decision trees

Random forest (RF): A random forest classifier [ 19 ] is well known as an ensemble classification technique that is used in the field of machine learning and data science in various application areas. This method uses “parallel ensembling” which fits several decision tree classifiers in parallel, as shown in Fig. 5 , on different data set sub-samples and uses majority voting or averages for the outcome or final result. It thus minimizes the over-fitting problem and increases the prediction accuracy and control [ 82 ]. Therefore, the RF learning model with multiple decision trees is typically more accurate than a single decision tree based model [ 106 ]. To build a series of decision trees with controlled variation, it combines bootstrap aggregation (bagging) [ 18 ] and random feature selection [ 11 ]. It is adaptable to both classification and regression problems and fits well for both categorical and continuous values.

Adaptive Boosting (AdaBoost): Adaptive Boosting (AdaBoost) is an ensemble learning process that employs an iterative approach to improve poor classifiers by learning from their errors. This is developed by Yoav Freund et al. [ 35 ] and also known as “meta-learning”. Unlike the random forest that uses parallel ensembling, Adaboost uses “sequential ensembling”. It creates a powerful classifier by combining many poorly performing classifiers to obtain a good classifier of high accuracy. In that sense, AdaBoost is called an adaptive classifier by significantly improving the efficiency of the classifier, but in some instances, it can trigger overfits. AdaBoost is best used to boost the performance of decision trees, base estimator [ 82 ], on binary classification problems, however, is sensitive to noisy data and outliers.

Extreme gradient boosting (XGBoost): Gradient Boosting, like Random Forests [ 19 ] above, is an ensemble learning algorithm that generates a final model based on a series of individual models, typically decision trees. The gradient is used to minimize the loss function, similar to how neural networks [ 41 ] use gradient descent to optimize weights. Extreme Gradient Boosting (XGBoost) is a form of gradient boosting that takes more detailed approximations into account when determining the best model [ 82 ]. It computes second-order gradients of the loss function to minimize loss and advanced regularization (L1 and L2) [ 82 ], which reduces over-fitting, and improves model generalization and performance. XGBoost is fast to interpret and can handle large-sized datasets well.

Stochastic gradient descent (SGD): Stochastic gradient descent (SGD) [ 41 ] is an iterative method for optimizing an objective function with appropriate smoothness properties, where the word ‘stochastic’ refers to random probability. This reduces the computational burden, particularly in high-dimensional optimization problems, allowing for faster iterations in exchange for a lower convergence rate. A gradient is the slope of a function that calculates a variable’s degree of change in response to another variable’s changes. Mathematically, the Gradient Descent is a convex function whose output is a partial derivative of a set of its input parameters. Let, \(\alpha\) is the learning rate, and \(J_i\) is the training example cost of \(i \mathrm{th}\) , then Eq. ( 4 ) represents the stochastic gradient descent weight update method at the \(j^\mathrm{th}\) iteration. In large-scale and sparse machine learning, SGD has been successfully applied to problems often encountered in text classification and natural language processing [ 82 ]. However, SGD is sensitive to feature scaling and needs a range of hyperparameters, such as the regularization parameter and the number of iterations.

Rule-based classification : The term rule-based classification can be used to refer to any classification scheme that makes use of IF-THEN rules for class prediction. Several classification algorithms such as Zero-R [ 125 ], One-R [ 47 ], decision trees [ 87 , 88 ], DTNB [ 110 ], Ripple Down Rule learner (RIDOR) [ 125 ], Repeated Incremental Pruning to Produce Error Reduction (RIPPER) [ 126 ] exist with the ability of rule generation. The decision tree is one of the most common rule-based classification algorithms among these techniques because it has several advantages, such as being easier to interpret; the ability to handle high-dimensional data; simplicity and speed; good accuracy; and the capability to produce rules for human clear and understandable classification [ 127 ] [ 128 ]. The decision tree-based rules also provide significant accuracy in a prediction model for unseen test cases [ 106 ]. Since the rules are easily interpretable, these rule-based classifiers are often used to produce descriptive models that can describe a system including the entities and their relationships.

figure 6

Classification vs. regression. In classification the dotted line represents a linear boundary that separates the two classes; in regression, the dotted line models the linear relationship between the two variables

Regression Analysis

Regression analysis includes several methods of machine learning that allow to predict a continuous ( y ) result variable based on the value of one or more ( x ) predictor variables [ 41 ]. The most significant distinction between classification and regression is that classification predicts distinct class labels, while regression facilitates the prediction of a continuous quantity. Figure 6 shows an example of how classification is different with regression models. Some overlaps are often found between the two types of machine learning algorithms. Regression models are now widely used in a variety of fields, including financial forecasting or prediction, cost estimation, trend analysis, marketing, time series estimation, drug response modeling, and many more. Some of the familiar types of regression algorithms are linear, polynomial, lasso and ridge regression, etc., which are explained briefly in the following.

Simple and multiple linear regression: This is one of the most popular ML modeling techniques as well as a well-known regression technique. In this technique, the dependent variable is continuous, the independent variable(s) can be continuous or discrete, and the form of the regression line is linear. Linear regression creates a relationship between the dependent variable ( Y ) and one or more independent variables ( X ) (also known as regression line) using the best fit straight line [ 41 ]. It is defined by the following equations:

where a is the intercept, b is the slope of the line, and e is the error term. This equation can be used to predict the value of the target variable based on the given predictor variable(s). Multiple linear regression is an extension of simple linear regression that allows two or more predictor variables to model a response variable, y, as a linear function [ 41 ] defined in Eq. 6 , whereas simple linear regression has only 1 independent variable, defined in Eq. 5 .

Polynomial regression: Polynomial regression is a form of regression analysis in which the relationship between the independent variable x and the dependent variable y is not linear, but is the polynomial degree of \(n^\mathrm{th}\) in x [ 82 ]. The equation for polynomial regression is also derived from linear regression (polynomial regression of degree 1) equation, which is defined as below:

Here, y is the predicted/target output, \(b_0, b_1,... b_n\) are the regression coefficients, x is an independent/ input variable. In simple words, we can say that if data are not distributed linearly, instead it is \(n^\mathrm{th}\) degree of polynomial then we use polynomial regression to get desired output.

LASSO and ridge regression: LASSO and Ridge regression are well known as powerful techniques which are typically used for building learning models in presence of a large number of features, due to their capability to preventing over-fitting and reducing the complexity of the model. The LASSO (least absolute shrinkage and selection operator) regression model uses L 1 regularization technique [ 82 ] that uses shrinkage, which penalizes “absolute value of magnitude of coefficients” ( L 1 penalty). As a result, LASSO appears to render coefficients to absolute zero. Thus, LASSO regression aims to find the subset of predictors that minimizes the prediction error for a quantitative response variable. On the other hand, ridge regression uses L 2 regularization [ 82 ], which is the “squared magnitude of coefficients” ( L 2 penalty). Thus, ridge regression forces the weights to be small but never sets the coefficient value to zero, and does a non-sparse solution. Overall, LASSO regression is useful to obtain a subset of predictors by eliminating less important features, and ridge regression is useful when a data set has “multicollinearity” which refers to the predictors that are correlated with other predictors.

Cluster Analysis

Cluster analysis, also known as clustering, is an unsupervised machine learning technique for identifying and grouping related data points in large datasets without concern for the specific outcome. It does grouping a collection of objects in such a way that objects in the same category, called a cluster, are in some sense more similar to each other than objects in other groups [ 41 ]. It is often used as a data analysis technique to discover interesting trends or patterns in data, e.g., groups of consumers based on their behavior. In a broad range of application areas, such as cybersecurity, e-commerce, mobile data processing, health analytics, user modeling and behavioral analytics, clustering can be used. In the following, we briefly discuss and summarize various types of clustering methods.

Partitioning methods: Based on the features and similarities in the data, this clustering approach categorizes the data into multiple groups or clusters. The data scientists or analysts typically determine the number of clusters either dynamically or statically depending on the nature of the target applications, to produce for the methods of clustering. The most common clustering algorithms based on partitioning methods are K-means [ 69 ], K-Mediods [ 80 ], CLARA [ 55 ] etc.

Density-based methods: To identify distinct groups or clusters, it uses the concept that a cluster in the data space is a contiguous region of high point density isolated from other such clusters by contiguous regions of low point density. Points that are not part of a cluster are considered as noise. The typical clustering algorithms based on density are DBSCAN [ 32 ], OPTICS [ 12 ] etc. The density-based methods typically struggle with clusters of similar density and high dimensionality data.

Hierarchical-based methods: Hierarchical clustering typically seeks to construct a hierarchy of clusters, i.e., the tree structure. Strategies for hierarchical clustering generally fall into two types: (i) Agglomerative—a “bottom-up” approach in which each observation begins in its cluster and pairs of clusters are combined as one, moves up the hierarchy, and (ii) Divisive—a “top-down” approach in which all observations begin in one cluster and splits are performed recursively, moves down the hierarchy, as shown in Fig 7 . Our earlier proposed BOTS technique, Sarker et al. [ 102 ] is an example of a hierarchical, particularly, bottom-up clustering algorithm.

Grid-based methods: To deal with massive datasets, grid-based clustering is especially suitable. To obtain clusters, the principle is first to summarize the dataset with a grid representation and then to combine grid cells. STING [ 122 ], CLIQUE [ 6 ], etc. are the standard algorithms of grid-based clustering.

Model-based methods: There are mainly two types of model-based clustering algorithms: one that uses statistical learning, and the other based on a method of neural network learning [ 130 ]. For instance, GMM [ 89 ] is an example of a statistical learning method, and SOM [ 22 ] [ 96 ] is an example of a neural network learning method.

Constraint-based methods: Constrained-based clustering is a semi-supervised approach to data clustering that uses constraints to incorporate domain knowledge. Application or user-oriented constraints are incorporated to perform the clustering. The typical algorithms of this kind of clustering are COP K-means [ 121 ], CMWK-Means [ 27 ], etc.

figure 7

A graphical interpretation of the widely-used hierarchical clustering (Bottom-up and top-down) technique

Many clustering algorithms have been proposed with the ability to grouping data in machine learning and data science literature [ 41 , 125 ]. In the following, we summarize the popular methods that are used widely in various application areas.

K-means clustering: K-means clustering [ 69 ] is a fast, robust, and simple algorithm that provides reliable results when data sets are well-separated from each other. The data points are allocated to a cluster in this algorithm in such a way that the amount of the squared distance between the data points and the centroid is as small as possible. In other words, the K-means algorithm identifies the k number of centroids and then assigns each data point to the nearest cluster while keeping the centroids as small as possible. Since it begins with a random selection of cluster centers, the results can be inconsistent. Since extreme values can easily affect a mean, the K-means clustering algorithm is sensitive to outliers. K-medoids clustering [ 91 ] is a variant of K-means that is more robust to noises and outliers.

Mean-shift clustering: Mean-shift clustering [ 37 ] is a nonparametric clustering technique that does not require prior knowledge of the number of clusters or constraints on cluster shape. Mean-shift clustering aims to discover “blobs” in a smooth distribution or density of samples [ 82 ]. It is a centroid-based algorithm that works by updating centroid candidates to be the mean of the points in a given region. To form the final set of centroids, these candidates are filtered in a post-processing stage to remove near-duplicates. Cluster analysis in computer vision and image processing are examples of application domains. Mean Shift has the disadvantage of being computationally expensive. Moreover, in cases of high dimension, where the number of clusters shifts abruptly, the mean-shift algorithm does not work well.

DBSCAN: Density-based spatial clustering of applications with noise (DBSCAN) [ 32 ] is a base algorithm for density-based clustering which is widely used in data mining and machine learning. This is known as a non-parametric density-based clustering technique for separating high-density clusters from low-density clusters that are used in model building. DBSCAN’s main idea is that a point belongs to a cluster if it is close to many points from that cluster. It can find clusters of various shapes and sizes in a vast volume of data that is noisy and contains outliers. DBSCAN, unlike k-means, does not require a priori specification of the number of clusters in the data and can find arbitrarily shaped clusters. Although k-means is much faster than DBSCAN, it is efficient at finding high-density regions and outliers, i.e., is robust to outliers.

GMM clustering: Gaussian mixture models (GMMs) are often used for data clustering, which is a distribution-based clustering algorithm. A Gaussian mixture model is a probabilistic model in which all the data points are produced by a mixture of a finite number of Gaussian distributions with unknown parameters [ 82 ]. To find the Gaussian parameters for each cluster, an optimization algorithm called expectation-maximization (EM) [ 82 ] can be used. EM is an iterative method that uses a statistical model to estimate the parameters. In contrast to k-means, Gaussian mixture models account for uncertainty and return the likelihood that a data point belongs to one of the k clusters. GMM clustering is more robust than k-means and works well even with non-linear data distributions.

Agglomerative hierarchical clustering: The most common method of hierarchical clustering used to group objects in clusters based on their similarity is agglomerative clustering. This technique uses a bottom-up approach, where each object is first treated as a singleton cluster by the algorithm. Following that, pairs of clusters are merged one by one until all clusters have been merged into a single large cluster containing all objects. The result is a dendrogram, which is a tree-based representation of the elements. Single linkage [ 115 ], Complete linkage [ 116 ], BOTS [ 102 ] etc. are some examples of such techniques. The main advantage of agglomerative hierarchical clustering over k-means is that the tree-structure hierarchy generated by agglomerative clustering is more informative than the unstructured collection of flat clusters returned by k-means, which can help to make better decisions in the relevant application areas.

Dimensionality Reduction and Feature Learning

In machine learning and data science, high-dimensional data processing is a challenging task for both researchers and application developers. Thus, dimensionality reduction which is an unsupervised learning technique, is important because it leads to better human interpretations, lower computational costs, and avoids overfitting and redundancy by simplifying models. Both the process of feature selection and feature extraction can be used for dimensionality reduction. The primary distinction between the selection and extraction of features is that the “feature selection” keeps a subset of the original features [ 97 ], while “feature extraction” creates brand new ones [ 98 ]. In the following, we briefly discuss these techniques.

Feature selection: The selection of features, also known as the selection of variables or attributes in the data, is the process of choosing a subset of unique features (variables, predictors) to use in building machine learning and data science model. It decreases a model’s complexity by eliminating the irrelevant or less important features and allows for faster training of machine learning algorithms. A right and optimal subset of the selected features in a problem domain is capable to minimize the overfitting problem through simplifying and generalizing the model as well as increases the model’s accuracy [ 97 ]. Thus, “feature selection” [ 66 , 99 ] is considered as one of the primary concepts in machine learning that greatly affects the effectiveness and efficiency of the target machine learning model. Chi-squared test, Analysis of variance (ANOVA) test, Pearson’s correlation coefficient, recursive feature elimination, are some popular techniques that can be used for feature selection.

Feature extraction: In a machine learning-based model or system, feature extraction techniques usually provide a better understanding of the data, a way to improve prediction accuracy, and to reduce computational cost or training time. The aim of “feature extraction” [ 66 , 99 ] is to reduce the number of features in a dataset by generating new ones from the existing ones and then discarding the original features. The majority of the information found in the original set of features can then be summarized using this new reduced set of features. For instance, principal components analysis (PCA) is often used as a dimensionality-reduction technique to extract a lower-dimensional space creating new brand components from the existing features in a dataset [ 98 ].

Many algorithms have been proposed to reduce data dimensions in the machine learning and data science literature [ 41 , 125 ]. In the following, we summarize the popular methods that are used widely in various application areas.

Variance threshold: A simple basic approach to feature selection is the variance threshold [ 82 ]. This excludes all features of low variance, i.e., all features whose variance does not exceed the threshold. It eliminates all zero-variance characteristics by default, i.e., characteristics that have the same value in all samples. This feature selection algorithm looks only at the ( X ) features, not the ( y ) outputs needed, and can, therefore, be used for unsupervised learning.

Pearson correlation: Pearson’s correlation is another method to understand a feature’s relation to the response variable and can be used for feature selection [ 99 ]. This method is also used for finding the association between the features in a dataset. The resulting value is \([-1, 1]\) , where \(-1\) means perfect negative correlation, \(+1\) means perfect positive correlation, and 0 means that the two variables do not have a linear correlation. If two random variables represent X and Y , then the correlation coefficient between X and Y is defined as [ 41 ]

ANOVA: Analysis of variance (ANOVA) is a statistical tool used to verify the mean values of two or more groups that differ significantly from each other. ANOVA assumes a linear relationship between the variables and the target and the variables’ normal distribution. To statistically test the equality of means, the ANOVA method utilizes F tests. For feature selection, the results ‘ANOVA F value’ [ 82 ] of this test can be used where certain features independent of the goal variable can be omitted.

Chi square: The chi-square \({\chi }^2\) [ 82 ] statistic is an estimate of the difference between the effects of a series of events or variables observed and expected frequencies. The magnitude of the difference between the real and observed values, the degrees of freedom, and the sample size depends on \({\chi }^2\) . The chi-square \({\chi }^2\) is commonly used for testing relationships between categorical variables. If \(O_i\) represents observed value and \(E_i\) represents expected value, then

Recursive feature elimination (RFE): Recursive Feature Elimination (RFE) is a brute force approach to feature selection. RFE [ 82 ] fits the model and removes the weakest feature before it meets the specified number of features. Features are ranked by the coefficients or feature significance of the model. RFE aims to remove dependencies and collinearity in the model by recursively removing a small number of features per iteration.

Model-based selection: To reduce the dimensionality of the data, linear models penalized with the L 1 regularization can be used. Least absolute shrinkage and selection operator (Lasso) regression is a type of linear regression that has the property of shrinking some of the coefficients to zero [ 82 ]. Therefore, that feature can be removed from the model. Thus, the penalized lasso regression method, often used in machine learning to select the subset of variables. Extra Trees Classifier [ 82 ] is an example of a tree-based estimator that can be used to compute impurity-based function importance, which can then be used to discard irrelevant features.

Principal component analysis (PCA): Principal component analysis (PCA) is a well-known unsupervised learning approach in the field of machine learning and data science. PCA is a mathematical technique that transforms a set of correlated variables into a set of uncorrelated variables known as principal components [ 48 , 81 ]. Figure 8 shows an example of the effect of PCA on various dimensions space, where Fig. 8 a shows the original features in 3D space, and Fig. 8 b shows the created principal components PC1 and PC2 onto a 2D plane, and 1D line with the principal component PC1 respectively. Thus, PCA can be used as a feature extraction technique that reduces the dimensionality of the datasets, and to build an effective machine learning model [ 98 ]. Technically, PCA identifies the completely transformed with the highest eigenvalues of a covariance matrix and then uses those to project the data into a new subspace of equal or fewer dimensions [ 82 ].

figure 8

An example of a principal component analysis (PCA) and created principal components PC1 and PC2 in different dimension space

Association Rule Learning

Association rule learning is a rule-based machine learning approach to discover interesting relationships, “IF-THEN” statements, in large datasets between variables [ 7 ]. One example is that “if a customer buys a computer or laptop (an item), s/he is likely to also buy anti-virus software (another item) at the same time”. Association rules are employed today in many application areas, including IoT services, medical diagnosis, usage behavior analytics, web usage mining, smartphone applications, cybersecurity applications, and bioinformatics. In comparison to sequence mining, association rule learning does not usually take into account the order of things within or across transactions. A common way of measuring the usefulness of association rules is to use its parameter, the ‘support’ and ‘confidence’, which is introduced in [ 7 ].

In the data mining literature, many association rule learning methods have been proposed, such as logic dependent [ 34 ], frequent pattern based [ 8 , 49 , 68 ], and tree-based [ 42 ]. The most popular association rule learning algorithms are summarized below.

AIS and SETM: AIS is the first algorithm proposed by Agrawal et al. [ 7 ] for association rule mining. The AIS algorithm’s main downside is that too many candidate itemsets are generated, requiring more space and wasting a lot of effort. This algorithm calls for too many passes over the entire dataset to produce the rules. Another approach SETM [ 49 ] exhibits good performance and stable behavior with execution time; however, it suffers from the same flaw as the AIS algorithm.

Apriori: For generating association rules for a given dataset, Agrawal et al. [ 8 ] proposed the Apriori, Apriori-TID, and Apriori-Hybrid algorithms. These later algorithms outperform the AIS and SETM mentioned above due to the Apriori property of frequent itemset [ 8 ]. The term ‘Apriori’ usually refers to having prior knowledge of frequent itemset properties. Apriori uses a “bottom-up” approach, where it generates the candidate itemsets. To reduce the search space, Apriori uses the property “all subsets of a frequent itemset must be frequent; and if an itemset is infrequent, then all its supersets must also be infrequent”. Another approach predictive Apriori [ 108 ] can also generate rules; however, it receives unexpected results as it combines both the support and confidence. The Apriori [ 8 ] is the widely applicable techniques in mining association rules.

ECLAT: This technique was proposed by Zaki et al. [ 131 ] and stands for Equivalence Class Clustering and bottom-up Lattice Traversal. ECLAT uses a depth-first search to find frequent itemsets. In contrast to the Apriori [ 8 ] algorithm, which represents data in a horizontal pattern, it represents data vertically. Hence, the ECLAT algorithm is more efficient and scalable in the area of association rule learning. This algorithm is better suited for small and medium datasets whereas the Apriori algorithm is used for large datasets.

FP-Growth: Another common association rule learning technique based on the frequent-pattern tree (FP-tree) proposed by Han et al. [ 42 ] is Frequent Pattern Growth, known as FP-Growth. The key difference with Apriori is that while generating rules, the Apriori algorithm [ 8 ] generates frequent candidate itemsets; on the other hand, the FP-growth algorithm [ 42 ] prevents candidate generation and thus produces a tree by the successful strategy of ‘divide and conquer’ approach. Due to its sophistication, however, FP-Tree is challenging to use in an interactive mining environment [ 133 ]. Thus, the FP-Tree would not fit into memory for massive data sets, making it challenging to process big data as well. Another solution is RARM (Rapid Association Rule Mining) proposed by Das et al. [ 26 ] but faces a related FP-tree issue [ 133 ].

ABC-RuleMiner: A rule-based machine learning method, recently proposed in our earlier paper, by Sarker et al. [ 104 ], to discover the interesting non-redundant rules to provide real-world intelligent services. This algorithm effectively identifies the redundancy in associations by taking into account the impact or precedence of the related contextual features and discovers a set of non-redundant association rules. This algorithm first constructs an association generation tree (AGT), a top-down approach, and then extracts the association rules through traversing the tree. Thus, ABC-RuleMiner is more potent than traditional rule-based methods in terms of both non-redundant rule generation and intelligent decision-making, particularly in a context-aware smart computing environment, where human or user preferences are involved.

Among the association rule learning techniques discussed above, Apriori [ 8 ] is the most widely used algorithm for discovering association rules from a given dataset [ 133 ]. The main strength of the association learning technique is its comprehensiveness, as it generates all associations that satisfy the user-specified constraints, such as minimum support and confidence value. The ABC-RuleMiner approach [ 104 ] discussed earlier could give significant results in terms of non-redundant rule generation and intelligent decision-making for the relevant application areas in the real world.

Reinforcement Learning

Reinforcement learning (RL) is a machine learning technique that allows an agent to learn by trial and error in an interactive environment using input from its actions and experiences. Unlike supervised learning, which is based on given sample data or examples, the RL method is based on interacting with the environment. The problem to be solved in reinforcement learning (RL) is defined as a Markov Decision Process (MDP) [ 86 ], i.e., all about sequentially making decisions. An RL problem typically includes four elements such as Agent, Environment, Rewards, and Policy.

RL can be split roughly into Model-based and Model-free techniques. Model-based RL is the process of inferring optimal behavior from a model of the environment by performing actions and observing the results, which include the next state and the immediate reward [ 85 ]. AlphaZero, AlphaGo [ 113 ] are examples of the model-based approaches. On the other hand, a model-free approach does not use the distribution of the transition probability and the reward function associated with MDP. Q-learning, Deep Q Network, Monte Carlo Control, SARSA (State–Action–Reward–State–Action), etc. are some examples of model-free algorithms [ 52 ]. The policy network, which is required for model-based RL but not for model-free, is the key difference between model-free and model-based learning. In the following, we discuss the popular RL algorithms.

Monte Carlo methods: Monte Carlo techniques, or Monte Carlo experiments, are a wide category of computational algorithms that rely on repeated random sampling to obtain numerical results [ 52 ]. The underlying concept is to use randomness to solve problems that are deterministic in principle. Optimization, numerical integration, and making drawings from the probability distribution are the three problem classes where Monte Carlo techniques are most commonly used.

Q-learning: Q-learning is a model-free reinforcement learning algorithm for learning the quality of behaviors that tell an agent what action to take under what conditions [ 52 ]. It does not need a model of the environment (hence the term “model-free”), and it can deal with stochastic transitions and rewards without the need for adaptations. The ‘Q’ in Q-learning usually stands for quality, as the algorithm calculates the maximum expected rewards for a given behavior in a given state.

Deep Q-learning: The basic working step in Deep Q-Learning [ 52 ] is that the initial state is fed into the neural network, which returns the Q-value of all possible actions as an output. Still, when we have a reasonably simple setting to overcome, Q-learning works well. However, when the number of states and actions becomes more complicated, deep learning can be used as a function approximator.

Reinforcement learning, along with supervised and unsupervised learning, is one of the basic machine learning paradigms. RL can be used to solve numerous real-world problems in various fields, such as game theory, control theory, operations analysis, information theory, simulation-based optimization, manufacturing, supply chain logistics, multi-agent systems, swarm intelligence, aircraft control, robot motion control, and many more.

Artificial Neural Network and Deep Learning

Deep learning is part of a wider family of artificial neural networks (ANN)-based machine learning approaches with representation learning. Deep learning provides a computational architecture by combining several processing layers, such as input, hidden, and output layers, to learn from data [ 41 ]. The main advantage of deep learning over traditional machine learning methods is its better performance in several cases, particularly learning from large datasets [ 105 , 129 ]. Figure 9 shows a general performance of deep learning over machine learning considering the increasing amount of data. However, it may vary depending on the data characteristics and experimental set up.

figure 9

Machine learning and deep learning performance in general with the amount of data

The most common deep learning algorithms are: Multi-layer Perceptron (MLP), Convolutional Neural Network (CNN, or ConvNet), Long Short-Term Memory Recurrent Neural Network (LSTM-RNN) [ 96 ]. In the following, we discuss various types of deep learning methods that can be used to build effective data-driven models for various purposes.

figure 10

A structure of an artificial neural network modeling with multiple processing layers

MLP: The base architecture of deep learning, which is also known as the feed-forward artificial neural network, is called a multilayer perceptron (MLP) [ 82 ]. A typical MLP is a fully connected network consisting of an input layer, one or more hidden layers, and an output layer, as shown in Fig. 10 . Each node in one layer connects to each node in the following layer at a certain weight. MLP utilizes the “Backpropagation” technique [ 41 ], the most “fundamental building block” in a neural network, to adjust the weight values internally while building the model. MLP is sensitive to scaling features and allows a variety of hyperparameters to be tuned, such as the number of hidden layers, neurons, and iterations, which can result in a computationally costly model.

CNN or ConvNet: The convolution neural network (CNN) [ 65 ] enhances the design of the standard ANN, consisting of convolutional layers, pooling layers, as well as fully connected layers, as shown in Fig. 11 . As it takes the advantage of the two-dimensional (2D) structure of the input data, it is typically broadly used in several areas such as image and video recognition, image processing and classification, medical image analysis, natural language processing, etc. While CNN has a greater computational burden, without any manual intervention, it has the advantage of automatically detecting the important features, and hence CNN is considered to be more powerful than conventional ANN. A number of advanced deep learning models based on CNN can be used in the field, such as AlexNet [ 60 ], Xception [ 24 ], Inception [ 118 ], Visual Geometry Group (VGG) [ 44 ], ResNet [ 45 ], etc.

LSTM-RNN: Long short-term memory (LSTM) is an artificial recurrent neural network (RNN) architecture used in the area of deep learning [ 38 ]. LSTM has feedback links, unlike normal feed-forward neural networks. LSTM networks are well-suited for analyzing and learning sequential data, such as classifying, processing, and predicting data based on time series data, which differentiates it from other conventional networks. Thus, LSTM can be used when the data are in a sequential format, such as time, sentence, etc., and commonly applied in the area of time-series analysis, natural language processing, speech recognition, etc.

figure 11

An example of a convolutional neural network (CNN or ConvNet) including multiple convolution and pooling layers

In addition to these most common deep learning methods discussed above, several other deep learning approaches [ 96 ] exist in the area for various purposes. For instance, the self-organizing map (SOM) [ 58 ] uses unsupervised learning to represent the high-dimensional data by a 2D grid map, thus achieving dimensionality reduction. The autoencoder (AE) [ 15 ] is another learning technique that is widely used for dimensionality reduction as well and feature extraction in unsupervised learning tasks. Restricted Boltzmann machines (RBM) [ 46 ] can be used for dimensionality reduction, classification, regression, collaborative filtering, feature learning, and topic modeling. A deep belief network (DBN) is typically composed of simple, unsupervised networks such as restricted Boltzmann machines (RBMs) or autoencoders, and a backpropagation neural network (BPNN) [ 123 ]. A generative adversarial network (GAN) [ 39 ] is a form of the network for deep learning that can generate data with characteristics close to the actual data input. Transfer learning is currently very common because it can train deep neural networks with comparatively low data, which is typically the re-use of a new problem with a pre-trained model [ 124 ]. A brief discussion of these artificial neural networks (ANN) and deep learning (DL) models are summarized in our earlier paper Sarker et al. [ 96 ].

Overall, based on the learning techniques discussed above, we can conclude that various types of machine learning techniques, such as classification analysis, regression, data clustering, feature selection and extraction, and dimensionality reduction, association rule learning, reinforcement learning, or deep learning techniques, can play a significant role for various purposes according to their capabilities. In the following section, we discuss several application areas based on machine learning algorithms.

Applications of Machine Learning

In the current age of the Fourth Industrial Revolution (4IR), machine learning becomes popular in various application areas, because of its learning capabilities from the past and making intelligent decisions. In the following, we summarize and discuss ten popular application areas of machine learning technology.

Predictive analytics and intelligent decision-making: A major application field of machine learning is intelligent decision-making by data-driven predictive analytics [ 21 , 70 ]. The basis of predictive analytics is capturing and exploiting relationships between explanatory variables and predicted variables from previous events to predict the unknown outcome [ 41 ]. For instance, identifying suspects or criminals after a crime has been committed, or detecting credit card fraud as it happens. Another application, where machine learning algorithms can assist retailers in better understanding consumer preferences and behavior, better manage inventory, avoiding out-of-stock situations, and optimizing logistics and warehousing in e-commerce. Various machine learning algorithms such as decision trees, support vector machines, artificial neural networks, etc. [ 106 , 125 ] are commonly used in the area. Since accurate predictions provide insight into the unknown, they can improve the decisions of industries, businesses, and almost any organization, including government agencies, e-commerce, telecommunications, banking and financial services, healthcare, sales and marketing, transportation, social networking, and many others.

Cybersecurity and threat intelligence: Cybersecurity is one of the most essential areas of Industry 4.0. [ 114 ], which is typically the practice of protecting networks, systems, hardware, and data from digital attacks [ 114 ]. Machine learning has become a crucial cybersecurity technology that constantly learns by analyzing data to identify patterns, better detect malware in encrypted traffic, find insider threats, predict where bad neighborhoods are online, keep people safe while browsing, or secure data in the cloud by uncovering suspicious activity. For instance, clustering techniques can be used to identify cyber-anomalies, policy violations, etc. To detect various types of cyber-attacks or intrusions machine learning classification models by taking into account the impact of security features are useful [ 97 ]. Various deep learning-based security models can also be used on the large scale of security datasets [ 96 , 129 ]. Moreover, security policy rules generated by association rule learning techniques can play a significant role to build a rule-based security system [ 105 ]. Thus, we can say that various learning techniques discussed in Sect. Machine Learning Tasks and Algorithms , can enable cybersecurity professionals to be more proactive inefficiently preventing threats and cyber-attacks.

Internet of things (IoT) and smart cities: Internet of Things (IoT) is another essential area of Industry 4.0. [ 114 ], which turns everyday objects into smart objects by allowing them to transmit data and automate tasks without the need for human interaction. IoT is, therefore, considered to be the big frontier that can enhance almost all activities in our lives, such as smart governance, smart home, education, communication, transportation, retail, agriculture, health care, business, and many more [ 70 ]. Smart city is one of IoT’s core fields of application, using technologies to enhance city services and residents’ living experiences [ 132 , 135 ]. As machine learning utilizes experience to recognize trends and create models that help predict future behavior and events, it has become a crucial technology for IoT applications [ 103 ]. For example, to predict traffic in smart cities, parking availability prediction, estimate the total usage of energy of the citizens for a particular period, make context-aware and timely decisions for the people, etc. are some tasks that can be solved using machine learning techniques according to the current needs of the people.

Traffic prediction and transportation: Transportation systems have become a crucial component of every country’s economic development. Nonetheless, several cities around the world are experiencing an excessive rise in traffic volume, resulting in serious issues such as delays, traffic congestion, higher fuel prices, increased CO \(_2\) pollution, accidents, emergencies, and a decline in modern society’s quality of life [ 40 ]. Thus, an intelligent transportation system through predicting future traffic is important, which is an indispensable part of a smart city. Accurate traffic prediction based on machine and deep learning modeling can help to minimize the issues [ 17 , 30 , 31 ]. For example, based on the travel history and trend of traveling through various routes, machine learning can assist transportation companies in predicting possible issues that may occur on specific routes and recommending their customers to take a different path. Ultimately, these learning-based data-driven models help improve traffic flow, increase the usage and efficiency of sustainable modes of transportation, and limit real-world disruption by modeling and visualizing future changes.

Healthcare and COVID-19 pandemic: Machine learning can help to solve diagnostic and prognostic problems in a variety of medical domains, such as disease prediction, medical knowledge extraction, detecting regularities in data, patient management, etc. [ 33 , 77 , 112 ]. Coronavirus disease (COVID-19) is an infectious disease caused by a newly discovered coronavirus, according to the World Health Organization (WHO) [ 3 ]. Recently, the learning techniques have become popular in the battle against COVID-19 [ 61 , 63 ]. For the COVID-19 pandemic, the learning techniques are used to classify patients at high risk, their mortality rate, and other anomalies [ 61 ]. It can also be used to better understand the virus’s origin, COVID-19 outbreak prediction, as well as for disease diagnosis and treatment [ 14 , 50 ]. With the help of machine learning, researchers can forecast where and when, the COVID-19 is likely to spread, and notify those regions to match the required arrangements. Deep learning also provides exciting solutions to the problems of medical image processing and is seen as a crucial technique for potential applications, particularly for COVID-19 pandemic [ 10 , 78 , 111 ]. Overall, machine and deep learning techniques can help to fight the COVID-19 virus and the pandemic as well as intelligent clinical decisions making in the domain of healthcare.

E-commerce and product recommendations: Product recommendation is one of the most well known and widely used applications of machine learning, and it is one of the most prominent features of almost any e-commerce website today. Machine learning technology can assist businesses in analyzing their consumers’ purchasing histories and making customized product suggestions for their next purchase based on their behavior and preferences. E-commerce companies, for example, can easily position product suggestions and offers by analyzing browsing trends and click-through rates of specific items. Using predictive modeling based on machine learning techniques, many online retailers, such as Amazon [ 71 ], can better manage inventory, prevent out-of-stock situations, and optimize logistics and warehousing. The future of sales and marketing is the ability to capture, evaluate, and use consumer data to provide a customized shopping experience. Furthermore, machine learning techniques enable companies to create packages and content that are tailored to the needs of their customers, allowing them to maintain existing customers while attracting new ones.

NLP and sentiment analysis: Natural language processing (NLP) involves the reading and understanding of spoken or written language through the medium of a computer [ 79 , 103 ]. Thus, NLP helps computers, for instance, to read a text, hear speech, interpret it, analyze sentiment, and decide which aspects are significant, where machine learning techniques can be used. Virtual personal assistant, chatbot, speech recognition, document description, language or machine translation, etc. are some examples of NLP-related tasks. Sentiment Analysis [ 90 ] (also referred to as opinion mining or emotion AI) is an NLP sub-field that seeks to identify and extract public mood and views within a given text through blogs, reviews, social media, forums, news, etc. For instance, businesses and brands use sentiment analysis to understand the social sentiment of their brand, product, or service through social media platforms or the web as a whole. Overall, sentiment analysis is considered as a machine learning task that analyzes texts for polarity, such as “positive”, “negative”, or “neutral” along with more intense emotions like very happy, happy, sad, very sad, angry, have interest, or not interested etc.

Image, speech and pattern recognition: Image recognition [ 36 ] is a well-known and widespread example of machine learning in the real world, which can identify an object as a digital image. For instance, to label an x-ray as cancerous or not, character recognition, or face detection in an image, tagging suggestions on social media, e.g., Facebook, are common examples of image recognition. Speech recognition [ 23 ] is also very popular that typically uses sound and linguistic models, e.g., Google Assistant, Cortana, Siri, Alexa, etc. [ 67 ], where machine learning methods are used. Pattern recognition [ 13 ] is defined as the automated recognition of patterns and regularities in data, e.g., image analysis. Several machine learning techniques such as classification, feature selection, clustering, or sequence labeling methods are used in the area.

Sustainable agriculture: Agriculture is essential to the survival of all human activities [ 109 ]. Sustainable agriculture practices help to improve agricultural productivity while also reducing negative impacts on the environment [ 5 , 25 , 109 ]. The sustainable agriculture supply chains are knowledge-intensive and based on information, skills, technologies, etc., where knowledge transfer encourages farmers to enhance their decisions to adopt sustainable agriculture practices utilizing the increasing amount of data captured by emerging technologies, e.g., the Internet of Things (IoT), mobile technologies and devices, etc. [ 5 , 53 , 54 ]. Machine learning can be applied in various phases of sustainable agriculture, such as in the pre-production phase - for the prediction of crop yield, soil properties, irrigation requirements, etc.; in the production phase—for weather prediction, disease detection, weed detection, soil nutrient management, livestock management, etc.; in processing phase—for demand estimation, production planning, etc. and in the distribution phase - the inventory management, consumer analysis, etc.

User behavior analytics and context-aware smartphone applications: Context-awareness is a system’s ability to capture knowledge about its surroundings at any moment and modify behaviors accordingly [ 28 , 93 ]. Context-aware computing uses software and hardware to automatically collect and interpret data for direct responses. The mobile app development environment has been changed greatly with the power of AI, particularly, machine learning techniques through their learning capabilities from contextual data [ 103 , 136 ]. Thus, the developers of mobile apps can rely on machine learning to create smart apps that can understand human behavior, support, and entertain users [ 107 , 137 , 140 ]. To build various personalized data-driven context-aware systems, such as smart interruption management, smart mobile recommendation, context-aware smart searching, decision-making that intelligently assist end mobile phone users in a pervasive computing environment, machine learning techniques are applicable. For example, context-aware association rules can be used to build an intelligent phone call application [ 104 ]. Clustering approaches are useful in capturing users’ diverse behavioral activities by taking into account data in time series [ 102 ]. To predict the future events in various contexts, the classification methods can be used [ 106 , 139 ]. Thus, various learning techniques discussed in Sect. “ Machine Learning Tasks and Algorithms ” can help to build context-aware adaptive and smart applications according to the preferences of the mobile phone users.

In addition to these application areas, machine learning-based models can also apply to several other domains such as bioinformatics, cheminformatics, computer networks, DNA sequence classification, economics and banking, robotics, advanced engineering, and many more.

Challenges and Research Directions

Our study on machine learning algorithms for intelligent data analysis and applications opens several research issues in the area. Thus, in this section, we summarize and discuss the challenges faced and the potential research opportunities and future directions.

In general, the effectiveness and the efficiency of a machine learning-based solution depend on the nature and characteristics of the data, and the performance of the learning algorithms. To collect the data in the relevant domain, such as cybersecurity, IoT, healthcare and agriculture discussed in Sect. “ Applications of Machine Learning ” is not straightforward, although the current cyberspace enables the production of a huge amount of data with very high frequency. Thus, collecting useful data for the target machine learning-based applications, e.g., smart city applications, and their management is important to further analysis. Therefore, a more in-depth investigation of data collection methods is needed while working on the real-world data. Moreover, the historical data may contain many ambiguous values, missing values, outliers, and meaningless data. The machine learning algorithms, discussed in Sect “ Machine Learning Tasks and Algorithms ” highly impact on data quality, and availability for training, and consequently on the resultant model. Thus, to accurately clean and pre-process the diverse data collected from diverse sources is a challenging task. Therefore, effectively modifying or enhance existing pre-processing methods, or proposing new data preparation techniques are required to effectively use the learning algorithms in the associated application domain.

To analyze the data and extract insights, there exist many machine learning algorithms, summarized in Sect. “ Machine Learning Tasks and Algorithms ”. Thus, selecting a proper learning algorithm that is suitable for the target application is challenging. The reason is that the outcome of different learning algorithms may vary depending on the data characteristics [ 106 ]. Selecting a wrong learning algorithm would result in producing unexpected outcomes that may lead to loss of effort, as well as the model’s effectiveness and accuracy. In terms of model building, the techniques discussed in Sect. “ Machine Learning Tasks and Algorithms ” can directly be used to solve many real-world issues in diverse domains, such as cybersecurity, smart cities and healthcare summarized in Sect. “ Applications of Machine Learning ”. However, the hybrid learning model, e.g., the ensemble of methods, modifying or enhancement of the existing learning techniques, or designing new learning methods, could be a potential future work in the area.

Thus, the ultimate success of a machine learning-based solution and corresponding applications mainly depends on both the data and the learning algorithms. If the data are bad to learn, such as non-representative, poor-quality, irrelevant features, or insufficient quantity for training, then the machine learning models may become useless or will produce lower accuracy. Therefore, effectively processing the data and handling the diverse learning algorithms are important, for a machine learning-based solution and eventually building intelligent applications.

In this paper, we have conducted a comprehensive overview of machine learning algorithms for intelligent data analysis and applications. According to our goal, we have briefly discussed how various types of machine learning methods can be used for making solutions to various real-world issues. A successful machine learning model depends on both the data and the performance of the learning algorithms. The sophisticated learning algorithms then need to be trained through the collected real-world data and knowledge related to the target application before the system can assist with intelligent decision-making. We also discussed several popular application areas based on machine learning techniques to highlight their applicability in various real-world issues. Finally, we have summarized and discussed the challenges faced and the potential research opportunities and future directions in the area. Therefore, the challenges that are identified create promising research opportunities in the field which must be addressed with effective solutions in various application areas. Overall, we believe that our study on machine learning-based solutions opens up a promising direction and can be used as a reference guide for potential research and applications for both academia and industry professionals as well as for decision-makers, from a technical point of view.

Canadian institute of cybersecurity, university of new brunswick, iscx dataset, http://www.unb.ca/cic/datasets/index.html/ (Accessed on 20 October 2019).

Cic-ddos2019 [online]. available: https://www.unb.ca/cic/datasets/ddos-2019.html/ (Accessed on 28 March 2020).

World health organization: WHO. http://www.who.int/ .

Google trends. In https://trends.google.com/trends/ , 2019.

Adnan N, Nordin Shahrina Md, Rahman I, Noor A. The effects of knowledge transfer on farmers decision making toward sustainable agriculture practices. World J Sci Technol Sustain Dev. 2018.

Agrawal R, Gehrke J, Gunopulos D, Raghavan P. Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the 1998 ACM SIGMOD international conference on Management of data. 1998; 94–105

Agrawal R, Imieliński T, Swami A. Mining association rules between sets of items in large databases. In: ACM SIGMOD Record. ACM. 1993;22: 207–216

Agrawal R, Gehrke J, Gunopulos D, Raghavan P. Fast algorithms for mining association rules. In: Proceedings of the International Joint Conference on Very Large Data Bases, Santiago Chile. 1994; 1215: 487–499.

Aha DW, Kibler D, Albert M. Instance-based learning algorithms. Mach Learn. 1991;6(1):37–66.

Article   Google Scholar  

Alakus TB, Turkoglu I. Comparison of deep learning approaches to predict covid-19 infection. Chaos Solit Fract. 2020;140:

Amit Y, Geman D. Shape quantization and recognition with randomized trees. Neural Comput. 1997;9(7):1545–88.

Ankerst M, Breunig MM, Kriegel H-P, Sander J. Optics: ordering points to identify the clustering structure. ACM Sigmod Record. 1999;28(2):49–60.

Anzai Y. Pattern recognition and machine learning. Elsevier; 2012.

MATH   Google Scholar  

Ardabili SF, Mosavi A, Ghamisi P, Ferdinand F, Varkonyi-Koczy AR, Reuter U, Rabczuk T, Atkinson PM. Covid-19 outbreak prediction with machine learning. Algorithms. 2020;13(10):249.

Article   MathSciNet   Google Scholar  

Baldi P. Autoencoders, unsupervised learning, and deep architectures. In: Proceedings of ICML workshop on unsupervised and transfer learning, 2012; 37–49 .

Balducci F, Impedovo D, Pirlo G. Machine learning applications on agricultural datasets for smart farm enhancement. Machines. 2018;6(3):38.

Boukerche A, Wang J. Machine learning-based traffic prediction models for intelligent transportation systems. Comput Netw. 2020;181

Breiman L. Bagging predictors. Mach Learn. 1996;24(2):123–40.

Article   MATH   Google Scholar  

Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.

Breiman L, Friedman J, Stone CJ, Olshen RA. Classification and regression trees. CRC Press; 1984.

Cao L. Data science: a comprehensive overview. ACM Comput Surv (CSUR). 2017;50(3):43.

Google Scholar  

Carpenter GA, Grossberg S. A massively parallel architecture for a self-organizing neural pattern recognition machine. Comput Vis Graph Image Process. 1987;37(1):54–115.

Chiu C-C, Sainath TN, Wu Y, Prabhavalkar R, Nguyen P, Chen Z, Kannan A, Weiss RJ, Rao K, Gonina E, et al. State-of-the-art speech recognition with sequence-to-sequence models. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018 pages 4774–4778. IEEE .

Chollet F. Xception: deep learning with depthwise separable convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1251–1258, 2017.

Cobuloglu H, Büyüktahtakın IE. A stochastic multi-criteria decision analysis for sustainable biomass crop selection. Expert Syst Appl. 2015;42(15–16):6065–74.

Das A, Ng W-K, Woon Y-K. Rapid association rule mining. In: Proceedings of the tenth international conference on Information and knowledge management, pages 474–481. ACM, 2001.

de Amorim RC. Constrained clustering with minkowski weighted k-means. In: 2012 IEEE 13th International Symposium on Computational Intelligence and Informatics (CINTI), pages 13–17. IEEE, 2012.

Dey AK. Understanding and using context. Person Ubiquit Comput. 2001;5(1):4–7.

Eagle N, Pentland AS. Reality mining: sensing complex social systems. Person Ubiquit Comput. 2006;10(4):255–68.

Essien A, Petrounias I, Sampaio P, Sampaio S. Improving urban traffic speed prediction using data source fusion and deep learning. In: 2019 IEEE International Conference on Big Data and Smart Computing (BigComp). IEEE. 2019: 1–8. .

Essien A, Petrounias I, Sampaio P, Sampaio S. A deep-learning model for urban traffic flow prediction with traffic events mined from twitter. In: World Wide Web, 2020: 1–24 .

Ester M, Kriegel H-P, Sander J, Xiaowei X, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd. 1996;96:226–31.

Fatima M, Pasha M, et al. Survey of machine learning algorithms for disease diagnostic. J Intell Learn Syst Appl. 2017;9(01):1.

Flach PA, Lachiche N. Confirmation-guided discovery of first-order rules with tertius. Mach Learn. 2001;42(1–2):61–95.

Freund Y, Schapire RE, et al. Experiments with a new boosting algorithm. In: Icml, Citeseer. 1996; 96: 148–156

Fujiyoshi H, Hirakawa T, Yamashita T. Deep learning-based image recognition for autonomous driving. IATSS Res. 2019;43(4):244–52.

Fukunaga K, Hostetler L. The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Trans Inform Theory. 1975;21(1):32–40.

Article   MathSciNet   MATH   Google Scholar  

Goodfellow I, Bengio Y, Courville A, Bengio Y. Deep learning. Cambridge: MIT Press; 2016.

Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. Generative adversarial nets. In: Advances in neural information processing systems. 2014: 2672–2680.

Guerrero-Ibáñez J, Zeadally S, Contreras-Castillo J. Sensor technologies for intelligent transportation systems. Sensors. 2018;18(4):1212.

Han J, Pei J, Kamber M. Data mining: concepts and techniques. Amsterdam: Elsevier; 2011.

Han J, Pei J, Yin Y. Mining frequent patterns without candidate generation. In: ACM Sigmod Record, ACM. 2000;29: 1–12.

Harmon SA, Sanford TH, Sheng X, Turkbey EB, Roth H, Ziyue X, Yang D, Myronenko A, Anderson V, Amalou A, et al. Artificial intelligence for the detection of covid-19 pneumonia on chest ct using multinational datasets. Nat Commun. 2020;11(1):1–7.

He K, Zhang X, Ren S, Sun J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell. 2015;37(9):1904–16.

He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016: 770–778.

Hinton GE. A practical guide to training restricted boltzmann machines. In: Neural networks: Tricks of the trade. Springer. 2012; 599-619

Holte RC. Very simple classification rules perform well on most commonly used datasets. Mach Learn. 1993;11(1):63–90.

Hotelling H. Analysis of a complex of statistical variables into principal components. J Edu Psychol. 1933;24(6):417.

Houtsma M, Swami A. Set-oriented mining for association rules in relational databases. In: Data Engineering, 1995. Proceedings of the Eleventh International Conference on, IEEE.1995:25–33.

Jamshidi M, Lalbakhsh A, Talla J, Peroutka Z, Hadjilooei F, Lalbakhsh P, Jamshidi M, La Spada L, Mirmozafari M, Dehghani M, et al. Artificial intelligence and covid-19: deep learning approaches for diagnosis and treatment. IEEE Access. 2020;8:109581–95.

John GH, Langley P. Estimating continuous distributions in bayesian classifiers. In: Proceedings of the Eleventh conference on Uncertainty in artificial intelligence, Morgan Kaufmann Publishers Inc. 1995; 338–345

Kaelbling LP, Littman ML, Moore AW. Reinforcement learning: a survey. J Artif Intell Res. 1996;4:237–85.

Kamble SS, Gunasekaran A, Gawankar SA. Sustainable industry 4.0 framework: a systematic literature review identifying the current trends and future perspectives. Process Saf Environ Protect. 2018;117:408–25.

Kamble SS, Gunasekaran A, Gawankar SA. Achieving sustainable performance in a data-driven agriculture supply chain: a review for research and applications. Int J Prod Econ. 2020;219:179–94.

Kaufman L, Rousseeuw PJ. Finding groups in data: an introduction to cluster analysis, vol. 344. John Wiley & Sons; 2009.

Keerthi SS, Shevade SK, Bhattacharyya C, Radha Krishna MK. Improvements to platt’s smo algorithm for svm classifier design. Neural Comput. 2001;13(3):637–49.

Khadse V, Mahalle PN, Biraris SV. An empirical comparison of supervised machine learning algorithms for internet of things data. In: 2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA), IEEE. 2018; 1–6

Kohonen T. The self-organizing map. Proc IEEE. 1990;78(9):1464–80.

Koroniotis N, Moustafa N, Sitnikova E, Turnbull B. Towards the development of realistic botnet dataset in the internet of things for network forensic analytics: bot-iot dataset. Fut Gen Comput Syst. 2019;100:779–96.

Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, 2012: 1097–1105

Kushwaha S, Bahl S, Bagha AK, Parmar KS, Javaid M, Haleem A, Singh RP. Significant applications of machine learning for covid-19 pandemic. J Ind Integr Manag. 2020;5(4).

Lade P, Ghosh R, Srinivasan S. Manufacturing analytics and industrial internet of things. IEEE Intell Syst. 2017;32(3):74–9.

Lalmuanawma S, Hussain J, Chhakchhuak L. Applications of machine learning and artificial intelligence for covid-19 (sars-cov-2) pandemic: a review. Chaos Sol Fract. 2020:110059 .

LeCessie S, Van Houwelingen JC. Ridge estimators in logistic regression. J R Stat Soc Ser C (Appl Stat). 1992;41(1):191–201.

LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proc IEEE. 1998;86(11):2278–324.

Liu H, Motoda H. Feature extraction, construction and selection: A data mining perspective, vol. 453. Springer Science & Business Media; 1998.

López G, Quesada L, Guerrero LA. Alexa vs. siri vs. cortana vs. google assistant: a comparison of speech-based natural user interfaces. In: International Conference on Applied Human Factors and Ergonomics, Springer. 2017; 241–250.

Liu B, HsuW, Ma Y. Integrating classification and association rule mining. In: Proceedings of the fourth international conference on knowledge discovery and data mining, 1998.

MacQueen J, et al. Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, 1967;volume 1, pages 281–297. Oakland, CA, USA.

Mahdavinejad MS, Rezvan M, Barekatain M, Adibi P, Barnaghi P, Sheth AP. Machine learning for internet of things data analysis: a survey. Digit Commun Netw. 2018;4(3):161–75.

Marchand A, Marx P. Automated product recommendations with preference-based explanations. J Retail. 2020;96(3):328–43.

McCallum A. Information extraction: distilling structured data from unstructured text. Queue. 2005;3(9):48–57.

Mehrotra A, Hendley R, Musolesi M. Prefminer: mining user’s preferences for intelligent mobile notification management. In: Proceedings of the International Joint Conference on Pervasive and Ubiquitous Computing, Heidelberg, Germany, 12–16 September, 2016; pp. 1223–1234. ACM, New York, USA. .

Mohamadou Y, Halidou A, Kapen PT. A review of mathematical modeling, artificial intelligence and datasets used in the study, prediction and management of covid-19. Appl Intell. 2020;50(11):3913–25.

Mohammed M, Khan MB, Bashier Mohammed BE. Machine learning: algorithms and applications. CRC Press; 2016.

Book   Google Scholar  

Moustafa N, Slay J. Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In: 2015 military communications and information systems conference (MilCIS), 2015;pages 1–6. IEEE .

Nilashi M, Ibrahim OB, Ahmadi H, Shahmoradi L. An analytical method for diseases prediction using machine learning techniques. Comput Chem Eng. 2017;106:212–23.

Yujin O, Park S, Ye JC. Deep learning covid-19 features on cxr using limited training data sets. IEEE Trans Med Imaging. 2020;39(8):2688–700.

Otter DW, Medina JR , Kalita JK. A survey of the usages of deep learning for natural language processing. IEEE Trans Neural Netw Learn Syst. 2020.

Park H-S, Jun C-H. A simple and fast algorithm for k-medoids clustering. Expert Syst Appl. 2009;36(2):3336–41.

Liii Pearson K. on lines and planes of closest fit to systems of points in space. Lond Edinb Dublin Philos Mag J Sci. 1901;2(11):559–72.

Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, et al. Scikit-learn: machine learning in python. J Mach Learn Res. 2011;12:2825–30.

MathSciNet   MATH   Google Scholar  

Perveen S, Shahbaz M, Keshavjee K, Guergachi A. Metabolic syndrome and development of diabetes mellitus: predictive modeling based on machine learning techniques. IEEE Access. 2018;7:1365–75.

Santi P, Ram D, Rob C, Nathan E. Behavior-based adaptive call predictor. ACM Trans Auton Adapt Syst. 2011;6(3):21:1–21:28.

Polydoros AS, Nalpantidis L. Survey of model-based reinforcement learning: applications on robotics. J Intell Robot Syst. 2017;86(2):153–73.

Puterman ML. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons; 2014.

Quinlan JR. Induction of decision trees. Mach Learn. 1986;1:81–106.

Quinlan JR. C4.5: programs for machine learning. Mach Learn. 1993.

Rasmussen C. The infinite gaussian mixture model. Adv Neural Inform Process Syst. 1999;12:554–60.

Ravi K, Ravi V. A survey on opinion mining and sentiment analysis: tasks, approaches and applications. Knowl Syst. 2015;89:14–46.

Rokach L. A survey of clustering algorithms. In: Data mining and knowledge discovery handbook, pages 269–298. Springer, 2010.

Safdar S, Zafar S, Zafar N, Khan NF. Machine learning based decision support systems (dss) for heart disease diagnosis: a review. Artif Intell Rev. 2018;50(4):597–623.

Sarker IH. Context-aware rule learning from smartphone data: survey, challenges and future directions. J Big Data. 2019;6(1):1–25.

Sarker IH. A machine learning based robust prediction model for real-life mobile phone data. Internet Things. 2019;5:180–93.

Sarker IH. Ai-driven cybersecurity: an overview, security intelligence modeling and research directions. SN Comput Sci. 2021.

Sarker IH. Deep cybersecurity: a comprehensive overview from neural network and deep learning perspective. SN Comput Sci. 2021.

Sarker IH, Abushark YB, Alsolami F, Khan A. Intrudtree: a machine learning based cyber security intrusion detection model. Symmetry. 2020;12(5):754.

Sarker IH, Abushark YB, Khan A. Contextpca: predicting context-aware smartphone apps usage based on machine learning techniques. Symmetry. 2020;12(4):499.

Sarker IH, Alqahtani H, Alsolami F, Khan A, Abushark YB, Siddiqui MK. Context pre-modeling: an empirical analysis for classification based user-centric context-aware predictive modeling. J Big Data. 2020;7(1):1–23.

Sarker IH, Alan C, Jun H, Khan AI, Abushark YB, Khaled S. Behavdt: a behavioral decision tree learning to build user-centric context-aware predictive model. Mob Netw Appl. 2019; 1–11.

Sarker IH, Colman A, Kabir MA, Han J. Phone call log as a context source to modeling individual user behavior. In: Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing (Ubicomp): Adjunct, Germany, pages 630–634. ACM, 2016.

Sarker IH, Colman A, Kabir MA, Han J. Individualized time-series segmentation for mining mobile phone user behavior. Comput J Oxf Univ UK. 2018;61(3):349–68.

Sarker IH, Hoque MM, MdK Uddin, Tawfeeq A. Mobile data science and intelligent apps: concepts, ai-based modeling and research directions. Mob Netw Appl, pages 1–19, 2020.

Sarker IH, Kayes ASM. Abc-ruleminer: user behavioral rule-based machine learning method for context-aware intelligent services. J Netw Comput Appl. 2020; page 102762

Sarker IH, Kayes ASM, Badsha S, Alqahtani H, Watters P, Ng A. Cybersecurity data science: an overview from machine learning perspective. J Big Data. 2020;7(1):1–29.

Sarker IH, Watters P, Kayes ASM. Effectiveness analysis of machine learning classification models for predicting personalized context-aware smartphone usage. J Big Data. 2019;6(1):1–28.

Sarker IH, Salah K. Appspred: predicting context-aware smartphone apps using random forest learning. Internet Things. 2019;8:

Scheffer T. Finding association rules that trade support optimally against confidence. Intell Data Anal. 2005;9(4):381–95.

Sharma R, Kamble SS, Gunasekaran A, Kumar V, Kumar A. A systematic literature review on machine learning applications for sustainable agriculture supply chain performance. Comput Oper Res. 2020;119:

Shengli S, Ling CX. Hybrid cost-sensitive decision tree, knowledge discovery in databases. In: PKDD 2005, Proceedings of 9th European Conference on Principles and Practice of Knowledge Discovery in Databases. Lecture Notes in Computer Science, volume 3721, 2005.

Shorten C, Khoshgoftaar TM, Furht B. Deep learning applications for covid-19. J Big Data. 2021;8(1):1–54.

Gökhan S, Nevin Y. Data analysis in health and big data: a machine learning medical diagnosis model based on patients’ complaints. Commun Stat Theory Methods. 2019;1–10

Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Van Den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M, et al. Mastering the game of go with deep neural networks and tree search. nature. 2016;529(7587):484–9.

Ślusarczyk B. Industry 4.0: Are we ready? Polish J Manag Stud. 17, 2018.

Sneath Peter HA. The application of computers to taxonomy. J Gen Microbiol. 1957;17(1).

Sorensen T. Method of establishing groups of equal amplitude in plant sociology based on similarity of species. Biol Skr. 1948; 5.

Srinivasan V, Moghaddam S, Mukherji A. Mobileminer: mining your frequent patterns on your phone. In: Proceedings of the International Joint Conference on Pervasive and Ubiquitous Computing, Seattle, WA, USA, 13-17 September, pp. 389–400. ACM, New York, USA. 2014.

Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2015; pages 1–9.

Tavallaee M, Bagheri E, Lu W, Ghorbani AA. A detailed analysis of the kdd cup 99 data set. In. IEEE symposium on computational intelligence for security and defense applications. IEEE. 2009;2009:1–6.

Tsagkias M. Tracy HK, Surya K, Vanessa M, de Rijke M. Challenges and research opportunities in ecommerce search and recommendations. In: ACM SIGIR Forum. volume 54. NY, USA: ACM New York; 2021. p. 1–23.

Wagstaff K, Cardie C, Rogers S, Schrödl S, et al. Constrained k-means clustering with background knowledge. Icml. 2001;1:577–84.

Wang W, Yang J, Muntz R, et al. Sting: a statistical information grid approach to spatial data mining. VLDB. 1997;97:186–95.

Wei P, Li Y, Zhang Z, Tao H, Li Z, Liu D. An optimization method for intrusion detection classification model based on deep belief network. IEEE Access. 2019;7:87593–605.

Weiss K, Khoshgoftaar TM, Wang DD. A survey of transfer learning. J Big data. 2016;3(1):9.

Witten IH, Frank E. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann; 2005.

Witten IH, Frank E, Trigg LE, Hall MA, Holmes G, Cunningham SJ. Weka: practical machine learning tools and techniques with java implementations. 1999.

Wu C-C, Yen-Liang C, Yi-Hung L, Xiang-Yu Y. Decision tree induction with a constrained number of leaf nodes. Appl Intell. 2016;45(3):673–85.

Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Philip SY, et al. Top 10 algorithms in data mining. Knowl Inform Syst. 2008;14(1):1–37.

Xin Y, Kong L, Liu Z, Chen Y, Li Y, Zhu H, Gao M, Hou H, Wang C. Machine learning and deep learning methods for cybersecurity. IEEE Access. 2018;6:35365–81.

Xu D, Yingjie T. A comprehensive survey of clustering algorithms. Ann Data Sci. 2015;2(2):165–93.

Zaki MJ. Scalable algorithms for association mining. IEEE Trans Knowl Data Eng. 2000;12(3):372–90.

Zanella A, Bui N, Castellani A, Vangelista L, Zorzi M. Internet of things for smart cities. IEEE Internet Things J. 2014;1(1):22–32.

Zhao Q, Bhowmick SS. Association rule mining: a survey. Singapore: Nanyang Technological University; 2003.

Zheng T, Xie W, Xu L, He X, Zhang Y, You M, Yang G, Chen Y. A machine learning-based framework to identify type 2 diabetes through electronic health records. Int J Med Inform. 2017;97:120–7.

Zheng Y, Rajasegarar S, Leckie C. Parking availability prediction for sensor-enabled car parks in smart cities. In: Intelligent Sensors, Sensor Networks and Information Processing (ISSNIP), 2015 IEEE Tenth International Conference on. IEEE, 2015; pages 1–6.

Zhu H, Cao H, Chen E, Xiong H, Tian J. Exploiting enriched contextual information for mobile app classification. In: Proceedings of the 21st ACM international conference on Information and knowledge management. ACM, 2012; pages 1617–1621

Zhu H, Chen E, Xiong H, Kuifei Y, Cao H, Tian J. Mining mobile user preferences for personalized context-aware recommendation. ACM Trans Intell Syst Technol (TIST). 2014;5(4):58.

Zikang H, Yong Y, Guofeng Y, Xinyu Z. Sentiment analysis of agricultural product ecommerce review data based on deep learning. In: 2020 International Conference on Internet of Things and Intelligent Applications (ITIA), IEEE, 2020; pages 1–7

Zulkernain S, Madiraju P, Ahamed SI. A context aware interruption management system for mobile devices. In: Mobile Wireless Middleware, Operating Systems, and Applications. Springer. 2010; pages 221–234

Zulkernain S, Madiraju P, Ahamed S, Stamm K. A mobile intelligent interruption management system. J UCS. 2010;16(15):2060–80.

Download references

Author information

Authors and affiliations.

Swinburne University of Technology, Melbourne, VIC, 3122, Australia

Iqbal H. Sarker

Department of Computer Science and Engineering, Chittagong University of Engineering & Technology, 4349, Chattogram, Bangladesh

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Iqbal H. Sarker .

Ethics declarations

Conflict of interest.

The author declares no conflict of interest.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Advances in Computational Approaches for Artificial Intelligence, Image Processing, IoT and Cloud Applications” guest edited by Bhanu Prakash K N and M. Shivakumar.

Rights and permissions

Reprints and permissions

About this article

Sarker, I.H. Machine Learning: Algorithms, Real-World Applications and Research Directions. SN COMPUT. SCI. 2 , 160 (2021). https://doi.org/10.1007/s42979-021-00592-x

Download citation

Received : 27 January 2021

Accepted : 12 March 2021

Published : 22 March 2021

DOI : https://doi.org/10.1007/s42979-021-00592-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Machine learning
  • Deep learning
  • Artificial intelligence
  • Data science
  • Data-driven decision-making
  • Predictive analytics
  • Intelligent applications
  • Find a journal
  • Publish with us
  • Track your research

Subscribe to the PwC Newsletter

Join the community, trending research, mambaout: do we really need mamba for vision.

research paper about machines

For vision tasks, as image classification does not align with either characteristic, we hypothesize that Mamba is not necessary for this task; Detection and segmentation tasks are also not autoregressive, yet they adhere to the long-sequence characteristic, so we believe it is still worthwhile to explore Mamba's potential for these tasks.

research paper about machines

A decoder-only foundation model for time-series forecasting

research paper about machines

Motivated by recent advances in large language models for Natural Language Processing (NLP), we design a time-series foundation model for forecasting whose out-of-the-box zero-shot performance on a variety of public datasets comes close to the accuracy of state-of-the-art supervised forecasting models for each individual dataset.

AniTalker: Animate Vivid and Diverse Talking Faces through Identity-Decoupled Facial Motion Encoding

The paper introduces AniTalker, an innovative framework designed to generate lifelike talking faces from a single portrait.

research paper about machines

AgentScope: A Flexible yet Robust Multi-Agent Platform

modelscope/agentscope • 21 Feb 2024

With the rapid advancement of Large Language Models (LLMs), significant progress has been made in multi-agent applications.

Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers

Sora unveils the potential of scaling Diffusion Transformer for generating photorealistic images and videos at arbitrary resolutions, aspect ratios, and durations, yet it still lacks sufficient implementation details.

Autonomous LLM-driven research from data to human-verifiable research papers

technion-kishony-lab/data-to-paper • 24 Apr 2024

As AI promises to accelerate scientific discovery, it remains unclear whether fully AI-driven research is possible and whether it can adhere to key scientific values, such as transparency, traceability and verifiability.

Sakuga-42M Dataset: Scaling Up Cartoon Research

zhenglinpan/SakugaDataset • 13 May 2024

Can we harness the success of the scaling paradigm to benefit cartoon research?

research paper about machines

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation.

research paper about machines

StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation

This module converts the generated sequence of images into videos with smooth transitions and consistent subjects that are significantly more stable than the modules based on latent spaces only, especially in the context of long video generation.

OS-Copilot: Towards Generalist Computer Agents with Self-Improvement

OS-Copilot/FRIDAY • 12 Feb 2024

Autonomous interaction with the computer has been a longstanding challenge with great potential, and the recent proliferation of large language models (LLMs) has markedly accelerated progress in building digital agents.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 16 October 2023

Forecasting the future of artificial intelligence with machine learning-based link prediction in an exponentially growing knowledge network

  • Mario Krenn   ORCID: orcid.org/0000-0003-1620-9207 1 ,
  • Lorenzo Buffoni 2 ,
  • Bruno Coutinho 2 ,
  • Sagi Eppel 3 ,
  • Jacob Gates Foster 4 ,
  • Andrew Gritsevskiy   ORCID: orcid.org/0000-0001-8138-8796 3 , 5 , 6 ,
  • Harlin Lee   ORCID: orcid.org/0000-0001-6128-9942 4 ,
  • Yichao Lu   ORCID: orcid.org/0009-0001-2005-1724 7 ,
  • João P. Moutinho 2 ,
  • Nima Sanjabi   ORCID: orcid.org/0009-0000-6342-5231 8 ,
  • Rishi Sonthalia   ORCID: orcid.org/0000-0002-0928-392X 4 ,
  • Ngoc Mai Tran 9 ,
  • Francisco Valente   ORCID: orcid.org/0000-0001-6964-9391 10 ,
  • Yangxinyu Xie   ORCID: orcid.org/0000-0002-1532-6746 11 ,
  • Rose Yu 12 &
  • Michael Kopp 6  

Nature Machine Intelligence volume  5 ,  pages 1326–1335 ( 2023 ) Cite this article

26k Accesses

3 Citations

1055 Altmetric

Metrics details

  • Complex networks
  • Computer science
  • Research data

A tool that could suggest new personalized research directions and ideas by taking insights from the scientific literature could profoundly accelerate the progress of science. A field that might benefit from such an approach is artificial intelligence (AI) research, where the number of scientific publications has been growing exponentially over recent years, making it challenging for human researchers to keep track of the progress. Here we use AI techniques to predict the future research directions of AI itself. We introduce a graph-based benchmark based on real-world data—the Science4Cast benchmark, which aims to predict the future state of an evolving semantic network of AI. For that, we use more than 143,000 research papers and build up a knowledge network with more than 64,000 concept nodes. We then present ten diverse methods to tackle this task, ranging from pure statistical to pure learning methods. Surprisingly, the most powerful methods use a carefully curated set of network features, rather than an end-to-end AI approach. These results indicate a great potential that can be unleashed for purely ML approaches without human knowledge. Ultimately, better predictions of new future research directions will be a crucial component of more advanced research suggestion tools.

Similar content being viewed by others

research paper about machines

Accurate structure prediction of biomolecular interactions with AlphaFold 3

research paper about machines

Highly accurate protein structure prediction with AlphaFold

research paper about machines

Augmenting large language models with chemistry tools

The corpus of scientific literature grows at an ever-increasing speed. Specifically, in the field of artificial intelligence (AI) and machine learning (ML), the number of papers every month is growing exponentially with a doubling rate of roughly 23 months (Fig. 1 ). Simultaneously, the AI community is embracing diverse ideas from many disciplines such as mathematics, statistics and physics, making it challenging to organize different ideas and uncover new scientific connections. We envision a computer program that can automatically read, comprehend and act on AI literature. It can predict and suggest meaningful research ideas that transcend individual knowledge and cross-domain boundaries. If successful, it could greatly improve the productivity of AI researchers, open up new avenues of research and help drive progress in the field.

figure 1

The doubling rate of papers per month is roughly 23 months, which might lead to problems for publishing in these fields, at some point. The categories are cs.AI, cs.LG, cs.NE and stat.ML.

In this work, we address the ambitious vision of developing a data-driven approach to predict future research directions 1 . As new research ideas often emerge from connecting seemingly unrelated concepts 2 , 3 , 4 , we model the evolution of AI literature as a temporal network. We construct an evolving semantic network that encapsulates the content and development of AI research since 1994, with approximately 64,000 nodes (representing individual concepts) and 18 million edges (connecting jointly investigated concepts).

We use the semantic network as an input to ten diverse statistical and ML methods to predict the future evolution of the semantic network with high accuracy. That is, we can predict which combinations of concepts AI researchers will investigate in the future. Being able to predict what scientists will work on is a first crucial step for suggesting new topics that might have a high impact.

Several methods were contributions to the Science4Cast competition hosted by the 2021 IEEE International Conference on Big Data (IEEE BigData 2021). Broadly, we can divide the methods into two classes: methods that use hand-crafted network-theoretical features and those that automatically learn features. We found that models using carefully hand-crafted features outperform methods that attempt to learn features autonomously. This (somewhat surprising) finding indicates a great potential for improvements of models free of human priors.

Our paper introduces a real-world graph benchmark for AI, presents ten methods for solving it, and discusses how this task contributes to the larger goal of AI-driven research suggestions in AI and other disciplines. All methods are available at GitHub 5 .

Semantic networks

The goal here is to extract knowledge from the scientific literature that can subsequently be processed by computer algorithms. At first glance, a natural first step would be to use large language model (such as GPT3 6 , Gopher 7 , MegaTron 8 or PaLM 9 ) on each article to extract concepts and their relations automatically. However, these methods still struggle in reasoning capabilities 10 , 11 ; thus, it is not yet directly clear how these models can be used for identifying and suggesting new ideas and concept combinations.

Rzhetsky et al. 12 pioneered an alternative approach, creating semantic networks in biochemistry from co-occurring concepts in scientific papers. There, nodes represent scientific concepts, specifically biomolecules, and are linked when a paper mentions both in its title or abstract. This evolving network captures the field’s history and, using supercomputer simulations, provides insights into scientists’ collective behaviour and suggests more efficient research strategies 13 . Although creating semantic networks from concept co-occurrences extracts only a small amount of knowledge from each paper, it captures non-trivial and actionable content when applied to large datasets 2 , 4 , 13 , 14 , 15 . PaperRobot extends this approach by predicting new links from large medical knowledge graphs and formulating new ideas in human language as paper drafts 16 .

This approach was applied and extended to quantum physics 17 by building a semantic network of over 6,000 concepts. There, the authors (including one of us) formulated the prediction of new research trends and connections as an ML task, with the goal of identifying concept pairs not yet jointly discussed in the literature but likely to be investigated in the future. This prediction task was one component for personalized suggestions of new research ideas.

Link prediction in semantic networks

We formulate the prediction of future research topics as a link-prediction task in an exponentially growing semantic network in the AI field. The goal is to predict which unconnected nodes, representing scientific concepts not yet jointly researched, will be connected in the future.

Link prediction is a common problem in computer science, addressed with classical metrics and features, as well as ML techniques. Network theory-based methods include local motif-based approaches 18 , 19 , 20 , 21 , 22 , linear optimization 23 , global perturbations 24 and stochastic block models 25 . ML works optimized a combination of predictors 26 , with further discussion in a recent review 27 .

In ref. 17 , 17 hand-crafted features were used for this task. In the Science4Cast competition, the goal was to find more precise methods for link-prediction tasks in semantic networks (a semantic network of AI that is ten times larger than the one in ref. 17 ).

Potential for idea generation in science

The long-term goal of predictions and suggestions in semantic networks is to provide new ideas to individual researchers. In a way, we hope to build a creative artificial muse in science 28 . We can bias or constrain the model to give topic suggestions that are related to the research interest of individual scientists, or a pair of scientists to suggest topics for collaborations in an interdisciplinary setting.

Generation and analysis of the dataset

Dataset construction.

We create a dynamic semantic network using papers published on arXiv from 1992 to 2020 in the categories cs.AI, cs.LG, cs.NE and stat.ML. The 64,719 nodes represent AI concepts extracted from 143,000 paper titles and abstracts using Rapid Automatic Keyword Extraction (RAKE) and normalized via natural language processing (NLP) techniques and custom methods 29 . Although high-quality taxonomies such as the Computer Science Ontology (CSO) exist 30 , 31 , we choose not to use them for two reasons: the rapid growth of AI and ML may result in new concepts not yet in the CSO, and not all scientific domains have high-quality taxonomies like CSO. Our goal is to build a scalable approach applicable to any domain of science. However, future research could investigate merging these approaches (see ‘Extensions and future work’).

Concepts form the nodes of the semantic network, and edges are drawn when concepts co-appear in a paper title or abstract. Edges have time stamps based on the paper’s publication date, and multiple time-stamped edges between concepts are common. The network is edge-weighted, and the weight of an edge stands for the number of papers that connect two concepts. In total, this creates a time-evolving semantic network, depicted in Fig. 2 .

figure 2

Utilizing 143,000 AI and ML papers on arXiv from 1992 to 2020, we create a list of concepts using RAKE and other NLP tools, which form nodes in a semantic network. Edges connect concepts that co-occur in titles or abstracts, resulting in an evolving network that expands as more concepts are jointly investigated. The task involves predicting which unconnected nodes (concepts not yet studied together) will connect within a few years. We present ten diverse statistical and ML methods to address this challenge.

Network-theoretical analysis

The published semantic network has 64,719 nodes and 17,892,352 unique undirected edges, with a mean node degree of 553. Many hub nodes greatly exceed this mean degree, as shown in Fig. 3 , For example, the highest node degrees are 466,319 (neural network), 198,050 (deep learning), 195,345 (machine learning), 169,555 (convolutional neural network), 159,403 (real world), 150,227 (experimental result), 127,642 (deep neural network) and 115,334 (large scale). We fit a power-law curve to the degree distribution p ( k ) using ref. 32 and obtained p ( k )  ∝   k −2.28 for degree k  ≥ 1,672. However, real complex network degree distributions often follow power laws with exponential cut-offs 33 . Recent work 34 has indicated that lognormal distributions fit most real-world networks better than power laws. Likelihood ratio tests from ref. 32 suggest truncated power law ( P  = 0.0031), lognormal ( P  = 0.0045) and lognormal positive ( P  = 0.015) fit better than power law, while exponential ( P  = 3 × 10 −10 ) and stretched exponential ( P  = 6 × 10 −5 ) are worse. We couldn’t conclusively determine the best fit with P  ≤ 0.1.

figure 3

Nodes with the highest (466,319) and lowest (2) non-zero degrees are neural network and video compression technique, respectively. The most frequent non-zero degree is 64 (which occures 313 times). The plot, in log scale, omits 1,247 nodes with zero degrees.

We observe changes in network connectivity over time. Although degree distributions remained heavy-tailed, the ordering of nodes within the tail changed due to popularity trends. The most connected nodes and the years they became so include decision tree (1994), machine learning (1996), logic program (2000), neural network (2005), experimental result (2011), machine learning (2013, for a second time) and neural network (2015).

Connected component analysis in Fig. 4 reveals that the network grew more connected over time, with the largest group expanding and the number of connected components decreasing. Mid-sized connected components’ trajectories may expose trends, like image processing. A connected component with four nodes appeared in 1999 (brightness change, planar curve, local feature, differential invariant), and three more joined in 2000 (similarity transformation, template matching, invariant representation). In 2006, a paper discussing support vector machine and local feature merged this mid-sized group with the largest connected component.

figure 4

Primary (left, blue) vertical axis: number of connected components with more than one node. Secondary (right, orange) vertical axis: number of nodes in the largest connected component. For example, the network in 2019 comprises of one large connected component with 63,472 nodes and 1,247 isolated nodes, that is, nodes with no edges. However, the 2001 network has 19 connected components with size greater than one, the largest of which has 2,733 nodes.

The semantic network reveals increasing centralization over time, with a smaller percentage of nodes (concepts) contributing to a larger fraction of edges (concept combinations). Figure 5 shows that the fraction of edges for high-degree nodes rises, while it decreases for low-degree nodes. The decreasing average clustering coefficient over time supports this trend, suggesting nodes are more likely to connect to high-degree central nodes. This could be due to the AI community’s focus on a few dominating methods or more consistent terminology use.

figure 5

This cumulative histogram illustrates the fraction of nodes (concepts) corresponding to the fraction of edges (connections) for given years (1999, 2003, 2007, 2011, 2015 and 2019). The graph was generated by adding edges and nodes dated before each year. Nodes are sorted by increasing degrees. The y value at x  = 80 represents the fraction of edges contributed by all nodes in and below the 80th percentile of degrees.

Problem formulation

At the big picture, we aim to make predictions in an exponentially growing semantic network. The specific task involves predicting which two nodes v 1 and v 2 with degrees d ( v 1/ 2 ) ≥  c lacking an edge in the year (2021 −  δ ) will have w edges in 2021. We use δ  = 1, 3, 5, c  = 0, 5, 25 and w  = 1, 3, where c is a minimal degree. Note that c  = 0 is an intriguing special case where the nodes may not have an associated edge in the initial year, requiring the model to predict which nodes will connect to entirely new edges. The task w  = 3 goes beyond simple link prediction and seeks to identify uninvestigated concept pairs that will appear together in at least three papers. An interesting alternative task could be predicting the fastest-growing links, denoted as ‘trend’ prediction.

In this task, we provide a list of 10 million unconnected node pairs (each node having a degree ≥ c ) for the year (2021 −  δ ), with the goal of sorting this list by descending probability that they will have at least w edges in 2021.

For evaluation, we employ the receiver operating characteristic (ROC) curve 35 , which plots the true-positive rate against the false-positive rate at various threshold settings. We use the area under the curve (AUC) of the ROC curve as our evaluation metric. The advantage of AUC over mean square error is its independence from the data distribution. Specifically, in our case, where the two classes have a highly asymmetric distribution (with only about 1–3% of newly connected edges) and the distribution changes over time, AUC offers meaningful interpretation. Perfect predictions yield AUC = 1, whereas random predictions result in AUC = 0.5. AUC represents the percentage that a random true element is ranked higher than a random false one. For other metrics, see ref. 36 .

To tackle this task, models can use the complete information of the semantic network from the year (2021 −  δ ) in any way possible. In our case, all presented models generate a dataset for learning to make predictions from (2021 − 2 δ ) to (2021 −  δ ). Once the models successfully complete this task, they are applied to the test dataset to make predictions from (2021 −  δ ) to 2021. All reported AUCs are based on the test dataset. Note that solving the test dataset is especially challenging due to the δ -year shift, causing systematic changes such as the number of papers and density of the semantic network.

AI-based solutions

We demonstrate various methods to predict new links in a semantic network, ranging from pure statistical approaches and neural networks with hand-crafted features (NF) to ML models without NF. The results are shown in Fig. 6 , with the highest AUC scores achieved by methods using NF as ML model inputs. Pure network features without ML are competitive, while pure ML methods have yet to outperform those with NF. Predicting links generated at least three times can achieve a quasi-deterministic AUC > 99.5%, suggesting an interesting target for computational sociology and science of science research. We have performed numerous tests to exclude data leakage in the benchmark dataset, overfitting or data duplication both in the set of articles and the set of concepts. We rank methods based on their performance, with model M1 as the best performing and model M8 as the least effective (for the prediction of a new edge with δ  = 3, c  = 0). Models M4 and M7 are subdivided into M4A, M4B, M7A and M7B, differing in their focus on feature or embedding selection (more details in Methods ).

figure 6

Here we show the AUC values for different models that use machine learning techniques (ML), hand-crafted network features (NF) or a combination thereof. The left plot shows results for the prediction of a single new link (that is, w  = 1) and the right plot shows the results for the prediction of new triple links w  = 3. The task is to predict δ  = [1, 3, 5] years into the future, with cut-off values c  = [0, 5, 25]. We sort the models by the the results for the task ( w  = 1,  δ  = 3,  c  = 0), which was the task in the Science4Cast competition. Data points that are not shown have a AUC below 0.6 or are not computed due to computational costs. All AUC values reported are computed on a validation dataset δ years ahead of the training dataset that the models have never seen. Note that the prediction of new triple edges can be performed nearly deterministic. It will be interesting to understand the origin of this quasi-deterministic pattern in AI research, for example, by connecting it to the research interests of scientists 88 .

Model M1: NF + ML. This approach combines tree-based gradient boosting with graph neural networks, using extensive feature engineering to capture node centralities, proximity and temporal evolution 37 . The Light Gradient Boosting Machine (LightGBM) model 38 is employed with heavy regularization to combat overfitting due to the scarcity of positive examples, while a time-aware graph neural network learns dynamic node representations.

Model M2: NF + ML. This method utilizes node and edge features (as well as their first and second derivatives) to predict link formation probabilities 39 . Node features capture popularity, and edge features measure similarity. A multilayer perceptron with rectified linear unit (ReLU) activation is used for learning. Cold start issues are addressed with feature imputation.

Model M3: NF + ML. This method captures hand-crafted node features over multiple time snapshots and employs a long short-term memory (LSTM) to learn time dependencies 40 . The features were selected to be highly informative while having a low computational cost. The final configuration uses degree centrality, degree of neighbours and common neighbours as features. The LSTM outperforms fully connected neural networks.

Model M4: pure NF. Two purely statistical methods, preferential attachment 41 and common neighbours 27 , are used 42 . Preferential attachment is based on node degrees, while common neighbours relies on the number of shared neighbours. Both methods are computationally inexpensive and perform competitively with some learning-based models.

Model M5: NF + ML. Here, ten groups of first-order graph features are extracted to obtain neighbourhood and similarity properties, with principal component analysis 43 applied for dimensionality reduction 44 . A random forest classifier is trained on the balanced dataset to predict new links.

Model M6: NF + ML. The baseline solution uses 15 hand-crafted features as input to a four-layer neural network, predicting the probability of link formation between node pairs 17 .

Model M7: end-to-end ML (auto node embedding). The baseline solution is modified to use node2vec 45 and ProNE embeddings 46 instead of hand-crafted features. The embeddings are input to a neural network with two hidden layers for link prediction.

Model M8: end-to-end ML (transformers). This method learns features in an unsupervised manner using transformers 47 . Node2vec embeddings 45 , 48 are generated for various snapshots of the adjacency matrix, and a transformer model 49 is pre-trained as a feature extractor. A two-layer ReLU network is used for classification.

Extensions and future work

Developing an AI that suggests research topics to scientists is a complex task, and our link-prediction approach in temporal networks is just the beginning. We highlight key extensions and future work directly related to the ultimate goal of AI for AI.

High-quality predictions without feature engineering. Interestingly, the most effective methods utilized carefully crafted features on a graph with extracted concepts as nodes and edges representing their joint publication history. Investigating whether end-to-end deep learning can solve tasks without feature engineering will be a valuable next step.

Fully automated concept extraction. Current concept lists, generated by RAKE’s statistical text analysis, demand time-consuming code development to address irrelevant term extraction (for example, verbs, adjectives). A fully automated NLP technique that accurately extracts meaningful concepts without manual code intervention would greatly enhance the process.

Leveraging ontology taxonomies. Alongside fully automated concept extraction, utilizing established taxonomies such as the CSO 30 , 31 , Wikipedia-extracted concepts, book indices 17 or PhySH key phrases is crucial. Although not comprehensive for all domains, these curated datasets often contain hierarchical and relational concept information, greatly improving prediction tasks.

Incorporating relation extraction. Future work could explore relation extraction techniques for constructing more accurate, sparser semantic networks. By discerning and classifying meaningful concept relationships in abstracts 50 , 51 , a refined AI literature representation is attainable. Using NLP tools for entity recognition, relationship identification and classification, this approach may enhance prediction performance and novel research direction identification.

Generation of new concepts. Our work predicts links between known concepts, but generating new concepts using AI remains a challenge. This unsupervised task, as explored in refs. 52 , 53 , involves detecting concept clusters with dynamics that signal new concept formation. Incorporating emerging concepts into the current framework for suggesting research topics is an intriguing future direction.

Semantic information beyond concept pairs. Currently, abstracts and titles are compressed into concept pairs, but more comprehensive information extraction could yield meaningful predictions. Exploring complex data structures such as hypergraphs 54 may be computationally demanding, but clever tricks could reduce complexity, as shown in ref. 55 . Investigating sociological factors or drawing inspiration from material science approaches 56 may also improve prediction tasks. A recent dataset for the study of the science of science also includes more complex data structures than the ones used in our paper, including data from social networks such as Twitter 57 .

Predictions of scientific success. While predicting new links between concepts is valuable, assessing their potential impact is essential for high-quality suggestions. Introducing a metric of success, like estimated citation numbers or citation growth rate, can help gauge the importance of these connections. Adapting citation prediction techniques from the science of science 58 , 59 , 60 , 61 to semantic networks offers a promising research direction.

Anomaly detections. Predicting likely connections may not align with finding surprising research directions. One method for identifying surprising suggestions involves constraining cosine similarity between vertices 62 , which measures shared neighbours and can be associated with semantic (dis)similarity. Another approach is detecting anomalies in semantic networks, which are potential links with extreme properties 63 , 64 . While scientists often focus on familiar topics 3 , 4 , greater impact results from unexpected combinations of distant domains 12 , encouraging the search for surprising associations.

End-to-end formulation. Our method breaks down the goal of extracting knowledge from scientific literature into subtasks, contrasting with end-to-end deep learning that tackles problems directly without subproblems 65 , 66 . End-to-end approaches have shown great success in various domains 67 , 68 , 69 . Investigating whether such an end-to-end solution can achieve similar success in our context would be intriguing.

Our method represents a crucial step towards developing a tool that can assist scientists in uncovering novel avenues for exploration. We are confident that our outlined ideas and extensions pave the way for achieving practical, personalized, interdisciplinary AI-based suggestions for new impactful discoveries. We firmly believe that such a tool holds the potential to become a influential catalyst, transforming the way scientists approach research questions and collaborate in their respective fields.

Details on concept set generation and application

In this section, we provide details on the generation of our list of 64,719 concepts. For more information, the code is accessible on GitHub . The entire approach is designed for immediate scalability to other domains.

Initially, we utilized approximately 143,000 arXiv papers from the categories cs.AI, cs.LG, cs.NE and stat.ML spanning 1992 to 2020. The omission of earlier data has a negligible effect on our research question, as we show below. We then iterated over each individual article, employing RAKE (with an extended stopword list) to suggest concept candidates, which were subsequently stored.

Following the iteration, we retained concepts composed of at least two words (for example, neural network) appearing in six or more articles, as well as concepts comprising a minimum of three words (for example, recurrent neural network) appearing in three or more articles. This initial filter substantially reduced noise generated by RAKE, resulting in a list of 104,948 concepts.

Lastly, we developed an automated filtering tool to further enhance the quality of the concept list. This tool identified common, domain-independent errors made by RAKE, which primarily included phrases that were not concepts (for example, dataset provided or discuss open challenge). We compiled a list of 543 words not part of meaningful concepts, including verbs, ordinal numbers, conjunctions and adverbials. Ultimately, this process produced our final list of 64,719 concepts employed in our study. No further semantic concept/entity linking is applied.

By this construction, the test sets with c  = 0 could lead to very rare contamination of the dataset. That is because each concept will have at least one edge in the final dataset. The effects, however, are negligible.

The distribution of concepts in the articles can be seen in Extended Data Fig. 1 . As an example, we show the extraction of concepts from five randomly chosen papers:

Memristor hardware-friendly reinforcement learning 70 : ‘actor critic algorithm’, ‘neuromorphic hardware implementation’, ‘hardware neural network’, ‘neuromorphic hardware system’, ‘neural network’, ‘large number’, ‘reinforcement learning’, ‘case study’, ‘pre training’, ‘training procedure’, ‘complex task’, ‘high performance’, ‘classical problem’, ‘hardware implementation’, ‘synaptic weight’, ‘energy efficient’, ‘neuromorphic hardware’, ‘control theory’, ‘weight update’, ‘training technique’, ‘actor critic’, ‘nervous system’, ‘inverted pendulum’, ‘explicit supervision’, ‘hardware friendly’, ‘neuromorphic architecture’, ‘hardware system’.

Automated deep learning analysis of angiography video sequences for coronary artery disease 71 : ‘deep learning approach’, ‘coronary artery disease’, ‘deep learning analysis’, ‘traditional image processing’, ‘deep learning’, ‘image processing’, ‘f1 score’, ‘video sequence’, ‘error rate’, ‘automated analysis’, ‘coronary artery’, ‘vessel segmentation’, ‘key frame’, ‘visual assessment’, ‘analysis method’, ‘analysis pipeline’, ‘coronary angiography’, ‘geometrical analysis’.

Demographic influences on contemporary art with unsupervised style embeddings 72 : ‘classification task’, ‘social network’, ‘data source’, ‘visual content’, ‘graph network’, ‘demographic information’, ‘social connection’, ‘visual style’, ‘historical dataset’, ‘novel information’

The utility of general domain transfer learning for medical language tasks 73 : ‘natural language processing’, ‘long short term memory’, ‘logistic regression model’, ‘transfer learning technique’, ‘short term memory’, ‘average f1 score’, ‘class classification model’, ‘domain transfer learning’, ‘weighted average f1 score’, ‘medical natural language processing’, ‘natural language process’, ‘transfer learning’, ‘f1 score’, ’natural language’, ’deep model’, ’logistic regression’, ’model performance’, ’classification model’, ’text classification’, ’regression model’, ’nlp task’, ‘short term’, ‘medical domain’, ‘weighted average’, ‘class classification’, ‘bert model’, ‘language processing’, ‘biomedical domain’, ‘domain transfer’, ‘nlp model’, ‘main model’, ‘general domain’, ‘domain model’, ‘medical text’.

Fast neural architecture construction using envelopenets 74 : ‘neural network architecture’, ‘neural architecture search’, ‘deep network architecture’, ‘image classification problem’, ‘neural architecture search method’, ‘neural network’, ‘reinforcement learning’, ‘deep network’, ‘image classification’, ‘objective function’, ‘network architecture’, ‘classification problem’, ‘evolutionary algorithm’, ‘neural architecture’, ‘base network’, ‘architecture search’, ‘training epoch’, ‘search method’, ‘image class’, ‘full training’, ‘automated search’, ‘generated network’, ‘constructed network’, ‘gpu day’.

Time gap between the generation of edges

We use articles from arXiv, which only goes back to the year 1992. However, of course, the field of AI exists at least since the 1960s 75 . Thus, this raises the question whether the omission of the first 30–40 years of research has a crucial impact in the prediction task we formulate, specifically, whether edges that we consider as new might not be so new after all. Thus, in Extended Data Fig. 2 , we compute the time between the formation of edges between the same concepts, taking into account all or just the first edge. We see that the vast majority of edges are formed within short time periods, thus the effect of omission of early publication has a negligible effect for our question. Of course, different questions might be crucially impacted by the early data; thus, a careful choice of the data source is crucial 61 .

Positive examples in the test dataset

Table 1 shows the number of positive cases within the 10 million examples in the 18 test datasets that are used for evaluation.

Publication rates in quantum physics

Another field of research that gained a lot of attention in the recent years is quantum physics. This field is also a strong adopter of arXiv. Thus, we analyse in the same way as for AI in Fig. 1 . We find in Extended Data Fig. 3 no obvious exponential increase in papers per month. A detailed analysis of other domains is beyond the current scope. It will be interesting to investigate the growth rates in different scientific disciplines in more detail, especially given that exponential increase has been observed in several aspects of the science of science 3 , 76 .

Details on models M1–M8

What follows are more detailed explanations of the models presented in the main text. All codes are available at GitHub. The feature importance of the best model M1 is shown here, those of other models are analysed in the respective workshop contributions (cited in the subsections).

Details on M1

The best-performing solution is based on a blend of a tree-based gradient boosting approach and a graph neural network approach 37 . Extensive feature engineering was conducted to capture the centralities of the nodes, the proximity between node pairs and their evolution over time. The centrality of a node is captured by the number of neighbours and the PageRank score 77 , while the proximity between a node pair is derived using the Jaccard index. We refer the reader to ref. 37 for the list of all features and their feature importance.

The tree-based gradient boosting approach uses LightGBM 38 and applies heavy regularization to combat overfitting due to the scarcity of positive samples. The graph neural network approach employs a time-aware graph neural network to learn node representations on dynamic semantic networks. The feature importance of model M1, averaged over 18 datasets, is shown in Table 2 . It shows that the temporal features do contribute largely to the model performance, but the model remains strong even when they are removed. An example of the evolution of the training (from 2016 to 2019) and test set (2019 to 2021) for δ  = 3, c  = 25, ω  = 1 is shown in Extended Data Fig. 4 .

Details on M2

The second method assumes that the probability that nodes u and v form an edge in the future is a function of the node features f ( u ), f ( v ) and some edge feature h ( u ,  v ). We chose node features f that capture popularity at the current time t 0 (such as degree, clustering coefficient 78 , 79 and PageRank 77 ). We also use these features’ first and second time derivatives to capture the evolution of the node’s popularity over time. After variable selection during training, we chose h to consist of the HOP-rec score (high-order proximity for implicit recommendation) 80 , 81 and a variation of the Dice similarity score 82 as a measure of similarity between nodes. In summary, we use 31 node features for each node, and two edge features, which gives 31 × 2 + 2 = 64 features in total. These features are then fed into a small multilayer perceptron (5 layers, each with 13 neurons) with ReLU activation.

Cold start is the problem that some nodes in the test set do not appear in the training set. Our strategy for a cold start is imputation. We say a node v is seen if it appeared in the training data, and unseen otherwise; similarly, we say that a node is born at time t if t is the first time stamp where an edge linking this node has appeared. The idea is that an unseen node is simply a node born in the future, so its features should look like a recently born node in the training set. If a node is unseen, then we impute its features as the average of the features of the nodes born recently. We found that with imputation during training, the test AUC scores across all models consistently increased by about 0.02. For a complete description of this method, we refer the reader to ref. 39 .

Details on M3

This approach, detailed in ref. 40 , uses hand-crafted node features that have been captured in multiple time snapshots (for example, every year) and then uses an LSTM to benefit from learning the time dependencies of these features. The final configuration uses two main types of feature: node features including degree and degree of neighbours, and edge features including common neighbours. In addition, to balance the training data, the same number of positive and negative instances have been randomly sampled and combined.

One of the goals was to identify features that are very informative with a very low computational cost. We found that the degree centrality of the nodes is the most important feature, and the degree centrality of the neighbouring nodes and the degree of mutual neighbours gave us the best trade-off. As all of the extracted features’ distributions are highly skewed to the right, meaning most of the features take near zero values, using a power transform such as Yeo–Johnson 83 helps to make the distributions more Gaussian, which boosts the learning. Finally, for the link-prediction task, we saw that LSTMs perform better than fully connected neural networks.

Details on M4

The following two methods are based on a purely statistical analysis of the test data and are explained in detail in ref. 42 .

Preferential attachment. In the network analysis, we concluded that the growth of this dataset tends to maintain a heavy-tailed degree distribution, often associated with scale-free networks. As mentioned before the γ value of the degree distribution is very close to 2, suggesting that preferential attachment 41 is probably the main organizational principle of the network. As such, we implemented a simple prediction model following this procedure. Preferential attachment scores in link prediction are often quantified as

with k i , j the degree of nodes i and j . However, this assumes the scoring of links between nodes that are already connected to the network, that is k i , j  > 0, which is not the case for all the links we must score in the dataset. As a result, we define our preferential attachment model as

Using this simple model with no free parameters we could score new links and compare them with the other models. Immediately we note that preferential attachment outperforms some learning-based models, even if it never manages to reach the top AUC, but it is extremely simple and with negligible computational cost.

Common neighbours. We explore another network-based approach to score the links. Indeed, while the preferential attachment model we derived performed well, it uses no information about the distance between i and j , which is a popular feature used in link-prediction methods 27 . As such, we decided to test a method known as common neighbours 18 . We define Γ ( i ) as the set of neighbors of node i and Γ ( i ) ∩  Γ ( j ) as the set of common neighbours between nodes i and j . We can easily score the nodes with

the intuition being that nodes that share a larger number of neighbours are more likely to be connected than distant nodes that do not share any.

Evaluating this score for each pair ( i ,  j ) on the dataset of unconnected pairs, which can be computed as the second power of the adjacency matrix, A 2 , we obtained an AUC that is sometimes higher than preferential attachment and sometimes lower than it but is still consistently quite close with the best learning-based models.

Details on M5

This method is based on ref. 44 . First, ten groups of first-order graph features are extracted to get some neighbourhood and similarity properties from each pair of nodes: degree centrality of nodes, pair’s total number of neighbours, common neighbours index, Jaccard coefficient, Simpson coefficient, geometric coefficient, cosine coefficient, Adamic–Adar index, resource allocation index and preferential attachment index. They are obtained for three consecutive years to capture the temporal dynamics of the semantic network, leading to a total of 33 features. Second, principal component analysis 43 is applied to reduce the correlation between features, speed up the learning process and improve generalization, which results in a final set of seven latent variables. Lastly, a random forest classifier is trained (using a balanced dataset) to estimate the likelihood of new links between the AI concepts.

In this paper, a modification was performed in relation to the original formulation of the method 44 : two of the original features, average neighbour degree and clustering coefficient, were infeasible to extract for some of the tasks covered in this paper, as their computation can be heavy for such a very large network, and they were discarded. Due to some computational memory issues, it was not possible to run the model for some of the tasks covered in this study, and so those results are missing.

Details on M6

The baseline solution for the Science4Cast competition was closely related to the model presented in ref. 17 . It uses 15 hand-crafted features of a pair of nodes v 1 and v 2 . (Degrees of v 1 and v 2 in the current year and previous two years are six properties. The number of shared neighbours in total of v 1 and v 2 in the current year and previous two years are six properties. The number of shared neighbours between v 1 and v 2 in the current year and the previous two years are three properties). These 15 features are the input of a neural network with four layers (15, 100, 10 and 1 neurons), intending to predict whether the nodes v 1 and v 2 will have w edges in the future. After the training, the model computes the probability for all 10 million evaluation examples. This list is sorted and the AUC is computed.

Details on M7

The solution M7 was not part of the Science4Cast competition and therefore not described in the corresponding proceedings, thus we want to add more details.

The most immediate way one can apply ML to this problem is by automating the detection of features. Quite simply, the baseline solution M6 is modified such that instead of 15 hand-crafted features, the neural network is instead trained on features extracted from a graph embedding. We use two different embedding approaches. The first method is employs node2vec (M7A) 45 , for which we use the implementations provided in the nodevectors Python package 84 . The second one uses the ProNE embedding (M7B) 46 , which is based on sparse matrix factorizations modulated by the higher-order Cheeger inequality 85 .

The embeddings generate a 32-dimensional representation for each node, resulting in edge representations in [0, 1] 64 . These features are input into a neural network with two hidden layers of size 1,000 and 30. Like M6, the model computes the probability for evaluation examples to determine the ROC. We compare ProNE to node2vec, a common graph embedding method using a biased random walk procedure with return and in–out parameters, which greatly affect network encoding. Initial experiments used default values for a 64-dimensional encoding before inputting into the neural network. The higher variance in node2vec predictions is probably due to its sensitivity to hyperparameters. While ProNE is better suited for general multi-dataset link prediction, node2vec’s sensitivity may help identify crucial network features for predicting temporal evolution.

Details on M8

This model, which is detailed in ref. 47 , does not use any hand-crafted features but learns them in a completely unsupervised manner. To do so, we extract various snapshots of the adjacency matrix through time, capturing graphs in the form of A t for t  = 1994, …, 2019. We then embed each of these graphs into 128-dimensional Euclidean space via node2vec 45 , 48 . For each node u in the semantic graph, we extract different 128-dimensional vector embeddings n u ( A 1994 ), …,  n u ( A 2019 ).

Transformers have performed extremely well in NLP tasks 49 ; thus, we apply them to learn the dynamics of the embedding vectors. We pre-train a transformer to help classify node pairs. For the transformer, the encoder and decoder had 6 layers each; we used 128 as the embedding dimension, 2,048 as the feed-forward dimension and 8-headed attention. This transformer acts as our feature extractor. Once we pre-train our transformer, we add a two-layer ReLU network with hidden dimension 128 as a classifier on top.

Data availability

All 18 datasets tested in this paper are available via Zenodo at https://doi.org/10.5281/zenodo.7882892 ref. 86 .

Code availability

All of the models and codes described above can be found via GitHub at https://github.com/artificial-scientist-lab/FutureOfAIviaAI ref. 5 and a permanent Zenodo record at https://zenodo.org/record/8329701 ref. 87 .

Clauset, A., Larremore, D. B. & Sinatra, R. Data-driven predictions in the science of science. Science 355 , 477–480 (2017).

Article   Google Scholar  

Evans, J. A. & Foster, J. G. Metaknowledge. Science 331 , 721–725 (2011).

Article   MathSciNet   MATH   Google Scholar  

Fortunato, S. et al. Science of science. Science 359 , eaao0185 (2018).

Wang, D. & Barabási, A.-L. The Science of Science (Cambridge Univ. Press, 2021).

Krenn, M. et al. FutureOfAIviaAI. GitHub https://github.com/artificial-scientist-lab/FutureOfAIviaAI (2023).

Brown, T. et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33 , 1877–1901 (2020).

Google Scholar  

Rae, J. W. et al. Scaling language models: methods, analysis & insights from training gopher. Preprint at https://arxiv.org/abs/2112.11446 (2021).

Smith, S. et al. Using DeepSpeed and Megatron to train Megatron-Turing NLG 530B, a large-scale generative language model. Preprint at https://arxiv.org/abs/2201.11990 (2022).

Chowdhery, A. et al. Palm: scaling language modeling with pathways. Preprint at https://arxiv.org/abs/2204.02311 (2022).

Kojima, T., Gu, S. S., Reid, M., Matsuo, Y. & Iwasawa, Y. Large language models are zero-shot reasoners. Preprint at https://arxiv.org/abs/2205.11916 (2022).

Zhang, H., Li, L. H., Meng, T., Chang, K.-W. & Broeck, G. V. d. On the paradox of learning to reason from data. Preprint at https://arxiv.org/abs/2205.11502 (2022).

Rzhetsky, A., Foster, J. G., Foster, I. T. & Evans, J. A. Choosing experiments to accelerate collective discovery. Proc. Natl Acad. Sci. USA 112 , 14569–14574 (2015).

Foster, J. G., Rzhetsky, A. & Evans, J. A. Tradition and innovation in scientists’ research strategies. Am. Sociol. Rev. 80 , 875–908 (2015).

Van Eck, N. J. & Waltman, L. Text mining and visualization using vosviewer. Preprint at https://arxiv.org/abs/1109.2058 (2011).

Van Eck, N. J. & Waltman, L. in Measuring Scholarly Impact: Methods and Practice (eds Ding, Y. et al.) 285–320 (Springer, 2014).

Wang, Q. et al. Paperrobot: Incremental draft generation of scientific ideas. Preprint at https://arxiv.org/abs/1905.07870 (2019).

Krenn, M. & Zeilinger, A. Predicting research trends with semantic and neural networks with an application in quantum physics. Proc. Natl Acad. Sci. USA 117 , 1910–1916 (2020).

Liben-Nowell, D. & Kleinberg, J. The link-prediction problem for social networks. J. Am. Soc. Inf. Sci. Technol. 58 , 1019–1031 (2007).

Albert, I. & Albert, R. Conserved network motifs allow protein–protein interaction prediction. Bioinformatics 20 , 3346–3352 (2004).

Zhou, T., Lü, L. & Zhang, Y.-C. Predicting missing links via local information. Eur. Phys. J. B 71 , 623–630 (2009).

Article   MATH   Google Scholar  

Kovács, I. A. et al. Network-based prediction of protein interactions. Nat. Commun. 10 , 1240 (2019).

Muscoloni, A., Abdelhamid, I. & Cannistraci, C. V. Local-community network automata modelling based on length-three-paths for prediction of complex network structures in protein interactomes, food webs and more. Preprint at bioRxiv https://doi.org/10.1101/346916 (2018).

Pech, R., Hao, D., Lee, Y.-L., Yuan, Y. & Zhou, T. Link prediction via linear optimization. Physica A 528 , 121319 (2019).

Lü, L., Pan, L., Zhou, T., Zhang, Y.-C. & Stanley, H. E. Toward link predictability of complex networks. Proc. Natl Acad. Sci. USA 112 , 2325–2330 (2015).

Guimerà, R. & Sales-Pardo, M. Missing and spurious interactions and the reconstruction of complex networks. Proc. Natl Acad. Sci. USA 106 , 22073–22078 (2009).

Ghasemian, A., Hosseinmardi, H., Galstyan, A., Airoldi, E. M. & Clauset, A. Stacking models for nearly optimal link prediction in complex networks. Proc. Natl Acad. Sci. USA 117 , 23393–23400 (2020).

Zhou, T. Progresses and challenges in link prediction. iScience 24 , 103217 (2021).

Krenn, M. et al. On scientific understanding with artificial intelligence. Nat. Rev. Phys. 4 , 761–769 (2022).

Rose, S., Engel, D., Cramer, N. & Cowley, W. in Text Mining: Applications and Theory (eds Berry, M. W. & Kogan, J.) Ch. 1 (Wiley, 2010).

Salatino, A. A., Thanapalasingam, T., Mannocci, A., Osborne, F. & Motta, E. The computer science ontology: a large-scale taxonomy of research areas. In Proc. Semantic Web–ISWC 2018: 17th International Semantic Web Conference Part II Vol. 17, 187–205 (Springer, 2018).

Salatino, A. A., Osborne, F., Thanapalasingam, T. & Motta, E. The CSO classifier: ontology-driven detection of research topics in scholarly articles. In Proc. Digital Libraries for Open Knowledge: 23rd International Conference on Theory and Practice of Digital Libraries Vol. 23, 296–311 (Springer, 2019).

Alstott, J., Bullmore, E. & Plenz, D. powerlaw: a Python package for analysis of heavy-tailed distributions. PLoS ONE 9 , e85777 (2014).

Fenner, T., Levene, M. & Loizou, G. A model for collaboration networks giving rise to a power-law distribution with an exponential cutoff. Soc. Netw. 29 , 70–80 (2007).

Broido, A. D. & Clauset, A. Scale-free networks are rare. Nat. Commun. 10 , 1017 (2019).

Fawcett, T. ROC graphs: notes and practical considerations for researchers. Pattern Recognit. Lett. 31 , 1–38 (2004).

Sun, Y., Wong, A. K. & Kamel, M. S. Classification of imbalanced data: a review. Int. J. Pattern Recognit. Artif. Intell. 23 , 687–719 (2009).

Lu, Y. Predicting research trends in artificial intelligence with gradient boosting decision trees and time-aware graph neural networks. In 2021 IEEE International Conference on Big Data (Big Data) 5809–5814 (IEEE, 2021).

Ke, G. et al. LightGBM: a highly efficient gradient boosting decision tree. In Proc. 31st International Conference on Neural Information Processing Systems 3149–3157 (Curran Associates Inc., 2017).

Tran, N. M. & Xie, Y. Improving random walk rankings with feature selection and imputation Science4Cast competition, team Hash Brown. In 2021 IEEE International Conference on Big Data (Big Data) 5824–5827 (IEEE, 2021).

Sanjabi, N. Efficiently predicting scientific trends using node centrality measures of a science semantic network. In 2021 IEEE International Conference on Big Data (Big Data) 5820–5823 (IEEE, 2021).

Barabási, A.-L. Network science. Phil. Trans. R. Soci. A 371 , 20120375 (2013).

Moutinho, J. P., Coutinho, B. & Buffoni, L. Network-based link prediction of scientific concepts—a Science4Cast competition entry. In 2021 IEEE International Conference on Big Data (Big Data) 5815–5819 (IEEE, 2021).

Jolliffe, I. T. & Cadima, J. Principal component analysis: a review and recent developments. Phil. Trans. R. Soc. A 374 , 20150202 (2016).

Valente, F. Link prediction of artificial intelligence concepts using low computational power. In 2021 IEEE International Conference on Big Data (Big Data) 5828–5832 (2021).

Grover, A. & Leskovec, J. node2vec: scalable feature learning for networks. In Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 855–864 (ACM, 2016).

Zhang, J., Dong, Y., Wang, Y., Tang, J. & Ding, M. ProNE: fast and scalable network representation learning. In Proc. Twenty-Eighth International Joint Conference on Artificial Intelligence 4278–4284 (International Joint Conferences on Artificial Intelligence Organization, 2019).

Lee, H., Sonthalia, R. & Foster, J. G. Dynamic embedding-based methods for link prediction in machine learning semantic network. In 2021 IEEE International Conference on Big Data (Big Data) 5801–5808 (IEEE, 2021).

Liu, R. & Krishnan, A. PecanPy: a fast, efficient and parallelized python implementation of node2vec. Bioinformatics 37 , 3377–3379 (2021).

Vaswani, A. et al. Attention is all you need. In Proc. 31st International Conference on Neural Information Processing Systems 6000–6010 (Curran Associates Inc., 2017).

Zelenko, D., Aone, C. & Richardella, A. Kernel methods for relation extraction. J. Mach. Learn. Res. 3 , 1083–1106 (2003).

MathSciNet   MATH   Google Scholar  

Bach, N. & Badaskar, S. A review of relation extraction. Literature Review for Language and Statistics II 2 , 1–15 (2007).

Salatino, A. A., Osborne, F. & Motta, E. How are topics born? Understanding the research dynamics preceding the emergence of new areas. PeerJ Comput. Sc. 3 , e119 (2017).

Salatino, A. A., Osborne, F. & Motta, E. AUGUR: forecasting the emergence of new research topics. In Proc. 18th ACM/IEEE on Joint Conference on Digital Libraries 303–312 (IEEE, 2018).

Battiston, F. et al. The physics of higher-order interactions in complex systems. Nat. Phys. 17 , 1093–1098 (2021).

Coutinho, B. C., Wu, A.-K., Zhou, H.-J. & Liu, Y.-Y. Covering problems and core percolations on hypergraphs. Phys. Rev. Lett. 124 , 248301 (2020).

Article   MathSciNet   Google Scholar  

Olivetti, E. A. et al. Data-driven materials research enabled by natural language processing and information extraction. Appl. Phys. Rev. 7 , 041317 (2020).

Lin, Z., Yin, Y., Liu, L. & Wang, D. SciSciNet: a large-scale open data lake for the science of science research. Sci. Data 10 , 315 (2023).

Azoulay, P. et al. Toward a more scientific science. Science 361 , 1194–1197 (2018).

Liu, H., Kou, H., Yan, C. & Qi, L. Link prediction in paper citation network to construct paper correlation graph. EURASIP J. Wirel. Commun. Netw. 2019 , 1–12 (2019).

Reisz, N. et al. Loss of sustainability in scientific work. New J. Phys. 24 , 053041 (2022).

Frank, M. R., Wang, D., Cebrian, M. & Rahwan, I. The evolution of citation graphs in artificial intelligence research. Nat. Mach. Intell. 1 , 79–85 (2019).

Newman, M. Networks (Oxford Univ. Press, 2018).

Kwon, D. et al. A survey of deep learning-based network anomaly detection. Cluster Comput. 22 , 949–961 (2019).

Pang, G., Shen, C., Cao, L. & Hengel, A. V. D. Deep learning for anomaly detection: a review. ACM Comput. Surv. 54 , 1–38 (2021).

Collobert, R. et al. Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12 , 2493–2537 (2011).

MATH   Google Scholar  

LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521 , 436–444 (2015).

Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with deep convolutional neural networks. Commun. ACM 60 , 84–90 (2017).

Mnih, V. et al. Human-level control through deep reinforcement learning. Nature 518 , 529–533 (2015).

Silver, D. et al. Mastering the game of Go with deep neural networks and tree search. Nature 529 , 484–489 (2016).

Wu, N., Vincent, A., Strukov, D. & Xie, Y. Memristor hardware-friendly reinforcement learning. Preprint at https://arxiv.org/abs/2001.06930 (2020).

Zhou, C. et al. Automated deep learning analysis of angiography video sequences for coronary artery disease. Preprint at https://arxiv.org/abs/2101.12505 (2021).

Huckle, N., Garcia, N. & Nakashima, Y. Demographic influences on contemporary art with unsupervised style embeddings. In Proc. Computer Vision–ECCV 2020 Workshops Part II Vol. 16, 126–142 (Springer, 2020).

Ranti, D. et al. The utility of general domain transfer learning for medical language tasks. Preprint at https://arxiv.org/abs/2002.06670 (2020).

Kamath, P., Singh, A. & Dutta, D. Fast neural architecture construction using envelopenets. Preprint at https://arxiv.org/abs/1803.06744 (2018).

Minsky, M. Steps toward artificial intelligence. Proc. IRE 49 , 8–30 (1961).

Bornmann, L., Haunschild, R. & Mutz, R. Growth rates of modern science: a latent piecewise growth curve approach to model publication numbers from established and new literature databases. Humanit. Soc. Sci. Commun. 8 , 224 (2021).

Brin, S. & Page, L. The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst. 30 , 107–117 (1998).

Holland, P. W. & Leinhardt, S. Transitivity in structural models of small groups. Comp. Group Studies 2 , 107–124 (1971).

Watts, D. J. & Strogatz, S. H. Collective dynamics of ‘small-world’ networks. Nature 393 , 440–442 (1998).

Yang, J.-H., Chen, C.-M., Wang, C.-J. & Tsai, M.-F. HOP-rec: high-order proximity for implicit recommendation. In Proc. 12th ACM Conference on Recommender Systems 140–144 (2018).

Lin, B.-Y. OGB_collab_project. GitHub https://github.com/brucenccu/OGB_collab_project (2021).

Sorensen, T. A. A method of establishing groups of equal amplitude in plant sociology based on similarity of species content and its application to analyses of the vegetation on danish commons. Biol. Skar. 5 , 1–34 (1948).

Yeo, I.-K. & Johnson, R. A. A new family of power transformations to improve normality or symmetry. Biometrika 87 , 954–959 (2000).

Ranger, M. nodevectors. GitHub https://github.com/VHRanger/nodevectors (2021).

Bandeira, A. S., Singer, A. & Spielman, D. A. A Cheeger inequality for the graph connection Laplacian. SIAM J. Matrix Anal. Appl. 34 , 1611–1630 (2013).

Krenn, M. et al. Predicting the future of AI with AI. Zenodo https://doi.org/10.5281/zenodo.7882892 (2023).

Krenn, M. et al. FutureOfAIviaAI code. Zenodo https://zenodo.org/record/8329701 (2023).

Jia, T., Wang, D. & Szymanski, B. K. Quantifying patterns of research-interest evolution. Nat. Hum. Behav. 1 , 0078 (2017).

Download references

Acknowledgements

We thank IARAI Vienna and IEEE for supporting and hosting the IEEE BigData Competition Science4Cast. We are specifically grateful to D. Kreil, M. Neun, C. Eichenberger, M. Spanring, H. Martin, D. Geschke, D. Springer, P. Herruzo, M. McCutchan, A. Mihai, T. Furdui, G. Fratica, M. Vázquez, A. Gruca, J. Brandstetter and S. Hochreiter for helping to set up and successfully execute the competition and the corresponding workshop. We thank X. Gu for creating Fig. 2 , and M. Aghajohari and M. Sadegh Akhondzadeh for helpful comments on the paper. The work of H.L., R.S. and J.G.F. was supported by grant TWCF0333 from the Templeton World Charity Foundation. H.L. is additionally supported by NSF grant DMS-1952339. J.P.M. acknowledges the support of FCT (Portugal) through scholarship SFRH/BD/144151/2019. B.C. thanks the support from FCT/MCTES through national funds and when applicable co-funded EU funds under the project UIDB/50008/2020, and FCT through the project CEECINST/00117/2018/CP1495/CT0001. N.M.T. and Y.X. are supported by NSF grant DMS-2113468, the NSF IFML 2019844 award to the University of Texas at Austin, and the Good Systems Research Initiative, part of University of Texas at Austin Bridging Barriers.

Open access funding provided by Max Planck Society.

Author information

Authors and affiliations.

Max Planck Institute for the Science of Light (MPL), Erlangen, Germany

Mario Krenn

Instituto de Telecomunicações, Lisbon, Portugal

Lorenzo Buffoni, Bruno Coutinho & João P. Moutinho

University of Toronto, Toronto, Ontario, Canada

Sagi Eppel & Andrew Gritsevskiy

University of California Los Angeles, Los Angeles, CA, USA

Jacob Gates Foster, Harlin Lee & Rishi Sonthalia

Cavendish Laboratories, Cavendish, VT, USA

Andrew Gritsevskiy

Institute of Advanced Research in Artificial Intelligence (IARAI), Vienna, Austria

Andrew Gritsevskiy & Michael Kopp

Alpha 8 AI, Toronto, Ontario, Canada

Independent Researcher, Barcelona, Spain

Nima Sanjabi

University of Texas at Austin, Austin, TX, USA

Ngoc Mai Tran

Independent Researcher, Leiria, Portugal

Francisco Valente

University of Pennsylvania, Philadelphia, PA, USA

Yangxinyu Xie

University of California, San Diego, CA, USA

You can also search for this author in PubMed   Google Scholar

Contributions

M. Krenn and R.Y. initiated the research. M. Krenn and M. Kopp organized the Science4Cast competition. M. Krenn generated the datasets and initial codes. S.E. and H.L. analysed the network-theoretical properties of the semantic network. M. Krenn, L.B., B.C., J.G.F., A.G, H.L., Y.L, J.P.M, N.S., R.S., N.M.T, F.V., Y.X and M. Kopp provided codes for the ten models. M. Krenn wrote the paper with input from all co-authors.

Corresponding author

Correspondence to Mario Krenn .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Peer review

Peer review information.

Nature Machine Intelligence thanks Alexander Belikov, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Mirko Pieropan, in collaboration with the Nature Machine Intelligence team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended data fig. 1.

Number of concepts per article.

Extended Data Fig. 2

Time Gap between the generation of edges. Here, left shows the time it takes to create a new edge between two vertices and right shows the time between the first and the second edge.

Extended Data Fig. 3

Publications in Quantum Physics.

Extended Data Fig. 4

Evolution of the AUC during training for Model M1.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Krenn, M., Buffoni, L., Coutinho, B. et al. Forecasting the future of artificial intelligence with machine learning-based link prediction in an exponentially growing knowledge network. Nat Mach Intell 5 , 1326–1335 (2023). https://doi.org/10.1038/s42256-023-00735-0

Download citation

Received : 21 January 2023

Accepted : 11 September 2023

Published : 16 October 2023

Issue Date : November 2023

DOI : https://doi.org/10.1038/s42256-023-00735-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

research paper about machines

Machine Intelligence

Google is at the forefront of innovation in Machine Intelligence, with active research exploring virtually all aspects of machine learning, including deep learning and more classical algorithms. Exploring theory as well as application, much of our work on language, speech, translation, visual processing, ranking and prediction relies on Machine Intelligence. In all of those tasks and many others, we gather large volumes of direct or indirect evidence of relationships of interest, applying learning algorithms to understand and generalize.

Machine Intelligence at Google raises deep scientific and engineering challenges, allowing us to contribute to the broader academic research community through technical talks and publications in major conferences and journals. Contrary to much of current theory and practice, the statistics of the data we observe shifts rapidly, the features of interest change as well, and the volume of data often requires enormous computation capacity. When learning systems are placed at the core of interactive services in a fast changing and sometimes adversarial environment, combinations of techniques including deep learning and statistical models need to be combined with ideas from control and game theory.

Recent Publications

Some of our teams.

Algorithms & optimization

Applied science

Climate and sustainability

Graph mining

Impact-Driven Research, Innovation and Moonshots

Learning theory

Market algorithms

Operations research

Security, privacy and abuse

System performance

We're always looking for more talented, passionate people.

Careers

Can machines think?

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

machine learning Recently Published Documents

Total documents.

  • Latest Documents
  • Most Cited Documents
  • Contributed Authors
  • Related Sources
  • Related Keywords

An explainable machine learning model for identifying geographical origins of sea cucumber Apostichopus japonicus based on multi-element profile

A comparison of machine learning- and regression-based models for predicting ductility ratio of rc beam-column joints, alexa, is this a historical record.

Digital transformation in government has brought an increase in the scale, variety, and complexity of records and greater levels of disorganised data. Current practices for selecting records for transfer to The National Archives (TNA) were developed to deal with paper records and are struggling to deal with this shift. This article examines the background to the problem and outlines a project that TNA undertook to research the feasibility of using commercially available artificial intelligence tools to aid selection. The project AI for Selection evaluated a range of commercial solutions varying from off-the-shelf products to cloud-hosted machine learning platforms, as well as a benchmarking tool developed in-house. Suitability of tools depended on several factors, including requirements and skills of transferring bodies as well as the tools’ usability and configurability. This article also explores questions around trust and explainability of decisions made when using AI for sensitive tasks such as selection.

Automated Text Classification of Maintenance Data of Higher Education Buildings Using Text Mining and Machine Learning Techniques

Data-driven analysis and machine learning for energy prediction in distributed photovoltaic generation plants: a case study in queensland, australia, modeling nutrient removal by membrane bioreactor at a sewage treatment plant using machine learning models, big five personality prediction based in indonesian tweets using machine learning methods.

<span lang="EN-US">The popularity of social media has drawn the attention of researchers who have conducted cross-disciplinary studies examining the relationship between personality traits and behavior on social media. Most current work focuses on personality prediction analysis of English texts, but Indonesian has received scant attention. Therefore, this research aims to predict user’s personalities based on Indonesian text from social media using machine learning techniques. This paper evaluates several machine learning techniques, including <a name="_Hlk87278444"></a>naive Bayes (NB), K-nearest neighbors (KNN), and support vector machine (SVM), based on semantic features including emotion, sentiment, and publicly available Twitter profile. We predict the personality based on the big five personality model, the most appropriate model for predicting user personality in social media. We examine the relationships between the semantic features and the Big Five personality dimensions. The experimental results indicate that the Big Five personality exhibit distinct emotional, sentimental, and social characteristics and that SVM outperformed NB and KNN for Indonesian. In addition, we observe several terms in Indonesian that specifically refer to each personality type, each of which has distinct emotional, sentimental, and social features.</span>

Compressive strength of concrete with recycled aggregate; a machine learning-based evaluation

Temperature prediction of flat steel box girders of long-span bridges utilizing in situ environmental parameters and machine learning, computer-assisted cohort identification in practice.

The standard approach to expert-in-the-loop machine learning is active learning, where, repeatedly, an expert is asked to annotate one or more records and the machine finds a classifier that respects all annotations made until that point. We propose an alternative approach, IQRef , in which the expert iteratively designs a classifier and the machine helps him or her to determine how well it is performing and, importantly, when to stop, by reporting statistics on a fixed, hold-out sample of annotated records. We justify our approach based on prior work giving a theoretical model of how to re-use hold-out data. We compare the two approaches in the context of identifying a cohort of EHRs and examine their strengths and weaknesses through a case study arising from an optometric research problem. We conclude that both approaches are complementary, and we recommend that they both be employed in conjunction to address the problem of cohort identification in health research.

Export Citation Format

Share document.

Accessibility Links

  • Skip to content
  • Skip to search IOPscience
  • Skip to Journals list
  • Accessibility help
  • Accessibility Help

Click here to close this panel.

Purpose-led Publishing is a coalition of three not-for-profit publishers in the field of physical sciences: AIP Publishing, the American Physical Society and IOP Publishing.

Together, as publishers that will always put purpose above profit, we have defined a set of industry standards that underpin high-quality, ethical scholarly communications.

We are proudly declaring that science is our only shareholder.

Machine Design Modern Techniques and Innovative Technologies

Hussein Younus Razzaq 1 , Hussein Mohammed Hasan 1 and Kadhim Raheem Abbas 1

Published under licence by IOP Publishing Ltd Journal of Physics: Conference Series , Volume 1897 , Sixth International Scientific Conference for Iraqi Al Khwarizmi Society (FISCAS) 2021 22-23 November 2020, Cairo, Egypt Citation Hussein Younus Razzaq et al 2021 J. Phys.: Conf. Ser. 1897 012072 DOI 10.1088/1742-6596/1897/1/012072

Article metrics

4732 Total downloads

Share this article

Author e-mails.

[email protected]

[email protected]

[email protected]

Author affiliations

1 Karbala Technical Institute, Al-Furat Al-Awsat Technical University, 56001, Karbala, Iraq

Buy this article in print

The manufacturing sector is facing a challenge in this 21 st century to continue developing their business by applying a new and innovative production technology and system. This is to help the novel ways of manufacturing process to move forward, where, the Machine Design will feature and compile the newest product line with an inventive technology to keep modernized techniques at the top of mind for our OEMs, end-users, integrators, and the entire supply community. This research paper will explore how the simulation derived model of Mechatronic could manage the most complex scheme of the machinery profile with a systematic approach by understanding the concept with precise machine design actions, dynamic behavior, and effective interaction with the various components of the machine. Mainly, the Mechatronic engineers will unite the mechanics, and electronics principles and compute them to generate more economical, a simpler, and reliable system.

Export citation and abstract BibTeX RIS

Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence . Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.

Help | Advanced Search

Statistics > Machine Learning

Title: estimating a function and its derivatives under a smoothness condition.

Abstract: We consider the problem of estimating an unknown function f* and its partial derivatives from a noisy data set of n observations, where we make no assumptions about f* except that it is smooth in the sense that it has square integrable partial derivatives of order m. A natural candidate for the estimator of f* in such a case is the best fit to the data set that satisfies a certain smoothness condition. This estimator can be seen as a least squares estimator subject to an upper bound on some measure of smoothness. Another useful estimator is the one that minimizes the degree of smoothness subject to an upper bound on the average of squared errors. We prove that these two estimators are computable as solutions to quadratic programs, establish the consistency of these estimators and their partial derivatives, and study the convergence rate as n increases to infinity. The effectiveness of the estimators is illustrated numerically in a setting where the value of a stock option and its second derivative are estimated as functions of the underlying stock price.

Submission history

Access paper:.

  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

Welcome to the MIT CISR website!

This site uses cookies. Review our Privacy Statement.

Red briefing graphic

AI Is Everybody’s Business

This briefing presents three principles to guide business leaders when making AI investments: invest in practices that build capabilities required for AI, involve all your people in your AI journey, and focus on realizing value from your AI projects. The principles are supported by the MIT CISR data monetization research, and the briefing illustrates them using examples from the Australia Taxation Office and CarMax. The three principles apply to any kind of AI, defined as technology that performs human-like cognitive tasks; subsequent briefings will present management advice distinct to machine learning and generative tools, respectively.

Access More Research!

Any visitor to the website can read many MIT CISR Research Briefings in the webpage. But site users who have signed up on the site and are logged in can download all available briefings, plus get access to additional content. Even more content is available to members of MIT CISR member organizations .

Author Barb Wixom reads this research briefing as part of our audio edition of the series. Follow the series on SoundCloud.

DOWNLOAD THE TRANSCRIPT

Today, everybody across the organization is hungry to know more about AI. What is it good for? Should I trust it? Will it take my job? Business leaders are investing in massive training programs, partnering with promising vendors and consultants, and collaborating with peers to identify ways to benefit from AI and avoid the risk of AI missteps. They are trying to understand how to manage AI responsibly and at scale.

Our book Data Is Everybody’s Business: The Fundamentals of Data Monetization describes how organizations make money using their data.[foot]Barbara H. Wixom, Cynthia M. Beath, and Leslie Owens, Data Is Everybody's Business: The Fundamentals of Data Monetization , (Cambridge: The MIT Press, 2023), https://mitpress.mit.edu/9780262048217/data-is-everybodys-business/ .[/foot] We wrote the book to clarify what data monetization is (the conversion of data into financial returns) and how to do it (by using data to improve work, wrap products and experiences, and sell informational solutions). AI technology’s role in this is to help data monetization project teams use data in ways that humans cannot, usually because of big complexity or scope or required speed. In our data monetization research, we have regularly seen leaders use AI effectively to realize extraordinary business goals. In this briefing, we explain how such leaders achieve big AI wins and maximize financial returns.

Using AI in Data Monetization

AI refers to the ability of machines to perform human-like cognitive tasks.[foot]See Hind Benbya, Thomas H. Davenport, and Stella Pachidi, “Special Issue Editorial: Artificial Intelligence in Organizations: Current State and Future Opportunities , ” MIS Quarterly Executive 19, no. 4 (December 2020), https://aisel.aisnet.org/misqe/vol19/iss4/4 .[/foot] Since 2019, MIT CISR researchers have been studying deployed data monetization initiatives that rely on machine learning and predictive algorithms, commonly referred to as predictive AI.[foot]This research draws on a Q1 to Q2 2019 asynchronous discussion about AI-related challenges with fifty-three data executives from the MIT CISR Data Research Advisory Board; more than one hundred structured interviews with AI professionals regarding fifty-two AI projects from Q3 2019 to Q2 2020; and ten AI project narratives published by MIT CISR between 2020 and 2023.[/foot] Such initiatives use large data repositories to recognize patterns across time, draw inferences, and predict outcomes and future trends. For example, the Australian Taxation Office (ATO) used machine learning, neural nets, and decision trees to understand citizen tax-filing behaviors and produce respectful nudges that helped citizens abide by Australia’s work-related expense policies. In 2018, the nudging resulted in AUD$113 million in changed claim amounts.[foot]I. A. Someh, B. H. Wixom, and R. W. Gregory, “The Australian Taxation Office: Creating Value with Advanced Analytics,” MIT CISR Working Paper No. 447, November 2020, https://cisr.mit.edu/publication/MIT_CISRwp447_ATOAdvancedAnalytics_SomehWixomGregory .[/foot]

In 2023, we began exploring data monetization initiatives that rely on generative AI.[foot]This research draws on two asynchronous generative AI discussions (Q3 2023, N=35; Q1 2024, N=34) regarding investments and capabilities and roles and skills, respectively, with data executives from the MIT CISR Data Research Advisory Board. It also draws on in-progress case studies with large organizations in the publishing, building materials, and equipment manufacturing industries.[/foot] This type of AI analyzes vast amounts of text or image data to discern patterns in them. Using these patterns, generative AI can create new text, software code, images, or videos, usually in response to user prompts. Organizations are now beginning to openly discuss data monetization initiative deployments that include generative AI technologies. For example, used vehicle retailer CarMax reported using OpenAI’s ChatGPT chatbot to help aggregate customer reviews and other car information from multiple data sets to create helpful, easy-to-read summaries about individual used cars for its online shoppers. At any point in time, CarMax has on average 50,000 cars on its website, so to produce such content without AI the company would require hundreds of content writers and years of time; using ChatGPT, the company’s content team can generate summaries in hours.[foot]Paula Rooney, “CarMax drives business value with GPT-3.5,” CIO , May 5, 2023, https://www.cio.com/article/475487/carmax-drives-business-value-with-gpt-3-5.html ; Hayete Gallot and Shamim Mohammad, “Taking the car-buying experience to the max with AI,” January 2, 2024, in Pivotal with Hayete Gallot, produced by Larj Media, podcast, MP3 audio, https://podcasts.apple.com/us/podcast/taking-the-car-buying-experience-to-the-max-with-ai/id1667013760?i=1000640365455 .[/foot]

Big advancements in machine learning, generative tools, and other AI technologies inspire big investments when leaders believe the technologies can help satisfy pent-up demand for solutions that previously seemed out of reach. However, there is a lot to learn about novel technologies before we can properly manage them. In this year’s MIT CISR research, we are studying predictive and generative AI from several angles. This briefing is the first in a series; in future briefings we will present management advice specific to machine learning and generative tools. For now, we present three principles supported by our data monetization research to guide business leaders when making AI investments of any kind: invest in practices that build capabilities required for AI, involve all your people in your AI journey, and focus on realizing value from your AI projects.

Principle 1: Invest in Practices That Build Capabilities Required for AI

Succeeding with AI depends on having deep data science skills that help teams successfully build and validate effective models. In fact, organizations need deep data science skills even when the models they are using are embedded in tools and partner solutions, including to evaluate their risks; only then can their teams make informed decisions about how to incorporate AI effectively into work practices. We worry that some leaders view buying AI products from providers as an opportunity to use AI without deep data science skills; we do not advise this.

But deep data science skills are not enough. Leaders often hire new talent and offer AI literacy training without making adequate investments in building complementary skills that are just as important. Our research shows that an organization’s progress in AI is dependent on having not only an advanced data science capability, but on having equally advanced capabilities in data management, data platform, acceptable data use, and customer understanding.[foot]In the June 2022 MIT CISR research briefing, we described why and how organizations build the five advanced data monetization capabilities for AI. See B. H. Wixom, I. A. Someh, and C. M. Beath, “Building Advanced Data Monetization Capabilities for the AI-Powered Organization,” MIT CISR Research Briefing, Vol. XXII, No. 6, June 2022, https://cisr.mit.edu/publication/2022_0601_AdvancedAICapabilities_WixomSomehBeath .[/foot] Think about it. Without the ability to curate data (an advanced data management capability), teams cannot effectively incorporate a diverse set of features into their models. Without the ability to oversee the legality and ethics of partners’ data use (an advanced acceptable data use capability), teams cannot responsibly deploy AI solutions into production.

It’s no surprise that ATO’s AI journey evolved in conjunction with the organization’s Smarter Data Program, which ATO established to build world-class data analytics capabilities, and that CarMax emphasizes that its governance, talent, and other data investments have been core to its generative AI progress.

Capabilities come mainly from learning by doing, so they are shaped by new practices in the form of training programs, policies, processes, or tools. As organizations undertake more and more sophisticated practices, their capabilities get more robust. Do invest in AI training—but also invest in practices that will boost the organization’s ability to manage data (such as adopting a data cataloging tool), make data accessible cost effectively (such as adopting cloud policies), improve data governance (such as establishing an ethical oversight committee), and solidify your customer understanding (such as mapping customer journeys). In particular, adopt policies and processes that will improve your data governance, so that data is only used in AI initiatives in ways that are consonant with your organization's values and its regulatory environment.

Principle 2: Involve All Your People in Your AI Journey

Data monetization initiatives require a variety of stakeholders—people doing the work, developing products, and offering solutions—to inform project requirements and to ensure the adoption and confident use of new data tools and behaviors.[foot]Ida Someh, Barbara Wixom, Michael Davern, and Graeme Shanks, “Configuring Relationships between Analytics and Business Domain Groups for Knowledge Integration, ” Journal of the Association for Information Systems 24, no. 2 (2023): 592-618, https://cisr.mit.edu/publication/configuring-relationships-between-analytics-and-business-domain-groups-knowledge .[/foot] With AI, involving a variety of stakeholders in initiatives helps non-data scientists become knowledgeable about what AI can and cannot do, how long it takes to deliver certain kinds of functionality, and what AI solutions cost. This, in turn, helps organizations in building trustworthy models, an important AI capability we call AI explanation (AIX).[foot]Ida Someh, Barbara H. Wixom, Cynthia M. Beath, and Angela Zutavern, “Building an Artificial Intelligence Explanation Capability,” MIS Quarterly Executive 21, no. 2 (2022), https://cisr.mit.edu/publication/building-artificial-intelligence-explanation-capability .[/foot]

For example, at ATO, data scientists educated business colleagues on the mechanics and results of models they created. Business colleagues provided feedback on the logic used in the models and helped to fine-tune them, and this interaction helped everyone understand how the AI made decisions. The data scientists provided their model results to ATO auditors, who also served as a feedback loop to the data scientists for improving the model. The data scientists regularly reported on initiative progress to senior management, regulators, and other stakeholders, which ensured that the AI team was proactively creating positive benefits without neglecting negative external factors that might surface.

Given the consumerization of generative AI tools, we believe that pervasive worker involvement in ideating, building, refining, using, and testing AI models and tools will become even more crucial to deploying fruitful AI projects—and building trust that AI will do the right thing in the right way at the right time.

Principle 3: Focus on Realizing Value From Your AI Projects

AI is costly—just add up your organization’s expenses in tools, talent, and training. AI needs to pay off, yet some organizations become distracted with endless experimentation. Others get caught up in finding the sweet spot of the technology, ignoring the sweet spot of their business model. For example, it is easy to become enamored of using generative AI to improve worker productivity, rolling out tools for employees to write better emails and capture what happened in meetings. But unless those activities materially impact how your organization makes money, there likely are better ways to spend your time and money.

Leaders with data monetization experience will make sure their AI projects realize value in the form of increased revenues or reduced expenses by backing initiatives that are clearly aligned with real challenges and opportunities. That is step one. In our research, the leaders that realize value from their data monetization initiatives measure and track their outcomes, especially their financial outcomes, and they hold someone accountable for achieving the desired financial returns. At CarMax, a cross-functional team owned the mission to provide better website information for used car shoppers, a mission important to the company’s sales goals. Starting with sales goals in mind, the team experimented with and then chose a generative AI solution that would enhance the shopper experience and increase sales.

Figure 1: Three Principles for Getting Value from AI Investments

research paper about machines

The three principles are based on the following concepts from MIT CISR data research: 1. Data liquidity: the ease of data asset recombination and reuse 2. Data democracy: an organization that empowers employees in the access and use of data 3. Data monetization: the generation of financial returns from data assets

Managing AI Using a Data Monetization Mindset

AI has and always will play a big role in data monetization. It’s not a matter of whether to incorporate AI, but a matter of how to best use it. To figure this out, quantify the outcomes of some of your organization’s recent AI projects. How much money has the organization realized from them? If the answer disappoints, then make sure the AI technology value proposition is a fit for your organization’s most important goals. Then assign accountability for ensuring that AI technology is applied in use cases that impact your income statements. If the AI technology is not a fit for your organization, then don’t be distracted by media reports of the AI du jour.

Understanding your AI technology investments can be hard if your organization is using AI tools that are bundled in software you purchase or are built for you by a consultant. To set yourself up for success, ask your partners to be transparent with you about the quality of data they used to train their AI models and the data practices they relied on. Do their answers persuade you that their tools are trustworthy? Is it obvious that your partner is using data compliantly and is safeguarding the model from producing bad or undesired outcomes? If so, make sure this good news is shared with the people in your organization and those your organization serves. If not, rethink whether to break with your partner and find another way to incorporate the AI technology into your organization, such as by hiring people to build it in-house.

To paraphrase our book’s conclusion: When people actively engage in data monetization initiatives using AI , they learn, and they help their organization learn. Their engagement creates momentum that initiates a virtuous cycle in which people’s engagement leads to better data and more bottom-line value, which in turn leads to new ideas and more engagement, which further improves data and delivers more value, and so on. Imagine this happening across your organization as all people everywhere make it their business to find ways to use AI to monetize data.

This is why AI, like data, is everybody’s business.

© 2024 MIT Center for Information Systems Research, Wixom and Beath. MIT CISR Research Briefings are published monthly to update the center’s member organizations on current research projects.

Related Publications

research paper about machines

Talking Points

Ai, like data, is everybody's business.

research paper about machines

Working Paper: Vignette

The australian taxation office: creating value with advanced analytics.

research paper about machines

Research Briefing

Building advanced data monetization capabilities for the ai-powered organization.

research paper about machines

Building AI Explanation Capability for the AI-Powered Organization

research paper about machines

What is Data Monetization?

About the researchers.

Profile picture for user bwixom@mit.edu

Barbara H. Wixom, Principal Research Scientist, MIT Center for Information Systems Research (CISR)

Profile picture for user cynthia.beath@mccombs.utexas.edu

Cynthia M. Beath, Professor Emerita, University of Texas and Academic Research Fellow, MIT CISR

Mit center for information systems research (cisr).

Founded in 1974 and grounded in MIT's tradition of combining academic knowledge and practical purpose, MIT CISR helps executives meet the challenge of leading increasingly digital and data-driven organizations. We work directly with digital leaders, executives, and boards to develop our insights. Our consortium forms a global community that comprises more than seventy-five organizations.

MIT CISR Associate Members

MIT CISR wishes to thank all of our associate members for their support and contributions.

MIT CISR's Mission Expand

MIT CISR helps executives meet the challenge of leading increasingly digital and data-driven organizations. We provide insights on how organizations effectively realize value from approaches such as digital business transformation, data monetization, business ecosystems, and the digital workplace. Founded in 1974 and grounded in MIT’s tradition of combining academic knowledge and practical purpose, we work directly with digital leaders, executives, and boards to develop our insights. Our consortium forms a global community that comprises more than seventy-five organizations.

  • College of Engineering and Computing
  • Location Location
  • Contact Contact
  • Colleges and Schools
  • News and Events
  • 2024 News Archive

Jamshidi earns recognition for most influential paper

Pooyan Jamshidi

When someone in academia publishes a research paper, one of the goals is to have the paper cited by other professors and researchers. A paper published 10 years ago by Computer Science and Engineering Assistant Professor Pooyan Jamshidi was recently recognized for its significant impact.

Jamshidi received the Most Influential Paper Award in April at the 19th International Conference on Software Engineering for Adaptive and Self-Managing Systems (SEAMS) in Lisbon, Portugal. Jamshidi’s paper, “ Autonomic Resource Provision for Cloud-based Software ,” was submitted, accepted and published just prior to earning his Ph.D. from Dublin City University in Ireland in 2014. It was presented at the 2014 SEAMS Conference in India.

For the most influential paper award, a select committee considers conference publications published approximately 10 years previously and selects those that have made the most impact according to several criteria, including the number of citations, practical applications and industry adoption, and influence on subsequent research. The most influential award is selected from this short list.

“I wanted to publish the most important part of my Ph.D. research at SEAMS because it was a special community, and their work was close to mine,” Jamshidi says. “Receiving this award is important because this was my first paper with the community. I kept publishing with SEAMS and remained engaged.” 

The paper’s title referred to a groundbreaking approach to fundamentally transform how resources are managed and allocated in cloud environments. The key innovation was to enable multiple tenants to describe their adaptation rules for cloud and multi-cloud resource provisioning using a specific language that enables the incorporation of reasoning, inference and resolution of conflicting adaptation rules.

Since the paper was published, it has received 188 citations according to Google Scholar . In addition, the autonomic resource provision technique has been integrated with Microsoft Azure and OpenStack . The concepts and methods introduced in the paper have also led to follow-up research in cloud autoscaling, Edge-and-Internet of Things resource scaling, and networking and autonomous driving.

The paper has impacted the field of software engineering, especially in the context of adaptive and self-managing systems in the cloud, research, industry practices and the broader technological landscape.

While Jamshidi admits that autonomous autoscaling system for cloud-based software is not as a hot topic as it was when his paper was published, it is still a relevant research area that is leading to new ideas, methods, and approaches.

“The most exciting direction in cloud auto-scaling and resource provisioning overall is sustainability-aware approaches to enable sustainable computer usage for modern applications, such as AI systems,” Jamshidi says. “We plan to continue this line of research. For example, thanks to funds provided by the National Science Foundation and collaborators from Carnegie Mellon University and Rochester Institute of Technology, we are investigating software-driven sustainability.” 

Challenge the conventional. Create the exceptional. No Limits.

U.S. flag

An official website of the United States government

Here’s how you know

Official websites use .gov A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS A lock ( Lock A locked padlock ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

https://www.nist.gov/blogs/taking-measure/trash-cash-how-ai-and-machine-learning-can-help-make-recycling-less-expensive

Taking Measure

Just a Standard Blog

From Trash to Cash: How AI and Machine Learning Can Help Make Recycling Less Expensive for Local Governments

Brad Sutliff wears safety glasses as he stands in the lab in front of a computer monitor.

Recycling costs local governments a lot of money, but AI can help make that process less costly — potentially leading to more recycling. NIST research is looking to make recycling more efficient and less expensive.

What happens to your plastic recycling after you toss it in “the bin”?

This question seems to be in the news a lot lately (see here , here or here ).

In truth, the answer is complicated. It depends on where you live and what that plastic is.

Collecting recycling costs local governments a lot of money. They need to maintain facilities to handle the plastics as well as the trucks and bins to collect them. Governments also need to hire people to do the work. It can be a lot cheaper to just put everything in a landfill.

However, when local governments recycle, they can turn trash into cash if they have the correct infrastructure. They can offset some of the costs by selling the collected plastic back to manufacturers. Most manufacturers want recycled plastic that is almost as good as brand-new plastic, but that requires careful sorting by the recyclers to provide a consistent product.

For many people, all plastic seems the same. However, those with a keen eye know that there are seven types of common plastic. You can identify them by the little recycling symbol on the bottom of almost all plastic containers. These numbers help identify the chemistry behind those plastics. You may have noticed them when sorting your own recycling.

Below is an explanation of some of these materials:

 Sorting these plastics is incredibly important. Different plastics with some similar characteristics often can’t be mixed because they require different melting procedures.

Take PVC, for example. Used in everything from plumbing to window blinds, PVC creates a strong acid with many industrial uses when melted. But like many other acids, it’s not something you want to make when you’re not expecting it.

Polyolefins, the group of plastics that includes HDPE (used in milk bottles), LDPE (used in plastic bags) and PP (used in takeout containers), provide a much tamer example. This group of plastics makes up about 40% of the world’s plastic production . They’re also some of the hardest to sort.

The type of plastic used in milk bottles requires high temperatures to melt and reprocess due to its crystalline structure. However, if plastic bag contaminants are in the mix, the bags will degrade at those high temperatures. So, if a plastic bag gets into the recycling mix with a milk bottle, it could lead to a gross, yellow milk bottle that no one wants to drink from. This processing risk is one of the many reasons you’re unlikely to see any milk bottles made of recycled plastics.

Additionally, if some high-temperature stable materials from takeout containers end up in a plastic bag process line, you’re likely to see clogged-up machinery.

Workers at a recycling facility sort and separate recycled plastics into individually-labeled yellow bins.

Theoretically, you can easily sort plastic waste by using the little recycling symbol. Then, you can sell those sorted plastics to secondary recyclers, who will turn that sorted waste into products.

The price depends on the assumed purity of the plastic. A bale of large, orange laundry detergent bottles is likely to sell for a high price because it is easy to pick those items out. However, a bunch of takeout containers can easily have a mix of plastics with various colors or additives.

At the local recycling facility in Montgomery County, Maryland, people hand-sort laundry detergent bottles, food containers and more. However, human hands and eyes can only move so fast, and mistakes are easy at that speed. So, recycling facilities focus on sorting high-value or easy-to-identify plastics to have consistency in what they are selling to secondary recyclers. That means detergent bottles and beverage containers are recycled at high rates. Your plastic “silverware” and old children’s toys are probably not.

To help facilitate sorting, our work at NIST has been focused on using near-visible infrared light (NIR), a technology that can look at plastics and rapidly tell us what they are. Some top-of-the-line recycling facilities already use lights or cameras that “see” using this approach and sort soda bottles from PVC piping.

But these systems can't sort everything, and the focus of my research is to create a method to help sort the trickiest of plastics in a way that will be profitable for recyclers.

How We’re Making Recycling More Efficient

With that in mind, our team looked at this NIR approach and decided to improve it with some help from machine learning algorithms and other scientific techniques.

In infrared spectroscopy, you shine light from several different wavelengths on some molecules. Those molecules absorb some of the energy from that light based on the wavelength and reflect or transmit the rest.

One way to think about it is with flowers and color. When the many wavelengths of light in sunlight shine on a red rose, for example, the rose is excellent at absorbing every wavelength/color except for red. The red light reflects off the petals, and that is why roses look red to us.

If we know what colors and light intensity we are shining on a flower or plastic bottle and what colors/intensity we get back, we can use the difference like a fingerprint to identify more of those flowers or bottles.

Brad Sutliff displays small circles of plastic on his gloved palm as he stands in the lab wearing safety glasses.

Using machine learning, we can find the NIR fingerprints for many plastic materials. We then “train” the computer to identify the plastic based on how similar the new NIR signal is to the NIR signals of other plastics. This training helps the technology identify the material in a soda bottle, know that it’s different from the makeup of a takeout container, and separate them accordingly.

In our first paper , we used machine learning to connect our plastic signals with certain properties, like how dense and crystalline the polyethylene was. Normally, you measure density by weighing the plastic in different liquids and comparing the difference. It’s a very slow and tedious process.

However, we show that you can use reflected light to find out almost the same information — much more quickly. And on the recycling line, time is of the essence.  

You can apply this approach to both large and small samples, too. It’s cool because it demonstrates that we can pull a lot more information out of these light-based measurements if we set things up carefully.

This is still very preliminary work and doesn’t yet apply to all types of plastics. So, it’s not like we can shine a light on any plastic and know its exact characteristics, but it’s an exciting start. It could save both recyclers and manufacturers a lot of time and effort with quality control steps if we can scale it up.

With that work published, I’ve been digging into the best ways to handle all the data we get from these measurements. You end up having very different-looking data based on the shape of the plastic, whether the sample is in pellets, a powder or a bottle.

This is because the light still gets reflected, but it gets reflected in different directions depending on the shape of the plastic. Think of the reflection on a clear pond versus on a pond with many ripples. Then, you can add colors and preservatives that have the potential to really change the signal. It doesn’t make the data wrong, but it can influence the sorting. Think of it as the difference between sorting black-and-white photographs of people and sorting black-and-white photographs, color photographs, caricatures and paintings of those same people.

To combat this, the team has been trying to expand our dataset, while I look at mathematical corrections to put powders, pellets and colored plastics on the same playing field. If we can accomplish that, it will become significantly easier to use machine learning to identify which plastic is which.

To make this research more broadly useful, I’m working to show that we can sort those troublesome polyolefins. With my current methods, we are achieving 95% to 98% accuracy when sorting these plastics. We’re doing this with a process that almost any NIR-equipped recycling facility could start using very rapidly if it wanted to.

Many recycling facilities are probably already using similar algorithms, but this work provides an extra level of refinement that really focuses on the tough-to-sort polyolefins.

If we can sort those effectively, we can reuse them with fewer processing problems, and recycling will become much more profitable. Then, hopefully, the profits can drive even better recycling habits, and we can start bending our linear economy into a circular one.

Recycling as a Puzzle to Solve

I’m a problem solver, and I hop from one puzzle to the next.

In addition to polymer research, I’ve worked on drug delivery systems for ovarian cancer, and now I’m working with artificial intelligence (AI) and machine learning.

I love the good that I get to do while solving complex puzzles. Sustainability and bio-friendly materials have been a nice thread throughout my research career.

You may not initially realize the connection between biomedical research and plastics. But drug delivery systems can help make really cool materials with applications outside the medical space. The plastics work can also enhance our understanding of our bodies in areas such as DNA, proteins and collagen.

And now, with the explosion of AI, we have new tools to conduct materials research faster and more effectively. It’s a very exciting time to be in this sustainable materials space!

Future of Sorting Research

I’m currently finishing up my two-year contract at NIST, and I’m looking for the next puzzle to solve.

However, I plan to stay connected to NIST as an associate to help other researchers use my techniques.

I hope to help the greater recycling community use data analysis to improve our recycling and help clean up our planet.

World Metrology Day is coming up on May 20, with the theme of sustainability. Using AI to help us recycle more efficiently is one of many NIST contributions to creating a healthier planet. Learn more:

  • A Polymer Scientist Wrestles — Literally and Figuratively — With the Frustrations of Plastic Packaging
  • Riding the Wind: How Applied Geometry and Artificial Intelligence Can Help Us Win the Renewable Energy Race
  • Microplastics Mystery: Sampling Airborne Microplastics at a Recycling Facility  

About the author

Image of Brad Sutliff

Bradley Sutliff

Brad has been studying sustainable materials since 2012. He has looked at DNA for drug delivery purposes, worked on bacteria for synthesizing new bioplastics, and used wood pulp to reinforce our plastics. He joined NIST as an NRC postdoctoral fellow in 2022 and started using his material characterization knowledge to help make recycling a more industrially viable process. He currently lives in Rockville, Maryland, with his fiancée, Deb, and their beloved cat, Kiki.

Related posts

Photos of Marc Levitan and Long Phan are part of a collage of tornado images labeled: Tornado Resiliency Building Code Research

NIST Research Is Setting the Standard to Help Buildings Withstand Tornadoes

Close-up shows shiny metal patches with areas of green corrosion and open holes.

Houston, We Have the Engine Remnants: How My Lab Pieced Together a Forgotten Part of an Apollo Rocket

Samuel Márquez Gonzalez leans forward to look through a microscope in the lab.

From Pokémon to Physics: My Journey of Perseverance Into Research

Add new comment.

  • No HTML tags allowed.
  • Web page addresses and email addresses turn into links automatically.
  • Lines and paragraphs break automatically.

Image CAPTCHA

  • Search entire site
  • Search for a course
  • Browse study areas

Analytics and Data Science

  • Data Science and Innovation
  • Postgraduate Research Courses
  • Business Research Programs
  • Undergraduate Business Programs
  • Entrepreneurship
  • MBA Programs
  • Postgraduate Business Programs

Communication

  • Animation Production
  • Business Consulting and Technology Implementation
  • Digital and Social Media
  • Media Arts and Production
  • Media Business
  • Media Practice and Industry
  • Music and Sound Design
  • Social and Political Sciences
  • Strategic Communication
  • Writing and Publishing
  • Postgraduate Communication Research Degrees

Design, Architecture and Building

  • Architecture
  • Built Environment
  • DAB Research
  • Public Policy and Governance
  • Secondary Education
  • Education (Learning and Leadership)
  • Learning Design
  • Postgraduate Education Research Degrees
  • Primary Education

Engineering

  • Civil and Environmental
  • Computer Systems and Software
  • Engineering Management
  • Mechanical and Mechatronic
  • Systems and Operations
  • Telecommunications
  • Postgraduate Engineering courses
  • Undergraduate Engineering courses
  • Sport and Exercise
  • Palliative Care
  • Public Health
  • Nursing (Undergraduate)
  • Nursing (Postgraduate)
  • Health (Postgraduate)
  • Research and Honours
  • Health Services Management
  • Child and Family Health
  • Women's and Children's Health

Health (GEM)

  • Coursework Degrees
  • Clinical Psychology
  • Genetic Counselling
  • Good Manufacturing Practice
  • Physiotherapy
  • Speech Pathology
  • Research Degrees

Information Technology

  • Business Analysis and Information Systems
  • Computer Science, Data Analytics/Mining
  • Games, Graphics and Multimedia
  • IT Management and Leadership
  • Networking and Security
  • Software Development and Programming
  • Systems Design and Analysis
  • Web and Cloud Computing
  • Postgraduate IT courses
  • Postgraduate IT online courses
  • Undergraduate Information Technology courses
  • International Studies
  • Criminology
  • International Relations
  • Postgraduate International Studies Research Degrees
  • Sustainability and Environment
  • Practical Legal Training
  • Commercial and Business Law
  • Juris Doctor
  • Legal Studies
  • Master of Laws
  • Intellectual Property
  • Migration Law and Practice
  • Overseas Qualified Lawyers
  • Postgraduate Law Programs
  • Postgraduate Law Research
  • Undergraduate Law Programs
  • Life Sciences
  • Mathematical and Physical Sciences
  • Postgraduate Science Programs
  • Science Research Programs
  • Undergraduate Science Programs

Transdisciplinary Innovation

  • Creative Intelligence and Innovation
  • Diploma in Innovation
  • Transdisciplinary Learning
  • Postgraduate Research Degree

ICML 2024: AAII Paper Acceptance Success

Two papers by AAII researchers have been accepted to flagship machine learning conference ICML 2024.

ICML2024

The 41st International Conference on Machine Learning will take place in Vienna, Austria in July 2024.

The International Conference on Machine Learning (ICML)  is the leading international academic conference dedicated to the advancement of the branch of artificial intelligence known as machine learning. This year, ICML will be held in Vienna Australia from 21st through to the 27th of July 2024, with The Australian Artificial Intelligence Institute having two publications accepted for presentation.

ICML is globally renowned for presenting and publishing cutting-edge research on all aspects of machine learning used in closely related areas like artificial intelligence, statistics and data science, as well as important application areas such as machine vision, computational biology, speech recognition, and robotics.

AAII publications accepted to ICML 2024 are as follows:

  •  ' Knowledge Distillation with Auxiliary Variable ,' Bo Peng, Zhen Fang, Guangquan Zhang, and Jie Lu.
  • 'Adaptive Stabilization Based on Machine Learning for Column Generation ,' Yunzhuang Shen, Yuan Sun, Xiaodong Li, Zhiguang Cao, Andrew Eberhard, Guangquan Zhang.

UTS acknowledges the Gadigal people of the Eora Nation, the Boorooberongal people of the Dharug Nation, the Bidiagal people and the Gamaygal people, upon whose ancestral lands our university stands. We would also like to pay respect to the Elders both past and present, acknowledging them as the traditional custodians of knowledge for these lands.

research paper about machines

GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators

Recent advances in large language models (LLMs) have stepped forward the development of multilingual speech and machine translation by its reduced representation errors and incorporated external knowledge. However, both translation tasks typically utilize beam search decoding and top-1 hypothesis selection for inference. These techniques struggle to fully exploit the rich information in the diverse N-best hypotheses, making them less optimal for translation tasks that require a single, high-quality output sequence. In this paper, we propose a new generative paradigm for translation tasks, namely GenTranslate, which builds upon LLMs to generate better results from the diverse translation versions in N-best list. Leveraging the rich linguistic knowledge and strong reasoning abilities of LLMs, our new paradigm can integrate the rich information in N-best candidates to generate a higher-quality translation result. Furthermore, to support LLM finetuning, we build and release a HypoTranslate dataset that contains over 592K hypotheses-translation pairs in 11 languages. Experiments on various speech and machine translation benchmarks (e.g., FLEURS, CoVoST-2, WMT) demonstrate that our GenTranslate significantly outperforms the state-of-the-art model.

Huck Yang

  • Large Language Models are Efficient Learners of Noise-Robust Speech Recognition
  • HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models
  • It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition
  • Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition

Main Navigation

  • Contact NeurIPS
  • Code of Ethics
  • Code of Conduct
  • Create Profile
  • Journal To Conference Track
  • Diversity & Inclusion
  • Proceedings
  • Future Meetings
  • Exhibitor Information
  • Privacy Policy

NeurIPS 2024

Conference Dates: (In person) 9 December - 15 December, 2024

Homepage: https://neurips.cc/Conferences/2024/

Call For Papers 

Abstract submission deadline: May 15, 2024

Author notification: Sep 25, 2024

Camera-ready, poster, and video submission: Oct 30, 2024 AOE

Submit at: https://openreview.net/group?id=NeurIPS.cc/2024/Conference  

The site will start accepting submissions on Apr 22, 2024 

Subscribe to these and other dates on the 2024 dates page .

The Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS 2024) is an interdisciplinary conference that brings together researchers in machine learning, neuroscience, statistics, optimization, computer vision, natural language processing, life sciences, natural sciences, social sciences, and other adjacent fields. We invite submissions presenting new and original research on topics including but not limited to the following:

  • Applications (e.g., vision, language, speech and audio, Creative AI)
  • Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)
  • Evaluation (e.g., methodology, meta studies, replicability and validity, human-in-the-loop)
  • General machine learning (supervised, unsupervised, online, active, etc.)
  • Infrastructure (e.g., libraries, improved implementation and scalability, distributed solutions)
  • Machine learning for sciences (e.g. climate, health, life sciences, physics, social sciences)
  • Neuroscience and cognitive science (e.g., neural coding, brain-computer interfaces)
  • Optimization (e.g., convex and non-convex, stochastic, robust)
  • Probabilistic methods (e.g., variational inference, causal inference, Gaussian processes)
  • Reinforcement learning (e.g., decision and control, planning, hierarchical RL, robotics)
  • Social and economic aspects of machine learning (e.g., fairness, interpretability, human-AI interaction, privacy, safety, strategic behavior)
  • Theory (e.g., control theory, learning theory, algorithmic game theory)

Machine learning is a rapidly evolving field, and so we welcome interdisciplinary submissions that do not fit neatly into existing categories.

Authors are asked to confirm that their submissions accord with the NeurIPS code of conduct .

Formatting instructions:   All submissions must be in PDF format, and in a single PDF file include, in this order:

  • The submitted paper
  • Technical appendices that support the paper with additional proofs, derivations, or results 
  • The NeurIPS paper checklist  

Other supplementary materials such as data and code can be uploaded as a ZIP file

The main text of a submitted paper is limited to nine content pages , including all figures and tables. Additional pages containing references don’t count as content pages. If your submission is accepted, you will be allowed an additional content page for the camera-ready version.

The main text and references may be followed by technical appendices, for which there is no page limit.

The maximum file size for a full submission, which includes technical appendices, is 50MB.

Authors are encouraged to submit a separate ZIP file that contains further supplementary material like data or source code, when applicable.

You must format your submission using the NeurIPS 2024 LaTeX style file which includes a “preprint” option for non-anonymous preprints posted online. Submissions that violate the NeurIPS style (e.g., by decreasing margins or font sizes) or page limits may be rejected without further review. Papers may be rejected without consideration of their merits if they fail to meet the submission requirements, as described in this document. 

Paper checklist: In order to improve the rigor and transparency of research submitted to and published at NeurIPS, authors are required to complete a paper checklist . The paper checklist is intended to help authors reflect on a wide variety of issues relating to responsible machine learning research, including reproducibility, transparency, research ethics, and societal impact. The checklist forms part of the paper submission, but does not count towards the page limit.

Please join the NeurIPS 2024 Checklist Assistant Study that will provide you with free verification of your checklist performed by an LLM here . Please see details in our  blog

Supplementary material: While all technical appendices should be included as part of the main paper submission PDF, authors may submit up to 100MB of supplementary material, such as data, or source code in a ZIP format. Supplementary material should be material created by the authors that directly supports the submission content. Like submissions, supplementary material must be anonymized. Looking at supplementary material is at the discretion of the reviewers.

We encourage authors to upload their code and data as part of their supplementary material in order to help reviewers assess the quality of the work. Check the policy as well as code submission guidelines and templates for further details.

Use of Large Language Models (LLMs): We welcome authors to use any tool that is suitable for preparing high-quality papers and research. However, we ask authors to keep in mind two important criteria. First, we expect papers to fully describe their methodology, and any tool that is important to that methodology, including the use of LLMs, should be described also. For example, authors should mention tools (including LLMs) that were used for data processing or filtering, visualization, facilitating or running experiments, and proving theorems. It may also be advisable to describe the use of LLMs in implementing the method (if this corresponds to an important, original, or non-standard component of the approach). Second, authors are responsible for the entire content of the paper, including all text and figures, so while authors are welcome to use any tool they wish for writing the paper, they must ensure that all text is correct and original.

Double-blind reviewing:   All submissions must be anonymized and may not contain any identifying information that may violate the double-blind reviewing policy.  This policy applies to any supplementary or linked material as well, including code.  If you are including links to any external material, it is your responsibility to guarantee anonymous browsing.  Please do not include acknowledgements at submission time. If you need to cite one of your own papers, you should do so with adequate anonymization to preserve double-blind reviewing.  For instance, write “In the previous work of Smith et al. [1]…” rather than “In our previous work [1]...”). If you need to cite one of your own papers that is in submission to NeurIPS and not available as a non-anonymous preprint, then include a copy of the cited anonymized submission in the supplementary material and write “Anonymous et al. [1] concurrently show...”). Any papers found to be violating this policy will be rejected.

OpenReview: We are using OpenReview to manage submissions. The reviews and author responses will not be public initially (but may be made public later, see below). As in previous years, submissions under review will be visible only to their assigned program committee. We will not be soliciting comments from the general public during the reviewing process. Anyone who plans to submit a paper as an author or a co-author will need to create (or update) their OpenReview profile by the full paper submission deadline. Your OpenReview profile can be edited by logging in and clicking on your name in https://openreview.net/ . This takes you to a URL "https://openreview.net/profile?id=~[Firstname]_[Lastname][n]" where the last part is your profile name, e.g., ~Wei_Zhang1. The OpenReview profiles must be up to date, with all publications by the authors, and their current affiliations. The easiest way to import publications is through DBLP but it is not required, see FAQ . Submissions without updated OpenReview profiles will be desk rejected. The information entered in the profile is critical for ensuring that conflicts of interest and reviewer matching are handled properly. Because of the rapid growth of NeurIPS, we request that all authors help with reviewing papers, if asked to do so. We need everyone’s help in maintaining the high scientific quality of NeurIPS.  

Please be aware that OpenReview has a moderation policy for newly created profiles: New profiles created without an institutional email will go through a moderation process that can take up to two weeks. New profiles created with an institutional email will be activated automatically.

Venue home page: https://openreview.net/group?id=NeurIPS.cc/2024/Conference

If you have any questions, please refer to the FAQ: https://openreview.net/faq

Abstract Submission: There is a mandatory abstract submission deadline on May 15, 2024, six days before full paper submissions are due. While it will be possible to edit the title and abstract until the full paper submission deadline, submissions with “placeholder” abstracts that are rewritten for the full submission risk being removed without consideration. This includes titles and abstracts that either provide little or no semantic information (e.g., "We provide a new semi-supervised learning method.") or describe a substantively different claimed contribution.  The author list cannot be changed after the abstract deadline. After that, authors may be reordered, but any additions or removals must be justified in writing and approved on a case-by-case basis by the program chairs only in exceptional circumstances. 

Ethics review: Reviewers and ACs may flag submissions for ethics review . Flagged submissions will be sent to an ethics review committee for comments. Comments from ethics reviewers will be considered by the primary reviewers and AC as part of their deliberation. They will also be visible to authors, who will have an opportunity to respond.  Ethics reviewers do not have the authority to reject papers, but in extreme cases papers may be rejected by the program chairs on ethical grounds, regardless of scientific quality or contribution.  

Preprints: The existence of non-anonymous preprints (on arXiv or other online repositories, personal websites, social media) will not result in rejection. If you choose to use the NeurIPS style for the preprint version, you must use the “preprint” option rather than the “final” option. Reviewers will be instructed not to actively look for such preprints, but encountering them will not constitute a conflict of interest. Authors may submit anonymized work to NeurIPS that is already available as a preprint (e.g., on arXiv) without citing it. Note that public versions of the submission should not say "Under review at NeurIPS" or similar.

Dual submissions: Submissions that are substantially similar to papers that the authors have previously published or submitted in parallel to other peer-reviewed venues with proceedings or journals may not be submitted to NeurIPS. Papers previously presented at workshops are permitted, so long as they did not appear in a conference proceedings (e.g., CVPRW proceedings), a journal or a book.  NeurIPS coordinates with other conferences to identify dual submissions.  The NeurIPS policy on dual submissions applies for the entire duration of the reviewing process.  Slicing contributions too thinly is discouraged.  The reviewing process will treat any other submission by an overlapping set of authors as prior work. If publishing one would render the other too incremental, both may be rejected.

Anti-collusion: NeurIPS does not tolerate any collusion whereby authors secretly cooperate with reviewers, ACs or SACs to obtain favorable reviews. 

Author responses:   Authors will have one week to view and respond to initial reviews. Author responses may not contain any identifying information that may violate the double-blind reviewing policy. Authors may not submit revisions of their paper or supplemental material, but may post their responses as a discussion in OpenReview. This is to reduce the burden on authors to have to revise their paper in a rush during the short rebuttal period.

After the initial response period, authors will be able to respond to any further reviewer/AC questions and comments by posting on the submission’s forum page. The program chairs reserve the right to solicit additional reviews after the initial author response period.  These reviews will become visible to the authors as they are added to OpenReview, and authors will have a chance to respond to them.

After the notification deadline, accepted and opted-in rejected papers will be made public and open for non-anonymous public commenting. Their anonymous reviews, meta-reviews, author responses and reviewer responses will also be made public. Authors of rejected papers will have two weeks after the notification deadline to opt in to make their deanonymized rejected papers public in OpenReview.  These papers are not counted as NeurIPS publications and will be shown as rejected in OpenReview.

Publication of accepted submissions:   Reviews, meta-reviews, and any discussion with the authors will be made public for accepted papers (but reviewer, area chair, and senior area chair identities will remain anonymous). Camera-ready papers will be due in advance of the conference. All camera-ready papers must include a funding disclosure . We strongly encourage accompanying code and data to be submitted with accepted papers when appropriate, as per the code submission policy . Authors will be allowed to make minor changes for a short period of time after the conference.

Contemporaneous Work: For the purpose of the reviewing process, papers that appeared online within two months of a submission will generally be considered "contemporaneous" in the sense that the submission will not be rejected on the basis of the comparison to contemporaneous work. Authors are still expected to cite and discuss contemporaneous work and perform empirical comparisons to the degree feasible. Any paper that influenced the submission is considered prior work and must be cited and discussed as such. Submissions that are very similar to contemporaneous work will undergo additional scrutiny to prevent cases of plagiarism and missing credit to prior work.

Plagiarism is prohibited by the NeurIPS Code of Conduct .

Other Tracks: Similarly to earlier years, we will host multiple tracks, such as datasets, competitions, tutorials as well as workshops, in addition to the main track for which this call for papers is intended. See the conference homepage for updates and calls for participation in these tracks. 

Experiments: As in past years, the program chairs will be measuring the quality and effectiveness of the review process via randomized controlled experiments. All experiments are independently reviewed and approved by an Institutional Review Board (IRB).

Financial Aid: Each paper may designate up to one (1) NeurIPS.cc account email address of a corresponding student author who confirms that they would need the support to attend the conference, and agrees to volunteer if they get selected. To be considered for Financial the student will also need to fill out the Financial Aid application when it becomes available.

  • Share full article

Advertisement

Supported by

Guest Essay

Press Pause on the Silicon Valley Hype Machine

research paper about machines

By Julia Angwin

Ms. Angwin is a contributing Opinion writer and an investigative journalist.

It’s a little hard to believe that just over a year ago, a group of leading researchers asked for a six-month pause in the development of larger systems of artificial intelligence, fearing that the systems would become too powerful. “Should we risk loss of control of our civilization?” they asked.

There was no pause. But now, a year later, the question isn’t really whether A.I. is too smart and will take over the world. It’s whether A.I. is too stupid and unreliable to be useful. Consider this week’s announcement from OpenAI’s chief executive, Sam Altman, who promised he would unveil “new stuff” that “ feels like magic to me.” But it was just a rather routine update that makes ChatGPT cheaper and faster .

It feels like another sign that A.I. is not even close to living up to its hype. In my eyes, it’s looking less like an all-powerful being and more like a bad intern whose work is so unreliable that it’s often easier to do the task yourself. That realization has real implications for the way we, our employers and our government should deal with Silicon Valley’s latest dazzling new, new thing. Acknowledging A.I.’s flaws could help us invest our resources more efficiently and also allow us to turn our attention toward more realistic solutions.

Others voice similar concerns. “I find my feelings about A.I. are actually pretty similar to my feelings about blockchains: They do a poor job of much of what people try to do with them, they can’t do the things their creators claim they one day might, and many of the things they are well suited to do may not be altogether that beneficial,” wrote Molly White, a cryptocurrency researcher and critic , in her newsletter last month.

Let’s look at the research.

In the past 10 years, A.I. has conquered many tasks that were previously unimaginable, such as successfully identifying images, writing complete coherent sentences and transcribing audio. A.I. enabled a singer who had lost his voice to release a new song using A.I. trained with clips from his old songs.

But some of A.I.’s greatest accomplishments seem inflated. Some of you may remember that the A.I. model ChatGPT-4 aced the uniform bar exam a year ago. Turns out that it scored in the 48th percentile, not the 90th, as claimed by OpenAI , according to a re-examination by the M.I.T. researcher Eric Martínez . Or what about Google’s claim that it used A.I. to discover more than two million new chemical compounds ? A re-examination by experimental materials chemists at the University of California, Santa Barbara, found “ scant evidence for compounds that fulfill the trifecta of novelty, credibility and utility .”

Meanwhile, researchers in many fields have found that A.I. often struggles to answer even simple questions, whether about the law , medicine or voter information . Researchers have even found that A.I. does not always improve the quality of computer programming , the task it is supposed to excel at.

I don’t think we’re in cryptocurrency territory, where the hype turned out to be a cover story for a number of illegal schemes that landed a few big names in prison . But it’s also pretty clear that we’re a long way from Mr. Altman’s promise that A.I. will become “ the most powerful technology humanity has yet invented .”

Take Devin, a recently released “ A.I. software engineer ” that was breathlessly touted by the tech press. A flesh-and-bones software developer named Carl Brown decided to take on Devin . A task that took the generative A.I.-powered agent over six hours took Mr. Brown just 36 minutes. Devin also executed poorly, running a slower, outdated programming language through a complicated process. “Right now the state of the art of generative A.I. is it just does a bad, complicated, convoluted job that just makes more work for everyone else,” Mr. Brown concluded in his YouTube video .

Cognition, Devin’s maker, responded by acknowledging that Devin did not complete the output requested and added that it was eager for more feedback so it can keep improving its product. Of course, A.I. companies are always promising that an actually useful version of their technology is just around the corner. “ GPT-4 is the dumbest model any of you will ever have to use again by a lot ,” Mr. Altman said recently while talking up GPT-5 at a recent event at Stanford University.

The reality is that A.I. models can often prepare a decent first draft. But I find that when I use A.I., I have to spend almost as much time correcting and revising its output as it would have taken me to do the work myself.

And consider for a moment the possibility that perhaps A.I. isn’t going to get that much better anytime soon. After all, the A.I. companies are running out of new data on which to train their models, and they are running out of energy to fuel their power-hungry A.I. machines . Meanwhile, authors and news organizations (including The New York Times ) are contesting the legality of having their data ingested into the A.I. models without their consent, which could end up forcing quality data to be withdrawn from the models.

Given these constraints, it seems just as likely to me that generative A.I. could end up like the Roomba, the mediocre vacuum robot that does a passable job when you are home alone but not if you are expecting guests.

Companies that can get by with Roomba-quality work will, of course, still try to replace workers. But in workplaces where quality matters — and where workforces such as screenwriters and nurses are unionized — A.I. may not make significant inroads.

And if the A.I. models are relegated to producing mediocre work, they may have to compete on price rather than quality, which is never good for profit margins. In that scenario, skeptics such as Jeremy Grantham, an investor known for correctly predicting market crashes, could be right that the A.I. investment bubble is very likely to deflate soon .

The biggest question raised by a future populated by unexceptional A.I., however, is existential. Should we as a society be investing tens of billions of dollars, our precious electricity that could be used toward moving away from fossil fuels, and a generation of the brightest math and science minds on incremental improvements in mediocre email writing?

We can’t abandon work on improving A.I. The technology, however middling, is here to stay, and people are going to use it. But we should reckon with the possibility that we are investing in an ideal future that may not materialize.

The Times is committed to publishing a diversity of letters to the editor. We’d like to hear what you think about this or any of our articles. Here are some tips . And here’s our email: [email protected] .

Follow the New York Times Opinion section on Facebook , Instagram , TikTok , WhatsApp , X and Threads .

Julia Angwin, a contributing Opinion writer and the founder of Proof News , writes about tech policy. You can follow her on Twitter or Mastodon or her personal newsletter .

IMAGES

  1. (PDF) Performance Analysis of CNC Milling Machine using Graph Theory

    research paper about machines

  2. How To Write A Research Paper On Artificial Intelligence?

    research paper about machines

  3. ⇉Compound Machine

    research paper about machines

  4. Research paper on Automated Vending Machine

    research paper about machines

  5. Tips on How to Write a Research Paper on Machine Learning

    research paper about machines

  6. 👍 Mechanical engineering research paper free download pdf. ENGINEERING

    research paper about machines

VIDEO

  1. The Most Satisfying Machines and Ingenious Gadgets

  2. Paper Machines

  3. "🏭💡Going Viral: The Best Top Quality's Paper Plate Machine for Your Indian Business in india🚀"

  4. How Technology Machines Are Changing The World l MechMinds Hub

  5. "Can machines truly think like humans?" 🤖🤔

  6. Frogmore Paper Mill

COMMENTS

  1. Machine Learning: Algorithms, Real-World Applications and Research

    To discuss the applicability of machine learning-based solutions in various real-world application domains. To highlight and summarize the potential research directions within the scope of our study for intelligent data analysis and services. The rest of the paper is organized as follows.

  2. The latest in Machine Learning

    OS-Copilot/FRIDAY • 12 Feb 2024. Autonomous interaction with the computer has been a longstanding challenge with great potential, and the recent proliferation of large language models (LLMs) has markedly accelerated progress in building digital agents. Papers With Code highlights trending Machine Learning research and the code to implement it.

  3. A comprehensive survey on support vector machine classification

    SVM algorithms have gained recognition in research and applications in several scientific and engineering areas. This paper provides a brief introduction of SVMs, describes many applications and summarizes challenges and trends. Furthermore, limitations of SVMs will be identified.

  4. Machine learning and artificial intelligence in CNC machine tools, A

    Applications of machine learning and artificial intelligence systems in CNC machine tools are investigated in the research work by reviewing and analyzing recent achievements from published papers. The research works are organized into categories based on the applications of MA and AI in CNC machine tools, and future research work directions in ...

  5. Forecasting the future of artificial intelligence with machine learning

    In 2006, a paper discussing support vector machine and local feature merged this mid-sized group with the largest connected component. Fig. 4: The network became more connected over the years.

  6. The journal of Mechanism and Machine Theory

    Mechanism to our mind deals with the study of the moving parts of machines, which constitutes the main core of the journal [6], and since the very beginning the scope of the journal has been very broad and covers all the fields of mechanism and machine science.Currently, the scope of the Mechanism and Machine Theory journal includes six key areas, namely: (i) design theory and methodology, (ii ...

  7. Machine learning with neural networks

    algorithms for machine learning to this date. In this book we therefore refer to this model as the McCulloch-Pitts neuron. The purpose of this early research on neural networks was to explain neuro-physiological mechanisms [8]. Perhaps the most significant advance was Hebb's learning principle, describing how neural networks

  8. 777306 PDFs

    Explore the latest full-text research PDFs, articles, conference papers, preprints and more on MACHINE LEARNING. Find methods information, sources, references or conduct a literature review on ...

  9. Machine Intelligence

    Machine Intelligence. Google is at the forefront of innovation in Machine Intelligence, with active research exploring virtually all aspects of machine learning, including deep learning and more classical algorithms. Exploring theory as well as application, much of our work on language, speech, translation, visual processing, ranking and ...

  10. (Pdf) Natural Language Processing: Transforming How Machines Understand

    A. Brief Overview of Natural Language Processing (NLP) Natural Language Processing (NLP) is a multidisciplinary field that focuses on enabling. machines to understand, interpret, and generate ...

  11. Machines

    Machines is an international, peer-reviewed, open access journal on machinery and engineering published monthly online by MDPI.The IFToMM is affiliated with Machines and its members receive a discount on the article processing charges.. Open Access — free for readers, with article processing charges (APC) paid by authors or their institutions.; High Visibility: indexed within Scopus, SCIE ...

  12. Can machines think?

    The German Philosopher Arthur Schopenhauer claimed in the mid of the 19th century that Mathematics is of only minor profundity because arithmetic can be done by machines. Now we know that machines can do calculations even better than humans. Nowadays, machines can translate language, play chess, control factories, steer cars and many other things which we usually call intelligent behavior ...

  13. machine learning Latest Research Papers

    This article examines the background to the problem and outlines a project that TNA undertook to research the feasibility of using commercially available artificial intelligence tools to aid selection. ... this research aims to predict user's personalities based on Indonesian text from social media using machine learning techniques. This ...

  14. Machine Design Modern Techniques and Innovative Technologies

    This is to help the novel ways of manufacturing process to move forward, where, the Machine Design will feature and compile the newest product line with an inventive technology to keep modernized techniques at the top of mind for our OEMs, end-users, integrators, and the entire supply community. This research paper will explore how the ...

  15. [2405.07863] RLHF Workflow: From Reward Modeling to Online RLHF

    We present the workflow of Online Iterative Reinforcement Learning from Human Feedback (RLHF) in this technical report, which is widely reported to outperform its offline counterpart by a large margin in the recent large language model (LLM) literature. However, existing open-source RLHF projects are still largely confined to the offline learning setting. In this technical report, we aim to ...

  16. (PDF) LITERATURE REVIEW ON LATHE MACHINE

    Machining [1] is a term that covers a large collection of manufacturing processes designed to. remove unwanted material, usually in the form of chips, from a work -piece. Machining is used to ...

  17. [2405.10126] Estimating a Function and Its Derivatives Under a

    Eunji Lim. We consider the problem of estimating an unknown function f* and its partial derivatives from a noisy data set of n observations, where we make no assumptions about f* except that it is smooth in the sense that it has square integrable partial derivatives of order m. A natural candidate for the estimator of f* in such a case is the ...

  18. Research on Virus Propagation Network Intrusion Detection Based on

    Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications. ... Traditional machine learning ...

  19. AI Is Everybody's Business

    However, there is a lot to learn about novel technologies before we can properly manage them. In this year's MIT CISR research, we are studying predictive and generative AI from several angles. This briefing is the first in a series; in future briefings we will present management advice specific to machine learning and generative tools.

  20. AutoML: A systematic review on automated machine ...

    By exploring the literature using these keywords, we read a wide range of research papers that address these key areas, contributing to a comprehensive understanding of AutoML, NAS and related topics. When searching for random keywords, we uncover fascinating developments in the field of AutoML and NAS such as the automated process of feature ...

  21. Jamshidi earns recognition for most influential paper

    For the most influential paper award, a select committee considers conference publications published approximately 10 years previously and selects those that have made the most impact according to several criteria, including the number of citations, practical applications and industry adoption, and influence on subsequent research.

  22. From Trash to Cash: How AI and Machine Learning Can Help Make Recycling

    In our first paper, we used machine learning to connect our plastic signals with certain properties, like how dense and crystalline the polyethylene was. Normally, you measure density by weighing the plastic in different liquids and comparing the difference. ... To make this research more broadly useful, I'm working to show that we can sort ...

  23. ICML 2024: AAII Paper Acceptance Success

    This year, ICML will be held in Vienna Australia from 21st through to the 27th of July 2024, with The Australian Artificial Intelligence Institute having two publications accepted for presentation. ICML is globally renowned for presenting and publishing cutting-edge research on all aspects of machine learning used in closely related areas like ...

  24. GenTranslate: Large Language Models are ...

    Recent advances in large language models (LLMs) have stepped forward the development of multilingual speech and machine translation by its reduced representation errors and incorporated external knowledge. However, both translation tasks typically utilize beam search decoding and top-1 hypothesis selection for inference. These techniques struggle to fully exploit the rich information in the ...

  25. NeurIPS 2024 Call for Papers

    The paper checklist is intended to help authors reflect on a wide variety of issues relating to responsible machine learning research, including reproducibility, transparency, research ethics, and societal impact. The checklist forms part of the paper submission, but does not count towards the page limit.

  26. (PDF) Time Travel and Time Machines

    Time T rav el and Time Machines. Chris Smeenk and Christian W¨ uthrich. F orthcoming in C. Callender (ed.), Oxford Handbook of Time, Oxford University Press. Abstract. This paper is an enquiry ...

  27. A.I. and the Silicon Valley Hype Machine

    A flesh-and-bones software developer named Carl Brown decided to take on Devin. A task that took the generative A.I.-powered agent over six hours took Mr. Brown just 36 minutes. Devin also ...