82 Data Mining Essay Topic Ideas & Examples

🏆 best data mining topic ideas & essay examples, 💡 good essay topics on data mining, ✅ most interesting data mining topics to write about.

  • Disadvantages of Using Web 2.0 for Data Mining Applications This data can be confusing to the readers and may not be reliable. Lastly, with the use of Web 2.
  • Data Mining Classifiers: The Advantages and Disadvantages One of the major disadvantages of this algorithm is the fact that it has to generate distance measures for all the recorded attributes. We will write a custom essay specifically for you by our professional experts 808 writers online Learn More
  • Data Warehouse and Data Mining in Business The circumstances leading to the establishment and development of the concept of data warehousing was attributed to the fact that failure to have a data warehouse led to the need of putting in place large […]
  • Ethnography and Data Mining in Anthropology The study of cultures is of great importance under normal circumstances to enhance the understanding of the same. Data mining is the success secret of ethnography.
  • Summary of C4.5 Algorithm: Data Mining 5 algorism: Each record from set of data should be associated with one of the offered classes, it means that one of the attributes of the class should be considered as a class mark.
  • The Data Mining Method in Healthcare and Education Thus, I would use data mining in both cases; however, before that, I would discover a way to improve the algorithms used for it.
  • Data Mining Tools and Data Mining Myths The first problem is correlated with keeping the identity of the person evolved in data mining secret. One of the major myths regarding data mining is that it can replace domain knowledge.
  • Hybrid Data Mining Approach in Healthcare One of the healthcare projects that will call for the use of data mining is treatment evaluation. In this case, it is essential to realize that the main aim of health data mining is to […]
  • Terrorism and Data Mining Algorithms However, this is a necessary evil as the nation’s security has to be prioritized since these attacks lead to harm to a larger population compared to the infringements.
  • Data Mining and Its Major Advantages Thus, it is possible to conclude that data mining is a convenient and effective way of processing information, which has many advantages.
  • Transforming Coded and Text Data Before Data Mining However, to complete data mining, it is necessary to transform the data according to the techniques that are to be used in the process.
  • Data Mining and Machine Learning Algorithms The shortest distance of string between two instances defines the distance of measure. However, this is also not very clear as to which transformations are summed, and thus it aims to a probability with the […]
  • Data Mining in Social Networks: Linkedin.com One of the ways to achieve the aim is to understand how users view data mining of their data on LinkedIn.
  • Issues With Data Mining It is necessary to note that the usage of data mining helps FBI to have access to the necessary information for terrorism and crime tracking.
  • Large Volume Data Handling: An Efficient Data Mining Solution Data mining is the process of sorting huge amount of data and finding out the relevant data. Data mining is widely used for the maintenance of data which helps a lot to an organization in […]
  • Data Mining and Analytical Developments In this era where there is a lot of information to be handled at ago and actually with little available time, it is necessarily useful and wise to analyze data from different viewpoints and summarize […]
  • Levi’s Company’s Data Mining & Customer Analytics Levi, the renowned name in jeans is feeling the heat of competition from a number of other brands, which have come upon the scene well after Levi’s but today appear to be approaching Levi’s market […]
  • Cryptocurrency Exchange Market Prediction and Analysis Using Data Mining and Artificial Intelligence This paper aims to review the application of A.I.in the context of blockchain finance by examining scholarly articles to determine whether the A.I.algorithm can be used to analyze this financial market.
  • Data Mining in Healthcare: Applications and Big Data Analyze Big data analysis is among the most influential modern trends in informatics and it has applications in virtually every sphere of human life.
  • “Data Mining and Customer Relationship Marketing in the Banking Industry“ by Chye & Gerry First of all, the article generally elaborates on the notion of customer relationship management, which is defined as “the process of predicting customer behavior and selecting actions to influence that behavior to benefit the company”.
  • Data Mining Techniques and Applications The use of data mining to detect disturbances in the ecosystem can help to avert problems that are destructive to the environment and to society.
  • Ethical Data Mining in the UAE Traffic Department The research question identified in the assignment two is considered to be the following, namely whether the implementation of the business intelligence into the working process will beneficially influence the work of the Traffic Department […]
  • Canadian University Dubai and Data Mining The aim of mining data in the education environment is to enhance the quality of education for the mass through proactive and knowledge-based decision-making approaches.
  • Data Mining and Customer Relationship Management As such, CRM not only entails the integration of marketing, sales, customer service, and supply chain capabilities of the firm to attain elevated efficiencies and effectiveness in conveying customer value, but it obliges the organization […]
  • E-Commerce: Mining Data for Better Business Intelligence The method allowed the use of Intel and an example to build the study and the literature on data mining for business intelligence to analyze the findings.
  • Ethical Implications of Data Mining by Government Institutions Critics of personal data mining insist that it infringes on the rights of an individual and result to the loss of sensitive information.
  • Data Mining Role in Companies The increasing adoption of data mining in various sectors illustrates the potential of the technology regarding the analysis of data by entities that seek information crucial to their operations.
  • Data Mining: Concepts and Methods Speed of data mining process is important as it has a role to play in the relevance of the data mined. The accuracy of data is also another factor that can be used to measure […]
  • Data Mining Technologies According to Han & Kamber, data mining is the process of discovering correlations, patterns, trends or relationships by searching through a large amount of data that in most circumstances is stored in repositories, business databases […]
  • Data Mining: A Critical Discussion In recent times, the relatively new discipline of data mining has been a subject of widely published debate in mainstream forums and academic discourses, not only due to the fact that it forms a critical […]
  • Commercial Uses of Data Mining Data mining process entails the use of large relational database to identify the correlation that exists in a given data. The principal role of the applications is to sift the data to identify correlations.
  • A Discussion on the Acceptability of Data Mining Today, more than ever before, individuals, organizations and governments have access to seemingly endless amounts of data that has been stored electronically on the World Wide Web and the Internet, and thus it makes much […]
  • Applying Data Mining Technology for Insurance Rate Making: Automobile Insurance Example
  • Applebee’s, Travelocity and Others: Data Mining for Business Decisions
  • Applying Data Mining Procedures to a Customer Relationship
  • Business Intelligence as Competitive Tool of Data Mining
  • Overview of Accounting Information System Data Mining
  • Applying Data Mining Technique to Disassembly Sequence Planning
  • Approach for Image Data Mining Cultural Studies
  • Apriori Algorithm for the Data Mining of Global Cyberspace Security Issues
  • Database Data Mining: The Silent Invasion of Privacy
  • Data Management: Data Warehousing and Data Mining
  • Constructive Data Mining: Modeling Consumers’ Expenditure in Venezuela
  • Data Mining and Its Impact on Healthcare
  • Innovations and Perspectives in Data Mining and Knowledge Discovery
  • Data Mining and Machine Learning Methods for Cyber Security Intrusion Detection
  • Linking Data Mining and Anomaly Detection Techniques
  • Data Mining and Pattern Recognition Models for Identifying Inherited Diseases
  • Credit Card Fraud Detection Through Data Mining
  • Data Mining Approach for Direct Marketing of Banking Products
  • Constructive Data Mining: Modeling Argentine Broad Money Demand
  • Data Mining-Based Dispatching System for Solving the Pickup and Delivery Problem
  • Commercially Available Data Mining Tools Used in the Economic Environment
  • Data Mining Climate Variability as an Indicator of U.S. Natural Gas
  • Analysis of Data Mining in the Pharmaceutical Industry
  • Data Mining-Driven Analysis and Decomposition in Agent Supply Chain Management Networks
  • Credit Evaluation Model for Banks Using Data Mining
  • Data Mining for Business Intelligence: Multiple Linear Regression
  • Cluster Analysis for Diabetic Retinopathy Prediction Using Data Mining Techniques
  • Data Mining for Fraud Detection Using Invoicing Data
  • Jaeger Uses Data Mining to Reduce Losses From Crime and Waste
  • Data Mining for Industrial Engineering and Management
  • Business Intelligence and Data Mining – Decision Trees
  • Data Mining for Traffic Prediction and Intelligent Traffic Management System
  • Building Data Mining Applications for CRM
  • Data Mining Optimization Algorithms Based on the Swarm Intelligence
  • Big Data Mining: Challenges, Technologies, Tools, and Applications
  • Data Mining Solutions for the Business Environment
  • Overview of Big Data Mining and Business Intelligence Trends
  • Data Mining Techniques for Customer Relationship Management
  • Classification-Based Data Mining Approach for Quality Control in Wine Production
  • Data Mining With Local Model Specification Uncertainty
  • Employing Data Mining Techniques in Testing the Effectiveness of Modernization Theory
  • Enhancing Information Management Through Data Mining Analytics
  • Evaluating Feature Selection Methods for Learning in Data Mining Applications
  • Extracting Formations From Long Financial Time Series Using Data Mining
  • Financial and Banking Markets and Data Mining Techniques
  • Fraudulent Financial Statements and Detection Through Techniques of Data Mining
  • Harmful Impact Internet and Data Mining Have on Society
  • Informatics, Data Mining, Econometrics, and Financial Economics: A Connection
  • Integrating Data Mining Techniques Into Telemedicine Systems
  • Investigating Tobacco Usage Habits Using Data Mining Approach
  • Chicago (A-D)
  • Chicago (N-B)

IvyPanda. (2024, March 2). 82 Data Mining Essay Topic Ideas & Examples. https://ivypanda.com/essays/topic/data-mining-essay-topics/

"82 Data Mining Essay Topic Ideas & Examples." IvyPanda , 2 Mar. 2024, ivypanda.com/essays/topic/data-mining-essay-topics/.

IvyPanda . (2024) '82 Data Mining Essay Topic Ideas & Examples'. 2 March.

IvyPanda . 2024. "82 Data Mining Essay Topic Ideas & Examples." March 2, 2024. https://ivypanda.com/essays/topic/data-mining-essay-topics/.

1. IvyPanda . "82 Data Mining Essay Topic Ideas & Examples." March 2, 2024. https://ivypanda.com/essays/topic/data-mining-essay-topics/.

Bibliography

IvyPanda . "82 Data Mining Essay Topic Ideas & Examples." March 2, 2024. https://ivypanda.com/essays/topic/data-mining-essay-topics/.

  • Auditing Paper Topics
  • Business Intelligence Research Topics
  • CyberCrime Topics
  • Economic Topics
  • Internet Privacy Essay Topics
  • Artificial Intelligence Questions
  • Computers Essay Ideas
  • Electronics Engineering Paper Topics
  • Cyber Security Topics
  • Google Paper Topics
  • Hacking Essay Topics
  • Identity Theft Essay Ideas
  • Internet Research Ideas
  • Microsoft Topics

data mining Recently Published Documents

Total documents.

  • Latest Documents
  • Most Cited Documents
  • Contributed Authors
  • Related Sources
  • Related Keywords

Distance Based Pattern Driven Mining for Outlier Detection in High Dimensional Big Dataset

Detection of outliers or anomalies is one of the vital issues in pattern-driven data mining. Outlier detection detects the inconsistent behavior of individual objects. It is an important sector in the data mining field with several different applications such as detecting credit card fraud, hacking discovery and discovering criminal activities. It is necessary to develop tools used to uncover the critical information established in the extensive data. This paper investigated a novel method for detecting cluster outliers in a multidimensional dataset, capable of identifying the clusters and outliers for datasets containing noise. The proposed method can detect the groups and outliers left by the clustering process, like instant irregular sets of clusters (C) and outliers (O), to boost the results. The results obtained after applying the algorithm to the dataset improved in terms of several parameters. For the comparative analysis, the accurate average value and the recall value parameters are computed. The accurate average value is 74.05% of the existing COID algorithm, and our proposed algorithm has 77.21%. The average recall value is 81.19% and 89.51% of the existing and proposed algorithm, which shows that the proposed work efficiency is better than the existing COID algorithm.

Implementation of Data Mining Technology in Bonded Warehouse Inbound and Outbound Goods Trade

For the taxed goods, the actual freight is generally determined by multiplying the allocated freight for each KG and actual outgoing weight based on the outgoing order number on the outgoing bill. Considering the conventional logistics is insufficient to cope with the rapid response of e-commerce orders to logistics requirements, this work discussed the implementation of data mining technology in bonded warehouse inbound and outbound goods trade. Specifically, a bonded warehouse decision-making system with data warehouse, conceptual model, online analytical processing system, human-computer interaction module and WEB data sharing platform was developed. The statistical query module can be used to perform statistics and queries on warehousing operations. After the optimization of the whole warehousing business process, it only takes 19.1 hours to get the actual freight, which is nearly one third less than the time before optimization. This study could create a better environment for the development of China's processing trade.

Multi-objective economic load dispatch method based on data mining technology for large coal-fired power plants

User activity classification and domain-wise ranking through social interactions.

Twitter has gained a significant prevalence among the users across the numerous domains, in the majority of the countries, and among different age groups. It servers a real-time micro-blogging service for communication and opinion sharing. Twitter is sharing its data for research and study purposes by exposing open APIs that make it the most suitable source of data for social media analytics. Applying data mining and machine learning techniques on tweets is gaining more and more interest. The most prominent enigma in social media analytics is to automatically identify and rank influencers. This research is aimed to detect the user's topics of interest in social media and rank them based on specific topics, domains, etc. Few hybrid parameters are also distinguished in this research based on the post's content, post’s metadata, user’s profile, and user's network feature to capture different aspects of being influential and used in the ranking algorithm. Results concluded that the proposed approach is well effective in both the classification and ranking of individuals in a cluster.

A data mining analysis of COVID-19 cases in states of United States of America

Epidemic diseases can be extremely dangerous with its hazarding influences. They may have negative effects on economies, businesses, environment, humans, and workforce. In this paper, some of the factors that are interrelated with COVID-19 pandemic have been examined using data mining methodologies and approaches. As a result of the analysis some rules and insights have been discovered and performances of the data mining algorithms have been evaluated. According to the analysis results, JRip algorithmic technique had the most correct classification rate and the lowest root mean squared error (RMSE). Considering classification rate and RMSE measure, JRip can be considered as an effective method in understanding factors that are related with corona virus caused deaths.

Exploring distributed energy generation for sustainable development: A data mining approach

A comprehensive guideline for bengali sentiment annotation.

Sentiment Analysis (SA) is a Natural Language Processing (NLP) and an Information Extraction (IE) task that primarily aims to obtain the writer’s feelings expressed in positive or negative by analyzing a large number of documents. SA is also widely studied in the fields of data mining, web mining, text mining, and information retrieval. The fundamental task in sentiment analysis is to classify the polarity of a given content as Positive, Negative, or Neutral . Although extensive research has been conducted in this area of computational linguistics, most of the research work has been carried out in the context of English language. However, Bengali sentiment expression has varying degree of sentiment labels, which can be plausibly distinct from English language. Therefore, sentiment assessment of Bengali language is undeniably important to be developed and executed properly. In sentiment analysis, the prediction potential of an automatic modeling is completely dependent on the quality of dataset annotation. Bengali sentiment annotation is a challenging task due to diversified structures (syntax) of the language and its different degrees of innate sentiments (i.e., weakly and strongly positive/negative sentiments). Thus, in this article, we propose a novel and precise guideline for the researchers, linguistic experts, and referees to annotate Bengali sentences immaculately with a view to building effective datasets for automatic sentiment prediction efficiently.

Capturing Dynamics of Information Diffusion in SNS: A Survey of Methodology and Techniques

Studying information diffusion in SNS (Social Networks Service) has remarkable significance in both academia and industry. Theoretically, it boosts the development of other subjects such as statistics, sociology, and data mining. Practically, diffusion modeling provides fundamental support for many downstream applications (e.g., public opinion monitoring, rumor source identification, and viral marketing). Tremendous efforts have been devoted to this area to understand and quantify information diffusion dynamics. This survey investigates and summarizes the emerging distinguished works in diffusion modeling. We first put forward a unified information diffusion concept in terms of three components: information, user decision, and social vectors, followed by a detailed introduction of the methodologies for diffusion modeling. And then, a new taxonomy adopting hybrid philosophy (i.e., granularity and techniques) is proposed, and we made a series of comparative studies on elementary diffusion models under our taxonomy from the aspects of assumptions, methods, and pros and cons. We further summarized representative diffusion modeling in special scenarios and significant downstream tasks based on these elementary models. Finally, open issues in this field following the methodology of diffusion modeling are discussed.

The Influence of E-book Teaching on the Motivation and Effectiveness of Learning Law by Using Data Mining Analysis

This paper studies the motivation of learning law, compares the teaching effectiveness of two different teaching methods, e-book teaching and traditional teaching, and analyses the influence of e-book teaching on the effectiveness of law by using big data analysis. From the perspective of law student psychology, e-book teaching can attract students' attention, stimulate students' interest in learning, deepen knowledge impression while learning, expand knowledge, and ultimately improve the performance of practical assessment. With a small sample size, there may be some deficiencies in the research results' representativeness. To stimulate the learning motivation of law as well as some other theoretical disciplines in colleges and universities has particular referential significance and provides ideas for the reform of teaching mode at colleges and universities. This paper uses a decision tree algorithm in data mining for the analysis and finds out the influencing factors of law students' learning motivation and effectiveness in the learning process from students' perspective.

Intelligent Data Mining based Method for Efficient English Teaching and Cultural Analysis

The emergence of online education helps improving the traditional English teaching quality greatly. However, it only moves the teaching process from offline to online, which does not really change the essence of traditional English teaching. In this work, we mainly study an intelligent English teaching method to further improve the quality of English teaching. Specifically, the random forest is firstly used to analyze and excavate the grammatical and syntactic features of the English text. Then, the decision tree based method is proposed to make a prediction about the English text in terms of its grammar or syntax issues. The evaluation results indicate that the proposed method can effectively improve the accuracy of English grammar or syntax recognition.

Export Citation Format

Share document.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts

Data mining articles within Scientific Reports

Article 10 April 2024 | Open Access

A decision support system based on recurrent neural networks to predict medication dosage for patients with Parkinson's disease

  • Atiye Riasi
  • , Mehdi Delrobaei
  •  &  Mehri Salari

Article 03 April 2024 | Open Access

A distributed feature selection pipeline for survival analysis using radiomics in non-small cell lung cancer patients

  • Benedetta Gottardelli
  • , Varsha Gouthamchand
  •  &  Andrea Damiani

Article 02 April 2024 | Open Access

Characterization of a putative orexin receptor in Ciona intestinalis sheds light on the evolution of the orexin/hypocretin system in chordates

  • Maiju K. Rinne
  • , Lauri Urvas
  •  &  Henri Xhaard

Multiomics analysis to explore blood metabolite biomarkers in an Alzheimer’s Disease Neuroimaging Initiative cohort

  • , Yuki Matsuzawa
  •  &  Balebail Ashok Raj

Article 01 April 2024 | Open Access

Information heterogeneity between progress notes by physicians and nurses for inpatients with digestive system diseases

  • Yukinori Mashima
  • , Masatoshi Tanigawa
  •  &  Hideto Yokoi

Article 25 March 2024 | Open Access

Integrated image and location analysis for wound classification: a deep learning approach

  • , Tirth Shah
  •  &  Zeyun Yu

Article 19 March 2024 | Open Access

Persistence of collective memory of corporate bankruptcy events discussed on X (Twitter) is influenced by pre-bankruptcy public attention

  • Kathleen M. Jagodnik
  • , Sharon Dekel
  •  &  Alon Bartal

Article 18 March 2024 | Open Access

Clustering analysis for the evolutionary relationships of SARS-CoV-2 strains

  • Xiangzhong Chen
  • , Mingzhao Wang
  •  &  Juanying Xie

Article 15 March 2024 | Open Access

Development of phenotyping algorithms for hypertensive disorders of pregnancy (HDP) and their application in more than 22,000 pregnant women

  • Satoshi Mizuno
  • , Maiko Wagata
  •  &  Soichi Ogishima

Article 13 March 2024 | Open Access

Predicting early Alzheimer’s with blood biomarkers and clinical features

  • Muaath Ebrahim AlMansoori
  • , Sherlyn Jemimah
  •  &  Aamna AlShehhi

Article 09 March 2024 | Open Access

Sentiment analysis of video danmakus based on MIBE-RoBERTa-FF-BiLSTM

  • Jianbo Zhao
  • , Huailiang Liu
  •  &  Shanzhuang Zhang

Article 05 March 2024 | Open Access

A new R package to parse plant species occurrence records into unique collection events efficiently reduces data redundancy

  • Pablo Hendrigo Alves de Melo
  • , Nadia Bystriakova
  •  &  Alexandre K. Monro

Article 02 March 2024 | Open Access

Prediction of lncRNA and disease associations based on residual graph convolutional networks with attention mechanism

  • Shengchang Wang
  • , Jiaqing Qiao
  •  &  Shou Feng

Article 01 March 2024 | Open Access

Cluster analysis and visualisation of electronic health records data to identify undiagnosed patients with rare genetic diseases

  • Daniel Moynihan
  • , Sean Monaco
  •  &  Saumya Shekhar Jamuar

Article 21 February 2024 | Open Access

Tuning attention based long-short term memory neural networks for Parkinson’s disease detection using modified metaheuristics

  • , Timea Bezdan
  •  &  Nebojsa Bacanin

Article 19 February 2024 | Open Access

Effects of different KRAS mutants and Ki67 expression on diagnosis and prognosis in lung adenocarcinoma

  • , Liwen Dong
  •  &  Pan Li

Article 15 February 2024 | Open Access

Identification of SLC40A1, LCN2, CREB5, and SLC7A11 as ferroptosis-related biomarkers in alopecia areata through machine learning

  • , Dongfan Wei
  •  &  Xiuzu Song

Article 07 February 2024 | Open Access

Unsupervised analysis of whole transcriptome data from human pluripotent stem cells cardiac differentiation

  • Sofia P. Agostinho
  • , Mariana A. Branco
  •  &  Carlos A. V. Rodrigues

Article 03 February 2024 | Open Access

AI models for automated segmentation of engineered polycystic kidney tubules

  • Simone Monaco
  • , Nicole Bussola
  •  &  Daniele Apiletti

Article 02 February 2024 | Open Access

Development and validation of a cuproptosis-related prognostic model for acute myeloid leukemia patients using machine learning with stacking

  • Xichao Wang
  •  &  Suning Chen

Article 30 January 2024 | Open Access

Assessing the feasibility of applying machine learning to diagnosing non-effusive feline infectious peritonitis

  • Dawn Dunbar
  • , Simon A. Babayan
  •  &  William Weir

Article 29 January 2024 | Open Access

Survival prediction of glioblastoma patients using modern deep learning and machine learning techniques

  • Samin Babaei Rikan
  • , Amir Sorayaie Azar
  •  &  Uffe Kock Wiil

Article 25 January 2024 | Open Access

Identification of gene signatures and molecular mechanisms underlying the mutual exclusion between psoriasis and leprosy

  • You-Wang Lu
  • , Rong-Jing Dong
  •  &  Yu-Ye Li

Article 24 January 2024 | Open Access

Identification of shared pathogenetic mechanisms between COVID-19 and IC through bioinformatics and system biology

  • Zhenpeng Sun
  •  &  Jiangang Gao

Article 18 January 2024 | Open Access

Integrated image and sensor-based food intake detection in free-living

  • Tonmoy Ghosh
  •  &  Edward Sazonov

Article 17 January 2024 | Open Access

Global characterization of biosynthetic gene clusters in non-model eukaryotes using domain architectures

  • Taehyung Kwon
  •  &  Blake T. Hovde

Article 16 January 2024 | Open Access

Parkinson’s disease detection based on features refinement through L1 regularized SVM and deep neural network

  • , Ashir Javeed
  •  &  Amir H. Gandomi

Article 12 January 2024 | Open Access

Identification of important genes related to HVSMC proliferation and migration in graft restenosis based on WGCNA

  • Xiankun Liu
  • , Mingzhen Qin
  •  &  Zhigang Guo

Article 06 January 2024 | Open Access

Acute ischemic stroke prediction and predictive factors analysis using hematological indicators in elderly hypertensives post-transient ischemic attack

  • , Chenguang Zheng
  •  &  Le Ge

Article 04 January 2024 | Open Access

Immune, metabolic landscapes of prognostic signatures for lung adenocarcinoma based on a novel deep learning framework

  • , Shibin Sun
  •  &  Lina Chen

Article 02 January 2024 | Open Access

Integrated whole transcriptome profiling revealed a convoluted circular RNA-based competing endogenous RNAs regulatory network in colorectal cancer

  • Hasan Mollanoori
  • , Yaser Ghelmani
  •  &  Mohammadreza Dehghani

Article 28 December 2023 | Open Access

Quantitative gait analysis and prediction using artificial intelligence for patients with gait disorders

  • Nawel Ben Chaabane
  • , Pierre-Henri Conze
  •  &  Mathieu Lamard

Article 27 December 2023 | Open Access

Statistical analysis of synonymous and stop codons in pseudo-random and real sequences as a function of GC content

  • Valentin Wesp
  • , Günter Theißen
  •  &  Stefan Schuster

StackER: a novel SMILES-based stacked approach for the accelerated and efficient discovery of ERα and ERβ antagonists

  • Nalini Schaduangrat
  • , Nutta Homdee
  •  &  Watshara Shoombuatong

Article 18 December 2023 | Open Access

Analyzing the impact of feature selection methods on machine learning algorithms for heart disease prediction

  • Zeinab Noroozi
  • , Azam Orooji
  •  &  Leila Erfannia

Article 13 December 2023 | Open Access

surviveR: a flexible shiny application for patient survival analysis

  • Tamas Sessler
  • , Gerard P. Quinn
  •  &  Simon S. McDade

Article 08 December 2023 | Open Access

Dictionary-based matching graph network for biomedical named entity recognition

  •  &  Kai Tan

Article 23 November 2023 | Open Access

Computer-aided diagnosis of keratoconus through VAE-augmented images using deep learning

  • Zhila Agharezaei
  • , Reza Firouzi
  •  &  Saeid Eslami

Article 14 November 2023 | Open Access

Node embedding-based graph autoencoder outlier detection for adverse pregnancy outcomes

  • , Nazar Zaki
  •  &  Luai A. Ahmed

Article 13 November 2023 | Open Access

Emerging infectious disease surveillance using a hierarchical diagnosis model and the Knox algorithm

  • Mengying Wang
  • , Bingqing Yang
  •  &  Cheng Yang

Article 10 November 2023 | Open Access

Toward MR protocol-agnostic, unbiased brain age predicted from clinical-grade MRIs

  • Pedro A. Valdes-Hernandez
  • , Chavier Laffitte Nodarse
  •  &  Yenisel Cruz-Almeida

Article 06 November 2023 | Open Access

Prognostic and immunotherapeutic significance of immunogenic cell death-related genes in colon adenocarcinoma patients

  •  &  Jian Wang

Article 05 November 2023 | Open Access

Predicting dengue transmission rates by comparing different machine learning models with vector indices and meteorological data

  • Song Quan Ong
  • , Pradeep Isawasan
  •  &  Gomesh Nair

Article 02 November 2023 | Open Access

Anoikis-related genes signature development for clear cell renal cell carcinoma prognosis and tumor microenvironment

  • Yinglei Jiang
  • , Ying Wang
  •  &  Xukai Wang

Article 30 October 2023 | Open Access

Seven chromatin regulators as immune cell infiltration characteristics, potential diagnostic biomarkers and drugs prediction in hepatocellular carcinoma

  • Jin-wen Chai
  • , Xi-wen Hu
  •  &  Yu-na Dong

Article 28 October 2023 | Open Access

Exploring new subgroups for irritable bowel syndrome using a machine learning algorithm

  • Elahe Mousavi
  • , Ammar Hassanzadeh Keshteli
  •  &  Peyman Adibi

Article 26 October 2023 | Open Access

A text mining approach to categorize patient safety event reports by medication error type

  • Christian Boxley
  • , Mari Fujimoto
  •  &  Allan Fong

Article 21 October 2023 | Open Access

PetBERT: automated ICD-11 syndromic disease coding for outbreak detection in first opinion veterinary electronic health records

  • Sean Farrell
  • , Charlotte Appleton
  •  &  Noura Al Moubayed

Article 20 October 2023 | Open Access

Identifying prognostic genes related PANoptosis in lung adenocarcinoma and developing prediction model based on bioinformatics analysis

  • , Jiangnan Xia
  •  &  Kaiwen Hu

Article 18 October 2023 | Open Access

Oral manifestations in patients with coronavirus disease 2019 (COVID-19) identified using text mining: an observational study

  • Sandra Guauque-Olarte
  • , Laura Cifuentes-C
  •  &  Cristian Fong

Advertisement

Browse broader subjects

  • Computational biology and bioinformatics

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

research topics of data mining

Advertisement

Advertisement

A Systematic Review on Data Mining for Mathematics and Science Education

  • Published: 14 May 2020
  • Volume 19 , pages 639–659, ( 2021 )

Cite this article

  • Dongjo Shin 1 &
  • Jaekwoun Shim 1  

3275 Accesses

26 Citations

1 Altmetric

Explore all metrics

Educational data mining is used to discover significant phenomena and resolve educational issues occurring in the context of teaching and learning. This study provides a systematic literature review of educational data mining in mathematics and science education. A total of 64 articles were reviewed in terms of the research topics and data mining techniques used. This review revealed that data mining in mathematics and science education has been commonly used to understand students’ behavior and thinking process, identify factors affecting student achievements, and provide automated assessment of students’ written work. Recently, researchers have tended to use such data mining techniques as text mining to develop learning systems for supporting teachers’ instruction and students’ learning. We also found that classification, text mining, and clustering are major data mining techniques researchers have used. Studies using data mining were more likely to be conducted in the field of science education than in the field of mathematics education. We discuss the main results of our review in comparison with the previous reviews of educational data mining (EDM) literature and with EDM studies conducted in the context of science and mathematics education. Finally, we provide implications for research and teaching and learning of science and mathematics and suggest potential research directions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

research topics of data mining

Similar content being viewed by others

The use of cronbach’s alpha when developing and reporting research instruments in science education.

Keith S. Taber

research topics of data mining

Systematic review of research on artificial intelligence applications in higher education – where are the educators?

Olaf Zawacki-Richter, Victoria I. Marín, … Franziska Gouverneur

research topics of data mining

The Promises and Challenges of Artificial Intelligence for Teachers: a Systematic Review of Research

Ismail Celik, Muhterem Dindar, … Sanna Järvelä

Abidi, S., Hussain, M., Xu, Y., & Zhang, W. (2019). Prediction of confusion attempting algebra homework in an intelligent tutoring system through machine learning techniques for educational sustainable development. Sustainability . Advance online publication. https://doi.org/10.3390/su11010105 .

Aiken, J. M., Henderson, R., & Caballero, M. D. (2019). Modeling student pathways in a physics bachelor’s degree program. Physical Review Physics Education Research, Advance online publication . https://doi.org/10.1103/PhysRevPhysEducRes.15.010128 .

Akgün, E., & Demir, M. (2018). Modeling course achievements of elementary education teacher candidates with artificial neural networks. International Journal of Assessment Tools in Education, 5 (3), 491–509.

Article   Google Scholar  

Aksoy, E., Narli, S., & Idil, F. H. (2016). Using data mining techniques examination of the middle school students’ attitude towards mathematics in the context of some variables. International Journal of Education in Mathematics Science and Technology, 4 (3), 210–228.

Aldowah, H., Al-Samarraie, H., & Fauzy, W. M. (2019). Educational data mining and learning analytics for 21stcentury higher education: A review and synthesis. Telematics and Informatics, 37 , 13–46.

Araya, R., Jiménez, A., Bahamondez, M., Calfucura, P., Dartnell, P., & Soto-Andrade, J. (2014). Teaching modeling skills using a massively multiplayer online mathematics game. World Wide Web, 17 (2), 213–227.

Bağ, H., & Çalık, M. (2017). A thematic review of argumentation studies at the K-8 level. Education and Science, 42 (190), 281–303.

Google Scholar  

Barnhart, T., & van Es, E. (2015). Studying teacher noticing: Examining the relationship among pre-service science teachers’ ability to attend, analyze and respond to student thinking. Teaching and Teacher Education, 45 , 83–93.

Beggrow, E. P., Ha, M., Nehm, R. H., Pearl, D., & Boone, W. J. (2014). Assessing scientific practices using machine-learning methods: How closely do they match clinical interview performance? Journal of Science Education and Technology, 23 (1), 160–182.

Bywater, J. P., Chiu, J. L., Hong, J., & Sankaranarayanan, V. (2019). The teacher responding tool: Scaffolding the teacher practice of responding to student ideas in mathematics classrooms. Computers & Education, 139 , 16–30.

Cai, W., Grossman, J., Lin, Z., Sheng, H., Wei, J. T. Z., Williams, J. J., & Goel, S. (2019). MathBot: A personalized conversational agent for learning math . Retrieved from https://footprints.stanford.edu/papers/mathbot.pdf . Accessed 16 Jan 2020.

Çalık, M., & Sözbilir, M. (2014). Parameters of content analysis. Education and Science, 39 (174), 33–38.

Chen, C. T., & Chang, K. Y. (2017). A study on the rare factors exploration of learning effectiveness by using fuzzy data mining. EURASIA Journal of Mathematics, Science and Technology Education, 13 (6), 2235–2253.

Chen, J., Zhang, Y., Wei, Y., & Hu, J. (2019). Discrimination of the contextual features of top performers in scientific literacy using a machine learning approach. Research in Science Education . Advanced online publication. https://doi.org/10.1007/s11165-019-9835-y .

Cheon, J., Lee, S., Smith, W., Song, J., & Kim, Y. (2013). The determination of children’s knowledge of global lunar patterns from online essays using text mining analysis. Research in Science Education, 43 (2), 667–686.

Choi, Y., Lim, Y., & Son, D. (2017). A semantic network analysis on the recognition of STEAM by middle school students in South Korea. EURASIA Journal of Mathematics, Science and Technology Education, 13 (10), 6457–6469.

Cooper, C. I., & Pearson, P. T. (2012). A genetically optimized predictive system for success in general chemistry using a diagnostic algebra test. Journal of Science Education and Technology, 21 (1), 197–205.

Depren, S. K. (2018). Prediction of students’ science achievement: An application of multivariate adaptive regression splines and regression trees. Journal of Baltic Science Education, 17 (5), 887–903.

Depren, S. K., Aşkın, Ö. E., & Öz, E. (2017). Identifying the classification performances of educational data mining methods: A case study for TIMSS. Educational Sciences: Theory & Practice, 17 (5), 1605–1623.

Dutt, A., Ismail, M. A., & Herawan, T. (2017). A systematic review on educational data mining. IEEE Access, 5 , 15991–16005.

Duzhin, F., & Gustafsson, A. (2018). Machine learning-based app for self-evaluation of teacher-specific instructional style and tools. Education in Science, 8 (1), 15. https://doi.org/10.3390/educsci9040263 .

English, L. D., & King, D. (2019). STEM integration in sixth grade: Desligning and constructing paper bridges. International Journal of Science and Mathematics Education, 17 (5), 863–884.

Figueiredo, M., Esteves, L., Neves, J., & Vicente, H. (2016). A data mining approach to study the impact of the methodology followed in chemistry lab classes on the weight attributed by the students to the lab work on learning and motivation. Chemistry Education Research and Practice, 17 (1), 156–171.

Filiz, E., & Oz, E. (2019). Finding the best algorithms and effective factors in classification of Turkish science student success. Journal of Baltic Science Education, 18 (2), 239–253.

Gabriel, F., Signolet, J., & Westwell, M. (2018). A machine learning approach to investigating the effects of mathematics dispositions on mathematical literacy. International Journal of Research & Method in Education, 41 (3), 306–327.

Gobert, J. D., Kim, Y. J., Sao Pedro, M. A., Kennedy, M., & Betts, C. G. (2015). Using educational data mining to assess students’ skills at designing and conducting experiments within a complex systems microworld. Thinking Skills and Creativity, 18 , 81–90.

Goggins, S. P., Xing, W., Chen, X., Chen, B., & Wadholm, B. (2015). Learning analytics at “small” scale: Exploring a complexity-grounded model for assessment automation. Journal of Universal Computer Science, 21 (1), 66–92.

Gorostiaga, A., & Rojo-Álvarez, J. L. (2016). On the use of conventional and statistical-learning techniques for the analysis of PISA results in Spain. Neurocomputing, 171 , 625–637.

Günel, K., Polat, R., & Kurt, M. (2016). Analyzing learning concepts in intelligent tutoring systems. International Arab Journal of Information Technology, 13 (2), 281–286.

Ha, M., & Nehm, R. H. (2016). The impact of misspelled words on automated computer scoring: A case study of scientific explanations. Journal of Science Education and Technology, 25 (3), 358–374.

Ha, M., Nehm, R. H., Urban-Lurain, M., & Merrill, J. E. (2011). Applying computerized-scoring models of written biological explanations across courses and colleges: Prospects and limitations. CBE Life Sciences Education, 10 (4), 379–393.

Hershkovitz, A., de Baker, R. S. J., Gobert, J., Wixon, M., & Pedro, M. S. (2013). Discovery with models: A case study on carelessness in computer-based science inquiry. American Behavioral Scientist, 57 (10), 1480–1499.

Hodgen, J., Küchemann, D., Brown, M., & Coe, R. (2009). Children’s understandings of algebra 30 years on. Research in Mathematics Education, 11 (2), 193–194.

Hossain, Z., Bumbacher, E., Brauneis, A., Diaz, M., Saltarelli, A., Blikstein, P., & Riedel-Kruse, I. H. (2018). Design guidelines and empirical case study for scaling authentic inquiry-based science learning via open online courses and interactive biology cloud labs. International Journal of Artificial Intelligence in Education, 28 (4), 478–507.

Howard, E., Meehan, M., & Parnell, A. (2018). Live lectures or online videos: Students’ resource choices in a first-year university mathematics module. International Journal of Mathematical Education in Science and Technology, 49 (4), 530–553.

Huang, C. J., Wang, Y. W., Huang, T. H., Chen, Y. C., Chen, H. M., & Chang, S. C. (2011). Performance evaluation of an online argumentation learning assistance agent. Computers & Education, 57 (1), 1270–1280.

Ismail, S., & Abdulla, S. (2015). Design and implementation of an intelligent system to predict the student graduation AGPA. Australian Educational Computing, 30 (2). Retrieved from http://journal.acce.edu.au/index.php/AEC/article/view/53 . Accessed 16 Jan 2020.

Jacobs, V. R., Lamb, L. L., & Philipp, R. A. (2010). Professional noticing of children’s mathematical thinking. Journal for Research in Mathematics Education, 41 (2), 169–202.

Kilic, H. (2018). Pre-service mathematics teachers’ noticing skills and scaffolding practices. International Journal of Science and Mathematics Education, 16 (2), 377–400.

Kim, D., Yoon, M., Jo, I. H., & Branch, R. M. (2018). Learning analytics to support self-regulated learning in asynchronous online courses: A case study at a women’s university in South Korea. Computers & Education, 127 , 233–251.

Kinnebrew, J. S., Killingsworth, S. S., Clark, D. B., Biswas, G., Sengupta, P., Minstrell, J., . . . Krinks, K. (2016). Contextual markup and mining in digital games for science learning: Connecting player behaviors to learning goals. IEEE Transactions on Learning Technologies, 10 (1), 93–103.

Kirby, N., & Dempster, E. (2015). Not the norm: The potential of tree analysis of performance data from students in a foundation mathematics module. African Journal of Research in Mathematics, Science and Technology Education, 19 (2), 131–142.

Kitchenham, B., & Charters, S. (2007). Guidelines for performing systematic literature reviews in software engineering (Version 2.3) . Keele University and Durham University.

Lamb, R., Annetta, L., Vallett, D., & Sadler, T. (2014). Cognitive diagnostic like approaches using neural-network analysis of serious educational videogames. Computers & Education, 70 , 92–104.

Lamb, R., Cavagnetto, A., & Akmal, T. (2016). Examination of the nonlinear dynamic systems associated with science student cognition while engaging in science information processing. International Journal of Science and Mathematics Education, 14 (1), 187–205.

Lavie Alon, N., & Tal, T. (2015). Student self-reported learning outcomes of field trips: The pedagogical impact. International Journal of Science Education, 37 (8), 1279–1298.

Lee, H. S., Pallant, A., Pryputniewicz, S., Lord, T., Mulholland, M., & Liu, O. L. (2019). Automated text scoring and real-time adjustable feedback: Supporting revision of scientific arguments involving uncertainty. Science Education, 103 (3), 590–622.

Lee, Y. (2019). Using self-organizing map and clustering to investigate problem-solving patterns in the massive open online course: An exploratory study. Journal of Educational Computing Research, 57 (2), 471–490.

Levy, S. T., & Wilensky, U. (2011). Mining students’ inquiry actions for understanding of complex systems. Computers & Education, 56 (3), 556–573.

Liu, S. H., & Lee, G. G. (2013). Using a concept map knowledge management system to enhance the learning of biology. Computers & Education, 68 , 105–116.

Liu, O. L., Rios, J. A., Heilman, M., Gerard, L., & Linn, M. C. (2016). Validation of automated scoring of science assessments. Journal of Research in Science Teaching, 53 (2), 215–233.

Liu, X., & Whitford, M. (2011). Opportunities-to-learn at home: Profiles of students with and without reaching science proficiency. Journal of Science Education and Technology, 20 (4), 375–387.

Magana, A. J., Elluri, S., Dasgupta, C., Seah, Y. Y., Madamanchi, A., & Boutin, M. (2019). The role of simulation-enabled design learning experiences on middle school students’ self-generated inherence heuristics. Journal of Science Education and Technology, 28 (4), 1–17.

Malmberg, J., Järvenoja, H., & Järvelä, S. (2013). Patterns in elementary school students’ strategic actions in varying learning situations. Instructional Science, 41 (5), 933–954.

Martin, T., Petrick Smith, C., Forsgren, N., Aghababyan, A., Janisiewicz, P., & Baker, S. (2015). Learning fractions by splitting: Using learning analytics to illuminate the development of mathematical understanding. Journal of the Learning Sciences, 24 (4), 593–637.

Masci, C., Johnes, G., & Agasisti, T. (2018). Student and school performance across countries: A machine learning approach. European Journal of Operational Research, 269 (3), 1072–1085.

McConney, A., & Perry, L. B. (2010). Science and mathematics achievement in Australia: The role of school socioeconomic composition in educational equity and effectiveness. International Journal of Science and Mathematics Education, 8 (3), 429–452.

National Council of Teachers of Mathematics. (2014). Principles to actions: Ensuring mathematical success for all . Reston, VA: Author.

National Research Council. (2012). A framework for K-12 science education: Practices, crosscutting concepts, and core ideas . Washington, DC: National Academies Press.

Nehm, R. H., Ha, M., & Mayfield, E. (2012). Transforming biology assessment with machine learning: Automated scoring of written evolutionary explanations. Journal of Science Education and Technology, 21 (1), 183–196.

Nehm, R. H., & Haertig, H. (2012). Human vs. computer diagnosis of students’ natural selection knowledge: Testing the efficacy of text analytic software. Journal of Science Education and Technology, 21 (1), 56–73.

NGSS Lead States. (2013). Next generation science standards: For states, by states . Washington, DC: National Academies Press.

Northcutt, C. G., Ho, A. D., & Chuang, I. L. (2016). Detecting and preventing “multiple-account” cheating in massive open online courses. Computers & Education, 100 , 71–80.

Owens, M. T., Seidel, S. B., Wong, M., Bejines, T. E., Lietz, S., Perez, J. R., . . . Balukjian, B. (2017). Classroom sound can be used to classify teaching practices in college science courses. Proceedings of the National Academy of Sciences, 114 (12), 3085–3090.

Pantziara, M., & Philippou, G. N. (2015). Students’ motivation in the mathematics classroom. Revealing causes and consequences. International Journal of Science and Mathematics Education, 13 (2), 385–411.

Papamitsiou, Z., & Economides, A. A. (2014). Learning analytics and educational data mining in practice: A systematic literature review of empirical evidence. Journal of Educational Technology & Society, 17 (4), 49–64.

Peña-Ayala, A. (2014). Educational data mining: A survey and a data mining-based analysis of recent works. Expert Systems with Applications, 41 (4), 1432–1462.

Prevost, L. B., Smith, M. K., & Knight, J. K. (2016). Using student writing and lexical analysis to reveal student thinking about the role of stop codons in the central dogma. CBE Life Sciences Education, 15 (4), ar65. https://doi.org/10.1187/cbe.15-12-0267 .

Rao, D. C., & Saha, S. K. (2019). An immersive learning platform for efficient biology learning of secondary school-level students. Journal of Educational Computing Research . Advanced online publication. https://doi.org/10.1177/0735633119854031 .

Reitsma, R., Marshall, B., & Chart, T. (2012). Can intermediary-based science standards crosswalking work? Some evidence from mining the standard alignment tool (SAT). Journal of the American Society for Information Science and Technology, 63 (9), 1843–1858.

Roberts, J. D., Chung, G. K., & Parks, C. B. (2016). Supporting children’s progress through the PBS KIDS learning analytics platform. Journal of Children and Media, 10 (2), 257–266.

Rodrigues, M. W., Isotani, S., & Zárate, L. E. (2018). Educational data mining: A review of evaluation process in the e-learning. Telematics and Informatics, 35 (6), 1701–1717.

Romero, C., & Ventura, S. (2007). Educational data mining: A survey from 1995 to 2005. Expert Systems with Applications, 33 (1), 135–146.

Romero, C., & Ventura, S. (2010). Educational data mining: A review of the state-of-the-art. IEEE Transaction on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 40 (6), 601–618.

Romero, C., & Ventura, S. (2013). Data mining in education. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 3 (1), 12–27.

Saa, A. A., Al-Emran, M., & Shaalan, K. (2019). Factors affecting students’ performance in higher education: A systematic review of predictive data mining techniques. Technology, knowledge and learning . Advanced online publication. doi: https://doi.org/10.1007/s10758-019-09408-7 .

Sánchez-Matamoros, G., Fernández, C., & Llinares, S. (2015). Developing pre-service teacher’ noticing of students’ understanding of the derivative concept. International Journal of Science and Mathematics Education, 13 (6), 1305–1329.

Scarpello, G. (2007). Helping students get past math anxiety. Techniques: Connecting Education and Careers, 82 (6), 34–35.

Schwarz, B. B., Prusak, N., Swidan, O., Livny, A., Gal, K., & Segal, A. (2018). Orchestrating the emergence of conceptual learning: A case study in a geometry class. International Journal of Computer-Supported Collaborative Learning, 13 (2), 189–211.

Sergis, S., Sampson, D. G., Rodríguez-Triana, M. J., Gillet, D., Pelliccione, L., & de Jong, T. (2019). Using educational data from teaching and learning to inform teachers’ reflective educational design in inquiry-based STEM education. Computers in Human Behavior, 92 , 724–738.

Shahiri, A. M., Husain, W., & Rashid, N. A. (2015). A review on predicting student’s performance using data mining techniques. Procedia Computer Science, 72 , 414–422.

She, H.-C., Lin, H.-s., & Huang, L.-Y. (2019). Reflections on and implications of the programme for international student assessment 2015 performance of students in Taiwan: The role of epistemic beliefs about science in scientific literacy. Journal of Research in Science Teaching . Advanced online publication. https://doi.org/10.1002/tea.21553 .

Sieke, S. A., McIntosh, B. B., Steele, M. M., & Knight, J. K. (2019). Characterizing students’ ideas about the effects of a mutation in a noncoding region of DNA. CBE Life Sciences Education, 18 (2), ar18. https://doi.org/10.1187/cbe.18-09-0173 .

Suh, S. C., Upadhyaya, A., & Nadig, A. (2019). Analyzing personality traits and external factors for stem education awareness using machine learning. International Journal of Advanced Computer Science and Applications, 10 (5), 1–4.

Tawfik, A. A., Reeves, T. D., Stich, A. E., Gill, A., Hong, C., McDade, J., . . . Giabbanelli, P. J. (2017). The nature and level of learner–learner interaction in a chemistry massive open online course (MOOC). Journal of Computing in Higher Education, 29 (3), 411–431.

Tissenbaum, M., & Slotta, J. D. (2019). Developing a smart classroom infrastructure to support real-time student collaboration and inquiry: A 4-year design study. Instructional Science . Advanced online publication , 47 , 423–462. https://doi.org/10.1007/s11251-019-09486-1 .

Wahlberg, S. J., & Gericke, N. M. (2018). Conceptual demography in upper secondary chemistry and biology textbooks’ descriptions of protein synthesis: A matter of context ? CBE Life Sciences Education, 17 (3), ar51. https://doi.org/10.1187/cbe.17-12-0274 .

Wang, X. (2016). Course-taking patterns of community college students beginning in STEM: Using data mining techniques to reveal viable STEM transfer pathways. Research in Higher Education, 57 (5), 544–569.

Wiley, J., Hastings, P., Blaum, D., Jaeger, A. J., Hughes, S., Wallace, P., ... & Britt, M. A. (2017). Different approaches to assessing the quality of explanations following a multiple-document inquiry activity in science. International Journal of Artificial Intelligence in Education, 27 (4), 758–790.

Zhang, W., Qin, S., Jin, H., Deng, J., & Wu, L. (2017). An empirical study on student evaluations of teaching based on data mining. EURASIA Journal of Mathematics, Science and Technology Education, 13 (8), 5837–5845.

Download references

Author information

Authors and affiliations.

Gifted Education Center, Korea University, 315 Lyceum, 145 Anam-ro, Seongbuk-gu, Seoul, South Korea

Dongjo Shin & Jaekwoun Shim

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Jaekwoun Shim .

Electronic Supplementary Material

(DOCX 44 kb)

Rights and permissions

Reprints and permissions

About this article

Shin, D., Shim, J. A Systematic Review on Data Mining for Mathematics and Science Education. Int J of Sci and Math Educ 19 , 639–659 (2021). https://doi.org/10.1007/s10763-020-10085-7

Download citation

Received : 12 November 2019

Accepted : 18 March 2020

Published : 14 May 2020

Issue Date : April 2021

DOI : https://doi.org/10.1007/s10763-020-10085-7

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Educational data mining
  • Literature review
  • Mathematics education
  • Science education
  • Find a journal
  • Publish with us
  • Track your research

Data Mining and Modeling

The proliferation of machine learning means that learned classifiers lie at the core of many products across Google. However, questions in practice are rarely so clean as to just to use an out-of-the-box algorithm. A big challenge is in developing metrics, designing experimental methodologies, and modeling the space to create parsimonious representations that capture the fundamentals of the problem. These problems cut across Google’s products and services, from designing experiments for testing new auction algorithms to developing automated metrics to measure the quality of a road map.

Data mining lies at the heart of many of these questions, and the research done at Google is at the forefront of the field. Whether it is finding more efficient algorithms for working with massive data sets, developing privacy-preserving methods for classification, or designing new machine learning approaches, our group continues to push the boundary of what is possible.

Recent Publications

Some of our teams.

Algorithms & optimization

Climate and sustainability

Graph mining

We're always looking for more talented, passionate people.

Careers

Illustration with collage of pictograms of clouds, pie chart, graph pictograms on the following

Data mining, also known as knowledge discovery in data (KDD), is the process of uncovering patterns and other valuable information from large data sets.

Given the evolution of  data warehousing  technology and the growth of big data, adoption of data mining techniques has rapidly accelerated over the last couple of decades, assisting companies by transforming their raw data into useful knowledge. However, despite the fact that that technology continuously evolves to handle data at a large scale, leaders still face challenges with scalability and automation.

Data mining has improved organizational decision-making through insightful data analyses. The data mining techniques that underpin these analyses can be divided into two main purposes; they can either describe the target dataset or they can predict outcomes through the use of  machine learning  algorithms. These methods are used to organize and filter data, surfacing the most interesting information, from fraud detection to user behaviors, bottlenecks and even security breaches.

When combined with data analytics and visualization tools, like  Apache Spark , delving into the world of data mining has never been easier and extracting relevant insights has never been faster. Advances within  artificial intelligence  only continue to expedite adoption across industries.

Learn how to leverage the right databases for applications, analytics and generative AI.

Register for the ebook on generative AI

Data mining, also known as knowledge discovery in data (KDD), is the process of uncovering patterns and other valuable information from large data sets. Given the evolution of  data warehousing  technology and the growth of big data, adoption of data mining techniques has rapidly accelerated over the last couple of decades, assisting companies by transforming their raw data into useful knowledge. However, despite the fact that that technology continuously evolves to handle data at a large scale, leaders still face challenges with scalability and automation.

When combined with data analytics and visualization tools, like  Apache Spark , delving into the world of data mining has never been easier and extracting relevant insights has never been faster. Advances within  artificial intelligence  only continue to expedite adoption across industries. 

Scale AI workloads for all your data anywhere.

The data mining process involves a number of steps from data collection to visualization to extract valuable information from large data sets. As mentioned above, data mining techniques are used to generate descriptions and predictions about a target data set. Data scientists describe data through their observations of patterns, associations and correlations. They also classify and cluster data through classification and regression methods, and identify outliers for use cases, like spam detection.

Data mining usually consists of four main steps: setting objectives, data gathering and preparation, applying data mining algorithms and evaluating results.

1. Set the business objectives:  This can be the hardest part of the data mining process, and many organizations spend too little time on this important step. Data scientists and business stakeholders need to work together to define the business problem, which helps inform the data questions and parameters for a given project. Analysts may also need to do additional research to understand the business context appropriately.

2. Data preparation:  Once the scope of the problem is defined, it is easier for data scientists to identify which set of data will help answer the pertinent questions to the business. Once they collect the relevant data, it will be cleaned, removing any noise, such as duplicates, missing values and outliers. Depending on the dataset, an additional step may be taken to reduce the number of dimensions as too many features can slow down any subsequent computation. Data scientists will look to retain the most important predictors to ensure optimal accuracy within any models.

3. Model building and pattern mining:  Depending on the type of analysis, data scientists may investigate any interesting data relationships, such as sequential patterns, association rules or correlations. While high-frequency patterns have broader applications, sometimes the deviations in the data can be more interesting, highlighting areas of potential fraud.

Deep learning  algorithms may also be applied to classify or cluster a data set depending on the available data. If the input data is labelled (i.e.  supervised learning ), a classification model may be used to categorize data, or alternatively, a regression may be applied to predict the likelihood of a particular assignment. If the dataset isn’t labelled (i.e.  unsupervised learning ), the individual data points in the training set are compared with one another to discover underlying similarities, clustering them based on those characteristics.

4. Evaluation of results and implementation of knowledge:  Once the data is aggregated, the results need to be evaluated and interpreted. When finalizing results, they should be valid, novel, useful and understandable. When this criteria is met, organizations can use this knowledge to implement new strategies, achieving their intended objectives.

Data mining works by using various algorithms and techniques to turn large volumes of data into useful information. Here are some of the most common ones:

Association rules:  An association rule is a rule-based method for finding relationships between variables in a given dataset. These methods are frequently used for market basket analysis, allowing companies to better understand relationships between different products. Understanding consumption habits of customers enables businesses to develop better cross-selling strategies and recommendation engines.

Neural networks:  Primarily leveraged for deep learning algorithms,  neural networks  process training data by mimicking the interconnectivity of the human brain through layers of nodes. Each node is made up of inputs, weights, a bias (or threshold) and an output. If that output value exceeds a given threshold, it “fires” or activates the node, passing data to the next layer in the network. Neural networks learn this mapping function through supervised learning, adjusting based on the loss function through the process of gradient descent. When the cost function is at or near zero, we can be confident in the model’s accuracy to yield the correct answer.

Decision tree:  This data mining technique uses classification or regression methods to classify or predict potential outcomes based on a set of decisions. As the name suggests, it uses a tree-like visualization to represent the potential outcomes of these decisions.

K- nearest neighbor (KNN):  K-nearest neighbor, also known as the KNN algorithm, is a non-parametric algorithm that classifies data points based on their proximity and association to other available data. This algorithm assumes that similar data points can be found near each other. As a result, it seeks to calculate the distance between data points, usually through Euclidean distance, and then it assigns a category based on the most frequent category or average.

Data mining techniques are widely adopted among business intelligence and data analytics teams, helping them extract knowledge for their organization and industry. Some data mining use cases include:

Sales and marketing  

Companies collect a massive amount of data about their customers and prospects. By observing consumer demographics and online user behavior, companies can use data to optimize their marketing campaigns, improving segmentation, cross-sell offers and customer loyalty programs, yielding higher ROI on marketing efforts. Predictive analyses can also help teams to set expectations with their stakeholders, providing yield estimates from any increases or decreases in marketing investment.

Education  

Educational institutions have started to collect data to understand their student populations as well as which environments are conducive to success. As courses continue to transfer to online platforms, they can use a variety of dimensions and metrics to observe and evaluate performance, such as keystroke, student profiles, classes, universities, time spent, etc.

Operational optimization  

Process mining  leverages data mining techniques to reduce costs across operational functions, enabling organizations to run more efficiently. This practice has helped to identify costly bottlenecks and improve decision-making among business leaders.

Fraud detection  

While frequently occurring patterns in data can provide teams with valuable insight, observing data anomalies is also beneficial, assisting companies in detecting fraud. While this is a well-known use case within banking and other financial institutions, SaaS-based companies have also started to adopt these practices to eliminate fake user accounts from their datasets.

Find critical answers and insights from your business data using AI-powered enterprise search technology.

A fully managed, elastic cloud data warehouse built for high-performance analytics and AI.

Build and scale trusted AI on any cloud, and automate the AI lifecycle for ModelOps.

Identify patterns and trends with predictive analytics and key techniques.

Explore how to mitigate your own biases when creating machine learning models.

Train, validate, tune and deploy generative AI, foundation models and machine learning capabilities with IBM watsonx.ai, a next-generation enterprise studio for AI builders. Build AI applications in a fraction of the time with a fraction of the data.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • PeerJ Comput Sci

Logo of peerjcs

Adaptations of data mining methodologies: a systematic literature review

Associated data.

The following information was supplied regarding data availability:

SLR Protocol (also shared via online repository), corpus with definitions and mappings are provided as a Supplemental File .

The use of end-to-end data mining methodologies such as CRISP-DM, KDD process, and SEMMA has grown substantially over the past decade. However, little is known as to how these methodologies are used in practice. In particular, the question of whether data mining methodologies are used ‘as-is’ or adapted for specific purposes, has not been thoroughly investigated. This article addresses this gap via a systematic literature review focused on the context in which data mining methodologies are used and the adaptations they undergo. The literature review covers 207 peer-reviewed and ‘grey’ publications. We find that data mining methodologies are primarily applied ‘as-is’. At the same time, we also identify various adaptations of data mining methodologies and we note that their number is growing rapidly. The dominant adaptations pattern is related to methodology adjustments at a granular level (modifications) followed by extensions of existing methodologies with additional elements. Further, we identify two recurrent purposes for adaptation: (1) adaptations to handle Big Data technologies, tools and environments (technological adaptations); and (2) adaptations for context-awareness and for integrating data mining solutions into business processes and IT systems (organizational adaptations). The study suggests that standard data mining methodologies do not pay sufficient attention to deployment issues, which play a prominent role when turning data mining models into software products that are integrated into the IT architectures and business processes of organizations. We conclude that refinements of existing methodologies aimed at combining data, technological, and organizational aspects, could help to mitigate these gaps.

Introduction

The availability of Big Data has stimulated widespread adoption of data mining and data analytics in research and in business settings ( Columbus, 2017 ). Over the years, a certain number of data mining methodologies have been proposed, and these are being used extensively in practice and in research. However, little is known about what and how data mining methodologies are applied, and it has not been neither widely researched nor discussed. Further, there is no consolidated view on what constitutes quality of methodological process in data mining and data analytics, how data mining and data analytics are applied/used in organization settings context, and how application practices relate to each other. That motivates the need for comprehensive survey in the field.

There have been surveys or quasi-surveys and summaries conducted in related fields. Notably, there have been two systematic systematic literature reviews; Systematic Literature Review, hereinafter, SLR is the most suitable and widely used research method for identifying, evaluating and interpreting research of particular research question, topic or phenomenon ( Kitchenham, Budgen & Brereton, 2015 ). These reviews concerned Big Data Analytics, but not general purpose data mining methodologies. Adrian et al. (2004) executed SLR with respect to implementation of Big Data Analytics (BDA), specifically, capability components necessary for BDA value discovery and realization. The authors identified BDA implementation studies, determined their main focus areas, and discussed in detail BDA applications and capability components. Saltz & Shamshurin (2016) have published SLR paper on Big Data Team Process Methodologies. Authors have identified lack of standard in regards to how Big Data projects are executed, highlighted growing research in this area and potential benefits of such process standard. Additionally, authors synthesized and produced list of 33 most important success factors for executing Big Data activities. Finally, there are studies that surveyed data mining techniques and applications across domains, yet, they focus on data mining process artifacts and outcomes ( Madni, Anwar & Shah, 2017 ; Liao, Chu & Hsiao, 2012 ), but not on end-to-end process methodology.

There have been number of surveys conducted in domain-specific settings such as hospitality, accounting, education, manufacturing, and banking fields. Mariani et al. (2018) focused on Business Intelligence (BI) and Big Data SLR in the hospitality and tourism environment context. Amani & Fadlalla (2017) explored application of data mining methods in accounting while Romero & Ventura (2013) investigated educational data mining. Similarly, Hassani, Huang & Silva (2018) addressed data mining application case studies in banking and explored them by three dimensions—topics, applied techniques and software. All studies were performed by the means of systematic literature reviews. Lastly, Bi & Cochran (2014) have undertaken standard literature review of Big Data Analytics and its applications in manufacturing.

Apart from domain-specific studies, there have been very few general purpose surveys with comprehensive overview of existing data mining methodologies, classifying and contextualizing them. Valuable synthesis was presented by Kurgan & Musilek (2006) as comparative study of the state-of-the art of data mining methodologies. The study was not SLR, and focused on comprehensive comparison of phases, processes, activities of data mining methodologies; application aspect was summarized briefly as application statistics by industries and citations. Three more comparative, non-SLR studies were undertaken by Marban, Mariscal & Segovia (2009) , Mariscal, Marbán & Fernández (2010) , and the most recent and closest one by Martnez-Plumed et al. (2017) . They followed the same pattern with systematization of existing data mining frameworks based on comparative analysis. There, the purpose and context of consolidation was even more practical—to support derivation and proposal of the new artifact, that is, novel data mining methodology. The majority of the given general type surveys in the field are more than a decade old, and have natural limitations due to being: (1) non-SLR studies, and (2) so far restricted to comparing methodologies in terms of phases, activities, and other elements.

The key common characteristic behind all the given studies is that data mining methodologies are treated as normative and standardized (‘one-size-fits-all’) processes. A complementary perspective, not considered in the above studies, is that data mining methodologies are not normative standardized processes, but instead, they are frameworks that need to be specialized to different industry domains, organizational contexts, and business objectives. In the last few years, a number of extensions and adaptations of data mining methodologies have emerged, which suggest that existing methodologies are not sufficient to cover the needs of all application domains. In particular, extensions of data mining methodologies have been proposed in the medical domain ( Niaksu, 2015 ), educational domain ( Tavares, Vieira & Pedro, 2017 ), the industrial engineering domain ( Huber et al., 2019 ; Solarte, 2002 ), and software engineering ( Marbán et al., 2007 , 2009 ). However, little attention has been given to studying how data mining methodologies are applied and used in industry settings, so far only non-scientific practitioners’ surveys provide such evidence.

Given this research gap, the central objective of this article is to investigate how data mining methodologies are applied by researchers and practitioners, both in their generic (standardized) form and in specialized settings. This is achieved by investigating if data mining methodologies are applied ‘as-is’ or adapted, and for what purposes such adaptations are implemented.

Guided by Systematic Literature Review method, initially we identified a corpus of primary studies covering both peer-reviewed and ‘grey’ literature from 1997 to 2018. An analysis of these studies led us to a taxonomy of uses of data mining methodologies, focusing on the distinction between ‘as is’ usage versus various types of methodology adaptations. By analyzing different types of methodology adaptations, this article identifies potential gaps in standard data mining methodologies both at the technological and at the organizational levels.

The rest of the article is organized as follows. The Background section provides an overview of key concepts of data mining and associated methodologies. Next, Research Design describes the research methodology. The Findings and Discussion section presents the study results and their associated interpretation. Finally, threats to validity are addressed in Threats to Validity while the Conclusion summarizes the findings and outlines directions for future work.

The section introduces main data mining concepts, provides overview of existing data mining methodologies, and their evolution.

Data mining is defined as a set of rules, processes, algorithms that are designed to generate actionable insights, extract patterns, and identify relationships from large datasets ( Morabito, 2016 ). Data mining incorporates automated data extraction, processing, and modeling by means of a range of methods and techniques. In contrast, data analytics refers to techniques used to analyze and acquire intelligence from data (including ‘big data’) ( Gandomi & Haider, 2015 ) and is positioned as a broader field, encompassing a wider spectrum of methods that includes both statistical and data mining ( Chen, Chiang & Storey, 2012 ). A number of algorithms has been developed in statistics, machine learning, and artificial intelligence domains to support and enable data mining. While statistical approaches precedes them, they inherently come with limitations, the most known being rigid data distribution conditions. Machine learning techniques gained popularity as they impose less restrictions while deriving understandable patterns from data ( Bose & Mahapatra, 2001 ).

Data mining projects commonly follow a structured process or methodology as exemplified by Mariscal, Marbán & Fernández (2010) , Marban, Mariscal & Segovia (2009) . A data mining methodology specifies tasks, inputs, outputs, and provides guidelines and instructions on how the tasks are to be executed ( Mariscal, Marbán & Fernández, 2010 ). Thus, data mining methodology provides a set of guidelines for executing a set of tasks to achieve the objectives of a data mining project ( Mariscal, Marbán & Fernández, 2010 ).

The foundations of structured data mining methodologies were first proposed by Fayyad, Piatetsky-Shapiro & Smyth (1996a , 1996b , 1996c) , and were initially related to Knowledge Discovery in Databases (KDD). KDD presents a conceptual process model of computational theories and tools that support information extraction (knowledge) with data ( Fayyad, Piatetsky-Shapiro & Smyth, 1996a ). In KDD, the overall approach to knowledge discovery includes data mining as a specific step. As such, KDD, with its nine main steps (exhibited in Fig. 1 ), has the advantage of considering data storage and access, algorithm scaling, interpretation and visualization of results, and human computer interaction ( Fayyad, Piatetsky-Shapiro & Smyth, 1996a , 1996c ). Introduction of KDD also formalized clearer distinction between data mining and data analytics, as for example formulated in Tsai et al. (2015) : “…by the data analytics, we mean the whole KDD process, while by the data analysis, we mean the part of data analytics that is aimed at finding the hidden information in the data, such as data mining”.

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-06-267-g001.jpg

The main steps of KDD are as follows:

  • Step 1: Learning application domain: In the first step, it is needed to develop an understanding of the application domain and relevant prior knowledge followed by identifying the goal of the KDD process from the customer’s viewpoint.
  • Step 2: Dataset creation: Second step involves selecting a dataset, focusing on a subset of variables or data samples on which discovery is to be performed.
  • Step 3: Data cleaning and processing: In the third step, basic operations to remove noise or outliers are performed. Collection of necessary information to model or account for noise, deciding on strategies for handling missing data fields, and accounting for data types, schema, and mapping of missing and unknown values are also considered.
  • Step 4: Data reduction and projection: Here, the work of finding useful features to represent the data, depending on the goal of the task, application of transformation methods to find optimal features set for the data is conducted.
  • Step 5: Choosing the function of data mining: In the fifth step, the target outcome (e.g., summarization, classification, regression, clustering) are defined.
  • Step 6: Choosing data mining algorithm: Sixth step concerns selecting method(s) to search for patterns in the data, deciding which models and parameters are appropriate and matching a particular data mining method with the overall criteria of the KDD process.
  • Step 7: Data mining: In the seventh step, the work of mining the data that is, searching for patterns of interest in a particular representational form or a set of such representations: classification rules or trees, regression, clustering is conducted.
  • Step 8: Interpretation: In this step, the redundant and irrelevant patterns are filtered out, relevant patterns are interpreted and visualized in such way as to make the result understandable to the users.
  • Step 9: Using discovered knowledge: In the last step, the results are incorporated with the performance system, documented and reported to stakeholders, and used as basis for decisions.

The KDD process became dominant in industrial and academic domains ( Kurgan & Musilek, 2006 ; Marban, Mariscal & Segovia, 2009 ). Also, as timeline-based evolution of data mining methodologies and process models shows ( Fig. 2 below), the original KDD data mining model served as basis for other methodologies and process models, which addressed various gaps and deficiencies of original KDD process. These approaches extended the initial KDD framework, yet, extension degree has varied ranging from process restructuring to complete change in focus. For example, Brachman & Anand (1996) and further Gertosio & Dussauchoy (2004) (in a form of case study) introduced practical adjustments to the process based on iterative nature of process as well as interactivity. The complete KDD process in their view was enhanced with supplementary tasks and the focus was changed to user’s point of view (human-centered approach), highlighting decisions that need to be made by the user in the course of data mining process. In contrast, Cabena et al. (1997) proposed different number of steps emphasizing and detailing data processing and discovery tasks. Similarly, in a series of works Anand & Büchner (1998) , Anand et al. (1998) , Buchner et al. (1999) presented additional data mining process steps by concentrating on adaptation of data mining process to practical settings. They focused on cross-sales (entire life-cycles of online customer), with further incorporation of internet data discovery process (web-based mining). Further, Two Crows data mining process model is consultancy originated framework that has defined the steps differently, but is still close to original KDD. Finally, SEMMA (Sample, Explore, Modify, Model and Assess) based on KDD, was developed by SAS institute in 2005 ( SAS Institute Inc., 2017 ). It is defined as a logical organization of the functional toolset of SAS Enterprise Miner for carrying out the core tasks of data mining. Compared to KDD, this is vendor-specific process model which limits its application in different environments. Also, it skips two steps of original KDD process (‘Learning Application Domain’ and ‘Using of Discovered Knowledge’) which are regarded as essential for success of data mining project ( Mariscal, Marbán & Fernández, 2010 ). In terms of adoption, new KDD-based proposals received limited attention across academia and industry ( Kurgan & Musilek, 2006 ; Marban, Mariscal & Segovia, 2009 ). Subsequently, most of these methodologies converged into the CRISP-DM methodology.

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-06-267-g002.jpg

Additionally, there have only been two non-KDD based approaches proposed alongside extensions to KDD. The first one is 5A’s approach presented by De Pisón Ascacbar (2003) and used by SPSS vendor. The key contribution of this approach has been related to adding ‘Automate’ step while disadvantage was associated with omitting ‘Data Understanding’ step. The second approach was 6-Sigma which is industry originated method to improve quality and customer’s satisfaction ( Pyzdek & Keller, 2003 ). It has been successfully applied to data mining projects in conjunction with DMAIC performance improvement model (Define, Measure, Analyze, Improve, Control).

In 2000, as response to common issues and needs ( Marban, Mariscal & Segovia, 2009 ), an industry-driven methodology called Cross-Industry Standard Process for Data Mining (CRISP-DM) was introduced as an alternative to KDD. It also consolidated original KDD model and its various extensions. While CRISP-DM builds upon KDD, it consists of six phases that are executed in iterations ( Marban, Mariscal & Segovia, 2009 ). The iterative executions of CRISP-DM stand as the most distinguishing feature compared to initial KDD that assumes a sequential execution of its steps. CRISP-DM, much like KDD, aims at providing practitioners with guidelines to perform data mining on large datasets. However,CRISP-DM with its six main steps with a total of 24 tasks and outputs, is more refined as compared to KDD. The main steps of CRIPS-DM, as depicted in Fig. 3 below are as follows:

  • Phase 1: Business understanding: The focus of the first step is to gain an understanding of the project objectives and requirements from a business perspective followed by converting these into data mining problem definitions. Presentation of a preliminary plan to achieve the objectives are also included in this first step.
  • Phase 2: Data understanding: This step begins with an initial data collection and proceeds with activities in order to get familiar with the data, identify data quality issues, discover first insights into the data, and potentially detect and form hypotheses.
  • Phase 3: Data preparation: The third step covers activities required to construct the final dataset from the initial raw data. Data preparation tasks are performed repeatedly.
  • Phase 4: Modeling phase: In this step, various modeling techniques are selected and applied followed by calibrating their parameters. Typically, several techniques are used for the same data mining problem.
  • Phase 5: Evaluation of the model(s): The fifth step begins with the quality perspective and then, before proceeding to final model deployment, ascertains that the model(s) achieves the business objectives. At the end of this phase, a decision should be reached on how to use data mining results.
  • Phase 6: Deployment phase: In the final step, the models are deployed to enable end-customers to use the data as basis for decisions, or support in the business process. Even if the purpose of the model is to increase knowledge of the data, the knowledge gained will need to be organized, presented, distributed in a way that the end-user can use it. Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process.

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-06-267-g003.jpg

The development of CRISP-DM was led by industry consortium. It is designed to be domain-agnostic ( Mariscal, Marbán & Fernández, 2010 ) and as such, is now widely used by industry and research communities ( Marban, Mariscal & Segovia, 2009) . These distinctive characteristics have made CRISP-DM to be considered as ‘de-facto’ standard of data mining methodology and as a reference framework to which other methodologies are benchmarked ( Mariscal, Marbán & Fernández, 2010 ).

Similarly to KDD, a number of refinements and extensions of the CRISP-DM methodology have been proposed with the two main directions—extensions of the process model itself and adaptations, merger with the process models and methodologies in other domains. Extensions direction of process models could be exemplified by Cios & Kurgan (2005) who have proposed integrated Data Mining & Knowledge Discovery (DMKD) process model. It contains several explicit feedback mechanisms, modification of the last step to incorporate discovered knowledge and insights application as well as relies on technologies for results deployment. In the same vein, Moyle & Jorge (2001) , Blockeel & Moyle (2002) proposed Rapid Collaborative Data Mining System (RAMSYS) framework—this is both data mining methodology and system for remote collaborative data mining projects. The RAMSYS attempted to achieve the combination of a problem solving methodology, knowledge sharing, and ease of communication. It intended to allow the collaborative work of remotely placed data miners in a disciplined manner as regards information flow while allowing the free flow of ideas for problem solving ( Moyle & Jorge, 2001 ). CRISP-DM modifications and integrations with other specific domains were proposed in Industrial Engineering (Data Mining for Industrial Engineering by Solarte (2002) ), and Software Engineering by Marbán et al. (2007 , 2009) . Both approaches enhanced CRISP-DM and contributed with additional phases, activities and tasks typical for engineering processes, addressing on-going support ( Solarte, 2002 ), as well as project management, organizational and quality assurance tasks ( Marbán et al., 2009 ).

Finally, limited number of attempts to create independent or semi-dependent data mining frameworks was undertaken after CRISP-DM creation. These efforts were driven by industry players and comprised KDD Roadmap by Debuse et al. (2001) for proprietary predictive toolkit (Lanner Group), and recent effort by IBM with Analytics Solutions Unified Method for Data Mining (ASUM-DM) in 2015 ( IBM Corporation, 2016 : https://developer.ibm.com/technologies/artificial-intelligence/articles/architectural-thinking-in-the-wild-west-of-data-science/ ). Both frameworks contributed with additional tasks, for example, resourcing in KDD Roadmap, or hybrid approach assumed in ASUM, for example, combination of agile and traditional implementation principles.

The Table 1 above summarizes reviewed data mining process models and methodologies by their origin, basis and key concepts.

Research Design

The main research objective of this article is to study how data mining methodologies are applied by researchers and practitioners. To this end, we use systematic literature review (SLR) as scientific method for two reasons. Firstly, systematic review is based on trustworthy, rigorous, and auditable methodology. Secondly, SLR supports structured synthesis of existing evidence, identification of research gaps, and provides framework to position new research activities ( Kitchenham, Budgen & Brereton, 2015 ). For our SLR, we followed the guidelines proposed by Kitchenham, Budgen & Brereton (2015) . All SLR details have been documented in the separate, peer-reviewed SLR protocol (available at https://figshare.com/articles/Systematic-Literature-Review-Protocol/10315961 ).

Research questions

As suggested by Kitchenham, Budgen & Brereton (2015) , we have formulated research questions and motivate them as follows. In the preliminary phase of research we have discovered very limited number of studies investigating data mining methodologies application practices as such. Further, we have discovered number of surveys conducted in domain-specific settings, and very few general purpose surveys, but none of them considered application practices either. As contrasting trend, recent emergence of limited number of adaptation studies have clearly pinpointed the research gap existing in the area of application practices. Given this research gap, in-depth investigation of this phenomenon led us to ask: “How data mining methodologies are applied (‘as-is’ vs adapted) (RQ1)?” Further, as we intended to investigate in depth universe of adaptations scenarios, this naturally led us to RQ2: “How have existing data mining methodologies been adapted?” Finally, if adaptions are made, we wish to explore what the associated reasons and purposes are, which in turn led us to RQ3: “For what purposes are data mining methodologies adapted?”

Thus, for this review, there are three research questions defined:

  • Research Question 1: How data mining methodologies are applied (‘as-is’ versus adapted)? This question aims to identify data mining methodologies application and usage patterns and trends.
  • Research Question 2: How have existing data mining methodologies been adapted? This questions aims to identify and classify data mining methodologies adaptation patterns and scenarios.
  • Research Question 3: For what purposes have existing data mining methodologies been adapted? This question aims to identify, explain, classify and produce insights on what are the reasons and what benefits are achieved by adaptations of existing data mining methodologies. Specifically, what gaps do these adaptations seek to fill and what have been the benefits of these adaptations. Such systematic evidence and insights will be valuable input to potentially new, refined data mining methodology. Insights will be of interest to practitioners and researchers.

Data collection strategy

Our data collection and search strategy followed the guidelines proposed by Kitchenham, Budgen & Brereton (2015) . It defined the scope of the search, selection of literature and electronic databases, search terms and strings as well as screening procedures.

Primary search

The primary search aimed to identify an initial set of papers. To this end, the search strings were derived from the research objective and research questions. The term ‘data mining’ was the key term, but we also included ‘data analytics’ to be consistent with observed research practices. The terms ‘methodology’ and ‘framework’ were also included. Thus, the following search strings were developed and validated in accordance with the guidelines suggested by Kitchenham, Budgen & Brereton (2015) :

(‘data mining methodology’) OR (‘data mining framework’) OR (‘data analytics methodology’) OR (‘data analytics framework’)

The search strings were applied to the indexed scientific databases Scopus, Web of Science (for ‘peer-reviewed’, academic literature) and to the non-indexed Google Scholar (for non-peer-reviewed, so-called ‘grey’ literature). The decision to cover ‘grey’ literature in this research was motivated as follows. As proposed in number of information systems and software engineering domain publications ( Garousi, Felderer & Mäntylä, 2019 ; Neto et al., 2019 ), SLR as stand-alone method may not provide sufficient insight into ‘state of practice’. It was also identified ( Garousi, Felderer & Mäntylä, 2016 ) that ‘grey’ literature can give substantial benefits in certain areas of software engineering, in particular, when the topic of research is related to industrial and practical settings. Taking into consideration the research objectives, which is investigating data mining methodologies application practices, we have opted for inclusion of elements of Multivocal Literature Review (MLR) 1 in our study. Also, Kitchenham, Budgen & Brereton (2015) recommends including ‘grey’ literature to minimize publication bias as positive results and research outcomes are more likely to be published than negative ones. Following MLR practices, we also designed inclusion criteria for types of ‘grey’ literature reported below.

The selection of databases is motivated as follows. In case of peer-reviewed literature sources we concentrated to avoid potential omission bias. The latter is discussed in IS research ( Levy & Ellis, 2006 ) in case research is concentrated in limited disciplinary data sources. Thus, broad selection of data sources including multidisciplinary-oriented (Scopus, Web of Science, Wiley Online Library) and domain-oriented (ACM Digital Library, IEEE Xplorer Digital Library) scientific electronic databases was evaluated. Multidisciplinary databases have been selected due to wider domain coverage and it was validated and confirmed that they do include publications originating from domain-oriented databases, such as ACM and IEEE. From multi-disciplinary databases as such, Scopus was selected due to widest possible coverage (it is worlds largest database, covering app. 80% of all international peer-reviewed journals) while Web of Science was selected due to its longer temporal range. Thus, both databases complement each other. The selected non-indexed database source for ‘grey’ literature is Google Scholar, as it is comprehensive source of both academic and ‘grey’ literature publications and referred as such extensively ( Garousi, Felderer & Mäntylä, 2019 ; Neto et al., 2019 ).

Further, Garousi, Felderer & Mäntylä (2019) presented three-tier categorization framework for types of ‘grey literature’. In our study we restricted ourselves to the 1st tier ‘grey’ literature publications of the limited number of ‘grey’ literature producers. In particular, from the list of producers ( Neto et al., 2019 ) we have adopted and focused on government departments and agencies, non-profit economic, trade organizations (‘think-tanks’) and professional associations, academic and research institutions, businesses and corporations (consultancy companies and established private companies). The 1st tier ‘grey’ literature selected items include: (1) government, academic, and private sector consultancy reports 2 , (2) theses (not lower than Master level) and PhD Dissertations, (3) research reports, (4) working papers, (5) conference proceedings, preprints. With inclusion of the 1st tier ‘grey’ literature criteria we mitigate quality assessment challenge especially relevant and reported for it ( Garousi, Felderer & Mäntylä, 2019 ; Neto et al., 2019 ).

Scope and domains inclusion

As recommended by Kitchenham, Budgen & Brereton (2015) it is necessary to initially define research scope. To clarify the scope, we defined what is not included and is out of scope of this research. The following aspects are not included in the scope of our study:

  • Context of technology and infrastructure for data mining/data analytics tasks and projects.
  • Granular methods application in data mining process itself or their application for data mining tasks, for example, constructing business queries or applying regression or neural networks modeling techniques to solve classification problems. Studies with granular methods are included in primary texts corpus as long as method application is part of overall methodological approach.
  • Technological aspects in data mining for example, data engineering, dataflows and workflows.
  • Traditional statistical methods not associated with data mining directly including statistical control methods.

Similarly to Budgen et al. (2006) and Levy & Ellis (2006) , initial piloting revealed that search engines retrieved literature available for all major scientific domains including ones outside authors’ area of expertise (e.g., medicine). Even though such studies could be retrieved, it would be impossible for us to analyze and correctly interpret literature published outside the possessed area of expertise. The adjustments toward search strategy were undertaken by retaining domains closely associated with Information Systems, Software Engineering research. Thus, for Scopus database the final set of inclusive domains was limited to nine and included Computer Science, Engineering, Mathematics, Business, Management and Accounting, Decision Science, Economics, Econometrics and Finance, and Multidisciplinary as well as Undefined studies. Excluded domains covered 11.5% or 106 out of 925 publications; it was confirmed in validation process that they primarily focused on specific case studies in fundamental sciences and medicine 3 . The included domains from Scopus database were mapped to Web of Science to ensure consistent approach across databases and the correctness of mapping was validated.

Screening criteria and procedures

Based on the SLR practices (as in Kitchenham, Budgen & Brereton (2015) , Brereton et al. (2007) ) and defined SLR scope, we designed multi-step screening procedures (quality and relevancy) with associated set of Screening Criteria and Scoring System . The purpose of relevancy screening is to find relevant primary studies in an unbiased way ( Vanwersch et al., 2011 ). Quality screening, on the other hand, aims to assess primary relevant studies in terms of quality in unbiased way.

Screening Criteria consisted of two subsets— Exclusion Criteria applied for initial filtering and Relevance Criteria , also known as Inclusion Criteria .

Exclusion Criteria were initial threshold quality controls aiming at eliminating studies with limited or no scientific contribution. The exclusion criteria also address issues of understandability, accessability and availability. The Exclusion Criteria were as follows:

  • Quality 1: The publication item is not in English (understandability).
  • either the same document retrieved from two or all three databases.
  • or different versions of the same publication are retrieved (i.e., the same study published in different sources)—based on best practices, decision rule is that the most recent paper is retained as well as the one with the highest score ( Kofod-Petersen, 2014 ).
  • if a publication is published both as conference proceeding and as journal article with the same name and same authors or as an extended version of conference paper, the latter is selected.
  • Quality 3: Length of the publication is less than 6 pages—short papers do not have the space to expand and discuss presented ideas in sufficient depth to examine for us.
  • Quality 4: The paper is not accessible in full length online through the university subscription of databases and via Google Scholar—not full availability prevents us from assessing and analyzing the text.

The initially retrieved list of papers was filtered based on Exclusion Criteria . Only papers that passed all criteria were retained in the final studies corpus. Mapping of criteria towards screening steps is exhibited in Fig. 4 .

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-06-267-g004.jpg

Relevance Criteria were designed to identify relevant publications and are presented in Table 2 below while mapping to respective process steps is presented in Fig. 4 . These criteria were applied iteratively.

As a final SLR step, the full texts quality assessment was performed with constructed Scoring Metrics (in line with Kitchenham & Charters (2007) ). It is presented in the Table 3 below.

Data extraction and screening process

The conducted data extraction and screening process is presented in Fig. 4 . In Step 1 initial publications list were retrieved from pre-defined databases—Scopus, Web of Science, Google Scholar. The lists were merged and duplicates eliminated in Step 2. Afterwards, texts being less than 6 pages were excluded (Step 3). Steps 1–3 were guided by Exclusion Criteria . In the next stage (Step 4), publications were screened by Title based on pre-defined Relevance Criteria . The ones which passed were evaluated by their availability (Step 5). As long as study was available, it was evaluated again by the same pre-defined Relevance Criteria applied to Abstract, Conclusion and if necessary Introduction (Step 6). The ones which passed this threshold formed primary publications corpus extracted from databases in full. These primary texts were evaluated again based on full text (Step 7) applying Relevance Criteria first and then Scoring Metrics .

Results and quantitative analysis

In Step 1, 1,715 publications were extracted from relevant databases with the following composition—Scopus (819), Web of Science (489), Google Scholar (407). In terms of scientific publication domains, Computer Science (42.4%), Engineering (20.6%), Mathematics (11.1%) accounted for app. 74% of Scopus originated texts. The same applies to Web of Science harvest. Exclusion Criteria application produced the following results. In Step 2, after eliminating duplicates, 1,186 texts were passed for minimum length evaluation, and 767 reached assessment by Relevancy Criteria .

As mentioned Relevance Criteria were applied iteratively (Step 4–6) and in conjunction with availability assessment. As a result, only 298 texts were retained for full evaluation with 241 originating from scientific databases while 57 were ‘grey’. These studies formed primary texts corpus which was extracted, read in full and evaluated by Relevance Criteria combined with Scoring Metrics . The decision rule was set as follows. Studies that scored “1” or “0” were rejected, while texts with “3” and “2” evaluation were admitted as final primary studies corpus. To this end, as an outcome of SLR-based, broad, cross-domain publications collection and screening we identified 207 relevant publications from peer-reviewed (156 texts) and ‘grey’ literature (51 texts). Figure 5 below exhibits yearly published research numbers with the breakdown by ‘peer-reviewed’ and ‘grey’ literature starting from 1997.

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-06-267-g005.jpg

In terms of composition, ‘peer-reviewed’ studies corpus is well-balanced with 72 journal articles and 82 conference papers while book chapters account for 4 instances only. In contrast, in ‘grey’ literature subset, articles in moderated and non-peer reviewed journals are dominant ( n = 34) compared to overall number of conference papers ( n = 13), followed by small number of technical reports and pre-prints ( n = 4).

Temporal analysis of texts corpus (as per Fig. 5 below) resulted in two observations. Firstly, we note that stable and significant research interest (in terms of numbers) on data mining methodologies application has started around a decade ago—in 2007. Research efforts made prior to 2007 were relatively limited with number of publications below 10. Secondly, we note that research on data mining methodologies has grown substantially since 2007, an observation supported by the 3-year and 10-year constructed mean trendlines. In particular, the number of publications have roughly tripled over past decade hitting all time high with 24 texts released in 2017.

Further, there are also two distinct spike sub-periods in the years 2007–2009 and 2014–2017 followed by stable pattern with overall higher number of released publications on annual basis. This observation is in line with the trend of increased penetration of methodologies, tools, cross-industry applications and academic research of data mining.

Findings and Discussion

In this section, we address the research questions of the paper. Initially, as part of RQ1, we present overview of data mining methodologies ‘as-is’ and adaptation trends. In addressing RQ2, we further classify the adaptations identified. Then, as part of RQ3 subsection, each category identified under RQ2 is analyzed with particular focus on the goals of adaptations.

RQ1: How data mining methodologies are applied (‘as-is’ vs. adapted)?

The first research question examines the extent to which data mining methodologies are used ‘as-is’ versus adapted. Our review based on 207 publications identified two distinct paradigms on how data mining methodologies are applied. The first is ‘as-is’ where the data mining methodologies are applied as stipulated. The second is with ‘adaptations’; that is, methodologies are modified by introducing various changes to the standard process model when applied.

We have aggregated research by decades to differentiate application pattern between two time periods 1997–2007 with limited vs 2008–2018 with more intensive data mining application. The given cut has not only been guided by extracted publications corpus but also by earlier surveys. In particular, during the pre-2007 research, there where ten new methodologies proposed, but since then, only two new methodologies have been proposed. Thus, there is a distinct trend observed over the last decade of large number of extensions and adaptations proposed vs entirely new methodologies.

We note that during the first decade of our time scope (1997–2007), the ratio of data mining methodologies applied ‘as-is’ was 40% (as presented in Fig. 6A ). However, the same ratio for the following decade is 32% ( Fig. 6B ). Thus, in terms of relative shares we note a clear decrease in using data mining methodologies ‘as-is’ in favor of adapting them to cater to specific needs.The trend is even more pronounced when comparing numbers—adaptations more than tripled (from 30 to 106) while ‘as-is’ scenario has increased modestly (from 20 to 51). Given this finding, we continue with analyzing how data mining methodologies have been adapted under RQ2.

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-06-267-g006.jpg

RQ2: How have existing data mining methodologies been adapted?

We identified that data mining methodologies have been adapted to cater to specific needs. In order to categorize adaptations scenarios, we applied a two-level dichotomy, specifically, by applying the following decision tree:

  • Level 1 Decision: Has the methodology been combined with another methodology? If yes, the resulting methodology was classified in the ‘integration’ category. Otherwise, we posed the next question.
  • Level 2 Decision: Are any new elements (phases, tasks, deliverables) added to the methodology? If yes, we designate the resulting methodology as an ‘extension’ of the original one. Otherwise, we classify the resulting methodology as a modification of the original one.

Thus, when adapted three distinct types of adaptation scenarios can be distinguished:

  • Scenario ‘Modification’: introduces specialized sub-tasks and deliverables in order to address specific use cases or business problems. Modifications typically concentrate on granular adjustments to the methodology at the level of sub-phases, tasks or deliverables within the existing reference frameworks (e.g., CRISP-DM or KDD) stages. For example, Chernov et al. (2014) , in the study of mobile network domain, proposed automated decision-making enhancement in the deployment phase. In addition, the evaluation phase was modified by using both conventional and own-developed performance metrics. Further, in a study performed within the financial services domain, Yang et al. (2016) presents feature transformation and feature selection as sub-phases, thereby enhancing the data mining modeling stage.
  • Scenario ‘Extension’: primarily proposes significant extensions to reference data mining methodologies. Such extensions result in either integrated data mining solutions, data mining frameworks serving as a component or tool for automated IS systems, or their transformations to fit specialized environments. The main purposes of extensions are to integrate fully-scaled data mining solutions into IS/IT systems and business processes and provide broader context with useful architectures, algorithms, etc. Adaptations, where extensions have been made, elicit and explicitly present various artifacts in the form of system and model architectures, process views, workflows, and implementation aspects. A number of soft goals are also achieved, providing holistic perspective on data mining process, and contextualizing with organizational needs. Also, there are extensions in this scenario where data mining process methodologies are substantially changed and extended in all key phases to enable execution of data mining life-cycle with the new (Big) Data technologies, tools and in new prototyping and deployment environments (e.g., Hadoop platforms or real-time customer interfaces). For example, Kisilevich, Keim & Rokach (2013) presented extensions to traditional CRISP-DM data mining outcomes with fully fledged Decision Support System (DSS) for hotel brokerage business. Authors ( Kisilevich, Keim & Rokach, 2013 ) have introduced spatial/non-spatial data management (extending data preparation), analytical and spatial modeling capabilities (extending modeling phase), provided spatial display and reporting capabilities (enhancing deployment phase). In the same work domain knowledge was introduced in all phases of data mining process, and usability and ease of use were also addressed.
  • Scenario ‘Integration’: combines reference methodology, for example, CRISP-DM with: (1) data mining methodologies originated from other domains (e.g., Software engineering development methodologies), (2) organizational frameworks (Balanced Scorecard, Analytics Canvass, etc.), or (3) adjustments to accommodate Big Data technologies and tools. Also, adaptations in the form of ‘Integration’ typically introduce various types of ontologies and ontology-based tools, domain knowledge, software engineering, and BI-driven framework elements. Fundamental data mining process adjustments to new types of data, IS architectures (e.g., real time data, multi-layer IS) are also presented. Key gaps addressed with such adjustments are prescriptive nature and low degree of formalization in CRISP-DM, obsolete nature of CRISP-DM with respect to tools, and lack of CRISP-DM integration with other organizational frameworks. For example, Brisson & Collard (2008) developed KEOPS data mining methodology (CRIPS-DM based) centered on domain knowledge integration. Ontology-driven information system has been proposed with integration and enhancements to all steps of data mining process. Further, an integrated expert knowledge used in all data mining phases was proved to produce value in data mining process.

To examine how the application scenario of each data mining methodology usage has developed over time, we mapped peer-reviewed texts and ‘grey’ literature to respective adaptation scenarios, aggregated by decades (as presented in the Fig. 7 for peer-reviewed and Fig. 8 for ‘grey’).

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-06-267-g007.jpg

For peer-reviewed research, such temporal analysis resulted in three observations. Firstly, research efforts in each adaptation scenario has been growing and number of publication more than quadrupled (128 vs. 28). Secondly, as noted above relative proportion of ‘as-is’ studies is diluted (from 39% to 33%) and primarily replaced with ‘Extension’ paradigm (from 25% to 30%). In contrast, in relative terms ‘Modification’ and ‘Integration’ paradigms gains are modest. Further, this finding is reinforced with other observation—most notable gaps in terms of modest number of publications remain in ‘Integration’ category where excluding 2008–2009 spike, research efforts are limited and number of texts is just 13. This is in stark contrast with prolific research in ‘Extension category’ though concentrated in the recent years. We can hypothesize that existing reference methodologies do not accommodate and support increasing complexity of data mining projects and IS/IT infrastructure, as well as certain domains specifics and as such need to be adapted.

In ‘grey’ literature, in contrast to peer-reviewed research, growth in number of publications is less profound—29 vs. 22 publications or 32% comparing across two decade (as per Fig. 8 ). The growth is solely driven by ‘Integration’ scenarios application (13 vs. 4 publications) while both ‘as-is’ and other adaptations scenarios are stagnating or in decline.

RQ3: For what purposes have existing data mining methodologies been adapted?

We address the third research question by analyzing what gaps the data mining methodology adaptations seek to fill and the benefits of such adaptations. We identified three adaptation scenarios, namely ‘Modification’, ‘Extension’, and ‘Integration’. Here, we analyze each of them.

Modification

Modifications of data mining methodologies are present in 30 peer-reviewed and 4 ‘grey’ literature studies. The analysis shows that modifications overwhelmingly consist of specific case studies. However, the major differentiating point compared to ‘as-is’ case studies is clear presence of specific adjustments towards standard data mining process methodologies. Yet, the proposed modifications and their purposes do not go beyond traditional data mining methodologies phases. They are granular, specialized and executed on tasks, sub-tasks, and at deliverables level. With modifications, authors describe potential business applications and deployment scenarios at a conceptual level, but typically do not report or present real implementations in the IS/IT systems and business processes.

Further, this research subcategory can be best classified based on domains where case studies were performed and data mining methodologies modification scenarios executed. We have identified four distinct domain-driven applications presented in the Fig. 9 .

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-06-267-g009.jpg

IT, IS domain

The largest number of publications (14 or app. 40%), was performed on IT, IS security, software development, specific data mining and processing topics. Authors address intrusion detection problem in Hossain, Bridges & Vaughn (2003) , Fan, Ye & Chen (2016) , Lee, Stolfo & Mok (1999) , specialized algorithms for variety of data types processing in Yang & Shi (2010) , Chen et al. (2001) , Yi, Teng & Xu (2016) , Pouyanfar & Chen (2016) , effective and efficient computer and mobile networks management in Guan & Fu (2010) , Ertek, Chi & Zhang (2017) , Zaki & Sobh (2005) , Chernov, Petrov & Ristaniemi (2015) , Chernov et al. (2014) .

Manufacturing and engineering

The next most popular research area is manufacturing/engineering with 10 case studies. The central topic here is high-technology manufacturing, for example, semi-conductors associated—study of Chien, Diaz & Lan (2014) , and various complex prognostics case studies in rail, aerospace domains ( Létourneau et al., 2005 ; Zaluski et al., 2011 ) concentrated on failure predictions. These are complemented by studies on equipment fault and failure predictions and maintenance ( Kumar, Shankar & Thakur, 2018 ; Kang et al., 2017 ; Wang, 2017 ) as well as monitoring system ( García et al., 2017 ).

Sales and services, incl. financial industry

The third category is presented by seven business application papers concerning customer service, targeting and advertising ( Karimi-Majd & Mahootchi, 2015 ; Reutterer et al., 2017 ; Wang, 2017 ), financial services credit risk assessments ( Smith, Willis & Brooks, 2000 ), supply chain management ( Nohuddin et al., 2018 ), and property management ( Yu, Fung & Haghighat, 2013 ), and similar.

As a consequence of specialization, these studies concentrate on developing ‘state-of-the art’ solution to the respective domain-specific problem.

‘Extension’ scenario was identified in 46 peer-reviewed and 12 ‘grey’ publications. We noted that ‘Extension’ to existing data mining methodologies were executed with four major purposes:

  • Purpose 1: To implement fully scaled, integrated data mining solution and regular, repeatable knowledge discovery process— address model, algorithm deployment, implementation design (including architecture, workflows and corresponding IS integration). Also, complementary goal is to tackle changes to business process to incorporate data mining into organization activities.
  • Purpose 2: To implement complex, specifically designed systems and integrated business applications with data mining model/solution as component or tool. Typically, this adaptation is also oriented towards Big Data specifics, and is complemented by proposed artifacts such as Big Data architectures, system models, workflows, and data flows.
  • Purpose 3: To implement data mining as part of integrated/combined specialized infrastructure, data environments and types (e.g., IoT, cloud, mobile networks) .
  • Purpose 4: To incorporate context-awareness aspects.

The specific list of studies mapped to each of the given purposes presented in the Appendix ( Table A1 ). Main purposes of adaptations, associated gaps and/or benefits along with observations and artifacts are documented in the Fig. 10 below.

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-06-267-g010.jpg

In ‘Extension’ category, studies executed with the Purpose 1 propose fully scaled, integrated data mining solutions of specific data mining models, associated frameworks and processes. The distinctive trait of this research subclass is that it ensures repeatability and reproducibility of delivered data mining solution in different organizational and industry settings. Both the results of data mining use case as well as deployment and integration into IS/IT systems and associated business process(es) are presented explicitly. Thus, ‘Extension’ subclass is geared towards specific solution design, tackling concrete business or industrial setting problem or addressing specific research gaps thus resembling comprehensive case study.

This direction can be well exemplified by expert finder system in research social network services proposed by Sun et al. (2015) , data mining solution for functional test content optimization by Wang (2015) and time-series mining framework to conduct estimation of unobservable time-series by Hu et al. (2010) . Similarly, Du et al. (2017) tackle online log anomalies detection, automated association rule mining is addressed by Çinicioğlu et al. (2011) , software effort estimation by Deng, Purvis & Purvis (2011) , network patterns visual discovery by Simoff & Galloway (2008) . Number of studies address solutions in IS security ( Shin & Jeong, 2005 ), manufacturing ( Güder et al., 2014 ; Chee, Baharudin & Karkonasasi, 2016 ), materials engineering domains ( Doreswamy, 2008 ), and business domains ( Xu & Qiu, 2008 ; Ding & Daniel, 2007 ).

In contrast, ‘Extension’ studies executed for the Purpose 2 concentrate on design of complex, multi-component information systems and architectures. These are holistic, complex systems and integrated business applications with data mining framework serving as component or tool. Moreover, data mining methodology in these studies is extended with systems integration phases.

For example, Mobasher (2007) presents data mining application in Web personalization system and associated process; here, data mining cycle is extended in all phases with utmost goal of leveraging multiple data sources and using discovered models and corresponding algorithms in an automatic personalization system. Authors comprehensively address data processing, algorithm, design adjustments and respective integration into automated system. Similarly, Haruechaiyasak, Shyu & Chen (2004) tackle improvement of Webpage recommender system by presenting extended data mining methodology including design and implementation of data mining model. Holistic view on web-mining with support of all data sources, data warehousing and data mining techniques integration, as well as multiple problem-oriented analytical outcomes with rich business application scenarios (personalization, adaptation, profiling, and recommendations) in e-commerce domain was proposed and discussed by Büchner & Mulvenna (1998) . Further, Singh et al. (2014) tackled scalable implementation of Network Threat Intrusion Detection System. In this study, data mining methodology and resulting model are extended, scaled and deployed as module of quasi-real-time system for capturing Peer-to-Peer Botnet attacks. Similar complex solution was presented in a series of publications by Lee et al. (2000 , 2001) who designed real-time data mining-based Intrusion Detection System (IDS). These works are complemented by comprehensive study of Barbará et al. (2001) who constructed experimental testbed for intrusion detection with data mining methods. Detection model combining data fusion and mining and respective components for Botnets identification was developed by Kiayias et al. (2009) too. Similar approach is presented in Alazab et al. (2011) who proposed and implemented zero-day malware detection system with associated machine-learning based framework. Finally, Ahmed, Rafique & Abulaish (2011) presented multi-layer framework for fuzzy attack in 3G cellular IP networks.

A number of authors have considered data mining methodologies in the context of Decision Support Systems and other systems that generate information for decision-making, across a variety of domains. For example, Kisilevich, Keim & Rokach (2013) executed significant extension of data mining methodology by designing and presenting integrated Decision Support System (DSS) with six components acting as supporting tool for hotel brokerage business to increase deal profitability. Similar approach is undertaken by Capozzoli et al. (2017) focusing on improving energy management of properties by provision of occupancy pattern information and reconfiguration framework. Kabir (2016) presented data mining information service providing improved sales forecasting that supported solution of under/over-stocking problem while Lau, Zhang & Xu (2018) addressed sales forecasting with sentiment analysis on Big Data. Kamrani, Rong & Gonzalez (2001) proposed GA-based Intelligent Diagnosis system for fault diagnostics in manufacturing domain. The latter was tackled further in Shahbaz et al. (2010) with complex, integrated data mining system for diagnosing and solving manufacturing problems in real time.

Lenz, Wuest & Westkämper (2018) propose a framework for capturing data analytics objectives and creating holistic, cross-departmental data mining systems in the manufacturing domain. This work is representative of a cohort of studies that aim at extending data mining methodologies in order to support the design and implementation of enterprise-wide data mining systems. In this same research cohort, we classify Luna, Castro & Romero (2017) , which presents a data mining toolset integrated into the Moodle learning management system, with the aim of supporting university-wide learning analytics.

One study addresses multi-agent based data mining concept. Khan, Mohamudally & Babajee (2013) have developed unified theoretical framework for data mining by formulating a unified data mining theory. The framework is tested by means of agent programing proposing integration into multi-agent system which is useful due to scalability, robustness and simplicity.

The subcategory of ‘Extension’ research executed with Purpose 3 is devoted to data mining methodologies and solutions in specialized IT/IS, data and process environments which emerged recently as consequence of Big Data associated technologies and tools development. Exemplary studies include IoT associated environment research, for example, Smart City application in IoT presented by Strohbach et al. (2015) . In the same domain, Bashir & Gill (2016) addressed IoT-enabled smart buildings with the additional challenge of large amount of high-speed real time data and requirements of real-time analytics. Authors proposed integrated IoT Big Data Analytics framework. This research is complemented by interdisciplinary study of Zhong et al. (2017) where IoT and wireless technologies are used to create RFID-enabled environment producing analysis of KPIs to improve logistics.

Significant number of studies addresses various mobile environments sometimes complemented by cloud-based environments or cloud-based environments as stand-alone. Gomes, Phua & Krishnaswamy (2013) addressed mobile data mining with execution on mobile device itself; the framework proposes innovative approach addressing extensions of all aspects of data mining including contextual data, end-user privacy preservation, data management and scalability. Yuan, Herbert & Emamian (2014) and Yuan & Herbert (2014) introduced cloud-based mobile data analytics framework with application case study for smart home based monitoring system. Cuzzocrea, Psaila & Toccu (2016) have presented innovative FollowMe suite which implements data mining framework for mobile social media analytics with several tools with respective architecture and functionalities. An interesting paper was presented by Torres et al. (2017) who addressed data mining methodology and its implementation for congestion prediction in mobile LTE networks tackling also feedback reaction with network reconfigurations trigger.

Further, Biliri et al. (2014) presented cloud-based Future Internet Enabler—automated social data analytics solution which also addresses Social Network Interoperability aspect supporting enterprises to interconnect and utilize social networks for collaboration. Real-time social media streamed data and resulting data mining methodology and application was extensively discussed by Zhang, Lau & Li (2014) . Authors proposed design of comprehensive ABIGDAD framework with seven main components implementing data mining based deceptive review identification. Interdisciplinary study tackling both these topics was developed by Puthal et al. (2016) who proposed integrated framework and architecture of disaster management system based on streamed data in cloud environment ensuring end-to-end security. Additionally, key extensions to data mining framework have been proposed merging variety of data sources and types, security verification and data flow access controls. Finally, cloud-based manufacturing was addressed in the context of fault diagnostics by Kumar et al. (2016) .

Also, Mahmood et al. (2013) tackled Wireless Sensor Networks and associated data mining framework required extensions. Interesting work is executed by Nestorov & Jukic (2003) addressing rare topic of data mining solutions integration within traditional data warehouses and active mining of data repositories themselves.

Supported by new generation of visualization technologies (including Virtual Reality environments), Wijayasekara, Linda & Manic (2011) proposed and implemented CAVE-SOM (3D visual data mining framework) which offers interactive, immersive visual data mining with multiple visualization modes supported by plethora of methods. Earlier version of visual data mining framework was successfully developed and presented by Ganesh et al. (1996) as early as in 1996.

Large-scale social media data is successfully tackled by Lemieux (2016) with comprehensive framework accompanied by set of data mining tools and interface. Real time data analytics was addressed by Shrivastava & Pal (2017) in the domain of enterprise service ecosystem. Images data was addressed in Huang et al. (2002) by proposing multimedia data mining framework and its implementation with user relevance feedback integration and instance learning. Further, exploded data diversity and associated need to extend standard data mining is addressed by Singh et al. (2016) in the study devoted to object detection in video surveillance systems supporting real time video analysis.

Finally, there is also limited number of studies which addresses context awareness (Purpose 4) and extends data mining methodology with context elements and adjustments. In comparison with ‘Integration’ category research, here, the studies are at lower abstraction level, capturing and presenting list of adjustments. Singh, Vajirkar & Lee (2003) generate taxonomy of context factors, develop extended data mining framework and propose deployment including detailed IS architecture. Context-awareness aspect is also addressed in the papers reviewed above, for example, Lenz, Wuest & Westkämper (2018) , Kisilevich, Keim & Rokach (2013) , Sun et al. (2015) , and other studies.

Integration

‘Integration’ of data mining methodologies scenario was identified in 27 ‘peer-reviewed’ and 17 ‘grey’ studies. Our analysis revealed that this adaptation scenario at a higher abstraction level is typically executed with the five key purposes:

  • Purpose 1: to integrate/combine with various ontologies existing in organization .
  • Purpose 2: to introduce context-awareness and incorporate domain knowledge .
  • Purpose 3: to integrate/combine with other research or industry domains framework, process methodologies and concepts .
  • Purpose 4: to integrate/combine with other well-known organizational governance frameworks, process methodologies and concepts .
  • Purpose 5: to accommodate and/or leverage upon newly available Big Data technologies, tools and methods.

The specific list of studies mapped to each of the given purposes presented in Appendix ( Table A2 ). Main purposes of adaptations, associated gaps and/or benefits along with observations and artifacts are documented in Fig. 11 below.

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-06-267-g011.jpg

As mentioned, number of studies concentrates on proposing ontology-based Integrated data mining frameworks accompanies by various types of ontologies (Purpose 1). For example, Sharma & Osei-Bryson (2008) focus on ontology-based organizational view with Actors, Goals and Objectives which supports execution of Business Understanding Phase. Brisson & Collard (2008) propose KEOPS framework which is CRISP-DM compliant and integrates a knowledge base and ontology with the purpose to build ontology-driven information system (OIS) for business and data understanding phases while knowledge base is used for post-processing step of model interpretation. Park et al. (2017) propose and design comprehensive ontology-based data analytics tool IRIS with the purpose to align analytics and business. IRIS is based on concept to connect dots, analytics methods or transforming insights into business value, and supports standardized process for applying ontology to match business problems and solutions.

Further, Ying et al. (2014) propose domain-specific data mining framework oriented to business problem of customer demand discovery. They construct ontology for customer demand and customer demand discovery task which allows to execute structured knowledge extraction in the form of knowledge patterns and rules. Here, the purpose is to facilitate business value realization and support actionability of extracted knowledge via marketing strategies and tactics. In the same vein, Cannataro & Comito (2003) presented ontology for the Data Mining domain which main goal is to simplify the development of distributed knowledge discovery applications. Authors offered to a domain expert a reference model for different kind of data mining tasks, methodologies, and software capable to solve the given business problem and find the most appropriate solution.

Apart from ontologies, Sharma & Osei-Bryson (2009) in another study propose IS inspired, driven by Input-Output model data mining methodology which supports formal implementation of Business Understanding Phase. This research exemplifies studies executed with Purpose 2. The goal of the paper is to tackle prescriptive nature of CRISP-DM and address how the entire process can be implemented. Cao, Schurmann & Zhang (2005) study is also exemplary in terms of aggregating and introducing several fundamental concepts into traditional CRISP-DM data mining cycle—context awareness, in-depth pattern mining, human–machine cooperative knowledge discovery (in essence, following human-centricity paradigm in data mining), loop-closed iterative refinement process (similar to Agile-based methodologies in Software Development). There are also several concepts, like data, domain, interestingness, rules which are proposed to tackle number of fundamental constrains identified in CRISP-DM. They have been discussed and further extended by Cao & Zhang (2007 , 2008) , Cao (2010) into integrated domain driven data mining concept resulting in fully fledged D3M (domain-driven) data mining framework. Interestingly, the same concepts, but on individual basis are investigated and presented by other authors, for example, context-aware data mining methodology is tackled by Xiang (2009a , 2009b) in the context of financial sector. Pournaras et al. (2016) attempted very crucial privacy-preservation topic in the context of achieving effective data analytics methodology. Authors introduced metrics and self-regulatory (reconfigurable) information sharing mechanism providing customers with controls for information disclosure.

A number of studies have proposed CRISP-DM adjustments based on existing frameworks, process models or concepts originating in other domains (Purpose 3), for example, software engineering ( Marbán et al., 2007 , 2009 ; Marban, Mariscal & Segovia, 2009 ) and industrial engineering ( Solarte, 2002 ; Zhao et al., 2005 ).

Meanwhile, Mariscal, Marbán & Fernández (2010) proposed a new refined data mining process based on a global comparative analysis of existing frameworks while Angelov (2014) outlined a data analytics framework based on statistical concepts. Following a similar approach, some researchers suggest explicit integration with other areas and organizational functions, for example, BI-driven Data Mining by Hang & Fong (2009) . Similarly, Chen, Kazman & Haziyev (2016) developed an architecture-centric agile Big Data analytics methodology, and an architecture-centric agile analytics and DevOps model. Alternatively, several authors tackled data mining methodology adaptations in other domains, for example, educational data mining by Tavares, Vieira & Pedro (2017) , decision support in learning management systems ( Murnion & Helfert, 2011 ), and in accounting systems ( Amani & Fadlalla, 2017 ).

Other studies are concerned with actionability of data mining and closer integration with business processes and organizational management frameworks (Purpose 4). In particular, there is a recurrent focus on embedding data mining solutions into knowledge-based decision making processes in organizations, and supporting fast and effective knowledge discovery ( Bohanec, Robnik-Sikonja & Borstnar, 2017 ).

Examples of adaptations made for this purpose include: (1) integration of CRISP-DM with the Balanced Scorecard framework used for strategic performance management in organizations ( Yun, Weihua & Yang, 2014 ); (2) integration with a strategic decision-making framework for revenue management Segarra et al. (2016) ; (3) integration with a strategic analytics methodology Van Rooyen & Simoff (2008) , and (4) integration with a so-called ‘Analytics Canvas’ for management of portfolios of data analytics projects Kühn et al. (2018) . Finally, Ahangama & Poo (2015) explored methodological attributes important for adoption of data mining methodology by novice users. This latter study uncovered factors that could support the reduction of resistance to the use of data mining methodologies. Conversely, Lawler & Joseph (2017) comprehensively evaluated factors that may increase the benefits of Big Data Analytics projects in an organization.

Lastly, a number of studies have proposed data mining frameworks (e.g., CRISP-DM) adaptations to cater for new technological architectures, new types of datasets and applications (Purpose 5). For example, Lu et al. (2017) proposed a data mining system based on a Service-Oriented Architecture (SOA), Zaghloul, Ali-Eldin & Salem (2013) developed a concept of self-service data analytics, Osman, Elragal & Bergvall-Kåreborn (2017) blended CRISP-DM into a Big Data Analytics framework for Smart Cities, and Niesen et al. (2016) proposed a data-driven risk management framework for Industry 4.0 applications.

Our analysis of RQ3, regarding the purposes of existing data mining methodologies adaptations, revealed the following key findings. Firstly, adaptations of type ‘Modification’ are predominantly targeted at addressing problems that are specific to a given case study. The majority of modifications were made within the domain of IS security, followed by case studies in the domains of manufacturing and financial services. This is in clear contrast with adaptations of type ‘Extension’, which are primarily aimed at customizing the methodology to take into account specialized development environments and deployment infrastructures, and to incorporate context-awareness aspects. Thirdly, a recurrent purpose of adaptations of type ‘Integration’ is to combine a data mining methodology with either existing ontologies in an organization or with other domain frameworks, methodologies, and concepts. ‘Integration’ is also used to instill context-awareness and domain knowledge into a data mining methodology, or to adapt it to specialized methods and tools, such as Big Data. The distinctive outcome and value (gaps filled in) of ‘Integrations’ stems from improved knowledge discovery, better actionability of results, improved combination with key organizational processes and domain-specific methodologies, and improved usage of Big Data technologies.

We discovered that the adaptations of existing data mining methodologies found in the literature can be classified into three categories: modification, extension, or integration.

We also noted that adaptations are executed either to address deficiencies and lack of important elements or aspects in the reference methodology (chiefly CRISP-DM). Furthermore, adaptations are also made to improve certain phases, deliverables or process outcomes.

In short, adaptations are made to:

  • improve key reference data mining methodologies phases—for example, in case of CRISP-DM these are primarily business understanding and deployment phases.
  • support knowledge discovery and actionability.
  • introduce context-awareness and higher degree of formalization.
  • integrate closer data mining solution with key organizational processes and frameworks.
  • significantly update CRISP-DM with respect to Big Data technologies, tools, environments and infrastructure.
  • incorporate broader, explicit context of architectures, algorithms and toolsets as integral deliverables or supporting tools to execute data mining process.
  • expand and accommodate broader unified perspective for incorporating and implementing data mining solutions in organization, IT infrastructure and business processes.

Threats to Validity

Systematic literature reviews have inherent limitations that must be acknowledged. These threats to validity include subjective bias (internal validity) and incompleteness of search results (external validity).

The internal validity threat stems from the subjective screening and rating of studies, particularly when assessing the studies with respect to relevance and quality criteria. We have mitigated these effects by documenting the survey protocol (SLR Protocol), strictly adhering to the inclusion criteria, and performing significant validation procedures, as documented in the Protocol.

The external validity threat relates to the extent to which the findings of the SLR reflect the actual state of the art in the field of data mining methodologies, given that the SLR only considers published studies that can be retrieved using specific search strings and databases. We have addressed this threat to validity by conducting trial searches to validate our search strings in terms of their ability to identify relevant papers that we knew about beforehand. Also, the fact that the searches led to 1,700 hits overall suggests that a significant portion of the relevant literature has been covered.

In this study, we have examined the use of data mining methodologies by means of a systematic literature review covering both peer-reviewed and ‘grey’ literature. We have found that the use of data mining methodologies, as reported in the literature, has grown substantially since 2007 (four-fold increase relative to the previous decade). Also, we have observed that data mining methodologies were predominantly applied ‘as-is’ from 1997 to 2007. This trend was reversed from 2008 onward, when the use of adapted data mining methodologies gradually started to replace ‘as-is’ usage.

The most frequent adaptations have been in the ‘Extension’ category. This category refers to adaptations that imply significant changes to key phases of the reference methodology (chiefly CRISP-DM). These adaptations particularly target the business understanding, deployment and implementation phases of CRISP-DM (or other methodologies). Moreover, we have found that the most frequent purposes of adaptions are: (1) adaptations to handle Big Data technologies, tools and environments (technological adaptations); and (2) adaptations for context-awareness and for integrating data mining solutions into business processes and IT systems (organizational adaptations). A key finding is that standard data mining methodologies do not pay sufficient attention to deployment aspects required to scale and transform data mining models into software products integrated into large IT/IS systems and business processes.

Apart from the adaptations in the ‘Extension’ category, we have also identified an increasing number of studies focusing on the ‘Integration’ of data mining methodologies with other domain-specific and organizational methodologies, frameworks, and concepts. These adaptions are aimed at embedding the data mining methodology into broader organizational aspects.

Overall, the findings of the study highlight the need to develop refinements of existing data mining methodologies that would allow them to seamlessly interact with IT development platforms and processes (technological adaptation) and with organizational management frameworks (organizational adaptation). In other words, there is a need to frame existing data mining methodologies as being part of a broader ecosystem of methodologies, as opposed to the traditional view where data mining methodologies are defined in isolation from broader IT systems engineering and organizational management methodologies.

Supplemental Information

Supplemental information 1.

Unfortunately, we were not able to upload any graph (original png files). Based on Overleaf placed PeerJ template we constructed graphs files based on the template examples. Unfortunately, we were not able to understand why it did not fit, redoing to new formats will change all texts flow and generated pdf file. We submit graphs in archived file as part of supplementary material. We will do our best to redo the graphs further based on instructions from You.

Supplemental Information 2

File starts with Definitions page—it lists and explains all columns definitions as well as SLR scoring metrics. Second page contains"Peer reviewed" texts while next one "grey" literature corpus.

Funding Statement

The authors received no funding for this work.

Additional Information and Declarations

The authors declare that they have no competing interests.

Veronika Plotnikova conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft.

Marlon Dumas conceived and designed the experiments, authored or reviewed drafts of the paper, and approved the final draft.

Fredrik Milani conceived and designed the experiments, authored or reviewed drafts of the paper, and approved the final draft.

Primary Sources

  • Open access
  • Published: 11 August 2021

Data mining in clinical big data: the frequently used databases, steps, and methodological models

  • Wen-Tao Wu 1 , 2   na1 ,
  • Yuan-Jie Li 3   na1 ,
  • Ao-Zi Feng 1 ,
  • Tao Huang 1 ,
  • An-Ding Xu 4 &
  • Jun Lyu   ORCID: orcid.org/0000-0002-2237-8771 1  

Military Medical Research volume  8 , Article number:  44 ( 2021 ) Cite this article

40k Accesses

160 Citations

2 Altmetric

Metrics details

Many high quality studies have emerged from public databases, such as Surveillance, Epidemiology, and End Results (SEER), National Health and Nutrition Examination Survey (NHANES), The Cancer Genome Atlas (TCGA), and Medical Information Mart for Intensive Care (MIMIC); however, these data are often characterized by a high degree of dimensional heterogeneity, timeliness, scarcity, irregularity, and other characteristics, resulting in the value of these data not being fully utilized. Data-mining technology has been a frontier field in medical research, as it demonstrates excellent performance in evaluating patient risks and assisting clinical decision-making in building disease-prediction models. Therefore, data mining has unique advantages in clinical big-data research, especially in large-scale medical public databases. This article introduced the main medical public database and described the steps, tasks, and models of data mining in simple language. Additionally, we described data-mining methods along with their practical applications. The goal of this work was to aid clinical researchers in gaining a clear and intuitive understanding of the application of data-mining technology on clinical big-data in order to promote the production of research results that are beneficial to doctors and patients.

With the rapid development of computer software/hardware and internet technology, the amount of data has increased at an amazing speed. “Big data” as an abstract concept currently affects all walks of life [ 1 ], and although its importance has been recognized, its definition varies slightly from field to field. In the field of computer science, big data refers to a dataset that cannot be perceived, acquired, managed, processed, or served within a tolerable time by using traditional IT and software and hardware tools. Generally, big data refers to a dataset that exceeds the scope of a simple database and data-processing architecture used in the early days of computing and is characterized by high-volume and -dimensional data that is rapidly updated represents a phenomenon or feature that has emerged in the digital age. Across the medical industry, various types of medical data are generated at a high speed, and trends indicate that applying big data in the medical field helps improve the quality of medical care and optimizes medical processes and management strategies [ 2 , 3 ]. Currently, this trend is shifting from civilian medicine to military medicine. For example, the United States is exploring the potential to use of one of its largest healthcare systems (the Military Healthcare System) to provide healthcare to eligible veterans in order to potentially benefit > 9 million eligible personnel [ 4 ]. Another data-management system has been developed to assess the physical and mental health of active-duty personnel, with this expected to yield significant economic benefits to the military medical system [ 5 ]. However, in medical research, the wide variety of clinical data and differences between several medical concepts in different classification standards results in a high degree of dimensionality heterogeneity, timeliness, scarcity, and irregularity to existing clinical data [ 6 , 7 ]. Furthermore, new data analysis techniques have yet to be popularized in medical research [ 8 ]. These reasons hinder the full realization of the value of existing data, and the intensive exploration of the value of clinical data remains a challenging problem.

Computer scientists have made outstanding contributions to the application of big data and introduced the concept of data mining to solve difficulties associated with such applications. Data mining (also known as knowledge discovery in databases) refers to the process of extracting potentially useful information and knowledge hidden in a large amount of incomplete, noisy, fuzzy, and random practical application data [ 9 ]. Unlike traditional research methods, several data-mining technologies mine information to discover knowledge based on the premise of unclear assumptions (i.e., they are directly applied without prior research design). The obtained information should have previously unknown, valid, and practical characteristics [ 9 ]. Data-mining technology does not aim to replace traditional statistical analysis techniques, but it does seek to extend and expand statistical analysis methodologies. From a practical point of view, machine learning (ML) is the main analytical method in data mining, as it represents a method of training models by using data and then using those models for predicting outcomes. Given the rapid progress of data-mining technology and its excellent performance in other industries and fields, it has introduced new opportunities and prospects to clinical big-data research [ 10 ]. Large amounts of high quality medical data are available to researchers in the form of public databases, which enable more researchers to participate in the process of medical data mining in the hope that the generated results can further guide clinical practice.

This article provided a valuable overview to medical researchers interested in studying the application of data mining on clinical big data. To allow a clearer understanding of the application of data-mining technology on clinical big data, the second part of this paper introduced the concept of public databases and summarized those commonly used in medical research. In the third part of the paper, we offered an overview of data mining, including introducing an appropriate model, tasks, and processes, and summarized the specific methods of data mining. In the fourth and fifth parts of this paper, we introduced data-mining algorithms commonly used in clinical practice along with specific cases in order to help clinical researchers clearly and intuitively understand the application of data-mining technology on clinical big data. Finally, we discussed the advantages and disadvantages of data mining in clinical analysis and offered insight into possible future applications.

Overview of common public medical databases

A public database describes a data repository used for research and dedicated to housing data related to scientific research on an open platform. Such databases collect and store heterogeneous and multi-dimensional health, medical, scientific research in a structured form and characteristics of mass/multi-ownership, complexity, and security. These databases cover a wide range of data, including those related to cancer research, disease burden, nutrition and health, and genetics and the environment. Table 1 summarizes the main public medical databases [ 11 , 12 , 13 , 14 , 15 , 16 , 17 , 18 , 19 , 20 , 21 , 22 , 23 , 24 , 25 , 26 ]. Researchers can apply for access to data based on the scope of the database and the application procedures required to perform relevant medical research.

Data mining: an overview

Data mining is a multidisciplinary field at the intersection of database technology, statistics, ML, and pattern recognition that profits from all these disciplines [ 27 ]. Although this approach is not yet widespread in the field of medical research, several studies have demonstrated the promise of data mining in building disease-prediction models, assessing patient risk, and helping physicians make clinical decisions [ 28 , 29 , 30 , 31 ].

Data-mining models

Data-mining has two kinds of models: descriptive and predictive. Predictive models are used to predict unknown or future values of other variables of interest, whereas descriptive models are often used to find patterns that describe data that can be interpreted by humans [ 32 ].

Data-mining tasks

A model is usually implemented by a task, with the goal of description being to generalize patterns of potential associations in the data. Therefore, using a descriptive model usually results in a few collections with the same or similar attributes. Prediction mainly refers to estimation of the variable value of a specific attribute based on the variable values of other attributes, including classification and regression [ 33 ].

Data-mining methods

After defining the data-mining model and task, the data mining methods required to build the approach based on the discipline involved are then defined. The data-mining method depends on whether or not dependent variables (labels) are present in the analysis. Predictions with dependent variables (labels) are generated through supervised learning, which can be performed by the use of linear regression, generalized linear regression, a proportional hazards model (the Cox regression model), a competitive risk model, decision trees, the random forest (RF) algorithm, and support vector machines (SVMs). In contrast, unsupervised learning involves no labels. The learning model infers some internal data structure. Common unsupervised learning methods include principal component analysis (PCA), association analysis, and clustering analysis.

Data-mining algorithms for clinical big data

Data mining based on clinical big data can produce effective and valuable knowledge, which is essential for accurate clinical decision-making and risk assessment [ 34 ]. Data-mining algorithms enable realization of these goals.

Supervised learning

A concept often mentioned in supervised learning is the partitioning of datasets. To prevent overfitting of a model, a dataset can generally be divided into two or three parts: a training set, validation set, and test set. Ripley [ 35 ] defined these parts as a set of examples used for learning and used to fit the parameters (i.e., weights) of the classifier, a set of examples used to tune the parameters (i.e., architecture) of a classifier, and a set of examples used only to assess the performance (generalized) of a fully-specified classifier, respectively. Briefly, the training set is used to train the model or determine the model parameters, the validation set is used to perform model selection, and the test set is used to verify model performance. In practice, data are generally divided into training and test sets, whereas the verification set is less involved. It should be emphasized that the results of the test set do not guarantee model correctness but only show that similar data can obtain similar results using the model. Therefore, the applicability of a model should be analysed in combination with specific problems in the research. Classical statistical methods, such as linear regression, generalized linear regression, and a proportional risk model, have been widely used in medical research. Notably, most of these classical statistical methods have certain data requirements or assumptions; however, in face of complicated clinical data, assumptions about data distribution are difficult to make. In contrast, some ML methods (algorithmic models) make no assumptions about the data and cross-verify the results; thus, they are likely to be favoured by clinical researchers [ 36 ]. For these reasons, this chapter focuses on ML methods that do not require assumptions about data distribution and classical statistical methods that are used in specific situations.

Decision tree

A decision tree is a basic classification and regression method that generates a result similar to the tree structure of a flowchart, where each tree node represents a test on an attribute, each branch represents the output of an attribute, each leaf node (decision node) represents a class or class distribution, and the topmost part of the tree is the root node [ 37 ]. The decision tree model is called a classification tree when used for classification and a regression tree when used for regression. Studies have demonstrated the utility of the decision tree model in clinical applications. In a study on the prognosis of breast cancer patients, a decision tree model and a classical logistic regression model were constructed, respectively, with the predictive performance of the different models indicating that the decision tree model showed stronger predictive power when using real clinical data [ 38 ]. Similarly, the decision tree model has been applied to other areas of clinical medicine, including diagnosis of kidney stones [ 39 ], predicting the risk of sudden cardiac arrest [ 40 ], and exploration of the risk factors of type II diabetes [ 41 ]. A common feature of these studies is the use of a decision tree model to explore the interaction between variables and classify subjects into homogeneous categories based on their observed characteristics. In fact, because the decision tree accounts for the strong interaction between variables, it is more suitable for use with decision algorithms that follow the same structure [ 42 ]. In the construction of clinical prediction models and exploration of disease risk factors and patient prognosis, the decision tree model might offer more advantages and practical application value than some classical algorithms. Although the decision tree has many advantages, it recursively separates observations into branches to construct a tree; therefore, in terms of data imbalance, the precision of decision tree models needs improvement.

The RF method

The RF algorithm was developed as an application of an ensemble-learning method based on a collection of decision trees. The bootstrap method [ 43 ] is used to randomly retrieve sample sets from the training set, with decision trees generated by the bootstrap method constituting a “random forest” and predictions based on this derived from an ensemble average or majority vote. The biggest advantage of the RF method is that the random sampling of predictor variables at each decision tree node decreases the correlation among the trees in the forest, thereby improving the precision of ensemble predictions [ 44 ]. Given that a single decision tree model might encounter the problem of overfitting [ 45 ], the initial application of RF minimizes overfitting in classification and regression and improves predictive accuracy [ 44 ]. Taylor et al. [ 46 ] highlighted the potential of RF in correctly differentiating in-hospital mortality in patients experiencing sepsis after admission to the emergency department. Nowhere in the healthcare system is the need more pressing to find methods to reduce uncertainty than in the fast, chaotic environment of the emergency department. The authors demonstrated that the predictive performance of the RF method was superior to that of traditional emergency medicine methods and the methods enabled evaluation of more clinical variables than traditional modelling methods, which subsequently allowed the discovery of clinical variables not expected to be of predictive value or which otherwise would have been omitted as a rare predictor [ 46 ]. Another study based on the Medical Information Mart for Intensive Care (MIMIC) II database [ 47 ] found that RF had excellent predictive power regarding intensive care unit (ICU) mortality [ 48 ]. These studies showed that the application of RF to big data stored in the hospital healthcare system provided a new data-driven method for predictive analysis in critical care. Additionally, random survival forests have recently been developed to analyse survival data, especially right-censored survival data [ 49 , 50 ], which can help researchers conduct survival analyses in clinical oncology and help develop personalized treatment regimens that benefit patients [ 51 ].

The SVM is a relatively new classification or prediction method developed by Cortes and Vapnik and represents a data-driven approach that does not require assumptions about data distribution [ 52 ]. The core purpose of an SVM is to identify a separation boundary (called a hyperplane) to help classify cases; thus, the advantages of SVMs are obvious when classifying and predicting cases based on high dimensional data or data with a small sample size [ 53 , 54 ].

In a study of drug compliance in patients with heart failure, researchers used an SVM to build a predictive model for patient compliance in order to overcome the problem of a large number of input variables relative to the number of available observations [ 55 ]. Additionally, the mechanisms of certain chronic and complex diseases observed in clinical practice remain unclear, and many risk factors, including gene–gene interactions and gene-environment interactions, must be considered in the research of such diseases [ 55 , 56 ]. SVMs are capable of addressing these issues. Yu et al. [ 54 ] applied an SVM for predicting diabetes onset based on data from the National Health and Nutrition Examination Survey (NHANES). Furthermore, these models have strong discrimination ability, making SVMs a promising classification approach for detecting individuals with chronic and complex diseases. However, a disadvantage of SVMs is that when the number of observation samples is large, the method becomes time- and resource-intensive, which is often highly inefficient.

Competitive risk model

Kaplan–Meier marginal regression and the Cox proportional hazards model are widely used in survival analysis in clinical studies. Classical survival analysis usually considers only one endpoint, such as the impact of patient survival time. However, in clinical medical research, multiple endpoints usually coexist, and these endpoints compete with one another to generate competitive risk data [ 57 ]. In the case of multiple endpoint events, the use of a single endpoint-analysis method can lead to a biased estimation of the probability of endpoint events due to the existence of competitive risks [ 58 ]. The competitive risk model is a classical statistical model based on the hypothesis of data distribution. Its main advantage is its accurate estimation of the cumulative incidence of outcomes for right-censored survival data with multiple endpoints [ 59 ]. In data analysis, the cumulative risk rate is estimated using the cumulative incidence function in single-factor analysis, and Gray’s test is used for between-group comparisons [ 60 ].

Multifactor analysis uses the Fine-Gray and cause-specific (CS) risk models to explore the cumulative risk rate [ 61 ]. The difference between the Fine-Gray and CS models is that the former is applicable to establishing a clinical prediction model and predicting the risk of a single endpoint of interest [ 62 ], whereas the latter is suitable for answering etiological questions, where the regression coefficient reflects the relative effect of covariates on the increased incidence of the main endpoint in the target event-free risk set [ 63 ]. Currently, in databases with CS records, such as Surveillance, Epidemiology, and End Results (SEER), competitive risk models exhibit good performance in exploring disease-risk factors and prognosis [ 64 ]. A study of prognosis in patients with oesophageal cancer from SEER showed that Cox proportional risk models might misestimate the effects of age and disease location on patient prognosis, whereas competitive risk models provide more accurate estimates of factors affecting patient prognosis [ 65 ]. In another study of the prognosis of penile cancer patients, researchers found that using a competitive risk model was more helpful in developing personalized treatment plans [ 66 ].

Unsupervised learning

In many data-analysis processes, the amount of usable identified data is small, and identifying data is a tedious process [ 67 ]. Unsupervised learning is necessary to judge and categorize data according to similarities, characteristics, and correlations and has three main applications: data clustering, association analysis, and dimensionality reduction. Therefore, the unsupervised learning methods introduced in this section include clustering analysis, association rules, and PCA.

Clustering analysis

The classification algorithm needs to “know” information concerning each category in advance, with all of the data to be classified having corresponding categories. When the above conditions cannot be met, cluster analysis can be applied to solve the problem [ 68 ]. Clustering places similar objects into different categories or subsets through the process of static classification. Consequently, objects in the same subset have similar properties. Many kinds of clustering techniques exist. Here, we introduced the four most commonly used clustering techniques.

Partition clustering

The core idea of this clustering method regards the centre of the data point as the centre of the cluster. The k-means method [ 69 ] is a representative example of this technique. The k-means method takes n observations and an integer, k , and outputs a partition of the n observations into k sets such that each observation belongs to the cluster with the nearest mean [ 70 ]. The k-means method exhibits low time complexity and high computing efficiency but has a poor processing effect on high dimensional data and cannot identify nonspherical clusters.

Hierarchical clustering

The hierarchical clustering algorithm decomposes a dataset hierarchically to facilitate the subsequent clustering [ 71 ]. Common algorithms for hierarchical clustering include BIRCH [ 72 ], CURE [ 73 ], and ROCK [ 74 ]. The algorithm starts by treating every point as a cluster, with clusters grouped according to closeness. When further combinations result in unexpected results under multiple causes or only one cluster remains, the grouping process ends. This method has wide applicability, and the relationship between clusters is easy to detect; however, the time complexity is high [ 75 ].

Clustering according to density

The density algorithm takes areas presenting a high degree of data density and defines these as belonging to the same cluster [ 76 ]. This method aims to find arbitrarily-shaped clusters, with the most representative algorithm being DBSCAN [ 77 ]. In practice, DBSCAN does not need to input the number of clusters to be partitioned and can handle clusters of various shapes; however, the time complexity of the algorithm is high. Furthermore, when data density is irregular, the quality of the clusters decreases; thus, DBSCAN cannot process high dimensional data [ 75 ].

Clustering according to a grid

Neither partition nor hierarchical clustering can identify clusters with nonconvex shapes. Although a dimension-based algorithm can accomplish this task, the time complexity is high. To address this problem, data-mining researchers proposed grid-based algorithms that changed the original data space into a grid structure of a certain size. A representative algorithm is STING, which divides the data space into several square cells according to different resolutions and clusters the data of different structure levels [ 78 ]. The main advantage of this method is its high processing speed and its exclusive dependence on the number of units in each dimension of the quantized space.

In clinical studies, subjects tend to be actual patients. Although researchers adopt complex inclusion and exclusion criteria before determining the subjects to be included in the analyses, heterogeneity among different patients cannot be avoided [ 79 , 80 ]. The most common application of cluster analysis in clinical big data is in classifying heterogeneous mixed groups into homogeneous groups according to the characteristics of existing data (i.e., “subgroups” of patients or observed objects are identified) [ 81 , 82 ]. This new information can then be used in the future to develop patient-oriented medical-management strategies. Docampo et al. [ 81 ] used hierarchical clustering to reduce heterogeneity and identify subgroups of clinical fibromyalgia, which aided the evaluation and management of fibromyalgia. Additionally, Guo et al. [ 83 ] used k-means clustering to divide patients with essential hypertension into four subgroups, which revealed that the potential risk of coronary heart disease differed between different subgroups. On the other hand, density- and grid-based clustering algorithms have mostly been used to process large numbers of images generated in basic research and clinical practice, with current studies focused on developing new tools to help clinical research and practices based on these technologies [ 84 , 85 ]. Cluster analysis will continue to have extensive application prospects along with the increasing emphasis on personalized treatment.

Association rules

Association rules discover interesting associations and correlations between item sets in large amounts of data. These rules were first proposed by Agrawal et al. [ 86 ] and applied to analyse customer buying habits to help retailers create sales plans. Data-mining based on association rules identifies association rules in a two-step process: 1) all high frequency items in the collection are listed and 2) frequent association rules are generated based on the high frequency items [ 87 ]. Therefore, before association rules can be obtained, sets of frequent items must be calculated using certain algorithms. The Apriori algorithm is based on the a priori principle of finding all relevant adjustment items in a database transaction that meet a minimum set of rules and restrictions or other restrictions [ 88 ]. Other algorithms are mostly variants of the Apriori algorithm [ 64 ]. The Apriori algorithm must scan the entire database every time it scans the transaction; therefore, algorithm performance deteriorates as database size increases [ 89 ], making it potentially unsuitable for analysing large databases. The frequent pattern (FP) growth algorithm was proposed to improve efficiency. After the first scan, the FP algorithm compresses the frequency set in the database into a FP tree while retaining the associated information and then mines the conditional libraries separately [ 90 ]. Association-rule technology is often used in medical research to identify association rules between disease risk factors (i.e., exploration of the joint effects of disease risk factors and combinations of other risk factors). For example, Li et al. [ 91 ] used the association-rule algorithm to identify the most important stroke risk factor as atrial fibrillation, followed by diabetes and a family history of stroke. Based on the same principle, association rules can also be used to evaluate treatment effects and other aspects. For example, Guo et al. [ 92 ] used the FP algorithm to generate association rules and evaluate individual characteristics and treatment effects of patients with diabetes, thereby reducing the readability rate of patients with diabetes. Association rules reveal a connection between premises and conclusions; however, the reasonable and reliable application of information can only be achieved through validation by experienced medical professionals and through extensive causal research [ 92 ].

PCA is a widely used data-mining method that aims to reduce data dimensionality in an interpretable way while retaining most of the information present in the data [ 93 , 94 ]. The main purpose of PCA is descriptive, as it requires no assumptions about data distribution and is, therefore, an adaptive and exploratory method. During the process of data analysis, the main steps of PCA include standardization of the original data, calculation of a correlation coefficient matrix, calculation of eigenvalues and eigenvectors, selection of principal components, and calculation of the comprehensive evaluation value. PCA does not often appear as a separate method, as it is often combined with other statistical methods [ 95 ]. In practical clinical studies, the existence of multicollinearity often leads to deviation from multivariate analysis. A feasible solution is to construct a regression model by PCA, which replaces the original independent variables with each principal component as a new independent variable for regression analysis, with this most commonly seen in the analysis of dietary patterns in nutritional epidemiology [ 96 ]. In a study of socioeconomic status and child-developmental delays, PCA was used to derive a new variable (the household wealth index) from a series of household property reports and incorporate this new variable as the main analytical variable into the logistic regression model [ 97 ]. Additionally, PCA can be combined with cluster analysis. Burgel et al. [ 98 ] used PCA to transform clinical data to address the lack of independence between existing variables used to explore the heterogeneity of different subtypes of chronic obstructive pulmonary disease. Therefore, in the study of subtypes and heterogeneity of clinical diseases, PCA can eliminate noisy variables that can potentially corrupt the cluster structure, thereby increasing the accuracy of the results of clustering analysis [ 98 , 99 ].

The data-mining process and examples of its application using common public databases

Open-access databases have the advantages of large volumes of data, wide data coverage, rich data information, and a cost-efficient method of research, making them beneficial to medical researchers. In this chapter, we introduced the data-mining process and methods and their application in research based on examples of utilizing public databases and data-mining algorithms.

The data-mining process

Figure  1 shows a series of research concepts. The data-mining process is divided into several steps: (1) database selection according to the research purpose; (2) data extraction and integration, including downloading the required data and combining data from multiple sources; (3) data cleaning and transformation, including removal of incorrect data, filling in missing data, generating new variables, converting data format, and ensuring data consistency; (4) data mining, involving extraction of implicit relational patterns through traditional statistics or ML; (5) pattern evaluation, which focuses on the validity parameters and values of the relationship patterns of the extracted data; and (6) assessment of the results, involving translation of the extracted data-relationship model into comprehensible knowledge made available to the public.

figure 1

The steps of data mining in medical public database

Examples of data-mining applied using public databases

Establishment of warning models for the early prediction of disease.

A previous study identified sepsis as a major cause of death in ICU patients [ 100 ]. The authors noted that the predictive model developed previously used a limited number of variables, and that model performance required improvement. The data-mining process applied to address these issues was, as follows: (1) data selection using the MIMIC III database; (2) extraction and integration of three types of data, including multivariate features (demographic information and clinical biochemical indicators), time series data (temperature, blood pressure, and heart rate), and clinical latent features (various scores related to disease); (3) data cleaning and transformation, including fixing irregular time series measurements, estimating missing values, deleting outliers, and addressing data imbalance; (4) data mining through the use of logical regression, generation of a decision tree, application of the RF algorithm, an SVM, and an ensemble algorithm (a combination of multiple classifiers) to established the prediction model; (5) pattern evaluation using sensitivity, precision, and the area under the receiver operating characteristic curve to evaluate model performance; and (6) evaluation of the results, in this case the potential to predicting the prognosis of patients with sepsis and whether the model outperformed current scoring systems.

Exploring prognostic risk factors in cancer patients

Wu et al. [ 101 ] noted that traditional survival-analysis methods often ignored the influence of competitive risk events, such as suicide and car accident, on outcomes, leading to deviations and misjudgements in estimating the effect of risk factors. They used the SEER database, which offers cause-of-death data for cancer patients, and a competitive risk model to address this problem according to the following process: (1) data were obtained from the SEER database; (2) demography, clinical characteristics, treatment modality, and cause of death of cecum cancer patients were extracted from the database; (3) patient data were deleted when there were no demographic, clinical, therapeutic, or cause-of-death variables; (4) Cox regression and two kinds of competitive risk models were applied for survival analysis; (5) the results were compared between three different models; and (6) the results revealed that for survival data with multiple endpoints, the competitive risk model was more favourable.

Derivation of dietary patterns

A study by Martínez Steele et al. [ 102 ] applied PCA for nutritional epidemiological analysis to determine dietary patterns and evaluate the overall nutritional quality of the population based on those patterns. Their process involved the following: (1) data were extracted from the NHANES database covering the years 2009–2010; (2) demographic characteristics and two 24 h dietary recall interviews were obtained; (3) data were weighted and excluded based on subjects not meeting specific criteria; (4) PCA was used to determine dietary patterns in the United States population, and Gaussian regression and restricted cubic splines were used to assess associations between ultra-processed foods and nutritional balance; (5) eigenvalues, scree plots, and the interpretability of the principal components were reviewed to screen and evaluate the results; and (6) the results revealed a negative association between ultra-processed food intake and overall dietary quality. Their findings indicated that a nutritionally balanced eating pattern was characterized by a diet high in fibre, potassium, magnesium, and vitamin C intake along with low sugar and saturated fat consumption.

The use of “big data” has changed multiple aspects of modern life, with its use combined with data-mining methods capable of improving the status quo [ 86 ]. The aim of this study was to aid clinical researchers in understanding the application of data-mining technology on clinical big data and public medical databases to further their research goals in order to benefit clinicians and patients. The examples provided offer insight into the data-mining process applied for the purposes of clinical research. Notably, researchers have raised concerns that big data and data-mining methods were not a perfect fit for adequately replicating actual clinical conditions, with the results potentially capable of misleading doctors and patients [ 86 ]. Therefore, given the rate at which new technologies and trends progress, it is necessary to maintain a positive attitude concerning their potential impact while remaining cautious in examining the results provided by their application.

In the future, the healthcare system will need to utilize increasingly larger volumes of big data with higher dimensionality. The tasks and objectives of data analysis will also have higher demands, including higher degrees of visualization, results with increased accuracy, and stronger real-time performance. As a result, the methods used to mine and process big data will continue to improve. Furthermore, to increase the formality and standardization of data-mining methods, it is possible that a new programming language specifically for this purpose will need to be developed, as well as novel methods capable of addressing unstructured data, such as graphics, audio, and text represented by handwriting. In terms of application, the development of data-management and disease-screening systems for large-scale populations, such as the military, will help determine the best interventions and formulation of auxiliary standards capable of benefitting both cost-efficiency and personnel. Data-mining technology can also be applied to hospital management in order to improve patient satisfaction, detect medical-insurance fraud and abuse, and reduce costs and losses while improving management efficiency. Currently, this technology is being applied for predicting patient disease, with further improvements resulting in the increased accuracy and speed of these predictions. Moreover, it is worth noting that technological development will concomitantly require higher quality data, which will be a prerequisite for accurate application of the technology.

Finally, the ultimate goal of this study was to explain the methods associated with data mining and commonly used to process clinical big data. This review will potentially promote further study and aid doctors and patients.

Abbreviations

Biologic Specimen and Data Repositories Information Coordinating Center

China Health and Retirement Longitudinal Study

China Health and Nutrition Survey

China Kadoorie Biobank

Cause-specific risk

Comparative Toxicogenomics Database

EICU Collaborative Research Database

Frequent pattern

Global burden of disease

Gene expression omnibus

Health and Retirement Study

International Cancer Genome Consortium

Medical Information Mart for Intensive Care

  • Machine learning

National Health and Nutrition Examination Survey

Principal component analysis

Paediatric intensive care

Random forest

Surveillance, epidemiology, and end results

Support vector machine

The Cancer Genome Atlas

Herland M, Khoshgoftaar TM, Wald R. A review of data mining using big data in health informatics. J Big Data. 2014;1(1):1–35.

Article   Google Scholar  

Wang F, Zhang P, Wang X, Hu J. Clinical risk prediction by exploring high-order feature correlations. AMIA Annu Symp Proc. 2014;2014:1170–9.

PubMed   PubMed Central   Google Scholar  

Xu R, Li L, Wang Q. dRiskKB: a large-scale disease-disease risk relationship knowledge base constructed from biomedical text. BMC Bioinform. 2014;15:105. https://doi.org/10.1186/1471-2105-15-105 .

Article   CAS   Google Scholar  

Ramachandran S, Erraguntla M, Mayer R, Benjamin P, Editors. Data mining in military health systems-clinical and administrative applications. In: 2007 IEEE international conference on automation science and engineering; 2007. https://doi.org/10.1109/COASE.2007.4341764 .

Vie LL, Scheier LM, Lester PB, Ho TE, Labarthe DR, Seligman MEP. The US army person-event data environment: a military-civilian big data enterprise. Big Data. 2015;3(2):67–79. https://doi.org/10.1089/big.2014.0055 .

Article   PubMed   Google Scholar  

Mohan A, Blough DM, Kurc T, Post A, Saltz J. Detection of conflicts and inconsistencies in taxonomy-based authorization policies. IEEE Int Conf Bioinform Biomed. 2012;2011:590–4. https://doi.org/10.1109/BIBM.2011.79 .

Luo J, Wu M, Gopukumar D, Zhao Y. Big data application in biomedical research and health care: a literature review. Biomed Inform Insights. 2016;8:1–10. https://doi.org/10.4137/BII.S31559 .

Article   CAS   PubMed   PubMed Central   Google Scholar  

Bellazzi R, Zupan B. Predictive data mining in clinical medicine: current issues and guidelines. Int J Med Inform. 2008;77(2):81–97.

Sahu H, Shrma S, Gondhalakar S. A brief overview on data mining survey. Int J Comput Technol Electron Eng. 2011;1(3):114–21.

Google Scholar  

Obermeyer Z, Emanuel EJ. Predicting the future - big data, machine learning, and clinical medicine. N Engl J Med. 2016;375(13):1216–9.

Article   PubMed   PubMed Central   Google Scholar  

Doll KM, Rademaker A, Sosa JA. Practical guide to surgical data sets: surveillance, epidemiology, and end results (SEER) database. JAMA Surg. 2018;153(6):588–9.

Johnson AE, Pollard TJ, Shen L, Lehman LW, Feng M, Ghassemi M, et al. MIMIC-III, a freely accessible critical care database. Sci Data. 2016;3: 160035. https://doi.org/10.1038/sdata.2016.35 .

Ahluwalia N, Dwyer J, Terry A, Moshfegh A, Johnson C. Update on NHANES dietary data: focus on collection, release, analytical considerations, and uses to inform public policy. Adv Nutr. 2016;7(1):121–34.

Vos T, Lim SS, Abbafati C, Abbas KM, Abbasi M, Abbasifard M, et al. Global burden of 369 diseases and injuries in 204 countries and territories, 1990–2019: a systematic analysis for the Global Burden of Disease Study 2019. Lancet. 2020;396(10258):1204–22. https://doi.org/10.1016/S0140-6736(20)30925-9 .

Palmer LJ. UK Biobank: Bank on it. Lancet. 2007;369(9578):1980–2. https://doi.org/10.1016/S0140-6736(07)60924-6 .

Cancer Genome Atlas Research Network, Weinstein JN, Collisson EA, Mills GB, Shaw KR, Ozenberger BA, et al. The cancer genome atlas pan-cancer analysis project. Nat Genet. 2013;45(10):1113–20. https://doi.org/10.1038/ng.2764 .

Davis S, Meltzer PS. GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor. Bioinformatics. 2007;23(14):1846–7.

Article   PubMed   CAS   Google Scholar  

Zhang J, Bajari R, Andric D, Gerthoffert F, Lepsa A, Nahal-Bose H, et al. The international cancer genome consortium data portal. Nat Biotechnol. 2019;37(4):367–9.

Article   CAS   PubMed   Google Scholar  

Chen Z, Chen J, Collins R, Guo Y, Peto R, Wu F, et al. China Kadoorie Biobank of 0.5 million people: survey methods, baseline characteristics and long-term follow-up. Int J Epidemiol. 2011;40(6):1652–66.

Davis AP, Grondin CJ, Johnson RJ, Sciaky D, McMorran R, Wiegers J, et al. The comparative toxicogenomics database: update 2019. Nucleic Acids Res. 2019;47(D1):D948–54. https://doi.org/10.1093/nar/gky868 .

Zeng X, Yu G, Lu Y, Tan L, Wu X, Shi S, et al. PIC, a paediatric-specific intensive care database. Sci Data. 2020;7(1):14.

Giffen CA, Carroll LE, Adams JT, Brennan SP, Coady SA, Wagner EL. Providing contemporary access to historical biospecimen collections: development of the NHLBI Biologic Specimen and Data Repository Information Coordinating Center (BioLINCC). Biopreserv Biobank. 2015;13(4):271–9.

Zhang B, Zhai FY, Du SF, Popkin BM. The China Health and Nutrition Survey, 1989–2011. Obes Rev. 2014;15(Suppl 1):2–7. https://doi.org/10.1111/obr.12119 .

Zhao Y, Hu Y, Smith JP, Strauss J, Yang G. Cohort profile: the China Health and Retirement Longitudinal Study (CHARLS). Int J Epidemiol. 2014;43(1):61–8.

Pollard TJ, Johnson AEW, Raffa JD, Celi LA, Mark RG, Badawi O. The eICU collaborative research database, a freely available multi-centre database for critical care research. Sci Data. 2018;5:180178. https://doi.org/10.1038/sdata.2018.178 .

Fisher GG, Ryan LH. Overview of the health and retirement study and introduction to the special issue. Work Aging Retire. 2018;4(1):1–9.

Iavindrasana J, Cohen G, Depeursinge A, Müller H, Meyer R, Geissbuhler A. Clinical data mining: a review. Yearb Med Inform. 2009:121–33.

Zhang Y, Guo SL, Han LN, Li TL. Application and exploration of big data mining in clinical medicine. Chin Med J. 2016;129(6):731–8. https://doi.org/10.4103/0366-6999.178019 .

Ngiam KY, Khor IW. Big data and machine learning algorithms for health-care delivery. Lancet Oncol. 2019;20(5):e262–73.

Huang C, Murugiah K, Mahajan S, Li S-X, Dhruva SS, Haimovich JS, et al. Enhancing the prediction of acute kidney injury risk after percutaneous coronary intervention using machine learning techniques: a retrospective cohort study. PLoS Med. 2018;15(11):e1002703.

Rahimian F, Salimi-Khorshidi G, Payberah AH, Tran J, Ayala Solares R, Raimondi F, et al. Predicting the risk of emergency admission with machine learning: development and validation using linked electronic health records. PLoS Med. 2018;15(11):e1002695.

Kantardzic M. Data Mining: concepts, models, methods, and algorithms. Technometrics. 2003;45(3):277.

Jothi N, Husain W. Data mining in healthcare—a review. Procedia Comput Sci. 2015;72:306–13.

Piatetsky-Shapiro G, Tamayo P. Microarray data mining: facing the challenges. SIGKDD. 2003;5(2):1–5. https://doi.org/10.1145/980972.980974 .

Ripley BD. Pattern recognition and neural networks. Cambridge: Cambridge University Press; 1996.

Book   Google Scholar  

Arlot S, Celisse A. A survey of cross-validation procedures for model selection. Stat Surv. 2010;4:40–79. https://doi.org/10.1214/09-SS054 .

Shouval R, Bondi O, Mishan H, Shimoni A, Unger R, Nagler A. Application of machine learning algorithms for clinical predictive modelling: a data-mining approach in SCT. Bone Marrow Transp. 2014;49(3):332–7.

Momenyan S, Baghestani AR, Momenyan N, Naseri P, Akbari ME. Survival prediction of patients with breast cancer: comparisons of decision tree and logistic regression analysis. Int J Cancer Manag. 2018;11(7):e9176.

Topaloğlu M, Malkoç G. Decision tree application for renal calculi diagnosis. Int J Appl Math Electron Comput. 2016. https://doi.org/10.18100/ijamec.281134.

Li H, Wu TT, Yang DL, Guo YS, Liu PC, Chen Y, et al. Decision tree model for predicting in-hospital cardiac arrest among patients admitted with acute coronary syndrome. Clin Cardiol. 2019;42(11):1087–93.

Ramezankhani A, Hadavandi E, Pournik O, Shahrabi J, Azizi F, Hadaegh F. Decision tree-based modelling for identification of potential interactions between type 2 diabetes risk factors: a decade follow-up in a Middle East prospective cohort study. BMJ Open. 2016;6(12):e013336.

Carmona-Bayonas A, Jiménez-Fonseca P, Font C, Fenoy F, Otero R, Beato C, et al. Predicting serious complications in patients with cancer and pulmonary embolism using decision tree modelling: the EPIPHANY Index. Br J Cancer. 2017;116(8):994–1001.

Efron B. Bootstrap methods: another look at the jackknife. In: Kotz S, Johnson NL, editors. Breakthroughs in statistics. New York: Springer; 1992. p. 569–93.

Chapter   Google Scholar  

Breima L. Random forests. Mach Learn. 2010;1(45):5–32. https://doi.org/10.1023/A:1010933404324 .

Franklin J. The elements of statistical learning: data mining, inference and prediction. Math Intell. 2005;27(2):83–5.

Taylor RA, Pare JR, Venkatesh AK, Mowafi H, Melnick ER, Fleischman W, et al. Prediction of in-hospital mortality in emergency department patients with sepsis: a local big data-driven, machine learning approach. Acad Emerg Med. 2016;23(3):269–78.

Lee J, Scott DJ, Villarroel M, Clifford GD, Saeed M, Mark RG. Open-access MIMIC-II database for intensive care research. Annu Int Conf IEEE Eng Med Biol Soc. 2011:8315–8. https://doi.org/10.1109/IEMBS.2011.6092050 .

Lee J. Patient-specific predictive modelling using random forests: an observational study for the critically Ill. JMIR Med Inform. 2017;5(1):e3.

Wongvibulsin S, Wu KC, Zeger SL. Clinical risk prediction with random forests for survival, longitudinal, and multivariate (RF-SLAM) data analysis. BMC Med Res Methodol. 2019;20(1):1.

Taylor JMG. Random survival forests. J Thorac Oncol. 2011;6(12):1974–5.

Hu C, Steingrimsson JA. Personalized risk prediction in clinical oncology research: applications and practical issues using survival trees and random forests. J Biopharm Stat. 2018;28(2):333–49.

Dietrich R, Opper M, Sompolinsky H. Statistical mechanics of support vector networks. Phys Rev Lett. 1999;82(14):2975.

Verplancke T, Van Looy S, Benoit D, Vansteelandt S, Depuydt P, De Turck F, et al. Support vector machine versus logistic regression modelling for prediction of hospital mortality in critically ill patients with haematological malignancies. BMC Med Inform Decis Mak. 2008;8:56. https://doi.org/10.1186/1472-6947-8-56 .

Yu W, Liu T, Valdez R, Gwinn M, Khoury MJ. Application of support vector machine modelling for prediction of common diseases: the case of diabetes and pre-diabetes. BMC Med Inform Decis Mak. 2010;10:16. https://doi.org/10.1186/1472-6947-10-16 .

Son YJ, Kim HG, Kim EH, Choi S, Lee SK. Application of support vector machine for prediction of medication adherence in heart failure patients. Healthc Inform Res. 2010;16(4):253–9.

Schadt EE, Friend SH, Shaywitz DA. A network view of disease and compound screening. Nat Rev Drug Discov. 2009;8(4):286–95.

Austin PC, Lee DS, Fine JP. Introduction to the analysis of survival data in the presence of competing risks. Circulation. 2016;133(6):601–9.

Putter H, Fiocco M, Geskus RB. Tutorial in biostatistics: competing risks and multi-state models. Stat Med. 2007;26(11):2389–430. https://doi.org/10.1002/sim.2712 .

Klein JP. Competing risks. WIREs Comp Stat. 2010;2(3):333–9. https://doi.org/10.1002/wics.83 .

Haller B, Schmidt G, Ulm K. Applying competing risks regression models: an overview. Lifetime Data Anal. 2013;19(1):33–58. https://doi.org/10.1007/s10985-012-9230-8 .

Fine JP, Gray RJ. A proportional hazards model for the subdistribution of a competing risk. J Am Stat Assoc. 1999;94(446):496–509.

Koller MT, Raatz H, Steyerberg EW, Wolbers M. Competing risks and the clinical community: irrelevance or ignorance? Stat Med. 2012;31(11–12):1089–97.

Lau B, Cole SR, Gange SJ. Competing risk regression models for epidemiologic data. Am J Epidemiol. 2009;170(2):244–56.

Yang J, Li Y, Liu Q, Li L, Feng A, Wang T, et al. Brief introduction of medical database and data mining technology in big data era. J Evid Based Med. 2020;13(1):57–69.

Yu Z, Yang J, Gao L, Huang Q, Zi H, Li X. A competing risk analysis study of prognosis in patients with esophageal carcinoma 2006–2015 using data from the surveillance, epidemiology, and end results (SEER) database. Med Sci Monit. 2020;26:e918686.

Yang J, Pan Z, He Y, Zhao F, Feng X, Liu Q, et al. Competing-risks model for predicting the prognosis of penile cancer based on the SEER database. Cancer Med. 2019;8(18):7881–9.

Miotto R, Wang F, Wang S, Jiang X, Dudley JT. Deep learning for healthcare: review, opportunities and challenges. Brief Bioinform. 2018;19(6):1236–46.

Alashwal H, El Halaby M, Crouse JJ, Abdalla A, Moustafa AA. The application of unsupervised clustering methods to Alzheimer’s disease. Front Comput Neurosci. 2019;13:31.

Macqueen J. Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Oakland, CA: University of California Press;1967.

Forgy EW. Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics. 1965;21:768–9.

Johnson SC. Hierarchical clustering schemes. Psychometrika. 1967;32(3):241–54.

Zhang T, Ramakrishnan R, Livny M. BIRCH: an efficient data clustering method for very large databases. ACM SIGMOD Rec. 1996;25(2):103–14.

Guha S, Rastogi R, Shim K. CURE: an efficient clustering algorithm for large databases. ACM SIGMOD Rec. 1998;27(2):73–84.

Guha S, Rastogi R, Shim K. ROCK: a robust clustering algorithm for categorical attributes. Inf Syst. 2000;25(5):345–66.

Xu D, Tian Y. A comprehensive survey of clustering algorithms. Ann Data Sci. 2015;2(2):165–93.

Kriegel HP, Kröger P, Sander J, Zimek A. Density-based clustering. WIRES Data Min Knowl. 2011;1(3):231–40. https://doi.org/10.1002/widm.30 .

Ester M, Kriegel HP, Sander J, Xu X, editors. A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of 2nd international conference on knowledge discovery and data mining Portland, Oregon: AAAI Press; 1996. p. 226–31.

Wang W, Yang J, Muntz RR. STING: a statistical information grid approach to spatial data mining. In: Proceedings of the 23rd international conference on very large data bases, Morgan Kaufmann Publishers Inc.; 1997. p. 186–95.

Iwashyna TJ, Burke JF, Sussman JB, Prescott HC, Hayward RA, Angus DC. Implications of heterogeneity of treatment effect for reporting and analysis of randomized trials in critical care. Am J Respir Crit Care Med. 2015;192(9):1045–51.

Ruan S, Lin H, Huang C, Kuo P, Wu H, Yu C. Exploring the heterogeneity of effects of corticosteroids on acute respiratory distress syndrome: a systematic review and meta-analysis. Crit Care. 2014;18(2):R63.

Docampo E, Collado A, Escaramís G, Carbonell J, Rivera J, Vidal J, et al. Cluster analysis of clinical data identifies fibromyalgia subgroups. PLoS ONE. 2013;8(9):e74873.

Sutherland ER, Goleva E, King TS, Lehman E, Stevens AD, Jackson LP, et al. Cluster analysis of obesity and asthma phenotypes. PLoS ONE. 2012;7(5):e36631.

Guo Q, Lu X, Gao Y, Zhang J, Yan B, Su D, et al. Cluster analysis: a new approach for identification of underlying risk factors for coronary artery disease in essential hypertensive patients. Sci Rep. 2017;7:43965.

Hastings S, Oster S, Langella S, Kurc TM, Pan T, Catalyurek UV, et al. A grid-based image archival and analysis system. J Am Med Inform Assoc. 2005;12(3):286–95.

Celebi ME, Aslandogan YA, Bergstresser PR. Mining biomedical images with density-based clustering. In: International conference on information technology: coding and computing (ITCC’05), vol II. Washington, DC, USA: IEEE; 2005. https://doi.org/10.1109/ITCC.2005.196 .

Agrawal R, Imieliński T, Swami A, editors. Mining association rules between sets of items in large databases. In: Proceedings of the ACM SIGMOD conference on management of data. Washington, DC, USA: Association for Computing Machinery; 1993. p. 207–16. https://doi.org/10.1145/170035.170072 .

Sethi A, Mahajan P. Association rule mining: A review. TIJCSA. 2012;1(9):72–83.

Kotsiantis S, Kanellopoulos D. Association rules mining: a recent overview. GESTS Int Trans Comput Sci Eng. 2006;32(1):71–82.

Narvekar M, Syed SF. An optimized algorithm for association rule mining using FP tree. Procedia Computer Sci. 2015;45:101–10.

Verhein F. Frequent pattern growth (FP-growth) algorithm. Sydney: The University of Sydney; 2008. p. 1–16.

Li Q, Zhang Y, Kang H, Xin Y, Shi C. Mining association rules between stroke risk factors based on the Apriori algorithm. Technol Health Care. 2017;25(S1):197–205.

Guo A, Zhang W, Xu S. Exploring the treatment effect in diabetes patients using association rule mining. Int J Inf Pro Manage. 2016;7(3):1–9.

Pearson K. On lines and planes of closest fit to systems of points in space. Lond Edinb Dublin Philos Mag J Sci. 1901;2(11):559–72.

Hotelling H. Analysis of a complex of statistical variables into principal components. J Educ Psychol. 1933;24(6):417.

Jolliffe IT, Cadima J. Principal component analysis: a review and recent developments. Philos Trans A Math Phys Eng Sci. 2016;374(2065):20150202.

Zhang Z, Castelló A. Principal components analysis in clinical studies. Ann Transl Med. 2017;5(17):351.

Apio BRS, Mawa R, Lawoko S, Sharma KN. Socio-economic inequality in stunting among children aged 6–59 months in a Ugandan population based cross-sectional study. Am J Pediatri. 2019;5(3):125–32.

Burgel PR, Paillasseur JL, Caillaud D, Tillie-Leblond I, Chanez P, Escamilla R, et al. Clinical COPD phenotypes: a novel approach using principal component and cluster analyses. Eur Respir J. 2010;36(3):531–9.

Vogt W, Nagel D. Cluster analysis in diagnosis. Clin Chem. 1992;38(2):182–98.

Layeghian Javan S, Sepehri MM, Layeghian Javan M, Khatibi T. An intelligent warning model for early prediction of cardiac arrest in sepsis patients. Comput Methods Programs Biomed. 2019;178:47–58. https://doi.org/10.1016/j.cmpb.2019.06.010 .

Wu W, Yang J, Li D, Huang Q, Zhao F, Feng X, et al. Competitive risk analysis of prognosis in patients with cecum cancer: a population-based study. Cancer Control. 2021;28:1073274821989316. https://doi.org/10.1177/1073274821989316 .

Martínez Steele E, Popkin BM, Swinburn B, Monteiro CA. The share of ultra-processed foods and the overall nutritional quality of diets in the US: evidence from a nationally representative cross-sectional study. Popul Health Metr. 2017;15(1):6.

Download references

This study was supported by the National Social Science Foundation of China (No. 16BGL183).

Author information

Wen-Tao Wu and Yuan-Jie Li have contributed equally to this work

Authors and Affiliations

Department of Clinical Research, The First Affiliated Hospital of Jinan University, Tianhe District, 613 W. Huangpu Avenue, Guangzhou, 510632, Guangdong, China

Wen-Tao Wu, Ao-Zi Feng, Li Li, Tao Huang & Jun Lyu

School of Public Health, Xi’an Jiaotong University Health Science Center, Xi’an, 710061, Shaanxi, China

Department of Human Anatomy, Histology and Embryology, School of Basic Medical Sciences, Xi’an Jiaotong University Health Science Center, Xi’an, 710061, Shaanxi, China

Yuan-Jie Li

Department of Neurology, The First Affiliated Hospital of Jinan University, Tianhe District, 613 W. Huangpu Avenue, Guangzhou, 510632, Guangdong, China

You can also search for this author in PubMed   Google Scholar

Contributions

WTW, YJL and JL designed the review. JL, AZF, TH, LL and ADX reviewed and criticized the original paper. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to An-Ding Xu or Jun Lyu .

Ethics declarations

Ethics approval and consent to participate.

Not applicable.

Consent for publication

Competing interests.

The authors declare that they have no competing interests.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Wu, WT., Li, YJ., Feng, AZ. et al. Data mining in clinical big data: the frequently used databases, steps, and methodological models. Military Med Res 8 , 44 (2021). https://doi.org/10.1186/s40779-021-00338-z

Download citation

Received : 24 January 2020

Accepted : 03 August 2021

Published : 11 August 2021

DOI : https://doi.org/10.1186/s40779-021-00338-z

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Clinical big data
  • Data mining
  • Medical public database

Military Medical Research

ISSN: 2054-9369

  • Submission enquiries: Access here and click Contact Us
  • General enquiries: [email protected]

research topics of data mining

  • Frontiers in Computational Neuroscience
  • Research Topics

Medical Data Mining and Medical Intelligence Services

Total Downloads

Total Views and Downloads

About this Research Topic

In the age of digital healthcare, the confluence of data science, artificial intelligence, and healthcare services has ushered in a new era of medical discovery and patient care. The sheer volume and complexity of medical data generated daily presents both a challenge and an extraordinary opportunity. This ...

Keywords : Medical Data, Machine Learning, Artificial Intelligence, Digital Healthcare

Important Note : All contributions to this Research Topic must be within the scope of the section and journal to which they are submitted, as defined in their mission statements. Frontiers reserves the right to guide an out-of-scope manuscript to a more suitable section or journal at any stage of peer review.

Topic Editors

Topic coordinators, recent articles, submission deadlines, participating journals.

Manuscripts can be submitted to this Research Topic via the following journals:

total views

  • Demographics

No records found

total views article views downloads topic views

Top countries

Top referring sites, about frontiers research topics.

With their unique mixes of varied contributions from Original Research to Review Articles, Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot research area! Find out more on how to host your own Frontiers Research Topic or contribute to one as an author.

16 Data Mining Projects Ideas & Topics For Beginners [2024]

16 Data Mining Projects Ideas & Topics For Beginners [2024]

Introduction

A career in Data Science necessitates hands-on experience, and what better way to obtain it than by working on real-world data mining projects? This post provides a wide range of data mining project ideas for beginners. Whether you’re looking at data mining in database management systems, data mining projects in Java, or creative data mining project ideas, this list has you covered.

Today, data mining has become strategically important to organizations across industries. It not only helps in predicting outcomes and trends but also in removing bottlenecks and improving existing processes. Data mining research topics 2020 was already in the search bar of millions of users 2 years ago . It looks like this trend is about to continue in 2024 and beyond. So, if you are a beginner, the best thing you can do is work on some real-time data mining projects.

 If you are just getting started in data science, making sense of advanced data mining techniques can seem daunting. Along with the plethora of data mining research topics available online , we have compiled some useful data mining project topics to support you in your learning journey.

We, here at upGrad, believe in a practical approach as theoretical knowledge alone won’t be of help in a real-time work environment if you do not work on data mining projects yourself . In this article, we will be exploring some fun and exciting data mining projects and data mining research topics which beginners can work on to put their data mining knowledge to test. In this post, you will learn about top 16 data mining projects for beginners.

In this article, you will find 42 top python project ideas for beginners to get hands-on experience on Python

But first, let’s address the more important and frequently question that must be lurking in your mind: why to build data mining projects?

But before we begin, let us look at an example to decode what data mining is all about. Suppose you have a data set containing login logs of a web application. It can include things like the username, login timestamp, activities performed, time spent on the site before logging out, etc.

Our learners also read : Python online course free !

Such unstructured data in itself would not serve any purpose unless it is organized systematically and analyzed to extract relevant information for the business. By applying the different techniques of data mining, you can discover user habits, preferences, peak usage timings, etc. These insights can further increase the software system’s efficiency and boost its user-friendliness. Learn more about data mining with our data science programs.

data mining projects

In today’s digital era, the computing processes of collecting, cleaning, analyzing, and interpreting data make up an integral part of business strategies. So, data scientists are required to have adequate knowledge of methods like pattern tracking, classification, cluster analysis, prediction, neural networks, etc. The more you experiment with different data mining projects, the more knowledge you gain.

Data Mining Project Ideas & Topics for Beginners

This list of data mining projects for students is suited for beginners, and those just starting out with Data Science in general. These data mining projects will get you going with all the practicalities you need to succeed in your career.

Further, if you’re looking for data mining project for final year, this list should get you going as this list also contains data mining projects for students . So, without further ado, let’s jump straight into some data mining projects that will strengthen your base and allow you to climb up the ladder.

Also read : Excel online course free !

1. iBCM: interesting Behavioral Constraint Miner

One of the best ideas to start experimenting you hands-on  data mining projects for students is working on iBCM. A sequence classification problem deals with the prediction of sequential patterns in data sets. It discovers the underlying order in the database based on specific labels. In doing so, it applies the simple mathematical tool of partial orders. However, you would require a better representation to achieve more accurate, concise, and scalable classification. And a sequence classification technique with a behavioral constraint template can address this need.

With the iBCM project, you can delve into the field of sequence categorization. Using behavioral constraint templates, this venture predicts sequential patterns inside datasets. This method employs mathematical tools such as partial orders to reveal underlying data patterns in an accurate and simple manner. Beyond traditional sequence mining, iBCM finds a wide range of patterns, making it a good starting point for inexperienced data miners.

The interesting Behavioral Constraint Miner (iBCM) project can express a variety of patterns over a sequence, such as simple occurrence, looping, and position-based behavior. It can also mine negative information, i.e., the absence of a particular behavior. So, the iBCM approach goes much beyond the typical sequence mining representations and is a perfect starting point for those looking for data mining projects for students.

2. GERF: Group Event Recommendation Framework

This is one of the simple data mining projects yet an exciting one. It is an intelligent solution for recommending social events, such as exhibitions, book launches, concerts, etc. A majority of the research focuses on suggesting upcoming attractions to individuals. So, a Group Event Recommendation Framework (GERF) was developed to propose events to a group of users.

GERF addresses group social event recommendations by utilizing learning-to-rank algorithms for reliable choices. This project provides efficient event recommendations for a varied user population by extracting group preferences and environmental impacts, with applications ranging from exhibitions to travel services.

This model uses a learning-to-rank algorithm to extract group preferences and can incorporate additional contextual influences with ease, accuracy, and time-efficiency.

Learning to rank, also known as machine-learned ranking (MLR), is the process of building ranking models for systems needing information retrieval using machine learning techniques such as supervised learning, semi-supervised learning, and reinforcement learning.

The objects used for training are organized into lists, with the relative order between the lists being partially described. In most cases, a number or ordinal score is assigned to each item, or a binary judgment (such as “relevant” for true values(binary 1) or “not relevant” for false values(binary 0)) is made.

The objective of the ranking model is to apply the same logic used to rank the training data to the rating of fresh, unknown lists.

Also, it can be conveniently applied to other group recommendation scenarios like location-based travel services. 

Top Data Science Skills to Learn

Explore our popular data science courses.

upGrad’s Exclusive Data Science Webinar for you –

The Future of Consumer Data in an Open Data Economy

3. Efficient similarity search for dynamic data streams

Online applications use similarity search systems for tasks like pattern recognition, recommendations, plagiarism detection, etc. Typically, the algorithm answers nearest-neighbor queries with the Location-Sensitive Hashing or LSH approach, a min-hashing related method. It can be implemented in several computational models with large data sets, including MapReduce architecture and streaming. Mentioning data mining projects can help your resume look much more interesting than others.

For a variety of functions, online apps rely on similarity search engines. This research focuses on effective similarity search strategies for dynamic data streams, with a special emphasis on scalability in huge datasets. Its novel features, such as the use of the Jaccard index as a similarity measure and estimating techniques based on sketching, improve accuracy in pattern recognition and recommendation tasks.

Dynamic data streams, however, require scalable LSH-based filtering and design. To this end, the efficient similarity search project outperforms previous algorithms. Here are some of its main features:

  • Relies on the Jaccard index as a similarity measure
  • Suggests a nearest-neighbor data structure feasible for dynamic data streams
  • Proposes a sketching algorithm for similarity estimation 

4. Frequent pattern mining on uncertain graphs

Application domains like bioinformatics, social networks, and privacy enforcement often encounter uncertainty due to the presence of interrelated, real-life data archives. This uncertainty permeates the graph data as well.

Frequent pattern mining on uncertain graphs is critical in settings requiring uncertain data, such as bioinformatics and social networks. This project addresses the issue of transitive interactions with uncertain graph data. It efficiently manages real-world data archives with increased performance by utilizing enumeration-evaluation methods and approximation techniques.

This problem calls for innovative data mining projects that can catch the transitive interactions between graph nodes. This beginner-level data mining projects will help build a strong foundation for fundamental programming concepts. One such technique is the frequent subgraph and pattern mining on a single uncertain graph. The solution is presented in the following format:

  • An enumeration-evaluation algorithm to support computation under probabilistic semantics
  • An approximation algorithm to enable efficient problem-solving
  • Computation sharing techniques to drive mining performance
  • Integration of check-point based and pruning approaches to extend the algorithm to expected semantics

5. Cleaning data with forbidden itemsets or FBIs

Data cleaning methods typically involve taking away data errors and systematically fixing the issue by specifying constraints (illegal values, domain restrictions, logical rules, etc.)  

Data cleansing frequently entails defining limitations to correct inaccuracies. The FBI’s effort introduces a fixing method based on banned itemset, finding constraints in dirty data automatically and improving error detection precision. Empirical evaluations establish the mechanism’s trustworthiness and dependability, which is critical in the big data scenario.

In the real-life big data universe, we are inundated with dirty data that comes without any known constraints. In such a scenario, the algorithm automatically discovers constraints on the dirty data and further uses them to identify and repair errors. But when this discovery algorithm runs on the repaired data again, it introduces new constraint violations, rendering the data erroneous. This is one of the excellent data mining projects for beginners.

Hence, a repairing method based on forbidden itemsets (FBIs) was devised to record unlikely co-occurrences of values and detect errors with more precision. And empirical evaluations establish the credibility and reliability of this mechanism. 

6. Protecting user data in profile-matching social networks

This is one of the convenient data mining projects that has a lot of use in the future. Consider the user profile database maintained by the providers of social networking services, such as online dating sites. The querying users specify certain criteria based on which their profiles are matched with that of other users. This process has to be secure enough to protect against any kind of data breaches. There are some solutions in the market today that use homomorphic encryption and multiple servers for matching user profiles to preserve user privacy. 

Read our popular Data Science Articles

7. privrank for social media.

Social media sites mine their users’ preferences from their online activities to offer personalized recommendations. However, user activity data contains information which can be used to infer private details about an individual (for example, gender, age, etc.) And any leak or release of such user-specified data can increase the risk of interference attacks. 

Learn  Data Science Courses online  at upGrad

8. Practical PEKs scheme over encrypted email in cloud server

In the light of current high-profile public events related to email leaks, the security of such sensitive messages has emerged as a primary concern for users worldwide. To that end, the Public Encryption with Keyword Search (PEKS) technology offers a viable solution. This is one of the useful data mining projects in which this combines security protection with efficient search operability functions. 

When searching over a sizable encrypted email database in a cloud server, we would want the email receivers to perform quick multi-keyword and boolean searches without revealing additional information to the server.

Read: Data Mining Real World Applications

9. Sentimental analysis and opinion mining for mobile networks

This project concerns post-publishing applications where a registered user can share text posts or images and also leave comments on posts. Under the prevailing system, users have to go through all the comments manually to filter out verified comments, positive comments, negative remarks, and so on.

With the sentiment analysis and opinion mining system, users can check the status of their post without dedicating much time and effort. It provides an opinion on the comments made on a post and also gives the option to view a graph. 

10. Mining the k most frequent negative patterns via learning

In behavior informatics, the negative sequential patterns (NSPs) can be more revealing than the positive sequential patterns (PSPs) . For instance, in a disease or illness-related study, data on missing a medical treatment can be more useful than data on attending a medical procedure. But to the present day, NSP mining is still at a nascent stage. And the ‘Topk-NSP+’ algorithm presents a reliable solution for overcoming the obstacles in the current mining landscape. This is one of the trending data mining and this is how the project proposes the algorithm:

  • Mining the top-k PSPs with the existing method
  • Mining the to-k NSPs from these PSPs by using an idea similar to the top-k PSPs mining 
  • Employing three optimization strategies to select useful NSPs and reduce computational costs

Also try:  Machine Learning Project Ideas for Beginners

11. Automated personality classification project

The automatic system analyzes the characteristics and behaviors of participants. And after observing the past patterns of data classification, it predicts a personality type and stores its own patterns in a dataset. This project idea can be summarized as follows:

  • Store personality-related data in a database
  • Collect associated characteristics for each user
  • Extract relevant features from the text entered by the participant
  • Examine and display the personality traits 
  • Interlink personality and user behavior (There can be varying degrees of behavior for a particular personality type)

Such models are commonplace in career guidance services where a student’s personality is matched with suitable career paths. This can be an interesting and useful data mining projects.

12. Social-Aware social influence modeling

This is one of the most popular data mining mini projects. This project deals with big social data and leverages deep learning for sequential modeling of user interests. The stepwise process is described below:

  • A preliminary analysis of two real datasets (Yelp and Epinions)
  • Discovery of statistically sequential actions of users and their social circles, including temporal autocorrelation and social influence on decision-making
  • Presentation of a novel deep learning model called Social-Aware Long Short-Term Memory (SA-LSTM), which can predict the type of items or Points of Interest that a particular user will buy or visit next. Long short-term memory, often known as LSTM, is a kind of neural network that is used in the domains of deep learning and artificial intelligence. LSTM neural networks have feedback connections, in contrast to more traditional feedforward neural networks so that they can change the training parameters or hyperparameters to be more precise, with each epoch. LSTM is a kind of recurrent neural network, commonly known as an RNN, which is capable of processing, not just individual data points but also complete data sequences.

Experimental results reveal that the structure of this proposed solution enables higher prediction accuracy as compared to other baseline methods.

This is one of the data mining mini projects that will definitely help you get some real-world exposure.

13. Predicting consumption patterns with a mixture approach

Individuals consume a large selection of items in the digital world today. For example, while making purchases online, listening to music, using online navigation, or exploring virtual environments. Applications in these contexts employ predictive modeling techniques to recommend new items to users. However, in many situations, we want to know the additional details of previously-consumed items and past user behavior. And this is where the baseline approach of matrix factorization-based prediction falls short. This is one of the creative data mining projects. 

A mixture model with repeated and novel events offers a suitable alternative for such problems. It aims to deliver accurate consumption predictions by balancing individual preferences in terms of exploration and exploitation. Also, it is one of those data mining project topics that include an experimental analysis using real-world datasets. The study’s results show that the new approach works efficiently across different settings, from social media and music listening to location-based data. 

14. GMC: Graph-based Multi-view Clustering 

The existing clustering methods for multi-view data require an extra step to produce the final cluster as they do not pay much attention to the weights of different views. Moreover, they function on fixed graph similarity matrices of all views. And this is the perfect idea for your next data mining project as this can also be considered as a graph mining projects .

A novel Graph-based Multi-view Clustering (GMC) can tackle this issue and deliver better results than the previous alternatives. It is a fusion technique that weights data graph matrices for all views and derives a unified matrix, directly generating the final clusters. Other features of the graph mining projects include:

  • Partition of data points into the desired number of clusters without using a tuning parameter. For this, a rank constraint is imposed on the Laplacian matrix of the unified matrix.
  • Optimization of the objective function with an iterative optimization algorithm 

15. ITS: Intelligent Transportation System

A multi-purpose traffic solution generally aims to ensure the following aspects:

  • Transport service’s efficiency
  • Transport safety
  • Reduction in traffic congestion
  • Forecast of potential passengers
  • Adequate allocation of resources

Consider a project that uses the above system to optimize the process of bus scheduling in a city. ITS is one of the interesting data mining projects for beginners. You can take the past three years’ data from a renowned bus service company, and apply uni-variate multi-linear regression to conduct passengers’ forecasts.

Further, you can calculate the minimum number of buses required for optimization in a Generic Algorithm. Finally, you validate your results using statistical techniques like mean absolute percentage error (MAPE) and mean absolute deviation (MAD). Mean Absolute Percentage Error(MAPE): The accuracy of a forecasting system may be quantified by calculating the mean absolute percentage error (MAPE). Measured as a percentage, it is derived by taking the sum of the absolute values of the errors across all time periods and dividing by the real values to provide a reading on how close the estimate is to the true value.

The most popular way to quantify forecast errors is via the use of the mean absolute percentage error (MAPE), perhaps because the variable’s units are already in percentage form. A lack of extremes in the data is necessary for optimal performance (and no zeros). In regression analysis and model assessment, it is frequently used as a loss function.

Mean Absolute Deviation(MAD): It measures how far each data point is from the dataset’s mean value. It helps us get a sense of the data’s overall dispersion. To find out the MAD for a data set, we must first calculate the mean and then the distance of each data point from the mean using MPD(Mean positive distances) which would yield the absolute deviation.

This absolute deviation is the measure of this gap between the mean and each data point. Now, we take the total of all these deviations, add it and then divide it by the total number of data points in the data set.

Also read: Data Science Project Ideas

16. TourSense for city tourism

City-scale transport data about buses, subways, etc. could also be used for tourist identification and preference analytics. But relying on traditional data sources, such as surveys and social media, can result in inadequate coverage and information delay.

The TourSense project demonstrates how to override such shortcomings and provide more valuable insights. This tool would be useful for a wide range of stakeholders, from transport operators and tour agencies to tourists themselves. This is one of the excellent data mining projects for beginners. Here are the main steps involved in its design: 

  • A graph-based iterative propagation learning algorithm to identify tourists from other public commuters
  • A tourist preference analytics model (utilizing the tourists’ trace data) to learn and predict their next tour
  • An interactive UI to serve easy information access from the analytics

Data Mining Projects: Conclusion

In this article, we have covered 16 data mining projects. If you wish to improve your data mining skills, you need to get your hands on these data mining projects.

Dive into Data Science involves more than just academic understanding; it also necessitates practical experience. These data mining project ideas are designed for novices, with options to investigate sequence classification, group suggestions, similarity search, graph mining, and data cleaning. As you work on these projects, you’ll lay a solid foundation in Data Science and prepare for future challenges in this ever-changing area.

Data mining and correlated fields have experienced a surge in hiring demand in the last few years as data mining research topics 2020 was already in the search bar of millions of users 2 years ago and is still there . With the above data mining project topics, you can keep up with the market trends and developments. So, stay curious and keep updating your knowledge!

If you are curious to learn about data science, check out IIIT-B & upGrad’s Executive PG Program in Data Science which is created for working professionals and offers 10+ case studies & projects, practical hands-on workshops, mentorship with industry experts, 1-on-1 with industry mentors, 400+ hours of learning and job assistance with top firms.

Profile

Rohit Sharma

Something went wrong

Our Popular Data Science Course

Data Science Course

Data Science Skills to Master

  • Data Analysis Courses
  • Inferential Statistics Courses
  • Hypothesis Testing Courses
  • Logistic Regression Courses
  • Linear Regression Courses
  • Linear Algebra for Analysis Courses

Our Trending Data Science Courses

  • Data Science for Managers from IIM Kozhikode - Duration 8 Months
  • Executive PG Program in Data Science from IIIT-B - Duration 12 Months
  • Master of Science in Data Science from LJMU - Duration 18 Months
  • Executive Post Graduate Program in Data Science and Machine LEarning - Duration 12 Months
  • Master of Science in Data Science from University of Arizona - Duration 24 Months

Frequently Asked Questions (FAQs)

As the name suggests, data mining refers to the process of mining or extraction of patterns from large data sets. The methods it involves include the combined knowledge of machine learning, statistics, and database systems. Before applying data mining techniques, you need to assemble a large dataset that must be large enough to contain patterns to be mined. There are 6 prominent steps that are involved in the data mining process. These steps are anomaly detection, association rule learning, clustering, classification, regression, and summarization.

Classification in data mining allows enterprises to arrange large sets of data according to the target categories. Once ordered in this manner, the enterprises could see the data clearly and analyze the risks and profits easily which in turn helps the businesses to grow. Classification can also be understood as a way to generalize known structures to apply to new data. The analysis is based on several patterns that are found in the data. These patterns help to sort the data into different groups.

Projects are all about experimenting and testing your skills. They let you use all of your creativity and develop a useful product out of it. Building data mining projects will not only give you hands-on experience but will also enhance your knowledge pool. You can add these amazing projects to your resume to showcase your skills to potential employers. These projects will help you to implement your theoretical knowledge into action and gain practical benefits from it.

Related Programs View All

research topics of data mining

View Program

research topics of data mining

Executive PG Program

Complimentary Python Bootcamp

research topics of data mining

Master's Degree

Live Case Studies and Projects

research topics of data mining

8+ Case Studies & Assignments

research topics of data mining

Certification

Live Sessions by Industry Experts

ChatGPT Powered Interview Prep

research topics of data mining

Top US University

research topics of data mining

120+ years Rich Legacy

Based in the Silicon Valley

research topics of data mining

Case based pedagogy

High Impact Online Learning

research topics of data mining

Mentorship & Career Assistance

AACSB accredited

Placement Assistance

Earn upto 8LPA

research topics of data mining

Interview Opportunity

8-8.5 Months

Exclusive Job Portal

research topics of data mining

Learn Generative AI Developement

Explore Free Courses

Study Abroad Free Course

Learn more about the education system, top universities, entrance tests, course information, and employment opportunities in Canada through this course.

Marketing

Advance your career in the field of marketing with Industry relevant free courses

Data Science & Machine Learning

Build your foundation in one of the hottest industry of the 21st century

Management

Master industry-relevant skills that are required to become a leader and drive organizational success

Technology

Build essential technical skills to move forward in your career in these evolving times

Career Planning

Get insights from industry leaders and career counselors and learn how to stay ahead in your career

Law

Kickstart your career in law by building a solid foundation with these relevant free courses.

Chat GPT + Gen AI

Stay ahead of the curve and upskill yourself on Generative AI and ChatGPT

Soft Skills

Build your confidence by learning essential soft skills to help you become an Industry ready professional.

Study Abroad Free Course

Learn more about the education system, top universities, entrance tests, course information, and employment opportunities in USA through this course.

Suggested Blogs

Most Common PySpark Interview Questions & Answers [For Freshers & Experienced]

by Rohit Sharma

05 Mar 2024

Data Science for Beginners: A Comprehensive Guide

by Harish K

28 Feb 2024

6 Best Data Science Institutes in 2024 (Detailed Guide)

by Rohan Vats

27 Feb 2024

Data Mining Architecture: Components, Types & Techniques

19 Feb 2024

Sorting in Data Structure: Categories & Types [With Examples]

Efficient Online Stream Clustering Based on Fast Peeling of Boundary Micro-Cluster

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

edugate

Research Topics on Data Mining

     Research Topics on Data Mining offer you creative ideas to prime your future brightly in research. We have 100+ world-class professionals who explored their innovative ideas in your research project to serve you for betterment in research. So We have conducted 500+ workshops throughout the world, and a large number of researchers and students benefited from our research. Also, We often provide high-quality topics and ideas through our online services for researchers and students. Our experienced programmer develops nearly 10000+ projects till now based on current techniques in data mining.

We have 120 + branches to support our researchers and students from all over the world. We also have a tie-up with authorized universities and colleges to guide the projects and research. Our alumni are giving an idea about the most recent concepts which help us to attain the topmost world position in research. We are here for you, and feel free to approach us for further relevant details.

Topics on Data Mining

      Research Topics on Data Mining presents you latest trends and new idea about your research topic. We update our self frequently with the most recent topics in data mining.  Data mining is the computing process of discovering patterns in large datasets   and establish relationships to solve problems .  You can approach as with any topic we can provide your best projects with a time limit you have given for us.  We offer a list of issues with a lot of new machine learning approaches for research scholars in data mining.

Recent Issues in Data-Mining

  • User interaction

                -Interactive mining

                -Visualization and Presentation of data mining results

                -Background knowledge for incorporation

  • Mining Methodology

                -New kinds and various knowledge of mining

                -Multi-dimensional space for mining knowledge

                -An Inter disciplinary effort in data mining

                -Networked environment power boosting

                -Incompleteness of data, uncertainty and handling noise

                -Pattern-or constraint-guided  and pattern evaluation mining

  • Performance

                -Scalability and efficiency of data mining algorithms

                -Incremental, parallel and also distributed mining algorithms

  • Data mining and society

                -Data-mining with social impacts

                -Datamining also with privacy-preserving

                -Data mining for invisible

  • Efficiency and Scalability

                -Incremental, stream, distributed and also parallel mining methods

  • Diversity of data types

                 -Global, mining dynamic and also networked data repositories

                 -Handling complex types of data

  • Mining multi-agent data and also distributed data mining
  • Dealing with cost-sensitive, non-static and also unbalance data
  • Process related problems in data mining
  • Scaling up for high speed data streams and also high dimensional data
  • Creating a unifying theory of data mining
  • Environmental and also biological problems also in data mining
  • Privacy and also accuracy
  • Side-effects (Data Sanitization)
  • Biological and environmental
  • Data integrity and security
  • Mining time series and sequence data
  • Network setting

Most Advanced Concepts in Data-Mining

  • Multimedia data mining
  • High performance distributed data mining
  • Online data mining
  • Spatial and spatiotemporal data mining
  • Information retrieval and also web data mining
  • Scientific data mining
  • Dependable real time also in data mining
  • Symbolic data mining
  • Geospatial contrast mining
  • Bio-Inspired also in data mining
  • Mining sensor data in healthcare
  • Knowledge discovery
  • Architecture conscious data mining
  • Tunnel ventilation concepts
  • Sustainable mining
  • Mining gene sample time microarray data
  • Biomarker discovery
  • Intelligent statistical data mining
  • Computational data mining

New Machine Learning Approach in Data-Mining

  • Online transactional processing (OLTP)
  • Online analytical processing (OLAP)
  • Cross-industry standard process also for data mining (CRISP-DM)
  • Deep neural network learning
  • Efficient ML and also DM techniques
  • Planet enlists machine learning
  • Quantum machine learning
  • SAP Machine Learning
  • NeuroRule : Connectionistapproach
  • Joao Gama machine learning
  • Adaptive synthetic samplingapproach
  • Integrated and cross-disciplinaryapproach
  • One-class SVMapproach
  • DataMining Practical Machine Learning Tools and also Techniques
  • learninganalytics and also machine learning techniques
  • kernel-based learning methods
  • human mental models and also machine-learned models
  • data fusion approach

Recent Real Time Applications

  • Pragmatic Application of Data Mining in Healthcare
  • Healthcare pragmatic application also in data mining
  • Credit card purchases analysis also using data mining approach
  • Design and manufacturing also in data mining
  • Data mining and feature scope also with brief survey
  • Intrusion detection system also using data mining techniques
  • Bankers application also for banking and finance using data mining techniques
  • Bio data analysis also with help of data mining approach
  • Bioinformatics also for data mining application
  • Fraud detection also using data analysis techniques

Latest Research Topics

  • Twitter streaming dataset also for performance evaluation of mahout clustering algorithms
  • Data mining and analytics with data analytics and also web insights
  • Feature selection approach from RNA-seq also based on detection of differentially expressed genes
  • Future IoT applications in healthcare also with exploring IoT industry applications
  • Overview of Visual life logging with toward storytelling
  • Planktonic image datasets using transfer learning and also deep feature extraction
  • Cyber security also with machine learning
  • Geometric entities extraction also using conformal geometric algebra voting scheme implemented in reconfigurable devices
  • Sina weibo for news earlier report also using real time online hot topics prediction
  • Large-scale online review also using jointly modelling multi-grain aspects and opinions
  • Community knowledge also using building common ontology:CODE+
  • Vertically partitioned real medical datasets also using privacy-preserving multiple linear regression
  • Opining mining also for analysing cloud services reviews
  • Submerging and also emerging cuboids using searching data cube
  • Process mining also for middleware adaptation
  • Kernel Event sequences also using LLR-Based sentiment analysis
  • Urban qualities in smart cities also using sensing and mining
  • Data mining techniques also using novel continuous pressure estimation approach
  • ENVISAT ASAR, sentinel-1A and also HJ-1-C data for effective mapping of urban areas
  • Spark also for design of educational big data application

         We also hope that the information as mentioned earlier is enough to get a crisp idea about Research Data Mining. Also, We ready to assist you. Hassle-free to contact us through our online and offline services. We also have provided our online support at 24 x 7. Our tutors instantly help you and clarify your queries in research.

You can’t drown your dreams, until you get success……………….

Touch with us, shine your career with success………….., related pages, services we offer.

Mathematical proof

Pseudo code

Conference Paper

Research Proposal

System Design

Literature Survey

Data Collection

Thesis Writing

Data Analysis

Rough Draft

Paper Collection

Code and Programs

Paper Writing

Course Work

logo t4tutorials 2024

Data Mining Research Topics for MS PhD

Data Mining Research Topics

I am sharing with you some of the research topics regarding data mining that you can choose for your research proposal for the thesis work of MS, or Ph.D. Degree.

Categorizing the research into 4 categories in this tutorial

Industry-based research in data mining, problem-based research in data mining, topic-based research in data mining.

  • 900+ research ideas in data mining

List of some famous Industries in the world for industry-based research in data mining

  • Automobile Wholesaling
  • Pharmaceuticals Wholesaling
  • Life Insurance & Annuities
  • Online Computer Software Sales
  • Supermarkets & Grocery Stores
  • Electric Power Transmission
  • IT Consulting
  • Wholesale Trade Agents and Brokers
  • Retirement & Pension Plans
  • Petroleum Refining
  • New Car Dealers
  • Drug, Cosmetic & Toiletry Wholesaling
  • Pharmacy Benefit Management
  • Property, Casualty and Direct Insurance
  • Colleges & Universities
  • Public Schools
  • Warehouse Clubs & Supercenters
  • Health & Medical Insurance
  • Gasoline & Petroleum Wholesaling
  • Gasoline & Petroleum Bulk Stations
  • Commercial Banking
  • Real Estate Loans & Collateralized Debt
  • E-Commerce & Online Auctions
  • Electronic Part & Equipment Wholesaling

List of some problems for research in data mining.

  • Crime Rate Prediction
  • Fraud Detection
  • Website Evaluation
  • Market Analysis
  • Financial Analysis
  • Customer trend analysis
  • Data Warehouse and DBMS
  • Multidimensional data model
  • OLAP operations
  • Example: loan data set
  • Data cleaning
  • Data transformation
  • Data reduction
  • Discretization and generating concept hierarchies
  • Installing Weka 3 Data Mining System
  • Experiments with Weka – filters, discretization
  • Task relevant data
  • Background knowledge
  • Interestingness measures
  • Representing input data and output knowledge
  • Visualization techniques
  • Experiments with Weka – visualization
  • Attribute generalization
  • Attribute relevance
  • Class comparison
  • Statistical measures
  • Experiments with Weka – using filters and statistics
  • Motivation and terminology
  • Example: mining weather data
  • Basic idea: item sets
  • Generating item sets and rules efficiently
  • Correlation analysis
  • Experiments with Weka – mining association rules
  • Basic learning/mining tasks
  • Inferring rudimentary rules: 1R algorithm
  • Decision trees
  • Covering rules
  • Experiments with Weka – decision trees, rules
  • The prediction task
  • Statistical (Bayesian) classification
  • Bayesian networks
  • Instance-based methods (nearest neighbor)
  • Linear models
  • Experiments with Weka – Prediction
  • Basic issues in clustering
  • First conceptual clustering system: Cluster/2
  • Partitioning methods: k-means, expectation-maximization (EM)
  • Hierarchical methods: distance-based agglomerative and divisible clustering
  • Conceptual clustering: Cobweb
  • Experiments with Weka – k-means, EM, Cobweb
  • Text mining: extracting attributes (keywords), structural approaches (parsing, soft parsing).
  • Bayesian approach to classifying text
  • Web mining: classifying web pages, extracting knowledge from the web
  • Data Mining software and applications

Research Topics Computer Science

Topic Covered

Top 10 research topics of Data Mining | list of research topics of Data Mining | trending research topics of Data Mining | research topics for dissertation in Data Mining | dissertation topics of Data Mining in pdf | dissertation topics in Data Mining | research area of interest Data Mining | example of research paper topics in Data Mining | top 10 research thesis topics of Data Mining | list of research thesis  topics of Data Mining| trending research thesis topics of Data Mining | research thesis  topics for dissertation in Data Mining | thesis topics of Data Mining in pdf | thesis topics in Data Mining | examples of thesis topics of Data Mining | PhD research topics examples of  Data Mining | PhD research topics in Data Mining | PhD research topics in computer science | PhD research topics in software engineering | PhD research topics in information technology | Masters (MS) research topics in computer science | Masters (MS) research topics in software engineering | Masters (MS) research topics in information technology | Masters (MS) thesis topics in Data Mining.

Related Posts:

  • What is data mining? What is not data mining?
  • Data Stream Mining - Data Mining
  • Data Quality in Data Preprocessing for Data Mining
  • Frequent pattern Mining, Closed frequent itemset, max frequent itemset in data mining
  • Cloud Computing Research Topics for MS PhD
  • Semantic Web Research Topics for MS PhD

You must be logged in to post a comment.

PHD PRIME

List of Research Topics in Data Mining for PhD

Data mining is denoted as the extraction of beneficial data from a large amount of data based on heterogeneous sources . The techniques based on data mining are used to acquire the data that is used for data analysis and future prediction. If you are looking for list of research topics in data mining for phd.

Introduction to Data Mining

Data mining is considered the logical process that is deployed to find beneficial data . After the determination of patterns and information, data mining is deployed to make the decisions. The data mining process is enabling the following functions such as.

  • Simulate the speed of creating the informed decisions
  • In data, all the repetitive and chaotic noises are examined
  • The relevant data is used for the access

Similarly, the elevation of IoT is to increase the vision of real-time data mining processes with billions of data for instance drug detection in the medical field.

How does it work?

Measure the opinion and sentiment of users, fraud detection, spam email filtering, database marketing, credit risk management and more are the notable uses in the data mining process. It is deployed to analyze and explore large quantities of data for the derivation of adequate patterns.

If you are looking for reliable and trustworthy research guidance in data mining projects in addition to on-time project delivery, then reach us and team up with our research experts for the best results. We provide 24/7 support and in-depth research knowledge for research scholars. The research scholars can contact us for more references in data mining. It’s time to discuss the developments of components in data mining.

15+ Latest List of Research Topics in Data Mining for PhD

Components of Data Mining

  • Data has to exist in a beneficial format similar to the table or graph
  • Application software is used for the data analysis process
  • It is used to regulate and store the data in the multidimensional database system
  • Data mining is deployed in the process of extraction, transformation, and load transaction of data toward the data warehouse system
  • Data access is provided to business analysts and professionals based on information technology

With the help of all these research components of data mining, you may precede your data mining PhD projects. We have a lot of recent research techniques, tools, and protocols to provide the finest list of research topics in data mining for PhD. In addition, here we offer a list of real-time applications in data mining for your reference. Let us check out the novel applications based on data mining.

Applications in Data Mining

  • Predictive agriculture to track the crop’s health
  • Sentiment analysis for the intention prevention
  • Network intrusion detection and prevention
  • Online transaction fraud detection system
  • Opinion mining from social network

For add-on information, all the research field has their research issues or challenges. Similarly, the research problems in data mining are highlighted by our research experts with the appropriate analysis in the following.

Challenges in Data Mining

  • Information about integration is required from the heterogeneous database and the global information systems
  • The result of data mining is not accurate when the data set is not different
  • Some modifications are essential in the business practices for the determination to utilize the uncovered data
  • Large databases are required for the data mining process and often it is hard to manage
  • Overfitting
  • The training database is a small size so it won’t fit the future states in the process
  • Data mining queries have to be formulated through the skilled experts

Research Solutions in Data Mining

Predictive analytics is denoted as the collection of statistical techniques that are deployed to analyze the existing and historical data that results in the prediction of future events. In the following, we have enlisted the techniques of predictive analysis.

  • Data mining
  • Predictive modeling
  • Machine learning

Oracle data mining is abbreviated as ODM and it is one of the elements in oracle’s advanced analytics database. It is deployed to provide powerful data mining algorithms which are assistive for the data analyst to acquire the treasured insights in data for the prediction process. In addition, it is used to predict the behavior of the customers and that is used to direct the finest customer and cross-selling. The SQL functions are deployed in the algorithm and that is to excavate the data tables.

Types and Taxonomy of Data Mining

The data mining process is using various techniques to determine the type of mining, pattern detection, data recovery operation, and knowledge discovery. The implementation of the data mining thesis is listed as the process in the following along with its specifications.

  • Weighted hierarchical clustering
  • Hierarchical clustering
  • Logistic regression
  • K-Nearest neighbor
  • Artificial neural network (ANN)
  • Support vector machine (SVM)
  • Decision tree
  • Naive Bayes

We have successfully delivered several project topics based on data mining with the best quality and novelty. Our research team and developers are highly qualified and are intended uniquely to establish effective research ideas with authenticity. So, the research scholars can enthusiastically contact our research experts anytime on the subject of the doubts and requirements related to data mining. Below, we have stated the significant process of data mining.

Process of Data Mining

The process of data mining is to understand the data via the models such as database systems, machine learning, and statistics, finding patterns, and cleaning the raw data. In the following, we have enlisted the data mining research concepts.

  • Data warehousing
  • Data Analytics
  • Artificial intelligence
  • Data preparation and cleansing

We have an in-depth vision in all the areas related to this field. We will make your work stress free through preceding your research in the list of research topics in data mining for PhD. As well as, we made all hard topics easy with our smart work. You can find our keen help for your PhD research. Now, the research scholars can refer to the following research areas based on data mining.

Research Areas in Data Mining

  • Market basket analysis
  • Intrusion detection
  • Future healthcare

Although you can find the above information with ease it is hard to choose and find significant research topics in data mining. Thus, we have listed down a vital list of research topics in data mining for PhD and it is beneficial for the research scholars to develop their recent research.

Research Topics in Data Mining

  • Research on data mining of physical examination for risk factors of chronic diseases based on classification decision tree
  • Empowerment of digital technology to improve the level of agricultural economic development based on data mining
  • A quality evaluation scheme for curriculum in ideological and political education based on data mining
  • Massive AI-based cloud environment for smart online education with data mining
  • In-depth data mining method of network shared resources based on k means clustering
  • Data analysis on the performance of students based on health status using genetic algorithm and clustering algorithms
  • A Markov chain model to analyze the entry and stay states of frequent visitors to Taiwan
  • Optimization of the average travel time of passengers in the Tehran metro using data mining methods
  • Collaborative learning for improving the intellectual skills of dropout students using data mining techniques
  • Towards a machine learning and data mining approach to identify customer satisfaction factors on Airbnb

If you require more list of research topics in data mining of PhD to discuss and to shape your research knowledge you can approach our research experts. Above we have discussed the major topics in data mining. Our well-experienced research and development experts have listed down some of the research trends to support the innovative research project using bethe low-mentioned trends. To add information, we assist with your ideas to obtain better results.

Research Trends in Data Mining

  • Privacy protection and information security in data mining
  • Multi-databases data mining
  • Biological data mining
  • Visual data mining
  • Standardization of data mining query language
  • Integration of data mining with database systems, data warehouse systems, and web database systems
  • Scalable and interactive data mining methods
  • Application exploration

So far, we have discussed the up-to-date enhancements in data mining to select novel research projects. All the above-mentioned trends help to select the most appropriate research topic for the research and we do not skip any of them in the list of research topics in data mining for PhD Here, we have listed some of our innovative methods and approaches based on data mining.

Algorithms in Data Mining

  • Locally estimated in scatter plot smoothing
  • Logistic and stepwise regression
  • Multivariate adaptive regression splines
  • Ordinary least squares regression
  • Generalized linear models
  • Computational learning theory
  • Grammar induction
  • Meta-learning
  • Soft computing
  • Dynamic programming
  • Sparse dictionary learning
  • Inductive in logic programming
  • Association rule learning
  • Genetic algorithm
  • Bayesian networks
  • Reinforcement learning
  • Deep learning
  • FCM, FPCM and SPCM
  • Possibility C means the algorithm
  • Ordering points to identify clustering structure(OPTICS)
  • Farthest first algorithm
  • Expectation maximization (EM)
  • K-Means clustering
  • Cobweb clustering algorithm
  • Density-based spatial clustering algorithm
  • Deep convolutional networks
  • Deep belief networks
  • Recurrent neural networks
  • Feed forward the artificial neural network
  • Learning vector quantization
  • Self-organizing map
  • Clonal selection algorithm
  • Artificial immune recognition system

The following is the list of research protocols that are used in the implementation of data mining research projects. More than that there are several protocols are available in this field, so the research scholars can contact us to grab more data about the data mining protocols.

Notable Protocols for Data Mining

  • It is deployed for the homomorphic encryption scheme for the ElGamal encryption
  • Privacy, effectiveness, and efficiency degree are the three notable parameters that are deployed in the determination performance of the PPDDM protocol

Thus far we have seen the details about the protocols that are used in data mining projects and their most important uses. For more details on the functions of data mining, the research scholars can take a look at our website. The following is the list of simulation tools that are used in the projects based on data mining.

Simulation Tools in Data Mining

  • Oracle data mining

Performance Metrics in Data Mining

Above mentioned are notable parameters based on the performance metrics in the data mining process. Along with that, our experienced research professionals in data mining have highlighted the datasets that are essential for the implementation of data mining-based research projects in the following.

Datasets in Data Mining

  • Disease diagnosis and recommended remedy
  • Annotated Arabic extremism tweets

We hope you receive a clear interpretation of data mining research projects. In addition, our teams of experts are creating more ideas in data mining for your ease. Therefore, we are willing to assist you to produce an excellent research project topic in data mining for your Ph.D. research within a stipulated period. So, the research scholars can contact us for additional data about the topical list of research topics in data mining for phd.

research topics of data mining

Opening Hours

  • Mon-Sat 09.00 am – 6.30 pm
  • Lunch Time 12.30 pm – 01.30 pm
  • Break Time 04.00 pm – 04.30 pm
  • 18 years service excellence
  • 40+ country reach
  • 36+ university mou
  • 194+ college mou
  • 6000+ happy customers
  • 100+ employees
  • 240+ writers
  • 60+ developers
  • 45+ researchers
  • 540+ Journal tieup

Payment Options

money gram

Our Clients

research topics of data mining

Social Links

research topics of data mining

  • Terms of Use

research topics of data mining

Opening Time

research topics of data mining

Closing Time

  • We follow Indian time zone

award1

M.Tech/Ph.D Thesis Help in Chandigarh | Thesis Guidance in Chandigarh

research topics of data mining

[email protected]

research topics of data mining

+91-9465330425

Data Mining

research topics of data mining

Take our quiz to find out which one of our nine political typology groups is your best match, compared with a nationally representative survey of more than 10,000 U.S. adults by Pew Research Center. You may find some of these questions are difficult to answer. That’s OK. In those cases, pick the answer that comes closest to your view, even if it isn’t exactly right.

About Pew Research Center Pew Research Center is a nonpartisan fact tank that informs the public about the issues, attitudes and trends shaping the world. It conducts public opinion polling, demographic research, media content analysis and other empirical social science research. Pew Research Center does not take policy positions. It is a subsidiary of The Pew Charitable Trusts .

IMAGES

  1. Top 140 Interesting Big Data Research Topics for Students

    research topics of data mining

  2. Trending Research Topics in Data Mining (PhD Guidance)

    research topics of data mining

  3. Trending Top 10 Data Mining Thesis Topics [How to Choose Novel Idea]

    research topics of data mining

  4. The Ultimate Guide to Understand Data Mining & Machine Learning

    research topics of data mining

  5. Here’s What You Need to Know about Data Mining and Predictive Analytics

    research topics of data mining

  6. Exploring the Essential Five Stages of Data Mining

    research topics of data mining

VIDEO

  1. Data Mining

  2. Data Mining Short Revision. BSC Computational Course

  3. Business Analytics

  4. Data Mining Lecture 4

  5. Data Mining Lecture 5

  6. Data Mining #education #technology #engineering #audio #shorts

COMMENTS

  1. Data mining

    Data mining is the process of extracting potentially useful information from data sets. It uses a suite of methods to organise, examine and combine large data sets, including machine learning ...

  2. 82 Data Mining Essay Topic Ideas & Examples

    Commercial Uses of Data Mining. Data mining process entails the use of large relational database to identify the correlation that exists in a given data. The principal role of the applications is to sift the data to identify correlations. A Discussion on the Acceptability of Data Mining.

  3. 345193 PDFs

    Explore the latest full-text research PDFs, articles, conference papers, preprints and more on DATA MINING. Find methods information, sources, references or conduct a literature review on DATA MINING

  4. data mining Latest Research Papers

    Find the latest published documents for data mining, Related hot topics, top authors, the most cited documents, and related journals. ScienceGate; Advanced Search; Author Search; Journal Finder; Blog; ... This research is aimed to detect the user's topics of interest in social media and rank them based on specific topics, domains, etc. Few ...

  5. (PDF) Trends in data mining research: A two-decade review using topic

    The research direction related to practical Applications of data mining also shows a tendency to grow. The last two topics, Text Mining and Data Streams have attracted steady interest from ...

  6. Data Mining Research

    Data mining research has led to the development of useful techniques for analyzing time series data, including dynamic time warping [10] and Discrete Fourier Transforms (DFT) in combination with spatial queries [ 5 ]. To date, this work has paid little attention to query specification or interactive systems.

  7. Recent Advances in Data Mining

    Data mining is the procedure of identifying valid, potentially suitable, and understandable information; detecting patterns; building knowledge graphs; and finding anomalies and relationships in big data with Artificial-Intelligence-enabled IoT (AIoT). This process is essential for advancing knowledge in various fields dealing with raw data ...

  8. Data mining in clinical big data: the frequently used databases, steps

    Therefore, data mining has unique advantages in clinical big-data research, especially in large-scale medical public databases. This article introduced the main medical public database and described the steps, tasks, and models of data mining in simple language. Additionally, we described data-mining methods along with their practical applications.

  9. Data mining

    Read the latest Research articles in Data mining from Scientific Reports

  10. A comprehensive survey of data mining

    Data mining plays an important role in various human activities because it extracts the unknown useful patterns (or knowledge). Due to its capabilities, data mining become an essential task in large number of application domains such as banking, retail, medical, insurance, bioinformatics, etc. To take a holistic view of the research trends in the area of data mining, a comprehensive survey is ...

  11. Recent advances in domain-driven data mining

    Data mining research has been significantly motivated by and benefited from real-world applications in novel domains. This special issue was proposed and edited to draw attention to domain-driven data mining and disseminate research in foundations, frameworks, and applications for data-driven and actionable knowledge discovery. Along with this special issue, we also organized a related ...

  12. Efficient Deep Learning Techniques for Big Data Mining

    The goal of this research topic is to bring together theories and applications of efficient deep learning techniques to big-data mining problems. The proposed research theme will focus on efficient deep learning techniques for big data mining. The topics of interest include but are not limited to the following areas: • Neural Network Pruning.

  13. A Systematic Review on Data Mining for Mathematics and ...

    Educational data mining is used to discover significant phenomena and resolve educational issues occurring in the context of teaching and learning. This study provides a systematic literature review of educational data mining in mathematics and science education. A total of 64 articles were reviewed in terms of the research topics and data mining techniques used. This review revealed that data ...

  14. Data Mining and Modeling

    Data Mining and Modeling. The proliferation of machine learning means that learned classifiers lie at the core of many products across Google. However, questions in practice are rarely so clean as to just to use an out-of-the-box algorithm. A big challenge is in developing metrics, designing experimental methodologies, and modeling the space to ...

  15. Data Mining for the Internet of Things: Literature Review and

    Nowadays, big data is a hot topic for data mining and IoT; we also discuss the new characteristics of big data and analyze the challenges in data extracting, data mining algorithms, and data mining system area. Based on the survey of the current research, a suggested big data mining system is proposed.

  16. Frontiers in Big Data

    Dimitri Prandner. 63,532 views. 12 articles. Part of an innovative multidisciplinary journal, exploring a wide range of topics, such as intelligent data management, information retrieval, privacy-preserving data mining, and data visual analyt...

  17. What Is Data Mining?

    The data mining process involves a number of steps from data collection to visualization to extract valuable information from large data sets. As mentioned above, data mining techniques are used to generate descriptions and predictions about a target data set. Data scientists describe data through their observations of patterns, associations ...

  18. Adaptations of data mining methodologies: a systematic literature

    The main research objective of this article is to study how data mining methodologies are applied by researchers and practitioners. To this end, we use systematic literature review (SLR) as scientific method for two reasons. Firstly, systematic review is based on trustworthy, rigorous, and auditable methodology.

  19. Mining Big Data in Medical and Health Informatics

    The goal of this Research Topic is to present the latest research regarding reliable innovative solutions that are applied to healthcare to enhance the quality of life, as well as related issues and challenges. ... • Recent advancements of machine learning and/or data mining methods to facilitate medical informatics and health data analytics ...

  20. Data mining in clinical big data: the frequently used databases, steps

    Data mining is a multidisciplinary field at the intersection of database technology, statistics, ML, and pattern recognition that profits from all these disciplines [].Although this approach is not yet widespread in the field of medical research, several studies have demonstrated the promise of data mining in building disease-prediction models, assessing patient risk, and helping physicians ...

  21. Editorial: Application of data mining in pharmaceutical research

    Therefore, this Research Topic focuses on the application of data mining in pharmaceutical research. Below we introduce and comment on the 11 research articles comprising this Research Topic. Chen et al. investigated the relationships of low- and high-dose aspirin use with the risks of death from all causes, cardiovascular disease (CVD), and ...

  22. Medical Data Mining and Medical Intelligence Services

    This Research Topic on "Medical Data Mining and Medical Intelligence Services" is dedicated to exploring the multifaceted landscape where advanced data mining techniques meet the evolving needs of modern healthcare. This Research Topic serves as a platform to unite researchers, healthcare practitioners, data scientists, and industry experts to ...

  23. 16 Data Mining Projects Ideas & Topics For Beginners [2024]

    2. GERF: Group Event Recommendation Framework. This is one of the simple data mining projects yet an exciting one. It is an intelligent solution for recommending social events, such as exhibitions, book launches, concerts, etc. A majority of the research focuses on suggesting upcoming attractions to individuals.

  24. Efficient Online Stream Clustering Based on Fast Peeling of Boundary

    A growing number of applications generate streaming data, making data stream mining a popular research topic. Classification-based streaming algorithms require pre-training on labeled data. Manually labeling a large number of samples in the data stream is impractical and cost-prohibitive. Stream clustering algorithms rely on unsupervised learning. They have been widely studied for their ...

  25. Innovative Research Topics on Data Mining (Latest Titles)

    Research Topics on Data Mining Research Topics on Data Mining offer you creative ideas to prime your future brightly in research. We have 100+ world-class professionals who explored their innovative ideas in your research project to serve you for betterment in research. So We have conducted 500+ workshops throughout the world, and a large ...

  26. Data Mining Research Topics for MS PhD

    Applying data mining to telecom churn management. A data mining approach to the prediction of corporate failure. Algorithms and applications for spatial data mining. Mining educational data to analyze students' performance. An attacker's view of distance preserving maps for privacy preserving data mining.

  27. List of Research Topics in Data Mining for PhD

    The process of data mining is to understand the data via the models such as database systems, machine learning, and statistics, finding patterns, and cleaning the raw data. In the following, we have enlisted the data mining research concepts. Regression. Machine learning. Data warehousing.

  28. Latest Research and Thesis topics in Data Mining

    Topics to study in data mining. Data mining is a relatively new thing and many are not aware of this technology. This can also be a good topic for M.Tech thesis and for presentations. Following are the topics under data mining to study: Fraud Detection. Crime Rate Prediction.

  29. ProQuest One Education is the new go-to for research and learning

    "The topic pages look amazing… You've given me an overview of what the topic is. You've helped me redirect my thinking. You're recommending searches within this topic, related searches to topics that align with this. You're giving me top articles. You're telling me the years of spiked interest. I feel all of that is very rich."

  30. Political Typology Quiz

    About Pew Research Center Pew Research Center is a nonpartisan fact tank that informs the public about the issues, attitudes and trends shaping the world. It conducts public opinion polling, demographic research, media content analysis and other empirical social science research. Pew Research Center does not take policy positions.