82 Data Mining Essay Topic Ideas & Examples

🏆 best data mining topic ideas & essay examples, 💡 good essay topics on data mining, ✅ most interesting data mining topics to write about.

  • Disadvantages of Using Web 2.0 for Data Mining Applications This data can be confusing to the readers and may not be reliable. Lastly, with the use of Web 2.
  • Data Mining and Its Major Advantages Thus, it is possible to conclude that data mining is a convenient and effective way of processing information, which has many advantages.
  • The Data Mining Method in Healthcare and Education Thus, I would use data mining in both cases; however, before that, I would discover a way to improve the algorithms used for it.
  • Data Mining Tools and Data Mining Myths The first problem is correlated with keeping the identity of the person evolved in data mining secret. One of the major myths regarding data mining is that it can replace domain knowledge.
  • Hybrid Data Mining Approach in Healthcare One of the healthcare projects that will call for the use of data mining is treatment evaluation. In this case, it is essential to realize that the main aim of health data mining is to […]
  • Terrorism and Data Mining Algorithms However, this is a necessary evil as the nation’s security has to be prioritized since these attacks lead to harm to a larger population compared to the infringements.
  • Transforming Coded and Text Data Before Data Mining However, to complete data mining, it is necessary to transform the data according to the techniques that are to be used in the process.
  • Data Mining and Machine Learning Algorithms The shortest distance of string between two instances defines the distance of measure. However, this is also not very clear as to which transformations are summed, and thus it aims to a probability with the […]
  • Summary of C4.5 Algorithm: Data Mining 5 algorism: Each record from set of data should be associated with one of the offered classes, it means that one of the attributes of the class should be considered as a class mark.
  • Data Mining in Social Networks: Linkedin.com One of the ways to achieve the aim is to understand how users view data mining of their data on LinkedIn.
  • Ethnography and Data Mining in Anthropology The study of cultures is of great importance under normal circumstances to enhance the understanding of the same. Data mining is the success secret of ethnography.
  • Issues With Data Mining It is necessary to note that the usage of data mining helps FBI to have access to the necessary information for terrorism and crime tracking.
  • Large Volume Data Handling: An Efficient Data Mining Solution Data mining is the process of sorting huge amount of data and finding out the relevant data. Data mining is widely used for the maintenance of data which helps a lot to an organization in […]
  • Data Mining and Analytical Developments In this era where there is a lot of information to be handled at ago and actually with little available time, it is necessarily useful and wise to analyze data from different viewpoints and summarize […]
  • Levi’s Company’s Data Mining & Customer Analytics Levi, the renowned name in jeans is feeling the heat of competition from a number of other brands, which have come upon the scene well after Levi’s but today appear to be approaching Levi’s market […]
  • Cryptocurrency Exchange Market Prediction and Analysis Using Data Mining and Artificial Intelligence This paper aims to review the application of A.I.in the context of blockchain finance by examining scholarly articles to determine whether the A.I.algorithm can be used to analyze this financial market.
  • “Data Mining and Customer Relationship Marketing in the Banking Industry“ by Chye & Gerry First of all, the article generally elaborates on the notion of customer relationship management, which is defined as “the process of predicting customer behavior and selecting actions to influence that behavior to benefit the company”.
  • Data Mining Techniques and Applications The use of data mining to detect disturbances in the ecosystem can help to avert problems that are destructive to the environment and to society.
  • Ethical Data Mining in the UAE Traffic Department The research question identified in the assignment two is considered to be the following, namely whether the implementation of the business intelligence into the working process will beneficially influence the work of the Traffic Department […]
  • Canadian University Dubai and Data Mining The aim of mining data in the education environment is to enhance the quality of education for the mass through proactive and knowledge-based decision-making approaches.
  • Data Mining and Customer Relationship Management As such, CRM not only entails the integration of marketing, sales, customer service, and supply chain capabilities of the firm to attain elevated efficiencies and effectiveness in conveying customer value, but it obliges the organization […]
  • E-Commerce: Mining Data for Better Business Intelligence The method allowed the use of Intel and an example to build the study and the literature on data mining for business intelligence to analyze the findings.
  • Ethical Implications of Data Mining by Government Institutions Critics of personal data mining insist that it infringes on the rights of an individual and result to the loss of sensitive information.
  • Data Mining Role in Companies The increasing adoption of data mining in various sectors illustrates the potential of the technology regarding the analysis of data by entities that seek information crucial to their operations.
  • Data Warehouse and Data Mining in Business The circumstances leading to the establishment and development of the concept of data warehousing was attributed to the fact that failure to have a data warehouse led to the need of putting in place large […]
  • Data Mining: Concepts and Methods Speed of data mining process is important as it has a role to play in the relevance of the data mined. The accuracy of data is also another factor that can be used to measure […]
  • Data Mining Technologies According to Han & Kamber, data mining is the process of discovering correlations, patterns, trends or relationships by searching through a large amount of data that in most circumstances is stored in repositories, business databases […]
  • Data Mining: A Critical Discussion In recent times, the relatively new discipline of data mining has been a subject of widely published debate in mainstream forums and academic discourses, not only due to the fact that it forms a critical […]
  • Commercial Uses of Data Mining Data mining process entails the use of large relational database to identify the correlation that exists in a given data. The principal role of the applications is to sift the data to identify correlations.
  • A Discussion on the Acceptability of Data Mining Today, more than ever before, individuals, organizations and governments have access to seemingly endless amounts of data that has been stored electronically on the World Wide Web and the Internet, and thus it makes much […]
  • Applying Data Mining Technology for Insurance Rate Making: Automobile Insurance Example
  • Applebee’s, Travelocity and Others: Data Mining for Business Decisions
  • Applying Data Mining Procedures to a Customer Relationship
  • Business Intelligence as Competitive Tool of Data Mining
  • Overview of Accounting Information System Data Mining
  • Applying Data Mining Technique to Disassembly Sequence Planning
  • Approach for Image Data Mining Cultural Studies
  • Apriori Algorithm for the Data Mining of Global Cyberspace Security Issues
  • Database Data Mining: The Silent Invasion of Privacy
  • Data Management: Data Warehousing and Data Mining
  • Constructive Data Mining: Modeling Consumers’ Expenditure in Venezuela
  • Data Mining and Its Impact on Healthcare
  • Innovations and Perspectives in Data Mining and Knowledge Discovery
  • Data Mining and Machine Learning Methods for Cyber Security Intrusion Detection
  • Linking Data Mining and Anomaly Detection Techniques
  • Data Mining and Pattern Recognition Models for Identifying Inherited Diseases
  • Credit Card Fraud Detection Through Data Mining
  • Data Mining Approach for Direct Marketing of Banking Products
  • Constructive Data Mining: Modeling Argentine Broad Money Demand
  • Data Mining-Based Dispatching System for Solving the Pickup and Delivery Problem
  • Commercially Available Data Mining Tools Used in the Economic Environment
  • Data Mining Climate Variability as an Indicator of U.S. Natural Gas
  • Analysis of Data Mining in the Pharmaceutical Industry
  • Data Mining-Driven Analysis and Decomposition in Agent Supply Chain Management Networks
  • Credit Evaluation Model for Banks Using Data Mining
  • Data Mining for Business Intelligence: Multiple Linear Regression
  • Cluster Analysis for Diabetic Retinopathy Prediction Using Data Mining Techniques
  • Data Mining for Fraud Detection Using Invoicing Data
  • Jaeger Uses Data Mining to Reduce Losses From Crime and Waste
  • Data Mining for Industrial Engineering and Management
  • Business Intelligence and Data Mining – Decision Trees
  • Data Mining for Traffic Prediction and Intelligent Traffic Management System
  • Building Data Mining Applications for CRM
  • Data Mining Optimization Algorithms Based on the Swarm Intelligence
  • Big Data Mining: Challenges, Technologies, Tools, and Applications
  • Data Mining Solutions for the Business Environment
  • Overview of Big Data Mining and Business Intelligence Trends
  • Data Mining Techniques for Customer Relationship Management
  • Classification-Based Data Mining Approach for Quality Control in Wine Production
  • Data Mining With Local Model Specification Uncertainty
  • Employing Data Mining Techniques in Testing the Effectiveness of Modernization Theory
  • Enhancing Information Management Through Data Mining Analytics
  • Evaluating Feature Selection Methods for Learning in Data Mining Applications
  • Extracting Formations From Long Financial Time Series Using Data Mining
  • Financial and Banking Markets and Data Mining Techniques
  • Fraudulent Financial Statements and Detection Through Techniques of Data Mining
  • Harmful Impact Internet and Data Mining Have on Society
  • Informatics, Data Mining, Econometrics, and Financial Economics: A Connection
  • Integrating Data Mining Techniques Into Telemedicine Systems
  • Investigating Tobacco Usage Habits Using Data Mining Approach
  • Electronics Engineering Paper Topics
  • Cyber Security Topics
  • Google Paper Topics
  • Hacking Essay Topics
  • Identity Theft Essay Ideas
  • Internet Research Ideas
  • Microsoft Topics
  • Chicago (A-D)
  • Chicago (N-B)

IvyPanda. (2024, March 2). 82 Data Mining Essay Topic Ideas & Examples. https://ivypanda.com/essays/topic/data-mining-essay-topics/

"82 Data Mining Essay Topic Ideas & Examples." IvyPanda , 2 Mar. 2024, ivypanda.com/essays/topic/data-mining-essay-topics/.

IvyPanda . (2024) '82 Data Mining Essay Topic Ideas & Examples'. 2 March.

IvyPanda . 2024. "82 Data Mining Essay Topic Ideas & Examples." March 2, 2024. https://ivypanda.com/essays/topic/data-mining-essay-topics/.

1. IvyPanda . "82 Data Mining Essay Topic Ideas & Examples." March 2, 2024. https://ivypanda.com/essays/topic/data-mining-essay-topics/.

Bibliography

IvyPanda . "82 Data Mining Essay Topic Ideas & Examples." March 2, 2024. https://ivypanda.com/essays/topic/data-mining-essay-topics/.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals

Data mining articles from across Nature Portfolio

Data mining is the process of extracting potentially useful information from data sets. It uses a suite of methods to organise, examine and combine large data sets, including machine learning, visualisation methods and statistical analyses. Data mining is used in computational biology and bioinformatics to detect trends or patterns without knowledge of the meaning of the data.

research topics for data mining

Discrete latent embeddings illuminate cellular diversity in single-cell epigenomics

CASTLE, a deep learning approach, extracts interpretable discrete representations from single-cell chromatin accessibility data, enabling accurate cell type identification, effective data integration, and quantitative insights into gene regulatory mechanisms.

Latest Research and Reviews

research topics for data mining

Publication, funding, and experimental data in support of Human Reference Atlas construction and usage

  • Yongxin Kong
  • Katy Börner

research topics for data mining

Depression recognition using voice-based pre-training model

  • Xiangsheng Huang
  • Zhenrong Xu

research topics for data mining

SOFB is a comprehensive ensemble deep learning approach for elucidating and characterizing protein-nucleic-acid-binding residues

We implement SOFB, an ensemble deep learning model-based nucleic-acid-binding residues on proteins identification method to characterize protein sequences by learning the semantics of biological dynamics contexts.

  • Xiangtao Li

research topics for data mining

Mitochondrial RNA modification-based signature to predict prognosis of lower grade glioma: a multi-omics exploration and verification study

  • Xingwang Zhou
  • Yuanguo Ling
  • Liangzhao Chu

research topics for data mining

Decoding intelligence via symmetry and asymmetry

  • Jianjing Fu
  • Ching-an Hsiao

research topics for data mining

Long non-coding RNAs expression and regulation across different brain regions in primates

  • Mohit Navandar
  • Constance Vennin
  • Susanne Gerber

Advertisement

News and Comment

research topics for data mining

Discovering cryptic natural products by substrate manipulation

Cryptic halogenation reactions result in natural products with diverse structural motifs and bioactivities. However, these halogenated species are difficult to detect with current analytical methods because the final products are often not halogenated. An approach to identify products of cryptic halogenation using halide depletion has now been discovered, opening up space for more effective natural product discovery.

  • Ludek Sehnal
  • Libera Lo Presti
  • Nadine Ziemert

research topics for data mining

Chroma is a generative model for protein design

  • Arunima Singh

research topics for data mining

Efficient computation reveals rare CRISPR–Cas systems

A study published in Science develops an efficient mining algorithm to identify and then experimentally characterize many rare CRISPR systems.

research topics for data mining

SEVtras characterizes cell-type-specific small extracellular vesicle secretion

Although single-cell RNA-sequencing has revolutionized biomedical research, exploring cell states from an extracellular vesicle viewpoint has remained elusive. We present an algorithm, SEVtras, that accurately captures signals from small extracellular vesicles and determines source cell-type secretion activity. SEVtras unlocks an extracellular dimension for single-cell analysis with diagnostic potential.

Protein structural alignment using deep learning

Quick links.

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

research topics for data mining

data mining Recently Published Documents

Total documents.

  • Latest Documents
  • Most Cited Documents
  • Contributed Authors
  • Related Sources
  • Related Keywords

Distance Based Pattern Driven Mining for Outlier Detection in High Dimensional Big Dataset

Detection of outliers or anomalies is one of the vital issues in pattern-driven data mining. Outlier detection detects the inconsistent behavior of individual objects. It is an important sector in the data mining field with several different applications such as detecting credit card fraud, hacking discovery and discovering criminal activities. It is necessary to develop tools used to uncover the critical information established in the extensive data. This paper investigated a novel method for detecting cluster outliers in a multidimensional dataset, capable of identifying the clusters and outliers for datasets containing noise. The proposed method can detect the groups and outliers left by the clustering process, like instant irregular sets of clusters (C) and outliers (O), to boost the results. The results obtained after applying the algorithm to the dataset improved in terms of several parameters. For the comparative analysis, the accurate average value and the recall value parameters are computed. The accurate average value is 74.05% of the existing COID algorithm, and our proposed algorithm has 77.21%. The average recall value is 81.19% and 89.51% of the existing and proposed algorithm, which shows that the proposed work efficiency is better than the existing COID algorithm.

Implementation of Data Mining Technology in Bonded Warehouse Inbound and Outbound Goods Trade

For the taxed goods, the actual freight is generally determined by multiplying the allocated freight for each KG and actual outgoing weight based on the outgoing order number on the outgoing bill. Considering the conventional logistics is insufficient to cope with the rapid response of e-commerce orders to logistics requirements, this work discussed the implementation of data mining technology in bonded warehouse inbound and outbound goods trade. Specifically, a bonded warehouse decision-making system with data warehouse, conceptual model, online analytical processing system, human-computer interaction module and WEB data sharing platform was developed. The statistical query module can be used to perform statistics and queries on warehousing operations. After the optimization of the whole warehousing business process, it only takes 19.1 hours to get the actual freight, which is nearly one third less than the time before optimization. This study could create a better environment for the development of China's processing trade.

Multi-objective economic load dispatch method based on data mining technology for large coal-fired power plants

User activity classification and domain-wise ranking through social interactions.

Twitter has gained a significant prevalence among the users across the numerous domains, in the majority of the countries, and among different age groups. It servers a real-time micro-blogging service for communication and opinion sharing. Twitter is sharing its data for research and study purposes by exposing open APIs that make it the most suitable source of data for social media analytics. Applying data mining and machine learning techniques on tweets is gaining more and more interest. The most prominent enigma in social media analytics is to automatically identify and rank influencers. This research is aimed to detect the user's topics of interest in social media and rank them based on specific topics, domains, etc. Few hybrid parameters are also distinguished in this research based on the post's content, post’s metadata, user’s profile, and user's network feature to capture different aspects of being influential and used in the ranking algorithm. Results concluded that the proposed approach is well effective in both the classification and ranking of individuals in a cluster.

A data mining analysis of COVID-19 cases in states of United States of America

Epidemic diseases can be extremely dangerous with its hazarding influences. They may have negative effects on economies, businesses, environment, humans, and workforce. In this paper, some of the factors that are interrelated with COVID-19 pandemic have been examined using data mining methodologies and approaches. As a result of the analysis some rules and insights have been discovered and performances of the data mining algorithms have been evaluated. According to the analysis results, JRip algorithmic technique had the most correct classification rate and the lowest root mean squared error (RMSE). Considering classification rate and RMSE measure, JRip can be considered as an effective method in understanding factors that are related with corona virus caused deaths.

Exploring distributed energy generation for sustainable development: A data mining approach

A comprehensive guideline for bengali sentiment annotation.

Sentiment Analysis (SA) is a Natural Language Processing (NLP) and an Information Extraction (IE) task that primarily aims to obtain the writer’s feelings expressed in positive or negative by analyzing a large number of documents. SA is also widely studied in the fields of data mining, web mining, text mining, and information retrieval. The fundamental task in sentiment analysis is to classify the polarity of a given content as Positive, Negative, or Neutral . Although extensive research has been conducted in this area of computational linguistics, most of the research work has been carried out in the context of English language. However, Bengali sentiment expression has varying degree of sentiment labels, which can be plausibly distinct from English language. Therefore, sentiment assessment of Bengali language is undeniably important to be developed and executed properly. In sentiment analysis, the prediction potential of an automatic modeling is completely dependent on the quality of dataset annotation. Bengali sentiment annotation is a challenging task due to diversified structures (syntax) of the language and its different degrees of innate sentiments (i.e., weakly and strongly positive/negative sentiments). Thus, in this article, we propose a novel and precise guideline for the researchers, linguistic experts, and referees to annotate Bengali sentences immaculately with a view to building effective datasets for automatic sentiment prediction efficiently.

Capturing Dynamics of Information Diffusion in SNS: A Survey of Methodology and Techniques

Studying information diffusion in SNS (Social Networks Service) has remarkable significance in both academia and industry. Theoretically, it boosts the development of other subjects such as statistics, sociology, and data mining. Practically, diffusion modeling provides fundamental support for many downstream applications (e.g., public opinion monitoring, rumor source identification, and viral marketing). Tremendous efforts have been devoted to this area to understand and quantify information diffusion dynamics. This survey investigates and summarizes the emerging distinguished works in diffusion modeling. We first put forward a unified information diffusion concept in terms of three components: information, user decision, and social vectors, followed by a detailed introduction of the methodologies for diffusion modeling. And then, a new taxonomy adopting hybrid philosophy (i.e., granularity and techniques) is proposed, and we made a series of comparative studies on elementary diffusion models under our taxonomy from the aspects of assumptions, methods, and pros and cons. We further summarized representative diffusion modeling in special scenarios and significant downstream tasks based on these elementary models. Finally, open issues in this field following the methodology of diffusion modeling are discussed.

The Influence of E-book Teaching on the Motivation and Effectiveness of Learning Law by Using Data Mining Analysis

This paper studies the motivation of learning law, compares the teaching effectiveness of two different teaching methods, e-book teaching and traditional teaching, and analyses the influence of e-book teaching on the effectiveness of law by using big data analysis. From the perspective of law student psychology, e-book teaching can attract students' attention, stimulate students' interest in learning, deepen knowledge impression while learning, expand knowledge, and ultimately improve the performance of practical assessment. With a small sample size, there may be some deficiencies in the research results' representativeness. To stimulate the learning motivation of law as well as some other theoretical disciplines in colleges and universities has particular referential significance and provides ideas for the reform of teaching mode at colleges and universities. This paper uses a decision tree algorithm in data mining for the analysis and finds out the influencing factors of law students' learning motivation and effectiveness in the learning process from students' perspective.

Intelligent Data Mining based Method for Efficient English Teaching and Cultural Analysis

The emergence of online education helps improving the traditional English teaching quality greatly. However, it only moves the teaching process from offline to online, which does not really change the essence of traditional English teaching. In this work, we mainly study an intelligent English teaching method to further improve the quality of English teaching. Specifically, the random forest is firstly used to analyze and excavate the grammatical and syntactic features of the English text. Then, the decision tree based method is proposed to make a prediction about the English text in terms of its grammar or syntax issues. The evaluation results indicate that the proposed method can effectively improve the accuracy of English grammar or syntax recognition.

Export Citation Format

Share document.

16 Data Mining Projects Ideas & Topics For Beginners [2024]

16 Data Mining Projects Ideas & Topics For Beginners [2024]

Introduction

A career in Data Science necessitates hands-on experience, and what better way to obtain it than by working on real-world data mining projects? This post provides a wide range of data mining project ideas for beginners. Whether you’re looking at data mining in database management systems, data mining projects in Java, or creative data mining project ideas, this list has you covered.

Today, data mining has become strategically important to organizations across industries. It not only helps in predicting outcomes and trends but also in removing bottlenecks and improving existing processes. Data mining research topics 2020 was already in the search bar of millions of users 2 years ago . It looks like this trend is about to continue in 2024 and beyond. So, if you are a beginner, the best thing you can do is work on some real-time data mining projects.

 If you are just getting started in data science, making sense of advanced data mining techniques can seem daunting. Along with the plethora of data mining research topics available online , we have compiled some useful data mining project topics to support you in your learning journey.

We, here at upGrad, believe in a practical approach as theoretical knowledge alone won’t be of help in a real-time work environment if you do not work on data mining projects yourself . In this article, we will be exploring some fun and exciting data mining projects and data mining research topics which beginners can work on to put their data mining knowledge to test. In this post, you will learn about top 16 data mining projects for beginners.

In this article, you will find 42 top python project ideas for beginners to get hands-on experience on Python

But first, let’s address the more important and frequently question that must be lurking in your mind: why to build data mining projects?

But before we begin, let us look at an example to decode what data mining is all about. Suppose you have a data set containing login logs of a web application. It can include things like the username, login timestamp, activities performed, time spent on the site before logging out, etc.

Our learners also read : Python online course free !

Such unstructured data in itself would not serve any purpose unless it is organized systematically and analyzed to extract relevant information for the business. By applying the different techniques of data mining, you can discover user habits, preferences, peak usage timings, etc. These insights can further increase the software system’s efficiency and boost its user-friendliness. Learn more about data mining with our data science programs.

data mining projects

In today’s digital era, the computing processes of collecting, cleaning, analyzing, and interpreting data make up an integral part of business strategies. So, data scientists are required to have adequate knowledge of methods like pattern tracking, classification, cluster analysis, prediction, neural networks, etc. The more you experiment with different data mining projects, the more knowledge you gain.

Data Mining Project Ideas & Topics for Beginners

This list of data mining projects for students is suited for beginners, and those just starting out with Data Science in general. These data mining projects will get you going with all the practicalities you need to succeed in your career.

Further, if you’re looking for data mining project for final year, this list should get you going as this list also contains data mining projects for students . So, without further ado, let’s jump straight into some data mining projects that will strengthen your base and allow you to climb up the ladder.

Also read : Excel online course free !

1. iBCM: interesting Behavioral Constraint Miner

One of the best ideas to start experimenting you hands-on  data mining projects for students is working on iBCM. A sequence classification problem deals with the prediction of sequential patterns in data sets. It discovers the underlying order in the database based on specific labels. In doing so, it applies the simple mathematical tool of partial orders. However, you would require a better representation to achieve more accurate, concise, and scalable classification. And a sequence classification technique with a behavioral constraint template can address this need.

With the iBCM project, you can delve into the field of sequence categorization. Using behavioral constraint templates, this venture predicts sequential patterns inside datasets. This method employs mathematical tools such as partial orders to reveal underlying data patterns in an accurate and simple manner. Beyond traditional sequence mining, iBCM finds a wide range of patterns, making it a good starting point for inexperienced data miners.

The interesting Behavioral Constraint Miner (iBCM) project can express a variety of patterns over a sequence, such as simple occurrence, looping, and position-based behavior. It can also mine negative information, i.e., the absence of a particular behavior. So, the iBCM approach goes much beyond the typical sequence mining representations and is a perfect starting point for those looking for data mining projects for students.

2. GERF: Group Event Recommendation Framework

This is one of the simple data mining projects yet an exciting one. It is an intelligent solution for recommending social events, such as exhibitions, book launches, concerts, etc. A majority of the research focuses on suggesting upcoming attractions to individuals. So, a Group Event Recommendation Framework (GERF) was developed to propose events to a group of users.

GERF addresses group social event recommendations by utilizing learning-to-rank algorithms for reliable choices. This project provides efficient event recommendations for a varied user population by extracting group preferences and environmental impacts, with applications ranging from exhibitions to travel services.

This model uses a learning-to-rank algorithm to extract group preferences and can incorporate additional contextual influences with ease, accuracy, and time-efficiency.

Learning to rank, also known as machine-learned ranking (MLR), is the process of building ranking models for systems needing information retrieval using machine learning techniques such as supervised learning, semi-supervised learning, and reinforcement learning.

The objects used for training are organized into lists, with the relative order between the lists being partially described. In most cases, a number or ordinal score is assigned to each item, or a binary judgment (such as “relevant” for true values(binary 1) or “not relevant” for false values(binary 0)) is made.

The objective of the ranking model is to apply the same logic used to rank the training data to the rating of fresh, unknown lists.

Also, it can be conveniently applied to other group recommendation scenarios like location-based travel services. 

Top Data Science Skills to Learn

Explore our popular data science courses.

upGrad’s Exclusive Data Science Webinar for you –

The Future of Consumer Data in an Open Data Economy

3. Efficient similarity search for dynamic data streams

Online applications use similarity search systems for tasks like pattern recognition, recommendations, plagiarism detection, etc. Typically, the algorithm answers nearest-neighbor queries with the Location-Sensitive Hashing or LSH approach, a min-hashing related method. It can be implemented in several computational models with large data sets, including MapReduce architecture and streaming. Mentioning data mining projects can help your resume look much more interesting than others.

For a variety of functions, online apps rely on similarity search engines. This research focuses on effective similarity search strategies for dynamic data streams, with a special emphasis on scalability in huge datasets. Its novel features, such as the use of the Jaccard index as a similarity measure and estimating techniques based on sketching, improve accuracy in pattern recognition and recommendation tasks.

Dynamic data streams, however, require scalable LSH-based filtering and design. To this end, the efficient similarity search project outperforms previous algorithms. Here are some of its main features:

  • Relies on the Jaccard index as a similarity measure
  • Suggests a nearest-neighbor data structure feasible for dynamic data streams
  • Proposes a sketching algorithm for similarity estimation 

4. Frequent pattern mining on uncertain graphs

Application domains like bioinformatics, social networks, and privacy enforcement often encounter uncertainty due to the presence of interrelated, real-life data archives. This uncertainty permeates the graph data as well.

Frequent pattern mining on uncertain graphs is critical in settings requiring uncertain data, such as bioinformatics and social networks. This project addresses the issue of transitive interactions with uncertain graph data. It efficiently manages real-world data archives with increased performance by utilizing enumeration-evaluation methods and approximation techniques.

This problem calls for innovative data mining projects that can catch the transitive interactions between graph nodes. This beginner-level data mining projects will help build a strong foundation for fundamental programming concepts. One such technique is the frequent subgraph and pattern mining on a single uncertain graph. The solution is presented in the following format:

  • An enumeration-evaluation algorithm to support computation under probabilistic semantics
  • An approximation algorithm to enable efficient problem-solving
  • Computation sharing techniques to drive mining performance
  • Integration of check-point based and pruning approaches to extend the algorithm to expected semantics

5. Cleaning data with forbidden itemsets or FBIs

Data cleaning methods typically involve taking away data errors and systematically fixing the issue by specifying constraints (illegal values, domain restrictions, logical rules, etc.)  

Data cleansing frequently entails defining limitations to correct inaccuracies. The FBI’s effort introduces a fixing method based on banned itemset, finding constraints in dirty data automatically and improving error detection precision. Empirical evaluations establish the mechanism’s trustworthiness and dependability, which is critical in the big data scenario.

In the real-life big data universe, we are inundated with dirty data that comes without any known constraints. In such a scenario, the algorithm automatically discovers constraints on the dirty data and further uses them to identify and repair errors. But when this discovery algorithm runs on the repaired data again, it introduces new constraint violations, rendering the data erroneous. This is one of the excellent data mining projects for beginners.

Hence, a repairing method based on forbidden itemsets (FBIs) was devised to record unlikely co-occurrences of values and detect errors with more precision. And empirical evaluations establish the credibility and reliability of this mechanism. 

6. Protecting user data in profile-matching social networks

This is one of the convenient data mining projects that has a lot of use in the future. Consider the user profile database maintained by the providers of social networking services, such as online dating sites. The querying users specify certain criteria based on which their profiles are matched with that of other users. This process has to be secure enough to protect against any kind of data breaches. There are some solutions in the market today that use homomorphic encryption and multiple servers for matching user profiles to preserve user privacy. 

Read our popular Data Science Articles

7. privrank for social media.

Social media sites mine their users’ preferences from their online activities to offer personalized recommendations. However, user activity data contains information which can be used to infer private details about an individual (for example, gender, age, etc.) And any leak or release of such user-specified data can increase the risk of interference attacks. 

Learn  Data Science Courses online  at upGrad

8. Practical PEKs scheme over encrypted email in cloud server

In the light of current high-profile public events related to email leaks, the security of such sensitive messages has emerged as a primary concern for users worldwide. To that end, the Public Encryption with Keyword Search (PEKS) technology offers a viable solution. This is one of the useful data mining projects in which this combines security protection with efficient search operability functions. 

When searching over a sizable encrypted email database in a cloud server, we would want the email receivers to perform quick multi-keyword and boolean searches without revealing additional information to the server.

Read: Data Mining Real World Applications

9. Sentimental analysis and opinion mining for mobile networks

This project concerns post-publishing applications where a registered user can share text posts or images and also leave comments on posts. Under the prevailing system, users have to go through all the comments manually to filter out verified comments, positive comments, negative remarks, and so on.

With the sentiment analysis and opinion mining system, users can check the status of their post without dedicating much time and effort. It provides an opinion on the comments made on a post and also gives the option to view a graph. 

10. Mining the k most frequent negative patterns via learning

In behavior informatics, the negative sequential patterns (NSPs) can be more revealing than the positive sequential patterns (PSPs) . For instance, in a disease or illness-related study, data on missing a medical treatment can be more useful than data on attending a medical procedure. But to the present day, NSP mining is still at a nascent stage. And the ‘Topk-NSP+’ algorithm presents a reliable solution for overcoming the obstacles in the current mining landscape. This is one of the trending data mining and this is how the project proposes the algorithm:

  • Mining the top-k PSPs with the existing method
  • Mining the to-k NSPs from these PSPs by using an idea similar to the top-k PSPs mining 
  • Employing three optimization strategies to select useful NSPs and reduce computational costs

Also try:  Machine Learning Project Ideas for Beginners

11. Automated personality classification project

The automatic system analyzes the characteristics and behaviors of participants. And after observing the past patterns of data classification, it predicts a personality type and stores its own patterns in a dataset. This project idea can be summarized as follows:

  • Store personality-related data in a database
  • Collect associated characteristics for each user
  • Extract relevant features from the text entered by the participant
  • Examine and display the personality traits 
  • Interlink personality and user behavior (There can be varying degrees of behavior for a particular personality type)

Such models are commonplace in career guidance services where a student’s personality is matched with suitable career paths. This can be an interesting and useful data mining projects.

12. Social-Aware social influence modeling

This is one of the most popular data mining mini projects. This project deals with big social data and leverages deep learning for sequential modeling of user interests. The stepwise process is described below:

  • A preliminary analysis of two real datasets (Yelp and Epinions)
  • Discovery of statistically sequential actions of users and their social circles, including temporal autocorrelation and social influence on decision-making
  • Presentation of a novel deep learning model called Social-Aware Long Short-Term Memory (SA-LSTM), which can predict the type of items or Points of Interest that a particular user will buy or visit next. Long short-term memory, often known as LSTM, is a kind of neural network that is used in the domains of deep learning and artificial intelligence. LSTM neural networks have feedback connections, in contrast to more traditional feedforward neural networks so that they can change the training parameters or hyperparameters to be more precise, with each epoch. LSTM is a kind of recurrent neural network, commonly known as an RNN, which is capable of processing, not just individual data points but also complete data sequences.

Experimental results reveal that the structure of this proposed solution enables higher prediction accuracy as compared to other baseline methods.

This is one of the data mining mini projects that will definitely help you get some real-world exposure.

13. Predicting consumption patterns with a mixture approach

Individuals consume a large selection of items in the digital world today. For example, while making purchases online, listening to music, using online navigation, or exploring virtual environments. Applications in these contexts employ predictive modeling techniques to recommend new items to users. However, in many situations, we want to know the additional details of previously-consumed items and past user behavior. And this is where the baseline approach of matrix factorization-based prediction falls short. This is one of the creative data mining projects. 

A mixture model with repeated and novel events offers a suitable alternative for such problems. It aims to deliver accurate consumption predictions by balancing individual preferences in terms of exploration and exploitation. Also, it is one of those data mining project topics that include an experimental analysis using real-world datasets. The study’s results show that the new approach works efficiently across different settings, from social media and music listening to location-based data. 

14. GMC: Graph-based Multi-view Clustering 

The existing clustering methods for multi-view data require an extra step to produce the final cluster as they do not pay much attention to the weights of different views. Moreover, they function on fixed graph similarity matrices of all views. And this is the perfect idea for your next data mining project as this can also be considered as a graph mining projects .

A novel Graph-based Multi-view Clustering (GMC) can tackle this issue and deliver better results than the previous alternatives. It is a fusion technique that weights data graph matrices for all views and derives a unified matrix, directly generating the final clusters. Other features of the graph mining projects include:

  • Partition of data points into the desired number of clusters without using a tuning parameter. For this, a rank constraint is imposed on the Laplacian matrix of the unified matrix.
  • Optimization of the objective function with an iterative optimization algorithm 

15. ITS: Intelligent Transportation System

A multi-purpose traffic solution generally aims to ensure the following aspects:

  • Transport service’s efficiency
  • Transport safety
  • Reduction in traffic congestion
  • Forecast of potential passengers
  • Adequate allocation of resources

Consider a project that uses the above system to optimize the process of bus scheduling in a city. ITS is one of the interesting data mining projects for beginners. You can take the past three years’ data from a renowned bus service company, and apply uni-variate multi-linear regression to conduct passengers’ forecasts.

Further, you can calculate the minimum number of buses required for optimization in a Generic Algorithm. Finally, you validate your results using statistical techniques like mean absolute percentage error (MAPE) and mean absolute deviation (MAD). Mean Absolute Percentage Error(MAPE): The accuracy of a forecasting system may be quantified by calculating the mean absolute percentage error (MAPE). Measured as a percentage, it is derived by taking the sum of the absolute values of the errors across all time periods and dividing by the real values to provide a reading on how close the estimate is to the true value.

The most popular way to quantify forecast errors is via the use of the mean absolute percentage error (MAPE), perhaps because the variable’s units are already in percentage form. A lack of extremes in the data is necessary for optimal performance (and no zeros). In regression analysis and model assessment, it is frequently used as a loss function.

Mean Absolute Deviation(MAD): It measures how far each data point is from the dataset’s mean value. It helps us get a sense of the data’s overall dispersion. To find out the MAD for a data set, we must first calculate the mean and then the distance of each data point from the mean using MPD(Mean positive distances) which would yield the absolute deviation.

This absolute deviation is the measure of this gap between the mean and each data point. Now, we take the total of all these deviations, add it and then divide it by the total number of data points in the data set.

Also read: Data Science Project Ideas

16. TourSense for city tourism

City-scale transport data about buses, subways, etc. could also be used for tourist identification and preference analytics. But relying on traditional data sources, such as surveys and social media, can result in inadequate coverage and information delay.

The TourSense project demonstrates how to override such shortcomings and provide more valuable insights. This tool would be useful for a wide range of stakeholders, from transport operators and tour agencies to tourists themselves. This is one of the excellent data mining projects for beginners. Here are the main steps involved in its design: 

  • A graph-based iterative propagation learning algorithm to identify tourists from other public commuters
  • A tourist preference analytics model (utilizing the tourists’ trace data) to learn and predict their next tour
  • An interactive UI to serve easy information access from the analytics

Data Mining Projects: Conclusion

In this article, we have covered 16 data mining projects. If you wish to improve your data mining skills, you need to get your hands on these data mining projects.

Dive into Data Science involves more than just academic understanding; it also necessitates practical experience. These data mining project ideas are designed for novices, with options to investigate sequence classification, group suggestions, similarity search, graph mining, and data cleaning. As you work on these projects, you’ll lay a solid foundation in Data Science and prepare for future challenges in this ever-changing area.

Data mining and correlated fields have experienced a surge in hiring demand in the last few years as data mining research topics 2020 was already in the search bar of millions of users 2 years ago and is still there . With the above data mining project topics, you can keep up with the market trends and developments. So, stay curious and keep updating your knowledge!

If you are curious to learn about data science, check out IIIT-B & upGrad’s Executive PG Program in Data Science which is created for working professionals and offers 10+ case studies & projects, practical hands-on workshops, mentorship with industry experts, 1-on-1 with industry mentors, 400+ hours of learning and job assistance with top firms.

Profile

Rohit Sharma

Something went wrong

Our Trending Data Science Courses

  • Data Science for Managers from IIM Kozhikode - Duration 8 Months
  • Executive PG Program in Data Science from IIIT-B - Duration 12 Months
  • Master of Science in Data Science from LJMU - Duration 18 Months
  • Executive Post Graduate Program in Data Science and Machine LEarning - Duration 12 Months
  • Master of Science in Data Science from University of Arizona - Duration 24 Months

Our Popular Data Science Course

Data Science Course

Data Science Skills to Master

  • Data Analysis Courses
  • Inferential Statistics Courses
  • Hypothesis Testing Courses
  • Logistic Regression Courses
  • Linear Regression Courses
  • Linear Algebra for Analysis Courses

Frequently Asked Questions (FAQs)

As the name suggests, data mining refers to the process of mining or extraction of patterns from large data sets. The methods it involves include the combined knowledge of machine learning, statistics, and database systems. Before applying data mining techniques, you need to assemble a large dataset that must be large enough to contain patterns to be mined. There are 6 prominent steps that are involved in the data mining process. These steps are anomaly detection, association rule learning, clustering, classification, regression, and summarization.

Classification in data mining allows enterprises to arrange large sets of data according to the target categories. Once ordered in this manner, the enterprises could see the data clearly and analyze the risks and profits easily which in turn helps the businesses to grow. Classification can also be understood as a way to generalize known structures to apply to new data. The analysis is based on several patterns that are found in the data. These patterns help to sort the data into different groups.

Projects are all about experimenting and testing your skills. They let you use all of your creativity and develop a useful product out of it. Building data mining projects will not only give you hands-on experience but will also enhance your knowledge pool. You can add these amazing projects to your resume to showcase your skills to potential employers. These projects will help you to implement your theoretical knowledge into action and gain practical benefits from it.

Related Programs View All

research topics for data mining

View Program

research topics for data mining

Executive PG Program

Complimentary Python Bootcamp

research topics for data mining

Master's Degree

Live Case Studies and Projects

research topics for data mining

8+ Case Studies & Assignments

research topics for data mining

Certification

Live Sessions by Industry Experts

ChatGPT Powered Interview Prep

research topics for data mining

Top US University

research topics for data mining

120+ years Rich Legacy

Based in the Silicon Valley

research topics for data mining

Case based pedagogy

High Impact Online Learning

research topics for data mining

Mentorship & Career Assistance

AACSB accredited

Placement Assistance

Earn upto 8LPA

research topics for data mining

Interview Opportunity

8-8.5 Months

Exclusive Job Portal

research topics for data mining

Learn Generative AI Developement

Explore Free Courses

Study Abroad Free Course

Learn more about the education system, top universities, entrance tests, course information, and employment opportunities in Canada through this course.

Marketing

Advance your career in the field of marketing with Industry relevant free courses

Data Science & Machine Learning

Build your foundation in one of the hottest industry of the 21st century

Management

Master industry-relevant skills that are required to become a leader and drive organizational success

Technology

Build essential technical skills to move forward in your career in these evolving times

Career Planning

Get insights from industry leaders and career counselors and learn how to stay ahead in your career

Law

Kickstart your career in law by building a solid foundation with these relevant free courses.

Chat GPT + Gen AI

Stay ahead of the curve and upskill yourself on Generative AI and ChatGPT

Soft Skills

Build your confidence by learning essential soft skills to help you become an Industry ready professional.

Study Abroad Free Course

Learn more about the education system, top universities, entrance tests, course information, and employment opportunities in USA through this course.

Suggested Blogs

What is Linear Data Structure? List of Data Structures Explained

by Rohit Sharma

28 May 2024

4 Types of Data: Nominal, Ordinal, Discrete, Continuous

21 May 2024

Binary Tree in Data Structure: Properties, Types, Representation & Benefits

by Shaheen Dubash

20 May 2024

Python Free Online Course with Certification [2024]

19 May 2024

16 Best Data Science Project Ideas & Topics for Beginners [2024]

16 May 2024

Recent advances in domain-driven data mining

  • Published: 27 December 2022
  • Volume 15 , pages 1–7, ( 2023 )

Cite this article

research topics for data mining

  • Chuanren Liu 1 ,
  • Ehsan Fakharizadi 2 ,
  • Tong Xu 3 &
  • Philip S. Yu 4  

2707 Accesses

1 Altmetric

Explore all metrics

Data mining research has been significantly motivated by and benefited from real-world applications in novel domains. This special issue was proposed and edited to draw attention to domain-driven data mining and disseminate research in foundations, frameworks, and applications for data-driven and actionable knowledge discovery. Along with this special issue, we also organized a related workshop to continue the previous efforts on promoting advances in domain-driven data mining. This editorial report will first summarize the selected papers in the special issue, then discuss various industrial trends in the context of the selected papers, and finally document the keynote talks presented by the workshop. Although many scholars have made prominent contributions with the theme of domain-driven data mining, there are still various new research problems and challenges calling for more research investigations in the future. We hope this special issue is helpful for scholars working along this critically important line of research.

Avoid common mistakes on your manuscript.

1 Summary of research contributions

Data mining has been a trending research area with contributions from diverse communities including computer scientists, statisticians, mathematicians, as well as other researchers and engineers working on data-intensive problems. While many researchers focus on general data mining methodologies for standardized problem settings, such as unsupervised learning and supervised learning, applying general solutions to specific problems may still be a nontrivial challenge. This is mainly due to the need to incorporate domain knowledge in implementing data mining solutions for novel real-world applications. Oftentimes standardized solutions must be significantly revised to accommodate unique characteristics of input data and deliver actionable results in novel application domains. Essentially, data mining research is highly applied. Many classic research problems are motivated by real-world applications and results of data mining research are expected to provide practical implications to business managers, government agencies, and all members of our society.

1.1 Overview of domain-driven data mining

Domain-driven data mining aims to bridge the gaps between theoretical research and practical applications in data mining and transform data intelligence to business value and impact [ 11 , 12 ]. Domain-driven data mining has been proposed as a research framework for discovering actionable knowledge and intelligence in a complex environment to directly transform data to decisions or enable decision-making actions [ 3 , 16 ].

Domain-driven data mining handles ubiquitous X-complexities and X-intelligences surrounding domain-driven actionable intelligence discovery. Examples of X-complexities and X-intelligences are related to domain complexity and intelligence, data complexity and intelligence, behavior complexity and intelligence, network complexity and intelligence, social complexity and intelligence, organizational complexity and intelligence, human complexity and intelligence, and their integration and meta-synthesis [ 8 , 16 ]. Analyzing and learning X-complexities and X-intelligences result in X-analytics [ 8 ] in various domains and on specific purposes. Examples are business analytics, behavior analytics, social analytics, operational analytics, risk analytics, customer analytics, insurance analytics, learning analytics, cybersecurity analytics, and financial analytics [ 15 , 21 , 24 , 26 , 28 , 29 , 31 , 38 , 40 , 41 , 42 , 43 , 51 ]. One prominent example of learning data complexities for in-depth data intelligence is the research on non-IID learning, which learns interactions and couplings (including correlation and dependency) involved in heterogeneous data, behaviors, and systems. Non-IID learning is applicable to many real-world applications such as non-IID outlier detection, non-IID recommendation, non-IID multimedia and multimodal analytics, and non-IID federated learning [ 5 , 6 , 17 ].

Domain-driven data mining also handles typical research issues and gaps in existing body of knowledge for domain-driven and actionable intelligence delivery. The research on domain-driven actionable intelligence discovery includes but is not limited to: quantifying knowledge actionability (rather than just interestingness) of data mining results [ 14 ], domain knowledge representation and domain generalization [ 30 ], domain-driven actionable knowledge discovery process [ 3 , 16 ], context-aware analytics and learning [ 46 ], discovering actionable patterns by combined mining [ 4 , 54 ] and high-utility mining [ 27 ], pattern relation analysis [ 4 ], cross-domain and transfer learning [ 24 , 36 , 45 , 51 ], data-to-decision transformation [ 8 ], personalized learning and recommendation [ 49 ], next-best action learning and recommendation [ 13 , 23 ], reflective learning with explicit and implicit feedback [ 32 , 50 ], explainable and interpretable analytics and learning [ 18 ], unbiased and fair analytics and learning [ 1 , 25 , 32 ], privacy and security-preserved analytics [ 52 ], and ethical analytics [ 34 ].

To better understand the challenges, recent advances, and new opportunities in domain-driven data mining, this special issue, along with other related activities, was proposed to call for the latest theoretical and practical developments, expert opinions on the open challenges, lessons learned, and best practices in domain-driven data mining. The special issue received submissions from researchers with different backgrounds, but all focusing on data-intensive research topics with novel applications. The papers accepted in this special issue explored novel factors and challenges such as socioeconomic, organizational, human-centered, and cultural aspects in different data mining tasks. In the following, we first provide a summary of the selected papers in the special issue.

1.2 Applied and flexible deep learning

Deep representation learning has attracted much attention in recent years. For chronic disease diagnosis, Zhang et al. [ 48 ] designed an unsupervised representation learning method to obtain informative correlation-aware signals from multivariate time series data. The key idea was a contrastive learning framework with a graph neural network (GNN) encoder to capture inter- and intra-correlation of multiple longitudinal variables. The work also considered modeling uncertainty quantification with evidential theory to assist the decision-making process in detecting chronic diseases. Also based on deep learning models, Sun et al. [ 37 ] adopted the sequential long short-term memory (LSTM) models in the domain of sports analytics for the baseball industry. With the numbers of home runs as the predictive target, the authors applied their models on the data from Major league Baseball (MLB) to support important decisions in managing players and teams. The results showed that deep learning model could perform better and bring valuable information to meet users’ needs. Focusing on more fundamental deep learning techniques, Zhao et al. [ 53 ] developed a flexible approach to compact architecture search for deep multitask learning (MTL) problems. Though sharing model architectures is a popular method for MTL problems, identifying the appropriate components to be shared by multiple tasks is still a challenge. Based on the expressive reinforcement learning framework, this paper proposed to discover flexible and compact MTL architectures with efficient search space and cost.

1.3 Interpretable and actionable predictions

A critical challenge facing data mining research is to discover actionable knowledge that can directly support decision-making tasks. In the domain of agricultural business and ecosystem management, Basak et al. [ 2 ] applied machine learning methods for a novel problem of soil moisture forecasting. The two modeling challenges were accurate long-term prediction and interpretable hydrological parameters. The proposed domain-driven solution was rooted in deterministic and physically based hydrological redistribution processes of gravity and suction.

As another example of actionable knowledge discovery, Dey et al. [ 19 ] proposed a systematic approach for fire station location planning. As urban fires could adversely affect the socioeconomic growth and ecosystem health of our communities, the authors applied various data mining and machine learning models in working with the Victoria Fire Department to make important decisions for selecting location of a new fire station. The key idea in their approach was to develop effective models for demand prediction and utilize the models to define a generalized index to measure quality of fire service in urban settings. The paper integrated multiple data sources and important domain knowledge/requirements in the modeling process. The final decision task was formulated as an integer programming problem to select the optimal location with maximum service coverage.

For sequential e-commerce product recommendation, Nasir and Ezeife [ 33 ] proposed the Semantic Enabled Markov Model Recommendation system to address long-standing challenges such as model complexity, data sparsity, and ambiguous predictions. Their system was proposed to extract and integrate sequential and semantic knowledge as well as contextual features. The new system showed improved recommendation performance for multiple e-commerce recommendation tasks.

1.4 Unsupervised learning with domain knowledge

Incorporating domain knowledge for unsupervised learning is particularly challenging due to the lack of clearly defined learning target. In the domain of health care, Jasinska-Piadlo et al. [ 22 ] explored the advantages and the challenges of a “domain-led” approach versus a data-driven approach to K -means clustering analysis. The authors compared expert opinions and principal component analysis for selecting the most useful variables to be used for the K -means clustering. The paper discussed comparative advantages of each approach and illustrated that domain knowledge played an important role at the interpretation stage of the clustering results. The authors developed a practical checklist guiding how to enable the integration of domain knowledge into a data mining project.

Similarly, text mining and natural language process are important research tools in many areas. However, many state-of-the-art text and language models are developed for general context, and careful adaption is often needed in applying such techniques on domain-specific data. In this special issue, Villanes and Healey [ 39 ] investigated the use of sentiment dictionaries to estimate sentiment for large document collections. The authors presented a semiautomatic method for extending general sentiment dictionary for a specific target domain. To minimize manual effort, the authors combined statistical term identification and term evaluation using Amazon Mechanical Turk in a study on dengue fever. The same approach could be potentially applied for constructing similar term-based sentiment dictionary in other target domains.

2 New trends from the industry perspective

A continuing trend in the data mining field has been the proliferation of its applications to new domains. This is partly due to the advancements in machine learning technologies evidenced by and promoted through frequent reports of new performance records on benchmark tasks. Another contributor to this proliferation is the increase in the quantity of data collected, stored, and appropriately documented for mining since the benefits of leveraging this data has become more apparent. Some of the works in this special issue demonstrated how data mining techniques can be applied in agriculture [ 2 ], health care and medicine [ 22 , 48 ], and city planning [ 19 ].

One aspect of data quality at the core of this expansion is the growing use of rich data formats. Image, audio, video, and raw text can now be almost directly fed into models that process them to extract meaningful features, patterns, and insights. These formats now often supplement the tabular data structures of the past as shown by Nasir and Ezeife [ 33 ]. To accommodate using these new formats, data mining and machine learning models have adapted to support multi-channel, multimodal, and sequential inputs [ 33 , 37 ].

As more domains employ novel data mining techniques, there have been more opportunities for cross-domain spillovers. We now see more examples of transfer learning, where models trained on one (source) domain are applied in another (target) domain suffering from data scarcity. However, learning generalized models that perform well on multiple tasks could be a challenging process [ 53 ]. These models are often trained with self-supervision on large data and contain millions or billions of learned parameters, such as models for language processing (e.g., BERT, GPT-3, XLNet) and image classification (ResNet, EfficientNet, Inception). A fundamental property of many generalized models is their ability to encode the input data into a vectorized representation, as evidenced by Zhang et al. [ 48 ].

Another recent challenge in data mining, one that is especially amplified in the case of transfer learning involving large models, is the issue of compactness. In many domains, where there is a need for scalable low-latency inferences and when the cost of training new models and deploying them could get high, it becomes necessary to restrict the model size. There are several techniques to accomplish these objectives including pruning, distilling, and training with constraints as Zhao et al. [ 53 ] demonstrated in this special issue.

Along with these trends, there have been several key developments in the structures used for data mining. One that has drastically improved the ability to digest sequential data is the invention of transformer structures. Transformers have effectively revolutionized the deep learning field by enabling models to understand the internal relationship between interdependent data points. These structures are the primary building blocks of some of the large generalized models mentioned above. Another recent progress is the improved ability of the generative models that learn not to score or classify but to create rich outputs such as images, texts, or audio. We also continue seeing more expansion in the field of graph neural network, where models learn and reproduce attributes of a graph data structure [ 48 ].

The sophistication of data mining methods has resulted in improved performance but comes at a cost. Models that use larger and richer input data, capture complex interaction between data points, and map the inputs to abstract representation spaces are very hard if not impossible to interpret. In many domains, it is important for the model outputs to be explainable to decision makers. Explainability matters for three reasons. First, explainable results are more powerful at both convincing decision makers and educating them with insights from the data [ 2 ]. Explainability is also a safeguard against models learning human biases and learning to discriminate. Finally, in some applications, it is necessary to understand not just the predicted value, but also the uncertainty of the predictions. Uncertainty modeling and quantification may be necessary in order to know when to rely on the machine and when to rely on the human. A recently popularized concept in this area is the human-in-the-loop approach, where models continuously receive and learn input from human experts and human decision makers, and meanwhile, experts use model predictions in their decision making on regular basis. Our authors in this special issue have demonstrated great potential of domain-driven data mining in addressing the aforementioned challenges, and more work is needed in the future with the collaboration between academia and industry.

3 Domain-driven data mining workshop

To facilitate the exchange of recent advances in domain-driven data mining, the Domain-Driven Data Mining Workshop was organized as a part of the 2021 SIAM International Conference on Data Mining. The workshop invited three keynote speakers and received paper submissions from multiple institutions. The papers accepted by the workshop were later invited for potential publication in this special issue. In the following, we review the invited keynote talks at the Domain-Driven Data Mining Workshop.

3.1 Actionable intelligence discovery

We first invited Dr. Longbing Cao for his keynote talk, “Domain-Driven and Actionable Intelligence Discovery.” In 2004, Dr. Cao proposed the concept “domain-driven data mining” and has led to implement many large enterprise data science projects for actionable knowledge discovery for governments and businesses, involving over 10 domains including capital markets, banking, insurance, telecommunication, transport, education, smart cities, online business, and public sectors (e.g., financial service, taxation, social welfare, IP, regulation, immigration).

Dr. Cao led a series of activities and proposed “domain-driven data mining” for “actionable knowledge discovery” in complex domains and problems, when discovering “actionable intelligence” was not a trivial task. The significant developments of data science, new-generation AI, and deep neural learning make domain-driven actionable intelligent discovery possible with progress made such as in representing and learning various complexities and intelligences in complex systems, data, and behaviors. In his talk, Dr. Cao first reviewed the aims, progresses, and gaps of conventional data mining/knowledge discovery and machine learning, domain-driven actionable knowledge discovery, and challenges and opportunities in domain-driven actionable intelligence discovery. Then, Dr. Cao discussed related strategic issues in data science thinking [ 8 ], new-generation AI [ 9 ], and actionable deep learning. Dr. Cao shared many thought-provoking illustrations, case studies, and theoretical and practical challenges in industry and government data sciences.

Particularly, Dr. Cao has made broad and in-depth contribution in understanding data complexities and data intelligence. One of his recent foci is learning from non-IID data, forming the research on non-IID learning [ 10 , 17 ]. Non-IID learning goes beyond the classic analytical and learning systems based on the common independent and identically distributed (IID) assumption widely taken in existing science, technology, and engineering. It studies the comprehensive non-IIDnesses [ 5 ], i.e., coupling relationships and interactions (including but beyond correlation and dependency) [ 6 ], and heterogeneities (including but beyond nonidentical distribution) in data, behaviors, and systems. The research on non-IID learning has evolved to almost all areas in data mining, analytics, and learning [ 17 ], such as non-IID data preparation, non-IID feature engineering, non-IID representation learning, non-IID similarity and metric learning, non-IID statistical learning, non-IID learning architecture, non-IID ensemble learning, non-IID federated learning, non-IID transfer learning, non-IID evaluation and validation, and various non-IID learning applications, such as non-IID recommender systems, non-IID outlier detection, non-IID information retrieval, and non-IID image and vision learning [ 5 , 20 , 35 , 47 , 55 ].

For instance, Cao [ 7 ] emphasized the critical issues of the intrinsic assumption that IID users and items in existing recommender systems, leading to false, misleading or incorrect recommendation, and poor performance in cold-start, sparse, and dynamic recommendations. Therefore, a non-IID theoretical framework is needed in order to build a deep and comprehensive understanding of the intrinsic nature of recommendation problems, from the perspective of both couplings and heterogeneities. Such research investigations led by Dr. Cao have triggered the paradigm shift from IID to non-IID recommendation research and can hopefully deliver informed, relevant, personalized, and actionable recommendations. All together, these contributions led to exciting new directions and fundamental solutions to address various challenges including cold-start, sparse data-based, cross-domain, group-based, and shilling attack-related issues in recommender systems.

3.2 A deep learning framework

We invited Dr. Balaji Padmanabhan for his keynote talk titled “Domain-Driven Data Mining: Examples and a Deep Learning Framework.” Dr. Padmanabhan is the Anderson Professor of Global Management and Professor of Information Systems at the University of South Florida’s Muma College of Business, where he is also the director of the Center for Analytics and Creativity. He has worked in data science, AI/machine learning, and business analytics for over two decades in the areas of research, teaching, business management, mentoring graduate students, and designing academic programs. He has also worked with over twenty firms on machine learning and data science initiatives in a variety of sectors. He has published extensively in data science and related areas at premier journals and conferences in the field and has served on the editorial board of leading journals including Management Science, MIS Quarterly, INFORMS Journal on Computing, Information Systems Research, Big Data, ACM Transactions on MIS, and the Journal of Business Analytics.

Dr. Padmanabhan witnessed and led the development of data mining. “I did my PhD at that time when the term of data mining first came up,” he shared with the audience of the workshop audience and reviewed the history of domain-driven data mining research. Then he presented a series of examples over the last two decades of his work. In generalizing from these examples, he emphasized that there are often different extents to which “domain” matters in different data mining endeavors. Dr. Padmanabhan encouraged the workshop audience to “think domain-driven,” which often motivates novel domain-driven methods that can meanwhile be applied more broadly (or “domain free”). Dr. Padmanabhan also shared a general framework for domain-driven deep learning in business research and used this framework to show how researchers can highlight significant contributions and position their own papers and ideas. Dr. Padmanabhan’s insightful cases and valuable research advice were greatly appreciated by the workshop audience from research communities of both computer science and management information systems.

In his talk, Dr. Padmanabhan also shared that his department has completed 100 projects in 7 years with about 30 companies, and funded postdoctoral research in analytics. His department has several outreach initiatives such as Economic Analytics Initiative and Florida Business Analytics Forum. Dr. Padmanabhan highlighted that such industrial collaborations and initiatives have greatly rewarded research activities particularly in domain-driven data mining projects. Dr. Padmanabhan encouraged researchers to actively reach out to industry not only when finding data but also to ask for new research questions.

3.3 Human resource management

We invited Dr. Hui Xiong for his keynote talk, “Artificial Intelligence in Human Resource Management.” Dr. Hui Xiong is a Distinguished Professor at the Rutgers, the State University of New Jersey. He also served as the Smart City Chief Scientist and the Deputy Dean of Baidu Research Institute in charge of several research laboratories. He is a co-Editor-in-Chief of Encyclopedia of GIS, an Associate Editor of IEEE Transactions on Big Data (TBD), ACM Transactions on Knowledge Discovery from Data (TKDD), and ACM Transactions on Management Information Systems (TMIS). Dr. Xiong has chaired for many international conferences in data mining, including a Program Co-Chair (2013) and a General Co-Chair (2015) for the IEEE International Conference on Data Mining (ICDM), and a Program Co-Chair of the Research Track (2018) and the Industry Track (2012) for the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). Dr. Xiong’s research has generated substantive impact beyond academia. He is an ACM distinguished scientist and has been honored by the ICDM-2011 Best Research Paper Award, the 2017 IEEE ICDM Outstanding Service Award, and the 2018 Ram Charan Management Practice Award as the Grand Prix winner from the Harvard Business Review. In 2020, he was named as an AAAS Fellow and an IEEE Fellow.

Dr. Xiong shared a successful story in leveraging big data technology for human resource management. Indeed, the availability of large-scale human resource (HR) data has enabled unparalleled opportunities for business leaders to understand talent behaviors and generate useful talent knowledge, which in turn deliver intelligence for real-time decision making and effective people management at work. In his talk, Dr. Xiong introduced a powerful set of innovative Artificial Intelligence (AI) techniques developed for intelligent human resource management, such as recruiting, performance evaluation, talent retention, talent development, job matching, team management, leadership development, and organization culture analysis. With his rich experiences and close collaborations with the industry, Dr. Xiong demonstrated how the results of talent analytics can be used for other business applications, such as market trend analysis and financial investment.

4 Concluding remarks

This special issue was proposed and edited to draw attention to domain-driven data mining and disseminate research in foundations, frameworks, and applications for data-driven and actionable knowledge discovery. This special issue and related activities on recent advances in domain-driven data mining continued the previous efforts including the workshop series on the same topic during 2007–2014 with the IEEE International Conference on Data Mining and a special issue published by the IEEE Transactions on Knowledge and Data Engineering [ 44 ]. Although many scholars have made significant contributions with the theme of domain-driven data mining, there are still various new research problems and challenges calling for more research investigations in the coming years. We hope this special issue is helpful for scholars working along this critically important line of research.

Alves, G., Amblard, M., Bernier, F., Couceiro, M., Napoli, A.: Reducing unintended bias of ML models on tabular and textual data. In: DSAA, pp. 1–10 (2021)

Basak, A., Schmidt, K.M., Mengshoel, O.J.: From data to interpretable models: machine learning for soil moisture forecasting. Int. J. Data Sci. Anal. (2022). https://doi.org/10.1007/s41060-022-00347-8

Cao, L.: Domain-driven data mining: challenges and prospects. IEEE Trans. Knowl. Data Eng. 22 (6), 755–769 (2010)

Article   Google Scholar  

Cao, L.: Combined mining: analyzing object and pattern relations for discovering and constructing complex yet actionable patterns. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 3 (2), 140–155 (2013)

Cao, L.: Non-iidness learning in behavioral and social data. Comput. J. 57 (9), 1358–1370 (2014)

Cao, L.: Coupling learning of complex interactions. Inf. Process. Manag. 51 (2), 167–186 (2015)

Cao, L.: Non-iid recommender systems: a review and framework of recommendation paradigm shifting. Engineering 2 (2), 212–224 (2016)

Cao, L.: Data Science Thinking: The Next Scientific, Technological and Economic Revolution. Data Analytics. Springer, Berlin (2018)

Book   Google Scholar  

Cao, L.: A new age of AI: features and futures. IEEE Intell. Syst. 37 (1), 25–37 (2022)

Cao, L.: Beyond i.i.d.: non-iid thinking, informatics, and learning. IEEE Intell. Syst. 37 (04), 5–17 (2022)

Cao, L., Zhang, C.: Domain-driven actionable knowledge discovery in the real world. In: PAKDD 2006, pp. 821–830 (2006)

Cao, L., Zhang, C.: The evolution of kdd: towards domain-driven data mining. IJPRAI 21 (4), 677–692 (2007)

Google Scholar  

Cao, L., Zhu, C.: Personalized next-best action recommendation with multi-party interaction learning for automated decision-making. PLoS ONE 17 , 1–22 (2022)

Cao, L., Luo, D., Zhang, C.: Knowledge actionability: satisfying technical and business interestingness. IJBIDM 2 (4), 496–514 (2007)

Cao, L., Zhang, C., Yang, Q., Bell, D.A., Vlachos, M., Taneri, B., Keogh, E.J., Yu, P.S., Zhong, N., Ashrafi, M.Z., Taniar, D., Dubossarsky, E., Graco, W.: Domain-driven, actionable knowledge discovery. IEEE Intell. Syst. 22 (4), 78–88 (2007)

Cao, L., Yu, P.S., Zhang, C., Zhao, Y.: Domain Driven Data Mining. Springer, Berlin (2010)

Book   MATH   Google Scholar  

Cao, L., Philip, S.Y., Zhao, Z.: Shallow and deep non-iid learning on complex data. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (2022)

Carlevaro, A., Mongelli, M.: A new SVDD approach to reliable and explainable AI. IEEE Intell. Syst. 37 (2), 55–68 (2022)

Dey, A., Heger, A., England, D.: Urban fire station location planning using predicted demand and service quality index. Int. J. Data Sci. Anal. (2022). https://doi.org/10.1007/s41060-022-00328-x

Do, T.D.T., Cao, L.: Gamma-Poisson dynamic matrix factorization embedded with metadata influence. In: NeurIPS 2018, pp. 5829–5840 (2018)

He, F., Li, Y., Xu, T., Yin, L., Zhang, W., Zhang, X.: A data-analytics approach for risk evaluation in peer-to-peer lending platforms. IEEE Intell. Syst. 35 (3), 85–95 (2020)

Jasinska-Piadlo, A., Bond, R., Biglarbeigi, P., Brisk, R., Campbell, P., Browne, F., McEneaneny, D.: Data-driven versus a domain-led approach to k-means clustering on an open heart failure dataset. Int. J. Data Sci. Anal. (2022). https://doi.org/10.1007/s41060-022-00346-9

Jin, B., Yang, H., Sun, L., Liu, C., Qu, Y., Tong, J.: A treatment engine by predicting next-period prescriptions. In: KDD, pp. 1608–1616 (2018)

Kanter, J.M., Gillespie, O., Veeramachaneni, K.: Label, segment, featurize: a cross domain framework for prediction engineering. In: DSAA, pp. 430–439 (2016)

Ke, W., Liu, C., Shi, X., Dai, Y., Yu, P.S., Zhu, X.: Addressing exposure bias in uplift modeling for large-scale online advertising. In: ICDM, pp. 1156–1161 (2021)

Kompan, M., Gaspar, P., Macina, J., Cimerman, M., Bieliková, M.: Exploring customer price preference and product profit role in recommender systems. IEEE Intell. Syst. 37 (1), 89–98 (2022)

Lin, J.C.-W., Gan, W., Fournier-Viger, P., Hong, T.-P., Tseng, V.S.: Mining high-utility itemsets with various discount strategies. In: DSAA, pp. 1–10 (2015)

Liu, C., Zhu, W.: Precision coupon targeting with dynamic customer triage. In: DSAA, pp. 420–428 (2020)

Liu, Q., Zeng, X., Liu, C., Zhu, H., Chen, E., Xiong, H., Xie, X.: Mining indecisiveness in customer behaviors. In: ICDM, pp. 281–290 (2015)

Long, M., Wang, J., Sun, J.-G., Yu, P.S.: Domain invariant transfer kernel learning. IEEE Trans. Knowl. Data Eng. 27 (6), 1519–1532 (2015)

Ma, D., Narayanan, V.K., Liu, C., Fakharizadi, E.: Boundary salience: the interactive effect of organizational status distance and geographical proximity on coauthorship tie formation. Soc. Netw. 63 , 162–173 (2020)

Melucci, M.: Investigating sample selection bias in the relevance feedback algorithm of the vector space model for information retrieval. In: DSAA, pp. 83–89 (2014)

Nasir, M., Ezeife, C.I.: Semantic enhanced Markov model for sequential e-commerce product recommendation. Int. J. Data Sci. Anal., (2022) https://doi.org/10.1007/s41060-022-00343-y

O’Leary, D.E.: Ethics for big data and analytics. IEEE Intell. Syst. 31 (4), 81–84 (2016)

Pang, G., Cao, L., Chen, L.: Homophily outlier detection in non-iid categorical data. Data Min. Knowl. Discov. 35 (4), 1163–1224 (2021)

Article   MATH   Google Scholar  

Ruiz-Dolz, R., Alemany, J., Barberá, S.H., García-Fornes, A.: Transformer-based models for automatic identification of argument relations: a cross-domain evaluation. IEEE Intell. Syst. 36 (6), 62–70 (2021)

Sun, H.-C., Lin, T.-Y., Tsai, Y.-L.: Performance prediction in major league baseball by long short-term memory networks. Int. J. Data Sci. Anal. (2022). https://doi.org/10.1007/s41060-022-00313-4

Teng, M., Zhu, H., Liu, C., Xiong, H.: Exploiting network fusion for organizational turnover prediction. ACM Trans. Manag. Inf. Syst. 12 (2), 16:1-16:18 (2021)

Villanes, A., Healey, C.G.: Domain-specific text dictionaries for text analytics. Int. J. Data Sci. Analy., Special Issue on Domain-Driven Data Mining (2022)

Xiang, H., Lin, J., Chen, C.-H., Kong, Y.: Asymptotic meta learning for cross validation of models for financial data. IEEE Intell. Syst. 35 (2), 16–24 (2020)

Xu, L., Wei, X., Cao, J., Yu, P.S.: Multiple social role embedding. In: DSAA, pp. 581–589. IEEE (2017)

Yang, D., Bingqing, Q., Cudré-Mauroux, P.: Location-centric social media analytics: challenges and opportunities for smart cities. IEEE Intell. Syst. 36 (5), 3–10 (2021)

Yang, J., Liu, C., Teng, M., Xiong, H., Liao, M., Zhu, V.: Exploiting temporal and social factors for B2B marketing campaign recommendations. In: ICDM, pp. 499–508 (2015)

Zhang, C., Yu, P., Bell, D.: Introduction to the domain-drive data mining special section. IEEE Trans. Knowl. Data Eng. 22 (6), 753–754 (2010)

Zhang, J., He, M.: CRTL: context restoration transfer learning for cross-domain recommendations. IEEE Intell. Syst. 36 (4), 65–72 (2021)

Zhang, K., Chen, E., Liu, Q., Liu, C., Lv, G.: A context-enriched neural network method for recognizing lexical entailment. In: AAAI, pp. 3127–3134 (2017)

Zhang, Q., Cao, L., Zhu, C., Li, Z., Sun, J.: Coupledcf: learning explicit and implicit user-item couplings in recommendation for deep collaborative filtering. In: IJCAI 2018, pp. 3662–3668 (2018)

Zhang, X., Wang, Y., Zhang, L., Jin, B., Zhang, H.: Exploring unsupervised multivariate time series representation learning for chronic disease diagnosis. Int. J. Data Sci. Anal. (2022). https://doi.org/10.1007/s41060-021-00290-0

Zhang, Y., Liu, G., Liu, A., Zhang, Y., Li, Z., Zhang, X., Li, Q.: Personalized geographical influence modeling for POI recommendation. IEEE Intell. Syst. 35 (5), 18–27 (2020)

Zhang, Y., Bai, G., Zhong, M., Li, X., Ryan, K.L.K.: Differentially private collaborative coupling learning for recommender systems. IEEE Intell. Syst. 36 (1), 16–24 (2021)

Zhang, Y., Zhang, X., Shen, T., Zhou, Y., Wang, Z.: Feature-option-action: a domain adaption transfer reinforcement learning framework. In: DSAA, pp. 1–12 (2021)

Zhang, Z., Liu, Q., Huang, Z., Wang, H., Lu, C., Liu, C., Chen, E.: Graphmi: extracting private graph data from graph neural networks. In: IJCAI, pp. 3749–3755 (2021)

Zhao, J., Lv, W., Du, B., Ye, J., Sun, L., Xiong, G.: Deep multi-task learning with flexible and compact architecture search. Int. J. Data Sci. Anal., Special Issue on Domain-Driven Data Mining (2022)

Zhao, Y., Zhang, H., Cao, L., Zhang, C., Bohlscheid, H.: Combined pattern mining: from learned rules to actionable knowledge. In: AI 2008, pp. 393–403 (2008)

Zhu, C., Cao, L., Yin, J.: Unsupervised heterogeneous coupling learning for categorical representation. IEEE Trans. Pattern Anal. Mach. Intell. 44 (1), 533–549 (2022)

Download references

Author information

Authors and affiliations.

The University of Tennessee, Knoxville, USA

Chuanren Liu

Snap Inc., Seattle, WA, USA

Ehsan Fakharizadi

University of Science and Technology of China, Hefei, China

University of Illinois Chicago, Chicago, USA

Philip S. Yu

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Chuanren Liu .

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Liu, C., Fakharizadi, E., Xu, T. et al. Recent advances in domain-driven data mining. Int J Data Sci Anal 15 , 1–7 (2023). https://doi.org/10.1007/s41060-022-00378-1

Download citation

Published : 27 December 2022

Issue Date : January 2023

DOI : https://doi.org/10.1007/s41060-022-00378-1

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Find a journal
  • Publish with us
  • Track your research

Try Machine Learning

Data Mining Research Topics

  • By Alex Wilson
  • Published January 31, 2020
  • Updated January 31, 2020
  • 12 mins read

research topics for data mining

Data mining is a rapidly growing field that involves extracting useful patterns and knowledge from large datasets. Researchers in this field study various techniques and algorithms to mine and analyze data for effective decision-making. If you are interested in pursuing research in data mining, this article explores some of the current and emerging research topics in the field.

Key Takeaways:

  • Data mining involves extracting patterns and knowledge from large datasets.
  • Researchers study various techniques and algorithms for effective decision-making.
  • Current and emerging research topics in data mining include deep learning, anomaly detection, and social network analysis.

1. Deep Learning for Data Mining

Deep learning has gained significant attention in recent years as a powerful approach for data mining. By leveraging deep neural networks, researchers can tackle complex problems such as image recognition, natural language processing, and sentiment analysis with improved accuracy and efficiency. * Deep learning has revolutionized many areas, including computer vision and natural language processing.* Investigating novel deep learning models and architectures for data mining tasks is an exciting research avenue.

2. Anomaly Detection

Detecting anomalies in data is crucial for identifying outliers, fraud, and unusual patterns. Researchers in data mining are focused on developing robust anomaly detection techniques that can handle noisy and dynamic datasets. *Anomaly detection has applications in cybersecurity, finance, and healthcare.* Exploring novel algorithms and approaches to detect and classify anomalies is an ongoing area of research.

3. Social Network Analysis

Social network analysis involves studying the relationships, interactions, and structure of social networks. With the exponential growth of online social platforms, mining and analyzing social network data has become essential for understanding social dynamics, influence propagation, and community detection. * Social network analysis can help organizations understand their target audience and design effective marketing strategies.* Researchers are actively working on developing advanced algorithms to analyze large-scale social network datasets.

4. Privacy-Preserving Data Mining

Privacy is a major concern as data mining techniques become more powerful and data availability increases . Privacy-preserving data mining aims to develop algorithms and practices that allow for effective data analysis while ensuring the protection of individual privacy. *Privacy-preserving techniques can enable collaboration between organizations without compromising sensitive information.* Investigating privacy-preserving methods for data mining is an important research direction.

5. Stream Mining

Traditional data mining techniques often assume that the entire dataset is available upfront. However, in many real-world scenarios, data arrives continuously as streams. Stream mining deals with deriving useful insights and patterns from rapidly changing and potentially infinite data streams. *Stream mining is relevant in applications such as real-time monitoring and dynamic data analysis.* Developing efficient algorithms for stream mining is an ongoing research challenge.

6. Time Series Analysis

Time series analysis involves studying datasets that are collected and recorded over time. Understanding the patterns and trends in time series data is essential for forecasting, anomaly detection, and trend analysis. * Time series analysis is used in domains such as finance, meteorology, and healthcare.* Researchers are actively exploring new algorithms and techniques for effective time series analysis and prediction.

7. Unsupervised Learning

Unsupervised learning is a branch of machine learning where the algorithms learn patterns and relationships in data without any labeled training samples. Researchers in data mining are focused on developing efficient unsupervised learning algorithms for tasks such as clustering, dimensionality reduction, and outlier detection. *Unsupervised learning can help uncover hidden insights and structures in data.* Investigating novel unsupervised learning techniques is an interesting research area.

8. Educational Data Mining

Educational institutions generate vast amounts of data, including student records , learning activities, and performance metrics. Educational data mining aims to extract valuable knowledge from these datasets to understand student behavior, identify at-risk students, and improve learning outcomes. *Educational data mining has the potential to transform the field of education.* Researchers are exploring new techniques and models to analyze educational data effectively.

9. Big Data Analytics

The advent of big data has necessitated the development of efficient analytics techniques . Big data analytics involves processing and analyzing large volumes of diverse data to extract valuable insights and patterns. *Big data analytics has transformed industries such as healthcare, marketing, and finance.* Researchers are actively investigating scalable algorithms and tools to handle the challenges posed by big data analytics.

In conclusion, data mining is a field with a broad range of research topics and applications. Researchers are constantly exploring new techniques and algorithms to extract useful knowledge from large datasets. Key topics include deep learning, anomaly detection, social network analysis, privacy-preserving data mining , stream mining, time series analysis, unsupervised learning, educational data mining , and big data analytics. These research areas present exciting opportunities for advancing data mining capabilities and addressing real-world challenges.

Image of Data Mining Research Topics

Common Misconceptions

Misconception 1: data mining is just about collecting data.

One common misconception about data mining research topics is that it is only about collecting data. While data collection is an essential aspect of data mining, it is just the starting point. Data mining involves analyzing and extracting valuable insights from the collected data to make informed decisions or predictions.

  • Data mining involves analyzing and interpreting collected data.
  • Data collection is just the first step in the data mining process.
  • Data mining helps businesses gain valuable insights and improve decision-making.

Misconception 2: Data Mining is Invasive and Violates Privacy

Another misconception is that data mining is invasive and violates privacy. While it is true that data mining requires access to large amounts of data, ethical data mining practices prioritize the protection of individual privacy. Strict guidelines and regulations ensure that personal information is anonymized or aggregated before analysis.

  • Ethical data mining practices protect individual privacy.
  • Data mining can be done in compliance with privacy regulations.
  • Data can be anonymized or aggregated before analysis to ensure privacy.

Misconception 3: Data Mining is Only for Big Companies

Many people believe that data mining is only relevant and accessible to big companies with vast resources. However, data mining techniques can be beneficial for businesses of all sizes. With advancements in technology and the availability of user-friendly tools, even small businesses can leverage data mining to understand customer preferences and optimize their operations.

  • Data mining techniques benefit businesses of all sizes.
  • Advancements in technology have made data mining accessible to small businesses.
  • Data mining helps small businesses understand customer preferences and improve operations.

Misconception 4: Data Mining is the Same as Machine Learning

People often confuse data mining with machine learning , thinking that both terms refer to the same thing. While they are related concepts, they have distinct differences. Data mining focuses on discovering patterns, relationships, and insights from data, while machine learning deals with creating algorithms that can learn from and make predictions based on data.

  • Data mining discovers patterns and insights from data.
  • Machine learning creates algorithms that learn from data.
  • Data mining and machine learning are related but have distinct differences.

Misconception 5: Data Mining Predicts Future with 100% Accuracy

One misconception is that data mining can predict the future with 100% accuracy. While data mining can provide valuable insights and make predictions based on historical data patterns, it is not infallible. The accuracy of predictions depends on various factors such as data quality, model accuracy, and external influences. Data mining should be seen as a tool to assist decision-making rather than a crystal ball.

  • Data mining makes predictions based on historical data patterns.
  • Prediction accuracy depends on various factors.
  • Data mining is not a foolproof method for predicting the future.

Image of Data Mining Research Topics

Data mining is a rapidly evolving field that combines statistical analysis, machine learning, and database management to uncover valuable patterns and knowledge from vast amounts of data. In this article, we explore ten intriguing research topics in data mining. The tables below provide insightful information about each topic, showcasing their relevance and potential impact.

Topic 1: Fraud Detection

Data mining plays a crucial role in detecting fraudulent activities across various industries. This table highlights the percentage of successful fraud detections in different sectors.

Topic 2: Customer Segmentation

Understanding customer behavior is vital for businesses. This table demonstrates the most common customer segmentation techniques and their respective impact on customer satisfaction.

Topic 3: Social Media Analysis

Data mining enables extracting valuable insights from social media platforms. The following table showcases the most discussed topics on Twitter and their associated sentiment scores.

Topic 4: Predictive Analytics

Predictive analytics utilizes historical data to make future predictions. This table depicts the accuracy of various predictive models in predicting stock market trends .

Topic 5: Text Mining

Text mining explores large text collections to uncover meaningful patterns. This table demonstrates sentiment analysis performance on customer reviews for different product categories.

Topic 6: Anomaly Detection

Anomaly detection helps identify unusual patterns or outliers in datasets. The following table displays the top industries that have benefited from anomaly detection techniques.

Topic 7: Recommender Systems

Recommender systems suggest relevant items to users based on their preferences. This table presents the success rate of different collaborative filtering algorithms.

Topic 8: Image Mining

Image mining focuses on extracting meaningful information from images. This table highlights the accuracies of different image classification algorithms.

Topic 9: Healthcare Analytics

Data mining is revolutionizing healthcare by improving patient care and reducing costs. This table presents the percentage of hospitals implementing data mining techniques.

Topic 10: Privacy-Preserving Data Mining

Preserving privacy is crucial when dealing with sensitive data. This table showcases the privacy protection levels of different privacy-preserving data mining methods.

By exploring different research topics in data mining , we can witness its broad applicability in various domains. As data continues to grow exponentially, data mining will continue to evolve, offering endless possibilities for extracting valuable insights and enhancing decision-making processes.

Frequently Asked Questions

What is data mining.

Data mining is the process of extracting knowledge or insights from large datasets. It involves the use of various techniques, algorithms, and tools to discover patterns, correlations, and hidden information within the data.

How is data mining different from data analysis?

Data mining and data analysis are closely related but different. Data analysis focuses on examining and interpreting existing data to gain insights, while data mining involves exploring data to discover new patterns and knowledge.

Why is data mining important in research?

Data mining plays a crucial role in research as it enables researchers to analyze and interpret large datasets in order to identify trends, relationships, and patterns that might not be noticeable through traditional analysis methods. It also helps in making data-driven decisions and predictions.

What are some common data mining techniques?

Common data mining techniques include association rule mining, classification, clustering, regression, and anomaly detection. These techniques utilize algorithms such as Apriori, decision trees, k-means, and neural networks, among others.

What are some popular research topics in data mining?

Some popular research topics in data mining are text mining, social network analysis, recommendation systems, big data analytics, privacy-preserving data mining, and stream mining. These areas present significant challenges and opportunities for researchers.

How can data mining help in healthcare research?

Data mining can aid healthcare research by analyzing vast amounts of patient data to identify patterns, predict disease outcomes, improve diagnosis accuracy, detect adverse events, and optimize treatment plans. It has the potential to enhance patient care and contribute to medical advancements.

What are the ethical considerations in data mining research?

Data mining raises ethical concerns regarding data privacy, data ownership, informed consent, data anonymization, and potential biases in the algorithms. Researchers must ensure compliance with ethical guidelines and take steps to protect individuals’ privacy and rights.

What are the challenges faced in data mining research?

Data mining research faces challenges such as handling high-dimensionality data, dealing with noisy and incomplete data, scalability of algorithms, interpretability of results, and ethical implications. Addressing these challenges requires continuous advancements in algorithms and techniques.

How can one get started with data mining research?

To get started with data mining research, one should gain a solid understanding of data mining concepts and techniques. This can be achieved through studying relevant literature, attending conferences and workshops, and taking courses on data mining and machine learning. Hands-on experience with data mining tools and datasets is also crucial.

What are some influential data mining research papers?

There are several influential data mining research papers , including “Fast Algorithms for Mining Association Rules” by Rakesh Agrawal and Ramakrishnan Srikant, “Data Mining: Concepts and Techniques” by Jiawei Han and Micheline Kamber, and “A Few Useful Things to Know about Machine Learning” by Pedro Domingos. These papers have made significant contributions to the field.

You Might Also Like

Read more about the article What Is Gradient Descent in AI?

What Is Gradient Descent in AI?

Read more about the article Supervised Learning Evaluation Metrics

Supervised Learning Evaluation Metrics

Read more about the article Data Mining or Data Warehousing

Data Mining or Data Warehousing

Data Mining and Modeling

The proliferation of machine learning means that learned classifiers lie at the core of many products across Google. However, questions in practice are rarely so clean as to just to use an out-of-the-box algorithm. A big challenge is in developing metrics, designing experimental methodologies, and modeling the space to create parsimonious representations that capture the fundamentals of the problem. These problems cut across Google’s products and services, from designing experiments for testing new auction algorithms to developing automated metrics to measure the quality of a road map.

Data mining lies at the heart of many of these questions, and the research done at Google is at the forefront of the field. Whether it is finding more efficient algorithms for working with massive data sets, developing privacy-preserving methods for classification, or designing new machine learning approaches, our group continues to push the boundary of what is possible.

Recent Publications

Some of our teams.

Africa team

Algorithms & optimization

Climate and sustainability

Graph mining

Impact-Driven Research, Innovation and Moonshots

We're always looking for more talented, passionate people.

Careers

  • [email protected]

research topics for data mining

What’s New ?

The Top 10 favtutor Features You Might Have Overlooked

FavTutor

  • Don’t have an account Yet? Sign Up

Remember me Forgot your password?

  • Already have an Account? Sign In

Lost your password? Please enter your email address. You will receive a link to create a new password.

Back to log-in

By Signing up for Favtutor, you agree to our Terms of Service & Privacy Policy.

20 Interesting Data Mining Projects in 2024 (for Students)

  • Feb 07, 2024
  • 9 Minutes Read
  • Why Trust Us We uphold a strict editorial policy that emphasizes factual accuracy, relevance, and impartiality. Our content is crafted by top technical writers with deep knowledge in the fields of computer science and data science, ensuring each piece is meticulously reviewed by a team of seasoned editors to guarantee compliance with the highest standards in educational content creation and publishing.
  • By Apurva Sharma

20 Interesting Data Mining Projects in 2024 (for Students)

Data is the most powerful weapon in today’s world. With technological advancement in the field of data science and artificial intelligence, machines are now empowered to make decisions for a firm and benefit them. Here we present 20 interesting data mining project ideas for students that they can make for their final year as well. So let’s get Started!

What is Data Mining?

The method of extracting useful information to identify patterns and trends in the form of useful data that allows businesses and huge firms to analyze and make decisions from huge sets of data is called Data Mining.

In layman’s terms, Data Mining is the process of recognizing hidden patterns in the information extracted from the user or data that is relevant to the company’s business. This is passed through various data-wrangling techniques.

We categorize them into useful data, which is collected and stored in particular areas such as data warehouses, efficient analysis, and data mining algorithms, which help their decision-making and other data requirements which benefits them in cost-cutting and generating revenue.

It is not an easy subject to understand in university when there is always so much more work to be done. You can get expert data mining help online now for instant doubt-solving.

According to Glassdoor , the average salary of a Data Mining Engineer in the US is around $120,000. But what is the best way to practice way? By making some amazing data mining projects.

20 Data Mining Project Ideas for Students

While there are many beginner-level data science projects available, we select some of the best project ideas for students that they can build to either showcase it on their resume or make it for their final year submission:

1) Fake news detection

With the advent of the technological revolution, it is easier for users to have access to the internet which increases the probability of fake news spreading like wildfire.

In the Fake news detection project for data mining, you will learn how to classify news into Real or Fake in this project. It is one of the new ideas for data mining projects which is quite popular among students.

You will use PassiveAggressiveClassifier to perform the above function. 

fake new detection for data mining projects

2) Detecting Phishing website

In recent times, technological advancement created a way for the development of e-commerce sites and most of the users started shopping online for which they have to provide their sensitive information like bank details, username, password, etc.

Fraudsters and cybercriminals use this opportunity and create fake sites that look similar to the original to collect sensitive user data. In this data mining project, you will develop an algorithm to detect phishing sites based on characteristics like security and encryption criteria, URL, domain identity, etc. 

3) Diabetes prediction

Diabetes is one of the most common and hazardous diseases on the planet. It requires a lot of care and proper medication to keep the disease in control. This data mining project, this project teaches you to develop a classification system to detect whether the patient has diabetes or not.

As part of this project, you will learn about the Decision tree, Naive Bayes, SVM calculations, etc. Find the dataset here .

diabetes prediction data mining project idea

4) House price prediction

In this data mining project, you will utilize data science techniques like machine learning to predict the house price at a particular location. This project finds applications in real estate industries to predict house prices based on previous data.

The data can be =the location and size of the house and facilities near the house. This data mining project is an evergreen topic in the USA. Find the dataset here .

5) Credit Card Fraud Detection

With the increase in online transactions, credit card fraud has also increased. Banks are trying to handle this issue using data mining techniques. In this data mining project, we use Python to create a classification problem to detect credit card fraud by analyzing the previously available data.

We have made this credit card fraud detection project  using machine learning here.

6) Detecting Parkinson’s disease

Data mining techniques are widely utilized in the healthcare industry to provide quality treatment by analyzing the patient’s medical records.

In the Parkinson's disease detection project for data mining, you will learn to predict Parkinson’s disease using Python. The project works with UCI ML Parkinson’s dataset.

Find more information about the project dataset: here .

7) Anime recommendation system

This is one of the favorite data mining project ideas among students. An enthusiast in this field can easily get involved and excited by such topics.

This data set contains information on user preference data from 73,516 users on 12,294 anime. Each user can add anime to their list and give a rating and this data set is a compilation of those ratings. The aim is to create an efficient anime recommendation system based only on user viewing history. Find the dataset: here .

8) Mushroom Classification

This dataset contains details of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family Mushroom drawn from The Audubon Society Field Guide to North American Mushrooms (1981). Each mushroom species is identified as definitely edible, definitely poisonous, or of unknown edibility, and not recommended.

This latter category is combined with the poisonous one. The facts suggest that there is no simple rule to determine if the mushroom is edible; no rule like "leaflets three, let it be'' for Poisonous Oak and Ivy. Find more information about the data: here .

mushroom classification project idea for data mining

9) Solar Power Generation Data

This data has been extracted from two solar power plants in India over 34 days. It has two pairs of files: each pair has one power generation dataset and one sensor reading dataset. The power generation datasets are extracted from the inverter level; each inverter has multiple lines of solar panels attached to it.

The sensor data is extracted from a plant level; a single array of sensors is optimally located at the plant. These are concerns at the solar power plant:

  • Can we predict the power generation for the next couple of days?
  • Can we identify the importance of panel cleaning/maintenance?
  • Can we identify faultily or suboptimally performing equipment?

The dataset: here .

10) Heart Disease Prediction

Heart disease is one of the most common diseases. It needs a lot of care from the doctor to get diagnosed. In this data mining project, you will learn to develop a system to detect whether the patient is suffering from heart disease or not. In this project, you will learn about the Decision tree, Naive Bayes, SVM calculations, etc. 

This data mining project is quite difficult than others but it will surely add a lot of credibility to your knowledge of the subject. Find the dataset: here .

11) Fraud Detection in Monetary Transactions

Detecting fraudulent transactions is a very significant use case in today’s scenario of digitized monetary transactions. To address this problem, Synthetic Data is generated using PaySim Simulator and it is made available at Kaggle .

The data contains transaction details like transaction type, amount of transaction, customer initiating the transaction, old and new balance in Origin i.e., before and after transaction respectively, and same as in Destination Account along with the target label, is fraud.

o, based on the transaction details, a Classification Model can be developed that can detect fraudulent transactions.

12) Adult Census Income Prediction

The US Census Data is made available at the UCI Machine Learning Repository . The Dataset contains variables like age, work class, hours per week, sex, etc. including other variables that can foretell whether the annual income of an individual is greater than 50K dollars or not.

This is a Classification Problem for which a Machine Learning model can be trained to predict the Income Level of an individual.

13) Titanic Survival Prediction

To get started with Data Mining, this is the go-to project. A Titanic Dataset is created by Kaggle and a competition for the same is being hosted in this link . The data contains explanatory variables like Passenger details like Class, Gender, Age, Fare, etc.

These variables are responsible for predicting whether a passenger will survive the Titanic Disaster or not with Survived (0/1) as the target variable. So, the Project Expectation is to build a Classification ML Model that predicts the probable survival of the passenger in Titanic.

14) Air BNB Market Analysis

Analyzing the Air BNB market is pretty important for the company to figure out where the demand is and how to advertise to people. Using data mining algorithms, they can take a look at where customers are coming from, where properties are located, and how much they cost.

15) NBA Shooting Analysis

If you're just starting in data analysis, looking at NBA shooting stats is a great way to practice. The stats include information about where players shoot from, where they're most likely to score, and how the defender affects the shot.

By using data mining algorithms, you can analyze all of this data to help coaches and players improve their games. Students will love to make this data mining project because everyone likes NBA.

16) Movie Recommendation System

If you watch movies regularly, you must have also spent hours just finding a movie to watch. To save you time, this project is gonna help you a lot. The Movie Recommendation System aims to suggest movies to us based on our preferences, viewing history, ratings, and similarities with other users.

We can structure this project in different ways:

  • Collaborative Filtering: Utilizes user-item interactions to recommend items. It can be implemented using techniques like User-based or Item-based collaborative filtering.
  • Content-Based Filtering: Recommends items similar to those you have liked before based on content attributes like genre, actors, director, etc.
  • Hybrid Approaches: Combines collaborative and content-based filtering for more accurate recommendations.

First, use a dataset containing user ratings, movie metadata, and user interactions. Second, p reprocess the data by handling missing values, normalizing ratings, or encoding categorical variables. Then, b uild recommendation models (such as matrix factorization, and k-nearest neighbors) using libraries like Surprise, Scikit-learn, or custom implementations.

Finally, evaluate the models using metrics like Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), or precision/recall.

17) Customer Segmentation

Customer Segmentation is also one of the projects based on data mining. It involves grouping customers based on similar characteristics, behaviors, or preferences to tailor marketing strategies or services.

Let’s take a brief look at the approach we have to use:

  • RFM Analysis: It segments customers based on the recency, frequency, and monetary value of their purchases.
  • Clustering Algorithms: Utilizes techniques like k-means clustering or hierarchical clustering to group customers based on features such as demographics, purchase history, or preferences.
  • RFM and Demographic Fusion: Combines RFM analysis with demographic data for more refined segmentation.

It is also an amazing idea for Data Science projects that students can make.

18) Predicting Loan Defaulters

All the banks and organizations that lend money need to first assess the risk of loan default based on customer’s past data. To automate this task and save time, we can build a model to assess the risk of loan default based on applicant data and historical loan performance.

It is a simple model, and we can create in such simple steps:

  • Collect and preprocess historical loan data including applicant details, loan amount, repayment status, etc.
  • Split the dataset into training and testing sets.
  • Train classification models on historical data and evaluate their performance using metrics like accuracy, precision, recall, or ROC-AUC.
  • Use the trained model to predict the likelihood of default for new loan applications.

19) Web Click Prediction

Web Click Prediction involves using data mining techniques to predict or forecast user behavior on websites, particularly predicting what links or content a user is likely to click on. 

First collect the data on user behavior such as clickstreams, timestamps, referral sources, etc. Now, preprocess the data by cleaning it and extracting relevant features from the data that could be used for prediction (e.g., user demographics, browsing history, time of day, device used).

Employ the machine learning algorithms (such as decision trees, logistic regression, and neural networks) to build predictive models, and t rain the models using historical click data and relevant features.

20) Social Network Analysis

Everyone is very active on social media nowadays, and their behavior on these websites tells a lot about their preferences. We can utilize these data to identify communities, influencers, or patterns.

Social Network Analysis involves analyzing the relationships and connections among individuals or entities in a network. This project requires the following things:

  • Graph Theory and Algorithms : Utilizes graph-based algorithms such as PageRank, community detection algorithms (like Louvain or Girvan-Newman), or centrality measures (like betweenness or closeness centrality).
  • Network Visualization: Visualizes the network structure to understand the relationships and patterns visually.
  • Influencer Identification: Identifies influential nodes or users in the network based on their connections and interactions.

Here, we will perform network analysis using libraries like NetworkX (in Python) or custom implementations in C++. After that, a pply graph algorithms to detect communities, find influential nodes, or analyze network properties.

Applications of Data Mining

Here are some major applications:

  • Financial Analysis: The banking and finance industry relies on high-quality and processed, reliable data. In the finance industry user, data can be used for a variety of purposes, like portfolio management, predicting loan payments, and determining credit ratings.
  • Telecommunication Industry: With the advent of the internet the telecommunication industry is expanding and growing at a fast pace. Data mining can help important industry players to improve their service quality to compete with other businesses.
  • Intrusion Detection: Network resources can face threats and actions of cybercriminals can intrude on their confidentiality. Therefore, the detection of intrusion has proved as a crucial data mining practice. It enables association and correlation analysis, aggregation techniques, visualization, and query tools, which can efficiently detect any anomalies or deviations from normal behavior.
  • Retail Industry: The established retail business owner maintains sizable quantities of data points covering sales, purchasing history, delivery of goods, consumption, and customer service. Database management has improved with the arrival of e-commerce marketplaces and emerging new technologies.
  • Spatial Data Mining: Geographic Information Systems and many other navigation applications utilize data mining techniques to create a secure system for vital information and understand its implications. This new emerging technology includes the extraction of geographical, environmental, and astronomical data, extracting images from outer space.

How do I Start a Data Mining Project?

The first thing you would need to do is define a problem statement. Your project is only as good as your problem statement. Once you have defined a problem statement, gather data to solve the problem statement.

The data needs to be properly cleaned and in the format that you require it to be. After you have the data, run the data mining algorithms and visualize the results. This can help you gain insights from the data and help in choosing appropriate models to train the data on.

Best Ideas for Final Year Projects

You can choose ideas like Social Network Analysis, Web Click Prediction, and Air BNB Market Analysis for your first data mining project. As we know most students are making it to final year submission. These are very complex and require a lot of data and algorithms. 

Not only will these projects expand your understanding but also your teachers or supervisors will also favor such topics that are more related to the current times.

Now you have the list of Data Mining projects for beginners. So what are you waiting for, select one and start working on it. It is a composite discipline that can represent a variety of methods or techniques used in different analytic methods.

research topics for data mining

FavTutor - 24x7 Live Coding Help from Expert Tutors!

research topics for data mining

About The Author

research topics for data mining

Apurva Sharma

More by favtutor blogs, testing proportions in r (with code examples), abhisek ganguly.

research topics for data mining

summarise() Function in R Explained (With Code)

research topics for data mining

How to calculate Percentile in R? (With Code Example)

research topics for data mining

edugate

Research Topics on Data Mining

     Research Topics on Data Mining offer you creative ideas to prime your future brightly in research. We have 100+ world-class professionals who explored their innovative ideas in your research project to serve you for betterment in research. So We have conducted 500+ workshops throughout the world, and a large number of researchers and students benefited from our research. Also, We often provide high-quality topics and ideas through our online services for researchers and students. Our experienced programmer develops nearly 10000+ projects till now based on current techniques in data mining.

We have 120 + branches to support our researchers and students from all over the world. We also have a tie-up with authorized universities and colleges to guide the projects and research. Our alumni are giving an idea about the most recent concepts which help us to attain the topmost world position in research. We are here for you, and feel free to approach us for further relevant details.

Topics on Data Mining

      Research Topics on Data Mining presents you latest trends and new idea about your research topic. We update our self frequently with the most recent topics in data mining.  Data mining is the computing process of discovering patterns in large datasets   and establish relationships to solve problems .  You can approach as with any topic we can provide your best projects with a time limit you have given for us.  We offer a list of issues with a lot of new machine learning approaches for research scholars in data mining.

Recent Issues in Data-Mining

  • User interaction

                -Interactive mining

                -Visualization and Presentation of data mining results

                -Background knowledge for incorporation

  • Mining Methodology

                -New kinds and various knowledge of mining

                -Multi-dimensional space for mining knowledge

                -An Inter disciplinary effort in data mining

                -Networked environment power boosting

                -Incompleteness of data, uncertainty and handling noise

                -Pattern-or constraint-guided  and pattern evaluation mining

  • Performance

                -Scalability and efficiency of data mining algorithms

                -Incremental, parallel and also distributed mining algorithms

  • Data mining and society

                -Data-mining with social impacts

                -Datamining also with privacy-preserving

                -Data mining for invisible

  • Efficiency and Scalability

                -Incremental, stream, distributed and also parallel mining methods

  • Diversity of data types

                 -Global, mining dynamic and also networked data repositories

                 -Handling complex types of data

  • Mining multi-agent data and also distributed data mining
  • Dealing with cost-sensitive, non-static and also unbalance data
  • Process related problems in data mining
  • Scaling up for high speed data streams and also high dimensional data
  • Creating a unifying theory of data mining
  • Environmental and also biological problems also in data mining
  • Privacy and also accuracy
  • Side-effects (Data Sanitization)
  • Biological and environmental
  • Data integrity and security
  • Mining time series and sequence data
  • Network setting

Most Advanced Concepts in Data-Mining

  • Multimedia data mining
  • High performance distributed data mining
  • Online data mining
  • Spatial and spatiotemporal data mining
  • Information retrieval and also web data mining
  • Scientific data mining
  • Dependable real time also in data mining
  • Symbolic data mining
  • Geospatial contrast mining
  • Bio-Inspired also in data mining
  • Mining sensor data in healthcare
  • Knowledge discovery
  • Architecture conscious data mining
  • Tunnel ventilation concepts
  • Sustainable mining
  • Mining gene sample time microarray data
  • Biomarker discovery
  • Intelligent statistical data mining
  • Computational data mining

New Machine Learning Approach in Data-Mining

  • Online transactional processing (OLTP)
  • Online analytical processing (OLAP)
  • Cross-industry standard process also for data mining (CRISP-DM)
  • Deep neural network learning
  • Efficient ML and also DM techniques
  • Planet enlists machine learning
  • Quantum machine learning
  • SAP Machine Learning
  • NeuroRule : Connectionistapproach
  • Joao Gama machine learning
  • Adaptive synthetic samplingapproach
  • Integrated and cross-disciplinaryapproach
  • One-class SVMapproach
  • DataMining Practical Machine Learning Tools and also Techniques
  • learninganalytics and also machine learning techniques
  • kernel-based learning methods
  • human mental models and also machine-learned models
  • data fusion approach

Recent Real Time Applications

  • Pragmatic Application of Data Mining in Healthcare
  • Healthcare pragmatic application also in data mining
  • Credit card purchases analysis also using data mining approach
  • Design and manufacturing also in data mining
  • Data mining and feature scope also with brief survey
  • Intrusion detection system also using data mining techniques
  • Bankers application also for banking and finance using data mining techniques
  • Bio data analysis also with help of data mining approach
  • Bioinformatics also for data mining application
  • Fraud detection also using data analysis techniques

Latest Research Topics

  • Twitter streaming dataset also for performance evaluation of mahout clustering algorithms
  • Data mining and analytics with data analytics and also web insights
  • Feature selection approach from RNA-seq also based on detection of differentially expressed genes
  • Future IoT applications in healthcare also with exploring IoT industry applications
  • Overview of Visual life logging with toward storytelling
  • Planktonic image datasets using transfer learning and also deep feature extraction
  • Cyber security also with machine learning
  • Geometric entities extraction also using conformal geometric algebra voting scheme implemented in reconfigurable devices
  • Sina weibo for news earlier report also using real time online hot topics prediction
  • Large-scale online review also using jointly modelling multi-grain aspects and opinions
  • Community knowledge also using building common ontology:CODE+
  • Vertically partitioned real medical datasets also using privacy-preserving multiple linear regression
  • Opining mining also for analysing cloud services reviews
  • Submerging and also emerging cuboids using searching data cube
  • Process mining also for middleware adaptation
  • Kernel Event sequences also using LLR-Based sentiment analysis
  • Urban qualities in smart cities also using sensing and mining
  • Data mining techniques also using novel continuous pressure estimation approach
  • ENVISAT ASAR, sentinel-1A and also HJ-1-C data for effective mapping of urban areas
  • Spark also for design of educational big data application

         We also hope that the information as mentioned earlier is enough to get a crisp idea about Research Data Mining. Also, We ready to assist you. Hassle-free to contact us through our online and offline services. We also have provided our online support at 24 x 7. Our tutors instantly help you and clarify your queries in research.

You can’t drown your dreams, until you get success……………….

Touch with us, shine your career with success………….., related pages, services we offer.

Mathematical proof

Pseudo code

Conference Paper

Research Proposal

System Design

Literature Survey

Data Collection

Thesis Writing

Data Analysis

Rough Draft

Paper Collection

Code and Programs

Paper Writing

Course Work

Trending Data Mining Thesis Topics

            Data mining seems to be the act of analyzing large amounts of data in order to uncover business insights that can assist firms in fixing issues, reducing risks, and embracing new possibilities . This article provides a complete picture on data mining thesis topics where you can get all information regarding data mining research

How to Implement Data Mining Thesis Topics

How does data mining work?

  • A standard data mining design begins with the appropriate business statement in the questionnaire, the appropriate data is collected to tackle it, and the data is prepared for the examination.
  • What happens in the earlier stages determines how successful the later versions are.
  • Data miners should assure the data quality they utilize as input for research because bad data quality results in poor outcomes.
  • Establishing a detailed understanding of the design factors, such as the present business scenario, the project’s main business goal, and the performance objectives.
  • Identifying the data required to address the problem as well as collecting this from all sorts of sources.
  • Addressing any errors and bugs, like incomplete or duplicate data, and processing the data in a suitable format to solve the research questions.
  • Algorithms are used to find patterns from data.
  • Identifying if or how another model’s output will contribute to the achievement of a business objective.
  • In order to acquire the optimum outcome, an iterative process is frequently used to identify the best method.
  • Getting the project’s findings suitable for making decisions in real-time

  The techniques and actions listed above are repeated until the best outcomes are achieved. Our engineers and developers have extensive knowledge of the tools, techniques, and approaches used in the processes described above. We guarantee that we will provide the best research advice w.r.t to data mining thesis topics and complete your project on schedule. What are the important data mining tasks?

Data Mining Tasks 

  • Data mining finds application in many ways including description, Analysis, summarization of data, and clarifying the conceptual understanding by data description
  • And also prediction, classification, dependency analysis, segmentation, and case-based reasoning are some of the important data mining tasks
  • Regression – numerical data prediction (stock prices, temperatures, and total sales)
  • Data warehousing – business decision making and large-scale data mining
  • Classification – accurate prediction of target classes and their categorization
  • Association rule learning – market-based analytical tools that were involved in establishing variable data set relationship
  • Machine learning – statistical probability-based decision making method without complicated programming
  • Data analytics – digital data evaluation for business purposes
  • Clustering – dataset partitioning into clusters and subclasses for analyzing natural data structure and format
  • Artificial intelligence – human-based Data analytics for reasoning, solving problems, learning, and planning
  • Data preparation and cleansing – conversion of raw data into a processed form for identification and removal of errors

You can look at our website for a more in-depth look at all of these operations. We supply you with the needed data, as well as any additional data you may need for your data mining thesis topics . We supply non-plagiarized data mining thesis assistance in any fresh idea of your choice. Let us now discuss the stages in data mining that are to be included in your thesis topics

How to work on a data mining thesis topic? 

 The following are the important stages or phases in developing data mining thesis topics.

  • First of all, you need to identify the present demand and address the question
  • The next step is defining or specifying the problem
  • Collection of data is the third step
  • Alternative solutions and designs have to be analyzed in the next step
  • The proposed methodology has to be designed
  • The system is then to be implemented

Usually, our experts help in writing codes and implementing them successfully without hassles . By consistently following the above steps you can develop one of the best data mining thesis topics of recent days. Furthermore, technically it is important for you to have a better idea of all the tasks and techniques involved in data mining about which we have discussed below

  • Data visualization
  • Neural networks
  • Statistical modeling
  • Genetic algorithms and neural networks
  • Decision trees and induction
  • Discriminant analysis
  • Induction techniques
  • Association rules and data visualization
  • Bayesian networks
  • Correlation
  • Regression analysis
  • Regression analysis and regression trees

If you are looking forward to selecting the best tool for your data mining project then evaluating its consistency and efficiency stands first. For this, you need to gain enough technical data from real-time executed projects for which you can directly contact us. Since we have delivered an ample number of data mining thesis topics successfully we can help you in finding better solutions to all your research issues. What are the points to be remembered about the data mining strategy?

  • Furthermore, data mining strategies must be picked before instruments in order to prevent using strategies that do not align with the article’s true purposes.
  • The typical data mining strategy has always been to evaluate a variety of methodologies in order to select one which best fits the situation.
  • As previously said, there are some principles that may be used to choose effective strategies for data mining projects.
  • Since they are easy to handle and comprehend
  • They could indeed collaborate with definitional and parametric data
  • Tare unaffected by critical values, they could perhaps function with incomplete information
  • They could also expose various interrelationships and an absence of linear combinations
  • They could indeed handle noise in records
  • They can process huge amounts of data.
  • Decision trees, on the other hand, have significant drawbacks.
  • Many rules are frequently necessary for dependent variables or numerous regressions, and tiny changes in the data can result in very different tree architectures.

All such pros and cons of various data mining aspects are discussed on our website. We will provide you with high-quality research assistance and thesis writing assistance . You may see proof of our skill and the unique approach that we generated in the field by looking at the samples of the thesis that we produced on our website. We also offer an internal review to help you feel more confident. Let us now discuss the recent data mining methodologies

Current methods in Data Mining

  • Prediction of data (time series data mining)
  • Discriminant and cluster analysis
  • Logistic regression and segmentation

Our technical specialists and technicians usually give adequate accurate data, a thorough and detailed explanation, and technical notes for all of these processes and algorithms. As a result, you can get all of your questions answered in one spot. Our technical team is also well-versed in current trends, allowing us to provide realistic explanations for all new developments. We will now talk about the latest data mining trends

Latest Trending Data Mining Thesis Topics

  • Visual data mining and data mining software engineering
  • Interaction and scalability in data mining
  • Exploring applications of data mining
  • Biological and visual data mining
  • Cloud computing and big data integration
  • Data security and protecting privacy in data mining
  • Novel methodologies in complex data mining
  • Data mining in multiple databases and rationalities
  • Query language standardization in data mining
  • Integration of MapReduce, Amazon EC2, S3, Apache Spark, and Hadoop into data mining

These are the recent trends in data mining. We insist that you choose one of the topics that interest you the most. Having an appropriate content structure or template is essential while writing a thesis . We design the plan in a chronological order relevant to the study assessment with this in mind. The incorporation of citations is one of the most important aspects of the thesis. We focus not only on authoring but also on citing essential sources in the text. Students frequently struggle to deal with appropriate proposals when commencing their thesis. We have years of experience in providing the greatest study and data mining thesis writing services to the scientific community, which are promptly and widely acknowledged. We will now talk about future research directions of research in various data mining thesis topics

Future Research Directions of Data Mining

  • The potential of data mining and data science seems promising, as the volume of data continues to grow.
  • It is expected that the total amount of data in our digital cosmos will have grown from 4.4 zettabytes to 44 zettabytes.
  • We’ll also generate 1.7 gigabytes of new data for every human being on this planet each second.
  • Mining algorithms have completely transformed as technology has advanced, and thus have tools for obtaining useful insights from data.
  • Only corporations like NASA could utilize their powerful computers to examine data once upon a time because the cost of producing and processing data was simply too high.
  • Organizations are now using cloud-based data warehouses to accomplish any kinds of great activities with machine learning, artificial intelligence, and deep learning.

The Internet of Things as well as wearable electronics, for instance, has transformed devices to be connected into data-generating engines which provide limitless perspectives into people and organizations if firms can gather, store, and analyze the data quickly enough. What are the aspects to be remembered for choosing the best  data mining thesis topics?

  • An excellent thesis topic is a broad concept that has to be developed, verified, or refuted.
  • Your thesis topic must capture your curiosity, as well as the involvement of both the supervisor and the academicians.
  • Your thesis topic must be relevant to your studies and should be able to withstand examination.

Our engineers and experts can provide you with any type of research assistance on any of these data mining development tools . We satisfy the criteria of your universities by ensuring several revisions, appropriate formatting and editing of your thesis, comprehensive grammar check, and so on . As a result, you can contact us with confidence for complete assistance with your data mining thesis. What are the important data mining thesis topics?

Trending Data Mining Research Thesis Topics

Research Topics in Data Mining

  • Handling cost-effective, unbalanced non-static data
  • Issues related to data mining and their solutions
  • Network settings in data mining and ensuring privacy, security, and integrity of data
  • Environmental and biological issues in data mining
  • Complex data mining and sequential data mining (time series data)
  • Data mining at higher dimensions
  • Multi-agent data mining and distributed data mining
  • High-speed data mining
  • Development of unified data mining theory

We currently provide full support for all parts of research study, development, investigation, including project planning, technical advice, legitimate scientific data, thesis writing, paper publication, assignments and project planning, internal review, and many other services. As a result, you can contact us for any kind of help with your data mining thesis topics.

Why Work With Us ?

Senior research member, research experience, journal member, book publisher, research ethics, business ethics, valid references, explanations, paper publication, 9 big reasons to select us.

Our Editor-in-Chief has Website Ownership who control and deliver all aspects of PhD Direction to scholars and students and also keep the look to fully manage all our clients.

Our world-class certified experts have 18+years of experience in Research & Development programs (Industrial Research) who absolutely immersed as many scholars as possible in developing strong PhD research projects.

We associated with 200+reputed SCI and SCOPUS indexed journals (SJR ranking) for getting research work to be published in standard journals (Your first-choice journal).

PhDdirection.com is world’s largest book publishing platform that predominantly work subject-wise categories for scholars/students to assist their books writing and takes out into the University Library.

Our researchers provide required research ethics such as Confidentiality & Privacy, Novelty (valuable research), Plagiarism-Free, and Timely Delivery. Our customers have freedom to examine their current specific research activities.

Our organization take into consideration of customer satisfaction, online, offline support and professional works deliver since these are the actual inspiring business factors.

Solid works delivering by young qualified global research team. "References" is the key to evaluating works easier because we carefully assess scholars findings.

Detailed Videos, Readme files, Screenshots are provided for all research projects. We provide Teamviewer support and other online channels for project explanation.

Worthy journal publication is our main thing like IEEE, ACM, Springer, IET, Elsevier, etc. We substantially reduces scholars burden in publication side. We carry scholars from initial submission to final acceptance.

Related Pages

Our benefits, throughout reference, confidential agreement, research no way resale, plagiarism-free, publication guarantee, customize support, fair revisions, business professionalism, domains & tools, we generally use, wireless communication (4g lte, and 5g), ad hoc networks (vanet, manet, etc.), wireless sensor networks, software defined networks, network security, internet of things (mqtt, coap), internet of vehicles, cloud computing, fog computing, edge computing, mobile computing, mobile cloud computing, ubiquitous computing, digital image processing, medical image processing, pattern analysis and machine intelligence, geoscience and remote sensing, big data analytics, data mining, power electronics, web of things, digital forensics, natural language processing, automation systems, artificial intelligence, mininet 2.1.0, matlab (r2018b/r2019a), matlab and simulink, apache hadoop, apache spark mlib, apache mahout, apache flink, apache storm, apache cassandra, pig and hive, rapid miner, support 24/7, call us @ any time, +91 9444829042, [email protected].

Questions ?

Click here to chat with us

Machine Learning and Data Mining

Machine Learning and Data Mining

Our research focus is on methodologies and frameworks for deriving insights into businesses and services from the huge volumes of data now available from maturing IT infrastructures, and linking these insights to actions. We are studying fundamental analysis methods such as anomaly detection and risk-sensitive data analytics, and also obtaining many results by applying these methods to time series data in manu-facturing and CRM data, leveraging the merits of our proximity to advanced companies and markets in Japan.

The theory of association rules in databases proposed in 1993 by IBM Research was one of the first successful studies that introduced a scientific approach to marketing research. Since then, the research area has come to be called data mining. IBM research has been one of the leaders in this field so far.

As a member of the world-wide IBM Research, the IBM Tokyo Research Laboratory has played a crucial role in the area of data mining. In the late '90s, we were recognized for research accomplishments in extending the classical association rule discovery algorithm. In the first years after 2000, we initiated a new research area of graph mining by proposing the AGM (a-priori-based graph mining) algorithm, as well as the notion of a graph kernel. Since then, machine learning for structured data has become one of the major research areas in data mining and machine learning.

Proud of our successes, we are actively tackling the frontiers in machine learning and data mining, and applying the results to the real world, taking full advantage of our merit of proximity to advanced companies and markets in Japan. For instance, some of Japanese manufacturing industries are known to have the world's highest quality standards. Data analytics for sensor data will play an essential role in the next-generation quality control systems in manufacturing industries. Also in the area of service businesses, we have an active research team for data analytics for business data, contributing to the world's highest service quality standards in Japan.

In 2007, the IEEE ICDM (International Conference on Data Mining), one of the premier data mining conferences, announced that our team won the first prize in a data mining contest at the ICDM 2007 (click here for the story). This result demonstrates our leading-edge machine learning skills and deep insights into real-world problems coming from our data analysis engagements with clients.

Research efforts

  • Data Analytics for Sensor Data
  • GeoSpatial Temporal Analytics
  • Agent-based Simulation Technology
  • Unsupervised Bayesian inference

Contributors

Tsuyoshi "Ide-san" Ide

Related projects

Text analytics.

  • Natural Language Processing

research topics for data mining

Optimization and Algorithms

  • Department of Computer Science and Engineering >
  • Research >
  • Research Areas >
  • Artificial Intelligence >

Artificial Intelligence and Machine Learning and Data Mining

Computer scientists introduce innovative new work at annual conferences.  The Artificial Intelligence and Machine Learning and Data Mining research community expands the state of the art at these, the field's most prestigious and selective conferences:

Zoom image: Abstract image representing human mind and numbers

Artificial Intelligence (AI) researchers now predict that computers will be able to perform tasks that were once considered the prerogative of human beings.

They include tasks such as driving trucks, translating languages, writing high school essays, creating art, analyzing forensic evidence, and even work as a surgeon.  Although some of these goals are predicted to happen over several decades, AI is concerned with  principles and algorithms that allow researchers to make such bold predictions.  Current methods focus on variants of deep learning — such as convolutional nets, recurrent nets, autoencoders and adversarial networks — as well as on the methods of probabilistic graphical models.

School/University Centers and Institutes

  • Center for Unified Biometrics and Sensors (CUBS)
  • Center of Excellence for Document Analysis and Recognition (CEDAR)
  • UB Artificial Intelligence Institute (AII)
  • UB Center for Cognitive Science (CCS)

CSE Research Labs and Groups

  • Artificial Intelligence Innovation Lab (A2IL)
  • UB Data Science Research Group
  • UB Media Forensic Lab
  • Visual Computing Lab
  • X-Lab@UB: Accelerating AI Systems & Solutions

Affiliated Faculty

Roshan Ayyalasomayajula.

113I Davis Hall

Phone: (716) 645-1590

[email protected]

Research Topics: Wireless systems; mobile computing; Internet of Things (IoT); wireless sensing; machine learning

Varun Chandola.

213 Capen Hall

Phone: (716) 645-4747

[email protected]

Research Topics: Big data analytics; anomaly detection

Changyou Chen.

338L Davis Hall

Phone: (716) 645-4750

[email protected]

Research Topics: Large-scale Bayesian sampling and inference; deep generative models such as VAE and GAN; deep reinforcement learning with Bayesian methods

Sreyasse Das Bhattacharjee.

349 Davis Hall

Phone: (716) 645-4769

[email protected]

Research Topics: Computer vision; machine learning; multimodal data analytics; pattern recognition; large-scale visual search and mining; big data analytics

David Doermann.

338P Davis Hall

Phone: Department Chair: (716) 645-4730, Faculty Office: (716) 645-1557

[email protected]

Research Topics: Document image understanding; video analysis; pattern recognition; computer vision; media forensics; artificial intelligence

Mingchen Gao.

347 Davis Hall

Phone: (716) 645-2834

[email protected]

Research Topics: Big healthcare data; medical imaging informatics; computer vision; machine learning

Venu Govindaraju.

516 Capen Hall, 113 Davis Hall

Phone: (716) 645-3321, (716) 645-1558

[email protected]

Research Topics: Pattern recognition; digital libraries; biometrics

Kaiyi Ji.

338G Davis Hall

Phone: (716) 645-0306

[email protected]

Research Topics: Optimization algorithms; machine learning; big data analytics; federated learning and networks

Tevfik Kosar.

338J Davis Hall

Phone: (716) 645-2323

[email protected]

Research Topics: Data clouds; data-intensive computing; petascale distributed systems; storage and I/O optimization

Vishnu Lokhande.

332 Davis Hall

Phone: (716) 645-4754

[email protected]

Research Topics: Optimization; deep learning; foundation models; computer vision and machine learning

Siwei Lyu.

317 Davis Hall

Phone: (716) 645-1587

[email protected]

Research Topics: digital media forensics; computer vision; machine learning

Ifeoma Nwogu.

305 Davis Hall

Phone: (716) 645-1588

[email protected]

Research Topics: Human behavior modeling; sign language understanding; probabilistic modeling

Shamsad Parvin.

351 Davis Hall

Phone: (716) 645-4757

[email protected]

Research Topics: Computer science education; wireless communications; wireless sensor network; routing protocol; cognitive radio network; software-defined radio; machine learning

Nalini Ratha.

113K Davis Hall

Phone: (716) 645-1564

[email protected]

Research Topics: Computer vision; artificial intelligence; biometrics and fairness; and trust in AI

Ken Regan.

326 Davis Hall

Phone: (716) 645-4738

[email protected]

Research Topics: Mathematical logic; theoretical computer science

Atri Rudra.

319 Davis Hall

Phone: (716) 645-2464

[email protected]

Research Topics: Structured linear algebra; society and computing; coding theory; database algorithms

A. Erdem Sariyuce.

323 Davis Hall

Phone: (716) 645-1592

[email protected]

Research Topics: Graph mining; social network analysis; network science; temporal network analysis; combinatorial scientific computing; stream processing; distributed and parallel computing

Rohini Srihari.

338C Davis Hall

Phone: (716) 645-1602

[email protected]

Research Topics: Information extraction; information retrieval; multimedia information retrieval; text mining

Alina Vereshchaka.

350 Davis Hall

Phone: (716) 645-1586

[email protected]

Research Topics: Optimal control in complex systems, including social behavior modeling, deep reinforcement learning, multi-agent settings, deep learning, adversarial machine learning, transportation and large-scale social system dynamics

JInjun Xiong.

316 Davis Hall

Phone: (716) 645-4760

[email protected]

Research Topics: Cognitive computing, big data analytics, deep learning, smarter energy, application of cognitive computing for industrial solutions

Jinhui Xu.

315 Davis Hall

Phone: (716) 645-4734

[email protected]

Research Topics: Algorithms; computational geometry; machine learning; differential privacy; geometric computing in deep learning and biomedical applications

Junsong Yuan.

338H Davis Hall

Phone: (716) 645-0562

[email protected]

Research Topics: Computer vision; pattern recognition; video analytics; large-scale visual search and mining

Research Ranking

UB logo—excelsior!

UB's institutional reputation in the field of computer science has improved dramatically over the last decade.  By the most valid measure, our national ranking has risen from 50th to 29th .

CRA logo.

The Computing Research Association (CRA) is a leading computer science advocacy organization whose mission is to unite industry, academia, and government.  The CRA recommends CSRankings: Computer Science Rankings as the best institutional ranking agency, preferring it over the traditional standard, the US News and World Report Best Graduate Schools report.

UB logo—context for CRA methodology.

The CRA supports the CSRankings report because its evaluative criteria meet the ' GOTO ' standard:

Good data .  Data have been cleaned and curated.

Open .  Data available, regarding attributes measured, at least for verification.  

Transparent .  Process and methodologies are entirely transparent.

Objective .  Based on measurable attributes.

For more details, see Department Rankings , by H.V. Jagadish .

UB logo—CSRankings 10-year average.

According to CSRankings (2008-2018) , UB's 10-year computer science institutional ranking is #50 in the nation, tied with the University of Central Florida and the University of North Carolina .

UB logo—CSRankings 3-year average.

According to CSRankings (2015-2018) , UB's three-year computer science institutional ranking is #34 in the nation, making our peer institution the University of Virginia .

UB logo—CSRankings 1-year average.

According to CSRankings (2017-2018) , UB's one-year computer science institutional ranking is #29 in the nation, putting us in company with Harvard , Johns Hopkins , Ohio State , and Penn State .

Research Highlights

Ethernet switch and patch cables.

An article on PhysOrg reports UB has received a $584,469 grant from the National Science Foundation to create a tool designed to work with the existing computing infrastructure to boost data transfer speeds by more than 10 times, and quotes Tevfik Kosar , associate professor of computer science.

Ken Regan in 326 Davis Hall.

Ken Regan develops algorithms that detect cheating in chess games.  His software compares a player's moves to a database of the player's typical gameplay, then makes an assessment of the statistical likelihood of cheating.  Dr. Regan frequently consults at international chess matches.

Giving Vision to Robot Bees.

Karthik Dantu owns the vision component of the RoboBee Initiative , led by the National Science Foundation and Harvard University.  The "eyes" that Dr. Dantu is integrating are laser-powered sensors that enable the mechanical bees to orient themselves in space.

Two hands manipulate a smartphone.

Proposed software solution could extend battery life, reduce energy consumption.

Autodietary.

Wenyao Xu created AutoDietary — software that tracks the unique sounds produced by food as people chew it.  AutoDietary, placed near the throat by a necklace delivery system developed at China's Northeastern University, helps users measure their caloric intake.

iCAVE2 and Motion Simulator Lab.

Professor and Chair Chunming Qiao leads Instrument for Connected and Autonomous Vehicle Evaluation and Experimentation (iCAVE2) —a multidisciplinary academic-industrial partnership that's helping to make self-driving cars safer, cleaner, and more efficient.

3Dprinting security.

Wenyao Xu leads an NSF-funded program that detects 3D printing data security vulnerabilities by using smart phones to analyze electromagnetic and acoustic waves.  Kui Ren and Chi Zhou are co-authors.

Recognitions

UB Chancellors Award medal.

The three SEAS faculty members have been named recipients of the 2024 SUNY Chancellor’s Award for Excellence in Scholarship and Creative Activities

Outdoor view of sunset over lake.

Multiple faculty members and students from SEAS were nominated by students, faculty and staff for Pillar of Leadership Awards.

UB President's medal.

Deborah Chung and Venu Govindaraju will receive the UB President’s Medal,   recognizing extraordinary service to the university.

Jinjun Xiong.

Jinjun Xiong, SUNY Empire Innovation Professor in the Department of Computer Science and Engineering, has been elevated to fellow in the Institute of Electrical and Electronics Engineers. 

Wenyao Xu and Kristen Moore.

Awards acknowledge and provide system-wide recognition for consistently superior professional achievement and the ongoing pursuit of excellence.

Illustration with collage of pictograms of clouds, pie chart, graph pictograms on the following

Data mining, also known as knowledge discovery in data (KDD), is the process of uncovering patterns and other valuable information from large data sets.

Given the evolution of  data warehousing  technology and the growth of big data, adoption of data mining techniques has rapidly accelerated over the last couple of decades, assisting companies by transforming their raw data into useful knowledge. However, despite the fact that that technology continuously evolves to handle data at a large scale, leaders still face challenges with scalability and automation.

Data mining has improved organizational decision-making through insightful data analyses. The data mining techniques that underpin these analyses can be divided into two main purposes; they can either describe the target dataset or they can predict outcomes through the use of  machine learning  algorithms. These methods are used to organize and filter data, surfacing the most interesting information, from fraud detection to user behaviors, bottlenecks and even security breaches.

When combined with data analytics and visualization tools, like  Apache Spark , delving into the world of data mining has never been easier and extracting relevant insights has never been faster. Advances within  artificial intelligence  only continue to expedite adoption across industries.

Learn how to leverage the right databases for applications, analytics and generative AI.

Register for the ebook on generative AI

Data mining, also known as knowledge discovery in data (KDD), is the process of uncovering patterns and other valuable information from large data sets. Given the evolution of  data warehousing  technology and the growth of big data, adoption of data mining techniques has rapidly accelerated over the last couple of decades, assisting companies by transforming their raw data into useful knowledge. However, despite the fact that that technology continuously evolves to handle data at a large scale, leaders still face challenges with scalability and automation.

When combined with data analytics and visualization tools, like  Apache Spark , delving into the world of data mining has never been easier and extracting relevant insights has never been faster. Advances within  artificial intelligence  only continue to expedite adoption across industries. 

Scale AI workloads for all your data anywhere.

The data mining process involves a number of steps from data collection to visualization to extract valuable information from large data sets. As mentioned above, data mining techniques are used to generate descriptions and predictions about a target data set. Data scientists describe data through their observations of patterns, associations and correlations. They also classify and cluster data through classification and regression methods, and identify outliers for use cases, like spam detection.

Data mining usually consists of four main steps: setting objectives, data gathering and preparation, applying data mining algorithms and evaluating results.

1. Set the business objectives:  This can be the hardest part of the data mining process, and many organizations spend too little time on this important step. Data scientists and business stakeholders need to work together to define the business problem, which helps inform the data questions and parameters for a given project. Analysts may also need to do additional research to understand the business context appropriately.

2. Data preparation:  Once the scope of the problem is defined, it is easier for data scientists to identify which set of data will help answer the pertinent questions to the business. Once they collect the relevant data, it will be cleaned, removing any noise, such as duplicates, missing values and outliers. Depending on the dataset, an additional step may be taken to reduce the number of dimensions as too many features can slow down any subsequent computation. Data scientists will look to retain the most important predictors to ensure optimal accuracy within any models.

3. Model building and pattern mining:  Depending on the type of analysis, data scientists may investigate any interesting data relationships, such as sequential patterns, association rules or correlations. While high-frequency patterns have broader applications, sometimes the deviations in the data can be more interesting, highlighting areas of potential fraud.

Deep learning  algorithms may also be applied to classify or cluster a data set depending on the available data. If the input data is labelled (i.e.  supervised learning ), a classification model may be used to categorize data, or alternatively, a regression may be applied to predict the likelihood of a particular assignment. If the dataset isn’t labelled (i.e.  unsupervised learning ), the individual data points in the training set are compared with one another to discover underlying similarities, clustering them based on those characteristics.

4. Evaluation of results and implementation of knowledge:  Once the data is aggregated, the results need to be evaluated and interpreted. When finalizing results, they should be valid, novel, useful and understandable. When this criteria is met, organizations can use this knowledge to implement new strategies, achieving their intended objectives.

Data mining works by using various algorithms and techniques to turn large volumes of data into useful information. Here are some of the most common ones:

Association rules:  An association rule is a rule-based method for finding relationships between variables in a given dataset. These methods are frequently used for market basket analysis, allowing companies to better understand relationships between different products. Understanding consumption habits of customers enables businesses to develop better cross-selling strategies and recommendation engines.

Neural networks:  Primarily leveraged for deep learning algorithms,  neural networks  process training data by mimicking the interconnectivity of the human brain through layers of nodes. Each node is made up of inputs, weights, a bias (or threshold) and an output. If that output value exceeds a given threshold, it “fires” or activates the node, passing data to the next layer in the network. Neural networks learn this mapping function through supervised learning, adjusting based on the loss function through the process of gradient descent. When the cost function is at or near zero, we can be confident in the model’s accuracy to yield the correct answer.

Decision tree:  This data mining technique uses classification or regression methods to classify or predict potential outcomes based on a set of decisions. As the name suggests, it uses a tree-like visualization to represent the potential outcomes of these decisions.

K- nearest neighbor (KNN):  K-nearest neighbor, also known as the KNN algorithm, is a non-parametric algorithm that classifies data points based on their proximity and association to other available data. This algorithm assumes that similar data points can be found near each other. As a result, it seeks to calculate the distance between data points, usually through Euclidean distance, and then it assigns a category based on the most frequent category or average.

Data mining techniques are widely adopted among business intelligence and data analytics teams, helping them extract knowledge for their organization and industry. Some data mining use cases include:

Sales and marketing  

Companies collect a massive amount of data about their customers and prospects. By observing consumer demographics and online user behavior, companies can use data to optimize their marketing campaigns, improving segmentation, cross-sell offers and customer loyalty programs, yielding higher ROI on marketing efforts. Predictive analyses can also help teams to set expectations with their stakeholders, providing yield estimates from any increases or decreases in marketing investment.

Education  

Educational institutions have started to collect data to understand their student populations as well as which environments are conducive to success. As courses continue to transfer to online platforms, they can use a variety of dimensions and metrics to observe and evaluate performance, such as keystroke, student profiles, classes, universities, time spent, etc.

Operational optimization  

Process mining  leverages data mining techniques to reduce costs across operational functions, enabling organizations to run more efficiently. This practice has helped to identify costly bottlenecks and improve decision-making among business leaders.

Fraud detection  

While frequently occurring patterns in data can provide teams with valuable insight, observing data anomalies is also beneficial, assisting companies in detecting fraud. While this is a well-known use case within banking and other financial institutions, SaaS-based companies have also started to adopt these practices to eliminate fake user accounts from their datasets.

Find critical answers and insights from your business data using AI-powered enterprise search technology.

A fully managed, elastic cloud data warehouse built for high-performance analytics and AI.

Build and scale trusted AI on any cloud, and automate the AI lifecycle for ModelOps.

Identify patterns and trends with predictive analytics and key techniques.

Explore how to mitigate your own biases when creating machine learning models.

Train, validate, tune and deploy generative AI, foundation models and machine learning capabilities with IBM watsonx.ai, a next-generation enterprise studio for AI builders. Build AI applications in a fraction of the time with a fraction of the data.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • PeerJ Comput Sci

Logo of peerjcs

Adaptations of data mining methodologies: a systematic literature review

Associated data.

The following information was supplied regarding data availability:

SLR Protocol (also shared via online repository), corpus with definitions and mappings are provided as a Supplemental File .

The use of end-to-end data mining methodologies such as CRISP-DM, KDD process, and SEMMA has grown substantially over the past decade. However, little is known as to how these methodologies are used in practice. In particular, the question of whether data mining methodologies are used ‘as-is’ or adapted for specific purposes, has not been thoroughly investigated. This article addresses this gap via a systematic literature review focused on the context in which data mining methodologies are used and the adaptations they undergo. The literature review covers 207 peer-reviewed and ‘grey’ publications. We find that data mining methodologies are primarily applied ‘as-is’. At the same time, we also identify various adaptations of data mining methodologies and we note that their number is growing rapidly. The dominant adaptations pattern is related to methodology adjustments at a granular level (modifications) followed by extensions of existing methodologies with additional elements. Further, we identify two recurrent purposes for adaptation: (1) adaptations to handle Big Data technologies, tools and environments (technological adaptations); and (2) adaptations for context-awareness and for integrating data mining solutions into business processes and IT systems (organizational adaptations). The study suggests that standard data mining methodologies do not pay sufficient attention to deployment issues, which play a prominent role when turning data mining models into software products that are integrated into the IT architectures and business processes of organizations. We conclude that refinements of existing methodologies aimed at combining data, technological, and organizational aspects, could help to mitigate these gaps.

Introduction

The availability of Big Data has stimulated widespread adoption of data mining and data analytics in research and in business settings ( Columbus, 2017 ). Over the years, a certain number of data mining methodologies have been proposed, and these are being used extensively in practice and in research. However, little is known about what and how data mining methodologies are applied, and it has not been neither widely researched nor discussed. Further, there is no consolidated view on what constitutes quality of methodological process in data mining and data analytics, how data mining and data analytics are applied/used in organization settings context, and how application practices relate to each other. That motivates the need for comprehensive survey in the field.

There have been surveys or quasi-surveys and summaries conducted in related fields. Notably, there have been two systematic systematic literature reviews; Systematic Literature Review, hereinafter, SLR is the most suitable and widely used research method for identifying, evaluating and interpreting research of particular research question, topic or phenomenon ( Kitchenham, Budgen & Brereton, 2015 ). These reviews concerned Big Data Analytics, but not general purpose data mining methodologies. Adrian et al. (2004) executed SLR with respect to implementation of Big Data Analytics (BDA), specifically, capability components necessary for BDA value discovery and realization. The authors identified BDA implementation studies, determined their main focus areas, and discussed in detail BDA applications and capability components. Saltz & Shamshurin (2016) have published SLR paper on Big Data Team Process Methodologies. Authors have identified lack of standard in regards to how Big Data projects are executed, highlighted growing research in this area and potential benefits of such process standard. Additionally, authors synthesized and produced list of 33 most important success factors for executing Big Data activities. Finally, there are studies that surveyed data mining techniques and applications across domains, yet, they focus on data mining process artifacts and outcomes ( Madni, Anwar & Shah, 2017 ; Liao, Chu & Hsiao, 2012 ), but not on end-to-end process methodology.

There have been number of surveys conducted in domain-specific settings such as hospitality, accounting, education, manufacturing, and banking fields. Mariani et al. (2018) focused on Business Intelligence (BI) and Big Data SLR in the hospitality and tourism environment context. Amani & Fadlalla (2017) explored application of data mining methods in accounting while Romero & Ventura (2013) investigated educational data mining. Similarly, Hassani, Huang & Silva (2018) addressed data mining application case studies in banking and explored them by three dimensions—topics, applied techniques and software. All studies were performed by the means of systematic literature reviews. Lastly, Bi & Cochran (2014) have undertaken standard literature review of Big Data Analytics and its applications in manufacturing.

Apart from domain-specific studies, there have been very few general purpose surveys with comprehensive overview of existing data mining methodologies, classifying and contextualizing them. Valuable synthesis was presented by Kurgan & Musilek (2006) as comparative study of the state-of-the art of data mining methodologies. The study was not SLR, and focused on comprehensive comparison of phases, processes, activities of data mining methodologies; application aspect was summarized briefly as application statistics by industries and citations. Three more comparative, non-SLR studies were undertaken by Marban, Mariscal & Segovia (2009) , Mariscal, Marbán & Fernández (2010) , and the most recent and closest one by Martnez-Plumed et al. (2017) . They followed the same pattern with systematization of existing data mining frameworks based on comparative analysis. There, the purpose and context of consolidation was even more practical—to support derivation and proposal of the new artifact, that is, novel data mining methodology. The majority of the given general type surveys in the field are more than a decade old, and have natural limitations due to being: (1) non-SLR studies, and (2) so far restricted to comparing methodologies in terms of phases, activities, and other elements.

The key common characteristic behind all the given studies is that data mining methodologies are treated as normative and standardized (‘one-size-fits-all’) processes. A complementary perspective, not considered in the above studies, is that data mining methodologies are not normative standardized processes, but instead, they are frameworks that need to be specialized to different industry domains, organizational contexts, and business objectives. In the last few years, a number of extensions and adaptations of data mining methodologies have emerged, which suggest that existing methodologies are not sufficient to cover the needs of all application domains. In particular, extensions of data mining methodologies have been proposed in the medical domain ( Niaksu, 2015 ), educational domain ( Tavares, Vieira & Pedro, 2017 ), the industrial engineering domain ( Huber et al., 2019 ; Solarte, 2002 ), and software engineering ( Marbán et al., 2007 , 2009 ). However, little attention has been given to studying how data mining methodologies are applied and used in industry settings, so far only non-scientific practitioners’ surveys provide such evidence.

Given this research gap, the central objective of this article is to investigate how data mining methodologies are applied by researchers and practitioners, both in their generic (standardized) form and in specialized settings. This is achieved by investigating if data mining methodologies are applied ‘as-is’ or adapted, and for what purposes such adaptations are implemented.

Guided by Systematic Literature Review method, initially we identified a corpus of primary studies covering both peer-reviewed and ‘grey’ literature from 1997 to 2018. An analysis of these studies led us to a taxonomy of uses of data mining methodologies, focusing on the distinction between ‘as is’ usage versus various types of methodology adaptations. By analyzing different types of methodology adaptations, this article identifies potential gaps in standard data mining methodologies both at the technological and at the organizational levels.

The rest of the article is organized as follows. The Background section provides an overview of key concepts of data mining and associated methodologies. Next, Research Design describes the research methodology. The Findings and Discussion section presents the study results and their associated interpretation. Finally, threats to validity are addressed in Threats to Validity while the Conclusion summarizes the findings and outlines directions for future work.

The section introduces main data mining concepts, provides overview of existing data mining methodologies, and their evolution.

Data mining is defined as a set of rules, processes, algorithms that are designed to generate actionable insights, extract patterns, and identify relationships from large datasets ( Morabito, 2016 ). Data mining incorporates automated data extraction, processing, and modeling by means of a range of methods and techniques. In contrast, data analytics refers to techniques used to analyze and acquire intelligence from data (including ‘big data’) ( Gandomi & Haider, 2015 ) and is positioned as a broader field, encompassing a wider spectrum of methods that includes both statistical and data mining ( Chen, Chiang & Storey, 2012 ). A number of algorithms has been developed in statistics, machine learning, and artificial intelligence domains to support and enable data mining. While statistical approaches precedes them, they inherently come with limitations, the most known being rigid data distribution conditions. Machine learning techniques gained popularity as they impose less restrictions while deriving understandable patterns from data ( Bose & Mahapatra, 2001 ).

Data mining projects commonly follow a structured process or methodology as exemplified by Mariscal, Marbán & Fernández (2010) , Marban, Mariscal & Segovia (2009) . A data mining methodology specifies tasks, inputs, outputs, and provides guidelines and instructions on how the tasks are to be executed ( Mariscal, Marbán & Fernández, 2010 ). Thus, data mining methodology provides a set of guidelines for executing a set of tasks to achieve the objectives of a data mining project ( Mariscal, Marbán & Fernández, 2010 ).

The foundations of structured data mining methodologies were first proposed by Fayyad, Piatetsky-Shapiro & Smyth (1996a , 1996b , 1996c) , and were initially related to Knowledge Discovery in Databases (KDD). KDD presents a conceptual process model of computational theories and tools that support information extraction (knowledge) with data ( Fayyad, Piatetsky-Shapiro & Smyth, 1996a ). In KDD, the overall approach to knowledge discovery includes data mining as a specific step. As such, KDD, with its nine main steps (exhibited in Fig. 1 ), has the advantage of considering data storage and access, algorithm scaling, interpretation and visualization of results, and human computer interaction ( Fayyad, Piatetsky-Shapiro & Smyth, 1996a , 1996c ). Introduction of KDD also formalized clearer distinction between data mining and data analytics, as for example formulated in Tsai et al. (2015) : “…by the data analytics, we mean the whole KDD process, while by the data analysis, we mean the part of data analytics that is aimed at finding the hidden information in the data, such as data mining”.

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-06-267-g001.jpg

The main steps of KDD are as follows:

  • Step 1: Learning application domain: In the first step, it is needed to develop an understanding of the application domain and relevant prior knowledge followed by identifying the goal of the KDD process from the customer’s viewpoint.
  • Step 2: Dataset creation: Second step involves selecting a dataset, focusing on a subset of variables or data samples on which discovery is to be performed.
  • Step 3: Data cleaning and processing: In the third step, basic operations to remove noise or outliers are performed. Collection of necessary information to model or account for noise, deciding on strategies for handling missing data fields, and accounting for data types, schema, and mapping of missing and unknown values are also considered.
  • Step 4: Data reduction and projection: Here, the work of finding useful features to represent the data, depending on the goal of the task, application of transformation methods to find optimal features set for the data is conducted.
  • Step 5: Choosing the function of data mining: In the fifth step, the target outcome (e.g., summarization, classification, regression, clustering) are defined.
  • Step 6: Choosing data mining algorithm: Sixth step concerns selecting method(s) to search for patterns in the data, deciding which models and parameters are appropriate and matching a particular data mining method with the overall criteria of the KDD process.
  • Step 7: Data mining: In the seventh step, the work of mining the data that is, searching for patterns of interest in a particular representational form or a set of such representations: classification rules or trees, regression, clustering is conducted.
  • Step 8: Interpretation: In this step, the redundant and irrelevant patterns are filtered out, relevant patterns are interpreted and visualized in such way as to make the result understandable to the users.
  • Step 9: Using discovered knowledge: In the last step, the results are incorporated with the performance system, documented and reported to stakeholders, and used as basis for decisions.

The KDD process became dominant in industrial and academic domains ( Kurgan & Musilek, 2006 ; Marban, Mariscal & Segovia, 2009 ). Also, as timeline-based evolution of data mining methodologies and process models shows ( Fig. 2 below), the original KDD data mining model served as basis for other methodologies and process models, which addressed various gaps and deficiencies of original KDD process. These approaches extended the initial KDD framework, yet, extension degree has varied ranging from process restructuring to complete change in focus. For example, Brachman & Anand (1996) and further Gertosio & Dussauchoy (2004) (in a form of case study) introduced practical adjustments to the process based on iterative nature of process as well as interactivity. The complete KDD process in their view was enhanced with supplementary tasks and the focus was changed to user’s point of view (human-centered approach), highlighting decisions that need to be made by the user in the course of data mining process. In contrast, Cabena et al. (1997) proposed different number of steps emphasizing and detailing data processing and discovery tasks. Similarly, in a series of works Anand & Büchner (1998) , Anand et al. (1998) , Buchner et al. (1999) presented additional data mining process steps by concentrating on adaptation of data mining process to practical settings. They focused on cross-sales (entire life-cycles of online customer), with further incorporation of internet data discovery process (web-based mining). Further, Two Crows data mining process model is consultancy originated framework that has defined the steps differently, but is still close to original KDD. Finally, SEMMA (Sample, Explore, Modify, Model and Assess) based on KDD, was developed by SAS institute in 2005 ( SAS Institute Inc., 2017 ). It is defined as a logical organization of the functional toolset of SAS Enterprise Miner for carrying out the core tasks of data mining. Compared to KDD, this is vendor-specific process model which limits its application in different environments. Also, it skips two steps of original KDD process (‘Learning Application Domain’ and ‘Using of Discovered Knowledge’) which are regarded as essential for success of data mining project ( Mariscal, Marbán & Fernández, 2010 ). In terms of adoption, new KDD-based proposals received limited attention across academia and industry ( Kurgan & Musilek, 2006 ; Marban, Mariscal & Segovia, 2009 ). Subsequently, most of these methodologies converged into the CRISP-DM methodology.

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-06-267-g002.jpg

Additionally, there have only been two non-KDD based approaches proposed alongside extensions to KDD. The first one is 5A’s approach presented by De Pisón Ascacbar (2003) and used by SPSS vendor. The key contribution of this approach has been related to adding ‘Automate’ step while disadvantage was associated with omitting ‘Data Understanding’ step. The second approach was 6-Sigma which is industry originated method to improve quality and customer’s satisfaction ( Pyzdek & Keller, 2003 ). It has been successfully applied to data mining projects in conjunction with DMAIC performance improvement model (Define, Measure, Analyze, Improve, Control).

In 2000, as response to common issues and needs ( Marban, Mariscal & Segovia, 2009 ), an industry-driven methodology called Cross-Industry Standard Process for Data Mining (CRISP-DM) was introduced as an alternative to KDD. It also consolidated original KDD model and its various extensions. While CRISP-DM builds upon KDD, it consists of six phases that are executed in iterations ( Marban, Mariscal & Segovia, 2009 ). The iterative executions of CRISP-DM stand as the most distinguishing feature compared to initial KDD that assumes a sequential execution of its steps. CRISP-DM, much like KDD, aims at providing practitioners with guidelines to perform data mining on large datasets. However,CRISP-DM with its six main steps with a total of 24 tasks and outputs, is more refined as compared to KDD. The main steps of CRIPS-DM, as depicted in Fig. 3 below are as follows:

  • Phase 1: Business understanding: The focus of the first step is to gain an understanding of the project objectives and requirements from a business perspective followed by converting these into data mining problem definitions. Presentation of a preliminary plan to achieve the objectives are also included in this first step.
  • Phase 2: Data understanding: This step begins with an initial data collection and proceeds with activities in order to get familiar with the data, identify data quality issues, discover first insights into the data, and potentially detect and form hypotheses.
  • Phase 3: Data preparation: The third step covers activities required to construct the final dataset from the initial raw data. Data preparation tasks are performed repeatedly.
  • Phase 4: Modeling phase: In this step, various modeling techniques are selected and applied followed by calibrating their parameters. Typically, several techniques are used for the same data mining problem.
  • Phase 5: Evaluation of the model(s): The fifth step begins with the quality perspective and then, before proceeding to final model deployment, ascertains that the model(s) achieves the business objectives. At the end of this phase, a decision should be reached on how to use data mining results.
  • Phase 6: Deployment phase: In the final step, the models are deployed to enable end-customers to use the data as basis for decisions, or support in the business process. Even if the purpose of the model is to increase knowledge of the data, the knowledge gained will need to be organized, presented, distributed in a way that the end-user can use it. Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process.

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-06-267-g003.jpg

The development of CRISP-DM was led by industry consortium. It is designed to be domain-agnostic ( Mariscal, Marbán & Fernández, 2010 ) and as such, is now widely used by industry and research communities ( Marban, Mariscal & Segovia, 2009) . These distinctive characteristics have made CRISP-DM to be considered as ‘de-facto’ standard of data mining methodology and as a reference framework to which other methodologies are benchmarked ( Mariscal, Marbán & Fernández, 2010 ).

Similarly to KDD, a number of refinements and extensions of the CRISP-DM methodology have been proposed with the two main directions—extensions of the process model itself and adaptations, merger with the process models and methodologies in other domains. Extensions direction of process models could be exemplified by Cios & Kurgan (2005) who have proposed integrated Data Mining & Knowledge Discovery (DMKD) process model. It contains several explicit feedback mechanisms, modification of the last step to incorporate discovered knowledge and insights application as well as relies on technologies for results deployment. In the same vein, Moyle & Jorge (2001) , Blockeel & Moyle (2002) proposed Rapid Collaborative Data Mining System (RAMSYS) framework—this is both data mining methodology and system for remote collaborative data mining projects. The RAMSYS attempted to achieve the combination of a problem solving methodology, knowledge sharing, and ease of communication. It intended to allow the collaborative work of remotely placed data miners in a disciplined manner as regards information flow while allowing the free flow of ideas for problem solving ( Moyle & Jorge, 2001 ). CRISP-DM modifications and integrations with other specific domains were proposed in Industrial Engineering (Data Mining for Industrial Engineering by Solarte (2002) ), and Software Engineering by Marbán et al. (2007 , 2009) . Both approaches enhanced CRISP-DM and contributed with additional phases, activities and tasks typical for engineering processes, addressing on-going support ( Solarte, 2002 ), as well as project management, organizational and quality assurance tasks ( Marbán et al., 2009 ).

Finally, limited number of attempts to create independent or semi-dependent data mining frameworks was undertaken after CRISP-DM creation. These efforts were driven by industry players and comprised KDD Roadmap by Debuse et al. (2001) for proprietary predictive toolkit (Lanner Group), and recent effort by IBM with Analytics Solutions Unified Method for Data Mining (ASUM-DM) in 2015 ( IBM Corporation, 2016 : https://developer.ibm.com/technologies/artificial-intelligence/articles/architectural-thinking-in-the-wild-west-of-data-science/ ). Both frameworks contributed with additional tasks, for example, resourcing in KDD Roadmap, or hybrid approach assumed in ASUM, for example, combination of agile and traditional implementation principles.

The Table 1 above summarizes reviewed data mining process models and methodologies by their origin, basis and key concepts.

Research Design

The main research objective of this article is to study how data mining methodologies are applied by researchers and practitioners. To this end, we use systematic literature review (SLR) as scientific method for two reasons. Firstly, systematic review is based on trustworthy, rigorous, and auditable methodology. Secondly, SLR supports structured synthesis of existing evidence, identification of research gaps, and provides framework to position new research activities ( Kitchenham, Budgen & Brereton, 2015 ). For our SLR, we followed the guidelines proposed by Kitchenham, Budgen & Brereton (2015) . All SLR details have been documented in the separate, peer-reviewed SLR protocol (available at https://figshare.com/articles/Systematic-Literature-Review-Protocol/10315961 ).

Research questions

As suggested by Kitchenham, Budgen & Brereton (2015) , we have formulated research questions and motivate them as follows. In the preliminary phase of research we have discovered very limited number of studies investigating data mining methodologies application practices as such. Further, we have discovered number of surveys conducted in domain-specific settings, and very few general purpose surveys, but none of them considered application practices either. As contrasting trend, recent emergence of limited number of adaptation studies have clearly pinpointed the research gap existing in the area of application practices. Given this research gap, in-depth investigation of this phenomenon led us to ask: “How data mining methodologies are applied (‘as-is’ vs adapted) (RQ1)?” Further, as we intended to investigate in depth universe of adaptations scenarios, this naturally led us to RQ2: “How have existing data mining methodologies been adapted?” Finally, if adaptions are made, we wish to explore what the associated reasons and purposes are, which in turn led us to RQ3: “For what purposes are data mining methodologies adapted?”

Thus, for this review, there are three research questions defined:

  • Research Question 1: How data mining methodologies are applied (‘as-is’ versus adapted)? This question aims to identify data mining methodologies application and usage patterns and trends.
  • Research Question 2: How have existing data mining methodologies been adapted? This questions aims to identify and classify data mining methodologies adaptation patterns and scenarios.
  • Research Question 3: For what purposes have existing data mining methodologies been adapted? This question aims to identify, explain, classify and produce insights on what are the reasons and what benefits are achieved by adaptations of existing data mining methodologies. Specifically, what gaps do these adaptations seek to fill and what have been the benefits of these adaptations. Such systematic evidence and insights will be valuable input to potentially new, refined data mining methodology. Insights will be of interest to practitioners and researchers.

Data collection strategy

Our data collection and search strategy followed the guidelines proposed by Kitchenham, Budgen & Brereton (2015) . It defined the scope of the search, selection of literature and electronic databases, search terms and strings as well as screening procedures.

Primary search

The primary search aimed to identify an initial set of papers. To this end, the search strings were derived from the research objective and research questions. The term ‘data mining’ was the key term, but we also included ‘data analytics’ to be consistent with observed research practices. The terms ‘methodology’ and ‘framework’ were also included. Thus, the following search strings were developed and validated in accordance with the guidelines suggested by Kitchenham, Budgen & Brereton (2015) :

(‘data mining methodology’) OR (‘data mining framework’) OR (‘data analytics methodology’) OR (‘data analytics framework’)

The search strings were applied to the indexed scientific databases Scopus, Web of Science (for ‘peer-reviewed’, academic literature) and to the non-indexed Google Scholar (for non-peer-reviewed, so-called ‘grey’ literature). The decision to cover ‘grey’ literature in this research was motivated as follows. As proposed in number of information systems and software engineering domain publications ( Garousi, Felderer & Mäntylä, 2019 ; Neto et al., 2019 ), SLR as stand-alone method may not provide sufficient insight into ‘state of practice’. It was also identified ( Garousi, Felderer & Mäntylä, 2016 ) that ‘grey’ literature can give substantial benefits in certain areas of software engineering, in particular, when the topic of research is related to industrial and practical settings. Taking into consideration the research objectives, which is investigating data mining methodologies application practices, we have opted for inclusion of elements of Multivocal Literature Review (MLR) 1 in our study. Also, Kitchenham, Budgen & Brereton (2015) recommends including ‘grey’ literature to minimize publication bias as positive results and research outcomes are more likely to be published than negative ones. Following MLR practices, we also designed inclusion criteria for types of ‘grey’ literature reported below.

The selection of databases is motivated as follows. In case of peer-reviewed literature sources we concentrated to avoid potential omission bias. The latter is discussed in IS research ( Levy & Ellis, 2006 ) in case research is concentrated in limited disciplinary data sources. Thus, broad selection of data sources including multidisciplinary-oriented (Scopus, Web of Science, Wiley Online Library) and domain-oriented (ACM Digital Library, IEEE Xplorer Digital Library) scientific electronic databases was evaluated. Multidisciplinary databases have been selected due to wider domain coverage and it was validated and confirmed that they do include publications originating from domain-oriented databases, such as ACM and IEEE. From multi-disciplinary databases as such, Scopus was selected due to widest possible coverage (it is worlds largest database, covering app. 80% of all international peer-reviewed journals) while Web of Science was selected due to its longer temporal range. Thus, both databases complement each other. The selected non-indexed database source for ‘grey’ literature is Google Scholar, as it is comprehensive source of both academic and ‘grey’ literature publications and referred as such extensively ( Garousi, Felderer & Mäntylä, 2019 ; Neto et al., 2019 ).

Further, Garousi, Felderer & Mäntylä (2019) presented three-tier categorization framework for types of ‘grey literature’. In our study we restricted ourselves to the 1st tier ‘grey’ literature publications of the limited number of ‘grey’ literature producers. In particular, from the list of producers ( Neto et al., 2019 ) we have adopted and focused on government departments and agencies, non-profit economic, trade organizations (‘think-tanks’) and professional associations, academic and research institutions, businesses and corporations (consultancy companies and established private companies). The 1st tier ‘grey’ literature selected items include: (1) government, academic, and private sector consultancy reports 2 , (2) theses (not lower than Master level) and PhD Dissertations, (3) research reports, (4) working papers, (5) conference proceedings, preprints. With inclusion of the 1st tier ‘grey’ literature criteria we mitigate quality assessment challenge especially relevant and reported for it ( Garousi, Felderer & Mäntylä, 2019 ; Neto et al., 2019 ).

Scope and domains inclusion

As recommended by Kitchenham, Budgen & Brereton (2015) it is necessary to initially define research scope. To clarify the scope, we defined what is not included and is out of scope of this research. The following aspects are not included in the scope of our study:

  • Context of technology and infrastructure for data mining/data analytics tasks and projects.
  • Granular methods application in data mining process itself or their application for data mining tasks, for example, constructing business queries or applying regression or neural networks modeling techniques to solve classification problems. Studies with granular methods are included in primary texts corpus as long as method application is part of overall methodological approach.
  • Technological aspects in data mining for example, data engineering, dataflows and workflows.
  • Traditional statistical methods not associated with data mining directly including statistical control methods.

Similarly to Budgen et al. (2006) and Levy & Ellis (2006) , initial piloting revealed that search engines retrieved literature available for all major scientific domains including ones outside authors’ area of expertise (e.g., medicine). Even though such studies could be retrieved, it would be impossible for us to analyze and correctly interpret literature published outside the possessed area of expertise. The adjustments toward search strategy were undertaken by retaining domains closely associated with Information Systems, Software Engineering research. Thus, for Scopus database the final set of inclusive domains was limited to nine and included Computer Science, Engineering, Mathematics, Business, Management and Accounting, Decision Science, Economics, Econometrics and Finance, and Multidisciplinary as well as Undefined studies. Excluded domains covered 11.5% or 106 out of 925 publications; it was confirmed in validation process that they primarily focused on specific case studies in fundamental sciences and medicine 3 . The included domains from Scopus database were mapped to Web of Science to ensure consistent approach across databases and the correctness of mapping was validated.

Screening criteria and procedures

Based on the SLR practices (as in Kitchenham, Budgen & Brereton (2015) , Brereton et al. (2007) ) and defined SLR scope, we designed multi-step screening procedures (quality and relevancy) with associated set of Screening Criteria and Scoring System . The purpose of relevancy screening is to find relevant primary studies in an unbiased way ( Vanwersch et al., 2011 ). Quality screening, on the other hand, aims to assess primary relevant studies in terms of quality in unbiased way.

Screening Criteria consisted of two subsets— Exclusion Criteria applied for initial filtering and Relevance Criteria , also known as Inclusion Criteria .

Exclusion Criteria were initial threshold quality controls aiming at eliminating studies with limited or no scientific contribution. The exclusion criteria also address issues of understandability, accessability and availability. The Exclusion Criteria were as follows:

  • Quality 1: The publication item is not in English (understandability).
  • either the same document retrieved from two or all three databases.
  • or different versions of the same publication are retrieved (i.e., the same study published in different sources)—based on best practices, decision rule is that the most recent paper is retained as well as the one with the highest score ( Kofod-Petersen, 2014 ).
  • if a publication is published both as conference proceeding and as journal article with the same name and same authors or as an extended version of conference paper, the latter is selected.
  • Quality 3: Length of the publication is less than 6 pages—short papers do not have the space to expand and discuss presented ideas in sufficient depth to examine for us.
  • Quality 4: The paper is not accessible in full length online through the university subscription of databases and via Google Scholar—not full availability prevents us from assessing and analyzing the text.

The initially retrieved list of papers was filtered based on Exclusion Criteria . Only papers that passed all criteria were retained in the final studies corpus. Mapping of criteria towards screening steps is exhibited in Fig. 4 .

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-06-267-g004.jpg

Relevance Criteria were designed to identify relevant publications and are presented in Table 2 below while mapping to respective process steps is presented in Fig. 4 . These criteria were applied iteratively.

As a final SLR step, the full texts quality assessment was performed with constructed Scoring Metrics (in line with Kitchenham & Charters (2007) ). It is presented in the Table 3 below.

Data extraction and screening process

The conducted data extraction and screening process is presented in Fig. 4 . In Step 1 initial publications list were retrieved from pre-defined databases—Scopus, Web of Science, Google Scholar. The lists were merged and duplicates eliminated in Step 2. Afterwards, texts being less than 6 pages were excluded (Step 3). Steps 1–3 were guided by Exclusion Criteria . In the next stage (Step 4), publications were screened by Title based on pre-defined Relevance Criteria . The ones which passed were evaluated by their availability (Step 5). As long as study was available, it was evaluated again by the same pre-defined Relevance Criteria applied to Abstract, Conclusion and if necessary Introduction (Step 6). The ones which passed this threshold formed primary publications corpus extracted from databases in full. These primary texts were evaluated again based on full text (Step 7) applying Relevance Criteria first and then Scoring Metrics .

Results and quantitative analysis

In Step 1, 1,715 publications were extracted from relevant databases with the following composition—Scopus (819), Web of Science (489), Google Scholar (407). In terms of scientific publication domains, Computer Science (42.4%), Engineering (20.6%), Mathematics (11.1%) accounted for app. 74% of Scopus originated texts. The same applies to Web of Science harvest. Exclusion Criteria application produced the following results. In Step 2, after eliminating duplicates, 1,186 texts were passed for minimum length evaluation, and 767 reached assessment by Relevancy Criteria .

As mentioned Relevance Criteria were applied iteratively (Step 4–6) and in conjunction with availability assessment. As a result, only 298 texts were retained for full evaluation with 241 originating from scientific databases while 57 were ‘grey’. These studies formed primary texts corpus which was extracted, read in full and evaluated by Relevance Criteria combined with Scoring Metrics . The decision rule was set as follows. Studies that scored “1” or “0” were rejected, while texts with “3” and “2” evaluation were admitted as final primary studies corpus. To this end, as an outcome of SLR-based, broad, cross-domain publications collection and screening we identified 207 relevant publications from peer-reviewed (156 texts) and ‘grey’ literature (51 texts). Figure 5 below exhibits yearly published research numbers with the breakdown by ‘peer-reviewed’ and ‘grey’ literature starting from 1997.

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-06-267-g005.jpg

In terms of composition, ‘peer-reviewed’ studies corpus is well-balanced with 72 journal articles and 82 conference papers while book chapters account for 4 instances only. In contrast, in ‘grey’ literature subset, articles in moderated and non-peer reviewed journals are dominant ( n = 34) compared to overall number of conference papers ( n = 13), followed by small number of technical reports and pre-prints ( n = 4).

Temporal analysis of texts corpus (as per Fig. 5 below) resulted in two observations. Firstly, we note that stable and significant research interest (in terms of numbers) on data mining methodologies application has started around a decade ago—in 2007. Research efforts made prior to 2007 were relatively limited with number of publications below 10. Secondly, we note that research on data mining methodologies has grown substantially since 2007, an observation supported by the 3-year and 10-year constructed mean trendlines. In particular, the number of publications have roughly tripled over past decade hitting all time high with 24 texts released in 2017.

Further, there are also two distinct spike sub-periods in the years 2007–2009 and 2014–2017 followed by stable pattern with overall higher number of released publications on annual basis. This observation is in line with the trend of increased penetration of methodologies, tools, cross-industry applications and academic research of data mining.

Findings and Discussion

In this section, we address the research questions of the paper. Initially, as part of RQ1, we present overview of data mining methodologies ‘as-is’ and adaptation trends. In addressing RQ2, we further classify the adaptations identified. Then, as part of RQ3 subsection, each category identified under RQ2 is analyzed with particular focus on the goals of adaptations.

RQ1: How data mining methodologies are applied (‘as-is’ vs. adapted)?

The first research question examines the extent to which data mining methodologies are used ‘as-is’ versus adapted. Our review based on 207 publications identified two distinct paradigms on how data mining methodologies are applied. The first is ‘as-is’ where the data mining methodologies are applied as stipulated. The second is with ‘adaptations’; that is, methodologies are modified by introducing various changes to the standard process model when applied.

We have aggregated research by decades to differentiate application pattern between two time periods 1997–2007 with limited vs 2008–2018 with more intensive data mining application. The given cut has not only been guided by extracted publications corpus but also by earlier surveys. In particular, during the pre-2007 research, there where ten new methodologies proposed, but since then, only two new methodologies have been proposed. Thus, there is a distinct trend observed over the last decade of large number of extensions and adaptations proposed vs entirely new methodologies.

We note that during the first decade of our time scope (1997–2007), the ratio of data mining methodologies applied ‘as-is’ was 40% (as presented in Fig. 6A ). However, the same ratio for the following decade is 32% ( Fig. 6B ). Thus, in terms of relative shares we note a clear decrease in using data mining methodologies ‘as-is’ in favor of adapting them to cater to specific needs.The trend is even more pronounced when comparing numbers—adaptations more than tripled (from 30 to 106) while ‘as-is’ scenario has increased modestly (from 20 to 51). Given this finding, we continue with analyzing how data mining methodologies have been adapted under RQ2.

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-06-267-g006.jpg

RQ2: How have existing data mining methodologies been adapted?

We identified that data mining methodologies have been adapted to cater to specific needs. In order to categorize adaptations scenarios, we applied a two-level dichotomy, specifically, by applying the following decision tree:

  • Level 1 Decision: Has the methodology been combined with another methodology? If yes, the resulting methodology was classified in the ‘integration’ category. Otherwise, we posed the next question.
  • Level 2 Decision: Are any new elements (phases, tasks, deliverables) added to the methodology? If yes, we designate the resulting methodology as an ‘extension’ of the original one. Otherwise, we classify the resulting methodology as a modification of the original one.

Thus, when adapted three distinct types of adaptation scenarios can be distinguished:

  • Scenario ‘Modification’: introduces specialized sub-tasks and deliverables in order to address specific use cases or business problems. Modifications typically concentrate on granular adjustments to the methodology at the level of sub-phases, tasks or deliverables within the existing reference frameworks (e.g., CRISP-DM or KDD) stages. For example, Chernov et al. (2014) , in the study of mobile network domain, proposed automated decision-making enhancement in the deployment phase. In addition, the evaluation phase was modified by using both conventional and own-developed performance metrics. Further, in a study performed within the financial services domain, Yang et al. (2016) presents feature transformation and feature selection as sub-phases, thereby enhancing the data mining modeling stage.
  • Scenario ‘Extension’: primarily proposes significant extensions to reference data mining methodologies. Such extensions result in either integrated data mining solutions, data mining frameworks serving as a component or tool for automated IS systems, or their transformations to fit specialized environments. The main purposes of extensions are to integrate fully-scaled data mining solutions into IS/IT systems and business processes and provide broader context with useful architectures, algorithms, etc. Adaptations, where extensions have been made, elicit and explicitly present various artifacts in the form of system and model architectures, process views, workflows, and implementation aspects. A number of soft goals are also achieved, providing holistic perspective on data mining process, and contextualizing with organizational needs. Also, there are extensions in this scenario where data mining process methodologies are substantially changed and extended in all key phases to enable execution of data mining life-cycle with the new (Big) Data technologies, tools and in new prototyping and deployment environments (e.g., Hadoop platforms or real-time customer interfaces). For example, Kisilevich, Keim & Rokach (2013) presented extensions to traditional CRISP-DM data mining outcomes with fully fledged Decision Support System (DSS) for hotel brokerage business. Authors ( Kisilevich, Keim & Rokach, 2013 ) have introduced spatial/non-spatial data management (extending data preparation), analytical and spatial modeling capabilities (extending modeling phase), provided spatial display and reporting capabilities (enhancing deployment phase). In the same work domain knowledge was introduced in all phases of data mining process, and usability and ease of use were also addressed.
  • Scenario ‘Integration’: combines reference methodology, for example, CRISP-DM with: (1) data mining methodologies originated from other domains (e.g., Software engineering development methodologies), (2) organizational frameworks (Balanced Scorecard, Analytics Canvass, etc.), or (3) adjustments to accommodate Big Data technologies and tools. Also, adaptations in the form of ‘Integration’ typically introduce various types of ontologies and ontology-based tools, domain knowledge, software engineering, and BI-driven framework elements. Fundamental data mining process adjustments to new types of data, IS architectures (e.g., real time data, multi-layer IS) are also presented. Key gaps addressed with such adjustments are prescriptive nature and low degree of formalization in CRISP-DM, obsolete nature of CRISP-DM with respect to tools, and lack of CRISP-DM integration with other organizational frameworks. For example, Brisson & Collard (2008) developed KEOPS data mining methodology (CRIPS-DM based) centered on domain knowledge integration. Ontology-driven information system has been proposed with integration and enhancements to all steps of data mining process. Further, an integrated expert knowledge used in all data mining phases was proved to produce value in data mining process.

To examine how the application scenario of each data mining methodology usage has developed over time, we mapped peer-reviewed texts and ‘grey’ literature to respective adaptation scenarios, aggregated by decades (as presented in the Fig. 7 for peer-reviewed and Fig. 8 for ‘grey’).

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-06-267-g007.jpg

For peer-reviewed research, such temporal analysis resulted in three observations. Firstly, research efforts in each adaptation scenario has been growing and number of publication more than quadrupled (128 vs. 28). Secondly, as noted above relative proportion of ‘as-is’ studies is diluted (from 39% to 33%) and primarily replaced with ‘Extension’ paradigm (from 25% to 30%). In contrast, in relative terms ‘Modification’ and ‘Integration’ paradigms gains are modest. Further, this finding is reinforced with other observation—most notable gaps in terms of modest number of publications remain in ‘Integration’ category where excluding 2008–2009 spike, research efforts are limited and number of texts is just 13. This is in stark contrast with prolific research in ‘Extension category’ though concentrated in the recent years. We can hypothesize that existing reference methodologies do not accommodate and support increasing complexity of data mining projects and IS/IT infrastructure, as well as certain domains specifics and as such need to be adapted.

In ‘grey’ literature, in contrast to peer-reviewed research, growth in number of publications is less profound—29 vs. 22 publications or 32% comparing across two decade (as per Fig. 8 ). The growth is solely driven by ‘Integration’ scenarios application (13 vs. 4 publications) while both ‘as-is’ and other adaptations scenarios are stagnating or in decline.

RQ3: For what purposes have existing data mining methodologies been adapted?

We address the third research question by analyzing what gaps the data mining methodology adaptations seek to fill and the benefits of such adaptations. We identified three adaptation scenarios, namely ‘Modification’, ‘Extension’, and ‘Integration’. Here, we analyze each of them.

Modification

Modifications of data mining methodologies are present in 30 peer-reviewed and 4 ‘grey’ literature studies. The analysis shows that modifications overwhelmingly consist of specific case studies. However, the major differentiating point compared to ‘as-is’ case studies is clear presence of specific adjustments towards standard data mining process methodologies. Yet, the proposed modifications and their purposes do not go beyond traditional data mining methodologies phases. They are granular, specialized and executed on tasks, sub-tasks, and at deliverables level. With modifications, authors describe potential business applications and deployment scenarios at a conceptual level, but typically do not report or present real implementations in the IS/IT systems and business processes.

Further, this research subcategory can be best classified based on domains where case studies were performed and data mining methodologies modification scenarios executed. We have identified four distinct domain-driven applications presented in the Fig. 9 .

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-06-267-g009.jpg

IT, IS domain

The largest number of publications (14 or app. 40%), was performed on IT, IS security, software development, specific data mining and processing topics. Authors address intrusion detection problem in Hossain, Bridges & Vaughn (2003) , Fan, Ye & Chen (2016) , Lee, Stolfo & Mok (1999) , specialized algorithms for variety of data types processing in Yang & Shi (2010) , Chen et al. (2001) , Yi, Teng & Xu (2016) , Pouyanfar & Chen (2016) , effective and efficient computer and mobile networks management in Guan & Fu (2010) , Ertek, Chi & Zhang (2017) , Zaki & Sobh (2005) , Chernov, Petrov & Ristaniemi (2015) , Chernov et al. (2014) .

Manufacturing and engineering

The next most popular research area is manufacturing/engineering with 10 case studies. The central topic here is high-technology manufacturing, for example, semi-conductors associated—study of Chien, Diaz & Lan (2014) , and various complex prognostics case studies in rail, aerospace domains ( Létourneau et al., 2005 ; Zaluski et al., 2011 ) concentrated on failure predictions. These are complemented by studies on equipment fault and failure predictions and maintenance ( Kumar, Shankar & Thakur, 2018 ; Kang et al., 2017 ; Wang, 2017 ) as well as monitoring system ( García et al., 2017 ).

Sales and services, incl. financial industry

The third category is presented by seven business application papers concerning customer service, targeting and advertising ( Karimi-Majd & Mahootchi, 2015 ; Reutterer et al., 2017 ; Wang, 2017 ), financial services credit risk assessments ( Smith, Willis & Brooks, 2000 ), supply chain management ( Nohuddin et al., 2018 ), and property management ( Yu, Fung & Haghighat, 2013 ), and similar.

As a consequence of specialization, these studies concentrate on developing ‘state-of-the art’ solution to the respective domain-specific problem.

‘Extension’ scenario was identified in 46 peer-reviewed and 12 ‘grey’ publications. We noted that ‘Extension’ to existing data mining methodologies were executed with four major purposes:

  • Purpose 1: To implement fully scaled, integrated data mining solution and regular, repeatable knowledge discovery process— address model, algorithm deployment, implementation design (including architecture, workflows and corresponding IS integration). Also, complementary goal is to tackle changes to business process to incorporate data mining into organization activities.
  • Purpose 2: To implement complex, specifically designed systems and integrated business applications with data mining model/solution as component or tool. Typically, this adaptation is also oriented towards Big Data specifics, and is complemented by proposed artifacts such as Big Data architectures, system models, workflows, and data flows.
  • Purpose 3: To implement data mining as part of integrated/combined specialized infrastructure, data environments and types (e.g., IoT, cloud, mobile networks) .
  • Purpose 4: To incorporate context-awareness aspects.

The specific list of studies mapped to each of the given purposes presented in the Appendix ( Table A1 ). Main purposes of adaptations, associated gaps and/or benefits along with observations and artifacts are documented in the Fig. 10 below.

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-06-267-g010.jpg

In ‘Extension’ category, studies executed with the Purpose 1 propose fully scaled, integrated data mining solutions of specific data mining models, associated frameworks and processes. The distinctive trait of this research subclass is that it ensures repeatability and reproducibility of delivered data mining solution in different organizational and industry settings. Both the results of data mining use case as well as deployment and integration into IS/IT systems and associated business process(es) are presented explicitly. Thus, ‘Extension’ subclass is geared towards specific solution design, tackling concrete business or industrial setting problem or addressing specific research gaps thus resembling comprehensive case study.

This direction can be well exemplified by expert finder system in research social network services proposed by Sun et al. (2015) , data mining solution for functional test content optimization by Wang (2015) and time-series mining framework to conduct estimation of unobservable time-series by Hu et al. (2010) . Similarly, Du et al. (2017) tackle online log anomalies detection, automated association rule mining is addressed by Çinicioğlu et al. (2011) , software effort estimation by Deng, Purvis & Purvis (2011) , network patterns visual discovery by Simoff & Galloway (2008) . Number of studies address solutions in IS security ( Shin & Jeong, 2005 ), manufacturing ( Güder et al., 2014 ; Chee, Baharudin & Karkonasasi, 2016 ), materials engineering domains ( Doreswamy, 2008 ), and business domains ( Xu & Qiu, 2008 ; Ding & Daniel, 2007 ).

In contrast, ‘Extension’ studies executed for the Purpose 2 concentrate on design of complex, multi-component information systems and architectures. These are holistic, complex systems and integrated business applications with data mining framework serving as component or tool. Moreover, data mining methodology in these studies is extended with systems integration phases.

For example, Mobasher (2007) presents data mining application in Web personalization system and associated process; here, data mining cycle is extended in all phases with utmost goal of leveraging multiple data sources and using discovered models and corresponding algorithms in an automatic personalization system. Authors comprehensively address data processing, algorithm, design adjustments and respective integration into automated system. Similarly, Haruechaiyasak, Shyu & Chen (2004) tackle improvement of Webpage recommender system by presenting extended data mining methodology including design and implementation of data mining model. Holistic view on web-mining with support of all data sources, data warehousing and data mining techniques integration, as well as multiple problem-oriented analytical outcomes with rich business application scenarios (personalization, adaptation, profiling, and recommendations) in e-commerce domain was proposed and discussed by Büchner & Mulvenna (1998) . Further, Singh et al. (2014) tackled scalable implementation of Network Threat Intrusion Detection System. In this study, data mining methodology and resulting model are extended, scaled and deployed as module of quasi-real-time system for capturing Peer-to-Peer Botnet attacks. Similar complex solution was presented in a series of publications by Lee et al. (2000 , 2001) who designed real-time data mining-based Intrusion Detection System (IDS). These works are complemented by comprehensive study of Barbará et al. (2001) who constructed experimental testbed for intrusion detection with data mining methods. Detection model combining data fusion and mining and respective components for Botnets identification was developed by Kiayias et al. (2009) too. Similar approach is presented in Alazab et al. (2011) who proposed and implemented zero-day malware detection system with associated machine-learning based framework. Finally, Ahmed, Rafique & Abulaish (2011) presented multi-layer framework for fuzzy attack in 3G cellular IP networks.

A number of authors have considered data mining methodologies in the context of Decision Support Systems and other systems that generate information for decision-making, across a variety of domains. For example, Kisilevich, Keim & Rokach (2013) executed significant extension of data mining methodology by designing and presenting integrated Decision Support System (DSS) with six components acting as supporting tool for hotel brokerage business to increase deal profitability. Similar approach is undertaken by Capozzoli et al. (2017) focusing on improving energy management of properties by provision of occupancy pattern information and reconfiguration framework. Kabir (2016) presented data mining information service providing improved sales forecasting that supported solution of under/over-stocking problem while Lau, Zhang & Xu (2018) addressed sales forecasting with sentiment analysis on Big Data. Kamrani, Rong & Gonzalez (2001) proposed GA-based Intelligent Diagnosis system for fault diagnostics in manufacturing domain. The latter was tackled further in Shahbaz et al. (2010) with complex, integrated data mining system for diagnosing and solving manufacturing problems in real time.

Lenz, Wuest & Westkämper (2018) propose a framework for capturing data analytics objectives and creating holistic, cross-departmental data mining systems in the manufacturing domain. This work is representative of a cohort of studies that aim at extending data mining methodologies in order to support the design and implementation of enterprise-wide data mining systems. In this same research cohort, we classify Luna, Castro & Romero (2017) , which presents a data mining toolset integrated into the Moodle learning management system, with the aim of supporting university-wide learning analytics.

One study addresses multi-agent based data mining concept. Khan, Mohamudally & Babajee (2013) have developed unified theoretical framework for data mining by formulating a unified data mining theory. The framework is tested by means of agent programing proposing integration into multi-agent system which is useful due to scalability, robustness and simplicity.

The subcategory of ‘Extension’ research executed with Purpose 3 is devoted to data mining methodologies and solutions in specialized IT/IS, data and process environments which emerged recently as consequence of Big Data associated technologies and tools development. Exemplary studies include IoT associated environment research, for example, Smart City application in IoT presented by Strohbach et al. (2015) . In the same domain, Bashir & Gill (2016) addressed IoT-enabled smart buildings with the additional challenge of large amount of high-speed real time data and requirements of real-time analytics. Authors proposed integrated IoT Big Data Analytics framework. This research is complemented by interdisciplinary study of Zhong et al. (2017) where IoT and wireless technologies are used to create RFID-enabled environment producing analysis of KPIs to improve logistics.

Significant number of studies addresses various mobile environments sometimes complemented by cloud-based environments or cloud-based environments as stand-alone. Gomes, Phua & Krishnaswamy (2013) addressed mobile data mining with execution on mobile device itself; the framework proposes innovative approach addressing extensions of all aspects of data mining including contextual data, end-user privacy preservation, data management and scalability. Yuan, Herbert & Emamian (2014) and Yuan & Herbert (2014) introduced cloud-based mobile data analytics framework with application case study for smart home based monitoring system. Cuzzocrea, Psaila & Toccu (2016) have presented innovative FollowMe suite which implements data mining framework for mobile social media analytics with several tools with respective architecture and functionalities. An interesting paper was presented by Torres et al. (2017) who addressed data mining methodology and its implementation for congestion prediction in mobile LTE networks tackling also feedback reaction with network reconfigurations trigger.

Further, Biliri et al. (2014) presented cloud-based Future Internet Enabler—automated social data analytics solution which also addresses Social Network Interoperability aspect supporting enterprises to interconnect and utilize social networks for collaboration. Real-time social media streamed data and resulting data mining methodology and application was extensively discussed by Zhang, Lau & Li (2014) . Authors proposed design of comprehensive ABIGDAD framework with seven main components implementing data mining based deceptive review identification. Interdisciplinary study tackling both these topics was developed by Puthal et al. (2016) who proposed integrated framework and architecture of disaster management system based on streamed data in cloud environment ensuring end-to-end security. Additionally, key extensions to data mining framework have been proposed merging variety of data sources and types, security verification and data flow access controls. Finally, cloud-based manufacturing was addressed in the context of fault diagnostics by Kumar et al. (2016) .

Also, Mahmood et al. (2013) tackled Wireless Sensor Networks and associated data mining framework required extensions. Interesting work is executed by Nestorov & Jukic (2003) addressing rare topic of data mining solutions integration within traditional data warehouses and active mining of data repositories themselves.

Supported by new generation of visualization technologies (including Virtual Reality environments), Wijayasekara, Linda & Manic (2011) proposed and implemented CAVE-SOM (3D visual data mining framework) which offers interactive, immersive visual data mining with multiple visualization modes supported by plethora of methods. Earlier version of visual data mining framework was successfully developed and presented by Ganesh et al. (1996) as early as in 1996.

Large-scale social media data is successfully tackled by Lemieux (2016) with comprehensive framework accompanied by set of data mining tools and interface. Real time data analytics was addressed by Shrivastava & Pal (2017) in the domain of enterprise service ecosystem. Images data was addressed in Huang et al. (2002) by proposing multimedia data mining framework and its implementation with user relevance feedback integration and instance learning. Further, exploded data diversity and associated need to extend standard data mining is addressed by Singh et al. (2016) in the study devoted to object detection in video surveillance systems supporting real time video analysis.

Finally, there is also limited number of studies which addresses context awareness (Purpose 4) and extends data mining methodology with context elements and adjustments. In comparison with ‘Integration’ category research, here, the studies are at lower abstraction level, capturing and presenting list of adjustments. Singh, Vajirkar & Lee (2003) generate taxonomy of context factors, develop extended data mining framework and propose deployment including detailed IS architecture. Context-awareness aspect is also addressed in the papers reviewed above, for example, Lenz, Wuest & Westkämper (2018) , Kisilevich, Keim & Rokach (2013) , Sun et al. (2015) , and other studies.

Integration

‘Integration’ of data mining methodologies scenario was identified in 27 ‘peer-reviewed’ and 17 ‘grey’ studies. Our analysis revealed that this adaptation scenario at a higher abstraction level is typically executed with the five key purposes:

  • Purpose 1: to integrate/combine with various ontologies existing in organization .
  • Purpose 2: to introduce context-awareness and incorporate domain knowledge .
  • Purpose 3: to integrate/combine with other research or industry domains framework, process methodologies and concepts .
  • Purpose 4: to integrate/combine with other well-known organizational governance frameworks, process methodologies and concepts .
  • Purpose 5: to accommodate and/or leverage upon newly available Big Data technologies, tools and methods.

The specific list of studies mapped to each of the given purposes presented in Appendix ( Table A2 ). Main purposes of adaptations, associated gaps and/or benefits along with observations and artifacts are documented in Fig. 11 below.

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-06-267-g011.jpg

As mentioned, number of studies concentrates on proposing ontology-based Integrated data mining frameworks accompanies by various types of ontologies (Purpose 1). For example, Sharma & Osei-Bryson (2008) focus on ontology-based organizational view with Actors, Goals and Objectives which supports execution of Business Understanding Phase. Brisson & Collard (2008) propose KEOPS framework which is CRISP-DM compliant and integrates a knowledge base and ontology with the purpose to build ontology-driven information system (OIS) for business and data understanding phases while knowledge base is used for post-processing step of model interpretation. Park et al. (2017) propose and design comprehensive ontology-based data analytics tool IRIS with the purpose to align analytics and business. IRIS is based on concept to connect dots, analytics methods or transforming insights into business value, and supports standardized process for applying ontology to match business problems and solutions.

Further, Ying et al. (2014) propose domain-specific data mining framework oriented to business problem of customer demand discovery. They construct ontology for customer demand and customer demand discovery task which allows to execute structured knowledge extraction in the form of knowledge patterns and rules. Here, the purpose is to facilitate business value realization and support actionability of extracted knowledge via marketing strategies and tactics. In the same vein, Cannataro & Comito (2003) presented ontology for the Data Mining domain which main goal is to simplify the development of distributed knowledge discovery applications. Authors offered to a domain expert a reference model for different kind of data mining tasks, methodologies, and software capable to solve the given business problem and find the most appropriate solution.

Apart from ontologies, Sharma & Osei-Bryson (2009) in another study propose IS inspired, driven by Input-Output model data mining methodology which supports formal implementation of Business Understanding Phase. This research exemplifies studies executed with Purpose 2. The goal of the paper is to tackle prescriptive nature of CRISP-DM and address how the entire process can be implemented. Cao, Schurmann & Zhang (2005) study is also exemplary in terms of aggregating and introducing several fundamental concepts into traditional CRISP-DM data mining cycle—context awareness, in-depth pattern mining, human–machine cooperative knowledge discovery (in essence, following human-centricity paradigm in data mining), loop-closed iterative refinement process (similar to Agile-based methodologies in Software Development). There are also several concepts, like data, domain, interestingness, rules which are proposed to tackle number of fundamental constrains identified in CRISP-DM. They have been discussed and further extended by Cao & Zhang (2007 , 2008) , Cao (2010) into integrated domain driven data mining concept resulting in fully fledged D3M (domain-driven) data mining framework. Interestingly, the same concepts, but on individual basis are investigated and presented by other authors, for example, context-aware data mining methodology is tackled by Xiang (2009a , 2009b) in the context of financial sector. Pournaras et al. (2016) attempted very crucial privacy-preservation topic in the context of achieving effective data analytics methodology. Authors introduced metrics and self-regulatory (reconfigurable) information sharing mechanism providing customers with controls for information disclosure.

A number of studies have proposed CRISP-DM adjustments based on existing frameworks, process models or concepts originating in other domains (Purpose 3), for example, software engineering ( Marbán et al., 2007 , 2009 ; Marban, Mariscal & Segovia, 2009 ) and industrial engineering ( Solarte, 2002 ; Zhao et al., 2005 ).

Meanwhile, Mariscal, Marbán & Fernández (2010) proposed a new refined data mining process based on a global comparative analysis of existing frameworks while Angelov (2014) outlined a data analytics framework based on statistical concepts. Following a similar approach, some researchers suggest explicit integration with other areas and organizational functions, for example, BI-driven Data Mining by Hang & Fong (2009) . Similarly, Chen, Kazman & Haziyev (2016) developed an architecture-centric agile Big Data analytics methodology, and an architecture-centric agile analytics and DevOps model. Alternatively, several authors tackled data mining methodology adaptations in other domains, for example, educational data mining by Tavares, Vieira & Pedro (2017) , decision support in learning management systems ( Murnion & Helfert, 2011 ), and in accounting systems ( Amani & Fadlalla, 2017 ).

Other studies are concerned with actionability of data mining and closer integration with business processes and organizational management frameworks (Purpose 4). In particular, there is a recurrent focus on embedding data mining solutions into knowledge-based decision making processes in organizations, and supporting fast and effective knowledge discovery ( Bohanec, Robnik-Sikonja & Borstnar, 2017 ).

Examples of adaptations made for this purpose include: (1) integration of CRISP-DM with the Balanced Scorecard framework used for strategic performance management in organizations ( Yun, Weihua & Yang, 2014 ); (2) integration with a strategic decision-making framework for revenue management Segarra et al. (2016) ; (3) integration with a strategic analytics methodology Van Rooyen & Simoff (2008) , and (4) integration with a so-called ‘Analytics Canvas’ for management of portfolios of data analytics projects Kühn et al. (2018) . Finally, Ahangama & Poo (2015) explored methodological attributes important for adoption of data mining methodology by novice users. This latter study uncovered factors that could support the reduction of resistance to the use of data mining methodologies. Conversely, Lawler & Joseph (2017) comprehensively evaluated factors that may increase the benefits of Big Data Analytics projects in an organization.

Lastly, a number of studies have proposed data mining frameworks (e.g., CRISP-DM) adaptations to cater for new technological architectures, new types of datasets and applications (Purpose 5). For example, Lu et al. (2017) proposed a data mining system based on a Service-Oriented Architecture (SOA), Zaghloul, Ali-Eldin & Salem (2013) developed a concept of self-service data analytics, Osman, Elragal & Bergvall-Kåreborn (2017) blended CRISP-DM into a Big Data Analytics framework for Smart Cities, and Niesen et al. (2016) proposed a data-driven risk management framework for Industry 4.0 applications.

Our analysis of RQ3, regarding the purposes of existing data mining methodologies adaptations, revealed the following key findings. Firstly, adaptations of type ‘Modification’ are predominantly targeted at addressing problems that are specific to a given case study. The majority of modifications were made within the domain of IS security, followed by case studies in the domains of manufacturing and financial services. This is in clear contrast with adaptations of type ‘Extension’, which are primarily aimed at customizing the methodology to take into account specialized development environments and deployment infrastructures, and to incorporate context-awareness aspects. Thirdly, a recurrent purpose of adaptations of type ‘Integration’ is to combine a data mining methodology with either existing ontologies in an organization or with other domain frameworks, methodologies, and concepts. ‘Integration’ is also used to instill context-awareness and domain knowledge into a data mining methodology, or to adapt it to specialized methods and tools, such as Big Data. The distinctive outcome and value (gaps filled in) of ‘Integrations’ stems from improved knowledge discovery, better actionability of results, improved combination with key organizational processes and domain-specific methodologies, and improved usage of Big Data technologies.

We discovered that the adaptations of existing data mining methodologies found in the literature can be classified into three categories: modification, extension, or integration.

We also noted that adaptations are executed either to address deficiencies and lack of important elements or aspects in the reference methodology (chiefly CRISP-DM). Furthermore, adaptations are also made to improve certain phases, deliverables or process outcomes.

In short, adaptations are made to:

  • improve key reference data mining methodologies phases—for example, in case of CRISP-DM these are primarily business understanding and deployment phases.
  • support knowledge discovery and actionability.
  • introduce context-awareness and higher degree of formalization.
  • integrate closer data mining solution with key organizational processes and frameworks.
  • significantly update CRISP-DM with respect to Big Data technologies, tools, environments and infrastructure.
  • incorporate broader, explicit context of architectures, algorithms and toolsets as integral deliverables or supporting tools to execute data mining process.
  • expand and accommodate broader unified perspective for incorporating and implementing data mining solutions in organization, IT infrastructure and business processes.

Threats to Validity

Systematic literature reviews have inherent limitations that must be acknowledged. These threats to validity include subjective bias (internal validity) and incompleteness of search results (external validity).

The internal validity threat stems from the subjective screening and rating of studies, particularly when assessing the studies with respect to relevance and quality criteria. We have mitigated these effects by documenting the survey protocol (SLR Protocol), strictly adhering to the inclusion criteria, and performing significant validation procedures, as documented in the Protocol.

The external validity threat relates to the extent to which the findings of the SLR reflect the actual state of the art in the field of data mining methodologies, given that the SLR only considers published studies that can be retrieved using specific search strings and databases. We have addressed this threat to validity by conducting trial searches to validate our search strings in terms of their ability to identify relevant papers that we knew about beforehand. Also, the fact that the searches led to 1,700 hits overall suggests that a significant portion of the relevant literature has been covered.

In this study, we have examined the use of data mining methodologies by means of a systematic literature review covering both peer-reviewed and ‘grey’ literature. We have found that the use of data mining methodologies, as reported in the literature, has grown substantially since 2007 (four-fold increase relative to the previous decade). Also, we have observed that data mining methodologies were predominantly applied ‘as-is’ from 1997 to 2007. This trend was reversed from 2008 onward, when the use of adapted data mining methodologies gradually started to replace ‘as-is’ usage.

The most frequent adaptations have been in the ‘Extension’ category. This category refers to adaptations that imply significant changes to key phases of the reference methodology (chiefly CRISP-DM). These adaptations particularly target the business understanding, deployment and implementation phases of CRISP-DM (or other methodologies). Moreover, we have found that the most frequent purposes of adaptions are: (1) adaptations to handle Big Data technologies, tools and environments (technological adaptations); and (2) adaptations for context-awareness and for integrating data mining solutions into business processes and IT systems (organizational adaptations). A key finding is that standard data mining methodologies do not pay sufficient attention to deployment aspects required to scale and transform data mining models into software products integrated into large IT/IS systems and business processes.

Apart from the adaptations in the ‘Extension’ category, we have also identified an increasing number of studies focusing on the ‘Integration’ of data mining methodologies with other domain-specific and organizational methodologies, frameworks, and concepts. These adaptions are aimed at embedding the data mining methodology into broader organizational aspects.

Overall, the findings of the study highlight the need to develop refinements of existing data mining methodologies that would allow them to seamlessly interact with IT development platforms and processes (technological adaptation) and with organizational management frameworks (organizational adaptation). In other words, there is a need to frame existing data mining methodologies as being part of a broader ecosystem of methodologies, as opposed to the traditional view where data mining methodologies are defined in isolation from broader IT systems engineering and organizational management methodologies.

Supplemental Information

Supplemental information 1.

Unfortunately, we were not able to upload any graph (original png files). Based on Overleaf placed PeerJ template we constructed graphs files based on the template examples. Unfortunately, we were not able to understand why it did not fit, redoing to new formats will change all texts flow and generated pdf file. We submit graphs in archived file as part of supplementary material. We will do our best to redo the graphs further based on instructions from You.

Supplemental Information 2

File starts with Definitions page—it lists and explains all columns definitions as well as SLR scoring metrics. Second page contains"Peer reviewed" texts while next one "grey" literature corpus.

Funding Statement

The authors received no funding for this work.

Additional Information and Declarations

The authors declare that they have no competing interests.

Veronika Plotnikova conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft.

Marlon Dumas conceived and designed the experiments, authored or reviewed drafts of the paper, and approved the final draft.

Fredrik Milani conceived and designed the experiments, authored or reviewed drafts of the paper, and approved the final draft.

Primary Sources

  • Frontiers in Computational Neuroscience
  • Research Topics

Medical Data Mining and Medical Intelligence Services

Total Downloads

Total Views and Downloads

About this Research Topic

In the age of digital healthcare, the confluence of data science, artificial intelligence, and healthcare services has ushered in a new era of medical discovery and patient care. The sheer volume and complexity of medical data generated daily presents both a challenge and an extraordinary opportunity. This ...

Keywords : Medical Data, Machine Learning, Artificial Intelligence, Digital Healthcare

Important Note : All contributions to this Research Topic must be within the scope of the section and journal to which they are submitted, as defined in their mission statements. Frontiers reserves the right to guide an out-of-scope manuscript to a more suitable section or journal at any stage of peer review.

Topic Editors

Topic coordinators, recent articles, submission deadlines, participating journals.

Manuscripts can be submitted to this Research Topic via the following journals:

total views

  • Demographics

No records found

total views article views downloads topic views

Top countries

Top referring sites, about frontiers research topics.

With their unique mixes of varied contributions from Original Research to Review Articles, Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot research area! Find out more on how to host your own Frontiers Research Topic or contribute to one as an author.

M.Tech/Ph.D Thesis Help in Chandigarh | Thesis Guidance in Chandigarh

research topics for data mining

[email protected]

research topics for data mining

+91-9465330425

Data Mining

research topics for data mining

Videos, Software, Training, etc. Data & Statistics MSHA Data Files NIOSH Mining en Español

Mining Safety and Health Topics News & Articles Mining Links Publications

Projects Contracts Strategic Plan Funding Opportunities

About Us Contact NIOSH Mining Employment Visitor Information Technology Innovations Awards Partnerships

  • Workplace Safety & Health Topics
  • Publications and Products

Exit Notification / Disclaimer Policy

  • The Centers for Disease Control and Prevention (CDC) cannot attest to the accuracy of a non-federal website.
  • Linking to a non-federal website does not constitute an endorsement by CDC or any of its employees of the sponsors or the information and products presented on the website.
  • You will be subject to the destination website's privacy policy when you follow the link.
  • CDC is not responsible for Section 508 compliance (accessibility) on other federal or private website.

IMAGES

  1. Data Mining Research Topics for MS PhD

    research topics for data mining

  2. Trending Research Topics in Data Mining (PhD Guidance)

    research topics for data mining

  3. Innovative Data Mining Research Topics (Research Guidance)

    research topics for data mining

  4. Data mining

    research topics for data mining

  5. Data mining

    research topics for data mining

  6. 99 Data Mining Dissertation Topics

    research topics for data mining

VIDEO

  1. Major Issues in Data Mining || Data Mining challenges

  2. Business Analytics

  3. Data Mining Trends and Research Frontiers

  4. DATA MINING PROCESS

  5. Data Mining Research Topics

  6. Data Mining Introduction

COMMENTS

  1. 82 Data Mining Essay Topic Ideas & Examples

    Commercial Uses of Data Mining. Data mining process entails the use of large relational database to identify the correlation that exists in a given data. The principal role of the applications is to sift the data to identify correlations. A Discussion on the Acceptability of Data Mining.

  2. Data mining

    Data mining is the process of extracting potentially useful information from data sets. It uses a suite of methods to organise, examine and combine large data sets, including machine learning ...

  3. data mining Latest Research Papers

    Find the latest published documents for data mining, Related hot topics, top authors, the most cited documents, and related journals. ScienceGate; Advanced Search; Author Search; Journal Finder; Blog; ... This research is aimed to detect the user's topics of interest in social media and rank them based on specific topics, domains, etc. Few ...

  4. 16 Data Mining Projects Ideas & Topics For Beginners [2024]

    2. GERF: Group Event Recommendation Framework. This is one of the simple data mining projects yet an exciting one. It is an intelligent solution for recommending social events, such as exhibitions, book launches, concerts, etc. A majority of the research focuses on suggesting upcoming attractions to individuals.

  5. 345193 PDFs

    Explore the latest full-text research PDFs, articles, conference papers, preprints and more on DATA MINING. Find methods information, sources, references or conduct a literature review on DATA MINING

  6. Recent Advances in Data Mining

    Data mining is the procedure of identifying valid, potentially suitable, and understandable information; detecting patterns; building knowledge graphs; and finding anomalies and relationships in big data with Artificial-Intelligence-enabled IoT (AIoT). This process is essential for advancing knowledge in various fields dealing with raw data ...

  7. Recent advances in domain-driven data mining

    Data mining research has been significantly motivated by and benefited from real-world applications in novel domains. This special issue was proposed and edited to draw attention to domain-driven data mining and disseminate research in foundations, frameworks, and applications for data-driven and actionable knowledge discovery. Along with this special issue, we also organized a related ...

  8. (PDF) Trends in data mining research: A two-decade review using topic

    The research direction related to practical Applications of data mining also shows a tendency to grow. The last two topics, Text Mining and Data Streams have attracted steady interest from ...

  9. Data Mining Research Topics

    Data Mining Research Topics. Data mining is a rapidly growing field that involves extracting useful patterns and knowledge from large datasets. Researchers in this field study various techniques and algorithms to mine and analyze data for effective decision-making. If you are interested in pursuing research in data mining, this article explores ...

  10. Data Mining and Modeling

    Data Mining and Modeling. The proliferation of machine learning means that learned classifiers lie at the core of many products across Google. However, questions in practice are rarely so clean as to just to use an out-of-the-box algorithm. A big challenge is in developing metrics, designing experimental methodologies, and modeling the space to ...

  11. Data Mining Research

    Data mining research has led to the development of useful techniques for analyzing time series data, including dynamic time warping [10] and Discrete Fourier Transforms (DFT) in combination with spatial queries [ 5 ]. To date, this work has paid little attention to query specification or interactive systems.

  12. 20 Interesting Data Mining Projects in 2024 (for Students)

    7) Anime recommendation system. This is one of the favorite data mining project ideas among students. An enthusiast in this field can easily get involved and excited by such topics. This data set contains information on user preference data from 73,516 users on 12,294 anime.

  13. Frontiers in Big Data

    John S Kimball. Nitesh V Chawla. Murat Kantarcioglu. Elena Ferrari. Dongwon Lee. Jean-Roch Vlimant. 19,637 views. 4 articles. Part of an innovative multidisciplinary journal, exploring a wide range of topics, such as intelligent data management, information retrieval, privacy-preserving data mining, and data visual analyt...

  14. Innovative Research Topics on Data Mining (Latest Titles)

    Research Topics on Data Mining Research Topics on Data Mining offer you creative ideas to prime your future brightly in research. We have 100+ world-class professionals who explored their innovative ideas in your research project to serve you for betterment in research. So We have conducted 500+ workshops throughout the world, and a large ...

  15. Advances in Artificial Intelligence (AI)-Driven Data Mining

    AI-driven data mining explores algorithms and techniques that can handle numerous data and extract useful pattern information with little human intervention. This Special Issue seeks new ideas, methods and achievements for the intersection between artificial intelligence and data mining. Topics of interest include, but are not limited to, the ...

  16. Trending Data Mining Thesis Topics

    Integration of MapReduce, Amazon EC2, S3, Apache Spark, and Hadoop into data mining. These are the recent trends in data mining. We insist that you choose one of the topics that interest you the most. Having an appropriate content structure or template is essential while writing a thesis.

  17. Machine Learning and Data Mining

    IBM research has been one of the leaders in this field so far. As a member of the world-wide IBM Research, the IBM Tokyo Research Laboratory has played a crucial role in the area of data mining. In the late '90s, we were recognized for research accomplishments in extending the classical association rule discovery algorithm.

  18. Data mining in clinical big data: the frequently used databases, steps

    Therefore, data mining has unique advantages in clinical big-data research, especially in large-scale medical public databases. This article introduced the main medical public database and described the steps, tasks, and models of data mining in simple language. Additionally, we described data-mining methods along with their practical applications.

  19. Artificial Intelligence and Machine Learning and Data Mining

    The Artificial Intelligence and Machine Learning and Data Mining research community expands the state of the art at these, the field's most prestigious and selective conferences: ... Research Topics: Graph mining; social network analysis; network science; temporal network analysis; combinatorial scientific computing; stream processing; ...

  20. What Is Data Mining?

    What is data mining? Data mining, also known as knowledge discovery in data (KDD), is the process of uncovering patterns and other valuable information from large data sets. Given the evolution of data warehousing technology and the growth of big data, adoption of data mining techniques has rapidly accelerated over the last couple of decades ...

  21. Adaptations of data mining methodologies: a systematic literature

    The main research objective of this article is to study how data mining methodologies are applied by researchers and practitioners. To this end, we use systematic literature review (SLR) as scientific method for two reasons. Firstly, systematic review is based on trustworthy, rigorous, and auditable methodology.

  22. Medical Data Mining and Medical Intelligence Services

    This Research Topic on "Medical Data Mining and Medical Intelligence Services" is dedicated to exploring the multifaceted landscape where advanced data mining techniques meet the evolving needs of modern healthcare. This Research Topic serves as a platform to unite researchers, healthcare practitioners, data scientists, and industry experts to ...

  23. Latest Research and Thesis topics in Data Mining

    Topics to study in data mining. Data mining is a relatively new thing and many are not aware of this technology. This can also be a good topic for M.Tech thesis and for presentations. Following are the topics under data mining to study: Fraud Detection. Crime Rate Prediction.

  24. Research on Data Mining Technology Based on Artificial Neural Networks

    Data mining (DM) is a multidisciplinary field that utilizes knowledge from various disciplines such as statistics, machine learning, and database theory to extract and obtain data. As the most important time data, how to learn the historical patterns of time series and predict their future trends has always been a key research topic for ...

  25. CDC

    Use EXAMiner to practice and teach hazard recognition skills for mining operations in any sector. Browse the Mining site by subject. Tools You Can Use. Videos, Software, Training, etc. Data & Statistics MSHA Data Files NIOSH Mining en Español. Information Resources. Mining Safety and Health Topics News & Articles

  26. IJGI

    The construction of new towns is one of the main measures to evacuate urban populations and promote regional coordination and urban-rural integration in China. Mining the spatio-temporal pattern of new town hot spots based on multivariate data and analyzing the influencing factors of new town construction hot spots can provide a strategic basis for new town construction, but few researchers ...