• Search Menu
  • Editor's Choice
  • Author Guidelines
  • Submission Site
  • Open Access
  • About Journal of Cybersecurity
  • Editorial Board
  • Advertising and Corporate Services
  • Journals Career Network
  • Self-Archiving Policy
  • Journals on Oxford Academic
  • Books on Oxford Academic

Issue Cover

Editors-in-Chief

Tyler Moore

About the journal

Journal of Cybersecurity publishes accessible articles describing original research in the inherently interdisciplinary world of computer, systems, and information security …

Latest articles

Cybersecurity Month

Call for Papers

Journal of Cybersecurity is soliciting papers for a special collection on the philosophy of information security. This collection will explore research at the intersection of philosophy, information security, and philosophy of science.

Find out more

CYBERS High Impact 480x270.png

High-Impact Research Collection

Explore a collection of freely available high-impact research from 2020 and 2021 published in the Journal of Cybersecurity .

Browse the collection here

submit

Submit your paper

Join the conversation moving the science of security forward. Visit our Instructions to Authors for more information about how to submit your manuscript.

Read and publish

Read and Publish deals

Authors interested in publishing in Journal of Cybersecurity may be able to publish their paper Open Access using funds available through their institution’s agreement with OUP.

Find out if your institution is participating

Related Titles

cybersecurityandcyberwar

Affiliations

  • Online ISSN 2057-2093
  • Print ISSN 2057-2085
  • Copyright © 2024 Oxford University Press
  • About Oxford Academic
  • Publish journals with us
  • University press partners
  • What we publish
  • New features  
  • Open access
  • Institutional account management
  • Rights and permissions
  • Get help with access
  • Accessibility
  • Advertising
  • Media enquiries
  • Oxford University Press
  • Oxford Languages
  • University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

  • Copyright © 2024 Oxford University Press
  • Cookie settings
  • Cookie policy
  • Privacy policy
  • Legal notice

This Feature Is Available To Subscribers Only

Sign In or Create an Account

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

A Study of Cyber Security Issues and Challenges

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

  • Survey Paper
  • Open access
  • Published: 01 July 2020

Cybersecurity data science: an overview from machine learning perspective

  • Iqbal H. Sarker   ORCID: orcid.org/0000-0003-1740-5517 1 , 2 ,
  • A. S. M. Kayes 3 ,
  • Shahriar Badsha 4 ,
  • Hamed Alqahtani 5 ,
  • Paul Watters 3 &
  • Alex Ng 3  

Journal of Big Data volume  7 , Article number:  41 ( 2020 ) Cite this article

140k Accesses

234 Citations

51 Altmetric

Metrics details

In a computing context, cybersecurity is undergoing massive shifts in technology and its operations in recent days, and data science is driving the change. Extracting security incident patterns or insights from cybersecurity data and building corresponding data-driven model , is the key to make a security system automated and intelligent. To understand and analyze the actual phenomena with data, various scientific methods, machine learning techniques, processes, and systems are used, which is commonly known as data science. In this paper, we focus and briefly discuss on cybersecurity data science , where the data is being gathered from relevant cybersecurity sources, and the analytics complement the latest data-driven patterns for providing more effective security solutions. The concept of cybersecurity data science allows making the computing process more actionable and intelligent as compared to traditional ones in the domain of cybersecurity. We then discuss and summarize a number of associated research issues and future directions . Furthermore, we provide a machine learning based multi-layered framework for the purpose of cybersecurity modeling. Overall, our goal is not only to discuss cybersecurity data science and relevant methods but also to focus the applicability towards data-driven intelligent decision making for protecting the systems from cyber-attacks.

Introduction

Due to the increasing dependency on digitalization and Internet-of-Things (IoT) [ 1 ], various security incidents such as unauthorized access [ 2 ], malware attack [ 3 ], zero-day attack [ 4 ], data breach [ 5 ], denial of service (DoS) [ 2 ], social engineering or phishing [ 6 ] etc. have grown at an exponential rate in recent years. For instance, in 2010, there were less than 50 million unique malware executables known to the security community. By 2012, they were double around 100 million, and in 2019, there are more than 900 million malicious executables known to the security community, and this number is likely to grow, according to the statistics of AV-TEST institute in Germany [ 7 ]. Cybercrime and attacks can cause devastating financial losses and affect organizations and individuals as well. It’s estimated that, a data breach costs 8.19 million USD for the United States and 3.9 million USD on an average [ 8 ], and the annual cost to the global economy from cybercrime is 400 billion USD [ 9 ]. According to Juniper Research [ 10 ], the number of records breached each year to nearly triple over the next 5 years. Thus, it’s essential that organizations need to adopt and implement a strong cybersecurity approach to mitigate the loss. According to [ 11 ], the national security of a country depends on the business, government, and individual citizens having access to applications and tools which are highly secure, and the capability on detecting and eliminating such cyber-threats in a timely way. Therefore, to effectively identify various cyber incidents either previously seen or unseen, and intelligently protect the relevant systems from such cyber-attacks, is a key issue to be solved urgently.

figure 1

Popularity trends of data science, machine learning and cybersecurity over time, where x-axis represents the timestamp information and y axis represents the corresponding popularity values

Cybersecurity is a set of technologies and processes designed to protect computers, networks, programs and data from attack, damage, or unauthorized access [ 12 ]. In recent days, cybersecurity is undergoing massive shifts in technology and its operations in the context of computing, and data science (DS) is driving the change, where machine learning (ML), a core part of “Artificial Intelligence” (AI) can play a vital role to discover the insights from data. Machine learning can significantly change the cybersecurity landscape and data science is leading a new scientific paradigm [ 13 , 14 ]. The popularity of these related technologies is increasing day-by-day, which is shown in Fig.  1 , based on the data of the last five years collected from Google Trends [ 15 ]. The figure represents timestamp information in terms of a particular date in the x-axis and corresponding popularity in the range of 0 (minimum) to 100 (maximum) in the y-axis. As shown in Fig.  1 , the popularity indication values of these areas are less than 30 in 2014, while they exceed 70 in 2019, i.e., more than double in terms of increased popularity. In this paper, we focus on cybersecurity data science (CDS), which is broadly related to these areas in terms of security data processing techniques and intelligent decision making in real-world applications. Overall, CDS is security data-focused, applies machine learning methods to quantify cyber risks, and ultimately seeks to optimize cybersecurity operations. Thus, the purpose of this paper is for those academia and industry people who want to study and develop a data-driven smart cybersecurity model based on machine learning techniques. Therefore, great emphasis is placed on a thorough description of various types of machine learning methods, and their relations and usage in the context of cybersecurity. This paper does not describe all of the different techniques used in cybersecurity in detail; instead, it gives an overview of cybersecurity data science modeling based on artificial intelligence, particularly from machine learning perspective.

The ultimate goal of cybersecurity data science is data-driven intelligent decision making from security data for smart cybersecurity solutions. CDS represents a partial paradigm shift from traditional well-known security solutions such as firewalls, user authentication and access control, cryptography systems etc. that might not be effective according to today’s need in cyber industry [ 16 , 17 , 18 , 19 ]. The problems are these are typically handled statically by a few experienced security analysts, where data management is done in an ad-hoc manner [ 20 , 21 ]. However, as an increasing number of cybersecurity incidents in different formats mentioned above continuously appear over time, such conventional solutions have encountered limitations in mitigating such cyber risks. As a result, numerous advanced attacks are created and spread very quickly throughout the Internet. Although several researchers use various data analysis and learning techniques to build cybersecurity models that are summarized in “ Machine learning tasks in cybersecurity ” section, a comprehensive security model based on the effective discovery of security insights and latest security patterns could be more useful. To address this issue, we need to develop more flexible and efficient security mechanisms that can respond to threats and to update security policies to mitigate them intelligently in a timely manner. To achieve this goal, it is inherently required to analyze a massive amount of relevant cybersecurity data generated from various sources such as network and system sources, and to discover insights or proper security policies with minimal human intervention in an automated manner.

Analyzing cybersecurity data and building the right tools and processes to successfully protect against cybersecurity incidents goes beyond a simple set of functional requirements and knowledge about risks, threats or vulnerabilities. For effectively extracting the insights or the patterns of security incidents, several machine learning techniques, such as feature engineering, data clustering, classification, and association analysis, or neural network-based deep learning techniques can be used, which are briefly discussed in “ Machine learning tasks in cybersecurity ” section. These learning techniques are capable to find the anomalies or malicious behavior and data-driven patterns of associated security incidents to make an intelligent decision. Thus, based on the concept of data-driven decision making, we aim to focus on cybersecurity data science , where the data is being gathered from relevant cybersecurity sources such as network activity, database activity, application activity, or user activity, and the analytics complement the latest data-driven patterns for providing corresponding security solutions.

The contributions of this paper are summarized as follows.

We first make a brief discussion on the concept of cybersecurity data science and relevant methods to understand its applicability towards data-driven intelligent decision making in the domain of cybersecurity. For this purpose, we also make a review and brief discussion on different machine learning tasks in cybersecurity, and summarize various cybersecurity datasets highlighting their usage in different data-driven cyber applications.

We then discuss and summarize a number of associated research issues and future directions in the area of cybersecurity data science, that could help both the academia and industry people to further research and development in relevant application areas.

Finally, we provide a generic multi-layered framework of the cybersecurity data science model based on machine learning techniques. In this framework, we briefly discuss how the cybersecurity data science model can be used to discover useful insights from security data and making data-driven intelligent decisions to build smart cybersecurity systems.

The remainder of the paper is organized as follows. “ Background ” section summarizes background of our study and gives an overview of the related technologies of cybersecurity data science. “ Cybersecurity data science ” section defines and discusses briefly about cybersecurity data science including various categories of cyber incidents data. In “  Machine learning tasks in cybersecurity ” section, we briefly discuss various categories of machine learning techniques including their relations with cybersecurity tasks and summarize a number of machine learning based cybersecurity models in the field. “ Research issues and future directions ” section briefly discusses and highlights various research issues and future directions in the area of cybersecurity data science. In “  A multi-layered framework for smart cybersecurity services ” section, we suggest a machine learning-based framework to build cybersecurity data science model and discuss various layers with their roles. In “  Discussion ” section, we highlight several key points regarding our studies. Finally,  “ Conclusion ” section concludes this paper.

In this section, we give an overview of the related technologies of cybersecurity data science including various types of cybersecurity incidents and defense strategies.

  • Cybersecurity

Over the last half-century, the information and communication technology (ICT) industry has evolved greatly, which is ubiquitous and closely integrated with our modern society. Thus, protecting ICT systems and applications from cyber-attacks has been greatly concerned by the security policymakers in recent days [ 22 ]. The act of protecting ICT systems from various cyber-threats or attacks has come to be known as cybersecurity [ 9 ]. Several aspects are associated with cybersecurity: measures to protect information and communication technology; the raw data and information it contains and their processing and transmitting; associated virtual and physical elements of the systems; the degree of protection resulting from the application of those measures; and eventually the associated field of professional endeavor [ 23 ]. Craigen et al. defined “cybersecurity as a set of tools, practices, and guidelines that can be used to protect computer networks, software programs, and data from attack, damage, or unauthorized access” [ 24 ]. According to Aftergood et al. [ 12 ], “cybersecurity is a set of technologies and processes designed to protect computers, networks, programs and data from attacks and unauthorized access, alteration, or destruction”. Overall, cybersecurity concerns with the understanding of diverse cyber-attacks and devising corresponding defense strategies that preserve several properties defined as below [ 25 , 26 ].

Confidentiality is a property used to prevent the access and disclosure of information to unauthorized individuals, entities or systems.

Integrity is a property used to prevent any modification or destruction of information in an unauthorized manner.

Availability is a property used to ensure timely and reliable access of information assets and systems to an authorized entity.

The term cybersecurity applies in a variety of contexts, from business to mobile computing, and can be divided into several common categories. These are - network security that mainly focuses on securing a computer network from cyber attackers or intruders; application security that takes into account keeping the software and the devices free of risks or cyber-threats; information security that mainly considers security and the privacy of relevant data; operational security that includes the processes of handling and protecting data assets. Typical cybersecurity systems are composed of network security systems and computer security systems containing a firewall, antivirus software, or an intrusion detection system [ 27 ].

Cyberattacks and security risks

The risks typically associated with any attack, which considers three security factors, such as threats, i.e., who is attacking, vulnerabilities, i.e., the weaknesses they are attacking, and impacts, i.e., what the attack does [ 9 ]. A security incident is an act that threatens the confidentiality, integrity, or availability of information assets and systems. Several types of cybersecurity incidents that may result in security risks on an organization’s systems and networks or an individual [ 2 ]. These are:

Unauthorized access that describes the act of accessing information to network, systems or data without authorization that results in a violation of a security policy [ 2 ];

Malware known as malicious software, is any program or software that intentionally designed to cause damage to a computer, client, server, or computer network, e.g., botnets. Examples of different types of malware including computer viruses, worms, Trojan horses, adware, ransomware, spyware, malicious bots, etc. [ 3 , 26 ]; Ransom malware, or ransomware , is an emerging form of malware that prevents users from accessing their systems or personal files, or the devices, then demands an anonymous online payment in order to restore access.

Denial-of-Service is an attack meant to shut down a machine or network, making it inaccessible to its intended users by flooding the target with traffic that triggers a crash. The Denial-of-Service (DoS) attack typically uses one computer with an Internet connection, while distributed denial-of-service (DDoS) attack uses multiple computers and Internet connections to flood the targeted resource [ 2 ];

Phishing a type of social engineering , used for a broad range of malicious activities accomplished through human interactions, in which the fraudulent attempt takes part to obtain sensitive information such as banking and credit card details, login credentials, or personally identifiable information by disguising oneself as a trusted individual or entity via an electronic communication such as email, text, or instant message, etc. [ 26 ];

Zero-day attack is considered as the term that is used to describe the threat of an unknown security vulnerability for which either the patch has not been released or the application developers were unaware [ 4 , 28 ].

Beside these attacks mentioned above, privilege escalation [ 29 ], password attack [ 30 ], insider threat [ 31 ], man-in-the-middle [ 32 ], advanced persistent threat [ 33 ], SQL injection attack [ 34 ], cryptojacking attack [ 35 ], web application attack [ 30 ] etc. are well-known as security incidents in the field of cybersecurity. A data breach is another type of security incident, known as a data leak, which is involved in the unauthorized access of data by an individual, application, or service [ 5 ]. Thus, all data breaches are considered as security incidents, however, all the security incidents are not data breaches. Most data breaches occur in the banking industry involving the credit card numbers, personal information, followed by the healthcare sector and the public sector [ 36 ].

Cybersecurity defense strategies

Defense strategies are needed to protect data or information, information systems, and networks from cyber-attacks or intrusions. More granularly, they are responsible for preventing data breaches or security incidents and monitoring and reacting to intrusions, which can be defined as any kind of unauthorized activity that causes damage to an information system [ 37 ]. An intrusion detection system (IDS) is typically represented as “a device or software application that monitors a computer network or systems for malicious activity or policy violations” [ 38 ]. The traditional well-known security solutions such as anti-virus, firewalls, user authentication, access control, data encryption and cryptography systems, however might not be effective according to today’s need in the cyber industry

[ 16 , 17 , 18 , 19 ]. On the other hand, IDS resolves the issues by analyzing security data from several key points in a computer network or system [ 39 , 40 ]. Moreover, intrusion detection systems can be used to detect both internal and external attacks.

Intrusion detection systems are different categories according to the usage scope. For instance, a host-based intrusion detection system (HIDS), and network intrusion detection system (NIDS) are the most common types based on the scope of single computers to large networks. In a HIDS, the system monitors important files on an individual system, while it analyzes and monitors network connections for suspicious traffic in a NIDS. Similarly, based on methodologies, the signature-based IDS, and anomaly-based IDS are the most well-known variants [ 37 ].

Signature-based IDS : A signature can be a predefined string, pattern, or rule that corresponds to a known attack. A particular pattern is identified as the detection of corresponding attacks in a signature-based IDS. An example of a signature can be known patterns or a byte sequence in a network traffic, or sequences used by malware. To detect the attacks, anti-virus software uses such types of sequences or patterns as a signature while performing the matching operation. Signature-based IDS is also known as knowledge-based or misuse detection [ 41 ]. This technique can be efficient to process a high volume of network traffic, however, is strictly limited to the known attacks only. Thus, detecting new attacks or unseen attacks is one of the biggest challenges faced by this signature-based system.

Anomaly-based IDS : The concept of anomaly-based detection overcomes the issues of signature-based IDS discussed above. In an anomaly-based intrusion detection system, the behavior of the network is first examined to find dynamic patterns, to automatically create a data-driven model, to profile the normal behavior, and thus it detects deviations in the case of any anomalies [ 41 ]. Thus, anomaly-based IDS can be treated as a dynamic approach, which follows behavior-oriented detection. The main advantage of anomaly-based IDS is the ability to identify unknown or zero-day attacks [ 42 ]. However, the issue is that the identified anomaly or abnormal behavior is not always an indicator of intrusions. It sometimes may happen because of several factors such as policy changes or offering a new service.

In addition, a hybrid detection approach [ 43 , 44 ] that takes into account both the misuse and anomaly-based techniques discussed above can be used to detect intrusions. In a hybrid system, the misuse detection system is used for detecting known types of intrusions and anomaly detection system is used for novel attacks [ 45 ]. Beside these approaches, stateful protocol analysis can also be used to detect intrusions that identifies deviations of protocol state similarly to the anomaly-based method, however it uses predetermined universal profiles based on accepted definitions of benign activity [ 41 ]. In Table 1 , we have summarized these common approaches highlighting their pros and cons. Once the detecting has been completed, the intrusion prevention system (IPS) that is intended to prevent malicious events, can be used to mitigate the risks in different ways such as manual, providing notification, or automatic process [ 46 ]. Among these approaches, an automatic response system could be more effective as it does not involve a human interface between the detection and response systems.

  • Data science

We are living in the age of data, advanced analytics, and data science, which are related to data-driven intelligent decision making. Although, the process of searching patterns or discovering hidden and interesting knowledge from data is known as data mining [ 47 ], in this paper, we use the broader term “data science” rather than data mining. The reason is that, data science, in its most fundamental form, is all about understanding of data. It involves studying, processing, and extracting valuable insights from a set of information. In addition to data mining, data analytics is also related to data science. The development of data mining, knowledge discovery, and machine learning that refers creating algorithms and program which learn on their own, together with the original data analysis and descriptive analytics from the statistical perspective, forms the general concept of “data analytics” [ 47 ]. Nowadays, many researchers use the term “data science” to describe the interdisciplinary field of data collection, preprocessing, inferring, or making decisions by analyzing the data. To understand and analyze the actual phenomena with data, various scientific methods, machine learning techniques, processes, and systems are used, which is commonly known as data science. According to Cao et al. [ 47 ] “data science is a new interdisciplinary field that synthesizes and builds on statistics, informatics, computing, communication, management, and sociology to study data and its environments, to transform data to insights and decisions by following a data-to-knowledge-to-wisdom thinking and methodology”. As a high-level statement in the context of cybersecurity, we can conclude that it is the study of security data to provide data-driven solutions for the given security problems, as known as “the science of cybersecurity data”. Figure 2 shows the typical data-to-insight-to-decision transfer at different periods and general analytic stages in data science, in terms of a variety of analytics goals (G) and approaches (A) to achieve the data-to-decision goal [ 47 ].

figure 2

Data-to-insight-to-decision analytic stages in data science [ 47 ]

Based on the analytic power of data science including machine learning techniques, it can be a viable component of security strategies. By using data science techniques, security analysts can manipulate and analyze security data more effectively and efficiently, uncovering valuable insights from data. Thus, data science methodologies including machine learning techniques can be well utilized in the context of cybersecurity, in terms of problem understanding, gathering security data from diverse sources, preparing data to feed into the model, data-driven model building and updating, for providing smart security services, which motivates to define cybersecurity data science and to work in this research area.

Cybersecurity data science

In this section, we briefly discuss cybersecurity data science including various categories of cyber incidents data with the usage in different application areas, and the key terms and areas related to our study.

Understanding cybersecurity data

Data science is largely driven by the availability of data [ 48 ]. Datasets typically represent a collection of information records that consist of several attributes or features and related facts, in which cybersecurity data science is based on. Thus, it’s important to understand the nature of cybersecurity data containing various types of cyberattacks and relevant features. The reason is that raw security data collected from relevant cyber sources can be used to analyze the various patterns of security incidents or malicious behavior, to build a data-driven security model to achieve our goal. Several datasets exist in the area of cybersecurity including intrusion analysis, malware analysis, anomaly, fraud, or spam analysis that are used for various purposes. In Table 2 , we summarize several such datasets including their various features and attacks that are accessible on the Internet, and highlight their usage based on machine learning techniques in different cyber applications. Effectively analyzing and processing of these security features, building target machine learning-based security model according to the requirements, and eventually, data-driven decision making, could play a role to provide intelligent cybersecurity services that are discussed briefly in “ A multi-layered framework for smart cybersecurity services ” section.

Defining cybersecurity data science

Data science is transforming the world’s industries. It is critically important for the future of intelligent cybersecurity systems and services because of “security is all about data”. When we seek to detect cyber threats, we are analyzing the security data in the form of files, logs, network packets, or other relevant sources. Traditionally, security professionals didn’t use data science techniques to make detections based on these data sources. Instead, they used file hashes, custom-written rules like signatures, or manually defined heuristics [ 21 ]. Although these techniques have their own merits in several cases, it needs too much manual work to keep up with the changing cyber threat landscape. On the contrary, data science can make a massive shift in technology and its operations, where machine learning algorithms can be used to learn or extract insight of security incident patterns from the training data for their detection and prevention. For instance, to detect malware or suspicious trends, or to extract policy rules, these techniques can be used.

In recent days, the entire security industry is moving towards data science, because of its capability to transform raw data into decision making. To do this, several data-driven tasks can be associated, such as—(i) data engineering focusing practical applications of data gathering and analysis; (ii) reducing data volume that deals with filtering significant and relevant data to further analysis; (iii) discovery and detection that focuses on extracting insight or incident patterns or knowledge from data; (iv) automated models that focus on building data-driven intelligent security model; (v) targeted security  alerts focusing on the generation of remarkable security alerts based on discovered knowledge that minimizes the false alerts, and (vi) resource optimization that deals with the available resources to achieve the target goals in a security system. While making data-driven decisions, behavioral analysis could also play a significant role in the domain of cybersecurity [ 81 ].

Thus, the concept of cybersecurity data science incorporates the methods and techniques of data science and machine learning as well as the behavioral analytics of various security incidents. The combination of these technologies has given birth to the term “cybersecurity data science”, which refers to collect a large amount of security event data from different sources and analyze it using machine learning technologies for detecting security risks or attacks either through the discovery of useful insights or the latest data-driven patterns. It is, however, worth remembering that cybersecurity data science is not just about a collection of machine learning algorithms, rather,  a process that can help security professionals or analysts to scale and automate their security activities in a smart way and in a timely manner. Therefore, the formal definition can be as follows: “Cybersecurity data science is a research or working area existing at the intersection of cybersecurity, data science, and machine learning or artificial intelligence, which is mainly security data-focused, applies machine learning methods, attempts to quantify cyber-risks or incidents, and promotes inferential techniques to analyze behavioral patterns in security data. It also focuses on generating security response alerts, and eventually seeks for optimizing cybersecurity solutions, to build automated and intelligent cybersecurity systems.”

Table  3 highlights some key terms associated with cybersecurity data science. Overall, the outputs of cybersecurity data science are typically security data products, which can be a data-driven security model, policy rule discovery, risk or attack prediction, potential security service and recommendation, or the corresponding security system depending on the given security problem in the domain of cybersecurity. In the next section, we briefly discuss various machine learning tasks with examples within the scope of our study.

Machine learning tasks in cybersecurity

Machine learning (ML) is typically considered as a branch of “Artificial Intelligence”, which is closely related to computational statistics, data mining and analytics, data science, particularly focusing on making the computers to learn from data [ 82 , 83 ]. Thus, machine learning models typically comprise of a set of rules, methods, or complex “transfer functions” that can be applied to find interesting data patterns, or to recognize or predict behavior [ 84 ], which could play an important role in the area of cybersecurity. In the following, we discuss different methods that can be used to solve machine learning tasks and how they are related to cybersecurity tasks.

Supervised learning

Supervised learning is performed when specific targets are defined to reach from a certain set of inputs, i.e., task-driven approach. In the area of machine learning, the most popular supervised learning techniques are known as classification and regression methods [ 129 ]. These techniques are popular to classify or predict the future for a particular security problem. For instance, to predict denial-of-service attack (yes, no) or to identify different classes of network attacks such as scanning and spoofing, classification techniques can be used in the cybersecurity domain. ZeroR [ 83 ], OneR [ 130 ], Navies Bayes [ 131 ], Decision Tree [ 132 , 133 ], K-nearest neighbors [ 134 ], support vector machines [ 135 ], adaptive boosting [ 136 ], and logistic regression [ 137 ] are the well-known classification techniques. In addition, recently Sarker et al. have proposed BehavDT [ 133 ], and IntruDtree [ 106 ] classification techniques that are able to effectively build a data-driven predictive model. On the other hand, to predict the continuous or numeric value, e.g., total phishing attacks in a certain period or predicting the network packet parameters, regression techniques are useful. Regression analyses can also be used to detect the root causes of cybercrime and other types of fraud [ 138 ]. Linear regression [ 82 ], support vector regression [ 135 ] are the popular regression techniques. The main difference between classification and regression is that the output variable in the regression is numerical or continuous, while the predicted output for classification is categorical or discrete. Ensemble learning is an extension of supervised learning while mixing different simple models, e.g., Random Forest learning [ 139 ] that generates multiple decision trees to solve a particular security task.

Unsupervised learning

In unsupervised learning problems, the main task is to find patterns, structures, or knowledge in unlabeled data, i.e., data-driven approach [ 140 ]. In the area of cybersecurity, cyber-attacks like malware stays hidden in some ways, include changing their behavior dynamically and autonomously to avoid detection. Clustering techniques, a type of unsupervised learning, can help to uncover the hidden patterns and structures from the datasets, to identify indicators of such sophisticated attacks. Similarly, in identifying anomalies, policy violations, detecting, and eliminating noisy instances in data, clustering techniques can be useful. K-means [ 141 ], K-medoids [ 142 ] are the popular partitioning clustering algorithms, and single linkage [ 143 ] or complete linkage [ 144 ] are the well-known hierarchical clustering algorithms used in various application domains. Moreover, a bottom-up clustering approach proposed by Sarker et al. [ 145 ] can also be used by taking into account the data characteristics.

Besides, feature engineering tasks like optimal feature selection or extraction related to a particular security problem could be useful for further analysis [ 106 ]. Recently, Sarker et al. [ 106 ] have proposed an approach for selecting security features according to their importance score values. Moreover, Principal component analysis, linear discriminant analysis, pearson correlation analysis, or non-negative matrix factorization are the popular dimensionality reduction techniques to solve such issues [ 82 ]. Association rule learning is another example, where machine learning based policy rules can prevent cyber-attacks. In an expert system, the rules are usually manually defined by a knowledge engineer working in collaboration with a domain expert [ 37 , 140 , 146 ]. Association rule learning on the contrary, is the discovery of rules or relationships among a set of available security features or attributes in a given dataset [ 147 ]. To quantify the strength of relationships, correlation analysis can be used [ 138 ]. Many association rule mining algorithms have been proposed in the area of machine learning and data mining literature, such as logic-based [ 148 ], frequent pattern based [ 149 , 150 , 151 ], tree-based [ 152 ], etc. Recently, Sarker et al. [ 153 ] have proposed an association rule learning approach considering non-redundant generation, that can be used to discover a set of useful security policy rules. Moreover, AIS [ 147 ], Apriori [ 149 ], Apriori-TID and Apriori-Hybrid [ 149 ], FP-Tree [ 152 ], and RARM [ 154 ], and Eclat [ 155 ] are the well-known association rule learning algorithms that are capable to solve such problems by generating a set of policy rules in the domain of cybersecurity.

Neural networks and deep learning

Deep learning is a part of machine learning in the area of artificial intelligence, which is a computational model that is inspired by the biological neural networks in the human brain [ 82 ]. Artificial Neural Network (ANN) is frequently used in deep learning and the most popular neural network algorithm is backpropagation [ 82 ]. It performs learning on a multi-layer feed-forward neural network consists of an input layer, one or more hidden layers, and an output layer. The main difference between deep learning and classical machine learning is its performance on the amount of security data increases. Typically deep learning algorithms perform well when the data volumes are large, whereas machine learning algorithms perform comparatively better on small datasets [ 44 ]. In our earlier work, Sarker et al. [ 129 ], we have illustrated the effectiveness of these approaches considering contextual datasets. However, deep learning approaches mimic the human brain mechanism to interpret large amount of data or the complex data such as images, sounds and texts [ 44 , 129 ]. In terms of feature extraction to build models, deep learning reduces the effort of designing a feature extractor for each problem than the classical machine learning techniques. Beside these characteristics, deep learning typically takes a long time to train an algorithm than a machine learning algorithm, however, the test time is exactly the opposite [ 44 ]. Thus, deep learning relies more on high-performance machines with GPUs than classical machine-learning algorithms [ 44 , 156 ]. The most popular deep neural network learning models include multi-layer perceptron (MLP) [ 157 ], convolutional neural network (CNN) [ 158 ], recurrent neural network (RNN) or long-short term memory (LSTM) network [ 121 , 158 ]. In recent days, researchers use these deep learning techniques for different purposes such as detecting network intrusions, malware traffic detection and classification, etc. in the domain of cybersecurity [ 44 , 159 ].

Other learning techniques

Semi-supervised learning can be described as a hybridization of supervised and unsupervised techniques discussed above, as it works on both the labeled and unlabeled data. In the area of cybersecurity, it could be useful, when it requires to label data automatically without human intervention, to improve the performance of cybersecurity models. Reinforcement techniques are another type of machine learning that characterizes an agent by creating its own learning experiences through interacting directly with the environment, i.e., environment-driven approach, where the environment is typically formulated as a Markov decision process and take decision based on a reward function [ 160 ]. Monte Carlo learning, Q-learning, Deep Q Networks, are the most common reinforcement learning algorithms [ 161 ]. For instance, in a recent work [ 126 ], the authors present an approach for detecting botnet traffic or malicious cyber activities using reinforcement learning combining with neural network classifier. In another work [ 128 ], the authors discuss about the application of deep reinforcement learning to intrusion detection for supervised problems, where they received the best results for the Deep Q-Network algorithm. In the context of cybersecurity, genetic algorithms that use fitness, selection, crossover, and mutation for finding optimization, could also be used to solve a similar class of learning problems [ 119 ].

Various types of machine learning techniques discussed above can be useful in the domain of cybersecurity, to build an effective security model. In Table  4 , we have summarized several machine learning techniques that are used to build various types of security models for various purposes. Although these models typically represent a learning-based security model, in this paper, we aim to focus on a comprehensive cybersecurity data science model and relevant issues, in order to build a data-driven intelligent security system. In the next section, we highlight several research issues and potential solutions in the area of cybersecurity data science.

Research issues and future directions

Our study opens several research issues and challenges in the area of cybersecurity data science to extract insight from relevant data towards data-driven intelligent decision making for cybersecurity solutions. In the following, we summarize these challenges ranging from data collection to decision making.

Cybersecurity datasets : Source datasets are the primary component to work in the area of cybersecurity data science. Most of the existing datasets are old and might insufficient in terms of understanding the recent behavioral patterns of various cyber-attacks. Although the data can be transformed into a meaningful understanding level after performing several processing tasks, there is still a lack of understanding of the characteristics of recent attacks and their patterns of happening. Thus, further processing or machine learning algorithms may provide a low accuracy rate for making the target decisions. Therefore, establishing a large number of recent datasets for a particular problem domain like cyber risk prediction or intrusion detection is needed, which could be one of the major challenges in cybersecurity data science.

Handling quality problems in cybersecurity datasets : The cyber datasets might be noisy, incomplete, insignificant, imbalanced, or may contain inconsistency instances related to a particular security incident. Such problems in a data set may affect the quality of the learning process and degrade the performance of the machine learning-based models [ 162 ]. To make a data-driven intelligent decision for cybersecurity solutions, such problems in data is needed to deal effectively before building the cyber models. Therefore, understanding such problems in cyber data and effectively handling such problems using existing algorithms or newly proposed algorithm for a particular problem domain like malware analysis or intrusion detection and prevention is needed, which could be another research issue in cybersecurity data science.

Security policy rule generation : Security policy rules reference security zones and enable a user to allow, restrict, and track traffic on the network based on the corresponding user or user group, and service, or the application. The policy rules including the general and more specific rules are compared against the incoming traffic in sequence during the execution, and the rule that matches the traffic is applied. The policy rules used in most of the cybersecurity systems are static and generated by human expertise or ontology-based [ 163 , 164 ]. Although, association rule learning techniques produce rules from data, however, there is a problem of redundancy generation [ 153 ] that makes the policy rule-set complex. Therefore, understanding such problems in policy rule generation and effectively handling such problems using existing algorithms or newly proposed algorithm for a particular problem domain like access control [ 165 ] is needed, which could be another research issue in cybersecurity data science.

Hybrid learning method : Most commercial products in the cybersecurity domain contain signature-based intrusion detection techniques [ 41 ]. However, missing features or insufficient profiling can cause these techniques to miss unknown attacks. In that case, anomaly-based detection techniques or hybrid technique combining signature-based and anomaly-based can be used to overcome such issues. A hybrid technique combining multiple learning techniques or a combination of deep learning and machine-learning methods can be used to extract the target insight for a particular problem domain like intrusion detection, malware analysis, access control, etc. and make the intelligent decision for corresponding cybersecurity solutions.

Protecting the valuable security information : Another issue of a cyber data attack is the loss of extremely valuable data and information, which could be damaging for an organization. With the use of encryption or highly complex signatures, one can stop others from probing into a dataset. In such cases, cybersecurity data science can be used to build a data-driven impenetrable protocol to protect such security information. To achieve this goal, cyber analysts can develop algorithms by analyzing the history of cyberattacks to detect the most frequently targeted chunks of data. Thus, understanding such data protecting problems and designing corresponding algorithms to effectively handling these problems, could be another research issue in the area of cybersecurity data science.

Context-awareness in cybersecurity : Existing cybersecurity work mainly originates from the relevant cyber data containing several low-level features. When data mining and machine learning techniques are applied to such datasets, a related pattern can be identified that describes it properly. However, a broader contextual information [ 140 , 145 , 166 ] like temporal, spatial, relationship among events or connections, dependency can be used to decide whether there exists a suspicious activity or not. For instance, some approaches may consider individual connections as DoS attacks, while security experts might not treat them as malicious by themselves. Thus, a significant limitation of existing cybersecurity work is the lack of using the contextual information for predicting risks or attacks. Therefore, context-aware adaptive cybersecurity solutions could be another research issue in cybersecurity data science.

Feature engineering in cybersecurity : The efficiency and effectiveness of a machine learning-based security model has always been a major challenge due to the high volume of network data with a large number of traffic features. The large dimensionality of data has been addressed using several techniques such as principal component analysis (PCA) [ 167 ], singular value decomposition (SVD) [ 168 ] etc. In addition to low-level features in the datasets, the contextual relationships between suspicious activities might be relevant. Such contextual data can be stored in an ontology or taxonomy for further processing. Thus how to effectively select the optimal features or extract the significant features considering both the low-level features as well as the contextual features, for effective cybersecurity solutions could be another research issue in cybersecurity data science.

Remarkable security alert generation and prioritizing : In many cases, the cybersecurity system may not be well defined and may cause a substantial number of false alarms that are unexpected in an intelligent system. For instance, an IDS deployed in a real-world network generates around nine million alerts per day [ 169 ]. A network-based intrusion detection system typically looks at the incoming traffic for matching the associated patterns to detect risks, threats or vulnerabilities and generate security alerts. However, to respond to each such alert might not be effective as it consumes relatively huge amounts of time and resources, and consequently may result in a self-inflicted DoS. To overcome this problem, a high-level management is required that correlate the security alerts considering the current context and their logical relationship including their prioritization before reporting them to users, which could be another research issue in cybersecurity data science.

Recency analysis in cybersecurity solutions : Machine learning-based security models typically use a large amount of static data to generate data-driven decisions. Anomaly detection systems rely on constructing such a model considering normal behavior and anomaly, according to their patterns. However, normal behavior in a large and dynamic security system is not well defined and it may change over time, which can be considered as an incremental growing of dataset. The patterns in incremental datasets might be changed in several cases. This often results in a substantial number of false alarms known as false positives. Thus, a recent malicious behavioral pattern is more likely to be interesting and significant than older ones for predicting unknown attacks. Therefore, effectively using the concept of recency analysis [ 170 ] in cybersecurity solutions could be another issue in cybersecurity data science.

The most important work for an intelligent cybersecurity system is to develop an effective framework that supports data-driven decision making. In such a framework, we need to consider advanced data analysis based on machine learning techniques, so that the framework is capable to minimize these issues and to provide automated and intelligent security services. Thus, a well-designed security framework for cybersecurity data and the experimental evaluation is a very important direction and a big challenge as well. In the next section, we suggest and discuss a data-driven cybersecurity framework based on machine learning techniques considering multiple processing layers.

A multi-layered framework for smart cybersecurity services

As discussed earlier, cybersecurity data science is data-focused, applies machine learning methods, attempts to quantify cyber risks, promotes inferential techniques to analyze behavioral patterns, focuses on generating security response alerts, and eventually seeks for optimizing cybersecurity operations. Hence, we briefly discuss a multiple data processing layered framework that potentially can be used to discover security insights from the raw data to build smart cybersecurity systems, e.g., dynamic policy rule-based access control or intrusion detection and prevention system. To make a data-driven intelligent decision in the resultant cybersecurity system, understanding the security problems and the nature of corresponding security data and their vast analysis is needed. For this purpose, our suggested framework not only considers the machine learning techniques to build the security model but also takes into account the incremental learning and dynamism to keep the model up-to-date and corresponding response generation, which could be more effective and intelligent for providing the expected services. Figure 3 shows an overview of the framework, involving several processing layers, from raw security event data to services. In the following, we briefly discuss the working procedure of the framework.

figure 3

A generic multi-layered framework based on machine learning techniques for smart cybersecurity services

Security data collecting

Collecting valuable cybersecurity data is a crucial step, which forms a connecting link between security problems in cyberinfrastructure and corresponding data-driven solution steps in this framework, shown in Fig.  3 . The reason is that cyber data can serve as the source for setting up ground truth of the security model that affect the model performance. The quality and quantity of cyber data decide the feasibility and effectiveness of solving the security problem according to our goal. Thus, the concern is how to collect valuable and unique needs data for building the data-driven security models.

The general step to collect and manage security data from diverse data sources is based on a particular security problem and project within the enterprise. Data sources can be classified into several broad categories such as network, host, and hybrid [ 171 ]. Within the network infrastructure, the security system can leverage different types of security data such as IDS logs, firewall logs, network traffic data, packet data, and honeypot data, etc. for providing the target security services. For instance, a given IP is considered malicious or not, could be detected by performing data analysis utilizing the data of IP addresses and their cyber activities. In the domain of cybersecurity, the network source mentioned above is considered as the primary security event source to analyze. In the host category, it collects data from an organization’s host machines, where the data sources can be operating system logs, database access logs, web server logs, email logs, application logs, etc. Collecting data from both the network and host machines are considered a hybrid category. Overall, in a data collection layer the network activity, database activity, application activity, and user activity can be the possible security event sources in the context of cybersecurity data science.

Security data preparing

After collecting the raw security data from various sources according to the problem domain discussed above, this layer is responsible to prepare the raw data for building the model by applying various necessary processes. However, not all of the collected data contributes to the model building process in the domain of cybersecurity [ 172 ]. Therefore, the useless data should be removed from the rest of the data captured by the network sniffer. Moreover, data might be noisy, have missing or corrupted values, or have attributes of widely varying types and scales. High quality of data is necessary for achieving higher accuracy in a data-driven model, which is a process of learning a function that maps an input to an output based on example input-output pairs. Thus, it might require a procedure for data cleaning, handling missing or corrupted values. Moreover, security data features or attributes can be in different types, such as continuous, discrete, or symbolic [ 106 ]. Beyond a solid understanding of these types of data and attributes and their permissible operations, its need to preprocess the data and attributes to convert into the target type. Besides, the raw data can be in different types such as structured, semi-structured, or unstructured, etc. Thus, normalization, transformation, or collation can be useful to organize the data in a structured manner. In some cases, natural language processing techniques might be useful depending on data type and characteristics, e.g., textual contents. As both the quality and quantity of data decide the feasibility of solving the security problem, effectively pre-processing and management of data and their representation can play a significant role to build an effective security model for intelligent services.

Machine learning-based security modeling

This is the core step where insights and knowledge are extracted from data through the application of cybersecurity data science. In this section, we particularly focus on machine learning-based modeling as machine learning techniques can significantly change the cybersecurity landscape. The security features or attributes and their patterns in data are of high interest to be discovered and analyzed to extract security insights. To achieve the goal, a deeper understanding of data and machine learning-based analytical models utilizing a large number of cybersecurity data can be effective. Thus, various machine learning tasks can be involved in this model building layer according to the solution perspective. These are - security feature engineering that mainly responsible to transform raw security data into informative features that effectively represent the underlying security problem to the data-driven models. Thus, several data-processing tasks such as feature transformation and normalization, feature selection by taking into account a subset of available security features according to their correlations or importance in modeling, or feature generation and extraction by creating new brand principal components, may be involved in this module according to the security data characteristics. For instance, the chi-squared test, analysis of variance test, correlation coefficient analysis, feature importance, as well as discriminant and principal component analysis, or singular value decomposition, etc. can be used for analyzing the significance of the security features to perform the security feature engineering tasks [ 82 ].

Another significant module is security data clustering that uncovers hidden patterns and structures through huge volumes of security data, to identify where the new threats exist. It typically involves the grouping of security data with similar characteristics, which can be used to solve several cybersecurity problems such as detecting anomalies, policy violations, etc. Malicious behavior or anomaly detection module is typically responsible to identify a deviation to a known behavior, where clustering-based analysis and techniques can also be used to detect malicious behavior or anomaly detection. In the cybersecurity area, attack classification or prediction is treated as one of the most significant modules, which is responsible to build a prediction model to classify attacks or threats and to predict future for a particular security problem. To predict denial-of-service attack or a spam filter separating tasks from other messages, could be the relevant examples. Association learning or policy rule generation module can play a role to build an expert security system that comprises several IF-THEN rules that define attacks. Thus, in a problem of policy rule generation for rule-based access control system, association learning can be used as it discovers the associations or relationships among a set of available security features in a given security dataset. The popular machine learning algorithms in these categories are briefly discussed in “  Machine learning tasks in cybersecurity ” section. The module model selection or customization is responsible to choose whether it uses the existing machine learning model or needed to customize. Analyzing data and building models based on traditional machine learning or deep learning methods, could achieve acceptable results in certain cases in the domain of cybersecurity. However, in terms of effectiveness and efficiency or other performance measurements considering time complexity, generalization capacity, and most importantly the impact of the algorithm on the detection rate of a system, machine learning models are needed to customize for a specific security problem. Moreover, customizing the related techniques and data could improve the performance of the resultant security model and make it better applicable in a cybersecurity domain. The modules discussed above can work separately and combinedly depending on the target security problems.

Incremental learning and dynamism

In our framework, this layer is concerned with finalizing the resultant security model by incorporating additional intelligence according to the needs. This could be possible by further processing in several modules. For instance, the post-processing and improvement module in this layer could play a role to simplify the extracted knowledge according to the particular requirements by incorporating domain-specific knowledge. As the attack classification or prediction models based on machine learning techniques strongly rely on the training data, it can hardly be generalized to other datasets, which could be significant for some applications. To address such kind of limitations, this module is responsible to utilize the domain knowledge in the form of taxonomy or ontology to improve attack correlation in cybersecurity applications.

Another significant module recency mining and updating security model is responsible to keep the security model up-to-date for better performance by extracting the latest data-driven security patterns. The extracted knowledge discussed in the earlier layer is based on a static initial dataset considering the overall patterns in the datasets. However, such knowledge might not be guaranteed higher performance in several cases, because of incremental security data with recent patterns. In many cases, such incremental data may contain different patterns which could conflict with existing knowledge. Thus, the concept of RecencyMiner [ 170 ] on incremental security data and extracting new patterns can be more effective than the existing old patterns. The reason is that recent security patterns and rules are more likely to be significant than older ones for predicting cyber risks or attacks. Rather than processing the whole security data again, recency-based dynamic updating according to the new patterns would be more efficient in terms of processing and outcome. This could make the resultant cybersecurity model intelligent and dynamic. Finally, response planning and decision making module is responsible to make decisions based on the extracted insights and take necessary actions to prevent the system from the cyber-attacks to provide automated and intelligent services. The services might be different depending on particular requirements for a given security problem.

Overall, this framework is a generic description which potentially can be used to discover useful insights from security data, to build smart cybersecurity systems, to address complex security challenges, such as intrusion detection, access control management, detecting anomalies and fraud, or denial of service attacks, etc. in the area of cybersecurity data science.

Although several research efforts have been directed towards cybersecurity solutions, discussed in “ Background ” , “ Cybersecurity data science ”, and “ Machine learning tasks in cybersecurity ” sections in different directions, this paper presents a comprehensive view of cybersecurity data science. For this, we have conducted a literature review to understand cybersecurity data, various defense strategies including intrusion detection techniques, different types of machine learning techniques in cybersecurity tasks. Based on our discussion on existing work, several research issues related to security datasets, data quality problems, policy rule generation, learning methods, data protection, feature engineering, security alert generation, recency analysis etc. are identified that require further research attention in the domain of cybersecurity data science.

The scope of cybersecurity data science is broad. Several data-driven tasks such as intrusion detection and prevention, access control management, security policy generation, anomaly detection, spam filtering, fraud detection and prevention, various types of malware attack detection and defense strategies, etc. can be considered as the scope of cybersecurity data science. Such tasks based categorization could be helpful for security professionals including the researchers and practitioners who are interested in the domain-specific aspects of security systems [ 171 ]. The output of cybersecurity data science can be used in many application areas such as Internet of things (IoT) security [ 173 ], network security [ 174 ], cloud security [ 175 ], mobile and web applications [ 26 ], and other relevant cyber areas. Moreover, intelligent cybersecurity solutions are important for the banking industry, the healthcare sector, or the public sector, where data breaches typically occur [ 36 , 176 ]. Besides, the data-driven security solutions could also be effective in AI-based blockchain technology, where AI works with huge volumes of security event data to extract the useful insights using machine learning techniques, and block-chain as a trusted platform to store such data [ 177 ].

Although in this paper, we discuss cybersecurity data science focusing on examining raw security data to data-driven decision making for intelligent security solutions, it could also be related to big data analytics in terms of data processing and decision making. Big data deals with data sets that are too large or complex having characteristics of high data volume, velocity, and variety. Big data analytics mainly has two parts consisting of data management involving data storage, and analytics [ 178 ]. The analytics typically describe the process of analyzing such datasets to discover patterns, unknown correlations, rules, and other useful insights [ 179 ]. Thus, several advanced data analysis techniques such as AI, data mining, machine learning could play an important role in processing big data by converting big problems to small problems [ 180 ]. To do this, the potential strategies like parallelization, divide-and-conquer, incremental learning, sampling, granular computing, feature or instance selection, can be used to make better decisions, reducing costs, or enabling more efficient processing. In such cases, the concept of cybersecurity data science, particularly machine learning-based modeling could be helpful for process automation and decision making for intelligent security solutions. Moreover, researchers could consider modified algorithms or models for handing big data on parallel computing platforms like Hadoop, Storm, etc. [ 181 ].

Based on the concept of cybersecurity data science discussed in the paper, building a data-driven security model for a particular security problem and relevant empirical evaluation to measure the effectiveness and efficiency of the model, and to asses the usability in the real-world application domain could be a future work.

Motivated by the growing significance of cybersecurity and data science, and machine learning technologies, in this paper, we have discussed how cybersecurity data science applies to data-driven intelligent decision making in smart cybersecurity systems and services. We also have discussed how it can impact security data, both in terms of extracting insight of security incidents and the dataset itself. We aimed to work on cybersecurity data science by discussing the state of the art concerning security incidents data and corresponding security services. We also discussed how machine learning techniques can impact in the domain of cybersecurity, and examine the security challenges that remain. In terms of existing research, much focus has been provided on traditional security solutions, with less available work in machine learning technique based security systems. For each common technique, we have discussed relevant security research. The purpose of this article is to share an overview of the conceptualization, understanding, modeling, and thinking about cybersecurity data science.

We have further identified and discussed various key issues in security analysis to showcase the signpost of future research directions in the domain of cybersecurity data science. Based on the knowledge, we have also provided a generic multi-layered framework of cybersecurity data science model based on machine learning techniques, where the data is being gathered from diverse sources, and the analytics complement the latest data-driven patterns for providing intelligent security services. The framework consists of several main phases - security data collecting, data preparation, machine learning-based security modeling, and incremental learning and dynamism for smart cybersecurity systems and services. We specifically focused on extracting insights from security data, from setting a research design with particular attention to concepts for data-driven intelligent security solutions.

Overall, this paper aimed not only to discuss cybersecurity data science and relevant methods but also to discuss the applicability towards data-driven intelligent decision making in cybersecurity systems and services from machine learning perspectives. Our analysis and discussion can have several implications both for security researchers and practitioners. For researchers, we have highlighted several issues and directions for future research. Other areas for potential research include empirical evaluation of the suggested data-driven model, and comparative analysis with other security systems. For practitioners, the multi-layered machine learning-based model can be used as a reference in designing intelligent cybersecurity systems for organizations. We believe that our study on cybersecurity data science opens a promising path and can be used as a reference guide for both academia and industry for future research and applications in the area of cybersecurity.

Availability of data and materials

Not applicable.

Abbreviations

  • Machine learning

Artificial Intelligence

Information and communication technology

Internet of Things

Distributed Denial of Service

Intrusion detection system

Intrusion prevention system

Host-based intrusion detection systems

Network Intrusion Detection Systems

Signature-based intrusion detection system

Anomaly-based intrusion detection system

Li S, Da Xu L, Zhao S. The internet of things: a survey. Inform Syst Front. 2015;17(2):243–59.

Google Scholar  

Sun N, Zhang J, Rimba P, Gao S, Zhang LY, Xiang Y. Data-driven cybersecurity incident prediction: a survey. IEEE Commun Surv Tutor. 2018;21(2):1744–72.

McIntosh T, Jang-Jaccard J, Watters P, Susnjak T. The inadequacy of entropy-based ransomware detection. In: International conference on neural information processing. New York: Springer; 2019. p. 181–189

Alazab M, Venkatraman S, Watters P, Alazab M, et al. Zero-day malware detection based on supervised learning algorithms of api call signatures (2010)

Shaw A. Data breach: from notification to prevention using pci dss. Colum Soc Probs. 2009;43:517.

Gupta BB, Tewari A, Jain AK, Agrawal DP. Fighting against phishing attacks: state of the art and future challenges. Neural Comput Appl. 2017;28(12):3629–54.

Av-test institute, germany, https://www.av-test.org/en/statistics/malware/ . Accessed 20 Oct 2019.

Ibm security report, https://www.ibm.com/security/data-breach . Accessed on 20 Oct 2019.

Fischer EA. Cybersecurity issues and challenges: In brief. Congressional Research Service (2014)

Juniper research. https://www.juniperresearch.com/ . Accessed on 20 Oct 2019.

Papastergiou S, Mouratidis H, Kalogeraki E-M. Cyber security incident handling, warning and response system for the european critical information infrastructures (cybersane). In: International Conference on Engineering Applications of Neural Networks, p. 476–487 (2019). New York: Springer

Aftergood S. Cybersecurity: the cold war online. Nature. 2017;547(7661):30.

Hey AJ, Tansley S, Tolle KM, et al. The fourth paradigm: data-intensive scientific discovery. 2009;1:

Cukier K. Data, data everywhere: A special report on managing information, 2010.

Google trends. In: https://trends.google.com/trends/ , 2019.

Anwar S, Mohamad Zain J, Zolkipli MF, Inayat Z, Khan S, Anthony B, Chang V. From intrusion detection to an intrusion response system: fundamentals, requirements, and future directions. Algorithms. 2017;10(2):39.

MATH   Google Scholar  

Mohammadi S, Mirvaziri H, Ghazizadeh-Ahsaee M, Karimipour H. Cyber intrusion detection by combined feature selection algorithm. J Inform Sec Appl. 2019;44:80–8.

Tapiador JE, Orfila A, Ribagorda A, Ramos B. Key-recovery attacks on kids, a keyed anomaly detection system. IEEE Trans Depend Sec Comput. 2013;12(3):312–25.

Tavallaee M, Stakhanova N, Ghorbani AA. Toward credible evaluation of anomaly-based intrusion-detection methods. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 40(5), 516–524 (2010)

Foroughi F, Luksch P. Data science methodology for cybersecurity projects. arXiv preprint arXiv:1803.04219 , 2018.

Saxe J, Sanders H. Malware data science: Attack detection and attribution, 2018.

Rainie L, Anderson J, Connolly J. Cyber attacks likely to increase. Digital Life in. 2014, vol. 2025.

Fischer EA. Creating a national framework for cybersecurity: an analysis of issues and options. LIBRARY OF CONGRESS WASHINGTON DC CONGRESSIONAL RESEARCH SERVICE, 2005.

Craigen D, Diakun-Thibault N, Purse R. Defining cybersecurity. Technology Innovation. Manag Rev. 2014;4(10):13–21.

Council NR. et al. Toward a safer and more secure cyberspace, 2007.

Jang-Jaccard J, Nepal S. A survey of emerging threats in cybersecurity. J Comput Syst Sci. 2014;80(5):973–93.

MathSciNet   MATH   Google Scholar  

Mukkamala S, Sung A, Abraham A. Cyber security challenges: Designing efficient intrusion detection systems and antivirus tools. Vemuri, V. Rao, Enhancing Computer Security with Smart Technology.(Auerbach, 2006), 125–163, 2005.

Bilge L, Dumitraş T. Before we knew it: an empirical study of zero-day attacks in the real world. In: Proceedings of the 2012 ACM conference on computer and communications security. ACM; 2012. p. 833–44.

Davi L, Dmitrienko A, Sadeghi A-R, Winandy M. Privilege escalation attacks on android. In: International conference on information security. New York: Springer; 2010. p. 346–60.

Jovičić B, Simić D. Common web application attack types and security using asp .net. ComSIS, 2006.

Warkentin M, Willison R. Behavioral and policy issues in information systems security: the insider threat. Eur J Inform Syst. 2009;18(2):101–5.

Kügler D. “man in the middle” attacks on bluetooth. In: International Conference on Financial Cryptography. New York: Springer; 2003, p. 149–61.

Virvilis N, Gritzalis D. The big four-what we did wrong in advanced persistent threat detection. In: 2013 International Conference on Availability, Reliability and Security. IEEE; 2013. p. 248–54.

Boyd SW, Keromytis AD. Sqlrand: Preventing sql injection attacks. In: International conference on applied cryptography and network security. New York: Springer; 2004. p. 292–302.

Sigler K. Crypto-jacking: how cyber-criminals are exploiting the crypto-currency boom. Comput Fraud Sec. 2018;2018(9):12–4.

2019 data breach investigations report, https://enterprise.verizon.com/resources/reports/dbir/ . Accessed 20 Oct 2019.

Khraisat A, Gondal I, Vamplew P, Kamruzzaman J. Survey of intrusion detection systems: techniques, datasets and challenges. Cybersecurity. 2019;2(1):20.

Johnson L. Computer incident response and forensics team management: conducting a successful incident response, 2013.

Brahmi I, Brahmi H, Yahia SB. A multi-agents intrusion detection system using ontology and clustering techniques. In: IFIP international conference on computer science and its applications. New York: Springer; 2015. p. 381–93.

Qu X, Yang L, Guo K, Ma L, Sun M, Ke M, Li M. A survey on the development of self-organizing maps for unsupervised intrusion detection. In: Mobile networks and applications. 2019;1–22.

Liao H-J, Lin C-HR, Lin Y-C, Tung K-Y. Intrusion detection system: a comprehensive review. J Netw Comput Appl. 2013;36(1):16–24.

Alazab A, Hobbs M, Abawajy J, Alazab M. Using feature selection for intrusion detection system. In: 2012 International symposium on communications and information technologies (ISCIT). IEEE; 2012. p. 296–301.

Viegas E, Santin AO, Franca A, Jasinski R, Pedroni VA, Oliveira LS. Towards an energy-efficient anomaly-based intrusion detection engine for embedded systems. IEEE Trans Comput. 2016;66(1):163–77.

Xin Y, Kong L, Liu Z, Chen Y, Li Y, Zhu H, Gao M, Hou H, Wang C. Machine learning and deep learning methods for cybersecurity. IEEE Access. 2018;6:35365–81.

Dutt I, Borah S, Maitra IK, Bhowmik K, Maity A, Das S. Real-time hybrid intrusion detection system using machine learning techniques. 2018, p. 885–94.

Ragsdale DJ, Carver C, Humphries JW, Pooch UW. Adaptation techniques for intrusion detection and intrusion response systems. In: Smc 2000 conference proceedings. 2000 IEEE international conference on systems, man and cybernetics.’cybernetics evolving to systems, humans, organizations, and their complex interactions’(cat. No. 0). IEEE; 2000. vol. 4, p. 2344–2349.

Cao L. Data science: challenges and directions. Commun ACM. 2017;60(8):59–68.

Rizk A, Elragal A. Data science: developing theoretical contributions in information systems via text analytics. J Big Data. 2020;7(1):1–26.

Lippmann RP, Fried DJ, Graf I, Haines JW, Kendall KR, McClung D, Weber D, Webster SE, Wyschogrod D, Cunningham RK, et al. Evaluating intrusion detection systems: The 1998 darpa off-line intrusion detection evaluation. In: Proceedings DARPA information survivability conference and exposition. DISCEX’00. IEEE; 2000. vol. 2, p. 12–26.

Kdd cup 99. http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html . Accessed 20 Oct 2019.

Tavallaee M, Bagheri E, Lu W, Ghorbani AA. A detailed analysis of the kdd cup 99 data set. In: 2009 IEEE symposium on computational intelligence for security and defense applications. IEEE; 2009. p. 1–6.

Caida ddos attack 2007 dataset. http://www.caida.org/data/ passive/ddos-20070804-dataset.xml/ . Accessed 20 Oct 2019.

Caida anonymized internet traces 2008 dataset. https://www.caida.org/data/passive/passive-2008-dataset . Accessed 20 Oct 2019.

Isot botnet dataset. https://www.uvic.ca/engineering/ece/isot/ datasets/index.php/ . Accessed 20 Oct 2019.

The honeynet project. http://www.honeynet.org/chapters/france/ . Accessed 20 Oct 2019.

Canadian institute of cybersecurity, university of new brunswick, iscx dataset, http://www.unb.ca/cic/datasets/index.html/ . Accessed 20 Oct 2019.

Shiravi A, Shiravi H, Tavallaee M, Ghorbani AA. Toward developing a systematic approach to generate benchmark datasets for intrusion detection. Comput Secur. 2012;31(3):357–74.

The ctu-13 dataset. https://stratosphereips.org/category/datasets-ctu13 . Accessed 20 Oct 2019.

Moustafa N, Slay J. Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In: 2015 Military Communications and Information Systems Conference (MilCIS). IEEE; 2015. p. 1–6.

Cse-cic-ids2018 [online]. available: https://www.unb.ca/cic/ datasets/ids-2018.html/ . Accessed 20 Oct 2019.

Cic-ddos2019 [online]. available: https://www.unb.ca/cic/datasets/ddos-2019.html/ . Accessed 28 Mar 2019.

Jing X, Yan Z, Jiang X, Pedrycz W. Network traffic fusion and analysis against ddos flooding attacks with a novel reversible sketch. Inform Fusion. 2019;51:100–13.

Xie M, Hu J, Yu X, Chang E. Evaluating host-based anomaly detection systems: application of the frequency-based algorithms to adfa-ld. In: International conference on network and system security. New York: Springer; 2015. p. 542–49.

Lindauer B, Glasser J, Rosen M, Wallnau KC, ExactData L. Generating test data for insider threat detectors. JoWUA. 2014;5(2):80–94.

Glasser J, Lindauer B. Bridging the gap: A pragmatic approach to generating insider threat data. In: 2013 IEEE Security and Privacy Workshops. IEEE; 2013. p. 98–104.

Enronspam. https://labs-repos.iit.demokritos.gr/skel/i-config/downloads/enron-spam/ . Accessed 20 Oct 2019.

Spamassassin. http://www.spamassassin.org/publiccorpus/ . Accessed 20 Oct 2019.

Lingspam. https://labs-repos.iit.demokritos.gr/skel/i-config/downloads/lingspampublic.tar.gz/ . Accessed 20 Oct 2019.

Alexa top sites. https://aws.amazon.com/alexa-top-sites/ . Accessed 20 Oct 2019.

Bambenek consulting—master feeds. available online: http://osint.bambenekconsulting.com/feeds/ . Accessed 20 Oct 2019.

Dgarchive. https://dgarchive.caad.fkie.fraunhofer.de/site/ . Accessed 20 Oct 2019.

Zago M, Pérez MG, Pérez GM. Umudga: A dataset for profiling algorithmically generated domain names in botnet detection. Data in Brief. 2020;105400.

Zhou Y, Jiang X. Dissecting android malware: characterization and evolution. In: 2012 IEEE Symposium on security and privacy. IEEE; 2012. p. 95–109.

Virusshare. http://virusshare.com/ . Accessed 20 Oct 2019.

Virustotal. https://virustotal.com/ . Accessed 20 Oct 2019.

Comodo. https://www.comodo.com/home/internet-security/updates/vdp/database . Accessed 20 Oct 2019.

Contagio. http://contagiodump.blogspot.com/ . Accessed 20 Oct 2019.

Kumar R, Xiaosong Z, Khan RU, Kumar J, Ahad I. Effective and explainable detection of android malware based on machine learning algorithms. In: Proceedings of the 2018 international conference on computing and artificial intelligence. ACM; 2018. p. 35–40.

Microsoft malware classification (big 2015). arXiv:org/abs/1802.10135/ . Accessed 20 Oct 2019.

Koroniotis N, Moustafa N, Sitnikova E, Turnbull B. Towards the development of realistic botnet dataset in the internet of things for network forensic analytics: bot-iot dataset. Future Gen Comput Syst. 2019;100:779–96.

McIntosh TR, Jang-Jaccard J, Watters PA. Large scale behavioral analysis of ransomware attacks. In: International conference on neural information processing. New York: Springer; 2018. p. 217–29.

Han J, Pei J, Kamber M. Data mining: concepts and techniques, 2011.

Witten IH, Frank E. Data mining: Practical machine learning tools and techniques, 2005.

Dua S, Du X. Data mining and machine learning in cybersecurity, 2016.

Kotpalliwar MV, Wajgi R. Classification of attacks using support vector machine (svm) on kddcup’99 ids database. In: 2015 Fifth international conference on communication systems and network technologies. IEEE; 2015. p. 987–90.

Pervez MS, Farid DM. Feature selection and intrusion classification in nsl-kdd cup 99 dataset employing svms. In: The 8th international conference on software, knowledge, information management and applications (SKIMA 2014). IEEE; 2014. p. 1–6.

Yan M, Liu Z. A new method of transductive svm-based network intrusion detection. In: International conference on computer and computing technologies in agriculture. New York: Springer; 2010. p. 87–95.

Li Y, Xia J, Zhang S, Yan J, Ai X, Dai K. An efficient intrusion detection system based on support vector machines and gradually feature removal method. Expert Syst Appl. 2012;39(1):424–30.

Raman MG, Somu N, Jagarapu S, Manghnani T, Selvam T, Krithivasan K, Sriram VS. An efficient intrusion detection technique based on support vector machine and improved binary gravitational search algorithm. Artificial Intelligence Review. 2019, p. 1–32.

Kokila R, Selvi ST, Govindarajan K. Ddos detection and analysis in sdn-based environment using support vector machine classifier. In: 2014 Sixth international conference on advanced computing (ICoAC). IEEE; 2014. p. 205–10.

Xie M, Hu J, Slay J. Evaluating host-based anomaly detection systems: Application of the one-class svm algorithm to adfa-ld. In: 2014 11th international conference on fuzzy systems and knowledge discovery (FSKD). IEEE; 2014. p. 978–82.

Saxena H, Richariya V. Intrusion detection in kdd99 dataset using svm-pso and feature reduction with information gain. Int J Comput Appl. 2014;98:6.

Chandrasekhar A, Raghuveer K. Confederation of fcm clustering, ann and svm techniques to implement hybrid nids using corrected kdd cup 99 dataset. In: 2014 international conference on communication and signal processing. IEEE; 2014. p. 672–76.

Shapoorifard H, Shamsinejad P. Intrusion detection using a novel hybrid method incorporating an improved knn. Int J Comput Appl. 2017;173(1):5–9.

Vishwakarma S, Sharma V, Tiwari A. An intrusion detection system using knn-aco algorithm. Int J Comput Appl. 2017;171(10):18–23.

Meng W, Li W, Kwok L-F. Design of intelligent knn-based alarm filter using knowledge-based alert verification in intrusion detection. Secur Commun Netw. 2015;8(18):3883–95.

Dada E. A hybridized svm-knn-pdapso approach to intrusion detection system. In: Proc. Fac. Seminar Ser., 2017, p. 14–21.

Sharifi AM, Amirgholipour SK, Pourebrahimi A. Intrusion detection based on joint of k-means and knn. J Converg Inform Technol. 2015;10(5):42.

Lin W-C, Ke S-W, Tsai C-F. Cann: an intrusion detection system based on combining cluster centers and nearest neighbors. Knowl Based Syst. 2015;78:13–21.

Koc L, Mazzuchi TA, Sarkani S. A network intrusion detection system based on a hidden naïve bayes multiclass classifier. Exp Syst Appl. 2012;39(18):13492–500.

Moon D, Im H, Kim I, Park JH. Dtb-ids: an intrusion detection system based on decision tree using behavior analysis for preventing apt attacks. J Supercomput. 2017;73(7):2881–95.

Ingre, B., Yadav, A., Soni, A.K.: Decision tree based intrusion detection system for nsl-kdd dataset. In: International conference on information and communication technology for intelligent systems. New York: Springer; 2017. p. 207–18.

Malik AJ, Khan FA. A hybrid technique using binary particle swarm optimization and decision tree pruning for network intrusion detection. Cluster Comput. 2018;21(1):667–80.

Relan NG, Patil DR. Implementation of network intrusion detection system using variant of decision tree algorithm. In: 2015 international conference on nascent technologies in the engineering field (ICNTE). IEEE; 2015. p. 1–5.

Rai K, Devi MS, Guleria A. Decision tree based algorithm for intrusion detection. Int J Adv Netw Appl. 2016;7(4):2828.

Sarker IH, Abushark YB, Alsolami F, Khan AI. Intrudtree: a machine learning based cyber security intrusion detection model. Symmetry. 2020;12(5):754.

Puthran S, Shah K. Intrusion detection using improved decision tree algorithm with binary and quad split. In: International symposium on security in computing and communication. New York: Springer; 2016. p. 427–438.

Balogun AO, Jimoh RG. Anomaly intrusion detection using an hybrid of decision tree and k-nearest neighbor, 2015.

Azad C, Jha VK. Genetic algorithm to solve the problem of small disjunct in the decision tree based intrusion detection system. Int J Comput Netw Inform Secur. 2015;7(8):56.

Jo S, Sung H, Ahn B. A comparative study on the performance of intrusion detection using decision tree and artificial neural network models. J Korea Soc Dig Indus Inform Manag. 2015;11(4):33–45.

Zhan J, Zulkernine M, Haque A. Random-forests-based network intrusion detection systems. IEEE Trans Syst Man Cybern C. 2008;38(5):649–59.

Tajbakhsh A, Rahmati M, Mirzaei A. Intrusion detection using fuzzy association rules. Appl Soft Comput. 2009;9(2):462–9.

Mitchell R, Chen R. Behavior rule specification-based intrusion detection for safety critical medical cyber physical systems. IEEE Trans Depend Secure Comput. 2014;12(1):16–30.

Alazab M, Venkataraman S, Watters P. Towards understanding malware behaviour by the extraction of api calls. In: 2010 second cybercrime and trustworthy computing Workshop. IEEE; 2010. p. 52–59.

Yuan Y, Kaklamanos G, Hogrefe D. A novel semi-supervised adaboost technique for network anomaly detection. In: Proceedings of the 19th ACM international conference on modeling, analysis and simulation of wireless and mobile systems. ACM; 2016. p. 111–14.

Ariu D, Tronci R, Giacinto G. Hmmpayl: an intrusion detection system based on hidden markov models. Comput Secur. 2011;30(4):221–41.

Årnes A, Valeur F, Vigna G, Kemmerer RA. Using hidden markov models to evaluate the risks of intrusions. In: International workshop on recent advances in intrusion detection. New York: Springer; 2006. p. 145–64.

Hansen JV, Lowry PB, Meservy RD, McDonald DM. Genetic programming for prevention of cyberterrorism through dynamic and evolving intrusion detection. Decis Supp Syst. 2007;43(4):1362–74.

Aslahi-Shahri B, Rahmani R, Chizari M, Maralani A, Eslami M, Golkar MJ, Ebrahimi A. A hybrid method consisting of ga and svm for intrusion detection system. Neural Comput Appl. 2016;27(6):1669–76.

Alrawashdeh K, Purdy C. Toward an online anomaly intrusion detection system based on deep learning. In: 2016 15th IEEE international conference on machine learning and applications (ICMLA). IEEE; 2016. p. 195–200.

Yin C, Zhu Y, Fei J, He X. A deep learning approach for intrusion detection using recurrent neural networks. IEEE Access. 2017;5:21954–61.

Kim J, Kim J, Thu HLT, Kim H. Long short term memory recurrent neural network classifier for intrusion detection. In: 2016 international conference on platform technology and service (PlatCon). IEEE; 2016. p. 1–5.

Almiani M, AbuGhazleh A, Al-Rahayfeh A, Atiewi S, Razaque A. Deep recurrent neural network for iot intrusion detection system. Simulation Modelling Practice and Theory. 2019;102031.

Kolosnjaji B, Zarras A, Webster G, Eckert C. Deep learning for classification of malware system call sequences. In: Australasian joint conference on artificial intelligence. New York: Springer; 2016. p. 137–49.

Wang W, Zhu M, Zeng X, Ye X, Sheng Y. Malware traffic classification using convolutional neural network for representation learning. In: 2017 international conference on information networking (ICOIN). IEEE; 2017. p. 712–17.

Alauthman M, Aslam N, Al-kasassbeh M, Khan S, Al-Qerem A, Choo K-KR. An efficient reinforcement learning-based botnet detection approach. J Netw Comput Appl. 2020;150:102479.

Blanco R, Cilla JJ, Briongos S, Malagón P, Moya JM. Applying cost-sensitive classifiers with reinforcement learning to ids. In: International conference on intelligent data engineering and automated learning. New York: Springer; 2018. p. 531–38.

Lopez-Martin M, Carro B, Sanchez-Esguevillas A. Application of deep reinforcement learning to intrusion detection for supervised problems. Exp Syst Appl. 2020;141:112963.

Sarker IH, Kayes A, Watters P. Effectiveness analysis of machine learning classification models for predicting personalized context-aware smartphone usage. J Big Data. 2019;6(1):1–28.

Holte RC. Very simple classification rules perform well on most commonly used datasets. Mach Learn. 1993;11(1):63–90.

John GH, Langley P. Estimating continuous distributions in bayesian classifiers. In: Proceedings of the eleventh conference on uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc.; 1995. p. 338–45.

Quinlan JR. C4.5: Programs for machine learning. Machine Learning, 1993.

Sarker IH, Colman A, Han J, Khan AI, Abushark YB, Salah K. Behavdt: a behavioral decision tree learning to build user-centric context-aware predictive model. Mobile Networks and Applications. 2019, p. 1–11.

Aha DW, Kibler D, Albert MK. Instance-based learning algorithms. Mach Learn. 1991;6(1):37–66.

Keerthi SS, Shevade SK, Bhattacharyya C, Murthy KRK. Improvements to platt’s smo algorithm for svm classifier design. Neural Comput. 2001;13(3):637–49.

Freund Y, Schapire RE, et al: Experiments with a new boosting algorithm. In: Icml, vol. 96, p. 148–156 (1996). Citeseer

Le Cessie S, Van Houwelingen JC. Ridge estimators in logistic regression. J Royal Stat Soc C. 1992;41(1):191–201.

Watters PA, McCombie S, Layton R, Pieprzyk J. Characterising and predicting cyber attacks using the cyber attacker model profile (camp). J Money Launder Control. 2012.

Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.

Sarker IH. Context-aware rule learning from smartphone data: survey, challenges and future directions. J Big Data. 2019;6(1):95.

MacQueen J. Some methods for classification and analysis of multivariate observations. In: Fifth Berkeley symposium on mathematical statistics and probability, vol. 1, 1967.

Rokach L. A survey of clustering algorithms. In: Data Mining and Knowledge Discovery Handbook. New York: Springer; 2010. p. 269–98.

Sneath PH. The application of computers to taxonomy. J Gen Microbiol. 1957;17:1.

Sorensen T. method of establishing groups of equal amplitude in plant sociology based on similarity of species. Biol Skr. 1948;5.

Sarker IH, Colman A, Kabir MA, Han J. Individualized time-series segmentation for mining mobile phone user behavior. Comput J. 2018;61(3):349–68.

Kim G, Lee S, Kim S. A novel hybrid intrusion detection method integrating anomaly detection with misuse detection. Exp Syst Appl. 2014;41(4):1690–700.

MathSciNet   Google Scholar  

Agrawal R, Imieliński T, Swami A. Mining association rules between sets of items in large databases. In: ACM SIGMOD Record. ACM; 1993. vol. 22, p. 207–16.

Flach PA, Lachiche N. Confirmation-guided discovery of first-order rules with tertius. Mach Learn. 2001;42(1–2):61–95.

Agrawal R, Srikant R, et al: Fast algorithms for mining association rules. In: Proc. 20th Int. Conf. Very Large Data Bases, VLDB, 1994, vol. 1215, p. 487–99.

Houtsma M, Swami A. Set-oriented mining for association rules in relational databases. In: Proceedings of the eleventh international conference on data engineering. IEEE; 1995. p. 25–33.

Ma BLWHY. Integrating classification and association rule mining. In: Proceedings of the fourth international conference on knowledge discovery and data mining, 1998.

Han J, Pei J, Yin Y. Mining frequent patterns without candidate generation. In: ACM Sigmod Record. ACM; 2000. vol. 29, p. 1–12.

Sarker IH, Salim FD. Mining user behavioral rules from smartphone data through association analysis. In: Proceedings of the 22nd Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Melbourne, Australia. New York: Springer; 2018. p. 450–61.

Das A, Ng W-K, Woon Y-K. Rapid association rule mining. In: Proceedings of the tenth international conference on information and knowledge management. ACM; 2001. p. 474–81.

Zaki MJ. Scalable algorithms for association mining. IEEE Trans Knowl Data Eng. 2000;12(3):372–90.

Coelho IM, Coelho VN, Luz EJS, Ochi LS, Guimarães FG, Rios E. A gpu deep learning metaheuristic based model for time series forecasting. Appl Energy. 2017;201:412–8.

Van Efferen L, Ali-Eldin AM. A multi-layer perceptron approach for flow-based anomaly detection. In: 2017 International symposium on networks, computers and communications (ISNCC). IEEE; 2017. p. 1–6.

Liu H, Lang B, Liu M, Yan H. Cnn and rnn based payload classification methods for attack detection. Knowl Based Syst. 2019;163:332–41.

Berman DS, Buczak AL, Chavis JS, Corbett CL. A survey of deep learning methods for cyber security. Information. 2019;10(4):122.

Bellman R. A markovian decision process. J Math Mech. 1957;1:679–84.

Kaelbling LP, Littman ML, Moore AW. Reinforcement learning: a survey. J Artif Intell Res. 1996;4:237–85.

Sarker IH. A machine learning based robust prediction model for real-life mobile phone data. Internet of Things. 2019;5:180–93.

Kayes ASM, Han J, Colman A. OntCAAC: an ontology-based approach to context-aware access control for software services. Comput J. 2015;58(11):3000–34.

Kayes ASM, Rahayu W, Dillon T. An ontology-based approach to dynamic contextual role for pervasive access control. In: AINA 2018. IEEE Computer Society, 2018.

Colombo P, Ferrari E. Access control technologies for big data management systems: literature review and future trends. Cybersecurity. 2019;2(1):1–13.

Aleroud A, Karabatis G. Contextual information fusion for intrusion detection: a survey and taxonomy. Knowl Inform Syst. 2017;52(3):563–619.

Sarker IH, Abushark YB, Khan AI. Contextpca: Predicting context-aware smartphone apps usage based on machine learning techniques. Symmetry. 2020;12(4):499.

Madsen RE, Hansen LK, Winther O. Singular value decomposition and principal component analysis. Neural Netw. 2004;1:1–5.

Qiao L-B, Zhang B-F, Lai Z-Q, Su J-S. Mining of attack models in ids alerts from network backbone by a two-stage clustering method. In: 2012 IEEE 26th international parallel and distributed processing symposium workshops & Phd Forum. IEEE; 2012. p. 1263–9.

Sarker IH, Colman A, Han J. Recencyminer: mining recency-based personalized behavior from contextual smartphone data. J Big Data. 2019;6(1):49.

Ullah F, Babar MA. Architectural tactics for big data cybersecurity analytics systems: a review. J Syst Softw. 2019;151:81–118.

Zhao S, Leftwich K, Owens M, Magrone F, Schonemann J, Anderson B, Medhi D. I-can-mama: Integrated campus network monitoring and management. In: 2014 IEEE network operations and management symposium (NOMS). IEEE; 2014. p. 1–7.

Abomhara M, et al. Cyber security and the internet of things: vulnerabilities, threats, intruders and attacks. J Cyber Secur Mob. 2015;4(1):65–88.

Helali RGM. Data mining based network intrusion detection system: A survey. In: Novel algorithms and techniques in telecommunications and networking. New York: Springer; 2010. p. 501–505.

Ryoo J, Rizvi S, Aiken W, Kissell J. Cloud security auditing: challenges and emerging approaches. IEEE Secur Priv. 2013;12(6):68–74.

Densham B. Three cyber-security strategies to mitigate the impact of a data breach. Netw Secur. 2015;2015(1):5–8.

Salah K, Rehman MHU, Nizamuddin N, Al-Fuqaha A. Blockchain for ai: review and open research challenges. IEEE Access. 2019;7:10127–49.

Gandomi A, Haider M. Beyond the hype: big data concepts, methods, and analytics. Int J Inform Manag. 2015;35(2):137–44.

Golchha N. Big data-the information revolution. Int J Adv Res. 2015;1(12):791–4.

Hariri RH, Fredericks EM, Bowers KM. Uncertainty in big data analytics: survey, opportunities, and challenges. J Big Data. 2019;6(1):44.

Tsai C-W, Lai C-F, Chao H-C, Vasilakos AV. Big data analytics: a survey. J Big data. 2015;2(1):21.

Download references

Acknowledgements

The authors would like to thank all the reviewers for their rigorous review and comments in several revision rounds. The reviews are detailed and helpful to improve and finalize the manuscript. The authors are highly grateful to them.

Author information

Authors and affiliations.

Swinburne University of Technology, Melbourne, VIC, 3122, Australia

Iqbal H. Sarker

Chittagong University of Engineering and Technology, Chittagong, 4349, Bangladesh

La Trobe University, Melbourne, VIC, 3086, Australia

A. S. M. Kayes, Paul Watters & Alex Ng

University of Nevada, Reno, USA

Shahriar Badsha

Macquarie University, Sydney, NSW, 2109, Australia

Hamed Alqahtani

You can also search for this author in PubMed   Google Scholar

Contributions

This article provides not only a discussion on cybersecurity data science and relevant methods but also to discuss the applicability towards data-driven intelligent decision making in cybersecurity systems and services. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Iqbal H. Sarker .

Ethics declarations

Competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Sarker, I.H., Kayes, A.S.M., Badsha, S. et al. Cybersecurity data science: an overview from machine learning perspective. J Big Data 7 , 41 (2020). https://doi.org/10.1186/s40537-020-00318-5

Download citation

Received : 26 October 2019

Accepted : 21 June 2020

Published : 01 July 2020

DOI : https://doi.org/10.1186/s40537-020-00318-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Decision making
  • Cyber-attack
  • Security modeling
  • Intrusion detection
  • Cyber threat intelligence

research paper about cyber security

Advertisement

Advertisement

Artificial intelligence in cyber security: research advances, challenges, and opportunities

  • Published: 13 March 2021
  • Volume 55 , pages 1029–1053, ( 2022 )

Cite this article

  • Zhimin Zhang   ORCID: orcid.org/0000-0002-9065-8724 1 ,
  • Huansheng Ning   ORCID: orcid.org/0000-0001-6413-193X 1 , 2 ,
  • Feifei Shi 1 ,
  • Fadi Farha 1 ,
  • Yang Xu 1 ,
  • Jiabo Xu 3 ,
  • Fan Zhang 1 &
  • Kim-Kwang Raymond Choo   ORCID: orcid.org/0000-0001-9208-5336 4  

15k Accesses

73 Citations

24 Altmetric

Explore all metrics

In recent times, there have been attempts to leverage artificial intelligence (AI) techniques in a broad range of cyber security applications. Therefore, this paper surveys the existing literature (comprising 54 papers mainly published between 2016 and 2020) on the applications of AI in user access authentication, network situation awareness, dangerous behavior monitoring, and abnormal traffic identification. This paper also identifies a number of limitations and challenges, and based on the findings, a conceptual human-in-the-loop intelligence cyber security model is presented.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

research paper about cyber security

Similar content being viewed by others

research paper about cyber security

Artificial Intelligence and Cybersecurity: Past, Presence, and Future

research paper about cyber security

Artificial Intelligence for Cybersecurity: Recent Advancements, Challenges and Opportunities

research paper about cyber security

A Comprehensive Review on Various Artificial Intelligence Based Techniques and Approaches for Cyber Security

Cloud adoption risk report 2019 (pdf). https://mscdss.ds.unipi.gr/wp-content/uploads/2018/10/Cloud-Adoption-Risk-Report-2019.pdf (2019).

What’s the difference between network security & cyber security? https://www.ecpi.edu/blog/whats-difference-between-network-security-cyber-security (2020).

Ai in cybersecurity-capgemini worldwide. https://www.capgemini.com/news/ai-in-cybersecurity/ (2020).

Ai index 2019 report (pdf). https://hai.stanford.edu/sites/g/files/sbiybj10986/f/ai_index_2019_report.pdf (2020).

Enterprise immune system-darktrace. https://www.darktrace.com/en/products/enterprise/ (2019).

Invincea launches x-as-a-service managed security. https://www.eweek.com/security/invincea-launches-x-as-a-service-managed-security (2020).

Congnigo-infosecurity magazine. https://www.infosecurity-magazine.com/directory/cognigo/ (2019).

Speech emotion recognition using semi-supervised learning with ladder networks. In: 2018 First Asian Conference on Affective Computing and Intelligent Interaction (ACII Asia), pp. 1–5 (2018).

Knowledge-directed artificial intelligence reasoning over schemas (kairos). https://www.darpa.mil/program/knowledge-directed-artificial-intelligence-reasoning-over-schemas (2020).

Darpa robotics challenge (DRC) using human-machine teamwork to perform disasterresponse with a humanoid robot. https://apps.dtic.mil/docs/citations/AD1027886 (2020).

Training ai to win a dogfight. https://www.darpa.mil/news-events/2019-05-08 (2020).

Cyborg super soldiers: Us army report reveals vision for deadly ‘machine humans’ with infrared sight, boosted strength and mind-controlled weapons by 2050. https://www.dailymail.co.uk/sciencetech/article-7738669/US-Military-scientists-create-plan-cyborg-super-soldier-future.html (2019).

Adekunle YA, Okolie SO, Adebayo AO, Ebiesuwa S, Ehiwe DD (2019) Holistic exploration of gaps vis-à-vis artificial intelligence in automated teller machine and internet banking. In: International journal of applied information systems (IJAIS), vol 12

Abdallah AE (2016) Detection and prediction of insider threats to cyber security: a systematic literature review and meta-analysis. Big Data Anal 1(1):6

Article   Google Scholar  

Ahmed M, Mahmood AN, Hu J (2015) A survey of network anomaly detection techniques. J. Netw. Comput. Appl. 60:19–31

Aljamal I, Tekeoğlu A, Bekiroglu K, Sengupta S (2019) Hybrid intrusion detection system using machine learning techniques in cloud computing environments. In: 2019 IEEE 17th international conference on software engineering research, management and applications (SERA), pp 84–89

Aljurayban NS, Emam A (2015) Framework for cloud intrusion detection system service. In: 2015 2nd world symposium on web applications and networking (WSWAN), pp 1–5

Amberkar A, Awasarmol P, Deshmukh G, Dave P (2018) Speech recognition using recurrent neural networks. In: 2018 international conference on current trends towards converging technologies (ICCTCT), pp 1–4

Bakhshi B, Veisi H (2019) End to end fingerprint verification based on convolutional neural network. In: 2019 27th Iranian conference on electrical engineering (ICEE), pp 1994–1998

Bao H, He H, Liu Z, Liu Z (2019) Research on information security situation awareness system based on big data and artificial intelligence technology. In: 2019 international conference on robots intelligent system (ICRIS), pp 318–322

Benias N, Markopoulos AP (2017) A review on the readiness level and cyber-security challenges in industry 4.0. In: 2017 south eastern European design automation, computer engineering, computer networks and social media conference (SEEDA-CECNSM), pp 1–5

Chang C, Eude T, Obando Carbajal LE (2016) Biometric authentication by keystroke dynamics for remote evaluation with one-class classification. In: Khoury R, Drummond C (eds) Advances in artificial intelligence. Springer, Cham, pp 21–32

Chapter   Google Scholar  

Deng M, Yang H, Cao J, Feng X (2019) View-invariant gait recognition based on deterministic learning and knowledge fusion. In: 2019 international joint conference on neural networks (IJCNN), pp 1–8

Ding C, Tao D (2018) Trunk-branch ensemble convolutional neural networks for video-based face recognition. IEEE Trans Pattern Anal Mach Intell 40(4):1002–1014

Dongmei Z, Jinxing L (2018) Study on network security situation awareness based on particle swarm optimization algorithm. Comput Ind Eng 125:764–775. https://doi.org/10.1016/j.cie.2018.01.006

Fairuz S, Habaebi MH, Elsheikh EMA (2018) Finger vein identification based on transfer learning of alexnet. In: 2018 7th international conference on computer and communication engineering (ICCCE), pp 465–469

Fernández Maimó L, Perales Gómez AL, García Clemente FJ, Gil Pérez M, Martínez Pérez G (2018) A self-adaptive deep learning-based system for anomaly detection in 5g networks. IEEE Access 6:7700–7712

Gangwar A, Joshi A (2016) Deepirisnet: deep iris representation with applications in iris recognition and cross-sensor iris recognition. In: 2016 IEEE international conference on image processing (ICIP), pp 2301–2305

Gu T, Dolan-Gavitt B, Garg S (2017) BadNets: identifying vulnerabilities in the machine learning model supply chain. ArXiv e-prints arXiv:1708.06733

Guan Y, Ge X (2018) Distributed attack detection and secure estimation of networked cyber-physical systems against false data injection attacks and jamming attacks. IEEE Trans Signal Inf Process Netw 4(1):48–59

MathSciNet   Google Scholar  

Han Z, Wang J (2019) Speech emotion recognition based on deep learning and kernel nonlinear PSVM. In: 2019 Chinese control and decision conference (CCDC), pp. 1426–1430

Hariyanto, Sudiro SA, Lukman S (2015) Minutiae matching algorithm using artificial neural network for fingerprint recognition. In: 2015 3rd international conference on artificial intelligence, modelling and simulation (AIMS), pp 37–41

Holzinger A, Plass M, Holzinger K, Crişan GC, Pintea CM, Palade V (2016) Towards interactive machine learning (IML): applying ant colony algorithms to solve the traveling salesman problem with the human-in-the-loop approach. In: Buccafurri F, Holzinger A, Kieseberg P, Tjoa AM, Weippl E (eds) Availability, reliability, and security in information systems. Springer, Cham, pp 81–95

Hong H, Lee M, Park K (2017) Convolutional neural network-based finger-vein recognition using nir image sensors. Sensors (Switzerland) 17:1297. https://doi.org/10.3390/s17061297

Hsieh C, Chan T (2016) Detection ddos attacks based on neural-network using apache spark. In: 2016 international conference on applied system innovation (ICASI), pp 1–4

Hu W, Tan Y (2017) Generating adversarial malware examples for black-box attacks based on gan. CoRR. http://arxiv.org/abs/1702.05983

Jenab K, Moslehpour S (2016) Cyber security management: a review. Soc. Bus. Manag. Dyn. 5(11):16–39

Google Scholar  

Ji Y, Bowman B, Huang HH (2019) Securing malware cognitive systems against adversarial attacks. In: 2019 IEEE international conference on cognitive computing (ICCC), pp 1–9

Jyothi V, Wang X, Addepalli SK, Karri R (2016) Brain: behavior based adaptive intrusion detection in networks: Using hardware performance counters to detect ddos attacks. In: 2016 29th international conference on VLSI design and 2016 15th international conference on embedded systems (VLSID), pp 587–588

Sugandhi K, Raju G (2019) An efficient hog-centroid descriptor for human gait recognition. In: 2019 amity international conference on artificial intelligence (AICAI), pp 355–360

Kanimozhi, V, Jacob TP (2019) Artificial intelligence based network intrusion detection with hyper-parameter optimization tuning on the realistic cyber dataset cse-cic-ids2018 using cloud computing. In: 2019 international conference on communication and signal processing (ICCSP), pp 0033–0036

Kolosnjaji B, Demontis A, Biggio B, Maiorca D, Giacinto G, Eckert C, Roli F (2018) Adversarial malware binaries: evading deep learning for malware detection in executables. In: 2018 26th European signal processing conference (EUSIPCO), pp 533–537

Kong L, Huang G, Wu K (2017) Identification of abnormal network traffic using support vector machine. In: 2017 18th international conference on parallel and distributed computing, applications and technologies (PDCAT), pp 288–292

Kong L, Huang G, Wu K, Tang Q, Ye S (2018) Comparison of internet traffic identification on machine learning methods. In: 2018 international conference on big data and artificial intelligence (BDAI), pp 38–41

Kong L, Huang G, Zhou Y, Ye J (2018) Fast abnormal identification for large scale internet traffic. In: Proceedings of the 8th international conference on communication and network security, ICCNS 2018. Association for Computing Machinery, New York, pp 117–120 (2018). https://doi.org/10.1145/3290480.3290498

Korkmaz Y (2016) Developing password security system by using artificial neural networks in user log in systems. In: 2016 electric electronics, computer science, biomedical engineerings’ meeting (EBBT), pp 1–4

Kowert W (2017) The foreseeability of human-artificial intelligence interactions. Texas Law Rev 96:181–204

Kruse C, Frederick B, Jacobson T, Monticone D (2016) Cybersecurity in healthcare: a systematic review of modern threats and trends. Technol Health Care 25:1–10. https://doi.org/10.3233/THC-161263

Li C, Li XM (2017) Cyber performance situation awareness on fuzzy correlation analysis. In: 2017 3rd IEEE international conference on computer and communications (ICCC), pp 424–428

Li X, Zhang X, Wang D (2018) Spatiotemporal cyberspace situation awareness mechanism for backbone networks. In: 2018 4th international conference on big data computing and communications (BIGCOM), pp 168–173

Liu W, Li W, Sun L, Zhang L, Chen P (2017) Finger vein recognition based on deep learning. In: 2017 12th IEEE conference on industrial electronics and applications (ICIEA), pp 205–210

Lu X, Xiao L, Xu T, Zhao Y, Tang Y, Zhuang W (2020) Reinforcement learning based PHY authentication for Vanets. IEEE Trans Veh Technol 69(3):3068–3079

Lu Y, Xu LD (2019) Internet of things (IoT) cybersecurity research: a review of current research topics. IEEE Internet Things J 6(2):2103–2115

Mahmood T, Afzal U (2013) Security analytics: big data analytics for cybersecurity: a review of trends, techniques and tools. In: 2013 2nd national conference on information assurance (NCIA), pp 129–134

Marir N, Wang H, Feng G, Li B, Jia M (2018) Distributed abnormal behavior detection approach based on deep belief network and ensemble SVM using spark. IEEE Access 6:59657–59671

McIntire JP, McIntire LK, Havig PR (2009) A variety of automated turing tests for network security: Using ai-hard problems in perception and cognition to ensure secure collaborations. In: 2009 international symposium on collaborative technologies and systems, pp 155–162

Naderpour M, Lu J, Zhang G (2014) An intelligent situation awareness support system for safety-critical environments. Decis Support Syst 59:325–340. https://doi.org/10.1016/j.dss.2014.01.004

Nithyakani P, Shanthini A, Ponsam G (2019) Human gait recognition using deep convolutional neural network. In: 2019 3rd international conference on computing and communications technologies (ICCCT), pp 208–211

Nunes DS, Zhang P, Sá Silva J (2015) A survey on human-in-the-loop applications towards an internet of all. IEEE Commun Surv Tutor 17(2):944–965

Ozsen S, Gunes S, Kara S, Latifoglu F (2009) Use of kernel functions in artificial immune systems for the nonlinear classification problems. IEEE Trans Inf Technol Biomed 13(4):621–628

Pandeeswari N, Kumar G (2016) Anomaly detection system in cloud environment using fuzzy clustering based ANN. Mob Netw Appl 21(3):494–505

Parthasarathy S, Busso C (2019) Semi-supervised speech emotion recognition with ladder networks. IEEE/ACM Trans Audio, Speech, Lang Process 28:2697–2709

Păvăloi L, Niţă CD (2018) Iris recognition using sift descriptors with different distance measures. In: 2018 10th international conference on electronics, computers and artificial intelligence (ECAI), pp 1–4

Qiu M (2017) Keystroke biometric systems for user authentication. J Signal Process Syst 86(2–3):175–190

Saeed F, Hussain M, Aboalsamh HA (2018) Classification of live scanned fingerprints using histogram of gradient descriptor. In: 2018 21st Saudi computer society national computer conference (NCC), pp 1–5

Salyut J, Kurnaz C (2018) Profile face recognition using local binary patterns with artificial neural network. In: 2018 international conference on artificial intelligence and data processing (IDAP), pp 1–4

Santhanam GR, Holland B, Kothari S, Ranade N (2017) Human-on-the-loop automation for detecting software side-channel vulnerabilities. In: Shyamasundar RK, Singh V, Vaidya J (eds) Information systems security. Springer, Cham, pp 209–230

Schlegel U, Arnout H, El-Assady M, Oelke D, Keim DA (2019) Towards a rigorous evaluation of xai methods on time series. In: 2019 IEEE/CVF international conference on computer vision workshop (ICCVW), pp 4197–4201

Shelton J, Jenkins J, Roy K (2016) Micro-dimensional feature extraction for multispectral iris recognition. SoutheastCon 2016:1–5

Shi Y, Li T, Renfa L, Peng X, Tang P (2017) An immunity-based iot environment security situation awareness model. J Comput Commun 5:182–197. https://doi.org/10.4236/jcc.2017.57016

Shoufan A (2017) Continuous authentication of uav flight command data using behaviometrics. In: 2017 IFIP/IEEE international conference on very large scale integration (VLSI-SoC), pp 1–6

Singh K, Kumar J, Tripathi G, Chullai GA (2017) Sparse proximity based robust fingerprint recognition. In: 2017 international conference on computing, communication and automation (ICCCA), pp 1025–1028

Sliti M, Abdallah W, Boudriga N (2018) Jamming attack detection in optical uav networks. In: 2018 20th international conference on transparent optical networks (ICTON), pp 1–5

Su J, Vargas DV, Sakurai K (2019) One pixel attack for fooling deep neural networks. IEEE Trans Evolut Comput 23(5):828–841

Taylor PJ, Dargahi T, Dehghantanha A, Parizi RM, Choo KKR (2019) A systematic literature review of blockchain cyber security. Digit Commun Netw 6(2):147–156

Thongsook A, Nunthawarasilp T, Kraypet P, Lim J, Ruangpayoongsak N (2019) C4.5 decision tree against neural network on gait phase recognition for lower limp exoskeleton. In: 2019 1st international symposium on instrumentation, control, artificial intelligence, and robotics (ICA-SYMP), pp 69–72

Tyworth M, Giacobe NA, Mancuso VF, McNeese MD, Hall DL (2013) A human-in-the-loop approach to understanding situation awareness in cyber defence analysis. EAI End Trans Secur Saf. https://doi.org/10.4108/trans.sesa.01-06.2013.e6

Uddin MZ, Khaksar W, Torresen J (2017) A robust gait recognition system using spatiotemporal features and deep learning. In: 2017 IEEE international conference on multisensor fusion and integration for intelligent systems (MFI), pp 156–161

Veeramachaneni K, Arnaldo I, Korrapati V, Bassias C, Li K (2016) Ai \(\hat{2}\) : Training a big data machine to defend. In: 2016 IEEE 2nd international conference on big data security on cloud (BigDataSecurity), IEEE international conference on high performance and smart computing (HPSC), and IEEE international conference on intelligent data and security (IDS), pp 49–54

Verma M, Vipparthi SK, Singh G (2019) Hinet: hybrid inherited feature learning network for facial expression recognition. IEEE Lett Comput Soc 2(4):36–39

Wang Z, Fang B (2019) Application of combined kernel function artificial intelligence algorithm in mobile communication network security authentication mechanism. J Supercomput 75(9):5946–5964

Wang ZJ, Turko R, Shaikh O, Park H, Das N, Hohman F, Kahng M, Chau DH (2020) CNN explainer: learning convolutional neural networks with interactive visualization. IEEE Trans Vis Comput Gr. https://doi.org/10.1109/TVCG.2020.3030418

Xiao R, Zhu H, Song C, Liu X, Dong J, Li H (2018) Attacking network isolation in software-defined networks: New attacks and countermeasures. In: 2018 27th international conference on computer communication and networks (ICCCN), pp 1–9

Yang H, Jia Y, Han WH, Nie YP, Li SD, Zhao XJ (2019) Calculation of network security index based on convolution neural networks, pp 530–540. https://doi.org/10.1007/978-3-030-24271-8_47

Yang W, Wang S, Hu J, Zheng G, Yang J, Valli C (2019) Securing deep learning based edge finger vein biometrics with binary decision diagram. IEEE Trans Ind Inform 15(7):4244–4253

Yavanoglu O, Aydos M (2017) A review on cyber security datasets for machine learning algorithms. In: 2017 IEEE international conference on big data (big data), pp 2186–2193

Young Park C, Blackmond Laskey K, Costa PCG, Matsumoto S (2016) A process for human-aided multi-entity bayesian networks learning in predictive situation awareness. In: 2016 19th international conference on information fusion (FUSION), pp 2116–2124

Yuan X, Li C, Li X (2017) Deepdefense: identifying ddos attack via deep learning. In: 2017 IEEE international conference on smart computing (SMARTCOMP), pp 1–8

Yunhu Jin, Shen Y, Zhang G, Hua Zhi (2016) The model of network security situation assessment based on random forest. In: 2016 7th IEEE international conference on software engineering and service science (ICSESS), pp 977–980

Zeng J, Wang F, Deng J, Qin C, Zhai Y, Gan J, Piuri V (2020) Finger vein verification algorithm based on fully convolutional neural network and conditional random field. IEEE Access 8:65402–65419

Zeng Y, Qi Z, Chen W, Huang Y, Zheng X, Qiu H (2019) Test: an end-to-end network traffic examination and identification framework based on spatio-temporal features extraction. CoRR. http://arxiv.org/abs/1908.10271

Zhang W, Lu X, Gu Y, Liu Y, Meng X, Li J (2019) A robust iris segmentation scheme based on improved u-net. IEEE Access 7:85082–85089

Zhang Y, Chen X, Guo D, Song M, Teng Y, Wang X (2019) PCCN: Parallel cross convolutional neural network for abnormal network traffic flows detection in multi-class imbalanced network traffic flows. IEEE Access 7:119904–119916

Zhang Y, Li W, Zhang L, Ning X, Sun L, Lu Y (2019) Adaptive learning Gabor filter for finger-vein recognition. IEEE Access 7:159821–159830

Zhang Z, Shi F, Wan Y, Xu Y, Zhang F, Ning H (2020) Application progress of artificial intelligence in military confrontation. Chin J Eng 42(9):1106–1118. https://doi.org/10.13374/j.issn2095-9389.2019.11.19.001

Download references

Acknowledgements

This work was funded by the National Natural Science Foundation of China (Grant No. 61872038). This work of K.-K. R. Choo was supported only by the Cloud Technology Endowed Professorship.

Author information

Authors and affiliations.

School of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing, 100083, China

Zhimin Zhang, Huansheng Ning, Feifei Shi, Fadi Farha, Yang Xu & Fan Zhang

Beijing Engineering Research Center for Cyberspace Data Analysis and Applications, Beijing, 100083, China

Huansheng Ning

School of Information Engineering, Xinjiang Institute of Engineering, Xinjiang, China

Department of Information Systems and Cyber Security, University of Texas at San Antonio, San Antonio, TX, 78249-0631, USA

Kim-Kwang Raymond Choo

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Huansheng Ning .

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Zhang, Z., Ning, H., Shi, F. et al. Artificial intelligence in cyber security: research advances, challenges, and opportunities. Artif Intell Rev 55 , 1029–1053 (2022). https://doi.org/10.1007/s10462-021-09976-0

Download citation

Published : 13 March 2021

Issue Date : February 2022

DOI : https://doi.org/10.1007/s10462-021-09976-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Cyber Security
  • Artificial Intelligence
  • Security Methods
  • Human-in-the-Loop
  • Find a journal
  • Publish with us
  • Track your research

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Sensors (Basel)

Logo of sensors

Cybersecurity of Critical Infrastructures: Challenges and Solutions

Leandros maglaras.

1 School of Computer Science and Informatics, De Montfort University, Leicester LE1 9BH, UK

Helge Janicke

2 Cyber Security Cooperative Research Centre, Edith Cowan University, Perth 6027, Australia; [email protected]

Mohamed Amine Ferrag

3 Department of Computer Science, Guelma University, Guelma 24000, Algeria; [email protected]

Associated Data

Not applicable.

People’s lives are becoming more and more dependent on information and computer technology. This is accomplished by the enormous benefits that the ICT offers for everyday life. Digital technology creates an avenue for communication and networking, which is characterized by the exchange of data, some of which are considered sensitive or private. There have been many reports recently of data being hijacked or leaked, often for malicious purposes. Maintaining security and privacy of information and systems has become a herculean task. It is therefore imperative to understand how an individual’s or organization’s personal data can be protected. Moreover, critical infrastructures are vital resources for the public safety, economic well-being and national security.

The major target of cyber attacks can be a country’s Critical National Infrastructures (CNIs) like ports, hospitals, water, gas or electricity producers, that use and rely on Industrial Control Systems but are affected by threats to any part of the supply chain. Cyber attacks are increasing at rate and pace, forming a major trend. The widespread use of computers and the Internet, coupled with the threat of activities of cyber criminals, has made it necessary to pay more attention to the detection or improve the technologies behind information security. The rapid reliance on cloud-based data storage and third-party technologies makes it difficult for industries to provide security for their data systems. Cyber attacks against critical systems are now common and recognized as one of the greatest risks facing today’s world [ 1 ].

This editorial presents the manuscripts accepted, after a careful peer-review process, for publication in the topic “Cyber Security and Critical Infrastructures” of the MDPI journals Applied Sciences, Electronics, Future Internet, Sensors and Smart Cities. The first volume includes sixteen articles: one editorial article, fifteen original research papers describing current challenges, innovative solutions, and real-world experiences involving critical infrastructures and one review paper focusing on the security and privacy challenges on Cloud, Edge, and Fog computing.

Many companies have recently decided to use cloud, edge and fog computing in order to achieve high storage capacity and efficient scalability. The work presented in [ 2 ] mainly focuses on how security in Cloud, Edge, and Fog Computing systems is achieved and how users’ privacy can be protected from attackers. The authors mention that there is a huge potential for vulnerabilities in security and privacy of such system. One good way of screening systems for possible vulnerabilities is by performing auditing of the systems based on security standards.

The recent EU Directive on security of network and information systems (the NIS Directive) has identified transport as one of the critical sectors that need to be secured in a European level. Smart cars is changing the transport landscape by introducing new capabilities along with new threats. Focusing on vehicle security, the authors in [ 3 ] examine the bit-level CAN bus reverse framework using a multiple linear regression model. The increasingly diverse features in today’s vehicles offer drivers and passengers a more relaxed driving experience and greater convenience along with new security threats. The reverse capability of the proposed system can help automotive security researchers to describe vehicle behavior using CAN messages when DBC files are not available.

Vulnerabilities in computer programs have always been a serious threat to software security, which may cause denial of service, information leakage and other attacks. The authors in [ 4 ] propose a new framework of fuzzy testing sample generation called CVDF DYNAMIC. which consists of three parts: Sample generation based on a genetic algorithm, sample generation based on a bi-LSTM neural network and sample reduction based on a heuristic genetic algorithm.

The transformation of cities into smart cities is on the rise. Through the use of innovative technologies such as the Internet of Things (IoT) and cyber–physical systems (CPS) that are connected through networks, smart cities offer better services to the citizens. The authors in propose a novel machine learning solution for threat detection in a smart city [ 5 ].The proposed hybrid Deep learning model that consists of QRNN and CNN improves cyber threat analysis accuracy, loweres False Postitive rate, and provides real-time analysis. The authors evaluated the proposed model on two datasets that were simulated to represent a realistic IoT environment and proved its superiority.

The next article in this collection [ 6 ] proposes a novel framework for few-shot network intrusion detection. Based on the fact that DL methods have been widely successful as network-based IDSs but require sizeable volumes of datasets which are not always feasible, the authors focus on few-shot solutions. Their proposed method is suitable for detecting specific classes of attacks. This model could be very helpful for deploying novel IDSs for Industrial Control Systems, which are the core of Critical Infrastructures, where there is a general lack of datasets.

In [ 7 ] the authors propose a novel reversible data hiding (RDH) scheme that can be applied to either remote medical diagnosis or even military secret transmission. The authors utilize a trained multi-layer perception neural network in order to be able to predict pixel values and then combining those with prediction error expansion techniques (PEE) to achieve (RDH). The proposed method although efficient is very time consuming and the authors propose in the future to implement novel solution to improve this aspect.

Focusing on Industrial components that are the main parts of critical infrastructures the authors in [ 8 ] propose a model for vulnerability analysis through the their entire life-cycle. The model can Identify the root causes and nature of vulnerabilities for the industrial components. This information is useful extracting new requirements and test cases, support the prioritization of patching and track vulnerabilities during the whole life-cycle of industrial components. The proposed model is applicable to existing systems and can be a good source of information for defining patching, training and security needs.

Android mobile devices are becoming the targets of several attacks nowadays since they support many of the everyday digital needs of the users. Since many sensitive applications are offered in these smart devices, like e-banking, adversaries have launched a number of new attacks. IoT enhances the power of malicious entities or people to perform attacks on critical systems or services. A lot of connected devices additionally mean a bigger attack surface for attacks and greater risk. Hackers using infected devices can generate many frequent, organized and complex malicious attacks. The authors in [ 9 ] propose novel IDS for malware in android devices combining several machine learning techniques. The proposed classifiers achieved good accuracy outperforming existing state-of-the-art models.

Having identified a lack of studies related to security in microservices architecture and especially for for authentication and authorization to such systems, the authors in [ 10 ] perform an analysis about this open issue. Microservices can increase scalability, availability and reliability of the system but come with an increase in the attack surface and new threats in the communication between them. Since microservices can become an integral part of critical systems, a thorough research on the attacks and defence against them is crucial. The article concludes that several existing solutions can be applied to make the systems robust but also novel methods need to be proposed that are tailored to the new architectures.

In another article that deals with machine learning as a defence mechanism for smart systems, the authors in [ 11 ] focus on the correct feature selection. Feature selection is the process of correctly identifying those features that help the machine learning algorithm be robust against an adversary. The article proposes a smart feature selection process and a novel feature engineering process which are proven to be more precise in terms of manipulated data while maintaining good results on clean data. The proposed solutions can be easily adopted in real environments in order to deal with sophisticated attacks against critical infrastructures.

Information Security Awareness Training is used to raise awareness of the users against cyber attacks and help them build a responsible behavior. In [ 12 ] the authors try to answer the question whether game-based training and Context-Based Micro-Training (CBMT) can help users correctly identify phishing against legitimate emails. IN order to answer this question the authors conducted a simulated experiment with 41 participants and the results showed that both methods managed to improve user behavior in relation to phishing emails. The paper concludes that training is a strong tool against cyber attacks but must be combined with other security solutions.

A vital challenge faced nowadays by federal and business decision-makers for choosing cost-efficient mitigations to scale back risks from supply chain attacks, particularly those from adversarial attacks that are complex, hard to detect and can lead to severe consequences. Focusing on adversarial attacks and how these can alter the performance of AI based detection systems, the authors in [ 13 ] propose a novel robust solution. Their proposed model was evaluated in both Enterprise and Internet of Things (IoT) networks and is proven to be efficient against adversarial classification attacks and adversarial training attacks.

There are many reasons why it’s vital to know what users can perceive as believable. It is crucial for service suppliers to grasp their vulnerabilities so as to assess their exposure to risks and also the associated problems. moreover, recognizing what the vulnerabilities are interprets into knowing from wherever the attacks are likely to come which leads for appropriate technical security measures to be deployed to protect against attacks. In [ 14 ] the authors present a solution that combines deep neural network and frequency domain pre-processing in order to detect images with embedded spam in social networks. The proposed method is proven to be superior against state-of-the-art detection models in terms of detection accuracy and efficiency. One of the major contributions of the authors is the creation of a novel dataset that contains images with embedded spam, which will be expanded in the near future.

Finding the correct sources that include vital information about securing critical systems is very important. Unfortunately, the lack of a fully functioning semantic web or text-based solutions to formalize security data sources limits the exploitation of existing cyber intelligence data sources. In [ 15 ] the authors aim to empower ontology-based cyber intelligence solutions by presenting a security ontology framework for storing data in an ontology from various textual data sources, supporting knowledge traceability and evaluating relationships between different security documents.

Ransomware has become one of the major threats against critical systems the latest years. The recent report from ENISA has ranked ransomware attacks first in terms of severity and frequency. Current solutions against ransomware do not cover all possible risks of data loss. In this article [ 16 ], the authors try to address this aspect and provide an effective solution that ensures efficient recovery of XML documents after ransomware attacks.

Funding Statement

This research received no external funding.

Author Contributions

All the authors contributed equally to this editorial. All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement

Informed consent statement, data availability statement, conflicts of interest.

The authors declare no conflict of interest.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

U.S. flag

An official website of the United States government

Here’s how you know

Official websites use .gov A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS A lock ( Lock A locked padlock ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

https://www.nist.gov/publications/post-quantum-cryptography-and-quantum-future-cybersecurity

Post-Quantum Cryptography, and the Quantum Future of Cybersecurity

Download paper, additional citation formats.

  • Google Scholar
  • Share full article

research paper about cyber security

Did One Guy Just Stop a Huge Cyberattack?

A Microsoft engineer noticed something was off on a piece of software he worked on. He soon discovered someone was probably trying to gain access to computers all over the world.

Credit... Jon Han

Supported by

Kevin Roose

By Kevin Roose

Reporting from San Francisco

  • April 3, 2024

The internet, as anyone who works deep in its trenches will tell you, is not a smooth, well-oiled machine.

It’s a messy patchwork that has been assembled over decades, and is held together with the digital equivalent of Scotch tape and bubble gum. Much of it relies on open-source software that is thanklessly maintained by a small army of volunteer programmers who fix the bugs, patch the holes and ensure the whole rickety contraption, which is responsible for trillions of dollars in global G.D.P., keeps chugging along.

Last week, one of those programmers may have saved the internet from huge trouble.

His name is Andres Freund. He’s a 38-year-old software engineer who lives in San Francisco and works at Microsoft. His job involves developing a piece of open-source database software known as PostgreSQL, whose details would probably bore you to tears if I could explain them correctly, which I can’t.

Recently, while doing some routine maintenance, Mr. Freund inadvertently found a backdoor hidden in a piece of software that is part of the Linux operating system. The backdoor was a possible prelude to a major cyberattack that experts say could have caused enormous damage, if it had succeeded.

Now, in a twist fit for Hollywood, tech leaders and cybersecurity researchers are hailing Mr. Freund as a hero. Satya Nadella, the chief executive of Microsoft, praised his “curiosity and craftsmanship.” An admirer called him “the silverback gorilla of nerds.” Engineers have been circulating an old, famous-among-programmers web comic about how all modern digital infrastructure rests on a project maintained by some random guy in Nebraska . (In their telling, Mr. Freund is the random guy from Nebraska.)

In an interview this week, Mr. Freund — who is actually a soft-spoken, German-born coder who declined to have his photo taken for this story — said that becoming an internet folk hero had been disorienting.

“I find it very odd,” he said. “I’m a fairly private person who just sits in front of the computer and hacks on code.”

The saga began earlier this year, when Mr. Freund was flying back from a visit to his parents in Germany. While reviewing a log of automated tests, he noticed a few error messages he didn’t recognize. He was jet-lagged, and the messages didn’t seem urgent, so he filed them away in his memory.

But a few weeks later, while running some more tests at home, he noticed that an application called SSH, which is used to log into computers remotely, was using more processing power than normal. He traced the issue to a set of data compression tools called xz Utils, and wondered if it was related to the earlier errors he’d seen.

(Don’t worry if these names are Greek to you. All you really need to know is that these are all small pieces of the Linux operating system, which is probably the most important piece of open-source software in the world. The vast majority of the world’s servers — including those used by banks, hospitals, governments and Fortune 500 companies — run on Linux, which makes its security a matter of global importance.)

Like other popular open-source software, Linux gets updated all the time, and most bugs are the result of innocent mistakes. But when Mr. Freund looked closely at the source code for xz Utils, he saw clues that it had been intentionally tampered with.

In particular, he found that someone had planted malicious code in the latest versions of xz Utils. The code, known as a backdoor, would allow its creator to hijack a user’s SSH connection and secretly run their own code on that user’s machine.

In the cybersecurity world, a database engineer inadvertently finding a backdoor in a core Linux feature is a little like a bakery worker who smells a freshly baked loaf of bread, senses something is off and correctly deduces that someone has tampered with the entire global yeast supply. It’s the kind of intuition that requires years of experience and obsessive attention to detail, plus a healthy dose of luck.

At first, Mr. Freund doubted his own findings. Had he really discovered a backdoor in one of the world’s most heavily scrutinized open-source programs?

“It felt surreal,” he said. “There were moments where I was like, I must have just had a bad night of sleep and had some fever dreams.”

But his digging kept turning up new evidence, and last week, Mr. Freund sent his findings to a group of open-source software developers. The news set the tech world on fire. Within hours, a fix was developed and some researchers were crediting him with preventing a potentially historic cyberattack.

“This could have been the most widespread and effective backdoor ever planted in any software product,” said Alex Stamos, the chief trust officer at SentinelOne, a cybersecurity research firm.

If it had gone undetected, Mr. Stamos said, the backdoor would have “given its creators a master key to any of the hundreds of millions of computers around the world that run SSH.” That key could have allowed them to steal private information, plant crippling malware, or cause major disruptions to infrastructure — all without being caught.

(The New York Times has sued Microsoft and its partner OpenAI on claims of copyright infringement involving artificial intelligence systems that generate text.)

Nobody knows who planted the backdoor. But the plot appears to have been so elaborate that some researchers believe only a nation with formidable hacking chops, such as Russia or China, could have attempted it.

According to some researchers who have gone back and looked at the evidence, the attacker appears to have used a pseudonym, “Jia Tan,” to suggest changes to xz Utils as far back as 2022. (Many open-source software projects are governed via hierarchy; developers suggest changes to a program’s code, then more experienced developers known as “maintainers” have to review and approve the changes.)

The attacker, using the Jia Tan name, appears to have spent several years slowly gaining the trust of other xz Utils developers and getting more control over the project, eventually becoming a maintainer, and finally inserting the code with the hidden backdoor earlier this year. (The new, compromised version of the code had been released, but was not yet in widespread use.)

Mr. Freund declined to guess who might have been behind the attack. But he said that whoever it was had been sophisticated enough to try to cover their tracks, including by adding code that made the backdoor harder to spot.

“It was very mysterious,” he said. “They clearly spent a lot of effort trying to hide what they were doing.”

Since his findings became public, Mr. Freund said, he had been helping the teams who are trying to reverse-engineer the attack and identify the culprit. But he’s been too busy to rest on his laurels. The next version of PostgreSQL, the database software he works on, is coming out later this year, and he’s trying to get some last-minute changes in before the deadline.

“I don’t really have time to go and have a celebratory drink,” he said.

Kevin Roose is a Times technology columnist and a host of the podcast " Hard Fork ." More about Kevin Roose

Advertisement

code on a black computer screen

SEI and OpenAI Recommend Ways To Evaluate Large Language Models for Cybersecurity Applications

  • Share on Facebook (opens in new window)
  • Share on X (opens in new window)
  • Share on LinkedIn (opens in new window)
  • Print this page
  • Share by email

Carnegie Mellon University’s  Software Engineering Institute (opens in new window) (SEI) and OpenAI published a  white paper (opens in new window) that found that large language models (LLMs) could be an asset for cybersecurity professionals, but should be evaluated using real and complex scenarios to better understand the technology’s capabilities and risks. LLMs underlie today’s generative artificial intelligence (AI) platforms, such as Google’s Gemini, Microsoft’s Bing AI, and ChatGPT, released in November 2022 by OpenAI. These platforms take prompts from human users, use deep learning on large datasets, and produce plausible text, images or code. Applications for LLMs have exploded in the past year in industries including creative arts, medicine, law and  software engineering and acquisition (opens in new window) .

While in its  early days (opens in new window) , the prospect of using LLMs for cybersecurity is increasingly tempting. The burgeoning technology seems a fitting force multiplier for the data-heavy, deeply technical and often laborious field of cybersecurity. Add the pressure to stay ahead of LLM-wielding cyber attackers, including  state-affiliated actors (opens in new window) , and the lure grows even brighter.

However, it is hard to know how capable LLMs might be at cyber operations or how risky if used by defenders. The conversation around evaluating LLMs’ capability in any professional field seems to focus on their theoretical knowledge, such as answers to standard exam questions. One  preliminary study (opens in new window) found that GPT-3.5 Turbo aced a common penetration testing exam.

LLMs may be excellent at factual recall, but it is not sufficient, according to the SEI and OpenAI paper "Considerations for Evaluating Large Language Models for Cybersecurity Tasks."

 “An LLM might know a lot,” said  Sam Perl (opens in new window) , a senior cybersecurity analyst in the SEI’s  CERT Division (opens in new window) and coauthor of the paper, “but does it know how to deploy it correctly in the right order and how to make tradeoffs?”

Focusing on theoretical knowledge ignores the complexity and nuance of real-world cybersecurity tasks. As a result, cybersecurity professionals cannot know how or when to incorporate LLMs into their operations.

The solution, according to the paper, is to evaluate LLMs on the same branches of knowledge on which a human cybersecurity operator would be tested: theoretical knowledge, or foundational, textbook information; practical knowledge, such as solving self-contained cybersecurity problems; and applied knowledge, or achievement of higher-level objectives in open-ended situations.

Testing a human this way is hard enough. Testing an artificial neural network presents a unique set of hurdles. Even defining the tasks is hard in a field as diverse as cybersecurity. “Attacking something is a lot different than doing forensics or evaluating a log file,” said  Jeff Gennari (opens in new window) , team lead and senior engineer in the SEI CERT division and coauthor of the paper. “Each task must be thought about carefully, and the appropriate evaluation should be designed.”

Once the tasks are defined, an evaluation must ask thousands or even millions of questions. LLMs need that many to mimic the human mind’s gift for semantic accuracy. Automation will be needed to generate the required volume of questions. That is already doable for theoretical knowledge. But the tooling needed to generate enough practical or applied scenarios — and to let an LLM interact with an executable system — does not exist. Finally, computing the metrics on all those responses to practical and applied tests will take new rubrics of correctness.

While the technology catches up, the white paper provides a framework for designing realistic cybersecurity evaluations of LLMs that starts with four overarching recommendations:

  • Define the real-world task for the evaluation to capture.
  • Represent tasks appropriately.
  • Make the evaluation robust.
  • Frame results appropriately.

Shing-hon Lau (opens in new window) , a senior AI security researcher in the SEI’s CERT division and one of the paper’s coauthors, notes that this guidance encourages a shift away from focusing exclusively on the LLMs, for cybersecurity or any field. “We need to stop thinking about evaluating the model itself and move towards evaluating the larger system that contains the model or how using a model enhances human capability.”

The SEI authors believe LLMs will eventually enhance human cybersecurity operators in a supporting role, rather than work autonomously. Even so, LLMs will still need to be evaluated, said Gennari. “Cyber professionals will need to figure out how to best use an LLM to support a task, then assess the risk of that use. Right now it's hard to answer either of those questions if your evidence is an LLM’s ability to answer fact-based questions.”

The SEI has long applied engineering rigor to  cybersecurity (opens in new window) and  AI (opens in new window) . Combining the two disciplines in the study of LLM evaluations is one way the SEI is leading AI cybersecurity research. Last year, the SEI also launched the  AI Security Incident Response Team (AISIRT) (opens in new window) to provide the United States with a capability to address the risks from the rapid growth and widespread use of AI.

OpenAI approached the SEI about LLM cybersecurity evaluations last year seeking to better understand the safety of the models underlying its generative AI platforms. OpenAI coauthors of the paper Joel Parish and Girish Sastry contributed first-hand knowledge of LLM cybersecurity and relevant policies. Ultimately, all the authors hope the paper starts a movement toward practices that can inform those deciding when to fold LLMs into cyber operations.

“Policymakers need to understand how to best use this technology on mission,” said Gennari. “If they have accurate evaluations of capabilities and risks, then they'll be better positioned to actually use them effectively.”

Download the paper " Considerations for Evaluating Large Language Models for Cybersecurity Tasks (opens in new window) " for all 14 recommendations and more information. Read Gennari, Lau, and Perl’s SEI Blog post on the paper, “ OpenAI Collaboration Yields 14 Recommendations for Evaluating LLMs for Cybersecurity (opens in new window) .” Learn more about the  SEI’s research on LLMs (opens in new window) in the SEI Digital Library.

— Related Content —

Paul Nielsen

SEI Director and CEO Paul Nielsen Receives AIAA Award

Researchers' adversarial prompts can elicit arbitrary harmful behaviors from state-of-the-art commercial LLMs with high probability, demonstrating potentials for misuse.

Researchers Discover New Vulnerability in Large Language Models

Gov. Shapiro visits Carnegie Mellon.

CMU Advances Trustworthy AI as Part of AI Safety Institute Consortium

  • The Piper: Campus & Community News (opens in new window)
  • Official Events Calendar (opens in new window)

University of South Florida

Main Navigation

Two study leaders speak to participants in a computer lab

USF research reveals language barriers limit effectiveness of cybersecurity resources

  • April 1, 2024

Research and Innovation

By: John Dudley , University Communications & Marketing

The idea for Fawn Ngo’s latest research came from a television interview.

Ngo, a University of South Florida criminologist, had spoken with a Vietnamese language network in California about her interest in better understanding how people become victims of cybercrime.

Afterward, she began receiving phone calls from viewers recounting their own experiences of victimization.

Fawn Ngo

Fawn Ngo, associate professor in the USF College of Behavioral and Community Sciences

“Some of the stories were unfortunate and heartbreaking,” said Ngo, an associate professor in the USF College of Behavioral and Community Sciences. “They made me wonder about the availability and accessibility of cybersecurity information and resources for non-English speakers. Upon investigating further, I discovered that such information and resources were either limited or nonexistent.”

The result is what’s believed to be the first study to explore the links among demographic characteristics, cyber hygiene practices and cyber victimization using a sample of limited English proficiency internet users.

Ngo is the lead author of an article, “Cyber Hygiene and Cyber Victimization Among Limited English Proficiency (LEP) Internet Users: A Mixed-Method Study,” which just published in the journal Victims & Offenders. The article’s co-authors are Katherine Holman, a USF graduate student and former Georgia state prosecutor, and Anurag Agarwal, professor of information systems, analytics and supply chain at Florida Gulf Coast University. 

Their research, which focused on Spanish and Vietnamese speakers, led to two closely connected main takeaways:

  • LEP internet users share the same concern about cyber threats and the same desire for online safety as any other individual. However, they are constrained by a lack of culturally and linguistically appropriate resources, which also hampers accurate collection of cyber victimization data among vulnerable populations.
  • Online guidance that provides the most effective educational tools and reporting forms is only available in English. The most notable example is the website for the Internet Crime Complaint Center, which serves as the FBI’s primary apparatus for combatting cybercrime.

As a result, the study showed that many well-intentioned LEP users still engage in such risky online behaviors as using unsecured networks and sharing passwords. For example, only 29 percent of the study’s focus group participants avoided using public Wi-Fi over the previous 12 months, and only 17 percent said they had antivirus software installed on their digital devices.

Previous research cited in Ngo’s paper has shown that underserved populations exhibit poorer cybersecurity knowledge and outcomes, most commonly in the form of computer viruses and hacked accounts, including social media accounts. Often, it’s because they lack awareness and understanding and isn’t a result of disinterest, Ngo said.

“According to cybersecurity experts, humans are the weakest link in the chain of cybersecurity,” Ngo said. “If we want to secure our digital borders, we must ensure that every member in society, regardless of their language skills, is well-informed about the risks inherent in the cyber world.”

The study’s findings point to a need for providing cyber hygiene information and resources in multiple formats, including visual aids and audio guides, to accommodate diverse literacy levels within LEP communities, Ngo said. She added that further research is needed to address the current security gap and ensure equitable access to cybersecurity resources for all internet users.

In the meantime, Ngo is working to create a website that will include cybersecurity information and resources in different languages and a link to report victimization.

“It’s my hope that cybersecurity information and resources will become as readily accessible in other languages as other vital information, such as information related to health and safety,” Ngo said. “I also want LEP victims to be included in national data and statistics on cybercrime and their experiences accurately represented and addressed in cybersecurity initiatives.” 

Return to article listing

Cybersecurity , John Dudley , USF Sarasota-Manatee

News Archive

Learn more about USF's journey to Preeminence by viewing Newsroom articles from past years.

USF in the News

Business observer: usf embraces opportunities to connect with more businesses.

April 10, 2024

NPR: A professor worried no one would read an algae study. So she had it put to music

April 4, 2024

Tampa Bay Times: Tampa doctors fit device that restores arm movement to stroke victims

March 27, 2024

Associated Press: Debate emerges over whether modern protections could have saved Baltimore bridge

More USF in the News

You are using an outdated browser. Upgrade your browser today or install Google Chrome Frame to better experience this site.

IMF Live

  • IMF at a Glance
  • Surveillance
  • Capacity Development
  • IMF Factsheets List
  • IMF Members
  • IMF Timeline
  • Senior Officials
  • Job Opportunities
  • Archives of the IMF
  • Climate Change
  • Fiscal Policies
  • Income Inequality

Flagship Publications

Other publications.

World Economic Outlook

Global Financial Stability Report

Fiscal Monitor

  • External Sector Report
  • Staff Discussion Notes
  • Working Papers
  • IMF Research Perspectives
  • Economic Review
  • Global Housing Watch
  • Commodity Prices
  • Commodities Data Portal
  • IMF Researchers
  • Annual Research Conference
  • Other IMF Events

IMF reports and publications by country

Regional offices.

  • IMF Resident Representative Offices
  • IMF Regional Reports
  • IMF and Europe
  • IMF Members' Quotas and Voting Power, and Board of Governors
  • IMF Regional Office for Asia and the Pacific
  • IMF Capacity Development Office in Thailand (CDOT)
  • IMF Regional Office in Central America, Panama, and the Dominican Republic
  • Eastern Caribbean Currency Union (ECCU)
  • IMF Europe Office in Paris and Brussels
  • IMF Office in the Pacific Islands
  • How We Work
  • IMF Training
  • Digital Training Catalog
  • Online Learning
  • Our Partners
  • Country Stories
  • Technical Assistance Reports
  • High-Level Summary Technical Assistance Reports
  • Strategy and Policies

For Journalists

  • Country Focus
  • Chart of the Week
  • Communiqués
  • Mission Concluding Statements
  • Press Releases
  • Statements at Donor Meetings
  • Transcripts
  • Views & Commentaries
  • Article IV Consultations
  • Financial Sector Assessment Program (FSAP)
  • Seminars, Conferences, & Other Events
  • E-mail Notification

Press Center

The IMF Press Center is a password-protected site for working journalists.

  • Login or Register
  • Information of interest
  • About the IMF
  • Conferences
  • Press briefings
  • Special Features
  • Middle East and Central Asia
  • Economic Outlook
  • Annual and spring meetings
  • Most Recent
  • Most Popular
  • IMF Finances
  • Additional Data Sources
  • World Economic Outlook Databases
  • Climate Change Indicators Dashboard
  • IMF eLibrary-Data
  • International Financial Statistics
  • G20 Data Gaps Initiative
  • Public Sector Debt Statistics Online Centralized Database
  • Currency Composition of Official Foreign Exchange Reserves
  • Financial Access Survey
  • Government Finance Statistics
  • Publications Advanced Search
  • IMF eLibrary
  • IMF Bookstore
  • Publications Newsletter
  • Essential Reading Guides
  • Regional Economic Reports
  • Country Reports
  • Departmental Papers
  • Policy Papers
  • Selected Issues Papers
  • All Staff Notes Series
  • Analytical Notes
  • Fintech Notes
  • How-To Notes
  • Staff Climate Notes

Global Financial Stability Report, April 2024

research paper about cyber security

Chapter 2: The Rise and Risks of Private Credit

Chapter 2 assesses vulnerabilities and potential risks to financial stability in private credit, a rapidly growing asset class—traditionally focused on providing loans to mid-sized firms outside the realms of either commercial banks or public debt markets—that now rivals other major credit markets in size. The chapter identifies important vulnerabilities arising from relatively fragile borrowers, a growing share of semi-liquid investment vehicles, multiple layers of leverage, stale and potentially subjective valuations, and unclear connections between participants. If private credit remains opaque and continues to grow exponentially under limited prudential oversight, these vulnerabilities could become systemic.

Given the potential risks posed by this fast-growing and interconnected asset class, authorities could consider a more proactive supervisory and regulatory approach to private credit. It is key to close data gaps and enhance reporting requirements to comprehensively assess risks. Authorities should closely monitor and address liquidity and conduct risks in funds—especially retail—that may be faced with higher redemption risks.

research paper about cyber security

Chapter 3: Cyber Risk: A Growing Concern for Macrofinancial Stability 

Against a backdrop of growing digitalization, evolving technologies, and rising geopolitical tensions, cyber risks are on the rise. Chapter 3 shows that while cyber incidents have thus far not been systemic, the risk of extreme losses from such incidents has increased. The financial sector is highly exposed, and a severe cyber incident could pose macro-financial stability risks through a loss of confidence, disruption of critical services, and spillovers to other institutions through technological and financial linkages. While better cyber legislation and cyber-related governance arrangements at firms can help mitigate these risks, cyber policy frameworks remain generally inadequate, especially in emerging market and developing economies. Thus, the cyber resilience of the financial sector needs to be strengthened by developing adequate national cybersecurity strategies, appropriate regulatory and supervisory frameworks, a capable cybersecurity workforce, and domestic and international information-sharing arrangements. To allow for more effective monitoring of cyber risks, reporting of cyber incidents should be strengthened. Supervisors should hold board members responsible for managing the cybersecurity of financial firms and promoting a conducive risk culture, cyber hygiene, and cyber training and awareness. To limit potential disruptions, financial firms should develop and test response and recovery procedures. National authorities should develop effective response protocols and crisis management frameworks.

COMING SOON 

<img src=

The report will be available for download on this page on April 16.

Press Briefing:  Global Financial Stability Report, April 2024

  • Tobias Adrian , Director, Monetary and Capital Markets Department, IMF
  • Fabio Natalucci , Deputy Director, Monetary and Capital Markets Department, IMF
  • Jason Wu , Assistant Director, Monetary and Capital Markets Department, IMF
  • Moderator:  Meera Louis , Communications Officer, IMF

Publications

World Economic Outlook

  • Latest Issue

Global Financial Stability Report

Finance & Development

  • ECONOMICS How should it change?

research paper about cyber security

September 2023

Annual Report

  • COMMITTED TO COLLABORATION

Regional Economic Outlook Reports, All Regions

Regional Economic Outlooks

  • Latest Issues

The Federal Register

The daily journal of the united states government, request access.

Due to aggressive automated scraping of FederalRegister.gov and eCFR.gov, programmatic access to these sites is limited to access to our extensive developer APIs.

If you are human user receiving this message, we can add your IP address to a set of IPs that can access FederalRegister.gov & eCFR.gov; complete the CAPTCHA (bot test) below and click "Request Access". This process will be necessary for each IP address you wish to access the site from, requests are valid for approximately one quarter (three months) after which the process may need to be repeated.

An official website of the United States government.

If you want to request a wider IP range, first request access for your current IP, and then use the "Site Feedback" button found in the lower left-hand side to make the request.

IMAGES

  1. Example Of Cyber Security Research Paper

    research paper about cyber security

  2. (PDF) AN EMPIRICAL STUDY ON CYBER SECURITY THREATS AND ATTACKS

    research paper about cyber security

  3. (PDF) Cybersecurity and Ethics

    research paper about cyber security

  4. Cyber Security Issues Essay Example

    research paper about cyber security

  5. (PDF) The Role of Artificial Intelligence in Cyber Security

    research paper about cyber security

  6. (PDF) The role of cyber-security in information technology education

    research paper about cyber security

VIDEO

  1. Cyber Security Question Paper 2022 |SPPU Question Paper 2022 |March 2022 Question Paper |TYBBA(CA)

  2. CYBER SECURITY QUESTION PAPER || BCOM 6TH SEM || JUNE/JULY 2023 ||

  3. cyber security NEP question paper

  4. Previous year question paper Cyber Defence Bsc 2nd year VBU FYUGP NEP VFYUG2

  5. Cybersecurity Responsibility

  6. cyber security question paper aktu

COMMENTS

  1. Journal of Cybersecurity

    About the journal. Journal of Cybersecurity publishes accessible articles describing original research in the inherently interdisciplinary world of computer, systems, and information security …. Journal of Cybersecurity is soliciting papers for a special collection on the philosophy of information security. This collection will explore ...

  2. A comprehensive review study of cyber-attacks and cyber security

    In addition, five scenarios can be considered for cyber warfare: (1) Government-sponsored cyber espionage to gather information to plan future cyber-attacks, (2) a cyber-attack aimed at laying the groundwork for any unrest and popular uprising, (3) Cyber-attack aimed at disabling equipment and facilitating physical aggression, (4) Cyber-attack as a complement to physical aggression, and (5 ...

  3. Artificial intelligence for cybersecurity: Literature review and future

    The article is a full research paper (i.e., not a presentation or supplement to a poster). ... Cyber supply chain security. Cyber supply chain security requires a secure integrated network between the incoming and outgoing chain's subsystems. Therefore, it is essential to understand and predict threats using both internal and threat ...

  4. Cyber security: State of the art, challenges and future directions

    This article is organized as Section 1) Introduction to Cyber security, Section 2) Application area of Cyber-security, Section 3) State-of-the-art in Cyber Security, Section 4) Related Work, Section 5) Challenges of Cyber Security, Section 6) Opportunities and future research direction of cyber security, and Section 7) conclusion. 2.

  5. (PDF) A Systematic Literature Review on the Cyber Security

    This paper offers a comprehensive overview of current research into cyber security. We commence, section 2 provides the cyber security related work, in section 3, by introducing about cyber security.

  6. Cyber risk and cybersecurity: a systematic review of data ...

    Cybercrime is estimated to have cost the global economy just under USD 1 trillion in 2020, indicating an increase of more than 50% since 2018. With the average cyber insurance claim rising from USD 145,000 in 2019 to USD 359,000 in 2020, there is a growing necessity for better cyber information sources, standardised databases, mandatory reporting and public awareness. This research analyses ...

  7. Cyber Security Threats and Vulnerabilities: A Systematic ...

    There has been a tremendous increase in research in the area of cyber security to support cyber applications and to avoid key security threats faced by these applications. The goal of this study is to identify and analyze the common cyber security vulnerabilities. To achieve this goal, a systematic mapping study was conducted, and in total, 78 primary studies were identified and analyzed ...

  8. AI-Driven Cybersecurity: An Overview, Security Intelligence ...

    Artificial intelligence (AI) is one of the key technologies of the Fourth Industrial Revolution (or Industry 4.0), which can be used for the protection of Internet-connected systems from cyber threats, attacks, damage, or unauthorized access. To intelligently solve today's various cybersecurity issues, popular AI techniques involving machine learning and deep learning methods, the concept of ...

  9. Full article: Cybersecurity Deep: Approaches, Attacks Dataset, and

    The remaining paper is structured as follows. Section 2 outlines the relevant works, ... The research should highlight building a real-time setup to validate deep learning approaches so they can counter new types of cybersecurity attacks. Researchers should primarily focus on issues where a criminal uses the DL Technique method to break into ...

  10. Cybersecurity: Past, Present and Future

    The term Cyber Security starts gaining popularity from the year 2017 and achieves peak popularity in the years 2019 and 2020. Whereas, the other two terms show a steady ... To keep pace with the advancements in the new digital technologies like IoTs and cloud, there is a need to expand research and develop novel cybersecurity methods and tools to

  11. A Systematic Literature Review on Cyber Threat Intelligence for ...

    Cybersecurity is a significant concern for businesses worldwide, as cybercriminals target business data and system resources. Cyber threat intelligence (CTI) enhances organizational cybersecurity resilience by obtaining, processing, evaluating, and disseminating information about potential risks and opportunities inside the cyber domain. This research investigates how companies can employ CTI ...

  12. Artificial intelligence in cyber security: research advances

    This paper also identifies a number of limitations and challenges, and based on the findings, a conceptual human-in-the-loop intelligence cyber security model is presented. References Adekunle YA, Okolie SO, Adebayo AO, Ebiesuwa S, Ehiwe DD (2019) Holistic exploration of gaps vis-à-vis artificial intelligence in automated teller machine and ...

  13. Cyber security: Current threats, challenges, and prevention methods

    Cyber Security is a blend of innovative headways, process cycles and practices. The goal of cyber security is to ensure protection of applications, networks, PCs, and critical information from attack. ... This paper reviews research work done in cybersecurity including the types of cybersecurity. The paper also discusses threats and prevention ...

  14. (PDF) Cybersecurity: trends, issues, and challenges

    The cyber security model in organizations needs a security analyst to protect the systems. The security analyst faces many challenges related to cyber security attack detection, prediction, and ...

  15. (PDF) Artificial Intelligence in Cyber Security

    1. Artificial Intelligence in Cyber Security. Rammanohar Das and Raghav Sandhane*. Symbiosis Centre for Information Technology, Symbiosis International (Deemed. University), Pune, Maharashtra ...

  16. Full article: Cyber Security and Emerging Technologies

    Melissa K. Griffith. Cyber Persistence Theory: Redefining National Security in Cyberspace. Michael P. Fischerkeller, Emily O. Goldman and Richard J. Harknett. Oxford and New York: Oxford University Press, 2022. £19.99/$29.95. 272 pp. Offensive Cyber Operations: Understanding Intangible Warfare.

  17. A Study of Cyber Security Issues and Challenges

    The paper first explains what cyber space and cyber security is. Then the costs and impact of cyber security are discussed. The causes of security vulnerabilities in an organization and the challenging factors of protecting an organization from cybercrimes are discussed in brief. Then a few common cyber-attacks and the ways to protect from them ...

  18. Cybersecurity data science: an overview from machine learning

    In a computing context, cybersecurity is undergoing massive shifts in technology and its operations in recent days, and data science is driving the change. Extracting security incident patterns or insights from cybersecurity data and building corresponding data-driven model, is the key to make a security system automated and intelligent. To understand and analyze the actual phenomena with data ...

  19. A STUDY OF CYBER SECURITY AND ITS CHALLENGES IN THE SOCIETY

    This paper mainly focuses on challenges faced by cyber security on the latest technologies .It also focuses on latest about the cyber security techniques, ethics and the trends changing the face of cyber security. Keywords: cyber security, cyber crime, cyber ethics, social media, cloud computing, android apps. 1. INTRODUCTION.

  20. Artificial intelligence in cyber security: research advances ...

    In recent times, there have been attempts to leverage artificial intelligence (AI) techniques in a broad range of cyber security applications. Therefore, this paper surveys the existing literature (comprising 54 papers mainly published between 2016 and 2020) on the applications of AI in user access authentication, network situation awareness, dangerous behavior monitoring, and abnormal traffic ...

  21. Cybersecurity of Critical Infrastructures: Challenges and Solutions

    The paper concludes that training is a strong tool against cyber attacks but must be combined with other security solutions. A vital challenge faced nowadays by federal and business decision-makers for choosing cost-efficient mitigations to scale back risks from supply chain attacks, particularly those from adversarial attacks that are complex ...

  22. Post-Quantum Cryptography, and the Quantum Future of Cybersecurity

    Abstract. We review the current status of efforts to develop and deploy post-quantum cryptography on the Internet. Then we suggest specific ways in which quantum technologies might be used to enhance cybersecurity in the near future and beyond. We focus on two goals: protecting the secret keys that are used in classical cryptography, and ...

  23. Did One Guy Just Stop a Huge Cyberattack?

    The backdoor was a possible prelude to a major cyberattack that experts say could have caused enormous damage, if it had succeeded. Now, in a twist fit for Hollywood, tech leaders and ...

  24. SEI and OpenAI Recommend Ways To Evaluate Large Language Models for

    Carnegie Mellon University's Software Engineering Institute (opens in new window) (SEI) and OpenAI published a white paper (opens in new window) that found that large language models (LLMs) could be an asset for cybersecurity professionals, but should be evaluated using real and complex scenarios to better understand the technology's capabilities and risks.

  25. USF research reveals language barriers limit effectiveness of

    She added that further research is needed to address the current security gap and ensure equitable access to cybersecurity resources for all internet users. In the meantime, Ngo is working to create a website that will include cybersecurity information and resources in different languages and a link to report victimization.

  26. (PDF) Research Paper on Cyber Security

    I.C.S. College, Khed, Ratnagri. Abstract: In the current world that is run by technology and network connections, it is crucial to know what cyber security is. and to be able to use it effectively ...

  27. Global Financial Stability Report

    Chapter 3: Cyber Risk: A Growing Concern for Macrofinancial Stability Against a backdrop of growing digitalization, evolving technologies, and rising geopolitical tensions, cyber risks are on the rise. Chapter 3 shows that while cyber incidents have thus far not been systemic, the risk of extreme losses from such incidents has increased.

  28. Apple teases transformative new iPhone experience

    The Cupertino, Calif., company recently published a study that explores its efforts with a multi-modal LLM called Feret-UI that might be a step along the path to an AI-enabled Siri. Unlike the LLM ...

  29. Federal Register :: Cyber Incident Reporting for Critical

    For example, 6 U.S.C. 681a(a)(7) requires CISA to immediately review Covered Cyber Incident Reports or voluntary reports submitted to CISA pursuant to 6 U.S.C. 681c to the extent they involve ongoing cyber threats or security vulnerabilities for cyber threat indicators that can be anonymized and disseminated, with defensive measures, to ...