Journal of Cloud Computing

Advances, Systems and Applications

Journal of Cloud Computing Cover Image

Special Issues - Guidelines for Guest Editors

For more information for Guest Editors, please see our Guidelines

Special Issues - Call for Papers

We welcome submissions for the upcoming special issues of the Journal of Cloud Computing

Advanced Blockchain and Federated Learning Techniques Towards Secure Cloud Computing Guest Editors: Yuan Liu, Jie Zhang, Athirai A. Irissappane, Zhu Sun Submission deadline:  30 April 2024

Mobile Edge Computing Meets AI Guest Editors: Lianyong Qi, Maqbool Khan, Qiang He, Shui Yu, Wajid Rafique Submission deadline:  3 May 2024   Blockchain-enabled Decentralized Cloud/Edge Computing Guest Editors: Qingqi Pei, Kaoru Ota, Martin Gilje Jaatun, Jie Feng, Shen Su Submission deadline: 31 st  March 2023

  • Most accessed

Deep Reinforcement Learning techniques for dynamic task offloading in the 5G edge-cloud continuum

Authors: Gorka Nieto, Idoia de la Iglesia, Unai Lopez-Novoa and Cristina Perfecto

Enhancing patient healthcare with mobile edge computing and 5G: challenges and solutions for secure online health tools

Authors: Yazeed Yasin Ghadi, Syed Faisal Abbas Shah, Tehseen Mazhar, Tariq Shahzad, Khmaies Ouahada and Habib Hamam

Online dynamic multi-user computation offloading and resource allocation for HAP-assisted MEC: an energy efficient approach

Authors: Sihan Chen and Wanchun Jiang

Enhancing lung cancer diagnosis with data fusion and mobile edge computing using DenseNet and CNN

Authors: Chengping Zhang, Muhammad Aamir, Yurong Guan, Muna Al-Razgan, Emad Mahrous Awwad, Rizwan Ullah, Uzair Aslam Bhatti and Yazeed Yasin Ghadi

Cross-chain asset trading scheme for notaries based on edge cloud storage

Authors: Lang Chen, Yuling Chen, Chaoyue Tan, Yun Luo, Hui Dou and Yuxiang Yang

Most recent articles RSS

View all articles

A quantitative analysis of current security concerns and solutions for cloud computing

Authors: Nelson Gonzalez, Charles Miers, Fernando Redígolo, Marcos Simplício, Tereza Carvalho, Mats Näslund and Makan Pourzandi

Critical analysis of vendor lock-in and its impact on cloud computing migration: a business perspective

Authors: Justice Opara-Martins, Reza Sahandi and Feng Tian

Future of industry 5.0 in society: human-centric solutions, challenges and prospective research areas

Authors: Amr Adel

Intrusion detection systems for IoT-based smart environments: a survey

Authors: Mohamed Faisal Elrawy, Ali Ismail Awad and Hesham F. A. Hamed

Load balancing in cloud computing – A hierarchical taxonomical classification

Authors: Shahbaz Afzal and G. Kavitha

Most accessed articles RSS

Aims and scope

The Journal of Cloud Computing: Advances, Systems and Applications (JoCCASA) will publish research articles on all aspects of Cloud Computing. Principally, articles will address topics that are core to Cloud Computing, focusing on the Cloud applications, the Cloud systems, and the advances that will lead to the Clouds of the future. Comprehensive review and survey articles that offer up new insights, and lay the foundations for further exploratory and experimental work, are also relevant.

Published articles will impart advanced theoretical grounding and practical application of Clouds and related systems as are offered up by the numerous possible combinations of internet-based software, development stacks and database availability, and virtualized hardware for storing, processing, analysing and visualizing data. Where relevant, Clouds should be scrutinized alongside other paradigms such Peer to Peer (P2P) computing, Cluster computing, Grid computing, and so on. Thorough examination of Clouds with respect to issues of management, governance, trust and privacy, and interoperability, are also in scope. The Journal of Cloud Computing is indexed by the Science Citation Index Expanded/SCIE. SCI has subsequently merged into SCIE.  

Cloud Computing is now a topic of significant impact and, while it may represent an evolution in technology terms, it is revolutionising the ways in which both academia and industry are thinking and acting. The Journal of Cloud Computing, Advances, Systems and Applications (JoCCASA) has been launched to offer a high quality journal geared entirely towards the research that will offer up future generations of Clouds. The journal publishes research that addresses the entire Cloud stack, and as relates Clouds to wider paradigms and topics.

Chunming Rong, Editor-in-Chief University of Stavanger, Norway

  • Editorial Board
  • Sign up for article alerts and news from this journal

Annual Journal Metrics

2022 Citation Impact 4.0 - 2-year Impact Factor 4.4 - 5-year Impact Factor 1.711 - SNIP (Source Normalized Impact per Paper) 0.976 - SJR (SCImago Journal Rank)

2023 Speed 10 days submission to first editorial decision for all manuscripts (Median) 116 days submission to accept (Median)

2023 Usage  733,672 downloads 49 Altmetric mentions 

  • More about our metrics
  • ISSN: 2192-113X (electronic)

Benefit from our free funding service

New Content Item

We offer a free open access support service to make it easier for you to discover and apply for article-processing charge (APC) funding. 

Learn more here

  • Survey paper
  • Open access
  • Published: 11 February 2019

A survey on data storage and placement methodologies for Cloud-Big Data ecosystem

  • Somnath Mazumdar   ORCID: orcid.org/0000-0002-1751-2569 1 ,
  • Daniel Seybold 2 ,
  • Kyriakos Kritikos   ORCID: orcid.org/0000-0001-9633-1610 3 &
  • Yiannis Verginadis 4  

Journal of Big Data volume  6 , Article number:  15 ( 2019 ) Cite this article

22k Accesses

62 Citations

8 Altmetric

Metrics details

Currently, the data to be explored and exploited by computing systems increases at an exponential rate. The massive amount of data or so-called “Big Data” put pressure on existing technologies for providing scalable, fast and efficient support. Recent applications and the current user support from multi-domain computing, assisted in migrating from data-centric to knowledge-centric computing. However, it remains a challenge to optimally store and place or migrate such huge data sets across data centers (DCs). In particular, due to the frequent change of application and DC behaviour (i.e., resources or latencies), data access or usage patterns need to be analyzed as well. Primarily, the main objective is to find a better data storage location that improves the overall data placement cost as well as the application performance (such as throughput). In this survey paper, we are providing a state of the art overview of Cloud-centric Big Data placement together with the data storage methodologies. It is an attempt to highlight the actual correlation between these two in terms of better supporting Big Data management. Our focus is on management aspects which are seen under the prism of non-functional properties. In the end, the readers can appreciate the deep analysis of respective technologies related to the management of Big Data and be guided towards their selection in the context of satisfying their non-functional application requirements. Furthermore, challenges are supplied highlighting the current gaps in Big Data management marking down the way it needs to evolve in the near future.

Introduction

Over the time, the type of applications has evolved from batch, compute or memory intensive applications to streaming or even interactive applications. As a result, applications are getting more complex and become long-running. Such applications might require frequent-access to multiple distributed data sources. During application deployment and provisioning, the user can face various issues such as (i) where to effectively place both the data and the computation; (ii) how to achieve required objectives while reducing the overall application running cost. Data could be generated from various sources, including a multitude of devices over IoT environments that can generate a huge amount of data, while the applications are running. An application can further produce a large amount of data. In general, data of such size is usually referred to as Big Data . In general, Big Data is characterised by five properties [ 1 , 2 ]. These are volume , velocity (means rapid update and propagation of data), variety (means different kinds of data parts), veracity (related to the trustworthiness, authenticity and protection (degree) of the data) and value (the main added-value and the importance of the data to the business). A large set of different data types generated from various sources can hold enormous information (in the form of relationships [ 3 ], system access logs, and also as the quality of services (QoSs)). Such knowledge can be critical for improving both products and services. Thus, to retrieve the underlying knowledge from such big sized data sets an efficient data processing ecosystem and knowledge filtering methodologies are needed.

In general, Cloud-based technology offers different solutions over different levels of abstractions to build and dynamically provision user applications. The Cloud offers suitable frameworks for the clustering of Big Data as well as efficiently distributed databases for their storage and placement. However, the native Cloud facilities have a lack of guidance on how to combine and integrate services in terms of holistic frameworks which could enable users to properly manage both their applications and the data. While there exist some promising efforts that fit well under the term Big Data-as-a-service (BDaaS) , most of them still lack adequate support for: data-privacy [ 4 , 5 , 6 ], query optimisation [ 7 ], robust data analytics [ 8 ] and data-related service level objective management for increased (Big Data) application quality [ 9 ]. Currently, the application placement and management over multi or cross-Clouds is being researched. However, the additional dimension of Big Data management does raise significantly the complexity of finding adequate and realistic solutions.

The primary goal of this survey is to present the current state-of-affairs in Cloud computing with respect to the Big Data management (mainly storage and placement) from the application’s administration point-of-view. To this end, we have thoroughly reviewed the proposed solutions based on the placement and storage of Big Data through the use of a carefully designed set of criteria. Such criteria were devised under the prism of non-functional properties. This was performed in an attempt to unveil those solutions which can be deemed suitable for the better management of different kinds of applications (while taking into consideration non-functional aspects). In the end, the prospective readers (such as Big Data application owners, DevOps) can be guided towards the selection of those solutions in each Big Data management lifecycle phase (focused in this article) that satisfy in a better way their non-functional application requirements. The analysis finally concludes with the identification of certain gaps. Based on the latter, a set of challenges for the two Big Data management phases covered as well as for Big Data management as a whole are supplied towards assisting in the evolution of respective solutions and paving the way for the actual directions that the research should follow.

Based on the above analysis, it is clear that this article aims at providing guidance to potential adopters concerning the most appropriate solution for both placing and storing Big Data (according to the distinctive requirements of the application domain). To this end, our work can be considered as complementary to other relevant surveys that attempt to review Big Data technologies. In particular, the past surveys have focused on the deployment of data-intensive applications in the Cloud [ 10 ], on assessing various database management tools for storing Big Data [ 11 ], on evaluating the technologies developed for Big Data applications [ 12 ], on Cloud-centric distributed database management systems (primarily on NoSQL storage models) [ 13 ], on design principles for in-memory Big Data management and processing [ 14 ] and on research challenges related to Big Data in the Cloud ecosystem [ 15 ]. However, the primary focus of these surveys is mainly on functional aspects examined under the prism of analysing different dimensions and technology types related to Big Data. Further, there is no clear discussion on management aspects in the context of the whole Big Data management lifecycle as usually the focus seems to be merely on the Big Data storage phase. Interestingly, our survey deeply analyses those phases in the Big Data management lifecycle that are the most crucial in the context of satisfying application non-functional requirements.

The remaining part of this manuscript is structured as follows: " Data lifecycle management (DLM) " section explicates how data modelling can be performed, analyses various data management lifecycle models and comes up with an ideal one which is presented along with the proper architecture to support it. Next, " Methodology " section attempts to explain this survey’s main methodology. " Non-functional data management features " section details the main non-functional features of focus in this article. Based on these features, the review of Big Data storage systems and distributed file systems are supplied in " Data storage systems " section. Similarly, the review of state-of-the-art data placement techniques is performed in " Data placement techniques " section. Next, " Lessons learned and future research directions " section presents relevant lessons learned as well as certain directions for future research work and finally " Concluding remarks " section concludes the survey paper.

Data lifecycle management (DLM)

Data lifecycle models.

There exist two types of data lifecycle models focusing on either general data or Big Data management. The generic data management lifecycles usually cover activities such as generation, collection (curation), storage, publishing, discovery, processing and analysis of data [ 16 ].

In general, Big Data lifecycle models primarily comprises activities (such as data collection, data loading, data processing, data analysis and data visualisation [ 17 , 18 ]). It is worth to note that apart from the data visualisation, they do share many identical activities with the generic ones. However, such models do not mention the value of data.

To counter this, the NIST reference model [ 19 ] suggests four data management phases: collection , preparation , analysis and action , where the action phase is related to using synthesised knowledge to create value (represents analytics and visualisation of knowledge). Furthermore, focusing more on the data value, OECD [ 20 ] has proposed a data value cycle model comprising six phases: datafication and data collection , Big Data , data analytics , knowledge base , decision making and valued-added for growth and well-being. The model forms an iterative, closed feedback loop where results from Big Data analytics are fed back to the respective database. Later, the work in [ 21 ] exposed the main drawbacks of OECD and proposed a new reference model that adds two additional components, the business intelligence (BI) system and the environment, into the OECD model. The data interaction and analysis formulates a short closed loop in the model. A greater loop is also endorsed via the BI’s iterative interaction and observation of its environment. Finally, it is claimed that the management of Big Data for value creation is also linked to the BI management. In this way, Big Data management is related directly to the activities of data integration, analysis, interaction and effectuation along with the successful management of the emergent knowledge via data intelligence.

Data modelling

The data needs to be described in an appropriate form prior to any kind of usage. The information used for the data description is termed as metadata (i.e., data about data) [ 22 , 23 , 24 ]. The use of metadata enriches data management so that it can properly support and improve any data management activity. Two major issues related to metadata management are:

How should metadata be described (or characterised)? The description of a metadata schema which can be exploited to efficiently place a certain Big Data application in multiple Clouds by respecting both user constraints and requirements. Such a metadata schema has been proposed partially in [ 25 ] or completely in [ 26 ].

How should metadata be efficiently managed and stored for better retrieval and exploitation? The design of appropriate languages [ 27 , 28 ] that focus on the description of how Big Data applications and data should be placed and migrated across different multiple Cloud resources.

For a better description of metadata, the authors in [ 22 ] identify available Cloud services and analyse some of their main characteristics following a tree-structured taxonomy. Another relevant effort is the DICE project [ 25 ] that focuses on the quality-driven development of Big Data applications. It offers a UML profile along with the appropriate tools that may assist software designers to reason about the reliability, safety and efficiency of data-intensive applications. Specifically, it has introduced a metamodel for describing certain aspects of Big Data-intensive applications.

Most of these efforts do not offer a direct support for expressing significant aspects of Big Data, such as data origin, location, volume, transfer rates or even aspects of the operations that transfer data between Cloud resources. One effort that tries to cover the requirements for a proper and complete metadata description is the Melodic metadata schema [ 26 ]. This schema refers to a taxonomy of concepts, properties and relationships that can be exploited for supporting Big Data management as well as application deployment reasoning. The schema is clustered into three parts: (i) one focusing on specifying Cloud service requirements and capabilities to support application deployment reasoning; (ii) another focusing on defining Big Data features and constraints to support Big Data management; (iii) a final one concentrating on supplying Big Data security-related concepts to drive the data access control.

With respect to the second direction of work, although several languages are currently used for capturing application placement and reconfiguration requirements (e.g., TOSCA [ 27 ]), a lack of distinct support for describing placement and management requirements for Big Data can be observed. However, if such languages are extended through the possible use of a metadata schema, then they could be able to achieve this purpose. This has been performed in [ 26 ], where a classical, state-of-the-art Cloud description language called CAMEL [ 29 ] has been extended to enable the description of Big Data placement and management requirements by following a feature-model-based approach where requirements are expressed as features or attributes that are annotated via elements from the metadata schema.

Data lifecycle management systems

Traditional data lifecycle management systems (DLMSs) focus more on the way data is managed and not on how they are processed. In particular, the actual main services that they offer are data storage planning (and provisioning) and data placement (and execution support) via efficient data management policies. On the other hand, it seems that data processing is covered by other tools or systems as it is regarded as application-specific. Traditionally in Cloud, Big Data processing is offered as a separate service, while the resource management is usually handled by other tools, such as Apache Mesos or YARN. Figure  1 depicts the architecture of a system that completely addresses the data management lifecycle, as inscribed in the previous sub-section. This system comprises six primary components.

Metadata management takes care of maintaining information which concerns both the static and dynamic characteristics of data. It is the cornerstone for enabling efficient data management.

Data placement encapsulates the main methods for efficient data placement and data replication while satisfying user requirements.

Data storage is responsible for proper (transactional) storage and efficient data retrieval support.

Data ingestion enables importing and exporting the data over the respective system.

Big Data processing supports the efficient and clustered processing of Big Data by executing the main logic of the user application(s).

Resource management is responsible for the proper and efficient management of computational resources.

In this article, our focus is mainly on the Data storage and Data placement parts of the above architecture. Our rationale is that the integration of such parts (or Big Data lifecycle management phases) covers the core of a DLMS. An application’s data access workflow in the Cloud is presented in Fig.  2 . As a first step, the application checks the availability of the input data. In general, the data needs to be known by the system to optimally handle it. It maps to two main cases: (i) data already exist and have been registered; (ii) data do not exist and must be registered. In the latter case, metadata is needed to register the data into the system (thus mapping to the data-registration process). During the data modelling (see " Data modelling " sub-section), the metadata are maintained via a data catalogue (i.e., a special realisation of Metadata management component). Such an approach can guarantee the efficient maintenance of application data throughout the application’s lifecycle by both knowing and dynamically altering the values of data features (such as data type, size, location, data format, user preference, data replica numbers, cost constraints) whenever needed. In the next phase, based on the employed data placement methodology, the data is placed/migrated next to the application or both the data and application code is collocated. Here, the underlying scheduler (realising the Data placement component) acquires the up-to-date data knowledge to achieve an efficient data placement during both the initial application deployment and its runtime. Such an approach can restrain unnecessary data movement and reduces cost (at runtime) [ 30 , 31 , 32 ]. Next, during the application execution, two situations may arise: (i) new data sets are generated; (ii) data sets are transformed into another form (such as data compression). Furthermore, temporary data may also need to be handled. Finally, once application execution ends, the generated or transformed data needs to be stored (or backed up) as per user instructions.

figure 1

A high-level block diagram of a Big Data management system

figure 2

Standard workflow of application data lifecycle

In general, a hierarchical storage management [ 33 ] could be considered as a DLMS tool. In recent times, cognitive data management (CDM) has gained industrial support for automated data management together with high-grade efficiency. The CDM (e.g., Stronglink Footnote 1 ) is generally the amalgamation of intelligent (artificial-intelligence Footnote 2 /machine learning-based approach) distributed storage including resource management together with a more sophisticated DLMS component. The CDM works on the database-as-a-service (DBaaS) layer which instructs the data to be used by the scheduler with an efficient management approach including the exploitation of the data catalogue via data modelling.

Methodology

We have conducted a systematic literature review (SLR) on Big Data placement and storage methods in the Cloud, following the guidelines proposed in [ 34 ]. Such an SLR comprises three main phases: (i) SLR planning, (ii) SLR conduction and (iii) SLR reporting. In this section, we briefly discuss the first two phases. While the remaining part of this manuscript focuses on the presenting the survey, the identification of the remaining research issues and the potential challenges for current and future work.

SLR planning

This phase comprises three main steps: (i) SLR need identification, (ii) research questions identification and (iii) SLR protocol formation.

SLR need identification

Here, we are advocating to add more focus on the Big Data storage and placement phases of the respective Big Data management lifecycle. Thus be able to confront the respective challenges that Big Data place on them. Such phases are also the most crucial in the attempt to satisfy the non-functional requirements of Big Data applications. The primary focus of this survey is over storage and placement phases. It is an attempt to examine if they are efficiently and effectively realised by current solutions and approaches. The twofold advantage of identifying the efficient ways to manage and store Big Data are: (i) practitioners can select the most suitable Big Data management solutions for satisfying both their functional and non-functional needs; (ii) researchers can fully comprehend the research area and identify the most interesting directions to follow. To this end, we are countering both the data placement and the storage issues focusing on the Big Data management lifecycle and Cloud computing under the prism of non-functional aspects. In contrast to previous surveys that have concentrated mainly on the Big Data storage issues in the context of functional aspects.

Research questions identification

This survey has the ambition to supply suitable and convincing answers to:

What are the most suitable (big) data storage technologies and how do they compete with each other according to certain criteria related to non-functional aspects?

What are the most suitable and sophisticated (big) data placement methods that can be followed to (optimally) place and/or migrate Big Data?

SLR protocol formation

It is a composite step related to the identification of (i) (data) sources—here we have primarily consulted the Web of Science and Scopus, and (ii) the actual terms for querying these (data) sources—here, we focus on population, intervention and outcome as mentioned in [ 34 ]. It is worth to note that such data sources supply nice structured searching capabilities which enabled us to better pose the respective query terms. The population mainly concerns target user groups in the research area or certain application domains. The intervention means the specific method employed to address a certain issue (used terms include: methodology, method, algorithm, approach, survey and study). Lastly, the outcome relates to the final result of the application of the respective approach (such as management, placement, positioning, allocation, storage). Based on these terms, the abstract query concretised in the context of the two data sources can be seen in Table  1 .

SLR conduction

Systematic literature review conduction includes the following steps: (i) study selection criteria; (ii) quality assessment criteria; (iii) study selection procedure. All these steps are analysed in the following three paragraphs.

Study selection

The study selection was performed via a certain set of inclusion and exclusion criteria. The inclusion criteria included the following:

Peer-reviewed articles.

Latest articles only (last 8 years).

In case of equivalent studies, only the one published in the highest rated journal or conference is selected to sustain only a high-quality set of articles on which the review is conducted.

Articles which supply methodologies, methods or approaches for Big Data management.

Articles which study or propose Big Data storage management systems or databases.

Articles which propose Big Data placement methodologies or algorithms.

While the exclusion criteria were the following:

Inaccessible articles.

Articles in a different language than English.

Short papers, posters or other kinds of small in contribution articles.

Articles which deal with the management of data in general and do not focus on Big Data.

Articles that focus on studying or proposing normal database management systems.

Articles that focus on studying or proposing normal file management systems.

Articles that focus on the supply of Big Data processing techniques or algorithms. As the focus in this article is mainly on how to manage the data and not how to process them to achieve a certain result.

Quality assessment criteria

Apart from the above criteria, quality assessment criteria were also employed to enable prioritising the review as well as possibly excluding some articles not reaching certain quality standards. In the context of this work, the following criteria were considered:

Presentation of the article is clear and there is no great effort needed to comprehend it.

Any kind of validation is offered especially in the context of the proposal of certain algorithms, methods, systems or databases.

The advancement over the state-of-the-art is clarified as well as the main limitations of the proposed work.

The objectives of the study are well covered by the approach that is being employed.

Study selection procedure

It has been decided to employ two surveyors for each main article topic which were given a different portion of the respective reviewing work depending on their expertise. In each topic, the selection results of one author were assessed by the other one. In case of disagreement, a respective discussion was conducted. If this discussion was not having a positive outcome, the respective decision was delegated to the principal author which has been unanimously selected by all authors from the very beginning.

Non-functional data management features

For effective Big Data management, current data management systems (DMSs), including distributed file systems (DFSs) and distributed database management systems (DDBMSs) need to provide a set of non-functional features to cater the storage, management and access of the continuously growing data. This section introduces a classification of the non-functional features (see Fig.  3 ) of DMSs in the Big Data domain extracted from [ 10 , 13 , 35 , 36 , 37 ].

figure 3

Non-functional features of data management systems

Figure  3 provides an overview of the relevant non-functional features while the following subsections attempt to analyse each of them.

Performance

Performance is typically referred to as one of the most important non-functional features. It directly relates to the execution of requests by the DMSs [ 38 , 39 ]. Typical performance metrics are throughput and latency .

Scalability

Scalability focuses on the general ability to process arbitrary workloads. A definition of scalability for distributed systems in general and with respect to DDBMSs is provided by Agrawal et al. [ 40 ], where the terms scale-up, scale-down, scale-out and scale-in are defined focusing on the management of growing workloads. Vertical as well as horizontal scaling techniques are applied to distributed DBMSs and can also be applied to DFSs. Vertical scaling applies by adding more computing resources to a single node. While horizontal scaling applies by adding nodes to a cluster (or in general to the instances of a certain application component).

Elasticity is tightly coupled to the horizontal scaling and helps to overcome the sudden workload fluctuations by scaling the respective cluster without any downtime. Agrawal et al. [ 40 ] formally define it by focusing on DDBMSs as follows “Elasticity, i.e. the ability to deal with load variations by adding more resources during high load or consolidating the tenants to fewer nodes when the load decreases, all in a live system without service disruption, is therefore critical for these systems” . While elasticity has become a common feature for DDBMSs, it is still in an early stage for DFSs [ 41 ].

Availability

The availability tier builds upon the scalability and elasticity as these tiers are exploited to handle request fluctuations [ 42 ]. Availability represents the degree to which a system is operational and accessible when required. The availability of a DMS can be affected by  overloading at the DMS layer and/or  failures at the resource layer . During overloading, a high number of concurrent client requests overload the system such that these requests are either handled with a non-acceptable latency or not handled at all. On the other hand, a node can fail due to a resource failure (such as network outage or disk failure). An intuitive way to deal with overload is to scale-out the system. Distributed DMSs apply data replication to handle such resource failures.

Consistency

To support high availability (HA), consistency becomes an even more important and challenging non-functional feature. However, there is a trade-off among consistency, availability and partitioning guarantees, inscribed by the well-known CAP theorem [ 43 ]. This means that different kinds of consistency guarantees could be offered by a DMS. According to [ 44 ] consistency can be considered from both the client and data perspectives (i.e., from the DMS administrator perspective). The client-centric consistency can be classified further into staleness and ordering [ 44 ]. Staleness defines the lagging of replica behind its master. It can be measured either in time or versions. Ordering defines that all requests must be executed on all replicas in the same chronological order. Data-centric consistency focuses on the synchronization processes among replicas and the internal ordering of operations.

Big Data processing

The need of native integration of (big) data processing frameworks into the DMSs arises along with the number of recently advanced Big Data processing frameworks, such as Hadoop MapReduce, Apache Spark, and their specific internal data models. Hence, the DMSs need to provide native drivers for Big Data processing frameworks which can automate the transformation of DMS data models into the respective Big Data processing framework storage models. Further, these native drivers can exploit data locality features of the DMSs as well. Please note that such a feature is also needed based on the respective DLMS architecture that has been presented in " Data lifecycle management (DLM) " section as a Big Data processing framework needs to be placed on top of the data management component.

Data storage systems

A DLMS in the Big Data domain requires both the storage and the management of heterogeneous data structures. Consequently, a sophisticated DLMS would need to support a diverse set of DMSs. DMSs can be classified into file systems for storing unstructured data and DBMSs (database management systems) for storing semi-structured and structured data. However, the variety of semi-structured and structured data requires suitable data models (see Fig.  4 ) to increase the flexibility of DBMSs. Following these requirements, the DBMS landscape is constantly evolving and becomes more heterogeneous. Footnote 3 The following sub-sections provides (i) an overview of related work on DBMS classifications; (ii) a holistic and up-to-date classification of current DBMS data models; (iii) a qualitative analysis of selected DBMSs; (iv) a classification and analysis of relevant DFSs.

figure 4

DBMS data model classification

Database management systems

The classification of the different data models (see Fig.  4 ) for semi-structured data has been in the focus since the last decade [ 37 ] as heterogeneous systems (such as Dynamo, Cassandra [ 45 ] and BigTable [ 46 ]) appeared on the DBMS landscape. Consequently, the term NoSQL evolved, which summarizes the heterogeneous data models for semi-structured data. Similar, the structured data model evolved with the NewSQL DBMSs [ 13 , 47 ].

Several surveys have reviewed NoSQL and NewSQL data models over the last years and analyze the existing DBMS with respect to their data models and the specific non-functional features [ 11 , 13 , 35 , 36 , 37 , 48 , 49 ]. In addition, dedicated surveys focus explicitly specific data models (such as the time series data model [ 50 , 51 ]) or specific DBMS architectures (such as in-memory DBMS [ 14 ]).

Cloud-centric challenges for operating distributed DBMS are analysed by [ 13 ], considers the following: horizontal scaling, handling elastic workload patterns and fault tolerance. It also classifies nineteen DDBMSs against features, such as partitioning, replication, consistency and security.

Recent surveys on NoSQL-based systems [ 35 , 49 ] derive both, the functional and the non-functional NoSQL and NewSQL features and correlated them with distribution mechanisms (such as sharding, replication, storage management and query processing). However, the implications of Cloud resources or the challenges of Big Data applications were not considered. Another conceptual analysis of NoSQL DBMS is carried out by [ 48 ]. It outlines many storage models (such as key-value, document, column-oriented and graph-based) and also analyses current NoSQL implementations against persistence, replication, sharding, consistency and query capability. However, recent DDBMSs (such as time-series DBMSs or NewSQL DBMSs) are not analysed from Big Data as well as the Cloud context. A survey on DBMS support for Big Data with the focus on data storage models, architectures and consistency models is presented by [ 11 ]. Here, the relevant DBMSs are analysed towards their suitability for Big Data applications, but the Cloud service models and evolving DBMSs (such as time-series databases) are also not considered.

An analysis of the challenges and opportunities for DBMSs in the Cloud is presented by [ 52 ]. Here, the relaxed consistency guarantees (for DDBMS) and heterogeneity, as well as the different level of Cloud resource failures are explained. Moreover, it is also explicated that HA mechanism is needed to overcome failures. However, the HA and horizontal scalability come with the weaker consistency model (e.g., BASE [ 53 ]) compared to ACID [ 43 ].

In the following, we distil and join existing data model classifications (refer to Fig.  4 ) into an up-to-date classification of the still-evolving DBMS landscape. Hereby, we select relevant details for the DLMS of Big Data applications, while we refer the interested reader to the presented surveys for an in-depth analysis of specific data models. Analogously, we apply a qualitative analysis of currently relevant DBMS based on the general DLMS features (see " Non-functional data management features " section), while in-depth analysis of specific features can be found in the presented surveys. Hereby, we select two common DBMS Footnote 4 of each data model for our analysis.

Relational data models

The relational data model stores data as tuples forming an ordered set of attributes; which can be extended to extract more meaningful information [ 54 ]. A relation forms a table and tables are defined using a static, normalised data schema. SQL is a generic data definition, manipulation and query language for relational data. Popular representative DBMSs with a relational data model are MySQL and PostgreSQL.

The traditional relational data model provides limited data partitioning, horizontal scalability and elasticity support. NewSQL DBMSs [ 55 ] aim at bridging this gap and build upon the relational data model and SQL. However, NewSQL relaxes relational features to enable horizontal scalability and elasticity [ 13 ]. It is worth to note that only a few NewSQL DBMSs, such as VoltDB Footnote 5 and CockroachDB, Footnote 6 are built upon such architectures with the focus on scalability and elasticity as most NewSQL DBMSs are constructed out of existing DBMSs [ 47 ].

The key-value data model relates to the hash tables of programming languages. The data records are tuples consisting of key-value pairs. While the key uniquely identifies an entry, the value is an arbitrary chunk of data. Operations are usually limited to simple put or get operations. Popular key-value DBMSs are Riak Footnote 7 and Redis. Footnote 8

The document data model is similar to the key-value data model. However, it defines a structure on the values in certain formats, such as XML or JSON. These values are referred to as documents, but usually without fixed schema definitions. Compared to key-value stores, the document data model allows for more complex queries as document properties can be used for indexing and querying. MongoDB Footnote 9 and Couchbase Footnote 10 represent the common DBMSs with a document data model.

Wide-column

The column-oriented data model stores data by columns rather than by rows. It enables both storing large amounts of data in bulk and efficiently querying over very large structured data sets. A column-oriented data model does not rely on a fixed schema. It provides nestable, map-like structures for data items which improve flexibility over fixed schema [ 46 ]. The common representatives of column-oriented DBMSs are Apache Cassandra Footnote 11 and Apache HBase. Footnote 12

The graph data model primarily uses graph structures, usually including elements like nodes and edges, for data modelling. Nodes are often used for the main data entities, while edges between nodes are used to describe relationships between entities. Querying is typically executed by traversing the graph. Typical graph-based DBMS are Neo4J Footnote 13 and JanusGraph. Footnote 14

Time-series

The time-series data model [ 50 ] is driven by the needs of sensor storage within the Cloud and Big Data context. The time-series DBMSs are typically built upon existing non-relational data models (preferably key-value or column-oriented), and add a dedicated time-series data model on top. The data model is built upon data points which comprise a time stamp, an associated numeric value and a customisable set of metadata. Time-series DBMSs offers analytical query capabilities, which cover statistical functions and aggregations. Well-known time-series DBMSs are InfluxDB Footnote 15 and Prometheus. Footnote 16

Multi-model

A multi-model address the problem of polyglot persistence [ 56 ] which signifies that each of the existing non-relational data models addresses a specific use case. Hence, multi-model DBMSs combine different data models into a single DBMS while build upon one storage backend to improve flexibility (e.g., providing the document and graph data model via a unified query interface). Common multi-model DBMSs are ArangoDB Footnote 17 and OrientDB. Footnote 18

Comparison of selected DBMSs

In this section, we analyse already mentioned DBMSs in the context of Big Data applications (see Table  2 ). To perform this, we first analyse already mentioned DBMS (of the previously introduced data models) with respect to their features and supported Cloud service models. Next, we provide a qualitative analysis with respect to the non-functional features of the DMSs (refer to " Non-functional data management features " section). For quantitative analysis of these non-functional requirements, we refer the interested reader to the existing work focused on DBMS evaluation frameworks [ 44 , 57 , 58 , 59 , 60 ] and evaluation results [ 42 , 61 , 62 ].

Qualitative criteria

In the Table  2 , the first three columns present each DBMS and its data model, followed by the technical features and the service models supported. The analysis only considers the standard version of a DBMS.

In the following, we attempt to explicate each of the technical features considered. The DBMS architecture is classified into single, master–slave and multi-master architectures [ 56 ]. The sharding strategies are analysed based on the DBMS architectures; they can be supported manually as well as automatically in a hash- or range-based manner. The elasticity feature relies on a distributed architecture and relates to whether a DBMS supports adding and/or removing nodes from the cluster at runtime without a downtime. For consistency and availability guarantees, each DBMS is analysed with respect to its consistency (C), availability (A) and partition tolerance (P) properties within the CAP theorem (i.e., CA, CP, AC or AP) [ 43 ]. However, it should be highlighted that we did not consider fine-grained configuration options that might be offered for a DBMS to vary the CAP properties. Next, the replication mechanisms are analysed in terms of both cluster and cross-cluster replication (also known as geo-distribution). Consequently, a DBMS supporting cross-cluster replication implicitly supports cluster replication. The interested reader might consider [ 63 ] for more fine-grained analysis of replication mechanisms of DDBMSs. The Big Data adapter is analysed by evaluating native and/or third-party drivers for Big Data processing frameworks. Finally, the DDBMSs are classified based on their offering as community editions, enterprise commercial editions or managed DBaaS . One exemplary provider is presented if the DBMS is offered as a DBaaS.

Qualitative analysis

The resulting Table  2 represents the evolving landscape of the DBMSs. The implemented features of existing DBMSs significantly differ (except the RDBMSs) even within one data model. The heterogeneity of analysed DBMSs is even more obvious across data models. Further, the heterogeneous DBMS landscape offers a variety of potential DBMS solutions for Big Data.

The feature analysis provides a baseline for the qualitative analysis of the non-functional features. From the (horizontal) scalability point-of-view, a DBMS with a multi-master architecture is supposed to provide scalability for write and read workloads, while a master–slave architecture is supposed to provide read scalability. Due to the differences between the DBMSs, the impact of elasticity requires additional qualitative evaluations [ 42 ].

The consistency guarantees correlate to the classification in the CAP theorem. Table  2 clearly shows the heterogeneity compared to the consistency guarantees. Generally, the single-master or master–slave architectures provide strong consistency guarantees. Multi-master architectures cannot be exactly classified into the CAP theorem as their consistency guarantees heavily depend on the DBMS runtime configuration [ 64 ]. Additional evaluations of the consistency for the selected DBMSs are required for strong consistency (so as to ensure scalability, elasticity and availability) [ 44 , 62 ].

Providing HA directly relates to the supported replication mechanisms to overcome failures. The analysis shows that all DBMSs support cluster-level replication, while cross-cluster replication is supported by ten out of the sixteen DBMSs. Big Data processing relates to the technical feature of Big Data adapters. Table  2 clearly shows that seven DBMSs provide native adapters and nine DBMS enable it via third-party adapters to support Big Data processing. The service model of all the DBMSs is either available as a self-hosted community or enterprise version. In addition, both RDBMS and six NoSQL DBMS are offered as managed DBaaS. While the DBaaS offerings are abstracting all operational aspects of the DBMS, an additional analysis might be required with respect to their non-functional features and cost models [ 65 ].

Cloudification of DMS

Traditional on-premise DBMS offerings are still popular, but the current trend shows that DDBMSs running in the Cloud are also well-accepted. Especially, as Big Data imposes new challenges such as scalability, the diversity of data management or the usage of Cloud resources, towards the massive storage of data [ 66 ]. In general, the distributed architecture of DMSs evolved their focus over exploiting the Cloud features and catering the 5Vs of Big Data [ 67 ]. Data-as-a-service (DaaS) mostly handles the data aggregation and management via appropriate web-services, such as RESTful APIs. While database-as-a-service (DBaaS) offers database as a service which can include (distributed) a relational database or a non-relational one. In most of the cases, storage-as-a-service (STaaS) includes both DaaS and the DBaaS [ 68 ]. Furthermore, BDaaS [ 3 ] is a Cloud service (such as Hadoop-as-a-service) where traditional applications are migrated from local installations to the Cloud. BDaaS wraps three primary services. They are (i) IaaS (for underlying resources), (ii) STaaS (a sub-domain of platform-as-s-service (PaaS)) for managing the data via dynamic scaling and (iii) data management (such as data placement, replica management).

Distributed file systems

A distributed file systems (DFS) is an extended networked file system that allows multiple distributed nodes to internally share data/files without using remote call methods or procedures [ 69 ]. A DFS offers scalability, fault-tolerance, concurrent file access and metadata support. However, the design challenges (independent of data size and storage type) of a DFS are transparency, reliability, performance, scalability, and security. In general, DFSs do not share storage access at the block level but rather work at the network level. In DFSs, security relies on either access control lists (ACLs) or respectively defined capabilities, depending on how the network is designed. DFSs can be broadly classified into three models and respective groups (see Fig.  5 ). First, client–server architecture based file systems which supply a standardized view of a local file system. Second, clustered-distributed file systems which offer multiple nodes to enable concurrent access to the same block device. Third, symmetric file systems, where all nodes have a complete view of the disk structure. Below, we briefly analyse each category in a separate sub-section while we also supply some strictly open-source members for it.

figure 5

Distributed file systems classification

Client–server model

In the client–server architecture based file system, all communications between servers and clients are conducted via remote procedure calls. The clients maintain the status of current operations on a remote file system. Each file server provides a standardized view of its local file system. Here, the file read-operations are not mutually exclusive but the write operations are. File sharing is based on mounting operations. Only the servers can mount directories exported from other servers. Network File System (NFS) and GlusterFS Footnote 19 are two popular open source implementations of the client–server model.

Clustered-distributed model

Clustered-distributed based systems organize the clusters in an application-specific manner and are ideal for DCs. The model supports a huge amount of data; the data is stored/partitioned across several servers for parallel access. By design, this DFS model is fault tolerant as it enables the hosting of a number of replicas. Due to the huge volume of data, data are appended instead of overwritten. In general, the DNS servers map (commonly using round-robin fashion) access requests to the clusters for load-balancing purposes. The Hadoop distributed file system (HDFS) and CephFS Footnote 20 are two popular implementations of such a DFS model.

Symmetric model

Symmetric is a DFS that supports a masterless architecture, where each node has the same set of roles. It mainly resembles a peer-to-peer system. In general, the symmetric model employs a distributed hash table approach for data distribution and replication across systems. Such a model offers higher availability but reduced performance. Ivy [ 70 ] and the parallel virtual file system (PVFS) [ 71 ] are examples of a symmetric DFS model.

DFS evaluation

Similar to DDBMSs, we also compare the open source implementations of DFSs according to the same set of technical features or criteria. A summary of this comparison is depicted in Table  3 . In general, most of the file systems are distributed in nature (except NFS and GlusterFS). However, they do exhibit some architectural differences. NFS and GlusterFS are both developed focusing on a master–slave approach, while Ivy and PVFS are based on the masterless model. Data partitioning (or sharding) is also supported dynamically (featured by Ivy and PVFS) or statically via a fixed size (as in case of HDFS) by these DFSs. Elasticity or supporting the data scaling is a very important feature for many Big Data applications (especially hosted at Cloud). We can thus observe that except NFS all mentioned DFSs support scalability. Further, HDFS, CephFS, Ivy and PVFS are fault tolerant as well. Replication, highly needed for not losing data, is well supported by all DFSs. However, their granularity differs from the block to the cluster level. Finally, these DFSs also offer some form of hooks (either native or third-party supplied) to be used with Big Data frameworks.

Data placement techniques

In the Cloud ecosystem, traditional placement algorithms incur a high cost (including the time) on storing and transferring data [ 72 ]. Placing data while data is partitioned and distributed across multiple locations is a challenge [ 23 , 73 ]. Runtime data migration is an expensive affair [ 30 , 31 , 32 ] and the complexity increases due to the frequent change of applications as well as DCs’ behaviour (i.e., resources or latencies) [ 74 ]. Placing a large amount of data across the Cloud is complex due to issues, such as (i) data storage and transfer cost optimisation while maintaining data dependencies; (ii) data availability and replication; (iii) privacy policies, such as restricted data storage based on geo-locations. Data replication can influence consistency, while it also enhances the scalability and higher availability of data. In general, the existing data placement strategies can be grouped based on user-imposed constraints, such as data access latency [ 75 ], fault tolerance [ 76 ], energy-cost awareness [ 77 ], data dependency [ 78 , 79 ] and robustness or reliability [ 80 , 81 ].

Formal definition

The data placement in a distributed computing domain is an instance of NP-hard problem [ 82 ], while it can be reduced to a bin-packing problem instance. Informally, the data placement problem can be described as follows: given a certain workflow, the current data placement, and a particular infrastructure, find the right position(s) of data within the infrastructure to optimise one or more certain criteria, such as the cost of the data transfer.

A formal representation of this problem as follows: suppose that there are N datasets, represented as \(d_i\) (where \(i=1,\ldots ,N\) ). Each dataset has a certain size \(s_i\) . Further, suppose that there are M computational elements represented as \(V_j\) (where \(j=1,\ldots ,M\) ). Each computational element has a certain storage capacity denoted as \(c_j\) . Finally, suppose that there is a workflow W with T tasks which are represented as \(t_k\) (where \(k=1,\ldots ,T\) ). Each task has a certain input \(t_k.input\) and output \(t_k.output\) , where each maps to a set of datasets.

The main set of decision variables is \(c_{ij}\) representing the decisions (e.g., based on privacy or legal issues) of whether a certain dataset i should be stored in a certain computational element j . Thus, there is a need to have \(c_{ij}==1\) for each i and a certain j . Two hard constraints need to hold: (i) a dataset should be stored in one computational element which can be represented as follows: \(\sum _j c_{ij}=1\) for each i . It is worth to note that this constraint holds when no dataset replication is allowed. Otherwise, it would take the following form: \(\sum _j c_{ij}>=r\) , where r is the replication factor; (ii) the capacity of a computational element should be sufficient for hosting the respective dataset(s) assigned to it. This is represented as follows: \(\sum _i c_{ij} * s_i <= c_j\) for each j .

Finally, suppose that the primary aim is to reduce the total amount of data transfers for the whole workflow. In this respect, this optimisation objective can be expressed as follows:

where \(\text {m}\left( d_i\right)\) (which supplies as the output a value in [1,  M ]) indicates the index of the computational element that has been selected for a certain dataset. This objective adds the amount of data transfers per each workflow task which relates to the fact that the task will be certainly placed in a specific resource mapping to one of the required input datasets. Thus, during its execution, the data mapping to the rest of the input datasets will need to be moved in order to support the respective computation needed.

Data placement methodologies

We broadly classify the proposed data placement methods into data dependency, holistic task and data scheduling and graph-based methods. The methods in each category are analysed in the following subsections.

Data dependency methods

A data-group-aware placement scheme is proposed in [ 83 ] by employing the bond energy algorithm (BEA) [ 84 ] to transform the original data dependency matrix into a Hadoop cluster. It exploits access patterns to find an optimal data grouping to achieve better parallelism and workload balancing. In [ 85 ], a data placement algorithm is proposed for solving the data inter-dependency issue at the VM level. Scalia [ 86 ] proposes a Cloud storage brokerage scheme that optimises the storage cost by exploiting the real-time data access patterns. Zhao et al. [ 87 ] proposed data placement strategies for both initial data placement and relocation using a particle swarm optimization (PSO) algorithm. For fixed data set placement, this method relies on hierarchical data correlation and performs data re-allocation during the storage saturation. Yuan et al. [ 78 ] propose a k-means based dataset clustering algorithm to construct a data dependency matrix by exploiting the data dependency and the locality of computation. Later, the dependency matrix is transformed by applying the BEA while items are clustered based on their dependencies by following a recursive binary partitioning algorithm. In general, the preservation of time locality can significantly impact caching performance while the efficient re-ordering of jobs can improve the resource usage. In [ 79 ] authors propose a file grouping policy for pre-staging data by preserving time locality and enforcing the role of job re-ordering via extracting access patterns.

Task and data scheduling methods

In [ 88 ], the authors propose an adaptive (based on multi-objective optimization model) data management middleware which collects system-state information and abstracts away the complexities of multiple Cloud storage systems. For internet-of-things (IoT) data streaming support, Lan et al. [ 89 ] proposed a data stream partitioning mechanism by exploiting statistical feature extraction. Zhang et al. [ 90 ] propose a mixed-integer linear programming model for modelling the data placement problem. It considers both the data access cost as well as the storage limitations of DCs. Hsu et al. [ 91 ] proposed a Hadoop extension by adding dynamic data re-distribution (by VM profiling) before the map phase and VM mapping for reducers based on partition size and VM availability. Here, high capacity VMs are assigned for high workload reducers. Xu et al.  [ 92 ] proposes a genetic programming approach to optimise the overall number of data transfers. However, this approach does not consider the DCs’ capacity constraints and the non-replication constraints of data sets. In [ 93 ], a policy engine is proposed for managing both the number of parallel streams (between origin and destination nodes) and the priorities for data staging jobs in scientific workflows. The policy engine also considers data transfers, storage allocation and network resources.

The storage resource broker [ 23 ] provides seamless access to the different distributed data sources (interfacing multiple storages) via its APIs. It works as a middleware between the multiple distributed data storages and applications. BitDew [ 94 ] offers a programmable environment for data management via metadata exploitation. The data scheduling (DS) service takes care of implicit data movement. Pegasus [ 95 ] provides a framework that maps complex scientific applications onto distributed resources. It stores the newly generated data and also registers them in the metadata catalogue. The replica location service [ 96 ] is a distributed, scalable, data management service that maps the logical data names to target names. It supports both centralized as well as distributed resource mapping. Kosar and Livny [ 81 ] proposes a data placement that consists of a scheduler, a planner and a resource broker. The resource broker is responsible for matching resources, data identification and decisions related to data movement. The scheduling of data placement jobs relies on the information given by the workflow manager, the resource broker and the data miner. A very interesting feature of the proposed sub-system is that it is able to support failure recovery through the application of retry semantics.

Graph-based data placement

Yu and Pan [ 72 ] proposes the use of sketches to construct a hyper-graph sparsifier of data traffic to lower the data placement cost. Such sketches represent data structures that approximate properties of a data stream. LeBeane et al. [ 97 ] proposed on-line graph-partitioning multiple strategies to optimise data-ingress across heterogeneous clusters. SWORD [ 98 ] handles the partitioning and placement for OLTP workloads. Here, the workload is represented as a hypergraph and a hyper-graph compression technique is employed to reduce the data partitioning overhead. An incremental data re-partitioning technique is also proposed that modifies data placement in multiple steps to support workload changes. Kayyoor et al. [ 99 ] propose how to map nodes to a subset of clusters via satisfying user constraints. It minimises the query span for query workloads by applying replica selection and data placement algorithms. The query-based workload is represented as hyper-graphs and a hypergraph partitioning algorithm is used to process them. Kaya et al. [ 100 ] model the workflow as a hypergraph and employ a partitioning algorithm to reduce the computational and storage load while trying to minimise the total amount of file transfers.

Comparative evaluation

In this section, we have carefully selected a set of criteria to evaluate the methods analysed in " Data placement methodologies " section. The curated criteria are: (i) fixed data sets —whether the placement of data can be a priori fixed in sight of, e.g., regulations, (ii) constraint satisfaction —which constraint solving technique is used, (iii) granularity —what is the granularity of the resources considered, (iv) intermediate data handling —whether intermediate data, produced by, e.g., a running workflow, can be also handled, (v) multiple application handling —whether the data placement over multiple applications can be supported, (vi) increasing data size —whether the growth rate of data is taken into account, (vii) replication —whether data replication is supported, (viii) optimisation criteria —which optimisation criteria are exploited, (ix) additional system related information —whether additional knowledge is captured which could enable the production of better data placement solutions. An overview of the evaluation based on these criteria can be observed in comparison Table  4 . First of all, we can clearly see that there is no approach that covers all the criteria considered. Three approaches (Yuan et al. [ 78 ], BitDew [ 94 ] and Kosar [ 81 ]) can be distinguished, considered also as complementary to each other. However, only in  [ 78 ] a suitable optimisation/scheduling algorithm for data placement has been realised.

Considering now each criterion in isolation, we can observe in Table  4 that very few approaches consider the existence of a fixed or semi-fixed location of data sets. Further, such approaches seem to prescribe a fixed a-priori solution to the data placement problem which can lead to a sub-optimal solution. Especially as optimisation opportunities are lost in sight of more flexible semi-fixed location constraints. For instance, fixing the placement of a dataset to a certain DC might be sub-optimal in case that multiple DCs in the same location exist.

Three main classes of data placement optimisation techniques can be observed: (i) meta-search (like PSO)/genetic programming) to more flexibly inspect the available solution space and efficiently find a near-optimal solution; (ii) hierarchical partition algorithms (based on BEA) that attempt to group data recursively based on data dependencies either to reduce the number or the cost of data transfers. BEA is used as the baseline for many of these algorithms. BEA also supports dynamicity. In particular, new data sets are handled by initially encoding them in a reduced table-based form before applying the BEA. After the initial solution is found, the modification can be done by adding cluster/VM capacity constraints into the model. (iii) a Big Data placement problem can also be encoded via a hypergraph. Here, nodes are data and machines while hyper-edges attempt to connect them together. Through such modelling, traditional or extended hypergraph partitioning techniques can be applied to find the best possible partitions. There can be a trade-off between different parameters or metrics that should be explored by all the data placement algorithms irrespectively of the constraint solving technique used. However, such a trade-off is not usually explored as in most cases only one metric is employed for optimisation.

Granularity constitutes the criterion with less versatility as most of the approaches have selected a fine-grained approach for data-to-resource mapping, which is suitable for the Cloud ecosystem.

The real-world applications are dynamic and can have varying load at different points of time. Furthermore, applications can produce additional data which can be used for next computation steps. Thus, data placement should be a continuous process to validate decisions taken at different points in time. However, most approaches in data placement, focus mainly on the initial positioning of Big Data and do not interfere with the actual runtime of the applications.

There seems also to exist a dependency between this criterion and the fixed data sets one. The majority of the proposed approaches satisfying this criterion also satisfy the fixed data set one. This looks like a logical outcome as dynamicity is highly correlated to the need to better handle some inherent data characteristics. Further, a large volume of intermediate data can also have a certain gravity effect that could resemble the one concerning fixed data.

The multi-application criterion is not supported at all. This can be due to the following facts: (i) multi-application support can increase the complexity and the size of the problem; (ii) it can also impact the solution quality and solution time which can be undesirable especially for approaches that already supply sub-optimal solutions.

Only the approach in [ 78 ] caters for data growth via reserving additional space in already allocated nodes based on statically specified margins. However, such an approach is static in nature and faces two unmet challenges: the support for dynamic data growth monitoring, suitable especially in cases where data can grow fast, and dynamic storage capacity determination, through, e.g., data growth prediction, for better supporting pro-active data allocation. However, if we consider all dynamicity criteria together, we can nominate the approach in [ 78 ] as the one with the highest level of dynamicity, which is another indication of why it can be considered as prominent.

Data replication has been widely researched in the context of distributed systems but has not been extensively employed in data placement. Thus, we do believe that there exists a research gap here. Especially as those few approaches (such as SWORD [ 98 ], Kosar [ 81 ], Kayoor [ 99 ], BitDew [ 94 ]) that do support replication still lack suitable details or rely on very simple policies driven by user input.

We can observe that the minimisation of data transfer number or cost is a well-accepted optimisation criterion. Furthermore, data partitioning related criteria, such as skew factor and cut weight , have been mostly employed in the hypergraphs based methods. In some cases, we can also see multiple criteria to be considered which are: (i) either reduced to an overall one; (ii) not handled through any kind of optimisation but just considered in terms of policies that should be enforced. In overall, we are not impressed by the performance of the state-of-the-art in this comparison criterion. So, there is a huge room for potential improvement here.

Finally, many of the methods also consider additional input to achieve a better solution. The most common extra information that is exploited is data access patterns and nodes (VMs or PMs) profiling to, e.g., inspect their (data) processing speed. However, while both are important, usually only one from these two is exploited in these methods.

Lessons learned and future research directions

To conclude our survey, in this section we will discuss the issues of the current state-of-the-art and the research gaps or opportunities related to data storage and placement. Further, we also supply research directions towards a complete DLMS system in the Big Data-Cloud ecosystem.

Data lifecycle management

Challenges and issues.

This subsection refers to how the discussed data storage and placement challenges can be combined and viewed from the perspective of a holistic DLMS of the future. Such a DLMS should be able to cope with the optimal data storage and placement in a way that considers the Big Data processing required, along with the functional and non-functional variability space of the given Cloud resources at hand, in each application scenario. It implies the ability to consider both private and public Clouds, offered by one or several Cloud vendors, according to the specifics of each use cases, while making the appropriate decisions on how the data should be stored, placed, processed and eventually managed.

Just considering the cross-Cloud application deployment for fully exploiting the benefits of the Cloud paradigm hinders the important challenge of data-awareness. This data-awareness refers to the need to support an application deployment process that considers the locations of data sources, their volume and velocity characteristics, as well as any security and privacy constraints applicable. Of course, from the DLM perspective, this means that there should also be a consideration of the dependencies between application components and all data sources. This has the reasonable implication that the components requiring frequent accesses to data artefacts, found at rest in certain data stores, cannot be placed in a different Cloud or even in a certain physical and network distance from the actual storage location. If such aspects are ignored then application performance certainly degrades, as expensive data migrations may incur while legislation conformance issues might be applicable.

Future research directions

Among the most prominent research directions, we highlighted the design and implementation of a holistic DLMS, able to cope with all of the above-mentioned aspects on the data management, while employing the appropriate strategies for benefiting from the multi-Cloud paradigm. It is important to note that data placement in virtualized resources is generally subjected to long-term decisions as any potential data migrations generally incur immense costs which may be amplified by data gravity aspects that may result in subsequent changes in the application placement. Based on this, we consider the following aspects that should sketch the main functionality of the DLMS of the future that is able to cope with Big Data management and processing by really taking advantage of the abundance of resources in the Cloud computing world:

Use of advanced modelling techniques that consider metadata schemas for setting the scope of truly exploitable data modelling artefacts. It refers to managing the modelling task in a way that covers the description of all V’s (e.g. velocity, volume, value, variety, and veracity) in the characteristics of Big Data to be processed. The proper and multi-dimensional data modelling will allow for an adequate description of the data placement problem.

Perform optimal data placement across multiple Cloud resources based on the data modelling and user-defined goals, requirements and constraints.

Use of efficiently distributed monitoring functionalities for observing the status of the Big Data stored or processed and detect any migration or reconfiguration opportunities.

Employ the appropriate replication, fail-over and backup techniques by considering and exploiting at the same time the already offered functionalities by public Cloud providers.

According to such opportunities, continuously make reconfiguration and migration decisions by consistently considering the real penalty for the overall application reconfiguration, always in sight of the user constraints, goals and requirements that should drive the configuration of computational resources and the scheduling of application tasks.

Design and implement security policies in order to guarantee that certain regulations (e.g., General Data Protection Regulation) are constantly and firmly respected (e.g., data artefacts should not be stored or processed outside the European Union) while at the same time the available Cloud providers’ offerings are exploited according to the data owners’ privacy needs (e.g., exploit the data sanitization service when migrating or just removing data from a certain Cloud provider).

  • Data storage

In this section, we highlight the challenges for holistic data lifecycle management with respect to both the current DBMS and DFS systems and propose future research directions to overcome such challenges.

In the recent decade, the DBMS landscape has significantly evolved with respect to the data models and supported non-functional features, driven by Big Data and the related requirements of Big Data applications (see " Non-functional data management features " section). The resulting heterogeneous DBMS landscape provides a lot of new opportunities for Big Data management while it simultaneously imposes new challenges as well. The variety of data models offers domain-specific solutions for different kinds of data structures. Yet, the vast number of existing DBMSs per data model leads to a complex DBMS selection process. Hereby, functional features of potential DBMSs need to be carefully evaluated (e.g., NoSQL DBMSs do not offer a common query interface even within the same data model). For the non-functional features, the decision process is twofold: (i) a qualitative analysis (as carried out in " Comparison of selected DBMSs " section) should be conducted to narrow down the potential DBMSs; (ii) quantitative evaluations should be performed over the major non-functional features based on existing evaluation frameworks.

While collecting data from many distributed and diverse data sources is a challenge [ 8 ] modern Big Data applications are typically built upon multiple different data structures. Consequently, current DBMSs cater for domain-specific data structures due to the variety of data models supported, (as shown in our analysis Table  2 ). However, exploiting the variety of data models typically leads to the integration of multiple different DBMSs in modern Big Data applications. Consequently, the operation of a DBMS needs to be abstracted to ease the integration of different DBMSs into Big Data applications and to fully exploit the required features (such as scalability or elasticity). Hereby, research approaches in Cloud-based application orchestration can be exploited [ 101 , 102 ]. While the current DBMS landscape already moves towards the Big Data domain, the optimal operation of large-scale or even geo-distributed DBMSs still remains a challenge as the non-functional features significantly differ for different DBMSs (especially by using Cloud resources [ 42 , 61 , 103 ]).

In general, DFS provides scalability, network transparency, fault tolerance, concurrent data (I/O) access, and data protection [ 104 ]. It is worth noting that in Big Data domain, the scalability must be achieved without increasing the degree of replication of stored data (particularly for the Cloud ecosystem while combined with the private/local data storage systems). The storage system must increase user data availability but not the overheads. While resource sharing is a complex task and the severity can increase many-folds while managing the Big Data. In today’s Cloud ecosystem, we lack a single/unified model that offers a single interface to connect multiple Cloud-based storage models (such as Amazon S3 objects) and DFSs. Apart from that, the synchronization in DFS is also a well-known issue and as the degree of data access concurrency is increasing, synchronization could certainly be a performance bottleneck. Moreover, in some cases, it has also been observed that the performance of DFSs is low compared to the local file systems [ 105 , 106 ]. Furthermore, network transparency is also a crucial process related to the performance, especially while handling Big Data (because now the Big Data is distributed across multiple Clouds). Although most DFSs uses transmission control protocol or user datagram protocol during the communication process, however, a smarter way needs to be devised. In DFS, the fault-tolerance is achieved by lineage, checkpoint, and replicating metadata (and data objects) [ 104 ]. While the state-less based DFSs are having fewer overheads regarding managing the file states while reconnecting after failures, the state-full approach is also in use. For DFSs, the failure must be handled very fast and seamlessly across the Big Data management infrastructure. On the other side, there is no well-accepted approach to data access optimization methods. The methods such as data locality, multi-level caches are used case by case. Finally, securing the data in the DFS-Cloud ecosystem is a challenge due to the interconnection of so many diverse hardware as well as software components.

To address the identified challenges for the data storage in Big Data lifecycle management, novel Big Data-centric evaluations are required that ease the selection and operation of large-scale DBMS.

The growing domain of hybrid transaction/analytical processing workloads needs to be considered for the existing data models. Moreover, comparable benchmarks for different data models need to be established [ 107 ] and qualitative evaluations need to be performed across all data model domains as well.

To select an optimal combination of a distributed DBMS and Cloud resources, evaluation frameworks across different DBMS, Cloud resource and workload domains are required [ 108 ]. Such frameworks ease the DBMS selection and operation for Big Data lifecycle management.

Holistic DBMS evaluation frameworks are required to enable the qualitative analysis across all non-functional features in a comparable manner. In order to achieve this, frameworks need to support complex DBMS adaptation scenarios, including scaling and failure injection.

DBMS adaptation strategies need to be derived and integrated into the orchestration frameworks to enable the automated operation (to cope with workload fluctuations) of a distributed DBMS.

Qualitative DBMS selection guidelines need to be extended with respect to operational and adaptation features of current DBMS (i.e., support for orchestration frameworks to enable automated operation and adaptation and the integration support into Big Data frameworks).

Similar to the above research directions for DBMSs, we also mention below the research directions for DFSs.

For efficient, resource sharing among multiple Cloud service providers/components, a single/unified interface must handle the complex issues, such as seamless workload distribution, improved data access experience and faster read-write synchronizations, together with the increased level of data serialization for DFSs.

We also advocate for using smarter replica-assignment policies to achieve better workload balance and efficient storage space management.

To counter the synchronization issue in DFSs, a generic solution could be to cache the data in the client or in the local server’s side, but such an approach can become the bottleneck for the Big Data management scenario as well. Thus, exploratory research must be done in this direction.

As the data diversity and the networks heterogeneity is increasing, an abstract communication layer must be in place to address the issue of network transparency. Such abstraction can handle different types of communications easily and efficiently.

The standard security mechanisms are in place (such as ACLs) for data security. However, after the Cloudification of the file system, the data become more vulnerable due to the interconnection of diverse distributed, heterogeneous computing components. Thus, proper security measures must be built-in features of tomorrow’s DFSs.

Data placement

The following data placement challenges and corresponding research directions are in line with our analysis in " Comparative evaluation " section.

Fixed data set size

We have observed data placement methods able to fix the location of data sets based on respective (privacy) regulations, laws or user requirements. Such requirements indicate that data placement should be restrained within a certain country, sets of countries or even continents. However, this kind of semi-fixed constraints is handled in a rather static way by already pre-selecting the right place for such data sets.

Constraint solving

Exhaustive solution techniques are efficient to reach optimal solutions but suffer from scalability issues and higher execution time (especially for medium/big-sized problem instances). On the other hand, meta-heuristics (such as PSO) seems more promising as they can produce near-optimal solutions faster by also achieving better scalability. However, they need proper configuration and modelling which can be a time-consuming task while it is not always guaranteed that near-optimal solutions can be produced.

Granularity

Most of the evaluated methods support a fine-grained approach for dataset placement. However, all such methods consider that resources are fixed in number. Such assumptions are inflexible in the sight of the following issues: (i) a gradual data growth can saturate the resources assigned to data. In fact, a whole private storage infrastructure could be saturated for this reason; (ii) data should be flexibly (re-)partitioned to tackle the workload variability.

Multiple applications

Only three from the evaluated methods (see Table  4 ) can handle multiple applications but also in a very limited fashion. Such handling is challenging, especially when different applications are assorted with conflicting requirements. It must also be dynamic due to the changes brought by application execution as well as other factors (e.g., application requirement and Cloud infrastructure changes).

Data growth

Data sets can grow over time. Only one method [ 78 ] in the previous analysis is able to handle the data size change. It employs a threshold-based approach to check when data needs to be moved or when resources are adequate for storing the data to also handle their growth. However, no detailed explanation is supplied concerning how the threshold is computed.

Data replication

It is usually challenging to find the best possible trade-off between cost and replication degree to enable cost-effective data replication.

Optimisation criteria

Data transfer and replication management is a complex process [ 109 ] due to the completely distributed nature of the Cloud ecosystem. It further gets complicated due to the unequal data access speed. Data transfer number or cost is a well-accepted criterion for optimising data placement. However, it can be also quite restrictive. First, as there can be cases where both of these two metrics need to be considered. For instance, suppose that we need to place two datasets, initially situated in one VM, to other VMs as this VM will become soon unavailable. If we just consider the transfer number, this can lead to the situation where the movement is performed in an arbitrary way even migrating data to another DC while there is certainly a place in the current one. In the opposite direction, there can be cases where cost could be minimised but this could lead to increasing the number of transfers which could impact application performance. Second, data placement has been mainly seen in an isolated manner without examining user requirements. However, it can greatly affect application performance and cost.

Additional information

Apart from extracting data access patterns and node profiles, we believe that more information is needed for a better data placement solution.

Fixed data set size: To guarantee the true, optimal satisfaction of the user requirements and optimisation objectives, we suggest the use of semi-fixed constraints in a more suitable and flexible manner as a respective non-static part of the location-aware optimisation problem to be solved .

Constraint solving: We propose the use of hybrid approaches (i.e., combining exhaustive and meta-search heuristic techniques) so as to rapidly get (within an acceptable and practically employable execution time) optimal or near-optimal results in a scalable fashion. For instance, constraint programming could be combined with local search. The first could be used to find a good initial solution, while the latter could be used for neighbourhood search to find a better result. In addition, it might be possible that a different and more scalable modelling of the optimisation problem could enable to run standard exhaustive solution techniques even with medium-sized problem instances. Finally, solution learning from history could be adopted to fix parts of the optimisation problem and thus substantially reduce the solution space to be examined.

Granularity: There is a need for dynamic approaches for data placement which do take into account the workload fluctuation and the data growth to both partition data as well as optimally place them in a set of resources with a size that is dynamically identified .

Multiple applications: To handle applications conflicting requirements and the dynamicity of context (e.g., change of infrastructure, application requirements), different techniques to solve the (combined) optimisation problem are required. First, soft constraints could be used to solve this problem, even if it is over-constrained (e.g., producing a solution that violates the least number of these preferences). Next, we could prioritise the applications and/or their tasks . Third, distributed solving techniques could be used to produce application-specific optimisation problems of reduced complexity. This would require a transformation of the overall problem into sub-problems which retains as much as possible the main constraints and requirements of each relevant application . Finally, complementary to these distributed solving techniques, the measure of replication could also be employed. By using such a measure, we enable each application to operate over its own copy of the data originally shared. This could actually enable to have complete independence of applications which would then allow us to solve data placement individually for each of these applications.

Data growth: There is a need to employ a more sophisticated approach which exploits the data (execution) history as well as data size prediction and data (type) similarity techniques to solve the data growth issue. Similarity can be learned by knowing the context of data (e.g., by assuming the same context has been employed for similar data over time by multiple users), while statistical methods can predict the data growth. Such an approach can also be used for new data sets for which no prior knowledge exists (known as the cold-start problem ).

Data replication: For data replication, we suggest to dynamically compute the replication degree by considering the application size, data size, data access pattern, data growth rate, user requirements, and the capabilities of Cloud services . Such a solution could also rely on a weight calculation method for the determination of the relative importance of each of these factors.

Optimisation criteria: An interesting research direction compiles into exploring ways via data placement and task scheduling could be either solved in conjunction or in a clever but independent way such that they do take into account the same set of (high-level) user requirements . This could lead to producing solutions which are in concert and also optimal according to both aspects of data and computation.

Additional information: We advocate that the additional information required to be collected or derived include: (i) co-locating frequently accessing tasks and data ; (ii) exploiting data dependencies to have effective data partitioning . A similar approach is employed by Wang et al. [ 83 ] where data are grouped together at a finer granularity. There are also precautions in not storing different data blocks from the same data in the same node; (iii) data variability data can be of different forms. Each form might require a different machine configuration for optimal storage and processing. In this case, profiling should be extended to also capture this kind of machine performance variation which could be quite beneficial for more data-form-focused placement . In fact, we see that whole approaches are dedicated to dealing with different data forms. For instance, graph analytics-oriented data placement algorithms exploit the fact that data are stored in the form of graphs to more effectively select the right techniques and algorithms for solving the data placement problem. While special-purpose approaches might be suitable for different data forms, they are not the right choice for handling different kinds of data. As such, we believe that an important future direction should be the ability to more optimally handle data of multiple forms to enhance the applicability of a data placement algorithm and make it suitable for handling different kinds of applications instead of a single one.

Concluding remarks

The primary aim of this survey is to provide a holistic overview of the state of the art related to both data storage and placement in the Cloud ecosystem. We acknowledge that there do exist some surveys on various aspects of Big Data, which focus on the functional aspect and mainly on Big Data storage issues. However, this survey plays a complementary role with respect to them. In particular, we cover multiple parts of the Big Data management architecture (such as DLM, data storage systems, data placement techniques), which were neglected in the other surveys, under the prism of non-functional properties. Further, our contribution to Big Data placement is quite unique. In addition, the in-depth analysis of each main article section is covered by a well-designed set of evaluation criteria. Such an analysis also assists in a better categorization of the respective approaches (or technologies, involved in each part).

Our survey enables readers to better understand which solution could be utilized under which non-functional requirements. Thus, assisting towards the construction of user-specific Big Data management systems according to the non-functional requirements posted. Subsequently, we have described relevant challenges that can pave the way for the proper evolution of such systems in the future. Each challenge prescribed in " Lessons learned and future research directions " section has been drawn from the conducted analysis. Lastly, we have supplied a set of interesting and emerging future research work directions concerning both the functionalities related to the Big Data management (i.e., Big Data storage and placement), as well as the Big Data lifecycle management as a whole, in order to address the identified challenges.

https://strongboxdata.com/products/stronglink/ .

https://www.ibm.com/services/artificial-intelligence .

http://nosql-database.org/ lists over 225 DBMS for semi-structured data.

https://db-engines.com/en/ranking .

https://www.voltdb.com/ .

https://www.cockroachlabs.com/ .

http://basho.com/products/riak-kv/ .

https://redis.io/ .

https://www.mongodb.com/ .

https://www.couchbase.com/ .

http://cassandra.apache.org/ .

https://hbase.apache.org/ .

https://neo4j.com/ .

http://janusgraph.org/ .

https://www.influxdata.com/

https://prometheus.io/ .

https://www.arangodb.com/ .

https://orientdb.com/ .

https://docs.gluster.org/en/latest/ .

http://docs.ceph.com/docs/mimic/cephfs/ .

Abbreviations

access control list

Big Data-as-a-service

bond energy algorithm

business intelligence

cognitive data management

data-as-a-service

database-as-a-service

data centers

distributed database management system

distributed file system

data lifecycle management system

data management system

high availability

Hadoop distributed file system

internet-of-thing

Network File System

platform-as-a-service

particle swarm optimization

parallel virtual file system

quality of service

systematic literature review

storage-as-a-service

Khan N, Yaqoob I, Hashem IAT, et al. Big data: survey, technologies, opportunities, and challenges. Sci World J. 2014;2014:712826.

Google Scholar  

Kaisler S, Armour F, Espinosa JA, Money W. Big data: issues and challenges moving forward. In: System sciences (HICSS), 2013 46th Hawaii international conference on, IEEE. 2013. pp. 995–1004.

Zheng Z, Zhu J, Lyu MR. Service-generated big data and big data-as-a-service: an overview. In: Big Data (BigData Congress), 2013 IEEE international congress on, IEEE. 2013. pp. 403–10.

Chen M, Mao S, Liu Y. Big data: a survey. Mob Netw Appl. 2014;19(2):171–209.

Article   Google Scholar  

Inukollu VN, Arsi S, Ravuri SR. Security issues associated with big data in cloud computing. Int J Netw Secur Appl. 2014;6(3):45.

Wang C, Wang Q, Ren K, Lou W. Privacy-preserving public auditing for data storage security in cloud computing. In: Infocom, 2010 proceedings IEEE, IEEE. 2010. pp. 1–9.

Chaudhuri S. What next?: a half-dozen data management research goals for big data and the cloud. In: PODS, Scottsdale, AZ, USA. 2012. pp. 1–4.

Najafabadi MM, Villanustre F, Khoshgoftaar TM, Seliya N, Wald R, Muharemagic E. Deep learning applications and challenges in big data analytics. J Big Data. 2015;2(1):1.

Verma D. Supporting service level agreements on IP networks. Indianapolis: Macmillan Technical Publishing; 1999.

Sakr S, Liu A, Batista DM, Alomari M. A survey of large scale data management approaches in cloud environments. IEEE Commun Surv Tutor. 2011;13(3):311–36.

Wu L, Yuan L, You J. Survey of large-scale data management systems for big data applications. J Comput Sci Technol. 2015;30(1):163.

Oussous A, Benjelloun FZ, Lahcen AA, Belfkih S. Big data technologies: a survey. J King Saud Univ Comput Inf Sci. 2017;30(4):431–48.

Grolinger K, Higashino WA, Tiwari A, Capretz MA. Data management in cloud environments: NoSQL and NewSQL data stores. J Cloud Comput Adv Syst Appl. 2013;2(1):22.

Zhang H, Chen G, Ooi BC, Tan KL, Zhang M. In-memory big data management and processing: a survey. IEEE Trans Knowl Data Eng. 2015;27(7):1920–48.

Hashem IAT, Yaqoob I, Anuar NB, Mokhtar S, Gani A, Khan SU. The rise of “big data” on cloud computing: review and open research issues. Inf Syst. 2015;47:98–115.

Ball A. Review of data management lifecycle models. Bath: University of Bath; 2012.

Demchenko Y, de Laat C, Membrey P. Defining architecture components of the big data ecosystem. In: International conference on collaboration technologies and systems. 2014. pp. 104–12.

Pääkkönen P, Pakkala D. Reference architecture and classification of technologies, products and services for big data systems. Big Data Res. 2015;2(4):166–86.

NBD-PWG. NIST big data interoperability framework: volume 2, big data taxonomies. Tech. rep., NIST, USA 2015. Special Publication 1500-2.

Organisation for Economic Co-operation and Development. Data-driven innovation: big data for growth and well-being. Paris: OECD Publishing; 2015.

Kaufmann M. Towards a reference model for big data management. Research report, University of Hagen. 2016. Retrieved from https://ub-deposit.fernuni-hagen.de/receive/mir_mods_00000583 . Retrieved 15 July 2016.

Höfer C, Karagiannis G. Cloud computing services: taxonomy and comparison. J Internet Serv Appl. 2011;2(2):81–94.

Baru C, Moore R, Rajasekar A, Wan M. The sdsc storage resource broker. In: CASCON first decade high impact papers, IBM Corp.; 2010. pp. 189–200.

Chasen JM, Wyman CN. System and method of managing metadata data 2004. US Patent 6,760,721.

Gómez A, Merseguer J, Di Nitto E, Tamburri DA. Towards a uml profile for data intensive applications. In: Proceedings of the 2nd international workshop on quality-aware DevOps, ACM. 2016. pp. 18–23.

Verginadis Y, Pationiotakis I, Mentzas G. Metadata schema for data-aware multi-cloud computing. In: Proceedings of the 14th international conference on INnovations in Intelligent SysTems and Applications (INISTA). IEEE. 2018.

Binz T, Breitenbücher U, Kopp O, Leymann F. Tosca: portable automated deployment and management of cloud applications. In: Advanced web services. Springer; 2014. pp. 527–49.

Kritikos K, Domaschka J, Rossini A. Srl: a scalability rule language for multi-cloud environments. In: Cloud computing technology and science (CloudCom), 2014 IEEE 6th international conference on, IEEE. 2014. pp. 1–9.

Rossini A, Kritikos K, Nikolov N, Domaschka J, Griesinger F, Seybold D, Romero D, Orzechowski M, Kapitsaki G, Achilleos A. The cloud application modelling and execution language (camel). Tech. rep., Universität Ulm 2017.

Das S, Nishimura S, Agrawal D, El Abbadi A. Albatross: lightweight elasticity in shared storage databases for the cloud using live data migration. Proc VLDB Endow. 2011;4(8):494–505.

Lu C, Alvarez GA, Wilkes J. Aqueduct: online data migration with performance guarantees. In: Proceedings of the 1st USENIX conference on file and storage technologies, FAST ’02. USENIX Association 2002.

Stonebraker M, Devine R, Kornacker M, Litwin W, Pfeffer A, Sah A, Staelin C. An economic paradigm for query processing and data migration in mariposa. In: Parallel and distributed information systems, 1994., proceedings of the third international conference on, IEEE. 1994. pp. 58–67.

Brubeck DW, Rowe LA. Hierarchical storage management in a distributed VOD system. IEEE Multimedia. 1996;3(3):37–47.

Kitchenham BA, Pfleeger SL, Pickard LM, Jones PW, Hoaglin DC, Emam KE, Rosenberg J. Preliminary guidelines for empirical research in software engineering. IEEE Trans Softw Eng. 2002;28(8):721–34.

Gessert F, Wingerath W, Friedrich S, Ritter N. NoSQL database systems: a survey and decision guidance. Comput Sci Res Dev. 2017;32(3–4):353–65.

Sakr S. Cloud-hosted databases: technologies, challenges and opportunities. Clust Comput. 2014;17(2):487–502.

Cattell R. Scalable SQL and NoSQL data stores. Acm Sigmod Rec. 2011;39(4):12–27.

Gray J. Database and transaction processing performance handbook. In: The benchmark handbook for database and transaction systems. 2nd ed. Digital Equipment Corp. 1993.

Traeger A, Zadok E, Joukov N, Wright CP. A nine year study of file system and storage benchmarking. ACM Trans Storage. 2008;4(2):5.

Agrawal D, El Abbadi A, Das S, Elmore AJ. Database scalability, elasticity, and autonomy in the cloud. In: International conference on database systems for advanced applications. Springer. 2011. pp. 2–15.

Séguin C, Le Mahec G, Depardon B. Towards elasticity in distributed file systems. In: Cluster, cloud and grid computing (CCGrid), 2015 15th IEEE/ACM international symposium on, IEEE. 2015. pp. 1047–56.

Seybold D, Wagner N, Erb B, Domaschka J. Is elasticity of scalable databases a myth? In: Big Data (Big Data), 2016 IEEE international conference on, IEEE. 2016. pp. 2827–36.

Gilbert S, Lynch N. Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. Acm Sigact News. 2002;33(2):51–9.

Bermbach D, Kuhlenkamp J. Consistency in distributed storage systems. In: Networked systems. Springer. 2013. pp. 175–89.

Lakshman A, Malik P. Cassandra: a decentralized structured storage system. ACM SIGOPS Oper Syst Rev. 2010;44(2):35–40.

Chang F, Dean J, Ghemawat S, Hsieh WC, Wallach DA, Burrows M, Chandra T, Fikes A, Gruber RE. Bigtable: a distributed storage system for structured data. ACM Trans Comput Syst. 2008;26(2):4.

Pavlo A, Aslett M. What’s really new with newsql? ACM Sigmod Rec. 2016;45(2):45–55.

Corbellini A, Mateos C, Zunino A, Godoy D, Schiaffino S. Persisting big-data: the NoSQL landscape. Inf Syst. 2017;63:1–23.

Davoudian A, Chen L, Liu M. A survey on NoSQL stores. ACM Comput Surv. 2018;51(2):40.

Jensen SK, Pedersen TB, Thomsen C. Time series management systems: a survey. IEEE Trans Knowl Data Eng. 2017;29(11):2581–600.

Bader A, Kopp O, Falkenthal M. Survey and comparison of open source time series databases. In: BTW (Workshops). 2017. pp. 249–68.

Abadi JD. Data management in the cloud: limitations and opportunities. IEEE Data Eng Bull. 2009;32:3–12.

Pritchett D. Base: an acid alternative. Queue. 2008;6(3):48–55.

Codd EF. Extending the database relational model to capture more meaning. ACM Trans Database Syst. 1979;4(4):397–434.

Aslett M. How will the database incumbents respond to nosql and newsql. The San Francisco. 2011;451:1–5.

Sadalage PJ, Fowler M. NoSQL distilled. 2012. ISBN-10 321826620

Seybold D, Hauser CB, Volpert S, Domaschka J. Gibbon: an availability evaluation framework for distributed databases. In: OTM confederated international conferences “On the Move to Meaningful Internet Systems”. Springer. 2017. pp. 31–49.

Seybold D, Domaschka J. Is distributed database evaluation cloud-ready? In: Advances in databases and information systems. Springer. 2017. pp. 100–8.

Barahmand S, Ghandeharizadeh S. BG: a benchmark to evaluate interactive social networking actions. In: CIDR. Citeseer. 2013.

Cooper BF., Silberstein A, Tam E, Ramakrishnan R, Sears R. Benchmarking cloud serving systems with ycsb. In: Proceedings of the 1st ACM symposium on Cloud computing, ACM. 2010. pp. 143–54.

Kuhlenkamp J, Klems M, Röss O. Benchmarking scalability and elasticity of distributed database systems. Proc VLDB Endow. 2014;7(12):1219–30.

Bermbach D, Tai S. Benchmarking eventual consistency: lessons learned from long-term experimental studies. In: Cloud engineering (IC2E), 2014 IEEE international conference on, IEEE. 2014. pp. 47–56.

Domaschka J, Hauser CB, Erb B. Reliability and availability properties of distributed database systems. In: Enterprise distributed object computing conference (EDOC), 2014 IEEE 18th international, IEEE. 2014. pp. 226–33.

Brewer E. Cap twelve years later: How the “rules” have changed. Computer. 2012;45(2):23–9.

Klems M, Bermbach D, Weinert R. A runtime quality measurement framework for cloud database service systems. In: Quality of information and communications technology (QUATIC), 2012 eighth international conference on the, IEEE. 2012. pp. 38–46.

Abadi D, Agrawal R, Ailamaki A, Balazinska M, Bernstein PA, Carey MJ, Chaudhuri S, Chaudhuri S, Dean J, Doan A. The beckman report on database research. Commun ACM. 2016;59(2):92–9.

Group NBDPW, et al. Nist big data interoperability framework. Special Publication 2015. pp. 1500–6.

Kachele S, Spann C, Hauck FJ, Domaschka J. Beyond iaas and paas: an extended cloud taxonomy for computation, storage and networking. In: Utility and cloud computing (UCC), 2013 IEEE/ACM 6th international conference on, IEEE. 2013. pp. 75–82.

Levy E, Silberschatz A. Distributed file systems: concepts and examples. ACM Comput Surv. 1990;22(4):321–74.

Muthitacharoen A, Morris R, Gil TM, Chen B. Ivy: a read/write peer-to-peer file system. ACM SIGOPS Oper Syst Rev. 2002;36(SI):31–44.

Ross RB, Thakur R, et al. PVFS: a parallel file system for Linux clusters. In: Proceedings of the 4th annual Linux showcase and conference. 2000. pp. 391–430.

Yu B, Pan J. Sketch-based data placement among geo-distributed datacenters for cloud storages. In: INFOCOM, San Francisco: IEEE. 2016. pp. 1–9.

Greene WS, Robertson JA. Method and system for managing partitioned data resources. 2005. US Patent 6,922,685.

Greenberg A, Hamilton J, Maltz DA, Patel P. The cost of a cloud: research problems in data center networks. ACM SIGCOMM Comput Commun Rev. 2008;39(1):68–73.

Hardavellas N, Ferdman M, Falsafi B, Ailamaki A. Reactive nuca: near-optimal block placement and replication in distributed caches. ACM SIGARCH Comput Archit News. 2009;37(3):184–95.

Kosar T, Livny M. Stork: making data placement a first class citizen in the grid. In: Distributed computing systems, 2004. Proceedings. 24th international conference on, IEEE. 2004. pp. 342–9.

Xie T. Sea: a striping-based energy-aware strategy for data placement in raid-structured storage systems. IEEE Trans Comput. 2008;57(6):748–61.

Article   MathSciNet   Google Scholar  

Yuan D, Yang Y, Liu X, Chen J. A data placement strategy in scientific cloud workflows. Future Gener Comput Syst. 2010;26(8):1200–14.

Doraimani S, Iamnitchi A. File grouping for scientific data management: lessons from experimenting with real traces. In: Proceedings of the 17th international symposium on High performance distributed computing, ACM. 2008. pp. 153–64.

Cope JM, Trebon N, Tufo HM, Beckman P. Robust data placement in urgent computing environments. In: Parallel & distributed processing, 2009. IPDPS 2009. IEEE international symposium on, IEEE. 2009. pp. 1–13.

Kosar T, Livny M. A framework for reliable and efficient data placement in distributed computing systems. J Parallel Distrib Comput. 2005;65(10):1146–57.

Bell DA. Difficult data placement problems. Comput J. 1984;27(4):315–20.

Wang J, Shang P, Yin J. Draw: a new data-grouping-aware data placement scheme for data intensive applications with interest locality. IEEE Trans Magnetic. 2012;49(6):2514–20.

McCormick W, Schweitzer P, White T. Problem decomposition and data reorganisation by a clustering technique. Oper Res. 1972;20:993–1009.

Ebrahimi M, Mohan A, Kashlev A, Lu S. BDAP: a Big Data placement strategy for cloud-based scientific workflows. In: BigDataService, IEEE computer society. 2015. pp. 105–14.

Papaioannou TG, Bonvin N, Aberer K. Scalia: an adaptive scheme for efficient multi-cloud storage. In: Proceedings of the international conference on high performance computing, networking, storage and analysis. IEEE Computer Society Press. 2012. p. 20.

Er-Dun Z, Yong-Qiang Q, Xing-Xing X, Yi C. A data placement strategy based on genetic algorithm for scientific workflows. In: CIS, IEEE computer society. 2012. pp. 146–9.

Rafique A, Van Landuyt D, Reniers V., Joosen W. Towards an adaptive middleware for efficient multi-cloud data storage. In: Proceedings of the 4th workshop on CrossCloud infrastructures & platforms, Crosscloud’17. 2017. pp. 1–6.

Lan K, Fong S, Song W, Vasilakos AV, Millham RC. Self-adaptive pre-processing methodology for big data stream mining in internet of things environmental sensor monitoring. Symmetry. 2017;9(10):244.

Zhang J, Chen J, Luo J, Song A. Efficient location-aware data placement for data-intensive applications in geo-distributed scientific data centers. Tsinghua Sci Technol. 2016;21(5):471–81.

Hsu CH, Slagter KD, Chung YC. Locality and loading aware virtual machine mapping techniques for optimizing communications in mapreduce applications. Future Gener Comput Syst. 2015;53:43–54.

Xu Q, Xu Z, Wang T. A data-placement strategy based on genetic algorithm in cloud computing. Int J Intell Sci. 2015;5(3):145–57.

Chervenak AL, Smith DE, Chen W, Deelman E. Integrating policy with scientific workflow management for data-intensive applications. In: 2012 SC companion: high performance computing, networking storage and analysis. 2012. pp. 140–9.

Fedak G, He H, Cappello F. Bitdew: a programmable environment for large-scale data management and distribution. In: 2008 SC—international conference for high performance computing, networking, storage and analysis. 2008. pp. 1–12.

Deelman E, Singh G, Su MH, Blythe J, Gil Y, Kesselman C, Mehta G, Vahi K, Berriman GB, Good J. Pegasus: a framework for mapping complex scientific workflows onto distributed systems. Sci Program. 2005;13(3):219–37.

Chervenak A, Deelman E, Foster I, Guy L, Hoschek W, Iamnitchi A, Kesselman C, Kunszt P, Ripeanu M, Schwartzkopf B, et al. Giggle: a framework for constructing scalable replica location services. In: Proceedings of the 2002 ACM/IEEE conference on supercomputing. IEEE computer society press. 2002. pp. 1–17.

LeBeane M, Song S, Panda R, Ryoo JH, John LK. Data partitioning strategies for graph workloads on heterogeneous clusters. In: SC, Austin: ACM; 2015. pp. 1–12.

Quamar A, Kumar KA, Deshpande A. Sword: scalable workload-aware data placement for transactional workloads. In: Proceedings of the 16th international conference on extending database technology, EDBT ’13, ACM. 2013. pp. 430–41.

Kumar KA, Deshpande A, Khuller S. Data placement and replica selection for improving co-location in distributed environments. CoRR 2012. arXiv:1302.4168 .

Catalyurek UV, Kaya K, Uçar B. Integrated data placement and task assignment for scientific workflows in clouds. In: Proceedings of the Fourth International Workshop on Data-intensive Distributed Computing, New York, NY, USA: ACM; 2011. pp. 45–54.

Baur D, Seybold D, Griesinger F, Tsitsipas A, Hauser CB, Domaschka J. Cloud orchestration features: are tools fit for purpose? In: Utility and Cloud Computing (UCC), 2015 IEEE/ACM 8th international conference on, IEEE. 2015. pp. 95–101.

Burns B, Grant B, Oppenheimer D, Brewer E, Wilkes J. Borg, omega, and kubernetes. Queue. 2016;14(1):10.

Schad J, Dittrich J, Quiané-Ruiz JA. Runtime measurements in the cloud: observing, analyzing, and reducing variance. Proc VLDB Endow. 2010;3(1–2):460–71.

Thanh TD, Mohan S, Choi E, Kim S, Kim P. A taxonomy and survey on distributed file systems. In: Networked computing and advanced information management, 2008. NCM’08. Fourth international conference on, vol. 1, IEEE. 2008. pp. 144–9.

Ananthanarayanan G, Ghodsi A, Shenker S, Stoica I. Disk-locality in datacenter computing considered irrelevant. In: HotOS. 2011. p. 12.

Nightingale EB, Chen PM, Flinn J. Speculative execution in a distributed file system. In: ACM SIGOPS operating systems review, vol. 39, ACM. 2005. pp. 191–205.

Coelho F, Paulo J, Vilaça R, Pereira J, Oliveira R. Htapbench: Hybrid transactional and analytical processing benchmark. In: Proceedings of the 8th ACM/SPEC on International Conference on Performance Engineering, ACM; 2017. pp. 293–304.

Seybold D, Keppler M, Gründler D, Domaschka J. Mowgli: Finding your way in the DBMS jungle. In: Proceedings of the 2019 ACM/SPEC international conference on performance engineering. ACM. 2019.

Allen MS, Wolski R. The livny and plank-beck problems: studies in data movement on the computational grid. In: Supercomputing, 2003 ACM/IEEE conference, IEEE. 2003. pp. 43.

Download references

Authors' contributions

" Introduction " is contributed by SM, DS, KK and YV; " Data lifecycle management (DLM) " is contributed by SM, KK and YV; " Methodology " is contributed by KK and SM; " Non-functional data management features " is contributed by DS and SM; " Data storage systems " is contributed by DS, SM and YV; " Data placement techniques " is contributed by KK and SM; and " Lessons learned and future research directions " is contributed by YV, SM, DS, KK. All authors read and approved the final manuscript.

Acknowledgements

The research leading to this survey paper has received funding from the European Union’s Horizon 2020 research and innovation programme under Grant Agreement No. 731664. The authors would like to thank the partners of the MELODIC project ( http://www.melodic.cloud/ ) for their valuable advices and comments.

Competing interests

The authors declare that they have no competing interests.

Availability of data and materials

Not applicable.

This work is generously supported by the Melodic project (Grant Number 731664) of the European Union H2020 program.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Author information

Authors and affiliations.

Simula Research Laboratory, 1325, Lysaker, Norway

Somnath Mazumdar

Ulm University, Ulm, Germany

Daniel Seybold

ICS-FORTH, Heraklion, Crete, Greece

Kyriakos Kritikos

Institute of Communication and Computer Systems (ICCS), 9 Iroon Polytechniou Str., Athens, Greece

Yiannis Verginadis

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Kyriakos Kritikos .

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/ ), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article.

Mazumdar, S., Seybold, D., Kritikos, K. et al. A survey on data storage and placement methodologies for Cloud-Big Data ecosystem. J Big Data 6 , 15 (2019). https://doi.org/10.1186/s40537-019-0178-3

Download citation

Received : 04 October 2018

Accepted : 22 January 2019

Published : 11 February 2019

DOI : https://doi.org/10.1186/s40537-019-0178-3

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Data models

cloud database research papers

Navigation Menu

Search code, repositories, users, issues, pull requests..., provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

  • Notifications

continuously update cloud database papers

TsinghuaDatabaseGroup/CloudDB

Folders and files, repository files navigation, cloud database papers.

Continuously update the Cloud Database papers. Please inform us if there are any great papers missed :)

Table of Contents

0. unmerged, 1. survey & tutorial, 2. database as a service, 3. auto-scaling & partition, 4. disaggregation, 5, optimizer, 6. safety & recovery, 7. hardware, 8. application & industry, 9. challenges.

[Minotor] Curino, C., Jones, E. P. C., Madden, S., & Balakrishnan, H. (2011). Workload-aware database monitoring and consolidation. Proceedings of the ACM SIGMOD International Conference on Management of Data, 313–324. [ paper ]

[Survey] Armbrust, A. Fox, and R. Griffith, M. (2009). Above the clouds: A Berkeley view of cloud computing. University of California, Berkeley, Tech. Rep. UCB, 07–013. [ paper ]

[Survey] Jonas, E., Schleier-Smith, J., Sreekanti, V., Tsai, C.-C., Khandelwal, A., Pu, Q., Shankar, V., Carreira, J., Krauth, K., Yadwadkar, N., Gonzalez, J. E., Popa, R. A., Stoica, I., & Patterson, D. A. (2019). Cloud Programming Simplified: A Berkeley View on Serverless Computing. [ paper ]

[Survey] Li, F. (2018). Cloud native database systems at Alibaba: Opportunities and challenges. Proceedings of the VLDB Endowment, 12(12), 2263–2272. [ paper ]

[DBaaS] Depoutovitch, A., Chen, C., Chen, J., Larson, P., Lin, S., Ng, J., Cui, W., Liu, Q., Huang, W., Xiao, Y., & He, Y. (2020). Taurus Database: How to be Fast, Available, and Frugal in the Cloud. Proceedings of the ACM SIGMOD International Conference on Management of Data, 1463–1478. [ paper ]

[DBaaS] Taft, R., Lang, W., Duggan, J., Elmore, A. J., Stonebraker, M., & De Witt, D. (2016). STeP: Scalable tenant placement for managing database-as-a-service deployments. Proceedings of the 7th ACM Symposium on Cloud Computing, SoCC 2016, 388–400. [ paper ]

[DBaaS] Das, S., Li, F., Narasayya, V. R., & König, A. C. (2016). Automated demand-driven resource scaling in relational database-as-a-service. Proceedings of the ACM SIGMOD International Conference on Management of Data, 26-June-2016, 1923–1934. [ paper ]

[DBaaS] Narasayya, V., Menache, I., Singh, M., Li, F., Syamala, M., & Chaudhuri, S. (2015). Sharing Buffer Pool Memory in Multi-Tenant Relational. Proceedings of the VLDB Endowment, 8(7), 726-737. [ paper ]

[Auto-scaling] Perron, M., Castro Fernandez, R., Dewitt, D., & Madden, S. (2020). Starling: A Scalable Query Engine on Cloud Functions. Proceedings of the ACM SIGMOD International Conference on Management of Data, 131–141. [ paper ]

[Auto-scaling] Shen, Z., Subbiah, S., Gu, X., & Wilkes, J. (2011). CloudScale: Elastic resource scaling for multi-tenant cloud systems. Proceedings of the 2nd ACM Symposium on Cloud Computing, SOCC 2011. [ paper ]

[Auto-scaling] Wu, C., Sreekanti, V., & Hellerstein, J. M. (2021). Autoscaling tiered cloud storage in Anna. VLDB Journal, 30(1), 25–43. [ paper ]

[Auto-scaling] [Disaggregation] Zhang, Y., Ruan, C., Li, C., Yang, J., Cao, W., Li, F., Wang, B., Fang, J., Wang, Y., Huo, J., & Bi, C. (2021). Towards Cost-Effective and Elastic Cloud Database Deployment via Memory Disaggregation. Proc. VLDB Endow., 14(1), 1900–1912. [ paper ]

[Partition] Hilprecht, B., Binnig, C., & Röhm, U. (2020). Learning a Partitioning Advisor for Cloud Databases. Proceedings of the ACM SIGMOD International Conference on Management of Data, 143–157. [ paper ]

[Disaggregation] Shan, Y., Huang, Y., Chen, Y., & Zhang, Y. (2018). LegoOS : A Disseminated , Distributed OS for Hardware Resource Disaggregation. Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’18)., 69–87. [ paper ]

[Disaggregation] Angel, S., Nanavati, M., & Sen, S. (2020). Disaggregation and the application. HotCloud 2020 - 12th USENIX Workshop on Hot Topics in Cloud Computing, Co-Located with USENIX ATC 2020.[ paper ]

[Disaggregation] Klimovic, A., Kozyrakis, C., Thereska, E., John, B., & Kumar, S. (2016). Flash storage disaggregation. Proceedings of the 11th European Conference on Computer Systems, EuroSys 2016. [ paper ]

[Disaggregation] Zhang, Q., Cai, Y., Chen, X., Angel, S., Chen, A., Liu, V., & Loo, B. T. (2020). Understanding the effect of data center resource disaggregation on production DBMSs. Proceedings of the VLDB Endowment, 13(9), 1568–1581. [ paper ]

[Optimizer] Wu, C., Jindal, A., Amizadeh, S., Patel, H., & Le, W. (2018). Towards a learning optimizer for shared clouds. Proceedings of the VLDB Endowment, 12(3), 210–222. [ paper ]

[Optimizer] Leis, V., & Kuschewski, M. (2021). Towards Cost-Optimal Query Processing in the Cloud. Proc. {VLDB} Endow., 14(9), 1606–1612. [ paper ]

[Safety] Antonopoulos, P., Arasu, A., Singh, K. D., Eguro, K., Gupta, N., Jain, R., Kaushik, R., Kodavalla, H., Kossmann, D., Ogg, N., Ramamurthy, R., Szymaszek, J., Trimmer, J., Vaswani, K., Venkatesan, R., & Zwilling, M. (2020). Azure SQL Database Always Encrypted. Proceedings of the ACM SIGMOD International Conference on Management of Data, 1, 1511–1525. [ paper ]

[Safety] Arasu, A., Eguro, K., Kaushik, R., & Ramamurthy, R. (2014). Querying encrypted data. Proceedings of the ACM SIGMOD International Conference on Management of Data, 1259–1261. [ paper ]

[Recovery] Yang, Y., Youill, M., Woicik, M., Liu, Y., Yu, X., Serafini, M., Aboulnaga, A., & Stonebraker, M. (2021). FlexPushdownDB: Hybrid Pushdown and Caching in a Cloud DBMS. FlexPushdownDB: Hybrid Pushdown and Caching in a Cloud DBMS. PVLDB, 14(11), 2101–2113. [ paper ]

[System] Ortiz, J., Lee, B., Balazinska, M., Gehrke, J., & Hellerstein, J. L. (2020). SLAOrchestrator: Reducing the cost of performance SLAs for cloud data analytics. Proceedings of the 2018 USENIX Annual Technical Conference, USENIX ATC 2018, 547–560.[ paper ]

[Hardware] Do, J., Sengupta, S., & Swanson, S. (2019). Programmable solid-state storage in future cloud datacenters. Communications of the ACM, 62(6), 54–62. [ paper ]

[Hardware] Xue, S., Zhao, S., Chen, Q., Deng, G., Liu, Z., Zhang, J., Song, Z., Ma, T., Yang, Y., Zhou, Y., Niu, K., Sun, S., & Guo, M. (2020). Spool: Reliable virtualized NVMe storage pool in public cloud infrastructure. Proceedings of the 2020 USENIX Annual Technical Conference, ATC 2020, 97–110.

[Memory] Kalia, A., Andersen, D., & Kaminsky, M. (2020). Challenges and solutions for fast remote persistent memory access. SoCC 2020 - Proceedings of the 2020 ACM Symposium on Cloud Computing, 105–119. [ paper ]

[Memory] Wei, X., Chen, R., Chen, H., Jiao, S., Wei, X., Chen, R., & Chen, H. (2020). Fast RDMA-based Ordered Key-Value Store using Remote Learned Cache. Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation Fast RDMA-Based Ordered Key-Value Store Using Remote Learned Cache.[ paper ]

[Memory] Nelson, J., Holt, B., Myers, B., Briggs, P., Ceze, L., Kahan, S., & Oskin, M. (2015). Latency-Tolerant Software Distributed Shared Memory. Proceedings of the 2015 USENIX Annual Technical Conference, USENIX ATC 2015, 291–305.[ paper ]

[Memory] Shan, Y., Tsai, S. Y., & Zhang, Y. (2017). Distributed shared persistent memory. SoCC 2017 - Proceedings of the 2017 Symposium on Cloud Computing, 323–337. [ paper ]

[Memory] Fent, P., Renen, A. Van, Kipf, A., Leis, V., Neumann, T., & Kemper, A. (2020). Low-latency communication for fast DBMS Using RDMA and shared memory. Proceedings - International Conference on Data Engineering, 2020-April, 1477–1488. [ paper ]

[Memory] Aguilera, M. K., Amit, N., Calciu, I., Deguillard, X., Gandhi, J., Subrahmanyam, P., Suresh, L., Tati, K., Venkatasubramanian, R., & Wei, M. (2017). Remote memory in the age of fast networks. SoCC 2017 - Proceedings of the 2017 Symposium on Cloud Computing, 121–127. [ paper ]

[Memory] Lagar-Cavilla, A., Ahn, J., Souhlal, S., Agarwal, N., Burny, R., Butt, S., Chang, J., Chaugule, A., Deng, N., Shahid, J., Thelen, G., Yurtsever, K. A., Zhao, Y., & Ranganathan, P. (2019). Software-Defined Far Memory in Warehouse-Scale Computers. International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS, 317–330. [ paper ]

[Network] Ziegler, T., Vani, S. T., Binnig, C., Fonseca, R., & Kraska, T. (2019). Designing distributed tree-based index structures for fast RDMA-capable networks. Proceedings of the ACM SIGMOD International Conference on Management of Data, 741–758. [ paper ]

[Network] Tirmazi, M., Ben Basat, R., Gao, J., & Yu, M. (2020). Cheetah: Accelerating Database Queries with Switch Pruning. Proceedings of the ACM SIGMOD International Conference on Management of Data, 2407–2422. [ paper ]

[Network] Craddock, H., Konudula, L. P., Cheng, K., & Kul, G. (2019). The case for physical memory pools: A vision paper. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 11513 LNCS(Vm), 208–221. [ paper ]

[Application] Müller, I., Marroquín, R., & Alonso, G. (2020). Lambada: Interactive Data Analytics on Cold Data Using Serverless Cloud Infrastructure. Proceedings of the ACM SIGMOD International Conference on Management of Data, 115–130. [ paper ]

[Application] Yu, X., Youill, M., Woicik, M., Ghanem, A., Serafini, M., Aboulnaga, A., & Stonebraker, M. (2020). PushdownDB: Accelerating a DBMS Using S3 Computation. Proceedings - International Conference on Data Engineering, 2020-April, 1802–1805. [ paper ]

[Application] Antonopoulos, P., Budovski, A., Diaconu, C., Saenz, A. H., Hu, J., Kodavalla, H., Kossmann, D., Lingam, S., Minhas, U. F., Prakash, N., Purohit, V., Qu, H., Ravella, C. S., Reisteter, K., Shrotri, S., Tang, D., & Wakade, V. (2019). Socrates: The new SQL server in the cloud. Proceedings of the ACM SIGMOD International Conference on Management of Data, 1743–1756. [ paper ]

[Industry] Verbitski, A., Gupta, A., Saha, D., Corey, J., Gupta, K., Brahmadesam, M., Mittal, R., Krishnamurthy, S., Maurice, S., Kharatishvilli, T., & Bao, X. (2018). Amazon Aurora. 789–796. [ paper ]

[Industry] Li, F. (2018). Cloud native database systems at Alibaba: Opportunities and challenges. Proceedings of the VLDB Endowment, 12(12), 2263–2272. [ paper ]

[Industry] Dobrescu, M., Argyraki, K., & Argyraki EPFL, K. (2014). Millions of Tiny Databases. Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’14). [ paper ]

[Industry] Cao, W., Liu, Y., Cheng, Z., Zheng, N., Li, W., Wu, W., Ouyang, L., Wang, P., Wang, Y., Kuan, R., Liu, Z., Zhu, F., & Zhang, T. (2020). POLARDB Meets Computational Storage : Efficiently Support Analytical Workloads in Cloud-Native Relational Database. 18th USENIX Conference on File and Storage Technologies (FAST 20). 2020. [ paper ]

[Industry] Cao, W., Zhang, Y., Yang, X., Li, F., Wang, S., Hu, Q., Cheng, X., Chen, Z., Liu, Z., Fang, J., Wang, B., Wang, Y., Sun, H., Yang, Z., Cheng, Z., Chen, S., Wu, J., Hu, W., Zhao, J., … Tong, J. (2021). PolarDB Serverless: A Cloud Native Database for Disaggregated Data Centers. Proceedings of the ACM SIGMOD International Conference on Management of Data, 2477–2489. [ paper ]

[Industry] Cao, W., Liu, Z., Wang, P., Chen, S., Zhu, C., Zheng, S., Wang, Y., & Ma, G. (2018). PolarFS: An ultralow latency and failure resilient distributed file system for shared storage cloud database. Proceedings of the VLDB Endowment, 11(12), 1849–1862. [ paper ]

[Industry] Dageville, B., Cruanes, T., Zukowski, M., Antonov, V., Avanes, A., Bock, J., Claybaugh, J., Engovatov, D., Hentschel, M., Huang, J., Lee, A. W., Motivala, A., Munir, A. Q., Pelley, S., Povinec, P., Rahn, G., Triantafyllis, S., & Unterbrunner, P. (2016). The snowflake elastic data warehouse. Proceedings of the ACM SIGMOD International Conference on Management of Data, 26-June-20, 215–226. [ paper ]

[Industry] Mattson, T., Rogers, J., & Elmore, A. J. (2018). The BigDAWG polystore system. Making Databases Work: The Pragmatic Wisdom of Michael Stonebraker, 44(2), 279–289. [ paper ]

[Industry] Huang, D., Liu, Q., Cui, Q., Fang, Z., Ma, X., Xu, F., Shen, L., Tang, L., Zhou, Y., Huang, M., Wei, W., Liu, C., Zhang, J., Li, J., Wu, X., Song, L., Sun, R., Yu, S., Zhao, L., … Tang, X. (2020). TiDB: a Raft-based HTAP database. Proceedings of the VLDB Endowment, 13(12), 3072–3084. [ paper ]

[Challenges] Zhang, Q., Cai, Y., Angel, S., Liu, V., Chen, A., & Loo, B. T. (2020). Rethinking Data Management Systems for Disaggregated Data Centers. CIDR 2019 - 9th Biennial Conference on Innovative Data Systems Research.[ paper ]

[Challenges] Hellerstein, J. M., Faleiro, J., Gonzalez, J. E., Schleier-Smith, J., Sreekanti, V., Tumanov, A., & Wu, C. (2019). Serverless computing: One step forward, two steps back. CIDR 2019 - 9th Biennial Conference on Innovative Data Systems Research.[ paper ]

Contributors 2

  • Reference Manager
  • Simple TEXT file

People also looked at

Systematic review article, securing machine learning in the cloud: a systematic review of cloud machine learning security.

www.frontiersin.org

  • 1 Information Technology University (ITU), Lahore, Pakistan
  • 2 AI4Networks Research Center, University of Oklahoma, Norman, OK, United States
  • 3 Social Data Science (SDS) Lab, Queen Mary University of London, London, United Kingdom
  • 4 School of Computing and Communications, Lancaster University, Lancaster, United Kingdom
  • 5 Hamad Bin Khalifa University (HBKU), Doha, Qatar

With the advances in machine learning (ML) and deep learning (DL) techniques, and the potency of cloud computing in offering services efficiently and cost-effectively, Machine Learning as a Service (MLaaS) cloud platforms have become popular. In addition, there is increasing adoption of third-party cloud services for outsourcing training of DL models, which requires substantial costly computational resources (e.g., high-performance graphics processing units (GPUs)). Such widespread usage of cloud-hosted ML/DL services opens a wide range of attack surfaces for adversaries to exploit the ML/DL system to achieve malicious goals. In this article, we conduct a systematic evaluation of literature of cloud-hosted ML/DL models along both the important dimensions— attacks and defenses —related to their security. Our systematic review identified a total of 31 related articles out of which 19 focused on attack, six focused on defense, and six focused on both attack and defense. Our evaluation reveals that there is an increasing interest from the research community on the perspective of attacking and defending different attacks on Machine Learning as a Service platforms. In addition, we identify the limitations and pitfalls of the analyzed articles and highlight open research issues that require further investigation.

1 Introduction

In recent years, machine learning (ML) techniques have been successfully applied to a wide range of applications, significantly outperforming previous state-of-the-art methods in various domains: for example, image classification, face recognition, and object detection. These ML techniques—in particular deep learning (DL)–based ML techniques—are resource intensive and require a large amount of training data to accomplish a specific task with good performance. Training DL models on large-scale datasets is usually performed using high-performance graphics processing units (GPUs) and tensor processing units. However, keeping in mind the cost of GPUs/Tensor Processing Units and the fact that small businesses and individuals cannot afford such computational resources, the training of deep models is typically outsourced to clouds, which is referred to in the literature as “Machine Learning as a Service” (MLaaS).

MLaaS refers to different ML services that are offered as a component of a cloud computing services, for example, predictive analytics, face recognition, natural language services, and data modeling APIs. MLaaS allows users to upload their data and model for training at the cloud. In addition to training, cloud-hosted ML services can also be used for inference purposes, that is, models can be deployed on the cloud environments; the system architecture of a typical MLaaS is shown in Figure 1 .

www.frontiersin.org

FIGURE 1 . Taxonomy of different defenses proposed for defending attacks on the third-party cloud-hosted machine learning (ML) or deep learning (DL) models.

MLaaS 1 can help reduce the entry barrier to the use of ML and DL through access to managed services of wide hardware heterogeneity and incredible horizontal scale. MLaaS is currently provided by several major organizations such as Google, Microsoft, and Amazon. For example, Google offers Cloud ML Engine 2 that allows developers and data scientists to upload training data and model which is trained on the cloud in the Tensorflow 3 environment. Similarly, Microsoft offers Azure Batch AI 4 —a cloud-based service for training DL models using different frameworks supported by both Linux and Windows operating systems and Amazon offers a cloud service named Deep Learning AMI (DLAMI) 5 that provides several pre-built DL frameworks (e.g., MXNet, Caffe, Theano, and Tensorflow) that are available in Amazon’s EC2 cloud computing infrastructure. Such cloud services are popular among researchers as evidenced by the price lifting of Amazon’s p2.16x large instance to the maximum possible—two days before the deadline of NeurIPS 2017 (the largest research venue on ML)—indicating that a large number of users request to reserve instances.

In addition to MLaaS services that allow users to upload their model and data for training on the cloud, transfer learning is another strategy to reduce computational cost in which a pretrained model is fine-tuned for a new task (using a new dataset). Transfer learning is widely applied for image recognition tasks using a convolutional neural network (CNN). A CNN model learns and encodes features like edges and other patterns. The learned weights and convolutional filters are useful for image recognition tasks in other domains and state-of-the-art results can be obtained with a minimal amount of training even on a single GPU. Moreover, various popular pretrained models such as AlexNet ( Krizhevsky et al., 2012 ), VGG ( Simonyan and Zisserman, 2015 ), and Inception ( Szegedy et al., 2016 ) are available for download and fine-tuning online. Both of the aforementioned outsourcing strategies come with new security concerns. In addition, the literature suggests that different types of attacks can be realized on different components of the communication network as well ( Usama et al., 2020a ), for example, intrusion detection ( Han et al., 2020 ; Usama et al., 2020b ), network traffic classification ( Usama et al., 2019 ), and malware detection systems ( Chen et al., 2018 ). Moreover, adversarial ML attacks have also been devised for client-side ML classifiers, that is, Google’s phishing pages filter ( Liang et al., 2016 ).

Contributions of the article: In this article, we analyze the security of MLaaS and other cloud-hosted ML/DL models and provide a systematic review of associated security challenges and solutions. To the best of our knowledge, this article is the first effort on providing a systematic review of the security of cloud-hosted ML models and services. The following are the major contributions of this article:

(1) We conducted a systematic evaluation of 31 articles related to MLaaS attacks and defenses.

(2) We investigated five themes of approaches aiming to attack MLaaS and cloud-hosted ML services.

(3) We examined five themes of defense methods for securing MLaaS and cloud-hosted ML services.

(4) We identified the pitfalls and limitations of the examined articles. Finally, we have highlighted open research issues that require further investigation.

Organization of the article: The rest of the article is organized as follows. The methodology adopted for the systematic review is presented in Section 2. The results of the systematic review are presented in Section 3. Section 4 presents various security challenges associated with cloud-hosted ML models and potential solutions for securing cloud-hosted ML models are presented in Section 5. The pitfalls and limitations of the reviewed approaches are discussed in Section 6. We briefly reflect on our methodology to identify any threats to the validity in Section 8 and various open research issues that require further investigation are highlighted in Section 7. Finally, we conclude the article in Section 9.

2 Review Methodology

In this section, we present the research objectives and the adopted methodology for the systematic review. The purpose of this article is to identify and systematically review the state-of-the art research related to the security of the cloud-based ML/DL techniques. The methodology followed for this study is depicted in Figure 2 .

www.frontiersin.org

FIGURE 2 . An illustration of a typical cloud-based ML or machine learning as a service (MLaaS) architecture.

2.1 Research Objectives

The following are the key objectives of this article.

O1: To build upon the existing work around the security of cloud-based ML/DL methods and present a broad overview of the existing state-of-the-art literature related to MLaaS and cloud-hosted ML services.

O2: To identify and present a taxonomy of different attack and defense strategies for cloud-hosted ML/DL models.

O3: To identify the pitfalls and limitations of the existing approaches in terms of research challenges and opportunities.

2.2 Research Questions

To achieve our objectives, we consider answering two important questions that are described below and conducted a systematic analysis of 31 articles.

Q1: What are the well-known attacks on cloud-hosted/third-party ML/DL models?

Q2: What are the countermeasures and defenses against such attacks?

2.3 Review Protocol

We developed a review protocol to conduct the systematic review; the details are described below.

2.3.1 Search Strategy and Searching Phase

To build a knowledge base and extract the relevant articles, eight major publishers and online repositories were queried that include ACM Digital Library, IEEE Xplore, ScienceDirect, international conference on machine learning, international conference on learning representations, journal of machine learning research, neural information processing systems, USENIX, and arXiv. As we added non-peer–reviewed articles from electric preprint archive (arXiv), we (AQ and AI) performed the critical appraisal using AACODS checklist; it is designed to enable evaluation and appraisal of gray literature ( Tyndall, 2010 ), which is designed for the critical evaluation of gray literature.

In the initial phase, we queried main libraries using a set of different search terms that evolved using an iterative process to maximize the number of relevant articles. To achieve optimal sensitivity, we used a combination of words: attack, poisoning, Trojan attack, contamination, model inversion, evasion, backdoor, model stealing, black box, ML, neural networks, MLaaS, cloud computing, outsource, third party, secure, robust, and defense. The combinations of search keywords used are depicted in Figure 3 . We then created search strategies with controlled or index terms given in Figure 3 . Please note that no lower limit for the publication date was applied; the last search date was June 2020. The researchers (WI and AI) searched additional articles through citations and by snowballing on Google Scholar. Any disagreement was adjudicated by the third reviewer (AQ). Finally, articles focusing on the attack/defense for cloud-based ML models were retrieved.

www.frontiersin.org

FIGURE 3 . The methodology for systematic review.

2.3.2 Inclusion and Exclusion Criteria

The inclusion and exclusion criteria followed for this systematic review are defined below.

2.3.2.1 Inclusion Criteria

The following are the key points that we considered for screening retrieved articles as relevant for conducting a systematic review.

• We included all articles relevant to the research questions and published in the English language that discusses the attacks on cloud-based ML services, for example, offered by cloud computing service providers.

• We then assessed the eligibility of the relevant articles by identifying whether they discussed either attack or defense for cloud-based ML/DL models.

• Comparative studies that compare the attacks and robustness against different well-known attacks on cloud-hosted ML services (poisoning attacks, black box attacks, Trojan attacks, backdoor attacks, contamination attacks, inversion, stealing, and invasion attacks).

• Finally, we categorized the selected articles into three categories, that is, articles on attacks, articles on defenses, and articles on attacks and defenses.

2.3.2.2 Exclusion Criteria

The exclusion criteria are outlined below.

• Articles that are written in a language other than English.

• Articles not available in full text.

• Secondary studies (e.g., systematic literature reviews, surveys, editorials, and abstracts or short papers) are not included.

• Articles that do not discuss attacks and defenses for cloud-based/third-party ML services, that is, we only consider those articles which have proposed an attack or defense for a cloud-hosted ML or MLaaS service.

2.3.3 Screening Phase

For the screening of articles, we employ two phases based on the content of the retrieved articles: 1) title and abstract screening and 2) full text of the publication. Please note that to avoid bias and to ensure that the judgment about the relevancy of articles is entirely based on the content of the publications, we intentionally do not consider authors, publication type (e.g., conference and journal), and publisher (e.g., IEEE and ACM). Titles and abstracts might not be true reflectors of the articles’ contents; however, we concluded that our review protocol is sufficient to avoid provenance-based bias.

It is very common that the same work got published in multiple venues, for example, conference papers are usually extended to journals. In such cases, we only consider the original article. In the screening phase, every article was screened by at least two authors of this article that were tasked to annotate the articles as either relevant, not relevant, or need further investigation, which was finalized by the discussion between the authors until any such article is either marked relevant or not relevant. Only original technical articles are selected, while survey and review articles are ignored. Finally, all selected publications were thoroughly read by the authors for categorization and thematic analysis.

3 Review Results

3.1 overview of the search and selection process outcome.

The search using the aforementioned strategy identified a total of 4,384 articles. After removing duplicate articles, title, and abstract screening, the overall number of articles reduced to 384. A total of 230 articles did not meet the inclusion criteria and were therefore excluded. From the remaining 154 articles, 123 articles did not discuss attack/defense for third-party cloud-hosted ML models and were excluded as well. Of the remaining articles, a total of 31 articles are identified as relevant. Reasons for excluding articles were documented and reported in a PRISMA flow diagram, depicted in Figure 4 . These articles were categorized into three classes, that is, articles that are specifically focused on attacks, articles that are specifically focused on defenses, and articles that considered both attacks and defenses containing 19, 6, and 6 articles each, respectively.

www.frontiersin.org

FIGURE 4 . Search queries used to identify publications to include in the systematic review.

3.2 Overview of the Selected Studies

The systematic review eventually identified a set of 31 articles related to cloud-based ML/DL models and MLaaS, which we categorized into three classes as mentioned above and shown in Figure 4 . As shown in Figure 5 , a significant portion of the selected articles were published in conferences (41.94%); comparatively, a very smaller proportion of these articles were published in journals or transactions (19.35%). The percentage of gray literature (i.e., non-peer–reviewed articles) is 25.81%. Yet, a very small proportion of publications are published in symposia (6.45%), and this percentage is the same for workshop papers. The distribution of selected publications by their types over the years is shown in Figure 6 . The figure depicts that the interest in the security of cloud-hosted ML/DL models increased in the year 2017 and was at a peak in the year 2018 and was slightly lower in the year 2019 as compared to 2018. Also, the majority of the articles during these years were published in conferences. The distribution of selected publications by their publishers over the years is depicted in Figure 7 , the figure shows that the majority of the publications have been published at IEEE, ACM, and arXiv. There is a similar trend in the number of articles in the year 2017, 2018, and 2019 as discussed previously.

www.frontiersin.org

FIGURE 5 . Flowchart of systematic review and categorization.

www.frontiersin.org

FIGURE 6 . Distribution of selected publications according to their types.

www.frontiersin.org

FIGURE 7 . Distribution of selected publications by types over years.

3.3 Some Partially Related Non-Selected Studies: A Discussion

We have described our inclusion and exclusion criteria that help us to identify relevant articles. We note, however, that some seemingly relevant articles failed to meet the inclusion criteria. Here, we briefly describe few such articles for giving a rationale why they were not included.

• Liang et al. (2016) investigated the security challenges for the client-side classifiers via a case study on the Google’s phishing pages filter, a very widely used classifier for automatically detecting unknown phishing pages. They devised an attack that is not relevant to the cloud-based service.

• Demetrio et al. (2020) presented WAF-A-MoLE, a tool that models the presence of an adversary. This tool leverages a set of mutation operators that alter the syntax of a payload without affecting the original semantics. Using the results, the authors demonstrated that ML-based WAFs are exposed to a concrete risk of being bypassed. However, this attack is not associated with any cloud-based services.

• Authors in Apruzzese et al. (2019) discussed adversarial attacks where the machine learning model is compromised to induce an output favorable to the attacker. These attacks are realized in a different setting as compared to the scope of this systematic review, as we only included the articles which discuss the attack or defense when the cloud is outsourcing its services as MLaaS.

• Han et al. (2020) conducted the first systematic study of the practical traffic space evasion attack on learning-based network intrusion detection systems; again it is out of the inclusion criteria of our work.

• Chen et al. (2018) designed and evaluated three types of attackers targeting the training phases to poison our detection. To address this threat, the authors proposed the detection system, KuafuDet, and showed it significantly reduces false negatives and boosts the detection accuracy.

• Song et al. (2020) presented a federated defense approach for mitigating the effect of adversarial perturbations in a federated learning environment. This article can be potentially relevant for our study as they address the problem of defending cloud-hosted ML models; however, instead of using a third-party service, the authors conducted the experiments on a single computer system in a simulated environment; therefore, this study is not included in the analysis of this article.

• In a similar study, Zhang et al. (2019) presented a defense mechanism for defending adversarial attacks on cloud-aided automatic speech recognition (ASR); however, it is not explicitly stated that the cloud is outsourcing ML services and also which ML/DL model or MLaaS was used in experiments.

4 Attacks on Cloud-Hosted Machine Learning Models (Q1)

In this section, we present the findings from the systematically selected articles that aim at attacking cloud-hosted/third-party ML/DL models.

4.1 Attacks on Cloud-Hosted Machine Learning Models: Thematic Analysis

In ML practice, it is very common to outsource the training of ML/DL models to third-party services that provide high computational resources on the cloud. Such services enable ML practitioners to upload their models along with training data which is then trained on the cloud. Although such services have clear benefits for reducing the training and inference time; however, these services can easily be compromised and to this end, different types of attacks against these services have been proposed in the literature. In this section, we present the thematic analysis of 19 articles that are focused on attacking cloud-hosted ML/DL models. These articles are classified into five major themes: 1) attack type, 2) threat model, 3) attack method, 4) target model(s), and 5) dataset.

Attack type: A wide variety of attacks have been proposed in the literature. These are listed below with their descriptions provided in the next section.

• Adversarial attacks ( Brendel et al., 2017 );

• Backdoor attacks 6 ( Chen et al., 2017 ; Gu et al., 2019 );

• Cyber kill chain–based attack ( Nguyen, 2017 );

• Data manipulation attacks ( Liao et al., 2018 );

• Evasion attacks ( Hitaj et al., 2019 );

• Exploration attacks ( Sethi and Kantardzic, 2018 );

• Model extraction attacks ( Correia-Silva et al., 2018 ; Kesarwani et al., 2018 ; Joshi and Tammana, 2019 ; Reith et al., 2019 );

• Model inversion attacks ( Yang et al., 2019 );

• Model-reuse attacks ( Ji et al., 2018 );

• Trojan attacks ( Liu et al., 2018 ).

black box attacks (no knowledge) ( Brendel et al., 2017 ; Chen et al., 2017 ; Hosseini et al., 2017 ; Correia-Silva et al., 2018 ; Sethi and Kantardzic, 2018 ; Hitaj et al., 2019 );

white box attacks (full knowledge) ( Liao et al., 2018 ; Liu et al., 2018 ; Gu et al., 2019 ; Reith et al., 2019 );

gray box attacks (partial knowledge) ( Ji et al., 2018 ; Kesarwani et al., 2018 ).

Attack method: In each article, a different type of method is proposed for attacking cloud-hosted ML/DL models; a brief description of these methods is presented in Table 1 and is discussed in detail in the next section.

www.frontiersin.org

TABLE 1 . Summary of the state-of-the art attack types for cloud-based/third-party ML/DL models.

Target model(s): Considered studies have used different MLaaS services (e.g., Google Cloud ML Services ( Hosseini et al., 2017 ; Salem et al., 2018 ; Sethi and Kantardzic, 2018 ), ML models of BigML Platform ( Kesarwani et al., 2018 ), IBM’s visual recognition ( Nguyen, 2017 ), and Amazon Prediction APIs ( Reith et al., 2019 ; Yang et al., 2019 )).

Dataset: These attacks have been realized using different datasets ranging from small size datasets (e.g., MNIST ( Gu et al., 2019 ) and Fashion-MNIST ( Liu et al., 2018 )) to large size datasets (e.g., YouTube Aligned Face Dataset ( Chen et al., 2017 ), Project Wolf Eye ( Nguyen, 2017 ), and Iris dataset ( Joshi and Tammana, 2019 )). Other datasets include California Housing, Boston House Prices, UJIIndoorLoc, and IPIN 2016 Tutorial ( Reith et al., 2019 ), FaceScrub, CelebA, and CIFAR-10 ( Yang et al., 2019 ). A summary of thematic analyses of these attacks is presented in Table 1 and briefly described in the next section.

4.2 Taxonomy of Attacks on Cloud-Hosted Machine Learning Models

In this section, we present a taxonomy and description of different attacks described above in thematic analysis. A taxonomy of attacks on cloud-hosted ML/DL models is depicted in Figure 8 and is described next.

www.frontiersin.org

FIGURE 8 . Distribution of selected publications by publishers over years.

4.2.1 Adversarial Attacks

In recent years, DL models have been found vulnerable to carefully crafted imperceptible adversarial examples ( Goodfellow et al., 2014 ). For instance, a decision-based adversarial attack namely the boundary attack against two black box ML models trained for brand and celebrity recognition hosted at Clarifai.com are proposed in ( Brendel et al., 2017 ). The first model identifies brand names from natural images for 500 distinct brands and the second model recognizes over 10,000 celebrities. To date, a variety of adversarial examples generation methods have been proposed in the literature so far, the interesting readers are referred to recent surveys articles for detailed taxonomy of different types of adversarial attacks (i.e., Akhtar and Mian, 2018 ; Yuan et al., 2019 ; Qayyum et al., 2020b ; Demetrio et al., 2020 ).

4.2.2 Exploratory Attacks

These attacks are inference time attacks in which adversary attempts to evade the underlying ML/DL model, for example, by forcing the classifier (i.e., ML/DL model) to misclassify a positive sample as a negative one. Exploratory attacks do not harm the training data and only affects the model at test time. A data-driven exploratory attack using the Seed – Explore – Exploit strategy for evading Google’s cloud prediction API considering black box settings is presented in ( Sethi and Kantardzic, 2018 ). The performance evaluation of the proposed framework was performed using 10 real-world datasets.

4.2.3 Model Extraction Attacks

In model extraction attacks, adversaries can query the deployed ML model and can use query–response pair for compromising future predictions and also, they can potentially realize privacy breaches of the training data and can steal the model by learning extraction queries. In Kesarwani et al. (2018) , the authors presented a novel method for quantifying the extraction status of models for users with an increasing number of queries, which aims to measure model learning rate using information gain observed by query and response streams of users. The key objective of the authors was to design a cloud-based system for monitoring model extraction status and warnings. The performance evaluation of the proposed method was performed using a decision tree model deployed on the BigML MLaaS platform for different adversarial attack scenarios. Similarly, a model extraction/stealing strategy is presented by Correia-Silva et al. (2018) . The authors queried the cloud-hosted DL model with random unlabeled samples and used their predictions for creating a fake dataset. Then they used the fake dataset for building a fake model by training an oracle (copycat) model in an attempt to achieve similar performance as of the target model.

4.2.4 Backdooring Attacks

In backdooring attacks, an adversary maliciously creates the trained model which performs as good as expected on the users’ training and validation data, but it performs badly on attacker input samples. The backdooring attacks on deep neural networks (DNNs) are explored and evaluated in ( Gu et al., 2019 ). The authors first explored the properties of backdooring for a toy example and created a backdoor model for handwritten digit classifier and then demonstrated that backdoors are powerful for DNN by creating a backdoor model for a United States street sign classifier. Where, two scenarios were considered, that is, outsourced training of the model and transfer learning where an attacker can acquire a backdoor pretrained model online. In another similar study ( Chen et al., 2017 ), a targeted backdoor attack for two state-of-the art face recognition models, that is, DeepID ( Sun et al., 2014 ) and VGG-Face ( Parkhi et al., 2015 ) is presented. The authors proposed two categories of backdooring poisoning attacks, that is, input–instance–key attacks and pattern–key attacks using two different data poising strategies, that is, input–instance–key strategies and pattern–key strategies, respectively.

4.2.5 Trojan Attacks

In Trojan attacks, the attacker inserts malicious content into the system that looks legitimate but can take over the control of the system. However, the purpose of Trojan insertion can be varied, for example, stealing, disruption, misbehaving, or getting intended behavior. In Liu et al. (2018) , the authors proposed a stealth infection on neural networks, namely, SIN2 to realize a practical supply chain triggered neural Trojan attacks. Also, they proposed a variety of Trojan insertion strategies for agile and practical Trojan attacks. The proof of the concept is demonstrated by developing a prototype of the proposed neural Trojan attack (i.e., SIN2) in Linux sandbox and used Torch ( Collobert et al., 2011 ) ML/DL framework for building visual recognition models using the Fashion-MNIST dataset.

4.2.6 Model-Reuse Attacks

In model-reuse attacks, an adversary creates a malicious model (i.e., adversarial model) that influences the host model to misbehave on targeted inputs (i.e., triggers) in extremely predictable fashion, that is, getting a sample classified into specific (intended class). For instance, experimental evaluation of model-reuse attacks for four pretrained primitive DL models (i.e., speech recognition, autonomous steering, face verification, and skin cancer screening) is evaluated by Ji et al. (2018) .

4.2.7 Data Manipulation Attacks

Those attacks in which training data are manipulated to get intended behavior by the ML/DL model are known as data manipulation attacks. Data manipulation attacks for stealthily manipulating traditional supervised ML techniques and logistic regression (LR) and CNN models are studied by Liao et al. (2018) . In the attack strategy, the authors added a new constraint on fully connected layers of the models and used gradient descent for retraining them, and other layers were frozen (i.e., were made non-trainable).

4.2.8 Cyber Kill Chain–Based Attacks

Kill chain is a term used to define steps for attacking a target usually used in the military. In cyber kill chain–based attacks, the cloud-hosted ML/DL models are attacked, for example, a high-level threat model targeting ML cyber kill chain is presented by Nguyen (2017) . Also, the authors provided proof of concept by providing a case study using IBM visual recognition MLaaS (i.e., cognitive classifier for classification cats and female lions) and provided recommendations for ensuring secure and robust ML.

4.2.9 Membership Inference Attacks

In a typical membership inference attack, for given input data and black box access to the ML model, an attacker attempts to figure out if the given input sample was the part of the training set or not. To realize a membership inference attack against a target model, a classification model is trained for distinguishing between the predictions of the target model against the inputs on which it was trained and that those on which it was not trained ( Shokri et al., 2017 ).

4.2.10 Evasion Attacks

Evasion attacks are inference time attacks in which an adversary attempts to modify the test data for getting the intended outcome from the ML/DL model. Two evasion attacks against watermarking techniques for DL models hosted as MLaaS have been presented by Hitaj et al. (2019) . The authors used five publicly available models and trained them for distinguishing between watermarked and clean (non-watermarked) images, that is, binary image classification tasks.

4.2.11 Model Inversion Attacks

In model inversion attacks, an attacker tries to learn about training data using the model’s outcomes. Two model inversion techniques have been proposed by Yang et al. (2019) , that is, training an inversion model using auxiliary set composed by utilizing adversary’s background knowledge and truncation-based method for aligning the inversion model. The authors evaluated their proposed methods on a commercial prediction MLaaS named Amazon Rekognition.

5 Toward Securing Cloud-Hosted Machine Learning Models (Q2)

In this section, we present the insights from the systematically selected articles that provide tailored defense against specific attacks and report the articles that along with creating attacks propose countermeasure for the attacks for cloud-hosted/third-party ML/DL models.

5.1 Defenses for Attacks on Cloud-Hosted Machine Learning Models: Thematic Analysis

Leveraging cloud-based ML services for computational offloading and minimizing the communication overhead is accepted as a promising trend. While cloud-based prediction services have significant benefits, however, by sharing the model and the training data raises many privacy and security challenges. Several attacks that can compromise the model and data integrity, as described in the previous section. To avoid such issues, users can download the model and make inferences locally. However, this approach has certain drawbacks, including, confidentiality issues, service providers cannot update the models, adversaries can use the model to develop evading strategies, and privacy of the user data is compromised. To outline the countermeasures against these attacks, we present the thematic analysis of six articles that are focused on defense against the tailored attacks for cloud-hosted ML/DL models or data. In addition, we also provide the thematic analysis of those six articles that propose defense against specific attacks. These articles are classified into five major themes: 1) attack type, 2) defense, 3) target model(s), 4) dataset, and 5) measured outcomes. The thematic analysis of these systematically reviewed articles that are focused on developing defense strategies against attacks is given below.

Considered attacks for developing defenses: The defenses proposed in the reviewed articles are developed against the following specific attacks.

• Extraction attacks ( Tramèr et al., 2016 ; Liu et al., 2017 );

• Inversion attacks ( Liu et al., 2017 ; Sharma and Chen, 2018 );

• Adversarial attacks ( Hosseini et al., 2017 ; Wang et al., 2018b ; Rouhani et al., 2018 );

• Evasion attacks ( Lei et al., 2020 );

• GAN attacks ( Sharma and Chen, 2018 );

• Privacy threat attacks ( Hesamifard et al., 2017 );

• ide channel and cache-timing attacks ( Jiang et al., 2018 );

• Membership inference attacks ( Shokri et al., 2017 ; Salem et al., 2018 ).

Most of the aforementioned attacks are elaborated in previous sections. However, in the selected articles that are identified as either defense or attack and defense articles, some attacks are specifically created, for instance, GAN attacks, side channel, cache-timing attack, privacy threats, etc. Therefore, the attacks are worth mentioning in this section to explain the specific countermeasures proposed against them in the defense articles.

Defenses against different attacks: To provide resilience against these attacks, the authors of selected articles proposed different defense algorithms, which are listed below against each type of attack.

• Extraction attacks: MiniONN ( Liu et al., 2017 ), rounding confidence, differential, and ensemble methods ( Tramèr et al., 2016 );

• Adversarial attacks: ReDCrypt ( Rouhani et al., 2018 ) and Arden ( Wang et al., 2018b );

• Inversion attacks: MiniONN ( Liu et al., 2017 ) and image disguising techniques ( Sharma and Chen, 2018 );

• Privacy attacks: encryption-based defense ( Hesamifard et al., 2017 ; Jiang et al., 2018 );

• Side channel and cache-timing attacks: encryption-based defense ( Hesamifard et al., 2017 ; Jiang et al., 2018 );

• Membership inference attack: dropout and model stacking ( Salem et al., 2018 ).

Target model(s): Different cloud-hosted ML/DL models have been used for the evaluation of the proposed defenses, as shown in Table 2 .

www.frontiersin.org

TABLE 2 . Summary of attack types and corresponding defenses for cloud-based/third-party ML/DL models.

Dataset(s) used: The robustness of these defenses have been evaluated using various datasets ranging from small size datasets (e.g., MNIST ( Liu et al., 2017 ; Wang et al., 2018b ; Rouhani et al., 2018 ; Sharma and Chen, 2018 )) and CIFAR-10 ( Liu et al., 2017 ; Wang et al., 2018b ; Sharma and Chen, 2018 )), to large size datasets (e.g., Iris dataset ( Tramèr et al., 2016 ), fertility and climate dataset ( Hesamifard et al., 2017 ), and breast cancer ( Jiang et al., 2018 )). Other datasets include Crab dataset ( Hesamifard et al., 2017 ), Face dataset, Traffic signs dataset, Traffic signs dataset ( Tramèr et al., 2016 ), SVHN ( Wang et al., 2018b ), Edinburgh MI, Edinburgh MI, WI-Breast Cancerband MONKs Prob ( Jiang et al., 2018 ), crab dataset, fertility dataset, and climate dataset ( Hesamifard et al., 2017 ). Each of the defense techniques discussed above is mapped in Table 2 to the specific attack for which it was developed.

Measured outcomes: The measured outcomes based on which the defenses are evaluated are response latency and message sizes ( Liu et al., 2017 ; Wang et al., 2018b ), throughput comparison ( Rouhani et al., 2018 ), average on the cache miss rates per second ( Sharma and Chen, 2018 ), AUC, space complexity to demonstrate approximated storage costs ( Jiang et al., 2018 ), classification accuracy of the model as well as running time ( Hesamifard et al., 2017 ; Sharma and Chen, 2018 ), similarity index ( Lei et al., 2020 ), and training time ( Hesamifard et al., 2017 ; Jiang et al., 2018 ).

5.2 Taxonomy of Defenses on Cloud-Hosted Machine Learning Model Attacks

In this section, we present a taxonomy and summary of different defensive strategies against attacks on cloud-hosted ML/DL models as described above in thematic analysis. A taxonomy of these defenses strategies is presented in Figure 9 and is described next.

www.frontiersin.org

FIGURE 9 . Taxonomy of different attacks realized on the third-party cloud-hosted machine learning (ML) or deep learning (DL) models.

5.2.1 MiniONN

DNNs are vulnerable to model inversion and extraction attacks. Liu et al. (2017) proposed that without making any changes to the training phase of the model it is possible to change the model into an oblivious neural network. They make the nonlinear function such as tanh and sigmoid function more flexible, and by training the models on several datasets, the authors demonstrated significant results with minimal loss in the accuracy. In addition, they also implemented the offline precomputation phase to perform encryption incremental operations along with the SIMD batch processing technique.

5.2.2 ReDCrypt

A reconfigurable hardware-accelerated framework is proposed by Rouhani et al. (2018) , for protecting the privacy of deep neural models in cloud networks. The authors perform an innovative and power-efficient implementation of Yao’s Garbled Circuit (GC) protocol on FPGAs for preserving privacy. The proposed framework is evaluated for different DL applications, and it has achieved up to 57-fold throughput gain per core.

5.2.3 Arden

To offload the large portion of DNNs from the mobile devices to the clouds and to make the framework secure, a privacy-preserving mechanism Arden is proposed by Wang et al. (2018b) . While uploading the data to the mobile-cloud perturbation, noisy samples are included to make the data secure. To verify the robustness, the authors perform rigorous analysis based on three image datasets and demonstrated that this defense is capable to preserve the user privacy along with inference performance.

5.2.4 Image Disguising Techniques

While leveraging services from the cloud GPU server, the adversary can realize an attack by introducing malicious created training data, perform model inversion, and use the model for getting desirable incentives and outcomes. To protect from such attacks and to preserve the data as well as the model, Sharma and Chen (2018) proposed an image disguising mechanism. They developed a toolkit that can be leveraged to calibrate certain parameter settings. They claim that the disguised images with block-wise permutation and transformations are resilient to GAN-based attack and model inversion attacks.

5.2.5 Homomorphic Encryption

For making the cloud services of outsourced MLaaS secure, Hesamifard et al. (2017) proposed a privacy-preserving framework using homomorphic encryption. They trained the neural network using the encrypted data and then performed the encrypted predictions. The authors demonstrated that by carefully choosing the polynomials of the activation functions to adopt neural networks, it is possible to achieve the desired accuracy along with privacy-preserving training and classification.

In a similar study, to preserve the privacy of outsourced biomedical data and computation on public cloud servers, Jiang et al. (2018) built a homomorphically encrypted model that reinforces the hardware security through Software Guard Extensions. They combined homomorphic encryption and Software Guard Extensions to devise a hybrid model for the security of the most commonly used model for biomedical applications, that is, LR. The robustness of the Secure LR framework is evaluated on various datasets, and the authors also compared its performance with state-of-the-art secure LR solutions and demonstrated its superior efficiency.

5.2.6 Pelican

Lei et al. (2020) proposed three mutation-based evasion attacks and a sample-based collision attack in white-, gray-, and black box scenarios. They evaluated the attacks and demonstrated a 100% success rate of attack on Google’s phishing page filter classifier, while a success rate of up to 81% for the transferability on Bitdefender TrafficLight. To deal with such attacks and to increase the robustness of classifiers, they proposed a defense method known as Pelican.

5.2.7 Rounding Confidences and Differential Privacy

Tramèr et al. (2016) presented the model extraction attacks against the online services of BigML and Amazon ML. The attacks are capable of model evasion, monetization, and can compromise the privacy of training data. The authors also proposed and evaluated countermeasures such as rounding confidences against equation-solving and decision tree pathfinding attacks; however, this defense has no impact on the regression tree model attack. For the preservation of training data, differential privacy is proposed; this defense reduces the ability of an attacker to learn insights about the training dataset. The impact of both defenses is evaluated on the attacks for different models, while the authors also proposed ensemble models to mitigate the impact of attacks; however, their resilience is not evaluated.

5.2.8 Increasing Entropy and Reducing Precision

The training of attack using shadow training techniques against black box models in the cloud-based Google Prediction API and Amazon ML models are studied by Shokri et al. (2017) . The attack does not require prior knowledge of training data distribution. The authors emphasize that in order to protect the privacy of medical-related datasets or other public-related data, countermeasures should be designed. For instance, restriction of prediction vector to top k classes, which will prevent the leakage of important information or rounding down or up the classification probabilities in the prediction. They show that regularization can be effective to cope with overfitting and increasing the randomness of the prediction vector.

5.2.9 Dropout and Model Stacking

In the study by Salem et al. (2018) , the authors created three diverse attacks and tested the applicability of these attacks on eight datasets from which six are similar as used by Shokri et al. (2017) , whereas in this work, news dataset and face dataset is included. In the threat model, the authors considered black box access to the target model which is a supervised ML classifier with binary classes that was trained for binary classification. To mitigate the privacy threats, the authors proposed a dropout-based method which reduces the impact of an attack by randomly deleting a proportion of edges in each training iteration in a fully connected neural network. The second defense strategy is model stacking, which hierarchically organizes multiple ML models to avoid overfitting. After extensive evaluation, these defense techniques showed the potential to mitigate the performance of the membership inference attack.

5.2.10 Randomness to Video Analysis Algorithms

Hosseini et al. designed two attacks specifically to analyze the robustness of video classification and shot detection ( Hosseini et al., 2017 ). The attack can subtly manipulate the content of the video in such a way that it is undetected by humans, while the output from the automatic video analysis method is altered. Depending on the fact that the video and shot labels are generated by API by processing only the first video frame of every second, the attack can successfully deceive API. To deal with the shot removal and generation attacks, the authors proposed the inclusion of randomness for enhancing the robustness of algorithms. However, in this article, the authors thoroughly evaluated the applicability of these attacks in different video setting, but the purposed defense is not rigorously evaluated.

5.2.11 Neuron Distance Threshold and Obfuscation

Transfer learning is an effective technique for quickly building DL student models in which knowledge from a Teacher model is transferred to a Student model. However, Wang et al. (2018a) discussed that due to the centralization of model training, the vulnerability against misclassification attacks for image recognition on black box Student models increases. The authors proposed several defenses to mitigate the impact of such an attack, such as changing the internal representation of the Student model from the Teacher model. Other defense methods include increasing dropout randomization which alters the student model training process, modification in input data before classification, adding redundancy, and using orthogonal model against transfer learning attack. The authors analyzed the robustness of these attacks and demonstrated that the neuron distance threshold is the most effective in obfuscating the identity of the Teacher model.

6 Pitfalls and Limitations

6.1 lack of attack diversity.

The attacks presented in the selected articles have limited scope and lack diversity, that is, they are limited to a specific setting, and the variability of attacks is limited as well. However, the diversity of attacks is an important consideration for developing robust attacks from the perspective of adversaries, and it ensures the detection and prevention of the attacks to be difficult. The diversity of attacks ultimately helps in the development of robust defense strategies. Moreover, the empirical evaluation of attack variabilities can identify the potential vulnerabilities of cybersecurity systems. Therefore, to make a more robust defense solution, it is important to test the model robustness under a diverse set of attacks.

6.2 Lack of Consideration for Adaptable Adversaries

Most of the defenses in the systematically reviewed articles are proposed for a specific attack and did not consider the adaptable adversaries. On the other hand, in practice, the adversarial attacks are an arms race between attackers and defenders. That is, the attackers continuously evolve and enhance their knowledge and attacking strategies to evade the underlying defensive system. Therefore, the consideration of adaptable adversaries is crucial for developing a robust and long-lasting defense mechanism. If we do not consider this, the adversary will adapt to our defensive system over time and will bypass it to get the intended behavior or outcomes.

6.3 Limited Progress in Developing Defenses

From the systematically selected articles that are collected from different databases, only 12 articles have presented defense methods for the proposed attack as compared to the articles that are focused on attacks, that is, 19. In these 12 articles, six have only discussed/presented a defense strategy and six have developed a defense against a particular attack. This indicates that there is limited activity from the research community in developing defense strategies for already proposed attacks in the literature. In addition, the proposed defenses only mitigate or detect those attacks for which they have been developed, and therefore, they are not generalizable. On the contrary, the increasing interest in developing different attacks and the popularity of cloud-hosted/third-party services demand a proportionate amount of interest in developing defense systems as well.

7 Open Research Issues

7.1 adversarially robust machine learning models.

In recent years, adversarial ML attacks have emerged as a major panacea for ML/DL models and the systematically selected articles have highlighted the threat of these attacks for cloud-hosted Ml/DL models as well. Moreover, the diversity of these attacks is drastically increasing as compared with the defensive strategies that can pose serious challenges and consequences for the security of cloud-hosted ML/DL models. Each defense method presented in the literature so far has been shown resilient to a particular attack which is realized in specific, settings and it fails to withstand for yet stronger and unseen attacks. Therefore, the development of adversarially robust ML/DL models remains an open research problem, while the literature suggests that worst-case robustness analysis should be performed while considering adversarial ML settings ( Qayyum et al., 2020a ; Qayyum et al., 2020b ; Ilahi et al., 2020 ). In addition, it has been argued in the literature that most of ML developers and security incident responders are unequipped with the required tools for securing industry-grade ML systems against adversarial ML attacks Kumar et al. (2020) . This indicates the increasing need for the development of defense strategies for securing ML/DL models against adversarial ML attacks.

7.2 Privacy-Preserving Machine Learning Models

In cloud-hosted ML services, preserving user privacy is fundamentally important and is a matter of high concern. Also, it is desirable that ML models built using users’ data should not learn information that can compromise the privacy of the individuals. However, the literature on developing privacy-preserving ML/DL models or MLaaS is limited. On the other hand, one of the privacy-preserving techniques that have been used for privacy protection for building a defense system for cloud-hosted ML/DL models, that is, the homomorphic encryption-based protocol ( Jiang et al., 2018 ), has been shown vulnerable to model extraction attack ( Reith et al., 2019 ). Therefore, the development of privacy-preserving ML models for cloud computing platforms is another open research problem.

7.3 Proxy Metrics for Evaluating Security and Robustness

From systematically reviewed literature on the security of cloud-hosted ML/DL models, we orchestrate that the interest from the research community in the development of novel security-centric proxy metrics for the evaluation of security threats and model robustness of cloud-hosted models is very limited. However, with the increasing proliferation of cloud-hosted ML services (i.e., MLaaS) and with the development/advancements of different attacks (e.g., adversarial ML attacks), the development of effective and scalable metrics for evaluating the robustness ML/DL models toward different attacks and defense strategies is required.

8 Threats to Validity

We now briefly reflect on our methodology in order to identify any threats to the validity of our findings. First, internal validity is maintained as the research questions we pose in Section 2.2 capture the objectives of the study. Construct validity relies on a sound understanding of the literature and how it represents the state of the field. A detailed study of the reviewed articles along with deep discussions between the members of the research team helped ensure the quality of this understanding. Note that the research team is of diverse skills and expertise in ML, DL, cloud computing, ML/DL security, and analytics. Also, the inclusion and exclusion criteria (Section 2.3) help define the remit of our survey. Data extraction is prone to human error as is always the case. This was mitigated by having different members of the research team review each reviewed article. However, we did not attempt to evaluate the quality of the reviewed studies or validate their content due to time constraints. In order to minimize selection bias, we cast a wide net in order to capture articles from different communities publishing in the area of MLaaS via a comprehensive set of bibliographical databases without discriminating based on the venue/source.

9 Conclusion

In this article, we presented a systematic review of literature that is focused on the security of cloud-hosted ML/DL models, also named as MLaaS. The relevant articles were collected from eight major publishers that include ACM Digital Library, IEEE Xplore, ScienceDirect, international conference on machine learning, international conference on learning representations, journal of machine learning research, USENIX, neural information processing systems, and arXiv. For the selection of articles, we developed a review protocol that includes inclusion and exclusion formulas and analyzed the selected articles that fulfill these criteria across two dimensions (i.e., attacks and defenses) on MLaaS and provide a thematic analysis of these articles across five attack and five defense themes, respectively. We also identified the limitations and pitfalls from the reviewed literature, and finally, we have highlighted various open research issues that require further investigation.

Data Availability Statement

The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author/s.

Author Contributions

AQ led the work in writing the manuscript and performed the annotation of the data and analysis as well. AI performed data acquisition, annotation, and analysis from four venues, and contributed to the paper write-up. MU contributed to writing a few sections, did annotations of papers, and helped in analysis. WI performed data scrapping, annotation, and analysis from four venues, and helped in developing graphics. All the first four authors validated the data, analysis, and contributed to the interpretation of the results. AQ and AI helped in developing and refining the methodology for this systematic review. JQ conceived the idea and supervises the overall work. JQ, YEK, and AF provided critical feedback and helped shape the research, analysis, and manuscript. All authors contributed to the final version of the manuscript.

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

1 We use MLaaS to cover both ML and DL as a Service cloud provisions.

2 https://cloud.google.com/ml-engine/ .

3 A popular Python library for DL.

4 https://azure.microsoft.com/en-us/services/machine-learning-service/ .

5 https://docs.aws.amazon.com/dlami/latest/devguide/AML2_0.html .

6 Backdoor attacks on cloud-hosted models can be further categorized into three categories ( Chen et al., 2020 ): 1) complete model–based attacks, 2) partial model–based attacks, and 3) model-free attacks).

Akhtar, N., and Mian, A. (2018). Threat of adversarial attacks on deep learning in computer vision: a survey. IEEE Access 6, 14410–14430. doi:10.1109/access.2018.2807385

CrossRef Full Text | Google Scholar

Apruzzese, G., Colajanni, M., Ferretti, L., and Marchetti, M. (2019). “Addressing adversarial attacks against security systems based on machine learning,” in 2019 11th International conference on cyber conflict (CyCon) , Tallinn, Estonia , May 28–31, 2019 ( IEEE ), 900, 1–18

Google Scholar

Brendel, W., Rauber, J., and Bethge, M. (2017). “Decision-based adversarial attacks: reliable attacks against black-box machine learning models,” in International Conference on Learning Representations (ICLR)

Chen, S., Xue, M., Fan, L., Hao, S., Xu, L., Zhu, H., et al. (2018). Automated poisoning attacks and defenses in malware detection systems: an adversarial machine learning approach. Comput. Secur. 73, 326–344. doi:10.1016/j.cose.2017.11.007

Chen, X., Liu, C., Li, B., Lu, K., and Song, D. (2017). Targeted backdoor attacks on deep learning systems using data poisoning. arXiv

Chen, Y., Gong, X., Wang, Q., Di, X., and Huang, H. (2020). Backdoor attacks and defenses for deep neural networks in outsourced cloud environments. IEEE Network 34 (5), 141–147. doi:10.1109/MNET.011.1900577

Collobert, R., Kavukcuoglu, K., and Farabet, C. (2011). “Torch7: a Matlab-like environment for machine learning,” in BigLearn, NIPS workshop .

Correia-Silva, J. R., Berriel, R. F., Badue, C., de Souza, A. F., and Oliveira-Santos, T. (2018). “Copycat CNN: stealing knowledge by persuading confession with random non-labeled data,” in 2018 International joint conference on neural networks (IJCNN) , Rio de Janeiro, Brazil , July 8–13, 2018 ( IEEE ), 1–8

Demetrio, L., Valenza, A., Costa, G., and Lagorio, G. (2020). “Waf-a-mole: evading web application firewalls through adversarial machine learning,” in Proceedings of the 35th annual ACM symposium on applied computing , Brno, Czech Republic , March 2020 , 1745–1752

Gong, Y., Li, B., Poellabauer, C., and Shi, Y. (2019). “Real-time adversarial attacks,” in Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI) , Macao, China , August 2019

Goodfellow, I. J., Shlens, J., and Szegedy, C. (2014). Explaining and harnessing adversarial examples. arXiv

Gu, T., Liu, K., Dolan-Gavitt, B., and Garg, S. (2019). BadNets: evaluating backdooring attacks on deep neural networks. IEEE Access 7, 47230–47244. doi:10.1109/access.2019.2909068

Han, D., Wang, Z., Zhong, Y., Chen, W., Yang, J., Lu, S., et al. (2020). Practical traffic-space adversarial attacks on learning-based nidss. arXiv

Hesamifard, E., Takabi, H., Ghasemi, M., and Jones, C. (2017). “Privacy-preserving machine learning in cloud,” in Proceedings of the 2017 on cloud computing security workshop , 39–43

Hilprecht, B., Härterich, M., and Bernau, D. (2019). “Monte Carlo and reconstruction membership inference attacks against generative models,” in Proceedings on Privacy Enhancing Technologies , Stockholm, Sweden , July 2019 , 2019, 232–249

Hitaj, D., Hitaj, B., and Mancini, L. V. (2019). “Evasion attacks against watermarking techniques found in MLaaS systems,” in 2019 sixth international conference on software defined systems (SDS) , Rome, Italy , June 10–13, 2019 ( IEEE )

Hosseini, H., Xiao, B., Clark, A., and Poovendran, R. (2017). “Attacking automatic video analysis algorithms: a case study of google cloud video intelligence API,” in Proceedings of the 2017 conference on multimedia Privacy and security (ACM) , 21–32

Ilahi, I., Usama, M., Qadir, J., Janjua, M. U., Al-Fuqaha, A., Hoang, D. T., et al. (2020). Challenges and countermeasures for adversarial attacks on deep reinforcement learning. arXiv

Ji, Y., Zhang, X., Ji, S., Luo, X., and Wang, T. (2018). “Model-reuse attacks on deep learning systems, “in Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security (New York, NY: ACM) , December 2018 , 349–363

Jiang, Y., Hamer, J., Wang, C., Jiang, X., Kim, M., Song, Y., et al. (2018). Securelr: secure logistic regression model via a hybrid cryptographic protocol. IEEE ACM Trans. Comput. Biol. Bioinf 16, 113–123. doi:10.1109/TCBB.2018.2833463

Joshi, N., and Tammana, R. (2019). “GDALR: an efficient model duplication attack on black box machine learning models,” in 2019 IEEE international Conference on system, computation, Automation and networking (ICSCAN) , Pondicherry, India , March 29–30, 2019 ( IEEE ), 1–6

Kesarwani, M., Mukhoty, B., Arya, V., and Mehta, S. (2018). Model extraction warning in MLaaS paradigm. In Proceedings of the 34th Annual Computer Security Applications Conference (ACM) , 371–380

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems , 1097–1105 Available at: http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

Kumar, R. S. S., Nyström, M., Lambert, J., Marshall, A., Goertzel, M., Comissoneru, A., et al. (2020). Adversarial machine learning–industry perspectives. arXiv . Available at: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3532474

Lei, Y., Chen, S., Fan, L., Song, F., and Liu, Y. (2020). Advanced evasion attacks and mitigations on practical ml-based phishing website classifiers. arXiv

Liang, B., Su, M., You, W., Shi, W., and Yang, G. (2016). “Cracking classifiers for evasion: a case study on the google’s phishing pages filter,” in Proceedings of the 25th international conference on world wide web Montréal, Québec, Canada , 345–356

Liao, C., Zhong, H., Zhu, S., and Squicciarini, A. (2018). “Server-based manipulation attacks against machine learning models,” in Proceedings of the eighth ACM conference on data and application security and privacy (ACM) , New York, NY , March 2018 , 24–34

Liu, J., Juuti, M., Lu, Y., and Asokan, N.. (2017). “Oblivious neural network predictions via minionn transformations,” in Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security , October 2017 , 619–631

Liu, T., Wen, W., and Jin, Y. (2018). “SIN 2: stealth infection on neural network—a low-cost agile neural Trojan attack methodology,” in 2018 IEEE international symposium on hardware oriented security and trust (HOST) , Washington, DC , April 30–4 May, 2018 ( IEEE ), 227–230

Nguyen, T. N. (2017). Attacking machine learning models as part of a cyber kill chain. arXiv

Parkhi, O. M., Vedaldi, A., Zisserman, A., et al. (2015). Deep face recognition. Bmvc 1, 6. doi:10.5244/C.29.41

Qayyum, A., Qadir, J., Bilal, M., and Al-Fuqaha, A. (2020a). Secure and robust machine learning for healthcare: a survey. IEEE Rev. Biomed. Eng. , 1. doi:10.1109/RBME.2020.3013489

Qayyum, A., Usama, M., Qadir, J., and Al-Fuqaha, A. (2020b). Securing connected & autonomous vehicles: challenges posed by adversarial machine learning and the way forward. IEEE Commun. Surv. Tutorials 22, 998–1026. doi:10.1109/comst.2020.2975048

Reith, R. N., Schneider, T., and Tkachenko, O. (2019). “Efficiently stealing your machine learning models,” in Proceedings of the 18th ACM workshop on privacy in the electronic society , November 2019 , 198–210

Rouhani, B. D., Hussain, S. U., Lauter, K., and Koushanfar, F. (2018). Redcrypt: real-time privacy-preserving deep learning inference in clouds using fpgas. ACM Trans. Reconfigurable Technol. Syst. 11, 1–21. doi:10.1145/3242899

Saadatpanah, P., Shafahi, A., and Goldstein, T. (2019). Adversarial attacks on copyright detection systems. arXiv .

Salem, A., Zhang, Y., Humbert, M., Berrang, P., Fritz, M., and Backes, M. (2018). ML-leaks: model and data independent membership inference attacks and defenses on machine learning models. arXiv .

Sehwag, V., Bhagoji, A. N., Song, L., Sitawarin, C., Cullina, D., Chiang, M., et al. (2019). Better the devil you know: an analysis of evasion attacks using out-of-distribution adversarial examples. arXiv .

Sethi, T. S., and Kantardzic, M. (2018). Data driven exploratory attacks on black box classifiers in adversarial domains. Neurocomputing 289, 129–143. doi:10.1016/j.neucom.2018.02.007

Sharma, S., and Chen, K.. (2018). “Image disguising for privacy-preserving deep learning,” in Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security , ( ACM, Toronto, Canada ), 2291–2293

Shokri, R., Stronati, M., Song, C., and Shmatikov, V. (2017). “Membership inference attacks against machine learning models,” in 2017 IEEE Symposium on Security and privacy (SP) , San Jose, CA , May 22–26, 2017 ( IEEE ), 3–18

Simonyan, K., and Zisserman, A. (2015). “Very deep convolutional networks for large-scale image recognition,”in International Conference on Learning Representations (ICLR)

Song, Y., Liu, T., Wei, T., Wang, X., Tao, Z., and Chen, M. (2020). Fda3: federated defense against adversarial attacks for cloud-based iiot applications. IEEE Trans. Industr. Inform. , 1. doi:10.1109/TII.2020.3005969

Sun, Y., Wang, X., and Tang, X. (2014). “Deep learning face representation from predicting 10,000 classes,” in Proceedings of the IEEE conference on computer vision and pattern recognition , Columbus, OH , June 23–28, 2014 , ( IEEE ).

Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. “(2016). Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) , Las Vegas, NV , June 27–30, 2016 ( IEEE ), 2818–2826

Tramèr, F., Zhang, F., Juels, A., Reiter, M. K., and Ristenpart, T. (2016). “Stealing machine learning models via prediction APIs,” in 25th USENIX security symposium (USENIX Security 16) , 601–618

Tyndall, J. (2010). AACODS checklist . Adelaide, Australia: Adelaide Flinders University

Usama, M., Mitra, R. N., Ilahi, I., Qadir, J., and Marina, M. K. (2020a). Examining machine learning for 5g and beyond through an adversarial lens. arXiv . Available at: https://arxiv.org/abs/2009.02473 .

Usama, M., Qadir, J., Al-Fuqaha, A., and Hamdi, M. (2020b). The adversarial machine learning conundrum: can the insecurity of ML become the achilles' heel of cognitive networks? IEEE Network 34, 196–203. doi:10.1109/mnet.001.1900197

Usama, M., Qayyum, A., Qadir, J., and Al-Fuqaha, A. (2019). “Black-box adversarial machine learning attack on network traffic classification, “in 2019 15th international wireless communications and mobile computing conference (IWCMC) , Tangier, Morocco , June 24–28, 2019

Wang, B., Yao, Y., Viswanath, B., Zheng, H., and Zhao, B. Y. (2018a). “With great training comes great vulnerability: practical attacks against transfer learning,” in 27th USENIX security symposium (USENIX Security 18) , Baltimore, MD , August 2018 , 1281–1297

Wang, J., Zhang, J., Bao, W., Zhu, X., Cao, B., and Yu, P. S. (2018b). “Not just privacy: improving performance of private deep learning in mobile cloud,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining London, United Kingdom , January 2018 , 2407–2416

Yang, Z., Zhang, J., Chang, E.-C., and Liang, Z. (2019). “Neural network inversion in adversarial setting via background knowledge alignment,” in Proceedings of the 2019 ACM SIGSAC conference on computer and communications security , London, UK , November 2019 , 225–240

Yuan, X., He, P., Zhu, Q., and Li, X. (2019). Adversarial examples: attacks and defenses for deep learning. IEEE Trans. Neural. Netw. Learn. Syst. 30 (9), 2805–2824. doi:10.1109/TNNLS.2018.2886017

Zhang, J., Zhang, B., and Zhang, B. (2019). “Defending adversarial attacks on cloud-aided automatic speech recognition systems, “in Proceedings of the seventh international workshop on security in cloud computing , New York , 23–31. Available at: https://dl.acm.org/doi/proceedings/10.1145/3327962

Keywords: Machine Learning as a Service, cloud-hosted machine learning models, machine learning security, cloud machine learning security, systematic review, attacks, defenses

Citation: Qayyum A, Ijaz A, Usama M, Iqbal W, Qadir J, Elkhatib Y and Al-Fuqaha A (2020) Securing Machine Learning in the Cloud: A Systematic Review of Cloud Machine Learning Security. Front. Big Data 3:587139. doi: 10.3389/fdata.2020.587139

Received: 24 July 2020; Accepted: 08 October 2020; Published: 12 November 2020.

Reviewed by:

Copyright © 2020 Qayyum, Ijaz, Usama, Iqbal, Qadir, Elkhatib and Al-Fuqaha. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Adnan Qayyum, [email protected]

This article is part of the Research Topic

Safe and Trustworthy Machine Learning

An Overview of Data Storage in Cloud Computing

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

cloud database research papers

  • Automated reasoning
  • Cloud and systems
  • Computer vision
  • Conversational AI
  • Information and knowledge management
  • Machine learning
  • Operations research and optimization
  • Quantum technologies
  • Search and information retrieval
  • Security, privacy, and abuse prevention
  • Sustainability
  • Publications
  • Conferences
  • Code and datasets
  • Alexa Prize
  • Academics at Amazon
  • Amazon Research Awards
  • Research collaborations

Amazon Aurora: Design considerations for high throughput cloud-native relational databases

https://www.amazon.science/publications/amazon-aurora-design-considerations-for-high-throughput-cloud-native-relational-databases

  • Amazon Web Services (AWS)
  • Relational databases

Latest news

Catalog experiment.16x9.png

Work with us

View from space of a connected network around planet Earth representing the Internet of Things.

cloud computing Recently Published Documents

Total documents.

  • Latest Documents
  • Most Cited Documents
  • Contributed Authors
  • Related Sources
  • Related Keywords

Simulation and performance assessment of a modified throttled load balancing algorithm in cloud computing environment

<span lang="EN-US">Load balancing is crucial to ensure scalability, reliability, minimize response time, and processing time and maximize resource utilization in cloud computing. However, the load fluctuation accompanied with the distribution of a huge number of requests among a set of virtual machines (VMs) is challenging and needs effective and practical load balancers. In this work, a two listed throttled load balancer (TLT-LB) algorithm is proposed and further simulated using the CloudAnalyst simulator. The TLT-LB algorithm is based on the modification of the conventional TLB algorithm to improve the distribution of the tasks between different VMs. The performance of the TLT-LB algorithm compared to the TLB, round robin (RR), and active monitoring load balancer (AMLB) algorithms has been evaluated using two different configurations. Interestingly, the TLT-LB significantly balances the load between the VMs by reducing the loading gap between the heaviest loaded and the lightest loaded VMs to be 6.45% compared to 68.55% for the TLB and AMLB algorithms. Furthermore, the TLT-LB algorithm considerably reduces the average response time and processing time compared to the TLB, RR, and AMLB algorithms.</span>

An improved forensic-by-design framework for cloud computing with systems engineering standard compliance

Reliability of trust management systems in cloud computing.

Cloud computing is an innovation that conveys administrations like programming, stage, and framework over the web. This computing structure is wide spread and dynamic, which chips away at the compensation per-utilize model and supports virtualization. Distributed computing is expanding quickly among purchasers and has many organizations that offer types of assistance through the web. It gives an adaptable and on-request administration yet at the same time has different security dangers. Its dynamic nature makes it tweaked according to client and supplier’s necessities, subsequently making it an outstanding benefit of distributed computing. However, then again, this additionally makes trust issues and or issues like security, protection, personality, and legitimacy. In this way, the huge test in the cloud climate is selecting a perfect organization. For this, the trust component assumes a critical part, in view of the assessment of QoS and Feedback rating. Nonetheless, different difficulties are as yet present in the trust the board framework for observing and assessing the QoS. This paper talks about the current obstructions present in the trust framework. The objective of this paper is to audit the available trust models. The issues like insufficient trust between the supplier and client have made issues in information sharing likewise tended to here. Besides, it lays the limits and their enhancements to help specialists who mean to investigate this point.

Cloud Computing Adoption in the Construction Industry of Singapore: Drivers, Challenges, and Strategies

An extensive review of web-based multi granularity service composition.

The paper reviews the efforts to compose SOAP, non-SOAP and non-web services. Traditionally efforts were made for composite SOAP services, however, these efforts did not include the RESTful and non-web services. A SOAP service uses structured exchange methodology for dealing with web services while a non-SOAP follows different approach. The research paper reviews the invoking and composing a combination of SOAP, non-SOAP, and non-web services into a composite process to execute complex tasks on various devices. It also shows the systematic integration of the SOAP, non-SOAP and non-web services describing the composition of heterogeneous services than the ones conventionally used from the perspective of resource consumption. The paper further compares and reviews different layout model for the discovery of services, selection of services and composition of services in Cloud computing. Recent research trends in service composition are identified and then research about microservices are evaluated and shown in the form of table and graphs.

Integrated Blockchain and Cloud Computing Systems: A Systematic Survey, Solutions, and Challenges

Cloud computing is a network model of on-demand access for sharing configurable computing resource pools. Compared with conventional service architectures, cloud computing introduces new security challenges in secure service management and control, privacy protection, data integrity protection in distributed databases, data backup, and synchronization. Blockchain can be leveraged to address these challenges, partly due to the underlying characteristics such as transparency, traceability, decentralization, security, immutability, and automation. We present a comprehensive survey of how blockchain is applied to provide security services in the cloud computing model and we analyze the research trends of blockchain-related techniques in current cloud computing models. During the reviewing, we also briefly investigate how cloud computing can affect blockchain, especially about the performance improvements that cloud computing can provide for the blockchain. Our contributions include the following: (i) summarizing the possible architectures and models of the integration of blockchain and cloud computing and the roles of cloud computing in blockchain; (ii) classifying and discussing recent, relevant works based on different blockchain-based security services in the cloud computing model; (iii) simply investigating what improvements cloud computing can provide for the blockchain; (iv) introducing the current development status of the industry/major cloud providers in the direction of combining cloud and blockchain; (v) analyzing the main barriers and challenges of integrated blockchain and cloud computing systems; and (vi) providing recommendations for future research and improvement on the integration of blockchain and cloud systems.

Cloud Computing and Undergraduate Researches in Universities in Enugu State: Implication for Skills Demand

Cloud building block chip for creating fpga and asic clouds.

Hardware-accelerated cloud computing systems based on FPGA chips (FPGA cloud) or ASIC chips (ASIC cloud) have emerged as a new technology trend for power-efficient acceleration of various software applications. However, the operating systems and hypervisors currently used in cloud computing will lead to power, performance, and scalability problems in an exascale cloud computing environment. Consequently, the present study proposes a parallel hardware hypervisor system that is implemented entirely in special-purpose hardware, and that virtualizes application-specific multi-chip supercomputers, to enable virtual supercomputers to share available FPGA and ASIC resources in a cloud system. In addition to the virtualization of multi-chip supercomputers, the system’s other unique features include simultaneous migration of multiple communicating hardware tasks, and on-demand increase or decrease of hardware resources allocated to a virtual supercomputer. Partitioning the flat hardware design of the proposed hypervisor system into multiple partitions and applying the chip unioning technique to its partitions, the present study introduces a cloud building block chip that can be used to create FPGA or ASIC clouds as well. Single-chip and multi-chip verification studies have been done to verify the functional correctness of the hypervisor system, which consumes only a fraction of (10%) hardware resources.

Study On Social Network Recommendation Service Method Based On Mobile Cloud Computing

Cloud-based network virtualization in iot with openstack.

In Cloud computing deployments, specifically in the Infrastructure-as-a-Service (IaaS) model, networking is one of the core enabling facilities provided for the users. The IaaS approach ensures significant flexibility and manageability, since the networking resources and topologies are entirely under users’ control. In this context, considerable efforts have been devoted to promoting the Cloud paradigm as a suitable solution for managing IoT environments. Deep and genuine integration between the two ecosystems, Cloud and IoT, may only be attainable at the IaaS level. In light of extending the IoT domain capabilities’ with Cloud-based mechanisms akin to the IaaS Cloud model, network virtualization is a fundamental enabler of infrastructure-oriented IoT deployments. Indeed, an IoT deployment without networking resilience and adaptability makes it unsuitable to meet user-level demands and services’ requirements. Such a limitation makes the IoT-based services adopted in very specific and statically defined scenarios, thus leading to limited plurality and diversity of use cases. This article presents a Cloud-based approach for network virtualization in an IoT context using the de-facto standard IaaS middleware, OpenStack, and its networking subsystem, Neutron. OpenStack is being extended to enable the instantiation of virtual/overlay networks between Cloud-based instances (e.g., virtual machines, containers, and bare metal servers) and/or geographically distributed IoT nodes deployed at the network edge.

Export Citation Format

Share document.

U.S. flag

An official website of the United States government

  • Research @ BEA

Studies on the Value of Data

The U.S. Bureau of Economic Analysis has undertaken a series of studies that present methods for quantifying the value of simple data that can be differentiated from the complex data created by highly skilled workers that was studied in Calderón and Rassier 2022 . Preliminary studies in this series focus on tax data, individual credit data, and driving data. Additional examples include medical records, educational transcripts, business financial records, customer data, equipment maintenance histories, social media profiles, tourist maps, and many more. If new case studies under this topic are released, they will be added to the listing below.

  • Capitalizing Data: Case Studies of Driving Records and Vehicle Insurance Claims | April 2024
  • Private Funding of “Free” Data: A Theoretical Framework | April 2024
  • Capitalizing Data: Case Studies of Tax Forms and Individual Credit Reports | June 2023

Rachel Soloveichik

JEL Code(s) E01 Published April 2024

  • Accessibility Policy
  • Skip to content
  • QUICK LINKS
  • Oracle Cloud Infrastructure
  • Oracle Fusion Cloud Applications
  • Download Java
  • Careers at Oracle

 alt=

Database 23ai: Feature Highlights

Learn how Oracle Database 23ai brings AI to your data, making it simple to power app development and mission critical workloads with AI. Each week, we'll share a new feature of Oracle Database 23ai with examples so you can get up and running quickly. Save this page and check back each week to see new highlighted features.

cloud database research papers

Oracle Database 23ai Feature highlights for developers

Check out some of the features we’ve built with developers in mind:

AI Vector Search brings AI to your data by letting you build generative AI pipelines using your business data, directly within the database. Easy-to-use native vector capabilities let your developers build next-gen AI applications that combine relational database processing with similarity search and retrieval augmented generation. Running vector search directly on your business data eliminates data movement as well as the complexity, cost, and data consistency headaches of managing and integrating multiple databases.

Other features developers should get to know include:

  • JSON Relational Duality
  • Property Graph

Previously highlighted features

  • Administration/Performance
  • Languages/Drivers
  • SQL/Data Types
  • Transactions/Microservices

cloud database research papers

Application availability—zero downtime for database clients

Transparent Application Continuity shields C/C++, Java, .NET, Python, and Node.js applications from the outages of underlying software, hardware, communications, and storage layers...

cloud database research papers

Automatic Transaction Rollback

If a transaction does not commit or rollback for a long time while holding row locks, it can potentially block other high-priority transactions...

cloud database research papers

DBMS_Search

DBMS_SEARCH implements Oracle Text ubiquitous search. DBMS_SEARCH makes it very easy to create a single index over multiple tables and views...

cloud database research papers

Fast Ingest enhancements

We've added enhancements to Memoptimized Rowstore Fast Ingest with support for partitioning, compressed tables, fast flush using direct writes, and direct in-memory column store population support...

cloud database research papers

Raft-based replication in Globally Distributed Database

Oracle Globally Distributed Database introduced the Raft replication feature in Oracle Database 23c. This allows us to achieve very fast (sub 3 seconds) failover with zero data loss in case of a node or a data center outage...

cloud database research papers

  • SQL Analysis Report

This week we’re turning the spotlight on SQL Analysis Report, an easy-to-use feature that helps developers write better SQL statements...

cloud database research papers

Transparent Application Continuity shields C/C++, Java, .NET, Python, and Node.js applications from the outages of underlying software, hardware, communications, and storage layers. With Oracle Real Application Clusters (RAC), Active Data Guard (ADG), and Autonomous Database (Shared and Dedicated), Oracle Database remains accessible even when a node or a subset of the RAC cluster fails or is taken offline for maintenance.

Oracle Database 23c brings many new enhancements, including batch applications support, for example, open cursors, also called session state stable cursors.

  • HikariCP Best Practices for Oracle Database and Spring Boot
  • Auding Enhancements in Oracle Database 23c
  • How to Make Application Continuity Most Effective in Oracle Database 23c
  • Oracle .NET Application Continuity — Getting Started

Documentation

  • ODP.NET and Application Continuity
  • Application Continuity for Java
  • OCI and Application Continuity

cloud database research papers

If a transaction does not commit or rollback for a long time while holding row locks, it can potentially block other high-priority transactions. This feature allows applications to assign priorities to transactions, and administrators to set timeouts for each priority. The database will automatically rollback a lower-priority transaction and release the row locks held if it blocks a higher-priority transaction beyond the set timeout, allowing the higher-priority transaction to proceed.

Automatic Transaction Rollback reduces the administrative burden while also helping to maintain transaction latencies/SLAs on higher-priority transactions.

  • Automatic Transaction Rollback in Database 23c with high-, medium-, and low-priority transactions
  • Automatic Transaction Rollback in Oracle Database 23c—Is this the end of Row Lock Contention in Oracle Database?
  • Managing Transactions

cloud database research papers

DBMS_SEARCH implements Oracle Text ubiquitous search. DBMS_SEARCH makes it very easy to create a single index over multiple tables and views. Just create a DBMS_SEARCH index and add tables and views. All searchable values, including VARCHAR, CLOB, JSON, and numeric columns will be included in the index, which is automatically maintained as the table or view contents change.

  • Oracle 23c DBMS_SEARCH—Ubiquitous Search
  • Easy Text Search over Multiple Tables and Views with DBMS_SEARCH in Oracle Database 23c
  • DBMS_SEARCH Package
  • Performing Ubiquitous Database Search with the DBMS_SEARCH APIs

cloud database research papers

We've added enhancements to Memoptimized Rowstore Fast Ingest with support for partitioning, compressed tables, fast flush using direct writes, and direct in-memory column store population support. These enhancements make the Fast Ingest feature easier to incorporate in more situations where fast data ingest is required. Now Oracle Database provides better support for applications requiring fast data ingest capabilities. Data can be ingested and then processed all in the same database. This reduces the need for special loading environments and thus reduces complexity and data redundancy.

  • Oracle Database 23c Fast Ingest Enhancements
  • Memoptimized Rowstore—Fast Ingest Updates
  • Enabling High Performance Data Streaming with the Memoptimized Rowstore

cloud database research papers

Oracle Globally Distributed Database introduced the Raft replication feature in Oracle Database 23c. This allows us to achieve very fast (sub 3 seconds) failover with zero data loss in case of a node or a data center outage. Raft replication uses a consensus-based commit protocol and is configured declaratively by specifying the replication factor. All shards in a Distributed Database act as leaders and followers for a subset of data. This enables an active/active/active symmetric distributed database architecture where all shards serve application traffic.

This helps improve availability with zero data loss, simplify management, and optimize hardware utilization for Globally Distributed Database environments.

  • Oracle Globally Distributed Database supports Raft replication in Oracle Database 23c
  • Using Raft replication in Oracle Globally Distributed Database

cloud database research papers

This week we’re turning the spotlight on SQL Analysis Report, an easy-to-use feature that helps developers write better SQL statements. SQL Analysis Report reports common issues with SQL statements, particularly those that can lead to poor SQL performance. It’s available in DBMS_XPLAN and SQL Monitor.

  • SQL Analysis Report in Oracle Database 23c

cloud database research papers

Blockchain tables

Blockchain and immutable tables, available since the release of Oracle Database 19c, use crypto-secure methods to help protect data from tampering or deletion by external hackers and rogue or compromised insiders...

cloud database research papers

Schema privileges

Oracle Database now supports schema privileges in addition to existing object, system, and administrative privileges...

cloud database research papers

SQL Firewall

Use SQL Firewall to detect anomalies and prevent SQL injection attacks. SQL Firewall examines all SQL, including session context information such as IP address and OS user...

cloud database research papers

DB_DEVELOPER_ROLE

Oracle Database 23c includes the new role DB_DEVELOPER_ROLE, which provides an application developer with all the necessary privileges to design, implement, debug, and deploy applications on Oracle Databases...

cloud database research papers

Blockchain and immutable tables, available since the release of Oracle Database 19c, use crypto-secure methods to help protect data from tampering or deletion by external hackers and rogue or compromised insiders. This includes insert-only restrictions that prevent updates or deletions (even by DBAs), cryptographic hash chains to enable verification, signed table digests to detect any large-scale rollbacks, and end user signing of inserted rows using their private keys. Oracle Database 23c introduces many enhancements, including support for logical replication via Oracle GoldenGate and rolling upgrades using Active Data Guard, support for distributed transactions that involve blockchain tables, efficient partition-based bulk dropping for expired rows, and performance optimizations for inserts/commits.

This release also introduces the ability to add/drop columns without impacting cryptographic hash chaining, user-specific chains and table digests for filtered rows, delegate-signing capability, and database countersigning. It also expands crypto-secure data management to regular tables by enabling an audit of historical changes to a non-blockchain table via Flashback archive defined to use a blockchain history table.

Great for built-in audit trail or journaling use cases, these capabilities can be used for financial ledgers, payments history, regulated compliance tracking, legal logs, and any data representing assets where tampering or deletions could lead to significant legal, reputation, or financial consequences.

  • Blockchain Tables in Oracle Database 21c (4:15)
  • Database In-Memory and Blockchain tables (55:42)
  • Reclaiming unused space in Oracle Database 23c with 'tablespace_shrink'
  • Blockchain Table Enhancements in Oracle Database 23c
  • Immutable Table Enhancements in Oracle Database 23c
  • Why Oracle implemented blockchain in Oracle Database 23c
  • Prevent and Detect Fraud Using Blockchain Tables on Oracle Autonomous Database
  • Managing Blockchain Tables
  • Managing Immutable Tables

cloud database research papers

Oracle Database now supports schema privileges in addition to existing object, system, and administrative privileges. This feature improves security by simplifying authorization for database objects to better implement the principle of least privilege and keep the guesswork out of who should have access to what.

  • Security made so much SIMPLER in 23c! (3:55)
  • So much simpler security management in 23c (1:18)
  • ACE Tim Hall: Schema privileges in Oracle Database 23c
  • ACE Peter Finnigan: Oracle 23c schema-level grants
  • ACE Gavin Soorma: Oracle 23c schema-level privileges and schema-only users
  • Schema-level privilege grants with Database 23c

Sample code

  • Tutorial on Database 23c schema privilege grants
  • Configuring Privilege and Role Authorization

cloud database research papers

Use SQL Firewall to detect anomalies and prevent SQL injection attacks. SQL Firewall examines all SQL, including session context information such as IP address and OS user. Embedded into the database kernel, SQL Firewall logs and (if enabled) blocks unauthorized SQL, ensuring that it can’t be bypassed. By enforcing an allow-list of SQL and approved session contexts, SQL Firewall can prevent many zero-day attacks and reduce the risk of credential theft or abuse.

  • SQL Firewall now built into Oracle Database 23c
  • Oracle Database 23c new feature—SQL Firewall by ACE Director Gavin Soorma
  • The three new PL/SQL packages in Oracle Database 23c by ACE Director Julian Dontcheff
  • SQL Firewall in Oracle Database 23c by ACE Director Tim Hall
  • SQL Firewall, Oracle Database 23c by database security expert Pete Finnigan: Part 1 , Part 2 , Part 3

Hands-on tutorials

  • Oracle SQL Firewall sample demo scripts
  • Using SQL Firewall

cloud database research papers

Oracle Database 23c includes the new role DB_DEVELOPER_ROLE, which provides an application developer with all the necessary privileges to design, implement, debug, and deploy applications on Oracle Databases. By using this role, administrators no longer have to guess which privileges may be necessary for application development.

  • DB_DEVELOPER_ROLE in Oracle Database 23c
  • Comparing the RESOURCE, CONNECT, and DEVELOPER roles
  • Use of the DB_DEVELOPER_ROLE Role for Application Developers

cloud database research papers

Boolean data type

Oracle Database now supports the ISO SQL standard-compliant Boolean data type. This enables you to store True and False values in tables and use Boolean expressions in SQL statements...

cloud database research papers

  • Direct Joins for UPDATE and DELETE Statements

Oracle Database now allows you to join the target table in UPDATE and DELETE statements to other tables using the FROM clause. These other tables can limit the rows that are changed or be the source of new values...

cloud database research papers

GROUP BY column alias

You can now use column alias or SELECT item position in GROUP BY, GROUP BY CUBE, GROUP BY ROLLUP, and GROUP BY GROUPING SETS clauses. Additionally, the HAVING clause supports column aliases...

cloud database research papers

IF [NOT] EXISTS

DDL object creation, modification, and deletion in Oracle Database now supports the IF EXISTS and IF NOT EXISTS syntax modifiers...

cloud database research papers

INTERVAL data type aggregations

Oracle Database 23c makes it easier for developers to calculate totals and averages over INTERVAL values...

cloud database research papers

RETURNING INTO clause

The RETURNING INTO clause for INSERT, UPDATE, and DELETE statements has been enhanced to report old and new values affected by the respective statement...

cloud database research papers

SELECT without FROM clause

You can now run SELECT expression-only queries without a FROM clause. This new feature improves SQL code portability and ease of use for developers.

cloud database research papers

Create SQL macros to factor out common SQL expressions and statements into reusable, parameterized constructs that can be used in other SQL statements...

cloud database research papers

  • SQL Transpiler

PL/SQL functions within SQL statements are automatically converted (transpiled) into SQL expressions whenever possible...

cloud database research papers

Table Value Constructor

The Oracle Database SQL engine now supports a VALUES clause for many types of statements...

cloud database research papers

Usage Annotations

Annotations enable you to store and retrieve metadata about database objects. They are free-form text fields applications can use to customize business logic or user interfaces...

cloud database research papers

Usage Domains

Usage Domains (sometimes called SQL domains or Application Usage Domains) are high-level dictionary objects that act as lightweight type modifiers and centrally document intended data usage for applications...

cloud database research papers

Wide tables—now 4,096 columns max

Now you can store a larger number of attributes in a single row, which may simplify application design and implementation for some applications...

cloud database research papers

Oracle Database now supports the ISO SQL standard-compliant Boolean data type. This enables you to store True and False values in tables and use Boolean expressions in SQL statements. The Boolean data type standardizes the storage of Yes and No values and makes it easier to migrate to Oracle Database.

  • Boom! Boolean is here in 23c, and it's easy to use (1:36)
  • Oracle 23c - Unlock the Power of Boolean Data Types (0:59)
  • Boolean data type in Oracle Database 23c (Oracle-Base)
  • Oracle 23c - Tipo de Datos BOOLEAN en SQL (Spanish language)
  • Oracle 23c Boolean support in SQL
  • More Boolean features in 23c
  • Boolean data type in Oracle Database 23c (Medium)
  • SQL Boolean Data Type

cloud database research papers

Oracle Database now allows you to join the target table in UPDATE and DELETE statements to other tables using the FROM clause. These other tables can limit the rows that are changed or be the source of new values. Direct joins make it easier to write SQL to change and delete data.

  • UPDATE and DELETE Statements via Direct Joins in Oracle Database 23c
  • ACE Lisandro Fernigrini: Oracle Database 23c—Joins en DELETE y UPDATE
  • ACE Timothy Hall: Direct Joins for UPDATE and DELETE Statements in Oracle Database 23c

cloud database research papers

You can now use column alias or SELECT item position in GROUP BY, GROUP BY CUBE, GROUP BY ROLLUP, and GROUP BY GROUPING SETS clauses. Additionally, the HAVING clause supports column aliases. These new Database 23c enhancements make it easier to write GROUP BY and HAVING clauses, making SQL queries much more readable and maintainable while providing better SQL code portability.

  • SQL tips DBAs should know | Aliases in GROUP BY (0:59)
  • Oracle Database 23c: Simplifying Query Development with Improved GROUP BY and HAVING Clauses
  • GROUP BY Column Alias or Position

cloud database research papers

DDL object creation, modification, and deletion in Oracle Database now supports the IF EXISTS and IF NOT EXISTS syntax modifiers. This enables you to control whether an error should be raised if a given object exists or does not exist, simplifying error handling in scripts and by applications.

  • Coding Tips Developers Need to Know | Unleash the power of IF [NOT] EXISTS clause with Oracle Database 23c (1:00)
  • Improved table management in Oracle Database 23c: Introducing the “IF [NOT] EXISTS” clause
  • ACE Timothy Hall: IF [NOT] EXISTS DDL Clause in Oracle Database 23c
  • ACE Lisandro Fernigrini: Oracle Database 23c—IF [NOT] EXISTS en Sentencias DDL (Spanish language)
  • Using IF EXISTS and IF NOT EXISTS

cloud database research papers

Oracle Database 23c makes it easier for developers to calculate totals and averages over INTERVAL values. With this enhancement, you now can pass INTERVAL data types to the SUM and AVG aggregate and analytic functions.

  • Aggregation over INTERVAL data types
  • Aggregation over INTERVAL data types in Oracle Database 23c
  • Oracle Database 23c INTERVAL data type aggregations

cloud database research papers

The RETURNING INTO clause for INSERT, UPDATE, and DELETE statements has been enhanced to report old and new values affected by the respective statement. This allows developers to use the same logic for each of these DML types to obtain values pre- and post-statement execution. Old and new values are valid only for UPDATE statements. INSERT statements don't report old values and DELETE statements don't report new values.

The ability to obtain old and new values affected by INSERT, UPDATE, and DELETE statements as part of the SQL command’s execution offers developers a uniform approach to reading these values and reduces the amount of work the database must perform.

  • YouTube: Shorts: Check out Oracle Database 23’s new enhanced returning clause (0:55)
  • Enhancements in Oracle 23c: Introducing the New/Old Returning Clause
  • SQL UPDATE RETURN Clause Enhancements

cloud database research papers

  • Game-Changing Developer Feature (0:59)
  • SELECT without FROM Clause in Oracle Database 23c
  • Oracle Database 23c Enhanced Querying: Eliminating the “FROM DUAL” Clause
  • SELECT Without FROM Clause

cloud database research papers

Create SQL macros to factor out common SQL expressions and statements into reusable, parameterized constructs that can be used in other SQL statements. SQL macros can be scalar expressions that are typically used in SELECT lists as well as WHERE, GROUP BY, and HAVING clauses. SQL macros can also be used to encapsulate calculations and business logic or can be table expressions, typically used in a FROM clause. Compared to PL/SQL constructs, SQL macros can improve performance. SQL macros increase developer productivity, simplify collaborative development, and improve code quality.

  • Create reusable SQL expressions with SQL macros (1:01:29)
  • Pattern Matching + SQL Macros = Pure SQL Awesomeness! (58:03)
  • Using SQL Macros Scalar and Table Expressions
  • How to Make Reusable SQL Pattern Matching Clauses with SQL Macros
  • SQL Macros: Creating Parameterized Views
  • How to create a parameterized view in Oracle
  • SQL macros have arrived in Autonomous Database
  • How to Make SQL Easier to Understand, Test, and Maintain
  • SQL_MACRO Clause

cloud database research papers

PL/SQL functions within SQL statements are automatically converted (transpiled) into SQL expressions whenever possible. Transpiling PL/SQL functions into SQL statements can speed up overall execution time.

  • Automatic PL/SQL to SQL Transpiler in Oracle Database 23c
  • Automatic PL/SQL to SQL Transpiler

cloud database research papers

The Oracle Database SQL engine now supports a VALUES clause for many types of statements. This enables you to materialize rows of data on the fly by specifying them using the new syntax without relying on existing tables. Oracle Database 23c supports the VALUES clause for the SELECT, INSERT, and MERGE statements. The introduction of the new VALUES clause allows developers to write less code for ad-hoc SQL commands, leading to better readability with less effort.

  • Using the table value constructor (0:59)
  • New value constructor in Oracle Database 23c
  • Oracle 23c SQL Syntax for Efficient Data Manipulation: Table Value Constructor
  • Table Value Constructor in Oracle Database 23c

cloud database research papers

Annotations enable you to store and retrieve metadata about database objects. They are free-form text fields applications can use to customize business logic or user interfaces. Annotations are name-value pairs or simply a name. They help you use database objects in the same way across all applications, simplifying development and improving data quality.

  • Annotations: The new metadata in Database 23c
  • Annotations in Oracle Database 23c
  • Application Usage Annotations

cloud database research papers

Usage Domains (sometimes called SQL domains or Application Usage Domains) are high-level dictionary objects that act as lightweight type modifiers and centrally document intended data usage for applications. Usage Domains can be used to define data usage and standardize operations to encapsulate a set of check constraints, display properties, ordering rules, and other usage properties—without requiring application-level meta data.

Usage Domains for one or more columns in a table do not modify the underlying data type and can, therefore, also be added to existing data without breaking applications or creating portability issues.

  • Less coding with SQL domains in Oracle Database 23c
  • Application Usage Domains

cloud database research papers

Now you can store a larger number of attributes in a single row, which may simplify application design and implementation for some applications.

The maximum number of columns allowed in a database table or view has been increased to 4,096. This feature goes beyond the previous 1,000-column limit, allowing you to build applications that can store attributes in a single table. Some applications such as machine learning and streaming Internet of Things (IoT) application workloads may require the use of de-normalized tables with more than 1,000 columns.

  • Oracle Database In-Memory blog: Oracle Database 23c Free—Wide Tables
  • Oracle-Base: MAX_COLUMNS: Increase the Maximum Number of Columns for a Table (Wide Tables) in Oracle Database 23c
  • Wide Tables documentation

cloud database research papers

Connection management for extreme scalability

Oracle Database 23c and CMAN-TDM now bring best-in-class connection management and monitoring capabilities with implicit connection pooling, multi-pool DRCP, per-PDB PRCP, and much more...

cloud database research papers

Database driver asynchronous programming and pipelining

With Oracle Database 23c, the Pipelining feature enables .NET, Java, and C/C++ applications to send multiple requests to the Database without waiting for the response from the server...

cloud database research papers

JavaScript stored procedures

Multilingual engine (MLE) module calls allow developers to invoke JavaScript functions stored in modules from SQL and PL/SQL. Call specifications written in PL/SQL link JavaScript to PL/SQL code units...

cloud database research papers

Multicloud configuration and security integration

A new feature of Oracle Database 23c is the client capability to store Oracle configuration information, such as connection strings, in Microsoft Azure App Configuration or Oracle Cloud Infrastructure Object Storage...

cloud database research papers

Observability, OpenTelemetry, and diagnosability for Java and .NET applications

The three pillars of observability are metrics, logging, and distributed tracing. This release brings enhanced logging, new debugging (diagnose on first failure), and new tracing capabilities...

cloud database research papers

Transportable Binary XML

Oracle Database 23c introduces Transportable Binary XML (TBX), a new self-contained XMLType storage method. TBX supports sharding, XML search index, and Exadata pushdown operations, providing better performance and scalability than other XML storage options...

cloud database research papers

Oracle Database 23c and CMAN-TDM now bring best-in-class connection management and monitoring capabilities with implicit connection pooling, multi-pool DRCP, per-PDB PRCP, and much more. Enhance the scalability and power of your C, Java, Python, Node.js, and ODP.NET applications with the latest and greatest features in DRCP and PRCP. Monitor the usage of PRCP pool effectively with statistics from the new V$TDM_STATS dynamic view in Oracle Database 23c.

  • Per-PDB Proxy Resident Connection Pooling
  • Medium: Multi-pool DRCP in Oracle Database 23c
  • Implicit Connection Pooling
  • Using Multi-pool DRCP
  • Per-PDB PRCP
  • TDM_PERPDB_ PRCP_CONNFACTOR—Per-PDB PRCP parameter
  • CMAN-TDM and PRCP Monitoring—V$TDM_STATS
  • JDBC Support for DRCP

cloud database research papers

With Oracle Database 23c, the Pipelining feature enables .NET, Java, and C/C++ applications to send multiple requests to the Database without waiting for the response from the server. Oracle Database queues and processes those requests one by one, allowing the client applications to continue working until notification of the completion of the requests. These enhancements provide a better end user experience, improved data-driven application responsiveness, end-to-end scalability, avoidance of performance bottlenecks, and efficient resource utilization on the server and the client sides.

For the client request to return immediately, Oracle Database Pipelining requires an asynchronous or reactive API in .NET, Java, and C/C++ drivers. These mechanisms can be used against Oracle Database, with or without Database Pipelining.

For Java, Oracle Database 23c furnishes the Reactive Extensions in Java Database Connectivity (JDBC), Universal Connection Pool (UCP), and the Oracle R2DBC Driver. It also supports the Java virtual threads in the driver (Project Loom) as well as the Reactive Streams libraries, such as Reactor, RxJava, Akka Streams, Vert.x, and more.

  • Oracle 23c .NET development features
  • What's in Oracle Database 23c for Java Developers? (PDF)
  • ODP.NET async code sample
  • ODP.NET Asynchronous Programming and Pipelining
  • JDBC Support for Pipelined Database Operations

cloud database research papers

Multilingual engine (MLE) module calls allow developers to invoke JavaScript functions stored in modules from SQL and PL/SQL. Call specifications written in PL/SQL link JavaScript to PL/SQL code units. This feature enables developers to use JavaScript functions anywhere PL/SQL functions are called.

  • Introduction to JavaScript in Oracle Database 23c Free—Developer Release
  • Using JavaScript community modules in Oracle Database 23c Free—Developer Release
  • How to import JavaScript ES modules in Oracle Database 23c Free and use them in SQL queries
  • APEX + Server Side JavaScript (MLE)
  • Simple Data Driven applications using JavaScript in Oracle Database 23c Free-Developer Release
  • Overview of JavaScript in Oracle Database

cloud database research papers

A new feature of Oracle Database 23c is the client capability to store Oracle configuration information, such as connection strings, in Microsoft Azure App Configuration or Oracle Cloud Infrastructure Object Storage. This new capability simplifies application cloud configuration, deployment, and connectivity with Oracle JDBC, .NET, Python, Node.js, and Oracle Call Interface data access drivers. The information is stored in configuration providers, which provides the benefit of separating application code and configuration.

Use with OAuth 2.0 single sign-on to the cloud and database to further enhance the ease of administration. Oracle Database 23c clients can use Microsoft Entra ID, Azure Active Directory, or Oracle Cloud Infrastructure access tokens for database sign-on.

  • Database 23c JDBC Seamless Authentication with OCI Identity and Access Management and Azure Active Directory
  • JDBC Configuration Via App Config Providers and Vaults
  • ODP.NET Centralized Configuration Providers
  • ODP.NET and Azure Active Directory
  • ODP.NET and OCI Identity and Access Management

cloud database research papers

The three pillars of observability are metrics, logging, and distributed tracing. This release brings enhanced logging, new debugging (diagnose on first failure), and new tracing capabilities. The JDBC and ODP.NET drivers have also been instrumented with a hook for tracing database calls; this hook enables distributed tracing using OpenTelemetry.

  • Java and .NET Application Observability with OpenTelemetry and Oracle Database
  • ODP.NET OpenTelemetry documentation
  • JDBC Trace Event Listener documentation
  • Oracle JDBC Trace Event Listener Javadoc
  • Oracle JDBC OpenTelemetry Provider

cloud database research papers

Oracle Database 23c introduces Transportable Binary XML (TBX), a new self-contained XMLType storage method. TBX supports sharding, XML search index, and Exadata pushdown operations, providing better performance and scalability than other XML storage options.

With the support of more database architectures, such as sharding or Exadata, and its capability to easily migrate and exchange XML data among different servers, containers, and PDBs, TBX allows your applications to take full advantage of this new XML storage format on more platforms and architectures.

You can migrate existing XMLType storage of a different format to TBX format in any of the following ways:

Insert-as select or create-as-select

Online redefinition

Oracle Data Pump

  • Database 23c new features for XML: Sharding of XML and XML Search Index (1:14:37)
  • Transportable Binary XML—Modern XML document storage in Oracle Database 23c
  • Introduction to Choosing an XMLType Storage Model and Indexing Approaches

cloud database research papers

JSON binary data type

The JSON data type is an Oracle-optimized binary JSON format called OSON. It is designed for faster query and DML performance in the database and in database clients from release 21c and on...

cloud database research papers

JSON Relational Duality views

JSON Relational Duality, an innovation introduced in Oracle Database 23c, unifies the relational and document data models to provide the best of both worlds...

cloud database research papers

  • JSON Schema

Oracle Database supports JSON to store and process schema-flexible data. With Oracle Database 23c, Oracle Database now supports JSON Schema to validate structure and values of JSON data...

cloud database research papers

PL/SQL JSON constructor support for aggregate types

The PL/SQL JSON constructor is enhanced to accept an instance of a corresponding PL/SQL aggregate type, returning a JSON object or array type populated with the aggregate type data.

cloud database research papers

MongoDB-compatible API

With the Oracle Database API for MongoDB, developers can continue to use MongoDB's tools and drivers connected to an Oracle Database while gaining access to Oracle's multimodel capabilities and self-driving database...

cloud database research papers

The JSON data type is an Oracle-optimized binary JSON format called OSON. It is designed for faster query and DML performance in the database and in database clients from release 21c and on.

  • JSON data type support in Oracle 21c
  • Native JSON Data Type Support: Maturing SQL and NoSQL Convergence in Oracle Database (PDF)
  • JSON Data Type

cloud database research papers

JSON Relational Duality, an innovation introduced in Oracle Database 23c, unifies the relational and document data models to provide the best of both worlds. Developers can build applications in either relational or JSON paradigms with a single source of truth and benefit from the strengths of both models. Data is held once but can be accessed, written, and modified with either approach. Developers benefit from ACID-compliant transactions and concurrency controls, which means they no longer have to make trade-offs between complex object-relational mappings or data inconsistency issues.

  • Medium: ODP.NET and JSON Relational Duality and Oracle Database 23c Free
  • Key benefits of JSON Relational Duality
  • Use JSON Relational Duality with Oracle Database API for MongoDB
  • REST with JSON Relational Duality
  • JSON Relational Duality: The Revolutionary Convergence of Document, Object, and Relational Models
  • JSON Relational Duality Views Overview

cloud database research papers

Oracle Database supports JSON to store and process schema-flexible data. With Oracle Database 23c, Oracle Database now supports JSON Schema to validate structure and values of JSON data. The SQL operator IS JSON was enhanced to accept a JSON Schema, and various PL/SQL functions were added to validate JSON and to describe database objects such as tables, views, and types as JSON Schema documents.

By default, JSON data is schemaless, providing flexibility. However, you may want to ensure that JSON data has a particular structure and typing, which can be done via industry-standard JSON Schema validation.

Contribute to JSON Schema Oracle actively contributes to JSON Schema, an open source effort to standardize a JSON-based declarative language that allows you to annotate and validate JSON documents. It is currently in Request for Comments (RFC).

  • Review Oracle's contributions to JSON Schema and comment
  • Or you can contribute via GitHub
  • JSON/JSON_VALUE will Convert PL/SQL Aggregate Type to/from JSON (12:36)
  • Mastering Oracle Database 23c Free: SQL Domains and JSON Schema

cloud database research papers

The PL/SQL JSON_VALUE operator is enhanced so its returning clause can accept a type name that defines the type of the instance that the operator is to return. JSON constructor support for aggregate data types streamlines data interchange between PL/SQL applications and languages that support JSON.

  • JSON_VALUE Function Enhancements in Oracle Database 23c
  • JSON Data Type Constructor Enhancements in Oracle Database 23c
  • Application development documentation

cloud database research papers

With the Oracle Database API for MongoDB, developers can continue to use MongoDB's tools and drivers connected to an Oracle Database while gaining access to Oracle's multimodel capabilities and self-driving database. Customers can run MongoDB workloads on Oracle Cloud Infrastructure (OCI). Often, little or no changes are required to existing MongoDB applications—you simply need to change the connection string.

The Oracle Database API for MongoDB is part of standard Oracle REST Data Services. It is preconfigured and fully managed as part of the Oracle Autonomous Database.

  • Demos and QA: Oracle Database API for MongoDB (55:01)
  • Demonstration of Oracle Database API for Mongo DB (6:18)
  • Oracle Database API for MongoDB
  • Installing Database API for MongoDB for any Oracle Database
  • Oracle Database API for MongoDB—Best Practices
  • SQL, JSON, and MongoDB API: Unify worlds with Oracle Database 23c Free
  • Use the Oracle Database API for MongoDB
  • Overview of Oracle Database API for MongoDB

cloud database research papers

Operational property graphs

Oracle Database offers native support for property graph data structures and graph queries...

cloud database research papers

Oracle Database offers native support for property graph data structures and graph queries. If you're looking for flexibility to build graphs in conjunction with transactional data, JSON, Spatial, and other data types, we got you covered. Developers can now easily build graph applications with SQL using existing SQL development tools and frameworks.

  • Create, Query, and Visualize a Property Graph with SQL Oracle Database 23c Free—Developer Release (3:53)
  • When property graphs join SQL—Oracle CloudWorld 2022 (30:29)
  • Operational property graphs in Oracle Database 23c Free—Developer Release
  • Property graphs in SQL Developer Release 23.1
  • Get started with property graphs in Oracle Database 23c Free—Developer Release
  • Lucas Jellema: SQL Property Graph for Network-Style Querying
  • Lucas Jellema: Graph Database Style Explorations of Relational Database with Formula One Data (Github content here )
  • ACE Timothy Hall: SQL Property Graphs and SQL/PGQ in Oracle Database 23c
  • Exploring Operational Property Graphs in Oracle Database 23c Free
  • SQL Property Graphs

cloud database research papers

Happy Holidays!

As we wrap up 2023, here's a recap of the new features in Oracle Database 23c that we highlighted throughout the year...

cloud database research papers

As we wrap up 2023, here's a recap of the new features in Oracle Database 23c that we highlighted throughout the year. If you haven't had a chance to try out our latest Oracle Database release yet—especially if you’re a developer—check out the different options here or at oracle.com/database/free .

  • Oracle Database 23c: The next long-term support release
  • Oracle Database 23c blog posts from SQLMaria
  • How to set up Oracle Database 23c Free—Developer Release and ORDS on OCI
  • Oracle Database 23c Free—Developer Release: getting started…
  • Deploying Oracle Database 23c Free—Developer Release on Kubernetes with Helm
  • Exploring JSON-relational duality views in Oracle Database 23c Free—Developer Release
  • Getting Started with Oracle Database 23c Free—Developer Release

Hands-On Labs/Downloads

  • Oracle Database Free Get Started
  • Oracle Database Software Downloads
  • Oracle Database 23c

cloud database research papers

AQ to TxEventQ Online Migration Tool

Oracle Database 23c introduces an online migration tool that simplifies migration from Oracle Advanced Queuing (AQ) to Transactional Event Queues (TxEventQ) with orchestration automation, source, and target compatibility diagnostics and remediation and a unified user experience...

cloud database research papers

Oracle Database 23c provides even more refined compatibility for Apache Kafka applications with Oracle Database...

cloud database research papers

Lock-free column value reservations

Lock-Free Reservations enable concurrent transactions to proceed without being blocked on updates of heavily updated rows. Lock-Free Reservations are held on the rows instead of locking them...

cloud database research papers

Grafana observability

Oracle continues to expand its cloud native and Kubernetes support with our new Observability Exporter for Oracle Database...

cloud database research papers

Saga APIs in Oracle Database 23c

The Saga framework introduced in Oracle Database 23c provides a unified framework for building async Saga applications in the database. ..

cloud database research papers

Oracle Database 23c introduces an online migration tool that simplifies migration from Oracle Advanced Queuing (AQ) to Transactional Event Queues (TxEventQ) with orchestration automation, source, and target compatibility diagnostics and remediation and a unified user experience. Migration scenarios can be short- or long-lived and be performed with or without AQ downtime, eliminating operational disruption.

Existing AQ customers interested in higher throughput queues and with Kafka compatibility using a Kafka Java Client and Confluent-like REST APIs, can easily migrate from AQ to TxEventQ. TxEventQ offers scalability, performance, key-based partitioning, and native JSON payload support, which makes event-driven microservices/application writing easier in multiple languages, including Java, JavaScript, PL/SQL, Python, and more.

  • Streamlining Oracle Advanced Queue to Transactional Event Queues Migration
  • Navigating DBMS_AQMIGTOOL Package in Oracle Database 23c: A Starter’s Guide
  • DBMS_AQMIGTOOL package documentation
  • Sample steps to migrate from AQ to TxEventQ
  • Example walkthrough

cloud database research papers

Oracle Database 23c provides even more refined compatibility for Apache Kafka applications with Oracle Database. This new feature provides easy migration for Kafka Java applications to Transactional Event Queues (TxEventQ). Kafka Java APIs can now connect to Oracle Database server and use TxEventQ as a messaging platform.

Developers can easily migrate an existing Java application that uses Kafka to Oracle Database using the JDBC thin driver. And with the Oracle Database 23c client-side library feature, Kafka applications can now connect to Oracle Database instead of a Kafka cluster and use TxEventQ's messaging platform transparently.

  • Simplify Event-driven Apps with TxEventQ in Oracle Database (with Kafka interoperability)
  • Kafka interoperability in Oracle Database 23c
  • New 23c version of Kafka-compatible Java APIs for Transactional Event Queues published
  • Playing with Kafka Java Client for TxEventQ – creating the simplest of producers and consumers
  • Oracle REST Data Services 22.3 brings new REST APIs for Transactional Event Queueing
  • Interoperability of Transactional Event Queue with Apache Kafka (Java APIs)
  • Kafka Java Client Interface for Oracle Transactional Event Queues (Java APIs)
  • Kafka Java Client for Oracle Transactional Event Queues (Java APIs)
  • Kafka Connectors for TxEventQ (Connectors)
  • Oracle Transactional Event Queues REST Endpoints (REST APIs)

cloud database research papers

Lock-Free Reservations enable concurrent transactions to proceed without being blocked on updates of heavily updated rows. Lock-Free Reservations are held on the rows instead of locking them. It verifies if the updates can succeed and defers the updates until the transaction commit time. Lock-Free Reservations improves the user experience and concurrency in transactions.

  • TikTok: Rethink everything you think you know about row locking in relational databases (0:29)
  • ACE Lucas Jellema: Oracle Database 23c—Fine-grained locking—Lock-Free Reservations
  • ACE Tim Hall: Lock-Free Reservations to prevent blocking sessions in Oracle Database 23c
  • Oracle Schema-Level Privileges and Lock-Free Column Reservations
  • Using Lock-Free Reservations

cloud database research papers

Oracle continues to expand its cloud native and Kubernetes support with our new Observability Exporter for Oracle Database, which allows customers to easily export database and application metrics in industry-standard Prometheus format, and to easily create Grafana dashboards to monitor the performance of their Oracle Databases and applications.

  • DevOps meets DataOps (50:10)
  • Introducing Oracle Database Observability Exporter
  • Unified Observability for Oracle Database
  • Unified Observability in Grafana with converged Oracle Database

cloud database research papers

The Saga framework introduced in Oracle Database 23c provides a unified framework for building async Saga applications in the database. Saga makes modern, high performance microservices application development easier and more reliable.

A Saga is a business transaction spanning multiple databases, implemented as a series of independent local transactions. Sagas avoid the global transaction duration locking found with synchronous distributed transactions and simplify consistency requirements for maintaining a global application state. The Saga framework integrates with Lock-Free reservable columns in Oracle Database 23c to provide automatic Saga compensation, simplifying application development.

The Saga framework emulates the MicroProfile LRA specification.

  • Developing Event-Driven, Auto-Compensating Transactions With Oracle Database Sagas and Lock-Free Reservation
  • Oracle Saga documentation
  • Oracle Saga CloudBank demo

Internet of things technology, research, and challenges: a survey

  • Published: 02 May 2024

Cite this article

cloud database research papers

  • Amit Kumar Vishwakarma 1 ,
  • Soni Chaurasia 2 ,
  • Kamal Kumar 3 ,
  • Yatindra Nath Singh 4 &
  • Renu Chaurasia 5  

30 Accesses

Explore all metrics

The world of digitization is growing exponentially; data optimization, security of a network, and energy efficiency are becoming more prominent. The Internet of Things (IoT) is the core technology of modern society. This paper is based on a survey of recent and past technologies used for IoT optimization models, such as IoT with Blockchain, IoT with WSN, IoT with ML, and IoT with big data analysis. Suppose anyone wants to start core research on IoT technologies, research opportunities, challenges, and solutions. In that case, this paper will help me understand all the basics, such as security, interoperability, standards, scalability, complexity, data management, and quality of service (QoS). This paper also discusses some recent technologies and the challenges in implementation. Finally, this paper discusses research possibilities in basic and applied IoT Domains.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

cloud database research papers

Data Availability

Available on request.

Ghorbani HR, Ahmadzadegan MH (2017) Security challenges in internet of things: survey. In: 2017 IEEE conference on wireless sensors (ICWiSe)

Brock DL (2013) The electronic product code (EPC) a naming scheme for physical objects. http://www.autoidlabs.org/ uploads/media/MIT-AUTOID-WH-002.pdf

IoT Analytics (2014) Why the internet of things is called internet of things: definition, history, disambiguation. https://iot-analytics.com/internet-of-things-definition/

Internet of Things (2005) International telecommunication union (ITU), Geneva. https://www.itu.int/net/wsis/tunis/newsroom/stats/The-Internet-of-Things-2005.pdf

Internet of Things (2010) https://en.oxforddictionaries.com/definition/us/ internetofthings

Manyika Chui M, Bisson P, Woetzel J, Dobbs R, Bughin J, Aharon D (2015) Unlocking the potential of the internet of things. https://www.mckinsey.com/~/media/McKinsey

Vermesan O, Friess P, Guillemin P, Gusmeroli S, Sundmaeker H, Bassi A, Jubert IS, Mazura M, Harrison M, others (2011) Internet of things strategic research roadmap

Artik cloud (2017) https://developer.artik.cloud/documentation/getting-started/index.html

Fusion Connect (2014) https://autodeskfusionconnect.com/iot-devices

(2016) https://docs.aws.amazon.com/iot/latest/developerguide/what-is-aws-iot.html

Guth J, Breitenbücher U, Falkenthal M, Leymann F, Reinfurt L (2016) Comparison of IoT platform architectures: A field study based on a reference architecture. In: 2016 Cloudification of the internet of things (CIoT)

Balani, Naveen and Hathi, Rajeev, Enterprise IoT: A Definitive Handbook. In: CreateSpace Independent Publishing Platform, 2015

GE Predix (2017) https://docs.predix.io/en-US/platform

Soliman M, Abiodun T, Hamouda T, Zhou J, Lung CH, (2013) Smart home: integrating internet of things with web services and cloud computing. In: 2013 IEEE 5th International conference on cloud computing technology and science

Google Cloud (2016) https://cloud.google.com/solutions/iot-overview

Familiar B (2015) IoT and microservices. In: Microservices, IoT, and Azure. Apress, Berkeley, CA. In: Internet of Things;Web services:Azure IOT

Microsoft IoT platform (2015) https://docs.microsoft.com/en-us/rest/api/iothub/?redirectedfrom=MSDN

High R (2012) The era of cognitive systems: an inside look at ibm watson and how it works. In: Internet of Things;Web services:Azure IOT

IBM Watson IoT (2017) https://www.ibm.com/internet-of-things

Deering S, Hinden R (2017) Internet Protocol, Version 6 (IPv6) Specification. https://tools.ietf.org/html/rfc8200

WInter Ed T, Thubert P, Brandt A, Hui J, Kelsey R, Levis P, Pister K, Struik R (2012) ipv6 routing protocol for low-power and lossy networks. https://tools.ietf.org/html/rfc6550

Saputro N, Akkaya K, Uludag S (2012) A survey of routing protocols for smart grid communications. http://www.sciencedirect.com/science/article/pii/S1389128612001429 , vol 56

Yi P, Iwayemi A, Zhou C (2011) Building automation networks for smart grids. In: International journals of digital multimedia broadcasting

Fairhurst G Jones T (2018) Transport features of the user datagram protocol (UDP) and lightweight UDP (UDP-Lite). https://www.rfceditor.org/info/rfc8304

Palattella MR, Accettura N, Vilajosana X, Watteyne T, Grieco LA, Boggia G, Dohler M (2013) Standardized protocol stack for the internet of (Important) things. In: IEEE communications surveys tutorials, vol 15

Karagiannis V, Chatzimisios P, Vazquez-Gallego F, Alonso-Zarate J (2015) A survey on application layer protocols for the internet of things. Transaction on IoT and Cloud Computing

Banks A, Gupta R (2014) MQTT Version 3.1.1. Edited by Andrew Banks and Rahul Gupta. OASIS Committee Specification Draft 02 / Public Review Draft 02. http://docs.oasis-open.org/mqtt/mqtt/v3.1.1/csprd02/mqtt-v3.1.1-csprd02.html

Bormann C, Castellani AP, Shelby Z (2012) CoAP: An Application Protocol for Billions of Tiny Internet Nodes. IEEE Internet Computing

Shelby Z, Hartke K, Bormann C (2014) The constrained application protocol (CoAP). https://tools.ietf.org/html/rfc7252

Johansson P, Kazantzidis M, Kapoor R, Gerla M (2001) Bluetooth: an enabler for personal area networking. IEEE Network

Kirsche M, Klauck R (2012) Unify to bridge gaps: Bringing XMPP into the Internet of Things. In: 2012 IEEE international conference on pervasive computing and communications workshops

Naik N, Jenkins P (2016) Web protocols and challenges of Web latency in the Web of Things. In: 2016 Eighth international conference on ubiquitous and future networks (ICUFN)

Han Dm, Lim Jh (2010) Smart home energy management system using IEEE 802.15.4 and zigbee. IEEE Transactions on Consumer Electronics

Eriksson J, Balakrishnan H, Madden S (2008) Cabernet: vehicular content delivery using wifi. https://doi.org/10.1145/1409944.1409968

Ratasuk R, Vejlgaard B, Mangalvedhe N, Ghosh A (2016) NB-IoT system for M2M communication. In: 2016 IEEE Wireless communications and networking conference

ANDRIES MI, BOGDAN I, NICOLAESCU SV, SCRIPCARIU L (2007) WiMAX features and applications. http://www.agir.ro/buletine/687.pdf

Kucharzewski L, Kotulski Z (2014) WiMAX networks architecture and ata security. Annales UMCS Informatica AI X

Adams JT (2006) An introduction to IEEE STD 802.15.4. In: 2006 IEEE aerospace conference

Atzori L, Iera A, Morabito G (2010) The internet of things: a survey. journal = Computer Networks. http://www.sciencedirect.com/science/article/pii/S1389128610001568 , vol.54

Mainetti L, Patrono L Vilei A (2011) Evolution of wireless sensor networks towards the Internet of Things: A survey. In: SoftCOM 2011, 19th international conference on software, telecommunications and computer networks

Miorandi D, Sicari S, De Pellegrini F, Chlamtac I (2012) Internet of things: Vision, applications and research challenges. http://www.sciencedirect.com/science/article/pii/S1570870512000674 , vol 10, pp 1497–1516

Xu LD, He W, Li S (2014) Internet of things in industries: a survey. In: IEEE Transactions on industrial informatics, vol 10

Botta A, De Donato W, Persico V, Pescapé A (2016) Integration of cloud computing and internet of things: a survey. http://www.sciencedirect.com/science/article/pii/S0167739X15003015 , vol 56

Seyedzadegan M, Othman M (2013) IEEE 802.16: WiMAX Overview, WiMAX Architecture. http://www.ijcte.org/papers/796-Z1030.pdf

Abdulzahra AM, Al-Qurabat AK, Abdulzahra SA (2023) Optimizing energy consumption in WSN-based IoT using unequal clustering and sleep scheduling methods. Internet of Things 22:100765

Chaurasia S, Kumar K (2023) ACRA:Adaptive Meta-heuristic Based Clustering and Routing Algorithm for IoT-Assisted Wireless Sensor Network. Peer to Peer Networking and Application. Springer

Chaurasia S, Kumar K (2023) MBASE: Meta-heuristic Based optimized location allocation algorithm for baSE station in IoT assist wireless sensor networks. Multimedia Tools and Applications, pp 1–33

Senthil GA, Raaza A, Kumar N (2022) Internet of things energy efficient cluster-based routing using hybrid particle swarm optimization for wireless sensor network. Wirel Pers Commun 122.3: 2603-2619

Prasanth A, Jayachitra S (2020) A novel multi-objective optimization strategy for enhancing quality of service in IoT-enabled WSN applications. Peer Peer Netw Appl 13:1905–1920

Article   Google Scholar  

Vaiyapuri T, et al (2022) A novel hybrid optimization for cluster-based routing protocol in information-centric wireless sensor networks for IoT based mobile edge computing. Wirel Pers Commun 127.1: 39-62

Dhiman G, Sharma R (2022) SHANN: an IoT and machine-learning-assisted edge cross-layered routing protocol using spotted hyena optimizer. Complex Intell Syst 8(5):3779–3787

Seyfollahi A, Taami T, Ghaffari A (2023) Towards developing a machine learning-metaheuristic-enhanced energy-sensitive routing framework for the internet of things. Microprocess Microsyst 96:104747

Donta PK et al (2023) iCoCoA: intelligent congestion control algorithm for CoAP using deep reinforcement learning. J Ambient Intell Humaniz Comput 14(3):2951–2966

Rosati R et al (2023) From knowledge-based to big data analytic model: a novel IoT and machine learning based decision support system for predictive maintenance in industry 4.0. J Intell Manuf 34.1:107–121

Babar M et al (2022) An optimized IoT-enabled big data analytics architecture for edge-cloud computing. IEEE Internet Things J 10(5):3995–4005

Article   MathSciNet   Google Scholar  

Lv Z, Singh AK (2021) Big data analysis of internet of things system. ACM Trans Internet Technol 21(2):1–15

Qiu Y, Zhu X, Jing L (2021) Fitness monitoring system based on internet of things and big data analysis. IEEE Access 9:8054–8068

Rahman A et al (2021) Smartblock-sdn: An optimized blockchain-sdn framework for resource management in iot. IEEE Access 9:28361–28376

Zhao Y et al (2023) A lightweight model-based evolutionary consensus protocol in blockchain as a service for IoT. IEEE Transactions on Services Computing

Saba T et al (2023) Blockchain-enabled intelligent iot protocol for high-performance and secured big financial data transaction. IEEE Transactions on Computational Social Systems

Abed S, Reem J, Bassam JM (2023) A review on blockchain and IoT integration from energy, security and hardware perspectives. Wirel Pers Commun 129(3):2079–2122

Javanmardi S et al (2023) An SDN perspective IoT-Fog security: A survey. Comput Netw 229:109732

Qayyum A et al (2023) Secure and trustworthy artificial intelligence-extended reality (AI-XR) for metaverses. ACM Computing Surveys

Rawat P, Chauhan S (2021) Clustering protocols in wireless sensor network: A survey, classification, issues, and future directions. Comput Sci Rev 40:100396

Albouq SS et al (2023) A survey of interoperability challenges and solutions for dealing with them in IoT environment. IEEE Access 10:36416–36428

Rana B, Singh Y, Singh PK (2021) A systematic survey on internet of things: Energy efficiency and interoperability perspective. Trans Emerg Telecommun Technol 32(8):e4166

Sasaki Y (2021) A survey on IoT big data analytic systems: current and future. IEEE Internet of Things Journal 9(2):1024–1036

Alfandi O et al (2021) A survey on boosting IoT security and privacy through blockchain: Exploration, requirements, and open issues. Cluster Comput 24(1):37–55

Bian Jet al. Machine learning in real-time internet of things (iot) systems: A survey. IEEE Internet of Things J 9(11): 8364–8386

Donta PK et al (2022) Survey on recent advances in IoT application layer protocols and machine learning scope for research directions. Digital Commun Netw 8(5):727–744

Download references

No funding was received to carry out this work.

Author information

Authors and affiliations.

Management science and technology, Khalifa University, Abu Dhabi, UAE

Amit Kumar Vishwakarma

Computer science & Engineering, SGT University, Gurugram, India

Soni Chaurasia

Department of Information Technology, IGDTUW, New Delhi, India

Kamal Kumar

Electrical Engineering, IIT Kanpur, Kanpur, India

Yatindra Nath Singh

Computer science & Engineering, AIT, Rooma, Kanpur, India

Renu Chaurasia

You can also search for this author in PubMed   Google Scholar

Contributions

Equally contributed.

Corresponding author

Correspondence to Soni Chaurasia .

Ethics declarations

Conflicts of interest.

No conflict of interest.

Consent to Publish

As per journal policy.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Vishwakarma, A.K., Chaurasia, S., Kumar, K. et al. Internet of things technology, research, and challenges: a survey. Multimed Tools Appl (2024). https://doi.org/10.1007/s11042-024-19278-6

Download citation

Received : 18 October 2023

Revised : 13 March 2024

Accepted : 18 April 2024

Published : 02 May 2024

DOI : https://doi.org/10.1007/s11042-024-19278-6

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Semantic intelligence
  • IoT protocol
  • IoT application
  • Research possibilities
  • IoT Platforms
  • IoT optimization models
  • Find a journal
  • Publish with us
  • Track your research

Help | Advanced Search

Computer Science > Machine Learning

Title: lazy data practices harm fairness research.

Abstract: Data practices shape research and practice on fairness in machine learning (fair ML). Critical data studies offer important reflections and critiques for the responsible advancement of the field by highlighting shortcomings and proposing recommendations for improvement. In this work, we present a comprehensive analysis of fair ML datasets, demonstrating how unreflective yet common practices hinder the reach and reliability of algorithmic fairness findings. We systematically study protected information encoded in tabular datasets and their usage in 280 experiments across 142 publications. Our analyses identify three main areas of concern: (1) a \textbf{lack of representation for certain protected attributes} in both data and evaluations; (2) the widespread \textbf{exclusion of minorities} during data preprocessing; and (3) \textbf{opaque data processing} threatening the generalization of fairness research. By conducting exemplary analyses on the utilization of prominent datasets, we demonstrate how unreflective data decisions disproportionately affect minority groups, fairness metrics, and resultant model comparisons. Additionally, we identify supplementary factors such as limitations in publicly available data, privacy considerations, and a general lack of awareness, which exacerbate these challenges. To address these issues, we propose a set of recommendations for data usage in fairness research centered on transparency and responsible inclusion. This study underscores the need for a critical reevaluation of data practices in fair ML and offers directions to improve both the sourcing and usage of datasets.

Submission history

Access paper:.

  • HTML (experimental)
  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

IMAGES

  1. (PDF) A Review Paper on Cloud Computing

    cloud database research papers

  2. (PDF) Cloud-Computing-Current-Research-Summary

    cloud database research papers

  3. (PDF) GREEN CLOUD COMPUTING

    cloud database research papers

  4. (PDF) A Review Paper on Cloud Computing

    cloud database research papers

  5. (PDF) A Framework for Secure Cloud Computing

    cloud database research papers

  6. Research paper on cloud computing security pdf

    cloud database research papers

VIDEO

  1. Final Year Projects

  2. Harnessing Azure Cosmos DB: Use Cases, Industry Examples, and Key Updates

  3. Secure Cloud Storage Meets with Secure Network Coding

  4. AWS Cloud Quest

  5. What is CrateDB? Explainer Video

  6. What is RePEc- Research Papers in Economics? (economics)(Elsevier)(springer)(database)

COMMENTS

  1. REVIEW OF CLOUD DATABASE BENEFITS AND CHALLENGES

    Amazon Web Services (AWS), Microsoft. Azure and Google Cloud are the top cloud computing providers (Bajpai, 2023). In Q1 2023. AWS revenue increased b y 20% year to year to 21.4B $, Intelligent ...

  2. PDF Cloud Data Systems: What are the Opportunities for the Database ...

    The panel will discuss the research opportunities for the database research community in the context of cloud native data services. PVLDB Reference Format: Magdalena Balazinska, Surajit Chaudhuri, AnHai Doan, Joseph M. Hellerstein, Hanuma Kodavalla, Ippokratis Pandis, and Matei Zaharia. Cloud Data Systems: What are the Opportunities for the ...

  3. Cloud databases: new techniques, challenges, and opportunities

    As database vendors are increasingly moving towards the cloud data service, i.e., databases as a service (DBaaS), cloud databases have become prevalent. Compared with the early cloud-hosted databases, the new generation of cloud databases, also known as cloud-native databases, seek for higher elasticity and lower cost by developing new ...

  4. A Study on Cloud Database

    From the data management point of view, cloud computing provides full availability where users can read and write data at any time without ever being blocked. The acknowledgment times are almost stable and do not depend on the amount of associated users, size of the database or any added computing/communication constraints. Furthermore, users are freed for the burden of taking backups. If ...

  5. Cloud computing research: A review of research themes, frameworks

    This paper presents a meta-analysis of cloud computing research in information systems with the aim of taking stock of literature and their associated research frameworks, research methodology, geographical distribution, level of analysis as well as trends of these studies over the period of 7 years.

  6. Monitoring the performance of cloud real-time databases: A firebase

    In this paper, an IoT model made use of the embedded smartphone sensors to gather data, send it to a cloud Firebase service provider, and store it in Firebase's real-time database. Additionally, using the Firebase test lab service, various experiments are carried out using 15 smartphone devices to observe the performance of the Firebase real ...

  7. PDF Databases in Cloud Computing: A Literature Review

    In this paper, cloud databases, cloud computing and the databases that can be hosted and deployed in the cloud have been discussed, respectively. Furthermore, the advantages and disadvantages of the most widely used database in cloud computing have been presented. This paper is organized as follows. In the next section, cloud

  8. Home page

    The Journal of Cloud Computing: Advances, Systems and Applications (JoCCASA) will publish research articles on all aspects of Cloud Computing. Principally, articles will address topics that are core to Cloud Computing, focusing on the Cloud applications, the Cloud systems, and the advances that will lead to the Clouds of the future.

  9. A survey on data storage and placement methodologies for Cloud-Big Data

    In this survey paper, we are providing a state of the art overview of Cloud-centric Big Data placement together with the data storage methodologies. ... The cost of a cloud: research problems in data center networks. ACM SIGCOMM Comput Commun Rev. 2008;39(1):68-73. Article Google Scholar Hardavellas N, Ferdman M, Falsafi B, Ailamaki A ...

  10. Adoption of cloud computing as innovation in the organization

    In the research of Chandran D and Kempegowda S, we can observe a hybrid E-learning platform being proposed for teaching based on a cloud architecture model. 29 The main motivation for this proposal was the ability to reduce costs and provide a dependable data storage and data sharing environment. Through this research, it has been noticed that ...

  11. GitHub

    continuously update cloud database papers. Contribute to TsinghuaDatabaseGroup/CloudDB development by creating an account on GitHub. ... two steps back. CIDR 2019 - 9th Biennial Conference on Innovative Data Systems Research. About. continuously update cloud database papers Resources. Readme Activity. Stars. 70 stars Watchers. 7 watching Forks ...

  12. PDF Amazon Aurora: Design Considerations for High Throughput Cloud-Native

    Figure 1: Move logging and storage off the database engine. In this paper, we describe Amazon Aurora, a new database service that addresses the above issues by more aggressively leveraging the redo log across a highly-distributed cloud environment. We use a novel service-oriented architecture (see Figure 1) with a multi-tenant scale-out storage ...

  13. Securing Machine Learning in the Cloud: A Systematic Review of Cloud

    With the advances in machine learning (ML) and deep learning (DL) techniques, and the potency of cloud computing in offering services efficiently and cost-effectively, Machine Learning as a Service (MLaaS) cloud platforms have become popular. In addition, there is increasing adoption of third-party cloud services for outsourcing training of DL models, which requires substantial costly ...

  14. An Overview of Data Storage in Cloud Computing

    An Overview of Data Storage in Cloud Computing ... It examines present trends in the area of Cloud storage and provides a guide for future research. The objective of this paper is to answer the question of what the current trend and development in Cloud storage is? The expected result at the end of this review is the identification of trends in ...

  15. Amazon Aurora: Design considerations for high throughput cloud-native

    Amazon Aurora is a relational database service for OLTP workloads offered as part of Amazon Web Services (AWS). In this paper, we describe the architecture of Aurora and the design considerations leading to that architecture. We believe the central constraint in high throughput data processing has moved from compute and storage to the network.

  16. Security and privacy protection in cloud computing ...

    Research on access control technology based on trust relationship. With the development of research on the trust model, the trust relationship among the data provider, cloud platform and user in a cloud computing system is different. (6) Research and implement a cross-domain, cross group, hierarchical dynamic fine-grained access control system.

  17. Key Opportunities and Challenges of Data Migration in Cloud: Results

    The results of this research paper can give a road map for the data migration journey and can help decision makers towards a safe and productive migration to a cloud computing environment. © 2019 The Authors. ... are often in mature levels when it comes to the discussion and implementations of cloud computing to migrate their data into the ...

  18. cloud computing Latest Research Papers

    The paper further compares and reviews different layout model for the discovery of services, selection of services and composition of services in Cloud computing. Recent research trends in service composition are identified and then research about microservices are evaluated and shown in the form of table and graphs. Download Full-text.

  19. A survey on security challenges in cloud computing: issues, threats

    Cloud computing has gained huge attention over the past decades because of continuously increasing demands. There are several advantages to organizations moving toward cloud-based data storage solutions. These include simplified IT infrastructure and management, remote access from effectively anywhere in the world with a stable Internet connection and the cost efficiencies that cloud computing ...

  20. Research trends in deep learning and machine learning for cloud

    Deep learning and machine learning show effectiveness in identifying and addressing cloud security threats. Despite the large number of articles published in this field, there remains a dearth of comprehensive reviews that synthesize the techniques, trends, and challenges of using deep learning and machine learning for cloud computing security. Accordingly, this paper aims to provide the most ...

  21. Service placement in fog-cloud computing environments: a ...

    With the rapid expansion of the Internet of Things and the surge in the volume of data exchanged in it, cloud computing became more significant. To face the challenges of the cloud, the idea of fog computing was formed. The heterogeneity of nodes, distribution, and limitation of their resources in fog computing in turn led to the formation of the service placement problem. In service placement ...

  22. Studies on the Value of Data

    The U.S. Bureau of Economic Analysis has undertaken a series of studies that present methods for quantifying the value of simple data that can be differentiated from the complex data created by highly skilled workers that was studied in Calderón and Rassier 2022. Preliminary studies in this series focus on tax data, individual credit data, and driving data.

  23. Research on Cloud Data Storage Technology and Its Architecture

    Data storage is a very important and valuable research field in cloud computing. This paper introduces the concept of cloud computing and cloud storage as well as the architecture of cloud storage firstly. Then we analyze the cloud data storage technology--GFS(Google File System)/HDFS(Hadoop Distributed File System) towards concrete enterprise ...

  24. Database 23ai

    Oracle Database 23ai Feature highlights for developers. Check out some of the features we've built with developers in mind: AI Vector Search brings AI to your data by letting you build generative AI pipelines using your business data, directly within the database. Easy-to-use native vector capabilities let your developers build next-gen AI applications that combine relational database ...

  25. [2404.18760] Flow AM: Generating Point Cloud Global Explanations by

    Although point cloud models have gained significant improvements in prediction accuracy over recent years, their trustworthiness is still not sufficiently investigated. In terms of global explainability, Activation Maximization (AM) techniques in the image domain are not directly transplantable due to the special structure of the point cloud models. Existing studies exploit generative models ...

  26. Internet of things technology, research, and challenges: a survey

    The world of digitization is growing exponentially; data optimization, security of a network, and energy efficiency are becoming more prominent. The Internet of Things (IoT) is the core technology of modern society. This paper is based on a survey of recent and past technologies used for IoT optimization models, such as IoT with Blockchain, IoT with WSN, IoT with ML, and IoT with big data ...

  27. [2404.17293] Lazy Data Practices Harm Fairness Research

    Data practices shape research and practice on fairness in machine learning (fair ML). Critical data studies offer important reflections and critiques for the responsible advancement of the field by highlighting shortcomings and proposing recommendations for improvement. In this work, we present a comprehensive analysis of fair ML datasets, demonstrating how unreflective yet common practices ...