• Skip to content
  • Accessibility Policy
  • Oracle blogs
  • Lorem ipsum dolor

Case Study: How a bank turned challenges into opportunities to serve its customers using NoSQL Database

nosql database case study

Acknowledgements: Michael Brey, Director of NoSQL Database Development, Oracle 

An industry in flux

Financial services industries are at crossroads and are experiencing massive changes in response to shifting customer demands. With the increasing adoption of cloud technologies, digital-only enterprises are offering innovative solutions at the lowest cost.

Customer experience is a strategic imperative for most organizations today, but delivering an engaging experience across the growing number of digital customer touchpoints can be challenging, especially if they have an aging technology stack.

Additionally, organizations have to navigate these transformational changes while managing vast volumes of digital transactions, a variety of data, and velocity without straining their business systems, experiencing data loss, breaches, and/or downtime.

The below graphic shows the IT priorities of financial services institutions, and it is no surprise that 25% of them want to modernize their systems and equally the same % want to ramp up on their digital touchpoints.

nosql database case study

This blog will examine how one of India's leading private banks modernized and expedited its digital presence, providing an enhanced experience for their customers using Oracle NoSQL Database . 

Some of the bank's challenge:

  • Exceeding customer expectations : India  has more than 50% of its population below age 25 and more than 65% below age 35 . Banks customers are increasingly comparing banking experiences to other areas of their digital lives. These digital natives aren't looking to check their balances and deposit checks. They are looking for more meaningful online experiences, e.g., they are looking to start and finish applications to open an account without ever walking into a bank, and they want it to happen fast. The bank was looking at a system that can provide an engaging and personalized digital customer experience in real-time under strict SLA (e.g., process a loan under 10 sec).
  • Ability to provide comprehensive services : Provide 'Always-on' digital services and delight customers by assisting them through chatbot interactions.  Additionally, they want to experiment and deliver new services such as enhanced payment and block-chain technologies valued by their customers.
  • Provide customer 360 experience : The bank offers various services, and their customers interact with those services in many different ways. However, customers want a consistent experience, regardless of the business division they are interacting with or the device they use in the process. Delivering an engaging and personalized customer experience with a single customer view and a unified view of all interactions encompassing each touchpoint with the bank is challenging.
  • Managing change without disruption : The bank needed agility to launch new services and make their development staff more productive. They want to minimize outages with high availability built into the system.

Choosing the right data management strategy

A comprehensive data management strategy sets the stage for establishing a deeper understanding of customer experience. It can offer a single view by collecting all the customer's structured and unstructured data from across the organization and other relevant external sources into one place. A NoSQL database is an ideal choice. It can store personal and demographic information and customer interactions with the company, including calls, chats, emails, texts, social media responses, product/service activity history, past and present purchases. McKinsey's study suggests that data-driven companies tend to be 19X profitable when they use data as a differentiation, as they tend to acquire 23X more customers and retain 6X more customers.

nosql database case study

Source: https://www.mckinsey.com/business-functions/marketing-and-sales/our-insights/five-facts-how-customer-analytics-boosts-corporate-performance

Why Oracle NoSQL Database

Oracle NoSQL Database multi-data model makes it easy for developers to store and combine data of any structure within the database without giving up sophisticated validation rules to govern data quality.

  • Support for flexible data model:

With the JSON document model, the schema can be dynamically modified without application or database downtime. Bank can localize all data for a given entity – such as a financial asset class or user class – into a single document, rather than spreading it across multiple relational tables. Customers can access entire documents in a single database operation, rather than joining separate tables spread across the database. As a result of this data localization, application performance is often much higher when using Oracle NoSQL Database, which can be the decisive factor in improving customer experience.

  • Predictable scalability with always-on availability

As banking use cases evolve, data sources and attributes grow. Onboarding additional applications, various digital channels, and users' demands, processing and storage capacity quickly grow.

Oracle NoSQL database supports scale-out architecture and sharding technology. With sharding, the data is distributed across multiple database instances spread across different machines, thus overcoming limitations of a single server and associated resources such as CPU, RAM, or I/O. An Oracle NoSQL cluster can be expanded horizontally online without incurring any application downtime and one hundred percent transparent to the application. Oracle NoSQL Database maintains multiple copies of data for high availability purposes.

  • Scale-out architecture for business continuity

The bank needed the ability to deploy the system across multiple data centers for disaster recovery purposes and also for the ability to perform local writes to the data center. Oracle NoSQL Database supports active-active architecture with multi-region tables. A multi-region architecture is two or more independent, geographically distributed Oracle NoSQL Database clusters bridged by bi-directional replication, ensuring the customers always have fast access to services and the latest data.

  • Simplify application development with rich query and APIs

Oracle NoSQL provides a rich query language and extensive secondary indexes giving users fast and flexible access to data with any query pattern. This can range from simple key-value lookups to complex search, traversals, and aggregations across rich data structures, including embedded sub-documents and arrays.  It also supports several easy-to-use SDKs in various programming languages – in particular, the customer was looking at NodeJS drivers.

High-level architecture of the proposed solution

nosql database case study

Critical components in the architecture include:

  • Applications Layer:  This layer manages all user input applications, e.g., loan or credit card applications. The applications are based on forms technology, allowing the developers to create adaptive and responsive documents to capture information. The forms have a notion of fragments that allows for pulling out standard segments such as personal details like name and address, family details, income details, etc. The application layer is responsible for doing all the "application plumbing": interacting with the database, enforcing validation at event points, etc. It interacts with the bank's backend system through the API gateway and doesn't store any personal or sensitive information.
  • Database Layer:  A CRM system is used primarily for lead generation to target customers. Also available in this layer is the ELK stack (Elasticsearch, Logstash, Kibana), which is primarily used to audit the log data stored in the NoSQL Database. Oracle NoSQL Database has an out-of-box integration with Elasticsearch. Oracle NoSQL Database also feeds the user drop-off (incomplete form activity) data to the orchestration framework primarily used for retargeting the users.
  • Marketing Layer : This layer hosts various servers that drive the business decision process. It comprises servers and tools used for customer segmentation (identify groups of individuals who are similar in attitudes, demographic profile, etc.) and customer journey analysis (a sum of all customer experiences with the bank).  Additionally, it handles personalization (showing the product or service a customer would be interested in buying) and retargeting (persuading the potential customers to reconsider bank's products and services after they left or got dropped off from their app) based on the drop-off campaign's data that's coming out the Oracle NoSQL Database.

Banking experience re-imagined

A typical user's journey, e.g., loan processing, starts with a user interacting with banks loan processing applications via – the web, mobile device, email, or even branch. The application is served off the forms in the application layer. At this stage, the user fills in details and submits the scanned supporting documents. These scanned forms are classified, and information is extracted, and the data is sent to the NoSQL Database store. The data is sent to the processing system that triggers the underwriting process, beginning with the rule engine and credit scoring engine. Depending on the underwriting process results, an application will be approved, denied, or sent back to the user for additional information. If the application is approved, the loan amount is deposited into the user's account. Suppose the user drops off at any point while filling the form. In that case, this drop-off information is stored in the NoSQL Database and feeds into the orchestration system to kick start the retargeting campaign that allows the bank to target the customer who got dropped off.  The process is repeated with specific ads, emails, or WhatsApp messages retargeting the customers. In the event the customer returns, they can start the journey where they left off.

In conclusion, one of India's leading private banks modernized and expedited its digital presence and provided an enhanced experience for its customers using Oracle NoSQL Database. 

More information

Oracle NoSQL Database is a multi-model, multi-region database designed to provide a highly-available, scalable, flexible, high-performant, and reliable data management solution to meet today's most demanding workloads. It is well-suited for high volume and velocity workloads, like the Internet of Things, customer 360, online contextual advertising, fraud detection, mobile application, user personalization, and online gaming. Developers can use a single application interface to build applications that run in on-premise and cloud environments quickly.  Visit NoSQL Database Cloud Service page  to learn more.

Michael Brey

Director of nosql development.

Previous Post

Power Your Event-Driven Applications with Oracle NoSQL Database Cloud Service – Part 1

Power your event-driven applications with oracle nosql database cloud service – part 2.

  • Analyst Reports
  • Cloud Economics
  • Corporate Responsibility
  • Diversity and Inclusion
  • Security Practices
  • What is Customer Service?
  • What is ERP?
  • What is Marketing Automation?
  • What is Procurement?
  • What is Talent Management?
  • What is VM?
  • Try Oracle Cloud Free Tier
  • Oracle Sustainability
  • Oracle COVID-19 Response
  • Oracle and SailGP
  • Oracle and Premier League
  • Oracle and Red Bull Racing Honda
  • US Sales 1.800.633.0738
  • How can we help?
  • Subscribe to Oracle Content
  • © 2024 Oracle
  • Privacy / Do Not Sell My Info

Why ScyllaDB?

Close-to-the-metal architecture handles millions of OPS with predictable single-digit millisecond latencies.

Is ScyllaDB right for me?

ScyllaDB is purpose-built for data-intensive apps that require high throughput & predictable low latency.

  • ScyllaDB Cloud Fully-Managed on AWS & GCP
  • ScyllaDB Enterprise Premium Features, Dedicated Support
  • ScyllaDB Open Source Free, Open Source NoSQL Database.
  • ScyllaDB Manager Streamline management
  • ScyllaDB Operator ScyllaDB on Kubernetes
  • ScyllaDB Monitoring Cluster observability
  • ScyllaDB Drivers Get ScyllaDB shard-aware drivers
  • ScyllaDB CDC Change Data Capture

ScyllaDB University

Level up your skills with our free NoSQL database courses.

Check out the ScyllaDB Blog

Our blog keeps you up to date with recent news about the ScyllaDB NoSQL database and related technologies, success stories and developer how-tos.

  • All Resources
  • ScyllaDB Intro
  • ScyllaDB Cloud
  • Cassandra Migration
  • DynamoDB Migration
  • Documentation
  • Best Practices
  • Whitepapers
  • Virtual Workshops
  • Masterclass Overview
  • ScyllaDB Summit 2024
  • How ScyllaDB Compares
  • ScyllaDB vs. Cassandra
  • ScyllaDB vs. DynamoDB
  • ScyllaDB vs. MongoDB
  • More Benchmarks
  • Get Started
  • ScyllaDB Cloud Access cloud clusters
  • Enterprise Portal Submit tickets & access downloads
  • University Self-paced learning
  • Forum Ask or search questions
  • Slack Chat with the ScyllaDB community

SQL graphic

NoSQL Database Use Cases

NoSQL Graphic

Applications of NoSQL Databases

Nosql for big data, nosql for iot, nosql ecommerce, nosql for content management, nosql for time series data, nosql for retail, nosql for social media, nosql for cybersecurity.

NoSQL for Fraud Detection

NoSQL for Adtech

Relational databases impose fairly rigid, schema-based structures to data models; tables consisting of columns and rows, which can be joined to enable ‘relations’ among entities. Each table typically defines an entity. Each row in a table holds one entry, and each column contains a specific piece of information for that record. The relationships among tables are clearly defined and usually enforced by schemas and database rules.

Unlike RDBMSs, NoSQL databases encourage ‘application-first’ or API-first development patterns. Following these models, developers first consider queries that support the functionality specific to an application, rather than considering the data models and entities. This developer-friendly architecture paved the path to the success of the first generation of NoSQL databases.

NoSQL databases are often preferred when:

  • There are large quantities and varieties of data
  • Scalability is important
  • Continuous availability is a priority
  • For real-time analytics or performing big data work

Learn more about SQL vs NoSQL

What are some of the top NoSQL database use cases? Here are some of the most common:

NoSQL is a good option for organizations with data workloads directed toward rapid processing and analyzing of massive quantities of unstructured data—coined “ big data ” back in the 1990s. A flexible data model, continuous application availability, optimized database architecture, and modern transaction support are all important for processing big data. In contrast to relational databases, NoSQL databases are flexible because they are not confined by a fixed schema model.

In action, organizations that can process and act on fresh data rapidly achieve greater bottom-line value, business agility, and operational efficiency from it. A typical approach to real-time big data processing ingests new data with stream processing, analyzes historical data, and integrates both with a NoSQL database. For example, big data stored in a NoSQL database can be used for customer segmentation, delivering personalized ads to customers, data mining, and fraud detection.

Examples of NoSQL for Big Data Use Cases

Tens of billions of IoT devices – such as mobile devices, smart vehicles, home appliances, factory sensors, and healthcare systems–are now online. These devices continuously generate a massive amount of diverse, semi-structured data—approximately 847 zettabytes —that NoSQL databases are better equipped to ingest and analyze than their relational cousins. There are three ways to consider this:

Scalability. Scalability is difficult for SQL databases because IoT use cases often experience unpredictable traffic bursts and are frequently write-heavy to start with. NoSQL databases are a good option for managing system load when easy scaling of write capacity is a priority.

Consistency. Although relational databases deliver strong consistency guarantees, IoT applications are often well-suited for eventual consistency models.

Flexibility. While validation and schema capabilities are built into SQL databases, IoT data often needs more flexibility. NoSQL databases allow users to push schema enforcement logic to application code.

Examples of NoSQL for IoT Use Cases

NoSQL may offer better affordability, availability, performance, scalability, and flexibility for ecommerce applications compared to relational databases. Most ecommerce applications are characterized by frequent queries, massive, rapidly updating and expanding online catalogs, and huge amounts of inventory.

NoSQL databases often respond more rapidly to queries and are known for their cost-effective, predictable, horizontal scalability and high availability. With NoSQL databases, organizations can:

  • Handle massive volumes of data and traffic growth
  • Scale easily for a good price
  • Analyze inventory and catalog in real time
  • Provide catalog refreshes more rapidly
  • Expand online catalog and product offerings

Examples of NoSQL for Ecommerce Use Cases

Multimedia content such as user-generated/social media reviews and imagery drives online sales most when it is curated and delivered to shoppers at the moment of interaction. Content management systems serve and store data assets and the metadata associated with them to a range of applications such as online publications, websites, and archives.

NoSQL databases offer an open-ended, flexible data model that is optimal for storing a mix of content, including structured, semi-structured, and unstructured data. NoSQL also allows users to aggregate and incorporate user data within a single catalog database to serve multiple business applications. In contrast, fixed RDBMS data models often cause many overlapping catalogs with different goals to proliferate.

Examples of NoSQL for Content Management Use Cases

Except for very small datasets, extracting high performance for time series data with an SQL database demands significant customization and configuration—and any such configuration is nontrivial.

Time series data is unique in that it is generally monitoring data gathered to assess the health of a host, system, patient, environment, etc. To optimize for time series use cases, NoSQL databases typically add data in time-ascending order and delete against a large range of old data. This ensures high query and write performance.

SQL databases are generally equally focused on creating, reading, updating, and deleting data, while NoSQL is less so. In addition, in contrast to the more loosely structured NoSQL database, SQL databases are typically designed with the ideas of atomicity, consistency, isolation, durability (ACID principles) in mind.

Time series databases typically accommodate time series data: they collect data in time order in real-time, and accommodate for extremely high volumes of data by holding it as immutable and append-only. Relational databases accommodate only lower ingest volumes, and are optimized for transactions. Overall, NoSQL databases trade ACID principles for the basic availability, soft state and eventual consistency (BASE) model, depending on their particular use case. In other words, the important notion for the time series data is the aggregate trend, not a single point in a time series, generally speaking.

Examples of NoSQL for Time Series Data Use Cases

To create differentiating, engaging digital customer experiences, it is essential to build on time-critical, data-intensive capabilities such as user profile management, personalization, and a unified customer view across touch points. This massive load of behavioral, demographic, and logistical data taxes RDBMS infrastructure that is designed to scale-up in different ways.

Distributed NoSQL databases allow users to manage increasing attributes with less work, scale more cost-effectively, and enjoy reduced latency—the essence of satisfying online interactions for users in real-time. A personalized, high-quality, fast, consistent experience is no longer a standout feature; it’s what customers demand, across all devices.

NoSQL platforms help deliver positive customer support experiences across multiple verticals by capturing data from massive quantities of omnichannel interactions and relating it to the accounts and service status of individual customers. NoSQL databases:

  • Allow for expanding customer bases with extremely low latency and fast response times
  • Handle structured and unstructured data from a range of sources
  • Scale cost-effectively by design, and manage, store, query, and modify massive quantities of data while delivering personalized experiences
  • Flex to enable innovations in the customer experience
  • Seamlessly collect, integrate, and analyze new data in real-time
  • Serve as the backbone for artificial intelligence (AI) and machine learning (ML) engine algorithms that drive personalization with recommendations

Examples of NoSQL for Retail Use Cases

The bulk of social networking platforms consist of posts, media, profiles, relationships, and APIs. Posts allow users to share thoughts, while media allows them to share videos and photos. Profiles store basic user information and relationships connect them. And through APIs the platform and users can interact with other sites and apps. These features demand that social network data is more flexible—and more difficult to process.

Massive amounts of data present social media platforms with both daily maintenance and development problems. Storing huge quantities of data in SQL databases makes it impossible to process unstructured and unpredictable information. Social media networks demand a flexible, occurrence-oriented database that operates on a schemaless-data model—something impossible for SQL databases. Also, the vertical scaling demands of SQL databases require enhancing implementation hardware, which makes processing large batches of data expensive.

NoSQL can store generic objects, such as JSON, and support huge volumes of read-write operations. This contributes to data consistency across the distributed system, making NoSQL databases a good option for processing the big, unstructured patterns of data access typical of social media platforms.

Examples of NoSQL for Social Media Use Cases

To react in real-time to a threat landscape that evolves constantly, cybersecurity demands speed and scale. To collect, store, and analyze the billions of events that reveal insight into the activities of malicious actors, cybersecurity providers are adopting cloud-native infrastructure. There are several reasons why a high performance low latency NoSQL database offers a cybersecurity advantage, most linking back to improved speed and scalability:

Intrusion detection. Greater speed supports real-time analytics and insights users can compare rapidly to events and operational data contained in a single database to detect problems.

Threat analysis. Real-time updates enable more proactive responses to security breaches and other attacks—including prevention.

Compliance and governance. A NoSQL structure can collect and store events and telemetry when deployed across diverse topologies, either on-premises or in the cloud, to ensure compliance.

Virus and malware protection. Enables machine learning and file analysis to identify malware within harmless content to defend users and endpoints from threats and intrusions.

Examples of NoSQL for Cybersecurity Use Cases

NoSQL for Fraud Detection and Identity Authentication

Ensuring only authentic users have access to applications and protecting sensitive personal data is a top priority that is heightened for banking, financial services, payments, and insurance.

It is sometimes possible to identify anomalies and patterns to detect fraud in real-time or even in advance. This demands real-time analysis of large quantities of both live and historic data, including environment, user profile, biometric data, geographic data, and context. For example, a $500 withdrawal may be typical until it occurs after hours in the wrong zip code.

Reputational stakes are amplified with mistakes over social media, yet excessively high false positive rates hurt the customer experience. This is why a fast and highly available NoSQL database is so important to support complex data analysis of website interactions, the CRM system, historical shopping data, and other data that fraud detection and identity authentication demand.

Examples of NoSQL for Fraud Detection Use Cases

The speed that NoSQL databases are well-suited to deliver is a critical competitive advantage for AdTech and MarTech businesses:

SLAs. To meet strict SLAs, these platforms must capture ad space during page loads—and this demands single-digit-millisecond latencies.

Real-time bidding. Consistently responsive, available databases allow users to win more available ad inventory and avoid latency spikes.

Precision ad targeting . High-volume ad service based on revenue optimization, impressions, and campaign goals can allow a team to target audiences and determine the most engaging content for individual users rapidly.

Highly scalable personalization engines . AdTech and MarTech services rely on personalization engines. These engines analyze behavioral, demographic, and geo-location data in real-time to ensure each user has a tailored experience each time they visit.

Real-time analytics. Drive real-time decision-making with actionable insights extracted from masses of data.

Mobile device metadata stores. A geographically-distributed metadata store for mobile devices can improve user conversion and retention.

User behavior and impressions. Engage in real-time capture and analysis of clickstreams to identify trends, understand sentiment, and optimize campaigns.

Machine learning. Run analytics and operational workloads at high velocity on the same infrastructure against the same datasets.

NoSQL Masterclasses: Advance Your NoSQL Knowledge

nosql database case study

You can access the complete course  here .

Trending NoSQL Resources

nosql database case study

Start scaling with the world's best high performance NoSQL database.

  • Product Overview
  • ScyllaDB Enterprise
  • Open Source
  • Release Notes
  • Online Training
  • NoSQL Guides
  • Resource Center
  • Customer Support
  • Custom Portal
  • Terms of Service
  • Privacy Policy
  • Data Subject Request Form
  • CCPA Privacy Notice
  • Cookie Policy
  • Trust Center
  • Legal Center

Apache® and Apache Cassandra® are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. Amazon DynamoDB® and Dynamo Accelerator® are trademarks of Amazon.com, Inc. No endorsements by The Apache Software Foundation or Amazon.com, Inc. are implied by the use of these marks.

  • Oracle Database
  • MySQL & MariaDB
  • SQL, PL/SQL, T-SQL
  • Microsoft SQL Server
  • Programming
  • Web development
  • Information Systems
  • Operating systems
  • Cloud tec & Networking
  • Data Science
  • Various IT topics

The Definitive Guide to NoSQL Databases

Limited SQL scalability has prompted the industry to develop and deploy a number of NoSQL database management systems, with a focus on performance, reliability, and consistency. The trend was driven by proprietary NoSQL databases developed by Google and Amazon. Eventually, open-source systems like MongoDB, Cassandra, and Hypertable brought NoSQL within reach of everyone.

In this post, Senior Software Engineer Mohamad Altarade dives into some of them and explains why NoSQL will probably be with us for years to come.

The Definitive Guide to NoSQL Databases

By Mohammad Altarade

Mohammad is a highly motivated, high-energy individual with a passion for writing useful software and working with the latest technologies.

There is no doubt that the way web applications deal with data has changed significantly over the past decade. More data is being collected and more users are accessing this data concurrently than ever before. This means that scalability and performance are more of a challenge than ever for relational databases that are schema-based and therefore can be harder to scale.

The Evolution of NoSQL

The SQL scalability issue was recognized by Web 2.0 companies with huge, growing data and infrastructure needs, such as Google, Amazon, and Facebook. They came up with their own solutions to the problem – technologies like BigTable , DynamoDB , and Cassandra .

This growing interest resulted in a number of NoSQL Database Management Systems (DBMS’s), with a focus on performance, reliability, and consistency. A number of existing indexing structures were reused and improved upon with the purpose of enhancing search and read performance.

First, there were proprietary (closed source) types of NoSQL databases developed by big companies to meet their specific needs, such as Google’s BigTable, which is believed to be the first NoSQL system, and Amazon’s DynamoDB.

The success of these proprietary systems initiated development of a number of similar open-source and proprietary database systems, the most popular ones being Hypertable, Cassandra, MongoDB, DynamoDB, HBase, and Redis.

What Makes NoSQL Different?

One key difference between NoSQL databases and traditional relational databases is the fact that NoSQL is a form of unstructured storage .

This means that NoSQL databases do not have a fixed table structure like the ones found in relational databases.

Advantages and Disadvantages of NoSQL Databases

NoSQL databases have many advantages compared to traditional, relational databases.

One major, underlying difference is that NoSQL databases have a simple and flexible structure. They are schema-free.

Unlike relational databases, NoSQL databases are based on key-value pairs.

Some store types of NoSQL databases include column store, document store, key value store, graph store, object store, XML store, and other data store modes.

Usually, each value in the database has a key. Some NoSQL database stores also allow developers to store serialized objects into the database, not just simple string values.

Open-source NoSQL databases don’t require expensive licensing fees and can run on inexpensive hardware, rendering their deployment cost-effective.

Also, when working with NoSQL databases, whether they are open-source or proprietary, expansion is easier and cheaper than when working with relational databases. This is because it’s done by horizontally scaling and distributing the load on all nodes, rather than the type of vertical scaling that is usually done with relational database systems, which is replacing the main host with a more powerful one.

Disadvantages

Of course, NoSQL databases are not perfect, and they are not always the right choice.

For one thing, most NoSQL databases do not support reliability features that are natively supported by relational database systems. These reliability features can be summed up as atomicity, consistency, isolation, and durability. This also means that NoSQL databases, which don’t support those features, trade consistency for performance and scalability.

In order to support reliability and consistency features, developers must implement their own proprietary code, which adds more complexity to the system.

This might limit the number of applications that can rely on NoSQL databases for secure and reliable transactions, like banking systems.

Other forms of complexity found in most NoSQL databases include incompatibility with SQL queries. This means that a manual or proprietary querying language is needed, adding even more time and complexity.

NoSQL vs. Relational Databases

This table provides a brief feature comparison between NoSQL and relational databases:

It should be noted that the table shows a comparison on the database level , not the various database management systems that implement both models. These systems provide their own proprietary techniques to overcome some of the problems and shortcomings in both systems, and in some cases, significantly improve performance and reliability.

NoSQL Data Store Types

Key value store.

In the Key Value store type, a hash table is used in which a unique key points to an item.

Keys can be organized into logical groups of keys, only requiring keys to be unique within their own group. This allows for identical keys in different logical groups. The following table shows an example of a key-value store, in which the key is the name of the city, and the value is the address for Ulster University in that city.

Some implementations of the key value store provide caching mechanisms, which greatly enhance their performance.

All that is needed to deal with the items stored in the database is the key. Data is stored in a form of a string, JSON, or BLOB (Binary Large OBject).

One of the biggest flaws in this form of database is the lack of consistency at the database level. This can be added by the developers with their own code, but as mentioned before, this adds more effort, complexity, and time.

The most famous NoSQL database that is built on a key value store is Amazon’s DynamoDB.

Document Store

Document stores are similar to key value stores in that they are schema-less and based on a key-value model. Both, therefore, share many of the same advantages and disadvantages. Both lack consistency on the database level, which makes way for applications to provide more reliability and consistency features.

There are however, key differences between the two.

In Document Stores, the values (documents) provide encoding for the data stored. Those encodings can be XML, JSON, or BSON (Binary encoded JSON) .

Also, querying based on data can be done.

The most popular database application that relies on a Document Store is MongoDB.

Column Store

In a Column Store database, data is stored in columns, as opposed to being stored in rows as is done in most relational database management systems.

A Column Store is comprised of one or more Column Families that logically group certain columns in the database. A key is used to identify and point to a number of columns in the database, with a keyspace attribute that defines the scope of this key. Each column contains tuples of names and values, ordered and comma separated.

Column Stores have fast read/write access to the data stored. In a column store, rows that correspond to a single column are stored as a single disk entry. This makes for faster access during read/write operations.

The most popular databases that use the column store include Google’s BigTable, HBase, and Cassandra.

In a Graph Base NoSQL Database, a directed graph structure is used to represent the data. The graph is comprised of edges and nodes.

Formally, a graph is a representation of a set of objects, where some pairs of the objects are connected by links. The interconnected objects are represented by mathematical abstractions, called vertices, and the links that connect some pairs of vertices are called edges. A set of vertices and the edges that connect them is said to be a graph.

A graph about graphs. At top center is a box called "a graph" with two arrows coming out of it. Both arrows are called "records"; one points to a "nodes" box and the other to a "relationships" box.  The "relationships" box has an "organize" arrow pointing to the "nodes" box. Both "nodes" and "relationships" have arrows called "have" pointing to one final box, "properties". In other words, a graph records relationships and nodes, which both have properties, and relationships organize nodes.

This illustrates the structure of a graph base database that uses edges and nodes to represent and store data. These nodes are organized by some relationships with one another, which is represented by edges between the nodes. Both the nodes and the relationships have some defined properties.

Graph databases are most typically used in social networking applications. Graph databases allow developers to focus more on relations between objects rather than on the objects themselves. In this context, they indeed allow for a scalable and easy-to-use environment.

Currently, InfoGrid and InfiniteGraph are the most popular graph databases.

NoSQL Database Management Systems

For a brief comparison of the databases, the following table provides a brief comparison between different NoSQL database management systems.

MongoDB has a flexible schema storage, which means stored objects are not necessarily required to have the same structure or fields. MongoDB also has some optimization features, which distributes the data collections across, resulting in overall performance improvement and a more balanced system.

Other NoSQL database systems, such as Apache CouchDB, are also document store type database, and share a lot of features with MongoDB, with the exception that the database can be accessed using RESTful APIs.

REST is an architectural style consisting of a coordinated set of architectural constraints applied to components, connectors, and data elements, within the World Wide Web. It relies on a stateless, client-server, cacheable communications protocol (e.g., the HTTP protocol).

RESTful applications use HTTP requests to post, read data, and delete data.

As for column base databases, Hypertable is a NoSQL database written in C++ and is based on Google’s BigTable.

Hypertable supports distributing data stores across nodes to maximize scalability, just like MongoDB and CouchDB.

One of the most widely used NoSQL databases is Cassandra, developed by Facebook.

Cassandra is a column store database that includes a lot of features aimed at reliability and fault tolerance.

Rather than providing an in-depth look at each NoSQL DBMS, Cassandra and MongoDB, two of the most widely used NoSQL database management systems, will be explored in the next subsections.

Cassandra is a database management system developed by Facebook.

The goal behind Cassandra was to create a DBMS that has no single point of failure and provides maximum availability.

Cassandra is mostly a column store database. Some studies referred to Cassandra as a hybrid system, inspired by Google’s BigTable, which is a column store database, and Amazon’s DynamoDB, which is a key-value database.

This is achieved by providing a key-value system, but the keys in Cassandra point to a set of column families, with reliance on Google’s BigTable distributed file system and Dynamo’s availability features (distributed hash table).

Cassandra is designed to store huge amounts of data distributed across different nodes. Cassandra is a DBMS designed to handle massive amounts of data, spread out across many servers, while providing a highly available service with no single point of failure, which is essential for a big service like Facebook.

The main features of Cassandra include:

  • No single point of failure. For this to be achieved, Cassandra must run on a cluster of nodes, rather than a single machine. That doesn’t mean that the data on each cluster is the same, but the management software is. When a failure in one of the nodes happens, the data on that node will be inaccessible. However, other nodes (and data) will still be accessible.
  • Distributed Hashing is a scheme that provides hash table functionality in a way that the addition or removal of one slot does not significantly change the mapping of keys to slots. This provides the ability to distribute the load to servers or nodes according to their capacity, and in turn, minimize downtime.
  • Relatively easy to use Client Interface . Cassandra uses Apache Thrift for its client interface. Apache Thrift provides a cross-language RPC client, but most developers prefer open-source alternatives built on top of Apple Thrift, such as Hector.
  • Other availability features. One of Cassandra’s features is data replication. Basically, it mirrors data to other nodes in the cluster. Replication can be random, or specific to maximize data protection by placing in a node in a different data center, for example. Another feature found in Cassandra is the partitioning policy. The partitioning policy decides where on which node to place the key. This can also be random or in order. When using both types of partitioning policies, Cassandra can strike a balance between load balancing and query performance optimization.
  • Consistency. Features like replication make consistency challenging. This is due to the fact that all nodes must be up-to-date at any point in time with the latest values, or at the time a read operation is triggered. Eventually, though, Cassandra tries to maintain a balance between replication actions and read/write actions by providing this customizability to the developer.
  • Read/Write Actions. The client sends a request to a single Cassandra node. The node, according to the replication policy, stores the data to the cluster. Each node first performs the data change in the commit log, and then updates the table structure with the change, both done synchronously. The read operation is also very similar, a read request is sent to a single node, and that single node is the one that determines which node holds the data, according to the partitioning/placement policy.

MongoDB is a schema-free, document-oriented database written in C++. The database is document store based, which means it stores values (referred to as documents) in the form of encoded data.

The choice of encoded format in MongoDB is JSON. This is powerful, because even if the data is nested inside JSON documents, it will still be queryable and indexable .

The subsections that follow describe some of the key features available in MongoDB.

Sharding is the partitioning and distributing of data across multiple machines (nodes). A shard is a collection of MongoDB nodes, in contrast to Cassandra where nodes are symmetrically distributed. Using shards also means the ability to horizontally scale across multiple nodes. In the case that there is an application using a single database server, it can be converted to sharded cluster with very few changes to the original application code because the way sharding is done by MongoDB. oftware is almost completely decoupled from the public APIs exposed to the client side.

Mongo Query Language

As discussed earlier, MongoDB uses a RESTful API. To retrieve certain documents from a db collection, a query document is created containing the fields that the desired documents should match.

In MongoDB, there is a group of servers called routers. Each one acts as a server for one or more clients. Similarly, The cluster contains a group of servers called configuration servers. Each one holds a copy of the metadata indicating which shard contains what data. Read or write actions are sent from the clients to one of the router servers in the cluster, and are automatically routed by that server to the appropriate shards that contain the data with the help of the configuration servers.

Similar to Cassandra, a shard in MongoDB has a data replication scheme, which creates a replica set of each shard that holds exactly the same data. There are two types of replica schemes in MongoDB: Master-Slave replication and Replica-Set replication. Replica-Set provides more automation and better handling for failures, while Master-Slave requires the administrator intervention sometimes. Regardless of the replication scheme, at any point in time in a replica set, only one shard acts as the primary shard, all other replica shards are secondary shards. All write and read operations go to the primary shard, and are then distributed evenly (if needed) to the other secondary shards in the set.

In the graphic below, we see the MongoDB architecture explained above, showing the router servers in green, the configuration servers in blue, and the shards that contain the MongoDB nodes.

Four numbered shards each have 3 "mondgod" nodes in them. Shard4 is colored grey and labeled "replica set." Shard1 is connected to a group of three blue "C1 mongod" nodes labeled "config servers;" the group and each of the shards is connected to a series of green "mongos" nodes. This series is, in turn, connected to a series of clients.

It should be noted that sharding (or sharing the data between shards) in MongoDB is completely automatic, which reduces the failure rate and makes MongoDB a highly scalable database management system.

Indexing Structures for NoSQL Databases

Indexing is the process of associating a key with the location of a corresponding data record in a DBMS. There are many indexing data structures used in NoSQL databases. The following sections will briefly discuss some of the more common methods; namely, B-Tree indexing, T-Tree indexing, and O2-Tree indexing.

B-Tree Indexing

B-Tree is one of the most common index structures in DBMS’s.

In B-trees, internal nodes can have a variable number of child nodes within some predefined range.

One major difference from other tree structures, such as AVL, is that B-Tree allows nodes to have a variable number of child nodes, meaning less tree balancing but more wasted space.

The B+-Tree is one of the most popular variants of B-Trees. The B+-Tree is an improvement over B-Tree that requires all keys to reside in the leaves.

T-Tree Indexing

The data structure of T-Trees was designed by combining features from AVL-Trees and B-Trees.

AVL-Trees are a type of self-balancing binary search trees, while B-Trees are unbalanced, and each node can have a different number of children.

In a T-Tree, the structure is very similar to the AVL-Tree and the B-Tree.

Each node stores more than one {key-value, pointer} tuple. Also, binary search is utilized in combination with the multiple-tuple nodes to produce better storage and performance.

A T-Tree has three types of nodes: A T-Node that has a right and left child, a leaf node with no children, and a half-leaf node with only one child.

It is believed that T-Trees have better overall performance than AVL-Trees.

O2-Tree Indexing

The O2-Tree is basically an improvement over Red-Black trees, a form of a Binary-Search tree, in which the leaf nodes contain the {key value, pointer} tuples.

O2-Tree was proposed to enhance the performance of current indexing methods. An O2-Tree of order m (m ≥ 2), where m is the minimum degree of the tree, satisfies the following properties:

  • Every node is either red or black. The root is black.
  • Every leaf node is colored black and consists of a block or page that holds “key value, record-pointer” pairs.
  • If a node is red, then both its children are black.
  • For each internal node, all simple paths from the node to descendant leaf-nodes contain the same number of black nodes. Each internal node holds a single key value.
  • Leaf-nodes are blocks that have between ⌈m/2⌉ and m “key-value, record-pointer” pairs.
  • If a tree has a single node, then it must be a leaf, which is the root of the tree, and it can have between 1 to m key data items.
  • Leaf nodes are double-linked in forward and backward directions.

Here, we see a straightforward performance comparison between O2-Tree, T-Tree, B+-Tree, AVL-Tree, and Red-Black Tree:

A graph comparing "Total Time in Seconds" (0-250) on the Y axis to "Update Ratio" (0-100) on the X axis. The five tree types all start with total times under 100 on the left, then increase on the right. O2-Tree, T-Tree, and AVL-Tree increase slower than the other two toward the right, with AVL-Tree ending around 125, O2-Tree ending around 75, and T-Tree somewhere in between.  Red-Black Tree and B+-Tree have more ups and downs, and both finish near each other in the top-right, with Red-Black Tree having a slightly higher value there.

The order of the T-Tree, B+-Tree, and the O2-Tree used was m = 512.

Time is recorded for operations of search, insert, and delete with update ratios varying between 0%-100% for an index of 50M records, with the operations resulting in adding another 50M records to the index.

It is clear that with an update ratio of 0-10%, B-Tree and T-Tree perform better than O2-Tree. However, with the update ratio increasing, O2-Tree index performs significantly better than most other data structures, with the B-Tree and Red-Black Tree structures suffering the most.

The Case for NoSQL?

A quick introduction to NoSQL databases, highlighting the key areas where traditional relational databases fall short, leads to the first takeaway:

While relational databases offer consistency, they are not optimized for high performance in applications where massive data is stored and processed frequently.

NoSQL databases gained a lot of popularity due to high performance, high scalability and ease of access; however, they still lack features that provide consistency and reliability.

Fortunately, a number of NoSQL DBMSs address these challenges by offering new features to enhance scalability and reliability.

Not all NoSQL database systems perform better than relational databases.

MongoDB and Cassandra have similar, and in most cases better, performance than relational databases in write and delete operations.

There is no direct correlation between the store type and the performance of a NoSQL DBMS. NoSQL implementations undergo changes, so performance may vary.

Therefore, performance measurements across database types in different studies should always be updated with the latest versions of database software in order for those numbers to be accurate.

While I can’t offer a definitive verdict on performance, here are a few points to keep in mind:

  • Traditional B-Tree and T-Tree indexing is commonly used in traditional databases.
  • One study offered improvements and enhancements by combining the characteristics of multiple indexing structures to come up with the O2-Tree.
  • The O2-Tree outperformed other structures in most tests, especially with huge datasets and high update ratios.
  • The B-Tree structure delivered the worst performance of all indexing structures covered in this article.

Further work can and should be done to enhance the consistency of NoSQL DBMSs. The integration of both systems, NoSQL and relational databases, is an area to further explore.

Finally, it’s important to note that NoSQL is a good addition to existing database standards, but with a few important caveats. NoSQL trades reliability and consistency features for sheer performance and scalability. This renders it a specialized solution, as the number of applications that can rely on NoSQL databases remains limited.

The upside? Specialization might not offer much in the way of flexibility, but when you want to get a specialized job done as quickly and efficiently as possible, you don’t need a Swiss Army Knife. You need NoSQL.

World-class articles, delivered weekly.

By entering your email, you are agreeing to our privacy policy .

Toptal Developers

  • Algorithm Developers
  • Angular Developers
  • AWS Developers
  • Azure Developers
  • Big Data Architects
  • Blockchain Developers
  • Business Intelligence Developers
  • C Developers
  • Computer Vision Developers
  • Django Developers
  • Docker Developers
  • Elixir Developers
  • Go Engineers
  • GraphQL Developers
  • Jenkins Developers
  • Kotlin Developers
  • Kubernetes Experts
  • Machine Learning Engineers
  • Magento Developers
  • .NET Developers
  • R Developers
  • React Native Developers
  • Ruby on Rails Developers
  • Salesforce Developers
  • SQL Developers
  • Tableau Developers
  • Unreal Engine Developers
  • Xamarin Developers
  • View More Freelance Developers

Join the Toptal ® community.

  • Open access
  • Published: 14 August 2015

Choosing the right NoSQL database for the job: a quality attribute evaluation

  • João Ricardo Lourenço 1 ,
  • Bruno Cabral 1 ,
  • Paulo Carreiro 2 ,
  • Marco Vieira 1 &
  • Jorge Bernardino 1 , 3  

Journal of Big Data volume  2 , Article number:  18 ( 2015 ) Cite this article

65 Citations

14 Altmetric

Metrics details

For over forty years, relational databases have been the leading model for data storage, retrieval and management. However, due to increasing needs for scalability and performance, alternative systems have emerged, namely NoSQL technology. The rising interest in NoSQL technology, as well as the growth in the number of use case scenarios, over the last few years resulted in an increasing number of evaluations and comparisons among competing NoSQL technologies. While most research work mostly focuses on performance evaluation using standard benchmarks, it is important to notice that the architecture of real world systems is not only driven by performance requirements, but has to comprehensively include many other quality attribute requirements. Software quality attributes form the basis from which software engineers and architects develop software and make design decisions. Yet, there has been no quality attribute focused survey or classification of NoSQL databases where databases are compared with regards to their suitability for quality attributes common on the design of enterprise systems. To fill this gap, and aid software engineers and architects, in this article, we survey and create a concise and up-to-date comparison of NoSQL engines, identifying their most beneficial use case scenarios from the software engineer point of view and the quality attributes that each of them is most suited to.

Introduction

Relational databases have been the stronghold of modern computing applications for decades. ACID properties (Atomicity, Consistency, Isolation, Durability) made relational databases the solution for almost all data management systems. However, the need to handle data in web-scale systems [ 1 – 3 ], in particular Big Data systems [ 4 ], have led to the creation of numerous NoSQL databases.

The term NoSQL was first coined in 1988 to name a relational database that did not have a SQL (Structured Query Language) interface [ 5 ]. It was then brought back in 2009 for naming an event which highlighted new non-relational databases, such as BigTable [ 3 ] and Dynamo [ 6 ], and has since been used without an “official” definition. Generally speaking, a NoSQL database is one that uses a different approach to data storage and access when compared with relational database management systems [ 7 , 8 ]. NoSQL databases lose the support for ACID transactions as a trade-off for increased availability and scalability [ 1 , 7 ]. Brewer created the term BASE for these systems - they are Basically Available, have a Soft state (during which they are not yet consistent), and are Eventually consistent, as opposed to ACID systems [ 9 ]. This BASE model forfeits the essential ACID properties of consistency and isolation in order to favor “availability, graceful degradation, and performance” [ 9 ]. While originally the term stood for “No SQL”, it has recently been restated as “Not Only SQL” [ 1 , 7 , 10 ] to highlight that these systems rarely fully drop the relational model. Thus, in spite of being a recurrent theme in literature, NoSQL is a very broad term, encompassing very distinct database systems.

There are hundreds of readily available NoSQL databases, and each have different use case scenarios [ 11 ]. They are usually divided in four categories [ 2 , 7 , 12 ], according to their data model and storage: Key-Value Stores, Document Stores, Column Stores and Graph databases. This classification is due to the fact that each kind of database offers different solutions for specific contexts. The “one size fits all” approach of relational databases no longer applies.

There has been extensive research in the comparison of relational and non-relational databases in terms of their performance for different applications. However, when developing enterprise systems, performance is only one of many quality attributes to be considered. Unfortunately, there has not yet been a comprehensive assessment of NoSQL technology in what concerns software quality attributes. The goal of this article is to fill this gap, by clearly identifying which NoSQL databases better promote the several quality attributes, thus becoming a reference for software engineers and architects.

This article is a revised and extended version of our WorldCIST 2015 paper [ 13 ]. It improves and complements the former in the following aspects:

Three more quality attributes (Consistency, Robustness and Maintainability) were evaluated.

A new section describing the evaluated NoSQL databases was introduced.

The state of the art was extended to provide more up to date and thorough information.

All of the previously evaluated quality attributes were reevaluated in light of new studies and new developments in the NoSQL ecosystem.

New conclusions and insights derived from the quality attribute based analysis of several NoSQL databases.

Henceforth, the main contributions of this article can be summarized as follows:

The development of a quality-attribute oriented evaluation of NoSQL databases (Table 2 ). Software architects may use this information to assess which NoSQL database best fits their quality attribute requirements.

A survey of the literature on the evaluation of NoSQL databases from a historic perspective.

The identification of several future research directions towards the full coverage of software quality attributes in the evaluation of NoSQL databases.

The remainder of this article is structured as follows. In Section ‘ Background and literature review ’, we perform a review of the literature and evaluation surrounding NoSQL systems. In Section ‘ Research design and methodology ’, we introduce the methodology used to select the quality attributes and NoSQL databases that we evaluated, as well as the methodology used in that evaluation. In Section ‘ Evaluated NoSQL databases ’, we present and describe the evaluated NoSQL databases. In Section ‘ Software quality attributes ’, we analyze the different quality attributes and identify the best NoSQL solutions for each of these quality attributes according to the literature. In Section ‘ Results and discussion ’, a summary table and analysis of the results of this evaluation is provided. Finally, Section ‘ Conclusions ’ presents future work and draws the conclusions.

Background and literature review

The word NoSQL was re-introduced in 2009 during an event about distributed databases [ 5 ]. The event intended to discuss the new technologies being presented by Google (Google BigTable [ 3 ]) and Amazon (Dynamo [ 14 ]) to handle high amounts of data. Interest in the research of NoSQL technologies bloomed since then, and lead to the publication of works, such as those by Stonebraker and Cattell [ 12 , 15 , 16 ]. Sonebraker began his research by describing different types of NoSQL technology and differences among those when compared to relational technology. The author argues that the main reasons to move to NoSQL databases are performance and flexibility. Performance is mainly focused on sharing and management of distributed data (i. e. dealing with “Big Data”), while flexibility relates to the semi-structured or unstructured data that may arise on the web.

By 2011, the NoSQL ecosystem was thriving, with several databases being the center of multiple studies [ 17 – 20 ]. These included Cassandra, Amazon SimpleDB, SciDB, CouchDB, MongoDB, Riak, Redis, and many others. Researchers categorized existing databases, and identified what kinds of NoSQL databases existed according to different architectures and goals. Ultimately, the majority agreed on four categories of databases [ 11 ]: Document Store, Column Store, Key-value Store and Graph-oriented databases.

Hecht and Jablonski [ 11 ] described the main characteristics offered by different NoSQL solutions, such as availability and horizontal scailability. Konstantinou et al. [ 19 ] performed a study based on the elasticity of non-relational solutions and compared HBase, Cassandra and Riak during execution of read and update operations. The authors concluded that HBase provided high elasticity and fast reads while Cassandra was capable of delivering fast inserts (writes). On the other hand, according to the authors, Riak did not show good scaling and high performance increase, regardless of the type of access. Many studies focused on evaluating performance [ 4 , 11 , 21 ].

Performance evaluations were made easier by the popularization of the Yahoo! Cloud Serving Benchmark (YCSB), proposed and implemented by Cooper et al. [ 21 ]. This benchmark, still widely used today, allows testing the read/write, latency and elasticity capabilities of any database, in particular NoSQL databases. The first studies using YCSB evaluated Cassandra, HBase, PNUTS and MySQL to conclude that each database offers its own set of trade-offs. The authors warn that each database performs at its best in different circumstances and, thus, a careful choice of the one to use must be made according to the nature of each project.

Since 2012, NoSQL databases have been most often evaluated and compared to RDBMSs (Relational Database Management Systems). Performance evaluation carried by [ 22 ] compared Cassandra, MongoDB and PostgreSQL, concluding that MongoDB is capable of providing high throughput, but mainly when it is used as a single server instance. On the other hand, the best choice for supporting a large distributed sensor system was considered Cassandra due to its horizontal scalability. Floratou et al. [ 4 ] used YCSB and TPC-H to compare the performance of MongoDB and MS SQL Server, as well as Hive. The authors state that NoSQL technology has room for improvement and should be further updated. Ashram and Anderson [ 7 ] studied the data model of Twitter and found that using non-relational technology creates additional difficulties on the programmers’ side. Parker et al. [ 23 ] also chose MongoDB and compared its performance with MS SQL Server using only one server instance. According to their results, when performing inserts, updates and selects, MongoDB is faster, but MS SQL Server outperforms MongoDB when running complex queries instead of simpler key-value access. In [ 24 ], Kashyap et al. compare the performance, scalability and availability of HBase, MongoDB, Cassandra and CouchDB by using YCSB. Their results show that Cassandra and HBase shared similar behaviour, but the former scaled better, and that MongoDB performed better than HBase by factors in the hundreds for their particular workload. The authors are prudent, and note that NoSQL is constantly evolving and that evaluations can quickly become obsolete. Rabl et al. [ 25 ] studied Cassandra, Voldemort, HBase, Redis, VoltDB and MySQL Cluster with regards to throughput, latency and scalability. Cassandra’s throughput is consistently better than that of the other databases, but it exhibits high latency. Voldemort, HBase and Cassandra all show linear scalability, and Voldemort has the most stable, lowest latency. Of the tested NoSQL databases, VoltDB has the worst results and HBase also lagged behind the other databases in terms of throughput.

Already in 2013, with the research focus on performance, Thumbtack Technologies produced two white papers comparing Aerospike, Cassandra, Couchbase and MongoDB [ 26 , 27 ]. In [ 26 ], the authors compare the durability and performance trade-offs of several state of the art NoSQL systems. Their results firstly show that Couchbase and Aerospike have good in-memory performance, and that MongoDB and Cassandra lagged behind in bulk loading capabilities. Regarding durability, Aerospike beats the competition in large balanced and read-heavy datasets. For in-memory datasets, Couchbase performed similarly to Aerospike as well. In their second paper [ 27 ], the authors study failover characteristics. Their results allow for many conclusions, but overall tend to indicate that Aerospike, Cassandra and Couchbase give strong availability guarantees.

In [ 28 ], MongoDB and Cassandra are compared in terms of their features and their capabilities by using YCSB. MongoDB is shown to be impacted by high workloads, whereas Cassandra seemed to experience performance boosts with increasing amounts of data. Additionally, Cassandra was found to be superior for update operations. In [ 29 ], the authors studied the applicability of NoSQL to RDF (Resource Description Framework data) processing, and make several key observations: 1) distributed NoSQL systems can be competitive with RDF stores with regards to query times; 2) NoSQL systems scale more gracefully than RDF stores when loading data in parallel; 3) complex SPARQL (SPARQL Protocol and RDF Query Language) queries, particularly with joins, perform poorly on NoSQL systems; classical query optimization techniques work well on NoSQL RDF systems; 5) MapReduce-like operations introduce higher latency. As their final conclusion, the authors state that NoSQL represents a compelling alternative to native RDF stores for simple workloads. Several other studies were performed in the same year regarding the applicability of NoSQL to diverse scenarios, such as [ 30 – 32 ].

More recently, as of 2014, experiments have stopped being so focused on performance, and having additional focus on applicability. NoSQL has seen validation and widespread usage, and so, in [ 10 ], a survey of some of the most popular NoSQL solutions is described. The authors state some of the advantages and main uses according to the NoSQL database type. In another evaluation, [ 33 ] performed their tests using real medical scenarios using MongoDB and CouchDB. They concluded that MongoDB and CouchDB have similar performance and drawbacks and note that, while applicable to medical imaging archiving, NoSQL still has to improve. In [ 34 ], the Yahoo! Cloud Serving Benchmark is used with a middleware layer that allows translating SQL queries into NoSQL commands. They tested Cassandra and MongoDB with and without the middleware layer, noting that it was possible to build middleware to ease the move from relational data stores to NoSQL databases. In [ 35 ], a write-intensive enterprise application is used as the basis for comparing Cassandra, MongoDB and Couchbase with MS SQL server. The results show that Cassandra outperforms the other NoSQL databases in a four-node setup, and that a MS SQL Server running on a single node vastly outperforms all NoSQL contenders for this particular setup and scenario.

The latest trends in NoSQL research, although still related to applicability and performance, have also concerned the validity of the benchmarking processes and tools used throughout the years. The authors of [ 36 ] propose an improvement of YCSB, called YCSB++, to deal with several shortcomings of the benchmark. In [ 37 ], the author proposes a method to validate previously proposed benchmarks of NoSQL databases, claiming that rigorous algorithms should be used for benchmarking methodology before any practical use. Chen et al., in [ 38 ], perform a survey of benchmarking tools, such as YCSB and TPC-C and list shortcomings and difficulties in implementing MapReduce and Big Data related benchmark systems, proposing methods for overcoming these difficulties. Similar work had already been done by [ 39 ], where benchmarks are reviewed and suggestions are given on building better benchmarks.

As we have seen, to the best of our knowledge, there are no studies focused on quality attributes and how each NoSQL system fits each of these attributes. Our work attempts to fill in this gap, by reviewing the literature, in Section ‘ Software quality attributes ’, with regards to the different quality attributes, finally presenting our findings in a summary table in Section ‘ Results and discussion ’.

It is important to notice that the analysis of NoSQL systems is inherently bound to the CAP theorem [ 40 ]. The CAP theorem, proposed by Brewer, states that no distributed system can simultaneously guarantee Consistency, Availability and Partition-Tolerance. In the context of the CAP theorem [ 40 , 41 ], consistency is often viewed as the premise that all nodes see the same data at the same time [ 42 ]. Indeed, of Brewer’s CAP theorem, most databases choose to be “AP”, meaning they provide Availability and Partition-Tolerance. Since Partition-Tolerance is a property that often cannot be traded off, Availability and Consistency are juggled, with most databases sacrificing more consistency than availability [ 43 ]. In Fig. 1 , an illustration of CAP is shown.

CAP theorem with databases that “choose” CA, CP and AP

Some authors (Brewer being one of them) have come to criticize the way the CAP theorem is interpreted and some have claimed that much has been written in literature under false assumptions [ 41 , 44 – 46 ]. The idea of CA (systems which ensure Consistency and Availability) is now most often looked at as a trade-off on a micro-scale [ 41 ], where individual operations can have their consistency guarantees explicitly defined. This means that some operations can be tied to full consistency (in the ACID semantics sense), or to one of a vast range of possible consistency options. Modern NoSQL systems allow for this consistency tuning and should therefore not be looked at under such a simplistic view which narrows the whole system to “CA”, “CP” or “AP”.

Research design and methodology

This work was developed to answer the following research question: “Is there currently enough knowledge on quality attributes in NoSQL systems to aid a software engineer’s decision process”? In our literature survey, we did not find any similar work attempting to provide a quality attribute guided evaluation of NoSQL databases. Thus, we devised a methodology to develop this work and answer our original research question.

We began by identifying several desirable quality attributes to evaluate in NoSQL databases. There are hundreds of quality attributes, yet some are nearly ubiquitous to every software project [ 47 ], and others are intimately tied to the topic of database systems, storage models and web applications (where the database backend often requires certain quality attributes) [ 48 ]. This lead us to identify the following quality attributes to evaluate: Availability, Consistency, Durability, Maintainability, Read and Write performance, Recovery Time, Reliability, Robustness, Scalability and Stabilization Time. These attributes are interdependent and have impact on most software projects. Most of these attributes have also been the target (even if indirectly) of some studies [ 18 , 27 , 27 , 49 , 50 ], rendering them ideal picks for this work.

Once these quality attributes had been identified, we identified which NoSQL systems were more popular and used, so as to narrow our research to a fixed set of NoSQL databases. This search lead us to selecting Aerospike, Cassandra, Couchbase, CouchDB, HBase, MongoDB and Voldemort as the systems to evaluate. These are often found in literature [ 6 , 10 , 11 , 26 , 51 , 52 ] and other sources [ 53 ] as the most popular and used systems, as well as the most versatile or appropriate to certain scenarios. For instance, while Couchbase and CouchDB share source-code and several similar original design goals, they have evolved into different systems, both with very high success and different characteristics. In much the same way, MongoDB and Cassandra, which are probably the most used NoSQL databases in the market, have fundamentally different approaches to storage model. Thus, our selection of databases attempted to find not only the most popular and mature databases in general, but also those that find high applicability in specific areas.

We surveyed the literature to evaluate the selected quality attributes on the aforementioned databases. This survey took into account already available evaluations regarding certain quality attributes, such as performance [ 51 , 54 ], consistency [ 43 ] or durability [ 26 ]. Each of the surveyed papers was taken into account according to the versions of the database tested (e.g. papers with outdated versions were given less relevance), generality of results and overall relevance to this evaluation. The summary table presented in Section ‘ Results and discussion ’ is the result of this careful evaluation of the NoSQL literature, technical knowledge found on the NoSQL ecosystem and expert opinions and positions. We also took into account the overall architectures of each NoSQL system (e.g. systems built with durability limitations are intrinsically limited in terms of this quality attribute). The result of this methodology is the aforementioned summary table, which we hope will aid software engineers and architects in their decision process when selecting a given NoSQL database according to a certain quality attribute.

In the following sections, we present the databases that we evaluated from the literature, as well as that evaluation.

Evaluated NoSQL databases

There are several popular NoSQL databases which have gained recognition and are usually considered before other NoSQL alternatives. We studied several of these databases (Aerospike, Cassandra, Couchbase, CouchDB, HBase, MongoDB and Voldemort) by performing a literature review and introduce the first quality attribute based evaluation of NoSQL databases. In this section, these selected databases are presented, with a summary table at the end (Table 1 ) detailing their characteristics.

Aerospike (formerly known as Citrusleaf [ 10 ] and recently open-sourced) is a NoSQL shared-nothing key-value database which offers mainly AP (Availability and Partition-Tolerance) characteristics. Additionally, the developers claim that it provides high consistency [ 55 ] by trading off availability and consistency at low granularity in specific subsystems, restricting communication latencies, minimizing cluster size, maximizing consistency and availability during failover situations and automatic conflict resolution. Consistency is guaranteed by using synchronous writes to replicas, guaranteeing immediate consistency. This immediate consistency can be relaxed if the software architects view that as a necessity. Durability is ensured by guaranteeing the use of flash/SSD on every node and performing direct reads from flash, as well as replication on several different layers.

Failover can be handled in two different ways [ 55 ]: focusing on High consistency on AP mode, or on Availability in CP (Consistency and Partition-Tolerance) mode. The former uses techniques to “virtually eliminate network based partitioning”, including fast heartbeats and consistent Paxos based cluster formation. These techniques favor Consistency over Availability to ensure that the system does not enter a state of network partition. If, however, partitioning occurs, Aerospike offers two conflict handling policies: one relies on the database’s auto-merging capabilities, and the other offloads the conflict to the application layer so that application developers can resolve the conflicts by themselves and re-write the right data back to the database. The second way that Aerospike manages failover is to provide Availability while in CP mode. In this mode, availability needs to be sacrificed by, for instance, forcing the minority quorum(s) to halt, thus avoiding data inconsistency if a network split occurs.

Aerospike is, henceforth, an in-memory database with disk persistence, automatic data partitioning and synchronous replication, offering cross data center replication and configurability in the failover handling mechanism, preferring full consistency or high consistency [ 10 , 52 , 55 ].

Cassandra is an open-source shared-nothing NoSQL column-store database developed and used in Facebook [ 10 , 52 , 56 ]. It is based on the ideas behind Google BigTable [ 3 ] and Amazon Dynamo [ 14 ].

Cassandra is similar to BigTable in what concerns the data model. The minimal unit of storage is a column, with rows consisting of columns or super columns (nested columns). Columns themselves consist of the name, value and timestamp, all of which are provided by the client. Since it is column-based, rows need not have the same number of columns [ 10 ].

Cassandra supports a SQL-like language called CQL, together with other protocols [ 10 ]. Indexes and secondary indexes are supported, and atomicity is guaranteed at the level of one table row. Persistence is ensured by logging. Consistency is highly tunable according to the desired operation – the application developer can specify the desired level of consistency, trading off latency and consistency. Conflicts are resolved based on timestamps (the newest record is kept). The database operates in master-master mode [ 52 ], where no node is different from another, and combines disk-persistence with in-memory caching of results, resulting in high write throughput operations [ 52 , 56 ]. The master-master architecture makes it easy for horizontal scalability to happen [ 56 ]. There are several different partitioning techniques and replication can be automatically managed by the database [ 56 ].

Apache CouchDB is another open-source project, written in Erlang, and following a document-oriented approach [ 10 ]. Documents are written in JSON and are meant to be accessed with CouchDB’s specific implementation of MapReduce views written in Javascript.

This database uses a B-tree index [ 10 ], updated during data modifications. These modifications have ACID properties on the document level and the use of MVCC (Multi-Version Concurrency Control) enables readers to never block [ 10 ]. CouchDB’s document manipulation uses optimistic locks by updating an append-only B-tree for data storage, meaning that data must be periodically compressed. This compression, in spite of maintaining availability, may hinder performance [ 10 ].

Regarding fault-tolerant replication mechanisms [ 57 ], CouchDB supports both master-slave and master-master replication that can be used between different instances of CouchDB or on a single instance. Scaling in CouchDB is achieved by replicating data, a process which is performed asynchronously. It does not natively support sharding/partitioning [ 10 ]. Consistency is guaranteed in the form of strengthened eventual consistency [ 10 ], and conflict resolution is performed by selecting the most up to date version (the application layer can later try to merge conflicting changes, if possible, back into the document). CouchDB’s programming interface is REST-based [ 10 , 57 ]. Ideally, CouchDB should be able to fit the whole dataset into the RAM of the cluster, as it is primarily a RAM-based database.

Couchbase is a combination of Membase (a key-value system with memcached compatibility) and CouchDB. It can be used in key-value fashion, but is considered a document store working with JSON documents (similarly to CouchDB) [ 10 ].

Documents, in Couchbase, have an intrinsic unique id and are stored in what are called data buckets. Like CouchDB, queries are built using MapReduce views in Javascript. The optimistic locking associated with an append-only B-tree is also implemented like in CouchDB. The default consistency level is eventual consistency (due to MapReduce views being constructed asynchronously). There is also the option of specifying that data should be indexed immediately [ 10 ].

A major difference when comparing Couchbase with CouchDB regards sharding [ 10 ]. Whereas CouchDB does not natively support sharding (there are projects, such as CouchDB Lounge [ 10 ] which enable this), Couchbase comes with transparent sharding off-the-shelf, with application transparency. Replication is also a major point of difference between the two databases, as couchbase supports intercluster and intracluster replication. The latter is performed within a cluster, guaranteeing immediate consistency. The former kind of replication ensures eventual consistency and is made asynchronously between geographically distributed clusters (conflict resolution is performed in the same way CouchDB does it). This database is mostly intended to run in-memory, so as to hold the whole dataset in RAM [ 10 , 29 ].

HBase is an open-source database written in Java and developed by the Apache Software Foundation. It is intended to be the open-source implementation of the Google BigTable principles, and relies on the Apache Hadoop Framework and the Apache ZooKeeper projects. It is, therefore, a column-store database [ 10 ].

HBase’s architecture is highly inspired by Google’s BigTable [ 3 , 10 ], and, thus, their capabilities are similar. There are, however, certain differences. The Hadoop Distributed File System (HDFS) is used for distributed storage, although other backends can be used (e.g. Hadoop MapReduce), in place of the Google File System. HBase also supports several master servers to improve system reliability, but does not support the concept of locality. Similarly to Google BigTable, it does not support full ACID semantics, although several properties are guaranteed [ 58 ]. Atomicity is guaranteed within a row and consistency ensures that no rows result of interleaving operations (i.e. the row must have effectively existed at some point in time). Still on the topic of consistency, rows are guaranteed to only move forward in time, never backward, and scans do not exhibit snapshot isolation, but, rather, the “read commited” isolation level. Durability is established in the sense that all data which is read has already been made durable (i.e. persisted to disk), and that all operations returning success have ensured this durability property. This can be configured, so that data is only periodically flushed to disk [ 58 ]. HBase does not support secondary indexes, meaning that data can only be queried by the primary key or by a table scan. It is worth noting that data, in HBase is also absent of data types (everything is a byte array) [ 52 ]. Regarding the programming interface, HBase can be interfaced using a Java API, a REST interface, and the Avro and Thrift protocols [ 10 ].

MongoDB is an open-source document-oriented database written in C++ and developed by the 10gen company. It uses JSON (data is stored and transferred in a binary, more compact form named BSON), allowing for a schemaless data model where the only requirement is that an id is always present [ 10 , 56 ].

MongoDB’s horizontal scalability is mainly provided through the use of automatic sharding [ 56 ]. Replication is also supported using locks and the asynchronous master-slave model, meaning that writes are only processed by the master node and reads can be made from both the master node and from one of the slave nodes. Writes are propagated to the slave nodes by reading from the master’s oplog (operation log) [ 56 ]. Database clients can choose the kind of consistency models they wish, by defining whether reads from secondary nodes are allowed and from how many nodes the confirmation must be obtained.

Document manipulation is a strong focus of MongoDB, as the database provides different frameworks (e.g. MapReduce and Aggregation Framework) and ways of interacting with documents [ 10 ]. These can be queried, sorted, projected, iterated with cursors, aggregated, among other operations. The changes to a document are guaranteed to be atomic. Indexing can be used on one or several fields (implemented using B-trees), with the possibility of using two-dimensional spatial indexes for geometry-based data [ 10 ]. There are many different programming interfaces supported by MongoDB, with most popular programming languages having native bindings. A REST interface is also supported [ 10 ].

Project Voldemort is an open-source key-value store implemented in Java which presents itself as an open-source implementation of the Amazon Dynamo database [ 10 , 14 , 59 , 60 ]. It supports scalar values, lists and records with named fields associated with a single key. Arbitrary fields can be used if they are serializable [ 10 ].

Operations on the data are simple and limited: there are put , get and delete commands [ 10 , 60 ]. In this sense, Voldemort can be considered (as the developers themselves put it), “basically just a big, distributed, persistent, fault-tolerant hash table” [ 59 ]. For data modification, the MVCC mechanism is used [ 10 ].

Replication is supported using the consistent hashing method (proposed in [ 61 ]) [ 10 , 60 ]. Sharding is implemented transparently with support for adding and removing nodes in real-time (although this feature was not always easily available [ 62 ]). Data is meant to stay in RAM as much as possible, with persistent data storage using several mechanisms, such as Berkley DB [ 60 ]. Voldemort uses a Java API [ 52 ].

Table 1 summarizes the characteristics of the studied NoSQL databases, similar to the work seen in [ 1 , 11 , 17 , 49 , 63 ], but providing a broader and more up to date view of these characteristics. Its information is derived from the previous sections and additional relevant sources ([ 12 , 64 – 71 ]). Each NoSQL database is described according to key characteristics: category (e.g. Key-Value database), positioning in the context of the CAP theorem, consistency guarantees and configurability, durability guarantees and configurability, querying possibilities and mechanisms (i.e. how are queries made and how complex can queries be?), concurrency control mechanisms, partitioning schemes and the existence of native partitioning. It should be noted that, as we have previously discussed, modern NoSQL databases often allow for fine-tuning of consistency and availability properties on a per-query basis, making the CAP-based classification (“AP”, “CP”, etc) overly simplistic [ 41 , 44 – 46 ].

Software quality attributes

In the previous section we identified and described several NoSQL databases. In this section, we survey the literature on NoSQL databases to find how each of these satisfy the software quality attributes that we selected. Each subsection explores the NoSQL literature on a given quality attribute, drawing conclusions regarding all of the evaluated NoSQL databases. This information is then summarized in the following section (Section ‘ Results and discussion ’), where a table is provided to aid software architects and engineers in their decision process.

Availability

Availability concerns what percentage of time a system is operating correctly [ 1 ]. NoSQL technology is inherently bound to provide availability more easily than relational systems. In fact, given the existence of Brewer’s CAP theorem [ 40 ], and the presence of failures in real-world systems (whether they are related to the network or to an application crash), NoSQL databases oppose most relational databases by favoring availability instead of consistency. Thus, one can assert that the higher the availability of a NoSQL system, the less likely it is that it provides high consistency guarantees. Several NoSQL databases provide ways to tune the trade-off between consistency and availability, including Dynamo [ 14 ], Cassandra, CouchDB and MongoDB [ 9 ].

Apache CouchDB uses a shared-nothing clustering approach, allowing all replica nodes to continue working even if they are disconnected, thus being a good candidate for systems where high availability is needed [ 9 ]. It is worth noting, however, that this database periodically requires a compaction step which may hinder system performance, but which does not affect the availability of its nodes under normal operation [ 3 ].

In 2013, [ 27 ] tested several NoSQL Databases (Aerospike, Cassandra, Couchbase and MongoDB) concerning their failover characteristics. Their results showed that Aerospike had the lowest downtime, followed by Cassandra, with MongoDB having the least favorable downtime. One should note that the results shown in the paper are limited to RAM-only datasets and hence might not be the best source for real-world scenarios. MongoDB’s results are also not surprising, as even though it allows for fine-tuning (to adjust the consistency-availability trade-offs), several tests have shown that it is not the best choice for a highly available system, in particular due to overhead when nodes are rejoining the system (see, for instance, [ 1 , 9 ] and our section on reliability). Lastly, [ 5 ] tested several NoSQL databases on the Cloud and noted that Riak could not provide high-availability under very high loads.

Thus, there is no obvious candidate for a highly available system, but there are several competing solutions, in particular when coupled with systems such as Memcached [ 2 ]. The specific architecture (number of replicas, consistency options, etc.) employed will play a major role, as pointed by several authors [ 27 , 72 ]. Furthermore, the popular MongoDB and Riak databases seem less likely to be good picks for this use case scenario.

Consistency

Consistency is related to transactions and, although not universally defined, can be seen as the guarantee that transactions started in the future see the effects of transactions committed in the past, coupled with the insurance of database constraints [ 73 – 75 ]. It is useful to recall that, in the context of the CAP theorem [ 40 , 41 ], consistency is often seen as the premise that all nodes see the same data at the same time [ 42 ] (i.e., the CAP version of consistency is merely a subset of the ACID version of the same property [ 41 ]). We have previously seen that consistency and availability are highly related properties of NoSQL systems.

Cassandra has several different consistency guarantees [ 76 ]. The database allows for tunable consistency at both the read and write level, even with near-ACID semantics if consistency level “ALL” is picked. MongoDB, in spite of being generally regarded as a CP system, offers similar consistency options [ 76 , 77 ]. Couchbase offers strong consistency guarantees for document access, whereas query access is eventually consistent [ 67 ]. HBase provides eventual consistency without fine-tuning being possible [ 58 ] (there is only the choice of opting for strong or eventual consistency), and CouchDB, being an AP system, fully relies on eventual consistency [ 78 ]. The Voldemort project puts more stress on application logic to deal with inconsistencies in data, by using read repair [ 60 ].

Regarding concrete experiments, not much has been done to study consistency as a property in itself. Recent work by Bermbach et al. [ 76 ] has seen the proposal of a general architecture for consistency benchmarking. The authors test their proposal on Cassandra and MongoDB, concluding that MongoDB performed better, but also noting that they are merely proposing an architecture and that their tests might have been impacted negatively due to their testing environment. The authors of [ 54 ] study Cassandra and Couchbase in a real world microblogging scenario, concluding that Couchbase provided consistent results faster (i.e. the same value took less time to reach all node replicas). In [ 79 ], the authors study Amazon S3’s consistency behavior and conclude that it frequently violates the monotonic read consistency property (other related work is presented by Bermbach et al. [ 76 ]). It seems that a general framework for testing consistency might provide with more in depth answers to the effectiveness of consistency trade-offs and techniques provided by each NoSQL database.

In summary, as the NoSQL ecosystem matures, there is a tendency towards micromanagement of consistency and availability [ 41 ], with some solutions opting to provide consistency (withholding availability), others providing availability (withholding consistency) and another set, such as Cassandra and MongoDB, allowing for fine-tuning based on a query basis.

Durability refers to the requirement that data be valid and committed to disk after a successful transaction [ 1 ]. As we have previously covered, NoSQL databases act on the premise that consistency doesn’t need to be fully enforced in the real world, preferring to sacrifice it in adjustable ways for achieving higher availability and partition tolerance. This impacts durability, as if a system suffers from consistency problems, its durability will also be at risk, leading to potential data loss [ 26 ].

In [ 26 ], the authors test Aerospike, Couchbase, Cassandra and MongoDB in a series of tests regarding durability and performance trade-offs. Their results featured Aerospike as the fastest performing database by a factor of 5-10 when the databases were set to synchronous replication. However, most scenarios do not rely on synchronous replication, but rather asynchronous (meaning that changes aren’t instantly propagated among nodes). In that sense, the same authors, which in [ 27 ] studied the same databases in the context of failover characteristics, show that MongoDB loses less data upon node failure when asynchronous replication is used. Cassandra comes as forerunner to MongoDB by about a factor of 100, and Aerospike and Couchbase both lose very large amounts of data. In [ 1 ], MongoDB is found to have issues with data loss when compared to CouchDB, in particular during recovery after a crash. In the same paper, the authors highlight that CouchDB’s immutable append only B+ Tree ensures that files are always in a valid state. CouchDB’s durability is also noticed and justified by the authors of [ 2 ]. It should be noted that document-based systems such as MongoDB usually use a single-versioning system, which is designed specifically to target durability [ 49 ]. HBase’s reliance on Hadoop means that it is inherently durable in the way requests are processed, as several authors have noted [ 80 – 82 ]. Voldemort’s continuing operation as the backend to Linkedin’s service is backed by strong durability [ 83 ], although there is a lack of studies focusing specifically on Voldemort’s durability.

In conclusion, as with other properties, the durability of NoSQL systems can be fine-tuned according to specific needs. However, databases based on immutability, such as CouchDB, are good picks for a system with good durability due to their inherent properties [ 1 ]. Furthermore, single-version databases, such as MongoDB, should also be the focus of those interested in durability advantages.

Maintainability

Maintainability is a quality attribute that regards the ease with which a product can be maintained, i.e., upgraded, repaired, debugged and met with new requirements [ 84 ]. From an intuitive point of view, systems with many components (e.g. several nodes) should add complexity and difficult maintainability, and this is a view that several authors agree with [ 7 , 85 ]. On the other hand, as some have hypothesized, the benefits of thoughtful modularity and task division make the case for a more maintainable system [ 86 ]. Assessing maintainability is a difficult problem which has seen vast amounts of research throughout the years, but it has seldom been focused explicitly on the database itself (in particular due to the widespread usage of the relational model with similar database interfaces).

In spite of the perceived difficulty in assessing the maintainability of NoSQL systems, there has been some research on the subject. Dzhakishev [ 50 ] studied the usability and maintainability of several NoSQL solutions in a real enterprise scenario. The author notes how MongoDB and Neo4j have “great shell applications”, easing maintainability, and that Neo4j even has a web interface (other NoSQL databases have such software, e.g. Couchbase Server). The authors of [ 87 ] study social network system implementation processes and rely on their own application-specific code to ensure maintainability of their application. They claim that versioning the schema using subversion is good for their goals. Throughout their work, maintainability seems to be moved more into the application layer and less into the database layer, possibly suggesting that NoSQL maintainability must be achieved with help of the developer. In [ 29 ], another real world experiment, the authors note the added maintainability difficulties in using HBase, Couchbase, Cassandra and CouchDB to replace their RDF data system. Similarly, Han [ 88 ] also faced maintainability problems with MongoDB when comparing it with the maintainability of relational alternatives. Although no particular study in literature has focused on the maintainability of Voldemort, from the point of view of ease of use, this database seems harder to configure (in particular in terms of node topology changes) than others [ 62 ].

It seems that most NoSQL systems offer limited maintainability when compared with traditional RDBMSs, but the literature has little to say with regards as to which is the more maintainable. Some authors [ 50 , 87 ] point in the direction of the ease of use of web interfaces and the readiness of tools. In that sense, Couchbase and Neo4j are prominent examples of easy to use and setup databases. On the other hand, MongoDB and HBase are known to be hard to install [ 89 , 90 ] or to confuse first users, thus limiting ease of use. Further research can and should be developed in this area so as to be able to truly compare the maintainability of NoSQL solutions.

Performance

When it comes to the performance and execution of different types of operations, NoSQL databases are divided mostly into two categories: read and write optimized [ 21 , 91 ]. That means that, in part, regardless of the system type and records, the database has an optimization that is granted by its mechanisms used for storage, organization and retrieval of data. For example, Cassandra is known for being very highly optimized during execution of writes (inserts) and is not able to show the same performance during reads [ 21 , 91 ]. The database achieves this by using a cache for write operations (updates are immediately written to a logfile, then cached in memory and only later written to disk, making the insertion process itself faster). In general, Column Store and Key-Value databases use more memory to store their data, some of those being completely in-memory (and, hence, completely unsuited for other attributes such as durability).

Document Stores, on the other hand, are considered as being more read optimized. This behavior resembles that of relational databases, where data loading and organization is slower, with the advantage of better preparing the system for future reads. Examples of this are MongoDB and Couchbase. If one compares most Column Store databases, such as Cassandra, to the document-based NoSQL landscape, with regards to read-performance, then the latter wins. This has been seen in numerous works such as [ 51 , 54 ] and [ 26 ]. We should also consider that databases such as MongoDB and Couchbase are considered more enterprise solutions with a set of mechanisms and functionality besides traditional key-value retrieval, which is mostly used not only by Key-Value stores but also by Column Store databases. This impacts performance significantly, as the need for additional functionality is usually associated with high performance costs.

Much work has been done with regards to performance testing of databases. Since NoSQL is constantly changing, past evaluations quickly become obsolete, but recent evaluations have been performed, some of which we now enumerate. In [ 51 ], a performance overview is given for Cassandra, HBase, MongoDB, OrientDB and Redis. The conclusions are that Redis is particularly well suited for all kinds of workloads (although this result should be taken lightly, since the database has many trade-offs with other quality attributes), that Cassandra performs very well for write/update scenarios, that overall OrientDB performs poorly when tested in this scenario and that HBase deals poorly with update queries. In [ 33 ], MongoDB and CouchDB are tested in a medical archiving scenario, with MongoDB showing better performance. In [ 92 ], MongoDB is shown to perform poorly for CRUD (create, read, update and delete) bulk operations when compared with PostgreSQL. Regarding write-heavy scenarios, a real-world enterprise scenario is presented in [ 35 ], where Cassandra, Couchbase and MongoDB are compared with MS SQL Server. In a four-node environment, Cassandra outperforms the NoSQL competition greatly (which is expected, since it is a write-optimized database), but is outperformed by a single-node MS SQL Server instance. Less recent, but also relevant, is the work presented in [ 26 , 27 ], where Cassandra, Couchbase, MongoDB and Aerospike are tested. Aerospike is shown to have the better performance, with Cassandra coming in as second in terms of read-throughput, and Couchbase in terms of write-throughput. Rabl et al. [ 25 ] compared Voldemort, Redis, HBase, Cassandra, MySQL Cluster and VoltDB with regards to throughput, scalability and disk usage, and noted that while Cassandra’s throughput dominated most tests, Voldemort exhibits a good balance between read and write performance, competing with the other databases. The authors also note that VoltDB had the worst performance and HBase’s throughput is low (although it scales better than most other databases).

In conclusion, performance highly depends on the database architecture. Column Store databases, such as Cassandra, are usually more oriented towards writing operations, whereas document based databases are more read-oriented. This last group of databases is also generally more feature-rich, bearing more resemblance to the traditional relational model, thus tending to have a bigger performance penalty. Experiments have been validating this theory and we can conclude that, in contrast with some of the other quality attributes studied in this article, performance is definitely not lacking in terms of research and evaluation.

Reliability

Reliability concerns the system’s probability of operating without failures for a given period of time [ 49 ]. The higher the reliability, the less likely it is that the system fails. Recently, Domaschka et al., in [ 49 ], have proposed a taxonomy for describing distributed databases with regards to their reliability and availability. Since reliability is significantly harder to define than availability (as it depends on the context of the application requirements), the authors suggest that software architects consider the following two questions: “(1) How are concurrent writes to the same item resolved?; (2) What is the consistency experienced by clients?”. With these in mind, and by using their taxonomy, we can see that systems which use single-version techniques, such as Redis, Couchbase, MongoDB and Neo4j, all perform online write conflict resolution detection, being good picks for a reliable system in the sense that they answer question (1) with reliable options. Regarding question (2), MongoDB, CouchDB, Neo4J, Cassandra and HBase all provide strong consistency guarantees. Thus, in order to achieve strong consistency guarantees and good concurrent write conflict resolution, as proposed by the authors, one should look at systems which have both these characteristics - MongoDB and Neo4j.

In conclusion, in spite of reliability being an important quality attribute, we have found that there is little focus in current literature about this topic, and, therefore, are limited in our answers to this research question.

Robustness is concerned with the ability of the database to cope with errors during execution [ 93 ]. Relational technology is known for its robustness, but many questions still arise when such a topic is discussed in the context of NoSQL [ 4 ]. If, from one point of view, one might consider that NoSQL databases are more robust due to their replication (i.e. crashes are “faded out” by appropriate replication and consensus algorithms [ 94 ]), from another, lack of code maturity and extensive testing might make NoSQL less robust in general [ 4 , 12 ]. Few [ 12 , 95 ] has been written on this subject, although there have been some real world studies where the impact of NoSQL on a system’s robustness was considered (even if only indirectly). In [ 88 ], Han experiments with MongoDB as a possible replacement for a traditional RDBMS in an air quality monitoring scenario. With regards to robustness, the author notes that as cluster scale and workloads increase, robustness becomes a more pressing issue (i.e. problems become more evident). Ranjan, in [ 95 ], studies Big Data platforms and notes that lack of robustness is a question in Big Data scheduling platforms and, in particular, in the NoSQL (Hadoop) case. In 2011, the authors of [ 12 ] postulated that robustness would be an issue for NoSQL, as the technology was new and needed testing. Neo4j is seen by some as a robust graph-based database [ 96 , 97 ]. Lior et al. [ 98 ] reviewed security issues in NoSQL databases and found that Cassandra and MongoDB were subject to Denial of Service attacks, which can be seen as a system with a lack of robustness. Similarly, Manoj [ 77 ] presents a comparative table of features for Cassandra, MongoDB and HBase, where HBase is identified as having an intrinsic single point of failure that needs to be overcome by explicitly using failover clustering. Lastly, in [ 87 ], the authors claim Cassandra is robust due to Facebook’s contribution to its development and the fact that it is used as one of the backends of the social network.

Overall, not much can be concluded for each individual database in terms of robustness. A benchmark for robustness is currently lacking in the NoSQL ecosystem, and software engineers looking for the most robust database would benefit from research into this area. The most up to date information and research indicates that more popular and used databases are more robust, although in general these systems are seen as less robust than their relational counterparts when tested in practice.

Scalability

Scalability concerns a system’s ability to deal with increasing workloads [ 1 ]. In the context of databases, it may be defined as the change in performance when new nodes are added, or hardware is improved [ 99 ]. NoSQL databases have been developed specifically to target scenarios where scalability is very important. These systems rely on horizontal and “elastic” scalability, by adding more nodes to a system instead of upgrading hardware [ 3 , 4 , 9 ]. The term “elastic” refers to elasticity, which is a characterization of the way a cluster reacts to the addition or removal of nodes [ 99 ].

In [ 18 ], the authors compared Cassandra and HBase, improving upon previous work. They concluded that both databases scale linearly with different read and write performances. They also provided a more in-depth analysis at Cassandra’s scalability, noticing how performing horizontal scalability with this platform leads to less performance hassles than performing vertical scalability.

In [ 99 ], the authors measure the elasticity and scalability of Cassandra, HBase and MongoDB. They showed surprise by identifying “superlinear speedups for clusters of size 24” when using Cassandra, stating that “it is almost as if Cassandra uses better algorithms for large cluster sizes”. For clusters of sizes 6 and 12, their results show HBase the fastest competitor with stable performance. Regarding elasticity, they found that HBase gives the best results, stabilizing the database significantly faster than Cassandra and MongoDB.

Rabl et al. [ 25 ] studied the scalability (and other attributes) of Cassandra, Voldemort, HBase, VoltDB, Redis and MySQL cluster. They noted the linear scalability of Cassandra, HBase and Voldemort, noting, however, that Cassandra’s latency was “peculiarly high” and that Voldemort’s was stable. HBase, while the worst of these databases in terms of throughput, scaled better than the rest. Regarding the different scalability capabilities of the databases themselves, Cassandra, HBase and Riak all support the addition of machines during live operation Key-value databases, such as Aerospike and Voldemort, are also easier to scale, as the data model allows for better distribution of data across several nodes [ 12 ]. In particular, Voldemort was designed to be highly scalable, being the major backend behind LinkedIn.

Further studies regarding scalability are needed in literature. It is clear that NoSQL databases are scalable, but the question of which scale the most, or with the best performance, is still left unanswered. Nevertheless, we can conclude that popular choices for highly scalable systems are Cassandra and HBase. One must also take notice that scalability will be influenced by the particular choice of configuration parameters.

Stabilization Time and Recovery Time

Besides availability, there are other failover characteristics which determine the behavior of a system and might impact system stability. In the study made in [ 27 ], which we have already covered, the authors measure the time it takes for several NoSQL systems to recover from a node failure - the recovery time -, as well as the time it takes for the system to stabilize when that node rejoins the cluster -the stabilization time. They find that MongoDB has the best recovery time, followed by Aerospike (when in synchronous change propagation mode), with Couchbase having values an order of magnitude slower and Cassandra two orders of magnitude slower that MongoDB. Regarding the time to stabilize on node up, all systems perform well (< 1ms) with the exception of MongoDB and Aerospike. The former takes a long 31 seconds to recover to stabilize on node reentry, and Aerospike, in synchronous mode takes 3 seconds. These results tend to indicate that MongoDB and Aerospike are good picks if one is looking for good recovery times, but that these choices should be taken with care, such that when a node reenters the system, it does not affect its stability.

Overall, the topic of failover is highly dependent of configuration and desired properties and should be studied more thoroughly (we note this as part of our future work). The current literature is limited and does not allow for very general and broad conclusions.

Results and discussion

We used the criteria described in each of the previous sections to quantify the databases. Regarding availability, the downtime was used as a primary measure, together with relevant studies [ 5 , 27 ]. Consistency was graded according to two essential criteria: 1) how much the database can provide ACID-semantics consistency and 2) how much can consistency be fine-tuned. Durability was measured according to the use of single or multi version concurrency control schemes, the way that data are persisted to disk (e.g. if data is always asynchronously persisted, this hinders durability), and studies that specifically targeted durability [ 26 ]. Regarding maintainability, the criteria were the currently available literature studies of real world experiments, the ease of setup and use, as well as the accessibility of tools to interact with the database. For read and write performance, we considered recent studies [ 27 ] and the fine-tuning of each database, as noted in the previous sections. Reliability is graded according to the taxonomy presented in [ 49 ] and by looking at synchronous propagation modes (databases which do not support them tend to be less reliable, as Domaschka et. al note). Database robustness was assessed with the real world experiments carried by researchers, as well as the available documentation on possible tendency of databases to have problems dealing with crashes or attacks (e.g. being subject to DoS attacks). With respect to scalability, we looked at each database’s elasticity, its increase in performance due to horizontal scaling, and the ease of on-line scalability (i.e. is the live addition of nodes supported?). For recovery time and stabilization time, highly related to availability, we based our classification on the results shown in [ 27 ] (implying that our grading of these attributes is mostly limited to their particular study and should be taken with apprpriate care). We looked at the databases described in Section ‘ Evaluated NoSQL databases ’.

By analyzing Table 2 , we can see that Aerospike suffers from data loss issues, affecting its durability, and it also has issues with stabilization time (in particular in synchronous mode). Cassandra is a multi-purpose database (in particular due to its configurable consistency properties) which mostly lacks read performance (since it is tuned for write-heavy workloads). CouchDB provides similar technology to MongoDB, but is better suited for situations where availability is needed. Couchbase provides good availability capabilities (coupled with good recovery and stabilization times), making it a good candidate for situations where failover is bound to happen. HBase has similar capabilities to Cassandra, but is unable to cope with high loads, limiting its availability in these scenarios, and is also the worst database in terms of robustness (this is mostly due to research seen in [ 77 , 95 ]). MongoDB is the database that mostly resembles the classical relational use case scenario - it is better suited for reliable, durable, consistent and read-oriented use cases. It is somewhat lacking in terms of availability, scalability and write-performance, and it is very hindered by its stabilization time (which is also one of the reasons for its low availability). Furthermore, it is not as efficient during write operations. Lastly, we lack some information on Voldemort, but find it to be a poor pick in terms of maintainability. It is, however, a good pick for durable, write-heavy scenarios, and provides a good balance between read and write performance (in line with [ 25 ]). We should highlight that there are more quality attributes that should be focused on, which we intend to do in future work, and, thus, that this table does not intend to show that “one database is better than another”, but, rather, that some database is better for a particular use case scenario where these attributes are needed.

Many software quality attributes are highly interdependent. For example, availability and consistency, in the context of the CAP theorem, are often polarized. Similarly, availability, stabilization time and recovery time are highly related, since low stabilization and recovery time are bound to hinder availability. With this in mind, there are several interesting findings in the summary table we have presented.

Availability, stabilization time and recovery time, as mentioned, are highly related software quality attributes. In this sense, it is interesting to note that polarizing results are found for different databases. Software engineers looking at the availability software quality attribute should note that although Aerospike, Cassandra, Couchbase, CouchDB and Voldemort all provide high availability, some of these databases are not ideal picks for situations where a fast recovery time is needed. Indeed, of these databases, only Aerospike and MongoDB have a “Great” rating, with Cassandra having the worst possible grading. On the other hand, Aerospike and MongoDB have poor stabilization times, but Cassandra has a “Good” rating for this quality attribute. Couchbase, another highly available system, although not having any “Great” rating in these two quality attributes, has a “Good” rating. Thus, for systems which desire high availability with a balance of stabilization and recovery time, Couchbase is an ideal pick.

It is interesting to note that Aerospike and Cassandra achieve high availability and high consistency ratings. A naive application of the CAP theorem to distributed systems and NoSQL systems would tend to indicate that both of these quality attributes would have to ultimately be traded off. Nevertheless, as other authors have pointed out [ 41 , 44 – 46 ], this is not the case, and our table reflects it. Systems such as Cassandra allow for these properties to be traded off on a query basis. This, ultimately combined with the other characteristics of each database, result in high ratings in both these quality attributes.

If inspecting only the availability and scalability quality attributes, it becomes clear that they are highly correlated. Nearly all systems with high availability also have high scalability. The only exception to this observation being CouchDB (this database does not support native partitioning, hindering its scalability). In cases where availability is limited, results are somewhat polarized: HBase achieves high scalability, whereas MongoDB is also hindered in terms of its scalability (this can easily be traced to MongoDB’s locking mechanism; indeed, as we have mentioned, this database is the most similar one to the typical relational use case scenario).

There are other highly correlated quality attributes which can be surprising. For instance, there is a high correlation between scalability and write performance. When one of these quality attributes is tending towards being “Great”, the other is too; similarly, when one tends to be “Bad”, the other does too. This result provides insight into how scalability is achieved in many NoSQL systems: write optimizations (particularly found in column-store databases) help achieve scalability, and systems with poor write performance tend to be fairly limited in its scalability. Contrasting with this positive correlation, it seems that read performance is slightly negatively correlated with scalability: databases with high scalability tend to have higher write performance than read performance. This could be due to the fact that many of these databases rely on partitioning as an efficient way to scale, and the fact that partitioning improves write performance (through parallel writes) much more than it does read performance (nevertheless, this would be interesting to study as future work). Consistency and recovery time are also quality attributes that share a high degree of correlation. This result is also intuitive, since systems that react quickly to the loss of a node will tend to have fewer conflicts and, thus, fewer consistency problems. Still on the topic of consistency, it shares some similarity with robustness, in our table. Indeed, robust systems tend to also be consistent ones (notable exceptions are HBase and MongoDB). This relationship, however, is probably due to the nature of each database and not to any particular reason (i.e. there is no intrinsic relationship between consistency and robustness). Finally, reliability and write performance are often in polarized positions (e.g., Aerospike has “Good” write performance but low reliability).

There are some quality attributes for which no “Great” rating has been attributed. Indeed, in terms of durability, maintainability, robustness and stabilization time, no NoSQL system was found to achieve optimal results. This indicates directions of future work for NoSQL databases. Some of these ratings can be explained due to the infancy of these systems – robustness and maintainability are properties that evolve with time, as systems mature, bugs are found and new functionality is added. These two quality attributes have the worst overall ratings and reveal weaknesses of NoSQL systems.

Another point of interest that becomes clear with the summary table is that while some quality attributes do not have a “Great” rating, there are others for which no “Bad” rating is given: availability, consistency, maintainability, reliability, scalability, read and write performance. Some of these quality attributes, such as consistency and scalability, are actually found to have generally high ratings. This implies that these quality attributes are among the key attributes offered by NoSQL databases. It is no surprise, then, that availability, consistency and scalability, some of the three major reasons for the initial development of NoSQL databases [ 63 ], are among these attributes.

Although performance is often considered an isolated quality attribute, read and write performance can be different. This difference is reflected in our table, and it is interesting to analyse the performance quality attribute as a whole with the data presented in the table. Most NoSQL databases polarize on performance characteristics, either having high ratings on read or write performance (Cassandra, Couchbase, HBase and MongoDB), but there are some exceptions. Aerospike provides a balance between write and read performance without reaching the “Great” rating in either of these quality attributes. On the other hand, Voldemort provides high write performance (“Great”) and good read performance (“Good”), while Couchbase offers good write performance (“Good”), but high read performance (“Great”). This implies that software engineers can look to Voldemort, Couchbase and Aerospike as “balanced” systems in terms of performance, with Voldemort and Couchbase tending slightly towards more specific write or read scenarios, respectively.

The only quality attributes where there are neither “Great” or “Bad” ratings are durability and maintainability. Indeed, it would make little sense for a NoSQL system to have bad durability, since this is a key attribute of most database solutions. On the other hand, the trade-offs associated with NoSQL often mean that durability must be sacrificed, resulting in no system achieving the best durability yet (this is a clear are for future work in NoSQL systems).

Conclusions

In this article we described the main characteristics and types of NoSQL technology while approaching different aspects that highly contribute to the use of those systems. We also presented the state of the art of non-relational technology by describing some of the most relevant studies and performance tests and their conclusions, after surveying a vast number of publications since NoSQL’s birth. This state of the art also intended to give a time-based perspective to the evolution of NoSQL research, highlighting four clearly distinct periods: 1) Database type characterization (where NoSQL was in its infancy and researchers tried to categorize databases into different sets); 2) Performance evaluations, with the advent of YCSB and a surge in NoSQL popularity; 3) Real-world scenarios and criticism to some interpretations of the CAP theorem; and 4) An even bigger focus on applicability and a reinvigorated focus on the validation of benchmarking software.

We concluded that although there have been a variety of studies and evaluations of NoSQL technology, there is still not enough information to verify how suited each non-relational database is in a specific scenario or system. Moreover, each working system differs from another and all the necessary functionality and mechanisms highly affect the database choice. Sometimes there is no possibility of clearly stating the best database solution. Furthermore, we tried to find the best databases on a quality attribute perspective, an approach still not found in current literature – this is our main contribution. In the future, we expect that NoSQL databases will be more used in real enterprise systems, allowing for more information and user experience available to conclude the most appropriate use of NoSQL according to each quality attribute and further improve this initial approach.

As we have seen, NoSQL is still an in-development field, with many questions and a shortage of definite answers. Its technology is ever-increasing and ever-changing, rendering even recent benchmarks and performance evaluations obsolete. There is also a lack of studies which focus on use-case oriented scenarios or software engineering quality attributes (we believe ours is the first work on this subject). All of these reasons make it difficult to find the best pick for each of the quality attributes we chose in this work, as well as others. The summary table we presented makes it clear that there is a current need for a broad study of quality attributes in order to better understand the NoSQL ecosystem, and it would be interesting to conduct research in this domain. When more studies and with more consistent results have been performed, a more thorough survey of the literature can be done, and with clearer, more concise results.

Software architects and engineer can look to the summary table presented in this article if looking for help understanding the wide array of offerings in the NoSQL world from a quality attribute based overview. This table also brings to light some hidden or unexpected relationships between quality attributes in the NoSQL world. For instance, scalability is highly related to the write performance, but not necessarily the read performance. Additionally, broad sets of quality attributes that are highly related (e.g. availability, stabilization time and recovery time) can be individually studied, so that the appropriate trade-offs can be selected for a candidate system architecture.

Our literature review allows us to establish future directions on research regarding a quality-attribute based approach to NoSQL databases. It is our belief that the development of a framework for assessing most of these quality attributes would greatly benefit the lifes of software engineers and architects alike. In particular, research is currently lacking in terms of Reliability, Robustness, Durability and Maintainability, with most work in literature focusing on raw performance. Future work in this area, with the development of such a framework for quality attribute evaluation, would undoubtedly benefit the NoSQL research in the long term.

Orend K (2010) Analysis and Classification of NoSQL Databases and Evaluation of their Ability to Replace an Object-relational Persistence Layer. Dissertation, Technische Universität München.

Leavitt N (2010) Will nosql databases live up to their promise?Computer 43(2): 12–14.

Article   Google Scholar  

Chang F, Dean J, Ghemawat S, Hsieh WC, Wallach DA, Burrows M, Chandra T, Fikes A, Gruber RE (2008) Bigtable: A distributed storage system for structured data. ACM Trans Comput Syst (TOCS) 26(2): 4.

Floratou A, Teletia N, DeWitt DJ, Patel JM, Zhang D (2012) Can the elephants handle the nosql onslaught?Proc VLDB Endowment 5(12): 1712–1723.

Lith A, Mattson J (2013) Investigating storage solutions for large data: A comparison of well performing and scalable data storage solutions for real time extraction and batch insertion of data. Dissertation, Chalmers University of Technology.

Sadalage PJ, Fowler M (2012) NoSQL Distilled: a Brief Guide to the Emerging World of Polyglot Persistence. Pearson Education, Upper Saddle River, NJ.

Google Scholar  

Schram A, Anderson KM (2012) Mysql to nosql: data modeling challenges in supporting scalability In: Proceedings of the 3rd Annual Conference on Systems, Programming, and Applications: Software for Humanity, 191–202.. ACM, Tucson, Arizona, USA.

NoSQL. http://nosql-database.org/ . Accessed June, 2015.

Strauch C (2011) NoSQL Databases. Lecture: Selected Topics on Software-Technology Ultra-Large Scale Sites, Stuttgart Media University.

Kuznetsov S, Poskonin A (2014) Nosql data management systems. Program Comput Softw 40(6): 323–332.

Hecht R, Jablonski S (2011) Nosql evaluation In: International Conference on Cloud and Service Computing, 336–41.. IEEE, Hong Kong, China.

Cattell R (2011) Scalable sql and nosql data stores. ACM SIGMOD Record 39(4): 12–27.

Lourenço JR, Abramova V, Vieira M, Cabral B, Bernardino J (2015) Nosql databases: A software engineering perspective In: New Contributions in Information Systems and Technologies, 741–750.. Springer, São Miguel, Azores, Portugal.

DeCandia G, Hastorun D, Jampani M, Kakulapati G, Lakshman A, Pilchin A, Sivasubramanian S, Vosshall P, Vogels W (2007) Dynamo: amazon’s highly available key-value store In: ACM SIGOPS Operating Systems Review, 205–220.. ACM, Stevenson, Washington, USA.

Stonebraker M (2010) Sql databases v. nosql databases. Commun ACM. 53(4): 10–11.

Stonebraker M (2011) Stonebraker on nosql and enterprises. Commun ACM. 54(8): 10–11.

Tudorica BG, Bucur C (2011) A comparison between several nosql databases with comments and notes In: Roedunet International Conference (RoEduNet), 2011 10th, 1–5.. IEEE, Iasi, Romania.

Chapter   Google Scholar  

Dory T, Mejías B, Van Roy P, Tran NL (2011) Comparative elasticity and scalability measurements of cloud databases In: Proc of the 2nd ACM Symposium on Cloud Computing (SoCC).. IEEE, Iasi, Romania.

Konstantinou I, Angelou E, Boumpouka C, Tsoumakos D, Koziris N (2011) On the elasticity of nosql databases over cloud management platforms In: Proceedings of the 20th ACM international conference on Information and knowledge management, 24–28, Glasgow.

Han J, Haihong E, Le G, Du J (2011) Survey on nosql database In: Pervasive Computing and Applications (ICPCA), 2011 6th International Conference On, 363–366.. IEEE, Port Elizabeth, South Africa.

Cooper BF, Silberstein A, Tam E, Ramakrishnan R, Sears R (2010) Benchmarking cloud serving systems with ycsb In: Proceedings of the 1st ACM Symposium on Cloud Computing, 143–154.. ACM, Indianapolis, Indiana, USA.

van der Veen JS, van der Waaij B, Meijer RJ (2012) Sensor data storage performance: Sql or nosql, physical or virtual In: Cloud Computing (CLOUD), 2012 IEEE 5th International Conference On, 431–438.. IEEE, Honolulu, HI, USA.

Parker Z, Poe S, Vrbsky SV (2013) Comparing nosql mongodb to an sql db In: Proceedings of the 51st ACM Southeast Conference, 5.. ACM, Savannah, Georgia, USA.

Kashyap S, Zamwar S, Bhavsar T, Singh S (2013) Benchmarking and analysis of nosql technologies. Int J Emerg Technol Adv Eng 3: 422–426.

Rabl T, Gómez-Villamor S, Sadoghi M, Muntés-Mulero V, Jacobsen HA, Mankovskii S (2012) Solving big data challenges for enterprise application performance management. Proc VLDB Endowment 5(12): 1724–1735.

Nelubin D, Engber B (2013) Ultra-High Performance NoSQL Benchmarking: Analyzing Durability and Performance Tradeoffs. Thumbtack Technology, Inc., White Paper.

Nelubin D, Engber B (2013) Nosql failover characteristics: Aerospike, cassandra, couchbase, mongodb.

Abramova V, Bernardino J (2013) Nosql databases: Mongodb vs cassandra In: Proceedings of the International C* Conference on Computer Science and Software Engineering, 14–22.. ACM, New York, USA.

Cudré-Mauroux P, Enchev I, Fundatureanu S, Groth P, Haque A, Harth A, Keppmann FL, Miranker D, Sequeda JF, Wylot M (2013) Nosql databases for rdf: an empirical evaluation In: The Semantic Web–ISWC 2013, 310–325.. Springer, Berlin.

Yang CT, Liu JC, Hsu WH, Lu HW, Chu WC-C (2013) Implementation of data transform method into nosql database for healthcare data In: Parallel and Distributed Computing, Applications and Technologies (PDCAT), 2013 International Conference On, 198–205.. IEEE, Taipei Taiwan.

Blanke T, Bryant M, Hedges M (2013) Back to our data-experiments with nosql technologies in the humanities In: Big Data, 2013 IEEE International Conference On, 17–20.. IEEE, Silicon Valley, CA, USA.

Fan C, Bai C, Zou J, Zhang X, Rao L (2013) A dynamic password authentication system based on nosql and rdbms combination In: LISS 2013, 811–819.. Springer, Berlin.

Silva LAB, Beroud L, Costa C, Oliveira JL (2014) Medical imaging archiving: A comparison between several nosql solutions In: Biomedical and Health Informatics (BHI), 2014 IEEE-EMBS International Conference On, 65–68.. IEEE, Valencia, Spain.

Rith J, Lehmayr PS, Meyer-Wegener K (2014) Speaking in tongues: Sql access to nosql systems In: Proceedings of the 29th Annual ACM Symposium on Applied Computing, 855–857.. ACM, Gyeongju, Korea.

Lourenço JR, Abramova V, Cabral B, Bernardino J, Carreiro P, Vieira M (2015) Nosql in practice: a write-heavy enterprise application In: IEEE BigData Congress 2015. New York, June 27-July 2, 2015.

Wingerath W, Friedrich S, Gessert F, Ritter N (2015) Who Watches the Watchmen? On the Lack of Validation in NoSQL Benchmarking. In: Seidl T, Ritter N, Schöning H, Sattler K-U, Härder T, Friedrich S, Wingerath W (eds)Datenbanksysteme für Business, Technologie und Web (BTW 2015), Hamburg.

George TBA proposed validation method for a benchmarking methodology. Int J Sustainable Econ Manag (IJSEM) 3(4): 1–10.

Chen Y, Raab F, Katz R (2014) From tpc-c to big data benchmarks: A functional workload model In: Specifying Big Data Benchmarks, 28–43.. Springer, Berlin.

Qin X, Zhou X (2013) A survey on benchmarks for big data and some more considerations In: Intelligent Data Engineering and Automated Learning–IDEAL 2013, 619–627.. Springer, Berlin.

Brewer EA (2000) Towards robust distributed systems In: PODC.. IEEE, Portland, Oregon, USA.

Brewer E (2012) Cap twelve years later: How the “rules” have changed. Computer 45(2): 23–29.

Gilbert S, Lynch N (2002) Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. ACM SIGACT News 33(2): 51–59.

Wada H, Fekete A, Zhao L, Lee K, Liu A (2011) Data consistency properties and the trade-offs in commercial cloud storage: the consumers’ perspective In: CIDR, 134–143.. ACM, Asilomar, California, USA.

Stonebraker M (2010) In search of database consistency. Commun ACM 53(10): 8–9.

Abadi D (2010) Problems with CAP, and Yahoo’s Little Known NoSQL System, DBMS Musings, blog, (2010); on-line resource. http://dbmsmusings.blogspot.pt/2010/04/problems-with-cap-and-yahoos-little.html . Accessed June 2015.

Hale C (2010) You can’t sacrifice partition tolerance. http://codahale.com/you-cant-sacrifice-partition-tolerance/ . Accessed July 2015.

Clements P, Kazman R, Klein M (2003) Evaluating Software Architectures. Tsinghua University Press, Beijing.

Offutt J (2002) Quality attributes of web software applications. IEEE Softw2: 25–32.

Domaschka J, Hauser CB, Erb B (2014) Reliability and availability properties of distributed database systems In: Enterprise Distributed Object Computing Conference (EDOC), 2014 IEEE 18th International, 226–233.. IEEE, Ulm, Germany.

Dzhakishev D (2014) Nosql databases in the enterprise. An experience with tomra s receipt validation system.

Abramova V, Bernardino J, Furtado P (2014) Which nosql database? a performance overview. Open J Databases (OJDB) 1(2): 17–24.

Gudivada VN, Rao D, Raghavan VV (2014) Nosql systems for big data management In: Services (SERVICES), 2014 IEEE World Congress On, 190–197.. IEEE, Anchorage, AK, USA.

DB-Engines Ranking: Knowledge Base of Relational and NoSQL Database Management Systems. http://db-engines.com/en/ranking . Accessed July, 2015.

Fonseca A, Vu A, Grman P (2013) Evaluation of NoSQL databases for large-scale decentralized microblogging, Universitat Politècnica de Catalunya.

Aerospike (2014) ACID Support in Aerospike. Aerospike, Mountain View, California.

Haughian G (2014) Benchmarking replication in nosql data stores. Dissertation, Imperial College London.

Nocuń Ł, Nieć M, Pikuła P, Mamla A, Turek W (2013) Car-finding system with couchdb-based sensor management platform. Comput Sci 14(3): 403–422.

Apache Hbase ACID Semantics. http://hbase.apache.org/acid-semantics.html . Accessed July, 2015.

Voldemort Project Github. https://github.com/voldemort/voldemort . Accessed July, 2015.

Voldemort: Design – Voldemort. www.project-voldemort.com/voldemort/design.html . Accessed July, 2015.

Karger D, Sherman A, Berkheimer A, Bogstad B, Dhanidina R, Iwamoto K, Kim B, Matkins L, Yerushalmi Y (1999) Web caching with consistent hashing. Comput Netw 31(11): 1203–1213.

Voldemort Rebalancing (as Seen in the Wayback Time Machine Archive in 2012). http://web.archive.org/web/20100923080327/http://github.com/voldemort/voldemort/wiki/Voldemort-Rebalancing . Accessed July, 2015.

Pokorny J (2013) Nosql databases: a step to database scalability in web environment. Int J Web Inf Syst 9(1): 69–82.

Aerospike Clustering. https://www.aerospike.com/docs/architecture/clustering.html . Accessed July, 2015.

MongoDB Concurrency FAQ. http://docs.mongodb.org/manual/faq/concurrency/ . Accessed July, 2015.

Couchbase Blog: Optimistic or Pessimistic Locking, Which One Should You Pick? http://blog.couchbase.com/optimistic-or-pessimistic-locking-which-one-should-you-pick . Accessed July, 2015.

10 Things Developers Should Know About Couchbase. http://blog.couchbase.com/10-things-developers-should-know-about-couchbase . Accessed July, 2015.

Cassandra Concurrency Control. http://teddyma.gitbooks.io/learncassandra/content/concurrent/concurrency_control.html . Accessed July, 2015.

Apache HBase Reference Guide. https://hbase.apache.org/book.html . Accessed July, 2015.

Apache HBase Durability Javadoc. https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Durability.html . Accessed July, 2015.

Sumbaly R, Kreps J, Gao L, Feinberg A, Soman C, Shah S (2012) Serving large-scale batch computed data with project voldemort In: Proceedings of the 10th USENIX Conference on File and Storage Technologies, 18–18.. USENIX Association, San Jose, CA, USA.

Beyer F, Koschel A, Schulz C, Schäfer M, Astrova I, Grivas SG, Schaaf M, Reich A (2011) Testing the suitability of cassandra for cloud computing environments In: CLOUD COMPUTING 2011, The Second International Conference on Cloud Computing, GRIDs, and Virtualization, 86–91.. IARIA XPS, Venice/Mestre, Italy.

Ports DR, Clements AT, Zhang I, Madden S, Liskov B (2010) Transactional consistency and automatic management in an application data cache In: OSDI, 1–15.. USENIX Association, Vancouver, BC, Canada.

Eswaran KP, Gray JN, Lorie RA, Traiger IL (1976) The notions of consistency and predicate locks in a database system. Commun ACM 19(11): 624–633.

Article   MATH   MathSciNet   Google Scholar  

Haerder T, Reuter A (1983) Principles of transaction-oriented database recovery. ACM Comput Surv (CSUR) 15(4): 287–317.

Article   MathSciNet   Google Scholar  

Bermbach D, Zhao L, Sakr S (2014) Towards comprehensive measurement of consistency guarantees for cloud-hosted data storage services In: Performance Characterization and Benchmarking, 32–47.. Springer, Berlin.

Manoj V (2014) Comparative study of nosql document, column store databases and evaluation of cassandra. Int J Database Manag Syst (IJDMS) 6: 11–26.

CouchDB Consistency. http://guide.couchdb.org/draft/consistency.html . Accessed July, 2015.

Bermbach D, Tai S (2011) Eventual consistency: How soon is eventual? an evaluation of amazon s3’s consistency behavior In: Proceedings of the 6th Workshop on Middleware for Service Oriented Computing, 1.. ACM, Lisbon, Portugal.

Konishetty VK, Kumar KA, Voruganti K, Rao G (2012) Implementation and evaluation of scalable data structure over hbase In: Proceedings of the International Conference on Advances in Computing, Communications and Informatics, 1010–1018.. USENIX Association, Chennai, India.

Harter T, Borthakur D, Dong S, Aiyer AS, Tang L, Arpaci-Dusseau AC, Arpaci-Dusseau RH (2014) Analysis of hdfs under hbase: a facebook messages case study In: FAST, 12.. USENIX Association, Santa Clara, CA, USA.

Konishetty VK, Kumar KA, Voruganti K, Rao GVP (2012) Implementation and evaluation of scalable data structure over hbase In: Proceedings of the International Conference on Advances in Computing, Communications and Informatics. ICACCI ’12, 1010–1018.. ACM, New York, NY, USA, doi:10.1145/2345396.2345559. http://doi.acm.org/10.1145/2345396.2345559 .

Chandra DG, Prakash R, Lamdharia S (2012) A study on cloud database In: Computational Intelligence and Communication Networks (CICN), 2012 Fourth International Conference On, 513–519.. IEEE, Mathura, India.

Riaz M, Mendes E, Tempero E (2011) Towards predicting maintainability for relational database-driven software applications: Extended evidence from software practitioners. Int J Softw Eng Appl 5(2): 107–121.

Roijackers J, Fletcher G (2012) Bridging sql and nosql. Master’s thesis, Eindhoven University of Technology.

Fujimoto R, McLean T, Perumalla K, Tacic I (2000) Design of high performance rti software In: Distributed Simulation and Real-Time Applications, 2000.(DS-RT 2000). Proceedings. Fourth IEEE International Workshop On, 89–96.. IEEE, San Francisco, CA, USA.

Škrabálek J, Kunc P, Nguyen F, Pitner T (2013) Towards effective social network system implementation In: New Trends in Databases and Information Systems, 327–336.. Springer, Berlin.

Han M (2015) The application of nosql database in air quality monitoring In: 2015 International Conference on Intelligent Systems Research and Mechatronics Engineering.. Atlantis Press, Zhengzhou, China.

Chodorow K (2013) MongoDB: the Definitive Guide. “O’Reilly Media, Inc.”, 103a Morris Street, Sebastopol, CA 95472, USA.

George L (2011) HBase: the Definitive Guide. “O’Reilly Media, Inc”, 103a Morris Street, Sebastopol, CA 95472, USA.

Gajendran SK (1998) A Survey on NoSQL Databases. Department of Computer Science, Donetsk.

Hammes D, Medero H, Mitchell H (2014) Comparison of NoSQL and SQL Databases in the Cloud. Proceedings of the Southern Association for Information Systems (SAIS), Macon, GA, 21-22 March, 2014.

Eager DL, Sevcik KC (1983) Achieving robustness in distributed database systems. ACM Trans Database Syst (TODS) 8(3): 354–381.

Feng H (2012) Benchmarking the suitability of key-value stores for distributed scientific data. Dissertation, The University of Edinburgh.

Ranjan R (2014) Modeling and simulation in performance optimization of big data processing frameworks. Cloud Comput IEEE 1(4): 14–19.

Huang H, Dong Z (2013) Research on architecture and query performance based on distributed graph database neo4j In: Consumer Electronics, Communications and Networks (CECNet), 2013 3rd International Conference On, 533–536.. IEEE, Xianning, China.

Schreiber A, Ney M, Wendel H (2012) The provenance store proost for the open provenance model In: Provenance and Annotation of Data and Processes, 240–242.. IEEE, Changsha, China.

Okman L, Gal-Oz N, Gonen Y, Gudes E, Abramov J (2011) Security issues in nosql databases In: Trust, Security and Privacy in Computing and Communications (TrustCom), 2011 IEEE 10th International Conference On, 541–547, IEEE, Changsha, China.

Kuhlenkamp J, Klems M, Röss O (2014) Benchmarking scalability and elasticity of distributed database systems. Proc VLDB Endowment 7(13): 1219–1230.

Download references

Acknowledgement

This research would not have been made possible without support and funding of the FEED - Free Energy Data and iCIS - Intelligent Computing in the Internet Services (CENTRO-07 - ST24 - FEDER - 002003) projects, to which we are extremely grateful.

Author information

Authors and affiliations.

CISUC, Department of Informatics Engineering, University of Coimbra, Pólo II – Pinhal de Marrocos, Coimbra, 3030-290, Portugal

João Ricardo Lourenço, Bruno Cabral, Marco Vieira & Jorge Bernardino

Critical Software, Parque Industrial de Taveiro, lote 49, Coimbra, 3045-504, Portugal

Paulo Carreiro

ISEC – Superior Institute of Engineering of Coimbra, Polytechnic Institute of Coimbra, Coimbra, 3030-190, Portugal

Jorge Bernardino

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to João Ricardo Lourenço .

Additional information

Competing interests.

The authors declare that they have no competing interests.

Authors’ contributions

JRL surveyed most of the literature. BC, JB and MV helped identifying and evaluating the quality attributes, as well as finding the appropriate NoSQL databases to study, guiding the research and iteratively reviewing and revising the work. PC provided an initial case study from which this work originally sprouted, and helped identifying the evaluated quality attributes. All authors read and approved the final manuscript.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( https://creativecommons.org/licenses/by/4.0 ), which permits use, duplication, adaptation, distribution, and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article.

Lourenço, J.R., Cabral, B., Carreiro, P. et al. Choosing the right NoSQL database for the job: a quality attribute evaluation. Journal of Big Data 2 , 18 (2015). https://doi.org/10.1186/s40537-015-0025-0

Download citation

Received : 02 June 2015

Accepted : 27 July 2015

Published : 14 August 2015

DOI : https://doi.org/10.1186/s40537-015-0025-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • NoSQL databases
  • Document store
  • Software engineering
  • Quality attributes
  • Software architecture

nosql database case study

Improving security in NoSQL document databases through model-driven modernization

  • Regular Paper
  • Open access
  • Published: 13 July 2021
  • Volume 63 , pages 2209–2230, ( 2021 )

Cite this article

You have full access to this open access article

nosql database case study

  • Alejandro Maté 1 ,
  • Jesús Peral   ORCID: orcid.org/0000-0003-1537-0218 1 ,
  • Juan Trujillo 1 ,
  • Carlos Blanco 2 ,
  • Diego García-Saiz 2 &
  • Eduardo Fernández-Medina 3  

3554 Accesses

8 Citations

4 Altmetric

Explore all metrics

This article has been updated

NoSQL technologies have become a common component in many information systems and software applications. These technologies are focused on performance, enabling scalable processing of large volumes of structured and unstructured data. Unfortunately, most developments over NoSQL technologies consider security as an afterthought, putting at risk personal data of individuals and potentially causing severe economic loses as well as reputation crisis. In order to avoid these situations, companies require an approach that introduces security mechanisms into their systems without scrapping already in-place solutions to restart all over again the design process. Therefore, in this paper we propose the first modernization approach for introducing security in NoSQL databases, focusing on access control and thereby improving the security of their associated information systems and applications. Our approach analyzes the existing NoSQL solution of the organization, using a domain ontology to detect sensitive information and creating a conceptual model of the database. Together with this model, a series of security issues related to access control are listed, allowing database designers to identify the security mechanisms that must be incorporated into their existing solution. For each security issue, our approach automatically generates a proposed solution, consisting of a combination of privilege modifications, new roles and views to improve access control. In order to test our approach, we apply our process to a medical database implemented using the popular document-oriented NoSQL database, MongoDB. The great advantages of our approach are that: (1) it takes into account the context of the system thanks to the introduction of domain ontologies, (2) it helps to avoid missing critical access control issues since the analysis is performed automatically, (3) it reduces the effort and costs of the modernization process thanks to the automated steps in the process, (4) it can be used with different NoSQL document-based technologies in a successful way by adjusting the metamodel, and (5) it is lined up with known standards, hence allowing the application of guidelines and best practices.

Similar content being viewed by others

nosql database case study

How the Conceptual Modelling Improves the Security on Document Databases

Fine-grained access control within nosql document-oriented datastores.

nosql database case study

A Comprehensive Framework Integrating Attribute-Based Access Control and Privacy Protection Models

Avoid common mistakes on your manuscript.

1 Introduction

Enormous amounts of data are already present and still rapidly growing due to heterogeneous data sources (sensors, GPS and many other types of smart devices). There has been an increasing interest in the efficient processing of these unstructured data, normally referred to as “Big Data”, and their incorporation into traditional applications. This necessity has made that traditional database systems and processing need to evolve and accommodate them. Therefore, new technologies have arisen focusing on performance, enabling the processing of large volumes of structured and unstructured data. NoSQL technologies are an example of these new technologies and have become a common component in many enterprise architectures in different domains (medical, scientific, biological, etc.).

We can distinguish four different categories of NoSQL databases: (1) Key/Value, where data are stored and accessible by a unique key that references a value (e.g., DynamoDB, Riak, Redis, etc.); (2) Column, similar to the key/value model, but the key consists of a combination of column, row and a trace of time used to reference groups of columns (e.g., Cassandra, BigTable, Hadoop/HBase); (3) Document, in which data are stored in documents that encapsulate all the information following a standard format such as XML, YAML or JSON (e.g., MongoDB, CouchDB); (4) graph, the graph theory is applied and expanding between multiple computers (e.g., Neo4J and GraphBase).

One of the main challenges is that these new NoSQL technologies have focused mainly on dealing with Big Data characteristics, whereas security and privacy constraints have been relegated to a secondary place [ 1 , 2 , 3 ], thus leading to information leaks causing economic loses and reputation crisis. The main objective of our research deals with incorporating security in NoSQL databases, focusing on document databases as a starting point. In this way, this paper presents the first modernization approach for introducing security in NoSQL document databases through the improvement of access control.

The proposed approach consists of two stages: (1) the analysis of the existing NoSQL solution (using a domain ontology and applying natural language processing, NLP) to detect sensitive data and create a conceptual model of the database (reverse engineering); (2) the identification of access control issues to be tackled in order to modernize the existing NoSQL solution. At a later stage, which we will not see in this paper, different transformation rules for each detected security issue will be applied. These transformation rules will consist on a combination of privilege modifications, new roles and the creation of views, which can be adapted by the database designer. Finally, the implementation of the transformation rules will be carried out. In order to evaluate our proposal, we have applied it to a medical database implemented using the document NoSQL database MongoDB.

The great advantages of our framework are that: (1) it takes into account the context of the system thanks to the introduction of domain ontologies; (2) it helps avoid missing critical security issues since the analysis is performed automatically; (3) it reduces the effort and costs of the modernization process thanks to the automated steps in the process; (4) it can be used with different NoSQL technologies in a successful way by adjusting the metamodel; and (5) it is lined up with known standards, hence allowing the application of guidelines and best practices.

The main contributions of this paper are summarized as follows:

The first general modernization approach for introducing security through improved access control in NoSQL document databases. We focus on document databases although our proposal could be applied to other NoSQL technologies, such as columnar and graph-based databases.

Our approach adapts to each domain by using an specialized ontology that allows users to specify the sensitive information.

The automatic analysis of data and database structure to identify potential security issues.

The generation of the security enhanced database model, including the automatic generation of a solution for each access control issue consisting of a combination of privilege modifications, new roles and views.

The remainder of this paper is organized as follows. In Sect.  2 the related work is shown. Following this, in Sect.  3 , our framework for NoSQL modernization including security aspects is defined. In Sect.  4 , our approach is applied to a case study within the medical domain. Section  5 presents a discussion and limitations of the present work. Finally, Sect.  6 explains the conclusions and sketches future works.

2 Related work

Given the multiple disciplines involved in our approach, there are several areas that must be considered as part of the related work.

With respect to security issues in NoSQL databases, different works have been developed. They address the problem of the inclusion of security policies in this kind of databases (usually applied in Big Data environments). However, these approaches rarely consider to apply this kind of policies at the different modeling stages [ 1 , 2 , 3 , 4 ] or to include security and privacy restrictions [ 3 , 5 , 6 ]. For all this, the works currently arise on this topic are proposals that lack adequate security in terms of confidentiality, privacy and integrity (just to mention a few properties) in Big Data domains [ 1 , 3 , 4 , 7 ].

Other approaches have achieved a proper security in the development of information systems, but they are not focused on NoSQL databases and their own security problems. In this sense, the most relevant proposals are listed below: (i) secure TROPOS is an improvement that provides security for TROPOS methodology [ 8 ]. It is focused on software development which uses intentional goals of agents. (ii) Mokum is an object-oriented system for modeling [ 9 ]. It is based on knowledge and facilitates the definition of security and integrity constraints. (iii) UMLsec evaluates general security issues using semantics and specifies confidentiality, integrity needs and access control [ 10 ]. (iv) MDS (model-driven security) is a proposal to apply the model-driven approach in high-level system models [ 11 ]. It adds security properties to the model and automatically generates a secure system.

Focusing on Big Data systems we can conclude that the current proposals do not consider adequately the security concept in all stages. They provide partial security solutions such as: (i) anonymization and data encryption [ 12 ], (ii) description of reputation models [ 13 ], and (iii) authentication and signature encryption [ 14 , 15 ]. Furthermore, many deficiencies for the implementation of security features in NoSQL databases [ 6 ] or in Hadoop ecosystems [ 5 , 16 , 17 , 18 ] have been detected.

It is important to mention that our proposal follows the standards defined for Big Data systems. We have mentioned two main approaches: (1) the BIGEU Project (The Big Data Public Private Forum Project) from the European Union [ 19 ] which tries to define a clear strategy for the successful use and exploitation of Big Data in our society, aligned with the objectives of the Horizon 2020 program; (2) the NIST (National Institute of Standards and Technology, USA) standard [ 16 ] which proposes a reference architecture for Big Data, where it identifies a component corresponding to the Big Data application.

With respect to the BIGEU Project, our stages can be aligned with the ones used in the Big Data Value Chain. The NoSQL database analysis corresponds to data acquisition and analysis (making special emphasis in the use of ontologies and NLP techniques). Our modeling stage (conceptual model creation and security issues list) can be matched to the data curation and storage stages. Finally, the automatic generation of the solution is related to the data usage stage. Regarding the NIST architecture, different stages of the information value chain (collection, preparation/curation, analytics and access) are defined in the Big Data component (similar to the previously mentioned stages of BIGEU). The NIST big data reference architecture has been extended to integrate security issues [ 20 ]. Our approach is also aligned with this security reference architecture for big data, through components such as the security requirement, security metadata and security solution. We can conclude that the presented architectures take into account the specific characteristics of Big Data systems, oriented and directed by the data, defining the stages of its value chain. These standardization efforts are considered in our proposal that will be aligned with the aforementioned architectures.

Furthermore, with regard to conceptual modeling and semantics, several works present the use of ontologies in conceptual modeling. Weber presents how ontological theories can be used to inform conceptual modeling research, practice, and pedagogy [ 21 ]. The general formal ontology (GFO) is a foundational ontology which integrates objects and processes and is designed for applications of diverse areas such as medical, biological, biomedical, economics, and sociology [ 22 ]. A comparison between traditional conceptual modeling and ontology-driven conceptual modeling was made in the work of Verdonck et al. [ 23 ], demonstrating that there do exist meaningful differences between adopting the two techniques because higher-quality models are obtained using the ontology-driven conceptual modeling. However, to the best of our knowledge, here we present the first use of ontologies to improve the security in NoSQL databases by detecting sensitive information.

As shown, there have been advances in several areas that make possible to provide a modernization process to incorporate security in existing NoSQL databases. However, after carrying out this review of the literature related to these topics, it is evident that our research is the first modernization approach for introducing security in NoSQL databases which automatically identifies and enables security mechanisms that were not considered during the initial implementation. Furthermore, our approach adapts to each domain by using an ontology that allows users to specify what information can be considered sensitive or highly sensitive.

3 A modernization approach for NoSQL document databases

The proposed approach (see Fig.  1 ) consists on two stages: (1) the process of reverse engineering to create a conceptual model of the database and the analysis of the existing NoSQL solution to detect sensitive information; (2) the identification of security issues to modernize the existing NoSQL solution. These stages will be detailed in the following subsections.

figure 1

Scheme of NoSQL modernization process

3.1 Reverse engineering

The modernization process starts with a reverse engineering of the database. The aim of the reverse engineering process is to obtain an abstracted model that allows us to reason and identify security issues that would be the target of security improvements. In order to perform this process, we require a metamodel that represents the structures in the DB. One such metamodel is the one presented across Fig.  2 .

figure 2

Metamodel for Document-based Databases & MongoDB

The excerpt of the metamodel shown in Fig.  2 is divided into two parts. The upper part corresponds to the database structures and contains the main elements of NoSQL document databases, whereas the lower part corresponds to the security actions (modifications) to be made in order to improve access control. Since we work with MongoDB, the database metamodel is tailored to MongoDB structures and datatypes and is defined as follows:

The first element in the metamodel is the Database element, which acts as root of the model. A database may have a name and has associated any number of Collections (Views), Roles and Users.

Each Collection has a name and establishes an id that enables the identification of each document. A Collection can have several Documents, each of which may have several fields, some of which can be required for all the documents within the collection. Each Field can be a SimpleField, storing a basic type of value, or a compose field, storing other collections within it. Finally, each field may have a constraint applied to it.

Aside from collections, a MongoDB database may store views. A View has a name and a base collection (or view) over which a pipeline of projections and operations are performed. Projections enable us to hide or show certain fields, and may involve conditions that determine whether an instance of the field is included or excluded. Moreover, a View can include aggregation operations to derive new dynamic fields obtained from the underlying collections.

To manage the access to the data stored within the different collections, the database may establish a Role-based system. A Role has a name and a series of privileges over a given Database and the different Collections or Views. These roles are assigned to users, with each User having at least one, and potentially multiple Roles.

In order to manage access control, privileges can be granted or revoked for each Role. Modifications over privileges in the database are covered in the lower part of our metamodel, which deals with access control. Our metamodel includes the four basic action privileges: Find, Insert, Update and Remove. These privileges can be granted over certain collections through the GrantRevokeCollection or, more in detail, over certain fields through the GrantRevokeField.

It is important to note that although at the conceptual level our model is generic and supports revoking field privileges, MongoDB does not implement field-level access control; thus, no privileges over individual fields can be directly established for a role or user.

In addition to grant and revoke actions, our model includes SecurityActions that represent high-level objectives, such as hiding fields or values, to be achieved through the creation of views, collections, grant and revoke actions.

The aim of the lower part of the metamodel is to make explicit the changes that will be performed over the database as a result of the security analysis. This also enables the addition or removal of any security actions before making effective the changes.

Using this metamodel as reference, the process for obtaining the reverse-engineered model is as follows:

First, a Database element is created with the name of the database that is reverse engineered.

Second, the list of collections is obtained by using the list_collection_names() method. For each collection, a Collection and a Document element are created and appended to the Database element. For simplicity, the Document element will hold the set of keys used in all the documents of the collection, as retrieving all the metadata of each specific document is unnecessary at this point.

Third, for each collection, the set of keys is retrieved and appended to the document of the collection. This can be done using multiple publicly known methods [ 24 ]. For each key, a Field element is added with its name set to the retrieved key. After all fields have been set, they are added to the collection and the next collection in the list is processed.

Fourth, after all collections have been retrieved, users and roles are retrieved using the usersInfo command. For each user a User is created along with its associated roles and privileges over collections. Once all the User and Role elements have been created, they are appended to the Database element.

After the reverse engineering process, we will have obtained a conceptual model of the database to be modernized. In order to identify potential security issues, we must perform an analysis that will start with an ontological analysis of the contents in the database. This analysis will allow domain experts to tag sensitive collections and properties that should be restricted, enabling the identification of new views, privileges, and roles that will need to be implemented or existing ones that should be adjusted.

3.1.1 Data analysis: the ontology

At this stage we will work with the source database in order to detect sensitive data. It is important to emphasize that the analysis stage can be applied to any NoSQL database technology used both in the previous phase of reverse engineering and in the next phase of identification of security issues.

Each field of the database is analyzed searching for sensitive information. One of the contributions of this proposal is the establishment of the security privileges needed to access each field of the data set. In order to define the different security privileges, we have followed the four levels of security clearance defined by U.S. Department of State Footnote 1 : unclassified, confidential, secret, and top secret, although the classifications used in other countries (Canada, UK, etc.) could be used. We have defined a mapping between the selected security clearance levels that we have called security levels (SL), and numerical values to facilitate subsequent calculations. Therefore, we will use SL = 0 (unclassified, all people have access to the information); SL = 1 (confidential, the persons registered in the system can access this information); SL = 2,3 (secret and top secret, only certain people with specific profiles and characteristics can access this information).

In order to tag the different SL of the database fields to allow their access, we have used NLP techniques and lexical and ontological resources. In our approach, we have used the lexical database WordNet 3.1 Footnote 2 which contains semantic relations (synonyms, hyponyms, meronyms, etc.) between words in more than 200 languages. In addition, it was necessary the help of experts in the specific domain of the source database (in our case study, medical domain) and in the domain of data protection. Footnote 3

The labeling process consists of two steps. In the first step, the lexical resource is enriched with information related to sensitivity. Therefore, WordNet was enriched by adding the SL to all the concepts; initially all WordNet concepts are labeled with SL = 1. Next, the specific domain expert (a physician, in our case scenario) will update the WordNet concepts related to his/her domain, distinguishing between SL = 2 (concepts that have sensitive information such as those related to the treatment or diagnosis of the patient) and SL = 3 (they have very sensitive information, for example, the concepts related to the medical specialty Oncology). Finally, the data protection expert will carry out the same process distinguishing between SL = 2 (for example, the patient’s address) and SL = 3 (such as race or religious beliefs among others).

In the second step, each database field is labeled with a specific SL. Thus, all the field values are consulted in WordNet. Basically, two cases can occur distinguishing two types of restrictions: (1) all values have the same level of security (e.g., SL = 3) and, consequently, this SL is assigned to the field (security constraints); (2) the field values have different SL (e.g., SL = 2 and SL = 3) and, consequently, distinctions within the same field will be made (fine-grain security constraints). The restrictions are explained below:

Security constraints. They are defined at the field level. Security constraints are specified when the information contained in a field (after the processing of all instances) has the same level of security; that is, there is no information within the field that is more sensible than another one. For example, a race field is sensible. The information which contains is always sensible regardless of the values of the field. Therefore, a specific security level of SL = 3, ’top secret’, might be required for queries.

Fine-grain security constraints. They are specified at the field content level. These constraints are described to define different security privileges of a field depending on its content. For example, to query a field which represents the medical specialty of patients might require a generic SL = 2. However, patients with terminal diseases might require a higher SL (for example, SL = 3). Thus, a user with SL = 3 could see all the patients (including those with terminal diseases), whereas the users with lower SL only could see patients with non-terminal diseases.

A more detailed description of the entire process will be explained in Sect.  4.3 with our case study.

3.2 Identification of security issues: security improvements

Once we have performed the data analysis using the expert tagging and the ontology, we will have a set of fields (properties in the MongoDB metamodel) tagged with different security levels. In order to identify potential security issues automatically, we will proceed as follows:

First, for each collection C , a security level array \(S_c\) will be defined, including a triple (attribute, security level, condition) for each attribute in the collection. The security values will be defined according to the security levels specified in the previous step, such that:

In this sense, if a field presents multiple security levels, it will be repeated, and the associated condition stored for later use. If no condition is specified, \(c_i\) will be null. For each array, the maximum and minimum value will be calculated. If they are equal, the array will be compressed into a single number that represents the security level of the entire collection. An example of ( 1 ) would be Patient , where each value represents the security level \(s_i\) of an attribute \(a_i\) (the names have been omitted for simplicity):

Afterward, we will obtain the role access matrix RA of the system, composed by role access arrays. A role array i , contains the name of the role \(r_i\) and the associated set of permissions \(A_i\) to access each collection. There is a role array for each role in the system except for the admin role, which forcibly has access to everything. Therefore, the role access matrix is defined as:

where \(A_i\) is a list of pairs \(\langle c,p\rangle \) , denoting that the role has access to the collection named c if \(p=1\) or that his access is restricted if \(p=0\) . The role access matrix ( 2 ), such as the one shown in the following for the medical database, will be used in conjunction with the security level arrays to determine the security level of each user and role.

The user access matrix UA is obtained in a similar fashion, using the permissions of users in the systems instead of the roles.

Once we have the security arrays and the access matrices, we obtain the extended access matrix for roles REA and users UEA by multiplying the access level of each role/user with the security level array of each collection. Essentially each row of the REA matrix will be:

An example of the REA matrix for the medical database is as follows:

The extended access matrices are used as input for the analysis. The security issues identified during the process will depend on the choice of the user across three different security policies:

Exclusive access to highly sensitive data per user (only one user can access to each highly sensitive field/collection)

(Default) Exclusive access to highly sensitive data per role

Without exclusive access

Taking into account the selected policy, we will analyze the accessibility of each field as follows:

For each set of rows that are equal to each other within the role matrix, we will identify these roles as “Duplicate Roles”, suggesting the removal of duplicates in the security issues list. The users that had their roles removed will be assigned the remaining role. Accordingly, the security action that summarizes this process, role removal, will include eliminating duplicate roles as well as corresponding grant role actions.

For security level 1 collections and fields, we will identify any 0’s in the column of the matrix. Since these fields are considered as “generally accessible”, we will report the list of roles without access for review, suggesting that access is given to them in the security issues list. The security action will be tagged as show field, with the corresponding grant collection actions.

For security level 2 collections and fields, we will identify any columns without 0’s. For these collections and fields, a warning will be included in the security issues list. In order to deal with this security issue we will proceed depending on whether the entire collection has a security level higher than 1 or if higher security is set only for selected fields:

If the entire collection has a security level higher than 1, then the suggested modification will be the removal of all the access rights and the creation of a new role with access to all level 1 collections as well as the restricted collection. The associated security action, hide collection, will include the revoke access action to all roles and will grant access to the new role to be created.

If only selected fields are affected, then the suggested modification will be to create a new View that contains level 1 fields. All existing roles with access to the collection will have their access removed and will be granted access to the view instead. Finally, a new role will be created with access to all level 1 collections as well as the restricted collection. This actions will be aggregated into the hide field security action.

For security level 3 collections and fields, first, we will identify any columns with two or more non-zero values. Since these are highly sensitive data, it is expected that only the admin and a specific role have access to them. Therefore, a warning will be included in the security issues list. If the exclusive role access policy has been selected, then the removal of all access rights from existing roles will be suggested, creating a new one with exclusive access as in the previous step. These actions will be related to a hide field or hide collection security action. Second, we will identify any rows with two or more level 3 field access. If these fields pertain to different collections, it would mean that a single role has access to sensitive information from multiple collections. Therefore, in order to improve security, a warning will be included in the security list and the role will keep its access only to the first collection. A new role for each collection with highly sensitive information that cannot be accessed by any other roles will be added. These actions will be related to a hide field security action.

In the case of fine-grained constraints, (i.e., fields that contain multiple security levels depending on their contents) such as the “medical_specialty field”, a new View will be created for the lower security level using the $redact operator from MongoDB and the condition previously stored. In this way, the original collection will retain all data, including highly sensitive information, while the other two views allow access to generally available and sensitive information, respectively. In these cases an additional role will be created that has access to all level 1 collections as well as the restricted view but not to the original collection. All these actions will be related to the hide value security action.

For users in the database, steps 2–4 will be repeated, identifying users without access to general collections as well as sensitive data to which all users have access. In addition, if the exclusive user access policy has been selected, step 5 will be repeated using the extended user access matrix, suggesting the removal of the rights of all users with access to highly sensitive data so that the database administrator can choose which users should maintain the access.

4 Case study

In order to prove the validity of our proposal, we have applied it to a case study within the medical domain. In the following subsections, we present: the source data, the reverse engineering process to extract the original model, the process to identify sensitive data in order to obtain the security recommendations, and finally, the generation of the security-enhanced model.

4.1 Source data

With respect to the database selection, we have used the structured data extracted from patients with diabetes used in the study developed by Strack et al. [ 27 ], extracted from the Health Facts database (Cerner Corporation, Kansas City, MO), a national data warehouse that collects comprehensive clinical (electronic medical) records across hospitals throughout the USA. It contains personal data of the patients and all the information related to their admission in the hospital.

The Health Facts data we used were an extract representing 10 years (1999-2008) of clinical care including 130 hospitals and integrated delivery networks throughout the United States. The database consists of 41 tables in a fact-dimension schema and a total of 117 features. The database includes 74,036,643 unique encounters (visits) that correspond to 17,880,231 unique patients and 2,889,571 providers.

The data set was created in two steps. First, encounters of interest were extracted from the database with 55 attributes. Second, preliminary analyses and preprocessing of the data were performed resulting in only these features (attributes) and encounters that could be used in further analyses being retained, in other words, features that contain sufficient information. The full list of the features and their description is provided in [ 27 ]. This data set is available as Supplementary Material available online, Footnote 4 and it is also in the UCI Machine Learning Repository.

Finally, the information from the mentioned database for “diabetic” encounters was extracted. In this way, 101,766 encounters were identified related to diabetic patients. These data were used in our experiments.

4.2 Reverse engineering

Using the information of diabetic patients, we created an initial MongoDB database replicating the structure of the dataset. The database contains two collections. The first one, “Admission”, stores all the information regarding the admission of patients. This information includes sensitive information such as drugs that have been administrated to the patient or the medical specialty corresponding to their case. The second one, “Patient”, stores information regarding patients such as name, gender, race, age or address.

Together with the “Admission” and “Patient” collections, two users with their corresponding roles are created in the database: first, the “Analyst” user and role, with access to every collection and field in order to manage the database as would be expected in a Data Warehouse-style database; second, a user and role for querying information about patients “Patients” that has access to the ‘Patient” collection only.

Using the reverse engineering process presented in Sect.  3.1 , we obtain the model shown in Fig.  3 .

figure 3

Initial database model

As can be seen in Fig.  3 , security-wise this would be a poor database model for general use. There is little security beyond one user having access to only one collection, and there is no discrimination on whether fields are sensitive or not. Therefore, to show how our process modernizes database security, our next step will be to be analyze the data at hand in order to annotate the model, identify security issues, and generate security recommendations.

4.3 Security recommendations extraction

In our example, considering our data model, we analyzed the different fields in order to establish the security constraints.

We applied the process defined in Sect.  3.1.1 . The first step consists on the enrichment of the lexical resource adding the SL to the concepts. As we have previously mentioned we have used WordNet 3.1. Initially, the lowest security level (SL = 1) was assigned to all the WordNet terms. These initial values will be modified by an expert according to the specific domain we were working on.

In our case scenario an expert in medical domain (a physician) will update the concept security levels by distinguishing sensitive information (SL = 2) and very sensitive information (SL = 3). For instance, the concepts related to the patient’s treatment and the respective medicaments will be treated as sensitive information (SL = 2). On the other hand, the concepts related to the patient’s medical specialty will have sensitive information (SL = 2). However, the physician will distinguish the most sensitive specialties (Oncology, etc.) assigning them the highest security level (SL = 3). These new security levels established by the expert will be updated in WordNet in the following way: the concepts with security levels 2 or 3 are searched in WordNet. When they are found, the new security level is modified and it is propagated to all their children concepts in the tree structure. This process ends when a leaf is reached or concepts with security levels greater than the new one are found (this indicates that the concept security level has been previously modified in the sub-tree). If a concept is not found in WordNet, it will be enriched with the new concept (the expert will indicate where it will be inserted).

figure 4

A fragment of treatment and medical specialty ontologies

In Fig.  4 , two fragments of WordNet related to the fields treatment and medical specialty are shown. On the one hand, the expert will assign the SL = 2 to the concept Medicament which will be propagated to their children (Antibacterial, Anticoagulant, Antidiabetic, Metformin, Glyburide, etc.). On the other hand, he/she will assign the SL = 2 to the concept Medical_Specialty which will be propagated to their children (Pediatrics, Dermatology, Hematology, Cardiology, etc.), except the concept Oncology that will have the SL = 3.

A similar process will be carried out by the data protection expert to update the SL in all the concepts of his/her domain with sensitive information. For example, the expert will assign the SL = 3 to the concept Race which will be propagated to the concepts Asian , Caucasian , Hispanic , etc.

In order to check that the labeling of the SL carried out by the experts is valid, the Cohen’s kappa coefficient has been calculated to measure the agreement between them [ 28 ]. Thus, two experts in the medical domain and two experts in data protection carried out the labeling of the SL of the concepts of their respective domains. The results obtained of kappa coefficient for each domain were higher than 0.75, which are considered excellent.

In the second step, each database field was labeled with a specific SL. Next, we will introduce some examples of the two kinds of constraints that we have previously defined. Suppose we are analyzing the treatment field. In advance, we ignore the details of the information stored in this field but its values are already known. After having consulted each one of the values of the treatment field in WordNet, they are all assigned SL = 2. An example of the use of security constraints would be, given the previous preconditions, setting SL = 2 for the mentioned field.

On the other hand, to illustrate an example of a fine-grain constraint, suppose the field medical specialty is now analyzed. This field describes, with 84 values, the specialty of the patient: “dermatology”, “endocrinology”, “pulmonology”, “oncology”, etc. These values are searched for in WordNet having, in most instances, SL = 2. However, it can be seen that some patients have the value of “oncology” which is the most sensible (SL = 3). According this, a new recommendation will be created for the following stage of modeling.

The final result of the analysis, after applying the constraints set, was:

Patient collection Race: SL = 3; Address: SL = 2; Remaining fields: SL = 1.

Admission collection Treatment: SL = 2; Medical specialty: SL = 2 (in case of oncology: SL = 3); Diagnosis: SL = 2; Remaining fields: SL = 1.

It can be highlighted that the application of NLP techniques is very useful in database fields where natural language is the way to explain concepts or ideas. For example, in the medical domain, fields containing information about the encounter with the patient, the diagnosis, or the medication are very usual. After carrying out an analysis of these textual fields, both the lexical-morphological (POS tagging) and partial syntactic (partial parsing), we can identify the main concepts of the text. The application of these NLP techniques also contributes to dealing effectively with natural language issues, such as ambiguities, ellipses or anaphoric expressions. Once these key concepts are extracted, the process defined above is carried out by assigning a security level to a text field.

Furthermore, it is interesting to note that a similar problem is tackled by the responsible for data protection and privacy of an organization or company about document anonymization or text sanitization. In these cases, entity recognition techniques (NER, Named Entity Recognition) from NLP are used to identify sensitive entities or words in order to anonymize them by generalizing to broader concepts.

With regard to the implementation, WordNet has been converted into JSON format that is compatible with our database engine. Each of the terms of the fields is searched in WordNet to assign them a level of security. If a concept does not exist in WordNet, it can be included as shown in Fig.  5 .

figure 5

Example of enrichment of WordNet

In our example, we have focused on the treatment and medical specialty fields to show the aforementioned constraints. An automatic process is carried out to establish the field security level. In Fig.  6 , the functions to extract the security levels of the Admission collection are shown.

figure 6

Extraction of security levels of Admission collection

The result of this stage is a list of security recommendations to be taken into account in the following stage of modeling. For instance, we have obtained these two recommendations related to the mentioned fields: (1) the treatment field has the SL = 2 (security constraint); and (2) the medical_specialty field has the SL = 2, and if it is Oncology: SL = 3 (fine-grain security constraint).

4.4 Security-enhanced model

Using the information from the previous step, we analyze the access levels using the steps described in Sect.  3.2 .

In our case at hand, we have two roles not including the admin. With the following access levels:

Combining this information with the security levels identified in the previous step, we obtain the extended access matrix. For the sake of brevity, all access levels 0 (“Patients” role does not have access to the “Admission” collection) and 1 (general access for registered users) are omitted in the paper:

Using the extended access matrix for roles, our approach identifies the following issues and recommendations:

The “address” field in “Patient” can be accessed by all roles (no 0s in the column). The suggested modification is to create a new view “ViewSL1_Patient” removing all security level 2 and 3 fields. The “Analyst” and “Patients” roles have their access to the collection “Patient” removed and are granted instead access to “ViewSL1_Patient”. A new role, “Patients_SL3” is created with access to the original collection.

No further SL = 2 issues are detected. The access matrix is updated.

The “race” field in “Patient” could initially be accessed by all roles (no 0s in the column). With the modifications made no further changes are needed to deal with this issue.

The “Analyst” role could initially access highly sensitive information (multiple security level 3 fields) from different collections. With the modifications made no further changes are needed to deal with this issue.

The “medical specialization” field in “Admission” has two security levels (2 and 3), yet only one user role exists “Analyst”, which has access to all the information. The suggested modification is to create a new view “ViewSL2_Admission” using the $redact operator to remove the information related to oncology patients. A new role “Admission_Oncology” is proposed, which has access to the original collection. “Analyst” role has its access revoked and instead is granted access to the new view. All these operations are related to a hide value security action. The access matrix is updated.

The resulting model summarizing the new structure of the database and the actions to be taken is shown in Fig.  7 , where the new roles and views are highlighted in gray color. Additionally, since all the new elements and modifications are related to their corresponding security action, it is easy to locate and remove undesired changes by removing the corresponding security action.

As a result of the analysis, our approach suggests the creation of two new roles, “Patients_SL3” and “Admission_Oncology”, that have exclusive access to highly sensitive information. Existing roles (and therefore users) have their access revoked and can be granted access again by assigning them the new roles. In this way, the database now has a security hierarchy that ensures data are adequately protected without duplicity of roles.

5 Discussion and limitations

Our proposed approach allows users to be aware and introduce security mechanisms into existing NoSQL document databases. Our approach provides users with a clear view of the main components in their document database, not only warning users about potential security flaws, but also providing the mechanisms to tackle them. Still, there are some limitations that must be taken into account when applying the proposal.

First and foremost, the specific implementation of the reverse engineering process is dependent on the MongoDB API. While MongoDB is the most popular document database available, the reverse engineering process (i) depends on the evolution of the API and (ii) would need to be adapted for other NoSQL document databases. Nevertheless, the constructs used in our proposal (Collections, Fields, Roles, etc.) are generic, and the rest of the process including the ontological and security analysis can be applied to any document-oriented database since they are independent of the specific technology used. As such, our proposal would be applicable to other popular NoSQL document databases such as Apache CouchDB or Amazon DynamoDB by updating the API calls used during the process.

The case for other NoSQL technologies (key-value, columnar graph, etc.) is different however. As database structures differ more and more, more profound changes are required, not only in the analysis process but also in the structural part of the metamodel. As such, it is expected that the process requires certain effort to be adapted for example to columnar NoSQL databases, where the way that information is stored maintains certain similarities but structures are different. More radical changes would be needed in the case of graph NoSQL databases, where not only the structure is completely different but also the emphasis is put on the relationships. Therefore, in these cases the process would need to be entirely redone from scratch.

figure 7

Security enhanced database model

Second, the current reverse-engineering process does not obtain an exact model of the database. Most notably, it does not differentiate on its own between Collections and Views in the database to be modernized. This is due to limitations in the current version of the MongoDB API. Nevertheless, the analysis process is the same, since both elements need to be checked for security issues. Furthermore, the modernization itself involves modifications that do not remove or alter existing views and collections, only creates new ones and alters permissions that users have over those that already exist.

Third, the process could be optimized by carrying out the ontological analysis at the same time that the reverse engineering process is performed, thereby increasing the performance by reducing the number of reads over the database. However, this would imply coupling both processes and making the ontological analysis dependent on the specific database technology used. As such, we have preferred maintaining decoupled both steps in the process, making it easier to adapt the process to other technologies, including non document-oriented database technologies.

Fourth, the proposed process focuses on security issues related to access control. Thus, other security issues such as vulnerability to attacks, weak user passwords, etc., are considered out of the scope of the proposal. Therefore, these issues would need to be tackled by existing approaches that model attack scenarios and test the security of user accounts.

6 Conclusions

In this paper, we have proposed the first modernization approach for introducing security in NoSQL document databases improving the security of their association information systems and applications. It identifies and enables security mechanisms that had been overlooked or not even been considered during the initial implementation. Our approach adapts to each domain by using an ontology that allows users to specify what information can be considered sensitive or highly sensitive. Then, it automatically analyzes the data and the database structure to identify security issues and propose security mechanisms that enable fine-grained access control, even when by default this level of security is not supported by the existing technology. As such, our approach can be adapted to any domain and reduces the effort and knowledge required to introduce security in NoSQL document databases.

As part of our future work, we plan to cover the entire cycle, automatically deriving the code that is required to modify the database. Furthermore, we plan to expand our approach to other NoSQL technologies, such as columnar and graph-based databases.

Change history

30 october 2021.

Funding information updated.

https://www.state.gov/security-clearances (visited on April, 2021).

http://wordnetweb.princeton.edu/perl/webwn (visited on December, 2019).

This person will be the responsible for processing the data of the organization or company (according to the General Data Protection Regulation, European Union, [ 25 , 26 ] is the person who decides the purpose and the way in which the organization’s data are processed).

http://dx.doi.org/10.1155/2014/781670 (visited on December, 2019).

Michael K, Miller KW (2013) Big data: new opportunities and new challenges [guest editors’ introduction]. Computer 46:22–24

Article   Google Scholar  

Kshetri N (2014) Big data’ s impact on privacy, security and consumer welfare. Telecommun Policy 38:1134–1145

Thuraisingham B. Big data security and privacy. In: Proceedings of the 5th ACM conference on data and application security and privacy, pp 279–280

Toshniwal R, Dastidar KG, Nath A (2015) Big data security issues and challenges. Int J Innov Res Adv Eng 2:15–20

Google Scholar  

Saraladevi B, Pazhaniraja N, Paul PV, Basha MS, Dhavachelvan P (2015) Big data and hadoop—a study in security perspective. Procedia Comput Sci 50:596–601

Okman L, Gal-Oz N, Gonen Y, Gudes E, Abramov J (2011) Security issues in nosql databases. In: Proceedings of the 10th IEEE international conference on trust, security and privacy in computing and communications. IEEE, pp 541–547

RENCI/NCDS, Security and privacy in the era of big data. White paper (2014)

Compagna L, El Khoury P, Krausová A, Massacci F, Zannone N (2009) How to integrate legal requirements into a requirements engineering methodology for the development of security and privacy patterns. Artif Intell Law 17:1–30

van de Riet RP (2008) Twenty-five years of mokum: for 25 years of data and knowledge engineering: Correctness by design in relation to mde and correct protocols in cyberspace. Data Knowl Eng 67:293–329

Schmidt H, Jürjens J (2011) UMLsec4UML2-adopting UMLsec to support UML2. Technical report, Technische Universitat Dortmund, Department of Computer Science

Basin D, Doser J, Lodderstedt T (2006) Model driven security: from uml models to access control infrastructures. ACM Trans Softw Eng Methodol 15:39–91

Lafuente G (2015) The big data security challenge. Netw Secur 2015:12–14

Yan S-R, Zheng X-L, Wang Y, Song WW, Zhang W-Y (2015) A graph-based comprehensive reputation model: Exploiting the social context of opinions to enhance trust in social commerce. Inf Sci 318:51–72

Article   MathSciNet   Google Scholar  

Wei G, Shao J, Xiang Y, Zhu P, Lu R (2015) Obtain confidentiality or/and authenticity in big data by id-based generalized signcryption. Inf Sci 318:111–122

Hou S, Huang X, Liu JK, Li J, Xu L (2015) Universal designated verifier transitive signatures for graph-based big data. Inf Sci 318:144–156

NIST, Nist big data interoperability framework: Volume 4, security and privacy, NIST Big Data Public Working Group (2017)

O’Malley O, Zhang K, Radia S, Marti R, Harrell C (2009) Hadoop security design. Technical report, Yahoo, Inc

Yuan M (2012) Study of security mechanism based on hadoop. Inf Secur Commun Privacy 6:042

Cavanillas JM, Curry E, Wahlster W (2016) New horizons for a data-driven economy: a roadmap for usage and exploitation of big data in Europe. Springer, Berlin

Book   Google Scholar  

Moreno J, Serrano MA, Fernandez-Medina E, Fernandez EB (2018) Towards a security reference architecture for big data. In: Proceedings of the 20th international workshop on design, optimization, languages and analytical processing of Big Data (DOLAP)

Weber R (2003) Conceptual modelling and ontology: possibilities and pitfalls. J Database Manag 14:1–20

Herre H (2010) General formal ontology (gfo): A foundational ontology for conceptual modelling. In: Theory and applications of ontology: computer applications. Springer, pp 297–345

Verdonck M, Gailly F, Pergl R, Guizzardi G, Martins B, Pastor O (2019) Comparing traditional conceptual modeling with ontology-driven conceptual modeling: an empirical study. Inf Syst 81:92–103

Object Rocket (2019) Get the Name of All Keys in a MongoDB Collection

EU, Regulation (European Union) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation), Official Journal L 119, 04/05/2016, p 1–88, 2016

EU, Directive 2002/58/EC of the European Parliament and of the Council of 12 July 2002 concerning the processing of personal data and the protection of privacy in the electronic communications sector (Directive on privacy and electronic communications), Official Journal L 201, 31/07/2002, pp 37–47, 2002

Strack B, DeShazo JP, Gennings C, Olmo JL, Ventura S, Cios KJ, Clore JN (2014) Impact of hba1c measurement on hospital readmission rates: analysis of 70,000 clinical database patient records. BioMed Res Int 2014:781670

Smeeton NC (1985) Early history of the kappa statistic. Biometrics 41:795

Download references

Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. This work was supported in part by the Spanish Ministry of Science, Innovation and Universities through the Project ECLIPSE under Grants RTI2018-094283-BC31 and RTI2018-094283- B-C32. Furthermore, it has been funded by the AETHER-UA (PID2020-112540RB-C43) Project from the Spanish Ministry of Science and Innovation.

Author information

Authors and affiliations.

Lucentia Research Group, Department of Software and Computing Systems, University of Alicante, Alicante, Spain

Alejandro Maté, Jesús Peral & Juan Trujillo

ISTR Research Group, Department of Computer Science and Electronics, University of Cantabria, Santander, Spain

Carlos Blanco & Diego García-Saiz

GSyA Research Group, Institute of Information Technologies and Systems, Information Systems and Technologies Department, University of Castilla-La Mancha, Ciudad Real, Spain

Eduardo Fernández-Medina

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Jesús Peral .

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Maté, A., Peral, J., Trujillo, J. et al. Improving security in NoSQL document databases through model-driven modernization. Knowl Inf Syst 63 , 2209–2230 (2021). https://doi.org/10.1007/s10115-021-01589-x

Download citation

Received : 30 October 2020

Revised : 14 June 2021

Accepted : 19 June 2021

Published : 13 July 2021

Issue Date : August 2021

DOI : https://doi.org/10.1007/s10115-021-01589-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • NoSQL databases
  • Modernization process
  • Find a journal
  • Publish with us
  • Track your research

What is NoSQL?

NoSQL databases (aka "not only SQL") are non-tabular databases and store data differently than relational tables. NoSQL databases come in a variety of types based on their data model. The main types are document, key-value, wide-column, and graph. They provide flexible schemas and scale easily with large amounts of data and high user loads.

In this article, you'll learn what a NoSQL database is, why (and when!) you should use one, and how to get started.

This article will cover:

  • Brief History of NoSQL Databases
  • NoSQL Database Features
  • Types of NoSQL Database
  • Difference between RDBMS and NoSQL
  • When should NoSQL be Used?
  • NoSQL Database Misconceptions
  • NoSQL Query Tutorial

What is a NoSQL database?

When people use the term “NoSQL database”, they typically use it to refer to any non-relational database. Some say the term “NoSQL” stands for “non SQL” while others say it stands for “not only SQL”. Either way, most agree that NoSQL databases are databases that store data in a format other than relational tables.

Brief history of NoSQL databases

NoSQL databases emerged in the late 2000s as the cost of storage dramatically decreased. Gone were the days of needing to create a complex, difficult-to-manage data model in order to avoid data duplication. Developers (rather than storage) were becoming the primary cost of software development, so NoSQL databases optimized for developer productivity.

Additionally, the Agile Manifesto was rising in popularity, and software engineers were rethinking the way they developed software. They were recognizing the need to rapidly adapt to changing requirements. They needed the ability to iterate quickly and make changes throughout their software stack — all the way down to the database. NoSQL databases gave them this flexibility.

Cloud computing also rose in popularity, and developers began using public clouds to host their applications and data. They wanted the ability to distribute data across multiple servers and regions to make their applications resilient, to scale out instead of scale up, and to intelligently geo-place their data. Some NoSQL databases like MongoDB provide these capabilities.

NoSQL database features

Each NoSQL database has its own unique features. At a high level, many NoSQL databases have the following features:

  • Flexible schemas
  • Horizontal scaling
  • Fast queries due to the data model
  • Ease of use for developers

Check out What are the Benefits of NoSQL Databases? to learn more about each of the features listed above.

Types of NoSQL databases

Over time, four major types of NoSQL databases emerged: document databases, key-value databases , wide-column stores, and graph databases.

  • Document databases store data in documents similar to JSON (JavaScript Object Notation) objects. Each document contains pairs of fields and values. The values can typically be a variety of types including things like strings, numbers, booleans, arrays, or objects.
  • Key-value databases are a simpler type of database where each item contains keys and values.
  • Wide-column stores store data in tables, rows, and dynamic columns.
  • Graph databases store data in nodes and edges. Nodes typically store information about people, places, and things, while edges store information about the relationships between the nodes.

To learn more, visit Understanding the Different Types of NoSQL Databases .

Difference between RDBMS and NoSQL databases

While a variety of differences exist between relational database management systems (RDBMS) and NoSQL databases, one of the key differences is the way the data is modeled in the database. In this section, we'll work through an example of modeling the same data in a relational database and a NoSQL database. Then, we'll highlight some of the other key differences between relational databases and NoSQL databases.

RDBMS vs NoSQL: Data Modeling Example

Let's consider an example of storing information about a user and their hobbies. We need to store a user's first name, last name, cell phone number, city, and hobbies.

In a relational database, we'd likely create two tables: one for Users and one for Hobbies.

In order to retrieve all of the information about a user and their hobbies, information from the Users table and Hobbies table will need to be joined together.

The data model we design for a NoSQL database will depend on the type of NoSQL database we choose. Let's consider how to store the same information about a user and their hobbies in a document database like MongoDB.

In order to retrieve all of the information about a user and their hobbies, a single document can be retrieved from the database. No joins are required, resulting in faster queries.

To see a more detailed version of this data modeling example, read Mapping Terms and Concepts from SQL to MongoDB .

Other differences between RDBMS and relational databases

While the example above highlights the differences in data models between relational databases and NoSQL databases, many other important differences exist, including:

  • Flexibility of the schema
  • Scaling technique
  • Support for transactions
  • Reliance on data to object mapping

To learn more about the differences between relational databases and NoSQL databases, visit NoSQL vs SQL Databases , or watch From RDBMS to NoSQL presentation from AWs re:Invent 2022 .

NoSQL databases are used in nearly every industry . Use cases range from the highly critical (e.g., storing financial data and healthcare records ) to the more fun and frivolous (e.g., storing IoT readings from a smart kitty litter box ).

In the following sections, we'll explore when you should choose to use a NoSQL database and common misconceptions about NoSQL databases.

When should NoSQL be used?

When deciding which database to use, decision-makers typically find one or more of the following factors lead them to selecting a NoSQL database:

  • Fast-paced Agile development
  • Storage of structured and semi-structured data
  • Huge volumes of data
  • Requirements for scale-out architecture
  • Modern application paradigms like microservices and real-time streaming

See When to Use NoSQL Databases and Exploring NoSQL Database Examples for more detailed information on the reasons listed above.

NoSQL database misconceptions

Over the years, many misconceptions about NoSQL databases have spread throughout the developer community. In this section, we'll discuss two of the most common misconceptions:

  • Relationship data is best suited for relational databases.
  • NoSQL databases don't support ACID transactions.

To learn more about common misconceptions, read Everything You Know About MongoDB is Wrong .

Misconception: relationship data is best suited for relational databases

A common misconception is that NoSQL databases or non-relational databases don’t store relationship data well. NoSQL databases can store relationship data — they just store it differently than relational databases do.

In fact, when compared with relational databases , many find modeling relationship data in NoSQL databases to be easier than in relational databases, because related data doesn’t have to be split between tables. NoSQL data models allow related data to be nested within a single data structure.

Misconception: NoSQL databases don't support ACID transactions

Another common misconception is that NoSQL databases don't support ACID transactions. Some NoSQL databases like MongoDB do, in fact, support ACID transactions .

Note that the way data is modeled in NoSQL databases can eliminate the need for multi-record transactions in many use cases. Consider the earlier example where we stored information about a user and their hobbies in both a relational database and a document database. In order to ensure information about a user and their hobbies was updated together in a relational database, we'd need to use a transaction to update records in two tables. In order to do the same in a document database, we could update a single document — no multi-record transaction required.

NoSQL query tutorial

A variety of NoSQL databases exist. Today, we'll be trying MongoDB, the world's most popular NoSQL database according to DB-Engines .

In this tutorial, you'll load a sample database and learn to query it — all without installing anything on your computer or paying anything.

Authenticate to MongoDB Atlas

The easiest way to get started with MongoDB is MongoDB Atlas . Atlas is MongoDB's fully managed database-as-a-service. Atlas has a forever free tier, which is what you'll be using today.

  • Navigate to Atlas .
  • Create an account if you haven't already.
  • Log into Atlas.
  • Create an Atlas organization and project.

For more information on how to complete the steps above, visit the official MongoDB documentation on creating an Atlas account .

Create a cluster and a database

A cluster is a place where you can store your MongoDB databases. In this section, you'll create a free cluster.

Once you have a cluster, you can begin storing data in Atlas. You could choose to manually create a database in the Atlas Data Explorer , in the MongoDB Shell , in MongoDB Compass , or using your favorite programming language . Instead, in this example, you will import Atlas's sample dataset.

  • Create a free cluster by following the steps in the official MongoDB documentation .
  • Load the sample dataset by following the instructions in the official MongoDB documentation .

Loading the sample dataset will take several minutes.

While we don't need to think about database design for this tutorial, note that database design and data modeling are major factors in MongoDB performance. Learn more about best practices for modeling data in MongoDB:

  • MongoDB Schema Design Patterns Blog Series
  • MongoDB Schema Design Anti-Patterns Blog Series
  • Free MongoDB University Course: M320 Data Modeling

Query the database

Now that you have data in your cluster, let's query it! Just like you had multiple ways to create a database, you have multiple options for querying a database: in the Atlas Data Explorer , in the MongoDB Shell , in MongoDB Compass , or using your favorite programming language .

In this section, you’ll query the database using the Atlas Data Explorer. This is a good way to get started querying, as it requires zero setup.

  • 1. Navigate to the Data Explorer (the Collections tab), if you are not already there. See the official MongoDB documentation for information on how to navigate to the Data Explorer. The left panel of the Data Explorer displays a list of databases and collections in the current cluster. The right panel of the Data Explorer displays a list of documents in the current collection.

A screenshot of the Collections tab in Atlas

  • 3. Select the movies collection . The Find View is displayed in the right panel. The first twenty documents of the results are displayed.
  • 4. You are now ready to query the movies collection. Let's query for the movie Pride and Prejudice. In the query bar input { title: "Pride and Prejudice"} and click Apply .

Two documents with the title “Pride and Prejudice” are returned.

A screenshot of the query bar and results in the Atlas Data Explorer. A query { title: "Pride and Prejudice"} is in the query bar. Two documents with the title "Pride and Prejudice" are returned.

Congrats! You've successfully queried a NoSQL database!

Continue exploring your data

In this tutorial, we only scratched the surface of what you can do in MongoDB and Atlas. Continue interacting with your data by using the Data Explorer to insert new documents, edit existing documents, and delete documents.

When you want to visualize your data, check out MongoDB Charts . Charts is the easiest way to visualize data stored in Atlas and Atlas Data Lake. Charts allows you to create dashboards that are filled with visualizations of your data.

NoSQL databases provide a variety of benefits including flexible data models, horizontal scaling, lightning fast queries, and ease of use for developers. NoSQL databases come in a variety of types including document databases, key-values databases, wide-column stores, and graph databases.

MongoDB is the world's most popular NoSQL database . Learn more about MongoDB Atlas , and give the free tier a try.

Excited to learn more now that you have your own Atlas account? Head over to MongoDB University where you can get free online training from MongoDB engineers and earn a MongoDB certification . The Quick Start Tutorials are another great place to begin; they will get you up and running quickly with your favorite programming language.

Follow this tutorial with MongoDB Atlas

What are the advantages of nosql.

Many NoSQL databases have the following advantages:

What is eventual consistency?

What is the cap theorem, what is nosql used for.

NoSQL databases are used in nearly every industry for a variety of use cases .

The type of NoSQL database determines the typical use case. For example, document databases like MongoDB are general purpose databases. Key-value databases are ideal for large volumes of data with simple lookup queries. Wide-column stores work well for use cases with large amounts of data and predictable query patterns. Graph databases excel at analyzing and traversing relationships between data. See Understanding the Different Types of NoSQL Databases for more information.

How do I write a NoSQL query?

Is nosql hard to learn.

No, NoSQL databases are not hard to learn. In fact, many developers find modeling data in NoSQL databases to be incredibly intuitive. For example, documents in MongoDB map to data structures in most popular programming languages, making programming faster and easier.

Note that those with training and experience in relational databases will likely face a bit of a learning curve as they adjust to new ways of modeling data in NoSQL databases .

Is JSON a NoSQL?

What language is used to query nosql, does nosql have schema.

Learn more about key differences between NoSQL vs SQL Databases

Related NoSQL Resources

  • What are the main differences between NoSQL and SQL?
  • When should you use a NoSQL database?
  • What are the 4 different types of NoSQL databases?
  • NoSQL Databases Advantages
  • NoSQL data modeling and schema design
  • NoSQL Database Examples
  • MongoDB Compatibility
  • MongoDB Basics
  • Learn About Databases
  • Languages compatible with MongoDB

DataStax Astra Vectorize: Generate Vector Embeddings with 1 Line of Code

NoSQL Use Cases: When to Use a Non-Relational Database

Rich edwards.

NoSQL Use Cases: When to Use a Non-Relational Database

For decades, many companies have relied on relational databases to store, protect, and access their data. SQL databases, in particular, worked well for a long time and still do for many use cases. But, today, there are a wide range of situations where SQL databases can no longer satisfy the needs of modern enterprises, especially those that have made the move to the cloud. Increasingly, these companies are turning to NoSQL databases to meet their goals. 

NoSQL databases are likely the better choice when:

  • You have a large volume and variety of data
  • Scalability is a top priority
  • You need continuous availability
  • Working with big data or performing real-time analytics

While this will often make a NoSQL database the right choice, there are many things to consider before making the move. In this post, we’ll explore when NoSQL use cases make sense. First, let’s take a closer look at NoSQL.

What is NoSQL?

NoSQL is short for “not only SQL,” or “non-SQL.” It’s a term used to describe databases that are not relational. To better understand NoSQL databases, let’s first take a look at their alternative, SQL databases.

Developed in the early 1970s, a time when data storage was extremely expensive, SQL databases attempt to minimize data duplication between tables. While extremely organized, this also makes them extremely inflexible and difficult to modify. Since then, the cost of storage has plummeted, while the cost of developer time has dramatically increased. With NoSQL databases, developers are no longer limited to the rigid, tabular approach of relational databases and have far more flexibility to do their best work. 

NoSQL comes with many benefits, including:

  • The choice of several database types— key-value, document, tabular (or wide column), graph, and multi-model —so you can find the best fit for your data.
  • The flexibility to easily store and access a wide variety of data types together, without upfront planning. The data types can include structured, semi-structured, unstructured, and polymorphic.
  • The ability to add new data types and fields to the database without having to redefine the data model.
  • Built-in, horizontal scalability that can handle rapid growth and is much less costly than attempting to scale-out a SQL database.
  • Continuous availability and strong resilience, due to its horizontal scaling approach.
  • Ease-of-use for developers that fits well with modern, Agile teams. 

Learn more about NoSQL .

Comparing NoSQL to SQL

While NoSQL databases have many advantages, they’re not the right choice for every situation. Sometimes sticking with a tried-and-true SQL database is the way to go. Let’s compare SQL and NoSQL databases across several factors. Think about how each would apply to your data profile and use cases.

NoSQL use cases

As you can see, making the choice between a SQL and NoSQL database is not always a straightforward decision. Each has its advantages and disadvantages. Making the right choice depends on your organization’s specific data environment, along with your current needs and future goals. Many development teams actually use both within their cloud data architecture, sometimes even within the same application—deploying each to cover the areas they handle best.

So, what are the non-relational use cases? Here are several where NoSQL has been proven to make sense:

Fraud detection and identity authentication

Inventory and catalog management.

  • Personalization, recommendations and customer experience
  • Internet of things (IoT) and sensor data
  • Financial services and payments
  • Logistics and asset management
  • Content management systems
  • Digital and media management

Let’s look at the first three NoSQL use cases more closely.

Protecting sensitive personal data and ensuring only real customers have access to applications is understandably a top priority. Of course, this is only heightened in areas such as financial services, banking, payments, and insurance.

It’s a never-ending battle. Fraudsters are creative and nimble. They tirelessly look for new ways to break the seal and their attacks continue to rise at an alarming rate. Whether you’re trying to prevent illegitimate users from gaining access, or authenticating the identity of your customers, you have to lean heavily on your data.

It’s possible to identify patterns and anomalies to pinpoint fraud in real-time or, in some cases, even before it occurs. To do so, real-time analysis of a large volume of both historic and live data of all types is required, including but not limited to user profile, environment, geographic data, and perhaps even biometric data. And context matters. For example, a $500 withdrawal may not typically be a big deal for a particular customer, but it might raise a red flag if the attempt originates at 3 a.m. in a foreign country.

The stakes to your company’s reputation are higher than ever. One breach or mistake can be quickly amplified with the social media megaphone. It’s a balancing act because setting restrictions too narrowly could result in a false positive rate that can adversely impact the customer experience. You want to make it as easy as possible for customers to use your application or website, while ensuring they actually are who they say they are. It’s quite a tightrope to walk.

This combination of needs, including real-time analysis, large and growing datasets, numerous data types, along with the ability to continuously analyze and conduct machine learning and AI , makes the decision to use a NoSQL database a no-brainer for fraud detection and identity authentication.

Take the case of  ACI Worldwide , a company that provides real-time payment capabilities to 19 of the top 20 banks in the world. Their transaction volume is astronomically high, processing trillions of dollars in payments every day, and their data analysis needs are complex.

While payment intelligence systems have used relational databases in the past, that approach struggles to handle growing, large-scale use cases that require complex data analysis. At some point, it becomes impractical and cost-prohibitive to build a relational database big enough to do the job. To have any chance at handling these needs, a SQL database would have to be partitioned. In addition to being extremely resource intensive and expensive, partitioning would have another drawback. For the fraud use case, all information across all dimensions is needed to make each transaction decision. To handle the ever-growing volume, inevitably, a partitioned relational database would have to decrease the window of time of past transactions evaluated. As that time window shrinks, so does the ability to detect fraud.

For effective fraud detection and identity authentication, the data types analyzed extend far beyond transactional information. They could include anything from demographic data, help desk information from the CRM system, website interactions, historical shopping data, and much, much more. It would be impossible to develop a schema upfront that would define everything customers might want to do in the future. This environment requires the flexibility of a NoSQL database where any type of data element can be quickly added to the mix.

Using  DataStax Enterprise (DSE), ACI has improved its fraud detection rate and false positive rate, while saving their customers millions of dollars. And ACI’s call center is saving money as fewer false positive cases are routed there.

Read more about how ACI is battling fraud with a NoSQL solution .

NoSQL databases are known for their high availability and predictable, cost-effective, horizontal scalability. This makes them a great match for e-commerce companies with massive, growing online catalogs and loads of inventory to manage. 

These organizations need the flexibility to quickly update their product mix, without volume limits. And the worst thing imaginable for them would be to have their site or application go down on Black Friday or during the Christmas holiday season.

For these reasons,  Macy’s has made the journey from relational databases to NoSQL. One of the most prominent department stores in the world, Macy’s also has one of the largest e-commerce sites, with billions in annual sales. Like ACI, Macy’s handles a massive volume of data that is diverse and growing. Before the move to NoSQL, the company had a heavily normalized database that limited their ability to scale their catalog and online inventory. Now that DSE and a NoSQL database are in place, this is no longer a source of concern for the Macy’s team.

With their NoSQL database setup, Macy’s can now:

  • Handle traffic growth and massive volumes of data
  • Easily and cost-effectively scale
  • Provide faster catalog refreshes
  • Grow its online catalog and number of products
  • Analyze its catalog and inventory in real time

Learn more about Macy’s move to NoSQL .

Personalization, recommendations, and customer experience

Providing a fast, personalized experience is no longer a differentiator. Today, it’s table stakes. Customers expect a consistent, high-quality, tailored experience from your brand, 24/7, across all devices. 

They take it for granted. They demand near real-time interactions and relevant recommendations. While it’s still possible to carve out unique, memorable experiences, the first priority is to make sure you have these bases covered. If you don’t, that’s what they’ll remember. And, if that happens, you run the risk of them turning to Twitter or Facebook and amplifying your shortcomings. NoSQL databases are the answer to power the individualized experiences that will keep your customers happy.

That’s because NoSQL databases:

  • Have fast response times with extremely low latency, even as a customer base expands
  • Can handle all types of data, structured and unstructured, from a variety of sources
  • Are built to cost-effectively scale, with the ability to store, manage, query, and modify extremely large volumes of data and concurrently deliver personalized experiences to millions of customers
  • Are extremely flexible, so you can continuously innovate and improve the customer experience
  • Can seamlessly capture, integrate, and analyze new data that is continuously flowing in
  • Are adept at being the backbone for the machine learning and AI engine algorithms that provide recommendations and power personalization

By focusing on providing intuitive, superior online customer experiences from the start,  Macquarie Bank , an Australian financial services company, was able to move from no retail banking presence to a top contender in the digital banking space in less than two years. Their focus on truly understanding customer behavior and prioritizing personalization has been a key to their success. So, it’s no surprise they use a NoSQL database (Apache Cassandra with DataStax Enterprise) to provide their customers with near real-time recommendations, interactions, and insights.

Read more about how MacQuarie uses NoSQL to provide personalization for their customers . 

Do you have a NoSQL use case?

Hopefully, this post and the non-relational database examples above have provided some guidance about when using a NoSQL database would be the smart move. So, what’s the next step if you determine your company does indeed have NoSQL use cases?

A great place to start is  to schedule a demo for DataStax Astra DB , a scale-out NoSQL database built on Apache CassandraTM.

Or, if you want to jump right in,  you can get started with Astra DB for free .

More Technology

Astra Vectorize: Generate Vector Embeddings with 1 Line of Code

Astra Vectorize: Generate Vector Embeddings with 1 Line of Code

GitHub Copilot + DataStax Astra DB: Build GenAI Apps 100x Faster

GitHub Copilot + DataStax Astra DB: Build GenAI Apps 100x Faster

Empowering On-Premises Deployments with Generative AI: Introducing Vector Search for a Self-Managed Modern Architecture

Empowering On-Premises Deployments with Generative AI: Introducing Vector Search for a Self-Managed Modern Architecture

Tips and Tricks for the DataStax Astra CLI

Tips and Tricks for the DataStax Astra CLI

One-stop data api for production genai.

Astra DB gives JavaScript developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.

  • SQL Cheat Sheet
  • SQL Interview Questions
  • MySQL Interview Questions
  • PL/SQL Interview Questions
  • Learn SQL and Database
  • SQL for Data Science

Introduction to SQL

  • What is SQL?
  • Difference Between RDBMS and DBMS
  • Difference between SQL and NoSQL
  • SQL Data Types
  • SQL | DDL, DML, TCL and DCL

Setting Up the Environment

  • Install PostgreSQL on Windows
  • How to Install SQL Server Client on Windows?
  • 10 Best SQL Editor Tools in the Market
  • How to Create a Database Connection?
  • Relational Model in DBMS
  • SQL SELECT Query
  • SQL | WITH clause
  • SQL | GROUP BY
  • PHP | MySQL LIMIT Clause
  • SQL LIMIT Clause
  • SQL Distinct Clause

SQL Operators

  • Comparison Operators in SQL
  • SQL - Logical Operators
  • SQL | Arithmetic Operators
  • SQL | String functions
  • SQL | Wildcard operators
  • AND and OR Operator in SQL
  • SQL | Concatenation Operator
  • SQL | MINUS Operator
  • SQL | DIVISION
  • SQL NOT Operator
  • SQL | BETWEEN & IN Operator

Working with Data

  • SQL | WHERE Clause
  • SQL ORDER BY
  • SQL INSERT INTO Statement
  • SQL UPDATE Statement
  • SQL DELETE Statement
  • ALTER (RENAME) in SQL
  • SQL ALTER TABLE

SQL Queries

  • SQL | Subquery
  • Nested Queries in SQL
  • Joining three or more tables in SQL
  • Inner Join vs Outer Join
  • SQL | Join (Cartesian Join & Self Join)
  • How to Get the names of the table in SQL
  • How to print duplicate rows in a table?

Data Manipulation

  • SQL Joins (Inner, Left, Right and Full Join)
  • SQL Inner Join
  • SQL Outer Join
  • SQL Self Join
  • How to Group and Aggregate Data Using SQL?
  • SQL HAVING Clause with Examples

Data Analysis

  • Window functions in SQL
  • Pivot and Unpivot in SQL
  • Data Preprocessing in Data Mining
  • SQL Functions (Aggregate and Scalar Functions)
  • SQL Date and Time Functions
  • SQL | Date Functions (Set-1)
  • SQL | Date Functions (Set-2)
  • SQL | Numeric Functions
  • SQL Aggregate functions

Data Visualization

  • What is Data Visualization and Why is It Important?
  • SQL Query to Export Table from Database to CSV File
  • Data Visualisation in Python using Matplotlib and Seaborn

Connecting SQL with Python

  • How to install MySQL connector package in Python?
  • MySQL-Connector-Python module in Python
  • Connect MySQL database using MySQL-Connector Python
  • Introduction to Psycopg2 module in Python
  • How to Connect Python with SQL Database?
  • Executing SQL query with Psycopg2 in Python
  • Python MySQL - Create Database
  • Python: MySQL Create Table
  • Python MySQL - Insert into Table
  • Python MySQL - Select Query
  • Python MySQL - Where Clause
  • Python MySQL - Order By Clause
  • Python MySQL - Delete Query
  • Python MySQL - Drop Table
  • Python MySQL - Update Query
  • Python MySQL - Limit Clause
  • Commit & RollBack Operation in Python

Advanced SQL Techniques for Data Science

  • Recursive Join in SQL
  • How to Get Current Date and Time in SQL?
  • NULL values in SQL

Optimizing SQL Queries for Performance

  • Best Practices For SQL Query Optimizations
  • SQL Performance Tuning

Working with NoSQL and NewSQL Databases

Introduction to nosql.

  • Introduction of NewSQL | Set 2
  • Difference between NoSQL and NewSQL

NoSQL is a type of database management system (DBMS) that is designed to handle and store large volumes of unstructured and semi-structured data. Unlike traditional relational databases that use tables with pre-defined schemas to store data, NoSQL databases use flexible data models that can adapt to changes in data structures and are capable of scaling horizontally to handle growing amounts of data.

The term NoSQL originally referred to “non-SQL” or “non-relational” databases, but the term has since evolved to mean “not only SQL,” as NoSQL databases have expanded to include a wide range of different database architectures and data models.

NoSQL databases are generally classified into four main categories:

  • Document databases: These databases store data as semi-structured documents, such as JSON or XML, and can be queried using document-oriented query languages.
  • Key-value stores: These databases store data as key-value pairs, and are optimized for simple and fast read/write operations.
  • Column-family stores: These databases store data as column families, which are sets of columns that are treated as a single entity. They are optimized for fast and efficient querying of large amounts of data.
  • Graph databases: These databases store data as nodes and edges, and are designed to handle complex relationships between data.

NoSQL databases are often used in applications where there is a high volume of data that needs to be processed and analyzed in real-time, such as social media analytics, e-commerce, and gaming. They can also be used for other applications, such as content management systems, document management, and customer relationship management.

However, NoSQL databases may not be suitable for all applications, as they may not provide the same level of data consistency and transactional guarantees as traditional relational databases. It is important to carefully evaluate the specific needs of an application when choosing a database management system.

NoSQL originally referring to non SQL or non relational is a database that provides a mechanism for storage and retrieval of data. This data is modeled in means other than the tabular relations used in relational databases. Such databases came into existence in the late 1960s , but did not obtain the NoSQL moniker until a surge of popularity in the early twenty-first century. NoSQL databases are used in real-time web applications and big data and their use are increasing over time.

  • NoSQL systems are also sometimes called Not only SQL to emphasize the fact that they may support SQL-like query languages. A NoSQL database includes simplicity of design, simpler horizontal scaling to clusters of machines, has and finer control over availability. The data structures used by NoSQL databases are different from those used by default in relational databases which makes some operations faster in NoSQL. The suitability of a given NoSQL database depends on the problem it should solve. 
  • NoSQL databases, also known as “not only SQL” databases, are a new type of database management system that has , gained popularity in recent years. Unlike traditional relational databases, NoSQL databases are designed to handle large amounts of unstructured or semi-structured data, and they can accommodate dynamic changes to the data model. This makes NoSQL databases a good fit for modern web applications, real-time analytics, and big data processing.
  • Data structures used by NoSQL databases are sometimes also viewed as more flexible than relational database tables. Many NoSQL stores compromise consistency in favor of availability, speed, , and partition tolerance. Barriers to the greater adoption of NoSQL stores include the use of low-level query languages, lack of standardized interfaces, and huge previous investments in existing relational databases.
  • Most NoSQL stores lack true ACID(Atomicity, Consistency, Isolation, Durability) transactions but a few databases, such as MarkLogic, Aerospike, FairCom c-treeACE, Google Spanner (though technically a NewSQL database), Symas LMDB, and OrientDB have made them central to their designs. 
  • Most NoSQL databases offer a concept of eventual consistency in which database changes are propagated to all nodes so queries for data might not return updated data immediately or might result in reading data that is not accurate which is a problem known as stale reads. Also, has some NoSQL systems may exhibit lost writes and other forms of data loss. Some NoSQL systems provide concepts such as write-ahead logging to avoid data loss.
  • One simple example of a NoSQL database is a document database. In a document database, data is stored in documents rather than tables. Each document can contain a different set of fields, making it easy to accommodate changing data requirements
  • For example, “Take, for instance, a database that holds data regarding employees.”. In a relational database, this information might be stored in tables, with one table for employee information and another table for department information. In a document database, each employee would be stored as a separate document, with all of their information contained within the document.
  • NoSQL databases are a relatively new type of database management system that has a gained popularity in recent years due to their scalability and flexibility. They are designed to handle large amounts of unstructured or semi-structured data and can handle dynamic changes to the data model. This makes NoSQL databases a good fit for modern web applications, real-time analytics, and big data processing.

Key Features of NoSQL:

  • Dynamic schema: NoSQL databases do not have a fixed schema and can accommodate changing data structures without the need for migrations or schema alterations.
  • Horizontal scalability: NoSQL databases are designed to scale out by adding more nodes to a database cluster, making them well-suited for handling large amounts of data and high levels of traffic.
  • Document-based: Some NoSQL databases, such as MongoDB, use a document-based data model, where data is stored in a schema-less semi-structured format, such as JSON or BSON.
  • Key-value-based: Other NoSQL databases, such as Redis, use a key-value data model, where data is stored as a collection of key-value pairs.
  • Column-based: Some NoSQL databases, such as Cassandra, use a column-based data model, where data is organized into columns instead of rows.
  • Distributed and high availability: NoSQL databases are often designed to be highly available and to automatically handle node failures and data replication across multiple nodes in a database cluster.
  • Flexibility: NoSQL databases allow developers to store and retrieve data in a flexible and dynamic manner, with support for multiple data types and changing data structures.
  • Performance: NoSQL databases are optimized for high performance and can handle a high volume of reads and writes, making them suitable for big data and real-time applications.

Advantages of NoSQL: There are many advantages of working with NoSQL databases such as MongoDB and Cassandra. The main advantages are high scalability and high availability.

  • High scalability: NoSQL databases use sharding for horizontal scaling. Partitioning of data and placing it on multiple machines in such a way that the order of the data is preserved is sharding. Vertical scaling means adding more resources to the existing machine whereas horizontal scaling means adding more machines to handle the data. Vertical scaling is not that easy to implement but horizontal scaling is easy to implement. Examples of horizontal scaling databases are MongoDB, Cassandra, etc. NoSQL can handle a huge amount of data because of scalability, as the data grows NoSQL scales The auto itself to handle that data in an efficient manner.
  • Flexibility: NoSQL databases are designed to handle unstructured or semi-structured data, which means that they can accommodate dynamic changes to the data model. This makes NoSQL databases a good fit for applications that need to handle changing data requirements.
  • High availability: The auto , replication feature in NoSQL databases makes it highly available because in case of any failure data replicates itself to the previous consistent state.
  • Scalability: NoSQL databases are highly scalable, which means that they can handle large amounts of data and traffic with ease. This makes them a good fit for applications that need to handle large amounts of data or traffic
  • Performance: NoSQL databases are designed to handle large amounts of data and traffic, which means that they can offer improved performance compared to traditional relational databases.
  • Cost-effectiveness: NoSQL databases are often more cost-effective than traditional relational databases, as they are typically less complex and do not require expensive hardware or software.
  • Agility: Ideal for agile development.

Disadvantages of NoSQL: NoSQL has the following disadvantages.

  • Lack of standardization:  There are many different types of NoSQL databases, each with its own unique strengths and weaknesses. This lack of standardization can make it difficult to choose the right database for a specific application
  • Lack of ACID compliance: NoSQL databases are not fully ACID-compliant, which means that they do not guarantee the consistency, integrity, and durability of data. This can be a drawback for applications that require strong data consistency guarantees.
  • Narrow focus: NoSQL databases have a very narrow focus as it is mainly designed for storage but it provides very little functionality. Relational databases are a better choice in the field of Transaction Management than NoSQL.
  • Open-source: NoSQL is an database open-source database. There is no reliable standard for NoSQL yet. In other words, two database systems are likely to be unequal.
  • Lack of support for complex queries: NoSQL databases are not designed to handle complex queries, which means that they are not a good fit for applications that require complex data analysis or reporting.
  • Lack of maturity: NoSQL databases are relatively new and lack the maturity of traditional relational databases. This can make them less reliable and less secure than traditional databases.
  • Management challenge: The purpose of big data tools is to make the management of a large amount of data as simple as possible. But it is not so easy. Data management in NoSQL is much more complex than in a relational database. NoSQL, in particular, has a reputation for being challenging to install and even more hectic to manage on a daily basis.
  • GUI is not available: GUI mode tools to access the database are not flexibly available in the market.
  • Backup: Backup is a great weak point for some NoSQL databases like MongoDB. MongoDB has no approach for the backup of data in a consistent manner.
  • Large document size: Some database systems like MongoDB and CouchDB store data in JSON format. This means that documents are quite large (BigData, network bandwidth, speed), and having descriptive key names actually hurts since they increase the document size.

Types of NoSQL database: Types of NoSQL databases and the name of the database system that falls in that category are:

  • Graph Databases : Examples – Amazon Neptune, Neo4j
  • Key value store: Examples – Memcached, Redis, Coherence
  • Column: Examples – Hbase, Big Table, Accumulo
  • Document-based: Examples – MongoDB, CouchDB, Cloudant

When should NoSQL be used:

  • When a huge amount of data needs to be stored and retrieved.
  • The relationship between the data you store is not that important
  • The data changes over time and is not structured.
  • Support of Constraints and Joins is not required at the database level
  • The data is growing continuously and you need to scale the database regularly to handle the data.

In conclusion, NoSQL databases offer several benefits over traditional relational databases, such as scalability, flexibility, and cost-effectiveness. However, they also have several drawbacks, such as a lack of standardization, lack of ACID compliance, and lack of support for complex queries. When choosing a database for a specific application, it is important to weigh the benefits and drawbacks carefully to determine the best fit.

Please Login to comment...

Similar reads, improve your coding skills with practice.

 alt=

What kind of Experience do you want to share?

  • Data Summit
  • Blockchain in Government
  • Big Data Quarterly

Twitter

  • TOPICS: Big Data
  • BI & Analytics
  • Data Integration
  • Database Management
  • Virtualization
  • More Topics Artificial Intelligence Blockchain Data Center Management Data Modeling Data Quality Data Warehousing Database Security Hadoop Internet of Things Master Data Management MultiValue Database Technology NoSQL

Newsletters

  • 5 Minute Briefing: Information Management [ Latest Issue ]
  • 5 Minute Briefing: Data Center [ Latest Issue ]
  • 5 Minute Briefing: MultiValue [ Latest Issue ]
  • 5 Minute Briefing: Oracle [ Latest Issue ]
  • 5 Minute Briefing: SAP [ Latest Issue ]
  • 5 Minute Briefing: Blockchain [ Latest Issue ]
  • 5 Minute Briefing: Cloud [ Latest Issue ]
  • IOUG E-brief: Oracle Enterprise Manager [ Latest Issue ]
  • IOUG E-Brief: Cloud Strategies [ Latest Issue ]
  • DBTA E-Edition [ Latest Issue ]
  • Data Summit Conference News
  • AI and Machine Learning Summit News

Parasoft Equips Teams with Improved Software Testing Capabilities

Parasoft, a global leader in software testing and quality solutions, is introducing several advancements in software testing technology, equipping teams with automated AI-enhanced API test generation, microservices code coverage collection, and web accessibility testing.

Parasoft's AI journey now includes the auto-parameterization of API scenario tests generated with the OpenAI/Azure OpenAI integration to streamline the creation of test scenarios that validate data flow through a use case.

Parasoft now collects code coverage metrics from multiple parallel test executions in the same test environment for Java and .NET microservices.

Microservices code coverage can now be published under one project in Parasoft DTP, enabling an aggregated view of microservices coverage.

With support for WCAG 2.2 (Web Content Accessibility Guidelines) and new reporting capabilities in Parasoft SOAtest and DTP, rule mapping with severity classifications helps teams streamline remediation efforts.

The new updates will help a myriad of users including:

  • Java development managers
  • QA managers
  • QA directors
  • Directors of engineering

For more information about this news, visit www.parasoft.com .

White Papers

Amazon Aurora High Availability and Disaster Recovery Features for Global Resilience

  • Amazon Aurora High Availability and Disaster Recovery Features for Global Resilience
  • PostgreSQL Performance Tuning Strategies
  • MySQL Performance Tuning
  • MongoDB Performance Tuning
  • Extracting knowledge from your data:Learn how a semantic layer helps you find, access, integrate, and re-use your enterprise knowledge.

PostgreSQL Performance Tuning Strategies

  • Business Intelligence and Analytics
  • Cloud Computing
  • Data Center Management
  • Data Modeling
  • Data Quality
  • Data Warehousing
  • Database Security
  • Master Data Management
  • MultiValue Database Technology
  • NoSQL Central
  • DBTA E-Edition
  • Data and Information Management Newsletters
  • DBTA 100: The 100 Companies that Matter in Data
  • Trend Setting Products in Data and Information Management
  • DBTA Downloads
  • DBTA SourceBook
  • Defining Data
  • Destination CRM
  • Faulkner Information Services
  • InfoToday.com
  • InfoToday Europe
  • ITIResearch.com
  • Online Searcher
  • Smart Customer Service
  • Speech Technology
  • Streaming Media
  • Streaming Media Europe
  • Streaming Media Producer

IOUG

COMMENTS

  1. Case Study: How a bank turned challenges into opportunities to serve

    Also available in this layer is the ELK stack (Elasticsearch, Logstash, Kibana), which is primarily used to audit the log data stored in the NoSQL Database. Oracle NoSQL Database has an out-of-box integration with Elasticsearch. Oracle NoSQL Database also feeds the user drop-off (incomplete form activity) data to the orchestration framework ...

  2. Real-World NoSQL Database Use Cases: Examples and Use Cases ...

    NoSQL databases can be used for a variety of applications, but there are a few common use cases where NoSQL shines: 1. E-commerce applications. NoSQL databases can help e-commerce companies manage large volumes of data, including product catalogs, customer profiles, and transaction histories.

  3. NoSQL Database Use Cases

    NoSQL is a good option for organizations with data workloads directed toward rapid processing and analyzing of massive quantities of unstructured data—coined "big data" back in the 1990s. A flexible data model, continuous application availability, optimized database architecture, and modern transaction support are all important for processing big data.

  4. NoSQL: A Real Use Case

    In this paper, an actual corporate case study is used, with real-world data, to evaluate how NoSQL databases perform. First, using big data, write-intensive tests are implemented and evaluated using Cassandra, MongoDB, Couchbase, and compared with the relational database in place, which is within the throughput limit.

  5. NoSQL case studies: overview of NoSQL database applications on

    Case study: Google's MapReduce - use commodity hardware to create search indexes. One of the most influential case studies in the NoSQL movement is the Google MapReduce system. In this paper, Google shared their process for transforming large volumes of web data content into search indexes using lowcost commodity CPUs.

  6. A Tour of NoSQL in Eight Use Cases

    Summary. NoSQL databases can offer substantial performance and cost advantages over their relational database management system counterparts. Understanding how NoSQL DBMSs vary from RDBMSs, as well as from each other, allows information managers to match the right technology to the appropriate use case.

  7. Introduction to NoSQL Databases

    Cassandra is mostly a column store database. Some studies referred to Cassandra as a hybrid system, inspired by Google's BigTable, which is a column store database, and Amazon's DynamoDB, which is a key-value database. ... In the case that there is an application using a single database server, it can be converted to sharded cluster with ...

  8. PDF Performance Evaluation of NoSQL Databases: A Case Study

    NoSQL database performance is in turn strongly influenced by how well the data model and query capabilities fit the application use cases, and so system-specific testing and characterization is required. This paper presents a method and the results of a study that selected among three NoSQL databases for a large, distributed healthcare ...

  9. Performance Evaluation of NoSQL Databases: A Case Study

    NoSQL database performance is in turn strongly influenced by how well the data model and query capabilities fit the application use cases, and so system-specific testing and characterization is required. This paper presents a method and the results of a study that selected among three NoSQL databases for a large, distributed healthcare ...

  10. Choosing the right NoSQL database for the job: a quality attribute

    Ranjan, in , studies Big Data platforms and notes that lack of robustness is a question in Big Data scheduling platforms and, in particular, in the NoSQL (Hadoop) case. In 2011, the authors of [ 12 ] postulated that robustness would be an issue for NoSQL, as the technology was new and needed testing.

  11. Improving security in NoSQL document databases through model-driven

    NoSQL technologies have become a common component in many information systems and software applications. These technologies are focused on performance, enabling scalable processing of large volumes of structured and unstructured data. Unfortunately, most developments over NoSQL technologies consider security as an afterthought, putting at risk personal data of individuals and potentially ...

  12. What Is NoSQL? NoSQL Databases Explained

    NoSQL databases (aka "not only SQL") are non-tabular databases and store data differently than relational tables. NoSQL databases come in a variety of types based on their data model. The main types are document, key-value, wide-column, and graph. They provide flexible schemas and scale easily with large amounts of data and high user loads.

  13. [PDF] NoSQL and SQL Databases for Mobile Applications. Case Study

    The main objectives of this paper are to investigate storage options for major mobile platforms and to point out some major differences between SQL and NoSQL datastores in terms of deployment, data model, schema design, data definition and manipulation. Compared with "classical" web, multi-tier applications, mobile applications have common and specific requirements concerning data persistence ...

  14. NoSQL Use Cases: Non-Relational Database Examples

    NoSQL Use Cases: When to Use a Non-Relational Database. For decades, many companies have relied on relational databases to store, protect, and access their data. SQL databases, in particular, worked well for a long time and still do for many use cases. But, today, there are a wide range of situations where SQL databases can no longer satisfy ...

  15. (PDF) NoSQL in Higher Education. A Case Study

    [email protected]. Abstract. Big Data and NoSQL technologies are simultaneously marketing hypes and tools. that could significantly change the database and application development landscape ...

  16. Performance Evaluation of NoSQL Databases: A Case Study

    The specific contributions of the paper are as. follows: A rigorous method that organizations can follow to evaluate the. performance and scalability of NoSQL databases. Performance and ...

  17. Using the Advantages of NOSQL: A Case Study on MongoDB

    NOSQL databases which provide more scalability and efficiency in storage and access of the data. A case study on MongoDB is done as to show. the representational format and querying process of ...

  18. PDF NoSQL Databases: Facebook Case study and Analysis

    There are four general types of NoSQL databases where every database has its own properties: Graph database: The basis of this type of databases is graph theory. Examples: Neo4j [27] and Titan [28]. Key-Value store: In this database, we store the data in two parts, namely key and value.

  19. NoSQL Customer Stories and Case Studies

    Server Our self-managed NoSQL database offering; Mobile Embedded NoSQL on your device, anywhere; Autonomous Operator Containerized Couchbase; ... Read the case study . Quantic achieves 50% faster querying with Couchbase Capella . Read the case study . PRODUCT Get hands-on with Capella in just a few clicks, test drive free for 30 days ...

  20. NoSQL Database Design

    In the world of modern data management, NoSQL databases have emerged as powerful alternatives to traditional relational databases. NoSQL, which stands for " Not Only SQL " have a diverse set of database technologies designed to handle large volumes of unstructured, semi-structured, and structured data. In this article, we'll explore the fundamentals of NoSQL database design, its key ...

  21. An empirical hybrid strategy for NoSQL database distribution

    A case study of a pharmaceutical distribution company with a relational database that is migrated to a document-oriented NoSQL database for which a hybrid distribution strategy based on fragmentation and replication is implemented. The legacy relational database schema was thoroughly analyzed and evaluated to determine the collections and documents needed to move all the data and leverage the ...

  22. Introduction to NoSQL

    Introduction to NoSQL. NoSQL is a type of database management system (DBMS) that is designed to handle and store large volumes of unstructured and semi-structured data. Unlike traditional relational databases that use tables with pre-defined schemas to store data, NoSQL databases use flexible data models that can adapt to changes in data ...

  23. (PDF) Comparative Case Study: An Evaluation of ...

    Testing Method Using 1,000 records, 5,000 records, and 10,000 records, with each record being tested three times, and then taking the average, The results of this study are that the NoSQL database ...

  24. TaskOral AssignmentDLMDSBDT01 (pdf)

    1.1 Task 1: SQL and NoSQL databases 1. Explain the basic differences between SQL and NoSQL databases and clarify why NoSQL and NewSQL databases became a popular choice for designing big data systems. 2. Using a real-world example, describe the storage system of a software company that needs both SQL and NoSQL databases for their business case ...

  25. Parasoft Equips Teams with Improved Software Testing Capabilities

    Case Studies. Software. Big Data SourceBook. Cyber Security SourceBook. RSS Feeds. Sponsors. Research; Videos; ... NoSQL : Register now for Data Summit 2024, May 8 - 9, in Boston & save $100 off! ... tests generated with the OpenAI/Azure OpenAI integration to streamline the creation of test scenarios that validate data flow through a use case.