Back to top

Type of No-SQL Databases

Quick Video Glance

Develop Intuition

The decision to choose a data-store is somewhat similar to buying a car where one needs to choose from a variety of available options. There are several parameters which individuals buying a car consider. People for whom space is of utmost importance (generally having families) tend to buy larger SUVs.

On the other hand, someone wanting a reduces fuel cost will most likely opt for a hatchback or a sedan. Similarly, the choice of a data-store depends to a great extent on application requirements and the user query patterns.On the other hand, if someone wants to optimize fuel cost is most probable to opt for a hatchback or a sedan. Similarly, the choice of a data-store depends to a great extent on application requirements and the user query patterns. If the application’s data contains multi-level relationships, then we may choose a graph-oriented database. On the other end, if the application requires running more analytical queries, then we may opt for a column-oriented database. Let’s try to dive deeper into the trade-offs associated with using different types of NoSQL databases.

Fig 0: An analogy to explain different types of NoSQL data stores

Overview

The underlying model using which a database organizes data and manages access to it is referred to as “Data Model.” Relational Databases store data by normalizing the data-entities using entity relationships. The growth of NoSQL databases marked the inception of modeling data towards query patterns. The NoSQL solutions for modeling data that can be effectively de-normalized, having trivial entity relationships, fall under the umbrella of aggregate-oriented databases. On the other hand, the data having complex relationships (e.g., social networks) use graph-based databases.

Aggregate Oriented Databases

An aggregate is effectively a collection of related objects that can be treated as a unit for data manipulation. The aggregate boundaries are decided by how data gets manipulated. For example, a customer purchase order makes a suitable aggregate when a customer is reviewing or making the order. However, it’s not useful if we have to determine the product sales history as it will require cross-aggregate joins.

Aggregate Oriented databases allow atomicity at a single aggregate level. Therefore, it’s not very efficient for transactions which span across multiple aggregates. In such cases, the applications need to ensure atomicity at its level.

Key-Value Data Model

Key-Value data-stores acts as a giant hash table which can be queried using a key and an efficient hash function(e.g., Consistent Hashing). This functionality allows key-value databases to perform data look-ups and manipulation in an extremely scalable fashion on large clusters. Such a data-model is best-fit for applications which require fetching the complete aggregate using a unique identifier (e.g., retrieve order info using orderId). Few examples of key-value data stores are Amazon DynamoDB and Riak.

**Brain Exercise:**

We need to design a data model for storing user-session information of a web application. The session data will comprise of three data-entities: UserProfile, SessionData, and ShoppingCart. How would you store this information in a key-value data-store?

**We recommend you to think about the solution to this problem on your own before moving forward.**

Most of the queries to fetch the session information will use a unique identifier like sessionId. The data-model for storing the session information will look similar to Fig 1. The session information (comprising of user-profile, session-data, shopping-cart) will be denormalized and bundled together as a single aggregate and will have sessionId as the key. This will reduce the overall I/O operations required for fetching the session information in real-time.

Fig 1: User Session Information in a Key-Value Store

However, such a data model isn’t efficient in cases of correlated keys where the transactions are spread across aggregates. They are also not optimal for cases where queries/updates will be made on aggregate values (e.g., updating unique user visits on a page).

Document Data Model

The aggregates in this data-model are called document which is effectively a collection of named fields and data. Unlike key-value stores, the document stores expose their aggregates, which enables queries and updates on values of the document fields. Besides this, such data stores allow retrieving the only subset of the aggregate rather than the whole thing. Such data-stores are suitable for use cases such as orders in e-commerce applications, web application visitors for real-time analytics, managing documents, and profiles in blogging platforms. Similar to key-value data stores, transactions, and querying across documents aren’t best suited for such data stores. Some of the popular document databases are MongoDB, Amazon DocumentDB, and Couchbase.

Choice between Key-Value and Document Data Stores

The data and application needs can be used to make the trade-off between key-value and document databases. Both key-value and document databases are excellent choices for many database applications. If query patterns and data structures are relatively simple, and the entire aggregate needs to fetched, key-value databases are the right choice. As the complexity of queries and entities increase, document databases become a better option, although the line between key-value and document data stores is very thin.

**Brain Exercise:**

We need to come up with a data model to support real-time analytics for product pages of a website. The data-model should contain page information that gets frequently updated, such as the number of page visits, unique page viewers, and so forth.

**We recommend you to think about the solution to this problem on your own before moving forward.**

The queries to the database will be targeted towards fetching the real-time statistics of the product pages. In Fig 2, we have shown how the data can be stored in a document data store. We have used embedded documents to store the relationships between data entities. The data model shown below represents product page details having a one-to-many relationship between the page and page-sections data. This not only enables us to fetch the entire page document with just one query but also update frequently changing aggregate fields such as unique visitors and number of page visits.

Fig 2: Real-time Page Statistics Data Model in a Document Store

Columnar Data Model

Columnar database models the data in a format which optimizes the performance of analytical queries by reducing the overall disk I/O requirements and the amount of data loaded from the disk. The following illustration explain the reason behind the efficiency of columnar data-stores for retrieving data into memory. Let’s assume that Table 1 contains information about the products on a website.

Table 1: Customer Payments Data

In row-wise database storage, the consecutive columns of a row are stored sequentially in the data blocks, as shown in Fig 3. Such storage is optimal for OLTP databases where all the values of entire records are accessed frequently. However, there are two possible scenarios in which such storage solutions do not effectively use the disk space. The first scenario is that the data block size is smaller than the record size in which case a record may occupy more than one block. In the second scenario, the block size is larger than an entire record resulting in the record being contained in less than one block.

Fig 3: Data Blocks in row-wise database storage

In columnar storage, data of a single column for multiple rows are stored sequentially in the data blocks. This helps in enhancing the storage efficiency of columnar databases in comparison to row-wise databases. The ability is further enhanced by using compression schemes explicitly selected for the column data-type as each data block holds the same type of data.

Fig 4: Data Blocks in columnar data-model

Such data stores are often used in OLAP scenarios where the writes are rare, and reads are catered towards a few columns of many rows at once. An excellent example of a columnar data store is Amazon Redshift, which is a cloud-hosted data warehouse solution optimized for data analysis over the vast quantity of data.

Graph Data Model

Relational databases use foreign keys to store relationships between data entities. However, those databases aren’t able to scale efficiently beyond relationships which are more than two levels deep. The reason being that relational databases compute connections at query time.

The advent of social networks drove the need to store more complex relationships. This played a crucial role in the popularity of graph databases. These databases store data-entities as nodes and the relationship between them as edges. The relationship between data-entities was pre-computed and stored in the database. This resulted in a significant improvement over relational databases as the relationships were no longer computed at query time.

Graph-based databases are useful in social networking applications, location-based services, or any application which requires storing complex relationships. However, such databases don’t perform well in analytical solutions where updates are applied to the properties of all the nodes. The data lookups performance is dependent on the access of one node from another. This performance is enhanced by storing the physical RAM address of adjacent nodes and caching links to directly-related nodes.

Fig. 5: A graph data model of favorite TV series of a group of people

In Fig 5, we have shown a sample graph data model which stores information about a group of people along-with their friends and favorite TV shows. The following illustrations show sample queries which are run on a graph data-model. The first illustration shows a query to fetch the friends of a person. Let’s assume that all the nodes are indexed in a nodeIndex.

Query 1: Get all friends of _Justin_

We will fetch all the relationships of the justin node having edge direction as Direction.OUTGOING and then filter the relationships having type as Friend.

// Fetch the node containing information about Justin from node index
Node justin = nodeIndex.get(“name”, “Justin”);

for(Relationship relationship: justin.getRelationships(OUTGOING, FRIEND)) {
    friendsOfJustin.add(relationship.getEndNode());
}

Query 2: Find all people who like House of Cards and GOT and are friends with each other

//Create a list of people who like House of cards
Node houseOfCards = nodeIndex.get(“name”, “House of Cards”);

for(Relationship relationship: houseOfCards.getRelationships(INGOING,LIKES)) {
    peopleLikingHouseOfCards.add(relationship.getStartNode());
}

//Create a list of people who like House of cards
Node got = nodeIndex.get(“name”, “GOT”);

for(Relationship relationship: got.getRelationships(INGOING,LIKES)) {
    peopleLikingGOT.add(relationship.getStartNode());
}

//Get list of friends of people liking House of Cards
for(People people: peopleLikingHouseOfCards){
    for(Relationship relationship: people.getRelationships(OUTGOING, FRIEND)) {
         friendsOfPeopleLikingHouseOfCards.add(relationship.getStartNode());
    }
}

//Get list of friends of people liking GOT
for(People people: peopleLikingGOT){
    for(Relationship relationship: people.getRelationships(OUTGOING, FRIEND)) {
         friendsOfPeopleLikingGOT.add(relationship.getStartNode());
    }
}

//Return the intersection of friendsOfPeopleLikingHouseOfCards and friendsOfPeopleLikingGOT

**Brain Exercise:**

Banks lose a lot of money every year to fraud. Often, such scams are carried out by a group of individuals who form a fraud-ring. A significant opportunity to improve fraud detection lies in looking beyond the individual bank account owners to the connections that link them. Some of the common characteristics of the fraud rings are listed below.

  • A group of two or more people organizes into a fraud ring.
  • The ring shares the information required for opening a bank account such as phone numbers, SSN, address and so forth.
  • The individual accounts in the ring have unsecured loans, overdraft protection, and credit cards etc.

How will you model the data of the bank account owners so that these fraud rings can be detected by running appropriate entity link analysis queries?

**We recommend you to think about the solution to this problem on your own before moving forward.**

Real-time graph traversals at the right kinds of events can help banks identify probable fraud rings. We can model data of the bank account owners (phone number, SSN, address, credit card, and so forth) as shown in Fig 6.

<img src='fig6.png’%})

Fig 6: Data Model for bank account owner details

The graph data model shown in the image above will help us in running queries to detect rings using the various elements of contact information. For instance, in the case of bank account owners, we can identify the following circles.

Sl. No.

Fraud Rings

Contact Type

Ring Size

1

[AccountHolder1, AccountHolder2, AccountHolder3]

Address

3

2

[AccountHolder1, AccountHolder2]

SSN

2

3

[AccountHolder2, AccountHolder3]

Phone Number

2

Table 2: Fraud Rings detected by connected link analysis

Case Study – Time Series Database

In the recent past, there has been an increasing need to store, query, and analyze time-series data. The demand of such data-models is particularly high in industries such as IoT (e.g., consumer wearables such as Fitbit, sensors in oil and gas factories, etc.) and DevOps (e.g., monitoring application logs, customer events and so forth). There are two primary application needs which make time-series database unique. The first being able to summarize a broad scan of records and provide aggregated results with minimal latency. The second requirement is to have efficient life-cycle management, which ensures that data which isn’t needed is no longer stored. We have shown how a sample time-series data would look like in the table below.

Timestamp

Price

Ticker Symbol

Market

2019-05-18T15:13:00Z

$175.80

AAPL

NASDAQ

2019-05-18T15:03:00Z

$1680.09

AMZN

NYSE

2019-05-18T14:55:00Z

$125.10

MSFT

NASDAQ

2019-05-18T14:23:00Z

$123.89

MSFT

NYSE

2019-05-18T13:15:00Z

$1702.10

AMZN

NASDAQ

2019-05-18T13:05:00Z

$174.50

AAPL

NYSE

Table 3: Fictional stock price data

InfluxDB is one of the most widely used time-series databases and provides several built-in features for manipulating and analyzing time-series data. InfluxDB came up with its own textual format which is called The Line Protocol. The protocol comprises of four components: measurement, tagset, fieldset, and timestamp. The format followed by the protocol is: measurement, tagset fieldset timestamp; an example of the stock-price data in this format is shown below.

Fig 7: Sample stock price data represented in Line Protocol format

In Fig 8, we have represented the stock-price data in a graphical format. Each position on the x-axis of the graph represents a timestamp and is called a point. The collection of points for a combination of tag-set and measurement is called a series. For instance, the red line in the graph for measurement “stock-price” and tag-set “ticker=AMZN, market=NYSE” represents a series. The data in InfluxDB is indexed on tag-set and timestamp. These indexes help in optimizing the queries which use the two components. A sample query can be to find the maximum stock-price for ticker=AMZN, market=NYSE in a specific period.

Fig 8: Graphical representation of fictional stock-price data

**Brain Exercise:**

We need to build an application for a water treatment facility which is spread across multiple locations in the country. The application should store the pH level of water at a given location along-with the timestamp when the measurement was taken. It should allow its users to make time-bound queries (e.g., average pH level in a particular time-frame). We have given a sneak-peak of the data in the table below. How will you model the data in InfluxDB to support this use case?

Sl. No.

location

Measured pH

Measurement Time

1

Atlanta

7.1

1439856000000000000

2

Atlanta

6.6

1439857000000000000

3

New York

8.0

1439857000000000000

4

New York

7.9

1439856000000000000

5

Atlanta

8.0

1439858000000000000

6

New York

6.9

1439858000000000000

Table 4: Sample data of the water facility

**We recommend you to think about the solution to this problem on your own before moving forward.**

The queries within the application will mostly be targeted towards monitoring the pH behavior at each of the water treatment facility locations over a time-frame. We can model the pH measurement data shown in Table 4 using line protocol in the format mentioned below.

Fig 9: pH time-series data represented using Line Protocol

The data-model will create indexes on the location(tag-set) and timestamp. This will be optimal for the majority of the queries which this application will receive. A sample query of the application will be to fetch the average pH score of water in New York location in a specific time-frame. In Fig 10, we have provided the graphical representation of the pH time-series data to help our readers develop an intuition.

Fig 10: Graphical representation of pH time-series data

Conclusion

In the current scenario, applications are no longer tied to a single database and have the flexibility to use different data-stores depending upon the data storage needs. The rise of SOA has also played a crucial role in ensuring that applications use the data-store best fit for the data-access pattern. In the context of SOA, an extensive application is a collection of sub-components (each of them is a micro-service). Each of those micro-services may use an entirely different data-stores depending on the use-case.

References

  1. http://www.informit.com/articles/article.aspx?p=2429466
  2. https://www.youtube.com/watch?v=qI_g07C_Q5I
  3. https://panoply.io/data-warehouse-guide/redshift-columnar-storage-101/
  4. https://en.wikipedia.org/wiki/Column-oriented_DBMS
  5. https://en.wikipedia.org/wiki/Graph_database
  6. https://docs.aws.amazon.com/cli/latest/reference/dynamodb/put-item.html
  7. https://docs.aws.amazon.com/documentdb/latest/developerguide/getting-started.connect.html
  8. https://docs.influxdata.com/influxdb/v1.7/introduction/getting-started/
  9. https://www.youtube.com/watch?v=OoCsY8odmpM
  10. https://neo4j.com/graphgist/bank-fraud-detection