Nosql Database Systems: M.Tech. (Iind, Sem Ce/Cn)
Nosql Database Systems: M.Tech. (Iind, Sem Ce/Cn)
1. The analysis of ice cream sales data in order to determine how the
number of ice cream cones sold is related to the daily
temperature. The results of such an analysis would support
decisions related to
– how much ice cream a store should order in relation to weather forecast
information.
• The term NoSQL groups all those technologies, like graphs, documents, text et
cetera. That by definition are non-relational.
• NoSQL databases are a good option for those information systems that do not require the
ACID properties. Therefore, without a strict management of a transactions, the control of
concurrency is not done by blocking, but there are other mechanisms.
• Far from replacing the relational database management systems, they complement and
extend their functional characteristics.
• Since durability is not a critical element in some web applications, it's not necessary to write
logs for each transaction. Examples of data not sensitive to durability include copies of user
session data and text message in forums or social networks.
Challenges with NoSQL
• NoSQL has a non-standardized query language. In the absence
of a standard query language for NoSQL databases, it
is necessary to learn each query language for a given database
management system. This is true even within each NoSQL
storage category.
• NoSQL has problems in transactions. As a result of not following
the ACID properties, there is a weaker control over the
consistency, durability and isolation of transactions.
• NoSQL has integrity problems. Ensuring the integrity of the data
requires extra programming manually and is not function of the
DBMS. Integrity is understood from the point of view of the
restrictions, of domain, referential of null value, etc.
RDBMS or NoSQL
The decision on what type of DBMS should be used depends on
a set of factors including but not limited to.
• The volume of data to store.
• The estimated concurrence.
• The number of operations that are done on the database per
unit of time.
• The desired scalability of the database.
• The degree of integrity and consistency that is desired.
• The nature of the data to be stored.
• The most frequent types of operations that you want to do with
the data.
CAP and BASE
There are two main characteristics regarding NoSQL data bases and they are the
CAP theorem and the BASE properties. On the one hand, the CAP theorem has
been adopted by several companies in the web and the NoSQL community. The
acronym for CAP refers to the following,
• C as consistency. It refers to whether a system is in a consistent state after the
execution of an operation or not. A distributed system is considered consistent if
after an update operation by a node, the rest of the nodes see the update in a
shared data resource.
• A as availability, it means that a system is designed and implemented in such a
way that it can continue its operation if there are software or hardware problems
or a node fails.
• P as Partition tolerance it's the ability of a system to continue it's operation in the
presence of network partitions. This happens if a set of nodes in a network looses
connectivity with other nodes. Partition tolerance can also be considered as
ability of a system to dynamically add and remove nodes.
BASE
So basically the CAP theorem says that the most you can have two of
the three characteristics in a shared data system.
• On the other hand, a NoSQL database follows a paradigm of the BASE
properties, Basically available system instead of ACID
• Basically available.
• Soft state,
(the information will expire unless it is refreshed.)
• Eventual consistency.
{
"firstname": "Pramod",
"citiesvisited": [ "Chicago", "London", "Pune", "Bangalore" ],
"addresses": [
{ "state": "AK",
"city": "DILLINGHAM", "type": "R"
},
{ "state": "MH", "city": "PUNE", "type": "R" }
],
"lastcity": "Chicago"
}
Column Family Database
• Regarding the column-oriented databases, as
the name implies, they store the data in
columns instead of rows.
• That is, all the attributes of a single data entity
are stored so that each of them can be accessed
as a unit.
• Examples of DBMS oriented to columns
includes Sybase IQ, Cassandra, hyper table, and
HBase
Column Family Database
Column Family Databases
• Column-oriented databases are not good for queries that
required to present the whole record of an entity. In this
case, the row-store is better.
• Column-oriented storage is especially efficient when data
readings are massive, and writes to a few columns. This is
because in a query, only the data of the columns that
interest are obtained, not all the columns of a record,
which increases efficiency.
• The problem of writings would be when you want to write
in all the columns of a record, since the columns work as
individual units and are not necessarily contiguous as in
the storage by rows.
Cassandra Datamodel with Column Families
Cassandra Model
Example of a column
{
name: "fullName",
value: "Martin Fowler",
timestamp: 12345667890
}
• The column has a key of firstName and
the value of Martin and has a
timestamp attached to it.
• A row is a collection of columns
attached or linked to a key;
• a collection of similar rows makes a
column family.
• When the columns in a column family
are simple columns, the column family
is known as standard column family.
Standard Column Family
//column family
{ //row
//row "martin-fowler" : {
"pramod-sadalage" : { firstName: "Martin",
firstName: "Pramod", lastName: "Fowler",
lastName: "Sadalage", location: "Boston"
lastVisit: "2012/12/12" }
} }
• Each column family can be compared to a container of rows in an RDBMS table where the key identifies
the row and the row consists on multiple columns.
• The difference is that various rows do not have to have the same columns, and columns can be added to
any row at any time without having to add it to other rows.
• We have the pramod-sadalage row and the martin-fowler row with different columns; both rows are part
of the column family.
Super Column
• When a column consists of a map of columns,
then we have a super column. A super column
consists of a name and a value which is a map
of columns. Think of a super column as a
container of columns.
Example of Super Column
{
name: "book:978-0767905923",
value: {
author: "Mitch Albon",
title: "Tuesdays with Morrie",
isbn: "978-0767905923"
}
}
Super Column Family (When we use super columns to create a column family,
we get a super column family.)
//super column family
{
//row
name: "billing:martin-fowler", value: {
address: {
name: "address:default",
value: {
fullName: "Martin Fowler", street:"100 N. Main Street", zip: "20145"
}
},
billing: {
name: "billing:default",
value: {
creditcard:
} "8888-8888-8888-8888", expDate: "12/2016"
}}
//row
name: "billing:pramod-sadalage", value: {
address: {
name: "address:default",
value: {
fullName: "Pramod Sadalage", street:"100 E. State Parkway", zip: "54130"
}
},
billing: {
name: "billing:default",
value: {
creditcard:
} "9999-8888-7777-4444", expDate: "01/2016"
}
}
}
Super Column Family
• Super column families are good to keep related data
together, but when some of the columns are not
needed most of the time, the columns are still fetched
and deserialized by Cassandra, which may not be
optimal.
• Cassandra puts the standard and super column families
into keyspaces. A keyspace is similar to a database in
RDBMS where all column families related to the
application are stored. Keyspaces have to be created so
that column families can be assigned to them:
Graph Databases
• In graph-oriented databases, graph is represented as a set of nodes or entities interconnected
by edge or relationships.
• Graphs give importance not only to the data, but to the relations between them too.
• Relationships can also have attributes and direct queries can be made to relationships, rather
than to the nodes.
• Being stored in this way, it is much more efficient to navigate between relationships than in a
relational model.
• Obviously, these type of databases are only useful when the information can be easily
represented as a network.
• Among the most used implementations are Neo4J, Hyperbase-DB, and InfoGrid.
• The performance and scalability of a graph-oriented database are variable, highly flexible,
and highly complex to implement.
Graph Database
When to use - which
Schema less Databases
• RDBMS has restriction on kind of data which
needs to be stored based on data types of
columns.
• NoSQL has freedom. Any document or value
can has its own fields.
• Challenge move to applications.
• Schema moves into application.
Materialized Views
• Aggregates – Advantage/Disadvantage
• Views in RDBMS
• Materialised View
– Eager View
– Computational View
– Summary View
Modelling for Data Access
• Key-Value store
Document store with references
• Reference
Document Database with
reference
# Customer object
{
"customerId": 1, "customer": {
"name": "Martin",
"billingAddress": [{"city": "Chicago"}],
"payment": [{"type": "debit","ccinfo": "1000-1000-1000-1000"}],
"orders":[{"orderId":99}]
}
}
# Order object
{
"customerId": 1, "orderId": 99, "order":{
"orderDate":"Nov-20-2011", "orderItems":[{"productId":27,
"price": 32.45}],
"orderPayment":[{"ccinfo":"1000-1000-1000-1000",
"txnId":"abelif879rft"}],
"shippingAddress":{"city":"Chicago"}
}
}
Business Analytics with Aggregates
• Materialised View (which order has given product in
them)
{ "itemid":27,
"orders":
{99545,897,678}
}
{ "itemid":29,
"orders":
{199,545,704,819}
}
Removing References from Document
Storage
# Customer object
{
"customerId": 1, "name": "Martin",
"billingAddress": [{"city": "Chicago"}], "payment": [
{"type": "debit",
"ccinfo": "1000-1000-1000-1000"} //No need to update customer record with order id.
]
}
# Order object
{
"orderId": 99, "customerId": 1, "orderDate":"Nov-20-2011",
"orderItems":[{"productId":27, "price": 32.45}], "orderPayment":[{"ccinfo":"1000-1000-1000-
1000",
"txnId":"abelif879rft"}],
"shippingAddress":{"city":"Chicago"}
}
Column Family Database
Graph Database
Distribution Models
Background
• NoSQL primarily for running on Clusters
• Can scale better
• Mainly two paths for data distribution
– Sharding
– Replication
• Master-Slave
• Peer-to-Peer
Sharding
• Horizontally partitioning large datasets into a collection of
smaller more manageable datasets called shards.
• Shards are distributed across multiple nodes (machine),
sharing the same schema and collectively represent
complete dataset.
• Scalability is achieved by distributing the processing load
across various nodes and it could be easily enhanced by
adding more powerful resources to existing infrastructure.
• As each node is only responsible for their data, the
read/writes are greatly improved.
Sharding
Sharding
• Each shard can independently service reads and writes for
the specific subset of data that it is responsible for.
• Shards provide some resilience against failures as only the
data of failed node will be inaccessible.
• Query pattern needs to be taken care.
• Depending upon the query the data may need to be fetched
from multiple the shards.
• Query requiring data from multiple shards may impact
performance.
• Data locality keeps commonly accessed data co-located on a
single shard and improve performance.
Sharding
Replication
• Replication stores multiple copies of the datasets, known
as, replicas on multiple nodes.
• Scalability and Availability is the result.
• Fault tolerance