IN THE NAME OF GOD
NoSql
2
NoSql
Contents
Preface...........................................................................................................................................................5
What is SQL?..................................................................................................................................................6
Alternatives ...................................................................................................................................................7
Definition and Introduction of No SQL ..........................................................................................................8
Context and a Bit of History ........................................................................................................................10
ACID .............................................................................................................................................................11
Characteristics .........................................................................................................................................11
Atomicity .............................................................................................................................................11
Consistency..........................................................................................................................................11
Isolation ...............................................................................................................................................11
Durability .............................................................................................................................................12
Architecture.................................................................................................................................................12
NoSQL Emerged From a Need .....................................................................................................................13
What is NoSQL? ...........................................................................................................................................13
NoSQL Categories ........................................................................................................................................15
1)
Key-values Stores: ...........................................................................................................................15
2)
Column Family: ................................................................................................................................19
3)
Document Databases: .....................................................................................................................23
Keys .....................................................................................................................................................24
Retrieval ..............................................................................................................................................24
Organization ........................................................................................................................................24
4)
Graph Databases: ............................................................................................................................26
Major NoSQL Players ...................................................................................................................................28
Querying NoSQL ..........................................................................................................................................28
Future of NoSQL ..........................................................................................................................................29
Structuring data without relations ..............................................................................................................29
Five advantages of NoSQL ...........................................................................................................................30
Five challenges of NoSQL ............................................................................................................................32
Summary .....................................................................................................................................................34
Last Word ....................................................................................................................................................34
3
NoSql
References ...................................................................................................................................................35
4
NoSql
Like most new and upcoming technologies, NoSQL is shrouded in a mist of fear,
uncertainty, anddoubt. The world of developers is probably divided into three
groups when it comes to NoSQL:
Those who love it People in this group are exploring how NoSQL fits in an
a pplication stack. They are using it, creating it, and keeping abreast with the
developments in the world of NoSQL.
Those who deny it
Mem ber s of t h is gr ou p a r e eit h er focu sin g on NoSQL s
sh or t com in gs or a r e ou t t o pr ove t h a t it s wor t h less.
Those who ignore it
Developers in this group are agnostic either because they are
waiting for the technology to mature, or they believe NoSQL is a passing fad and
ign or in g it will sh ield t h em fr om t h e r oller coa st er r ide of a h ype cycle, or h a ve
simply not had a chance to get to it.
what NoSQL is what its characteristics are, what constitutes its typical use cases,
and where it fitsin the application stack?
5
NoSql
SQL (sometimes referred to as Structured Query Language) is a programming
language designed for managing data in relational database management systems
(RDBMS).
Originally based upon relational algebra and tuple relational calculus, its scope
includes data insert, query, update and delete, schema creation and modification,
and data access control.
SQL was one of the first commercial languages for E dga r F Cod s relational model,
as described in his influential 1970 paper, "A Relational Model of Data for Large
Shared Data Banks". Despite not adhering to the relational model as described by
Codd, it became the most widely used database language. Although SQL is often
described as, and to a great extent is, a declarative language, it also
includes procedural elements. SQL became a standard of the American National
Standards Institute (ANSI) in 1986, and of the International Organization for
Standards (ISO) in 1987. Since then, the standard has been enhanced several times
with added features. However, issues of SQL code portability between major
RDBMS products still exist due to lack of full compliance with, or different
interpretations of, the standard. Among the reasons mentioned are the large size
and incomplete specification of the standard, as well as vendor lock-in.
SQL was initially developed at IBM by Donald D. Chamberlin and Raymond F.
Boyce in the early 1970s. This version, initially called SEQUEL (Structured English
Query Language), was designed to manipulate and retrieve data stored in IBM's
original quasi-relational database management system, System R, which a group
at IBM San Jose Research Laboratory had developed during the 1970s.The acronym
SEQUEL was later changed to SQL because "SEQUEL" was a trademark of
the UK-based Hawker Siddeley aircraft company.
The first Relational Database Management System (RDBMS) was RDMS,
developed at MIT in the early 1970s, soon followed by Ingres, developed in 1974
at U.C. Berkeley. Ingres implemented a query language known as QUEL, which
was later supplanted in the marketplace by SQL.
In the late 1970s, Relational Software, Inc. (now Oracle Corporation) saw the
potential of the concepts described by Codd, Chamberlin, and Boyce and developed
their own SQL-based RDBMS with aspirations of selling it to the U.S.
Navy, Central Intelligence Agency, and other U.S. government agencies. In June
1979, Relational Software, Inc. introduced the first commercially available
implementation of SQL, Oracle V2 (Version2) for VAX computers. Oracle V2 beat
IBM's August release of the System/38 RDBMS to market by a few weeks.
6
NoSql
After testing SQL at customer test sites to determine the usefulness and
practicality of the system, IBM began developing commercial products based on
their System R prototype including System/38, SQL/DS, and DB2, which were
commercially available in 1979, 1981, and 1983, respectively.
This chart shows several of the SQL language elements that compose a single
statement.
The SQL language is subdivided into several language elements, including:
Clauses, which are constituent components of statements and queries. (In some
cases, these are optional.)
Expressions, which can produce either scalar values or tables consisting
of columns and rows of data.
Predicates, which specify conditions that can be evaluated to SQL three-valued
Predicates,
valued
logic (3VL) or Boolean (true/false/unknown) truth values and which are used to
limit the effects of statements and queries, or to change program flow.
Queries, which retrieve the data
data based on specific criteria. This is the most
important element of SQL.
Statements, which may have a persistent effect on schemata and data, or which
may control transactions, program flow, connections, sessions, or
diagnostics.SQL statements also include the semicolon (";") statement
diagnostics.SQL
terminator. Though not required on every platform, it is defined as a standard
part of the SQL grammar.
Insignificant whitespace is generally ignored in SQL statements and queries,
making it easier to format SQL code ffor
or readability.
A distinction should be made between alternatives to relational query languages
and alternatives to SQL. Below are proposed relational alternatives to SQL:
.QL - object
object-oriented
oriented Datalog
4D Query Language (4D QL)
Datalog
7
NoSql
HTSQL - URL based query method
IBM Business System 12 (IBM BS12) - one of the first fully relational
database management systems, introduced in 1982
ISBL
Java Persistence Query Language (JPQL) - The query language used by
the Java Persistence API and Hibernate persistence library
JoSQL - Runs SQL statements written as Strings to query collections from
inside Java code.
LINQ - Runs SQL statements written like language constructs to query
collections directly from inside .Net code.
Object Query Language
QBE (Query By Example) created by MoshèZloof, IBM 1977
Quel introduced in 1974 by the U.C. Berkeley Ingres project.
Tutorial D
SBQL - the Stack Based Query Language (SBQL)
UnQL - the Unstructured Query Language, a functional superset of SQL,
developed by the authors of SQLite and CouchDB
XQuery
NoSQL is literally a combination of two words: No and SQL. The implication is that
NoSQLis a technology or product that counters SQL. The creators and early
adopters of the buzzwordNoSQL probably wanted to say No RDBMS or No
relational but were infatuated by the nicersounding NoSQL and stuck to it. In due
course, some have proposed NonRel as an alternative toNoSQL. A few others have
tried to salvage the original term by proposing that NoSQL is actuallyan acronym
t h a t expa n ds t o Not On ly SQL. Wh a t ever t h e lit er a l m ea n in g, NoSQL is
u sedt oda y a s a n u m br ella t er m for a ll da t a ba ses a n d da t a st or es t h a t don t follow
the popular and well-established RDBMS principles and often relate to large data
sets accessed and manipulated on aWeb scale. This means NoSQL is not a single
product or even a single technology. It representsa class of products and a collection
of diverse, and sometimes related, concepts about datastorage and manipulation.
MapReduce is a parallel programming model that allows distributed processing on
large data setson a cluster of computers. The MapReduce framework is patentedby
8
NoSql
Google, but theideas are freely shared and adopted in a number of open-source
implementations.
MapReduce derives its ideas and inspiration from concepts in the world of
functional programming.Map and reduce are commonly used functions in the world
of functional programming. In functionalprogramming, a map function applies an
operation or a function to each element in a list. Forexample, a multiply-by-two
function on a list [1, 2, 3, 4] would generate another list as follows: [2, 4, 6, 8]. When
such functions are applied, the original list is not altered. Functional
programmingbelieves in keeping data immutable and avoids sharing data among
multiple processes or threads.
This means the map function that was just illustrated, trivial as it may be, could be
run via two ormore multiple threads on the list and these threads would not step on
each other, because the list itself is not altered.
Like the map function, functional programming has a concept of a reduce function.
Actually, areduce function in functional programming is more commonly known as
a fold function. Areduceor a fold function is also sometimes called an accumulate,
compress, or inject function. A reduce orfold function applies a function on all
elements of a data structure, such as a list, and produces asingle result or output.
So applying a reduce function-like summation on the list generated out of themap
function, that is, [2, 4, 6, 8], would generate an output equal to 20.
So map and reduce functions could be used in conjunction to process lists of data,
where a functionis first applied to each member of a list and then an aggregate
function is applied to the transformed and generated list.
This same simple idea of map and reduce has been extended to work on large data
sets. The ideais slightly modified to work on collections of tuples or key/value pairs.
The map function appliesa function on every key/value pair in the collection and
generates a new collection. Then the reducefunction works on the new generated
collection and applies an aggregate function to compute a finaloutput. This is better
understood through an example, so let me present a trivial one to explain theflow.
Say you have a collection of key/value pairs as follows:
[{ 94303 : Tom }, { 94303 : J a n e }, { 94301 : Ar u n }, { 94302 : Ch en }]
This is a collection of key/value pairs where the key is the zip code and the value is
the name of aperson who resides within that zip code. A simple map function on this
collection could get thenames of all those who reside in a particular zip code. The
output of such a map function is asfollows:
[{ 94303 :[ Tom , J a n e ]}, { 94301 :[ Ar u n ]}, { 94302 :[ Ch en ]}]
Now a reduce function could work on this output to simply count the number of
people who belongto particular zip code. The final output then would be as follows:
9
NoSql
[{ 94303 : 2}, { 94301 : 1}, { 94302 : 1}]
This example is extremely simple and a MapReduce mechanism seems too complex
for such amanipulation, but I hope you get the core idea behind the concepts and
the flow.
Carlo Strozzi used the term NoSQL in 1998 to name his lightweight, open-source
relational database that did not expose the standard SQL interface. (Strozzi
suggests that, as the current NoSQL movement "departs from the relational model
altogether; it should therefore have been called more appropriately 'NoREL', or
something to that effect.")
Eric Evans, a Rackspace employee, reintroduced the term NoSQL in early 2009
when Johan Oskarsson of Last.fm wanted to organize an event to discuss opensource distributed databases. The name attempted to label the emergence of a
growing number of non-relational, distributed data stores that often did not attempt
to provide ACID (atomicity, consistency, isolation, durability) guarantees, which are
the key attributes of classic relational database systems such as Sybase, IBM DB2,
MySQL, Microsoft SQL Server, PostgreSQL, Oracle RDBMS, Informix, Oracle Rdb,
etc.
In 2011, work began on UnQL (Unstructured Query Language), a specification for a
query language for NoSQL databases. It is built to query collections (versus tables)
of documents (versus rows) with loosely defined fields (versus columns). UnQL is a
superset of SQL within which SQL is a very constrained type of UnQL for which the
queries always return the same fields (same number, names and types). However,
UnQL does not cover the data definition language (DDL) SQL statements like
CREATE TABLE or CREATE INDEX
Befor e I st a r t wit h det a ils on t h e NoSQL t ypes a n d t h e con cept s in volved, it s
important to setthe context in which NoSQL emerged. Non-relational databases are
not new. In fact, the firstnon-relational stores go back in time to when the first set
of computing machines were invented.
Non-relational databases thrived through the advent of mainframes and have
existed in specializedand specific domains
for example, hierarchical directories
for storing authentication andauthorization credentials
through the years.
However, the non-relational stores that haveappeared in the world of NoSQL are a
new incarnation, which were born in the world of massivelyscalable Internet
10
NoSql
applications. These non-relational NoSQL stores, for the most part, were
conceivedin the world of distributed and parallel computing.
Starting out with Inktomi, which could be thought of as the first true search engine,
andculminating with Google, it is clear that the widely adopted relational database
managementsystem (RDBMS) has its own set of problems when applied to massive
amounts of data. Theproblems relate to efficient processing, effective
parallelization, scalability, and costs.
In computer science, ACID (atomicity, consistency, isolation, durability) is a set of
properties that guarantee that database transactions are processed reliably. In the
context of databases, a single logical operation on the data is called a transaction.
For example, a transfer of funds from one bank account to another, even though
that might involve multiple changes (such as debiting one account and crediting
another), is a single transaction.
Jim Gray defined these properties of a reliable transaction system in the late 1970s
and developed technologies to automatically achieve them. In 1983, Andreas Reuter
and Theo Härder coined the acronym ACID to describe them.
Characteristics
Atomicity
Atomicity requires that each transaction is "all or nothing": if one part of the
transaction fails, the entire transaction fails, and the database state is left
unchanged. An atomic system must guarantee atomicity in each and every
situation, including power failures, errors, and crashes.
Consistency
The consistency property ensures that any transaction will bring the database from
one valid state to another. Any data written to the database must be valid according
to all defined rules, including but not limited to constraints, cascades, triggers, and
any combination thereof.
Isolation
11
NoSql
Isolation refers to the requirement that no transaction should be able to interfere
with another transaction. One way of achieving this is to ensure that no
transactions that affect the same rows can run concurrently, since their sequence,
and hence the outcome, might be unpredictable. This property of ACID is often
partly relaxed due to the huge speed decrease this type of concurrency management
entails.
Durability
Durability means that once a transaction has been committed, it will remain so,
even in the event of power loss, crashes, or errors. In a relational database, for
instance, once a group of SQL statements execute, the results need to be stored
permanently. If the database crashes immediately thereafter, it should be possible
to restore the database to the state after the last transaction committed.
Typical modern relational databases have shown poor performance on certain dataintensive applications, including indexing a large number of documents, serving
pages on high-traffic websites, and delivering streaming media. Typical RDBMS
implementations are tuned either for small but frequent read/write transactions or
for large batch transactions with rare write accesses. NoSQL, on the other hand,
can service heavy read/write workloads. Real-world NoSQL deployments include
Digg's 3 TB for green badges (markers that indicate stories voted for by others in a
social network) and Facebook's 50 TB for inbox search.
NoSQL architectures often provide weak consistency guarantees, such as eventual
consistency, or transactions restricted to single data items. Some systems, however,
provide full ACID guarantees in some instances by adding a supplementary
middleware layer (e.g., AppScale and CloudTPS). Two systems have been developed
that provide snapshot isolation for column stores: Google's Percolator system based
on BigTable, and a transactional system for HBase developed at the University of
Waterloo. These systems, developed independently, use similar concepts to achieve
multi-row distributed ACID transactions with snapshot isolation guarantee for the
underlying column store, without the extra overhead of data management,
middleware system deployment, or maintenance introduced by the middleware
layer.
Several NoSQL systems employ a distributed architecture, with the data held in a
redundant manner on several servers, often using a distributed hash table. In this
12
NoSql
way, the system can readily scale out by adding more servers, and failure of a
server can be tolerated.
Some NoSQL advocates promote very simple interfaces such as associative arrays
or key-value pairs. Other systems, such as native XML databases, promote support
of the XQuery standard. Newer systems such as CloudTPS also support join
queries.
Data Storage: The world's stored digital data is measured in exabytes. An exabyte is
equal to one billion gigabytes (GB) of data. According to Internet.com, the amount of
stored data added in 2006 was 161 exabytes. Just 4 years later in 2010, the amount
of data stored will be almost 1,000 ExaBytes which is an increase of over 500%. In
other words, there is a lot of data being stored in the world and it s just going to
continue growing.
Interconnected Data: Data continues to become more connected. The creation of the
web fostered in hyperlinks, blogs have pingbacks and every major social network
system has tags that tie things together. Major systems are built to be
interconnected.
Complex Data Structure: NoSQL can handle hierarchical nested data structures
easily. To accomplish the same thing in SQL, you would need multiple relational
tables with all kinds of keys. In addition, there is a relationship between
performance and data complexity. Performance can degrade in a traditional
RDBMS as we store the massive amounts of data required in social networking
applications and the semantic web.
I guess one way to define NoSQL is to consider what it s not. It's not SQL and it's
not relational. Like the name suggests, it's not a replacement for a RDBMS but
compliments it. NoSQL is designed for distributed data stores for very large scale
data needs. Think about Facebook with its 500,000,000 users or Twitter which
accumulates Terabits of data every single day.
13
NoSql
In a NoSQL database, there is no fixed schema and no joins. A RDBMS "scales up"
by getting faster and faster hardware and adding memory. NoSQL, on the other
hand, can take advantage of "scaling out". Scaling out refers to spreading the load
over many commodity systems. This is the component of NoSQL that makes it an
inexpensive solution for large datasets.
14
NoSql
The current NoSQL world fits into 4 basic categories:
1) Key-values Stores:
are based primarily on Amazon's Dynamo Paper which was written in 2007. The
main idea is the existence of a hash table wherethere is a unique key and a pointer
to a particular item of data. These mappings are usually accompanied by cache
mechanisms to maximize performance.
A HashMap or an
associative array is the
simplest data structure
that can hold a set of
key/valuepairs. Such data
structures are extremely
popular because they
provide a very efficient,
big O(1)average algorithm
running time for accessing
data. The key of a
key/value pair is a unique
value inthe set and can be
easily looked up to access
the data.
Key/value pairs are of
varied types: some keep
the data in memory and
some provide the
capabilityto persist the
data to disk. Key/value pairs can be distributed and held in a cluster of nodes.
A sim ple, yet power fu l, key/va lu e st or e is Or a cle s Ber keley DB. Ber keley DB is a
pure storage enginewhere both key and value are an array of bytes. The core
st or a ge en gin e of Ber keley DB doesn t a t t a ch m ea n in g t o t h e key or t h e va lu e. It
takes byte array pairs in and returns the same back to the callingclient. Berkeley
DB allows data to be cached in memory and flushed to disk as it grows. There isalso
a notion of indexing the keys for faster lookup and access. Berkeley DB has existed
since themid-1990s. It wa s cr ea t ed t o r epla ce AT&T s NDBM a s a pa r t of m igr a t in g
from BSD 4.3 to 4.4. In1996, Sleepycat Software was formed to maintain and
provide support for Berkeley DB.
15
NoSql
Another type of key/value store in common use is a cache. A cache provides an inmemory snapshotof the most-used data in an application. The purpose of cache is to
reduce disk I/O. Cache systemscould be rudimentary map structures or robust
systems with a cache expiration policy. Cachingis a popular strategy employed at all
levels of a computer software stack to boost performance.Operating systems,
databases, middleware components, and applications use caching.
Robust open-source distributed cache systems like EHCache (http://ehcache.org/)
are widelyused in Java applications. EHCache could be considered as a NoSQL
solution. Another cachingsystem popularly used in web applications is Memcached
(http://memcached.org/), which is anopen-source, high-performance object caching
system. Brad Fitzpatrick created Memcached forLiveJournal in 2003. Apart from
being a caching system, Memcached also helps effective memorymanagement by
creating a large virtual pool and distributing memory among nodes as required.
This prevents fragmented zones where one node could have excess but unused
memory and anothernode could be starved for memory.As the NoSQL movement
has gathered momentum, a number of key/value pair data stores haveemerged.
Some of these newer stores build on the Memcached API, some use Berkeley DB as
theunderlying storage, and a few others provide alternative solutions built from
scratch.
Many of these key/value pairs have APIs that allow get-and-set mechanisms to get
and set values.A few, like Redis (http://redis.io/), provide richer abstractions and
powerful APIs. Redis couldbe considered as a data structure server because it
provides data structures like string (charactersequences), lists, and sets, apart from
maps. Also, Redis provides a very rich set of operations toaccess data from these
different types of data structures.
enumeration of a few important characteristics :
Membase (Proposed to be merged into Couchbase, gaining features from
CouchDBafter the creation of Couchbase, Inc.)
Official Online Resources
www.membase.org/.
History
Project started in 2009 by NorthScale, Inc. (later renamed as Membase).
Zygnaand NHN have been contributors since the beginning. Membase builds on
Mem ca ch ed a n dsu ppor t s Mem ca ch ed s t ext a n d bin a r y pr ot ocol. Mem ba se a dds a lot
of additional featureson top of Memcached. It adds disk persistence, data
replication, live cluster reconfiguration,and data rebalancing. A number of
coreMembase creators are also Memcachedcontributors.
Technologies and Language
Implemented in Erlang, C, and C++.
Access Methods
Memcached-compliant API with some extensions.Can be a dropinreplacement for Memcached.
16
NoSql
Open-Source License
Who Uses It
Apache License version 2.
Zynga, NHN, and others.
Kyoto Cabinet
Official Online Resources
http://fallabs.com/kyotocabinet/.
History
Kyoto Cabinet is a successor of Tokyo Cabinet. The database is a simple
data file containing records; each is a pair of akey and a value. Every key and value
are serial bytes with variable length.
Technologies and Language
Implemented in C++.
Access Methods
Provides APIs for C, C++, Java, C#, Python, Ruby, Perl, Erlang,
OCaml,and Lua. The protocol simplicity means there are many, many clients.
Open-Source License
GNU GPL and GNU LGPL.
Who Uses It
Mixi, Inc. sponsored much of its original work before the author left
Mixito join Google. Blog posts and mailing lists suggest that there are many users
but no publiclist is available.
Redis
Official Online Resources
http://redis.io/.
History
Project started in 2009 by Salvatore Sanfilippo. Salvatore created it for
hisstartup LLOOGG (http://lloogg.com/). Though still an independent project,
Redisprimary author is employed by VMware, who sponsor its development.
Technologies and Language
Implemented in C.
Access Methods
Rich set of methods and operations. Can access via Redis
command-lineinterface and a set of well-maintained client libraries for languages
like Java, Python, Ruby,C, C++, Lua, Haskell, AS3, and more.
Open-Source License
Who Uses It
BSD.
Craigslist.
The three key/value pairs listed here are nimble, fast implementations that provide
storage for realtimedata, temporary frequently used data, or even full-scale
persistence.
The key/value pairs listed so far provide a strong consistency model for the data it
stores. However,a few other key/value pairs emphasize availability over consistency
in distributed deployments.
17
NoSql
Ma n y of t h ese a r e in spir ed by Am a zon s Dyn a m o, wh ich is a lso a key/va lu e pa ir .
Am a zon s Dyn a m opr om ises except ion a l a va ila bilit y a n d sca la bility, and forms the
ba ckbon e for Am a zon s dist r ibu t edfa u lt t oler a n t a n d h igh ly a va ila ble syst em .
Apache Cassandra, Basho Riak, and Voldemort are opensourceimplementations of
the ideas proposed by Amazon Dynamo.
Amazon Dynamo brings a lot of key high-availability ideas to the forefront. The
most importantof the ideas is that of eventual consistency. Eventual consistency
implies that there could be smallintervals of inconsistency between replicated nodes
as data gets updated among peer-to-peer nodes.
Eventual consistency does not mean inconsistency. It just implies a weaker form of
consistency thanthe typical ACID type consistency found in RDBMS.
For now I will list the Amazon Dynamo clones and introduce you to a few important
characteristicsof these data stores.
Cassandra
Official Online Resources
http://cassandra.apache.org/.
History
Developed at Facebook and open sourced in 2008, Apache Cassandra
wasdonated to the Apache foundation.
Technologies and Language
Implemented in Java.
Access Methods
A command-line access to the store. Thrift interface and an
internalJava API exist. Clients for multiple languages including Java, Python,
Grails, PHP, .NET.and Ruby are available. Hadoop integration is also supported.
Query Language
A query language specification is in the making.
Open-Source License
Who Uses It
Apache License version 2.
Facebook, Digg, Reddit, Twitter, and others.
Voldemort
Official Online Resources
History
http://project-voldemort.com/.
Created by the data and analytics team at LinkedIn in 2008.
Technologies and Language
Implemented in Java.Provides for pluggable storage
usingeither Berkeley DB or MySQL.
Access Methods
Integrates with Thrift, Avro, and protobuf
(http://code.google.com/p/protobuf/) interfaces. Can be used in conjunction with
Hadoop.
Open-Source License
Apache License version 2.
18
NoSql
Who Uses It
LinkedIn.
Riak
Official Online Resources
History
http://wiki.basho.com/.
Created at Basho, a company formed in 2008.
Technologies and Language
JavaScript.
Implemented in Erlang. Also, uses a bit of C and
Access Methods
Interfaces for JSON (over HTTP) and protobuf clients exist.
Librariesfor Erlang, Java, Ruby, Python, PHP, and JavaScript exist.
Open-Source License
Who Uses It
Apache License version 2.
Comcast and Mochi Media.
All three
Cassandra, Riak and Voldemort
provide open-source Amazon
Dynamo capabilities.Cassandra and Riak demonstrate dual nature as far their
behavior and properties go. Cassandra hasproperties of both Google Bigtable and
Amazon Dynamo. Riak acts both as a key/value store and adocument database.
2) Column Family:
Stores were created to store and process very large amounts of data distributed
over many machines. There are still keys but they point to multiple columns. In the
case of BigTable (Google's Column Family NoSQL model), rows are identified by a
row key with the data sorted and stored by this key. The columns are arranged by
column family.
Google s Bigt a ble espou ses a m odel wh er e da t a in st or ed in a colu m n -oriented way.
This contrastswith the row-oriented format in RDBMS. The column-oriented
storage allows data to be storedeffectively. It avoids consuming space when storing
n u lls by sim ply n ot st or in g a colu m n wh en a va lu e doesn t exist for t h a t colu m n .
Each unit of data can be thought of as a set of key/value pairs, where the unit itself
is identified withthe help of a primary identifier, often referred to as the primary
key. Bigtable and its clones tend tocall this primary key the row-key. Also, as the
title of this subsection suggests, units are stored inan ordered-sorted manner. The
units of data are sorted and ordered on the basis of the row-key. Toexplain sorted
ordered column-oriented stores, an example serves better than a lot of text, so let
mepresent an example to you. Consider a simple table of values that keeps
information about a set ofpeople. Such a table could have columns like first_name,
last_n a m e, occu pa t ion , zip_code, a n dgen der . A per son s in for m a t ion in t h is t a ble
could be as follows:
19
NoSql
first_name: John
last_name: Doe
zip_code: 10001
gender: male
Another set of data in the same table could be as follows:
first_name: Jane
zip_code: 94303
The row-key of the first data point could be 1 and the second could be 2. Then data
would be storedin a sorted ordered column-oriented store in a way that the data
point with row-key 1 will be storedbefore a data point with row-key 2 and also that
the two data points will be adjacent to each other.
Next, only the valid key/value pairs would be stored for each data point. So, a
possiblecolumn-family for the example could be name with columns first_name and
last_name beingits members. Another column-family could be location with
zip_code as its member. A thirdcolumn-family could be profile. The gender column
could be a member of the profilecolumn-family. In column-oriented stores similar to
Bigtable, data is stored on a column-family basis.
Column-families are typically defined at configuration or startup time. Columns
themselves need noSorted Ordered Column-Oriented Storesa-priori definition or
declaration. Also, columns are capable of storing any data types as far as thedata
can be persisted to an array of bytes.
So the underlying logical storage for this simple example consists of three storage
buckets: name,location, and profile. Within each bucket, only key/value pairs with
valid values are stored.
Therefore, the name column-family bucket stores the following values:
For row-key: 1
first_name: John
last_name: Doe
For row-key: 2
first_name: Jane
The location column-family stores the following:
For row-key: 1
20
NoSql
zip_code: 10001
For row-key: 2
zip_code: 94303
The profile column-family has values only for the data point with row-key 1 so it
stores onlythe following:
For row-key: 1
gender: male
In real storage terms, the column-families are not physically isolated for a given
row. All data
Pertaining to a row-key is stored together. The column-family acts as a key for the
columns itcontains and the row-key acts as the key for the whole data set.
Data in Bigtable and its clones is stored in a contiguous sequenced manner. As data
grows to fill upone node, it is spilt into multiple nodes. The data is sorted and
ordered not only on each node butalso across nodes providing one large continuously
sequenced set. The data is persisted in a fault tolerantmanner where three copies of
each data set are maintained. Most Bigtable clones leverage adistributed file system
to persist data to disk. Distributed file systems allow data to be stored among
acluster of machines.
The sorted ordered structure makes data seek by row-key extremely efficient. Data
access is lessrandom and ad-hoc and lookup is as simple as finding the node in the
sequence that holds the data.
Data is inserted at the end of the list. Updates are in-place but often imply adding a
newer versionof data to the specific cell rather than in-place overwrites. This means
a few versions of each cell aremaintained at all times. The versioning property is
usually configurable.
HBase is a popular, open-source, sorted ordered column-family store that is modeled
on t h e idea spr oposed by Google s Bigt a ble.
Data stored in HBase can be manipulated using the MapReduce infrastructure.
H a doop sMa pRedu ce t ools ca n ea sily u se H Ba se a s t h e sou r ce a n d/or sin k of da t a .
Th e best wa y t o lea r n a bou t a n d lever a ge t h e idea s pr oposed by Google s
infrastructure is to startwith the Hadoop (http//hadoop.apache.org) family of
products. The NoSQL Bigtable storecalled HBase is part of the Hadoop family.
HBase
Official Online Resources
http://hbase.apache.org.
21
NoSql
History
Created at Powerset (now part of Microsoft) in 2007. Donated to the
Apachefoundation before Powerset was acquired by Microsoft.
Technologies and Language
Implemented in Java.
Access Methods
AJRuby shell allows command-line access to the store. Thrift,
Avro,REST, and protobuf clients exist. A few language bindings are also available.
A Java API isavailable with the distribution.Protobuf, short for Protocol Buffers, is
Google s da t a in t er ch a n ge for m a t . Mor ein for m a t ion is a va ila ble on lin e a t
http://code.google.com/p/protobuf/.
Query Language
No native querying language. Hive
(http://hive.apache.org)provides a SQL-like interface for HBase.
Open-Source License
Who Uses It
Apache License version 2.
Facebook, StumbleUpon, Hulu, Ning, Mahalo, Yahoo!, and others.
WHAT IS THRIFT?
Thrift is a software framework and an interface definition language that
allowscross-language services and API development. Services generated using
Thrift work efficiently and seamlessly between C++, Java, Python, PHP, Ruby,
Erlang, Perl,Haskell, C#, Cocoa, Smalltalk, and OCaml. Thrift was created by
F a cebook in 2007. It s a n Apa ch e in cu ba t or pr oject . You ca n fi n d m or e in for m a t ion
on Thrift athttp://incubator.apache.org/thrift/.
Hypertable
Official Online Resources
History
www.hypertable.org.
Created at Zvents in 2007. Now an independent open-source project.
Technologies and Language
Implemented in C++, uses Google RE2 regular
expressionlibrary. RE2 provides a fast and efficient implementation. Hypertable
promises performanceboost over HBase, potentially serving to reduce time and cost
when dealing with largeamounts of data.
Access Methods
A command-line shell is available. In addition, a Thrift interface
issupported. Language bindings have been created based on the Thrift interface. A
creativedeveloper has even created a JDBC-compliant interface for Hypertable.
Query Language
HQL (Hypertable Query Language) is a SQL-like abstraction
forquerying Hypertable data. Hypertable also has an adapter for Hive.
Open-Source License
Who Uses It
portal).
GNU GPL version 2.
Zven t s, Ba idu (Ch in a s biggest sea r ch en gin e), Rediff (In dia s biggest
22
NoSql
Cloudata
Official Online Resources
www.cloudata.org/.
History
Created by a Korean developer named YK Kwon
(www.readwriteweb.com/hack/2011/02/open-source-bigtable-cloudata.php). Not
much is publicly knownabout its origins.
Technologies and Language
Access Methods
areavailable.
Query Language
language.
A command-line access is available. Thrift, REST, and Java API
CQL (Cloudata Query Language) defines a SQL-like query
Open-Source License
Who Uses It
Implemented in Java.
Apache License version 2.
Not known.
Sorted ordered column-family stores form a very popular NoSQL option. However,
NoSQLconsists of a lot more variants of key/value stores and document databases.
Next, I introduce thekey/value stores.
3) Document Databases:
were inspired by Lotus Notes and are similar to key-value stores. The model is
basically versioned documents that are collections of other key-value collections.
The semi-structured documents are stored in formats like JSON.Document
databases are not document management systems. More often than not,
developersstarting out with NoSQL confuse document databases with document
and content managementsystems. The word document in document databases
connotes loosely structured sets of key/value pairs in documents, typically JSON
(JavaScript Object Notation), and not documents orspreadsheets (though these
could be stored too).
The central concept of a document-oriented database is the notion of a Document.
While each document-oriented database implementation differs on the details of
this definition, in general, they all assume documents encapsulate and encode data
(or information) in some standard formats or encodings. Encodings in use
include XML, YAML, JSON, and BSON, as well as binary forms like PDF and
Microsoft Office documents (MS Word, Excel, and so on).
23
NoSql
Documents inside a document-oriented database are similar, in some ways, to
records or rows, in relational databases, but they are less rigid. They are not
required to adhere to a standard schema nor will they have all the same sections,
slots, parts, keys, or the like. For example here's a document:
FirstName:"Bob", Address:"5 Oak St.", Hobby:"sailing".
Another document could be:
FirstName:"Jonathan", Address:"15 Wanamassa Point Road",
Children:[{Name:"Michael",Age:10}, {Name:"Jennifer", Age:8}, {Name:"Samantha",
Age:5}, {Name:"Elena", Age:2}].
Both documents have some similar information and some different. Unlike a
relational database where each record would have the same set of fields and unused
fields might be kept empty, there are no empty 'fields' in either document (record) in
this case. This system allows new information to be added and it does not require
explicitly stating if other pieces of information are left out.
Keys
Documents are addressed in the database via a unique key that represents
that document. Often, this key is a simple string. In some cases, this string is
a URI or path. Regardless, you can use this key to retrieve the document
from the database. Typically, the database retains an index on the key such
that document retrieval is fast.
Retrieval
One of the other defining characteristics of a document-oriented database is
that, beyond the simple key-document (or key-value) lookup that you can use
to retrieve a document, the database will offer an API or query language that
will allow you to retrieve documents based on their contents. For example,
you may want a query that gets you all the documents with a certain field set
to a certain value. The set of query APIs or query language features
available, as well as the expected performance of the queries, varies
significantly from one implementation to the next.
Organization
24
NoSql
Implementations offer a variety of ways of organizing documents, including
notions of
Collections
Tags
Non-visible Metadata
Directory hierarchies
Document databases treat a document as a whole and avoid splitting a document
into its constituentname/value pairs. At a collection level, this allows for putting
together a diverse set of documentsinto a single collection. Document databases
allow indexing of documents on the basis of not onlyits primary identifier but also
its properties. A few different open-source document databases areavailable today
but the most prominent among the available options are MongoDB and CouchDB.
MongoDB
Official Online Resources
History
www.mongodb.org.
Created at 10gen.
Technologies and Language
Implemented in C++.
Access Methods
A JavaScript command-line interface. Drivers exist for a number
of languagesincluding C, C#, C++, Erlang. Haskell, Java, JavaScript, Perl, PHP,
Python, Ruby, and Scala.
Query Language
SQL-like query language.
Open-Source License
Who Uses It
GNU Affero GPL (http://gnu.org/licenses/agpl-3.0.html).
FourSquare, Shutterfl y, Intuit, Github, and more.
25
NoSql
CouchDB
Official Online Resources
http://couchdb.apache.org and
www.couchbase.com
www.couchbase.com.Most
.Most of the authors are part of Couchbase, Inc.
History
Work started in 2005 and it was incubated into Apache in 2008.
Technologies and Language
Implemented in Erlang with some C and a
JavaScriptexecution environment.
Access Methods
Upholds REST above every other mechanism. Use standard web
toolsand clients to access the database, the same way as you access web resources.
Open-Source
Source License
Apache License version 2.
Who Uses It
Apple, BBC, Canonical, Cern, and more at
http://wiki.apache.org/couchdb/CouchDB_in_the_wild.
http://wiki.apache.org/couchdb/CouchDB_in_the_wild.
4) Graph Databases
Databases:
are built with nodes, relationships between notes and the properties of nodes.
Instead of tables of rows and columns and the rigid structure of SQL, a flexible
graph model is used which can scale across many machines
.
26
NoSql
So far I have listed most of the mainstream open-source NoSQL products. A few
other products likeGraph databases and XML data stores could also qualify as
NoSQL databases. This book does notcover Graph and XML databases. However, I
list the two Graph databases that may be of interestand something you may want to
explore beyond this book: Neo4j and FlockDB:Neo4J is an ACID-compliant graph
database. It facilitates rapid traversal of graphs.
Neo4j
Official Online Resources
http://neo4j.org.
History
Created at Neo Technologies in 2003. (Yes, this database has been
aroundbefore the term NoSQL was known popularly.)
Technologies and Language
Implemented in Java.
Access Methods
A command-line access to the store is provided. REST interface
alsoavailable. Client libraries for Java, Python, Ruby, Clojure, Scala, and PHP
exist.
Query Language
Supports SPARQL protocol and RDF Query Language.
Open-Source License
Who Uses It
AGPL.
Box.net.
FlockDB
Official Online Resources
https://github.com/twitter/flockdb
History
Created at Twitter and open sourced in 2010. Designed to store the
adjacencylists for followers on Twitter.
Technologies and Language
Access Methods
A Thrift and Ruby client.
Open-Source License
Who Uses It
Implemented in Scala.
Apache License version 2.
Twitter.
27
NoSql
The major players in NoSQL have emerged primarily because of the organizations
that have adopted them. Some of the largest NoSQL technologies include:
Dynamo:Dynamo was created by Amazon.com and is the most prominent
Key-Value NoSQL database. Amazon was in need of a highly scalable
distributed platform for their e-commerce businesses so they developed
Dynamo. Amazon S3 uses Dynamo as the storage mechanism.
Cassandra:Cassandra was open sourced by Facebook and is a column
oriented NoSQL database.
BigTable:BigTable is Google's proprietary column oriented database. Google
allows the use of BigTable but only for the Google App Engine.
SimpleDB:SimpleDB is another Amazon database. Used for Amazon EC2 and
S3, it is part of Amazon Web Services that charges fees depending on usage.
CouchDB:CouchDB along with MongoDB are open source document oriented
NoSQL databases.
Neo4J:Neo4j is an open source graph database.
The question of how to query a NoSQL database is what most developers are
interested in. After all, data stored in a huge database doesn't do anyone any good if
you can't retrieve and show it to end users or web services. NoSQL databases do not
provide a high level declarative query language like SQL. Instead, querying these
databases is data-model specific.Many of the NoSQL platforms allow for RESTful
interfaces to the data. Other offer query APIs. There are a couple of query tools that
have been developed that attempt to query multiple NoSQL databases. These tools
typically work across a single NoSQL category. One example is SPARQL. SPARQL
is a declarative query specification designed for graph databases. Here is an
example of a SPARQL query that retrieves the URL of a particular blogger
(courtesy of IBM):
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?url
FROM <bloggers.rdf>
WHERE {
?contributor foaf:name "Jon Foobar" .
?contributorfoaf:weblog ?url .
}
28
NoSql
Organizations that have massive data storage needs are looking seriously at
NoSQL. Apparently, the concept isn't getting as much traction in smaller
organizations. In a survey conducted by Information Week, 44% of business IT
professionals haven't heard of NoSQL. Further, only 1% of the respondents reported
that NoSQL is a part of their strategic direction. Clearly, NoSQL has its place in
our connected world but will need to continue to evolve to get the mass appeal that
many think it could have.
Let's consider the Twitter example. A tweet is a very small piece of text created by
one user. Twitter should be able to save it quickly and then distribute it widely to
that user's followers so they can all read it.
Now, if I go to that user's profile and don't immediately see their latest tweet, it's
not the end of the world. If it shows up seconds (or even minutes?) later, no big deal.
It's a tweet.
Twitter used to use a relational database (MySQL) and caching (memcached) to
handle this but they're switching to Cassandra because it's designed for this kind of
data. A "key-value" store like Cassandra is designed for this kind of data, and can
be massively scaled horizontally. Other key-value databases include Redis)
and Riak, created by Boston's own Basho.
Consider other "real-world" examples that you might need to model in your web
application:
resume
business card
receipt
All of these are self-contained data structures. Everything you need in real life is in
one place. Yet, if you were to model this in a relational database, you'd probably
have a "persons" table and an "orders" table and "line_items" table, splitting the
data into it's atomic pieces.
But if you're the user, all you want is the receipt for your purchase. A NoSQL class
of databases called document databases such as MongoDB and CouchDB are a good
fit here. You might have a Receipt document that stores your customers' receipts,
then just send that data cleanly back to the user.
So, is your data relational? Do parts of your application better fit a different data
structure that NoSQL tools are really good at?
29
NoSql
1. Elastic scaling
For years, database administrators have relied on scale up
buying bigger servers
as database load increases
rather than scale out
distributing the database
across multiple hosts as load increases. However, as transaction rates and
availability requirements increase, and as databases move into the cloud or onto
virtualized environments, the economic advantages of scaling out on commodity
hardware become irresistible.
RDBMS might not scale out easily on commodity clusters, but the new breed of
NoSQL databases are designed to expand transparently to take advantage of new
n odes, a n d t h ey r e u su a lly design ed wit h low-cost commodity hardware in mind.
2. Big data
Just as transaction rates have grown out of recognition over the last decade, the
volumes of data that are being stored also have increased massively. O Reilly h a s
clever ly ca lled t h is t h e in du st r ia l r evolu t ion of da t a . RDBMS ca pa cit y h a s been
growing to match these increases, but as with transaction rates, the constraints of
data volumes that can be practically managed by a single RDBMS are becoming
in t oler a ble for som e en t er pr ises. Toda y, t h e volu m es of big da t a t h a t ca n be
handled by NoSQL systems, such as Hadoop, outstrip what can be handled by the
biggest RDBMS.
3. Goodbye DBAs (see you later?)
Despite the many manageability improvements claimed by RDBMS vendors over
the years, high-end RDBMS systems can be maintained only with the assistance of
expensive, highly trained DBAs. DBAs are intimately involved in the design,
installation, and ongoing tuning of high-end RDBMS systems.
NoSQL databases are generally designed from the ground up to require less
management: automatic repair, data distribution, and simpler data models lead to
lower administration and tuning requirements
in t h eor y. In pr a ct ice, it s likely
t h a t r u m or s of t h e DBA s dea t h have been slightly exaggerated. Someone will
always be accountable for the performance and availability of any mission-critical
data store.
30
NoSql
4. Economics
NoSQL databases typically use clusters of cheap commodity servers to manage the
exploding data and transaction volumes, while RDBMS tends to rely on expensive
proprietary servers and storage systems. The result is that the cost per gigabyte or
transaction/second for NoSQL can be many times less than the cost for RDBMS,
allowing you to store and process more data at a much lower price point.
5. Flexible data models
Change management is a big headache for large production RDBMS. Even minor
changes to the data model of an RDBMS have to be carefully managed and may
necessitate downtime or reduced service levels.
NoSQL databases have far more relaxed
or even nonexistent
data model
restrictions. NoSQL Key Value stores and document databases allow the application
to store virtually any structure it wants in a data element. Even the more rigidly
defined BigTable-based NoSQL databases (Cassandra, HBase) typically allow new
columns to be created without too much fuss.
The result is that application changes and database schema changes do not have to
be managed as one complicated change unit. In theory, this will allow applications
to iterate faster, though,clearly, there can be undesirable side effects if the
application fails to manage data integrity.
31
NoSql
The promise of the NoSQL database has generated a lot of enthusiasm, but there
are many obstacles to overcome before they can appeal to mainstream enterprises.
Here are a few of the top challenges.
1. Maturity
RDBMS systems have been around for a long time. NoSQL advocates will argue
that their advancing age is a sign of their obsolescence, but for most CIOs, the
maturity of the RDBMS is reassuring. For the most part, RDBMS systems are
stable and richly functional. In comparison, most NoSQL alternatives are in preproduction versions with many key features yet to be implemented.
Living on the technological leading edge is an exciting prospect for many
developers, but enterprises should approach it with extreme caution.
2. Support
Enterprises want the reassurance that if a key system fails, they will be able to get
timely and competent support. All RDBMS vendors go to great lengths to provide a
high level of enterprise support.
In contrast, most NoSQL systems are open source projects, and although there are
usually one or more firms offering support for each NoSQL database, these
companies often are small start-ups without the global reach, support resources, or
credibility of an Oracle, Microsoft, or IBM.
3. Analytics and business intelligence
NoSQL databases have evolved to meet the scaling demands of modern Web 2.0
applications. Consequently, most of their feature set is oriented toward the
demands of these applications. However, data in an application has value to the
business that goes beyond the insert-read-update-delete cycle of a typical Web
application. Businesses mine information in corporate databases to improve their
efficiency and competitiveness, and business intelligence (BI) is a key IT issue for
all medium to large companies.
NoSQL databases offer few facilities for ad-hoc query and analysis. Even a simple
query requires significant programming expertise, and commonly used BI tools do
not provide connectivity to NoSQL.
32
NoSql
Some relief is provided by the emergence of solutions such as HIVE or PIG, which
can provide easier access to data held in Hadoop clusters and perhaps eventually,
other NoSQL databases. Quest Software has developed a product
Toad for Cloud
Databases
that can provide ad-hoc query capabilities to a variety of NoSQL
databases.
4. Administration
The design goals for NoSQL may be to provide a zero-admin solution, but the
current reality falls well short of that goal. NoSQL today requires a lot of skill to
install and a lot of effort to maintain.
5. Expertise
There are literally millions of developers throughout the world, and in every
business segment, who are familiar with RDBMS concepts and programming. In
contrast, almost every NoSQL developer is in a learning mode. This situation will
a ddr ess n a t u r a lly over t im e, bu t for n ow, it s fa r ea sier t o fin d exper ien ced RDBMS
programmers or administrators than a NoSQL expert.
Conclusion
NoSQL databases are becoming an increasingly important part of the database
landscape, and when used appropriately, can offer real benefits. However,
enterprises should proceed with caution with full awareness of the legitimate
limitations and issues that are associated with these databases.
33
NoSql
NoSQL is a shell-based relational database management system that runs
under Unix-like operating systems, or others with compatibility layers (e.g., Cygwin
under Windows). Its name merely reflects the fact that it does not express its
queries using Structured Query Language; the NoSQL RDBMS is distinct from the
circa-2009 general concept of NoSQL databases, which are typically non-relational,
unlike the NoSQL RDBMS. NoSQL is released under the GNU GPL.
Ok Then we reach to end Of SQL !!!!!!!!!! .
May be you want to choose your group (remember the Preface):
I.
II.
II.
Love it
Deny it
Ignore it
However, I hope that I covered the whole main concept of No SQL, like its
definition, advantages and uses in other areas.Bu t we sh ou ldn t for get t h a t ,This
new technology of database still have long way to go and let s see how far it will go
and can it stand on the challenges that I mentioned earlier? Hope so.
A.Akhtar
34
NoSql
http://www.techrepublic.com
http://en.wikipedia.org/wiki/NoSQL
http://newtech.about.com/od/databasemanagement/a/Nosql.htm
35