No SQL

Arsalan Akhtar

No SQL

IN THE NAME OF GOD NoSql 2 NoSql Contents Preface...........................................................................................................................................................5 What is SQL?..................................................................................................................................................6 Alternatives ...................................................................................................................................................7 Definition and Introduction of No SQL ..........................................................................................................8 Context and a Bit of History ........................................................................................................................10 ACID .............................................................................................................................................................11 Characteristics .........................................................................................................................................11 Atomicity .............................................................................................................................................11 Consistency..........................................................................................................................................11 Isolation ...............................................................................................................................................11 Durability .............................................................................................................................................12 Architecture.................................................................................................................................................12 NoSQL Emerged From a Need .....................................................................................................................13 What is NoSQL? ...........................................................................................................................................13 NoSQL Categories ........................................................................................................................................15 1) Key-values Stores: ...........................................................................................................................15 2) Column Family: ................................................................................................................................19 3) Document Databases: .....................................................................................................................23 Keys .....................................................................................................................................................24 Retrieval ..............................................................................................................................................24 Organization ........................................................................................................................................24 4) Graph Databases: ............................................................................................................................26 Major NoSQL Players ...................................................................................................................................28 Querying NoSQL ..........................................................................................................................................28 Future of NoSQL ..........................................................................................................................................29 Structuring data without relations ..............................................................................................................29 Five advantages of NoSQL ...........................................................................................................................30 Five challenges of NoSQL ............................................................................................................................32 Summary .....................................................................................................................................................34 Last Word ....................................................................................................................................................34 3 NoSql References ...................................................................................................................................................35 4 NoSql Like most new and upcoming technologies, NoSQL is shrouded in a mist of fear, uncertainty, anddoubt. The world of developers is probably divided into three groups when it comes to NoSQL: Those who love it People in this group are exploring how NoSQL fits in an a pplication stack. They are using it, creating it, and keeping abreast with the developments in the world of NoSQL. Those who deny it Mem ber s of t h is gr ou p a r e eit h er focu sin g on NoSQL s sh or t com in gs or a r e ou t t o pr ove t h a t it s wor t h less. Those who ignore it Developers in this group are agnostic either because they are waiting for the technology to mature, or they believe NoSQL is a passing fad and ign or in g it will sh ield t h em fr om t h e r oller coa st er r ide of a h ype cycle, or h a ve simply not had a chance to get to it. what NoSQL is what its characteristics are, what constitutes its typical use cases, and where it fitsin the application stack? 5 NoSql SQL (sometimes referred to as Structured Query Language) is a programming language designed for managing data in relational database management systems (RDBMS). Originally based upon relational algebra and tuple relational calculus, its scope includes data insert, query, update and delete, schema creation and modification, and data access control. SQL was one of the first commercial languages for E dga r F Cod s relational model, as described in his influential 1970 paper, "A Relational Model of Data for Large Shared Data Banks". Despite not adhering to the relational model as described by Codd, it became the most widely used database language. Although SQL is often described as, and to a great extent is, a declarative language, it also includes procedural elements. SQL became a standard of the American National Standards Institute (ANSI) in 1986, and of the International Organization for Standards (ISO) in 1987. Since then, the standard has been enhanced several times with added features. However, issues of SQL code portability between major RDBMS products still exist due to lack of full compliance with, or different interpretations of, the standard. Among the reasons mentioned are the large size and incomplete specification of the standard, as well as vendor lock-in. SQL was initially developed at IBM by Donald D. Chamberlin and Raymond F. Boyce in the early 1970s. This version, initially called SEQUEL (Structured English Query Language), was designed to manipulate and retrieve data stored in IBM's original quasi-relational database management system, System R, which a group at IBM San Jose Research Laboratory had developed during the 1970s.The acronym SEQUEL was later changed to SQL because "SEQUEL" was a trademark of the UK-based Hawker Siddeley aircraft company. The first Relational Database Management System (RDBMS) was RDMS, developed at MIT in the early 1970s, soon followed by Ingres, developed in 1974 at U.C. Berkeley. Ingres implemented a query language known as QUEL, which was later supplanted in the marketplace by SQL. In the late 1970s, Relational Software, Inc. (now Oracle Corporation) saw the potential of the concepts described by Codd, Chamberlin, and Boyce and developed their own SQL-based RDBMS with aspirations of selling it to the U.S. Navy, Central Intelligence Agency, and other U.S. government agencies. In June 1979, Relational Software, Inc. introduced the first commercially available implementation of SQL, Oracle V2 (Version2) for VAX computers. Oracle V2 beat IBM's August release of the System/38 RDBMS to market by a few weeks. 6 NoSql After testing SQL at customer test sites to determine the usefulness and practicality of the system, IBM began developing commercial products based on their System R prototype including System/38, SQL/DS, and DB2, which were commercially available in 1979, 1981, and 1983, respectively. This chart shows several of the SQL language elements that compose a single statement. The SQL language is subdivided into several language elements, including: Clauses, which are constituent components of statements and queries. (In some cases, these are optional.) Expressions, which can produce either scalar values or tables consisting of columns and rows of data. Predicates, which specify conditions that can be evaluated to SQL three-valued Predicates, valued logic (3VL) or Boolean (true/false/unknown) truth values and which are used to limit the effects of statements and queries, or to change program flow. Queries, which retrieve the data data based on specific criteria. This is the most important element of SQL. Statements, which may have a persistent effect on schemata and data, or which may control transactions, program flow, connections, sessions, or diagnostics.SQL statements also include the semicolon (";") statement diagnostics.SQL terminator. Though not required on every platform, it is defined as a standard part of the SQL grammar. Insignificant whitespace is generally ignored in SQL statements and queries, making it easier to format SQL code ffor or readability. A distinction should be made between alternatives to relational query languages and alternatives to SQL. Below are proposed relational alternatives to SQL: .QL - object object-oriented oriented Datalog 4D Query Language (4D QL) Datalog 7 NoSql HTSQL - URL based query method IBM Business System 12 (IBM BS12) - one of the first fully relational database management systems, introduced in 1982 ISBL Java Persistence Query Language (JPQL) - The query language used by the Java Persistence API and Hibernate persistence library JoSQL - Runs SQL statements written as Strings to query collections from inside Java code. LINQ - Runs SQL statements written like language constructs to query collections directly from inside .Net code. Object Query Language QBE (Query By Example) created by MoshèZloof, IBM 1977 Quel introduced in 1974 by the U.C. Berkeley Ingres project. Tutorial D SBQL - the Stack Based Query Language (SBQL) UnQL - the Unstructured Query Language, a functional superset of SQL, developed by the authors of SQLite and CouchDB XQuery NoSQL is literally a combination of two words: No and SQL. The implication is that NoSQLis a technology or product that counters SQL. The creators and early adopters of the buzzwordNoSQL probably wanted to say No RDBMS or No relational but were infatuated by the nicersounding NoSQL and stuck to it. In due course, some have proposed NonRel as an alternative toNoSQL. A few others have tried to salvage the original term by proposing that NoSQL is actuallyan acronym t h a t expa n ds t o Not On ly SQL. Wh a t ever t h e lit er a l m ea n in g, NoSQL is u sedt oda y a s a n u m br ella t er m for a ll da t a ba ses a n d da t a st or es t h a t don t follow the popular and well-established RDBMS principles and often relate to large data sets accessed and manipulated on aWeb scale. This means NoSQL is not a single product or even a single technology. It representsa class of products and a collection of diverse, and sometimes related, concepts about datastorage and manipulation. MapReduce is a parallel programming model that allows distributed processing on large data setson a cluster of computers. The MapReduce framework is patentedby 8 NoSql Google, but theideas are freely shared and adopted in a number of open-source implementations. MapReduce derives its ideas and inspiration from concepts in the world of functional programming.Map and reduce are commonly used functions in the world of functional programming. In functionalprogramming, a map function applies an operation or a function to each element in a list. Forexample, a multiply-by-two function on a list [1, 2, 3, 4] would generate another list as follows: [2, 4, 6, 8]. When such functions are applied, the original list is not altered. Functional programmingbelieves in keeping data immutable and avoids sharing data among multiple processes or threads. This means the map function that was just illustrated, trivial as it may be, could be run via two ormore multiple threads on the list and these threads would not step on each other, because the list itself is not altered. Like the map function, functional programming has a concept of a reduce function. Actually, areduce function in functional programming is more commonly known as a fold function. Areduceor a fold function is also sometimes called an accumulate, compress, or inject function. A reduce orfold function applies a function on all elements of a data structure, such as a list, and produces asingle result or output. So applying a reduce function-like summation on the list generated out of themap function, that is, [2, 4, 6, 8], would generate an output equal to 20. So map and reduce functions could be used in conjunction to process lists of data, where a functionis first applied to each member of a list and then an aggregate function is applied to the transformed and generated list. This same simple idea of map and reduce has been extended to work on large data sets. The ideais slightly modified to work on collections of tuples or key/value pairs. The map function appliesa function on every key/value pair in the collection and generates a new collection. Then the reducefunction works on the new generated collection and applies an aggregate function to compute a finaloutput. This is better understood through an example, so let me present a trivial one to explain theflow. Say you have a collection of key/value pairs as follows: [{ 94303 : Tom }, { 94303 : J a n e }, { 94301 : Ar u n }, { 94302 : Ch en }] This is a collection of key/value pairs where the key is the zip code and the value is the name of aperson who resides within that zip code. A simple map function on this collection could get thenames of all those who reside in a particular zip code. The output of such a map function is asfollows: [{ 94303 :[ Tom , J a n e ]}, { 94301 :[ Ar u n ]}, { 94302 :[ Ch en ]}] Now a reduce function could work on this output to simply count the number of people who belongto particular zip code. The final output then would be as follows: 9 NoSql [{ 94303 : 2}, { 94301 : 1}, { 94302 : 1}] This example is extremely simple and a MapReduce mechanism seems too complex for such amanipulation, but I hope you get the core idea behind the concepts and the flow. Carlo Strozzi used the term NoSQL in 1998 to name his lightweight, open-source relational database that did not expose the standard SQL interface. (Strozzi suggests that, as the current NoSQL movement "departs from the relational model altogether; it should therefore have been called more appropriately 'NoREL', or something to that effect.") Eric Evans, a Rackspace employee, reintroduced the term NoSQL in early 2009 when Johan Oskarsson of Last.fm wanted to organize an event to discuss opensource distributed databases. The name attempted to label the emergence of a growing number of non-relational, distributed data stores that often did not attempt to provide ACID (atomicity, consistency, isolation, durability) guarantees, which are the key attributes of classic relational database systems such as Sybase, IBM DB2, MySQL, Microsoft SQL Server, PostgreSQL, Oracle RDBMS, Informix, Oracle Rdb, etc. In 2011, work began on UnQL (Unstructured Query Language), a specification for a query language for NoSQL databases. It is built to query collections (versus tables) of documents (versus rows) with loosely defined fields (versus columns). UnQL is a superset of SQL within which SQL is a very constrained type of UnQL for which the queries always return the same fields (same number, names and types). However, UnQL does not cover the data definition language (DDL) SQL statements like CREATE TABLE or CREATE INDEX Befor e I st a r t wit h det a ils on t h e NoSQL t ypes a n d t h e con cept s in volved, it s important to setthe context in which NoSQL emerged. Non-relational databases are not new. In fact, the firstnon-relational stores go back in time to when the first set of computing machines were invented. Non-relational databases thrived through the advent of mainframes and have existed in specializedand specific domains for example, hierarchical directories for storing authentication andauthorization credentials through the years. However, the non-relational stores that haveappeared in the world of NoSQL are a new incarnation, which were born in the world of massivelyscalable Internet 10 NoSql applications. These non-relational NoSQL stores, for the most part, were conceivedin the world of distributed and parallel computing. Starting out with Inktomi, which could be thought of as the first true search engine, andculminating with Google, it is clear that the widely adopted relational database managementsystem (RDBMS) has its own set of problems when applied to massive amounts of data. Theproblems relate to efficient processing, effective parallelization, scalability, and costs. In computer science, ACID (atomicity, consistency, isolation, durability) is a set of properties that guarantee that database transactions are processed reliably. In the context of databases, a single logical operation on the data is called a transaction. For example, a transfer of funds from one bank account to another, even though that might involve multiple changes (such as debiting one account and crediting another), is a single transaction. Jim Gray defined these properties of a reliable transaction system in the late 1970s and developed technologies to automatically achieve them. In 1983, Andreas Reuter and Theo Härder coined the acronym ACID to describe them. Characteristics Atomicity Atomicity requires that each transaction is "all or nothing": if one part of the transaction fails, the entire transaction fails, and the database state is left unchanged. An atomic system must guarantee atomicity in each and every situation, including power failures, errors, and crashes. Consistency The consistency property ensures that any transaction will bring the database from one valid state to another. Any data written to the database must be valid according to all defined rules, including but not limited to constraints, cascades, triggers, and any combination thereof. Isolation 11 NoSql Isolation refers to the requirement that no transaction should be able to interfere with another transaction. One way of achieving this is to ensure that no transactions that affect the same rows can run concurrently, since their sequence, and hence the outcome, might be unpredictable. This property of ACID is often partly relaxed due to the huge speed decrease this type of concurrency management entails. Durability Durability means that once a transaction has been committed, it will remain so, even in the event of power loss, crashes, or errors. In a relational database, for instance, once a group of SQL statements execute, the results need to be stored permanently. If the database crashes immediately thereafter, it should be possible to restore the database to the state after the last transaction committed. Typical modern relational databases have shown poor performance on certain dataintensive applications, including indexing a large number of documents, serving pages on high-traffic websites, and delivering streaming media. Typical RDBMS implementations are tuned either for small but frequent read/write transactions or for large batch transactions with rare write accesses. NoSQL, on the other hand, can service heavy read/write workloads. Real-world NoSQL deployments include Digg's 3 TB for green badges (markers that indicate stories voted for by others in a social network) and Facebook's 50 TB for inbox search. NoSQL architectures often provide weak consistency guarantees, such as eventual consistency, or transactions restricted to single data items. Some systems, however, provide full ACID guarantees in some instances by adding a supplementary middleware layer (e.g., AppScale and CloudTPS). Two systems have been developed that provide snapshot isolation for column stores: Google's Percolator system based on BigTable, and a transactional system for HBase developed at the University of Waterloo. These systems, developed independently, use similar concepts to achieve multi-row distributed ACID transactions with snapshot isolation guarantee for the underlying column store, without the extra overhead of data management, middleware system deployment, or maintenance introduced by the middleware layer. Several NoSQL systems employ a distributed architecture, with the data held in a redundant manner on several servers, often using a distributed hash table. In this 12 NoSql way, the system can readily scale out by adding more servers, and failure of a server can be tolerated. Some NoSQL advocates promote very simple interfaces such as associative arrays or key-value pairs. Other systems, such as native XML databases, promote support of the XQuery standard. Newer systems such as CloudTPS also support join queries. Data Storage: The world's stored digital data is measured in exabytes. An exabyte is equal to one billion gigabytes (GB) of data. According to Internet.com, the amount of stored data added in 2006 was 161 exabytes. Just 4 years later in 2010, the amount of data stored will be almost 1,000 ExaBytes which is an increase of over 500%. In other words, there is a lot of data being stored in the world and it s just going to continue growing. Interconnected Data: Data continues to become more connected. The creation of the web fostered in hyperlinks, blogs have pingbacks and every major social network system has tags that tie things together. Major systems are built to be interconnected. Complex Data Structure: NoSQL can handle hierarchical nested data structures easily. To accomplish the same thing in SQL, you would need multiple relational tables with all kinds of keys. In addition, there is a relationship between performance and data complexity. Performance can degrade in a traditional RDBMS as we store the massive amounts of data required in social networking applications and the semantic web. I guess one way to define NoSQL is to consider what it s not. It's not SQL and it's not relational. Like the name suggests, it's not a replacement for a RDBMS but compliments it. NoSQL is designed for distributed data stores for very large scale data needs. Think about Facebook with its 500,000,000 users or Twitter which accumulates Terabits of data every single day. 13 NoSql In a NoSQL database, there is no fixed schema and no joins. A RDBMS "scales up" by getting faster and faster hardware and adding memory. NoSQL, on the other hand, can take advantage of "scaling out". Scaling out refers to spreading the load over many commodity systems. This is the component of NoSQL that makes it an inexpensive solution for large datasets. 14 NoSql The current NoSQL world fits into 4 basic categories: 1) Key-values Stores: are based primarily on Amazon's Dynamo Paper which was written in 2007. The main idea is the existence of a hash table wherethere is a unique key and a pointer to a particular item of data. These mappings are usually accompanied by cache mechanisms to maximize performance. A HashMap or an associative array is the simplest data structure that can hold a set of key/valuepairs. Such data structures are extremely popular because they provide a very efficient, big O(1)average algorithm running time for accessing data. The key of a key/value pair is a unique value inthe set and can be easily looked up to access the data. Key/value pairs are of varied types: some keep the data in memory and some provide the capabilityto persist the data to disk. Key/value pairs can be distributed and held in a cluster of nodes. A sim ple, yet power fu l, key/va lu e st or e is Or a cle s Ber keley DB. Ber keley DB is a pure storage enginewhere both key and value are an array of bytes. The core st or a ge en gin e of Ber keley DB doesn t a t t a ch m ea n in g t o t h e key or t h e va lu e. It takes byte array pairs in and returns the same back to the callingclient. Berkeley DB allows data to be cached in memory and flushed to disk as it grows. There isalso a notion of indexing the keys for faster lookup and access. Berkeley DB has existed since themid-1990s. It wa s cr ea t ed t o r epla ce AT&T s NDBM a s a pa r t of m igr a t in g from BSD 4.3 to 4.4. In1996, Sleepycat Software was formed to maintain and provide support for Berkeley DB. 15 NoSql Another type of key/value store in common use is a cache. A cache provides an inmemory snapshotof the most-used data in an application. The purpose of cache is to reduce disk I/O. Cache systemscould be rudimentary map structures or robust systems with a cache expiration policy. Cachingis a popular strategy employed at all levels of a computer software stack to boost performance.Operating systems, databases, middleware components, and applications use caching. Robust open-source distributed cache systems like EHCache (http://ehcache.org/) are widelyused in Java applications. EHCache could be considered as a NoSQL solution. Another cachingsystem popularly used in web applications is Memcached (http://memcached.org/), which is anopen-source, high-performance object caching system. Brad Fitzpatrick created Memcached forLiveJournal in 2003. Apart from being a caching system, Memcached also helps effective memorymanagement by creating a large virtual pool and distributing memory among nodes as required. This prevents fragmented zones where one node could have excess but unused memory and anothernode could be starved for memory.As the NoSQL movement has gathered momentum, a number of key/value pair data stores haveemerged. Some of these newer stores build on the Memcached API, some use Berkeley DB as theunderlying storage, and a few others provide alternative solutions built from scratch. Many of these key/value pairs have APIs that allow get-and-set mechanisms to get and set values.A few, like Redis (http://redis.io/), provide richer abstractions and powerful APIs. Redis couldbe considered as a data structure server because it provides data structures like string (charactersequences), lists, and sets, apart from maps. Also, Redis provides a very rich set of operations toaccess data from these different types of data structures. enumeration of a few important characteristics : Membase (Proposed to be merged into Couchbase, gaining features from CouchDBafter the creation of Couchbase, Inc.) Official Online Resources www.membase.org/. History Project started in 2009 by NorthScale, Inc. (later renamed as Membase). Zygnaand NHN have been contributors since the beginning. Membase builds on Mem ca ch ed a n dsu ppor t s Mem ca ch ed s t ext a n d bin a r y pr ot ocol. Mem ba se a dds a lot of additional featureson top of Memcached. It adds disk persistence, data replication, live cluster reconfiguration,and data rebalancing. A number of coreMembase creators are also Memcachedcontributors. Technologies and Language Implemented in Erlang, C, and C++. Access Methods Memcached-compliant API with some extensions.Can be a dropinreplacement for Memcached. 16 NoSql Open-Source License Who Uses It Apache License version 2. Zynga, NHN, and others. Kyoto Cabinet Official Online Resources http://fallabs.com/kyotocabinet/. History Kyoto Cabinet is a successor of Tokyo Cabinet. The database is a simple data file containing records; each is a pair of akey and a value. Every key and value are serial bytes with variable length. Technologies and Language Implemented in C++. Access Methods Provides APIs for C, C++, Java, C#, Python, Ruby, Perl, Erlang, OCaml,and Lua. The protocol simplicity means there are many, many clients. Open-Source License GNU GPL and GNU LGPL. Who Uses It Mixi, Inc. sponsored much of its original work before the author left Mixito join Google. Blog posts and mailing lists suggest that there are many users but no publiclist is available. Redis Official Online Resources http://redis.io/. History Project started in 2009 by Salvatore Sanfilippo. Salvatore created it for hisstartup LLOOGG (http://lloogg.com/). Though still an independent project, Redisprimary author is employed by VMware, who sponsor its development. Technologies and Language Implemented in C. Access Methods Rich set of methods and operations. Can access via Redis command-lineinterface and a set of well-maintained client libraries for languages like Java, Python, Ruby,C, C++, Lua, Haskell, AS3, and more. Open-Source License Who Uses It BSD. Craigslist. The three key/value pairs listed here are nimble, fast implementations that provide storage for realtimedata, temporary frequently used data, or even full-scale persistence. The key/value pairs listed so far provide a strong consistency model for the data it stores. However,a few other key/value pairs emphasize availability over consistency in distributed deployments. 17 NoSql Ma n y of t h ese a r e in spir ed by Am a zon s Dyn a m o, wh ich is a lso a key/va lu e pa ir . Am a zon s Dyn a m opr om ises except ion a l a va ila bilit y a n d sca la bility, and forms the ba ckbon e for Am a zon s dist r ibu t edfa u lt t oler a n t a n d h igh ly a va ila ble syst em . Apache Cassandra, Basho Riak, and Voldemort are opensourceimplementations of the ideas proposed by Amazon Dynamo. Amazon Dynamo brings a lot of key high-availability ideas to the forefront. The most importantof the ideas is that of eventual consistency. Eventual consistency implies that there could be smallintervals of inconsistency between replicated nodes as data gets updated among peer-to-peer nodes. Eventual consistency does not mean inconsistency. It just implies a weaker form of consistency thanthe typical ACID type consistency found in RDBMS. For now I will list the Amazon Dynamo clones and introduce you to a few important characteristicsof these data stores. Cassandra Official Online Resources http://cassandra.apache.org/. History Developed at Facebook and open sourced in 2008, Apache Cassandra wasdonated to the Apache foundation. Technologies and Language Implemented in Java. Access Methods A command-line access to the store. Thrift interface and an internalJava API exist. Clients for multiple languages including Java, Python, Grails, PHP, .NET.and Ruby are available. Hadoop integration is also supported. Query Language A query language specification is in the making. Open-Source License Who Uses It Apache License version 2. Facebook, Digg, Reddit, Twitter, and others. Voldemort Official Online Resources History http://project-voldemort.com/. Created by the data and analytics team at LinkedIn in 2008. Technologies and Language Implemented in Java.Provides for pluggable storage usingeither Berkeley DB or MySQL. Access Methods Integrates with Thrift, Avro, and protobuf (http://code.google.com/p/protobuf/) interfaces. Can be used in conjunction with Hadoop. Open-Source License Apache License version 2. 18 NoSql Who Uses It LinkedIn. Riak Official Online Resources History http://wiki.basho.com/. Created at Basho, a company formed in 2008. Technologies and Language JavaScript. Implemented in Erlang. Also, uses a bit of C and Access Methods Interfaces for JSON (over HTTP) and protobuf clients exist. Librariesfor Erlang, Java, Ruby, Python, PHP, and JavaScript exist. Open-Source License Who Uses It Apache License version 2. Comcast and Mochi Media. All three Cassandra, Riak and Voldemort provide open-source Amazon Dynamo capabilities.Cassandra and Riak demonstrate dual nature as far their behavior and properties go. Cassandra hasproperties of both Google Bigtable and Amazon Dynamo. Riak acts both as a key/value store and adocument database. 2) Column Family: Stores were created to store and process very large amounts of data distributed over many machines. There are still keys but they point to multiple columns. In the case of BigTable (Google's Column Family NoSQL model), rows are identified by a row key with the data sorted and stored by this key. The columns are arranged by column family. Google s Bigt a ble espou ses a m odel wh er e da t a in st or ed in a colu m n -oriented way. This contrastswith the row-oriented format in RDBMS. The column-oriented storage allows data to be storedeffectively. It avoids consuming space when storing n u lls by sim ply n ot st or in g a colu m n wh en a va lu e doesn t exist for t h a t colu m n . Each unit of data can be thought of as a set of key/value pairs, where the unit itself is identified withthe help of a primary identifier, often referred to as the primary key. Bigtable and its clones tend tocall this primary key the row-key. Also, as the title of this subsection suggests, units are stored inan ordered-sorted manner. The units of data are sorted and ordered on the basis of the row-key. Toexplain sorted ordered column-oriented stores, an example serves better than a lot of text, so let mepresent an example to you. Consider a simple table of values that keeps information about a set ofpeople. Such a table could have columns like first_name, last_n a m e, occu pa t ion , zip_code, a n dgen der . A per son s in for m a t ion in t h is t a ble could be as follows: 19 NoSql first_name: John last_name: Doe zip_code: 10001 gender: male Another set of data in the same table could be as follows: first_name: Jane zip_code: 94303 The row-key of the first data point could be 1 and the second could be 2. Then data would be storedin a sorted ordered column-oriented store in a way that the data point with row-key 1 will be storedbefore a data point with row-key 2 and also that the two data points will be adjacent to each other. Next, only the valid key/value pairs would be stored for each data point. So, a possiblecolumn-family for the example could be name with columns first_name and last_name beingits members. Another column-family could be location with zip_code as its member. A thirdcolumn-family could be profile. The gender column could be a member of the profilecolumn-family. In column-oriented stores similar to Bigtable, data is stored on a column-family basis. Column-families are typically defined at configuration or startup time. Columns themselves need noSorted Ordered Column-Oriented Storesa-priori definition or declaration. Also, columns are capable of storing any data types as far as thedata can be persisted to an array of bytes. So the underlying logical storage for this simple example consists of three storage buckets: name,location, and profile. Within each bucket, only key/value pairs with valid values are stored. Therefore, the name column-family bucket stores the following values: For row-key: 1 first_name: John last_name: Doe For row-key: 2 first_name: Jane The location column-family stores the following: For row-key: 1 20 NoSql zip_code: 10001 For row-key: 2 zip_code: 94303 The profile column-family has values only for the data point with row-key 1 so it stores onlythe following: For row-key: 1 gender: male In real storage terms, the column-families are not physically isolated for a given row. All data Pertaining to a row-key is stored together. The column-family acts as a key for the columns itcontains and the row-key acts as the key for the whole data set. Data in Bigtable and its clones is stored in a contiguous sequenced manner. As data grows to fill upone node, it is spilt into multiple nodes. The data is sorted and ordered not only on each node butalso across nodes providing one large continuously sequenced set. The data is persisted in a fault tolerantmanner where three copies of each data set are maintained. Most Bigtable clones leverage adistributed file system to persist data to disk. Distributed file systems allow data to be stored among acluster of machines. The sorted ordered structure makes data seek by row-key extremely efficient. Data access is lessrandom and ad-hoc and lookup is as simple as finding the node in the sequence that holds the data. Data is inserted at the end of the list. Updates are in-place but often imply adding a newer versionof data to the specific cell rather than in-place overwrites. This means a few versions of each cell aremaintained at all times. The versioning property is usually configurable. HBase is a popular, open-source, sorted ordered column-family store that is modeled on t h e idea spr oposed by Google s Bigt a ble. Data stored in HBase can be manipulated using the MapReduce infrastructure. H a doop sMa pRedu ce t ools ca n ea sily u se H Ba se a s t h e sou r ce a n d/or sin k of da t a . Th e best wa y t o lea r n a bou t a n d lever a ge t h e idea s pr oposed by Google s infrastructure is to startwith the Hadoop (http//hadoop.apache.org) family of products. The NoSQL Bigtable storecalled HBase is part of the Hadoop family. HBase Official Online Resources http://hbase.apache.org. 21 NoSql History Created at Powerset (now part of Microsoft) in 2007. Donated to the Apachefoundation before Powerset was acquired by Microsoft. Technologies and Language Implemented in Java. Access Methods AJRuby shell allows command-line access to the store. Thrift, Avro,REST, and protobuf clients exist. A few language bindings are also available. A Java API isavailable with the distribution.Protobuf, short for Protocol Buffers, is Google s da t a in t er ch a n ge for m a t . Mor ein for m a t ion is a va ila ble on lin e a t http://code.google.com/p/protobuf/. Query Language No native querying language. Hive (http://hive.apache.org)provides a SQL-like interface for HBase. Open-Source License Who Uses It Apache License version 2. Facebook, StumbleUpon, Hulu, Ning, Mahalo, Yahoo!, and others. WHAT IS THRIFT? Thrift is a software framework and an interface definition language that allowscross-language services and API development. Services generated using Thrift work efficiently and seamlessly between C++, Java, Python, PHP, Ruby, Erlang, Perl,Haskell, C#, Cocoa, Smalltalk, and OCaml. Thrift was created by F a cebook in 2007. It s a n Apa ch e in cu ba t or pr oject . You ca n fi n d m or e in for m a t ion on Thrift athttp://incubator.apache.org/thrift/. Hypertable Official Online Resources History www.hypertable.org. Created at Zvents in 2007. Now an independent open-source project. Technologies and Language Implemented in C++, uses Google RE2 regular expressionlibrary. RE2 provides a fast and efficient implementation. Hypertable promises performanceboost over HBase, potentially serving to reduce time and cost when dealing with largeamounts of data. Access Methods A command-line shell is available. In addition, a Thrift interface issupported. Language bindings have been created based on the Thrift interface. A creativedeveloper has even created a JDBC-compliant interface for Hypertable. Query Language HQL (Hypertable Query Language) is a SQL-like abstraction forquerying Hypertable data. Hypertable also has an adapter for Hive. Open-Source License Who Uses It portal). GNU GPL version 2. Zven t s, Ba idu (Ch in a s biggest sea r ch en gin e), Rediff (In dia s biggest 22 NoSql Cloudata Official Online Resources www.cloudata.org/. History Created by a Korean developer named YK Kwon (www.readwriteweb.com/hack/2011/02/open-source-bigtable-cloudata.php). Not much is publicly knownabout its origins. Technologies and Language Access Methods areavailable. Query Language language. A command-line access is available. Thrift, REST, and Java API CQL (Cloudata Query Language) defines a SQL-like query Open-Source License Who Uses It Implemented in Java. Apache License version 2. Not known. Sorted ordered column-family stores form a very popular NoSQL option. However, NoSQLconsists of a lot more variants of key/value stores and document databases. Next, I introduce thekey/value stores. 3) Document Databases: were inspired by Lotus Notes and are similar to key-value stores. The model is basically versioned documents that are collections of other key-value collections. The semi-structured documents are stored in formats like JSON.Document databases are not document management systems. More often than not, developersstarting out with NoSQL confuse document databases with document and content managementsystems. The word document in document databases connotes loosely structured sets of key/value pairs in documents, typically JSON (JavaScript Object Notation), and not documents orspreadsheets (though these could be stored too). The central concept of a document-oriented database is the notion of a Document. While each document-oriented database implementation differs on the details of this definition, in general, they all assume documents encapsulate and encode data (or information) in some standard formats or encodings. Encodings in use include XML, YAML, JSON, and BSON, as well as binary forms like PDF and Microsoft Office documents (MS Word, Excel, and so on). 23 NoSql Documents inside a document-oriented database are similar, in some ways, to records or rows, in relational databases, but they are less rigid. They are not required to adhere to a standard schema nor will they have all the same sections, slots, parts, keys, or the like. For example here's a document: FirstName:"Bob", Address:"5 Oak St.", Hobby:"sailing". Another document could be: FirstName:"Jonathan", Address:"15 Wanamassa Point Road", Children:[{Name:"Michael",Age:10}, {Name:"Jennifer", Age:8}, {Name:"Samantha", Age:5}, {Name:"Elena", Age:2}]. Both documents have some similar information and some different. Unlike a relational database where each record would have the same set of fields and unused fields might be kept empty, there are no empty 'fields' in either document (record) in this case. This system allows new information to be added and it does not require explicitly stating if other pieces of information are left out. Keys Documents are addressed in the database via a unique key that represents that document. Often, this key is a simple string. In some cases, this string is a URI or path. Regardless, you can use this key to retrieve the document from the database. Typically, the database retains an index on the key such that document retrieval is fast. Retrieval One of the other defining characteristics of a document-oriented database is that, beyond the simple key-document (or key-value) lookup that you can use to retrieve a document, the database will offer an API or query language that will allow you to retrieve documents based on their contents. For example, you may want a query that gets you all the documents with a certain field set to a certain value. The set of query APIs or query language features available, as well as the expected performance of the queries, varies significantly from one implementation to the next. Organization 24 NoSql Implementations offer a variety of ways of organizing documents, including notions of Collections Tags Non-visible Metadata Directory hierarchies Document databases treat a document as a whole and avoid splitting a document into its constituentname/value pairs. At a collection level, this allows for putting together a diverse set of documentsinto a single collection. Document databases allow indexing of documents on the basis of not onlyits primary identifier but also its properties. A few different open-source document databases areavailable today but the most prominent among the available options are MongoDB and CouchDB. MongoDB Official Online Resources History www.mongodb.org. Created at 10gen. Technologies and Language Implemented in C++. Access Methods A JavaScript command-line interface. Drivers exist for a number of languagesincluding C, C#, C++, Erlang. Haskell, Java, JavaScript, Perl, PHP, Python, Ruby, and Scala. Query Language SQL-like query language. Open-Source License Who Uses It GNU Affero GPL (http://gnu.org/licenses/agpl-3.0.html). FourSquare, Shutterfl y, Intuit, Github, and more. 25 NoSql CouchDB Official Online Resources http://couchdb.apache.org and www.couchbase.com www.couchbase.com.Most .Most of the authors are part of Couchbase, Inc. History Work started in 2005 and it was incubated into Apache in 2008. Technologies and Language Implemented in Erlang with some C and a JavaScriptexecution environment. Access Methods Upholds REST above every other mechanism. Use standard web toolsand clients to access the database, the same way as you access web resources. Open-Source Source License Apache License version 2. Who Uses It Apple, BBC, Canonical, Cern, and more at http://wiki.apache.org/couchdb/CouchDB_in_the_wild. http://wiki.apache.org/couchdb/CouchDB_in_the_wild. 4) Graph Databases Databases: are built with nodes, relationships between notes and the properties of nodes. Instead of tables of rows and columns and the rigid structure of SQL, a flexible graph model is used which can scale across many machines . 26 NoSql So far I have listed most of the mainstream open-source NoSQL products. A few other products likeGraph databases and XML data stores could also qualify as NoSQL databases. This book does notcover Graph and XML databases. However, I list the two Graph databases that may be of interestand something you may want to explore beyond this book: Neo4j and FlockDB:Neo4J is an ACID-compliant graph database. It facilitates rapid traversal of graphs. Neo4j Official Online Resources http://neo4j.org. History Created at Neo Technologies in 2003. (Yes, this database has been aroundbefore the term NoSQL was known popularly.) Technologies and Language Implemented in Java. Access Methods A command-line access to the store is provided. REST interface alsoavailable. Client libraries for Java, Python, Ruby, Clojure, Scala, and PHP exist. Query Language Supports SPARQL protocol and RDF Query Language. Open-Source License Who Uses It AGPL. Box.net. FlockDB Official Online Resources https://github.com/twitter/flockdb History Created at Twitter and open sourced in 2010. Designed to store the adjacencylists for followers on Twitter. Technologies and Language Access Methods A Thrift and Ruby client. Open-Source License Who Uses It Implemented in Scala. Apache License version 2. Twitter. 27 NoSql The major players in NoSQL have emerged primarily because of the organizations that have adopted them. Some of the largest NoSQL technologies include: Dynamo:Dynamo was created by Amazon.com and is the most prominent Key-Value NoSQL database. Amazon was in need of a highly scalable distributed platform for their e-commerce businesses so they developed Dynamo. Amazon S3 uses Dynamo as the storage mechanism. Cassandra:Cassandra was open sourced by Facebook and is a column oriented NoSQL database. BigTable:BigTable is Google's proprietary column oriented database. Google allows the use of BigTable but only for the Google App Engine. SimpleDB:SimpleDB is another Amazon database. Used for Amazon EC2 and S3, it is part of Amazon Web Services that charges fees depending on usage. CouchDB:CouchDB along with MongoDB are open source document oriented NoSQL databases. Neo4J:Neo4j is an open source graph database. The question of how to query a NoSQL database is what most developers are interested in. After all, data stored in a huge database doesn't do anyone any good if you can't retrieve and show it to end users or web services. NoSQL databases do not provide a high level declarative query language like SQL. Instead, querying these databases is data-model specific.Many of the NoSQL platforms allow for RESTful interfaces to the data. Other offer query APIs. There are a couple of query tools that have been developed that attempt to query multiple NoSQL databases. These tools typically work across a single NoSQL category. One example is SPARQL. SPARQL is a declarative query specification designed for graph databases. Here is an example of a SPARQL query that retrieves the URL of a particular blogger (courtesy of IBM): PREFIX foaf: <http://xmlns.com/foaf/0.1/> SELECT ?url FROM <bloggers.rdf> WHERE { ?contributor foaf:name "Jon Foobar" . ?contributorfoaf:weblog ?url . } 28 NoSql Organizations that have massive data storage needs are looking seriously at NoSQL. Apparently, the concept isn't getting as much traction in smaller organizations. In a survey conducted by Information Week, 44% of business IT professionals haven't heard of NoSQL. Further, only 1% of the respondents reported that NoSQL is a part of their strategic direction. Clearly, NoSQL has its place in our connected world but will need to continue to evolve to get the mass appeal that many think it could have. Let's consider the Twitter example. A tweet is a very small piece of text created by one user. Twitter should be able to save it quickly and then distribute it widely to that user's followers so they can all read it. Now, if I go to that user's profile and don't immediately see their latest tweet, it's not the end of the world. If it shows up seconds (or even minutes?) later, no big deal. It's a tweet. Twitter used to use a relational database (MySQL) and caching (memcached) to handle this but they're switching to Cassandra because it's designed for this kind of data. A "key-value" store like Cassandra is designed for this kind of data, and can be massively scaled horizontally. Other key-value databases include Redis) and Riak, created by Boston's own Basho. Consider other "real-world" examples that you might need to model in your web application: resume business card receipt All of these are self-contained data structures. Everything you need in real life is in one place. Yet, if you were to model this in a relational database, you'd probably have a "persons" table and an "orders" table and "line_items" table, splitting the data into it's atomic pieces. But if you're the user, all you want is the receipt for your purchase. A NoSQL class of databases called document databases such as MongoDB and CouchDB are a good fit here. You might have a Receipt document that stores your customers' receipts, then just send that data cleanly back to the user. So, is your data relational? Do parts of your application better fit a different data structure that NoSQL tools are really good at? 29 NoSql 1. Elastic scaling For years, database administrators have relied on scale up buying bigger servers as database load increases rather than scale out distributing the database across multiple hosts as load increases. However, as transaction rates and availability requirements increase, and as databases move into the cloud or onto virtualized environments, the economic advantages of scaling out on commodity hardware become irresistible. RDBMS might not scale out easily on commodity clusters, but the new breed of NoSQL databases are designed to expand transparently to take advantage of new n odes, a n d t h ey r e u su a lly design ed wit h low-cost commodity hardware in mind. 2. Big data Just as transaction rates have grown out of recognition over the last decade, the volumes of data that are being stored also have increased massively. O Reilly h a s clever ly ca lled t h is t h e in du st r ia l r evolu t ion of da t a . RDBMS ca pa cit y h a s been growing to match these increases, but as with transaction rates, the constraints of data volumes that can be practically managed by a single RDBMS are becoming in t oler a ble for som e en t er pr ises. Toda y, t h e volu m es of big da t a t h a t ca n be handled by NoSQL systems, such as Hadoop, outstrip what can be handled by the biggest RDBMS. 3. Goodbye DBAs (see you later?) Despite the many manageability improvements claimed by RDBMS vendors over the years, high-end RDBMS systems can be maintained only with the assistance of expensive, highly trained DBAs. DBAs are intimately involved in the design, installation, and ongoing tuning of high-end RDBMS systems. NoSQL databases are generally designed from the ground up to require less management: automatic repair, data distribution, and simpler data models lead to lower administration and tuning requirements in t h eor y. In pr a ct ice, it s likely t h a t r u m or s of t h e DBA s dea t h have been slightly exaggerated. Someone will always be accountable for the performance and availability of any mission-critical data store. 30 NoSql 4. Economics NoSQL databases typically use clusters of cheap commodity servers to manage the exploding data and transaction volumes, while RDBMS tends to rely on expensive proprietary servers and storage systems. The result is that the cost per gigabyte or transaction/second for NoSQL can be many times less than the cost for RDBMS, allowing you to store and process more data at a much lower price point. 5. Flexible data models Change management is a big headache for large production RDBMS. Even minor changes to the data model of an RDBMS have to be carefully managed and may necessitate downtime or reduced service levels. NoSQL databases have far more relaxed or even nonexistent data model restrictions. NoSQL Key Value stores and document databases allow the application to store virtually any structure it wants in a data element. Even the more rigidly defined BigTable-based NoSQL databases (Cassandra, HBase) typically allow new columns to be created without too much fuss. The result is that application changes and database schema changes do not have to be managed as one complicated change unit. In theory, this will allow applications to iterate faster, though,clearly, there can be undesirable side effects if the application fails to manage data integrity. 31 NoSql The promise of the NoSQL database has generated a lot of enthusiasm, but there are many obstacles to overcome before they can appeal to mainstream enterprises. Here are a few of the top challenges. 1. Maturity RDBMS systems have been around for a long time. NoSQL advocates will argue that their advancing age is a sign of their obsolescence, but for most CIOs, the maturity of the RDBMS is reassuring. For the most part, RDBMS systems are stable and richly functional. In comparison, most NoSQL alternatives are in preproduction versions with many key features yet to be implemented. Living on the technological leading edge is an exciting prospect for many developers, but enterprises should approach it with extreme caution. 2. Support Enterprises want the reassurance that if a key system fails, they will be able to get timely and competent support. All RDBMS vendors go to great lengths to provide a high level of enterprise support. In contrast, most NoSQL systems are open source projects, and although there are usually one or more firms offering support for each NoSQL database, these companies often are small start-ups without the global reach, support resources, or credibility of an Oracle, Microsoft, or IBM. 3. Analytics and business intelligence NoSQL databases have evolved to meet the scaling demands of modern Web 2.0 applications. Consequently, most of their feature set is oriented toward the demands of these applications. However, data in an application has value to the business that goes beyond the insert-read-update-delete cycle of a typical Web application. Businesses mine information in corporate databases to improve their efficiency and competitiveness, and business intelligence (BI) is a key IT issue for all medium to large companies. NoSQL databases offer few facilities for ad-hoc query and analysis. Even a simple query requires significant programming expertise, and commonly used BI tools do not provide connectivity to NoSQL. 32 NoSql Some relief is provided by the emergence of solutions such as HIVE or PIG, which can provide easier access to data held in Hadoop clusters and perhaps eventually, other NoSQL databases. Quest Software has developed a product Toad for Cloud Databases that can provide ad-hoc query capabilities to a variety of NoSQL databases. 4. Administration The design goals for NoSQL may be to provide a zero-admin solution, but the current reality falls well short of that goal. NoSQL today requires a lot of skill to install and a lot of effort to maintain. 5. Expertise There are literally millions of developers throughout the world, and in every business segment, who are familiar with RDBMS concepts and programming. In contrast, almost every NoSQL developer is in a learning mode. This situation will a ddr ess n a t u r a lly over t im e, bu t for n ow, it s fa r ea sier t o fin d exper ien ced RDBMS programmers or administrators than a NoSQL expert. Conclusion NoSQL databases are becoming an increasingly important part of the database landscape, and when used appropriately, can offer real benefits. However, enterprises should proceed with caution with full awareness of the legitimate limitations and issues that are associated with these databases. 33 NoSql NoSQL is a shell-based relational database management system that runs under Unix-like operating systems, or others with compatibility layers (e.g., Cygwin under Windows). Its name merely reflects the fact that it does not express its queries using Structured Query Language; the NoSQL RDBMS is distinct from the circa-2009 general concept of NoSQL databases, which are typically non-relational, unlike the NoSQL RDBMS. NoSQL is released under the GNU GPL. Ok Then we reach to end Of SQL !!!!!!!!!! . May be you want to choose your group (remember the Preface): I. II. II. Love it Deny it Ignore it However, I hope that I covered the whole main concept of No SQL, like its definition, advantages and uses in other areas.Bu t we sh ou ldn t for get t h a t ,This new technology of database still have long way to go and let s see how far it will go and can it stand on the challenges that I mentioned earlier? Hope so. A.Akhtar 34 NoSql http://www.techrepublic.com http://en.wikipedia.org/wiki/NoSQL http://newtech.about.com/od/databasemanagement/a/Nosql.htm 35

Log In

No SQL

Related papers

Related papers