Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Unit 2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 64

 Introduction

 Aggregate data model


 Distribution Models
 Consistency
 Map-Reduce
 Types of NoSQL Databases
▪ Key-Value
▪ Document
▪ Column Family
▪ Graph

John Edgar 2
 Relational databases are the prevalent
solution for storing enterprise data
 Some of the main benefits of RDBMS are
▪ Access to persistent data
▪ ACID Properties
▪ Integration
▪ Where multiple applications share data
▪ Standardized model

John Edgar 4
 A major disadvantage with a RDBMS is the impedance
mismatch with the object model
 This is partly mitigated by the availability of object-
relational mapping frameworks
 The typical DB model uses a RDBMS as an integration
database
▪ Providing a data source for multiple applications
 An alternative is to create separate application databases
▪ And use web services to integrate the application databases

Impedance mismatch – a term borrowed from electrical engineering


where (very broadly) the output does not match the input

John Edgar 5
 The volume of data has grown dramatically in
recent years
▪ Caused by the onset of the web as a vehicle for
trade, information and social networking
▪ With the growth in data came a dramatic growth
in users
 Managing the increase in data requires more
computing resources
▪ Scale up or out
John Edgar 6
 There are two basic methods for scaling
computing resources
▪ Vertical scaling ("up")
▪ Buy bigger, more powerful machines
▪ Horizontal scaling ("out")
▪ Buy more, cheaper, machines
 Many small machines in a cluster ends up
being cheaper than scaling up
▪ And provides more resilience
John Edgar 7
 RDBMS are not designed to run on clusters
▪ Though many DBMS products support distributed
databases
▪ Primarily through writing to a highly available disk subsystem
▪ Though this subsystem may still be a single point of failure
 Organizations sought alternatives to RDBMS
▪ Google Dynamo
▪ Amazon BigTable
▪ Neither of these products use SQL
▪ They are NoSQL databases
John Edgar 8
 There is no one definition of NoSQL
▪ And NoSQL databases differ significantly
▪ There is even disagreement over whether it stands for No
SQL or Not only SQL
 Common NoSQL database characteristics
▪ Does not use relational model
▪ Runs well on clusters
▪ Open source (not always)
▪ Built for 21st century web applications
▪ Schema-less

John Edgar 9
 NoSQL databases move away from the
relational data model
 There are four main types of NoSQL database
▪ Key-value
▪ Document
▪ Column store
▪ Graph
 The data models used by the first three have
some similarities
John Edgar 11
 Modern programming language data structures
have more complex structure than relations
▪ That allow lists and other structures to be nested
▪ We will refer to such structures as aggregates
▪ A collection of related objects to be treated as a unit
 Using aggregates makes it easier for databases
to operate in clusters
▪ The aggregate can be used as a unit for replication
and sharding
John Edgar 12
 A relational database captures relationships
using foreign keys in tables
▪ Combining tables entails joins
▪ Which may be expensive
▪ It does not capture the structure of an aggregate
 The same relational schema could have many
different corresponding aggregates
▪ Let's look at our bank example

John Edgar 13
firstName
type balance city

customerID lastName accNumber rate branchName phone

Customer owns Account holds Branch

birthDate income

transacts works

transNumber Transaction sin Employee


firstName
amount description lastName startDate
transDate salary

John Edgar 14
 There are many different possible aggregates
▪ For example a Branch object could include a container
holding its associated accounts
▪ Or Account objects could be independent of the Branch
and contain a reference to the branch
 There are similar variations between the relationships
between other entities
 The aggregates that are chosen should reflect the way
that data is manipulated
▪ Data related to a single aggregate should be maintained
at the same cluster
John Edgar 15
 NoSQL databases are schema-less
 Before storing data in a relational database the schema
has to be defined
▪ Tables, columns and their domains are defined
 NoSQL databases are much more casual
▪ Key-value store allows any data to be associated with the key
▪ Document databases do not make restrictions on what is
contained in a document
▪ Column family databases allow any data to be stored in any
column

John Edgar 16
 Advantages
▪ Freedom and flexibility
▪ New data can be added as required
▪ Old data can be retained since un-needed columns do not
have to be deleted
▪ Easier to deal with non-uniform data
 Disadvantages
▪ In practice most programs rely on schemata
▪ If the schema is only implicit it must be deduced
▪ The schema is moved from the data store to the application

John Edgar 17
 NoSQL databases handle large amounts of
data by scaling out
▪ Running on clusters of machines
 The data to be stored needs to be distributed
across the cluster
▪ Sharding
▪ Replication

John Edgar 19
 Sharding distributes data between nodes
▪ The goal is for users to get all, or most of, their data
from one server
 Sharding methods
▪ By physical location
▪ Locate the Vancouver accounts in Vancouver servers
▪ Locate aggregates that are likely to be accessed
together or in sequence in the same location
▪ Many NoSQL databases perform automatic sharding
John Edgar 20
 Sharding can improve both read and write
performance
▪ Sharding allows horizontal scaling for both reads and
writes
 However sharding does not improve resilience
▪ Since sharding distributes data across many machines
there is a larger chance of failure
▪ Particularly compared to a single machine that is highly
maintained

John Edgar 21
 Replication is the process of maintaining multiple
copies of data
▪ To improve read performance
▪ And improve availability and resilience
 Replication works better for read-intensive databases
▪ Since all copies of the data have to be updated when
processing writes
▪ There are two replication schemes that handle writes in
different ways
 Replication may lead to inconsistency

John Edgar 22
 In master-slave replication one copy is maintained as
the definitive data source
▪ All updates are performed on this master copy and then
propagated to the slaves
▪ Read requests are handled by the slaves
 Since the master handles all updates it is not good for
write-intensive systems
 If the master fails one of the slaves is appointed as the
new master
▪ Either manually or automatically

John Edgar 23
 Master-slave replication does not improve write
scalability
▪ Resilience is improved for slaves but not the master
▪ Master is a bottleneck and a single point of failure
 In peer to peer replication all replicas accept writes
and have equal weight
▪ There is a trade-off between availability and inconsistency
▪ Read inconsistency can occur when changes have not been
propagated to all replicas
▪ Write inconsistency can occur when two updates are performed at
the same time

John Edgar 24
 Sharding and replication can be combined
▪ Each shard is replicated
 In master – slave replication there is one
master for each shard
 Peer to peer replication of shards is
commonly used for column family databases

John Edgar 25
 Relational databases guarantee consistency
▪ Through ensuring that transactions are processed atomically as
if they occurred in isolation
▪ Databases interleave actions of transactions to improve
throughput
▪ While identifying and preventing conflicts that could leave the
database in an inconsistent state
▪ Often through locking
 When actions of two transactions conflict the database
prevents one from starting before the other has finished
▪ Guaranteeing consistency becomes more difficult when the
data resides on multiple servers
John Edgar 27
Consider two transactions that affect a single bank account which initially holds
$1,000. In one transaction Bob (T1) is going to deposit $500 and in the second interest
of 10% is going to be added to the account.

T1 T2 A T1 T2 A
Read(A) 1,000 Read(A) 1,000
Read(A) 1,000 Write(A) 1,500
Write(A) 1,500 Read(A) 1,500
Write(A) 1,100 Write(A) 1,650

This interleaving results in a lost In this version the database locks A


update caused by an unrepeatable which prevents T2 from acting on A
read which leaves the database in (including reading it) until T1 has
an inconsistent state completed

John Edgar 28
 It’s important to understand that locking mechanisms
come with a cost
▪ A reduction in throughput
▪ i.e. less transactions are processed in a given time
 Controlling concurrency through locking in a
distributed system is time consuming
▪ The process is more complex
▪ And it takes time to communicate across the network
 NoSQL databases often relax their requirements for
consistency

John Edgar 29
 Controlling concurrency on a distributed
system becomes more complex
▪ Particularly if the system is using peer to peer
replication
▪ Where any node can process updates
 Conflicts must be detected across nodes
▪ And updates must be processed in the same order
▪ Note that in the example the account’s balance would
be different if T2 occurred before T1

John Edgar 30
 A common claim is that NoSQL databases to not
guarantee ACID transactions
▪ That is, they drop acid
▪ Specifically they do not support atomic transactions
 Aggregate oriented databases do support atomic
transactions The typical inconsistency window for
Amazon's SimpleDB is claimed to be
▪ But only within aggregates less than one second
▪ And not necessarily across aggregates
 Updates that affect multiple aggregates may result in
inconsistent reads for some time
▪ Known as the inconsistency window

John Edgar 31
 An additional issue for distributed databases
is replication consistency
▪ Where updates reach different replicas at
different times
▪ This may result in two users reading different
values at the same time
 Over time replicas will have the same values
▪ That is they will be eventually consistent
▪ Replication consistency issues may result in an
increase in the size of the inconsistency window
John Edgar 32
 The CAP theorem states that it is only possible
to maintain two out of three properties
▪ Consistency
▪ Availability
▪ If a node is available it can read and write data, or
▪ Every request received by a non-failing node in the system
must result in a response
▪ Partition tolerance
▪ The cluster can survive communication breakdowns that
separate it into multiple partitions

John Edgar 33
 A distributed system must have partition tolerance
▪ Unlike a single server system which can therefore provide
consistency and availability
▪ Without partition tolerance the implication is that if a
partition occurs all the nodes in the cluster go down
 As partition tolerance is required, distributed systems
must choose between availability and consistency
▪ This is not an either / or choice
▪ Most often it is necessary to reduce consistency to increase
availability

John Edgar 34
 Consider Bob and Kate trying to book the last
hotel room in the Grand Hotel in Vancouver
▪ On a peer to peer system with two nodes
▪ Bob is in Vancouver and Kate is in London
▪ If consistency is to be ensured then London must
confirm Kate's booking with Vancouver
 If the network link fails then neither node can
book rooms
▪ Which sacrifices availability
John Edgar 35
 An alternative is to use master-slave replication
▪ All bookings for Vancouver hotels will be processed by the
Vancouver node
▪ Vancouver is the master
 What happens if the network connection fails?
▪ Bob can still book the last room but Kate cannot
▪ Kate can see that a room is available but cannot book it
▪ There is an availability failure in London
▪ As Kate can talk to the London node but the node is unable to
update data

John Edgar 36
 A third alternative is to allow both nodes to
accept reservations when the connection fails
▪ This increases availability
▪ But may result in both Bob and Kate booking the last
room
▪ A consistency failure
 This reduction in consistency may be acceptable
▪ The hotel does not lose bookings
▪ And may keep a few spare rooms even when fully
booked in case a room has to be vacated
John Edgar 37
 There may be situations where some
inconsistency is permissible
▪ These situations are domain dependent and
would have to be identified
▪ By talking to the client
 Deciding to deal with some inconsistent
updates (or reads) can be very useful
▪ The tradeoff may be for more availability or
performance
John Edgar 38
 NoSQL databases are said to follow the BASE
properties rather than the ACID properties
▪ Basically Available
▪ Soft state
▪ Eventually consistent
 It is debatable how useful this is, or how well
defined the properties are
▪ But its cute …

John Edgar 39
 Read-Your-Writes consistency
▪ Once a process has updated a record it will always
read the updated value
 Session consistency
▪ Read-Your-Writes consistency over a session
 Monotonic Read consistency
▪ Once a process sees a version of a value it will never
see an earlier version of that value
 Monotonic Write consistency
▪ Updates are executed in the order in which they are
performed
John Edgar 40
 A key-value store has two columns
▪ The key – the primary key for the store
▪ The value – which can be anything
 The value in a key-value store is not
understood by the store
▪ It is the responsibility of the application that is
accessing the value
 The structurally simplest NoSQL database

John Edgar 42
 Consistency
▪ Applicable only for single key operations
▪ Eventual consistency a popular model
 Transactions
▪ Varies considerably between products
 Query features
▪ Key-values stores support querying by the key
▪ Querying by attributes of the value column is not supported
 Scaling – by sharding
▪ The value of the key determines the node on which the key is
stored
 Keys should be well designed
▪ Use a naming convention
▪ Use meaningful and unambiguous names
▪ Use consistently named values for ranges
▪ e.g. dates
▪ Use a common delimiter
▪ Keep keys short while complying with the above
▪ Take implementation limitations into account

John Edgar 44
 Generally the value will be driven by the
application but some design issues remain
▪ What aggregates are to be used?
▪ One key – many values (e.g. customer123)
▪ Many keys – many values (e.g. customer123 address,
customer 123name etc.)
▪ Large or small values?
▪ Large values reduce the number of reads
▪ But the time to read and write values increases

John Edgar 45
 Uses for which key-value stores are suitable include
▪ Session information
▪ Storing configuration and user data information
▪ Shopping carts
 Examples of key-value stores
▪ Riak
▪ Redis
▪ Amazon WebServices Simple Db (and DynamoDB)
▪ Project Voldemort

John Edgar 46
 Document databases store … documents …
▪ Often XML, JSON etc.
 Documents are self-describing hierarchical tree
structures
▪ Documents are similar but do not have to be identical
▪ And can have different schema
 Document databases are similar to key-value stores
▪ Except that the value is a document
▪ And the document can be examined, rather than just
being obtained

John Edgar 48
 Consistency
▪ Using MongoDB as an example, the database uses
master-slave replication
▪ The level of consistency can be specified
▪ That is, the number of nodes to which an update has to be
propagated before it is deemed successful
▪ Making write consistency stronger reduce availability
 Availability
▪ Availability is improved through replication
▪ Data is duplicated across nodes
▪ Allowing data to be read even when the master node is unavailable

John Edgar 49
 Queries
▪ Document databases allow documents to be queried
without first retrieving the entire document
▪ Different document databases provide different query
features
 Scaling
▪ Scaling for reads is supported by adding more slaves
▪ Scaling for writes is supported by sharding
▪ A shard key is selected that determines how documents are
broken into shards

John Edgar 50
 Documents can be grouped into collections
▪ Grouping similar documents together
▪ Documents in a collection do not have to have identical
structure
▪ But should contain documents of the general type
▪ Documents in a single collection will typically be
processed by the same application code
▪ If not, consider if the collection should be split
 Documents in a collection can be operated on as
a group
John Edgar 51
 Uses for which document databases are suitable
include
▪ Event logging
▪ Content management systems
▪ Web analytics
▪ E-commerce
 Examples of document databases
▪ MongoDB
▪ CouchDB

John Edgar 52
 Column-family stores group column families and are a
refinement of columnar databases
▪ A columnar database stores each column separately
▪ Which greatly increases the speed of aggregate operations on
column data
▪ But makes accessing an entire row inefficient
 Column-families contain multiple related columns
▪ But may still break down what would be a single table in a
relational database into multiple tables
▪ Rows in a column-family do not have to have the same
columns as other rows
John Edgar 54
 Column family DBs appear similar to relational
databases
▪ They have rows and columns
▪ Rows are identified by unique identifiers
 There are important differences
▪ Column family DBs do not support multi-row transactions
▪ In some column family DBs column values are not typed
▪ The data is interpreted at the application level
▪ Column family DBs are typically de-normalized
▪ The same key may identify different column families
▪ Column values may include lists and other structures
John Edgar 55
Row key Customer column family
Customer ID lname fname address …
Bobson Bob 123 Bobville Rd

Product column family


16735 56122 98735
Thermal Detonator Pulse Rifle BFG

 Different customers rows can have different numbers of products


 The product ID is used as a column name
▪ Data is stored in column name order
▪ Column values, the product names are repeated for each customer
▪ But joins are not required to return customer information

John Edgar 56
 Consistency
▪ Column-family stores use peer to peer replication
▪ The level of consistency can be specified
▪ By selecting the number of nodes that have to respond to a
read or write before it is committed
 Availability
▪ Availability is improved through replication
▪ Using peer to peer replication improves availability
▪ Which can be further improved by reducing consistency

John Edgar 57
 Queries
▪ Column-families can be queried
▪ Although query languages are not as rich as SQL
▪ Do not, for example, allow joins or subqueries
▪ Columns can be indexed to improve efficiency
▪ Data in rows are sorted by column names
 Scaling
▪ Scaling for reads or writes is achieved by adding
additional nodes
John Edgar 58
 Uses for which column-family stores are suitable
include
▪ Event logging
▪ Content management systems, blogging sites
▪ Counters
 Examples of column-family stores
▪ Cassandra
▪ HBase
▪ Google BigTable
▪ Designing BigTable schema

John Edgar 59
 Graph databases are designed to efficiently store
relationships
▪ Nodes map to entities and edges to relationships
 Nodes have properties such as name
 Edges have types such as likes
▪ Edges can be set to bidirectional
▪ Different edges in the same graph can have different types
▪ For example likes and employee
 Traversing relationships in a graph database is fast
▪ The relationships are stored persistently

John Edgar 61
 Graph databases support a set of specialized
operations
▪ Uni0n
▪ Combines two graphs by taking the union
▪ Intersection
▪ Combines two graphs by taking the union
▪ Traversal
▪ Traverses the graph from a given node, visiting all
connected nodes

John Edgar 62
 Isomorphism
▪ Two graphs are isomorphic if they have corresponding vertices
and edges between vertices
 Order – number of vertices
 Size – number of edges
 Degree – the number of edges of a vertex
 Closeness
▪ A measure of distance between vertices
 Betweenness
▪ A measure of how connected vertices are
▪ Can be used to identify vulnerable paths

John Edgar 63
 Uses for which graph databases are suitable
include
▪ Storing connected data such as social networks
▪ Location based services
▪ Recommendation engines
▪ Tracking infections
 Examples of graph databases
▪ Neo4j
▪ Titan
▪ OrientDB

John Edgar 64

You might also like