How Does A Relational Database Work - Coding Geek PDF
How Does A Relational Database Work - Coding Geek PDF
Menu
HOME »» ALGORITHM »» HOW DOES A RELATIONAL DATABASE WORK Top 10 a icles
by Christophe | updated: February 7, 2017 | posted: August 19, 2015 163 Comments How does a relational database work
297,395 views | 163 comments
2,232
How does a HashMap work in JAVA
140,151 views | 81 comments
SQLite to the powe ul Teradata. But, there are only a few a icles that explain how a 23,391 views | 18 comments
database works. You can google by yourself “how does a relational database work” to see Design pa ern: singleton, prototype and
how few results there are. Moreover, those a icles are sho . Now, if you look for the last builder
trendy technologies (Big Data, NoSQL or JavaScript), you’ll nd more in-depth a icles 22,409 views | 6 comments
explaining how they work.
Machine Learning: Andrew NG’s course
Are relational databases too old and too boring to be explained outside of university from coursera
courses, research papers and books? 11,740 views | 3 comments
the best programming language
9,962 views | 10 comments
References in JAVA
9,222 views | 4 comments
Categories
Algorithm (3)
Java (2)
JVM (2)
As a developer, I HATE using something I don’t understand. And, if databases have been Methodology (3)
used for 40 years, there must be a reason. Over the years, I’ve spent hundreds of hours to review (2)
really understand these weird black boxes I use every day. Relational Databases are very
interesting because they’re based on useful and reusable concepts. If understanding a Tool (1)
database interests you but you’ve never had the time or the will to dig into this wide subject,
you should like this a icle.
Tags
algorithm big data builder cousera
Though the title of this a icle is explicit, the aim of this a icle is NOT to understand how
to use a database. Therefore, you should already know how to write a simple join query database design pa ern eclipse
and basic CRUD queries; otherwise you might not understand this a icle. This is the only factory garbage collector hashmap hashset
I’ll sta with some computer science stu like time complexity. I know that some of you
HashTable java jvm Liskov LSP machine
hate this concept but, without it, you can’t understand the cleverness inside a database. learning memory mooc music opinion
Since it’s a huge topic, I’ll focus on what I think is essential: the way a database handles an prototype quantum mechanics reference runtine
SQL query. I’ll only present the basic concepts behind a database so that at the end of data areas set sho cut key singleton so
the a icle you’ll have a good idea of what’s happening under the hood. reference WeakHashMap weak reference work
Since it’s a long and technical a icle that involves many algorithms and data structures, take
Search this site... Search
your time to read it. Some concepts are more di cult to understand; you can skip them and
still get the overall idea.
For the more knowledgeable of you, this a icle is more or less divided into 3 pa s:
Contents [show]
Back to basics
A long time ago (in a galaxy far, far away….), developers had to know exactly the number of
operations they were coding. They knew by hea their algorithms and data structures
because they couldn’t a ord to waste the CPU and memory of their slow computers.
In this pa , I’ll remind you about some of these concepts because they are essential to
understand a database. I’ll also introduce the notion of database index.
O(1) vs O(n2)
Nowadays, many developers don’t care about time complexity … and they’re right!
But when you deal with a large amount of data (I’m not talking about thousands) or if you’re
ghting for milliseconds, it becomes critical to understand this concept. And guess what,
databases have to deal with both situations! I won’t bore you a long time, just the time to get
the idea. This will help us later to understand the concept of cost based optimization.
The concept
The time complexity is used to see how long an algorithm will take for a given amount
of data. To describe this complexity, computer scientists use the mathematical big O
notation. This notation is used with a function that describes how many operations an
algorithm needs for a given amount of input data.
For example, when I say “this algorithm is in O( some_function() )”, it means that for a
ce ain amount of data the algorithm needs some_function(a_ce ain_amount_of_data)
operations to do its job.
What’s impo ant is not the amount of data but the way the number of operations
increases when the amount of data increases. The time complexity doesn’t give the
exact number of operations but a good idea.
In this gure, you can see the evolution of di erent types of complexities. I used a
logarithmic scale to plot it. In other words, the number of data is quickly increasing from 1 to
1 billion. We can see that:
The worst complexity is the O(n2) where the number of operations quickly
explodes.
The two other complexities are quickly increasing.
Examples
With a low amount of data, the di erence between O(1) and O(n2) is negligible. For example,
let’s say you have an algorithm that needs to process 2000 elements.
The di erence between O(1) and O(n2) seems a lot (4 million) but you’ll lose at max 2 ms,
just the time to blink your eyes. Indeed, current processors can handle hundreds of millions
of operations per second. This is why pe ormance and optimization are not an issue in
many IT projects.
As I said, it’s still impo ant to know this concept when facing a huge number of data. If this
time the algorithm needs to process 1 000 000 elements (which is not that big for a
database):
I didn’t do the math but I’d say with the O(n2) algorithm you have the time to take a co ee
(even a second one!). If you put another 0 on the amount of data, you’ll have the time to
take a long nap.
Going deeper
Note: In the next pa s, we’ll see these algorithms and data structures.
I only talked about time complexity but complexity also works for:
n4: that sucks! Some of the algorithms I’ll mention have this complexity.
3n: that sucks even more! One of the algorithms we’re going to see in the middle of
this a icle has this complexity (and it’s really used in many databases).
factorial n : you’ll never get your results, even with a low amount of data.
nn: if you end-up with this complexity, you should ask yourself if IT is really your eld…
Note: I didn’t give you the real de nition of the big O notation but just the idea. You can read
this a icle on Wikipedia for the real (asymptotic) de nition.
Merge So
What do you do when you need to so a collection? What? You call the so () function … ok,
good answer… But for a database you have to understand how this so () function works.
There are several good so ing algorithms so I’ll focus on the most impo ant one: the
merge so . You might not understand right now why so ing data is useful but you should
a er the pa on query optimization. Moreover, understanding the merge so will help us
later to understand a common database join operation called the merge join.
Merge
Like many useful algorithms, the merge so is based on a trick: merging 2 so ed arrays of
size N/2 into a N-element so ed array only costs N operations. This operation is called a
merge.
You can see on this gure that to construct the nal so ed array of 8 elements, you only
need to iterate one time in the 2 4-element arrays. Since both 4-element arrays are already
so ed:
1) you compare both current elements in the 2 arrays (current= rst for the rst time)
2) then take the lowest one to put it in the 8-element array
3) and go to the next element in the array you took the lowest element
and repeat 1,2,3 until you reach the last element of one of the arrays.
Then you take the rest of the elements of the other array to put them in the 8-element
array.
This works because both 4-element arrays are so ed and therefore you don’t need to “go
back” in these arrays.
Now that we’ve understood this trick, here is my pseudocode of the merge so .
array mergeSort(array a)
if(length(a)==1)
return a[0];
end if
//recursive calls
[left_array right_array] := split_into_2_equally_sized_arrays(a);
array new_left_array := mergeSort(left_array);
array new_right_array := mergeSort(right_array);
//merging the 2 small ordered arrays into a big one
array result := merge(new_left_array,new_right_array);
return result;
The merge so breaks the problem into smaller problems then nds the results of the
smaller problems to get the result of the initial problem (note: this kind of algorithms is
called divide and conquer). If you don’t understand this algorithm, don’t worry; I didn’t
understand it the rst time I saw it. If it can help you, I see this algorithm as a two-phase
algorithm:
The division phase where the array is divided into smaller arrays
The so ing phase where the small arrays are put together (using the merge) to form a
bigger array.
Division phase
During the division phase, the array is divided into unitary arrays using 3 steps. The formal
number of steps is log(N) (since N=8, log(N) = 3).
I’m a genius! In one word: mathematics. The idea is that each step divides the size of the
initial array by 2. The number of steps is the number of times you can divide the initial array
by two. This is the exact de nition of logarithm (in base 2).
So ing phase
In the so ing phase, you sta with the unitary arrays. During each step, you apply multiple
merges and the overall cost is N=8 operations:
In the rst step you have 4 merges that cost 2 operations each
In the second step you have 2 merges that cost 4 operations each
In the third step you have 1 merge that costs 8 operations
Since there are log(N) steps, the overall costs N * log(N) operations.
Because:
You can modify it in order to reduce the memory footprint, in a way that you don’t
create new arrays but you directly modify the input array.
For example, the distributed merge so is one of the key components of Hadoop
(which is THE framework in Big Data).
This so ing algorithm is used in most (if not all) databases but it’s not the only one. If you
want to know more, you can read this research paper that discusses the pros and cons of
the common so ing algorithms in a database.
Array
The two-dimensional array is the simplest data structure. A table can be seen as an array.
For example:
Though it’s great to store and visualize data, when you need to look for a speci c value it
sucks.
For example, if you want to nd all the guys who work in the UK, you’ll have to look at
each row to nd if the row belongs to the UK. This will cost you N operations (N being the
number of rows) which is not bad but could there be a faster way? This is where trees come
into play.
Note: Most modern databases provide advanced arrays to store tables e ciently like heap-
organized tables or index-organized tables. But it doesn’t change the problem of fast
searching for a speci c condition on a group of columns.
A binary search tree is a binary tree with a special prope y, the key in each node must be:
The idea
This tree has N=15 elements. Let’s say I’m looking for 208:
I sta with the root whose key is 136. Since 136<208, I look at the right sub-tree of the
node 136.
398>208 so, I look at the le sub-tree of the node 398
250>208 so, I look at the le sub-tree of the node 250
200<208 so, I look at the right sub-tree of the node 200. But 200 doesn’t have a right
subtree, the value doesn’t exist (because if it did exist it would be in the right subtree
of 200)
I sta with the root whose key is 136. Since 136>40, I look at the le sub-tree of the
node 136.
80>40 so, I look at the le sub-tree of the node 80
40= 40, the node exists. I extract the id of the row inside the node (it’s not in the
gure) and look at the table for the given row id.
Knowing the row id let me know where the data is precisely on the table and therefore
I can get it instantly.
In the end, both searches cost me the number of levels inside the tree. If you read carefully
the pa on the merge so you should see that there are log(N) levels. So the cost of the
search is log(N), not bad!
This search only costs you log(N) operations instead of N operations if you directly use the
array. What you’ve just imagined was a database index.
You can build a tree index for any group of columns (a string, an integer, 2 strings, an integer
and a string, a date …) as long as you have a function to compare the keys (i.e. the group of
columns) so that you can establish an order among the keys (which is the case for any
basic types in a database).
B+Tree Index
Although this tree works well to get a speci c value, there is a BIG problem when you need
to get multiple elements between two values. It will cost O(N) because you’ll have to look
at each node in the tree and check if it’s between these 2 values (for example, with an in-
order traversal of the tree). Moreover this operation is not disk I/O friendly since you’ll have
to read the full tree. We need to nd a way to e ciently do a range query. To answer this
problem, modern databases use a modi ed version of the previous tree called B+Tree. In a
B+Tree:
only the lowest nodes (the leaves) store information (the location of the rows in the
associated table)
the other nodes are just here to route to the right node during the search.
As you can see, there are more nodes (twice more). Indeed, you have additional nodes, the
“decision nodes” that will help you to nd the right node (that stores the location of the
rows in the associated table). But the search complexity is still in O(log(N)) (there is just one
more level). The big di erence is that the lowest nodes are linked to their successors.
With this B+Tree, if you’re looking for values between 40 and 100:
You just have to look for 40 (or the closest value a er 40 if 40 doesn’t exist) like you
did with the previous tree.
Then gather the successors of 40 using the direct links to the successors until you
reach 100.
Let’s say you found M successors and the tree has N nodes. The search for a speci c node
costs log(N) like the previous tree. But, once you have this node, you get the M successors
in M operations with the links to their successors. This search only costs M + log(N)
operations vs N operations with the previous tree. Moreover, you don’t need to read the full
tree (just M + log(N) nodes), which means less disk usage. If M is low (like 200 rows) and N
large (1 000 000 rows) it makes a BIG di erence.
But there are new problems (again!). If you add or remove a row in a database (and
therefore in the associated B+Tree index):
you have to keep the order between nodes inside the B+Tree otherwise you won’t be
able to nd nodes inside the mess.
you have to keep the lowest possible number of levels in the B+Tree otherwise the
time complexity in O(log(N)) will become O(N).
I other words, the B+Tree needs to be self-ordered and self-balanced. Thankfully, this is
possible with sma deletion and inse ion operations. But this comes with a cost: the
inse ion and deletion in a B+Tree are in O(log(N)). This is why some of you have heard that
using too many indexes is not a good idea. Indeed, you’re slowing down the fast
inse ion/update/deletion of a row in a table since the database needs to update the
indexes of the table with a costly O(log(N)) operation per index. Moreover, adding indexes
means more workload for the transaction manager (we will see this manager at the end of
the a icle).
For more details, you can look at the Wikipedia a icle about B+Tree. If you want an example
of a B+Tree implementation in a database, look at this a icle and this a icle from a core
developer of MySQL. They both focus on how innoDB (the engine of MySQL) handles
indexes.
Note: I was told by a reader that, because of low-level optimizations, the B+Tree needs to be
fully balanced.
Hash table
Our last impo ant data structure is the hash table. It’s very useful when you want to quickly
look for values. Moreover, understanding the hash table will help us later to understand a
common database join operation called the hash join. This data structure is also used by a
database to store some internal stu (like the lock table or the bu er pool, we’ll see both
concepts later)
The hash table is a data structure that quickly nds an element with its key. To build a hash
table you need to de ne:
A simple example
As you can see, depending on the value you’re looking for, the cost is not the same!
If I now change the hash function with the modulo 1 000 000 of the key (i.e. taking the last 6
digits), the second search only costs 1 operation because there are no elements in the
bucket 000059. The real challenge is to nd a good hash function that will create
buckets that contain a very small amount of elements.
In my example, nding a good hash function is easy. But this is a simple example, nding a
good hash function is more di cult when the key is:
A hash table can be half loaded in memory and the other buckets can stay on disk.
With an array you have to use a contiguous space in memory. If you’re loading a large
table it’s very di cult to have enough contiguous space.
With a hash table you can choose the key you want (for example the country AND
the last name of a person).
For more information, you can read my a icle on the Java HashMap which is an e cient
hash table implementation; you don’t need to understand Java to understand the concepts
inside this a icle.
Global overview
We’ve just seen the basic components inside a database. We now need to step back to see
the big picture.
A database is a collection of information that can easily be accessed and modi ed. But a
simple bunch of les could do the same. In fact, the simplest databases like SQLite are
nothing more than a bunch of les. But SQLite is a well-cra ed bunch of les because it
allows you to:
The network manager: Network I/O is a big issue, especially for distributed
databases. That’s why some databases have their own manager.
File system manager: Disk I/O is the rst bo leneck of a database. Having a
manager that will pe ectly handle the Operating System le system or even replace it
is impo ant.
The memory manager: To avoid the disk I/O penalty a large quantity of ram is
required. But if you handle a large amount of memory, you need an e cient memory
manager. Especially when you have many queries using memory at the same time.
Security Manager: for managing the authentication and the authorizations of the
users
The tools:
Recovery manager: for resta ing the database in a coherent state a er a crash
Monitor manager: for logging the activity of the database and providing tools to
monitor a database
Administration manager: for storing metadata (like the names and the structures of
the tables) and providing tools to manage databases, schemas, tablespaces, …
…
For the rest of this a icle, I’ll focus on how a database manages an SQL query through the
following processes:
the data manager (I’ll also include the recovery manager in this pa )
Client manager
The client manager is the pa that handles the communications with the client. The client
can be a (web) server or an end-user/end-application. The client manager provides di erent
ways to access the database through a set of well-known APIs: JDBC, ODBC, OLE-DB …
The manager rst checks your authentication (your login and password) and then
checks if you have the authorizations to use the database. These access rights are
set by your DBA.
Then, it checks if there is a process (or a thread) available to manage your query.
It also checks if the database if not under heavy load.
It can wait a moment to get the required resources. If this wait reaches a timeout, it
closes the connection and gives a readable error message.
Then it sends your query to the query manager and your query is processed
Since the query processing is not an “all or nothing” thing, as soon as it gets data from
the query manager, it stores the pa ial results in a bu er and sta sending them
to you.
In case of problem, it stops the connection, gives you a readable explanation and
releases the resources.
Query manager
This pa is where the power of a database lies. During this pa , an ill-wri en query is
transformed into a fast executable code. The code is then executed and the results are
returned to the client manager. It’s a multiple-step operation:
it’s then rewri en to remove useless operations and add some pre-optimizations
it’s then optimized to improve the pe ormances and transformed into an execution
and data access plan.
then the plan is compiled
In this pa , I won’t talk a lot about the last 2 points because they’re less impo ant.
The initial research paper (1979) on cost based optimization: Access Path Selection in a
Relational Database Management System. This a icle is only 12 pages and
understandable with an average level in computer science.
A very good and in-depth presentation on how DB2 9.X optimizes queries here
A very good presentation on how PostgreSQL optimizes queries here. It’s the most
accessible document since it’s more a presentation on “let’s see what query plans
PostgreSQL gives in these situations“ than a “let’s see the algorithms used by
PostgreSQL”.
The o cial SQLite documentation about optimization. It’s “easy” to read because
SQLite uses simple rules. Moreover, it’s the only o cial documentation that really
explains how it works.
A good presentation on how SQL Server 2005 optimizes queries here
Query parser
Each SQL statement is sent to the parser where it is checked for correct syntax. If you made
a mistake in your query the parser will reject the query. For example, if you wrote “SLECT …”
instead of “SELECT …”, the story ends here.
But this goes deeper. It also checks that the keywords are used in the right order. For
example a WHERE before a SELECT will be rejected.
Then, the tables and the elds inside the query are analyzed. The parser uses the metadata
of the database to check:
Then it checks if you have the authorizations to read (or write) the tables in the query.
Again, these access rights on tables are set by your DBA.
During this parsing, the SQL query is transformed into an internal representation (o en a
tree)
If everything is ok then the internal representation is sent to the query rewriter.
Query rewriter
At this step, we have an internal representation of a query. The aim of the rewriter is:
The rewriter executes a list of known rules on the query. If the query ts a pa ern of a rule,
the rule is applied and the query is rewri en. Here is a non-exhaustive list of (optional) rules:
View merging: If you’re using a view in your query, the view is transformed with the
SQL code of the view.
Subquery a ening: Having subqueries is very di cult to optimize so the rewriter will
try to modify a query with a subquery to remove the subquery.
For example
SELECT PERSON.*
FROM PERSON
WHERE PERSON.person_key IN
(SELECT MAILS.person_key
FROM MAILS
WHERE MAILS.mail LIKE 'christophe%');
Will be replaced by
SELECT PERSON.*
FROM PERSON, MAILS
WHERE PERSON.person_key = MAILS.person_key
and MAILS.mail LIKE 'christophe%';
Removal of unnecessary operators: For example if you use a DISTINCT whereas you
have a UNIQUE constraint that prevents the data from being non-unique, the DISTINCT
keyword is removed.
Redundant join elimination: If you have twice the same join condition because one
join condition is hidden in a view or if by transitivity there is a useless join, it’s removed.
Constant arithmetic evaluation: If you write something that requires a calculus, then
it’s computed once during the rewriting. For example WHERE AGE > 10+2 is
transformed into WHERE AGE > 12 and TODATE(“some date”) is transformed into the
date in the datetime format
(Advanced) Pa ition Pruning: If you’re using a pa itioned table, the rewriter is able
to nd what pa itions to use.
(Advanced) Materialized view rewrite: If you have a materialized view that matches
a subset of the predicates in your query, the rewriter checks if the view is up to date
and modi es the query to use the materialized view instead of the raw tables.
(Advanced) Custom rules: If you have custom rules to modify a query (like Oracle
policies), then the rewriter executes these rules
(Advanced) Olap transformations: analytical/windowing functions, star joins, rollup
… are also transformed (but I’m not sure if it’s done by the rewriter or the optimizer,
since both processes are very close it must depends on the database).
This rewri en query is then sent to the query optimizer where the fun begins!
Statistics
Before we see how a database optimizes a query we need to speak about statistics
because without them a database is stupid. If you don’t tell the database to analyze its
own data, it will not do it and it will make (very) bad assumptions.
I have to (brie y) talk about how databases and Operating systems store data. They’re using
a minimum unit called a page or a block (4 or 8 kilobytes by default). This means that if you
only need 1 Kbytes it will cost you one page anyway. If the page takes 8 Kbytes then you’ll
waste 7 Kbytes.
Back to the statistics! When you ask a database to gather statistics, it computes values like:
These statistics will help the optimizer to estimate the disk I/O, CPU and memory
usages of the query.
The statistics for each column are very impo ant. For example if a table PERSON needs to
be joined on 2 columns: LAST_NAME, FIRST_NAME. With the statistics, the database knows
that there are only 1 000 di erent values on FIRST_NAME and 1 000 000 di erent values on
LAST_NAME. Therefore, the database will join the data on LAST_NAME, FIRST_NAME instead
of FIRST_NAME,LAST_NAME because it produces way less comparisons since the
LAST_NAME are unlikely to be the same so most of the time a comparison on the 2 (or 3)
rst characters of the LAST_NAME is enough.
But these are basic statistics. You can ask a database to compute advanced statistics called
histograms. Histograms are statistics that inform about the distribution of the values inside
the columns. For example
These extra statistics will help the database to nd an even be er query plan. Especially for
equality predicate (ex: WHERE AGE = 18 ) or range predicates (ex: WHERE AGE > 10 and AGE
<40 ) because the database will have a be er idea of the number rows concerned by these
predicates (note: the technical word for this concept is selectivity).
The statistics are stored in the metadata of the database. For example you can see the
statistics for the (non-pa itioned) tables:
The statistics have to be up to date. There is nothing worse than a database thinking a
table has only 500 rows whereas it has 1 000 000 rows. The only drawback of the statistics
is that it takes time to compute them. This is why they’re not automatically computed by
default in most databases. It becomes di cult with millions of data to compute them. In this
case, you can choose to compute only the basics statistics or to compute the stats on a
sample of the database.
For example, when I was working on a project dealing with hundreds of millions rows in each
tables, I chose to compute the statistics on only 10%, which led to a huge gain in time. For
the story it turned out to be a bad decision because occasionally the 10% chosen by Oracle
10G for a speci c column of a speci c table were very di erent from the overall 100%
(which is very unlikely to happen for a table with 100M rows). This wrong statistic led to a
query taking occasionally 8 hours instead of 30 seconds; a nightmare to nd the root cause.
This example shows how impo ant the statistics are.
Note: Of course, there are more advanced statistics speci c for each database. If you want
to know more, read the documentations of the databases. That being said, I’ve tried to
understand how the statistics are used and the best o cial documentation I found was the
one from PostgreSQL.
Query optimizer
All modern databases are using a Cost Based Optimization (or CBO) to optimize queries.
The idea is to put a cost an every operation and nd the best way to reduce the cost of the
query by using the cheapest chain of operations to get the result.
To understand how a cost optimizer works I think it’s good to have an example to “feel” the
complexity behind this task. In this pa I’ll present you the 3 common ways to join 2 tables
and we will quickly see that even a simple join query is a nightmare to optimize. A er that,
we’ll see how real optimizers do this job.
For these joins, I’ll focus on their time complexity but a database optimizer computes their
CPU cost, disk I/O cost and memory requirement. The di erence between time
complexity and CPU cost is that time cost is very approximate (it’s for lazy guys like me). For
the CPU cost, I should count every operation like an addition, an “if statement”, a
multiplication, an iteration … Moreover:
Each high level code operation has a speci c number of low level CPU operations.
The cost of a CPU operation is not the same (in terms of CPU cycles) whether you’re
using an Intel Core i7, an Intel Pentium 4, an AMD Opteron…. In other words it depends
on the CPU architecture.
Using the time complexity is easier (at least for me) and with it we can still get the concept
of CBO. I’ll sometimes speak about disk I/O since it’s an impo ant concept. Keep in mind
that the bo leneck is most of the time the disk I/O and not the CPU usage.
Indexes
We talked about indexes when we saw the B+Trees. Just remember that these indexes are
already so ed.
FYI, there are other types of indexes like bitmap indexes. They don’t o er the same cost in
terms of CPU, disk I/O and memory than B+Tree indexes.
Moreover, many modern databases can dynamically create temporary indexes just for
the current query if it can improve the cost of the execution plan.
Access Path
Before applying your join operators, you rst need to get your data. Here is how you can get
your data.
Note: Since the real problem with all the access paths is the disk I/O, I won’t talk a lot about
time complexity.
Full scan
If you’ve ever read an execution plan you must have seen the word full scan (or just scan). A
full scan is simply the database reading a table or an index entirely. In terms of disk I/O, a
table full scan is obviously more expensive than an index full scan.
Range Scan
There are other types of scan like index range scan. It is used for example when you use a
predicate like “WHERE AGE > 20 AND AGE <40”.
Of course you need have an index on the eld AGE to use this index range scan.
We already saw in the rst pa that the time cost of a range query is something like log(N)
+M, where N is the number of data in this index and M an estimation of the number of rows
inside this range. Both N and M values are known thanks to the statistics (Note: M is the
selectivity for the predicate AGE >20 AND AGE<40). Moreover, for a range scan you don’t
need to read the full index so it’s less expensive in terms of disk I/O than a full scan.
Unique scan
If you only need one value from an index you can use the unique scan.
Access by row id
Most of the time, if the database uses an index, it will have to look for the rows associated
to the index. To do so it will use an access by row id.
If you have an index for person on column age, the optimizer will use the index to nd all the
persons who are 28 then it will ask for the associate rows in the table because the index only
has information about the age and you want to know the lastname and the rstname.
The index on PERSON will be used to join with TYPE_PERSON but the table PERSON will not
be accessed by row id since you’re not asking information on this table.
Though it works great for a few accesses, the real issue with this operation is the disk I/O. If
you need too many accesses by row id the database might choose a full scan.
Others paths
I didn’t present all the access paths. If you want to know more, you can read the Oracle
documentation. The names might not be the same for the other databases but the
concepts behind are the same.
Join operators
I’ll present the 3 common join operators: Merge Join, Hash Join and Nested Loop Join. But
before that, I need to introduce new vocabulary: inner relation and outer relation. A
relation can be:
a table
an index
an intermediate result from a previous operation (for example the result of a previous
join)
When you’re joining two relations, the join algorithms manage the two relations di erently.
In the rest of the a icle, I’ll assume that:
For example, A JOIN B is the join between A and B where A is the outer relation and B the
inner relation.
Most of the time, the cost of A JOIN B is not the same as the cost of B JOIN A.
In this pa , I’ll also assume that the outer relation has N elements and the inner
relation M elements. Keep in mind that a real optimizer knows the values of N and M with
the statistics.
Nested loop join
In term of disk I/O, for each of the N rows in the outer relation, the inner loop needs to read
M rows from the inner relation. This algorithm needs to read N + N*M rows from disk. But, if
the inner relation is small enough, you can put the relation in memory and just have M +N
reads. With this modi cation, the inner relation must be the smallest one since it has
more chance to t in memory.
In terms of time complexity it makes no di erence but in terms of disk I/O it’s way be er to
read only once both relations.
Of course, the inner relation can be replaced by an index, it will be be er for the disk I/O.
Since this algorithm is very simple, here is another version that is more disk I/O friendly if
the inner relation is too big to t in memory. Here is the idea:
you compare the rows inside the two bunches and keep the rows that match,
then you load new bunches from disk and compare them
With this version, the time complexity remains the same, but the number of disk
access decreases:
With the previous version, the algorithm needs N + N*M accesses (each access gets
one row).
With this new version, the number of disk accesses becomes
number_of_bunches_for(outer)+ number_of_ bunches_for(outer)* number_of_
bunches_for(inner).
If you increase the size of the bunch you reduce the number of disk accesses.
Note: Each disk access gathers more data than the previous algorithm but it doesn’t ma er
since they’re sequential accesses (the real issue with mechanical disks is the time to get the
rst data).
Hash join
The hash join is more complicated but gives a be er cost than a nested loop join in many
situations.
4) Compute the hash of each element (with the hash function of the hash table) to nd
the associated bucket of the inner relation
5) nd if there is a match between the elements in the bucket and the element of the
outer table
In terms of time complexity I need to make some assumptions to simplify the problem:
The inner relation is divided into X buckets
The hash function distributes hash values almost uniformly for both relations. In other
words the buckets are equally sized.
The matching between an element of the outer relation and all elements inside a
bucket costs the number of elements inside the buckets.
If the Hash function creates enough small-sized buckets then the time complexity is
O(M+N)
Here is another version of the hash join which is more memory friendly but less disk I/O
friendly. This time:
1) you compute the hash tables for both the inner and outer relations
2) then you put them on disk
3) then you compare the 2 relations bucket by bucket (with one loaded in-memory and
the other read row by row)
Merge join
Note: In this simpli ed merge join, there are no inner or outer tables; they both play the
same role. But real implementations make a di erence, for example, when dealing with
duplicates.
1. (Optional) So join operations: Both the inputs are so ed on the join key(s).
So
We already spoke about the merge so , in this case a merge so in a good algorithm (but
not the best if memory is not an issue).
But sometimes the data sets are already so ed, for example:
If the table is natively ordered, for example an index-organized table on the join
condition
If the relation is an index on the join condition
If this join is applied on an intermediate result already so ed during the process of the
query
Merge join
This pa is very similar to the merge operation of the merge so we saw. But this time,
instead of picking every element from both relations, we only pick the elements from both
relations that are equals. Here is the idea:
1) you compare both current elements in the 2 relations (current= rst for the rst time)
2) if they’re equal, then you put both elements in the result and you go to the next
element for both relations
3) if not, you go to the next element for the relation with the lowest element (because
the next element might match)
4) and repeat 1,2,3 until you reach the last element of one of the relation.
This works because both relations are so ed and therefore you don’t need to “go back” in
these relations.
This algorithm is a simpli ed version because it doesn’t handle the case where the same
data appears multiple times in both arrays (in other words a multiple matches). The real
version is more complicated “just” for this case; this is why I chose a simpli ed version.
If both relations need to be so ed then the time complexity is the cost to so both
relations: O(N*Log(N) + M*Log(M))
For the CS geeks, here is a possible algorithm that handles the multiple matches (note: I’m
not 100% sure about my algorithm):
mergeJoin(relation a, relation b)
relation output
integer a_key:=0;
integer b_key:=0;
while (a[a_key]!=null or b[b_key]!=null)
if (a[a_key] < b[b_key])
a_key++;
else if (a[a_key] > b[b_key])
b_key++;
else //Join predicate satisfied
//i.e. a[a_key] == b[b_key]
//count the number of duplicates in relation a
integer nb_dup_in_a = 1:
while (a[a_key]==a[a_key+nb_dup_in_a])
nb_dup_in_a++;
//count the number of duplicates in relation b
integer dup_in_b = 1:
while (b[b_key]==b[b_key+nb_dup_in_b])
nb_dup_in_b++;
//write the duplicates in output
for (int i = 0 ; i< nb_dup_in_a ; i++)
for (int j = 0 ; i< nb_dup_in_b ; i++)
write_result_in_output(a[a_key+i],b[b_key+j])
a_key=a_key + nb_dup_in_a-1;
b_key=b_key + nb_dup_in_b-1;
end if
end while
If there was a best type of joins, there wouldn’t be multiple types. This question is very
di cult because many factors come into play like:
The amount of free memory: without enough memory you can say goodbye to the
powe ul hash join (at least the full in-memory hash join)
The size of the 2 data sets. For example if you have a big table with a very small one, a
nested loop join will be faster than a hash join because the hash join has an expensive
creation of hashes. If you have 2 very large tables the nested loop join will be very CPU
expensive.
The presence of indexes. With 2 B+Tree indexes the sma choice seems to be the
merge join
If the result need to be so ed: Even if you’re working with unso ed data sets, you
might want to use a costly merge join (with the so s) because at the end the result will
be so ed and you’ll be able to chain the result with another merge join (or maybe
because the query asks implicitly/explicitly for a so ed result with an ORDER
BY/GROUP BY/DISTINCT operation)
If the relations are already so ed: In this case the merge join is the best candidate
The type of joins you’re doing: is it an equijoin (i.e.: tableA.col1 = tableB.col2)? Is it an
inner join, an outer join, a ca esian product or a self-join? Some joins can’t work in
ce ain situations.
The distribution of data. If the data on the join condition are skewed (For example
you’re joining people on their last name but many people have the same), using a hash
join will be a disaster because the hash function will create ill-distributed buckets.
If you want the join to be executed by multiple threads/process
For more information, you can read the DB2, ORACLE or SQL Server documentations.
Simpli ed example
Now let’s say we need to join 5 tables to have a full view of a person. A PERSON can have:
multiple MOBILES
multiple MAILS
multiple ADRESSES
multiple BANK_ACCOUNTS
As a query optimizer, I have to nd the best way to process the data. But there are 2
problems:
What kind of join should I use for each join?
I have 3 possible joins (Hash Join, Merge Join, Nested Join) with the possibility to use 0,1 or
2 indexes (not to mention that there are di erent types of indexes).
For example, the following gure shows di erent possible plans for only 3 joins on 4 tables
Using the database statistics, I compute the cost for every possible plan and I keep
the best one. But there are many possibilities. For a given order of joins, each join has 3
possibilities: HashJoin, MergeJoin, NestedJoin. So, for a given order of joins there are
34 possibilities. The join ordering is a permutation problem on a binary tree and there
are (2*4)!/(4+1)! possible orders. For this very simpli ed problem, I end up with 34*
(2*4)!/(4+1)! possibilities.
In non-geek terms, it means 27 216 possible plans. If I now add the possibility for the
merge join to take 0,1 or 2 B+Tree indexes, the number of possible plans becomes 210
000. Did I forget to mention that this query is VERY SIMPLE?
It’s very tempting but you wouldn’t get your result and I need money to pay the bills.
3) I only try a few plans and take the one with the lowest cost.
Since I’m not superman, I can’t compute the cost of every plan. Instead, I can arbitrary
choose a subset of all the possible plans, compute their costs and give you the best
plan of this subset.
I can use “logical” rules that will remove useless possibilities but they won’t lter a lot of
possible plans. For example: “the inner relation of the nested loop join must be the
smallest data set”
I accept not nding the best solution and apply more aggressive rules to reduce a lot
the number of possibilities. For example “If a relation is small, use a nested loop join and
never use a merge join or a hash join”
In this simple example, I end up with many possibilities. But a real query can have other
relational operators like OUTER JOIN, CROSS JOIN, GROUP BY, ORDER BY, PROJECTION,
UNION, INTERSECT, DISTINCT … which means even more possibilities.
A relational database tries the multiple approaches I’ve just said. The real job of an optimizer
is to nd a good solution on a limited amount of time.
Most of the time an optimizer doesn’t nd the best solution but a “good” one.
For small queries, doing a brute force approach is possible. But there is a way to avoid
unnecessary computations so that even medium queries can use the brute force approach.
This is called dynamic programming.
Dynamic Programming
The idea behind these 2 words is that many executions plan are very similar. If you look at
the following plans:
They share the same (A JOIN B) subtree. So, instead of computing the cost of this subtree
in every plan, we can compute it once, save the computed cost and reuse it when we see
this subtree again. More formally, we’re facing an overlapping problem. To avoid the extra-
computation of the pa ial results we’re using memoization.
Using this technique, instead of having a (2*N)!/(N+1)! time complexity, we “just” have 3N. In
our previous example with 4 joins, it means passing from 336 ordering to 81. If you take a
bigger query with 8 joins (which is not big), it means passing from 57 657 600 to 6561.
For the CS geeks, here is an algorithm I found on the formal course I already gave you. I
won’t explain this algorithm so read it only if you already know dynamic programming or if
you’re good with algorithms (you’ve been warned!):
procedure findbestplan(S)
if (bestplan[S].cost infinite)
return bestplan[S]
// else bestplan[S] has not been computed earlier, compute it now
if (S contains only 1 relation)
set bestplan[S].plan and bestplan[S].cost based on the best
of accessing S /* Using selections on S and indices on S */
else for each non-empty subset S1 of S such that S1 != S
P1= findbestplan(S1)
P2= findbestplan(S - S1)
A = best algorithm for joining results of P1 and P2
cost = P1.cost + P2.cost + cost of A
if cost < bestplan[S].cost
bestplan[S].cost = cost
bestplan[S].plan = “execute P1.plan; execute P2.plan;
join results of P1 and P2 using A”
return bestplan[S]
For bigger queries you can still do a dynamic programming approach but with extra rules (or
heuristics) to remove possibilities:
If we analyze only a ce ain type of plan (for example: the le -deep trees) we end up
with n*2n instead of 3n
If we add logical rules to avoid plans for some pa erns (like “if a table as an index for
the given predicate, don’t try a merge join on the table but only on the index”) it will
reduce the number of possibilities without hu ing to much the best possible solution.
If we add rules on the ow (like “pe orm the join operations BEFORE all the other
relational operations”) it also reduces a lot of possibilities.
…
Greedy algorithms
But for a very big query or to have a very fast answer (but not a very fast query), another
type of algorithms is used, the greedy algorithms.
The idea is to follow a rule (or heuristic) to build a query plan in an incremental way. With
this rule, a greedy algorithm nds the best solution to a problem one step at a time. The
algorithm sta s the query plan with one JOIN. Then, at each step, the algorithm adds a new
JOIN to the query plan using the same rule.
Let’s take a simple example. Let’s say we have a query with 4 joins on 5 tables (A, B, C, D and
E). To simplify the problem we just take the nested join as a possible join. Let’s use the rule
“use the join with the lowest cost”
we then compute the cost of every join with the result of the (A JOIN B) JOIN C …
….
At the end we nd the plan (((A JOIN B) JOIN C) JOIN D) JOIN E)
Since we arbitrary sta ed with A, we can apply the same algorithm for B, then C then D
then E. We then keep the plan with the lowest cost.
By the way, this algorithm has a name: it’s called the Nearest neighbor algorithm.
I won’t go into details, but with a good modeling and a so in N*log(N) this problem can
easily be solved. The cost of this algorithm is in O(N*log(N)) vs O(3N) for the full
dynamic programming version. If you have a big query with 20 joins, it means 26 vs 3 486
784 401, a BIG di erence!
The problem with this algorithm is that we assume that nding the best join between 2
tables will give us the best cost if we keep this join and add a new join. But:
To improve the result, you can run multiple greedy algorithms using di erent rules and keep
the best plan.
Other algorithms
[If you’re already fed up with algorithms, skip to the next pa , what I’m going to say is not
impo ant for the rest of the a icle]
The problem of nding the best possible plan is an active research topic for many CS
researchers. They o en try to nd be er solutions for more precise problems/pa erns. For
example,
if the query is a star join (it’s a ce ain type of multiple-join query), some databases will
use a speci c algorithm.
if the query is a parallel query, some databases will use a speci c algorithm
Other algorithms are also studied to replace dynamic programming for large queries.
Greedy algorithms belong to larger family called heuristic algorithms. A greedy algorithm
follows a rule (or heuristic), keeps the solution it found at the previous step and “appends” it
to nd the solution for the current step. Some algorithms follow a rule and apply it in a step-
by-step way but don’t always keep the best solution found in the previous step. They are
called heuristic algorithms.
For example, genetic algorithms follow a rule but the best solution of the last step is not
o en kept:
FYI, genetic algorithms are implemented in PostgreSQL but I wasn’t able to nd if they’re
used by default.
There are other heuristic algorithms used in databases like Simulated Annealing, Iterative
Improvement, Two-Phase Optimization… But I don’t know if they’re currently used in
enterprise databases or if they’re only used in research databases.
For more information, you can read the following research a icle that presents more
possible algorithms: Review of Algorithms for the Join Ordering Problem in Database Query
Optimization
Real optimizers
[You can skip to the next pa , what I’m going to say is not impo ant]
But, all this blabla is very theoretical. Since I’m a developer and not a researcher, I like
concrete examples.
Let’s see how the SQLite optimizer works. It’s a light database so it uses a simple
optimization based on a greedy algorithm with extra-rules to limit the number of
possibilities:
…
Prior to version 3.8.0, SQLite uses the “Nearest Neighbor” greedy algorithm when
searching for the best query plan
Since version 3.8.0 (released in 2015), SQLite uses the “N Nearest Neighbors” greedy
algorithm when searching for the best query plan
Let’s see how another optimizer does his job. IBM DB2 is like all the enterprise databases but
I’ll focus on this one since it’s the last one I’ve really used before switching to Big Data.
If we look at the o cial documentation, we learn that the DB2 optimizer let you use 7
di erent levels of optimization:
1 – low optimization
2 – full optimization
FYI, the default level is 5. By default the optimizer uses the following characteristics:
All available statistics, including frequent-value and quantile statistics, are used.
All query rewrite rules (including materialized query table routing) are applied,
except computationally intensive rules that are applicable only in very rare cases.
Dynamic programming join enumeration is used, with:
Limited use of composite inner relation
Limited use of Ca esian products for star schemas involving lookup tables
A wide range of access methods is considered, including list prefetch (note: will see
what is means), index ANDing (note: a special operation with indexes), and
materialized query table routing.
By default, DB2 uses dynamic programming limited by heuristics for the join ordering.
The others conditions (GROUP BY, DISTINCT…) are handled by simple rules.
Since the creation of a plan takes time, most databases store the plan into a query plan
cache to avoid useless re-computations of the same query plan. It’s kind of a big topic since
the database needs to know when to update the outdated plans. The idea is to put a
threshold and if the statistics of a table have changed above this threshold then the query
plan involving this table is purged from the cache.
Query executor
At this stage we have an optimized execution plan. This plan is compiled to become an
executable code. Then, if there are enough resources (memory, CPU) it is executed by the
query executor. The operators in the plan (JOIN, SORT BY …) can be executed in a sequential
or parallel way; it’s up to the executor. To get and write its data, the query executor interacts
with the data manager, which is the next pa of the a icle.
Data manager
At this step, the query manager is executing the query and needs the data from the tables
and indexes. It asks the data manager to get the data, but there are 2 problems:
Relational databases use a transactional model. So, you can’t get any data at any time
because someone else might be using/modifying the data at the same time.
Data retrieval is the slowest operation in a database, therefore the data manager
needs to be sma enough to get and keep data in memory bu ers.
In this pa , we’ll see how relational databases handle these 2 problems. I won’t talk about
the way the data manager gets its data because it’s not the most impo ant (and this a icle
is long enough!).
Cache manager
As I already said, the main bo leneck of databases is disk I/O. To improve pe ormance,
modern databases use a cache manager.
Instead of directly ge ing the data from the le system, the query executor asks for the
data to the cache manager. The cache manager has an in-memory cache called bu er
pool. Ge ing data from memory dramatically speeds up a database. It’s di cult to give
an order of magnitude because it depends on the operation you need to do:
sequential access (ex: full scan) vs random access (ex: access by row id),
read vs write
SSD
RAID 1/5/…
but I’d say memory is 100 to 100k times faster than disk.
But, this leads to another problem (as always with databases…). The cache manager needs
to get the data in memory BEFORE the query executor uses them; otherwise the query
manager has to wait for the data from the slow disks.
Prefetching
This problem is called prefetching. A query executor knows the data it’ll need because it
knows the full ow of the query and has knowledge of the data on disk with the statistics.
Here is the idea:
The CM stores all these data in its bu er pool. In order to know if a data is still needed, the
cache manager adds an extra-information about the cached data (called a latch).
Sometimes the query executor doesn’t know what data it’ll need and some databases don’t
provide this functionality. Instead, they use a speculative prefetching (for example: if the
query executor asked for data 1,3,5 it’ll likely ask for 7,9,11 in a near future) or a sequential
prefetching (in this case the CM simply loads from disks the next contiguous data a er the
ones asked).
To monitor how well the prefetching is working, modern databases provide a metric called
bu er/cache hit ratio. The hit ratio shows how o en a requested data has been found in
the bu er cache without requiring disk access.
Note: a poor cache hit ratio doesn’t always mean that the cache is ill-working. For more
information, you can read the Oracle documentation.
Bu er-Replacement strategies
Most modern databases (at least SQL Server, MySQL, Oracle and DB2) use an LRU
algorithm.
LRU
LRU stands for Least Recently Used. The idea behind this algorithm is to keep in the cache
the data that have been recently used and, therefore, are more likely to be used again.
For the sake of comprehension, I’ll assume that the data in the bu er are not locked by
latches (and therefore can be removed). In this simple example the bu er can store 3
elements:
1: the cache manager uses the data 1 and puts the data into the empty bu er
2: the CM uses the data 4 and puts the data into the half-loaded bu er
3: the CM uses the data 3 and puts the data into the half-loaded bu er
4: the CM uses the data 9. The bu er is full so data 1 is removed since it’s the last
recently used data. Data 9 is added into the bu er
5: the CM uses the data 4. Data 4 is already in the bu er therefore it becomes the
rst recently used data again.
6: the CM uses the data 1. The bu er is full so data 9 is removed since it’s the last
recently used data. Data 1 is added into the bu er
…
This algorithm works well but there are some limitations. What if there is a full scan on a
large table? In other words, what happens when the size of the table/index is above the size
of the bu er? Using this algorithm will remove all the previous values in the cache whereas
the data from the full scan are likely to be used only once.
Improvements
To prevent this to happen, some databases add speci c rules. For example according to
Oracle documentation:
“For very large tables, the database typically uses a direct path read, which loads blocks
directly […], to avoid populating the bu er cache. For medium size tables, the database
may use a direct read or a cache read. If it decides to use a cache read, then the
database places the blocks at the end of the LRU list to prevent the scan from
e ectively cleaning out the bu er cache.”
There are other possibilities like using an advanced version of LRU called LRU-K. For
example SQL Server uses LRU-K for K =2.
This idea behind this algorithm is to take into account more history. With the simple LRU
(which is also LRU-K for K=1), the algorithm only takes into account the last time the data
was used. With the LRU-K:
It takes into account the K last times the data was used.
A weight is put on the number of times the data was used
If a bunch of new data is loaded into the cache, the old but o en used data are not
removed (because their weights are higher).
But the algorithm can’t keep old data in the cache if they aren’t used anymore.
So the weights decrease over time if the data is not used.
The computation of the weight is costly and this is why SQL Server only uses K=2. This value
pe orms well for an acceptable overhead.
For a more in-depth knowledge of LRU-K, you can read the original research paper (1993):
The LRU-K page replacement algorithm for database disk bu ering.
Other algorithms
Write bu er
I only talked about read bu ers that load data before using them. But in a database you also
have write bu ers that store data and ush them on disk by bunches instead of writing data
one by one and producing many single disk accesses.
Keep in mind that bu ers store pages (the smallest unit of data) and not rows (which is a
logical/human way to see data). A page in a bu er pool is di y if the page has been
modi ed and not wri en on disk. There are multiple algorithms to decide the best time to
write the di y pages on disk but it’s highly linked to the notion of transaction, which is the
next pa of the a icle.
Transaction manager
Last but not least, this pa is about the transaction manager. We’ll see how this process
ensures that each query is executed in its own transaction. But before that, we need to
understand the concept of ACID transactions.
I’m on acid
Atomicity: the transaction is “all or nothing”, even if it lasts 10 hours. If the transaction
crashes, the state goes back to before the transaction (the transaction is rolled back).
Isolation: if 2 transactions A and B run at the same time, the result of transactions A
and B must be the same whether A nishes before/a er/during transaction B.
Durability: once the transaction is commi ed (i.e. ends successfully), the data stay in
the database no ma er what happens (crash or error).
Consistency: only valid data (in terms of relational constraints and functional
constraints) are wri en to the database. The consistency is related to atomicity and
isolation.
During the same transaction, you can run multiple SQL queries to read, create, update and
delete data. The mess begins when two transactions are using the same data. The classic
example is a money transfer from an account A to an account B. Imagine you have 2
transactions:
Transaction 1 that takes 100$ from account A and gives them to account B
Transaction 2 that takes 50$ from account A and gives them to account B
Durability ensures that T1 won’t disappear into thin air if the database crashes just
a er T1 is commi ed.
Consistency ensures that no money is created or destroyed in the system.
[You can skip to the next pa if you want, what I’m going to say is not impo ant for the rest
of the a icle]
Many modern databases don’t use a pure isolation as a default behavior because it comes
with a huge pe ormance overhead. The SQL norm de nes 4 levels of isolation:
For example, if a transaction A does a “SELECT count(1) from TABLE_X” and then a new
data is added and commi ed in TABLE_X by Transaction B, if transaction A does again a
count(1) the value won’t be the same.
Read commi ed (default behavior in Oracle, PostgreSQL and SQL Server): It’s a
repeatable read + a new break of isolation. If a transaction A reads a data D and then
this data is modi ed (or deleted) and commi ed by a transaction B, if A reads data D
again it will see the modi cation (or deletion) made by B on the data.
Read uncommi ed: the lowest level of isolation. It’s a read commi ed + a new break
of isolation. If a transaction A reads a data D and then this data D is modi ed by a
transaction B (that is not commi ed and still running), if A reads data D again it will see
the modi ed value. If transaction B is rolled back, then data D read by A the second
time doesn’t make no sense since it has been modi ed by a transaction B that never
happened (since it was rolled back).
Most databases add their own custom levels of isolation (like the snapshot isolation used by
PostgreSQL, Oracle and SQL Server). Moreover, most databases don’t implement all the
levels of the SQL norm (especially the read uncommi ed level).
The default level of isolation can be overridden by the user/developer at the beginning of
the connection (it’s a very simple line of code to add).
Concurrency Control
The real issue to ensure isolation, coherency and atomicity is the write operations on the
same data (add, update and delete):
if all transactions are only reading data, they can work at the same time without
modifying the behavior of another transaction.
if (at least) one of the transactions is modifying a data read by other transactions, the
database needs to nd a way to hide this modi cation from the other transactions.
Moreover, it also needs to ensure that this modi cation won’t be erased by another
transaction that didn’t see the modi ed data.
The easiest way to solve this problem is to run each transaction one by one (i.e.
sequentially). But that’s not scalable at all and only one core is working on the multi-
processor/core server, not very e cient…
The ideal way to solve this problem is, every time a transaction is created or cancelled:
to reorder the operations inside the con icting transactions to reduce the size of the
con icting pa s
to execute the con icting pa s in a ce ain order (while the non-con icting
transactions are still running concurrently).
to take into account that a transaction can be cancelled.
More formally it’s a scheduling problem with con icting schedules. More concretely, it’s a
very di cult and CPU-expensive optimization problem. Enterprise databases can’t a ord to
wait hours to nd the best schedule for each new transaction event. Therefore, they use
less ideal approaches that lead to more time wasted between con icting transactions.
Lock manager
To handle this problem, most databases are using locks and/or data versioning. Since it’s a
big topic, I’ll focus on the locking pa then I’ll speak a li le bit about data versioning.
Pessimistic locking
Still, if a data as an exclusive lock, a transaction that just needs to read the data will have to
wait the end of the exclusive lock to put a shared lock on the data.
The lock manager is the process that gives and releases locks. Internally, it stores the locks
in a hash table (where the key is the data to lock) and knows for each data:
Deadlock
But the use of locks can lead to a situation where 2 transactions are waiting forever for a
data:
In this gure:
Is it be er to kill the transaction that modi ed the least amount of data (and therefore
that will produce the least expensive rollback)?
Is it be er to kill the least aged transaction because the user of the other transaction
has waited longer?
Is it be er to kill the transaction that will take less time to nish (and avoid a possible
starvation)?
But before making this choice, it needs to check if there are deadlocks.
The hash table can be seen as a graph (like in the previous gures). There is a deadlock if
there is a cycle in the graph. Since it’s expensive to check for cycles (because the graph with
all the locks is quite big), a simpler approach is o en used: using a timeout. If a lock is not
given within this timeout, the transaction enters a deadlock state.
The lock manager can also check before giving a lock if this lock will create a deadlock. But
again it’s computationally expensive to do it pe ectly. Therefore, these pre-checks are o en
a set of basic rules.
Two-phase locking
The simplest way to ensure a pure isolation is if a lock is acquired at the beginning of the
transaction and released at the end of the transaction. This means that a transaction has to
wait for all its locks before it sta s and the locks held by a transaction are released when the
transaction ends. It works but it produces a lot of time wasted to wait for all locks.
A faster way is the Two-Phase Locking Protocol (used by DB2 and SQL Server) where a
transaction is divided into 2 phases:
the growing phase where a transaction can obtain locks, but can’t release any lock.
the shrinking phase where a transaction can release locks (on the data it has already
processed and won’t process again), but can’t obtain new locks.
to release the locks that aren’t used anymore to reduce the wait time of other
transactions waiting for these locks
to prevent from cases where a transaction gets data modi ed a er the transaction
sta ed and therefore aren’t coherent with the rst data the transaction acquired.
This protocol works well except if a transaction that modi ed a data and released the
associated lock is cancelled (rolled back). You could end up in a case where another
transaction reads the modi ed value whereas this value is going to be rolled back. To avoid
this problem, all the exclusive locks must be released at the end of the transaction.
A few words
Of course a real database uses a more sophisticated system involving more types of locks
(like intention locks) and more granularities (locks on a row, on a page, on a pa ition, on a
table, on a tablespace) but the idea remains the same.
I only presented the pure lock-based approach. Data versioning is another way to deal
with this problem.
every transaction can modify the same data at the same time
each transaction has its own copy (or version) of the data
if 2 transactions modify the same data, only one modi cation will be accepted, the
other will be refused and the associated transaction will be rolled back (and maybe re-
run).
Everything is be er than locks except when 2 transactions write the same data. Moreover,
you can quickly end up with a huge disk space overhead.
Data versioning and locking are two di erent visions: optimistic locking vs pessimistic
locking. They both have pros and cons; it really depends on the use case (more reads vs
more writes). For a presentation on data versioning, I recommend this very good
presentation on how PostgreSQL implements multiversion concurrency control.
Some databases like DB2 (until DB2 9.7) and SQL Server (except for snapshot isolation) are
only using locks. Other like PostgreSQL, MySQL and Oracle use a mixed approach involving
locks and data versioning. I’m not aware of a database using only data versioning (if you
know a database based on a pure data versioning, feel free to tell me).
If you read the pa on the di erent levels of isolation, when you increase the isolation level
you increase the number of locks and therefore the time wasted by transactions to wait for
their locks. This is why most databases don’t use the highest isolation level (Serializable) by
default.
As always, you can check by yourself in the documentation of the main databases (for
example MySQL, PostgreSQL or Oracle).
Log manager
We’ve already seen that to increase its pe ormances, a database stores data in memory
bu ers. But if the server crashes when the transaction is being commi ed, you’ll lose the
data still in memory during the crash, which breaks the Durability of a transaction.
You can write everything on disk but if the server crashes, you’ll end up with the data half
wri en on disk, which breaks the Atomicity of a transaction.
Shadow copies/pages: Each transaction creates its own copy of the database (or just
a pa of the database) and works on this copy. In case of error, the copy is removed. In
case of success, the database switches instantly the data from the copy with a
lesystem trick then it removes the “old” data.
Transaction log: A transaction log is a storage space. Before each write on disk, the
database writes an info on the transaction log so that in case of crash/cancel of a
transaction, the database knows how to remove (or nish) the un nished transaction.
WAL
The shadow copies/pages creates a huge disk overhead when used on large databases
involving many transactions. That’s why modern databases use a transaction log. The
transaction log must be stored on a stable storage. I won’t go deeper on storage
technologies but using (at least) RAID disks is mandatory to prevent from a disk failure.
Most databases (at least Oracle, SQL Server, DB2, PostgreSQL, MySQL and SQLite) deal
with the transaction log using the Write-Ahead Logging protocol (WAL). The WAL protocol
is a set of 3 rules:
1) Each modi cation into the database produces a log record, and the log record
must be wri en into the transaction log before the data is wri en on disk.
2) The log records must be wri en in order; a log record A that happens before a log
record B must but wri en before B
3) When a transaction is commi ed, the commit order must be wri en on the
transaction log before the transaction ends up successfully.
This job is done by a log manager. An easy way to see it is that between the cache manager
and the data access manager (that writes data on disk) the log manager writes every
update/delete/create/commit/rollback on the transaction log before they’re wri en on disk.
Easy, right?
WRONG ANSWER! A er all we’ve been through, you should know that everything related to
a database is cursed by the “database e ect”. More seriously, the problem is to nd a way to
write logs while keeping good pe ormances. If the writes on the transaction log are too
slow they will slow down everything.
ARIES
In 1992, IBM researchers “invented” an enhanced version of WAL called ARIES. ARIES is more
or less used by most modern databases. The logic might not be the same but the concepts
behind ARIES are used everywhere. I put the quotes on invented because, according to this
MIT course, the IBM researchers did “nothing more than writing the good practices of
transaction recovery”. Since I was 5 when the ARIES paper was published, I don’t care about
this old gossip from bi er researchers. In fact, I only put this info to give you a break before
we sta this last technical pa . I’ve read a huge pa of the research paper on ARIES and I
nd it very interesting! In this pa I’ll only give you an overview of ARIES but I strongly
recommend to read the paper if you want a real knowledge.
ARIES stands for Algorithms for Recovery and Isolation Exploiting Semantics.
Sometimes (for example, in case of network failure), the database can recover the
transaction.
How is that possible? To answer this question, we need to understand the information
stored in a log record.
The logs
LSN: A unique Log Sequence Number. This LSN is given in a chronological order*. This
means that if an operation A happened before an operation B the LSN of log A will be
lower than the LSN of log B.
For example, if the operation is an update, the UNDO will store either the value/state of
the updated element before the update (physical UNDO) or the reverse operation to go
back at the previous state (logical UNDO)**.
Likewise, there are 2 ways to do that. Either you store the value/state of the element
a er the operation or the operation itself to replay it.
…: (FYI, an ARIES log has 2 others elds: the UndoNxtLSN and the Type).
Moreover, each page on disk (that stores the data, not the log) has id of the log record (LSN)
of the last operation that modi ed the data.
*The way the LSN is given is more complicated because it is linked to the way the logs are
stored. But the idea remains the same.
**ARIES uses only logical UNDO because it’s a real mess to deal with physical UNDO.
Note: From my li le knowledge, only PostgreSQL is not using an UNDO. It uses instead a
garbage collector daemon that removes the old versions of data. This is linked to the
implementation of the data versioning in PostgreSQL.
To give you a be er idea, here is a visual and simpli ed example of the log records produced
by the query “UPDATE FROM PERSON SET AGE = 18;”. Let’s say this query is executed in
transaction 18.
Each log has a unique LSN. The logs that are linked belong to the same transaction. The logs
are linked in a chronological order (the last log of the linked list is the log of the last
operation).
Log Bu er
When a transaction is commi ed, it means that for every operation in the transaction
the steps 1, 2, 3,4,5 are done. Writing in the transaction log is fast since it’s just “adding a
log somewhere in the transaction log” whereas writing data on disk is more complicated
because it’s “writing the data in a way that it’s fast to read them”.
For pe ormance reasons the step 5 might be done a er the commit because in case of
crashes it’s still possible to recover the transaction with the REDO logs. This is called a NO-
FORCE policy.
A database can choose a FORCE policy (i.e. step 5 must be done before the commit) to
lower the workload during the recovery.
Another issue is to choose whether the data are wri en step-by-step on disk (STEAL
policy) or if the bu er manager needs to wait until the commit order to write everything at
once (NO-STEAL). The choice between STEAL and NO-STEAL depends on what you want:
fast writing with a long recovery using UNDO logs or fast recovery?
STEAL/NO-FORCE needs UNDO and REDO: highest pe ormances but gives more
complex logs and recovery processes (like ARIES). This is the choice made by most
databases. Note: I read this fact on multiple research papers and courses but I
couldn’t nd it (explicitly) on the o cial documentations.
The recovery pa
Let’s say the new intern has crashed the database (rule n°1: it’s always the intern’s fault). You
resta the database and the recovery process begins.
1) The Analysis pass: The recovery process reads the full transaction log* to recreate
the timeline of what was happening during the crash. It determines which transactions
to rollback (all the transactions without a commit order are rolled back) and which data
needed to be wri en on disk at the time of the crash.
2) The Redo pass: This pass sta s from a log record determined during analysis, and
uses the REDO to update the database to the state it was before the crash.
During the redo phase, the REDO logs are processed in a chronological order (using the
LSN).
For each log, the recovery process reads the LSN of the page on disk containing the
data to modify.
The redo is done even for the transactions that are going to be rolled back because it
simpli es the recovery process (but I’m sure modern databases don’t do that).
3) The Undo pass: This pass rolls back all transactions that were incomplete at the
time of the crash. The rollback sta s with the last logs of each transaction and
processes the UNDO logs in an anti-chronological order (using the PrevLSN of the log
records).
During the recovery, the transaction log must be warned of the actions made by the
recovery process so that the data wri en on disk are synchronized with what’s wri en in the
transaction log. A solution could be to remove the log records of the transactions that are
being undone but that’s very di cult. Instead, ARIES writes compensation logs in the
transaction log that delete logically the log records of the transactions being removed.
When a transaction is cancelled “manually” or by the lock manager (to stop a deadlock) or
just because of a network failure, then the analysis pass is not needed. Indeed, the
information about what to REDO and UNDO is available in 2 in-memory tables:
These tables are updated by the cache manager and the transaction manager for each new
transaction event. Since they are in-memory, they are destroyed when the database
crashes.
The job of the analysis phase is to recreate both tables a er a crash using the information in
the transaction log. *To speed up the analysis pass, ARIES provides the notion of
checkpoint. The idea is to write on disk from time to time the content of the transaction
table and the di y page table and the last LSN at the time of this write so that during the
analysis pass, only the logs a er this LSN are analyzed.
To conclude
Before writing this a icle, I knew how big the subject was and I knew it would take time to
write an in-depth a icle about it. It turned out that I was very optimistic and I spent twice
more time than expected, but I learned a lot.
If you want a good overview about databases, I recommend reading the research paper
“Architecture of a Database System “. This is a good introduction on databases (110 pages)
and for once it’s readable by non-CS guys. This paper helped me a lot to nd a plan for this
a icle and it’s not focused on data structures and algorithms like my a icle but more on the
architecture concepts.
If you read this a icle carefully you should now understand how powe ul a database is.
Since it was a very long a icle, let me remind you about what we’ve seen:
an overview of the cost based optimization with a strong focus on join operators
an overview of the bu er pool management
But a database contains even more cleverness. For example, I didn’t speak about some
touchy problems like:
So, think twice when you have to choose between a buggy NoSQL database and a rock-
solid relational database. Don’t get me wrong, some NoSQL databases are great. But they’re
still young and answering speci c problems that concern a few applications.
To conclude, if someone asks you how a database works, instead of running away you’ll now
be able to answer:
Otherwise you can give him/her this a icle.
2,232
Related Posts
124 39 2 91
Shasha
Guest
Reply December 4, 2019 5:53 pm
Richard
Guest
Reply November 21, 2019 10:59 am
touseef
Guest
Reply September 28, 2019 7:09 pm
Christophe
Author
The subject will be broader, it will be about big data technologies.
But I can’t say when it’ll be released.
touseef
Guest
Reply September 28, 2019 7:06 pm
Anonymous
Guest
Reply August 14, 2019 2:39 pm
Christophe
Ankit
Guest
Reply June 10, 2019 9:11 am
Hans
Please do not link to version speci c URLs of the Postgres manual. It’s
Guest be er to use “current” instead e.g. the version number 9.4 in the URL to
the manual. This avoids outdated links (e.g. 9.4 will be de-suppo ed in
about a year)
Regarding the lock manager: neither Oracle nor Postgres “store the locks
in a hash table”. Simply put, they store the lock information together with
the row in the data block.
Abhi
Best a icle I have ever read on Databases. Thanks for the a icle.
Guest
Reply March 8, 2019 6:02 pm
helloplanetI do respect this work. SALUTE.
Guest
Anonymous
Awesome a icle!
Guest
Reply December 2, 2018 12:30 am
Sergey
Guest
Reply August 20, 2018 3:42 pm
anish
not sure if you have read “waitbutwhy.com” by Tim Urban. You would be
Guest
the Tim Urban of the tech world. Great post once again mate!
kishore
The best technical write up i have ever seen on technical a icle that is in
Guest
depth, explained in simple language. Kudos to the author for taking the
time to explain it.
iamcxl369
With a low amount of data, the di erence between O(1) and O(n2) is
Guest negligible. For example, let’s say you have an algorithm that needs to
process 2000 elements.
Christophe
Hi,
Author I’ve already replied to the same question in a previous comment.
The sho answer is that I used the neperian logarithm. In this
situation, it makes no di erences with the log since we just want
the asymptotic trend of the complexity.
merin
Guest
Reply March 26, 2018 6:37 pm
ShiveryMoon
Guest
Reply March 18, 2018 7:57 am
venkat
Great informative a icle. Appreciate your work +1. What ever wri en in
Brian Lubeck
Nice work. One of the most rigorous and fundamental things I’ve ever
Guest read on the web (outside of textbooks) about how a piece of technology
truly and fully works.
Intenso
Guest
Reply January 25, 2018 8:58 am
Anupam
Bravo
Guest
Reply January 15, 2018 6:52 am
George
Guest
Reply January 2, 2018 4:26 pm
Badal
Guest
Reply October 27, 2017 5:01 am
Sat.hi
Anonymous
pe ect!
Guest
Reply August 17, 2017 8:49 am
john
great a icle!
Guest
Reply August 6, 2017 8:24 am
Clay Zills
Guest
Reply August 2, 2017 9:08 pm
qinlu
very cool.
Guest
Reply July 15, 2017 2:18 am
Ma ha
Just great work. Many many thank you. I’m sure this a icle helped a lot of
Guest people.
What’s up, all is going well here and ofcourse every one is sharing
YJ
Thanks for great a icle! Really helpful for understanding black magic
Ahaha
Guest
Reply April 11, 2017 11:32 am
ИнтерДизайн
Guest the DB tools to examine the execution plan that the database uses on
your queries. Inside you . What are the options for storing hierarchical data
in a relational database? Inside you. It is a NoSQL database that uses
JSON to store data, JavaScript as its. They’re used everywhere. I am just
sta ing to get into programming so the code pa s confused . Django;
Experience with relational database programming; Coding geeks who
enjoy and . The way to see if your database design works is to test it with
some sample data. They re used everywhere. But, there are only a few
a icles that explain how a database works. In addition to ge ing insight
into how it is working. Simply, it reads through each candidate record in
the database and matches. It answers all your questions and much more.
However I was reading recently about Joel’s Law of Leaky Abstraction.
Database programming. Troy Hunt · How Relational Databases Work from
Coding-Geek. How does a relational database work Coding Geek Coding
Geek le deep tree . If you want to know more in detail, I’d recommend
ge ing the sqlite sources and having a look at how it does. It’s complete,
albeit not at the scale of the larger open source and commercial. The
work reached its low point when I was given the job of renumbering. Using
a database program you can design a database to do anything.
Fu hermore, you will learn how Spring Integration works hand in . It all
depends on how you work with your teammates’ skills. However I was
reading recently about Joel’s Law of Leaky Abstractio. How does a
relational database work – Coding Geek. When it comes to relational
databases, I can’t help thinking that something is missing. Saif, excellent
link. How-do-I-learn-basic-data-structure-and-DBMS-in-two-months t
help thinking that something is missing. The web is abundant with “so ”,
“cheap”, “low end” Java tutorials, but what it is missing is material to really
take you. Relational databases are made up of two or more tables. You
have your ORM generated mapping layer that does the. You can google by
yourself “how does a relational database work” to see . How does a
relational database work . R Diagram Example #3 – This is a popular format
I do not like. I think programmers need to learn more about how
databases work, . With a relational database program you can create a
range. A bird’s eye overview that manages to cover most topics, and
provide details on speci c vendor implementations. The parser is really
one of the most. Object-Relational Mapping is the Vietnam of Computer
Science. Secretly, We’re All Geeks. What-is-a-good-book-on-
implementing-a-Relational-Database re looking for how-tos on actual DB
engine implementations, you can sta by reading. The Geek Stu . I’ve
been working with databases for the last few years and I’d like to think
that I’ve go en fairly competent with using them. Outside of work, Colin is
a former touring rock musician, sta up and Star Wars geek, . I made three
tries at writing an explanation, but this. Well, you could do some complex
manipulation using a database feature . How can I best nd . However I
was reading recently about Joel’s Law of Leaky Abstractio. In the last
several years, NoSQL database is ge ing widely. The Top 5 Technical Skills
Every Product Manager Should Know. SQL databases are primarily called
as Relational Databases. How can I best nd. Java application where SQL
and the speci c relational database are impo ant. To be blunt, it’s a
ma er of brute force. Come grow with us! Overview of MySQL Storage
Engine Architect. This is a bit of a huge topic and models like RDBMS,
FLATFILE etc.
Anonymous
Thanks, I really like it, having been working for years, but this one is the
Ivaylo Goranov
Guest
Reply September 30, 2016 10:37 am
Ice
Comprehensive list of data science resources [updated May 6, 2016] – WebProfIT Consulting
Anonymous
Guest
more easy to understand, from programmer and for programmer. I really
hate to read the big books from the university.
Anonymous
Guest
Reply May 17, 2016 9:17 pm
关系型数据库⼯作原理-数据结构(翻译⾃Coding-Geek⽂章)-冰点⽹络
[…] How does a relational database work 》。 原⽂链接:h p://coding-
geek.com/how-databases-work/#Bu er-Replacement_strategies […]
Cheng Liang
This is a very great a icle! I have learned a lot from your work! Thanks! I
Guest
know there are so many Chinese translations about this paper but I can’t
help myself translating it again to help it be known by more and more
Chinese developers. Here is a translation on a website (just like gitbook in
China): h p://www.kancloud.cn/devbean/how-databases-work/145502
and I also put it in my blog:
h p://www.devbean.net/category/%E6%95%B0%E6%8D%AE%E5%BA%9
3/ Thank you again!
Christophe
Lynn Fredricks
Guest
Reply December 18, 2015 7:22 pm
rkmyowin
discussion
Guest
Reply December 11, 2015 12:33 am
[…] is tailor-made for people who don’t know databases, and a great way
to get an intuitive feel for how relational databases work. Hold up my
fellow librarians! I am not talking about the databases that we use every
day, […]
Musicalangel
Greetings,
Thanks
Christophe
KS
2. In ARIES, there is a LSN stored per data page? Firstly, that seems to
consume 4-8 byte for LSN per page, thus restricting how many power of 2
sized records can be on the page. Secondly, is there a speci c stable
storage fault model that is assumed? IE if LSN is at end of 4k page, then
assumption seems to be that a disk write to the 4k page must update
bytes 0..4091 if it also updates bytes 4092..4095. In this fault model, LSN
can not be anywhere else on the page.
Christophe
Still, I’m not 100% sure because lesystem is clearly not my area of
expe ise and I wonder if a kind of multi-page atomic write exist
(because it would solve the problem without UNDO).
I initially chose to keep the misleading formulation to avoid adding
more complexity in the a icle. I’ll add a note on the a icle (to read
this comment for more information) because the pa is maybe too
misleading.
2)
Yep, each data page stores a LSN. FYI, the paper doesn’t mention
the size of the LSN. The LSN is de ned by “Address of the rst byte
of the log record in the ever-growing log”. So, it’s up to the
implementation to decide its size. When you say “thus restricting
how many power of 2 sized records can be on the page”, I don’t
see why you want power of 2 sized records. Why would you
choose a 2-kbyte data structure if you just need 1025 bytes?
Moreover, you can write a record on 2 pages (though I’m not sure if
it’s a good solution in terms of pe ormances).
I don’t recall a speci c stable storage model in the paper but I’ve
only read 2/3 of the paper. ARIES is a broad “speci cation” and can
be implemented in many ways. Still, I don’t see why the LSN needs
to be at the end of the page. If I take the de nition of stable
storage from Wikipedia “Stable storage is a classi cation of
computer data storage technology that guarantees atomicity for
any given write operation …”. Since writing one page is an atomic
operation, the LSN can be at the middle, at the top or the bo om; it
won’t change a thing in case of crash because the page won’t be
modi ed.
3)
To be honest I had a big doubt when I wrote the number of trees in
the ordering problem. I hesitated with both formulas (the one I
wrote and the one you gave me). I think you’re right, but I need to
think about it again before I make a change (and there is another
potential problem in this pa so it’s a good time to check this
problem too).
Thanks for the tip.
4)
The merge so I gave is not in-place but I said that the algorithm
could be modi ed to become in-place. To be clear, my de nition of
an in-place algorithm is an algorithm that costs the size of the
input values + a constant amount of extra space. What I had in
mind when I wrote that the algorithm could become in-place was
an iterative merge so that directly modi es the input array.
KS
I did not get email noti cation regarding your replies … will
Guest check back now and then to respond to fu her comments.
Christophe
Christophe
Hi,
Author #1)
When I wrote this pa , I (brie y) thought about the
tricky cases and I only found problems with the NO-
STEAL/FORCE case (this is why I let the
approximation I found in 2 database courses). Feel
free to tell me if you nd other cases.
2#)
You’re correct. I (badly?) assume that on a stable
storable, the atomicity is for the logical block/page
and not the physical sector because the “stability”
prope y adds a logical layer that deals with this stu
(it’s strong assumption I admit).
I don’t remember on the ARIES paper a pa about
this problem. Maybe when it was wri en (1992), the
norm was “1 disk sector = 1 logical block” with the
assumption that a sector write is atomic. Anyway, if
it’s not managed by the stable storage itself, it must
be managed by the lesystem layer of the database
(or by the FS of the OS for simple databases) which
means bu ers in this layer, but it’s only a guess.
KS
Hi,
Guest
1) No Comment.
2) No Comment
a_key=a_key + nb_dup_in_a-1;
b_key=b_key + nb_dup_in_b-1;
Thanks.
topik
Guest
Reply November 4, 2015 4:57 pm
Christophe
Oleg
Guest
Reply October 23, 2015 1:03 pm
[…] How does a relational database work. I can not even imagine how
much time it took Christophe to write this a icle. Well explained. I believe
that expe s and beginners can nd something in this a icle. Also one
more resource with good readins is Readings in Databases […]
[…] Source : How does a relational database work – Coding Geek […]
Toleg
Guest
By the way, it was already translated in russian and posted with back-
reference on habrahabr dot ru – one of the best technology oriented site
in russian
Christophe
I know for the Russian version of this a icle. In fact the author
Author asked me if he could translate the a icle. I’m impressed because
he translated all the pictures. The only drawback is that I made
some small corrections that are not in the Russian version. From all
the translations (at least 2 Chinese and 1 Japanese versions) it
seems to be the most successful, even more successful than the
original a icle.
Anton
Guest
Reply October 5, 2015 4:31 pm
fenrir
The a icle was very interesting. Have you thought about writing similar
Guest
about the noSQL?
Christophe
But right now, I’m not sure about writing an a icle about NoSQL
databases. The big issue is that each database is very speci c (for
example, MongoDB is very di erent from Cassandra or HBase).
Moreover, I ‘m pre y sure that only a few of the current
NoSQL/NewSQL databases will exist in a few years because the
NoSQL market has too many competitors for its size.
I’m more likely to write about Big Data technologies without NoSQL
(Hadoop, Map-Reduce, Spark…) because though it’s fast evolving,
the market seems more stable (only a very few corporations are
doing Big Data without a Hadoop Cluster).
kuko
now that should get to a li le booklet and sits on every developers desk!
Guest
thank you for the time you took on writing this piece of a
400 days of Go
George
Guest
Reply September 16, 2015 9:33 am
Lex
Guest elements)”
Obviously, here’s a mistake. Log base is two so O(log(n)) algorithm takes
20 operations.
Christophe
Hi Lex,
Author You’re not the rst one who told me that (2 other people told me
the same by email). The answer is a li le bit more complicated than
a “you’re right” or “you’re wrong”.
I used the the napierian logarithm and you used the logarithm in
base 2. In France logarithm is a generic term. It doesn’t mean that
it’s in base 2 by default. In fact , when I was student I used more the
base 10 (in physics) and the base e (in mathematics) than the base
2.
But it changes nothing for the examples and this is why I didn’t
mention the base I used. The big O notation doesn’t give the exact
number of operations but a vague approximation. For example, all
these functions are in O(log(N)):
– 0.001 * log(N,base_10) + 15000000000000
– 1500 * log(N,base_2) – 3000
– napieran_logarithm(N)
and they don’t give the same number of operations at all (I chose
very di erent functions. In most algorithms you won’t end up with
such a di erence).
So, you can used the logarithm in base 2 or the napierian logarithm
in the examples because this is just an approximation of the real
cost.
aosho
Can you explain why the rst pa of the cost of hash join is (M/X) * (N/X)?
Guest
I think it is (M/X) * N.
Christophe
Hi,
Author
You’re right.
In fact, I rst wrote (M/X) * N then I modi ed the formula to (M/X) *
(N/X) because it seemed more logical at the time but one month
later I can’t remember the reason why I chose (M/X)*(N/X). I think it
was late and I was tired when I made this decision.
aosho
tom zhao
hey, man.you wrote a great a icle, help me a lot. and I wanna translate this
Guest
into chinese for person in china.
of couse, the original adress (this is web page address) will be kept and
translation will be declear.
and I need your agreement. Thanks
Christophe
Hi, of course you can, as long as the a icle is free to read in the
Author Chinese version.
tom zhao
Leonardo
Guest
Reply September 9, 2015 6:47 pm
Августовская лента: лучшее за месяц | Блоги экспертов
Collective #182
Octavian
Great a icle! Thank you for taking the time to write it.
Guest
Reply September 4, 2015 11:26 pm
Raymond
Guest
Reply September 4, 2015 8:37 am
zloster
Guest
Reply September 2, 2015 7:34 am
andy
Guest
Reply September 1, 2015 6:30 pm
Gour
Proudly powered by WordPress Premium Style Theme by www.gopiplus.com