SQL Courses
SQL Courses
SQL Courses
COURSE 1: Databases
DBMS (Database Management System)
• System designed to define and manipulate data.
• Storage.
• Retrieval.
• Updates.
Relational NoSql
• Vertical scalability • Horizontal scalabitily
• ACID • BASE
• pre-defined schema • Flexible schema
• SQL language • No standard
• Normalized data • Collections, redundancy
• High-level design.
• Suitable for structured systems.
ERD
strong key
weak entity unary binary simple multivalued
entity attribute
• RELAT
RELATION
• ATTR
ATTRIBUTE
• RELAT
RELATION links entities (unary, binary, ternary).
usually a verb
• ATT
ATTRIBUTE
• RELAT
RELATION links entities (unary, binary, ternary).
usually a verb
• ATTR
ATTRIBUTE describe entities or relations
Entity
strong
weak entity
entity
Entity
ISA
relationship
• A sub-entity has the same key as the super-entity and all its attributes
and relationships.
• Optimized search.
• Optimized joins (lookup in more than one table)
• Optimized order/group
slower DML
Optimized search
extra memory
Optimized joins Index
Optimized extra load
order/group
INSERT, UDATE
Databases C1 Intro, Entity Relationship
Sql Optimizer
Index
YES NO
request ~
Full Scan
<15% rows
YES NO
Index
Full Scan
search
• Oracle rowid:
• Pseudo column 18 characters = 10 + 4 + 4 (block, row, file).
• Store and return row address in hexadecimal format (string).
• Unique identifier for each row.
• Immutable.
• Oracle rownum:
• Sequential number in which oracle has fetched the row, before ordering the
result
• Temporary generated along with a select statement.
• Mongo
• ObjectID (timestamp 4Bytes + random 5Bytes + Count 3Bytes.
MySQL MySQL
select * from information_schema.statistics
SHOW EXTENDED INDEX FROM index_test; where table_name = 'index_test1’
and index_name = 'primary';
Oracle Oracle
select * from user_indexes select * from user_constraints
where table_name = 'INDEX_TEST'; where table_name = 'INDEX_TEST';
[1 .. 20] [30..50] …. …
30: AAB0lYAAEAAAFNHABD
…. 32: AAB0lYAAEAAAFNHABA …. …
……
Databases C1 Intro, Entity Relationship
Reverse index
• B – tree where keys are in reverse order. Key 4573 is stored 3754.
• Optimized insert operations.
• Key 4573 will be stored in the same block with key 9573
while 4574 will be stored in a different block.
emp_id en fr row_id A1 A2 B1 B2 C1 C2
1 A1 B1 AAB0lYAAEAAAFNHABD 1 0 0 0 0 0
2 A2 B2 AAB0lYAAEAAAFNHABV 0 1 0 0 0 0
3 C1 A1 AAB0lYAAEAAAFNHABX 0 0 0 0 1 0
4 A1 B1 AAB0lYAAEAAAFNHAAv 1 0 0 0 0 0
5 A1 AAB0lYAAEAAAFNHAAV 1 0 0 0 0 0
Relational
Integrity RELATIONS OPERATORS
constraints
• Domain constraints
• the value of each attribute must be unique, specified data types integers, real
numbers, characters, Booleans, variable length strings etc.
• Key constraint
• Unique + not null PK
• 𝑅 ⊂ 𝐷1 × 𝐷2 × ⋯ × 𝐷𝑛 , 𝐷𝑖 𝑑𝑜𝑚𝑎𝑖𝑛
• PROJECT
• SELECT
• JOIN
• DIVISION
SALES_REP
EMP_ID TARGET LAST_NAME FIRST_NAME SALARY
2 25 Grant Anee 2700
… …
SOFTWARE_ENG
EMP_ID TEEM LAST_NAME FIRST_NAME SALARY
3 3 Brown Gregory 2300
… …
Databases C2 Relational Model, indexes
Rules for relationships
• 1 to 1 & 1 to M → foreign keys.
• 1 (PK) to M (FK)
• Usually in 1 to 1 relationships the FK is placed in the tables with fewer rows.
• M to M → associative table.
• PK contains FKs and additional column.
CARD
CARD_ID ACCOUNT_ID CVN DATE
16897 10 125 18/04/21
24789 22 987 14/04/22
34597 300 875 03/05/21
… … … …
LOAN
LOAN_ID CUSTOMER_ID VALUES DATE
16897 10 125000 18/04/21
24789 22 987000 14/04/22
34597 300 87500 03/05/21
… … … …
FLIGHT_CREW
CREW_ID FLIGHT_ID OBSERVATIONS
10 1 …
22 1 …
10 2 …
AIRCREW
CREW_ID LAST_NAME FIRST_NAME JOB_ID
10 Snow John captain
22 Grant Anee first_officer
… … … …
Databases C2 Relational Model, indexes
Ternary Relationships
TEACH
PROFESSOR_ID COURSE_ID STUDENT_ID GRADE
1 BD 1001 9
1 SGBD 1002 10
1 BD 1002 8
2 TAP 1001 8
2 TAP 1002 10
2 AG 1001 5
…. …. …. ….
EMP_SKILL
EMP_ID SKILL LEVEL
1 Python 3
1 C++ 2
1 NoSql 3
2 SQL 1
• Optimized search.
• Optimized joins (lookup in more than one table)
• Optimized order/group
slower DML
Optimized search
extra memory
Optimized joins Index
Optimized extra load
order/group
INSERT, UDATE
Databases C2 Relational Model, indexes
Sql Optimizer
Index
YES NO
request ~
Full Scan
<15% rows
YES NO
Index
Full Scan
search
• Oracle rowid:
• Pseudo column 18 characters = 10 + 4 + 4 (block, row, file).
• Store and return row address in hexadecimal format (string).
• Unique identifier for each row.
• Immutable.
• Oracle rownum:
• Sequential number in which oracle has fetched the row, before ordering the
result
• Temporary generated along with a select statement.
• Mongo
• ObjectID (timestamp 4Bytes + random 5Bytes + Count 3Bytes.
MySQL MySQL
select * from information_schema.statistics
SHOW EXTENDED INDEX FROM index_test; where table_name = 'index_test1’
and index_name = 'primary';
Oracle Oracle
select * from user_indexes select * from user_constraints
where table_name = 'INDEX_TEST'; where table_name = 'INDEX_TEST';
emp_id en fr row_id A1 A2 B1 B2 C1 C2
1 A1 B1 AAB0lYAAEAAAFNHABD 1 0 0 0 0 0
2 A2 B2 AAB0lYAAEAAAFNHABV 0 1 0 0 0 0
3 C1 A1 AAB0lYAAEAAAFNHABX 0 0 0 0 1 0
4 A1 B1 AAB0lYAAEAAAFNHAAv 1 0 0 0 0 0
5 A1 AAB0lYAAEAAAFNHAAV 1 0 0 0 0 0
• All operations are finalized with success or none is saved in the db.
Statement 3
Statement 4
Statement 5
commit -- end transaction 2
• Database constraints
• PRIMARY KEY key constraint, UNIQUE, NOT NULL, FOREIGN KEY referential integrity,
CHECK
• Business constrains
active
failed aborted
failed aborted
Final statement
executed
active
failed aborted
Final statement
executed
Log → hardware failure →
→ System restart
→ recover updates Log
active DURABILITY
failed aborted
Successful
completion
active
failed aborted
active
Normal execution
cannot proceed
failed aborted
active
failure → recover initial state LOG
CONSISTENCY
Normal execution
cannot proceed
failed aborted
active
Rollbacked, initial
stat resolved
failed aborted
read (B)
B := B + temp
write (B)
commit
2. View serializability
2. View serializability
➢ Precedence graph — a direct graph where the vertices are the transactions (names).
➢ We draw an arc from Ti to Tj if the 2 transaction conflict, and Ti accessed the data item on which
the conflict arose earlier.
➢ If precedence graph is acyclic, the serializability order can be obtained by a topological sorting of
the graph.
➢ The problem of checking if a schedule is view serializable falls in the class of NP-complete
problems. Thus, existence of an efficient algorithm is extremely unlikely.
➢ Practical algorithms that just check some sufficient conditions for view serializability can still
be used.
update stock
set qte = :nS - 1
where n_prod = 100
if nS < 10
insert into restock(n_prod, qte)
values(100, 15)
T1 T2
select MAX(qte)
into :max
from orders
where n_prod =
100
phantom-read insert into orders(n_prod,qte)
anomaly values(100, 789455)
AVG > MAX! commit
select AVG(qte)
new lignes inserted into :avarage
from orders
where n_prod =
100
Databases C4: Transactional systems
T1 T2
select qte into :nS
Transactions errors from stock
where n_prod = 100
--nS = 13
update stock
set qte = :nS - 1
where n_prod = 100
select qte into :nS
from stock
where n_prod = 100
--nS = 12
dirty-write anomaly update stock
final stock 11! In the set qte = :nS - 1
first transaction, the where n_prod = 100
stock returns to 13. --nS = 11
Only one update
should decrease the abort
number of products. insert …
commit
READ
UNCOMMITTED
READ
COMMITTED
REPEATABLE
READ
SERIALIZABLE
REPEATABLE
READ
READ
COMMITTED
READ
UNCOMMITTED
• Locking
• Timestamp
• A transaction waits until all incompatible locks held by other transactions are released.
• https://oracle-base.com/articles/misc/deadlocks
• https://docs.oracle.com/cd/B19306_01/server.102/b14220/consist.htm
Databases C4: Transactional systems
Snapshot isolation
• Snapshot of the database at the beginning of each transaction.
INSERT/UPDATE/DELETE
REDUNDANCY
ANOMALY
decomposition /
synthesis
----
normalization
INSERT/UPDATE/DELETE
REDUNDANCY
ANOMALY
LOAN
Analyze
dependencies
“good”
decomposition
• Lossless
ς𝑅1 𝑅 ⋈ ς𝑅2 𝑅 = R
• The domain of each attribute contains only atomic values and each
attribute contains only a value of its domain.
• partial (X,Y) → Z
• Y →Z
X Y Z T
• total (X,Y) → T … … … …
X2 … … T2
• X -/-> T
X2 … … T3
• Y -/-> T
… … … …
… … … …
K1,K2 X, Y
K1 -> X
K1,K2 X, Y
K1 -> X
K1,K2 Y K1 X
K1,K2 X, Y
K1 -> X
K1 = AIRPLANE_ID
K2 = AIRPORT_ID, DEPARTURE K1,K2 Y K1 X
Y = BOARDING_GATE
X = AIRPLANE_MODEL
K X, Y, Z
K -> X
X -> Y
K X, Y, Z
K -> X
X -> Y
K X, Z X Y
K X, Y, Z
K -> X
X -> Y
K = AIRPLANE_ID
X = AIRPLANE_MODEL K X, Z X Y
Y = CAPACITY
Z= REVISION_DATE
Relational-Algebra expression
Optimizer EXECUTION-PLAN
Evaluation,
Query Output
algorithms
Relational-Algebra expression
relations + operators
products p1 products p2
Relational-Algebra expression ෑ
relations + operators
products p1 products p2
Relational-Algebra expression
JOIN
relations + operators
ෑ ෑ
𝐽𝑂𝐼𝑁(ෑ 𝑝1 , ෑ 𝑝2)
𝑛𝑎𝑚𝑒,𝑚𝑖𝑛𝑝𝑟𝑖𝑐𝑒 𝑛𝑎𝑚𝑒,𝑚𝑖𝑛𝑝𝑟𝑖𝑐𝑒
products p1 products p2
Optimizer EXECUTION-PLAN
Evaluation,
algorithms
Optimizer EXECUTION-PLAN
Evaluation,
algorithms
• PROP2: associativity
table:
column statistics: index statistics:
number of rows, system statistics
number of blocks, avg number of distinct number of leafs,
row length values, levels
number of nulls,
data distribution
• Fault tolerant.
Shuffle,
Sort,
Reduce
output
Map reduce
map
map reduce
input output
map reduce
map
Map reduce
(laptop,50)
map
(usb,57)
input
(laptop, 78)
map
(mouse, 25)
(laptop, 78)
map
(mouse, 25) M2
(mouse, 25)
(mouse, 67)
(phone, 49) (usb, 12)
map (usb, 57)
(mouse,67)
(laptop, 5)
Map reduce
(laptop, 10)
map (usb, 12) M1
(laptop, 10)
(laptop, 50)
(laptop,50) (laptop, 78)
map (usb,57) (laptop, 5)
(laptop, 78)
map (mouse, 25) M2
(mouse, 25)
(mouse, 67)
(phone, 49) (usb, 12)
map (mouse,67) (usb, 57)
(laptop, 5)
Map reduce
(laptop, 10)
map (usb, 13)
M1
(laptop, [10, 50, 78, 5]) laptop, 143
(laptop,50)
map (usb,57)
(laptop, 78)
map (mouse, 25)
M2
mouse, 92
(mouse, [25,67])
usb 70
(phone, 49) (usb, [13,57])
map (mouse,67)
(laptop, 5)
MapReduce Hadoop
• Open source from Apache. https://hadoop.apache.org/
https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-
client-core/MapReduceTutorial.html
• Components
• MapReduce
• PageRank-ing
D1: data base systems,
Inverted index D2: economic base analysis
D3: distributed systems
D4: data analysis
(data, 1) M1
map (base, 6) (data, D1:1)
(systems, 11) (base, D1:1)
(systems, D1: 11)
(economic, 1) (economic, D2:1)
map (base, 10) (base, D2:10)
(analysis, 15)
input shuffle, sort
M2
(distributed,1) (analysis, D2:15)
map
(systems, 14) (distributed, D3:1)
(systems, D3:14)
(data, D4:1)
(data, 1) (analysis, D4: 6)
map (analysis, 6)
D1: data base systems,
Inverted index D2: economic base analysis
D3: distributed systems
D4: data analysis
(data, 1) M1
map (base, 6) (analysis, D2:15)
(systems, 11) (analysis, D4: 6)
(base, D1:1)
(economic, 1) (base, D2:10)
map (base, 10) (data, D1:1)
(analysis, 15) (data, D4:1)
input shuffle, sort
(distributed,1) M2
map (distributed, D3:1)
(systems, 14)
(economic, D2:1)
(systems, D1: 11)
(systems, D3:14)
(data, 1)
map (analysis, 6)
D1: data base systems,
Inverted index D2: economic base analysis
D3: distributed systems
D4: data analysis
M1
(data, 1) (analysis, D2:15)
M1
map (base, 6) (analysis, D4: 6)
(analysis, [D2:15,
(systems, 11) (base, D1:1)
D4:6])
(base, D2:10)
(base, [D1:1, D2, 10])
(data, D1:1)
(economic, 1) (data, [D1:1,D4:1])
(data, D4:1)
map (base, 10)
(analysis, 15)
input shuffle, sort
(distributed,1) M2 M2
map (distributed, D3:1) (distributed, [D3:1])
(systems, 14)
(economic, D2:1) (economic, [D2:1])
(systems, D1: 11) (systems, [D1:
(systems, D3:14) 11,D3:14])
(data, 1)
map (analysis, 6)
Sql operators
MapReduce: Sql operators
• Selection
• Group by
• Join
EMPLOYEES DEPARTMENTS
emp_id name dep_id dep_id dep_name
100 Steven King 90 30 Purchasing
102 Lex De Hann 90 90 Executive
108 Nancy Greenberg 100 100 Finance
116 Shelli Baida 30 20 Marketing
117 Sigal Tobias 30
map map
map map
map map
• High performance.
→Put(key, value)
→Get(Key)
• Examples
• Bigtable, Apache HBase, Dynamo, Cassandra, MongoDB, Azure etc.
• Challenges: manage request that must access data from multiple shards
→ replicas in order to ensure availability in case of failure,
→ keep replicas consistent,
→ expensive joins if tables are stores on different nodes,
depends on the speed of the communication network.
Sharding
• Types of partitioning: horizontal partitioning (example sharding),
vertical partitioning
• Architecture:
• Mongos: query routers
• Config Servers
• Shards (replicas)
• Alternative: PACELC
A
C P
MongoDB, Hbase,
Spanner, Redis
CAP Theorem
• MongoDB CP datastore.
• Eventually consistent: it’s not guarantee that all replicas have the
same data.
• Consistency level: number of replicas that needs to respond to a
read/write operation.
• ONE: closest replica
• QUORUM: synchronize → majority,
Consistency levels
• Strict consistency: global clock, all reads seen instantaneously by all processors.
• Eventually consistent: if there are no writes for a period of time that is system
dependent, every node will “see” the value of the last write.
BASE
BASE
• Basically Available: low latency, high availability
• Eventually consistent
Mongo DB
Mongo DB and SQL
Mongo RDBMS
Document: set of key-value pairs, similar to JSON row in a table
objects
Collection: set of documents, documents in a table
collection may have different sets of fields
Field in JSON document column
$lookup and embedded documents joins
…
https://docs.mongodb.com/manual/reference/sql-comparison/
Mongo API
Use/create/delete database
show dbs show available databases
Test w ∈ S → h(w) = 1 ?
h2(v1) = 12
h1(v1) = 4 h3(v1) = 22
0 0 0 1 0 0 1 0 0 1 0 1 0 0 0 1 1 0 0 0 0 1 0 1 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
1
• 𝑃𝑟ℎ∈𝐻 ℎ 𝑥1 = 𝑦1 ٿℎ 𝑥2 = 𝑦2 … ٿℎ 𝑥𝑘 = 𝑦𝑘 =
𝑚𝑘
• ℎ 𝑥1 uniformly distributed.
• ℎ 𝑥1 , ℎ 𝑥2 , … ℎ 𝑥𝑘 independent random variables.
v1 v2 v3
h2(v1) = 12
h1(v1) = 4 h3(v1) = 22
0 0 0 1 0 0 1 0 0 1 0 1 0 0 0 1 1 0 0 0 0 1 0 1 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Small probability of false positive. false positive. Value w: B[h1(w)] = 1 B[h2(w)] =1 … B[hk[w]] = 1
Probability of false negative = 0. Each hash of w equals a hash of an element in the set
Bloom filters – accuracy
• m size of array, n number of elements in S, k number of hash
functions.
• Probability of false positive:
𝑘𝑛 𝑘
1
𝑃 = 1− 1− or
𝑚
𝑘𝑛 𝑘
−𝑚
𝑃 = 1− 𝑒
• m = 10 * n and k = 7 ≃ 0,01
Bloom filters – accuracy
• m size of array, n number of elements in S, k number of hash
functions.
h(w) != h1(v1)
• Probability of false positive:
𝑘𝑛 𝑘
1
𝑃 = 1− 1− or
𝑚
• m = 10 * n and k = 7 ≃ 0,01
Bloom filters – accuracy
• m size of array, n number of elements in S, k number of hash
functions. h1(w) != h1(v1)
• Probability of false positive: h1(w) != h1(v1)
𝑘 …..
1 𝑘𝑛
𝑃 = 1− 1− or h1(w) != hn(v1)
𝑚 h1(w) != h1(v2)
…
h1(w) != hn(v2)
…
• m = 10 * n and k = 7 ≃ 0,01
Bloom filters – accuracy
• m size of array, n number of elements in S, k number of hash
functions. h1(w) = h1(v1)
• Probability of false positive: or
𝑘 h1(w) = h1(v1)
1 𝑘𝑛
𝑃 = 1− 1− or …..
𝑚 h1(w) = hn(v1)
or
h1(w) = h1(v2)
…
h1(w) = hn(v2)
…
• m = 10 * n and k = 7 ≃ 0,01
Log Structured Merge-tree
Log Structured Merge Trees
• Optimize I/O operations.
https://commons.wikimedia.org/wiki/File:Btree.png
LSMT L0 in memory
L1
on disk
L2
…….
LSMT
lookup L0 in memory
Separate
L1
lookups
on disk
L2
…….
LSMT
lookup L0 in memory
Separate
L1
lookups
on disk
L2
merge results
…….
LSMT
insert L0 in memory
insert if
memory …
available
on disk
…….
LSMT
insert L0 in memory
memory full L1
on disk
if L1 empty,
copy L0 → L1,
delete L0 …
…….
LSMT
insert L0 in memory
on disk
rolling merge
…
…….
LSMT
update, delete L0 in memory
L0,L1
“deleted” or on disk
“updated”
recoreds,
udapted merge:
check if record …
is deleted or
updated
…….
LSMT
stepped-merge L0 L0_1 L0_2 … in memory
L1_1 L1_2 ….
on disk
…….
LSMT
stepped-merge L0 L0_1 L0_2 … in memory
L1_1 L1_2 ….
on disk
Bloom filters
optimize lookup
…
…….
Materialized views
Materialized views
• redundant data, contents can be inferred from the definition
• Selection
• Projection
• Aggregation