Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
28 views

AI learns sorting algorithm

Uploaded by

Hrittik Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

AI learns sorting algorithm

Uploaded by

Hrittik Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Research 11: Machine Learning for Databases II SIGMOD ’20, June 14–19, 2020, Portland, OR, USA

The Case for a Learned Sorting Algorithm


Ani Kristo∗ Kapil Vaidya∗ Uğur Çetintemel
ani@brown.edu kapilv@mit.edu ugur@cs.brown.edu
Brown University MIT Brown University

Sanchit Misra Tim Kraska


sanchit.misra@intel.com kraska@mit.edu
Intel Labs MIT
ABSTRACT 1 INTRODUCTION
Sorting is one of the most fundamental algorithms in Com- Sorting is one of the most fundamental and well-studied
puter Science and a common operation in databases not just problems in Computer Science. Counting-based sorting algo-
for sorting query results but also as part of joins (i.e., sort- rithms, such as Radix Sort, have a complexity of O(wN ) with
merge-join) or indexing. In this work, we introduce a new N being the number of keys and w being the key length and
type of distribution sort that leverages a learned model of are often the fastest algorithms for small keys. However, for
the empirical CDF of the data. Our algorithm uses a model larger key domains comparison-based sorting algorithms are
to efficiently get an approximation of the scaled empirical often faster, such as Quicksort or Mergesort, which have a
CDF for each record key and map it to the corresponding time complexity of O(N log N ), or hybrid algorithms, which
position in the output array. We then apply a deterministic combine various comparative and distributive sorting tech-
sorting algorithm that works well on nearly-sorted arrays niques. Those are also the default sorting algorithms used in
(e.g., Insertion Sort) to establish a totally sorted order. most standard libraries (i.e., C++ STL sort).
We compared this algorithm against common sorting ap- In this paper, we introduce a ML-enhanced sorting algo-
proaches and measured its performance for up to 1 billion rithm by building on our previous work [28]. The core idea
normally-distributed double-precision keys. The results show of the algorithm is simple: we train a CDF model F over a
that our approach yields an average 3.38× performance im- small sample of keys A and then use the model to predict the
provement over C++ STL sort, which is an optimized Quick- position of each key in the sorted output. If we would be able
sort hybrid, 1.49× improvement over sequential Radix Sort, to train the perfect model of the empirical CDF, we could
and 5.54× improvement over a C++ implementation of Tim- use the predicted probability P(A ≤ x) for a key x, scaled to
sort, which is the default sorting function for Java and Python. the number of keys N , to predict the final position for every
key in the sorted output: pos = F A (x) · N = P(A ≤ x) · N .
ACM Reference Format:
Assuming the model already exists, this would allow us to
Ani Kristo, Kapil Vaidya, Uğur Çetintemel, Sanchit Misra, and Tim
Kraska. 2020. The Case for a Learned Sorting Algorithm. In Pro- sort the data with only one pass over the input, in O(N ) time.
ceedings of the 2020 ACM SIGMOD International Conference on Man- Obviously, several challenges exists with this approach.
agement of Data (SIGMOD’20), June 14–19, 2020, Portland, OR, USA. Most importantly, it is unlikely that we can build a perfect
ACM, New York, NY, USA, 16 pages. https://doi.org/10.1145/3318464. empirical model. Furthermore, state-of-the-art approaches to
3389752 model the CDF, in particular NN, would be overly expensive
to train and execute. More surprising though, even with
∗ Both authors contributed equally to this research. a perfect model the sorting time might be slower than a
highly optimized Radix Sort algorithm. Radix Sort can be
Permission to make digital or hard copies of all or part of this work for implemented to only use sequential writes, whereas a naïve
personal or classroom use is granted without fee provided that copies ML-enhanced sorting algorithm as the one we outlined in
are not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. Copyrights
[28] creates a lot of random writes to place the data directly
for components of this work owned by others than the author(s) must into its sorted order.
be honored. Abstracting with credit is permitted. To copy otherwise, or In this paper, we describe Learned Sort, a sequential ML-
republish, to post on servers or to redistribute to lists, requires prior specific enhanced, sorting algorithm that overcomes these challenges.
permission and/or a fee. Request permissions from permissions@acm.org. In addition, we introduce a fast training and inference algo-
SIGMOD’20, June 14–19, 2020, Portland, OR, USA
rithm for CDF modeling. This paper is the first in-depth study
© 2020 Copyright held by the owner/author(s). Publication rights licensed
to ACM.
describing a cache-efficient ML-enhanced sorting algorithm,
ACM ISBN 978-1-4503-6735-6/20/06. . . $15.00 which does not suffer from the random access problem. Our
https://doi.org/10.1145/3318464.3389752

1001
Research 11: Machine Learning for Databases II SIGMOD ’20, June 14–19, 2020, Portland, OR, USA

Algorithm 1 A first Learned Sort


Input A - the array to be sorted
Input F A - the CDF model for the distribution of A
Input o - the over-allocation rate. Default=1
Output A′ - the sorted version of array A
1: procedure Learned-Sort(A, F A, o )
2: N ← A. length
3: A′ ← empty array of size (N · o)
4: for x in A do
5: pos ← ⌊F A (x ) · N · o ⌋
6: if empty(A′ [pos]) then A′ [pos] ← x
7: else Collision-Handler(x )
8: if o > 1 then Compact(A′ )
Figure 1: Sorting with the perfect CDF model 9: if non-monotonic then Insertion-Sort(A′ )
10: return A′

experiments show that Learned Sort can indeed achieve bet-


ter performance than highly tuned counting-based sorting
algorithms, including Radix Sort and histogram-based sorts, which we use an auxiliary array as big as the input array.
as well as comparison-based and hybrid sorting algorithms. Later, we discuss the changes necessary to create an in-place
In fact, our learned sorting algorithm provides the best per- variant of the same algorithm in Section 4.1 and address the
formance even when we include the model training time as sorting of strings and other complex objects in Section 4.2.
a part of the overall sorting time. For example, our exper-
iments show that Learned Sort yields an average of 3.38× 2.1 Sorting with imprecise models
performance improvement over C++ STL sort (std::sort)[16], As discussed earlier, duplicate keys and imprecise models
5.54× improvement over Timsort (Python’s default sorting may lead to the mapping of multiple keys to the same out-
algorithm [45]), 1.49× over Radix sort[51], and 1.31× over put position in the sorted array. Moreover, some models
IS4 o[2], a cache-efficient version of the Samplesort and one (e.g., NN or even the Recursive Model Index (RMI) [29]) may
of the fastest available sorting implementations [40]. not be able to guarantee monotonicity, creating small mis-
In summary, we make the following contributions: placements in the output. That is, for two keys a and b with
• We propose a first ML-enhanced sorting algorithm, called a < b the CDF value of a might be greater than the one of
Learned Sort, which leverages simple ML models to model b (F (a) > F (b)), thus, causing the output to not be entirely
the empirical CDF to significantly speed-up a new variant sorted. Obviously, such errors should be small as otherwise
of Radix Sort using a model would provide no benefits. However, if the
• We theoretically analyze our sorting algorithm model does not guarantee monotonicity, further work on
• We exhaustively evaluate Learned Sort over various syn- the output is needed to repair such errors. That is a learned
thetic and real-world datasets sorting algorithm also has to (1) correct the sort order for
non-monotonic models, (2) handle key collisions, and prefer-
2 LEARNING TO SORT NUMBERS ably (3) minimize the number of such collisions. A general
Given a function F A (x), which returns the exact empirical algorithm for dealing with those three issues is outlined in
CDF value for each key x ∈ A, we can sort A by calcu- Algorithm 1. The core idea is again simple: given a model we
lating the position of each key within the sorted order as calculate the expected position for each key (Line 5) and, if
pos ← F A (x) · |A|. This would allow us to sort a dataset that position in the output is free, place the key there (Line
with a single pass over the data as visualized in Figure 1. 6). In case the position is not empty, we have several options
However, in general, we will not have a perfect CDF func- to handle the collisions (Line 7):
tion, especially if we train the model just based on a sample (1) Linear probing: If a position is already occupied, we
from the input data. In addition, there might be duplicates could sequentially scan the array for the nearest empty
in the dataset, which may cause several keys to be mapped spot and place the element there. However, this technique
to the same position. In the following, we describe an initial might misplace keys (like non-monotonic models) and
learned sorting algorithm, similar to the one of SageDB[28], will take increasingly more time as the array fills up.
that is robust against imprecise models, and then explain why (2) Chaining: Like in hash-tables, we could chain elements
this first approach is still not competitive, before introducing for already-filled positions. This could be implemented
the final algorithm. To simplify the discussion, our focus in either with a linked list or variable-sized sub-arrays, both
this section is exclusively on the sorting of numbers and we of which introduce additional performance overhead due
only describe the out-of-place variant of our algorithm, in to pointer chasing and dynamic memory allocation.

1002
Research 11: Machine Learning for Databases II SIGMOD ’20, June 14–19, 2020, Portland, OR, USA

Figure 3: Radix Sort[51] can be implemented to mainly use


Figure 2: The sorting rate for different collision handling sequential memory access by making sure that at least one
strategies for Algorithm 1 on normally distributed keys. cache line per histogram fits into the cache. This way the
prefetcher can detect when to load the next cache-line per
histogram (green slots indicate processed items, red the cur-
(3) Spill bucket: We could use a “spill bucket” where we rent one, white slots unprocessed or empty slots)
append all the colliding elements. This technique requires
to separately sort & merge the spill bucket. Algorithm 1 is in many ways similar to a hash table with a
We experimented with all three methods and found that CDF model F A as an order-preserving hash-function.1 Yet, to
the spill bucket often outperforms the other methods (see our surprise, even with a perfect zero-overhead CDF model,
Figure 2). Thus, in the remainder of the paper, we only focus Algorithm 1 is not faster than Radix Sort. For example, as a
on the spill bucket approach. test we generated a randomly permuted dataset containing
After all items have been placed, for non-monotonic mod- all integer numbers from 0 to 109 . In this case, the key itself
els we correct any mistakes using Insertion Sort over the can be used as a position prediction as it represents the offset
entire output (Line 9). Note that, any sorting algorithm could inside the sorted array, making the model for F A just the iden-
guarantee correctness, however we choose Insertion Sort tity function pos ← key; a perfect zero-overhead oracle. Note
because it performs well when (1) the number of misplaced that we also maintain a bitmap to track if the positions are
elements is small and (2) the distance of misplaced elements filled in the output array. To our astonishment, in this micro-
from their correct positions is small. experiment we observed that the time taken to distribute the
From Algorithm 1, it should be clear that the model quality keys into their final sorted position, despite a zero-overhead
determines the number of collisions and that the number of oracle function, took 38.7 sec and Radix Sort took 37.5 sec.
collisions will have a profound impact on the performance. This performance trend is due to the fact that this permuta-
Interestingly, the expected number of collisions depends tion step makes random and unpredictable array accesses,
on how well the model overfits to the observed data. For which hurt CPU cache and TLB’s locality and incur multiple
example, let us assume our CDF model is exactly the same stalls (see Line 6 in Algorithm 1), whereas our cache-efficient
model as used to generate the keys we want to sort (i.e., we Radix Sort implementation was memory-access optimized
have the model of the underlying distribution), the number of and mainly used sequential memory accesses[51]. The Radix
collisions would still be around 1/e ≈ 36.7% independently of Sort implementation achieved this by carefully adjusting the
the distribution. This result follows directly from the birthday number of radices to the L2-cache size and while it made sev-
paradox and is similar to the problem of hash table collision. eral passes over the data, it still outperformed our idealized
However, if the model overfits to the observed data (i.e., learned sorting algorithm.
learns the empirical CDF) the number of collisions can be Based on the insights discussed in this section regarding
significantly lower. Unfortunately, if we want to train a model random memory access, collision handling, and monotonic-
just based on a sample, to reduce the training cost, it is mostly ity, we developed a cache-efficient Learned Sort algorithm,
impossible to learn the perfect empirical CDF. which is explained in the next section.
Hence, we need a different way to deal with collisions.
For example, we can reduce the number of collisions by 2.2 Cache-optimized learned sorting
over-provisioning the output array (o in Algorithm 1), again Our final Learned Sort algorithm enhances Algorithm 1 with
similar to how hash tables are often over-provisioned to the idea from the cache-optimized Radix Sort. In fact, in case
reduce collisions. However, this comes at the cost of requiring the number of elements to sort is close to the key domain
more memory and time to remove the empty space for the
1 Note,that existing perfect or order-preserving hash-functions can not be
final output (Line 8). Another idea is to map keys to small
used in our context because of their very slow training and execution time,
buckets rather than individual positions. Bucketing helps which is also the reason why there does not exist a single sorting algorithm
significantly reduce the number of collisions and can be using them. Similarly our problem is NOT related to local-sensitive hashing
combined with over-allocation. either as sorting keys is in a single dimensional space.

1003
Research 11: Machine Learning for Databases II SIGMOD ’20, June 14–19, 2020, Portland, OR, USA

Figure 4: Cache-optimized Learned Sort: First the input is partitioned into f fixed-capacity buckets (here f = 2) and the input
keys are shuffled into these buckets based on the CDF model’s predictions. If a bucket gets full, the overflowing items are
placed into a spill bucket S. Afterwards, each bucket is split again into f smaller buckets and the process repeats until the
bucket capacity meets a threshold t (here t = 6). Then, each bucket is sorted using a CDF model-based counting sort-style
subroutine (Step 2). The next step corrects any sorting mistakes using Insertion Sort (Step 3). Finally we sort the spill bucket
S, merge it with B, and return the sorted array (Step 4).

size (e.g., 232 for 32-bit keys) the run-time of our Learned Sort • Step 3: Afterwards we take the now sorted buckets
algorithm is almost identical to Radix Sort. However, in case and merge them into one sorted array. If we use a non-
the number of elements is much smaller than the key domain monotonic model, we also correct any sorting mistakes
size, Learned Sort starts to significantly outperform even using Insertion Sort.
the optimized Radix Sort implementations as well as other • Step 4: The buckets are of fixed capacity, which mini-
comparison-based sorts. The reason is that, with every pass mizes the cost of dynamic memory allocation. However,
over the data our learned model can extract more information if a bucket becomes full, the additional keys are placed
than Radix Sort about where the key should go in the final into a separate spill bucket array (see Figure 4 the “S”-
output, thus overcoming the core challenge of Radix Sort bucket symbol). As a last step, the spill bucket has to be
that the run-time heavily depends on the key domain size2 . sorted and merged. The overhead of this operation is
The basic idea of our algorithm is visualized in Figure 4: low as long as the model is capable of evenly distribut-
• We organize the input array into logical buckets. That is, ing the keys to buckets.
instead of predicting an exact position, the model only Algorithm 2 shows the pseudocode of Learned Sort. The
has to predict a bucket index for each element, which algorithm requires an input array of keys (A), a CDF model
reduces the number of collisions as explained earlier. that was trained on a sample of this array (F A ), a fan-out
• Step 1: For cache efficiency, we start with a few large factor (f ) that determines the ratio of new buckets in each
buckets and recursively split them into smaller ones. By iteration, and a threshold (t) which decides when to stop the
carefully choosing the fan-out (f ) per iteration, we can bucketization, such that every bucket fits into the cache.
ensure that at least one cache-line per bucket fits into Step 1: The algorithm starts by allocating a linear array
the cache, hence transforming the memory access pat- B that is of the same size as the input A (Line 5). This will be
tern into a more sequential one. This recursion repeats logically partitioned into n buckets, each of fixed capacity
until the buckets become as small as a preset threshold b (Lines 3-4). We record the bucket sizes (i.e. how many
t. Section 3.1 explains how f and t should be set based elements are currently in the bucket) in an integer array I ,
on the CPU cache size. which has the same size as the current number of buckets (n).
• Step 2: When the buckets reach capacity t, we use the Then, the algorithm shuffles each key into buckets by using
CDF model to predict the exact position for each ele- the model F A to predict its empirical CDF value and scaling it
ment within the bucket. out to the current number of buckets in that round (Line 14).
If the predicted bucket (at index pos) has reached its capacity,
then the key is placed in the spill bucket S, otherwise, the key
is inserted into the bucket (Lines 15 - 19). Here, we calculate
the bucket start offset as pos · b and the write offset within
2 Obviously, the model itself still depends on the key domain size as discussed the bucket as I [pos]. After one iteration, each bucket will be
later in more detail.

1004
Research 11: Machine Learning for Databases II SIGMOD ’20, June 14–19, 2020, Portland, OR, USA

logically split further into f smaller buckets (Lines 20-21) Algorithm 2 The Learned Sort algorithm
until the buckets are smaller than threshold t (Line 11). Note .
Input A - the array to be sorted
that, in order to preserve memory, we reuse the arrays A and Input F A - the CDF model for the distribution of A
B by simply swapping the read and write pointers (Line 22) Input f - fan-out of the algorithm
Input t - threshold for bucket size
and updating the bucket splitting parameters (Lines 20-21). Output A′ - the sorted version of array A
Step 2: When the bucket capacity reaches t, we switch to 1: procedure Learned-Sort(A, F A, f , t )
2: N ← |A | ▷ Size of the input array
a model-based Counting Sort-style routine (Lines 23-38) to 3: n←f ▷ n represents the number of buckets
map the items to their final positions. We again do that using 4: b ← ⌊N /f ⌋ ▷ b represents the bucket capacity
5: B ← [] × N ▷ Empty array of size N
the model, which now predicts the exact index position, not 6: I ← [0] × n ▷ Records bucket sizes
the bucket. That is, we first calculate the final position for 7: S ← [] ▷ Spill bucket
read_arr ← pointer to A
every key (Line 28) and store in array K the count of keys 8:
9: write_arr ← pointer to B
that are mapped to each predicted index (Line 29). The array
K is then transformed to a running total (Line 31). Finally, we 10:
11:
// Stage 1: Model-based bucketization
while b ≥ t do ▷ Until bucket capacity reaches the threshold t
place the items into their final position using the cumulative 12: I ← [0] × n ▷ Reset array I
13: for x ∈ read_arr do
counts (Lines 32-38), which is similar to the Counting Sort 14: pos ← ⌊Infer(F A, x ) · n ⌋
routine[7, pp.168-170]. As we only sort one bucket at a time 15: if I [pos] ≥ b then ▷ Bucket is full
S .append(x ) ▷ Add to spill bucket
and want to keep the array size of K small, we use an offset 16:
17: else ▷ Write into the predicted bucket
to set the start index of the bucket in Lines 28-36. 18: write_arr[pos · b + I [pos]] ← x
19: Increment I[pos]
We switch to the model-based Counting Sort for two rea-
20: b ← ⌊b/f ⌋ ▷ Update bucket capacity
sons. First, and most importantly, it helps improve the overall 21: n ← ⌊N /b ⌋ ▷ Update the number of buckets
sorting time as we are able to fully utilize our model’s pre- 22: PtrSwp(read_arr, write_arr) ▷ Pointer swap to reuse memory

cision for fine-level predictions. Second, it helps reduce the 23: // Stage 2: In-bucket reordering
number of overflows (see Section 3.1.2 for more details). 24: offset ← 0
25: for i ← 0 up to n do ▷ Process each bucket
Step 3: After the last sorting stage we remove any empty 26: K ← [0] × b ▷ Array of counts
space and, for non-monotonic models, correct any potential
27: for j ← 0 up to I [i] do ▷ Record the counts of the predicted positions
mistakes with Insertion Sort(Line 40). 28: pos ← ⌊Infer(F A, read_arr[offset + j]) · N ⌋
Step 4: Finally, because we used a spill bucket (S) for 29: Increment K [pos − offset]
the overflowing elements in Stage 1, we have to sort it and 30: for j ← 1 up to |K | do ▷ Calculate the running total
merge it with the sorted buckets before returning (Lines 42- 31: K [j] ← K [j] + K [j − 1]
43). Provided that the sorting algorithm for the spill bucket
32: T ← [] ▷ Temporary auxiliary memory
is stable, Learned Sort also maintains the stability property. 33: for j ← 0 up to I [i] do ▷ Order keys w.r.t. the cumulative counts
34: pos ← ⌊Infer(F A, read_arr[offset + j]) · N ⌋
35: T [j] ← read_arr[offset + K [pos − offset]]
36: Decrement K [pos − offset]
2.3 Implementation optimizations 37: Copy T back to read_arr[offset]
The pseudocode in Algorithm 2 gives an abstract view of the 38: offset ← offset + b

procedure, however, we have used a number of optimizations 39: // Stage 3: Touch-up


at the implementation level. Below we describe the most 40: Insertion-Sort-And-Compact(read_arr)

important ones: 41: // Stage 4: Sort & Merge


42: Sort(S )
• We process elements in batches. First we use the model 43: A′ ← Merge(read_arr, S )
to get the predicted indices for all the elements in the
44: return A′
batch, and then place them into the predicted buckets.
This batch-oriented approach maintains cache locality.
• As in Algorithm 1, we can over-provision array B by a
small factor (e.g., 1.1×) in order to increase the bucket perform all the operations in Lines 11-40 for all the keys
sizes and consequently reduce the number of overflowing in a single bucket before moving on to the next one.
elements in the spill bucket S. This in turn reduces the The code for the algorithm can be found at http://dsg.csail.
sorting time for S. mit.edu/mlforsystems.
• Since the bucket sizes in Stage 2 are small (i.e., b ≤ t),
we can cache the predicted position for every element in 2.4 Choice of the CDF model
the current bucket in Line 28 and reuse them in Line 34. Our sorting algorithm does not depend on a specific model
• In order to preserve the cache’s and TLB’s temporal lo- to approximate the CDF. However, it is paramount that the
cality, we use a bucket-at-a-time approach, where we model is fast to train and has a very low inference time

1005
Research 11: Machine Learning for Databases II SIGMOD ’20, June 14–19, 2020, Portland, OR, USA

Algorithm 3 The inference procedure for the CDF model


Input F A - the trained model (F A [l ][r ] refers to the r t h model in the l t h layer)
Input x - the key
Output r - the predicted rank (between 0-1)
1: procedure Infer(F A, x )
2: L ← the number of layers of the CDF model F A
3: M l ← the number of models in the l t h layer of the RMI F A
4: r ←0
5: for l ← 0 up to L do
6: r = x · F A [l ][r ].slope +F A [l ][r ].intercept
7: return r

to keep the overall sorting time low. Thus, models such as Figure 5: A typical RMI architecture containing three layers
KDE[43, 47], neural networks or even perfect order-preserving
hash functions are usually too expensive to train or execute Algorithm 4 The training procedure for the CDF model
for our purposes. One might think that histograms would Input A - the input array
be an interesting alternative, and indeed histogram-based Input L - the number of layers of the CDF model
Input M l - the number of linear models in the l t h layer of the CDF model
sorting algorithms have been proposed in the past [7, pp.168- Output F A - the trained CDF model with RMI architecture
177]. Unfortunately, histograms have the problem that they 1: procedure Train(A, L, M )
are either too coarse-grained, making any prediction very 2: S ← Sample(A)
3: Sort(S )
inaccurate, or too fine-grained, which increase the time to 4: T ←[][][] ▷ Training sets implemented as a 3D array
navigate the histogram itself (see also Section 6.8). 5: for i ← 0 up to |S | do
6: T [0][0].add((S [i], i/ |S | ))
Certainly many model architectures could be used, how- 7: for l ← 0 up to L do
ever, for this paper we use the recursive model index (RMI) 8: for m ← 0 up to M l do
architecture as proposed in [29] (shown in Figure 5). RMIs 9: F A [l ][m] ←linear model trained on the set {t | t ∈ T [l ][m]}
10: if l + 1 < L then
contain simple linear models which are organized into a 11: for t ∈ T [l ][m] do
layered structure, acting like a mixture of experts[29]. 12: F A [l ][m].slope← F A [l ][m].slope · M l +1
13: F A [l ][m].intercept← F A [l ][m].intercept · M l +1
14: i ← F A [l ][m].slope ·t + F A [l ][m].intercept
2.4.1 Inference. Algorithm 3 shows the inference procedure 15: T [l + 1][i].add(t )
for an RMI architecture. During inference, each layer of the 16: return F A
model takes the key as an input and linearly transforms it
to obtain a value, which is used as an index to pick a model
in the next layer (Line 6). The intermediate models’ slope The CDF model F A can be implemented as a 2D array where
an intercept terms are already scaled out to the number of F A [l][r ] refers to the r t h model in the l t h layer of the RMI.
models in the next layer, hence avoiding additional multipli- For the root model, the algorithm uses the entire sample as
cations at inference time, whereas the last layer will return a training set to calculate a slope and intercept term (Line 9).
CDF value between 0 and 1. Note, that the inference can be After it has been trained, for each of the training tuples, the
extremely fast because the procedure uses simple data depen- root model predicts a CDF value and it scales it by M l +1 (the
dencies instead of control dependencies (i.e., if-statements), number of models in the next layer) (Line 12-13). Then, it dis-
consequently making it easier for the optimizer to perform tributes these tuples into multiple training subsets that will
loop unrolling and even vectorization. Hence, for each layer, be used to train each of the linear models in the subsequent
the inference requires only one addition, one multiplication, layer. Each tuple goes to a training subset at index i, which is
and one array look-up to read the model parameters[29]. calculated in Line 14 by using the slope and intercept terms
of the parent model. This partitioning process continues un-
2.4.2 Training Procedure. Algorithm 4 shows the training til the second-to-last layer of the RMI, and each of the newly
procedure, which can be on a small sample of the input ar- created training subsets is used to train the corresponding
ray. The algorithm starts by selecting a sample and sorting linear models in a recursive fashion.
it using an efficient deterministic sorting algorithm – e.g.,
std::sort - (Lines 2-3), creating a 3D array to represent a tree 2.4.3 Training of the individual linear models. One way to
structure of training sets, and inserting all <key, CDF> pairs train the linear models is using the closed-form of the uni-
into the the top node T [0][0]. Here the empirical CDF for the variate linear regression with an MSE loss function. However,
training tuples is calculated as its index in the sorted sample when using linear regression training, it is possible that two
over the sample size (i/|S |). Starting at the root layer, the neighboring linear models that are in the last layer of the
algorithm trains linear models working its way top-down. CDF model predict values in overlapping ranges. Hence,

1006
Research 11: Machine Learning for Databases II SIGMOD ’20, June 14–19, 2020, Portland, OR, USA

locations fit in the L2 cache (f ≈1-5K). Empirically, the fan-


out value that gives the best trade-off for the particular cache
size in this setup is 1K. Hence, like in the cache-efficient
Radix-Sort implementation, this parameter has to be tuned
based on the available cache size.
3.1.2 Choosing the threshold (t). The threshold t determines
the minimum bucket capacity as well as when to switch to a
Counting Sort subroutine (Line 11 in Algorithm 2). We do
Figure 6: Mapping time for various fan-out values (log scale) this for two reasons: (1) to reduce the number of overflows
(i.e., the number of elements in the spill bucket) and (2) to
the resulting prediction is not guaranteed to be monotonic, take better advantage of the model for the in-bucket sorting.
increasing the time that Insertion Sort takes at the end of Here we show how the threshold t affects the size of the spill
Algorithm 2. One way to force the CDF model to be mono- bucket, which directly influences the performance.
tonic is to use boundary checks for the prediction ranges If we had a perfect model every element would be mapped
of each leaf model, however, this comes at the expense of to a unique position. Yet, in most cases, this is impossible
additional branching instructions for every element and at to achieve as we train based on a sample and aim to restrict
every iteration of the loop in Line 10 of Algorithm 2. the complexity of the model itself, inevitably mapping differ-
Thus, we opted instead to train the models using linear ent items to the same position. Then our problem becomes
spline fitting which has better monotonicity. Furthermore, that of randomly hashing N elements onto N unit-capacity
this method is cheaper to compute than the closed-form of buckets (i.e., t = 1). That is, the model that we learn behaves
linear regression and only requires to fit a line between the similar to an order-preserving hash function as a randomly
min and max key in the respective training sets. generated element from this distribution is equally likely
From the perspective of a single model, splines seem to to be mapped onto any bucket. This holds for any distribu-
fit much worse than other models. However, recall that we tion of the input array A, since its CDF function F A ) will
use several linear models as part of the RMI, which overall be uniformly distributed between [0, 1] [13]. Since we are
maintains a good level of approximation of the CDF even for using N buckets, the k th bucket will contain the value range
highly skewed distributions. Therefore, in contrast to [29], [(k − 1)/N , k/N ) and the probability of the k th bucket being
using linear splines, provides bigger advantages because (1) empty is (1 − 1/N )N , which is approximately 1/e for large N .
it is on average 2.7× faster to “train” than the closed-form This means that approximately N /e buckets will be empty
version of linear regression, and (2) it provides up to 35% less at the end of the mapping phase, and all of these elements
key swaps during Insertion Sort. will be placed in the spill bucket. Using s to denote the size
of the spill bucket, we have E[s] = N /e.
3 ALGORITHM ANALYSIS In the general case, our problem is that of randomly hash-
In this section we analyze the complexity and the perfor- ing N elements into N /t buckets, each with capacity t ≥ 1.
mance of the algorithm w.r.t. various parameters. Then, the expected number of overflowing elements is:
t −1
3.1 Analysis of the sorting parameters N Õ (t − i) · (t)i
E[s] =
te t i=0 i!
3.1.1 Choosing the fan-out (f ). One key parameter of our
algorithm is the fan-out f . On one hand, we want to have a
high fan-out, as it allows us to leverage the model’s accuracy
Capacity Overflow Capacity Overflow
to the fullest extent. On the other hand, in order to utilize
the cache in the optimal way, we have to restrict the number 1 36.7% 25 7.9%
of buckets, as we otherwise can not ensure that we can 2 27.1% 50 5.6%
append to a bucket without causing a cache miss. Ideally, 5 17.5% 100 3.9%
for all buckets we would keep in cache the auxiliary data 10 12.5% 1000 1.3%
structures, as well as the next empty slot per bucket. Table 1: Bucket capacity (t) vs. proportion of elements in the
In order to understand this trade-off, we measured the spill bucket (E[s]/N ).
performance of Learned Sort with varying fan-out values
(Figure 6) using a random array of 100M doubles, which is Table 1 represents the proportion of the elements in the
large enough to not fit in any cache level. The minimum of spill bucket for various bucket capacities. Empirically, we
this plot corresponds to the point where all the hot memory found that we can maximize the performance of Learned Sort

1007
Research 11: Machine Learning for Databases II SIGMOD ’20, June 14–19, 2020, Portland, OR, USA

when the spill bucket contains less than 5% of the elements,


therefore, we set the minimum bucket capacity to be t = 100.
3.1.3 The effect of model quality on the spill bucket size. The
formula that we presented in the section above shows the
expected size of the spill bucket when the model perfectly
learns the underlying distribution. However, in this section,
we analyze the expected size of the spill bucket for an ap-
proximation model that is trained on a small sample of the
input. Figure 7: Sorting time for various sampling rates of 100M
For this analysis, we again assume that we can treat the normally-distributed keys (log-log scale).
sample as independently generated from the underlying dis-

tribution, and that we use unit-capacity buckets. For simplic- O(1/ n). So, as the sample size increases the accuracy of
ity, we also linearly transform the input space into the range its empirical CDF improves making the model better.
[0, 1]. Let f (x) be the CDF of the underlying distribution and On other hand, a large sample increases the training time
д(x) be our approximation that is learned from the sample. creating a trade-off between model quality and runtime per-
In the mapping phase, the keys in the range [a, b] will be formance. Figure 7 shows the trend of the time to sort the
mapped to N · (д(b) −д(a)) buckets. Whereas, in expectation, array (Sorting Time) and total time (Training and Sorting)
there will be N · (f (b) − f (a)) keys present in the range w.r.t the sample size, while keeping the other parameters
[a, b]. The difference between the number of elements that constant. In the figure, as sample size increases we can see
are actually present in that range and the number of buckets the sorting time decreases because a larger sample leads to a
that they are mapped to, leads to bucket overflows. better model quality. However, the training time keeps on
For a small range of input keys dx, the number of elements increasing with the sample size. We empirically found that a
in the array within this input range [x, x +dx) will be propor- sampling rate of 1% provides a good balance in this trade-off
tional to f ′(x)dx but our approximation will map them to as is evident from the graph.
д ′(x)dx buckets. So the problem turns into hashing f ′(x)dx
elements into д ′(x)dx buckets of unit capacity. The number 3.2 Complexity
of empty buckets, which is same as number of overflowing el-
′ ′ Stage 1 of the sorting algorithm scans the input keys se-
ements, in this input range will be д ′(x)e −f (x )/д (x )dx, which
quentially and for each key it uses the trained CDF model to
will be integrated over the input range [0,1]. This gives us the
predict which bucket to go to. This process takes O(N · L)
following equation for measuring the number of overflowing
time, where L is the number of layers in the model. Since we
elements w.r.t. the model’s quality of approximation:
split the buckets progressively using a fan-out factor f until
−f ′ (x ) a threshold size t, the number of iterations and the actual
∫ 1
E[s] = N · д ′(x)e д′ (x ) dx
0
complexity depend on the choice of f and t. However, in
practice we use a large fan-out factor, therefore the number
Using Jensen’s inequality[25], it can be shown that this num-
of iterations can be considered constant.
ber is always greater than N /e with equality occurring when
On the other hand, Stage 2 of the algorithm uses a routine
f ′(x) = д ′(x). This shows that, for small samples, learning
similar to Counting Sort, which is a linear-time sorting proce-
the underlying distribution leads to lesser elements in the
dure with respect to the bucket capacity (t), hence accounting
spill bucket. A qualitative aspect of the formula above is that
for an O(N ) term. Assuming that the CDF model is mono-
one needs to approximate the derivative of the CDF function
tonic, the worst case complexity of Stage 3 is O(N t) due to
(the PDF function) in order to minimize the expected size of
the fact that we use Insertion Sort and we have a threshold on
spill bucket, and therefore maximize performance.
the buckets. Otherwise, the worst case for a non-monotonic
3.1.4 Choosing the sample size. We now discuss how the model would be O(N 2 ). However, the constant term of this
model quality changes w.r.t increasing sample size. We ap- stage depends on the model quality: A good model, as the
proximate the CDF of an element in a sample by looking at its one described above, will provide a nearly-sorted array be-
position in the sorted sample. This empirical CDF of the sam- fore the Insertion Sort subroutine is called, hence making
ple is different from its CDF in the distribution that generated this step non-dominant (refer to Figure 13).
the sample. The Dvoretzky-Kiefer-Wolfowitz inequality[12] Finally, as in the touch-up stage, the performance of Stage 4
is one method for generating CDF-based confidence bounds also depends on the model quality. Assuming we employ a
and it states that for a given sample size n the difference traditional, asymptotically optimal, comparison-based sort-
between empirical CDF and real CDF is proportional to ing routine for the spill bucket S, this stage’s worst-case

1008
Research 11: Machine Learning for Databases II SIGMOD ’20, June 14–19, 2020, Portland, OR, USA

complexity is O(s log s) + O(N ) ( O(N ) for the merging step). Then, one way to train the CDF model is to fit multivariate
Then again, a good model will not permit the size of the spill linear regression models (w.x+b) over the feature vectors in
bucket to inflate (empirically ≤ 5%), which makes this step each layer of the model. However, this strategy is computa-
pretty insignificant in practice (see also Figure 13). tionally expensive as it would take O(n) operations at every
The space complexity of Learned Sort is in the order of layer of the model. As a workaround, we could limit the
O(N ), given that the algorithm uses an auxiliary memory characters considered by the model, however that might lead
for the buckets, which is linearly dependent on the input to non-monotonic predictions. If we consider C 1 , C 2 , ..., Cn
size. The in-place version introduced in Section 4.1, however, to be the ASCII values of characters in the string, then we
only uses a memory buffer that is independent of the input can obtain a monotonic encoding of strings by calculating
C1 C2 Cn
size, therefore, accounting for an O(1) space complexity. 256 + 2562 + .. + 256n . This value is bound to be between zero
and one, monotonic, and can potentially be used as a CDF
4 EXTENSIONS value. This prediction would have been accurate if the ASCII
So far we only discussed the Learned Sort implementation value of each character was uniformly distributed and inde-
that is not in-place and does not handle strings or other pendent of the values of the other characters in the string.
complex objects. In this section, we provide an overview of This is not always the case, so we transform the encodings
how we can extend the Learned Sort algorithm to address to make their distribution uniform.
these shortcoming, whereas in the Evaluation section we In the training phase, we take a sample from the array and
show experiments for an in-place version as well as early encode the strings using their ASCII values and use them to
results for using Learned Sort over strings. map strings into the buckets. If the bucket sizes are uneven,
we readjust the encoding ranges falling into these buckets
4.1 Making Learned Sort in-place by making a linear transformation of the slope and intercept
terms of respective models. Then we re-map the strings into
The current algorithm’s space complexity is linear with re- another layer of buckets after this linear transformation. This
spect to the input size due to the requirement from the re-mapping step continues until we obtain evenly distributed
mapping stage of the algorithm. However, there is a way buckets. Similar to the numeric algorithm we split the array
of making the algorithm in-place, (i.e., have constant mem- into finer buckets until a threshold size after which point
ory requirement that is independent on the input size). we use std::sort. Some promising preliminary results on this
The in-place version of the mapping stage would group approach are shown in Section 6.4.
the input keys into blocks such that all the elements in each
block belong to the same bucket. The algorithm maintains
a small buffer equal to the block size for every bucket. The
algorithm iterates over the unsorted array and maps the
elements into their respective buffer and whenever a buffer 4.3 Duplicates
space fills up it is written onto the already scanned section The number of duplicates (i.e., repeated keys) is a factor that
of the array. These block are then permuted, so that blocks affects the run-time behavior of our algorithm as our model
from the same bucket are stored contiguously. Note that will always assign the same bucket to the key, which, per
this type of approach is very common for designing in-place definition, increases the number of collisions and the number
algorithms such as in [2], and we show results in Section 6.6. of keys placed into the spill bucket. Consequently, the spill
bucket inflates and Stage 4 of the algorithm takes longer to
4.2 Learning to sort strings execute, since it relies on a slower algorithm.
The primary focus so far has been on sorting numerical As a remedy, we incorporated a fast heuristic in our al-
values and extending our CDF model for strings creates a gorithm that detects repeated keys at training time. While
number of unique challenges. While we are still investigating building the CDF model, the algorithm looks at the frequen-
on how to best handle strings and many existing work on cies at which equal keys appear in the training sample and,
ML-enhanced data structures and algorithms so far only if it is above a certain threshold, it adds these keys to an
considers numeric values [11, 29, 55], we outline an early exception list. Then, at sorting time, if the algorithm comes
implementation we did for strings, which also has a very across an element whose key is in the exception list, it skips
compelling performance (see Section 6). the bucket insertion step and only merges the repeated keys
Our string model has an RMI architecture, but represents at the end of the procedure. However, in the absence of du-
strings as input feature vectors x ∈ Rn where n is the length plicates, we found that this additional step only introduces a
of the string. For simplicity, we can work with fixed-length small performance overhead (<3%), which is a tolerable cost
strings by padding shorter sequences with null characters. for the average case.

1009
Research 11: Machine Learning for Databases II SIGMOD ’20, June 14–19, 2020, Portland, OR, USA

5 RELATED WORK given number of digits at a time. The Counting Sort subrou-
Sorting algorithms are generally classified as comparison- tine calculates a count histogram of all the keys based on the
based or distribution-based, depending on whether they rely selected digits, transforms that into a cumulative histogram
only on pairwise key comparisons to come up with a sorted by generating running totals, and then re-orders the keys
order, or rather make some assumptions or estimations on back into a sorted order based on these calculated counts.
the distribution of the keys. Radix Sort’s time complexity is O(d · (N + r )), where d is
Comparison sorts: Some of the most common compari- the number of passes over the data (i.e., the number of digits
son sorts are Quicksort, Mergesort, and Insertion Sort. While divided by the radix size), N is the input size, and r is the
they all have a lower bound of Ω(N log N ) comparisons in range of each key (i.e., 10 raised to the power of radix size).
the average case, their performance in practice depends also Note that, in order to use Radix Sort with IEEE-754 floating
largely on factors such as memory access patterns, which point numbers, it is first necessary to shift and mask the bit
dictate their cache efficiency. representation. While Radix Sort is highly sensitive to the
The GNU Standard Template Library in C++ employs In- key length, which dictates the number of passes, it is never-
trosort [38] as its default sorting function (std::sort) [16], theless a very efficient sorting algorithm for numerical types,
which combines the speed of Quicksort in the average case that is very well-suited for multi-core procedures[6, 22, 40],
with the optimal worst case of Heapsort and the efficiency and SIMD vectorization [50].
of Insertion Sort for small arrays. Introsort was also adopted Most related to Learned Sort is Histogram Sort[4]. How-
by the standard library of the Go language[17] and that of ever, Histogram Sort implicitly assumes a uniform distribu-
the Swift language until version 5[1]. tion for the input data as it allocates n variable-sized buck-
Samplesort is another variant of Quicksort that uses mul- ets and maps each key x into a bucket Bi by calculating
tiple pivots that are selected from a sample from the input i = n · (x − xmin )/(xmax − xmin ). It then sorts these buckets
array (note, that this does NOT create a histogram over the using Insertion Sort and merge them in order.
data). Thus, the elements are arranged into partitions in a SIMD optimization: There has been a lot of work on
finer-grained way that enables early termination of recur- enhancing traditional sorting implementations with data
sion when a certain partition contains only identical keys. parallelism in SIMD-enabled CPUs[5, 14, 26], as well as the
One of the most recent and efficient implementation of Sam- use of adaptive and cache-sensitive partitioning techniques
plesort is the Super Scalar Samplesort initially introduced in for multi-core or multi-processor implementations [3, 6, 9, 21,
[48] and then later on again with an in-place and improved 50]. Nevertheless, there has not been much recent innovation
implementation in [2] (IPS4 o). The strongest point of this im- in the algorithmic space for sorting and we found that IS4 o,
plementation is the use of a binary search tree to efficiently one of our baselines, is one of the most competitive openly
discover the right partition for each key, and the avoidance available implementations.
of conditional branches with the help of conditional instruc- Hashing functions In a way, the CDF model might be
tions provided in modern processors. regarded as an order-preserving hash function for the in-
Java’s List.sort() function[23] and Python’s built-in sorting put keys, such as [8, 32]. However, order-preserving hash-
function[45] use a hybrid of Mergesort and Insertion Sort, ing is unsuitable for sorting since it does not provide fast
called Timsort[36]. Timsort combines Insertion Sort’s ability enough training and inference times, and, to the best of our
to benefit from the input’s pre-sortedness with the stability knowledge, there does not exist any sorting algorithm that
of Mergesort, and it is said to work well on real-world data uses order-preserving hashing for sorting. Similarly, locality-
which contain intrinsic patterns. The procedure starts by sensitive hashing [54, 56, 57] can also not be used for sorting
scanning the input to find pre-sorted key sub-sequences and a single numeric value as we are concerned with sorting
proceeds to merge them onto the final sorted output. a single dimension rather than efficiently finding similar
It should be noted, that most open-source DB systems items in a multi-dimensional space. Finally, perfect hash-
implement their sorting routines building upon Quicksort or ing’s objective is to avoid element collisions, which would
Mergesort, depending on whether the data fits in memory initially seem an interesting choice to use for Learned Sort.
or if the ordering needs to be stable[37, 39, 46, 52]. However, perfect hash functions grow in size with the input
Distribution sorts comprise the other major group of data, are not fast to train, and most importantly, usually not
sorting procedures and they include algorithms like Radix order-preserving [10].
Sort, Counting Sort, and Histogram Sort. Radix Sort is ar- ML-enhanced algorithms There has been growing in-
guably the most commonly used one, and it works by calling terest in the use of Machine Learning techniques to speed
the Counting Sort subroutine for all the keys by scanning a up traditional algorithms in the context of systems. Most
notably, the work in Learned Index Structures [29] intro-
duces the use of an RMI structure to substitute the traditional

1010
Research 11: Machine Learning for Databases II SIGMOD ’20, June 14–19, 2020, Portland, OR, USA

fact, conducted measurements against Mergesort, Heapsort,


Insertion Sort, Shell Sort, Quicksort, and one of its improved
variants – PDQS[44]. However, we did not consider them
further as their performance was always worse than one of
the other baselines. Note, that for all our experiments we
include the model training time as part of the overall
sorting time unless mentioned otherwise.
We evaluate the performance of our algorithm for nu-
merical keys on both synthetic and real datasets of varying
precision. For the synthetic datasets we generated the fol-
lowing distributions:
• Uniform distribution with min=0 and max=1
• Multimodal distribution that is a mixture of five normal
distributions whose PDF is shown in the histogram below
the performance charts in Figure 9
• Exponential distribution with λ = 2 and scaled by a factor
of 106 (80% of the keys are concentrated in 7% of the key
Figure 8: The sorting throughput for normally distributed domain)
double-precision keys (higher is better). • Lognormal distribution with µ = 0 and σ = 1 that has an
extreme skew (80% of the keys are concentrated in 1% of
B+ tree indexes in database systems, a work that was fol- the key range)
lowed up later on by [11]. In addition, other ML algorithms, We also use real-world data from OpenStreetMap[42]
as well as reinforcement learning, have been used to tune and sort on 100M longitude and latitude compound keys
system parameters or optimize query execution plans in (osm/longlat) that are generated using the transformation
[27, 30, 34, 35, 58]. Finally, this new research trend has also longlat = 180 · lon + lat, as in [11], as well as on their
reached other system applications outside databases, such as respective node IDs (osm/id). In addition, we use an IoT
in scheduling [33], congestion control [24], and frequency dataset [15] to sort on the iot/mem and iot/bytes columns
estimation in data stream processing [20]. (10M keys), which represent the amount of available memory
and number of input bytes to a server machine at regular
6 EVALUATION time intervals. Thirdly, we use the Facebook RW dataset
(fb/rw) to sort on 1.1M collected user IDs from a random
The goal of this section is to:
walk in the Facebook user graph[31]. Finally, we show re-
• Evaluate the performance of Learned Sort compared to
sults from TPC-H benchmark data on the customer account
other highly-tuned sorting algorithms over real and syn-
balances (tpch/bal - 3M keys), and order keys (tpch/o_key
thetic datasets
- 30M keys) for a scale factor of 20, which are of course not
• Explain the time-breakdown of the different stages of
real but represents data which are often used to evaluate
Learned Sort
the performance of database systems. All datasets were ran-
• Evaluate the main weakness of Learned Sort: duplicates
domly shuffled before sorting. We display the distribution of
• Show the relative performance of the in-place versus the
these datasets with histograms below each result in Figure 9.
out-of-place Learned Sort
All the experiments are measured on a server-grade Linux
• Evaluate the Learned Sort performance over strings.
machine running on Intel® Xeon® Gold 6150 CPU @ 2.70GHz
with 376GB of memory, and compiled with GCC 9.2 with
6.1 Setup and datasets the -O3 flag for full optimizations3 . The model we used for
As baselines, we compare against cache-optimized and highly training was always 2-layers and contained 1000 leaf models,
tuned C++ implementations of Radix Sort [51], Timsort [18], trained with a uniformly selected 1% sample of the array.
Introsort (std::sort), Histogram Sort[4], and IS4 o [49] (one
of the most optimized sorting algorithms we were able to 3 Intel
Xeon and Intel Xeon Phi are trademarks of Intel Corporation or its
find, which was also recently used in other studies [40] as subsidiaries in the U.S. and/or other countries. Other names and brands
may be claimed as the property of others. ©Intel Corporation.
a comparison point). Note that we use a recursive, equi-
Software and workloads used in performance tests may have been opti-
depth version of Histogram sort that adapts to the input’s mized for performance only on Intel microprocessors. Performance tests,
skew as to avoid severe performance penalties. While we are such as SYSmark and MobileMark, are measured using specific computer
presenting only the most competitive baselines, we have, in systems, components, software, operations and functions. Any change

1011
Research 11: Machine Learning for Databases II SIGMOD ’20, June 14–19, 2020, Portland, OR, USA

Histogram Sort

(A) synthetic, 64-bit floating points (B) real/benchmark, 64-bit floating points
(high precision)

100M 100M 100M 100M 100M 10M 10M 3M

(C) real/benchmark, 64-bit floating points (D) synth & real, 32-bit integers
(low precision)

100M 10M 10M 3M 100M 100M 1.1M 30M

Figure 9: The sorting rate of Learned Sort and other baselines for real and synthetic datasets containing both doubles and
integers. The pictures below the charts visualize the key distributions and the dataset sizes.

6.2 Overall Performance L3 cache, as it is the case with 1 million keys (roughly 8MB),
As the first experiment we measured the sorting rate for Radix sort is almost as fast as Learned Sort. However, as soon
array sizes varying from 1 million up to 1 billion double- as the data does not fit into the L3 cache, the sorting rate
precision keys following a standard normal distribution, and of Learned Sort is significantly higher than Radix or IS4 o.
compare it to the baseline algorithms that we selected as Furthermore, Learned Sort’s cache optimization enables it
described in Section 6.1. The sorting rate (bytes per second) is to maintain a good sorting throughput even for sizes up to
shown in Figure 8 for Learned Sort and our main baselines, in 8GB.
addition to SageDB::sort[28]. As it can be seen Learned Sort
achieves an average of 30% higher throughput than the
next best algorithm (IS4 o) and 55% as compared to Radix Sort 6.3 Sorting rate
for larger data sizes. However, when the data fits into the To better understand the behaviour of our algorithm, we
compared Learned Sort against our other baselines on (A)
to any of those factors may cause the results to vary. You should con-
synthetic data with 64-bit doubles generated from different
sult other information and performance tests to assist you in fully eval- distributions, (B) high precision real-world/benchmark data
uating your contemplated purchases, including the performance of that with 64-bit doubles, which have at least 10 significant digits,
product when combined with other products. For more information go to (C) low precision real-world/benchmark data with 64-bit dou-
www.intel.com/benchmarks.
bles, with reduced floating point precision, and (D) synthetic
Benchmark results were obtained prior to implementation of recent
software patches and firmware updates intended to address exploits referred
and real-world data with 32-bit integers.
to as "Spectre" and "Meltdown". Implementation of these updates may make Figure 9A shows that Learned Sort consistently outper-
these results inapplicable to your device or system. forms Radix Sort by an average of 48%, IS4 o by an average

1012
Research 11: Machine Learning for Databases II SIGMOD ’20, June 14–19, 2020, Portland, OR, USA

Figure 10: The sorting rate for various strings datasets.

of 27%, and the other baselines by much larger margins. The


same performance gain is also present for the high-precision
real-world datasets (Figure 9B). Note, that we achieve a sig-
nificant higher sorting rate for the tpch/bal data due to the Reference line Histogram Sort

fact that it is smaller, and incurs more cache benefits. Figure 11: The sorting rate of Learned Sort and the other
On the other hand, we observed that Radix Sort’s perfor- baselines for varying degrees of duplicated keys and number
mance improves when the input keys have less precision of spikes, as well as on different Zipf distributions. The ref-
(Figure 9C). In this case, Learned Sort and all the other base- erence line represents the sorting rate of Learned Sort where
lines algorithms remain unaffected, while Radix Sort gets there are no duplicates.
a 34% performance boost. This improvement results from
the fact that everything can be sorted on the most signifi- • bcmrk: 10M ASCII arrays generated using the code from
cant bits. However, it is necessary to note that Radix Sort’s the SortBenchmark dataset[19]
performance does not surpass that of Learned Sort. • synth, synth_lo, synth_hi : A set of 1M randomly gen-
Finally, in Figure 9A we show that for integers Learned erated strings following a uniform distribution of a-z
Sort has an even higher throughput and, in some cases, even a characters at each position with characters having no
bigger benefit. For example, on the synthetic integer dataset correlation, low correlation, and high correlation with
it is 38% better than IS4 o and twice as fast as Radix Sort. neighbouring ASCII values, respectively.
Whereas for the FB dataset and OSM IDs the performance Overall, the experiment show that Learned Sort is also a
difference compared to IS4 o is less because of the particular very promising direction for sorting complex objects, such
distribution of values and duplicates (see Section 4.3). as strings. It should be noted, that building efficient models
for string is still an active area of research and probably a
6.4 Sorting Strings paper on its own as it also have far reaching applications
Figure 10 show the preliminary sorting rate for strings for for indexes, tries, and many other data structures and algo-
our algorithm of Section 4.2 with respect to IS4 o, std::sort, rithms. Finally, we would like to point out that if we include
and Timsort. In this experiment we excluded the training the training time with the sorting time, Learned Sort still
time for the model. However, it should be noted, that many dominates the other algorithms but by a margin of of 2-8%
real-world scenarios exists in which a dataset or a subset has rather than the 5-20% shown in Figure 9.
to be sorted several times. For example, within a database
recurring merge-joins operation or the sorting for the final 6.5 The impact of duplicates
result, would allow to pre-train models as similar (but not The number of duplicates (i.e., repeated keys) is a factor that
identical) subsets of the data might appear over and over affects the run-time behavior of our algorithm as our model
again. Note, that we excluded Radix Sort from this compar- will always assign the same position/bucket to the key, which
ison as it was significantly slower than any of the other increases the number of collisions and the number of keys
baselines. For the data we used: placed into the spill bucket. To study the impact of duplicates,
• addr: A set of 1M address strings from the OpenAd- we first generate a normal distribution dataset (µ = 0 and
dresses dataset (Northeast USA)[41]. σ = 5) and afterwards duplicate a few randomly selected
• dict: A set of 479K words from an English dictionary[53]. keys n-times (referred to as spikes).
• url: A list of 1.1M URLs from the Weblogs dataset[15] Figure 11 shows the performance of Learned Sort and the
containing requests to a webserver for cs.brown.edu other baseline algorithms for different combinations of the

1013
Research 11: Machine Learning for Databases II SIGMOD ’20, June 14–19, 2020, Portland, OR, USA

0% 20% 40% 60% 80% 100%

Training Bucketization Insertion Sort Spill bucket sorting Merging

Figure 13: Performance of each of the stages of Learned Sort.

Sort with a histogram as a model. The reason is, that the


Figure 12: The sorting rate of Learned Sort and its in-place additional number of passes over the data performed as part
version for all of our synthetic datasets.
of Learned Sort does not pay out for the imprecision of the
histogram-based models (consider, for example, Step 2 in
number of spikes and number of duplicate keys, in addition Algorithm 2). However, Learned Sort with an RMI model is
to three different Zipf distributions with parameters 0.25, almost twice as fast. The advantage of using RMI models
0.50, and 0.90. As the results demonstrate our technique comes from the fact, that it uses continuous functions to
from Section 4.3 ensures that Learned Sort remains highly quickly estimate the position for a key.
competitive even for datasets with large degrees of duplicates.
Only for the most extreme cases with 50-80% of duplicates
IS4 o is actually faster than Learned Sort.

6.6 In-place sorting


The performance of the in-place version described in Sec-
tion 4.1 is shown in Figure 12, and as observed, the sorting
rate drops by an average of 8-11% as compared to when the
mapping procedure directly uses fixed-capacity buckets.

6.7 Performance decomposition Figure 14: The sorting rate of Learned Sort algorithm on
Next, in Figure 13 we show the time spent in each phase of 100M normally-distributed keys as compared with (1) a ver-
Learned Sort. With 1% sample from the input, the training sion of LS that uses an equi-depth histogram as CDF model,
procedure only accounts for ≈ 6% of the overall runtime. The (2) a version with an equi-width histogram, (3) Equi-depth
majority of time (≈ 80%) goes towards the model-based key Histogram Sort, and (4) Equi-width Histogram Sort.
bucketization. Since the CDF model makes a nearly-sorted
order, the touch-up step with Insertion Sort makes up only
5% of the total time, and the spill bucket sorting only 2%, 7 CONCLUSION AND FUTURE WORK
which reconfirms the quality of the model predictions. In this paper we presented a novel approach to accelerate
sorting by leveraging models that approximate the empirical
6.8 Using histograms as CDF models CDF of the input to quickly map elements into the sorted
Finally, in Figure 14 we run a micro-experiment to show position within a small error margin. This approach results
the performance of a version of our Learned Sort algorithm in significant performance improvements as compared to
that uses a histogram as CDF model. We also compare this the most competitive and widely used sorting algorithms,
approach with Histogram Sort[4]. For both these algorithms and marks an important step into building ML-enhanced
we show the performance using both, equi-width and equi- algorithms and data structures. Much future work remains
depth histograms. The equi-depth CDF histogram model was open, most notably how to handle strings, complex types,
implemented as a linear array that records the starting key and parallel sorting.
for each bin and uses binary search to find the right bin for
any given key. On the other hand, for the equi-width version ACKNOWLEDGMENTS
we do not need to store the keys and we can find the right This research is supported by Google, Intel, and Microsoft
bin using a simple array lookup. as part of the MIT Data Systems and AI Lab (DSAIL) at MIT,
As a reference the figure includes the RMI-based Learned NSF IIS 1900933, DARPA Award 16-43-D3M-FP040, and the
Sort as a dashed line at around 280MB/s. The figure shows MIT Air Force Artificial Intelligence Innovation Accelerator
that the number of bins has a clear impact on the sorting rate (AIIA).
of the histogram based method. Surprisingly, the Histogram
Sort using equi-depth performs better than using our Learned

1014
Research 11: Machine Learning for Databases II SIGMOD ’20, June 14–19, 2020, Portland, OR, USA

REFERENCES [18] Fuji Goro and Morwenn. 2019. Open-source C++ implementation of
[1] Apple. 2018. Apple/Swift: standard library sort. (2018). https://github. Timsort. (2019). https://github.com/gfx/cpp-TimSort
com/apple/swift/blob/master/test/Prototypes/IntroSort.swift [19] Jim Gray, Chris Nyberg, Mehul Shah, and Naga Govindaraju. 2017.
[2] Michael Axtmann, Sascha Witt, Daniel Ferizovic, and Peter Sanders. The SortBenchmark dataset. (2017). http://sortbenchmark.org/
2017. In-Place Parallel Super Scalar Samplesort (IPSSSSo). In 25th [20] Chen-Yu Hsu, Piotr Indyk, Dina Katabi, and Ali Vakilian. 2019.
Annual European Symposium on Algorithms (ESA 2017) (Leibniz Inter- Learning-Based Frequency Estimation Algorithms. In International
national Proceedings in Informatics (LIPIcs)), Kirk Pruhs and Christian Conference on Learning Representations. ICLR, New Orleans, LA. https:
Sohler (Eds.), Vol. 87. Schloss Dagstuhl–Leibniz-Zentrum fuer Infor- //openreview.net/forum?id=r1lohoCqY7
matik, Dagstuhl, Germany, 9:1–9:14. https://doi.org/10.4230/LIPIcs. [21] Hiroshi Inoue and Kenjiro Taura. 2015. SIMD-and cache-friendly
ESA.2017.9 algorithm for sorting an array of structures. Proceedings of the VLDB
[3] Cagri Balkesen, Gustavo Alonso, Jens Teubner, and M Tamer Özsu. Endowment 8, 11 (2015), 1274–1285.
2013. Multi-core, main-memory joins: Sort vs. hash revisited. Proceed- [22] Intel. 2020. Intel Performance Primitive library for x86 architectures.
ings of the VLDB Endowment 7, 1 (2013), 85–96. (2020). http://software.intel.com/en-us/intel-ipp/
[4] Paul E. Black. 2019. Histogram Sort. (2019). https://www.nist.gov/ [23] Java. 2017. Java 9: List.sort. (2017). https://docs.oracle.com/javase/9/
dads/HTML/histogramSort.html docs/api/java/util/List.html#sort-java.util.Comparator-
[5] Berenger Bramas. 2017. Fast sorting algorithms using AVX-512 on [24] Nathan Jay, Noga H. Rotman, P. Brighten Godfrey, Michael Schapira,
Intel Knights Landing. arXiv preprint arXiv:1704.08579 305 (2017), 315. and Aviv Tamar. 2018. Internet Congestion Control via Deep Rein-
[6] Minsik Cho, Daniel Brand, Rajesh Bordawekar, Ulrich Finkler, Vincent forcement Learning. (2018). arXiv:cs.NI/1810.03259
Kulandaisamy, and Ruchir Puri. 2015. PARADIS: an efficient parallel [25] Johan Ludwig William Valdemar Jensen et al. 1906. Sur les fonctions
algorithm for in-place radix sort. Proceedings of the VLDB Endowment convexes et les inégalités entre les valeurs moyennes. Acta mathemat-
8, 12 (2015), 1518–1529. ica 30 (1906), 175–193.
[7] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford [26] Jie Jiang, Lixiong Zheng, Junfeng Pu, Xiong Cheng, Chongqing Zhao,
Stein. 2007. Introduction to algorithms (2 ed.). MIT Press, Cambridge, Mark R Nutter, and Jeremy D Schaub. 2017. Tencent Sort. (2017).
MA. [27] Andreas Kipf, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter Boncz,
[8] Zbigniew J. Czech, George Havas, and Bohdan S. Majewski. 1992. and Alfons Kemper. 2018. Learned Cardinalities: Estimating Correlated
An optimal algorithm for generating minimal perfect hash functions. Joins with Deep Learning. (2018). arXiv:cs.DB/1809.00677
Inform. Process. Lett. 43, 5 (1992), 257 – 264. https://doi.org/10.1016/ [28] Tim Kraska, Mohammad Alizadeh, Alex Beutel, Ed H. Chi, Ani Kristo,
0020-0190(92)90220-P Guillaume Leclerc, Samuel Madden, Hongzi Mao, and Vikram Nathan.
[9] T. Dachraoui and L. Narayanan. 1996. Fast deterministic sorting on 2019. SageDB: A Learned Database System. In CIDR 2019, 9th Biennial
large parallel machines. In Proceedings of SPDP ’96: 8th IEEE Symposium Conference on Innovative Data Systems Research, Asilomar, CA, USA,
on Parallel and Distributed Processing. IEEE, New Orleans, LA, 273–280. January 13-16, 2019, Online Proceedings. www.cidrdb.org. http://cidrdb.
https://doi.org/10.1109/SPDP.1996.570344 org/cidr2019/papers/p117-kraska-cidr19.pdf
[10] Martin Dietzfelbinger, Anna Karlin, Kurt Mehlhorn, Friedhelm Meyer [29] Tim Kraska, Alex Beutel, Ed H Chi, Jeffrey Dean, and Neoklis Polyzotis.
Auf Der Heide, Hans Rohnert, and Robert E Tarjan. 1994. Dynamic 2018. The case for learned index structures. In Proceedings of the 2018
perfect hashing: Upper and lower bounds. SIAM J. Comput. 23, 4 (1994), International Conference on Management of Data. ACM, 489–504.
738–761. [30] Sanjay Krishnan, Zongheng Yang, Ken Goldberg, Joseph Hellerstein,
[11] Jialin Ding, Umar Farooq Minhas, Hantian Zhang, Yinan Li, Chi Wang, and Ion Stoica. 2018. Learning to optimize join queries with deep
Badrish Chandramouli, Johannes Gehrke, Donald Kossmann, and reinforcement learning. arXiv preprint arXiv:1808.03196 (2018).
David Lomet. 2019. ALEX: An Updatable Adaptive Learned Index. [31] Maciej Kurant, Minas Gjoka, Carter T. Butts, and Athina Markopoulou.
(2019). arXiv:cs.DB/1905.08898 2011. Walking on a Graph with a Magnifying Glass: Stratified Sampling
[12] A. Dvoretzky, J. Kiefer, and J. Wolfowitz. 1956. Asymptotic Minimax via Weighted Random Walks. In Proceedings of ACM SIGMETRICS ’11.
Character of the Sample Distribution Function and of the Classical San Jose, CA.
Multinomial Estimator. Ann. Math. Statist. 27, 3 (09 1956), 642–669. [32] Bohdan S Majewski, Nicholas C Wormald, George Havas, and Zbig-
https://doi.org/10.1214/aoms/1177728174 niew J Czech. 1996. A family of perfect hashing methods. Comput. J.
[13] Paul Embrechts and Marius Hofert. 2013. A note on generalized in- 39, 6 (1996), 547–554.
verses. Mathematical Methods of Operations Research 77, 3 (2013), [33] Hongzi Mao, Malte Schwarzkopf, Shaileshh Bojja Venkatakrishnan,
423–432. Zili Meng, and Mohammad Alizadeh. 2019. Learning Scheduling Algo-
[14] Timothy Furtak, José Nelson Amaral, and Robert Niewiadomski. 2007. rithms for Data Processing Clusters. In Proceedings of the ACM Special
Using SIMD registers and instructions to enable instruction-level par- Interest Group on Data Communication (SIGCOMM ’19). ACM, New
allelism in sorting algorithms. In Proceedings of the nineteenth annual York, NY, USA, 270–288. https://doi.org/10.1145/3341302.3342080
ACM symposium on Parallel algorithms and architectures. ACM, San [34] Ryan Marcus, Parimarjan Negi, Hongzi Mao, Chi Zhang, Mohammad
Diego, CA, 348–357. Alizadeh, Tim Kraska, Olga Papaemmanouil, and Nesime Tatbul. 2019.
[15] Alex Galakatos, Michael Markovitch, Carsten Binnig, Rodrigo Fonseca, Neo: A Learned Query Optimizer. Proc. VLDB Endow. 12, 11 (July 2019),
and Tim Kraska. 2019. FITing-Tree: A Data-Aware Index Structure. 1705–1718. https://doi.org/10.14778/3342263.3342644
In Proceedings of the 2019 International Conference on Management of [35] Ryan Marcus and Olga Papaemmanouil. 2018. Deep reinforcement
Data (SIGMOD ‘19). Association for Computing Machinery, New York, learning for join order enumeration. In Proceedings of the First Inter-
NY, USA, 1189–1206. https://doi.org/10.1145/3299869.3319860 national Workshop on Exploiting Artificial Intelligence Techniques for
[16] GNU. 2009. C++: STL sort. (2009). https://gcc.gnu.org/onlinedocs/ Data Management. ACM, 3.
libstdc++/libstdc++-html-USERS-4.4/a01347.html [36] Peter McIlroy. 1993. Optimistic sorting and information theoretic
[17] Go. 2009. Source file src/sort/sort.go. (2009). https://golang.org/src/ complexity. In Proceedings of the fourth annual ACM-SIAM Symposium
sort/sort.go on Discrete algorithms. Society for Industrial and Applied Mathematics,
467–474.

1015
Research 11: Machine Learning for Databases II SIGMOD ’20, June 14–19, 2020, Portland, OR, USA

[37] MongoDB. 2018. MongoDB: sorter.cpp. (2018). https://github.com/ [50] Nadathur Satish, Changkyu Kim, Jatin Chhugani, Anthony D Nguyen,
mongodb/mongo/blob/master/src/mongo/db/sorter/sorter.cpp Victor W Lee, Daehyun Kim, and Pradeep Dubey. 2010. Fast sort on
[38] David R Musser. 1997. Introspective sorting and selection algorithms. CPUs and GPUs: A case for bandwidth oblivious SIMD sort. In Proceed-
Software: Practice and Experience 27, 8 (1997), 983–993. ings of the 2010 ACM SIGMOD International Conference on Management
[39] MySQL. 2000. MySQL: filesort.cc. (2000). https://github.com/mysql/ of data. ACM, 351–362.
mysql-server/blob/8.0/sql/filesort.cc [51] Andrew Schein. 2009. Open-source C++ implementation of Radix Sort
[40] Omar Obeya, Endrias Kahssay, Edward Fan, and Julian Shun. 2019. for double-precision floating points. (2009). https://bitbucket.org/ais/
Theoretically-Efficient and Practical Parallel In-Place Radix Sorting. usort/src/474cc2a19224/usort/f8_sort.c
In The 31st ACM on Symposium on Parallelism in Algorithms and Ar- [52] SQLite. 2011. SQLite: vdbesort.c. (2011). https://github.com/mackyle/
chitectures. ACM, 213–224. sqlite/blob/master/src/vdbesort.c
[41] OpenAddresses. 2020. The OpenAddresses - Northeast dataset. (2020). [53] Simon Steele and Marius Žilénas. 2020. 479k English words for all
https://data.openaddresses.io/openaddr-collected-us_northeast.zip your dictionary. (2020). https://github.com/dwyl/english-words
[42] OpenStreetMap contributors. 2017. Planet dump retrieved from [54] Michal Turčaník and Martin Javurek. 2016. Hash function generation
https://planet.osm.org . (2017). https://www.openstreetmap.org by neural network. In 2016 New Trends in Signal Processing (NTSP).
[43] Emanuel Parzen. 1962. On estimation of a probability density function IEEE, 1–5.
and mode. The annals of mathematical statistics 33, 3 (1962), 1065–1076. [55] Peter Van Sandt, Yannis Chronis, and Jignesh M Patel. 2019. Efficiently
[44] Orson R. L. Peters. 2020. The Pattern-Defeating Quicksort Algorithm. Searching In-Memory Sorted Arrays: Revenge of the Interpolation
(2020). https://github.com/orlp/pdqsort Search?. In Proceedings of the 2019 International Conference on Man-
[45] Tim Peters. 2002. Python: list.sort. (2002). https://github.com/python/ agement of Data. ACM, 36–53.
cpython/blob/master/Objects/listsort.txt [56] Jingdong Wang, Heng Tao Shen, Jingkuan Song, and Jianqiu Ji. 2014.
[46] Postgres. 1996. Postgres: tuplesort.c. (1996). https://github.com/ Hashing for similarity search: A survey. arXiv preprint arXiv:1408.2927
postgres/postgres/blob/master/src/backend/utils/sort/tuplesort.c (2014).
[47] Murray Rosenblatt. 1956. Remarks on some nonparametric estimates [57] Jianfeng Wang, Jingdong Wang, Nenghai Yu, and Shipeng Li. 2013.
of a density function. The Annals of Mathematical Statistics (1956), Order preserving hashing for approximate nearest neighbor search.
832–837. In Proceedings of the 21st ACM international conference on Multimedia.
[48] Peter Sanders and Sebastian Winkel. 2004. Super scalar sample sort. ACM, 133–142.
In European Symposium on Algorithms. Springer, 784–796. [58] Chenggang Wu, Alekh Jindal, Saeed Amizadeh, Hiren Patel, Wangchao
[49] Michael Axtmann Sascha Witt. 2020. Open-source C++ implementa- Le, Shi Qiao, and Sriram Rao. 2018. Towards a learning optimizer for
tion of the IPS4o algorithm. (2020). https://github.com/SaschaWitt/ shared clouds. Proceedings of the VLDB Endowment 12, 3 (2018), 210–
ips4o 222.

1016

You might also like