AI learns sorting algorithm
AI learns sorting algorithm
1001
Research 11: Machine Learning for Databases II SIGMOD ’20, June 14–19, 2020, Portland, OR, USA
1002
Research 11: Machine Learning for Databases II SIGMOD ’20, June 14–19, 2020, Portland, OR, USA
1003
Research 11: Machine Learning for Databases II SIGMOD ’20, June 14–19, 2020, Portland, OR, USA
Figure 4: Cache-optimized Learned Sort: First the input is partitioned into f fixed-capacity buckets (here f = 2) and the input
keys are shuffled into these buckets based on the CDF model’s predictions. If a bucket gets full, the overflowing items are
placed into a spill bucket S. Afterwards, each bucket is split again into f smaller buckets and the process repeats until the
bucket capacity meets a threshold t (here t = 6). Then, each bucket is sorted using a CDF model-based counting sort-style
subroutine (Step 2). The next step corrects any sorting mistakes using Insertion Sort (Step 3). Finally we sort the spill bucket
S, merge it with B, and return the sorted array (Step 4).
size (e.g., 232 for 32-bit keys) the run-time of our Learned Sort • Step 3: Afterwards we take the now sorted buckets
algorithm is almost identical to Radix Sort. However, in case and merge them into one sorted array. If we use a non-
the number of elements is much smaller than the key domain monotonic model, we also correct any sorting mistakes
size, Learned Sort starts to significantly outperform even using Insertion Sort.
the optimized Radix Sort implementations as well as other • Step 4: The buckets are of fixed capacity, which mini-
comparison-based sorts. The reason is that, with every pass mizes the cost of dynamic memory allocation. However,
over the data our learned model can extract more information if a bucket becomes full, the additional keys are placed
than Radix Sort about where the key should go in the final into a separate spill bucket array (see Figure 4 the “S”-
output, thus overcoming the core challenge of Radix Sort bucket symbol). As a last step, the spill bucket has to be
that the run-time heavily depends on the key domain size2 . sorted and merged. The overhead of this operation is
The basic idea of our algorithm is visualized in Figure 4: low as long as the model is capable of evenly distribut-
• We organize the input array into logical buckets. That is, ing the keys to buckets.
instead of predicting an exact position, the model only Algorithm 2 shows the pseudocode of Learned Sort. The
has to predict a bucket index for each element, which algorithm requires an input array of keys (A), a CDF model
reduces the number of collisions as explained earlier. that was trained on a sample of this array (F A ), a fan-out
• Step 1: For cache efficiency, we start with a few large factor (f ) that determines the ratio of new buckets in each
buckets and recursively split them into smaller ones. By iteration, and a threshold (t) which decides when to stop the
carefully choosing the fan-out (f ) per iteration, we can bucketization, such that every bucket fits into the cache.
ensure that at least one cache-line per bucket fits into Step 1: The algorithm starts by allocating a linear array
the cache, hence transforming the memory access pat- B that is of the same size as the input A (Line 5). This will be
tern into a more sequential one. This recursion repeats logically partitioned into n buckets, each of fixed capacity
until the buckets become as small as a preset threshold b (Lines 3-4). We record the bucket sizes (i.e. how many
t. Section 3.1 explains how f and t should be set based elements are currently in the bucket) in an integer array I ,
on the CPU cache size. which has the same size as the current number of buckets (n).
• Step 2: When the buckets reach capacity t, we use the Then, the algorithm shuffles each key into buckets by using
CDF model to predict the exact position for each ele- the model F A to predict its empirical CDF value and scaling it
ment within the bucket. out to the current number of buckets in that round (Line 14).
If the predicted bucket (at index pos) has reached its capacity,
then the key is placed in the spill bucket S, otherwise, the key
is inserted into the bucket (Lines 15 - 19). Here, we calculate
the bucket start offset as pos · b and the write offset within
2 Obviously, the model itself still depends on the key domain size as discussed the bucket as I [pos]. After one iteration, each bucket will be
later in more detail.
1004
Research 11: Machine Learning for Databases II SIGMOD ’20, June 14–19, 2020, Portland, OR, USA
logically split further into f smaller buckets (Lines 20-21) Algorithm 2 The Learned Sort algorithm
until the buckets are smaller than threshold t (Line 11). Note .
Input A - the array to be sorted
that, in order to preserve memory, we reuse the arrays A and Input F A - the CDF model for the distribution of A
B by simply swapping the read and write pointers (Line 22) Input f - fan-out of the algorithm
Input t - threshold for bucket size
and updating the bucket splitting parameters (Lines 20-21). Output A′ - the sorted version of array A
Step 2: When the bucket capacity reaches t, we switch to 1: procedure Learned-Sort(A, F A, f , t )
2: N ← |A | ▷ Size of the input array
a model-based Counting Sort-style routine (Lines 23-38) to 3: n←f ▷ n represents the number of buckets
map the items to their final positions. We again do that using 4: b ← ⌊N /f ⌋ ▷ b represents the bucket capacity
5: B ← [] × N ▷ Empty array of size N
the model, which now predicts the exact index position, not 6: I ← [0] × n ▷ Records bucket sizes
the bucket. That is, we first calculate the final position for 7: S ← [] ▷ Spill bucket
read_arr ← pointer to A
every key (Line 28) and store in array K the count of keys 8:
9: write_arr ← pointer to B
that are mapped to each predicted index (Line 29). The array
K is then transformed to a running total (Line 31). Finally, we 10:
11:
// Stage 1: Model-based bucketization
while b ≥ t do ▷ Until bucket capacity reaches the threshold t
place the items into their final position using the cumulative 12: I ← [0] × n ▷ Reset array I
13: for x ∈ read_arr do
counts (Lines 32-38), which is similar to the Counting Sort 14: pos ← ⌊Infer(F A, x ) · n ⌋
routine[7, pp.168-170]. As we only sort one bucket at a time 15: if I [pos] ≥ b then ▷ Bucket is full
S .append(x ) ▷ Add to spill bucket
and want to keep the array size of K small, we use an offset 16:
17: else ▷ Write into the predicted bucket
to set the start index of the bucket in Lines 28-36. 18: write_arr[pos · b + I [pos]] ← x
19: Increment I[pos]
We switch to the model-based Counting Sort for two rea-
20: b ← ⌊b/f ⌋ ▷ Update bucket capacity
sons. First, and most importantly, it helps improve the overall 21: n ← ⌊N /b ⌋ ▷ Update the number of buckets
sorting time as we are able to fully utilize our model’s pre- 22: PtrSwp(read_arr, write_arr) ▷ Pointer swap to reuse memory
cision for fine-level predictions. Second, it helps reduce the 23: // Stage 2: In-bucket reordering
number of overflows (see Section 3.1.2 for more details). 24: offset ← 0
25: for i ← 0 up to n do ▷ Process each bucket
Step 3: After the last sorting stage we remove any empty 26: K ← [0] × b ▷ Array of counts
space and, for non-monotonic models, correct any potential
27: for j ← 0 up to I [i] do ▷ Record the counts of the predicted positions
mistakes with Insertion Sort(Line 40). 28: pos ← ⌊Infer(F A, read_arr[offset + j]) · N ⌋
Step 4: Finally, because we used a spill bucket (S) for 29: Increment K [pos − offset]
the overflowing elements in Stage 1, we have to sort it and 30: for j ← 1 up to |K | do ▷ Calculate the running total
merge it with the sorted buckets before returning (Lines 42- 31: K [j] ← K [j] + K [j − 1]
43). Provided that the sorting algorithm for the spill bucket
32: T ← [] ▷ Temporary auxiliary memory
is stable, Learned Sort also maintains the stability property. 33: for j ← 0 up to I [i] do ▷ Order keys w.r.t. the cumulative counts
34: pos ← ⌊Infer(F A, read_arr[offset + j]) · N ⌋
35: T [j] ← read_arr[offset + K [pos − offset]]
36: Decrement K [pos − offset]
2.3 Implementation optimizations 37: Copy T back to read_arr[offset]
The pseudocode in Algorithm 2 gives an abstract view of the 38: offset ← offset + b
1005
Research 11: Machine Learning for Databases II SIGMOD ’20, June 14–19, 2020, Portland, OR, USA
to keep the overall sorting time low. Thus, models such as Figure 5: A typical RMI architecture containing three layers
KDE[43, 47], neural networks or even perfect order-preserving
hash functions are usually too expensive to train or execute Algorithm 4 The training procedure for the CDF model
for our purposes. One might think that histograms would Input A - the input array
be an interesting alternative, and indeed histogram-based Input L - the number of layers of the CDF model
Input M l - the number of linear models in the l t h layer of the CDF model
sorting algorithms have been proposed in the past [7, pp.168- Output F A - the trained CDF model with RMI architecture
177]. Unfortunately, histograms have the problem that they 1: procedure Train(A, L, M )
are either too coarse-grained, making any prediction very 2: S ← Sample(A)
3: Sort(S )
inaccurate, or too fine-grained, which increase the time to 4: T ←[][][] ▷ Training sets implemented as a 3D array
navigate the histogram itself (see also Section 6.8). 5: for i ← 0 up to |S | do
6: T [0][0].add((S [i], i/ |S | ))
Certainly many model architectures could be used, how- 7: for l ← 0 up to L do
ever, for this paper we use the recursive model index (RMI) 8: for m ← 0 up to M l do
architecture as proposed in [29] (shown in Figure 5). RMIs 9: F A [l ][m] ←linear model trained on the set {t | t ∈ T [l ][m]}
10: if l + 1 < L then
contain simple linear models which are organized into a 11: for t ∈ T [l ][m] do
layered structure, acting like a mixture of experts[29]. 12: F A [l ][m].slope← F A [l ][m].slope · M l +1
13: F A [l ][m].intercept← F A [l ][m].intercept · M l +1
14: i ← F A [l ][m].slope ·t + F A [l ][m].intercept
2.4.1 Inference. Algorithm 3 shows the inference procedure 15: T [l + 1][i].add(t )
for an RMI architecture. During inference, each layer of the 16: return F A
model takes the key as an input and linearly transforms it
to obtain a value, which is used as an index to pick a model
in the next layer (Line 6). The intermediate models’ slope The CDF model F A can be implemented as a 2D array where
an intercept terms are already scaled out to the number of F A [l][r ] refers to the r t h model in the l t h layer of the RMI.
models in the next layer, hence avoiding additional multipli- For the root model, the algorithm uses the entire sample as
cations at inference time, whereas the last layer will return a training set to calculate a slope and intercept term (Line 9).
CDF value between 0 and 1. Note, that the inference can be After it has been trained, for each of the training tuples, the
extremely fast because the procedure uses simple data depen- root model predicts a CDF value and it scales it by M l +1 (the
dencies instead of control dependencies (i.e., if-statements), number of models in the next layer) (Line 12-13). Then, it dis-
consequently making it easier for the optimizer to perform tributes these tuples into multiple training subsets that will
loop unrolling and even vectorization. Hence, for each layer, be used to train each of the linear models in the subsequent
the inference requires only one addition, one multiplication, layer. Each tuple goes to a training subset at index i, which is
and one array look-up to read the model parameters[29]. calculated in Line 14 by using the slope and intercept terms
of the parent model. This partitioning process continues un-
2.4.2 Training Procedure. Algorithm 4 shows the training til the second-to-last layer of the RMI, and each of the newly
procedure, which can be on a small sample of the input ar- created training subsets is used to train the corresponding
ray. The algorithm starts by selecting a sample and sorting linear models in a recursive fashion.
it using an efficient deterministic sorting algorithm – e.g.,
std::sort - (Lines 2-3), creating a 3D array to represent a tree 2.4.3 Training of the individual linear models. One way to
structure of training sets, and inserting all <key, CDF> pairs train the linear models is using the closed-form of the uni-
into the the top node T [0][0]. Here the empirical CDF for the variate linear regression with an MSE loss function. However,
training tuples is calculated as its index in the sorted sample when using linear regression training, it is possible that two
over the sample size (i/|S |). Starting at the root layer, the neighboring linear models that are in the last layer of the
algorithm trains linear models working its way top-down. CDF model predict values in overlapping ranges. Hence,
1006
Research 11: Machine Learning for Databases II SIGMOD ’20, June 14–19, 2020, Portland, OR, USA
1007
Research 11: Machine Learning for Databases II SIGMOD ’20, June 14–19, 2020, Portland, OR, USA
1008
Research 11: Machine Learning for Databases II SIGMOD ’20, June 14–19, 2020, Portland, OR, USA
complexity is O(s log s) + O(N ) ( O(N ) for the merging step). Then, one way to train the CDF model is to fit multivariate
Then again, a good model will not permit the size of the spill linear regression models (w.x+b) over the feature vectors in
bucket to inflate (empirically ≤ 5%), which makes this step each layer of the model. However, this strategy is computa-
pretty insignificant in practice (see also Figure 13). tionally expensive as it would take O(n) operations at every
The space complexity of Learned Sort is in the order of layer of the model. As a workaround, we could limit the
O(N ), given that the algorithm uses an auxiliary memory characters considered by the model, however that might lead
for the buckets, which is linearly dependent on the input to non-monotonic predictions. If we consider C 1 , C 2 , ..., Cn
size. The in-place version introduced in Section 4.1, however, to be the ASCII values of characters in the string, then we
only uses a memory buffer that is independent of the input can obtain a monotonic encoding of strings by calculating
C1 C2 Cn
size, therefore, accounting for an O(1) space complexity. 256 + 2562 + .. + 256n . This value is bound to be between zero
and one, monotonic, and can potentially be used as a CDF
4 EXTENSIONS value. This prediction would have been accurate if the ASCII
So far we only discussed the Learned Sort implementation value of each character was uniformly distributed and inde-
that is not in-place and does not handle strings or other pendent of the values of the other characters in the string.
complex objects. In this section, we provide an overview of This is not always the case, so we transform the encodings
how we can extend the Learned Sort algorithm to address to make their distribution uniform.
these shortcoming, whereas in the Evaluation section we In the training phase, we take a sample from the array and
show experiments for an in-place version as well as early encode the strings using their ASCII values and use them to
results for using Learned Sort over strings. map strings into the buckets. If the bucket sizes are uneven,
we readjust the encoding ranges falling into these buckets
4.1 Making Learned Sort in-place by making a linear transformation of the slope and intercept
terms of respective models. Then we re-map the strings into
The current algorithm’s space complexity is linear with re- another layer of buckets after this linear transformation. This
spect to the input size due to the requirement from the re-mapping step continues until we obtain evenly distributed
mapping stage of the algorithm. However, there is a way buckets. Similar to the numeric algorithm we split the array
of making the algorithm in-place, (i.e., have constant mem- into finer buckets until a threshold size after which point
ory requirement that is independent on the input size). we use std::sort. Some promising preliminary results on this
The in-place version of the mapping stage would group approach are shown in Section 6.4.
the input keys into blocks such that all the elements in each
block belong to the same bucket. The algorithm maintains
a small buffer equal to the block size for every bucket. The
algorithm iterates over the unsorted array and maps the
elements into their respective buffer and whenever a buffer 4.3 Duplicates
space fills up it is written onto the already scanned section The number of duplicates (i.e., repeated keys) is a factor that
of the array. These block are then permuted, so that blocks affects the run-time behavior of our algorithm as our model
from the same bucket are stored contiguously. Note that will always assign the same bucket to the key, which, per
this type of approach is very common for designing in-place definition, increases the number of collisions and the number
algorithms such as in [2], and we show results in Section 6.6. of keys placed into the spill bucket. Consequently, the spill
bucket inflates and Stage 4 of the algorithm takes longer to
4.2 Learning to sort strings execute, since it relies on a slower algorithm.
The primary focus so far has been on sorting numerical As a remedy, we incorporated a fast heuristic in our al-
values and extending our CDF model for strings creates a gorithm that detects repeated keys at training time. While
number of unique challenges. While we are still investigating building the CDF model, the algorithm looks at the frequen-
on how to best handle strings and many existing work on cies at which equal keys appear in the training sample and,
ML-enhanced data structures and algorithms so far only if it is above a certain threshold, it adds these keys to an
considers numeric values [11, 29, 55], we outline an early exception list. Then, at sorting time, if the algorithm comes
implementation we did for strings, which also has a very across an element whose key is in the exception list, it skips
compelling performance (see Section 6). the bucket insertion step and only merges the repeated keys
Our string model has an RMI architecture, but represents at the end of the procedure. However, in the absence of du-
strings as input feature vectors x ∈ Rn where n is the length plicates, we found that this additional step only introduces a
of the string. For simplicity, we can work with fixed-length small performance overhead (<3%), which is a tolerable cost
strings by padding shorter sequences with null characters. for the average case.
1009
Research 11: Machine Learning for Databases II SIGMOD ’20, June 14–19, 2020, Portland, OR, USA
5 RELATED WORK given number of digits at a time. The Counting Sort subrou-
Sorting algorithms are generally classified as comparison- tine calculates a count histogram of all the keys based on the
based or distribution-based, depending on whether they rely selected digits, transforms that into a cumulative histogram
only on pairwise key comparisons to come up with a sorted by generating running totals, and then re-orders the keys
order, or rather make some assumptions or estimations on back into a sorted order based on these calculated counts.
the distribution of the keys. Radix Sort’s time complexity is O(d · (N + r )), where d is
Comparison sorts: Some of the most common compari- the number of passes over the data (i.e., the number of digits
son sorts are Quicksort, Mergesort, and Insertion Sort. While divided by the radix size), N is the input size, and r is the
they all have a lower bound of Ω(N log N ) comparisons in range of each key (i.e., 10 raised to the power of radix size).
the average case, their performance in practice depends also Note that, in order to use Radix Sort with IEEE-754 floating
largely on factors such as memory access patterns, which point numbers, it is first necessary to shift and mask the bit
dictate their cache efficiency. representation. While Radix Sort is highly sensitive to the
The GNU Standard Template Library in C++ employs In- key length, which dictates the number of passes, it is never-
trosort [38] as its default sorting function (std::sort) [16], theless a very efficient sorting algorithm for numerical types,
which combines the speed of Quicksort in the average case that is very well-suited for multi-core procedures[6, 22, 40],
with the optimal worst case of Heapsort and the efficiency and SIMD vectorization [50].
of Insertion Sort for small arrays. Introsort was also adopted Most related to Learned Sort is Histogram Sort[4]. How-
by the standard library of the Go language[17] and that of ever, Histogram Sort implicitly assumes a uniform distribu-
the Swift language until version 5[1]. tion for the input data as it allocates n variable-sized buck-
Samplesort is another variant of Quicksort that uses mul- ets and maps each key x into a bucket Bi by calculating
tiple pivots that are selected from a sample from the input i = n · (x − xmin )/(xmax − xmin ). It then sorts these buckets
array (note, that this does NOT create a histogram over the using Insertion Sort and merge them in order.
data). Thus, the elements are arranged into partitions in a SIMD optimization: There has been a lot of work on
finer-grained way that enables early termination of recur- enhancing traditional sorting implementations with data
sion when a certain partition contains only identical keys. parallelism in SIMD-enabled CPUs[5, 14, 26], as well as the
One of the most recent and efficient implementation of Sam- use of adaptive and cache-sensitive partitioning techniques
plesort is the Super Scalar Samplesort initially introduced in for multi-core or multi-processor implementations [3, 6, 9, 21,
[48] and then later on again with an in-place and improved 50]. Nevertheless, there has not been much recent innovation
implementation in [2] (IPS4 o). The strongest point of this im- in the algorithmic space for sorting and we found that IS4 o,
plementation is the use of a binary search tree to efficiently one of our baselines, is one of the most competitive openly
discover the right partition for each key, and the avoidance available implementations.
of conditional branches with the help of conditional instruc- Hashing functions In a way, the CDF model might be
tions provided in modern processors. regarded as an order-preserving hash function for the in-
Java’s List.sort() function[23] and Python’s built-in sorting put keys, such as [8, 32]. However, order-preserving hash-
function[45] use a hybrid of Mergesort and Insertion Sort, ing is unsuitable for sorting since it does not provide fast
called Timsort[36]. Timsort combines Insertion Sort’s ability enough training and inference times, and, to the best of our
to benefit from the input’s pre-sortedness with the stability knowledge, there does not exist any sorting algorithm that
of Mergesort, and it is said to work well on real-world data uses order-preserving hashing for sorting. Similarly, locality-
which contain intrinsic patterns. The procedure starts by sensitive hashing [54, 56, 57] can also not be used for sorting
scanning the input to find pre-sorted key sub-sequences and a single numeric value as we are concerned with sorting
proceeds to merge them onto the final sorted output. a single dimension rather than efficiently finding similar
It should be noted, that most open-source DB systems items in a multi-dimensional space. Finally, perfect hash-
implement their sorting routines building upon Quicksort or ing’s objective is to avoid element collisions, which would
Mergesort, depending on whether the data fits in memory initially seem an interesting choice to use for Learned Sort.
or if the ordering needs to be stable[37, 39, 46, 52]. However, perfect hash functions grow in size with the input
Distribution sorts comprise the other major group of data, are not fast to train, and most importantly, usually not
sorting procedures and they include algorithms like Radix order-preserving [10].
Sort, Counting Sort, and Histogram Sort. Radix Sort is ar- ML-enhanced algorithms There has been growing in-
guably the most commonly used one, and it works by calling terest in the use of Machine Learning techniques to speed
the Counting Sort subroutine for all the keys by scanning a up traditional algorithms in the context of systems. Most
notably, the work in Learned Index Structures [29] intro-
duces the use of an RMI structure to substitute the traditional
1010
Research 11: Machine Learning for Databases II SIGMOD ’20, June 14–19, 2020, Portland, OR, USA
1011
Research 11: Machine Learning for Databases II SIGMOD ’20, June 14–19, 2020, Portland, OR, USA
Histogram Sort
(A) synthetic, 64-bit floating points (B) real/benchmark, 64-bit floating points
(high precision)
(C) real/benchmark, 64-bit floating points (D) synth & real, 32-bit integers
(low precision)
Figure 9: The sorting rate of Learned Sort and other baselines for real and synthetic datasets containing both doubles and
integers. The pictures below the charts visualize the key distributions and the dataset sizes.
6.2 Overall Performance L3 cache, as it is the case with 1 million keys (roughly 8MB),
As the first experiment we measured the sorting rate for Radix sort is almost as fast as Learned Sort. However, as soon
array sizes varying from 1 million up to 1 billion double- as the data does not fit into the L3 cache, the sorting rate
precision keys following a standard normal distribution, and of Learned Sort is significantly higher than Radix or IS4 o.
compare it to the baseline algorithms that we selected as Furthermore, Learned Sort’s cache optimization enables it
described in Section 6.1. The sorting rate (bytes per second) is to maintain a good sorting throughput even for sizes up to
shown in Figure 8 for Learned Sort and our main baselines, in 8GB.
addition to SageDB::sort[28]. As it can be seen Learned Sort
achieves an average of 30% higher throughput than the
next best algorithm (IS4 o) and 55% as compared to Radix Sort 6.3 Sorting rate
for larger data sizes. However, when the data fits into the To better understand the behaviour of our algorithm, we
compared Learned Sort against our other baselines on (A)
to any of those factors may cause the results to vary. You should con-
synthetic data with 64-bit doubles generated from different
sult other information and performance tests to assist you in fully eval- distributions, (B) high precision real-world/benchmark data
uating your contemplated purchases, including the performance of that with 64-bit doubles, which have at least 10 significant digits,
product when combined with other products. For more information go to (C) low precision real-world/benchmark data with 64-bit dou-
www.intel.com/benchmarks.
bles, with reduced floating point precision, and (D) synthetic
Benchmark results were obtained prior to implementation of recent
software patches and firmware updates intended to address exploits referred
and real-world data with 32-bit integers.
to as "Spectre" and "Meltdown". Implementation of these updates may make Figure 9A shows that Learned Sort consistently outper-
these results inapplicable to your device or system. forms Radix Sort by an average of 48%, IS4 o by an average
1012
Research 11: Machine Learning for Databases II SIGMOD ’20, June 14–19, 2020, Portland, OR, USA
fact that it is smaller, and incurs more cache benefits. Figure 11: The sorting rate of Learned Sort and the other
On the other hand, we observed that Radix Sort’s perfor- baselines for varying degrees of duplicated keys and number
mance improves when the input keys have less precision of spikes, as well as on different Zipf distributions. The ref-
(Figure 9C). In this case, Learned Sort and all the other base- erence line represents the sorting rate of Learned Sort where
lines algorithms remain unaffected, while Radix Sort gets there are no duplicates.
a 34% performance boost. This improvement results from
the fact that everything can be sorted on the most signifi- • bcmrk: 10M ASCII arrays generated using the code from
cant bits. However, it is necessary to note that Radix Sort’s the SortBenchmark dataset[19]
performance does not surpass that of Learned Sort. • synth, synth_lo, synth_hi : A set of 1M randomly gen-
Finally, in Figure 9A we show that for integers Learned erated strings following a uniform distribution of a-z
Sort has an even higher throughput and, in some cases, even a characters at each position with characters having no
bigger benefit. For example, on the synthetic integer dataset correlation, low correlation, and high correlation with
it is 38% better than IS4 o and twice as fast as Radix Sort. neighbouring ASCII values, respectively.
Whereas for the FB dataset and OSM IDs the performance Overall, the experiment show that Learned Sort is also a
difference compared to IS4 o is less because of the particular very promising direction for sorting complex objects, such
distribution of values and duplicates (see Section 4.3). as strings. It should be noted, that building efficient models
for string is still an active area of research and probably a
6.4 Sorting Strings paper on its own as it also have far reaching applications
Figure 10 show the preliminary sorting rate for strings for for indexes, tries, and many other data structures and algo-
our algorithm of Section 4.2 with respect to IS4 o, std::sort, rithms. Finally, we would like to point out that if we include
and Timsort. In this experiment we excluded the training the training time with the sorting time, Learned Sort still
time for the model. However, it should be noted, that many dominates the other algorithms but by a margin of of 2-8%
real-world scenarios exists in which a dataset or a subset has rather than the 5-20% shown in Figure 9.
to be sorted several times. For example, within a database
recurring merge-joins operation or the sorting for the final 6.5 The impact of duplicates
result, would allow to pre-train models as similar (but not The number of duplicates (i.e., repeated keys) is a factor that
identical) subsets of the data might appear over and over affects the run-time behavior of our algorithm as our model
again. Note, that we excluded Radix Sort from this compar- will always assign the same position/bucket to the key, which
ison as it was significantly slower than any of the other increases the number of collisions and the number of keys
baselines. For the data we used: placed into the spill bucket. To study the impact of duplicates,
• addr: A set of 1M address strings from the OpenAd- we first generate a normal distribution dataset (µ = 0 and
dresses dataset (Northeast USA)[41]. σ = 5) and afterwards duplicate a few randomly selected
• dict: A set of 479K words from an English dictionary[53]. keys n-times (referred to as spikes).
• url: A list of 1.1M URLs from the Weblogs dataset[15] Figure 11 shows the performance of Learned Sort and the
containing requests to a webserver for cs.brown.edu other baseline algorithms for different combinations of the
1013
Research 11: Machine Learning for Databases II SIGMOD ’20, June 14–19, 2020, Portland, OR, USA
6.7 Performance decomposition Figure 14: The sorting rate of Learned Sort algorithm on
Next, in Figure 13 we show the time spent in each phase of 100M normally-distributed keys as compared with (1) a ver-
Learned Sort. With 1% sample from the input, the training sion of LS that uses an equi-depth histogram as CDF model,
procedure only accounts for ≈ 6% of the overall runtime. The (2) a version with an equi-width histogram, (3) Equi-depth
majority of time (≈ 80%) goes towards the model-based key Histogram Sort, and (4) Equi-width Histogram Sort.
bucketization. Since the CDF model makes a nearly-sorted
order, the touch-up step with Insertion Sort makes up only
5% of the total time, and the spill bucket sorting only 2%, 7 CONCLUSION AND FUTURE WORK
which reconfirms the quality of the model predictions. In this paper we presented a novel approach to accelerate
sorting by leveraging models that approximate the empirical
6.8 Using histograms as CDF models CDF of the input to quickly map elements into the sorted
Finally, in Figure 14 we run a micro-experiment to show position within a small error margin. This approach results
the performance of a version of our Learned Sort algorithm in significant performance improvements as compared to
that uses a histogram as CDF model. We also compare this the most competitive and widely used sorting algorithms,
approach with Histogram Sort[4]. For both these algorithms and marks an important step into building ML-enhanced
we show the performance using both, equi-width and equi- algorithms and data structures. Much future work remains
depth histograms. The equi-depth CDF histogram model was open, most notably how to handle strings, complex types,
implemented as a linear array that records the starting key and parallel sorting.
for each bin and uses binary search to find the right bin for
any given key. On the other hand, for the equi-width version ACKNOWLEDGMENTS
we do not need to store the keys and we can find the right This research is supported by Google, Intel, and Microsoft
bin using a simple array lookup. as part of the MIT Data Systems and AI Lab (DSAIL) at MIT,
As a reference the figure includes the RMI-based Learned NSF IIS 1900933, DARPA Award 16-43-D3M-FP040, and the
Sort as a dashed line at around 280MB/s. The figure shows MIT Air Force Artificial Intelligence Innovation Accelerator
that the number of bins has a clear impact on the sorting rate (AIIA).
of the histogram based method. Surprisingly, the Histogram
Sort using equi-depth performs better than using our Learned
1014
Research 11: Machine Learning for Databases II SIGMOD ’20, June 14–19, 2020, Portland, OR, USA
REFERENCES [18] Fuji Goro and Morwenn. 2019. Open-source C++ implementation of
[1] Apple. 2018. Apple/Swift: standard library sort. (2018). https://github. Timsort. (2019). https://github.com/gfx/cpp-TimSort
com/apple/swift/blob/master/test/Prototypes/IntroSort.swift [19] Jim Gray, Chris Nyberg, Mehul Shah, and Naga Govindaraju. 2017.
[2] Michael Axtmann, Sascha Witt, Daniel Ferizovic, and Peter Sanders. The SortBenchmark dataset. (2017). http://sortbenchmark.org/
2017. In-Place Parallel Super Scalar Samplesort (IPSSSSo). In 25th [20] Chen-Yu Hsu, Piotr Indyk, Dina Katabi, and Ali Vakilian. 2019.
Annual European Symposium on Algorithms (ESA 2017) (Leibniz Inter- Learning-Based Frequency Estimation Algorithms. In International
national Proceedings in Informatics (LIPIcs)), Kirk Pruhs and Christian Conference on Learning Representations. ICLR, New Orleans, LA. https:
Sohler (Eds.), Vol. 87. Schloss Dagstuhl–Leibniz-Zentrum fuer Infor- //openreview.net/forum?id=r1lohoCqY7
matik, Dagstuhl, Germany, 9:1–9:14. https://doi.org/10.4230/LIPIcs. [21] Hiroshi Inoue and Kenjiro Taura. 2015. SIMD-and cache-friendly
ESA.2017.9 algorithm for sorting an array of structures. Proceedings of the VLDB
[3] Cagri Balkesen, Gustavo Alonso, Jens Teubner, and M Tamer Özsu. Endowment 8, 11 (2015), 1274–1285.
2013. Multi-core, main-memory joins: Sort vs. hash revisited. Proceed- [22] Intel. 2020. Intel Performance Primitive library for x86 architectures.
ings of the VLDB Endowment 7, 1 (2013), 85–96. (2020). http://software.intel.com/en-us/intel-ipp/
[4] Paul E. Black. 2019. Histogram Sort. (2019). https://www.nist.gov/ [23] Java. 2017. Java 9: List.sort. (2017). https://docs.oracle.com/javase/9/
dads/HTML/histogramSort.html docs/api/java/util/List.html#sort-java.util.Comparator-
[5] Berenger Bramas. 2017. Fast sorting algorithms using AVX-512 on [24] Nathan Jay, Noga H. Rotman, P. Brighten Godfrey, Michael Schapira,
Intel Knights Landing. arXiv preprint arXiv:1704.08579 305 (2017), 315. and Aviv Tamar. 2018. Internet Congestion Control via Deep Rein-
[6] Minsik Cho, Daniel Brand, Rajesh Bordawekar, Ulrich Finkler, Vincent forcement Learning. (2018). arXiv:cs.NI/1810.03259
Kulandaisamy, and Ruchir Puri. 2015. PARADIS: an efficient parallel [25] Johan Ludwig William Valdemar Jensen et al. 1906. Sur les fonctions
algorithm for in-place radix sort. Proceedings of the VLDB Endowment convexes et les inégalités entre les valeurs moyennes. Acta mathemat-
8, 12 (2015), 1518–1529. ica 30 (1906), 175–193.
[7] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford [26] Jie Jiang, Lixiong Zheng, Junfeng Pu, Xiong Cheng, Chongqing Zhao,
Stein. 2007. Introduction to algorithms (2 ed.). MIT Press, Cambridge, Mark R Nutter, and Jeremy D Schaub. 2017. Tencent Sort. (2017).
MA. [27] Andreas Kipf, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter Boncz,
[8] Zbigniew J. Czech, George Havas, and Bohdan S. Majewski. 1992. and Alfons Kemper. 2018. Learned Cardinalities: Estimating Correlated
An optimal algorithm for generating minimal perfect hash functions. Joins with Deep Learning. (2018). arXiv:cs.DB/1809.00677
Inform. Process. Lett. 43, 5 (1992), 257 – 264. https://doi.org/10.1016/ [28] Tim Kraska, Mohammad Alizadeh, Alex Beutel, Ed H. Chi, Ani Kristo,
0020-0190(92)90220-P Guillaume Leclerc, Samuel Madden, Hongzi Mao, and Vikram Nathan.
[9] T. Dachraoui and L. Narayanan. 1996. Fast deterministic sorting on 2019. SageDB: A Learned Database System. In CIDR 2019, 9th Biennial
large parallel machines. In Proceedings of SPDP ’96: 8th IEEE Symposium Conference on Innovative Data Systems Research, Asilomar, CA, USA,
on Parallel and Distributed Processing. IEEE, New Orleans, LA, 273–280. January 13-16, 2019, Online Proceedings. www.cidrdb.org. http://cidrdb.
https://doi.org/10.1109/SPDP.1996.570344 org/cidr2019/papers/p117-kraska-cidr19.pdf
[10] Martin Dietzfelbinger, Anna Karlin, Kurt Mehlhorn, Friedhelm Meyer [29] Tim Kraska, Alex Beutel, Ed H Chi, Jeffrey Dean, and Neoklis Polyzotis.
Auf Der Heide, Hans Rohnert, and Robert E Tarjan. 1994. Dynamic 2018. The case for learned index structures. In Proceedings of the 2018
perfect hashing: Upper and lower bounds. SIAM J. Comput. 23, 4 (1994), International Conference on Management of Data. ACM, 489–504.
738–761. [30] Sanjay Krishnan, Zongheng Yang, Ken Goldberg, Joseph Hellerstein,
[11] Jialin Ding, Umar Farooq Minhas, Hantian Zhang, Yinan Li, Chi Wang, and Ion Stoica. 2018. Learning to optimize join queries with deep
Badrish Chandramouli, Johannes Gehrke, Donald Kossmann, and reinforcement learning. arXiv preprint arXiv:1808.03196 (2018).
David Lomet. 2019. ALEX: An Updatable Adaptive Learned Index. [31] Maciej Kurant, Minas Gjoka, Carter T. Butts, and Athina Markopoulou.
(2019). arXiv:cs.DB/1905.08898 2011. Walking on a Graph with a Magnifying Glass: Stratified Sampling
[12] A. Dvoretzky, J. Kiefer, and J. Wolfowitz. 1956. Asymptotic Minimax via Weighted Random Walks. In Proceedings of ACM SIGMETRICS ’11.
Character of the Sample Distribution Function and of the Classical San Jose, CA.
Multinomial Estimator. Ann. Math. Statist. 27, 3 (09 1956), 642–669. [32] Bohdan S Majewski, Nicholas C Wormald, George Havas, and Zbig-
https://doi.org/10.1214/aoms/1177728174 niew J Czech. 1996. A family of perfect hashing methods. Comput. J.
[13] Paul Embrechts and Marius Hofert. 2013. A note on generalized in- 39, 6 (1996), 547–554.
verses. Mathematical Methods of Operations Research 77, 3 (2013), [33] Hongzi Mao, Malte Schwarzkopf, Shaileshh Bojja Venkatakrishnan,
423–432. Zili Meng, and Mohammad Alizadeh. 2019. Learning Scheduling Algo-
[14] Timothy Furtak, José Nelson Amaral, and Robert Niewiadomski. 2007. rithms for Data Processing Clusters. In Proceedings of the ACM Special
Using SIMD registers and instructions to enable instruction-level par- Interest Group on Data Communication (SIGCOMM ’19). ACM, New
allelism in sorting algorithms. In Proceedings of the nineteenth annual York, NY, USA, 270–288. https://doi.org/10.1145/3341302.3342080
ACM symposium on Parallel algorithms and architectures. ACM, San [34] Ryan Marcus, Parimarjan Negi, Hongzi Mao, Chi Zhang, Mohammad
Diego, CA, 348–357. Alizadeh, Tim Kraska, Olga Papaemmanouil, and Nesime Tatbul. 2019.
[15] Alex Galakatos, Michael Markovitch, Carsten Binnig, Rodrigo Fonseca, Neo: A Learned Query Optimizer. Proc. VLDB Endow. 12, 11 (July 2019),
and Tim Kraska. 2019. FITing-Tree: A Data-Aware Index Structure. 1705–1718. https://doi.org/10.14778/3342263.3342644
In Proceedings of the 2019 International Conference on Management of [35] Ryan Marcus and Olga Papaemmanouil. 2018. Deep reinforcement
Data (SIGMOD ‘19). Association for Computing Machinery, New York, learning for join order enumeration. In Proceedings of the First Inter-
NY, USA, 1189–1206. https://doi.org/10.1145/3299869.3319860 national Workshop on Exploiting Artificial Intelligence Techniques for
[16] GNU. 2009. C++: STL sort. (2009). https://gcc.gnu.org/onlinedocs/ Data Management. ACM, 3.
libstdc++/libstdc++-html-USERS-4.4/a01347.html [36] Peter McIlroy. 1993. Optimistic sorting and information theoretic
[17] Go. 2009. Source file src/sort/sort.go. (2009). https://golang.org/src/ complexity. In Proceedings of the fourth annual ACM-SIAM Symposium
sort/sort.go on Discrete algorithms. Society for Industrial and Applied Mathematics,
467–474.
1015
Research 11: Machine Learning for Databases II SIGMOD ’20, June 14–19, 2020, Portland, OR, USA
[37] MongoDB. 2018. MongoDB: sorter.cpp. (2018). https://github.com/ [50] Nadathur Satish, Changkyu Kim, Jatin Chhugani, Anthony D Nguyen,
mongodb/mongo/blob/master/src/mongo/db/sorter/sorter.cpp Victor W Lee, Daehyun Kim, and Pradeep Dubey. 2010. Fast sort on
[38] David R Musser. 1997. Introspective sorting and selection algorithms. CPUs and GPUs: A case for bandwidth oblivious SIMD sort. In Proceed-
Software: Practice and Experience 27, 8 (1997), 983–993. ings of the 2010 ACM SIGMOD International Conference on Management
[39] MySQL. 2000. MySQL: filesort.cc. (2000). https://github.com/mysql/ of data. ACM, 351–362.
mysql-server/blob/8.0/sql/filesort.cc [51] Andrew Schein. 2009. Open-source C++ implementation of Radix Sort
[40] Omar Obeya, Endrias Kahssay, Edward Fan, and Julian Shun. 2019. for double-precision floating points. (2009). https://bitbucket.org/ais/
Theoretically-Efficient and Practical Parallel In-Place Radix Sorting. usort/src/474cc2a19224/usort/f8_sort.c
In The 31st ACM on Symposium on Parallelism in Algorithms and Ar- [52] SQLite. 2011. SQLite: vdbesort.c. (2011). https://github.com/mackyle/
chitectures. ACM, 213–224. sqlite/blob/master/src/vdbesort.c
[41] OpenAddresses. 2020. The OpenAddresses - Northeast dataset. (2020). [53] Simon Steele and Marius Žilénas. 2020. 479k English words for all
https://data.openaddresses.io/openaddr-collected-us_northeast.zip your dictionary. (2020). https://github.com/dwyl/english-words
[42] OpenStreetMap contributors. 2017. Planet dump retrieved from [54] Michal Turčaník and Martin Javurek. 2016. Hash function generation
https://planet.osm.org . (2017). https://www.openstreetmap.org by neural network. In 2016 New Trends in Signal Processing (NTSP).
[43] Emanuel Parzen. 1962. On estimation of a probability density function IEEE, 1–5.
and mode. The annals of mathematical statistics 33, 3 (1962), 1065–1076. [55] Peter Van Sandt, Yannis Chronis, and Jignesh M Patel. 2019. Efficiently
[44] Orson R. L. Peters. 2020. The Pattern-Defeating Quicksort Algorithm. Searching In-Memory Sorted Arrays: Revenge of the Interpolation
(2020). https://github.com/orlp/pdqsort Search?. In Proceedings of the 2019 International Conference on Man-
[45] Tim Peters. 2002. Python: list.sort. (2002). https://github.com/python/ agement of Data. ACM, 36–53.
cpython/blob/master/Objects/listsort.txt [56] Jingdong Wang, Heng Tao Shen, Jingkuan Song, and Jianqiu Ji. 2014.
[46] Postgres. 1996. Postgres: tuplesort.c. (1996). https://github.com/ Hashing for similarity search: A survey. arXiv preprint arXiv:1408.2927
postgres/postgres/blob/master/src/backend/utils/sort/tuplesort.c (2014).
[47] Murray Rosenblatt. 1956. Remarks on some nonparametric estimates [57] Jianfeng Wang, Jingdong Wang, Nenghai Yu, and Shipeng Li. 2013.
of a density function. The Annals of Mathematical Statistics (1956), Order preserving hashing for approximate nearest neighbor search.
832–837. In Proceedings of the 21st ACM international conference on Multimedia.
[48] Peter Sanders and Sebastian Winkel. 2004. Super scalar sample sort. ACM, 133–142.
In European Symposium on Algorithms. Springer, 784–796. [58] Chenggang Wu, Alekh Jindal, Saeed Amizadeh, Hiren Patel, Wangchao
[49] Michael Axtmann Sascha Witt. 2020. Open-source C++ implementa- Le, Shi Qiao, and Sriram Rao. 2018. Towards a learning optimizer for
tion of the IPS4o algorithm. (2020). https://github.com/SaschaWitt/ shared clouds. Proceedings of the VLDB Endowment 12, 3 (2018), 210–
ips4o 222.
1016