Location via proxy:   
[Report a bug]   [Manage cookies]                
Thrill logo

About Thrill

Thrill is a C++ framework for distributed Big Data batch computations on a cluster of machines. It is currently being designed and developed as a research project at Karlsruhe Institute of Technology and is in early testing.

We last presented our ongoing work on Thrill at the IEEE Conference on Big Data in December 2016. A longer technical report about the design and goals is available at arXiv: https://arxiv.org/abs/1608.05634. This paper gives a good introduction into the concepts and ideas. The slides of our presentation at the conference are also available and give a visual introduction.

The development code is available on github under a BSD open-source license and outside contributors are welcome to join and contact us.

Doxygen documentation automatically built from the master is available. The doxygen documentation also contains a tutorial “ Write your first Thrill program”.

GitHub: Travis-CI: Travis-CI Status

Some of the main goals for the design are:

  • To create a high-performance Big Data batch processing framework.

  • Expose a powerful C++ user interface, that is efficiently tied to the framework’s internals. The interface supports the Map/Reduce paradigm, but also versatile “dataflow graph” style computations like Apache Spark or Apache Flink with host language control flow.
    See our examples of WordCount, PageRank, and k-Means.

  • Leverage newest C++11 and C++14 features like lambda functions and auto types to make writing user programs easy and convenient.

  • Enable compilation of binary programs with full compile-time optimization runnable directly on hardware without a virtual machine interpreter. Exploit cache effects due to less indirections than in Java and other languages. Save energy and money by reducing computation overhead.

  • Due to the zero-overhead concept of C++, enable applications to process small datatypes efficiently with no overhead.

  • Support external memory well by implementing I/O-efficient algorithms where needed, but keep most computations in RAM.

  • Perform full pipelining of data flows, where pipelined stages are often combined at compile time.

  • Avoid all unnecessary round trips of data to memory or disk.

  • Enable reproducible benchmarking of programs due to RAII memory management.

In the long term the framework can play a mediator role between Big Data applications and lower layer algorithms research, which may include:

  • Research into more communication efficient distributed algorithms for basic operations like sorting, selection, hashing, etc.

  • Support fault tolerant execution with lower overheads due to fault-resilient algorithms and better checkpointing.

  • Join Big Data research with succinct data structures and compression to enable more computations to be performed in RAM.

Current Authors and Contributors:

Michael Axtmann, Timo Bingmann, Emanuel Jöbstl, Sebastian Lamm, Huyen Chau Nguyen, Alexander Noe, Matthias Stumpp, Peter Sanders, Sebastian Schlag, Tobias Sturm.

Weblog Posts

  • Word Count Example

    This C++ snippet shows our (unoptimized) working example of Word Count in Thrill.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    
    using namespace thrill;
    
    size_t WordCountExample(Context& ctx) {
    
        auto lines = ReadLines(ctx, "wordcount.in");
    
        auto word_pairs = lines.template FlatMap<WordCountPair>(
            [](const std::string& line, auto emit) -> void {
                    /* map lambda: emit each word */
                for (const std::string& word : common::split(line, ' ')) {
                    if (word.size() != 0)
                        emit(WordCountPair(word, 1));
                }
            });
    
        auto red_words =  word_pairs.ReduceBy(
            [](const WordCountPair& in) -> std::string {
                /* reduction key: the word string */
                return in.first;
            },
            [](const WordCountPair& a, const WordCountPair& b) -> WordCountPair {
                /* associative reduction operator: add counters */
                return WordCountPair(a.first, a.second + b.second);
            });
    
        red_words.Map(
            [](const WordCountPair& wc) {
                return wc.first + ": " + std::to_string(wc.second);
            })
        .WriteLinesMany(
            "wordcount_" + std::to_string(ctx.my_rank()) + ".out");
    
        return 0;
    }
    
    int main(int argc, char* argv[]) {
        return api::Run(WordCountExample);
    }