This document discusses real-time log analysis using Mesos, Docker, Kafka, Spark, Cassandra and Solr at scale. It provides an overview of the architecture, describing how data from various sources like syslog can be ingested into Kafka via Docker producers. It then discusses consuming from Kafka to write to Cassandra in real-time and running Spark jobs on Cassandra data. The document uses these open source tools together in a reference architecture to enable real-time analytics and search capabilities on streaming data.
1 of 37
More Related Content
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, Kafka, Spark, Cassandra and Solr at scale
5. Mesos Papers
Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
http://static.usenix.org/event/nsdi11/tech/full_papers/Hindman_new.pdf
Google Borg - https://research.google.com/pubs/pub43438.html
Google Omega: flexible, scalable schedulers for large compute clusters http:
//eurosys2013.tudos.org/wp-content/uploads/2013/paper/Schwarzkopf.pdf
5
10. Fine Grained Resource Elasticity
"If people knew how low it really is, we’d all get fired."
https://gigaom.com/2013/11/30/the-sorry-state-of-server-utilization-and-the-impending-post-hypervisor-era/
10
17. Kafka papers
Apache Kafka was first open sourced by LinkedIn in 2011
Papers
● Building a Replicated Logging System with Apache Kafka http://www.vldb.org/pvldb/vol8/p1654-wang.pdf
● Kafka: A Distributed Messaging System for Log Processing http://research.microsoft.com/en-
us/um/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf
● Building LinkedIn’s Real-time Activity Data Pipeline http://sites.computer.org/debull/A12june/pipeline.pdf
● The Log: What Every Software Engineer Should Know About Real-time Data's Unifying Abstraction http://engineering.
linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
http://kafka.apache.org/
17
28. Kafka on Mesos
● smart broker.id assignment.
● preservation of broker placement (through constraints and/or new features).
● ability to-do configuration changes.
● rolling restarts (for things like configuration changes).
● scaling the cluster up and down with automatic, programmatic and manual
options.
● smart partition assignment via constraints visa vi roles, resources and
attributes.
28
29. CLI & REST API
● scheduler - starts the scheduler.
● broker
○ add - adds one more more brokers to the cluster.
○ update - changes resources, constraints or broker properties one or more brokers.
○ remove - take a broker out of the cluster.
○ start - starts a broker up.
○ stop - this can either a graceful shutdown or will force kill it (./kafka-mesos.sh help stop)
● topic
○ list - list topics in cluster
○ add - add new topics in cluster
○ update - change topics in cluster
○ rebalance - allows you to rebalance a cluster either by selecting the brokers or topics to rebalance. Manual
assignment is still possible using the Apache Kafka project tools. Rebalance can also change the replication
factor on a topic.
● help - ./kafka-mesos.sh help || ./kafka-mesos.sh help {command}
29
32. Consume from Kafka → Write to Cassandra
Implement CQL write here https://github.
com/stealthly/go_kafka_client/blob/master/consumers/consum
ers.go#L186-L194 with https://github.com/gocql/gocql
Go Kafka Client does fan out work processing, rebalance doesn’
t upset consumers that are reading already.
32