Big Data technologies like distributed databases, queues, batch processors, and stream processors are fun and exciting to play with. Making them play nicely together can be challenging. Keeping it fun for engineers to continuously improve and operate them is hard. At ResearchGate, we run thousands of YARN applications every day to gain insights and to power user facing features. Of course, there are numerous integration challenges on the way:
* integrating batch and stream processors with operational systems
* ingesting data and playing back results while controlling performance crosstalk
* rolling out new versions of synchronous, stream, and batch applications and their respective data schemas
* controlling the amount of glue and adapter code between different technologies
* modeling cross-flow dependencies while handling failures gracefully and limiting their repercussions
We describe our ongoing journey in identifying patterns and principles to make our big data stack integrate well. Technologies to be covered will include MongoDB, Kafka, Hadoop (YARN), Hive (TEZ), Flink Batch, and Flink Streaming.
2. The social network gives scientists new tools
to connect, collaborate, and keep up with the
research that matters most to them.
ResearchGate is built
for scientists.
3. Our mission is to connect the world of science
and make research open to all.
9. Patterns & Principles
Integration patterns should be strategic, but also ...
should be driven by use cases
should tackle real world pain points
should not be dictated by a single technology
10. Patterns & Principles
Big data is still a fast moving space
Big data batch processing today is quite different compared to 5 years ago
Big data stream processing is evolving heavily right now
Big data architecture
must evolve over time
14. Enriching User Generated Content
Users and batch flows continuously enrich an evolving dataset
Both user actions and batch flow results ultimately affect the same live database
Users Live Database Batch flow
15. Bibliographic Metadata – Data Model
Author Asset Derivative
Publication
Journal
Institution
Department
Account
Affiliation
Citation
Authorship Publication Link
Affiliation
Claiming
16. community publication service asset service
Bibliographic Metadata – Services
Author Asset Derivative
Publication
Journal
Institution
Department
Account
Affiliation
Citation
Authorship Publication Link
Affiliation
Claiming
21. Debugging an Error on Production
Your flow
has unit and integrations tests
but still breaks unexpectedly in production
You need to find the root cause
Is it a change in input data?
Is it a change on the cluster?
Is it a race condidition?
Crucial capabilities
Easy adhoc analysis of all involved data (input, intermediate, result)
Rerun current flow with current cluster configuration on yesterday’s data
Confirm hotfix by re-running on today’s data (exactly the same data that triggered the bug)
27. Platform Data Import
Dedicated component, but generic
Every team can onboard new data sources, as required by use cases
Every ingested source is immediately available for all consumers (incl. analytics)
Feature parity for all data sources (e.g., mounting everything in Hive)
28. #2 Speak a common format*
* have at least one copy of all data in a common format (e.g., avro)
29. Formats
Text SequenceFiles Avro ORC
X X +
schema evolution
self describing
reflect datum reader
flexible for batch & streaming
columnar
great for batch
30. Speak a common format
Have at least one copy of all data in a common format
Your choice of processing framework should not be limited by format of existing data
Every ingested source should be available for all consumers
When optimizing for a framework (e.g., ORC for Hive) consider a copy
31. #3 Speak a common language*
* continuously propagate schema changes
33. Data Warehouse vs. Data Lake
... ...
data lake
assume no schema
(defer schema to consumer)
data warehouse
enforce schema at ingestion
(schema on write)
X X
34. Can we have both?
Preserve schema information that is already present
some times at database level
many times at application level
Preserve full data – be truthful to our data source
continuously propagate schema changes
Can we have something like a Data Lakehouse?
35. Entities Define Schema
Code first
entities within owning service define schema
Auto conversion preferred
conversion to other representations via annotations
(JSON, BSON, Avro, ...)
36. Continuously propagate schema changes
Data ingestion process is generic and driven by avro schema
Changes in avro schema are continuously propagated to data ingestion process
Consumers with old schema can still read data due to avro schema evolution
Caveat: breaking changes still have to be dealt with by a change process
Everyone speaks the same language
42. Model Data Dependencies Explicitly
More flexible scheduling – run flows as early as possible
Allows multiple ingestion or processing attempts
Allows immutable data (repeatable read)
Allows analysis of dependency graph
which datasets are used by what flow
46. Push results via HTTP to service
Export of results just becomes a client of the service
service does not have to be aware of big data technologies
Service can validate results, e.g.,
plausibility checks
optimistic locking
Makes testing much easier
47. Avro → Http
Part of the flow, but standardized component
Handles tracking of progress
treats input file as a “queue”
converts records to http calls
can be interrupted and resumed anytime
Sends standardized headers, e.g.,
X-rg-client-id: author-analysis
Handles backpressure signals from services
49. Model Flow Orchestration Explicitly
Consider using an execution system like Azkaban, Luigi, or Airflow
Establish coding standards for orchestration, e.g.,
inject paths from outside – don’t construct them in your flow
inject calculation dates – never call now()
inject configuration settings – don’t hardcode -D mapreduce.map.java.opts=-Xmx4096m
foresee environment specific settings
Think about
ease of operations
tuning of settings
upgrades
51. Sources of Streaming Data
entity conveyor
timeseries data
non-timeseries data
(e.g., graph data)
kafka
kafka
52. Stream Processing
stream processor mqcom
#2 Speak a common format #5 Decouple export of results
#1 Decouple data ingestion
#3 Speak a common language
#6 Model flow execution explicitly
53. What about #4 ?
Model Data Dependencies Explicitly
We think about it
Depends on use cases and pain points
Potentially put Kafka topics into Memento
storing “offsets of interest” from producers
facilitate switching between incompatible versions of stream processors