Working with big volumes of data is a complicated task, but it's even harder if you have to do everything in real time and try to figure it all out yourself. This session will use practical examples to discuss architectural best practices and lessons learned when solving real-time social media analytics, sentiment analysis, and data visualization decision-making problems with AWS. Learn how you can leverage AWS services like Amazon RDS, AWS CloudFormation, Auto Scaling, Amazon S3, Amazon Glacier, and Amazon Elastic MapReduce to perform highly performant, reliable, real-time big data analytics while saving time, effort, and money. Gain insight from two years of real-time analytics successes and failures so you don't have to go down this path on your own.
1. • SaaS Company – since 2008
• Social Media Analytics track and measure activity
of brands and personality, providing information to
market research & brand comparison
• Multi Language Technology (English, Portuguese
and Spanish)
• Leader in Latin America, with operations in 5
countries, customers in LatAm and US
• 1 out of 34 Twitter Certified Program Worldwide
9. Challenges: Velocity
• Updates every second
• Top users, top hashtags each
minute
• After event analysis are made
with batch over complete
dataset
• Spikes of 20,000+ tweets per
minute
Last TV
Debate
Results
Announced
15. Architecture – 1st iteration
What we needed:
• Complete data isolation
• Trying different solutions/offerings
16. Architecture – 1st iteration
What we did:
• All-in-one approach
• Multi instance architecture
• Simple vertical scalability
• MySQL performance tunning
17. Architecture – 1st iteration
What we've learned:
• Multi-instance is harder to administrate, but
minimize instability impact on customers
• Vertical scalability: poor resource management
• MySQL schema changes translates into downtime
18. Architecture – 2nd iteration
What we needed:
• Separation of Responsabilities (crawling,
processing)
• Horizontal Scalability
• Fast Provisioning
• Costs reduction
19. Architecture – 2nd iteration
What we changed:
• Migrated to AWS
• RabbitMQ (Single Node)
• Replace MySQL for RDS
• Cloud Formation
• Auto Scaling Groups
20. Architecture – 2nd iteration
What we've learned:
• PIOPs à
• Tuning the auto scaling policies can be hard
• Cloud Formation: great for migration, not enough
for daily ops
21. Architecture – 3rd iteration
What we needed:
• Deliver new features (NRT, more complex analytics)
• Scale Fast
• Be resilient against failure
• Adding and improving data-sources
• Keep costs under control (always)
22. Architecture – 3rd iteration
What we changed:
• Apache Storm
• RabbitMQ HA
• EMR (Hadoop/Hive)
• CloudFormation + Chef
• Glacier + S3 lifecycles policies
23. Architecture – 3rd iteration
What we've learned:
• Spot instances + Reserved instances
• Hive = SQL à SQL scripts are hard to test
• Bulk upserts on RDS can be expensive (PIOPS)
• DynamoDB is great, but expensive (for our use-case)
25. Architecture – 4th iteration
What we needed:
• Monitor millions of social media profiles
• Make data accessible (exploration, PoC)
• Improve UI response times
• Testing our data pipelines
• Reprocessing (faster)
26. Architecture – 4th iteration
What we changed:
• Cassandra (DSE)
• MongoDB MMS
• Apache Spark
27. What we've learned:
• Leverage on AWS ecosystem
• Datastax AMI + Opscenter integration
• MongoDB MMS: automation magic!
• Apache Spark unit testing + ec2 launch scripts
• EMR doesn’t have the latest stable versions
Architecture – 4th iteration
31. Lessons Learned
• Automate since day 1 (cloudformation + chef)
• Monitor systems activity, understand your data
patterns. eg: LogStash (ELK)
• Always have a Source of Truth (S3 + Glacier)
• Make your Source of Truth Searchable
32. Lessons Learned (II)
• Approximation is a good thing: HLL, CMS, Bloom
• Write your pipelines considering reprocessing
needs
• Avoid at all costs framework explosion
• AWS ecosystem allows rapid prototype