The document introduces a workshop on big data tools and MongoDB, discusses how MediaGlu uses big data for advertising by tracking user paths across different channels, and outlines an agenda covering MongoDB fundamentals, running MongoDB, and labs on shell commands, aggregation, replication, and sharding.
2. About Me
Sundar Nathikudi
– Co-Founder & CTO , MLN Advertising
– Former Principal Engineer , AOL Advertising
3. About MLN Advertising
• Baltimore based Online Advertising Startup
• MediaGlu - Cloud based Ad Platform
• Our engineers use
4. Why Does MediaGlu Need Big Data?-
MediaGlu takes a holistic approach to marketing and advertising - by using real data and user
path tracking, our state of the art technology gives marketers the tools to find their
customers wherever they are, in Search, Social & Content sites.
Data Management
Tag Management: One tag tracks all Discover Insights
your digital assets (website, social
media, etc). Channel Analytics: Visualize the Action
effectiveness of your ad campaigns
across media by learning what
Media Attribution: Track user actions Campaign Management: Let MediaGlu
interactions actually drive revenue.
and attribute them, even from sources improve your media bidding and
like Facebook and Pinterest. creative scheduling in Real Time.
User Analytics: Visualize the timeline
in which users interact with your
Reporting: See how user actions Budget Optimization: Get reports
brand.
connect across the digital space in one detailing which channels are providing
intuitive interface. the most value.
Predictive Analytics: Visualize how
channel and user path metrics overlap
Personalize Web/Social Experience:
into predictable behavior.
Custom tailored brand interactions
based on known and predicted
behavior.
5. Agenda
• Big Data and NoSQL Basics
• MongoDB Fundamentals
• Running MongoDB
• Lab #1 - Shell Commands
• Lab #2 - MongoDB Map /Reduce and
Aggregation Framework
• Replication and Sharding
8. Era of Big Data
• Facebook
– 2.7 billion likes made daily on and off of the
Facebook site
– 300 million photos uploaded
– 500+ terabytes of new data "ingested"
• Twitter
– 340 million tweets daily
– 500 Million Users
9. Big Data - Definition
• Volumes & Volumes of data
• Data does not fit on one Rack.
• Unstructured or Semi Structured
10. RDBMS & Big Data
• Pros
– Oracle, SQL server, MySql etc..
– Good for structured data and relational model
– Supports Partitioning
– ACID
– Transactions
• Cons
– Joins make it difficult for horizontal scaling
– Vertical scaling is limited by physics and cost
– Hard to scale vertically in cloud.
11. NoSQL – Not Only Sql
• Not using the relational model (nor the SQL
language)
• Open source
• Designed to run on large clusters
• No schema, allowing fields to be added to any
record without controls
• Based on the needs of web 2.0 properties
• Rise of NoSQL = Polyglot Persistence
12. NoSQL – Data Models
• Key Value Stores
– data is stored in Key-Value pairs
– support get, put, and delete operations based on a primary key
– Couchbase(membase), Redis , Riak
• Document
– store data in structured “documents” such as JSON/XML with no support to
relationships/joins
– MongoDB, CouchDB, SimpleDB
• Column Family (Big Table)
– contains columns of related data
– HBase, Cassandra
• Graph
– organize data into node and edge graphs; they work best for data that has
complex relationship structures
– Facebook social graph
– Neo4J
13. NoSQL – CAP Theorem
• Consistency - all nodes should see the same data at the same
time.
• Availability – node failures do not prevent ongoing writes
/reads
• Partition- Tolerance – system should continue to operate
irrespective of data loss
Eric Brewer – “distributed system can satisfy any two
of these guarantees at the same time, but not all
three”
15. What’s a document database
• Composed of Documents – Self describing
• Schema Free
• Store arbitrary data – Collections ,trees
16. What is JSON??
• Java Script Object Notation
• Lightweight data-interchange format
• Elements of JSON { "id": 1,
– Object : K/V Pairs "name": "Foo",
"price": 123,
– Key, String "tags": [ "Bar", "Eek" ],
– Value "stock": { "warehouse": 300,
"retail": 20 }
• Number }
• String
• Boolean
• Array
• Object
17. MongoDB - Overview
• BSON – Bin-ary-en-coded seri-al-iz-a-tion of
JSON-like doc-u-ments(more at
http://bsonspec.org/)
• Schema Less
• Embedded documents and arrays reduce need
for joins
• Scalable – Replication and Sharding
• Best features of key /value store, document
and relational databases in one .
18. MongoDB - Overview
• Name stems from humongous
• 10 gen
• Written in C++
• Understands Java script
• Spider Monkey Java script Engine for server-
side Javascript execution
• Lots of language drivers available
21. Blog Post - Document Model
{ _id: 1234,
author: { name: "Bob Davis", email : "bob@bob.com" },
post: "In these troubled times I like to …",
date: { $date: "2010-07-12 13:23UTC" },
location: [ -121.2322, 42.1223222 ],
rating: 2.2,
comments: [ { user: "jgs32@hotmail.com", upVotes: 22, downVotes: 14, text:
"Great point! I agree" },
{ user: "holly.davidson@gmail.com", upVotes: 421, downVotes:
22, text: "You are an idiot" } ],
tags: [ "Politics", "Virginia" ]
}
22. No SQL vs. RDBMS terminology
MySql NoSQL
Database Database
Table Collection
Index Index
Row Document
Column Field
Join Embedding and Linking
Primary Key _id field
Group By Aggregation
23. Installing Mongo DB
• Mongo Distributions
– OS X, Linux , Windows, Solaris
– Runs on commodity hardware
24. Installing Mongo DB
• Download Mongo DB server
– http://www.mongodb.org/downloads
– http://www.mlnsitelabs.com/mongodb
– Extract the bin folder to C:MongoDB
• Create Data Folder - C:MongoDbData
• To Start from command line
– Run Mongo.Bat
• To install as a window service
– Run MongoService.bat from command line.
27. Learning MongoDB Shell
• Interactive java script Shell
• Use online browser shell
– http://try.mongodb.org/
• Or run from command line
– mongo http://localhost:27017
28. Learning Shell Commands
• Create Database
– use student;
– db.student.scores.find();
• Inserting a document into collection
– var student = {name: 'Jim', scores: [75, 99, 87.2]};
– db.scores.save(student);
– var student = {name: 'John', scores: [35, 45, 55]};
– db.scores.save(student);
34. Indexes
• What is an Index??
– structure that allows you to quickly locate
documents based on the values stored in certain
specified fields.
• Indexes enhance query performance
35. Indexes
• Mongo DB Indexes
– defines indexes on a per-collection level.
– B-Tree Indexes
– Compound indexes with multiple fields
• db.scores.ensureIndex(“{ name: 1, id: 1 }”};
– Unique Index
• db.addresses.ensureIndex( { "user_id": 1 }, { unique: true } )
– Sparse Index
• db.addresses.ensureIndex( { "xmpp_id": 1 }, { sparse: true } )
36. Map Reduce
• Pattern to allow computations to be
parallelized over a cluster.
• Group By in SQL
37. Map Reduce
• Write two functions – Map and Reduce
• Write them in Java script
• Map Function :
– Called once per document – returns key and values
• Reduce Function
– Called Once per key emitted, with an array of values
• Finalize (optional)
– Allowing rounding up of the reduced data set.
38. Map Reduce
• User Profile
{
"_id" : ObjectId("505e717a6794e396ac493e37"),
"UserId" : NumberLong(5209704),
"Browser" : "Microsoft Internet Explorer",
"Gender" : "M",
"CountryCode" : "US",
"State" : "FL",
"City" : "Spring Hill"
}
• Count the users from california by Browser and Gender
40. Lab#2 – Map/Reduce
• CSV file– user profile information
• Count the users by Browser and Gender
• Download
• http://www.mlnsitelabs.com/mongodb/Labs/Lab2
41. Aggregation Framework
• Map/Reduce is a big hammer
– Sum, Average
– Avoid java script overhead if you can
• Aggregation Framework
– Specify a pipeline
– Pipeline = series of operations
– Collections run through a pipeline to produce
aggregated result
42. Aggregation Framework
• $match
– Uses query predicate
• $project
– Uses a sample document to determine the result
• $unwind
– Hands out the array elements one at a time
• $group
– Aggregates items into group defined by a key
43. Aggregation Framework
• $sort
– sort the result
• $limit
– Limit the number of documents to pass
• $skip
– Skip over the specified number of documents
47. Sharding
• Partitioning of data among multiple machines
• Enables Horizontal Scaling – writes per second
• Partition a collection, specify a shard key – ex:
_id, timestamp
49. GridFS
• Specification for storing large files in
MongoDB
• BSON object in MongoDB are limited to 16MB
Size
• GridFS – Divide large files among multiple
documents
50. References
• Mongo Cookbook
– http://cookbook.mongodb.org/
• NoSQL Distilled: A Brief Guide to the Emerging
World of Polyglot Persistence
– http://www.amazon.com/NoSQL-Distilled-
Emerging-Polyglot-Persistence/dp/0321826620
• Seven Databases in Seven Weeks
– http://www.amazon.com/Seven-Databases-
Weeks-Modern-Movement/dp/1934356921