MediaGlu and Mongo DB

Big Data Tools Workshop
Introduction to MongoDB

About Me
Sundar Nathikudi
– Co-Founder & CTO , MLN Advertising
– Former Principal Engineer , AOL Advertising

About MLN Advertising
• Baltimore based Online Advertising Startup
• MediaGlu - Cloud based Ad Platform
• Our engineers use

Why Does MediaGlu Need Big Data?-
MediaGlu takes a holistic approach to marketing and advertising - by using real data and user
path tracking, our state of the art technology gives marketers the tools to find their
customers wherever they are, in Search, Social & Content sites.

Data Management
Tag Management: One tag tracks all Discover Insights
your digital assets (website, social
media, etc). Channel Analytics: Visualize the Action
effectiveness of your ad campaigns
across media by learning what
Media Attribution: Track user actions Campaign Management: Let MediaGlu
interactions actually drive revenue.
and attribute them, even from sources improve your media bidding and
like Facebook and Pinterest. creative scheduling in Real Time.
User Analytics: Visualize the timeline
in which users interact with your
Reporting: See how user actions Budget Optimization: Get reports
brand.
connect across the digital space in one detailing which channels are providing
intuitive interface. the most value.
Predictive Analytics: Visualize how
channel and user path metrics overlap
Personalize Web/Social Experience:
into predictable behavior.
Custom tailored brand interactions
based on known and predicted
behavior.

Agenda
• Big Data and NoSQL Basics
• MongoDB Fundamentals
• Running MongoDB
• Lab #1 - Shell Commands
• Lab #2 - MongoDB Map /Reduce and
Aggregation Framework
• Replication and Sharding

Era of Big Data
• Facebook
– 2.7 billion likes made daily on and off of the
Facebook site
– 300 million photos uploaded
– 500+ terabytes of new data "ingested"
• Twitter
– 340 million tweets daily
– 500 Million Users

Big Data - Definition
• Volumes & Volumes of data
• Data does not fit on one Rack.
• Unstructured or Semi Structured

RDBMS & Big Data
• Pros
– Oracle, SQL server, MySql etc..
– Good for structured data and relational model
– Supports Partitioning
– ACID
– Transactions
• Cons
– Joins make it difficult for horizontal scaling
– Vertical scaling is limited by physics and cost
– Hard to scale vertically in cloud.

NoSQL – Not Only Sql
• Not using the relational model (nor the SQL
language)
• Open source
• Designed to run on large clusters
• No schema, allowing fields to be added to any
record without controls
• Based on the needs of web 2.0 properties
• Rise of NoSQL = Polyglot Persistence

NoSQL – Data Models
• Key Value Stores
– data is stored in Key-Value pairs
– support get, put, and delete operations based on a primary key
– Couchbase(membase), Redis , Riak
• Document
– store data in structured “documents” such as JSON/XML with no support to
relationships/joins
– MongoDB, CouchDB, SimpleDB
• Column Family (Big Table)
– contains columns of related data
– HBase, Cassandra
• Graph
– organize data into node and edge graphs; they work best for data that has
complex relationship structures
– Facebook social graph
– Neo4J

NoSQL – CAP Theorem
• Consistency - all nodes should see the same data at the same
time.
• Availability – node failures do not prevent ongoing writes
/reads
• Partition- Tolerance – system should continue to operate
irrespective of data loss

Eric Brewer – “distributed system can satisfy any two
of these guarantees at the same time, but not all
three”

NoSQL – Triangle of compromise

What’s a document database
• Composed of Documents – Self describing
• Schema Free
• Store arbitrary data – Collections ,trees

What is JSON??
• Java Script Object Notation
• Lightweight data-interchange format
• Elements of JSON { "id": 1,
– Object : K/V Pairs "name": "Foo",
"price": 123,
– Key, String "tags": [ "Bar", "Eek" ],
– Value "stock": { "warehouse": 300,
"retail": 20 }
• Number }
• String
• Boolean
• Array
• Object

MongoDB - Overview
• BSON – Bin-ary-en-coded seri-al-iz-a-tion of
JSON-like doc-u-ments(more at
http://bsonspec.org/)
• Schema Less
• Embedded documents and arrays reduce need
for joins
• Scalable – Replication and Sharding
• Best features of key /value store, document
and relational databases in one .

MongoDB - Overview
• Name stems from humongous
• 10 gen
• Written in C++
• Understands Java script
• Spider Monkey Java script Engine for server-
side Javascript execution
• Lots of language drivers available

Blog Post - Document Model
{ _id: 1234,
author: { name: "Bob Davis", email : "bob@bob.com" },
post: "In these troubled times I like to …",
date: { $date: "2010-07-12 13:23UTC" },
location: [ -121.2322, 42.1223222 ],
rating: 2.2,
comments: [ { user: "jgs32@hotmail.com", upVotes: 22, downVotes: 14, text:
"Great point! I agree" },
{ user: "holly.davidson@gmail.com", upVotes: 421, downVotes:
22, text: "You are an idiot" } ],
tags: [ "Politics", "Virginia" ]
}

No SQL vs. RDBMS terminology
MySql NoSQL
Database Database
Table Collection
Index Index
Row Document
Column Field
Join Embedding and Linking
Primary Key _id field
Group By Aggregation

Installing Mongo DB
• Mongo Distributions
– OS X, Linux , Windows, Solaris
– Runs on commodity hardware

Installing Mongo DB
• Download Mongo DB server
– http://www.mongodb.org/downloads
– http://www.mlnsitelabs.com/mongodb
– Extract the bin folder to C:MongoDB
• Create Data Folder - C:MongoDbData
• To Start from command line
– Run Mongo.Bat
• To install as a window service
– Run MongoService.bat from command line.

Installing MongoVue GUI
• Download MongoVue GUI tool
– http://www.mongovue.com/downloads/
– http://www.mlnsitelabs.com/mongodb

System components

mongod.exe mongo.exe
database server shell

mongos.exe
sharding router

Learning MongoDB Shell
• Interactive java script Shell
• Use online browser shell
– http://try.mongodb.org/
• Or run from command line
– mongo http://localhost:27017

Learning Shell Commands
• Create Database
– use student;
– db.student.scores.find();

• Inserting a document into collection
– var student = {name: 'Jim', scores: [75, 99, 87.2]};
– db.scores.save(student);
– var student = {name: 'John', scores: [35, 45, 55]};
– db.scores.save(student);

• Querying a collection
– db.scores.find();
– db.scores.find({scores: {'$gt': 15}});

• Updating a document
– db.scores.update({name : 'Jim'},{name: 'Jim', scores: [92,34,54]});

• Deleting a document
– db.scores.remove({name: Jim'});

Lab #1 - Shell Commands
• http://www.mlnsitelabs.com/mongodb/Labs/
Lab1

Lets Do it!!

Data Types
• string
• integer
• boolean
• double
• null
• array
• object
• binary data
• regular expression

Query Selectors
• Selectors
– $ne
– $lt
– $lte
– $gt
– $gte
– $in
– $nin
– $all

• Creating an index
– db.scores.ensureIndex(“{name:1}”)

Indexes
• What is an Index??
– structure that allows you to quickly locate
documents based on the values stored in certain
specified fields.
• Indexes enhance query performance

Indexes
• Mongo DB Indexes
– defines indexes on a per-collection level.
– B-Tree Indexes
– Compound indexes with multiple fields
• db.scores.ensureIndex(“{ name: 1, id: 1 }”};
– Unique Index
• db.addresses.ensureIndex( { "user_id": 1 }, { unique: true } )
– Sparse Index
• db.addresses.ensureIndex( { "xmpp_id": 1 }, { sparse: true } )

Map Reduce
• Pattern to allow computations to be
parallelized over a cluster.
• Group By in SQL

Map Reduce
• Write two functions – Map and Reduce
• Write them in Java script
• Map Function :
– Called once per document – returns key and values
• Reduce Function
– Called Once per key emitted, with an array of values
• Finalize (optional)
– Allowing rounding up of the reduced data set.

Map Reduce
• User Profile
{
"_id" : ObjectId("505e717a6794e396ac493e37"),
"UserId" : NumberLong(5209704),
"Browser" : "Microsoft Internet Explorer",
"Gender" : "M",
"CountryCode" : "US",
"State" : "FL",
"City" : "Spring Hill"
}

• Count the users from california by Browser and Gender

Map Reduce
• Map Function
– function() {
var key = { Browser:this.Browser, Gender:this.Gender };
emit(key, { Count:1 });
}

• Reduce Function
• function(key, values) {
var cnt = 0;
values.forEach(function(value) {
cnt += value.Count;
});
return { Count:cnt };
}

Lab#2 – Map/Reduce
• CSV file– user profile information
• Count the users by Browser and Gender
• Download
• http://www.mlnsitelabs.com/mongodb/Labs/Lab2

• Map/Reduce is a big hammer
– Sum, Average
– Avoid java script overhead if you can
• Aggregation Framework
– Specify a pipeline
– Pipeline = series of operations
– Collections run through a pipeline to produce
aggregated result

• $match
– Uses query predicate
• $project
– Uses a sample document to determine the result
• $unwind
– Hands out the array elements one at a time
• $group
– Aggregates items into group defined by a key

• $sort
– sort the result
• $limit
– Limit the number of documents to pass
• $skip
– Skip over the specified number of documents

Lab#3 – Aggregation framework
• CSV file– user profile information
• { aggregate : ‘UserProfileInfo',
pipeline : [
{ $match : {State:'CA'}},
{ $group: {_id: {Browser : '$Browser', Gender :
'$Gender'}, Count:{$sum: 1 } }},
{ $project : { _id :0, Browser : '$_id.Browser',
Gender : '$_id.Gender', Count: 1}
]}

Replication
• Data Redundancy
• Automated Failover /HA
• Read Scaling
• Master – Slave Replication
– Master handles writes
– Slave handles reads

Replication

Slave Slave

Master

Writes

Client

Sharding
• Partitioning of data among multiple machines
• Enables Horizontal Scaling – writes per second
• Partition a collection, specify a shard key – ex:
_id, timestamp

Sharding

Config
Shard 1

Client Router Shard 2
mongos

Shard 3

GridFS
• Specification for storing large files in
MongoDB
• BSON object in MongoDB are limited to 16MB
Size
• GridFS – Divide large files among multiple
documents

References
• Mongo Cookbook
– http://cookbook.mongodb.org/
• NoSQL Distilled: A Brief Guide to the Emerging
World of Polyglot Persistence
– http://www.amazon.com/NoSQL-Distilled-
Emerging-Polyglot-Persistence/dp/0321826620
• Seven Databases in Seven Weeks
– http://www.amazon.com/Seven-Databases-
Weeks-Modern-Movement/dp/1934356921

Contacts
{
name : “Sundar Nathikudi”
mail: mln@mlnadvertising.com
website: http://www.mlnadvertising.com
}

MediaGlu and Mongo DB

More Related Content

MediaGlu and Mongo DB