Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Big Data Tools Workshop
  Introduction to MongoDB
About Me
Sundar Nathikudi
  – Co-Founder & CTO , MLN Advertising
  – Former Principal Engineer , AOL Advertising
About MLN Advertising
• Baltimore based Online Advertising Startup
• MediaGlu - Cloud based Ad Platform
• Our engineers use
Why Does MediaGlu Need Big Data?-
MediaGlu takes a holistic approach to marketing and advertising - by using real data and user
   path tracking, our state of the art technology gives marketers the tools to find their
             customers wherever they are, in Search, Social & Content sites.

Data Management
Tag Management: One tag tracks all        Discover Insights
your digital assets (website, social
media, etc).                              Channel Analytics: Visualize the         Action
                                          effectiveness of your ad campaigns
                                          across media by learning what
Media Attribution: Track user actions                                              Campaign Management: Let MediaGlu
                                          interactions actually drive revenue.
and attribute them, even from sources                                              improve your media bidding and
like Facebook and Pinterest.                                                       creative scheduling in Real Time.
                                          User Analytics: Visualize the timeline
                                          in which users interact with your
Reporting: See how user actions                                                    Budget Optimization: Get reports
                                          brand.
connect across the digital space in one                                            detailing which channels are providing
intuitive interface.                                                               the most value.
                                          Predictive Analytics: Visualize how
                                          channel and user path metrics overlap
                                                                                Personalize Web/Social Experience:
                                          into predictable behavior.
                                                                                Custom tailored brand interactions
                                                                                based on known and predicted
                                                                                behavior.
Agenda
• Big Data and NoSQL Basics
• MongoDB Fundamentals
• Running MongoDB
• Lab #1 - Shell Commands
• Lab #2 - MongoDB Map /Reduce and
  Aggregation Framework
• Replication and Sharding
Era of Big Data
Era of Big Data
Era of Big Data
• Facebook
  – 2.7 billion likes made daily on and off of the
    Facebook site
  – 300 million photos uploaded
  – 500+ terabytes of new data "ingested"
• Twitter
  – 340 million tweets daily
  – 500 Million Users
Big Data - Definition
• Volumes & Volumes of data
• Data does not fit on one Rack.
• Unstructured or Semi Structured
RDBMS & Big Data
• Pros
   – Oracle, SQL server, MySql etc..
   – Good for structured data and relational model
   – Supports Partitioning
   – ACID
   – Transactions
• Cons
   – Joins make it difficult for horizontal scaling
   – Vertical scaling is limited by physics and cost
   – Hard to scale vertically in cloud.
NoSQL – Not Only Sql
• Not using the relational model (nor the SQL
  language)
• Open source
• Designed to run on large clusters
• No schema, allowing fields to be added to any
  record without controls
• Based on the needs of web 2.0 properties
• Rise of NoSQL = Polyglot Persistence
NoSQL – Data Models
• Key Value Stores
    – data is stored in Key-Value pairs
    – support get, put, and delete operations based on a primary key
    – Couchbase(membase), Redis , Riak
• Document
    – store data in structured “documents” such as JSON/XML with no support to
      relationships/joins
    – MongoDB, CouchDB, SimpleDB
• Column Family (Big Table)
    – contains columns of related data
    – HBase, Cassandra
• Graph
    – organize data into node and edge graphs; they work best for data that has
      complex relationship structures
    – Facebook social graph
    – Neo4J
NoSQL – CAP Theorem
• Consistency - all nodes should see the same data at the same
  time.
• Availability – node failures do not prevent ongoing writes
  /reads
• Partition- Tolerance – system should continue to operate
  irrespective of data loss

     Eric Brewer – “distributed system can satisfy any two
       of these guarantees at the same time, but not all
                             three”
NoSQL – Triangle of compromise
What’s a document database
• Composed of Documents – Self describing
• Schema Free
• Store arbitrary data – Collections ,trees
What is JSON??
• Java Script Object Notation
• Lightweight data-interchange format
• Elements of JSON            { "id": 1,
   – Object : K/V Pairs        "name": "Foo",
                               "price": 123,
   – Key, String               "tags": [ "Bar", "Eek" ],
   – Value                     "stock": { "warehouse": 300,
                                          "retail": 20 }
      •   Number                }
      •   String
      •   Boolean
      •   Array
      •   Object
MongoDB - Overview
• BSON – Bin-ary-en-coded seri-al-iz-a-tion of
  JSON-like doc-u-ments(more at
  http://bsonspec.org/)
• Schema Less
• Embedded documents and arrays reduce need
  for joins
• Scalable – Replication and Sharding
• Best features of key /value store, document
  and relational databases in one .
MongoDB - Overview
• Name stems from humongous
• 10 gen
• Written in C++
• Understands Java script
• Spider Monkey Java script Engine for server-
  side Javascript execution
• Lots of language drivers available
Languages Supported
Blog Post - Relational Model
Blog Post - Document Model
{ _id: 1234,
 author: { name: "Bob Davis", email : "bob@bob.com" },
 post: "In these troubled times I like to …",
 date: { $date: "2010-07-12 13:23UTC" },
 location: [ -121.2322, 42.1223222 ],
 rating: 2.2,
 comments: [ { user: "jgs32@hotmail.com", upVotes: 22, downVotes: 14, text:
                     "Great point! I agree" },
               { user: "holly.davidson@gmail.com", upVotes: 421, downVotes:
                     22, text: "You are an idiot" } ],
 tags: [ "Politics", "Virginia" ]
}
No SQL vs. RDBMS terminology
  MySql         NoSQL
  Database      Database
  Table         Collection
  Index         Index
  Row           Document
  Column        Field
  Join          Embedding and Linking
  Primary Key   _id field
  Group By      Aggregation
Installing Mongo DB
• Mongo Distributions
  – OS X, Linux , Windows, Solaris
  – Runs on commodity hardware
Installing Mongo DB
• Download Mongo DB server
  – http://www.mongodb.org/downloads
  – http://www.mlnsitelabs.com/mongodb
  – Extract the bin folder to C:MongoDB
• Create Data Folder - C:MongoDbData
• To Start from command line
  – Run Mongo.Bat
• To install as a window service
  – Run MongoService.bat from command line.
Installing MongoVue GUI
• Download MongoVue GUI tool
  – http://www.mongovue.com/downloads/
  – http://www.mlnsitelabs.com/mongodb
System components


mongod.exe                            mongo.exe
 database server                         shell




                   mongos.exe
                    sharding router
Learning MongoDB Shell
• Interactive java script Shell
• Use online browser shell
  – http://try.mongodb.org/
• Or run from command line
  – mongo http://localhost:27017
Learning Shell Commands
• Create Database
  – use student;
  – db.student.scores.find();



• Inserting a document into collection
  –   var student = {name: 'Jim', scores: [75, 99, 87.2]};
  –   db.scores.save(student);
  –   var student = {name: 'John', scores: [35, 45, 55]};
  –   db.scores.save(student);
Learning Shell Commands
• Querying a collection
  – db.scores.find();
  – db.scores.find({scores: {'$gt': 15}});

• Updating a document
  – db.scores.update({name : 'Jim'},{name: 'Jim', scores: [92,34,54]});



• Deleting a document
  – db.scores.remove({name: Jim'});
Lab #1 - Shell Commands
• http://www.mlnsitelabs.com/mongodb/Labs/
  Lab1


            Lets Do it!!
Data Types
•   string
•   integer
•   boolean
•   double
•   null
•   array
•   object
•   binary data
•   regular expression
Query Selectors
• Selectors
  – $ne
  – $lt
  – $lte
  – $gt
  – $gte
  – $in
  – $nin
  – $all
Learning Shell Commands
• Creating an index
  – db.scores.ensureIndex(“{name:1}”)
Indexes
• What is an Index??
  – structure that allows you to quickly locate
    documents based on the values stored in certain
    specified fields.
• Indexes enhance query performance
Indexes
• Mongo DB Indexes
  – defines indexes on a per-collection level.
  – B-Tree Indexes
  – Compound indexes with multiple fields
     • db.scores.ensureIndex(“{ name: 1, id: 1 }”};
  – Unique Index
     • db.addresses.ensureIndex( { "user_id": 1 }, { unique: true } )
  – Sparse Index
     • db.addresses.ensureIndex( { "xmpp_id": 1 }, { sparse: true } )
Map Reduce
• Pattern to allow computations to be
  parallelized over a cluster.
• Group By in SQL
Map Reduce
• Write two functions – Map and Reduce
• Write them in Java script
• Map Function :
   – Called once per document – returns key and values
• Reduce Function
   – Called Once per key emitted, with an array of values
• Finalize (optional)
   – Allowing rounding up of the reduced data set.
Map Reduce
• User Profile
{
    "_id" : ObjectId("505e717a6794e396ac493e37"),
    "UserId" : NumberLong(5209704),
    "Browser" : "Microsoft Internet Explorer",
    "Gender" : "M",
    "CountryCode" : "US",
    "State" : "FL",
    "City" : "Spring Hill"
}

• Count the users from california by Browser and Gender
Map Reduce
• Map Function
   – function() {
                var key = { Browser:this.Browser, Gender:this.Gender };
                emit(key, { Count:1 });
             }

• Reduce Function
  •   function(key, values) {
                       var cnt = 0;
                       values.forEach(function(value) {
                           cnt += value.Count;
                       });
                       return { Count:cnt };
                   }
Lab#2 – Map/Reduce
• CSV file– user profile information
• Count the users by Browser and Gender
• Download
  • http://www.mlnsitelabs.com/mongodb/Labs/Lab2
Aggregation Framework
• Map/Reduce is a big hammer
  – Sum, Average
  – Avoid java script overhead if you can
• Aggregation Framework
  – Specify a pipeline
  – Pipeline = series of operations
  – Collections run through a pipeline to produce
    aggregated result
Aggregation Framework
• $match
  – Uses query predicate
• $project
  – Uses a sample document to determine the result
• $unwind
  – Hands out the array elements one at a time
• $group
  – Aggregates items into group defined by a key
Aggregation Framework
• $sort
  – sort the result
• $limit
  – Limit the number of documents to pass
• $skip
  – Skip over the specified number of documents
Lab#3 – Aggregation framework
• CSV file– user profile information
• { aggregate : ‘UserProfileInfo',
      pipeline : [
      { $match : {State:'CA'}},
     { $group: {_id: {Browser : '$Browser', Gender :
            '$Gender'}, Count:{$sum: 1 } }},
     { $project : { _id :0, Browser : '$_id.Browser',
  Gender : '$_id.Gender', Count: 1}
  ]}
Replication
•   Data Redundancy
•   Automated Failover /HA
•   Read Scaling
•   Master – Slave Replication
    – Master handles writes
    – Slave handles reads
Replication

Slave                       Slave




         Master


                   Writes


          Client
Sharding
• Partitioning of data among multiple machines
• Enables Horizontal Scaling – writes per second
• Partition a collection, specify a shard key – ex:
  _id, timestamp
Sharding

         Config
                    Shard 1




Client   Router     Shard 2
          mongos




                     Shard 3
GridFS
• Specification for storing large files in
  MongoDB
• BSON object in MongoDB are limited to 16MB
  Size
• GridFS – Divide large files among multiple
  documents
References
• Mongo Cookbook
  – http://cookbook.mongodb.org/
• NoSQL Distilled: A Brief Guide to the Emerging
  World of Polyglot Persistence
  – http://www.amazon.com/NoSQL-Distilled-
    Emerging-Polyglot-Persistence/dp/0321826620
• Seven Databases in Seven Weeks
  – http://www.amazon.com/Seven-Databases-
    Weeks-Modern-Movement/dp/1934356921
Contacts
{
    name : “Sundar Nathikudi”
    mail: mln@mlnadvertising.com
    website: http://www.mlnadvertising.com
    }
Thank You

More Related Content

MediaGlu and Mongo DB

  • 1. Big Data Tools Workshop Introduction to MongoDB
  • 2. About Me Sundar Nathikudi – Co-Founder & CTO , MLN Advertising – Former Principal Engineer , AOL Advertising
  • 3. About MLN Advertising • Baltimore based Online Advertising Startup • MediaGlu - Cloud based Ad Platform • Our engineers use
  • 4. Why Does MediaGlu Need Big Data?- MediaGlu takes a holistic approach to marketing and advertising - by using real data and user path tracking, our state of the art technology gives marketers the tools to find their customers wherever they are, in Search, Social & Content sites. Data Management Tag Management: One tag tracks all Discover Insights your digital assets (website, social media, etc). Channel Analytics: Visualize the Action effectiveness of your ad campaigns across media by learning what Media Attribution: Track user actions Campaign Management: Let MediaGlu interactions actually drive revenue. and attribute them, even from sources improve your media bidding and like Facebook and Pinterest. creative scheduling in Real Time. User Analytics: Visualize the timeline in which users interact with your Reporting: See how user actions Budget Optimization: Get reports brand. connect across the digital space in one detailing which channels are providing intuitive interface. the most value. Predictive Analytics: Visualize how channel and user path metrics overlap Personalize Web/Social Experience: into predictable behavior. Custom tailored brand interactions based on known and predicted behavior.
  • 5. Agenda • Big Data and NoSQL Basics • MongoDB Fundamentals • Running MongoDB • Lab #1 - Shell Commands • Lab #2 - MongoDB Map /Reduce and Aggregation Framework • Replication and Sharding
  • 6. Era of Big Data
  • 7. Era of Big Data
  • 8. Era of Big Data • Facebook – 2.7 billion likes made daily on and off of the Facebook site – 300 million photos uploaded – 500+ terabytes of new data "ingested" • Twitter – 340 million tweets daily – 500 Million Users
  • 9. Big Data - Definition • Volumes & Volumes of data • Data does not fit on one Rack. • Unstructured or Semi Structured
  • 10. RDBMS & Big Data • Pros – Oracle, SQL server, MySql etc.. – Good for structured data and relational model – Supports Partitioning – ACID – Transactions • Cons – Joins make it difficult for horizontal scaling – Vertical scaling is limited by physics and cost – Hard to scale vertically in cloud.
  • 11. NoSQL – Not Only Sql • Not using the relational model (nor the SQL language) • Open source • Designed to run on large clusters • No schema, allowing fields to be added to any record without controls • Based on the needs of web 2.0 properties • Rise of NoSQL = Polyglot Persistence
  • 12. NoSQL – Data Models • Key Value Stores – data is stored in Key-Value pairs – support get, put, and delete operations based on a primary key – Couchbase(membase), Redis , Riak • Document – store data in structured “documents” such as JSON/XML with no support to relationships/joins – MongoDB, CouchDB, SimpleDB • Column Family (Big Table) – contains columns of related data – HBase, Cassandra • Graph – organize data into node and edge graphs; they work best for data that has complex relationship structures – Facebook social graph – Neo4J
  • 13. NoSQL – CAP Theorem • Consistency - all nodes should see the same data at the same time. • Availability – node failures do not prevent ongoing writes /reads • Partition- Tolerance – system should continue to operate irrespective of data loss Eric Brewer – “distributed system can satisfy any two of these guarantees at the same time, but not all three”
  • 14. NoSQL – Triangle of compromise
  • 15. What’s a document database • Composed of Documents – Self describing • Schema Free • Store arbitrary data – Collections ,trees
  • 16. What is JSON?? • Java Script Object Notation • Lightweight data-interchange format • Elements of JSON { "id": 1, – Object : K/V Pairs "name": "Foo", "price": 123, – Key, String "tags": [ "Bar", "Eek" ], – Value "stock": { "warehouse": 300, "retail": 20 } • Number } • String • Boolean • Array • Object
  • 17. MongoDB - Overview • BSON – Bin-ary-en-coded seri-al-iz-a-tion of JSON-like doc-u-ments(more at http://bsonspec.org/) • Schema Less • Embedded documents and arrays reduce need for joins • Scalable – Replication and Sharding • Best features of key /value store, document and relational databases in one .
  • 18. MongoDB - Overview • Name stems from humongous • 10 gen • Written in C++ • Understands Java script • Spider Monkey Java script Engine for server- side Javascript execution • Lots of language drivers available
  • 20. Blog Post - Relational Model
  • 21. Blog Post - Document Model { _id: 1234, author: { name: "Bob Davis", email : "bob@bob.com" }, post: "In these troubled times I like to …", date: { $date: "2010-07-12 13:23UTC" }, location: [ -121.2322, 42.1223222 ], rating: 2.2, comments: [ { user: "jgs32@hotmail.com", upVotes: 22, downVotes: 14, text: "Great point! I agree" }, { user: "holly.davidson@gmail.com", upVotes: 421, downVotes: 22, text: "You are an idiot" } ], tags: [ "Politics", "Virginia" ] }
  • 22. No SQL vs. RDBMS terminology MySql NoSQL Database Database Table Collection Index Index Row Document Column Field Join Embedding and Linking Primary Key _id field Group By Aggregation
  • 23. Installing Mongo DB • Mongo Distributions – OS X, Linux , Windows, Solaris – Runs on commodity hardware
  • 24. Installing Mongo DB • Download Mongo DB server – http://www.mongodb.org/downloads – http://www.mlnsitelabs.com/mongodb – Extract the bin folder to C:MongoDB • Create Data Folder - C:MongoDbData • To Start from command line – Run Mongo.Bat • To install as a window service – Run MongoService.bat from command line.
  • 25. Installing MongoVue GUI • Download MongoVue GUI tool – http://www.mongovue.com/downloads/ – http://www.mlnsitelabs.com/mongodb
  • 26. System components mongod.exe mongo.exe database server shell mongos.exe sharding router
  • 27. Learning MongoDB Shell • Interactive java script Shell • Use online browser shell – http://try.mongodb.org/ • Or run from command line – mongo http://localhost:27017
  • 28. Learning Shell Commands • Create Database – use student; – db.student.scores.find(); • Inserting a document into collection – var student = {name: 'Jim', scores: [75, 99, 87.2]}; – db.scores.save(student); – var student = {name: 'John', scores: [35, 45, 55]}; – db.scores.save(student);
  • 29. Learning Shell Commands • Querying a collection – db.scores.find(); – db.scores.find({scores: {'$gt': 15}}); • Updating a document – db.scores.update({name : 'Jim'},{name: 'Jim', scores: [92,34,54]}); • Deleting a document – db.scores.remove({name: Jim'});
  • 30. Lab #1 - Shell Commands • http://www.mlnsitelabs.com/mongodb/Labs/ Lab1 Lets Do it!!
  • 31. Data Types • string • integer • boolean • double • null • array • object • binary data • regular expression
  • 32. Query Selectors • Selectors – $ne – $lt – $lte – $gt – $gte – $in – $nin – $all
  • 33. Learning Shell Commands • Creating an index – db.scores.ensureIndex(“{name:1}”)
  • 34. Indexes • What is an Index?? – structure that allows you to quickly locate documents based on the values stored in certain specified fields. • Indexes enhance query performance
  • 35. Indexes • Mongo DB Indexes – defines indexes on a per-collection level. – B-Tree Indexes – Compound indexes with multiple fields • db.scores.ensureIndex(“{ name: 1, id: 1 }”}; – Unique Index • db.addresses.ensureIndex( { "user_id": 1 }, { unique: true } ) – Sparse Index • db.addresses.ensureIndex( { "xmpp_id": 1 }, { sparse: true } )
  • 36. Map Reduce • Pattern to allow computations to be parallelized over a cluster. • Group By in SQL
  • 37. Map Reduce • Write two functions – Map and Reduce • Write them in Java script • Map Function : – Called once per document – returns key and values • Reduce Function – Called Once per key emitted, with an array of values • Finalize (optional) – Allowing rounding up of the reduced data set.
  • 38. Map Reduce • User Profile { "_id" : ObjectId("505e717a6794e396ac493e37"), "UserId" : NumberLong(5209704), "Browser" : "Microsoft Internet Explorer", "Gender" : "M", "CountryCode" : "US", "State" : "FL", "City" : "Spring Hill" } • Count the users from california by Browser and Gender
  • 39. Map Reduce • Map Function – function() { var key = { Browser:this.Browser, Gender:this.Gender }; emit(key, { Count:1 }); } • Reduce Function • function(key, values) { var cnt = 0; values.forEach(function(value) { cnt += value.Count; }); return { Count:cnt }; }
  • 40. Lab#2 – Map/Reduce • CSV file– user profile information • Count the users by Browser and Gender • Download • http://www.mlnsitelabs.com/mongodb/Labs/Lab2
  • 41. Aggregation Framework • Map/Reduce is a big hammer – Sum, Average – Avoid java script overhead if you can • Aggregation Framework – Specify a pipeline – Pipeline = series of operations – Collections run through a pipeline to produce aggregated result
  • 42. Aggregation Framework • $match – Uses query predicate • $project – Uses a sample document to determine the result • $unwind – Hands out the array elements one at a time • $group – Aggregates items into group defined by a key
  • 43. Aggregation Framework • $sort – sort the result • $limit – Limit the number of documents to pass • $skip – Skip over the specified number of documents
  • 44. Lab#3 – Aggregation framework • CSV file– user profile information • { aggregate : ‘UserProfileInfo', pipeline : [ { $match : {State:'CA'}}, { $group: {_id: {Browser : '$Browser', Gender : '$Gender'}, Count:{$sum: 1 } }}, { $project : { _id :0, Browser : '$_id.Browser', Gender : '$_id.Gender', Count: 1} ]}
  • 45. Replication • Data Redundancy • Automated Failover /HA • Read Scaling • Master – Slave Replication – Master handles writes – Slave handles reads
  • 46. Replication Slave Slave Master Writes Client
  • 47. Sharding • Partitioning of data among multiple machines • Enables Horizontal Scaling – writes per second • Partition a collection, specify a shard key – ex: _id, timestamp
  • 48. Sharding Config Shard 1 Client Router Shard 2 mongos Shard 3
  • 49. GridFS • Specification for storing large files in MongoDB • BSON object in MongoDB are limited to 16MB Size • GridFS – Divide large files among multiple documents
  • 50. References • Mongo Cookbook – http://cookbook.mongodb.org/ • NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence – http://www.amazon.com/NoSQL-Distilled- Emerging-Polyglot-Persistence/dp/0321826620 • Seven Databases in Seven Weeks – http://www.amazon.com/Seven-Databases- Weeks-Modern-Movement/dp/1934356921
  • 51. Contacts { name : “Sundar Nathikudi” mail: mln@mlnadvertising.com website: http://www.mlnadvertising.com }