Cloud Computing Era
Trend Micro
Three Major Trends to Chang the World

                 Cloud Computing

           Big Data           Mobile
What is Cloud Computing?
 National Institute of Standards and Technology (NIST) definition :


                                                             Service Models

                                                             Deployment Models

 Provide scalable and elastic IT relative functions to users via as-a-
 service business models and internet technologies
It’s About the Ecosystem

                        Structured, Semi-structured

                                                      Cloud Computing

   Enterprise Data Warehouse



                                                                             Big Data


                                                                          Business Insights


                                                                        Competition, Innovation,
What is Big Data?
What is the problem

• Getting the data to the processors
  becomes the bottleneck

• Quick calculation
   – Typical disk data transfer rate:
      • 75MB/sec
   – Time taken to transfer 100GB of data
     to the processor:
      • approx. 22   minutes!
The Era of Big Data – Are You Ready

        Data for business commercial analysis
        • 2011: multi-terabyte (TB)
        • 2020: 35.2 ZB (1 ZB = 1 billion TB)
Who Needs It?
           Enterprise Database                          Hadoop

When to use?                          When to use?
•   Ad-hoc Reporting (<1sec)          •   Affordable Storage/Compute
•   Multi-step Transactions           •   Unstructured or Semi-structured
•   Lots of Inserts/Updates/Deletes   •   Resilient Auto Scalability
– inspired by
• Apache Hadoop project
  – inspired by Google's MapReduce and Google File System
• Open sourced, flexible and available architecture for
  large scale computation and data processing on a
  network of commodity hardware
• Open Source Software + Hardware Commodity
  – IT Costs Reduction
Hadoop Core


              © 2011 Cloudera, Inc. All Rights Reserved.

• Hadoop Distributed File System
• Redundancy
• Fault Tolerant
• Scalable
• Self Healing
• Write Once, Read Many Times
• Java API
• Command Line Tool

                   © 2011 Cloudera, Inc. All Rights Reserved.

 • Two Phases of Functional Programming
 • Redundancy
 • Fault Tolerant
 • Scalable
 • Self Healing
 • Java API

                    © 2011 Cloudera, Inc. All Rights Reserved.
Hadoop Core



               © 2011 Cloudera, Inc. All Rights Reserved.
Word Count Example

       Key: offset
       Value: line

                            Key: word      Key: word
                            Value: count   Value: sum of count

0:The cat sat on the mat
22:The aardvark sat on the sofa
The Hadoop Ecosystems
The Ecosystem is the System

• Hadoop has become the kernel of the distributed
  operating system for Big Data
• No one uses the kernel alone
• A collection of projects at Apache
Relation Map

                                                Hue                                   Mahout
                                            (Web Console)                         (Data Mining)

                                                        (Job Workflow & Scheduling)

                                                                          Pig/Hive (Analytical Language)
                                          (Data integration)

                                 MapReduce Runtime
                                 (Dist. Programming Framework)                         Hbase
                                                                                (Column NoSQL DB)

                                              Hadoop Distributed File System (HDFS)
Zookeeper – Coordination Framework

                                                Hue                                   Mahout
                                            (Web Console)                         (Data Mining)

                                                        (Job Workflow & Scheduling)

                                                                          Pig/Hive (Analytical Language)
                                          (Data integration)

                                 MapReduce Runtime
                                 (Dist. Programming Framework)                         Hbase
                                                                                (Column NoSQL DB)

                                              Hadoop Distributed File System (HDFS)
What is ZooKeeper

• A centralized service for maintaining
  – Configuration information
  – Providing distributed synchronization
• A set of tools to build distributed applications that can
  safely handle partial failures
• ZooKeeper was designed to store coordination data
  – Status information
  – Configuration
  – Location information
Flume / Sqoop – Data Integration Framework

                                                Hue                                   Mahout
                                            (Web Console)                         (Data Mining)

                                                        (Job Workflow & Scheduling)

                                                                          Pig/Hive (Analytical Language)
                                          (Data integration)

                                 MapReduce Runtime
                                 (Dist. Programming Framework)                         Hbase
                                                                                (Column NoSQL DB)

                                              Hadoop Distributed File System (HDFS)
What’s the problem for data collection

• Data collection is currently a priori and ad hoc
• A priori – decide what you want to collect ahead of time
• Ad hoc – each kind of data source goes through its own
  collection path
(and how can it help?)

• A distributed data collection service
• It efficiently collecting, aggregating, and moving large
  amounts of data
• Fault tolerant, many failover and recovery mechanism
• One-stop solution for data collection of all formats
Flume Architecture

Log                                                        Log

Flume Node                                                 Flume Node


              © 2011 Cloudera, Inc. All Rights Reserved.
Flume Sources and Sinks

• Local Files
• Stdin, Stdout
• Twitter

                  © 2011 Cloudera, Inc. All Rights Reserved.

• Easy, parallel database import/export
• What you want do?
  – Insert data from RDBMS to HDFS
  – Export data from HDFS back into RDBMS




        © 2011 Cloudera, Inc. All Rights Reserved.
Sqoop Examples

 $ sqoop import --connect jdbc:mysql://localhost/world --
 username root --table City

 $ hadoop fs -cat City/part-m-00000

                    © 2011 Cloudera, Inc. All Rights Reserved.
Pig / Hive – Analytical Language

                                                Hue                                   Mahout
                                            (Web Console)                         (Data Mining)

                                                        (Job Workflow & Scheduling)

                                                                          Pig/Hive (Analytical Language)
                                          (Data integration)

                                 MapReduce Runtime
                                 (Dist. Programming Framework)                         Hbase
                                                                                (Column NoSQL DB)

                                              Hadoop Distributed File System (HDFS)
Why Hive and Pig?

• Although MapReduce is very powerful, it can also be
  complex to master
• Many organizations have business or data analysts who
  are skilled at writing SQL queries, but not at writing Java
• Many organizations have programmers who are skilled
  at writing code in scripting languages
• Hive and Pig are two projects which evolved separately
  to help such people analyze huge amounts of data via
  – Hive was initially developed at Facebook, Pig at Yahoo!
Hive     – Developed by

• What is Hive?
  – An SQL-like interface to Hadoop
• Data Warehouse infrastructure that provides data
  summarization and ad hoc querying on top of Hadoop
  – MapRuduce for execution
  – HDFS for storage
• Hive Query Language
  – Basic-SQL : Select, From, Join, Group-By
  – Equi-Join, Muti-Table Insert, Multi-Group-By
  – Batch query
 SELECT * FROM purchases WHERE price > 100 GROUP BY storeid




       © 2011 Cloudera, Inc. All Rights Reserved.
Pig                 – Initiated by

• A high-level scripting language (Pig Latin)
• Process data one step at a time
• Simple to write MapReduce program
• Easy understand
• Easy debug         A = load ‘a.txt’ as (id, name, age, ...)
                     B = load ‘b.txt’ as (id, address, ...)
                     C = JOIN A BY id, B BY id;STORE C into ‘c.txt’




       © 2011 Cloudera, Inc. All Rights Reserved.
Hive vs. Pig

                  Hive                   Pig
Language          HiveQL (SQL-like)      Pig Latin, a scripting language
Schema            Table definitions      A schema is optionally defined
                  that are stored in a   at runtime
Programmait Access JDBC, ODBC            PigServer
WordCount Example

• Input
  Hello World Bye World
  Hello Hadoop Goodbye Hadoop
• For the given sample input the map emits
  < Hello, 1>
  < World, 1>
  < Bye, 1>
  < World, 1>
  < Hello, 1>
  < Hadoop, 1>
  < Goodbye, 1>
  < Hadoop, 1>

• the reduce just sums up the values
   < Bye, 1>
  < Goodbye, 1>
  < Hadoop, 2>
  < Hello, 2>
  < World, 2>
WordCount Example In MapReduce
public class WordCount {
 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
  private final static IntWritable one = new IntWritable(1);
  private Text word = new Text();

    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
      String line = value.toString();
      StringTokenizer tokenizer = new StringTokenizer(line);
      while (tokenizer.hasMoreTokens()) {
         context.write(word, one);

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
  public void reduce(Text key, Iterable<IntWritable> values, Context context)
    throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
         sum += val.get();
      context.write(key, new IntWritable(sum));

public static void main(String[] args) throws Exception {
 Configuration conf = new Configuration();

    Job job = new Job(conf, "wordcount");



    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));

WordCount Example By Pig

A = LOAD 'wordcount/input' USING PigStorage as (token:chararray);

B = GROUP A BY token;

C = FOREACH B GENERATE group, COUNT(A) as count;

WordCount Example By Hive

CREATE TABLE wordcount (token STRING);

LOAD DATA LOCAL INPATH ’wordcount/input'

SELECT count(*) FROM wordcount GROUP BY token;
The Story So Far

    SQL         Hive                               Pig        Script

    Java        MapReduce
    Java        HDFS
                Sqoop Flume
    SQL         RDBMS FS                                      Posix

1                © 2011 Cloudera, Inc. All Rights Reserved.
Hbase – Column NoSQL DB

                                               Hue                                   Mahout
                                           (Web Console)                         (Data Mining)

                                                       (Job Workflow & Scheduling)

                                                                         Pig/Hive (Analytical Language)
                                         (Data integration)

                                MapReduce Runtime
                                (Dist. Programming Framework)                         Hbase
                                                                               (Column NoSQL DB)

                                             Hadoop Distributed File System (HDFS)
Structured-data vs Raw-data
I – Inspired by

• Coordinated by Zookeeper
• Low Latency
• Random Reads And Writes
• Distributed Key/Value Store
• Simple API
  –   PUT
  –   GET
  –   DELETE
  –   SCAN
Hbase – Data Model

• Cells are “versioned”
• Table rows are sorted by row key
• Region – a row range [start-key:end-key]
HBase Examples

hbase>   create 'mytable', 'mycf‘
hbase>   list
hbase>   put 'mytable', 'row1', 'mycf:col1', 'val1‘
hbase>   put 'mytable', 'row1', 'mycf:col2', 'val2‘
hbase>   put 'mytable', 'row2', 'mycf:col1', 'val3‘
hbase>   scan 'mytable‘
hbase>   disable 'mytable‘
hbase>   drop 'mytable'

                     © 2011 Cloudera, Inc. All Rights Reserved.
Oozie – Job Workflow & Scheduling

                                                Hue                                   Mahout
                                            (Web Console)                         (Data Mining)

                                                        (Job Workflow & Scheduling)

                                                                          Pig/Hive (Analytical Language)
                                          (Data integration)

                                 MapReduce Runtime
                                 (Dist. Programming Framework)                         Hbase
                                                                                (Column NoSQL DB)

                                              Hadoop Distributed File System (HDFS)
What is                       ?

• A Java Web Application
• Oozie is a workflow scheduler for Hadoop
• Crond for Hadoop
• Triggered            Job 1 Job 2
  – Time
  – Data
                            Job 3

                       Job 4 Job 5
Oozie Features

• Component Independent
  –   MapReduce
  –   Hive
  –   Pig
  –   SqoopStreaming

                       © 2011 Cloudera, Inc. All Rights Reserved.
Mahout – Data Mining

                                                Hue                                   Mahout
                                            (Web Console)                         (Data Mining)

                                                        (Job Workflow & Scheduling)

                                                                          Pig/Hive (Analytical Language)
                                          (Data integration)

                                 MapReduce Runtime
                                 (Dist. Programming Framework)                         Hbase
                                                                                (Column NoSQL DB)

                                              Hadoop Distributed File System (HDFS)
What is

• Machine-learning tool
• Distributed and scalable machine learning algorithms on
  the Hadoop platform
• Building intelligent applications easier and faster
Mahout Use Cases

• Yahoo: Spam Detection
• Foursquare: Recommendations
• SpeedDate.com: Recommendations
• Adobe: User Targetting
• Amazon: Personalization Platform

                  © 2011 Cloudera, Inc. All Rights Reserved.
Hue – developed by

• Hadoop User Experience
• Apache Open source project
• HUE is a web UI for Hadoop
• Platform for building custom applications with a nice UI

• HUE comes with a suite of applications
  – File Browser: Browse HDFS; change permissions and
    ownership; upload, download, view and edit files.
  – Job Browser: View jobs, tasks, counters, logs, etc.
  – Beeswax: Wizards to help create Hive tables, load data, run and
    manage Hive queries, and download results in Excel format.
Hue: File Browser UI
Hue: Beewax UI
Use case Example

• Predict what the user likes based on
  – His/Her historical behavior
  – Aggregate behavior of people similar to him

Today, we introduced:
• Why Hadoop is needed
• The basic concepts of HDFS and MapReduce
• What sort of problems can be solved with Hadoop
• What other projects are included in the Hadoop
Recap – Hadoop Ecosystem

                                               Hue                                   Mahout
                                           (Web Console)                         (Data Mining)

                                                       (Job Workflow & Scheduling)

                                                                         Pig/Hive (Analytical Language)
                                         (Data integration)

                                MapReduce Runtime
                                (Dist. Programming Framework)                         Hbase
                                                                               (Column NoSQL DB)

                                             Hadoop Distributed File System (HDFS)
Trend Micro Smart Protection
Network (SPN) Case Study
Collaboration in the underground
Network Threats Shows Explosive Growth
Threats on the network like variants of the virus, spams, unknown download
sources escapes the detection of the traditional security system and continues to
show explosive growth.

                        New Unique Malware Discovered
Cloud computing era
New Design Concept for Threat Intelligence
                 CDN / xSP                             Human


                                                                      Web Crawler

       Trend Micro
       Mail Protection
                                                 Trend Micro
                             Trend Micro         Endpoint Protection
                             Web Protection

                   150M+ Worldwide Endpoints/Sensors
Challenges We Are Faced
             The Concept is Great but ….
  6TB of data and 15B lines of logs received daily by

           It becomes the Big Data Challenge!
Issues to Address

 Raw Data           Information   Intelligence/Solution

 Volume: Infinite
 Time: No Delay
 Target: Keep Changing Threats
                                                                  CDN Log              SPN High Level Architecture
                             HTTP POST


                        Log               Log
                      Receiver          Receiver


                       Log Post          Log Post           Log Post
                      Processing        Processing         Processing
SPN infrastructure

                                                   Adhoc-Query (Pig)

                         MapReduce                      HBase
                          Hadoop Distributed File System                    (Ambari)

                     Feedback Information

                                                                                         Message Bus

                       Email Reputation Service                                        Web Reputation     File Reputation
                                                                                          Service              Service
Trend Micro Big Data process capacity

Daily amount of SPN data to be processed
• 8.5 billions Web Reputation queries
• 3 billions Email Reputation queries
• 7 billions File Reputation queries
• Process 6 TB worldwide raw logs
• 150 millions End-point connections
Trend Micro: Web Reputation Services
 Technology                         Process                           Operation

    Trend Micro               User Traffic | Honeypot
Products / Technology
                                                              8 billions/day
                                      Akamai                             40% filtered
CDN Cache
                                                             4.8 billions/day
                               Rating Server for Known

                                                                                           15 Minutes
High Throughput Web Service            Threats                           82% filtered

                                Unknown & Prefilter
Hadoop Cluster
                                                             860 millions/day
                                  Page Download
Web Crawling
                                                                         99.98% filtered
Machine Learning
Data Mining
                                                         25,000 malicious URL /day

Block malicious URL within 15 minutes once it goes online!
Big Data Cases
Google vs. Hadoop Ecosystem

Chubby vs. Zookeeper

MapReduce vs. MapReduce

BigTable vs. HBase


 Ref: Google Cluster
Pioneer of Big Data Infrastructure – Google
Hbase use Case@Facebook - Messages
 HBase Use Cases @ Facebook

            Facebook Insights   Operational Data Store
            Self-service        More Analytics/Hashout apps
 Messages   Hashout             Site Integrity

    2010        2011                 2012            2013

                                                  Social Graph Search Indexing
                                                  Realtime Hive Updates
                                                  Cross-system Tracing
                                                  … and more
Flagship App:Facebook Messages
Monthly data volume prior to launch
• Monthly data volume prior to launch

                 15B x 1,024 bytes = 14TB

                 120B x 100 bytes = 11TB
Facebook Messages Now
book Messages NOW
StatsQuick Stats
                                Messages   Chats
       – 11B+ messages/day
         • 90B+ data accesses
+ messages/day
         • Peak:1.5M ops/sec
0B+ data~55% Read, 45% Write
eak: 1.5M ops/sec
55%Rd, 45% Wr data
      – 20PB+ of total
                                 Emails    SMS
         • Grows 400TB/month

B+ of total data
rows 400TB/month
Facebook Messages:Requirements

• Very High Write Volume
  – Previously, chat was not persisted to disk
• Ever-growing data sets(Old data rarely gets
• Elasticity & automatic failover
• Strong consistency within a single data center
• Large scans/map-reduce support for migrations &
  schema conversions
• Bulk import data
Physical Multi-tenancy

• Real-time Ads Insights
  – Real-time analytics for social plugins on top of Hbase
  – Publishers get real-time distribution/engagement metrics:
     • # of impressions, likes
     • analytics by domain/URL/demographics and time periods
  – Uses HBase capabilities:
     • Efficient counters (single-RPC increments)
     • TTL for purging old data
  – Needs massive write throughput & low latencies
     • Billions of URLs
     • Millions of counter increments/second

• Operational Data Store
Facebook Open Source Stack

• Memcached --> App Server Cache
• ZooKeeper --> Small Data Coordination Service
• HBase --> Database Storage Engine
• HDFS --> Distributed FileSystem
• Hadoop --> Asynchronous Map-Reduce Jobs
Cloud computing era
Thank you!

