Cloud computing era

Cloud Computing Era
Trend Micro

Three Major Trends to Chang the World

Cloud Computing

Big Data Mobile

What is Cloud Computing?
National Institute of Standards and Technology (NIST) definition :

Essential
Characteristics

Service Models

Deployment Models

Provide scalable and elastic IT relative functions to users via as-a-
service business models and internet technologies

It’s About the Ecosystem

Structured, Semi-structured

Cloud Computing

SaaS
Enterprise Data Warehouse
PaaS

IaaS

Generate

Big Data

Lead

Business Insights

create

Competition, Innovation,
Productivity

What is the problem

• Getting the data to the processors
becomes the bottleneck

• Quick calculation
– Typical disk data transfer rate:
• 75MB/sec
– Time taken to transfer 100GB of data
to the processor:
• approx. 22 minutes!

The Era of Big Data – Are You Ready

Data for business commercial analysis
• 2011: multi-terabyte (TB)
• 2020: 35.2 ZB (1 ZB = 1 billion TB)

Who Needs It?
Enterprise Database Hadoop

When to use? When to use?
• Ad-hoc Reporting (<1sec) • Affordable Storage/Compute
• Multi-step Transactions • Unstructured or Semi-structured
• Lots of Inserts/Updates/Deletes • Resilient Auto Scalability

– inspired by
• Apache Hadoop project
– inspired by Google's MapReduce and Google File System
papers.
• Open sourced, flexible and available architecture for
large scale computation and data processing on a
network of commodity hardware
• Open Source Software + Hardware Commodity
– IT Costs Reduction

Hadoop Core

MapReduce
HDFS

© 2011 Cloudera, Inc. All Rights Reserved.

HDFS

• Hadoop Distributed File System
• Redundancy
• Fault Tolerant
• Scalable
• Self Healing
• Write Once, Read Many Times
• Java API
• Command Line Tool


MapReduce

• Two Phases of Functional Programming
• Redundancy
• Fault Tolerant
• Scalable
• Self Healing
• Java API

13

Hadoop Core

Java
Java

MapReduce
HDFS
Java
Java

14

Word Count Example

Key: offset
Value: line

Key: word Key: word
Value: count Value: sum of count

0:The cat sat on the mat
22:The aardvark sat on the sofa

The Ecosystem is the System

• Hadoop has become the kernel of the distributed
operating system for Big Data
• No one uses the kernel alone
• A collection of projects at Apache

Relation Map

Hue Mahout
(Web Console) (Data Mining)

Oozie
(Job Workflow & Scheduling)
(Coordination)
Zookeeper

Sqoop/Flume
Pig/Hive (Analytical Language)
(Data integration)

MapReduce Runtime
(Dist. Programming Framework) Hbase
(Column NoSQL DB)

Hadoop Distributed File System (HDFS)

Zookeeper – Coordination Framework

Hue Mahout

Oozie
(Coordination)
Zookeeper

Sqoop/Flume
(Data integration)

MapReduce Runtime
(Column NoSQL DB)


What is ZooKeeper

• A centralized service for maintaining
– Configuration information
– Providing distributed synchronization
• A set of tools to build distributed applications that can
safely handle partial failures
• ZooKeeper was designed to store coordination data
– Status information
– Configuration
– Location information

Flume / Sqoop – Data Integration Framework

Hue Mahout

Oozie
(Coordination)
Zookeeper

Sqoop/Flume
(Data integration)

MapReduce Runtime
(Column NoSQL DB)


What’s the problem for data collection

• Data collection is currently a priori and ad hoc
• A priori – decide what you want to collect ahead of time
• Ad hoc – each kind of data source goes through its own
collection path

(and how can it help?)

• A distributed data collection service
• It efficiently collecting, aggregating, and moving large
amounts of data
• Fault tolerant, many failover and recovery mechanism
• One-stop solution for data collection of all formats

Flume Architecture

Log Log
...

Flume Node Flume Node

HDFS


Flume Sources and Sinks

• Local Files
• HDFS
• Stdin, Stdout
• Twitter
• IRC
• IMAP


Sqoop

• Easy, parallel database import/export
• What you want do?
– Insert data from RDBMS to HDFS
– Export data from HDFS back into RDBMS

Sqoop

HDFS

Sqoop

RDBMS

28

Sqoop Examples

$ sqoop import --connect jdbc:mysql://localhost/world --
username root --table City
...

$ hadoop fs -cat City/part-m-00000
1,Kabul,AFG,Kabol,17800002,Qandahar,AFG,Qandahar,2375003,He
rat,AFG,Herat,1868004,Mazar-e-
Sharif,AFG,Balkh,1278005,Amsterdam,NLD,Noord-Holland,731200
...

29

Pig / Hive – Analytical Language

Hue Mahout

Oozie
(Coordination)
Zookeeper

Sqoop/Flume
(Data integration)

MapReduce Runtime
(Column NoSQL DB)


Why Hive and Pig?

• Although MapReduce is very powerful, it can also be
complex to master
• Many organizations have business or data analysts who
are skilled at writing SQL queries, but not at writing Java
code
• Many organizations have programmers who are skilled
at writing code in scripting languages
• Hive and Pig are two projects which evolved separately
to help such people analyze huge amounts of data via
MapReduce
– Hive was initially developed at Facebook, Pig at Yahoo!

Hive – Developed by

• What is Hive?
– An SQL-like interface to Hadoop
• Data Warehouse infrastructure that provides data
summarization and ad hoc querying on top of Hadoop
– MapRuduce for execution
– HDFS for storage
• Hive Query Language
– Basic-SQL : Select, From, Join, Group-By
– Equi-Join, Muti-Table Insert, Multi-Group-By
– Batch query
SELECT * FROM purchases WHERE price > 100 GROUP BY storeid

Hive

SQL

Hive

MapReduce

33

Pig – Initiated by

• A high-level scripting language (Pig Latin)
• Process data one step at a time
• Simple to write MapReduce program
• Easy understand
• Easy debug A = load ‘a.txt’ as (id, name, age, ...)
B = load ‘b.txt’ as (id, address, ...)
C = JOIN A BY id, B BY id;STORE C into ‘c.txt’

Pig

Script

Pig

MapReduce


Hive vs. Pig

Hive Pig
Language HiveQL (SQL-like) Pig Latin, a scripting language
Schema Table definitions A schema is optionally defined
that are stored in a at runtime
metastore
Programmait Access JDBC, ODBC PigServer

WordCount Example

• Input
Hello World Bye World
Hello Hadoop Goodbye Hadoop
• For the given sample input the map emits
< Hello, 1>
< World, 1>
< Bye, 1>
< World, 1>
< Hello, 1>
< Hadoop, 1>
< Goodbye, 1>
< Hadoop, 1>

• the reduce just sums up the values
< Bye, 1>
< Goodbye, 1>
< Hadoop, 2>
< Hello, 2>
< World, 2>

WordCount Example In MapReduce
public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}

public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();

Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);

job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.waitForCompletion(true);
}

WordCount Example By Pig

A = LOAD 'wordcount/input' USING PigStorage as (token:chararray);

B = GROUP A BY token;

C = FOREACH B GENERATE group, COUNT(A) as count;

DUMP C;

WordCount Example By Hive

CREATE TABLE wordcount (token STRING);

LOAD DATA LOCAL INPATH ’wordcount/input'
OVERWRITE INTO TABLE wordcount;

SELECT count(*) FROM wordcount GROUP BY token;

The Story So Far

SQL Hive Pig Script

Java MapReduce
Java HDFS
Sqoop Flume
SQL RDBMS FS Posix

4
1 © 2011 Cloudera, Inc. All Rights Reserved.

Hbase – Column NoSQL DB

Hue Mahout

Oozie
(Coordination)
Zookeeper

Sqoop/Flume
(Data integration)

MapReduce Runtime
(Column NoSQL DB)


I – Inspired by

• Coordinated by Zookeeper
• Low Latency
• Random Reads And Writes
• Distributed Key/Value Store
• Simple API
– PUT
– GET
– DELETE
– SCAN

Hbase – Data Model

• Cells are “versioned”
• Table rows are sorted by row key
• Region – a row range [start-key:end-key]

HBase Examples

hbase> create 'mytable', 'mycf‘
hbase> list
hbase> put 'mytable', 'row1', 'mycf:col1', 'val1‘
hbase> scan 'mytable‘
hbase> disable 'mytable‘
hbase> drop 'mytable'


Oozie – Job Workflow & Scheduling

Hue Mahout

Oozie
(Coordination)
Zookeeper

Sqoop/Flume
(Data integration)

MapReduce Runtime
(Column NoSQL DB)


What is ?

• A Java Web Application
• Oozie is a workﬂow scheduler for Hadoop
• Crond for Hadoop
• Triggered Job 1 Job 2
– Time
– Data
Job 3

Job 4 Job 5

Oozie Features

• Component Independent
– MapReduce
– Hive
– Pig
– SqoopStreaming


Mahout – Data Mining

Hue Mahout

Oozie
(Coordination)
Zookeeper

Sqoop/Flume
(Data integration)

MapReduce Runtime
(Column NoSQL DB)


What is

• Machine-learning tool
• Distributed and scalable machine learning algorithms on
the Hadoop platform
• Building intelligent applications easier and faster

Mahout Use Cases

• Yahoo: Spam Detection
• Foursquare: Recommendations
• SpeedDate.com: Recommendations
• Adobe: User Targetting
• Amazon: Personalization Platform


Hue – developed by

• Hadoop User Experience
• Apache Open source project
• HUE is a web UI for Hadoop
• Platform for building custom applications with a nice UI
library

Hue

• HUE comes with a suite of applications
– File Browser: Browse HDFS; change permissions and
ownership; upload, download, view and edit files.
– Job Browser: View jobs, tasks, counters, logs, etc.
– Beeswax: Wizards to help create Hive tables, load data, run and
manage Hive queries, and download results in Excel format.

Use case Example

• Predict what the user likes based on
– His/Her historical behavior
– Aggregate behavior of people similar to him

Conclusion

Today, we introduced:
• Why Hadoop is needed
• The basic concepts of HDFS and MapReduce
• What sort of problems can be solved with Hadoop
• What other projects are included in the Hadoop
ecosystem

Recap – Hadoop Ecosystem

Hue Mahout

Oozie
(Coordination)
Zookeeper

Sqoop/Flume
(Data integration)

MapReduce Runtime
(Column NoSQL DB)


Trend Micro Smart Protection
Network (SPN) Case Study

Collaboration in the underground

Network Threats Shows Explosive Growth
Threats on the network like variants of the virus, spams, unknown download
sources escapes the detection of the traditional security system and continues to
show explosive growth.

New Unique Malware Discovered
1M
unique
Malwares
every
month

New Design Concept for Threat Intelligence
CDN / xSP Human
Intelligence

Honeypot

Web Crawler

Trend Micro
Mail Protection
Trend Micro
Trend Micro Endpoint Protection
Web Protection

150M+ Worldwide Endpoints/Sensors

Challenges We Are Faced
The Concept is Great but ….
6TB of data and 15B lines of logs received daily by

It becomes the Big Data Challenge!

Issues to Address

Threat
Raw Data Information Intelligence/Solution

 Volume: Infinite
 Time: No Delay
 Target: Keep Changing Threats

SPN
Feedback
SPAM
CDN Log SPN High Level Architecture
HTTP POST

L4

Log Log
Receiver Receiver

L4

Log Post Log Post Log Post
Processing Processing Processing
SPN infrastructure

Adhoc-Query (Pig)

MapReduce HBase
Circus
Hadoop Distributed File System (Ambari)
(HDFS)

Feedback Information

Message Bus
Application

Email Reputation Service Web Reputation File Reputation
Service Service

Trend Micro Big Data process capacity

Daily amount of SPN data to be processed
• 8.5 billions Web Reputation queries
• 3 billions Email Reputation queries
• 7 billions File Reputation queries
• Process 6 TB worldwide raw logs
• 150 millions End-point connections

Trend Micro: Web Reputation Services
Technology Process Operation

Trend Micro User Traffic | Honeypot
Products / Technology
8 billions/day
Akamai 40% filtered
CDN Cache
4.8 billions/day
Rating Server for Known

15 Minutes
High Throughput Web Service Threats 82% filtered

Unknown & Prefilter
Hadoop Cluster
860 millions/day
Page Download
Web Crawling
99.98% filtered
Threat
Analysis
Machine Learning
Data Mining
25,000 malicious URL /day

Block malicious URL within 15 minutes once it goes online!

Google vs. Hadoop Ecosystem

Chubby vs. Zookeeper

MapReduce vs. MapReduce

BigTable vs. HBase

GFS vs. HDFS

Ref: Google Cluster

Pioneer of Big Data Infrastructure – Google

Hbase use Case@Facebook - Messages
HBase Use Cases @ Facebook

Facebook Insights Operational Data Store
Self-service More Analytics/Hashout apps
Messages Hashout Site Integrity

2010 2011 2012 2013

Social Graph Search Indexing
Realtime Hive Updates
Cross-system Tracing
… and more

Flagship App：Facebook Messages
Monthly data volume prior to launch
• Monthly data volume prior to launch

15B x 1,024 bytes = 14TB

120B x 100 bytes = 11TB

Facebook Messages Now
book Messages NOW
•
StatsQuick Stats
Messages Chats
– 11B+ messages/day
• 90B+ data accesses
+ messages/day
• Peak：1.5M ops/sec
0B+ data~55% Read, 45% Write
•
accesses
eak: 1.5M ops/sec
55%Rd, 45% Wr data
– 20PB+ of total
Emails SMS
• Grows 400TB/month

B+ of total data
rows 400TB/month

Facebook Messages：Requirements

• Very High Write Volume
– Previously, chat was not persisted to disk
• Ever-growing data sets（Old data rarely gets
accessed）
• Elasticity & automatic failover
• Strong consistency within a single data center
• Large scans/map-reduce support for migrations &
schema conversions
• Bulk import data

Physical Multi-tenancy

• Real-time Ads Insights
– Real-time analytics for social plugins on top of Hbase
– Publishers get real-time distribution/engagement metrics:
• # of impressions, likes
• analytics by domain/URL/demographics and time periods
– Uses HBase capabilities:
• Efficient counters (single-RPC increments)
• TTL for purging old data
– Needs massive write throughput & low latencies
• Billions of URLs
• Millions of counter increments/second

• Operational Data Store

Facebook Open Source Stack

• Memcached --> App Server Cache
• ZooKeeper --> Small Data Coordination Service
• HBase --> Database Storage Engine
• HDFS --> Distributed FileSystem
• Hadoop --> Asynchronous Map-Reduce Jobs

Cloud computing era

More Related Content

Cloud computing era