Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
15 views

BigData Processing Intro

This document provides an overview of big data processing and introduces the topic of text mining and natural language processing. It outlines four lectures that will cover latent semantic analysis, text clustering and classification, word embedding techniques like word2vec and GloVe, and advanced embedding methods like CNNs and LSTMs. The lectures will include labs and projects in Python focusing on text transformation, categorization, and other NLP applications. Evaluation will be based on a final project.

Uploaded by

Elie Al Howayek
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

BigData Processing Intro

This document provides an overview of big data processing and introduces the topic of text mining and natural language processing. It outlines four lectures that will cover latent semantic analysis, text clustering and classification, word embedding techniques like word2vec and GloVe, and advanced embedding methods like CNNs and LSTMs. The lectures will include labs and projects in Python focusing on text transformation, categorization, and other NLP applications. Evaluation will be based on a final project.

Uploaded by

Elie Al Howayek
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Big Data Processing - Introduction

Nistor Grozav
nistor.grozavu@gmail.com
u

Big Data - Text Mining


Lecture 1 :
- Big Data introduction
- Text mining : general context, NLP, documents similarities, bag of words, vector quantisation, tf-idf
- Lab in Python on text transformation and visualisation, on wikipedia text

Lecture 2 :
- Latent semantic analysis, LSA, LSI, LDA
- Clustering and classification on text
- Lab in Python on text categorisation, and opinion mining on 20NG and tweets dataset

Lecture 3 :
Word embedding methods :
- Matrix factorisation based models : SVD, NMF
- Neural networks based models word2vec; GloVe; FastText
- Lab/Project in Python/Keras

Lecture 4 :
Advanced embedding : CNN, LSTM
Applications : Topic extraction, sentiment analysis,...
Lab/Project in Python/Keras

Evaluation : Project

2
Outline
• Part I. Introduction & Contex
• Introduction to Big Dat
• Methods and Tool
• Part II. Dimensionality reductio
• Definition and objectives
• Methods of data reduction
• Examples of applications
• The curse of dimensionality
• Techniques of dimensions reducin
• Part III. Dimensionality reduction for large heterogeneous data

3
s

Part
Introduction & Context

4
I

Who’s Generating Big Data

Mobile devices
(tracking all objects all the time)

Social media and networks Scientific instruments


(all of us are generating data) (collecting all sorts of data)

Sensor technology and


networks
(measuring all kinds of data)

● The progress and innovation is no longer hindered by the ability to collect data
● But, by the ability to manage, analyze, summarize, visualize, and discover knowledge
from the collected data in a timely manner and in a scalable fashion

5
How much data?
• Google processes 20 PB a day (2008)
• Wayback Machine has 3 PB + 100 TB/month (3/2009)
• Facebook has 2.5 PB of user data + 15 TB/day (4/2009)
• eBay has 6.5 PB of user data + 50 TB/day (5/2009)
• CERN’s Large Hydron Collider (LHC) generates 15 PB a year

640K ought to be
enough for anybody.
Introduction
• The sheer volume of data being stored today is exploding. In the year
2000, 800,000 petabytes (PB) of data were stored in the world. Of
course, a lot of the data that’s being created today isn’t analyzed at all
and that’s another prob- lem we’re trying to address with BigInsights.
We expect this number to reach 35 zettabytes (ZB) by 2020.

• Twitter alone generates more than 7 terabytes (TB) of data every


day, Facebook 10 TB, and some enterprises generate terabytes of
data every hour of every day of the year.

• We store everything: environmental data, financial data, medical


data, surveillance data, and the list goes on and on.
Volume of data (1)

• The volume of data available to organizations today is on the rise, while the
percent of data they can analyze is on the decline.
Volume of data (2)

• 2 billion internet users


• 5 billion mobile phones in use in 2010.
• 30 billion pieces of content shared on Facebook every month.
• 7TB of data are processed by Twitter every day,
• 10TB of data are processed by Facebook every day.
• 40% projected growth in global data generated per year.
• 235T data collected by US library of Congress in April 2011
• 15 out of 17 sectors in the US have more data stored per
company than the US library of Congress.
• 90% of the data in the world today has been created in the last
two years alone.
Type of data sources & storage

• Data capture and collectio


– Sensor data, Mobile device, Social Network, Web clickstream,
– Traffic monitoring, Multimedia content, Smart energy meters,
– DNA analysis, Industry machines in the age of Internet of Things,
Consumer activities – communicating, browsing, buying, sharing,
searching – create enormous trails of data.

• Data Storag
– Cost of storage is reduced tremendously
– Seagate 3 TB Barracuda @ $149.99 from Amazon.com (4.9¢/GB)
e

Big Data: 3V
Big Data: 3V
• Gartner’s definition of the 3Vs is still widely used, and in agreement
with a consensual definition that states that "Big Data represents the
Information assets characterized by such a High Volume, Velocity and
Variety to require specific Technology and Analytical Methods for its
transformation into Value”. The 3Vs have been expanded to other
complementary characteristics of big data:
• Volume: big data doesn't sample. It just observes and tracks what
happens
• Velocity: big data is often available in real-time
• Variety: big data draws from text, images, audio, video; plus it
completes missing pieces through data fusion
• Machine Learning: big data often doesn't ask why and simply detects
patterns
• Digital footprint: big data is often a cost-free byproduct of digital
interaction
Big Data: 5V…
• Big data can be described by the following characteristics:
• Volume - The quantity of generated data is important in this context. The size of the data
determines the value and potential of the data under consideration, and whether it can
actually be considered big data or not. The name ‘big data’ itself contains a term related
to size, and hence the characteristic.
• Variety - The type of content, and an essential fact that data analysts must know. This
helps people who are associated with and analyze the data to effectively use the data to
their advantage and thus uphold its importance.
• Velocity In this context, the speed at which the data is generated and processed to meet
the demands and the challenges that lie in the path of growth and development.
• Variability The inconsistency the data can show at times—-which can hamper the
process of handling and managing the data effectively.
• Veracity The quality of captured data, which can vary greatly. Accurate analysis depends
on the veracity of source data. Complexity Data management can be very complex,
especially when large volumes of data come from multiple sources. Data must be linked,
connected, and correlated so users can grasp the information the data is supposed to
convey.
Big Data : 6C

Factory work and Cyber-physical systems may have a 6C system:

• Connection (sensor and networks)


• Cloud (computing and data on demand)
• Cyber (model and memory)
• Content/context (meaning and correlation)
• Community (sharing and collaboration)
• Customization (personalization and value)
• Data must be processed with advanced tools (analytics and algorithms) to
reveal meaningful information. Considering visible and invisible issues in,
for example, a factory, the information generation algorithm must detect
and address invisible issues such as machine degradation, component wear,
etc. on the factory floor.
Introduction: Explosion in Quantity of Data

1946 2012
Eniac LHC
X 6000000 = 1 (40 TB/S)

Air Bus A380


- 1 billion line of code 640TB per Flight
- each engine generate 10 TB every
30 min
Twitter Generate approximately 12 TB of data per
day

New York Stock Exchange 1TB of data everyday

storage capacity has doubled roughly every three


years since the 1980s
Explosion in Quantity of Data

Our Data-driven World


• Science
– Data bases from astronomy, genomics, environmental data,
transportation data, …

• Humanities and Social Sciences


– Scanned books, historical documents, social interactions data, new
technology like GPS …

• Business & Commerce


– Corporate sales, stock market transactions, census, airline traffic, …

• Entertainment
– Internet images, Hollywood movies, MP3 files, …

• Medicine
– MRI & CT scans, patient records, …
Introduction: Explosion in Quantity of Data

Our Data-driven World

- Fish and Oceans of Data

What we do with these amount of data?

Ignore, select pertinent data


Big Data Characteristics
How big is the Big Data?
- What is big today maybe not big tomorrow

- Any data that can challenge our current technology


in some manner can consider as Big Data
- Volume
- Communication
- Speed of Generating
- Meaningful Analysis

Big Data Vectors (3Vs)


"Big Data are high-volume, high-velocity, and/or high-variety
information assets that require new forms of processing to enable
enhanced decision making, insight discovery and process
optimization”
Gartner 2012
Big Data Characteristics
Big Data Vectors (3Vs)
- high-volume
amount of data

- high-velocity
Speed rate in collecting or acquiring or generating or processing of data

- high-variety
different data type such as audio, video, image data (mostly unstructured data)
Cost Problem (example)
Cost of processing 1 Petabyte of data
with 1000 node ?
1 PB = 1015 B = 1 million gigabytes = 1 thousand terabytes

- 9 hours for each node to process 500GB at rate of 15MB/S


- 15*60*60*9 = 486000MB ~ 500 GB
- 1000 * 9 * 0.34$ = 3060$ for single run

- 1 PB = 1000000 / 500 = 2000 * 9 = 18000 h /


24 = 750 Day

- The cost for 1000 cloud node each processing


1PB
2000 * 3060$ = 6,120,000$
Usage Example of Big Data
US 2012 Election

- predictive modeling - data mining for


- mybarackobama.com individualized ad targeting
- drive traffic to other campaign sites
Facebook page (33 million "likes") - Orca big-data app
YouTube channel (240,000 subscribers
and 246 million page views). - YouTube channel( 23,700 subscribers
- a contest to dine with Sarah Jessica Parker and 26 million page views)
- Every single night, the team ran 66,000
computer simulations, Reddit!!! - Ace of Spades HQ
- Amazon web services
Usage Example in Big Data
Data Analysis prediction for US 2012 Election

Drew Linzer, June 2012 media continue reporting the race as very tight
332 for Obama,
206 for Romney

Nate Silver’s, Five thirty Eight blog


Predict Obama had a 86% chance of winning
Predicted all 50 state correctly

Sam Wang, the Princeton Election Consortium


The probability of Obama's re-election
at more than 98%
Usage Example of Big Data

■ French elections:
Analysis of the evolution of French political communities over Twitter
(DPDA) during 2012 both in terms of relevant terms, opinions,
behaviors (6th of May 2012)

24
Usage Example of Big Data

Clusters analysis for each candidate.

25
14
Some Challenges in Big Data
➢ Big Data Integration is Multidisciplinary
➢Less than 10% of Big Data world are genuinely relational
➢Meaningful data integration in the real, messy, schema-less
and complex Big Data world of database and semantic web
using multidisciplinary and multi-technology methode

➢ The Billion Triple Challenge


➢Web of data contain 31 billion RDf triples, that 446million of
them are RDF links, 13 Billion government data, 6 Billion
geographic data, 4.6 Billion Publication and Media data, 3 Billion
life science data
➢BTC 2011, Sindice 2011

➢ The Linked Open Data Ripper


➢Mapping, Ranking, Visualization, Key Matching, Snappiness

➢ Demonstrate the Value of Semantics: let data integration drive


DBMS technology
➢Large volumes of heterogeneous data, like link data and RDF
Other Aspects of Big Data 15

Six Provocations for Big Data

1- Automating Research Changes the Definition of Knowledge

2- Claim to Objectively and Accuracy are Misleading

3- Bigger Data are not always Better data

4- Not all Data are equivalent

5- Just because it is accessible doesn’t make it ethical

6- Limited access to big data creatrs new digital divides


16
Other Aspects of Big Data
▪ Five Big Question about big Data:
1- What happens in a world of radical transparency, with data widely available?

2- If you could test all your decisions, how would that change the way you compete?

3- How would your business change if you used big data for widespread, real time
customization?

4- How can big data augment or even replace Management?

5-Could you create a new business model based on data?


Implementation of Big Data 17

Platforms for Large-scale Data Analysis


• Parallel DBMS technologies
– Proposed in late eighties
– Matured over the last two decades
– Multi-billion dollar industry: Proprietary DBMS Engines intended
as Data Warehousing solutions for very large enterprises
• Map Reduce
– pioneered by Google
– popularized by Yahoo! (Hadoop)
Implementation of Big Data 18

MapReduce Parallel DBMS technologies


▪ Popularly used for more than two decades
• Overview:
▪ Research Projects: Gamma, Grace, …
– Data-parallel programming model
▪ Commercial: Multi-billion dollar
– An associated parallel and distributed industry but access to only a privileged
implementation for commodity clusters few
• Pioneered by Google ▪ Relational Data Model
– Processes 20 PB of data per day ▪ Indexing
• Popularized by open-source Hadoop ▪ Familiar SQL interface
– Used by Yahoo!, Facebook, ▪ Advanced query optimization
Amazon, and the list is growing … ▪ Well understood and studied
Implementation of Big Data 19

MapReduce
Raw Input: <key, value>

MAP

<K1, V1> <K2,V2> <K3,V3>

REDUCE
Implementation of Big Data 20

MapReduce Advantages
▪ Automatic Parallelization:
▪ Depending on the size of RAW INPUT DATA ➔ instantiate multiple
MAP tasks
▪ Similarly, depending upon the number of intermediate <key, value>
partitions ➔ instantiate multiple REDUCE tasks
▪ Run-time:
▪ Data partitioning
▪ Task scheduling
▪ Handling machine failures
▪ Managing inter-machine communication
▪ Completely transparent to the programmer/analyst/user
Big Data scenario tools
• NoSQL
– DatabasesMongoDB, CouchDB, Cassandra, Redis, BigTable, Hbase, Hypertable, Voldemort, Riak,
ZooKeeper

• MapReduce
– Hadoop, Hive, Pig, Cascading, Cascalog, mrjob, Caffeine, S4, MapR, Acunu, Flume, Kafka,
Azkaban, Oozie, Greenplum

• Storage
– S3, Hadoop Distributed File System

• Servers
– EC2, Google App Engine, Elastic, Beanstalk, Heroku

• Processing
– R, Yahoo! Pipes, Mechanical Turk, Solr/Lucene, ElasticSearch, Datameer, BigSheets, Tinkerpop

You might also like