BigData Processing Intro
BigData Processing Intro
Nistor Grozav
nistor.grozavu@gmail.com
u
Lecture 2 :
- Latent semantic analysis, LSA, LSI, LDA
- Clustering and classification on text
- Lab in Python on text categorisation, and opinion mining on 20NG and tweets dataset
Lecture 3 :
Word embedding methods :
- Matrix factorisation based models : SVD, NMF
- Neural networks based models word2vec; GloVe; FastText
- Lab/Project in Python/Keras
Lecture 4 :
Advanced embedding : CNN, LSTM
Applications : Topic extraction, sentiment analysis,...
Lab/Project in Python/Keras
Evaluation : Project
2
Outline
• Part I. Introduction & Contex
• Introduction to Big Dat
• Methods and Tool
• Part II. Dimensionality reductio
• Definition and objectives
• Methods of data reduction
• Examples of applications
• The curse of dimensionality
• Techniques of dimensions reducin
• Part III. Dimensionality reduction for large heterogeneous data
3
s
Part
Introduction & Context
4
I
Mobile devices
(tracking all objects all the time)
● The progress and innovation is no longer hindered by the ability to collect data
● But, by the ability to manage, analyze, summarize, visualize, and discover knowledge
from the collected data in a timely manner and in a scalable fashion
5
How much data?
• Google processes 20 PB a day (2008)
• Wayback Machine has 3 PB + 100 TB/month (3/2009)
• Facebook has 2.5 PB of user data + 15 TB/day (4/2009)
• eBay has 6.5 PB of user data + 50 TB/day (5/2009)
• CERN’s Large Hydron Collider (LHC) generates 15 PB a year
640K ought to be
enough for anybody.
Introduction
• The sheer volume of data being stored today is exploding. In the year
2000, 800,000 petabytes (PB) of data were stored in the world. Of
course, a lot of the data that’s being created today isn’t analyzed at all
and that’s another prob- lem we’re trying to address with BigInsights.
We expect this number to reach 35 zettabytes (ZB) by 2020.
• The volume of data available to organizations today is on the rise, while the
percent of data they can analyze is on the decline.
Volume of data (2)
• Data Storag
– Cost of storage is reduced tremendously
– Seagate 3 TB Barracuda @ $149.99 from Amazon.com (4.9¢/GB)
e
Big Data: 3V
Big Data: 3V
• Gartner’s definition of the 3Vs is still widely used, and in agreement
with a consensual definition that states that "Big Data represents the
Information assets characterized by such a High Volume, Velocity and
Variety to require specific Technology and Analytical Methods for its
transformation into Value”. The 3Vs have been expanded to other
complementary characteristics of big data:
• Volume: big data doesn't sample. It just observes and tracks what
happens
• Velocity: big data is often available in real-time
• Variety: big data draws from text, images, audio, video; plus it
completes missing pieces through data fusion
• Machine Learning: big data often doesn't ask why and simply detects
patterns
• Digital footprint: big data is often a cost-free byproduct of digital
interaction
Big Data: 5V…
• Big data can be described by the following characteristics:
• Volume - The quantity of generated data is important in this context. The size of the data
determines the value and potential of the data under consideration, and whether it can
actually be considered big data or not. The name ‘big data’ itself contains a term related
to size, and hence the characteristic.
• Variety - The type of content, and an essential fact that data analysts must know. This
helps people who are associated with and analyze the data to effectively use the data to
their advantage and thus uphold its importance.
• Velocity In this context, the speed at which the data is generated and processed to meet
the demands and the challenges that lie in the path of growth and development.
• Variability The inconsistency the data can show at times—-which can hamper the
process of handling and managing the data effectively.
• Veracity The quality of captured data, which can vary greatly. Accurate analysis depends
on the veracity of source data. Complexity Data management can be very complex,
especially when large volumes of data come from multiple sources. Data must be linked,
connected, and correlated so users can grasp the information the data is supposed to
convey.
Big Data : 6C
1946 2012
Eniac LHC
X 6000000 = 1 (40 TB/S)
• Entertainment
– Internet images, Hollywood movies, MP3 files, …
• Medicine
– MRI & CT scans, patient records, …
Introduction: Explosion in Quantity of Data
- high-velocity
Speed rate in collecting or acquiring or generating or processing of data
- high-variety
different data type such as audio, video, image data (mostly unstructured data)
Cost Problem (example)
Cost of processing 1 Petabyte of data
with 1000 node ?
1 PB = 1015 B = 1 million gigabytes = 1 thousand terabytes
Drew Linzer, June 2012 media continue reporting the race as very tight
332 for Obama,
206 for Romney
■ French elections:
Analysis of the evolution of French political communities over Twitter
(DPDA) during 2012 both in terms of relevant terms, opinions,
behaviors (6th of May 2012)
24
Usage Example of Big Data
25
14
Some Challenges in Big Data
➢ Big Data Integration is Multidisciplinary
➢Less than 10% of Big Data world are genuinely relational
➢Meaningful data integration in the real, messy, schema-less
and complex Big Data world of database and semantic web
using multidisciplinary and multi-technology methode
2- If you could test all your decisions, how would that change the way you compete?
3- How would your business change if you used big data for widespread, real time
customization?
MapReduce
Raw Input: <key, value>
MAP
REDUCE
Implementation of Big Data 20
MapReduce Advantages
▪ Automatic Parallelization:
▪ Depending on the size of RAW INPUT DATA ➔ instantiate multiple
MAP tasks
▪ Similarly, depending upon the number of intermediate <key, value>
partitions ➔ instantiate multiple REDUCE tasks
▪ Run-time:
▪ Data partitioning
▪ Task scheduling
▪ Handling machine failures
▪ Managing inter-machine communication
▪ Completely transparent to the programmer/analyst/user
Big Data scenario tools
• NoSQL
– DatabasesMongoDB, CouchDB, Cassandra, Redis, BigTable, Hbase, Hypertable, Voldemort, Riak,
ZooKeeper
• MapReduce
– Hadoop, Hive, Pig, Cascading, Cascalog, mrjob, Caffeine, S4, MapR, Acunu, Flume, Kafka,
Azkaban, Oozie, Greenplum
• Storage
– S3, Hadoop Distributed File System
• Servers
– EC2, Google App Engine, Elastic, Beanstalk, Heroku
• Processing
– R, Yahoo! Pipes, Mechanical Turk, Solr/Lucene, ElasticSearch, Datameer, BigSheets, Tinkerpop