0% found this document useful (0 votes)

15 views

BigData Processing Intro

This document provides an overview of big data processing and introduces the topic of text mining and natural language processing. It outlines four lectures that will cover latent semantic analysis, text clustering and classification, word embedding techniques like word2vec and GloVe, and advanced embedding methods like CNNs and LSTMs. The lectures will include labs and projects in Python focusing on text transformation, categorization, and other NLP applications. Evaluation will be based on a final project.

Uploaded by

Elie Al Howayek

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views

BigData Processing Intro

Uploaded by

Elie Al Howayek

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

Big Data Processing - Introduction

Nistor Grozav
nistor.grozavu@gmail.com
u

Big Data - Text Mining

Lecture 1 :
- Big Data introduction
- Text mining : general context, NLP, documents similarities, bag of words, vector quantisation, tf-idf
- Lab in Python on text transformation and visualisation, on wikipedia text

Lecture 2 :
- Latent semantic analysis, LSA, LSI, LDA
- Clustering and classification on text
- Lab in Python on text categorisation, and opinion mining on 20NG and tweets dataset

Lecture 3 :
Word embedding methods :
- Matrix factorisation based models : SVD, NMF
- Neural networks based models word2vec; GloVe; FastText
- Lab/Project in Python/Keras

Lecture 4 :
Advanced embedding : CNN, LSTM
Applications : Topic extraction, sentiment analysis,...
Lab/Project in Python/Keras

Evaluation : Project

2
Outline
• Part I. Introduction & Contex
• Introduction to Big Dat
• Methods and Tool
• Part II. Dimensionality reductio
• Definition and objectives
• Methods of data reduction
• Examples of applications
• The curse of dimensionality
• Techniques of dimensions reducin
• Part III. Dimensionality reduction for large heterogeneous data

3
s

Part
Introduction & Context

4
I

Who’s Generating Big Data

Mobile devices
(tracking all objects all the time)

Social media and networks Scientific instruments

(all of us are generating data) (collecting all sorts of data)

Sensor technology and

networks
(measuring all kinds of data)

● The progress and innovation is no longer hindered by the ability to collect data
● But, by the ability to manage, analyze, summarize, visualize, and discover knowledge
from the collected data in a timely manner and in a scalable fashion

5
How much data?
• Google processes 20 PB a day (2008)
• Wayback Machine has 3 PB + 100 TB/month (3/2009)
• Facebook has 2.5 PB of user data + 15 TB/day (4/2009)
• eBay has 6.5 PB of user data + 50 TB/day (5/2009)
• CERN’s Large Hydron Collider (LHC) generates 15 PB a year

640K ought to be
enough for anybody.
Introduction
• The sheer volume of data being stored today is exploding. In the year
2000, 800,000 petabytes (PB) of data were stored in the world. Of
course, a lot of the data that’s being created today isn’t analyzed at all
and that’s another problem we’re trying to address with BigInsights.
We expect this number to reach 35 zettabytes (ZB) by 2020.

• Twitter alone generates more than 7 terabytes (TB) of data every

day, Facebook 10 TB, and some enterprises generate terabytes of
data every hour of every day of the year.

• We store everything: environmental data, financial data, medical

data, surveillance data, and the list goes on and on.
Volume of data (1)

• The volume of data available to organizations today is on the rise, while the
percent of data they can analyze is on the decline.
Volume of data (2)

• 2 billion internet users

• 5 billion mobile phones in use in 2010.
• 30 billion pieces of content shared on Facebook every month.
• 7TB of data are processed by Twitter every day,
• 10TB of data are processed by Facebook every day.
• 40% projected growth in global data generated per year.
• 235T data collected by US library of Congress in April 2011
• 15 out of 17 sectors in the US have more data stored per
company than the US library of Congress.
• 90% of the data in the world today has been created in the last
two years alone.
Type of data sources & storage

• Data capture and collectio

– Sensor data, Mobile device, Social Network, Web clickstream,
– Traffic monitoring, Multimedia content, Smart energy meters,
– DNA analysis, Industry machines in the age of Internet of Things,
Consumer activities – communicating, browsing, buying, sharing,
searching – create enormous trails of data.

• Data Storag
– Cost of storage is reduced tremendously
– Seagate 3 TB Barracuda @ $149.99 from Amazon.com (4.9¢/GB)
e

Big Data: 3V
Big Data: 3V
• Gartner’s definition of the 3Vs is still widely used, and in agreement
with a consensual definition that states that "Big Data represents the
Information assets characterized by such a High Volume, Velocity and
Variety to require specific Technology and Analytical Methods for its
transformation into Value”. The 3Vs have been expanded to other
complementary characteristics of big data:
• Volume: big data doesn't sample. It just observes and tracks what
happens
• Velocity: big data is often available in real-time
• Variety: big data draws from text, images, audio, video; plus it
completes missing pieces through data fusion
• Machine Learning: big data often doesn't ask why and simply detects
patterns
• Digital footprint: big data is often a cost-free byproduct of digital
interaction
Big Data: 5V…
• Big data can be described by the following characteristics:
• Volume - The quantity of generated data is important in this context. The size of the data
determines the value and potential of the data under consideration, and whether it can
actually be considered big data or not. The name ‘big data’ itself contains a term related
to size, and hence the characteristic.
• Variety - The type of content, and an essential fact that data analysts must know. This
helps people who are associated with and analyze the data to effectively use the data to
their advantage and thus uphold its importance.
• Velocity In this context, the speed at which the data is generated and processed to meet
the demands and the challenges that lie in the path of growth and development.
• Variability The inconsistency the data can show at times—-which can hamper the
process of handling and managing the data effectively.
• Veracity The quality of captured data, which can vary greatly. Accurate analysis depends
on the veracity of source data. Complexity Data management can be very complex,
especially when large volumes of data come from multiple sources. Data must be linked,
connected, and correlated so users can grasp the information the data is supposed to
convey.
Big Data : 6C

Factory work and Cyber-physical systems may have a 6C system:

• Connection (sensor and networks)

• Cloud (computing and data on demand)
• Cyber (model and memory)
• Content/context (meaning and correlation)
• Community (sharing and collaboration)
• Customization (personalization and value)
• Data must be processed with advanced tools (analytics and algorithms) to
reveal meaningful information. Considering visible and invisible issues in,
for example, a factory, the information generation algorithm must detect
and address invisible issues such as machine degradation, component wear,
etc. on the factory floor.
Introduction: Explosion in Quantity of Data

1946 2012
Eniac LHC
X 6000000 = 1 (40 TB/S)

Air Bus A380

- 1 billion line of code 640TB per Flight
- each engine generate 10 TB every
30 min
Twitter Generate approximately 12 TB of data per
day

New York Stock Exchange 1TB of data everyday

storage capacity has doubled roughly every three

years since the 1980s
Explosion in Quantity of Data

Our Data-driven World

• Science
– Data bases from astronomy, genomics, environmental data,
transportation data, …

• Humanities and Social Sciences

– Scanned books, historical documents, social interactions data, new
technology like GPS …

• Business & Commerce

– Corporate sales, stock market transactions, census, airline traffic, …

• Entertainment
– Internet images, Hollywood movies, MP3 files, …

• Medicine
– MRI & CT scans, patient records, …
Introduction: Explosion in Quantity of Data

Our Data-driven World

- Fish and Oceans of Data

What we do with these amount of data?

Ignore, select pertinent data

Big Data Characteristics
How big is the Big Data?
- What is big today maybe not big tomorrow

- Any data that can challenge our current technology

in some manner can consider as Big Data
- Volume
- Communication
- Speed of Generating
- Meaningful Analysis

Big Data Vectors (3Vs)

"Big Data are high-volume, high-velocity, and/or high-variety
information assets that require new forms of processing to enable
enhanced decision making, insight discovery and process
optimization”
Gartner 2012
Big Data Characteristics
Big Data Vectors (3Vs)
- high-volume
amount of data

- high-velocity
Speed rate in collecting or acquiring or generating or processing of data

- high-variety
different data type such as audio, video, image data (mostly unstructured data)
Cost Problem (example)
Cost of processing 1 Petabyte of data
with 1000 node ?
1 PB = 1015 B = 1 million gigabytes = 1 thousand terabytes

- 9 hours for each node to process 500GB at rate of 15MB/S

- 15*60*60*9 = 486000MB ~ 500 GB
- 1000 * 9 * 0.34$ = 3060$ for single run

- 1 PB = 1000000 / 500 = 2000 * 9 = 18000 h /

24 = 750 Day

- The cost for 1000 cloud node each processing

1PB
2000 * 3060$ = 6,120,000$
Usage Example of Big Data
US 2012 Election

- predictive modeling - data mining for

- mybarackobama.com individualized ad targeting
- drive traffic to other campaign sites
Facebook page (33 million "likes") - Orca big-data app
YouTube channel (240,000 subscribers
and 246 million page views). - YouTube channel( 23,700 subscribers
- a contest to dine with Sarah Jessica Parker and 26 million page views)
- Every single night, the team ran 66,000
computer simulations, Reddit!!! - Ace of Spades HQ
- Amazon web services
Usage Example in Big Data
Data Analysis prediction for US 2012 Election

Drew Linzer, June 2012 media continue reporting the race as very tight
332 for Obama,
206 for Romney

Nate Silver’s, Five thirty Eight blog

Predict Obama had a 86% chance of winning
Predicted all 50 state correctly

Sam Wang, the Princeton Election Consortium

The probability of Obama's re-election
at more than 98%
Usage Example of Big Data

■ French elections:
Analysis of the evolution of French political communities over Twitter
(DPDA) during 2012 both in terms of relevant terms, opinions,
behaviors (6th of May 2012)

24
Usage Example of Big Data

Clusters analysis for each candidate.

25
14
Some Challenges in Big Data
➢ Big Data Integration is Multidisciplinary
➢Less than 10% of Big Data world are genuinely relational
➢Meaningful data integration in the real, messy, schema-less
and complex Big Data world of database and semantic web
using multidisciplinary and multi-technology methode

➢ The Billion Triple Challenge

➢Web of data contain 31 billion RDf triples, that 446million of
them are RDF links, 13 Billion government data, 6 Billion
geographic data, 4.6 Billion Publication and Media data, 3 Billion
life science data
➢BTC 2011, Sindice 2011

➢ The Linked Open Data Ripper

➢Mapping, Ranking, Visualization, Key Matching, Snappiness

➢ Demonstrate the Value of Semantics: let data integration drive

DBMS technology
➢Large volumes of heterogeneous data, like link data and RDF
Other Aspects of Big Data 15

Six Provocations for Big Data

1- Automating Research Changes the Definition of Knowledge

2- Claim to Objectively and Accuracy are Misleading

3- Bigger Data are not always Better data

4- Not all Data are equivalent

5- Just because it is accessible doesn’t make it ethical

6- Limited access to big data creatrs new digital divides

16
Other Aspects of Big Data
▪ Five Big Question about big Data:
1- What happens in a world of radical transparency, with data widely available?

2- If you could test all your decisions, how would that change the way you compete?

3- How would your business change if you used big data for widespread, real time
customization?

4- How can big data augment or even replace Management?

5-Could you create a new business model based on data?

Implementation of Big Data 17

Platforms for Large-scale Data Analysis

• Parallel DBMS technologies
– Proposed in late eighties
– Matured over the last two decades
– Multi-billion dollar industry: Proprietary DBMS Engines intended
as Data Warehousing solutions for very large enterprises
• Map Reduce
– pioneered by Google
– popularized by Yahoo! (Hadoop)
Implementation of Big Data 18

MapReduce Parallel DBMS technologies

▪ Popularly used for more than two decades
• Overview:
▪ Research Projects: Gamma, Grace, …
– Data-parallel programming model
▪ Commercial: Multi-billion dollar
– An associated parallel and distributed industry but access to only a privileged
implementation for commodity clusters few
• Pioneered by Google ▪ Relational Data Model
– Processes 20 PB of data per day ▪ Indexing
• Popularized by open-source Hadoop ▪ Familiar SQL interface
– Used by Yahoo!, Facebook, ▪ Advanced query optimization
Amazon, and the list is growing … ▪ Well understood and studied
Implementation of Big Data 19

MapReduce
Raw Input: <key, value>

MAP

<K1, V1> <K2,V2> <K3,V3>

REDUCE
Implementation of Big Data 20

MapReduce Advantages
▪ Automatic Parallelization:
▪ Depending on the size of RAW INPUT DATA ➔ instantiate multiple
MAP tasks
▪ Similarly, depending upon the number of intermediate <key, value>
partitions ➔ instantiate multiple REDUCE tasks
▪ Run-time:
▪ Data partitioning
▪ Task scheduling
▪ Handling machine failures
▪ Managing inter-machine communication
▪ Completely transparent to the programmer/analyst/user
Big Data scenario tools
• NoSQL
– DatabasesMongoDB, CouchDB, Cassandra, Redis, BigTable, Hbase, Hypertable, Voldemort, Riak,
ZooKeeper

• MapReduce
– Hadoop, Hive, Pig, Cascading, Cascalog, mrjob, Caffeine, S4, MapR, Acunu, Flume, Kafka,
Azkaban, Oozie, Greenplum

• Storage
– S3, Hadoop Distributed File System

• Servers
– EC2, Google App Engine, Elastic, Beanstalk, Heroku

• Processing
– R, Yahoo! Pipes, Mechanical Turk, Solr/Lucene, ElasticSearch, Datameer, BigSheets, Tinkerpop

Big Data PPT 55b0fc01e7543
No ratings yet
Big Data PPT 55b0fc01e7543
31 pages
Lecture 6 BigData
No ratings yet
Lecture 6 BigData
61 pages
The Age of Big Data: Kayvan Tirdad
No ratings yet
The Age of Big Data: Kayvan Tirdad
26 pages
Big Data Basics Unit 1
No ratings yet
Big Data Basics Unit 1
12 pages
Introduction To Big Data
No ratings yet
Introduction To Big Data
83 pages
Class - Big Data UNIT-I
No ratings yet
Class - Big Data UNIT-I
40 pages
Introduction To Big Data: Soorya Prasanna Ravichandran
No ratings yet
Introduction To Big Data: Soorya Prasanna Ravichandran
33 pages
Basic Concepts in Big Data 1
No ratings yet
Basic Concepts in Big Data 1
43 pages
Bigdata 201126054145 PDF
No ratings yet
Bigdata 201126054145 PDF
23 pages
Unit 1 Bigdata
No ratings yet
Unit 1 Bigdata
30 pages
Seminar Report BIG DATA
No ratings yet
Seminar Report BIG DATA
28 pages
Geie 112 - S19 - LCN - 5
No ratings yet
Geie 112 - S19 - LCN - 5
24 pages
Part 1 - Introduction To Big Data
No ratings yet
Part 1 - Introduction To Big Data
24 pages
Big Data
No ratings yet
Big Data
30 pages
Bigdatappt
No ratings yet
Bigdatappt
31 pages
Overview of Big Data
No ratings yet
Overview of Big Data
4 pages
CC Becse Unit 4 PDF
No ratings yet
CC Becse Unit 4 PDF
32 pages
Apache Hadoop Training For Developers Day 1
No ratings yet
Apache Hadoop Training For Developers Day 1
136 pages
Future Revolution On Big Data
No ratings yet
Future Revolution On Big Data
24 pages
CC&BD Unit 3
No ratings yet
CC&BD Unit 3
16 pages
What Is Big Data
No ratings yet
What Is Big Data
8 pages
Prepared By: Asmita Deshmukh
No ratings yet
Prepared By: Asmita Deshmukh
51 pages
Big Data Analytics_Lecture Slides
No ratings yet
Big Data Analytics_Lecture Slides
72 pages
Unit 5 - Principles of Big Data 2
No ratings yet
Unit 5 - Principles of Big Data 2
14 pages
Deloitte Solutions Network: Introduction To Big Data
No ratings yet
Deloitte Solutions Network: Introduction To Big Data
9 pages
UNIT I
No ratings yet
UNIT I
25 pages
Unit 5
No ratings yet
Unit 5
63 pages
Big Data (Analytics) in Power Systems
No ratings yet
Big Data (Analytics) in Power Systems
20 pages
1Big_Data (1)
No ratings yet
1Big_Data (1)
69 pages
DSBDA_UNIT1
No ratings yet
DSBDA_UNIT1
221 pages
Introduction to Big Data
No ratings yet
Introduction to Big Data
51 pages
BDA 01 - Introduction
No ratings yet
BDA 01 - Introduction
42 pages
Chapter N1 Introduction To Big Data
No ratings yet
Chapter N1 Introduction To Big Data
40 pages
Big Data
No ratings yet
Big Data
25 pages
Big Data
No ratings yet
Big Data
24 pages
Big Data, Hadoop
No ratings yet
Big Data, Hadoop
24 pages
Data Science and Big Data
No ratings yet
Data Science and Big Data
80 pages
Current Trends in Big Data - V4
0% (1)
Current Trends in Big Data - V4
8 pages
Lecture Notes - Introduction To Big Data
0% (1)
Lecture Notes - Introduction To Big Data
8 pages
Big Data_1
No ratings yet
Big Data_1
46 pages
Unit 3 Big Data Analytics
No ratings yet
Unit 3 Big Data Analytics
18 pages
17 2017 Lecture1-2 INT312
0% (2)
17 2017 Lecture1-2 INT312
21 pages
2 LecturE 1 2
No ratings yet
2 LecturE 1 2
28 pages
Big Data Intro
No ratings yet
Big Data Intro
12 pages
I Jcs It 2015060405
No ratings yet
I Jcs It 2015060405
6 pages
Big Data
No ratings yet
Big Data
15 pages
Big Data Class - Introduction
No ratings yet
Big Data Class - Introduction
60 pages
Module 2-4
No ratings yet
Module 2-4
16 pages
What Is Big Data?
No ratings yet
What Is Big Data?
3 pages
Lecture 07
No ratings yet
Lecture 07
64 pages
Bda CHP1
No ratings yet
Bda CHP1
83 pages
Big Data CH 1
No ratings yet
Big Data CH 1
62 pages
Introduction To Big Data
No ratings yet
Introduction To Big Data
29 pages
Evolution of Big Data and Tools For Big Data
No ratings yet
Evolution of Big Data and Tools For Big Data
9 pages
07-08 What Is Big Data
No ratings yet
07-08 What Is Big Data
41 pages
Bda Unit 1
No ratings yet
Bda Unit 1
47 pages
Crash Course Big Data
From Everand
Crash Course Big Data
IntroBooks Team
No ratings yet
Big Data: the Revolution That Is Transforming Our Work, Market and World
From Everand
Big Data: the Revolution That Is Transforming Our Work, Market and World
PAT NAKAMOTO
No ratings yet
Reality Mining: Using Big Data to Engineer a Better World
From Everand
Reality Mining: Using Big Data to Engineer a Better World
Nathan Eagle
4/5 (2)
Data Decoded - Understanding Big Data and Its Everyday Applications
From Everand
Data Decoded - Understanding Big Data and Its Everyday Applications
Michael Reed
No ratings yet
MRA Extended Project Problem Statement
No ratings yet
MRA Extended Project Problem Statement
2 pages
Assignment
No ratings yet
Assignment
2 pages
V - SA (Collective Log) User Exits
No ratings yet
V - SA (Collective Log) User Exits
2 pages
10.1 SQLite Primer
No ratings yet
10.1 SQLite Primer
26 pages
SAS Vs FC Vs ISCSI PDF
No ratings yet
SAS Vs FC Vs ISCSI PDF
15 pages
Solving Your Database Backup and Recovery Problems With Commvault
No ratings yet
Solving Your Database Backup and Recovery Problems With Commvault
2 pages
CertyIQ AZ-305 40-Important Real Exam New Questions-2022
No ratings yet
CertyIQ AZ-305 40-Important Real Exam New Questions-2022
85 pages
Pts000108 Dicom Conformance Statement Mylab f11xx 01.3 04
No ratings yet
Pts000108 Dicom Conformance Statement Mylab f11xx 01.3 04
118 pages
Relational Database Management Systems
No ratings yet
Relational Database Management Systems
5 pages
5 DATA WAREHOUSE (1)
No ratings yet
5 DATA WAREHOUSE (1)
17 pages
Mahanta DMAB A1
No ratings yet
Mahanta DMAB A1
8 pages
IT-005 User Acceptance Testing
No ratings yet
IT-005 User Acceptance Testing
10 pages
Romney 15e Accessible Fullppt 03
No ratings yet
Romney 15e Accessible Fullppt 03
15 pages
SQL Optimization Slides
No ratings yet
SQL Optimization Slides
325 pages
Document Management Systems - Wiki - SCN Wiki
No ratings yet
Document Management Systems - Wiki - SCN Wiki
4 pages
012 HCP-MISKAWI - PLAGIASI - 5 Model Pengembangan Pariwisata Berbasis Potensi Cagar Budaya Di Kabupaten Banyuwangi
No ratings yet
012 HCP-MISKAWI - PLAGIASI - 5 Model Pengembangan Pariwisata Berbasis Potensi Cagar Budaya Di Kabupaten Banyuwangi
25 pages
Hana Introduction
No ratings yet
Hana Introduction
23 pages
Dissertation in Library and Information Science
100% (2)
Dissertation in Library and Information Science
6 pages
Digital Libraries
No ratings yet
Digital Libraries
13 pages
25.accessible Display Design To Control Home Area Networks
No ratings yet
25.accessible Display Design To Control Home Area Networks
2 pages
CH 01
No ratings yet
CH 01
23 pages
Importance of Database Design in DBMS
No ratings yet
Importance of Database Design in DBMS
5 pages
JD - Senior DBA - MariaDB or MSSQL - Maybank Core Banking
No ratings yet
JD - Senior DBA - MariaDB or MSSQL - Maybank Core Banking
2 pages
1921 Hippolytus Philosophumena Introduction Book 1
100% (1)
1921 Hippolytus Philosophumena Introduction Book 1
211 pages
DSBDA_Mini_Project
No ratings yet
DSBDA_Mini_Project
11 pages
MIS ch-3 - MIS and DSS
No ratings yet
MIS ch-3 - MIS and DSS
44 pages
CheatSheet Octave PDF
No ratings yet
CheatSheet Octave PDF
1 page
BCA DBMS Important Questions
No ratings yet
BCA DBMS Important Questions
1 page
Database Systems: Ayesha Asmat
No ratings yet
Database Systems: Ayesha Asmat
28 pages
Tugas Kapita Selekta Hasil Jaya
No ratings yet
Tugas Kapita Selekta Hasil Jaya
458 pages