Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
12 views

Introduction To Big Data

Uploaded by

gq998trc
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Introduction To Big Data

Uploaded by

gq998trc
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Introduction to Big Data

Dan Lo
Department of Computer Science
Kennesaw State University
What is Big Data
International System of Units (SI)
• Kbyte (1 000 Bytes)
• Megabyte (1 000 000 Bytes)
• Gigabyte (1 000 000 000 Bytes)
• Terabyte (1 000 000 000 000 Bytes)
• Petabyte (1 000 000 000 000 000 Bytes)
• Exabyte (1 000 000 000 000 000 000 Bytes)
• Zettabyte (1 000 000 000 000 000 000 000
Bytes)
Name Number SI
Thousand Kilo
Million Mega
Billion (milliard) Giga
Trillion (billion) Tera
Quadrillion (billiard) Peta
Quintillion (trillion) Exa
Sextillion (trilliard) Zetta
Septillion (quadrillion) Yotta
What do you mean by “Big”

Number of emails sent every day? Tweets per Day?


294 billion 500+ million

Video Uploaded to Youtube every minute?


0.4 Million Hours
Number of sensors populating IoT with real-time data?
TRILLIONS
Amount of Google Search per day?
3.5 billion

Orders Processed by Amazon every Second on Prime Day?


398
6
Big Data Era
• “Big Data” is a fact of the world.

• Every day, we create 2.5 quintillion bytes of data — so


much that 90% of the data in the world today has been
created in the last two years alone.

• This data comes from everywhere: sensors used to


gather climate information, posts to social media sites,
digital pictures and videos, purchase transaction
records, and cell phone GPS signals to name a few.
(Source: IBM; IBM Big Data)
So Big!
The total amount of digital data will reach 180
zettabytes in 2025. Approximately 80 percent of this
data will be unstructured…
Age of Big Data
• Not only having big data, but also having the
capability to make a big deal on the big data
• Some tasks that are not possible 5-10 years ago
are routines now.
– Build a model to detect credit card fraud using
thousands of features and billions of transactions
– Intelligently recommend millions of products to
millions of users
– Estimate financial risk through simulations of
portfolios including millions of instruments
– Easily manipulate data from thousands of human
genomes to detect genetic associations with disease
What Can We do?
• Sitting behind this capability is the big data
computing platforms that can leverage
clusters of commodity computers to chug
through massive amounts of data

• In this course we will use the most cutting-


edge big data platform that is called Spark
What is Data Science?
• Just having platforms and big data are not
enough, we need to have “Data Science” to fill
the gap between platforms and data.
Cont.
• Data science employs techniques and theories
drawn from many fields such as statistics and
machine learning to extract knowledge and
insights from big data by leveraging big data
platform.
Change of Science
• Turing award winner Jim Gray imagined data
science as a "fourth paradigm" of science
(empirical, theoretical, computational and
now data-driven) and asserted that
"everything about science is changing because
of the impact of information technology and
the data deluge”.
How is Big Data Related to Me?
• There will be a shortage of talent necessary
for organizations to take advantage of big
data. By 2030, we could face a shortage of
85,000,000 people with deep analytical skills.
• In 2020, the shortage is 250,000.
• There have been 3 times job postings than job
searches since 2015.
(Source: https://quanthub.com/data-scientist-
shortage-2020/)
U.S. News Top 100 Best Jobs in 2021
Data Science Activities
• Data Acquisition
• Data Preparation
• Data Analysis
• Data Presentation
• Data Products
About This Course
• Teaching by Examples
• Learning by Doing
• Study a bit of theory behind
Projects You Can Do
• Scala and Spark basics: data preprocessing
• Recommender Systems: Recommending Music
• Decision Trees and Forests: Predicting Forest
Cover
• Anomaly Detection with K-Means Clustering:
Network Traffic Analysis
• Latent Semantic Analysis: Understand Wikipedia
• Co-occurrences Analysis: Medline Citation Data
• Geospatial and Temporal Data Analysis: New York
City Taxi Trip Data
Research/Project Component
– Read and present one or more research papers in
the big data area, e.g.
• Theoretical - Large-scale Logistic Regression and Linear
Support Vector Machines Using Spark or Mutiple
Submodels Parallel Support Vector Machine on Spark
• Applications - U-Air: when urban air quality inference
meets big data or Forecasting fine-grained air quality
based on big data

You might also like