Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
29 views35 pages

Unit I: Chapter 1: Introduction To Big Data

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1/ 35

Unit I

Chapter 1: INTRODUCTION TO BIG


DATA
Introduction
• 21st century characterized by rapid advancement in the field of information
technology
• It has become an integral part of daily life as well as various other industries like
health, education, entertainment, science and technology, genetics or business
operations.
• In todays competitive and global economy, organizations must posses a number of
skills to create their place and sustain in market.
• Most crucial of these skills is an understanding of and ability to utilize and harness
the immense potential; of information technology
• According to the information technology association of America , information
technology is defined as “the study, design, development, application,
implementation, support or management of computer-based information systems”
• This is truly an information age where data is being generated at an alarming rate
• This huge amount of data is often termed as Big Data
• Organizations use data generated through various sources to run their businesses
• They analyze the data to understand and interpret market trends, study
customer behaviour and take financial decisions
• The term Big Data is now widely used , particularly in the IT industry where it
has generated various job opportunities
• Big data consists of large datasets that cannot be managed efficiently by
the common database management systems.
• These datasets range from terabytes to exabytes
• Mobile phones, credit cards, Radio Frequency Identification devices and
social networking platforms create huge amount of data that may reside
unutilized at unknown servers for many years
• However with evolution of Big Data , this data can be accessed and
analyzed on a regular basis to generate useful information
What is Big Data?
• Think of the following:
• Every second there are around 8,22 tweets on Twitter
• Every minute, nearly 510 comments are posted, 2,93,000
statuses are updated, and 136,000 photos are uploaded
on a facebook
• Every hour Walmart a global discount departmental store
chain, handles more than 1 million customer transactions
• Every day, consumers make around 11.5 million
payments by using Paypal.
• We live in digital world where data is increasing rapidly because of
ever increasing use of the Internet, sensors and heavy machines at a
very high rate
• The sheer volume, variety, velocity and veracity of such data is
signified by the term “Big Data”
• Big Data is structured, unstructured, semi structured or
heterogeneous in nature
• It becomes difficult for computing systems to manage ‘Big Data’
because of the immense speed and volume at which it is generated
• Traditional data management, warehousing and analysis systems
fizzle to analyze this type of data
• Due to complexity, big data is stored in distributed architecture file
system
• Hadoop by Apache is widely used for storing and managing Big Data
• Analyzing Big Data is a challenging task as it involves large
distributed file systems, which should be fault tolerant, flexible and
scalable
• According to IBM, “Everyday we create 2.5 quintillion
bytes of data- ie.) 90% of the data in the world today
has been created in the last two years alone.
• This data comes from everywhere: sensors used to
gather climate information, posts to socila media sites,
digital pictures and videos, purchase transaction
records, and cell phone GPS signals to name a few
• This data is big data
• Data is everywhere, in every industry, in the form of
numbers, images, videos and text
• As data continues to grow, so does the need to organize it
• Collecting such huge amount of data would just be a waste
of time, effort and storage space if it cannot be put to any
logical use.
• The need to sort, organize, analyze and offer this critical
data in a systematic manner leads to the rise of the much
discussed term Big Data
• The process of capturing or collecting Big Data is known as
‘datafication’
• Big Data is datafied so that it can be used productively
• Big Data cannot be made useful by simply organizing it, rather
the data usefulness lies in determining what we can do with it.
• According to IBM, Big Data is being generated by nearly
everything around us at all tiles at an alarming velocity,
volume and variety
• To extract meaningful value from Big Data, you need optimal
processing power, analytical capabilities and skills
Features of Big Data
• Big data is a pool of huge amounts of data of all
types, shapes and formats collected from varied
sources
• Table 1 lists some common types of data and their
sources
History of Data Management – Evolution of Big
Data
• Big data is the new term of data evolution directed by
the enormous velocity, variety and volume of data
• Velocity implies the speed with which the data flows
in an organization
• Variety refers to the varied forms of data such as
structured, semi structured or unstructured
• Volume defines the amount or quantity of data an
organization has to deal with
• The advent of IT, the Internet and globalization has
facilitated increased volumes of data and information
generation at an exponential rate, which has led to
‘information explosion’
• This in turn fueled the evolution of Big Data that
started in 1940 and continues till date
• Information explosion is described as a continuous
increase in the volume of the published information or
data and the effects of this abundant information
Table 1.2 lists some major milestones in the
evolution of Big Data
• Table 1.2 only a synopsis of the evolution
• The need for an adequate space and storage of data was always
felt and with time , Big Data grew into a technology phenomenon.
• Business applicability of Big Data as a concept has been used for
long
• When researchers used computers to analyze huge volumes of
data, they were actually analyzing the Big Data
• The demand for faster access to data and the applications and
programs to process this data led to the present concept of Big
Data and Big Data analytics in the IT industry
• Suppose a bank plans to establish self-service kiosks in a major
metro area
• The marketing department wants to determine the busiest spots for
establishing the self-service kiosks, on the basis of the traffic patterns
of customers across the city.
• This information is not available in the existing data warehouse of the
bank
• In this situation, the bank can acquire the GPS location based data of
the customers through a third party , and thereby gather the
information about the mobility patters of its customers
• Thus by using the right set of Big Data with the right technique of
data extraction, preparation and integration today banks can identify
the busiest spots in the city for establishing their self-service kiosks
Activity
• https://www.menti.com/5jv6gunfe1
Structuring Big Data
Types of Data
• Data that comes from multiple sources such as databases,
ERP systems, Weblogs, chat history and GPS maps, varies
in its format
• However different formats of data need to be made
consistent and clear to be used for analysis
• Data is obtained primarily from the following types of
sources:
• 1. Internal Sources, such as organizational or enterprise data
• 2. External sources, such as social data
• On the basis of the data received from the aftermentioned
sources, Big Data comprises:
– Structured Data
– Unstructured data
– Semi-Structured data
• In real-world scenario, typically, the unstructured data is
larger in volume that the structures and semi-structured
data
• Figure illustrates the types of data that comprise Big Data:
Structured Data
• Defined as the data that has a defined repeating pattern
• This pattern makes it easier for any program to sort, read and
process the data
• Processing structured data is much easier and faster than
processing data without any specific repeating patterns
• Features:
– It is organized data in a predefined format
– Is sorted in tabular form
– Is the data that resides in fixed fields within record or file
– Is formatted data that has entities and their attributes mapped
– Is used to query and report against predetermined data types
• Sources of structured data:
– Relational databases (in the form of tables)
– Flat files in the form of records(like comma separated
values(csv) and tab-separated files)
– Multidimensional databases (majority used in data
warehouse technology)
– Legacy Databases
– Sample of structured data
Customer Name Product City State
ID ID
12365 Smith 241 Graz Styria
23658 Jack 365 Wolfsberg Carinthia
32456 Kady 421 Enns Upper
Austria
Unstructured Data
• Is a set of data that might or might not have any logical or repeating
patterns.
• Features:
– Consists typically of metadata ie) the additional information related to data
– Comprises inconsistent data such as data obtained from files, social media
websites , satellites etc.,
– Consists of data in different formats such as e-mails, text, audio,video or images
– Some sources of unstructured data include:
• Text both internal and external to an organization – Documents, logs, survey results, feedbacks
and e-mails from both within and across the organization
• Social Media – Data obtained from social networking platforms, including YouTube, FaceBook,
Twitter, LinkedIn and Flickr
• Mobile Data – Data such as text messages and location information
– About 80% of enterprise data consists of unstructured content.
• Challenges Associated with Unstructured Data:
– Working with unstructured data poses certain challenges ,
which are as follows:
• Identifying the unstructured data that can be processed
• Sorting, organizing and arranging unstructured data in different sets
and formats
• Combining and linking unstructured data in a more structured format
to derive any logical conclusions out of the available information
• Costing in terms of storage space and human resource (data analyst
and scientists) needed to deal with the exponential growth of
unstructured data
Figure 1.5 Challenges in Handling Unstructured
Data

• Source: “Big Data Infographic and Gartner 2012 Top 10 Strategic Tech Trends.” Business Analytics 3.0 (blog)
(November 11, 2011). http://practicalanalytics.
Semi-Structured Data
• Also known as having a schema-less or self describing
structure refers to a form of structured data that contains tags or
markup elements in order to separate elements and generate
hierarchies of records and fields in the given data.
• Such type of data does not follow the proper structure of data
models as in relational databases
• Data is stored inconsistently in rows and columns of a database
• Some sources for semi-structured data includes:
– File systems such as Web Data in the form of cookies
– Data exchange formats such as JavaScript Object Notation(JSON)
data
Elements of Big Data
• According to Gartner, data is growing at the rate of
59% every year.
• This growth can be depicted in terms of the
following four Vs:
– Volume
– Velocity
– Variety
– Veracity
Volume
• Is the amount of data generated by the
organizations or individuals
• Today the volume of data in most organizations is
approaching exabytes
• Some experts predict the volume of data to reach
zettabytes in the coming years

You might also like