Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

What Is Big Data

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 18

1

What is Data?

Data is a collection of information that can be used for a specific purpose. It can be made
up of facts, numbers, names, figures, or descriptions of things. Data can come in the form
of text, observations, figures, images, numbers, graphs, or symbols.

Characteristics of Data:

As depicted in Figure, data has three key characteristics:

Composition: The composition of data deals with the structure of data, that is, the
sources of data, the granularity, the types, and the nature of data as to whether it is
static or real-time streaming.

Condition: The condition of data deals with the state of data, that is, "Can one use this
data as is for analysis?" or "Does it require cleansing for further enhancement and
enrichment?"

Context: The context of data deals with "Where has this data been generated?" "Why
was this data generated?" How sensitive is this data?" "What are the events associated
with this data?" and so on.
Small data (data as it existed prior to the big data revolution) is about certainty. It is
about known data sources; it is about no major changes to the composition or context of
data.

Composition

Data Condition

Context

What is big data?

Big data is extremely large data sets that may be analysed computationally to reveal
patterns, trends, and associations, especially relating to human behavior and interactions.

Big data is a collection of structured, unstructured, and semi-structured data that is large
and diverse. It is so large and complex that traditional data management systems cannot
store, process, and analyze it.

2
Structuring Big Data

Structured data is data that has a standardized format and a well-defined structure. It
follows a consistent order and is easily accessible by humans and computer programs.

Unstructured data is information that doesn't have a predefined data model or structure. It
can be human-generated or machine-generated, and it can be textual or non-textual.

Semi-structured data is a type of structured data that has some structure but doesn't follow
a data model. It doesn't have a fixed schema, but it does have some organizational
properties that make it easier to analyze.

Aspect / features Structured Data Semi structured Unstructured Data


Data
Organization Highly Organization Moderately Not organized
organized
Data model Yes No No
Structure Yes Yes No
Machine readability High or Easy Moderate Challenging
Analyzation Easy Moderate Difficult
Data security Less Moderate High
Scalability Hard Moderate Easy
Examples RDBMS, spread HTML, JSON files, Images, videos,
sheets XML, YAML, audio, documents,
NoSQL, log files, chats, presentation

3
web pages, emails

History of big data

Father of big data - John R Mashey (who at the time worked at Silicon Graphics), word
big data has gained popularity in early 1990s

The history of big data can be traced back to 1663 when John Graunt used statistical data
analysis to study the bubonic plague. In the 1960s, IBM introduced the first computer
system capable of processing large amounts of data. The 1970s and 1980s saw the
development of databases and data-processing software.
1970s and before was the era of mainframes. The data was essentially primitive and
structured. Relational databases evolved in 1980s and 1990s. The era was of data
intensive applications. The World Wide Web (WWW) and the Internet of Things (IOT)
have led to an onslaught of structured, unstructured, and multimedia data.

BIG DATA EXAMPLES TO KNOW

 Marketing: forecast customer behavior and product strategies which in turn increases
sales.
 Transportation: assist in GPS navigation, traffic and weather alerts.

Congestion management and traffic control, Route planning, Traffic safety

4
 Government and public administration: track tax, defense and public health data.
 Business: streamline management operations and optimize costs.
 Healthcare: access medical records and accelerate treatment development.
 Cybersecurity: detect system vulnerabilities and cyber threats.

 Banking and finance: Fraud detection ( credit card purchase pattern) , Risk management

Evolution of Big Data:

If we see the last few decades, we can analyze that Big Data technology has gained so
much growth. There are a lot of milestones in the evolution of Big Data which are
described below:

1. DataWarehousing:

A data warehouse is an enterprise system used for the analysis and reporting of
structured and semi-structured data from multiple sources, such as point-of-sale
transactions, marketing automation, customer relationship management, and more.
In the 1990s, data warehousing emerged as a solution to store and analyze large
volumes of structured data.

2. Hadoop:

Hadoop is an open-source framework that manages the storage and processing of


large amounts of data for applications. It uses distributed storage and parallel
processing to handle big data and analytics jobs.
Hadoop was introduced in 2006 by Doug Cutting and Mike Cafarella.

3. NoSQLDatabases:

NoSQL, which stands for "not only SQL" or "non-SQL", is a database design
approach that allows for the storage and querying of data outside of traditional
relational database structures.
In 2009, NoSQL databases were introduced, which provide a flexible way to
store and retrieve unstructured data.

4. CloudComputing:

5
Cloud Computing technology helps companies to store their important data in
data centers that are remote, and it saves their infrastructure cost and
maintenance costs. Ex: gmail, dropbox, facebook, youtube etc

5. MachineLearning:

Machine Learning algorithms are those algorithms that work on large data, and
analysis is done on a huge amount of data to get meaningful insights from it.
This has led to the development of artificial intelligence (AI) applications.

6. DataStreaming:

Data Streaming technology has emerged as a solution to process large volumes


of data in real time. Data streaming is the process of continuously transmitting
data, which is also known as streams. Data streams are made up of a series of
data elements that are ordered by time.
Examples of data streaming platforms
Real-time stock trades
Retail inventory management
Social media feeds
Multiplayer games
Ride-sharing apps

7. EdgeComputing:

Edge Computing is a kind of distributed computing paradigm that allows data


processing to be done at the edge or the corner of the network, closer to the
source of the data.
Edge computing is used to process time-sensitive data, while cloud computing
is used to process data that is not time-driven. It uses less bandwidth and have
low latency. It is used

Edge computing is used in various applications, such as:

IoT devices
Autonomous vehicles
Industrial automation

6
Ensuring worker safety at construction sites

Overall, big data technology has come a long way since the early days of data
warehousing. The introduction of Hadoop, NoSQL databases, cloud computing, machine
learning, data streaming, and edge computing has revolutionized how we store, process,
and analyze large volumes of data. As technology evolves, we can expect Big Data to
play a very important role in various industries.

Career in Big Data

1. Big Data Engineer


Median Salary: $151,300
Big data engineers are similar to data analysts in that they turn large volumes of data into
insights that organizations can use to make smarter business decisions. However, they’re
also tasked with retrieving, interpreting, analyzing, and reporting on a business’s data
—which they typically have to gather from a variety of different sources.
These professionals are also often responsible for creating and maintaining the
company’s software and hardware architecture, including the systems and processes users
need to work with that data.

2. Data Architect
Median Salary: $137,000
These professionals design the structure of complex data frameworks and build and
maintain these databases. Data architects develop strategies for each subject area of the
enterprise data model and communicate plans, status, and issues to their company’s
executives.

3. Data Modeler
Median Salary: $130,800
These professionals turn large volumes of data into insights, such as micro and macro
trends, which are gathered into business reports. Data modelers must be skilled in both
information science and statistical analysis and should have proficient programming
skills.

7
Data modelers often specialize in a particular business area, making it easier to find
useful data trends for their employers.

4. Data Scientist
Median Salary: $122,100
Data scientists design and construct new processes for modeling, data mining, and
production. In addition to conducting data studies and product experiments, these
professionals are tasked with developing prototypes, algorithms, predictive models, and
custom analyses.
Previous work experience in a similar position is usually required, and data scientists
should be skilled in different data mining techniques, such as clustering, regression
analysis, and decision trees.

5. Database Developer
Median Salary: $109,300
Database developers are responsible for analyzing current database processes in order to
modernize, streamline, or eliminate inefficient coding. These professionals are often
charged with monitoring database performance, developing new databases, and
troubleshooting issues as they arise.
Database developers work closely with other members of the development team. They’re
often required to have prior experience with database development, data analysis, and
unit testing.

6. Database Manager
Median Salary: $106,400
Database managers identify problems that occur in databases, take corrective action to
remedy those issues, and assist with the design and physical implementation of storage
hardware and maintenance. They are also responsible for storing and analyzing their
organization’s data.
These professionals work closely with database developers and often provide guidance
and training to lower-level staff.

8
7. Database Administrator
Median Salary: $105,300
These professionals are responsible for monitoring and optimizing database performance
to avoid damaging effects caused by constant access and high traffic. They also
coordinate with IT security professionals to ensure data security. Database administrators
typically have prior experience working on database administration teams.

8. Data Security Analyst


Median Salary: $97,500
Data security analysts perform security audits, risk assessments, and analyses to help
make recommendations for enhancing data systems security. They often research
attempted breaches of data security and formulate security policies and procedures to
rectify security weaknesses.

9. Business Intelligence Analyst


Median Salary: $97,500
Business intelligence analysts turn companies’ data into insights that executives can use
to make better business decisions. These professionals often respond to management’s
requests for specific information but might also scrutinize data independently to find
patterns and trends.
Business intelligence analysts should have a strong background in analytical and
reporting tools, several years of experience with database queries and stored procedure
writing, as well as online analytical processing (OLAP) and data cube technology skills.

10. Data Analyst


Median Salary: $92,900
Data analysts work with large volumes of data, turning them into insights businesses can
leverage to make better decisions. They work across a variety of industries—from
healthcare and finance to retail and technology.
Data analysts work to improve their own systems to make relaying future insights easier.
The goal is to develop methods to analyze large data sets that can be easily reproduced
and scaled.

9
Future Applications of Big Data

If you have doubts about the future of big data and whether it actually makes sense,
you can lay them to rest here. After all, big data applications are immense and can
convince all the naysayers.

1. Healthcare

The timely and accurate analysis of medical data can save countless lives. Big data
analysis has the potential to revolutionize the healthcare industry. It can change the
way we do things. From predictive devices to enhanced diagnostic accuracy, from real-
time imaging data to optimized treatment plans — you can have it all at the time of
your fingers. By generating an ever-increasing volume of big data future opportunities
are bound to rise.

2. Automobile Development

Cars are becoming increasingly more intelligent, day by day. They generate vast
amounts of data on driving patterns, accidents and their cause, and route optimization.
Manufacturers can use these data points to find patterns and correlations among
different factors. They can study how traffic conditions affect the routes or how driving
speed increases the likelihood of an accident. Using this analysis, they can come up
with optimized solutions.

3. Smart Assistant Development

Virtual assistants like Siri, Alexa, and Google Assistant are the rage nowadays. And
it's all possible because of big data. These assistants are trained on massive volumes of
big data accumulated over a long period of time. This data helps to train machine
learning models based on natural language processing and generate an output
accordingly.

10
4. Industrial Automation Systems

Industrial automation is the secret to scaling up your business. No matter how efficient
you are, you cannot be everywhere simultaneously. You will need to delegate tasks.
This is where industrial automation systems can help. These systems have been trained
on vast amounts of data, like sensor data, equipment malfunctions, maintenance
records, etc. By analyzing and learning from this data, you can identify inefficiencies
in the process and evaluate performance. Think of yourself as the puppet master. You
can move pieces and change strategies to streamline operations.

Key Technologies Driving the Future of Big Data

1. Artificial Intelligence (AI) and Machine Learning (ML)

AI and ML are at the forefront of big data advancements. These technologies can
automate almost any repetitive task you don't want to do. You can automate the
sharing of personalized recommendations. You can even automate the entire order
fulfillment process based on customer behavior on E-commerce website.

2. Cloud Computing

The main reason why cloud storage and computing have become so popular is It
bundles scalability and cost-effectiveness. It costs much less than buying proper server
space is an added advantage.

3. Predictive Analytics

Predictive analytics plays a huge role in the future of big data. As the name suggests, it
utilizes historical data to make predictions.

For example, you can use predictive analytics on your big data sets to determine when
your product will spike in demand. Knowing this in advance or even having an
approximation can allow you to plan appropriately.

11
4. Internet of Things (IoT)

The popularity of IoT devices will generate a massive amount of data. These devices
collect and store data in real time. Not only that, but they also analyze that data and
provide insights and personalized recommendations. IoT devices will play a significant
role in shaping the future of big data.

5. SQL-based Technologies

SQL is the most prominent programming language used to work with relational
databases (where data is stored in rows and columns). For big data clusters (clusters so
big that an Excel sheet won’t do), the SQL server offers a specially designed file
system called HDFS. It helps in scaling standard data technologies and operations
while maintaining accuracy. Clearly, SQL widens the future scope for big data.

INTRODUCTION TO BIG DATA:

Big data is a collection of data from many different sources. It is often described by five
characteristics: volume, value, variety, velocity, and veracity.
The rise of sensor-based internet-enabled devices, known as the "Internet of Things"
(IoT), is increasing the generation of data. Millions of TVs, thermostats, wearables, and
even refrigerators are now generating zetta bytes of data every day.
Big data has become an increasingly important area of research and application with the
advent of new technologies such as cloud computing, machine learning, and artificial
intelligence.
The challenges include capture, storage, search, sharing, transfer, analysis and
visualization.
Elements of Big Data: There are five elements of big data.

Big data is a collection of data from many different sources and is often describe by five
characteristics: volume, value, variety, velocity, and veracity.

 Volume: the size and amounts of big data that companies manage and analyze

Walmart handles 1million customer transactions/hour


Facebook handles 40 billion photos from its user base!

12
Facebook inserts 500 terabytes of new data every day.
Facebook stores, accesses, and analyzes 30+Peta bytes of user generated data.
More than 5 billion people are calling, texting, tweeting and browsing on mobile phones
worldwide.
A flight generates 240 tera bytes offlight data in 6-8hours offlight.

 Value: the most important “V” from the perspective of the business, the value of big data
usually comes from insight discovery and pattern recognition that lead to more effective
operations, stronger customer relationships and other clear and quantifiable business
benefits

 Variety: the diversity and range of different data types, including unstructured data,
semi-structured data and raw data
 Velocity: the speed at which companies receive, store and manage data – e.g., the
specific number of social media posts or search queries received within a day, hour or
other unit of time

As an example, in healthcare, many medical devices today are designed to monitor


patients and collect data. From in-hospital medical equipment to wearable devices,
collected data needs to be sent to its destination and analyzed quickly.

Veracity: the “truth” or accuracy of data and information assets, which often determines
executive-level confidence Veracity refers to the quality, accuracy, integrity and
credibility of data. Gathered data could have missing pieces, might be inaccurate or might
not be able to provide real, valuable insight. Veracity, overall, refers to the level of trust
there is in the collected data.

Data can sometimes become messy and difficult to use. A large amount of data can cause
more confusion than insights if it's incomplete. For example, in the medical field, if data
about what drugs a patient is taking is incomplete, the patient's life could be endangered.

Both value and veracity help define the quality and insights gathered from data.
Thresholds for the truth of data often -- and should -- exist in an organization at the
executive level, to determine whether it's suitable for high-level decision-making.

13
The additional characteristic of variability can also be considered:
 Variability: the changing nature of the data companies seek to capture, manage and
analyze – e.g., in sentiment or text analytics, changes in the meaning of key words or
phrases

14
NOTE: IBM considers three elements of Big Data (3v’s). The 3V’s: Volume,Velocity
and Variety

What is big data analytics?


Big data analytics describes the process of uncovering trends, patterns, and correlations in
large amounts of raw data to help make data-informed decisions. It is similar to clustering
and regression in statistics. Big data has been a buzz word since the early 2000s, With the
explosion of data, early innovation projects like Hadoop, Spark, and NoSQL databases
were created for the storage and processing of big data. This field continues to evolve as
data engineers look for ways to integrate the vast amounts of complex information
created by sensors, networks, transactions, smart devices, web usage, and more. Even
now, big data analytics methods are being used with emerging technologies, like machine
learning, to discover and scale more complex insights.

How big data analytics works ?

Big data analytics refers to collecting, processing, cleaning, and analyzing large datasets
to help organizations operationalise their big data.

15
1. Collect Data

Data collection looks different for every organization. With today’s technology,
organizations can gather both structured and unstructured data from a variety of sources
— from cloud storage to mobile applications to in-store IoT sensors and beyond. Some
data will be stored in data warehouses where business intelligence tools and solutions can
access it easily. Raw or unstructured data that is too diverse or complex for a warehouse
may be assigned metadata and stored in a data lake.

2. Process Data

Once data is collected and stored, it must be organized properly to get accurate results on
analytical queries, especially when it’s large and unstructured. Available data is growing
exponentially, making data processing a challenge for organizations. One processing
option is batch processing, which looks at large data blocks over time. Batch processing
is useful when there is a longer turnaround time between collecting and analyzing
data. Stream processing looks at small batches of data at once, shortening the delay time
between collection and analysis for quicker decision-making. Stream processing is more
complex and often more expensive.

3. Clean Data

Data big or small requires scrubbing to improve data quality and get stronger results; all
data must be formatted correctly, and any duplicative or irrelevant data must be
eliminated or accounted for. Dirty data can obscure and mislead, creating flawed insights.

4. Analyze Data

Getting big data into a usable state takes time. Once it’s ready, advanced analytics
processes can turn big data into big insights. Some of these big data analysis methods
include:

 Data mining sorts through large datasets to identify patterns and relationships by
identifying anomalies and creating data clusters.
 Predictive analytics uses an organization’s historical data to make predictions
about the future, identifying upcoming risks and opportunities.
 Deep learning imitates human learning patterns by using artificial intelligence and
machine learning to layer algorithms and find patterns in the most complex and
abstract data.

16
Technologies for handling big data :
Distributed computing: Distributed computing is the method of making multiple
computers work together to solve a common problem. It makes a computer network
appear as a powerful single computer that provides large-scale resources to deal with
complex challenges. Example: Mobile and web applications are examples of distributed
computing because several machines work together in the backend for the application to
give you the correct information.

Parallel computing: Parallel computing is a computing architecture that involves


breaking down large problems into smaller, independent parts that can be processed
simultaneously. The results are then combined as part of an overall algorithm. Example:
Self driving a car is an excellent example of when someone would use the parallel
computing. While the vehicle is in motion, the multiple sensors and processors must
simultaneously pay attention to traffic, lights, and road signs.

17
In-memory computing means using a type of middleware software that allows one to
store data in RAM, across a cluster of computers, and process it in parallel. Storing
information on RAM makes computation very fast than storing it on spinning devices.

RAM storage and parallel distributed processing are two fundamental pillars of in-
memory computing. Parallel distributed processing capabilities of in-memory computing
are... a technical necessity. In IMC the entire dataset is divided into individual
computers’ memory, each storing only a portion of the overall dataset. Once data is
partitioned - parallel distributed processing becomes a technical necessity simply because
data is stored this way.

18

You might also like