What Is Big Data
What Is Big Data
What Is Big Data
What is Data?
Data is a collection of information that can be used for a specific purpose. It can be made
up of facts, numbers, names, figures, or descriptions of things. Data can come in the form
of text, observations, figures, images, numbers, graphs, or symbols.
Characteristics of Data:
Composition: The composition of data deals with the structure of data, that is, the
sources of data, the granularity, the types, and the nature of data as to whether it is
static or real-time streaming.
Condition: The condition of data deals with the state of data, that is, "Can one use this
data as is for analysis?" or "Does it require cleansing for further enhancement and
enrichment?"
Context: The context of data deals with "Where has this data been generated?" "Why
was this data generated?" How sensitive is this data?" "What are the events associated
with this data?" and so on.
Small data (data as it existed prior to the big data revolution) is about certainty. It is
about known data sources; it is about no major changes to the composition or context of
data.
Composition
Data Condition
Context
Big data is extremely large data sets that may be analysed computationally to reveal
patterns, trends, and associations, especially relating to human behavior and interactions.
Big data is a collection of structured, unstructured, and semi-structured data that is large
and diverse. It is so large and complex that traditional data management systems cannot
store, process, and analyze it.
2
Structuring Big Data
Structured data is data that has a standardized format and a well-defined structure. It
follows a consistent order and is easily accessible by humans and computer programs.
Unstructured data is information that doesn't have a predefined data model or structure. It
can be human-generated or machine-generated, and it can be textual or non-textual.
Semi-structured data is a type of structured data that has some structure but doesn't follow
a data model. It doesn't have a fixed schema, but it does have some organizational
properties that make it easier to analyze.
3
web pages, emails
Father of big data - John R Mashey (who at the time worked at Silicon Graphics), word
big data has gained popularity in early 1990s
The history of big data can be traced back to 1663 when John Graunt used statistical data
analysis to study the bubonic plague. In the 1960s, IBM introduced the first computer
system capable of processing large amounts of data. The 1970s and 1980s saw the
development of databases and data-processing software.
1970s and before was the era of mainframes. The data was essentially primitive and
structured. Relational databases evolved in 1980s and 1990s. The era was of data
intensive applications. The World Wide Web (WWW) and the Internet of Things (IOT)
have led to an onslaught of structured, unstructured, and multimedia data.
Marketing: forecast customer behavior and product strategies which in turn increases
sales.
Transportation: assist in GPS navigation, traffic and weather alerts.
4
Government and public administration: track tax, defense and public health data.
Business: streamline management operations and optimize costs.
Healthcare: access medical records and accelerate treatment development.
Cybersecurity: detect system vulnerabilities and cyber threats.
Banking and finance: Fraud detection ( credit card purchase pattern) , Risk management
If we see the last few decades, we can analyze that Big Data technology has gained so
much growth. There are a lot of milestones in the evolution of Big Data which are
described below:
1. DataWarehousing:
A data warehouse is an enterprise system used for the analysis and reporting of
structured and semi-structured data from multiple sources, such as point-of-sale
transactions, marketing automation, customer relationship management, and more.
In the 1990s, data warehousing emerged as a solution to store and analyze large
volumes of structured data.
2. Hadoop:
3. NoSQLDatabases:
NoSQL, which stands for "not only SQL" or "non-SQL", is a database design
approach that allows for the storage and querying of data outside of traditional
relational database structures.
In 2009, NoSQL databases were introduced, which provide a flexible way to
store and retrieve unstructured data.
4. CloudComputing:
5
Cloud Computing technology helps companies to store their important data in
data centers that are remote, and it saves their infrastructure cost and
maintenance costs. Ex: gmail, dropbox, facebook, youtube etc
5. MachineLearning:
Machine Learning algorithms are those algorithms that work on large data, and
analysis is done on a huge amount of data to get meaningful insights from it.
This has led to the development of artificial intelligence (AI) applications.
6. DataStreaming:
7. EdgeComputing:
IoT devices
Autonomous vehicles
Industrial automation
6
Ensuring worker safety at construction sites
Overall, big data technology has come a long way since the early days of data
warehousing. The introduction of Hadoop, NoSQL databases, cloud computing, machine
learning, data streaming, and edge computing has revolutionized how we store, process,
and analyze large volumes of data. As technology evolves, we can expect Big Data to
play a very important role in various industries.
2. Data Architect
Median Salary: $137,000
These professionals design the structure of complex data frameworks and build and
maintain these databases. Data architects develop strategies for each subject area of the
enterprise data model and communicate plans, status, and issues to their company’s
executives.
3. Data Modeler
Median Salary: $130,800
These professionals turn large volumes of data into insights, such as micro and macro
trends, which are gathered into business reports. Data modelers must be skilled in both
information science and statistical analysis and should have proficient programming
skills.
7
Data modelers often specialize in a particular business area, making it easier to find
useful data trends for their employers.
4. Data Scientist
Median Salary: $122,100
Data scientists design and construct new processes for modeling, data mining, and
production. In addition to conducting data studies and product experiments, these
professionals are tasked with developing prototypes, algorithms, predictive models, and
custom analyses.
Previous work experience in a similar position is usually required, and data scientists
should be skilled in different data mining techniques, such as clustering, regression
analysis, and decision trees.
5. Database Developer
Median Salary: $109,300
Database developers are responsible for analyzing current database processes in order to
modernize, streamline, or eliminate inefficient coding. These professionals are often
charged with monitoring database performance, developing new databases, and
troubleshooting issues as they arise.
Database developers work closely with other members of the development team. They’re
often required to have prior experience with database development, data analysis, and
unit testing.
6. Database Manager
Median Salary: $106,400
Database managers identify problems that occur in databases, take corrective action to
remedy those issues, and assist with the design and physical implementation of storage
hardware and maintenance. They are also responsible for storing and analyzing their
organization’s data.
These professionals work closely with database developers and often provide guidance
and training to lower-level staff.
8
7. Database Administrator
Median Salary: $105,300
These professionals are responsible for monitoring and optimizing database performance
to avoid damaging effects caused by constant access and high traffic. They also
coordinate with IT security professionals to ensure data security. Database administrators
typically have prior experience working on database administration teams.
9
Future Applications of Big Data
If you have doubts about the future of big data and whether it actually makes sense,
you can lay them to rest here. After all, big data applications are immense and can
convince all the naysayers.
1. Healthcare
The timely and accurate analysis of medical data can save countless lives. Big data
analysis has the potential to revolutionize the healthcare industry. It can change the
way we do things. From predictive devices to enhanced diagnostic accuracy, from real-
time imaging data to optimized treatment plans — you can have it all at the time of
your fingers. By generating an ever-increasing volume of big data future opportunities
are bound to rise.
2. Automobile Development
Cars are becoming increasingly more intelligent, day by day. They generate vast
amounts of data on driving patterns, accidents and their cause, and route optimization.
Manufacturers can use these data points to find patterns and correlations among
different factors. They can study how traffic conditions affect the routes or how driving
speed increases the likelihood of an accident. Using this analysis, they can come up
with optimized solutions.
Virtual assistants like Siri, Alexa, and Google Assistant are the rage nowadays. And
it's all possible because of big data. These assistants are trained on massive volumes of
big data accumulated over a long period of time. This data helps to train machine
learning models based on natural language processing and generate an output
accordingly.
10
4. Industrial Automation Systems
Industrial automation is the secret to scaling up your business. No matter how efficient
you are, you cannot be everywhere simultaneously. You will need to delegate tasks.
This is where industrial automation systems can help. These systems have been trained
on vast amounts of data, like sensor data, equipment malfunctions, maintenance
records, etc. By analyzing and learning from this data, you can identify inefficiencies
in the process and evaluate performance. Think of yourself as the puppet master. You
can move pieces and change strategies to streamline operations.
AI and ML are at the forefront of big data advancements. These technologies can
automate almost any repetitive task you don't want to do. You can automate the
sharing of personalized recommendations. You can even automate the entire order
fulfillment process based on customer behavior on E-commerce website.
2. Cloud Computing
The main reason why cloud storage and computing have become so popular is It
bundles scalability and cost-effectiveness. It costs much less than buying proper server
space is an added advantage.
3. Predictive Analytics
Predictive analytics plays a huge role in the future of big data. As the name suggests, it
utilizes historical data to make predictions.
For example, you can use predictive analytics on your big data sets to determine when
your product will spike in demand. Knowing this in advance or even having an
approximation can allow you to plan appropriately.
11
4. Internet of Things (IoT)
The popularity of IoT devices will generate a massive amount of data. These devices
collect and store data in real time. Not only that, but they also analyze that data and
provide insights and personalized recommendations. IoT devices will play a significant
role in shaping the future of big data.
5. SQL-based Technologies
SQL is the most prominent programming language used to work with relational
databases (where data is stored in rows and columns). For big data clusters (clusters so
big that an Excel sheet won’t do), the SQL server offers a specially designed file
system called HDFS. It helps in scaling standard data technologies and operations
while maintaining accuracy. Clearly, SQL widens the future scope for big data.
Big data is a collection of data from many different sources. It is often described by five
characteristics: volume, value, variety, velocity, and veracity.
The rise of sensor-based internet-enabled devices, known as the "Internet of Things"
(IoT), is increasing the generation of data. Millions of TVs, thermostats, wearables, and
even refrigerators are now generating zetta bytes of data every day.
Big data has become an increasingly important area of research and application with the
advent of new technologies such as cloud computing, machine learning, and artificial
intelligence.
The challenges include capture, storage, search, sharing, transfer, analysis and
visualization.
Elements of Big Data: There are five elements of big data.
Big data is a collection of data from many different sources and is often describe by five
characteristics: volume, value, variety, velocity, and veracity.
Volume: the size and amounts of big data that companies manage and analyze
12
Facebook inserts 500 terabytes of new data every day.
Facebook stores, accesses, and analyzes 30+Peta bytes of user generated data.
More than 5 billion people are calling, texting, tweeting and browsing on mobile phones
worldwide.
A flight generates 240 tera bytes offlight data in 6-8hours offlight.
Value: the most important “V” from the perspective of the business, the value of big data
usually comes from insight discovery and pattern recognition that lead to more effective
operations, stronger customer relationships and other clear and quantifiable business
benefits
Variety: the diversity and range of different data types, including unstructured data,
semi-structured data and raw data
Velocity: the speed at which companies receive, store and manage data – e.g., the
specific number of social media posts or search queries received within a day, hour or
other unit of time
Veracity: the “truth” or accuracy of data and information assets, which often determines
executive-level confidence Veracity refers to the quality, accuracy, integrity and
credibility of data. Gathered data could have missing pieces, might be inaccurate or might
not be able to provide real, valuable insight. Veracity, overall, refers to the level of trust
there is in the collected data.
Data can sometimes become messy and difficult to use. A large amount of data can cause
more confusion than insights if it's incomplete. For example, in the medical field, if data
about what drugs a patient is taking is incomplete, the patient's life could be endangered.
Both value and veracity help define the quality and insights gathered from data.
Thresholds for the truth of data often -- and should -- exist in an organization at the
executive level, to determine whether it's suitable for high-level decision-making.
13
The additional characteristic of variability can also be considered:
Variability: the changing nature of the data companies seek to capture, manage and
analyze – e.g., in sentiment or text analytics, changes in the meaning of key words or
phrases
14
NOTE: IBM considers three elements of Big Data (3v’s). The 3V’s: Volume,Velocity
and Variety
Big data analytics refers to collecting, processing, cleaning, and analyzing large datasets
to help organizations operationalise their big data.
15
1. Collect Data
Data collection looks different for every organization. With today’s technology,
organizations can gather both structured and unstructured data from a variety of sources
— from cloud storage to mobile applications to in-store IoT sensors and beyond. Some
data will be stored in data warehouses where business intelligence tools and solutions can
access it easily. Raw or unstructured data that is too diverse or complex for a warehouse
may be assigned metadata and stored in a data lake.
2. Process Data
Once data is collected and stored, it must be organized properly to get accurate results on
analytical queries, especially when it’s large and unstructured. Available data is growing
exponentially, making data processing a challenge for organizations. One processing
option is batch processing, which looks at large data blocks over time. Batch processing
is useful when there is a longer turnaround time between collecting and analyzing
data. Stream processing looks at small batches of data at once, shortening the delay time
between collection and analysis for quicker decision-making. Stream processing is more
complex and often more expensive.
3. Clean Data
Data big or small requires scrubbing to improve data quality and get stronger results; all
data must be formatted correctly, and any duplicative or irrelevant data must be
eliminated or accounted for. Dirty data can obscure and mislead, creating flawed insights.
4. Analyze Data
Getting big data into a usable state takes time. Once it’s ready, advanced analytics
processes can turn big data into big insights. Some of these big data analysis methods
include:
Data mining sorts through large datasets to identify patterns and relationships by
identifying anomalies and creating data clusters.
Predictive analytics uses an organization’s historical data to make predictions
about the future, identifying upcoming risks and opportunities.
Deep learning imitates human learning patterns by using artificial intelligence and
machine learning to layer algorithms and find patterns in the most complex and
abstract data.
16
Technologies for handling big data :
Distributed computing: Distributed computing is the method of making multiple
computers work together to solve a common problem. It makes a computer network
appear as a powerful single computer that provides large-scale resources to deal with
complex challenges. Example: Mobile and web applications are examples of distributed
computing because several machines work together in the backend for the application to
give you the correct information.
17
In-memory computing means using a type of middleware software that allows one to
store data in RAM, across a cluster of computers, and process it in parallel. Storing
information on RAM makes computation very fast than storing it on spinning devices.
RAM storage and parallel distributed processing are two fundamental pillars of in-
memory computing. Parallel distributed processing capabilities of in-memory computing
are... a technical necessity. In IMC the entire dataset is divided into individual
computers’ memory, each storing only a portion of the overall dataset. Once data is
partitioned - parallel distributed processing becomes a technical necessity simply because
data is stored this way.
18