Big Data Unit 1 Notes - 240311 - 100703
Big Data Unit 1 Notes - 240311 - 100703
Big Data Unit 1 Notes - 240311 - 100703
DIGITAL DATA
Digital data is information stored on a computer system as a series of 0’s and 1’s in a binary
language. Digital data jumps from one value to the next in a step by step sequence.
Example: Whenever we send an email, read a social media post, or take pictures with our digital
camera, we are working with digital data.
a. Unstructured Data: The data which does not conform to a data model or is not
in a form that can be used easily by a computer program is categorized as unstructured data.
About 80—90% data of an organization is in this format.
Example: Memos, chat rooms, PowerPoint presentations, images, videos, letters,
researches, white papers, the body of an email, etc.
b. Semi-Structured Data: The data which does not conform to a data model but has some
structure is categorized as semi-structured data. However, it is not in a form that can be used
easily by a computer program.
Example : Emails, XML, markup languages like HTML, etc. Metadata for this
data is available but is not sufficient.
c. Structured Data: The data which is in an organized form (ie. in rows and columns) and
can be easily used by a computer program is categorized as semi-structured data. Relationships
exist between entities of data, such as
classes and their objects.
Example: Data stored in databases.
A big data platform is a type of IT solution that combines the features and capabilities of
several big data applications and utilities within a single solution; this is then used further
for managing as well as analyzing Big Data.
It focuses on providing its users with efficient analytics tools for massive datasets.
The users of such platforms can custom build applications according to their use case like
to calculate customer loyalty (E-Commerce user case), and so on.
Goal: The main goal of a Big Data Platform is to achieve: Scalability, Availability, Performance,
and Security.
Example: Some of the most commonly used Big Data Platforms are :
● Hadoop Delta Lake Migration Platform
● Data Catalog Platform
● Data Ingestion Platform
● IoT Analytics Platform
Data storage: Data for batch processing operations is stored in a distributed file
store that can hold high volumes of large files in various formats (also called data lake).
Example, Azure Data Lake Store or blob containers in Azure Storage.
Batch processing: Since the data sets are so large, therefore a big data solution must process
data files using long-running batch jobs to filter, aggregate, and prepare the data for analysis.
Real-time message ingestion: If a solution includes real-time sources, the architecture must
include a way to capture and store real-time messages for stream processing.
Stream processing: After capturing real-time messages, the solution must process them by
filtering, aggregating, and preparing the data for analysis. The processed stream data is then
written to an output sink. We can use open-source Apache streaming technologies like Storm and
Spark Streaming for this.
Analytical data store: Many big data solutions prepare data for analysis and then serve the
processed data in a structured format that can be queried using analytical tools. Example: Azure
Synapse Analytics provides a managed service for large-scale, cloud-based data warehousing.
Analysis and reporting: The goal of most big data solutions is to provide insights into the data
through analysis and reporting. To empower users to analyze the data, the architecture may
include a data modeling layer. Analysis and reporting can also take the form of interactive data
exploration by data scientists or data analysts.
Orchestration: Most big data solutions consist of repeated data processing operations that
transform source data, move data between multiple sources and sinks, load the processed data
into an analytical data store, or push the results straight to a report. To automate these
workflows, we can use an orchestration technology such as Azure Data Factory.
2. Variety :
Big Data can be structured, unstructured, and semi-structured that are being collected from
different sources. Data were only collected from databases and sheets in the past, But these
days the data will come in an array of forms ie.- PDFs, Emails, audios, Social Media
posts, photos, videos, etc.
3. Velocity :
Velocity refers to the speed with which data is generated in real-time.
Velocity plays an important role compared to others.It contains the linking of incoming
data sets speeds, rate of change, and activity bursts. The primary aspect of Big Data is to
provide demanding data rapidly.
Example of data that is generated with high velocity - Twitter messages or Facebook
posts.
4. Veracity
Veracity refers to the quality of the data that is being analyzed. It is the process of being able to
handle and manage data efficiently.
Example: Facebook posts with hashtags.
5. Value :
Value is an essential characteristic of big data.
It is not the data that we process or store, it is valuable and reliable data that we store, process
and analyze.
a) Batch, in which large groups of data are gathered and delivered together.
b) Streaming, which is a continuous flow of data. This is necessary for real-time
data analytics.
2. Storage :
Storage is where the converted data is stored in a data lake or warehouse and eventually
processed.
The data lake/warehouse is the most essential component of a big data
ecosystem.
It needs to contain only thorough, relevant data to make insights as valuable as possible.
It must be efficient with as little redundancy as possible to allow for quicker processing.
3. Analysis :
In the analysis layer, data gets passed through several tools, shaping it into actionable insights.
There are four types of analytics on big data :
● Diagnostic: Explains why a problem is happening.
● Descriptive: Describes the current state of a business through historical data.
● Predictive: Projects future results based on historical data.
● Prescriptive: Takes predictive analytics a step further by projecting best future efforts.
4. Consumption :
The final big data component is presenting the information in a format digestible to the end-user.
This can be in the forms of tables, advanced visualizations and even single numbers if requested.
The most important thing in this layer is making sure the intent and meaning of the output is
understandable.
Time Reductions :
o The high speed of tools like Hadoop and in-memory analytics can easily identify new sources
of data which helps businesses analyzing data immediately.
o This helps us to make quick decisions based on the learnings.
Analysis vs reporting
Reporting :
● Once data is collected, it will be organized using tools such as graphs and tables.
● The process of organizing this data is called reporting.
● Reporting translates raw data into information.
● Reporting helps companies to monitor their online business and be alerted when data falls
outside of expected ranges.
● Good reporting should raise questions about the business from its end users.
Analysis :
● Analytics is the process of taking the organized data and analyzing it.
● This helps users to gain valuable insights on how businesses can improve
their performance.
● Analysis transforms data and information into insights.
● The goal of the analysis is to answer questions by interpreting the data at a deeper level and
providing actionable recommendations.