Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

BDA Assignment 1: Big Data Features and Characteristics

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 14

BDA Assignment 1

Big Data features and characteristics

Instead of being a single process, big data analytics is a collection of


numerous business-related procedures that may also involve data
scientists, business management, and production teams. The only
component of this huge data analytics is data analytics. Numerous tools
are employed in the Big Data for Beginners analytics paradigm, and each
one of them needs to meet specific requirements.

These technologies are necessary for data scientists to speed up and


increase the efficiency of the process. The main features of big data
analytics are:

1. Data wrangling and Preparation

The idea of Data Preparation procedures conducted once during the


project and performed before using any iterative model. Contrarily, Data
Wrangling is done during iterative analysis and model construction. At the
period of feature engineering, this idea.

2. Data exploration

The initial phase in data analysis is called data exploration, and it


involves looking at and visualizing data to find insights right away or point
out regions or patterns that need further investigation. Users may more
quickly gain insights by using interactive dashboards and point-and-click
data exploration to better understand the broader picture.

3. Scalability

To scale up, or vertically scale, a system, a faster server with more


powerful processors and memory is needed. This technique utilizes less
network gear and uses less energy, but it may only be a temporary cure
for many big data analytics platform characteristics, especially if more
growth is anticipated.
4. Support for various types of Analytics

Due to the big data revolution, new forms, stages, and types of data
analysis have evolved. Data analytics is exploding in boardrooms all over
the world, offering enterprise-wide commercial success techniques. What
do these, though, mean for businesses? Gaining the appropriate
expertise, which results in information, enables organizations to develop
a competitive edge, which is crucial for enterprises to successfully
leverage Big Data. Big data analytics' main goal is to help firms make
better business decisions.

Big data analytics shouldn't be thought of as a universal fix. The best


data scientists and analysts are also distinguished from the competition
by their aptitude for identifying the many forms of analytics that may be
applied to benefit the business the most. The three most typical
categories

5. Version control

Version control, often known as source control, is the process of keeping


track of and controlling changes to software code. Version control
systems are computerized tools that help software development teams
keep track of changes to source code over time.

6. Data management

The process of obtaining, storing, and using data in a cost-effective,


effective, and secure way is known as data management. Data
management assists people, organizations, and connected things in
optimizing the use of data within the bounds of policy and regulation,
enabling decision-making and actions that will benefit the business as
much as feasible. As businesses increasingly rely on intangible assets to
create value, an efficient data management strategy is more important
than ever.
7. Data Integration

Data integration is the process of combining information from several


sources to give people a cohesive perspective. The fundamental idea
behind data integration is to open up data and make it simpler for
individuals and systems to access, use, and process. When done
correctly, data integration can enhance data quality, free up resources,
lower IT costs, and stimulate creativity without significantly modifying
current applications or data structures. Aside from the fact that IT firms
have always needed to integrate, the benefits of doing so may have
never been as large as they are now.

8. Data Governance

Data governance is the process of ensuring that data is trustworthy,


accurate, available, and usable. It describes the actions people must
take, the rules they must follow, and the technology that will support them
throughout the data life cycle.

9. Data security

Data security is the technique of preventing digital data from being


accessed by unauthorized parties, being corrupted, or being stolen at any
point in its lifecycle. It is a concept that encompasses all elements of data
security, including administrative and access controls, logical programme
security, and physical hardware and storage device security. Also data
security is one of the key features of data analytics. Also data security is
one of the key features of data analytics. Also covered are the policies
and practices of the organization.

10. Data visualization

It's more crucial than ever to have easy ways to see and comprehend
data in our increasingly data-driven environment. Employers are, after all,
increasingly seeking employees with data skills. Data and its
ramifications must be understood by all employees and business owners.
Characteristics of Big Data Analytics

Big data is characterized by several key attributes that distinguish it from


traditional data processing techniques. These characteristics are often
summarized using the "3 Vs" model, which has expanded over time to
include additional dimensions. Here are the primary characteristics of big
data:

1. Volume

 Definition: Refers to the vast amount of data generated from


various sources, including business transactions, social media,
sensors, scientific experiments, etc.
 Scale: Typically terabytes (TB), petabytes (PB), or even exabytes
(EB) of data.
 Example: Social media platforms generating massive volumes of
user interactions and content daily.

2. Velocity

 Definition: Describes the speed at which data is generated and


processed to meet the demands of real-time or near-real-time
analytics and decision-making.
 Real-time Requirements: Data streams in continuously and must
be processed promptly to derive actionable insights.
 Example: Financial transactions needing immediate fraud detection
or social media trending topics.

3. Variety

 Definition: Refers to the diversity of data types and sources,


including structured, semi-structured, and unstructured data.
 Types: Includes text, audio, video, sensor data, log files, social
media posts, etc.
 Example: IoT devices generating sensor data, coupled with
customer reviews and call center transcripts.
Additional Characteristics:

4. Veracity

 Definition: Indicates the trustworthiness or reliability of the data


captured. It addresses issues such as data quality, consistency,
and accuracy.
 Quality Control: Involves data cleaning, validation, and ensuring
that data meets predefined quality standards.
 Example: Sensor data from IoT devices might have inconsistencies
due to environmental conditions or equipment malfunctions.

5. Variability

 Definition: Refers to the inconsistency in the data flow, which can


vary over time or be unpredictable.
 Seasonal Trends: Data may exhibit periodic fluctuations or sudden
spikes in volume and velocity.
 Example: Retail sales data showing spikes during holidays or
promotions.

6. Value

 Definition: The usefulness or potential insights that can be derived


from analyzing the data.
 Actionable Insights: Data analysis aims to extract valuable
information that can lead to better decision-making, competitive
advantage, or new revenue streams.
 Example: Customer behavior analysis to optimize marketing
campaigns or predictive maintenance to reduce downtime.

7. Visualization

 Definition: The ability to present data in a meaningful and


understandable way to facilitate analysis and decision-making.
 Graphical Representation: Charts, graphs, dashboards, and other
visual tools help interpret complex data sets.
 Example: Interactive dashboards showing real-time sales
performance or geographical distribution of customer
demographics.

8. Venue
 Definition: Refers to the location where data is stored, processed,
and analyzed, including on-premises data centers, cloud
environments, or hybrid solutions.
 Cloud Adoption: Many organizations are migrating to cloud
platforms to leverage scalability, flexibility, and cost-effectiveness.
 Example: Enterprises using cloud-based big data platforms like
AWS, Azure, or Google Cloud for data storage and analytics.

BDA Assignment 2
Hadoop Distributed File System (HDFS)

The Hadoop Distributed File System (HDFS) is the primary data storage
system Hadoop applications use. It's an open source distributed processing
framework for handling data processing, managing pools of big data and
storing and supporting related big data analytics applications.

HDFS employs a NameNode and DataNode architecture to implement a


distributed file system that provides high-performance access to data across
highly scalable Hadoop clusters. It's designed to run on commodity
hardware and is a key part of the many Hadoop ecosystem technologies.

The following describes how HDFS works:

 With HDFS, data is written on the server once and read and reused
numerous times.
 HDFS has a primary NameNode, which keeps track of where file data is
kept in the cluster.
 HDFS has multiple DataNodes on a commodity hardware cluster --
typically one per node in a cluster. The DataNodes are generally organized
within the same rack in the data center. Data is broken down into separate
blocks and distributed among the various DataNodes for storage. Blocks
are also replicated across nodes, enabling highly efficient parallel
processing.
 The NameNode knows which DataNode contains which blocks and where
the DataNodes reside within the machine cluster. The NameNode also
manages access to the files, including reads, writes, creates, deletes and
the data block replication across the DataNodes.
 The NameNode operates together with the DataNodes. As a result, the
cluster can dynamically adapt to server capacity demands in real time by
adding or subtracting nodes as necessary.
 The DataNodes are in constant communication with the NameNode to
determine if the DataNodes need to complete specific tasks. Consequently,
the NameNode is always aware of the status of each DataNode. If the
NameNode realizes that one DataNode isn't working properly, it can
immediately reassign that DataNode's task to a different node containing
the same data block. DataNodes also communicate with each other, which
enables them to cooperate during normal file operations.
 The HDFS is designed to be highly fault tolerant. The file system
replicates -- or copies -- each piece of data multiple times and distributes
the copies to individual nodes, placing at least one copy on a
different server rack than the other copies.
HDFS architecture, NameNode and DataNodes
HDFS uses a primary/secondary architecture where each HDFS cluster is
comprised of many worker nodes and one primary node or the NameNode.
The NameNode is the controller node, as it knows the metadata and status of
all files including file permissions, names and location of each block. An
application or user can create directories and then store files inside these
directories. The file system namespace hierarchy is like most other file
systems, as a user can create, remove, rename or move files from one
directory to another.

The HDFS cluster's NameNode is the primary server that manages the file
system namespace and controls client access to files. As the central
component of the Hadoop Distributed File System, the NameNode maintains
and manages the file system namespace and provides clients with the right
access permissions. The system's DataNodes manage the storage that's
attached to the nodes they run on.
NameNode

The NameNode performs the following key functions:

 The NameNode performs file system namespace operations, including


opening, closing and renaming files and directories.
 The NameNode governs the mapping of blocks to the DataNodes.
 The NameNode records any changes to the file system namespace or its
properties. An application can stipulate the number of replicas of a file that
the HDFS should maintain.
 The NameNode stores the number of copies of a file, called the replication
factor of that file.
 To ensure that the DataNodes are alive, the NameNode gets block reports
and heartbeat data.
 In case of a DataNode failure, the NameNode selects new DataNodes for
replica creation.
DataNodes

In HDFS, DataNodes function as worker nodes or Hadoop daemons and are


typically made of low-cost off-the-shelf hardware. A file is split into one or
more of the blocks that are stored in a set of DataNodes. Based on their
replication factor, the files are internally partitioned into many blocks that are
kept on separate DataNodes.

The DataNodes perform the following key functions:

 The DataNodes serve read and write requests from the clients of the file
system.
 The DataNodes perform block creation, deletion and replication when the
NameNode instructs them to do so.
 The DataNodes transfer periodic heartbeat signals to the NameNode to
help keep HDFS health in check.
 The DataNodes provide block reports to NameNode to help keep track of
the blocks included within the DataNodes. For redundancy and higher
availability, each block is copied onto two extra DataNodes by default.
Features of HDFS
There are several features that make HDFS particularly useful, including the
following:

 Data replication. Data replication ensures that the data is always available
and prevents data loss. For example, when a node crashes or there's a
hardware failure, replicated data can be pulled from elsewhere within a
cluster, so processing continues while data is being recovered.
 Fault tolerance and reliability. HDFS' ability to replicate file blocks and
store them across nodes in a large cluster ensures fault tolerance and
reliability.
 High availability. Because of replication across nodes, data is available
even if the NameNode or DataNode fails.
 Scalability. HDFS stores data on various nodes in the cluster, so as
requirements increase, a cluster can scale to hundreds of nodes.
 High throughput. Because HDFS stores data in a distributed manner, the
data can be processed in parallel on a cluster of nodes. This, plus data
locality, cuts the processing time and enables high throughput.
 Data locality. With HDFS, computation happens on the DataNodes where
the data resides, rather than having the data move to where the
computational unit is. Minimizing the distance between the data and the
computing process decreases network congestion and boosts a system's
overall throughput.
 Snapshots. HDFS supports snapshots, which capture point-in-time copies
of the file system and protect critical data from user or application errors.
BDA Assignment 3
Pig Data model
Creating a data model for pig data in big data analytics involves organizing and structuring data
related to pigs to facilitate efficient analysis and decision-making. This process can be implemented
using various big data technologies like Hadoop, Spark, or NoSQL databases. Here’s a step-by-step
guide to creating a pig data model in big data analytics:

Step 1: Define the Use Case and Objectives

Objectives:

 Real-time health monitoring: Track vital signs and environmental conditions.


 Growth analysis: Monitor weight and size progression over time.
 Predictive maintenance: Predict and prevent disease outbreaks and health issues.
 Feed optimization: Analyze feed consumption patterns to optimize feeding schedules and
reduce waste.

Use Cases:

 Farm Management: Improve overall farm productivity by monitoring pig health and growth.
 Veterinary Analysis: Provide veterinarians with detailed health records to aid in diagnosis
and treatment.
 Research and Development: Enable agricultural researchers to study growth patterns and
the impact of environmental conditions on pig health.

Step 2: Identify Data Sources

 Sensors: IoT devices that measure temperature, humidity, air quality, and pig vital signs.
 Manual Records: Data entered by farm staff about health checks, treatments, and feedings.
 Automated Systems: Feeders and drinkers that log consumption data automatically.
 GPS Devices: Track the location and movement of pigs.
Step 3: Data Collection and Ingestion

 IoT Integration: Use MQTT or similar protocols to collect data from sensors.
 Data Pipelines: Implement data pipelines using Apache NiFi, Apache Kafka, or AWS Data
Pipeline to ingest data into the big data platform.
 Batch and Stream Processing: Set up both batch processing for historical data and stream
processing for real-time data.

Step 4: Data Storage

 Raw Data Storage: Use HDFS, Amazon S3, or Azure Blob Storage to store raw sensor data.
 Processed Data Storage: Store processed and cleaned data in HBase, Cassandra, or Amazon
DynamoDB.
 Data Warehousing: Use Redshift, BigQuery, or Snowflake for structured data and analytics.

Step 5: Data Processing and Transformation

Batch Processing:

 Apache Hive: Use Hive for SQL-like querying of large datasets.


 Apache Pig: Use Pig for data transformation and processing.
 Apache Spark: Utilize Spark for faster, in-memory batch processing.

Stream Processing:

 Apache Storm: Real-time data processing.


 Apache Flink: Stream and batch processing with low latency.
 Spark Streaming: Real-time processing using Spark.

Step 6: Data Modeling

Schema Design:

 Pigs Table:

sql
Copy code
CREATE TABLE Pigs (
PigID INT PRIMARY KEY,
BirthDate DATE,
Breed VARCHAR(50),
Sex VARCHAR(10),
FarmID INT
);

 HealthRecords Table:

sql
Copy code
CREATE TABLE HealthRecords (
RecordID INT PRIMARY KEY,
PigID INT,
Date DATE,
HealthMetric VARCHAR(50),
Value FLOAT,
FOREIGN KEY (PigID) REFERENCES Pigs(PigID)
);

 EnvironmentalData Table:

sql
Copy code
CREATE TABLE EnvironmentalData (
DataID INT PRIMARY KEY,
PigID INT,
Timestamp TIMESTAMP,
Temperature FLOAT,
Humidity FLOAT,
FOREIGN KEY (PigID) REFERENCES Pigs(PigID)
);

 FeedData Table:

sql
Copy code
CREATE TABLE FeedData (
FeedID INT PRIMARY KEY,
PigID INT,
Date DATE,
FeedType VARCHAR(50),
Quantity FLOAT,
FOREIGN KEY (PigID) REFERENCES Pigs(PigID)
);

 MovementData Table:

sql
Copy code
CREATE TABLE MovementData (
MovementID INT PRIMARY KEY,
PigID INT,
Timestamp TIMESTAMP,
Latitude FLOAT,
Longitude FLOAT,
FOREIGN KEY (PigID) REFERENCES Pigs(PigID)
);

Data Relationships:

 One-to-Many: A pig can have multiple health records, environmental data entries, feed data
entries, and movement records.

Step 7: Data Analysis and Visualization

 SQL Queries: Use Hive or Presto to run SQL queries on your data.
 Data Analysis: Use Spark for more complex analysis, such as machine learning models or
statistical computations.
 Visualization Tools: Connect Tableau, Power BI, or QlikView to your data warehouse for
creating dashboards and visualizations.
Step 8: Machine Learning and Predictive Analytics

Machine Learning Models:

 Health Prediction: Use logistic regression, decision trees, or neural networks to predict
health issues.
 Growth Prediction: Use linear regression or time series analysis to predict growth patterns.
 Anomaly Detection: Use clustering algorithms like k-means or DBSCAN to detect anomalies
in sensor data.

Implementation:

 Feature Engineering: Extract relevant features from raw data (e.g., average temperature,
weekly weight gain).
 Model Training: Train models using historical data.
 Model Deployment: Deploy models using ML platforms like TensorFlow Serving, AWS
SageMaker, or Google AI Platform.

Step 9: Data Security and Governance

 Data Encryption: Encrypt data at rest and in transit using TLS/SSL.


 Access Controls: Implement role-based access control (RBAC) to restrict access to sensitive
data.
 Audit Logs: Maintain logs of data access and changes for auditing purposes.
 Compliance: Ensure compliance with relevant regulations (e.g., GDPR, HIPAA).

Step 10: Deployment and Monitoring

Deployment:

 Production Environment: Deploy your data pipelines, processing jobs, and analytics tools in
a production environment using Kubernetes, Docker, or cloud services like AWS, GCP, or
Azure.
 Continuous Integration/Continuous Deployment (CI/CD): Use CI/CD tools like Jenkins,
Travis CI, or GitLab CI to automate deployments.

Monitoring:

 Monitoring Tools: Use tools like Prometheus, Grafana, or Datadog to monitor the
performance and health of your data infrastructure.
 Alerting: Set up alerts for anomalies, failures, or performance issues.

Example Data Flow

1. Data Collection: IoT sensors collect real-time data on pig vital signs and environmental
conditions.
2. Data Ingestion: Data is ingested into the big data platform using Kafka or NiFi.
3. Data Storage: Raw data is stored in HDFS or S3.
4. Data Processing: Spark processes raw data to generate meaningful insights.
5. Data Storage: Processed data is stored in HBase or a data warehouse like Redshift.
6. Data Analysis: Analysts use SQL queries and Spark to analyze the data.
7. Visualization: Dashboards in Tableau or Power BI provide visual insights.
8. Machine Learning: Models predict health issues or optimize feed schedules.
9. Actionable Insights: Farm managers receive alerts or recommendations based on the
analysis.

This detailed approach ensures a comprehensive and efficient data model for managing and
analyzing pig data in a big data analytics environment.

You might also like