BDA Assignment 1: Big Data Features and Characteristics
BDA Assignment 1: Big Data Features and Characteristics
BDA Assignment 1: Big Data Features and Characteristics
2. Data exploration
3. Scalability
Due to the big data revolution, new forms, stages, and types of data
analysis have evolved. Data analytics is exploding in boardrooms all over
the world, offering enterprise-wide commercial success techniques. What
do these, though, mean for businesses? Gaining the appropriate
expertise, which results in information, enables organizations to develop
a competitive edge, which is crucial for enterprises to successfully
leverage Big Data. Big data analytics' main goal is to help firms make
better business decisions.
5. Version control
6. Data management
8. Data Governance
9. Data security
It's more crucial than ever to have easy ways to see and comprehend
data in our increasingly data-driven environment. Employers are, after all,
increasingly seeking employees with data skills. Data and its
ramifications must be understood by all employees and business owners.
Characteristics of Big Data Analytics
1. Volume
2. Velocity
3. Variety
4. Veracity
5. Variability
6. Value
7. Visualization
8. Venue
Definition: Refers to the location where data is stored, processed,
and analyzed, including on-premises data centers, cloud
environments, or hybrid solutions.
Cloud Adoption: Many organizations are migrating to cloud
platforms to leverage scalability, flexibility, and cost-effectiveness.
Example: Enterprises using cloud-based big data platforms like
AWS, Azure, or Google Cloud for data storage and analytics.
BDA Assignment 2
Hadoop Distributed File System (HDFS)
The Hadoop Distributed File System (HDFS) is the primary data storage
system Hadoop applications use. It's an open source distributed processing
framework for handling data processing, managing pools of big data and
storing and supporting related big data analytics applications.
With HDFS, data is written on the server once and read and reused
numerous times.
HDFS has a primary NameNode, which keeps track of where file data is
kept in the cluster.
HDFS has multiple DataNodes on a commodity hardware cluster --
typically one per node in a cluster. The DataNodes are generally organized
within the same rack in the data center. Data is broken down into separate
blocks and distributed among the various DataNodes for storage. Blocks
are also replicated across nodes, enabling highly efficient parallel
processing.
The NameNode knows which DataNode contains which blocks and where
the DataNodes reside within the machine cluster. The NameNode also
manages access to the files, including reads, writes, creates, deletes and
the data block replication across the DataNodes.
The NameNode operates together with the DataNodes. As a result, the
cluster can dynamically adapt to server capacity demands in real time by
adding or subtracting nodes as necessary.
The DataNodes are in constant communication with the NameNode to
determine if the DataNodes need to complete specific tasks. Consequently,
the NameNode is always aware of the status of each DataNode. If the
NameNode realizes that one DataNode isn't working properly, it can
immediately reassign that DataNode's task to a different node containing
the same data block. DataNodes also communicate with each other, which
enables them to cooperate during normal file operations.
The HDFS is designed to be highly fault tolerant. The file system
replicates -- or copies -- each piece of data multiple times and distributes
the copies to individual nodes, placing at least one copy on a
different server rack than the other copies.
HDFS architecture, NameNode and DataNodes
HDFS uses a primary/secondary architecture where each HDFS cluster is
comprised of many worker nodes and one primary node or the NameNode.
The NameNode is the controller node, as it knows the metadata and status of
all files including file permissions, names and location of each block. An
application or user can create directories and then store files inside these
directories. The file system namespace hierarchy is like most other file
systems, as a user can create, remove, rename or move files from one
directory to another.
The HDFS cluster's NameNode is the primary server that manages the file
system namespace and controls client access to files. As the central
component of the Hadoop Distributed File System, the NameNode maintains
and manages the file system namespace and provides clients with the right
access permissions. The system's DataNodes manage the storage that's
attached to the nodes they run on.
NameNode
The DataNodes serve read and write requests from the clients of the file
system.
The DataNodes perform block creation, deletion and replication when the
NameNode instructs them to do so.
The DataNodes transfer periodic heartbeat signals to the NameNode to
help keep HDFS health in check.
The DataNodes provide block reports to NameNode to help keep track of
the blocks included within the DataNodes. For redundancy and higher
availability, each block is copied onto two extra DataNodes by default.
Features of HDFS
There are several features that make HDFS particularly useful, including the
following:
Data replication. Data replication ensures that the data is always available
and prevents data loss. For example, when a node crashes or there's a
hardware failure, replicated data can be pulled from elsewhere within a
cluster, so processing continues while data is being recovered.
Fault tolerance and reliability. HDFS' ability to replicate file blocks and
store them across nodes in a large cluster ensures fault tolerance and
reliability.
High availability. Because of replication across nodes, data is available
even if the NameNode or DataNode fails.
Scalability. HDFS stores data on various nodes in the cluster, so as
requirements increase, a cluster can scale to hundreds of nodes.
High throughput. Because HDFS stores data in a distributed manner, the
data can be processed in parallel on a cluster of nodes. This, plus data
locality, cuts the processing time and enables high throughput.
Data locality. With HDFS, computation happens on the DataNodes where
the data resides, rather than having the data move to where the
computational unit is. Minimizing the distance between the data and the
computing process decreases network congestion and boosts a system's
overall throughput.
Snapshots. HDFS supports snapshots, which capture point-in-time copies
of the file system and protect critical data from user or application errors.
BDA Assignment 3
Pig Data model
Creating a data model for pig data in big data analytics involves organizing and structuring data
related to pigs to facilitate efficient analysis and decision-making. This process can be implemented
using various big data technologies like Hadoop, Spark, or NoSQL databases. Here’s a step-by-step
guide to creating a pig data model in big data analytics:
Objectives:
Use Cases:
Farm Management: Improve overall farm productivity by monitoring pig health and growth.
Veterinary Analysis: Provide veterinarians with detailed health records to aid in diagnosis
and treatment.
Research and Development: Enable agricultural researchers to study growth patterns and
the impact of environmental conditions on pig health.
Sensors: IoT devices that measure temperature, humidity, air quality, and pig vital signs.
Manual Records: Data entered by farm staff about health checks, treatments, and feedings.
Automated Systems: Feeders and drinkers that log consumption data automatically.
GPS Devices: Track the location and movement of pigs.
Step 3: Data Collection and Ingestion
IoT Integration: Use MQTT or similar protocols to collect data from sensors.
Data Pipelines: Implement data pipelines using Apache NiFi, Apache Kafka, or AWS Data
Pipeline to ingest data into the big data platform.
Batch and Stream Processing: Set up both batch processing for historical data and stream
processing for real-time data.
Raw Data Storage: Use HDFS, Amazon S3, or Azure Blob Storage to store raw sensor data.
Processed Data Storage: Store processed and cleaned data in HBase, Cassandra, or Amazon
DynamoDB.
Data Warehousing: Use Redshift, BigQuery, or Snowflake for structured data and analytics.
Batch Processing:
Stream Processing:
Schema Design:
Pigs Table:
sql
Copy code
CREATE TABLE Pigs (
PigID INT PRIMARY KEY,
BirthDate DATE,
Breed VARCHAR(50),
Sex VARCHAR(10),
FarmID INT
);
HealthRecords Table:
sql
Copy code
CREATE TABLE HealthRecords (
RecordID INT PRIMARY KEY,
PigID INT,
Date DATE,
HealthMetric VARCHAR(50),
Value FLOAT,
FOREIGN KEY (PigID) REFERENCES Pigs(PigID)
);
EnvironmentalData Table:
sql
Copy code
CREATE TABLE EnvironmentalData (
DataID INT PRIMARY KEY,
PigID INT,
Timestamp TIMESTAMP,
Temperature FLOAT,
Humidity FLOAT,
FOREIGN KEY (PigID) REFERENCES Pigs(PigID)
);
FeedData Table:
sql
Copy code
CREATE TABLE FeedData (
FeedID INT PRIMARY KEY,
PigID INT,
Date DATE,
FeedType VARCHAR(50),
Quantity FLOAT,
FOREIGN KEY (PigID) REFERENCES Pigs(PigID)
);
MovementData Table:
sql
Copy code
CREATE TABLE MovementData (
MovementID INT PRIMARY KEY,
PigID INT,
Timestamp TIMESTAMP,
Latitude FLOAT,
Longitude FLOAT,
FOREIGN KEY (PigID) REFERENCES Pigs(PigID)
);
Data Relationships:
One-to-Many: A pig can have multiple health records, environmental data entries, feed data
entries, and movement records.
SQL Queries: Use Hive or Presto to run SQL queries on your data.
Data Analysis: Use Spark for more complex analysis, such as machine learning models or
statistical computations.
Visualization Tools: Connect Tableau, Power BI, or QlikView to your data warehouse for
creating dashboards and visualizations.
Step 8: Machine Learning and Predictive Analytics
Health Prediction: Use logistic regression, decision trees, or neural networks to predict
health issues.
Growth Prediction: Use linear regression or time series analysis to predict growth patterns.
Anomaly Detection: Use clustering algorithms like k-means or DBSCAN to detect anomalies
in sensor data.
Implementation:
Feature Engineering: Extract relevant features from raw data (e.g., average temperature,
weekly weight gain).
Model Training: Train models using historical data.
Model Deployment: Deploy models using ML platforms like TensorFlow Serving, AWS
SageMaker, or Google AI Platform.
Deployment:
Production Environment: Deploy your data pipelines, processing jobs, and analytics tools in
a production environment using Kubernetes, Docker, or cloud services like AWS, GCP, or
Azure.
Continuous Integration/Continuous Deployment (CI/CD): Use CI/CD tools like Jenkins,
Travis CI, or GitLab CI to automate deployments.
Monitoring:
Monitoring Tools: Use tools like Prometheus, Grafana, or Datadog to monitor the
performance and health of your data infrastructure.
Alerting: Set up alerts for anomalies, failures, or performance issues.
1. Data Collection: IoT sensors collect real-time data on pig vital signs and environmental
conditions.
2. Data Ingestion: Data is ingested into the big data platform using Kafka or NiFi.
3. Data Storage: Raw data is stored in HDFS or S3.
4. Data Processing: Spark processes raw data to generate meaningful insights.
5. Data Storage: Processed data is stored in HBase or a data warehouse like Redshift.
6. Data Analysis: Analysts use SQL queries and Spark to analyze the data.
7. Visualization: Dashboards in Tableau or Power BI provide visual insights.
8. Machine Learning: Models predict health issues or optimize feed schedules.
9. Actionable Insights: Farm managers receive alerts or recommendations based on the
analysis.
This detailed approach ensures a comprehensive and efficient data model for managing and
analyzing pig data in a big data analytics environment.