Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
32 views

Arduino

The document provides an introduction to the Arduino platform including details about the Arduino board components, popular Arduino boards like the UNO and Nano, the Arduino IDE, and an overview of how to get started with Arduino development.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

Arduino

The document provides an introduction to the Arduino platform including details about the Arduino board components, popular Arduino boards like the UNO and Nano, the Arduino IDE, and an overview of how to get started with Arduino development.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 120

Introduction to Arduino

• Arduino is a software as well as hardware platform that helps in making


electronic projects.
• It is an open source platform and has a variety of controllers and
microprocessors.
• The Arduino is a single circuit board, which consists of different interfaces
or parts.
• The Arduino board consists of a set of digital and analog pins that are used
to connect various devices and components, which we want to use for the
functioning of the electronic devices.
• The designs of Arduino boards use a variety of controllers and
microprocessors.
Microcontroller
• The most essential part of the Arduino is the Microcontroller.
• Microcontroller is small and low power computer.
• Most of the microcontrollers have a RAM (Random Access Memory),
CPU (Central Processing Unit), and a memory storage like other
computer systems.
• It has very small memory of 2KB. Due to less memory, some
microcontrollers are capable of running only one program at a time.
• It is a single chip that includes memory, Input/Output (I/O) peripherals,
and a processor.
• The GPIO (General Purpose Input Output) pins present on the chip help
us to control other electronics or circuitry from the program.
Features of Arduino
• Arduino programming is a simplified version of C++, which makes the
learning process easy.
• The Arduino IDE is used to control the functions of boards.
• It further sends the set of specifications to the microcontroller.
• Arduino does not need an extra board or piece to load new code.
• Arduino can read analog and digital input signals.
• The hardware and software platform is easy to use and implement.
Arduino Boards
• Arduino is an easy-to-use open platform for creating different electronics
projects.
• The various components present on the Arduino boards are Microcontroller,
Digital Input/output pins, USB Interface and Connector, Analog Pins, Reset
Button, Power button, LED's, Crystal Oscillator, and Voltage Regulator.
• Some components may differ depending on the type of board.
• The most standard and popular board used over time is Arduino UNO.
• The ATmega328 Microcontroller present on the UNO board makes it rather
powerful than other boards.
• The Arduino Boards are organized using the Arduino (IDE- Integrated
Development Environment), which can run on various platforms.
Arduino Boards
Some of the popular Arduino List of available boards in Arduino software
boards. .

• Arduino UNO
• Arduino Nano
• Arduino Mega
• Arduino Due
• Arduino Bluetooth
Types of Arduino Boards
1) Arduino UNO
• Arduino UNO is based on an
ATmega328P microcontroller.
• The Arduino UNO includes 6 analog pin
inputs, 14 digital pins, a USB connector, a
power jack, and an ICSP (In-Circuit Serial
Programming) header.
• It is the most used and of standard form
from the list of all available Arduino
Boards.
• It is also recommended for beginners as it
is easy to use.
Types of Arduino Boards
2) Arduino Nano
• The Arduino Nano is a small Arduino board based on ATmega328P or
ATmega628 Microcontroller.
• The connectivity is the same as the Arduino UNO board.
• The Nano board is defined as a sustainable, small, consistent, and flexible
microcontroller board.
• It is small in size compared to the UNO board.
• The devices required to start our projects using
the Arduino Nano board are Arduino IDE and mini USB.
• The Arduino Nano includes an I/O pin set of 14 digital pins and 8 analog
pins.
• It also includes 6 Power pins and 2 Reset pins.
Types of Arduino Boards

3) Arduino Mega
• The Arduino Mega is based on ATmega2560
Microcontroller which is an 8-bit microcontroller.
• It has the advantage of working with more memory space.
• The Arduino Mega includes 54 I/O digital pins and 16 Analog Input/Output
(I/O), ICSP header, a reset button, 4 UART (Universal Asynchronous
Reciever/Transmitter) ports, USB connection, and a power jack.
Types of Arduino Boards
4) Arduino Due
• The Arduino Due is based on the 32- bit ARM core.
• It is the first Arduino board that has developed
based on the ARM Microcontroller.
• It consists of 54 Digital Input/Output pins and 12 Analog pins.
• The Microcontroller present on the board is the Atmel SAM3X8E ARM
Cortex-M3 CPU.
• It has two ports, namely, native USB port and Programming port.
Types of Arduino Boards

5) Arduino Bluetooth
• The Arduino Bluetooth board is based on ATmega168 Microcontroller.
• It is also named as Arduino BT board.
• The components present on the board are 16 digital pins, 6 analog pins,
reset button, 16MHz crystal oscillator, ICSP header, and screw terminals.
• The screw terminals are used for power.
• The Arduino Bluetooth Microcontroller board can be programmed over the
Bluetooth as a wireless connection.
Arduino IDE
• The Arduino IDE is an open-source software, which is used to write and
upload code to the Arduino boards.
• The IDE application is suitable for different operating systems such
as Windows, Mac OS X, and Linux.
• It supports the programming languages C and C++. IDE stands
for Integrated Development Environment.
• The program or code written in the Arduino IDE is often called as sketching.
• We need to connect the Genuino and Arduino board with the IDE to upload
the sketch written in the Arduino IDE software.
• The sketch is saved with the extension '.ino.'
The Arduino IDE will appear as:
Arduino IDE
Each section of the Arduino IDE:
• Toolbar Button
• The icons displayed on the toolbar are New, Open, Save, Upload, & Verify.
• It is shown below:
Menu Bar
Introduction to Django
• Django is a Python framework that makes it easier to create web
sites using Python.
• Django emphasizes reusability of components, also referred to as
DRY (Don't Repeat Yourself), and comes with ready-to-use
features like login system, database connection and CRUD
operations (Create, Read, Update, Delete).
• Django is especially helpful for database driven websites.
• A database-driven website is one that uses a database for
collecting and storing information.
• Django officially supports the following databases:
PostgreSQL. MariaDB. MySQL.
How does Django Work?
• Django follows the MVT design pattern (Model, View, Template).
• Model - The data you want to present, usually data from a
database.
• View - A request handler that returns the relevant template and
content - based on the request from the user.
• Template - A text file (like an HTML file) containing the layout of
the web page, with logic on how to display the data.
Model
• The model provides data from the database.
• In Django, the data is delivered as an Object Relational Mapping
(ORM), which is a technique designed to make it easier to work
with databases.
• The most common way to extract data from a database is SQL.
• One problem with SQL is that you have to have a pretty good
understanding of the database structure to be able to work with it.
• Django, with ORM, makes it easier to communicate with the
database, without having to write complex SQL statements.
• The models are usually located in a file called models.py.
View
• A view is a function or method that takes http requests as
arguments, imports the relevant model(s), and finds out what
data to send to the template, and returns the final result.

• The views are usually located in a file called views.py.


Template
• A template is a file where you describe how the result should be
represented.
• Templates are often .html files, with HTML code describing the
layout of a web page, but it can also be in other file formats to
present other results.
• Django uses standard HTML to describe the layout, but uses
Django tags to add logic:
• <h1>My Homepage</h1>
• <p>My name is {{ firstname }}.</p>
• The templates of an application is located in a folder named
templates.
URLs
• Django also provides a way to navigate around the different
pages in a website.
• When a user requests a URL, Django decides which view it will
send it to.
• This is done in a file called urls.py.
Django Getting Started
• To install Django, you must have Python installed, and a
package manager like PIP.
• PIP is included in Python from version 3.4.
Django Requires Python
• To check if your system has Python installed, run this
command in the command prompt: python –version
• If Python is installed, you will get a result with the version
number, like this Python 3.9.2
• If you find that you do not have Python installed on your computer,
then you can download it for free from the following website:
https://www.python.org/
• To install Django, you must use a package manager like PIP, which
is included in Python from version 3.4.
• To check if your system has PIP installed, run this command in the
command prompt: pip --version
• If PIP is installed, you will get a result with the version number.
• If you do not have PIP installed, you can download and install it
from this page: https://pypi.org/project/pip/
virtual environment in Django:
• The virtual environment is an environment which is used by
Django to execute an application.
• It is recommended to create and execute a Django application
in a separate environment.
• Python provides a tool virtualenv to create an isolated Python
environment.
Virtual Environment
• It is suggested to have a dedicated virtual environment for each
Django project, and one way to manage a virtual environment is
venv, which is included in Python.
• The name of the virtual environment is your choice, we will call it
myworld.
• Type the following in the command prompt,
where you want to create your project:
• Windows: py -m venv myworld
• Unix/MacOS: python -m venv myworld
• This will set up a virtual environment, and create a folder named
"myworld" with subfolders and files, like this:

• Then you have to activate the environment, by typing this


command:
• Windows: myworld\Scripts\activate.bat
• Unix/MacOS: source myworld/bin/activate
• Once the environment is activated, you will see this result in the
command prompt:
Install Django
• we have created a virtual environment, we are ready to install
Django.
• Django is installed using pip, with this command:

\\\\

• Which will give a result that


looks like this (Windows machine):
• Now you have installed Django in
your new project, running in a
virtual environment!

Check Django Version
• You can check if Django is installed by asking for its version
number like this:

• If Django is installed, you will get a result with the version


number:
Django Create Project
My First Project:
• Give the name for your Django project, my_tennis_club,
• navigate to where in the file system you want to store the code (in
the virtual environment)
• navigate to the myworld folder, and run this command in the
command prompt:
• Django creates a my_tennis_club folder on my computer, with this
content:

Run the Django Project


• Navigate to the /my_tennis_club folder and execute this command
in the command prompt:
Which will produce this result:

Open a new browser window and type 127.0.0.1:8000 in the address


bar.
Result:
• The next step is to make
an app in your project.
Django Create App
• An app is a web application that has a specific meaning in your
project, like a home page, a contact form, or a members database.
Create App
• name my app as “members”.
Store the App
In my_tennis_club folder, and run the command below.
• Django creates a folder named members in my project, with this
content:
Views:
• Django views are Python functions that takes http requests and
returns http response, like HTML documents.
• A web page that uses Django is full of views with different tasks
and missions.
• Views are usually put in a file called views.py located on your
app's folder.
• There is a views.py in your members folder that looks like this:
• render() function to create the HttpResponse that is sent back to
the browser.
• The name of the view = members
• This is a simple example of how to send a response back to the
browser. we can execute the view via a URL.
URLs
• Create a file named urls.py in the same folder as the views.py
file, and type this code in it:
Templates
• In the Django, the result should be in HTML, and it should be created in a
template.
• Create a templates folder inside the members folder, and create a HTML file
named myfirst.html.
Modify the View:
• Open the views.py file and replace the members view with this:
Django Models
• A Django model is a table in your database.
• we will see how Django allows us to work with data, without
having to change or upload files in the process.
• In Django, data is created in objects, called Models, and is actually
tables in a database.
Create Table (Model)
• To create a model, navigate to the models.py file in the
/members/ folder.
• Open it, and add a Member table by creating a Member class,
and describe the table fields in it:
Django Insert Data
Add Records in SQLLite
<QuerySet []>
• A QuerySet is a collection of data from a database.
• Add a record to the table, by executing these two lines:
>>> member = Member(firstname='Emil', lastname='Refsnes')
>>> member.save()
• Execute this command to see if the Member table got a member:
>>> Member.objects.all().values()
• the result will look like this:
<QuerySet [{'id': 1, 'firstname': 'Emil', 'lastname': 'Refsnes'}]>
Add Multiple Records
• You can add multiple records by making a list of Member objects, and
execute .save() on each entry:
>>> member1 = Member(firstname='Tobias', lastname='Refsnes')
>>> member2 = Member(firstname='Linus', lastname='Refsnes')
>>> member3 = Member(firstname='Lene', lastname='Refsnes')
>>> member4 = Member(firstname='Stale', lastname='Refsnes')
>>> member5 = Member(firstname='Jane', lastname='Doe')
>>> members_list = [member1, member2, member3, member4, member5]
>>> for x in members_list:
>>> x.save()
• Now there are 6 members in the Member table:
>>> Member.objects.all().values()
<QuerySet [{'id': 1, 'firstname': 'Emil', 'lastname': 'Refsnes'},
{'id': 2, 'firstname': 'Tobias', 'lastname': 'Refsnes'},
{'id': 3, 'firstname': 'Linus', 'lastname': 'Refsnes'},
{'id': 4, 'firstname': 'Lene', 'lastname': 'Refsnes'},
{'id': 5, 'firstname': 'Stale', 'lastname': 'Refsnes'},
{'id': 6, 'firstname': 'Jane', 'lastname': 'Doe'}]>
Update Records:
• To update records that are already in the database, we first have
to get the record we want to update:
DESIGNING A RESTFUL WEB API
• RESTful API is an interface that two computer systems use to exchange
information securely over the internet.
• An application programming interface (API) defines the rules that you must
follow to communicate with other software systems.
• Developers create APIs, so that other applications can communicate with
their applications programmatically.
• For example, the timesheet application creates an API that asks for an
employee's full name and a range of dates.
• When it receives this information, it internally processes the employee's
timesheet and returns the number of hours worked in that date range.
Web API as a gateway between clients and resources on the web.
Clients
• Clients are users who want to access information from the web.
• The client can be a person or a software system that uses the API.
• For example, developers can write programs that access weather
data from a weather system.
• Or you can access the same data from your browser when you
visit the weather website directly.
Resources
• Resources are the information that different applications
provide to their clients.
• Resources can be images, videos, text, numbers, or any type of
data.
• The machine that gives the resource to the client is also called
the server.
• Organizations use APIs to share resources and provide web
services while maintaining security, control, and authentication.
• In addition, APIs help them to determine which clients get
access to specific internal resources.
How do RESTful APIs work?
The basic function of a RESTful API is the same as browsing the internet.
The client contacts the server by using the API when it requires a resource. API
developers explain how the client should use the REST API in the server application
API documentation.
These are the general steps for any REST API call:
• The client sends a request to the server. The client follows the API documentation to
format the request in a way that the server understands.
• The server authenticates the client and confirms that the client has the right to make
that request.
• The server receives the request and processes it internally.
• The server returns a response to the client. The response contains information that
tells the client whether the request was successful.
• The REST API request and response details vary slightly depending on how the API
developers design the API.
RESTful APIs require requests to contain the following main
components:
Unique resource identifier
• The server identifies each resource with unique resource identifiers.
• For REST services, the server typically performs resource
identification by using a Uniform Resource Locator (URL).
• The URL specifies the path to the resource. A URL is similar to the
website address that you enter into your browser to visit any
webpage.
• The URL is also called the request endpoint and clearly specifies to
the server what the client requires.
Method
• Developers often implement RESTful APIs by using the Hypertext
Transfer Protocol (HTTP). An HTTP method tells the server what it
needs to do to the resource.
The following are four common HTTP methods:
1. GET
• Clients use GET to access resources that are located at the specified
URL on the server. They can cache GET requests and send
parameters in the RESTful API request to instruct the server to filter
data before sending.
2. POST
Clients use POST to send data to the server. They include the data
representation with the request. Sending the same POST request multiple times
has the side effect of creating the same resource multiple times.
3. PUT
Clients use PUT to update existing resources on the server. Unlike POST,
sending the same PUT request multiple times in a RESTful web service gives
the same result.
4. DELETE
Clients use the DELETE request to remove the resource. A DELETE request
can change the server state. However, if the user does not have appropriate
authentication, the request fails.
HTTP headers
• Request headers are the metadata exchanged between the client and server. For
instance, the request header indicates the format of the request and response,
provides information about request status, and so on.
Data
• REST API requests might include data for the POST, PUT, and other HTTP
methods to work successfully.
Parameters
• RESTful API requests can include parameters that give the server more details
about what needs to be done. The following are some different types of
parameters:
• Path parameters that specify URL details.
• Query parameters that request more information about the resource.
• Cookie parameters that authenticate clients quickly.
• RESTAPI allows to create, view, update and delete a collection of resources.
• Each resource represents a sensor data reading from a weather monitoring
station.
• The station model contains 4 fields:
station name, timestamp, temperature, latitude and longitude.
• JSON stands for JavaScript Object Notation. It is a lightweight data-
interchange format that is used to store and exchange data.
• Serializers in Django REST Framework are responsible for converting objects into
data types understandable by javascript and front-end frameworks. Serializers also
provide deserialization, allowing parsed data to be converted back into complex
types, after first validating the incoming data.
DATA ANALYTICS FOR IoT
What is Big Data?
• Data which are very large in size is called Big Data.
• Normally we work on data of size MB(WordDoc ,Excel) or maximum
GB(Movies, Codes) but data in Peta bytes i.e. 10^15 byte size is called Big
Data.
• It is stated that almost 90% of today's data has been generated in the past 3
years.
Sources of Big Data
Sources of Big Data
Social networking sites: Facebook, Google, LinkedIn all these sites generates
huge amount of data on a day to day basis as they have billions of users
worldwide.
E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge amount
of logs from which users buying trends can be traced
Weather Station: All the weather station and satellite gives very huge data
which are stored and manipulated to forecast weather.
Telecom company: Telecom giants like Airtel, Vodafone study the user trends
and accordingly publish their plans and for this they store the data of its million
users.
Share Market: Stock exchange across the world generates huge amount of data
through its daily transaction.
3V's of Big Data
Velocity:
• The data is increasing at a very fast rate.
• It is estimated that the volume of data will double every 2 years.
Variety:
• Nowadays data are not stored in rows and columns.
• Data is structured as well as unstructured.
• Log file, and CCTV footage are unstructured data.
• Data that can be saved in tables are structured data like the transaction data of
the bank.
Volume:
• The amount of data which we deal with is of very large size of Peta
bytes(10^15 byte ).
Issues:
• Huge amount of unstructured data which needs to be stored, processed and
analyzed.
Solution:
• Storage: This huge amount of data, Hadoop uses HDFS (Hadoop Distributed
File System) which uses commodity hardware to form clusters and store data in
a distributed fashion.
• It works on Write once, read many times principle.
• Processing: Map Reduce paradigm is applied to data distributed over network
to find the required output.
• Analyze: Pig, Hive can be used to analyze the data.
• Cost: Hadoop is open source so the cost is no more an issue.
Modules of Hadoop
• HDFS: Hadoop Distributed File System. Google published its paper GFS and on the
basis of that HDFS was developed. It states that the files will be broken into blocks
and stored in nodes over the distributed architecture.
• Yarn: Yet another Resource Negotiator is used for job scheduling and manage the
cluster.
• Map Reduce: This is a framework which helps Java programs to do the parallel
computation on data using key value pair.
• A key-value pair consists of two related data elements: A key, which is a constant that
defines the data set (e.g., gender, color, price), and a value, which is a variable that
belongs to the set (e.g., male/female, green, 100).
• The Map task takes input data and converts it into a data set which can be computed
in a Key value pair.
• The output of Map task is consumed by reduce task and then the out of reducer gives
the desired result.
• Hadoop Common: These Java libraries are used to start Hadoop and are used by
Advantages of Hadoop
• Fast: In HDFS the data distributed over the cluster and are mapped which
helps in faster retrieval, thus reducing the processing time. It is able to
process terabytes of data in minutes and Peta bytes in hours.
• Scalable: Hadoop cluster can be extended by just adding nodes in the
cluster.
• Cost Effective: Hadoop is open source and uses commodity hardware to
store data so it really cost effective as compared to traditional relational
database management system.
• Resilient to failure: HDFS has the property with which it can replicate data
over the network, so if one node is down or some other network failure
happens, then Hadoop takes the other copy of data and use it.
• Normally, data are replicated thrice but the replication factor is configurable.
History of Hadoop
Apache Spark
• Main concern is to maintain speed in processing large datasets in terms of
waiting time between queries and waiting time to run the program.
• Spark was introduced by Apache Software Foundation for speeding up the
Hadoop computational computing software process.
• Spark uses Hadoop in two ways – one is storage and second is processing.
Since Spark has its own cluster management computation, it uses Hadoop for
storage purpose only.
Features of Apache Spark
• Speed − Spark helps to run an application in Hadoop cluster, up to 100 times
faster in memory, and 10 times faster when running on disk. This is possible by
reducing number of read/write operations to disk. It stores the intermediate
processing data in memory.

• Supports multiple languages − Spark provides built-in APIs in Java, or Python.


Therefore, you can write applications in different languages.

• Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also
supports SQL queries, Streaming data, Machine learning (ML), and Graph
algorithms.
Components of Spark
The Spark framework includes:
• Spark Core as the foundation for the platform
• Spark SQL for interactive queries
• Spark Streaming for real-time analytics
• Spark MLlib for machine learning
• Spark GraphX for graph processing
• Spark Core: It is responsible for memory management, fault recovery,
scheduling, distributing & monitoring jobs, and interacting with storage systems.
Spark Core is exposed through an application programming interface (APIs) built
for Java, Scala, Python and R.
• MLlib- Machine Learning library
• Spark includes MLlib, a library of algorithms to do machine learning on data at
scale.
• Machine Learning models can be trained by data scientists with R or Python on
any Hadoop data source, saved using MLlib, and imported into a Java or Scala-
based pipeline.
• Spark was designed for fast, interactive computation that runs in memory,
enabling machine learning to run quickly.
• The algorithms include the ability to do classification, regression, clustering,
collaborative filtering, and pattern mining.
• Spark Streaming
• Real-time
• Spark Streaming leverages Spark Core's fast scheduling capability to
perform streaming analytics.
• Spark Streaming supports data from Twitter, Kafka, Flume, HDFS, and
ZeroMQ, and many others found from the Spark Packages ecosystem.
• Spark SQL-Interactive Queries
• Spark SQL is a distributed query engine that provides low-latency, interactive
queries up to 100x faster than MapReduce.
• It includes a cost-based optimizer, columnar storage, and code generation for fast
queries, while scaling to thousands of nodes.
• Business analysts can use standard SQL or the Hive Query Language for querying
data.
• Developers can use APIs, available in Scala, Java, Python, and R.
• GraphX
• Graph Processing
• Spark GraphX is a distributed graph processing framework built on top of
Spark.
• GraphX provides ETL, exploratory analysis, and iterative graph computation to
enable users to interactively build, and transform a graph data structure at scale.
Data Analytics for IoT
• The Volume, velocity and variety of data generated by data-intensive
IoT systems is so large that it is difficult to store, manage, process and
analyse the data using traditional databases and data processing tools.
• Analysis of data can be done with aggregation functions (sum, min,
max, count, average) OR
• Using ML methods such as clustering and classification.
• Clustering- grouping similar data items together such that, data items
which are more similar to each other than other data items are put in one
cluster.
• Classification is used for categorizing objects into predefined
categories.
REST Services, Analytics Component
(IoT Intelligence)
Deployment design of a forest fire detection system
• Deployment design of a forest fire detection system with multiple end
nodes which are deployed in forest.
• The end nodes are equipped with sensors for measuring temperature
,humidity, light and carbon monoxide (CO) at various locations in the
forest.
• Each end node sends data independently to the cloud using REST-
based communication.
• The data collected in the cloud is analysed to predict whether fire has
broken out in the forest.
Timestamp, Temperature(c),Humidity(%),Light(Lux),CO(parts per million)
• A measurement of 1 lux is equal to the illumination/brightness of a
one metre square surface (that is one metre away from a single
candle).
• ppm is used to measure the concentration of a contaminant in soils
and sediments.
• Parts per million (ppm) is the number of units of mass of a
contaminant per million units of total mass.
• Xively Cloud was designed to enable
developers to connect, manage, and
analyze data from IoT devices. It
provided features such as device
connectivity, data management, real-
time analytics, and visualization
tools.
1.Apache Hadoop
• Hadoop is an open source framework from Apache and is used to
store process and analyze data which are very huge in volume.
• Hadoop is written in Java .
• It is used for batch/offline processing.
• It is being used by Facebook, Yahoo, Google, Twitter, LinkedIn and
many more.
• Moreover it can be scaled up just by adding nodes in the cluster.
1.1 MapReduce Programming Model
• MapReduce is a programming framework that allows us to perform parallel
processing on large data sets in a distributed environment.
• A MapReduce program is composed of a map procedure, which performs
filtering and sorting, and a reduce method, which performs a summary
operation.
• The main components of a MapReduce program are the Mapper and Reducer.
• Languages used Java, Python or others.
• Mapper: The Mapper class splits the input data into key-value pairs.
(eg:gender: M/F,Color:green,price:100)
• Reducer: The Reducer class takes the key-value pairs output by the Mapper
and reduces them to a result.
Data flow in MapReduce and Example
1.2 Hadoop MapReduce Job Execution
• About MapReduce job execution workflow and the steps involved in job
submission, job initialization, task selection and task execution.

The Hadoop Distributed File System


(HDFS) is a distributed file system for
Hadoop. It contains a master/slave
architecture.
Components of a Hadoop Cluster
• A Hadoop cluster comprises of a Master node, backup node and a
number of slave nodes.
• The master node runs the NameNode and JobTracker processes.
• The slave nodes run the DataNode and TaskTracker components of
Hadoop.
• The backup node runs the Secondary NameNode process.
• The functions of the key processes of Hadoop are described as follow:
NameNode:
• Keeps the directory tree of all files in the file system & tracks where
across the cluster the file data is kept.
• NameNode store the data of these files.
• Client applications talk to the NameNode whenever they wish to locate a
file or when they want to add/copy/move/delete a file.
• The NameNode responds to the successful requests by returning a list of
relevant DataNode servers where the data lives.
• NameNode serves as both directory namespace manager and inode table
(keep track of all the files in the linux system) for the Hadoop DFS.
• There is a single NameNode running in any DFS deployment.
Secondary NameNode:
• HDFS is not currently a high availability system.
• The NameNode is a single point of failure for the HDFS cluster.
• When the NameNode goes down, the file system goes offline.
• An optional secondary NameNode which is hosted on a separate machine
creates checkpoints of the namespace(directories/files & blocks).

JobTracker:
• distributes MapReduce tasks to specific nodes in the cluster.
• Client applications submit jobs to the Job tracker.
• The JobTracker submits the work to the chosen TaskTracker nodes.
TaskTracker:
• A TaskTracker is a node in the cluster that accepts tasks - Map, Reduce
and Shuffle operations - from a JobTracker.
• Every TaskTracker is configured/designed with a set of slots, these
indicate the number of tasks that it can accept.
DataNode:
• DataNodes are the slave nodes in HDFS.
• The actual data is stored on DataNodes.
• A functional filesystem has more than one DataNode, with data replicated
across them.
• On startup, a DataNode connects to the NameNode; spinning until that
service comes up.
Purpose of DataNode:
• The DataNode stores HDFS data in files in its local file system.
• The DataNode has no knowledge about HDFS files.
• It stores each block of HDFS data in a separate file in its local file system.
• The DataNode does not create all files in the same directory.
File Block In HDFS
• Data in HDFS is always stored in terms of blocks. So the single block of data is
divided into multiple blocks of size 128MB which is default and you can also
change it manually. Nowadays file blocks of 128MB to 256MB are considered in
Hadoop.
MapReduce Job Execution Workflow for Hadoop
• The job execution starts when the client applications submit jobs to the job tracker.
• The JobTracker returns a JobID to the client application.
• The JobTracker talks to the NameNode to determine the location of the data.
• The JobTracker locates TaskTracker nodes with available slots at/or near the data.
• The TaskTrackers send out the heartbeat messages to the JobTracker, usually every
few minutes, to reassure the JobTracker that they are still alive.
• A 'heartbeat' is a signal sent between a DataNode and NameNode. This signal is taken
as a sign of vitality(strength/energy). If there is no response to the signal, then it is
understood that there are certain technical problems with the nodes.
• These messages also inform the JobTracker of the number of available slots, so the
JobTracker can stay up to date with where in the cluster, new work can be delegated.
• The JobTracker submits the work to the TaskTracker nodes when they poll for tasks.
• To choose a task for a TaskTracker, the JobTracker uses various scheduling
algorithms.
• The default scheduling algorithm in Hadoop in FIFO(First-in First-out).
• In FIFO,a work queue is maintained and JobTracker pulls the oldest job first for
scheduling.
• There is no notion of the Job priority or size of the job in FIFO scheduling.
• The TaskTracker nodes are monitored using the heartbeat signals that are sent by the
TaskTrackers to JobTracker.
• The TaskTracker produces a separate JVM process for each task ,so that any task can
execute.
• The TaskTracker monitors these processes while capturing the output and exit codes.
• When the process finishes ,successfully or not,the TaskTracker notifies the
JobTracker.
• When the task fails the TaskTracker notifies the JobTracker.
• JobTracker decides whether to resubmit the job to some other TaskTracker or mark
that specific record as something to avoid.
• The JobTracker can blacklist a TaskTracker as unreliable if there are repeated task
failures.
• When the job is completed,the JobTracker updates its status.
• Client applications can poll the JobTracker for status of the jobs.
1.4.Hadoop Cluster setup
Steps involved in setting up a Hadoop cluster are described as follows:
Install Java: Hadoop requires Java 6 or later version.
The Daemons are the processes that
run in the background of the system.
The components of Hadoop known as
daemons include NameNode,
Secondary NameNode, DataNode,
JobTracker, and TaskTracker.

SSH keys are more secure than


passwords, and provide an easy way
to secure access to your Hadoop
cluster. If your SSH account is
secured using a key, the client must
provide the matching private key
when you connect.
APACHE STORM VS HADOOP
Storm Hadoop

Real-time stream processing Batch processing

Stateless Stateful

Master/Slave architecture with ZooKeeper Master-slave architecture with/without


based coordination. The master node is called ZooKeeper based coordination. Master node
as nimbus and slaves are supervisors. is job tracker and slave node is task tracker.

Hadoop Distributed File System (HDFS) uses


A Storm streaming process can access tens of
MapReduce framework to process vast amount
thousands messages per second on cluster.
of data that takes minutes or hours.

Storm topology runs until shutdown by the MapReduce jobs are executed in a sequential
user or an unexpected unrecoverable failure. order and completed eventually.

Both are distributed and fault-tolerant

If nimbus / supervisor dies, restarting makes it


If the JobTracker dies, all the running jobs are
continue from where it stopped, hence nothing
lost.
gets affected.
Following are the features of Apache Storm.
• It is an open source and a part of Apache projects.
• It helps to process big data.
• It is a fast and reliable processing system.
• It is highly parallelizable, scalable, and fault-tolerant.
• A stream represents a continuous sequence of bytes of data. It is produced by
one program and consumed by another.
• It is consumed in the ‘First In First Out’ (or FIFO) sequence. That means if
12345 is produced by one program, another program consumes it in the order
12345 only. It can be bounded or unbounded.
• Bounded means that the data is limited.
• Unbounded means that there is no limit and the producer will keep producing
the data as long as it runs and the consumer will keep consuming the data.
• A Linux pipe is an example of a stream.
Industry Use Cases for STORM

Many industries can use Storm for real-time big data processing such as:
• Credit card companies can use it for fraud detection on swipe.
• Investment banks can use it for trade pattern analysis in real time.
• Retail stores can use it for dynamic pricing.
• Transportation providers can use it for route suggestions based on traffic
data.
• Healthcare providers can use it for the monitoring of ICU sensors.
• Telecom organizations can use it for processing switch data.
STORM Data Model: Storm data model consists of tuples and streams.
Tuple
• A tuple is an ordered list of named values similar to a database row. Each field
in the tuple has a data type that can be dynamic. The field can be of any data
type such as a string, integer, float, double, boolean or byte array. User-defined
data types are also allowed in tuples.
• For example, for the stock market data, if the schema is in the ticker, year,
value, and status format, then some tuples can be ABC, 2011, 20, GOOD
ABC, 2012, 30, GOOD ABC, 2012, 32, BAD XYZ, 2011, 25, GOOD.
Stream
• A stream of Storm is an unbounded sequence of tuples.
• For example, if the above tuples are stored in a file stocks.txt format, then the
command cat stocks.txt produces a stream. If the process is continuously
putting data into stocks.txt format, then it becomes an unbounded stream.
Storm Architecture
• Storm has a master-slave architecture.
• There is a master server called Nimbus running on a single node called
master node.
• There are slave services called supervisors that are running on each
worker node.
• Supervisors consists of one or more worker processes called workers that
run in parallel to process the input.
• The diagram shows the Storm architecture with one master
node and five worker nodes.
• The Nimbus process is running on the master node.
• There is one supervisor process running on each worker node.
• There are multiple worker processes running on each worker
node.
• The workers get the input from the file system or database and
store the output also to a file system or database.
Storm Processes
• A Zookeeper cluster is used for coordinating the master,
supervisor and worker processes.
• Assigns and distributes the tasks to the worker nodes
• Monitors the tasks
• Reassigns tasks on node failure.
supervisor process
• Runs on each worker node of the cluster
• Runs each task as a separate process called worker process
• Communicates with Nimbus using zookeeper
• Number of worker processes for each task can be configured
worker process
• Runs on any worker node of the cluster
• Started and monitored by the supervisory process
• Runs either spout or bolt tasks
• Number of worker processes for each task can be configured
Sample Program
• A Log processing program takes each line from the log file and
filters the messages based on the log type and outputs the log
type.
• Input: A log file containing error, warning, and informational
messages. This is a growing file getting continuous lines of log
messages.
• Output: Output type of message (ERROR or WARNING or INFO)
Let us continue with the sample program.
• This program given below contains a single spout and a single bolt.
The spout does the following:
• Opens the file, reads each line and outputs the entire line as a tuple.
The bolt does the following:
• Reads each tuple from the spout and checks if the tuple contains the string ERROR or
WARNING or INFO.
• Outputs only ERROR or WARNING or INFO.
LineSpout {
foreach line = readLine(logfile) {
emit(line)
}
LogTypeBolt(tuple) {
if(tuple contains “ERROR”) emit(“ERROR”);
if(tuple contains(“WARNING”) emit (“WARNING”);
if(tuple contains “INFO”) emit(“INFO”);
• The spout is named LineSpout.
• It has a loop to read each line of input and outputs the entire line.
• The emit function is used to output the line as a stream of tuples.
• The bolt is named LogTypeBolt. It takes the tuple as input.
• If the line contains the string ERROR, then it outputs the string ERROR.
• If the line contains the string WARNING, then it outputs the string
WARNING.
• Similarly, if the line contains the string INFO, then it outputs the string
INFO.
Storm Components
• Storm provides two types of components that process the input stream, spouts,
and bolts. Spouts process external data to produce streams of tuples. Spouts
produce tuples and send them to bolts. Bolts process the tuples from input
streams and produce some output tuples. Input streams to bolt may come from
spouts or from another bolt.
• The diagram shows a Storm cluster consisting of one spout and two bolts. The
spout gets the data from an external data source and produces a stream of
tuples. The first bolt takes the output tuples from the spout and processes them
to produce another set of tuples. The second bolt takes the output tuples from
bolt 1 and stores them into an output stream.
• Storm Example
• Let us illustrate storm with an example.
• Problem: The stock market data which is continuously sent by an external
system should be processed, so that data with GOOD status is inserted into a
database whereas data that are with BAD status is written to an error log.
• STORM Solution: This will have one spout and two bolts in the topology.
Spout will get the data from the external system and convert into a stream of
tuples. These tuples will be processed by two bolts. Those with Status GOOD
will be processed by bolt1. Those with status BAD will be processed by bolt2.
Bolt1 will save the tuples to Cassandra database. Bolt2 will save the tuples to
an error log file.
• The diagram shows the Storm topology for the above solution. There is one
spout that gets the input from an external data source. There is bolt 1 that
processes the tuples from the spout and stores the tuples with GOOD status to
Cassandra. There is bolt 2 that processes the tuples from the spout and stores
the tuples with BAD status to an error log.

You might also like