SIC Big Data Chapter 3 Workbook
SIC Big Data Chapter 3 Workbook
SIC Big Data Chapter 3 Workbook
1
Contents
Lab 1: Data Ingestion with Sqoop for RDBMS (MariaDB)................................................................. 3
Lab 2: Data Ingestion with Apache Flume ....................................................................................... 13
Lab 3: Making your first dataflow ..................................................................................................... 20
Lab 4: Creating Connections .............................................................................................................. 28
Lab 5: Navigating Data Flows ............................................................................................................ 35
Lab 6: Creating and Using Templates ............................................................................................... 42
Lab 7: Using Processor Groups .......................................................................................................... 50
Lab 8: Setting Back Pressure on your Connections ........................................................................ 61
Lab 9: Working with Hadoop in NiFi.................................................................................................. 70
Lab 10: Creating a Kafka Topic, Producer, and Consumer ............................................................... 75
Lab 11: Sending Messages from Flume to Kafka .............................................................................. 81
2
Lab 1: Data Ingestion with Sqoop for RDBMS (MariaDB)
In this lab, you will import and export tables from RDBMS to HDFS(Hadoop Distributed
Filesystem) using Sqoop.
First, in the next few steps we will use Sqoop to examine the databases and tables in this
database before importing them into HDFS.
Database: labs
Note: If you do not enter anything after the password, you will be prompted for the password:
2. If the login is successful, the "MariaDB [labs]>" prompt appears and a screen waiting for
commands is displayed. Enter a command to check which database exists here.
3
Figure 1. List databases in MariaDB
4
Figure 3. Structure of Tables (authors and posts)
4. Review the structure of the authors, posts tables and review some records.
MariaDB> quit
5
6. Run the following command to check the basic options of sqoop.
$sqoop help
7. To see detailed options for each sub-command, enter the desired subcommand after help. To
see detailed options for import, run the command as follows.
8. Run the list of databases in MariaDB and tables in database labs with the following
command.
The command execution result is the same as the database shown in Figure.1.
Note: an alternative to using the --password argument is to use -P (capital letter)and let Sqoop
prompt you for the password, which is then not visible when you type it.
Note: Sqoop provides import-all-tables, but this command is rarely used in real production.
The reason is that this command tries to accomplish many things with one command. Don’t use
it this time.
6
The real environments typically have hundreds of databases and thousands of tables in each
database, so use this command to just test your system.
Usually, even importing a single table can take a lot of time, so the command to import all
tables is impractical. Most of these cases use the import command.
10. Execute the command to fetch the posts table from the labs database using Sqoop and
store it in HDFS.
When this command is executed, the posts directory is created under the /user/student home
directory of HDFS and data is stored as follows.
12. Import the authors table and save it to the HDFS directory we created above using ',' to
delimit the fields.
Note: The --fields-terminated-by ',' option separates the fields in the HDFS file with the tab
character. If you want to work with Hive or Spark, it’s use better to ‘\t’ instead of ‘,’.
7
--target-dir /mywarehouse/authors
13. Review that the command worked with hdfs commands for target-dir.
If you execute the cat command, you can see that each line of data is stored separated by ","
unlike the previous posts file (tab delimiter) in hdfs.
14. Import the only specified columns with –columns for authors in hdfs home directory. The
imported columns are first_name, last_name, email.
8
15. Import the only matching row with –where statement. The imported rows are the first
named 'Dorthy' in the authors table.
Note: Output of Hadoop jobs is saved as one or more “partition” files. Usually 4 files are created,
and query results are stored in arbitrary files.
16. Import a table using an alternate file format instead of text format. Import the authors
table to Parquet format.
17. view the results of the import commands by listing the contents in HDFS (target-dir).
9
18. Import a table using a compression option –compress or -z for authors table.
19. First, import the rows whose first name is "Dorthy" performed in step 15, and save it as
dorthy folder in the hdfs home directory.
10
$ sqoop import --connect jdbc:mysql://localhost/labs --username student --password
student --table authors --fields-terminated-by '\t' --where "first_name='Dorthy'" --
target-dir dorthy
21. Create a target directory in HDFS to import table data into for labs (/tmp/mylabs).
21.1. From the posts table, import only the primary key, along with posting title, posting
date to HDFS directory /tmp/mylabs/posts_info. Please save the file in text format
with tab delimiters.
Hint: You will have to figure out what the name of the table columns are in order to complete
this lab using just sqoop command instead of MariaDB access.
21.2. This time save the same in parquet format with snappy compressing. Save it in
/tmp/mylabs/posts_compressed.
21.3. From the terminal, display some of the records that you just imported.
21.4. Import the only specified columns for authors in hdfs home directory. The imported
columns are id, first_name, last_name, birthdate with tab field delimiter. Save the file
in text format.
Hint: If the authors directory exists in the HDFS home directory, delete it and execute the
import command.
11
21.5. Import and save in /tmp/mylabs/posts_NotN only posts which title is not null column
and id is in posts table. And the imported columns are id, title, content not all columns.
Save the file in parquet format and compressed using snappy codec.
Hint: The SQL command for retrieving data whose field value is not null is "column_name is not
null".
12
Lab 2: Data Ingestion with Apache Flume
In this lab, you run the Flume agent to collect data from various data sources and store it as
HDFS or local filesystem.
This Agent allows the user to generate events and subsequently log them to the console. This
configuration defines a single agent named agent1.
mkdir flume
cd flume
vi transfer.conf
The agent1 has a source that listens for data on port 3333, a channel that buffers event data
in memory, and a sink that logs event data to the console.
agent1.sources = netcatSrc
agent1.channels = memChannel
agent1.sinks = log
agent1.sources.netcatSrc.channels = memChannel
agent1.sinks.log.channel = memChannel
agent1.sources.netcatSrc.type = netcat
agent1.sources.netcatSrc.bind = 0.0.0.0
agent1.sources.netcatSrc.port = 3333
agent1.sinks.log.type = logger
agent1.channels.memChannel.type = memory
13
agent1.channels.memChannel.capacity = 100
1.4. Open another terminal window and execute the telent command.
1.5. Check that the message sent to telnet in step 4 is output from the terminal where the
flume agent was executed in step 3
Note: If the telnet is not closed properly, the port is not closed properly, and when you try to
connect again, an error that the port is already open may occur.
14
^] (ctrl+])
telnet> close
This agent 2 is to save the files coming into the spool directory to the local directory.
vi transfer_spool.conf
agent2.sources = dirSrc
agent2.channels = memChannel
agent2.sinks = fileSink
agent2.sources.dirSrc.channels = memChannel
agent2.sinks.fileSink.channel = memChannel
agent2.sources.dirSrc.type = spoolDir
agent2.sources.dirSrc.spoolDir = /home/student/data/spool
agent2.sinks.fileSink.type = file_roll
agent2.sinks.fileSink.sink.directory = /home/student/data/output
agent2.sinks.fileSink.sink.rollInterval = 0
agent2.channels.memChannel.type = memory
agent2.channels.memChannel.capacity = 100
2.4. Open another terminal window and copy two sql files to spool directory.
15
cd /home/student/flume/incoming
cp ~/Data/*.txt .
vi hello.txt
This is test file for Flume.
2.5. You can check the message that pig_data1.txt, pig_data2.txt, alice_in_wonderland.txt,
and hello.txt copied to the spool directory in step 4 are transmitted to the terminal
where Agent 2 is running.
Also, you can check the transmission of hello.txt created with vi. The transferred files are saved
in the output directory.
2.6. The transferred files are stored as files in the OUTPUT directory
3. Using Interceptor
16
This agent3 is the role of inserting the IP address of the host where the agent is running into
the event header.
vi interceptor.conf
agent3.sources = netcatSrc
agent3.channels = memChannel
agent3.sinks = log
agent3.sources.netcatSrc.channels = memChannel
agent1.sinks.log.channel = memChannel
agent1.sources.netcatSrc.type = netcat
agent1.sources.netcatSrc.bind = 0.0.0.0
agent1.sources.netcatSrc.port = 3333
agent1.sinks.log.type = logger
agent1.channels.memChannel.type = memory
agent1.channels.memChannel.capacity = 100
agent03.sources.netcatSrc.interceptors = i1
agent03.sources.netcatSrc.interceptors.i1.type = host
agent03.sources.netcatSrc.interceptors.i1.hostHeader = hostname
3.4. Open another terminal window and execute the telent command.
17
telnet localhost 3333
This is testing Flume with interceptor.
Hadoop
Spark
3.5. The message sent to telnet in step 4 is output from the terminal where the flume
agent was executed in step 3, and it is confirmed that the IP address where the agent
is currently running is inserted into the event header and transmitted.
Note: IP is 127.0.0.1
3.6. Delete the temporary directory used for the flume operations.
$cd ~/flume
$rm -rf incoming output
Source
Type Netcat
Bind localhost
18
Port 11111
Channel
Type Disk
Capacity 1000
transactionCapacity 100
Sink
Type logger
4.2. Start the agent
4.3. From another terminal start telnet and connect to port 44444. Start typing and you
should see the result from the other terminal.
19
Lab 3: Making your first dataflow
NiFi supports user authentication via client certificates, via username/password, via Apache
Knox, or via OpenId Connect. The default option is Single User with username and password. In
order to authenticate and synchronize with the Linux user, one of the more sophisticated
authentication mechanisms must be installed. However, installing such would be beyond the
scope of the labs and beyond the resources available within the virtual machine.
In order to facilitate the labs, the NiFI service will be run as the Linux root user.
1.1. Open a terminal and run the following command to start NiFi services
1.2. Open the Firefox browser and open the NiFi Web UI. It may take some time for the NiFi
service to come up. You may have to wait a few minutes if you cannot connect to the
NiFi Web UI right away.
https://localhost:8443/nifi
1.3. Login to the NiFI service. The current username is student123456789 and password is
also student123456789.
20
2.2. Explore the Add Processor pop up window
2.2.1. Trying clicking on hadoop from the category choices on the left tab. Observe
how the selection of available processors in the middle changes. How many
processors are related to Hadoop?
You will see on the top of the screen, text saying display nn of NNN. There are NNN processors
in total in this version of NiFi. Of those, nn are related to Hadoop.
2.2.2. Try combining hadoop with other categories such as put or get. What happens if
you select both put and get? Now remove the get? What happens?
When you try to select both put and get, you will notice that there are no matching
processors. This is because there are no Hadoop related processors that can
perform both get and put operations at the same time.
2.2.3. With hadoop and put selected on the left pane, try entering "Gener" in the filter
on the top left of the pop up window. Notice that there are no available
selections again. What happened?
2.2.4. The filter and the categories work together. There are no processor that has
"Gener" in the name of the processor that is part of both the Hadoop and put
category.
2.2.5. Now, remove Hadoop and put from the category filters and try entering
"Gener" in the search filter. You will now see several processors that can be
selected. Choose the GenerateFlowFile processor selecting it and Add.
Alternatively, you can select by double clicking it. You screen should look
similar to below
21
2.3.1. Move the GenerateFlowFile processor around the canvas by clicking and
dragging it.
2.3.2. Move the canvas around by clicking on an empty space
within the canvas and dragging the mouse around. Notice
that the small box in the Navigate palette screen on the
left moves around when you move the canvas.
2.3.3. This time, try moving the box in the Navigate Palette. What happens?
2.3.4. Zoom in and out of the canvas by moving the middle mouse wheel up and
down. Alternatively, try using the + and - buttons in the Navigate palette.
2.3.5. Make the processor icon very small by zooming out and
then try clicking the fit and actual buttons on the
Navigate Palette.
2.4. Explore the Configure Processor pop-up window by either double clicking the
GenerateFlowFile processor or right-click and selecting configure.
22
2.5.3. Select both processor by either holding down the shift key and selecting
each processor or holding down the shift key and clicking and dragging
around the two processors to create a selection box.
2.5.4. Disable the second Lab1-GenerateFlowFile processor. You can do this by
using disable button from the Operate Palette or the disable option from the
context menu.
2.5.5. Now delete the disabled processor that we just created. How would you do
this?
Use the delete button from the Operate Palette or the delete option from the context
menu after right-clicking on the processor to delete.
23
3.1.5. Select the success option for the For Relationships selection and
ADD the connection. When you are done, your data flow should look similar to
below.
3.1.6. The yellow exclamation triangle in the Lab1-PutFile processor indicates that
there is a problem with the setting for this processor. Hover the mouse over
the yellow icon to see what the problem is.
3.1.7. Recall from our lectures that all Relationships must be either directed
forward to another processor or terminated. There are a couple of
relationships that need to be terminated. Since after saving the flowfiles to
disk, we will not need to forward them further downstream, lets auto-
terminate them from this processor. Go to the SETTING tab in the processor
and check both failure and success Relationships for auto-termination.
24
25
3.2. Run the Processors
3.2.1. Now both processors should be in a stopped state (red square) and we are
ready to run the processors. Select both processors and click on the Run
button from the Operate Palette or choose the Start option from the canvas
context menu.
3.2.2. Both processors should transition to the run state and you will see data
movement from the 5 minute statistics.
3.2.3. Stop both processors. The procedure is analogous to starting the processors but
this time select the stop icons.
3.3.1. Open a terminal and navigate to the local /tmp/nifi directory. Get a listing of
the directory to verify that the PutFile processor has saved the flowfiles in this
directory. Your output will look similar to below.
26
27
Lab 4: Creating Connections
In this lab, we will look at Relationships and their connections between processors. A
Relationship in a processor represents a potential outcome of that processor. The available
Relationships will be determined by the type of processor as well as attributes in the flowfile.
For example, success or failure is a very common Relationship available in most processors.
Recall that all Relationships must be either forwarded downstream or auto-terminated.
1.1. Add a new process group to the root canvas by click and dragging a Processor Group
icon to the canvas
1.1.2. Double click on the newly created processor group. You will be presented with a
clean canvas where you can create data flows. Nifi will also let you know where
you are in the hierarchy of processor groups from the bread crumb screen. You
can move between processor groups by clicking on the desired processor group.
In order to return to the root canvas, you can click on the top level “NiFi Flow.”
Do not leave the Lab 2 processor group for now. We will create our data flow
inside this processor group.
28
29
2. Create a data pipeline with multiple paths
2.1. Add a GenerateFlowFile processor to the canvas
30
terminating the failure and original Relationships, we are effectively preventing
any flow files to be further passed downstream for those two situations.
2.3.3. Leave the Run Schedule to 0 sec. This will allow this processor the process any
flow files as soon as they become available.
2.3.4. Set the Line Split Count property to 1. This will split the incoming text to a
separate flow file for each line of text. Since we have set the upstream
GenerateFlowFile processors to send multiple lines of text, this processor will
split each line into a new flow file.
2.4. Connect the GenerateFlowFile processors to the SplitText processor
2.4.2. Now, click and drag the connection icon to the middle of the SplitText processor
and let go of the mouse
31
2.4.3. A pop-up window to create the connection will come up. Make sure the success
Relationship is selected for the connection
2.5. Connect Lab2-Bad-GenerateFlowFile the same way. Your connection so far should look
similar to below.
32
(.*?):
2.6.4. This is a regular expression that matches zero or more of any character non-
greedily until a colon (:) is encountered. Learning regular expressions is not
within scope of this lab, but you can easily find resources to learn its syntax. The
parenthesis captures a block. The value captured within this block will be
assigned to the attribute names.
2.7. Connect the Lab2-SplitText processor to the Lab2-Extract processor on the Relationship
splits.
2.8. Notice the yellow exclamation mark on the Lab2-ExtractText processor. Hover the
mouse over the yellow icon to review the error message. What does the error message
say? How can you fix it?
33
2.9. The problem is that there are two relationships generated from Lab2-ExtractText that
is not terminated. We have to either automatically terminate them or pass the flow
files for each relationship further downstream.
2.10.1. Add two Funnels below the Lab2-ExtractText processor by clicking and dragging
the Funnel icon to the canvas.
2.10.2. Connect the matched relationship to one of the funnels and the unmatched
relationship to the other. A funnel allows to take connections from multiple
sources and forward it to a downstream processor. However, many NiFi
developers use the funnel as a temporary destination for data flows still in the
works. We will use the funnel in this context. When you are done, your data
flow pipeline should look similar to below.
34
Lab 5: Navigating Data Flows
In this lab, we will run the processors that we created in Lab 2 and observe the flow files as
they are generated. Finally, we will take part of the data flow and create a template for reuse
in other labs.
1.2. Go the data flow previously created in Lab 2 and select all components, including
processors, connections and funnels. Do you recall how to select multiple components?
You can either hold down the shift key and select each of the components or hold down the
shift key and click and drag the mouse around all the components that you are trying to select.
1.3. Copy all the selected components using either the context menu or copy icon from the
Operate palette. You can also use the Ctrl-C keyboard shortcut.
1.4. Navigate back to Lab 3 processor group and paste the components. You can use the
paste icon from the operate palette or right click anywhere on the canvas and select
paste. Finally you can also use Ctrl-V keyboard shortcut as well.
2. Start Processors
2.1. Start the Lab2-Good-GnerateFlowFile processor
2.1.1. Before any processor can be started, they must be in a stopped state. This is the
red square button shown on each of the processors.
35
2.1.2. Select just the Lab2-Good-GenerateFlowFile and start it. You can use the start
( ) icon from the Operate Palette. You can also right click on the processor and
use the context menu.
36
2.2. Stop and observe the generated flowfiles.
2.2.1. Let the Lab2-Good-GenerateFlowFile run for about 30 seconds or so. Recall that
we have set this processor to trigger every 10 seconds so you should get 3 or 4
flowfiles generated during that time. Stop the processor using the stop icon ( )
from the Operate palette or using the context menu.
2.2.2. Your connection between Lab2-Good-GenerateFlowFile and Lab2-SplitText
should now have some flowfiles queued similar to below. The Lab2-Good-
GenerateFlowFile will also have updated and now shows activity.
2.2.3. Right click on the connection where the flowfiles are queued. A context menu
will appear. Select the List queue option.
37
2.2.5. You can get detailed information about flowfiles including information on
attributes. Click on any of the ( ) icons on the far left to display the pop-up
window shown below.
2.2.6. Every flowfile has a unique UUID associated with it. This allows NiFi to track
every flowfile and records its lineage. A linage is information tracking how and
when a flowfile was created, modified, copied, cloned, deleted, etc. It tracks
each change in time order. This is the provenance feature of NiFi. We also see
flowfiles are actual physical files with a Filename, that are saved durably for
some configured period of time. This portion of the flowfile does not contain the
actual content of the flowfile as that would be very inefficient. The flowfile
itself keeps a pointer to another contents file that contains the actual contents.
We can view or download the content using the download ( ) or view
icon ( ) from the details tab.
2.2.7. From the Attributes tab, we can view all attributes associated with this flowfile.
Every flowfile will have a few default attributes such as filename, uuid, path, etc.
In addition, if the user has created any custom attributes, it will be displayed
here.
2.2.8. View the contents of several flowfiles. What do you see?
You should see the text you entered in custom text field
from the Lab2-Good-GenerateFlowFile processor. This will
be:
38
processor for about 30 seconds and then stop. View the queued flowfiles from this
Relationship. What are the contents of the flowfiles?
You should also see the custom text entered. In our case, we entered “some text, more random
text, last text” all on separate lines.
2.4. Start only the Lab2-SplitText processor and let it run until you see all the queued
flowfiles from upstream clear. Each flowfile from the Lab2-Good-GenerateFlowFile
will be split into a separate flowfile for each line entry. Since there are four lines, this
will result in four (4) new flowfiles. Similarly, there are three (3) line entries in each
flowfile from Lab2-Bad-GenerateFlowFile. This will result in three (3) new flowfiles.
How many new flowfiles do you expect to see queued between the Lab2-SplitText and
Lab2-ExtractText processor?
If you had 3 flowfiles on each connection, then you should now see (3x4) +(3x3) = 21 flowfiles
queued up.
2.5. As before in steps 2.2.5 and 2.2.8 in Lab 3, view the detail information including the
content and attribute of the queued flowfiles. What do you see?
39
You should see some new attributes automatically created by the system due to the split
operation. There is now a fragment.count attribute that corresponds to the number of new
flowfiles created from the original flowfile upstream. The new fragment.index attribute
corresponds to index number of the current flowfile within the fragments. The flowfile shown
below is the second fragment. When we check the contents, we see that it corresponds to the
second line item we entered in the custom text field for Lab2-Good-GenerateFlowFile.
2.6. Start the Lab2-ExtractText processor and let it run until all the queued flowfiles are
processed and passed downstream to one of the two funnels.
2.7. Open the queued flowfiles from the matched queue. What do you observe?
2.8. Similarly, view the flowfiles from the unmatched queue. What do you observe?
40
You will see that all of the flowfiles that originated from Lab2-Good-GenerateFlowFile have
ended up on the matched queue while all from Lab2-Bad-GenerateFlowFile are in the
unmatched queue. This is because of the regular expression we used in the names field of the
Lab2-ExtractText processor properties. The regular expression matches all characters prior to
seeing a colon. All the text entered in Lab2-Good-GenerateFlowFile was in the form of a
key:value pair with a colon between the key and value.
Sample unmatched flowfile. Notice that in the unmatched flowfile, the names attribute has
not been created since there was no match.
2.9. Let’s clean up the queues before we finish this lab. Right
click on the queue and bring up the context menu. From
there select, Empty queue to delete all the flowfiles that are
queue in the connection. Do this for both matched and
unmatched relationships.
41
Lab 6: Creating and Using Templates
In this lab, we are going to learn how to create templates from existing data flows. Templates
are a way to save often used data flow logics. Many NiFi developers have libraries of templates
that they can reuse to create more complex data flows
1. Creating a Template
1.1. Setup Lab 4 Processor Group
1.1.1. Add a new processor group to the root canvas and name it Lab 4
1.1.2. Copy the entire dataflow you created in Lab 3 and paste inside the Lab 4
processor group
1.1.3. Make sure there are no running processors and all the queues for all the
relationships are empty
1.2. Remove processors not part of the template
42
In the previous lab, this processor simply generated 3 random lines of text. We will
enhance this processor slightly so that each generates random text that are UUID
unique.
${UUID()}
1.5. Save as a Template: We are now ready to save the three processors as a template. This
small little dataflow can simulate a streaming data source where there is actual
key:value data mixed in with random noise. The flow that we will save should look
similar to below:
1.5.1. Select all components including the three processors and two connections.
1.5.2. Select the Create Template icon from the Operate Palette or right click on any of
the selected components (they have to be already selected) to bring up the
context menu. Select Create Template.
1.5.3. The Create Template pop-up window will open. Name the template KV
Datasource and select Create
43
44
2. Create new dataflow from Templates
2.1. Create a new dataflow by using the template that we just saved. Select the Template
icon from the components toolbar and click and drag to the canvas
2.2. From the pop-up menu, select the KV Datasource template that you just saved. A copy
of the dataflow will be placed on the canvas. Delete the second copy for now.
3. Exporting Templates
3.1. We can export templates as XML files to be transported to
other NiFi clusters or to simply save permanently as a
physical file. Open the Global Menu on the top right of the
screen.
3.3. A new pop-up window with all the saved templates in the current NiFi cluster will be
displayed. On the far right, there is the download and trashcan icon. Use the trashcan
icon to delete any saved templates. Use the download icon to export the selected
template as a XML file.
3.4. Your Firefox browser will prompt you do either save the file or open it. Choose save file.
45
3.5. Open a terminal and navigate to the Downloads directory. List the directory and you
will see the saved XML file. You may copy this file and use it on another NiFi cluster.
cd ~/Downloads
ls
3.6. Open the KV_Datasource.xml file using an editor of your choice. In this example, we
will use the KWrite utility available on the bottom right in the panel. Review the XML
file. Do not make any changes, however.
46
3.7. Delete the template from the NiFI cluster.
Delete the saved template from the current cluster so that we can import it in the next
step. If a template with the same name already exists in the cluster, you will not be
able to import it. Refer to the instructions in 3.3.
4. Importing Templates
4.1. Click on the Upload Template icon from the Operate Palette or right click on an empty
area of the canvas. Make sure that nothing is selected. From the context menu, select
Upload template.
47
4.2. Select the magnifier icon to browse through the files in your directory. Select the
Download directory on the left tab and from there, select he KV_Datasource.xml file
that we just saved in the previous step
4.3. Once the XML file has been selected, click on the upload button to complete the
operation. A success pop-up window will confirm the success of the operation.
4.4. Return to the Global menu to confirm that the template has been imported and is once
again, part of the NiFi cluster.
4.5. Confirm the template by repeating the operations from step 2.1 to make a copy of the
KV_Datasource dataflow on the canvas.
5.1.1. Navigate to the root canvas using the breadcrumb menu on the bottom left
screen
5.1.2. Select all the components using Ctrl-A
5.1.3. Create a template from all the selections as you did in step 1.5.
48
49
Lab 7: Using Processor Groups
In this lab, we will combine using templates and processor groups to simplify the development
of dataflows. So far, we have only been using processor groups to organize projects into
different directories. Now we will use then to connect smaller logical dataflows into larger
ones.
We will create a dataflow where we will take a datasource, merge the contents,
compress it and save it to local disk.
50
In essence, we are going to wait until the minimum size of the merged content is
30 MB and then package it in tar format.
51
1.4. Connect the processors
1.5. Run the dataflow, processor by processor and observe the queue
1.5.1. Run the Lab5-GenerateFlowFile for about 30 seconds until there are flowfiles
queued downstream. Stop the processor and then open the queue and inspect
the flowfiles. Take note of the file size.
52
1.5.2. Run the MergeContent for about 15 seconds until there are flowfiles queued
downstream. Stop the processor and then open the queue and inspect the
flowfiles. Observe the file size. From the attributes tab, look at the new
attributes that have been created by this processor. What do you think those
attributes mean?
Let's look at some of the more interesting attributes. The merge.count attribute
shows that how many flowfiles have been combined together. If you stopped the
processor at 15 seconds, you will have about 8 flow files. Your results may differ
slightly and may show more or less Since we had set the Lab5-GenerateFlowFile
processor to send 5MB flowfiles, we should expect that the new flowfile sent
downstream is approximately 5MB x 8 = 40MB. The actual size is 39.07MB. Once
again, this size will depend on the merge.count. If your count is 8, it should be
53
similar. Your results may differ slightly and may show more or less The
merge.reason states that the minimum threshold was reached. Go back to the
CompressContent properties tab and confirm that we have set the Minimum Group
Size to 30 MB. There is also a maximum threshold that can be set. Another way to
trigger the processor to bundle and send forward is the bin age. If a certain
amount of time as passed (Max Bin Age), the processor will pack all the flowfiles
collected so far and send downstream.
1.5.3. Run the CompressContent until there is/are flowfile(s) queued downstream.
Stop the processor. First take a look at the 5 min statistics. It shows 1 in. A
Read/Write of 39.07MB/32.28MB and 1 out. This indicates that this processor
processed 1 flowfile coming in and sent out 1 flowfile. The amount of
Read/Write I/O to complete this is as shown. Since, the output file was 32.28MB,
we should expect flowfile in the success relationship connection queue will be of
that size and that the gzip compressor was able to reduce the size slightly.
1.5.4. Open the queue and inspect the flowfile. Notice the size of the file matches the
output shown from the 5 min statistics above. Also, from the attributes tab, the
mime.type has changed from tar to zip because be have compressed using gzip.
54
1.5.5. Run the PutFile processor until the queue is cleared. Make sure that there was
no failure by ensuring that the failure queue to the funnel remains empty. If
there is a failure, go back to step 1.3.9 and check the setting.
1.5.6. Open up a new terminal and navigate to the /tmp/nifi/lab5 directory. Get a
listing of the directory
cd /tmp/nifi/lab5
ls
1.5.7. Untar the file and list the directory again. You should see the same number of
files as merge.count from above step 1.5.2. In our current case, that was 8.
1.6. Use more (less) to inspect the contents of any one of the files.
more fe3b610e-abf7-467e-8db2-70842d52fe00
55
2. Create data flows a Processor Groups
The Merge-Compress-Save portion of our data flow seems like something we can often use.
We will now create a process group from it and use it as part of a dataflow.
2.1. Make sure that all the processors are stopped and all the queues are empty
2.3. Move the Lab5-GenerateFlowFile out of the way so that we can easily select the rest of
the processors and connections.
2.4.1. Select all of the components except the Lab5-GenerateFlowFile and right click
inside any of the selected components to get the context menu. Choose The
Group option. .
56
Alternatively select all the desired components and click the Group icon from
the Operate Palette.
2.4.3. Click on the Add to create the processor group. All the selected components will
now be inserted into the newly created processor group.
57
2.4.5. Notice that there is now a yellow warning on the MergeContent processor.
Hover the mouse over it to read the warning message.
2.4.6. The MergeContent requires an upstream processor but currently does not have
one. It used to have the Lab5-GenerateFlowfile processor connected upstream
but we have disconnected it. Add an Input Port to the processor group by
selecting it from the components toolbar. Name the port, Get-Datasource
2.4.7. Connect the Get-Datasource input port to the Merge-Content processor. Your
dataflow should look similar to below.
2.4.8. Use the breadcrumb menu on the bottom left to move out of the Merge-
Compress-Save processor group and up to the Lab 5 processor group.
58
2.4.10. Add the connection. Your dataflow should now look similar to below.
2.4.11. From a terminal, navigate back to /tmp/nifi/lab5. We will be testing the new
dataflow. In order to make sure we are adding new files, remove all the files in
the directory.
cd /tmp/nifi/lab5
rm -f *
ls
59
2.4.12. Return to the Lab 5 canvas and run all processors including the processor group.
Let the dataflow run for about a minute and then stop it. You should see activity
in the 5 minute statistics for both the Lab5-GenerateFlowFile processor and the
Merge-Compress-Save processor group.
2.4.13. Return to the terminal to check that new files have been created.
60
Lab 8: Setting Back Pressure on your Connections
In this lab, we will explore back pressure. Sometimes a processor may experience delays
causing the queue from upstream to accumulate. If the queue gets too big, we can setup and
configure back pressure such that the upstream processor stops sending new dataflows
downstream. This in turn can cause a recursive back pressure to activate. In other words, since
the upstream processor has stopped processing flowfiles, its upstream queue, in turn, will
begin to accumulate. This accumulation can trigger another back pressure policy, causing the
grandparent or two legs upstream processor to stop processing as well.
1.2. From Lab 5, we created a very useful dataflow. The dataflow was made into a processor
group named Merge-Compress-Save. Merging and compressing the content and then
saving it is a scenario that is often encountered. Go back to Lab 5 and create template
from that processor group. Name the template Merge-Compress-Save.
1.3. Return to the Lab 6 canvas and create a dataflow from the Merge-Compress-Save
template
1.4. We had save another template that generated Key:Value pairs for us in Lab4. Make
another dataflow from the KV Datasource template. Your canvas should look similar to
below.
61
1.5. Processor groups makes making dataflow so much easier. Now that we know how to
make processor groups, take the three processors from the KV Datasource template
and create a new processor group. Name it KV Datasource. Your dataflow should look
similar to below.
62
1.8. Configure the backpressure settings
1.8.2. From the Setting tab, change the Back Pressure Object Threshold value to 10
1.8.3. Right click on the connection between SplitEachLine and KeyValueGenFlowFile
to bring up the context menu. Select Configure.
1.8.4. Set the Back Pressure Object Threshold value to 30. This is the maximum
number of flowfiles that can be queued before back pressure will be triggered.
63
the how much of the buffer is filled before back pressure will be applied. Green
is 0-60%, Yellow is61-85% and Red is 86-100%.
2.1.2. Stop all the processors empty all the queues. the queue.
2.1.3. This time start both the KeyValueGenFlowFile processor and the SplitEachLine
processor. Observe both queues.
64
Notice that there seems to be another threshold marker to the right of the current one.
The current marker indicates the threshold for number of flowfiles. The as of yet
inactive marker on the right is to show size thresholds.
2.2.1. Stop all the processors and empty all the queues.
2.2.2. Configure the "splits" connection and change the Size Threshold to 1000B.
2.2.3. Configure the "success" connection and change the Size Threshold to 3000B.
2.2.4. Restart the KeyValueGenFlowFile and SplitEachLine processors and observer
the queuers again.
What is happening? Can you deduct how the backpressure configuration policy works
based on the observations?
65
INTENTIONALLY LEFT BLANK
66
The backpressure policy has two settings. There is a total number of flowfiles
queued setting and there is a total size of queued flowfiles settings. The color
markers are for number of flowfiles on the left and size of flowfiles on the right.
The backpressure policy triggers whenever either one of these thresholds is
reached. Notice that in the success queue, the size is reached before the number
of flowfiles.
3.2. Stop all the processor and empty all the queue.
3.5. Start both processor groups. Observe the queues. Notice that the queues in the KV
Datasource keep getting full and the backpressure policy triggers. Also note that the
size of the output files is significantly smaller than before? What is happening? Think
about the answers before you turn the page.
67
We have setup a Merge Content processor as the receiver processor in the Merge-
Compress-Save processor group. This processor tries to merge several flowfiles
until criteria are met to trigger. In the previous lab, we had on purpose set things
up such that it would trigger after collecting about 30 MB of data. However, now
that the data source is coming from KV Datasource, the flowfiles are significantly
smaller. So, instead of triggering on the Minimum Group Size, it is triggering on
the Maximum Number of Entries of 1000 flowfiles. This is causing the output files
to be significantly smaller. It is also causing the upstream processors in KV Source
to trigger its backpressure policies because the flowfiles are piling up waiting for
their content to be merged.
68
69
Lab 9: Working with Hadoop in NiFi
In this lab, we will modify the Merge-Compress-Save processor group and template. So far, we
have only been saving to the local Linux drive. Now, we will add HDFS as an additional
destination.
1.2. Copy the Merge-Compress-Save processor group by adding it to the canvas from the
template
1.6.1. Hadoop Configuration Resources: This is the location where the Hadoop
configuration XML files are stored.
Open a terminal and login as hadoop. Enter the following command to navigate
to the correct directory. We will use bash terminal's ability to autocomplete
directory names as we navigate. When the user enters a tab, bash will
autocomplete the rest of the name.exit
su - hadoop
cd ~/
cd had <and then hit tab>
ls
cd et <and then hit tab>
ls
cd ha <and then hit tab>
ls
70
pwd
1.6.2. Hadoop Configuration Resources: Enter the following two entries in a single line
with a comma between them
/home/hadoop/hadoop/etc/hadoop/core-site.xml,
/home/hadoop/hadoop/etc/hadoop/hdfs-site.xml
71
1.7.2. Connect the PutHDFS failure relationship to the funnel (if you have accidently
deleted the funnel, just add a new one)
72
This has happened because NiFI is currently running as the Linux root user.
While the root user has complete control over all of the local directories and
files, this is not the case for the HDFS file system. The ERROR occurs because
the root user is trying to save to a HDFS directory that belongs to user student.
2.3. Fix PutHDFS processor by setting root user as super user for HDFS
su
groupadd supergroup
usermod -aG supergroup root
groups root
hdfs dfsadmin -refreshUserToGroupsMappings
73
2.4. Restart the dataflow and check that the problem has been fixed. Inspect the 5 min
statistics of the processors, and of PutHDFS in particular to see if it was able to
successfully Read and more importantly Write.
2.5. Get a listing of the destination directory to check if files have been saved.
2.6. Stop all the processors and clear any queued flowfiles.
74
Lab 10: Creating a Kafka Topic, Producer, and Consumer
In this lab, you run use kafka services on a command line to create a topic. We use Kafka to
create producers and consumers and pass data through them.
In order to reduce demand to our virtual machine, we shall stop HBase and run only Apache
Kafka.
sudo stop-hbase.sh
1.3. Make sure that both zookeeper and kaka is running. If not repeat above step.
75
2. Creating a Kafka Topic
Create a Kafka topic named topic1_logs that will contain messages representing lines in log
files.
2.1.1. Execute the following code from a terminal to create topic1_logs topic
$kafka-topics –create \
--bootstrap-server localhost:9092 \
--replication-factor 1\
--partitions 1\
--topic topic1_logs
Note:If you previously worked on an lab that used Kafka, you may get an error here indicating
that this topic already exists. You may disregard the error.
2.1.2. Use the –list option to display all kafka topics and confirm that the new topic you
just created is listed:
$ kafka-topics –list \
76
--bootstrap-server localhost:9092
3.1. Open 2 terminals and create the producer on one and the consumer on the other.
3.1.1. From the first terminal use kafka-console-producer command to start the
producer.
$kafka-console-producer \
--broker-list localhost:9092 \
--topic topic1_logs
Notice that the kafka-console-producer is waiting for text to be typed in. Text that
is type here will become a message in the Kafka topic topic1_logs.
3.1.2. From the second terminal, use kafka-console-consumer to create a consumer for
the topic
kafka-console-consumer \
--bootstrap-server localhost:9092 \
--topic topic1_logs \
--from-beginning
77
Notice that the kafka-console-consumer is waiting for messages to arrive at
kafka topic topic1_logs.
3.2. Rename the terminals.
3.2.1. From the producer terminal, select Edit > Rename Tab and change the terminal
tab Producer.
3.2.2. Do the same from the consumer terminal, naming it Consumer.
3.3.1. Begin typing something from the Producer terminal where the producer is
running
3.3.2. Observe that in Consumer terminal, the consumer will pull messages that have
been pushed by the producer.
78
3.4. Create messages in batch mode.
3.4.1. From the Producer terminal, stop the Producer by sending a Ctlr-C
KeyboardTerminate signal
3.4.2. Send the entire contents of Alice-in-Wonderland.txt file to the topic1_logs topic
cat ~/Data/alice_in_wonderland.txt | \
kafka-console-producer \
--broker-list localhost:9092 \
--topic topic1_logs
What happened? It went very fast. I hope you didn't blink. The entire content
of the book "Alice in Wonderland" was passes as messages by the Producer and
then picked up by the Consumer.
79
3.5. Clean up
3.5.1. Stop producer and consumer as necessary using Ctrl-C kill signal.
80
Lab 11: Sending Messages from Flume to Kafka
As we have learned, creating Kafka Producers and Consumers require programming them
using the Kafka Producer and Consumer API. In other words, Kafka is not a plug-n-play end
user tool. Rather, it requires enterprises to develop code. Today, there are many publicly
available producers and consumers that organizations can utilize so the need for development
has reduced dramatically.
During the early days, using Flume in conjunction with Kafka was proposed. Flume supports
and extensive number of sources and sinks. The idea was to develop a Kafka source and Kafka
sink for Flume. Then, they could be combined with existing Flume sources and sinks to
complete the dataflow pipeline. For example, a Flume dataflow that collects data from Web
logs through a NetCat source, Spool Directory source or Syslog source could then designate
Kafka as the sink. This would effectively create a Kafka Producer. Another Flume dataflow
with a Kafka sink to read the collected Web logs and HDFS sink to store it would effectively act
as a Kafka consumer of the Web logs.
In this lab, create a Kafka producer using Flume and a Kafka Sink within. It will collect data
from a streaming source that drops file to a spool directory. We will confirm that the Kafka
producer is working as expected by reading the topic with the Kafka console consumer.
In order to reduce resource demand to our virtual machine, we shall stop HBase and run only
Apache Kafka
sudo stop-hbase.sh
81
sudo systemctl status kafka
1.3. Make sure that both zookeeper and kaka is running. If not repeat above step.
mkdir /home/student/Labs/C3U4/spooldir
2.3. Start with the spooldir-stub.conf stub file provided and complete the Flume
configuration.
82
2.3.6. Set the memory-channel channel type to memory
2.3.7. Configure the memory-channel capacity to 10000
2.3.8. Configure the maximum number of events per transaction to 100
2.3.9. Refer to the following to configure a Kafka Sink
https://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html#kafka-
sink
2.3.10. Set the kafka-sink sink type to org.apache.flume.sink.kafka.KafkaSink
2.3.11. Set the Kafka topic to stream_text
2.3.12. Set the list of bootstrap servers to localhost:9092
2.3.13. Set the number of messages to process in one batch to 5
2.3.14. Refer to the following to configure a Logger Sink
https://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html#logger-
sink
2.3.15. Set the logger-sink type to logger
2.3.16. Use the memory-channel for streaming-txt-source, logger-sink and kafka-sink.
3.1. Open a new terminal and name it Kafka Consumer Producer as we did in step Error!
Reference source not found.
4.1.1. Open a new terminal and change the name of the terminal to Flafka Producer as
we did in step Error! Reference source not found.
4.1.2. Start the Flume dataflow that we created above
83
--conf-file /home/student/Labs/C3U4/spooldir.conf \
--name agent1 -Dflume.root.logger=INFO,console
rm -rf ./spool
4.2.4. Execute the spool_stream.py python program. This command takes 5000
characters from alice_in_wonderland.txt, creates a temporary file to stage it, and
then moves the file to the spool directory. It produces one file every 5 seconds
(hard-coded).
84
4.2.5. The output from the three terminals should look similar to above. In the Flafka
Producer terminal, the logger is displaying the file from the spool directory that it is
sending to Kafka topic. The Kafka Consumer screen will update every 5 seconds
with new content.
Note: There is a chance that the Flafka Producer may fail while running. This is
due to the copy move script taking too much time. You many ignore this.
4.2.6. From the Streaming Source terminal, press Ctrl+C to stop the streaming
simulator.
4.2.7. Stop the Flafka producer.
4.2.8. Stop the Kafka consumer.
85
END OF LAB
86