SIC - Big Data - Chapter 4 - Workbook

Chapter 4.
Big Data Storage

Exercise Workbook
1
Contents
Lab 1: AWS S3 Storage........................................................................................................................ 3
Lab 2: AWS S3 Glacier Storage.........................................................................................................21
Lab 3: Data Access with HBase........................................................................................................ 35
Lab 4: Data Accessing Using DML commands................................................................................42
Lab 5: Working with HBase...............................................................................................................47
Lab 6: Data Access with Cassandra.................................................................................................49
Lab 7: Working with Cassandra....................................................................................................... 62
Lab 1
2
Lab 1: AWS S3 Storage
In this lab, you will create AWS Account and explore the S3 Service. Then you will access
and manage S3 Storage using AWS CLI and Cyberduck.
1. Create an AWS Account
1.1 Create a free tier AWS account from https://aws.amazon.com/console/
1.1.1 Creating a free tier account will require a credit card. However, the services
that we will be using will all be within free tier eligibility.
1.2 Configure your free account
1.2.1 Change the AWS region to a location nearest to you
1.2.2 Change the language to English or a language option of your choice. This
option is on the bottom left of your browser.
2. Explore the AWS S3 Service
2.1 Open the Amazon S3 console at https://s3.console.aws.amazon.com/
3
2.1.1 You can get to any Amazon AWS service in several ways.
Use the search bar on top and search for the service
2.1.2 Click on the Service icon next to the AWS logo to bring up a menu of all
services
2.1.3 Directly using link: https://s3.console.aws.amazon.com/
2.2 Create a new s3 bucket
2.2.1 Click on the create bucket icon
2.2.2 Enter a unique DNS compliant name for your bucket

 Be unique across all of Amazon S3
 Between 3 and 63 characters
4
 No uppercase
 Start with lowercase letter or number
2.2.3 Choose a region close to you
2.2.4 Keep the checkmark on Block all public access. By default, Amazon S3
blocks public access to S3 buckets. You can modify this later, even after the
bucket has been created
2.2.5 Keep value for Bucket Versioning to Disable. If you enable this feature, S3
will keep multiple variants of an object in the bucket. This may be useful
against accidental deletes.
2.2.6 Keep value for Default encryption to Disable. Users can choose to encrypt
their data-at-rest by enabling this feature.
2.2.7 Click the Create bucket button to create your bucket.
5
2.3 Uploading objects to a bucket
2.3.1 Go to the Amazon S3 console and select the bucket to which you wish to
upload object. Select from the list of buckets.
2.3.2 From the Objects tab, select the Upload icon.
2.3.3 You can choose to upload files or folders. We will upload a file for now.
Choose Add Files icon.
2.3.4 Select the seoul.jpeg file provided to you by your instructor.
2.3.5 Click on the Upload button on the bottom of the screen to complete. When
the upload is complete, you will get a success screen below.
2.4 Downloading an object from an S3 bucket to a local directory
2.4.1 Go to the Amazon S3 console and select the bucket you wish to upload from
the list of buckets
6
2.4.2 From the Objects tab, select an object that you would like to download. For
now, we will select the seoul.jpeg file that was uploaded.
2.4.3 From the object’s Properties tab, review the information for this object
2.4.4 Click on the Download icon on the top to download this object. The object
will be downloaded to your download directory on your local system.
2.4.5 Confirm that the file has been downloaded
2.5 Create folders in an S3 bucket
2.5.1 Go to the Amazon S3 console and select the bucket you wish to upload from
the list of buckets
2.5.2 Choose the Create Folder button to create a new folder
2.5.3 Enter a name for your folder
2.5.4 Keep value the default values for the rest
2.5.5 Click on Create Folder button on the buttom to create your new folder. You
should now have a new folder and the seoul.jpeg file you uploaded earlier
inside your bucket.
7
2.6 Copy objects in a S3 bucket
2.6.1 Go to the S3 bucket or folder that contains the object you want to copy. For
us, this will be in bucket that we first created and where we uploaded the
seoul.jpeg file
2.6.2 Select the object that you want to perform an action on
2.6.3 From the Actions menu, choose Copy
2.6.4 We now have to provide a destination to copy. We can do this in two ways.
One is to give the bucket or bucket/folder address. The other is to use the
Browse S3 option to open a GUI window and navigate to the desired bucket
or folder. If you choose the GUI method, the Destination address will be
automatically filled in for you
2.6.5 As we copy the object, it is possible to change some of the parameter

associated with the object. Leave everything the same and click on the Copy
button on the bottom of the screen.
2.7 Accessing S3 objects externally

In the bucket that we have already created, recall that we had set the public
access to be disabled.
2.7.1 Return to the top level of your s3 account
2.7.2 Create a new bucket but this time, allow public access to it by unchecking the
Block all public access. You must acknowledge that this might expose your
object to the public. Make sure that you provide a globally unique name for
the bucket.
8
2.7.3 Leave everything as is and click on Create bucket to create the bucket.
2.7.4 Copy seoul.jpeg file that you uploaded in step 2.3 to the newly created
bucket from above.
2.7.5 Go to where you copied the file from above.
2.7.6 The bucket you copied to has allowed public access, however, you need to set
an object's access rights specifically. It does not have public access as of yet.
Click on object and go to the Permissions tab.
2.7.7 Click on Edit button to modify the access control list
9
2.7.8 Allow read access to all persons (public access). You must also acknowledge
that you understand the consequences of this action.
2.7.9 Save the changes using the Save changes button.
2.7.10 From the Properties tab of the object, find the Object URL and click to test
public access
2.8 Delete an object in your S3 bucket
You incur charges while you have objects stored in S3. We will delete all the
buckets and all of its contents.
2.8.1 Go to your bucket list and select the bucket you would like to delete
2.8.2 You cannot delete a bucket unless it is empty.
10
2.8.3 Click on Empty button on the top menu to delete all objects and folders from
inside the bucket.
2.8.4 As a caution, you will have to type permanently delete in order to empty
the bucket. Finalize the operation by clicking on Empty.
2.8.5 Now that the bucket is empty, select it again from your bucket list and select
Delete from the top buttons. As a caution, you will have to enter the name of
the bucket in order to delete it. You will not get the icon to delete the bucket
until you correctly spell the name of the bucket.
2.8.6 In the same manner, delete the remaining bucket as well. When completed,
you should not have any buckets remaining in the s3 account.
2.9 Working with lifecycle settings for S3 buckets
AWS S3 have several storage classes with various performance and cost
specification. In simple terms, users can select different storage classes
depending on whether the data is hot, warm or cold.
You can use lifecycle rules to define what you want Amazon S3 to do during an
object's lifecycle, such as moving an object to a different storage class, retaining
the object, or deleting the object after a specified period of time has elapsed.
11
Before setting up your lifecycle configuration, keep the following in mind:
Propagation delay
When you add an S3 lifecycle configuration to a bucket, there is usually some

latency until the new or updated lifecycle configuration is fully propagated to all
Amazon S3 systems. You will need to wait a few minutes for the configuration
to fully take effect. This delay can also occur when deleting the S3 lifecycle
configuration.
Disable or Delete Lifecycle Rules
If you disable or delete a lifecycle rule, there is a small delay before Amazon S3
stops scheduling deletion or transition for new objects. Any objects that have
already been reserved will be canceled and will not be deleted or converted.
Existing and new objects
When you add a lifecycle configuration to a bucket, the configuration rules apply
to existing objects as well as objects you add later. For example, if you add a
lifecycle configuration rule today that includes an expiration action that expires
30 days after creation of objects with a specific prefix, Amazon S3 queues all
existing objects older than 30 days for deletion. .
2.9.1 Make a new bucket and name it <some prefix>-lifecycle-archive. You have to
add some prefix to the beginning of the name to assure that you can create a
globally uniquely named bucket. You may choose to make it public or closed.
2.9.2 Select the bucket and open the Management tab. From there select Create
lifecycle rule.
2.9.3 In Lifecycle rule name, enter a name for the rule. However, make sure it has a
unique name within the bucket.
2.9.4 Select the scope of the lifecycle rule
To apply this lifecycle rule to all objects with a specific prefix or tag, choose
Limit the scope of this rule using one or more filters.
From the Filter Type:
12
To limit the range by prefix, enter a prefix in Prefix.
Objects in s3 can be given a tag. You enter a key and value pair to create the
tag. To limit the scope to tags, choose Add tag and enter the tag key and
value to which this rule should apply
To apply this lifecycle rule to all objects in the bucket, select This rule applies
to all objects in the bucket and confirm that this rule applies to all objects in
the bucket (I acknowledge that this rule applies to all objects in the bucket)
2.9.5 Apply this rule to all objects in the bucket
2.9.6 Set a Lifecycle rule actions. You can select more than one rule. Some of the
options only applies if versioning is enabled and there are previous versions.
In this lab, we will set rules for current versions since we are not versioning
and there are no previous versions. Select the Transition current versions of
objects between storage classes..
2.9.7 When the above option is selected, a new section will appear for you to enter
the transition policy. Here you enter the Days after object creation and the
Storage class transitions. To view the cost and storage classes, refer to
https://docs.aws.amazon.com/AmazonS3/latest/userguide/lifecycle-
transition-general-considerations.html#glacier-pricing-considerations
2.9.8 Choose Glacier storage class and set to 1 day after object creation. You must
also acknowledge your understanding of this rule
13
2.9.9 Click the Create rule button on the bottom
2.9.10 Upload the seoul.jpeg file to <prefix>-lifecycle-archive bucket. After more

than a day has passed come back to check that the file has been moved to
Glacier storage.
14
3. Using AWS CLI to access and manage S3 storage
3.1 Add Git Bash (Windows environment only)
Git Bash is a nice Windows terminal program that provides many of the basic
functionalities of a Linux terminal. Many of the command we will execute below
can be easily accomplished from Git Bash without installing further programs. To
install Git Bash, go to https://git-scm.com/downloads and download the Windows
64 bit version. If you Windows environment is 32 bits, please upgrade your
operating system.
You will use Git Bash as the terminal program for the rest of the lab.
3.2 Add AWS CLI for Windows environment
3.2.1 Install AWS CLI version 2 on Windows from

https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2-
windows.html
3.2.2 Follow the instructions. The recommended method is to install the latest
version from https://awscli.amazonaws.com/AWSCLIV2.msi
3.3 Add AWS CLI for Max OS environment
3.3.1 Install AWS CLI version 2 for Mac OS from

https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2-mac.html
3.3.2 Follow the instructions. The recommended method is to install the latest
version from https://awscli.amazonaws.com/AWSCLIV2.pkg
3.4 Verify that AWS CLI was properly installed by opening up a terminal (Git Bash for
Windows and a terminal in Mac OS. Both environments will give same results.
The lab instructions will limit the screenshot to Git Bash for brevity) and executing
the following:
aws --version
3.5 Set up your AWS CLI configuration
In order to execute commands from the

AWS CLI, it will need to be configured with
the proper authorizations to access your
AWS account.
15
3.5.1 From the AWS console, click on the account name and select My Security
Credentials from the drop-down menu
3.5.2 From the new page, select Access Keys and click on the Create New Access
Key button
3.5.3 From the pop-up window, select Download Key File
3.5.4 Navigate to the download directory and open the downloaded access key
file. Depending on your environment, it might open as an Excel document or
as a Text document. In either case, there will be an AWS Access Key ID and
AWS Secret Access Key. You will need these two pieces of information in the
next step.
3.5.5 From Git Bash (terminal), run aws configure command and enter the AWS
Access Key ID and AWS Secret Access Key when prompted. For the default
region, enter the region name where you created your buckets in step 2.2.
Finally, enter json for the output format.
16
3.5.6 Test the connection by issuing the following command to get a listing of your
s3 buckets.
aws s3 ls
3.6 Create a new bucket with AWS CLI
3.6.1 First, get a list of available command with:
aws s3 help
3.6.2 Create a new bucket and name it <prefix>-cli-bucket. In creating a bucket,

you must use an S3 URI path notation
aws s3 mb s3://student-cli-bucket
3.6.3 List your buckets to confirm a new bucket has been created.
3.6.4 Now, copy the seoul.jpeg file to the newly created bucket with the following
command. Notice that as we copy, we are also creating a folder named my-
folder.
aws s3 cp seoul.jpeg s3://student-cli-bucket/my-folder/seoul.jpeg
3.6.5 Confirm that the file was uploaded
17
3.6.6 Now add another file of your choice to the my-folder directory in the
<prefix>-cli-bucket bucket. Confirm that you have added the file by listing
the bucket and directory again. Notice that we can change the name of the
file at the destination.
3.6.7 We have already uploaded the seoul.jpeg file previously, so let’s delete it
from the new bucket and folder. Execute the following command:
aws s3 rm s3://student-cli-bucket/my-folder/seoul.jpeg
aws s3 ls s3://student-cli-bucket/my-folder/
4. Using Cyberduck to explore AWS S3

Cyberduck is a popular graphical user interface based FTP / file explorer. It can connect to
various services including AWS s3.
4.1 Install cyberduck for your platform from https://cyberduck.io/download/
4.2 Start cyberduck and select Open Connection
18
4.3 Select Amazon S3 from the drop down menu and enter your Access Key ID and
Secret Access Key, if necessary, that was downloaded from step 3.5.1. Cyberduck
is smart and will read the AWS CLI setting for the Access Key ID and Secret Access
Key. If it was unable to read this information from your system, you will have to
enter it manually. Connect to your account.
4.4 Open each of the buckets by clicking on the arrow (>) next to the
bucket names. Check that the files we had placed are displayed
properly.
4.5 Because Cyberduck is a GUI based tool, all of the usage is very
intuitive. Click on the cogwheel to see a list of actions available.
We can create new folders, rename files and folders. duplicate

files and folders, download and upload files from s3 buckets
and to s3 buckets, delete files and folders, and much more.
4.6 Rename the my-folder in student-cli-bucket to Q1-Sales.
4.6.1 Click on the my-folder icon.
4.6.2 Select Rename from the cogwheel menu
19
4.6.3 Change the my-folder name to Q1-Sales and hit enter
4.6.4 You will be prompted if you want to move the selected files and will be given
an option to Rename. Click on the Rename
4.7 Upload a file to the Q1-Sales folder
4.7.1 Select the Q1-Sales folder
4.7.2 Select Upload from the cogwheel menu or from the top menu
4.7.3 Navigate to the directory containing the source file and select it.
4.7.4 Click Choose to begin uploading the file. A new pop up window will display
the progress of the upload.
4.7.5 Notice that Cyberduck displays different icons depending on the file type
4.8 Try renaming the student-cli-bucket. Will Cyberduck allow this operation?
Unfortunately, it will not. Remember, bucket names must be globally unique,

therefore it is beyond the scope of Cyberduck to check for its uniqueness.
20
4.9 Select the Q2 sales.csv file and drag it into the <prefix>-lifecycle-archive bucket.
Does it work as expected?
Yes, Cyberduck prompt you to verify that you want to move the file. When you
click on the move button, the file is moved to the destination bucket.
4.10 Now try creating a new folder on your own. Create the folder in student-cli-bucket
bucket.
4.11 Download the Q1 sales.csv file and then delete it from S3.
21
Lab 2: AWS S3 Glacier Storage
AWS S3 Glacier storage is a cost-effective alternative to S3 standard storage for files that are
cold and not expected to be accessed for a long time.
1. Create an AWS Glacier Vault
1.1 Login to the AWS Management Console and navigate to the Glacier service
1.1.1 You can get to any Amazon AWS service in several ways.
Use the search bar on top and search for the service
1.1.2 Go direct to https://console.aws.amazon.com/glacier/
1.2 Select Create Vault from the splash page to create a S3 Glacier vault
1.2.1 Select an AWS Region close to your current location
1.2.2 Provide a name for the vault
1.2.3 Click Next Step to continue
22
1.3 Keep the setting to Do not enable notification and click continue
1.4 Check the information and Submit
1.5 Your new vault is listed in the Amazon S3 Glacier Vault page
2. Using AWS CLI to work with S3 Glacier vaults
2.1 Check to see that the Glacier vault was properly created. From Git Bash (terminal)
issue the following command. You must include the hyphen at the end of the
command. This tells AWS CLI to use the account id associated with the
information we entered in step 3.5 as part of configuring AWS CLI.
aws glacier list-vaults --account-id -
This command will return the name of that vault in json format.
23
2.2 Uploading a large file using multi-upload to S3 Glacier using AWS CLI
2.2.1 Create a S3 Glacier vault with the create-vault command
aws glacier create-vault --account-id - --vault-name tempvault
2.2.2 Create a multipart upload to S3 Glacier with the initiate-multipart-upload

command
aws glacier initiate-multipart-upload --account-id - \

--archive-description "multipart upload test" \
--part-size 1048576 \
--vault-name tempvault
This command initiates a multipart upload and requires the size of each part in bytes.
Our chunks are 1MiB each. We also have to give it a vault name and our account-id
information. We will need the uploadId information that is returned after running
this command
24
2.3 Setup an environmental variable named UPLOADID and set it to the uploadId
information returned above. Be careful that you don’t introduce any extra
characters or space within the command and especially the uploadId string.
UPLOADID="<your upload id from above step>"
2.4 Use the upload-multipart-part command to upload each part. These commands
can be run in parallel for faster throughput.
2.4.1 Make sure that files named 3 parts are in the current directory. If they are
not, either copy them to the current directory or move to the directory that
has the 3 files.
2.4.2 Execute the following command:
aws glacier upload-multipart-part --body chunkaa \

--range 'bytes 0-1048575/*' \
--account-id - \
--vault-name tempvault \
--upload-id $UPLOADID
aws glacier upload-multipart-part --body chunkab \

--range 'bytes 1048576-2097151/*' \
--account-id - \
25
aws glacier upload-multipart-part --body chunkac \
--range 'bytes 2097152-3145727/*' \
--account-id - \
You will get 3 results similar to:
2.5 Checking for proper transmission
Amazon S3 Glacier requires a tree hash of the original file to ensure that all
uploaded fragments have reached AWS unchanged.
To compute the tree hash, split the file into 1 MiB parts and compute the binary
SHA-256 hash of each piece. Then split the hash list into pairs, combine the two
binary hashes into each pair, and get the hash of the result. Repeat this process
until only one hash remains. If there is an odd hash at any level, it is promoted to
the next level without modification.
The key to correctly calculating tree hashes when using command line utilities is
to store each hash in binary format and only convert it to hexadecimal in the last
26
step. Combining or hashing hashes of the hexadecimal version of the tree can
produce incorrect results.
2.5.1 To calculate tree hash

Compute and store a binary SHA-256 hash of each chunk
openssl dgst -sha256 -binary chunkaa > hash1

openssl dgst -sha256 -binary chunkab > hash2
openssl dgst -sha256 -binary chunkac > hash3
2.5.2 Combine the first 2 hashes and get the binary hash of the result.
cat hash1 hash2 > hash12

openssl dgst -sha256 -binary hash12 > hash12hash
2.5.3 Combine the parent hashes of chunks aa and ab with the hashes of chunk ac
and hash the result again. At this time, the hexadecimal number is produced.
cat hash12hash hash3 > hash123

openssl dgst -sha256 hash123
2.5.4 Store the result in a shell variable.
TREEHASH=<hexadecimal number produced above>
27
2.5.5 Finally, use the complete-multipart-upload command to complete the
upload. This command takes the size (in bytes) of the original file, the final
tree hash value (in hexadecimal), and the account ID and vault name.
aws glacier complete-multipart-upload \

--checksum $TREEHASH \
--archive-size 3145728 \
--upload-id $UPLOADID \
--account-id - \
2.5.6 You can also check the vault status using the describe-vault command.
aws glacier describe-vault \

--account-id - \
Vault status is updated about once per day. You may not be able to see the files you
have uploaded yet.
2.6 Downloading a Vault Inventory
2.6.1 Initiate an inventory retrieval job using the initiate-job command.
28
aws glacier initiate-job \
--account-id - \
--job-parameters '{"Type": "inventory-retrieval"}'
2.6.2 Save the jobId from above in an environmental variable
JOBID=<jobId from above>
2.6.3 Use the describe-job command to check the status of a previous search job
aws glacier describe-job \

--account-id - \
--job-id $JOBID
29
2.6.4 Wait for the download operation to complete. You can rerun the above
command to keep checking on the completion status.
Job IDs don't expire for at least 24 hours after S3 Glacier completes the job
2.6.5 When finished, use the get-job-output command to download the status as
output.json file.
aws glacier get-job-output \

--account-id - \
--job-id $JOBID output.json
3. Configuring Vault Notifications Using the Amazon S3 Glacier Console
3.1 From the AWS Management Console, navigate to the S3 Glacier console
3.1.1 Navigate to https://console.aws.amazon.com/glacier
3.2 We have been working with the tempvault in the previous steps. Click and select
the tempvault. This will open up the following screen:
30
3.3 Select the Notifications tab from the top to configure the notification
3.3.1 Enable the Notification
3.3.2 Select create a new SNS topic
3.3.3 Enter a topic name and display name and click Create topic
31
3.3.4 Select the notification job type. Select both options.
3.3.5 Finally, Save the notification job
3.4 Create a topic subscription
3.4.1 Navigate to the Amazon Simple Notification Service console by searching for
it from the AWS Management Console
3.4.2 Select Subscription from the tab menu on the left
3.4.3 Click Create subscription
3.4.4 Select the Topic ARN that we created in the previous step
32
3.4.5 Select Email as the Protocol
3.4.6 Enter an email to receive notification in the Endpoint section
3.4.7 Click Create subscription on the bottom right
3.4.8 A Details page will display the details of the subscription. Select on the topic
that we created in the previous section.
33
3.4.9 You will see that there is now a pending subscription waiting for
confirmation. Check the email address entered in the subscription. There
should be an email requesting confirmation.
3.4.10 Confirm the subscription. A new pop-up window will confirm that you are
now subscribed.
4. Deleting AWS S3 Glacier Vault
4.1 Using AWS CLI to a vault
4.1.1 Use the delete-vault command to remove a vault that does not have any
archives
4.1.2 Delete the student-vault that we created in step 1.2.
aws glacier delete-vault \

--vault-name student-vault \
--account-id -
34
The student-vault is now deleted, however, we will keep the tempvault until all the pending
jobs are completed. Since we have subscribed to a notification, wait for the confirmation
notification that the job is complete.
4.2 Using the AWS Glacier console to delete a vault
4.2.1 Create a vault using AWS CLI so that we can delete it from the console
aws glacier create-vault \

--account-id - \
--vault-name to-delete-vault
4.2.2 Return to the console and refresh the screen to view all your vaults.
4.2.3 Select the to-delete-vault. This will enable the Delete Vault option on the
top menu
35
4.2.4 Click Delete Vault to delete the selected vault. Confirm the deletion.
Note: If the vault is not empty, you must first delete all archives. In order to delete the archive,
use aws cli to start an "inventory scan" job. You can get a list of archive IDs from there. Delete
all archives using aws glacier delete archive <archive id>.
36
Lab 3: Data Access with HBase
In this lab, you run to basic operation with creating, deleting, and altering a table in HBase
shell. And you will use the shell to put and get data in HBase.
1. Start with HBase Shell
1.1 Run the HBase shell. And execute the help command and view the basic usage
information for HBase Shell.
$hbase shell
hbase(main):001:0> help
Note: The HBase shell prompt ends with a “>” character.
1.2 Display the version and status for basic usage
hbase(main):001:0> version
hbase(main):001:0> status
You can check the Hbase version (2.3.5) and standalone execution (1 active master, 1 servers).
1.3 Use the create command to create a new table. You must specify the table name
and the column family name.
1.3.1 Table name: tbl_authors, Column Family: cf1
hbase(main):001:0> create ‘tbl_authors’, ‘cf1’
Note: Table names, rows, columns all must be enclosed in quote mark.
1.4 List table to verify table tbl_authors was created using list. You can either give the
List command alone or give the table name along with List .
37
hbase(main):002:0> list
hbase(main):003:0> list ‘tbl_authors’
1.5 Use the describe command to see details, including configuration defaults.
hbase(main):001:0> describe ‘tbl_authors’

hbase(main):002:0> describe
1.6 Delete the table you just created. And list all tables to verify table was successfully
deleted
hbase(main):001:0> drop ‘tbl_authors’
Note: Table tbl_authors is enabled now. Disable it first to delete it.
38
1.7 Disable the tbl_authors table for drop.
hbase(main):003:0> disable ‘tbl_authors’

hbase(main):004:0> drop ‘tbl_authors’
hbase(main):005:0> list
1.8 Create a test table, and use the put command. Here, we insert three values, one at
a time.
1.8.1 The input data is in the following format.
1.8.2 Table name: temp, Column Family: cf1

Column family
Rowkey cf1:a cf1:b cf1:c
rk1 A
rk2 B
rk3 C
hbase(main):003:0> create 'temp', 'cf1'

hbase(main):004:0> put 'temp', 'rk1', 'cf1:a', 'A'
hbase(main):005:0> put 'temp', 'rk2', 'cf1:b', 'B'
hbase(main):006:0> put 'temp', 'rk3', 'cf1:c', 'C'
39
1.9 Count the number of rows in the HBase temp table that you created in the
previous lab
hbase(main):003:0> count ‘temp’
1.10 What is the expected number of rows after running the following command?
1.10.1 Input data under the following conditions into temp table.
Column Family: cf1
RowKey:rk4, column descriptor : d and value: D
RowKey:rk1, column descriptor : b and value: 1B
RowKey:rk5, column descriptor : d and value: E
hbase(main):003:0> put 'temp', 'rk4', 'cf1:d', 'D'

hbase(main):003:0> put 'temp', 'rk1', 'cf1:b', '1B'
hbase(main):003:0> put 'temp', 'rk5', 'cf1:d', 'E'
hbase(main):003:0> scan ‘temp’
The following is the result of scanning the data in the TEMP table. The total list is 6, but the
actual number of rows is 5.
Note: There are two data in cf1 with ROWKEY rk=1.
1.11 Change the version attribute of cf1 in the temp table to 3, and check whether the
change is reflected correctly.
40
hbase(main):003:0> alter ‘temp’, {NAME=>'cf1', VERSIONS=>3}
hbase(main):004:0> desc ‘temp’
1.12 Change the temp table to add the cf2 and cf3 column family using the ALTER
command.
hbase(main):003:0> alter 'temp', 'cf2', 'cf3'
1.13 Execute the command to check whether cf2 and cf3 have been added.
1.14 Remove the newly added cf3 and check the result.
hbase(main):003:0> alter 'temp', {'delete' => 'cf3'}

hbase(main):004:0> desc ‘temp’
1.15 Delete the temp table. And list all tables to verify table was successfully deleted.
41
1.16 Exit HBase shell with the quit command
hbase(main):004:0> quit
1.17 Run HBase shell commands from a scripts file.
$vi rubyscript.rb
disable ‘temp’
drop ‘temp’
create 'temp', 'cf1'

put 'temp', 'rk1', 'cf1:a', 'A'
put 'temp', 'rk2', 'cf1:b', 'B'
put 'temp', 'rk3', 'cf1:c', 'C'
put 'temp', 'rk4', 'cf1:d', 'D'
alter 'temp', {NAME=>'cf1', VERSIONS=>3}

alter 'temp', 'cf2', 'cf3'
put 'temp', 'rk1', 'cf1:b', '1B'
put 'temp', 'rk5', 'cf1:d', 'E'
scan 'temp'
get 'temp', 'rk1'
count 'temp'
exit
1.18 You can enter HBase shell commands into a text file, one command per line, and
pass the file to the shell.
hbase shell rubyscript.rb

hbase shell
42
hbase(main):001:0> require ‘./rubyscript.rb’
Note: This is a rubyscript.rb execution result screen, and if there is no exit statement in the
script, it remains in the shell state.
1.19 You can also pass commands to the HBase Shell in non-interactive mode using the
echo command and the | (pipe) operator.
echo "describe 'temp'" | hbase shell -n
43
Lab 4: Data Accessing Using DML commands
In this lab, you will use command to inserting, retrieving, scanning, and removing rows.
1. CRUD(Insert, select, update, delete) operations.
1.1 If you have not finished lab1, run the ruby.rb script first and proceed with this lab2.
1.2 Run the HBase shell:
hbase shell
1.3 Enter data with the following conditions and display the results.
RowKey:rk1, column descriptor : b and value: F
hbase(main):003:0> put ‘temp’, ‘rk1’, ‘cf1:b’, ‘F’

hbase(main):004:0> scan ‘temp’
1.4 Get the row with rowkey rk1 from temp table.
Hbase(main):003:0> get ‘temp’, ‘rk1’
1.5 Get two values of the previous versions of the ‘b’ column with the rk1.
44
Hbase(main):003:0> get ‘temp’, ‘rk1’, {COLUMNS => ‘cf1:b’, VERSIONS=>2}
hbase(main):004:0> get ‘temp’, ‘rk1’, {COLUMNS => ‘cf1:b’}
Note: The first input 1B value and the last input F value are displayed. If the version is not
specified, only the final value is displayed.
1.6 View the all table, but only show the ‘a’, ‘b’, ‘c’ columns.
Hbase(main):003:0> scan ‘temp’, {COLUMNS => [‘cf1:b’, ‘cf1:a’, ‘cf1:c’]}
1.7 Delete the ‘b’ column from the temp table with the rowkey = rk1.
Hbase(main):003:0> delete ‘temp’, ‘rk1’, ‘cf1:b’
1.8 Verify that ‘b’ column has been deleted.
Hbase(main):004:0> scan ‘temp’
1.9 Delete the entire row form the temp table in rk1 rowkey.
Hbase(main):012:0> deleteall ‘temp’, ‘rk5’
1.10 Show the row with rk1 has been deleted from temp table.
45
2. Using MIN_VERSIONS and Time-To-Live
2.1 Describe the temp table and check the TTL value of cf2 column. The default
TTL(Time-To-Live) is FOREVER, meaning that versions of a cell never expire.
2.2 Change the value from FOREVER to 30 seconds. This means the version will be
deleted 30 seconds after inserting data into the table.
Base(main):005:0> alter ‘temp’, NAME => ‘cf2’, TTL=> 30
2.3 Show the TTL changed from FOREVER to 30 secs.
2.4 Insert a data with the following conditions and display the results.
46
RowKey:rk1, Column Family: cf2, column descriptor : ttl and value: Y
hbase(main):007:0> put ‘temp’, ‘rk1’, ‘cf2:ttl’, ‘Y’
2.5 After waiting at least 30 seconds, run the scan command again to see if the
inserted row has expired and has been deleted.
2.6 Set TTL to 10 seconds and set MIN_VERSIONS to 1 with cf2.
Hbase(main):014:0> alter ‘temp’, NAME => ‘cf2’, TTL=> 10, MIN_VERSIONS => 1
Note: When a version expires with MIN_VERSION option, it will not be deleted if it is the only
remaining version of the cell.
2.7 Verify the TTL and MIN_VERSIONS value in cf2.
2.8 Insert a data with the following conditions and display the results.
RowKey:rk5, Column Family: cf2, column descriptor : ttl_min and value: Z
47
hbase(main):016:0> put ‘temp’, ‘rk5’, ‘cf2:ttl_min’, ‘Z’
2.9 After waiting at least 10 seconds, run the scan command again to see if the
inserted row has expired and has been remained.
48
Lab 5: Working with HBase
In this lab, you will work with HBase.
1. Write a command for creating table that meets the following conditions
1.1 Create a table with the following conditions:
1.1.1 Table name: movie

Column family: info with 3 versions
Column family: media with 3 versions
1.1.2 Table name: ranking

Column family: versions with 3 versions
1.2.1 Table name: movie

Column family: media with 3 versions
1.2.2 Table name: ranking

Column family: versions with 3 versions
1.3 Show the number of rows in the movie and user tables you entered.
1.4 Use alter command to change the movie table to add the title column family.
1.5 Run the command to verify that the stitle column family has been added.
1.6 Remove the added stitle.
1.7 Check the movie table to ensure that the stitle column family has been removed.
1.8 Modify the media column family in the movie table to keep 4 versions.
1.9 Show the media change in the movie table.
2. Write a command for CRUD that meets the following conditions
2.1 Show the user table to check the column family names.
2.2 Get the row with row key 100 from the user table.
49
2.3 Add a row with the following properties.
2.3.1 Table name: user
2.3.2 Row key: 100
2.3.3 Column Family: info
2.3.4 Column Descriptor: age and value 20
2.3.5 Column Descriptor: gender and value F
2.3.6 Column Descriptor: zip and value 18730
2.4 Just show the row with row key 100 from the user table.
2.5 In addition, insert a row with the following attributes:
2.5.2 Row key: 100
2.5.6 Row key: 100
2.6 Show the row with row key 100 from the user table.
2.7 In the user table, get all the old columns of the age column version in the row with
row key 100.
2.8 Show the entire table with scan command, but only display the age column.
2.9 In the user table, delete the info:age column of the row with row key 100.
2.10 Verify that the age column has been deleted.
2.11 Delete the entire row with row key 100 from the user table.
2.12 Verify that the row with row key 100 has been removed from the user table.
50
51
Lab 6: Data Access with Cassandra
In this lab, you will start and use the Cassandra NoSQL for our labs.
1. Installing the Cassandra NoSQL

check the latest version in https://cassandra.apache.org/_/download.html
1.1 Create the Cassandra of repo file for installation.
sudo vi /etc/yum.repos.d/cassandra.repo
[cassandra]
name=Apache Cassandra
baseurl=https://downloads.apache.org/cassandra/redhat/40x/
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://downloads.apache.org/cassandra/KEYS
1.2 Install the Cassandra NoSQL
sudo yum install cassandra
1.3 Start the Cassandra service
sudo systemctl daemon-reload

sudo systemctl start cassandra
1.4 Verify the Cassandra service
sudo systemctl status cassandra
52
2. Starting the CQL shell
2.1 Run the CQL shell
cqlsh
2.2 List all keyspaces in the cluster
cqlsh> desc keyspaces;
2.3 Create a keyspace called “ks_work1” on a single node.
cqlsh> CREATE KEYSPACE ks_work1

WITH REPLICATION = {
'class' : 'SimpleStrategy',
'replication_factor' : 1
};
2.4 If you want to use multiple data centers, Use NetworkTopologyStrategy on an

evaluation cluster.
2.5 This example shows how to create a keyspace with network topology in a single
node evaluation cluster.
cqlsh> CREATE KEYSPACE ks_work2

WITH REPLICATION = {
'class' : 'NetworkTopologyStrategy',
'datacenter1' : 1
53
};
2.6 Verify the generated keyspace information.
cqlsh> describe keyspace ks_work1;

cqlsh> desc ks_work2;
2.7 Change and delete keyspace settings
cqlsh> alter keyspace ks_work2 with durable_writes = false;

2.7.1 Verify the changed setting in ks_work2
cqlsh> drop KEYSPACE ks_work2;

2.7.2 Verify the keyspace, ks_work2.
3. Data Accessing using DDL
54
3.1 Create tables with the following characteristics (create table including key space
without use command)
3.1.1 Keyspace: ks_work1
Table name: user_info
Column name: id uuid, l_name text, f_name text, country text, birth int, gender text
Partition Key: id
cqlsh> create table ks_work1.user_info (id uuid primary key, l_name text, f_name text,
country text, birth int, gender text);
3.1.2 Verify the user_info table.
Table name: user_info2
Column name: id uuid, l_name text, f_name text, country text, birth int, gender text,
s text, num int
static column: s
55
primary key : country, birth, id
composite partition key : country, birth
cqlsh> create table ks_work1.user_info2 (id int, l_name text, f_name text, country text,
birth text, gender text, s text, num int, primary key((country,birth), id));
csqsh> desc tables;
3.1.4 Verify the user_info2 table.
Column name: id int, l_name text, f_name text, country text, birth int, age int
Compound Primary Key : country, id, birth
Clustering key: birth
cqlsh> create table ks_work1.user_info3 (id int, l_name text, f_name text, country text,
birth text, gender text, s text static, num int, primary key(country, id, birth));
Column name: id int, l_name text, f_name text, country text, birth int, age int
Compound Primary Key : country, id, birth (descending order)
Clustering key: birth (store data in descending order)
56
cqlsh> create table ks_work1.user_info33 (id int, l_name text, f_name text, country
text, birth text, gender text, s text static, age int, primary key(id, birth)) with clustering
order by (birth desc);
csqsh> desc tables;
3.2 Change the keyspace to ks_work1 and verify whether the tables user_info,
user_info2 and user_info3 were created successfully.
cqlsh> use ks_work1;

cqlsh:ks_work1> desc tables;
cqlsh:ks_work1> desc user_info;
57
3.2.1 Verify the user_info
3.3 Add the following column to the user_info33 table. And verify the user_info33
table.
3.3.1 Column name : height, type int
cqlsh:ks_work1> alter table user_info33 add height int;

cqlsh:ks_work1> desc user_info33;
58
3.4 Delete the user_info33 table you just modified.
cqlsh:ks_work1> drop table user_info33;
3.5 List all tables again to confirm that table user_inf33 was dropped successfully.
cqlsh:ks_work1> desc tables;
59
3.6 Enter data into the user_info2 and user_info3 tables, respectively, as shown in the
following figure.
3.6.1 User_info2
3.6.2 User_info3
4. Data Accessing using DML
4.1 Select statement that outputs Info2, info3 data and limit to 1 row per partition.
cqlsh:ks_work1> select * from user_info2;

4.1.1 Verify the records for user_info2 and user_info3;
60
cqlsh:ks_work1> select * from user_info2 per partition limit 1;
cqlsh:ks_work1> select * from user_info3 per partition limit 1;
4.1.2 Verify the records for user_info2 and user_info3 with 1 row per partition.
61
4.2 List of persons included in KOR/JPN.
cqlsh:ks_work1> select f_name, l_name from user_info3 where country in ('KOR',

'JPN');
4.3 Find the total number of records in the user_info3 table and the sum, minimum,
and average values of num.
62
cqlsh:ks_work1> select count(id) as count, sum(num) as sum, min(num) as min,
avg(num) as avg from user_info3;
4.4 Get a record of the country being KOR.
cqlsh:ks_work1> select * from user_info3 where country='KOR';
4.5 Find records where country Is KOR and number is greater than 30
cqlsh:ks_work1> select * from user_info3 where country='KOR' and num > 30 allow
filtering;
4.6 Modify the value of s in KOR to ‘value8’, and make it null after 30 seconds.
cqlsh:ks_work1> update user_info3 using ttl 30 set s='value_new' where

country='KOR';
63
4.6.1 After 30 seconds, the values in 3 rows are deleted and become null.
4.7 Insert the value of s in KOR to ‘value8 and update ’ and update the value of the row
with id 4 with the following condition.
f_name:ann, l_name:kim, num:85
cqlsh:ks_work1> insert into user_info3 (id, f_name, l_name, num, country, birth)
values(4,'ann','kim', 85, 'JPN', '1974');
Note: insert into user_info3 (id, f_name, l_name, num) values(4,'ann','kim',85);

insert into user_info3 (id, f_name, l_name, num, country) values(4,'kim',85, 'JPN');
Execute the above 2 commands and consider the reason for the error.
64
4.8 Change the value of f_name in the record with id 3 to "scott".
cqlsh:ks_work1> insert into user_info3 (id, l_name, f_name, country, birth, gender, s,
num) values (3,'jeong','scott','USA','1967', 'M', 'value3', 32) if not exists;
4.9 Change Scitt to scott in the 3rd row of the user_info3 table.
4.10 Delete the s column where the country is USA from the user_info3 table.
cqlsh:ks_work1> delete s from user_info3 where country='USA';

65
4.11 Delete the l_name column under the same conditions with 4.10
cqlsh:ks_work1> delete l_name from user_info3 where country='USA';
4.12 Delete the row where the country is CAN or JPN from the user_info3 table.››
cqlsh:ks_work1> delete from user_info3 where country in ('CAN', 'JPN');

66
Lab 7: Working with Cassandra
1. Write a command for Table creation for “country”

Column Name Column Type Key
id uuid Primary Key
name text Partition Key
capital text Partition Key
population int
1.2 Input the following values in the created country.

id Name Capital Population
1 USA Washington 333,098,437
D.C.
2 Germany Berlin 84,073,352
3 France Paris 65,426,179
4 Korea Seoul 51,305,186
2. Working with Queries.
2.1 Query the country table for the following information:
2.1.1 Population of Korea
2.1.2 Population and capital of all entries in the country table
2.1.3 Names of countries in alphabetical order
2.1.4 Country, capital and population, ordered by population, descending
2.1.5 Same as above but ordered ascending
2.1.6 Countries with a population less than 100,000,000.
2.1.7 Countries with a population between 50 and 100 million
2.2 Create queries on the country table
2.2.1 Name the country with the smallest population
67
2.2.2 Name the country with the second largest population
2.2.3 List the first row in the countries table
2.2.4 How many countries are listed in the country table?
END OF LAB
68

SIC - Big Data - Chapter 4 - Workbook

Uploaded by

Document Informationclick to expand document informationhay

Document Informationclick to expand document information

Copyright:

Available Formats

SIC - Big Data - Chapter 4 - Workbook

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SIC - Big Data - Chapter 4 - Workbook

Uploaded by

Copyright:

Available Formats

Chapter 4.

Big Data Storage

1. Create an AWS Account

1.1 Create a free tier AWS account from https://aws.amazon.com/console/

1.2 Configure your free account

1.2.1 Change the AWS region to a location nearest to you

2. Explore the AWS S3 Service

2.1 Open the Amazon S3 console at https://s3.console.aws.amazon.com/

2.1.3 Directly using link: https://s3.console.aws.amazon.com/

2.2 Create a new s3 bucket

2.2.1 Click on the create bucket icon

2.2.2 Enter a unique DNS compliant name for your bucket

2.2.3 Choose a region close to you

2.2.7 Click the Create bucket button to create your bucket.

2.3.2 From the Objects tab, select the Upload icon.

2.3.4 Select the seoul.jpeg file provided to you by your instructor.

2.4 Downloading an object from an S3 bucket to a local directory

2.4.5 Confirm that the file has been downloaded

2.5 Create folders in an S3 bucket

2.5.2 Choose the Create Folder button to create a new folder

2.5.3 Enter a name for your folder

2.5.4 Keep value the default values for the rest

2.6.2 Select the object that you want to perform an action on

2.6.3 From the Actions menu, choose Copy

2.6.5 As we copy the object, it is possible to change some of the parameter

2.7 Accessing S3 objects externally

2.7.1 Return to the top level of your s3 account

2.7.5 Go to where you copied the file from above.

2.7.7 Click on Edit button to modify the access control list

2.7.9 Save the changes using the Save changes button.

2.8 Delete an object in your S3 bucket

2.8.2 You cannot delete a bucket unless it is empty.

2.9 Working with lifecycle settings for S3 buckets

When you add an S3 lifecycle configuration to a bucket, there is usually some

Disable or Delete Lifecycle Rules

Existing and new objects

2.9.4 Select the scope of the lifecycle rule

From the Filter Type:

2.9.5 Apply this rule to all objects in the bucket

2.9.10 Upload the seoul.jpeg file to <prefix>-lifecycle-archive bucket. After more

3.1 Add Git Bash (Windows environment only)

3.2 Add AWS CLI for Windows environment

3.2.1 Install AWS CLI version 2 on Windows from

3.3 Add AWS CLI for Max OS environment

3.3.1 Install AWS CLI version 2 for Mac OS from

3.5 Set up your AWS CLI configuration

In order to execute commands from the

3.5.3 From the pop-up window, select Download Key File

3.6 Create a new bucket with AWS CLI

3.6.1 First, get a list of available command with:

3.6.2 Create a new bucket and name it <prefix>-cli-bucket. In creating a bucket,

aws s3 cp seoul.jpeg s3://student-cli-bucket/my-folder/seoul.jpeg

3.6.5 Confirm that the file was uploaded

4. Using Cyberduck to explore AWS S3

4.1 Install cyberduck for your platform from https://cyberduck.io/download/

4.2 Start cyberduck and select Open Connection

We can create new folders, rename files and folders. duplicate