Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Assignment 2
Conor Dorrian
L00101441
Hypothesis
An exploratory investigation into the application with big data analytics for Twitter live feed
and top hash tags.
Introduction
This abstract is going to work with and around big data to know more and better
understand big data. I choose this topic, As I am using Hadoop and python for project
preparation next semester and wanted to try something new and to see similarities and
differences in them.
Twitter a social networking service where users post and read short "tweets". Registered
users can post and read tweets, but those who are unregistered can only read them.
Scala is almost like Java, Scala is object-oriented programming language. Unlike
Java, Scala has many features of functional programming languages like Scheme and im-
mutability etc.
Spark is an open source-processing engine built around speed, ease of use, and analytics. If
you have large amounts of data that requires low latency processing that a typical
MapReduce program cannot provide.
In this abstract will be setting up and installing Spark Streaming and Scala IDE and will then
will be looking into twitter live feeds and twitter top hash tags by creating API tokens and
access keys on https://apps.twitter.com/ and segments of code.
Setting up Spark Streaming
1. Download JDK-1.8.0_101 for windows
2. Run and save to windows(C:programfilesJava) folder
3. Download apache spark 2.0.0.tgz file
4. Download winRaR x64 to extract files from .tgz
5. Extract .tgz file
6. Created a folder in windows(C:) To make it easier for myself to find.
PC>windows(C:)>Spark
Spark is now installed which is built is built on Scala which is built on top of java
Next Installation
1. Edited log4 in Spark folder by deleting .template property it had at the end of it
2. Edited the file in wordpad by changing rootCategory = info to rootCategory = ERROR.
This changes the log level and makes it easier to run.
3. Download wntulis to make spark think its running a Hadoop cluster. Made a winutils
folder with a bin folder in it. Put the winutils file in bin folder.
Environment variables
1. Next was to edit environment variables. Control Panel/System Security/System/
Advanced System/Settings/Environment Variables
2. Create 3 new variables: Variable Name = SPARK_Home
Varibale value = C:Spark
Variable Name = JAVA_Home
Varibale value = C:filesjavajdk1.8.0_101
Variable Name = HADOOP_Home
Varibale value = C:winutils
Figure 1.1: Showing variable that where created
Figure 1.2: Showing variable that where created
Figure 1.3: Showing variable that where created
3. Then edited the path directories by adding
%SPARK_HOME%bin
%JAVA_HOME%bin
Figure 2: Showing paths that where created after the variables
Environment variables are now complete.
Installing Scala IDE
1. Downloaded Scala IDE from www.Scala-ide.org/downloads /sdk.html
2. Because Scala is built on top of java there is no need for any special software to
install
3. Unzip folder and placed it in downloads
4. Eclipse.exe is the Scala IDE
5. Moved the eclipse folder in scala to my (C:) drive so I can get to it easier
6. Pinned it to desktop then for easier access
Test to see if all work
1. Go to control panel as administrator
2. Cd C:Spark to get into spark folder
3. Dir to get the directories
4. Spark-shell.
5. After a few minutes Spark should appear without any errors
Figure 3: Showing Spark working
6. Then create a mini database
7. Val rdd = sc.textfile(“README.md) gives the database
8. Rdd.Count() will give how many lines are in that file
Figure 4: got the database and showed how big it was
Apache Spark now installedandthe firstSparkapplicationcompleted.
Starting Assessments
Live Twitter feed application
1. Went to apps.twitter.com. Sign in to twitter and then create an app. Type in the
information about the app etc.
2. Keys and access tokens button is at the bottom of the screen.(Very Important)
3. Click the button and create the keys to connect to Spark Streaming with twitter
4. Created a new folder where I would save my code to and created a notepad where I
would keep all random generated keys from twitter
5. Went into Scala Eclipse and began work creating my twitter app
6. Set up twitter credentials
 setupTwitter()
 setuplogging()
7. Before starting the app, I added Jars to the path from spark jar files which were
automatically installed when apache spark was downloaded.
8. I then added 3 more twitter jars files so the code would work. Without these the app
would not work.
9. After the jars were added I became the workings of the code.
10. I then ran the code and it worked. I had created my 1st application on Spark and
Scala. This outputted a live twitter feed every second with 15 tweets.
Continuing Twitter feed (getting popular
Hashtags)
Live top hashtags every 5 minutes
1. Set up Twitter credentials from access keys
 setupTwitter()
 setupLogging()
2. Create a DStream(collection of elements partitioned across the nodes) from each
tweet to extract text
3. Separate each tweet into a list of words
4. Put in a new DStream called “val tweetwords” using flatmap operation to create a
new entry
5. Then deleted everything which is not a hashtag
 Val hashtags = tweetwords.filter(word =>word.startswith(“#”)
6. Used a map which is basically MapReduce to add every hastag by 1
 Example: k:1, k:1, d:2, d:2, d:2, a:3
 Equals: k:2
d:3
a:1
7. Added the maps every 5 minutes every second
8. These were then placed in a “hashtagsCounts” to be added and given final value
9. Then put it into a sorted result
10. Print the top 10 most hash tagged tweets
Top Hashtag output
Conclusion
This abstract took me a long time to complete, but in the end, it was all worth it. I learnt a
lot, which I hope, will be beneficial in the long run after college.
Getting the abstract started was easy enough but gave me many errors at the start. I just
had set up Spark Streaming, which was built on Scala, which is built on top of java.
Errors
1. First error occurred when trying to show spark in command prompt.
Figure 5: Showing error while trying to run spark
I got multiple errors like this. I knew it was to do with the winutils file but did not know
what. I then created a new winutils file but I got another different error.
Figure 6: Showing a different error from creating new winutils folder
I then realised that I had put the winutils file into too many bin files by accident.
Spark didn’t work and winutils didn’t work because of 2bin folders. Windows(C:Program
fileswinutilsbinbin) I accidently created 2 bin folders which caused the error. After
realising this I deleted 1 of the bin folders. I pressed “ctrl D” to restart the command
prompt to do it again, It then worked fine.
After this installation, I was ready to get going onto my Assignment
My next error was when I was trying to run the code for a twitter feed.
When this showed up, I was very confused and did not really know what was wrong.
However, after much research I finally understood what the first error was. It was just
making windows letting it through the firewall. After understanding the first error, I then got
another error (below) every time after 3 Time messages
The error above took me days to fix because I was not used to the language and had other
thing to do also. I first had to make my computer time sync, which I thought would fix it but
it did not. I then went back to my access keys and that is where the error lay. I forgot to put
“ – “ in-between 2 numbers. After this was fixed It worked perfectly.
After I was finished with the live Twitter feed, I then went on to the top hashtags for twitter.
This was a lot easier because I did not have to set up anything, it was already downloaded
and installed(i.eSpark). The top hashtags code was ok to get my head around and I had
hardly any problems with it.
Stuff I learned from top hashtags.
Filtering DSTREAMS
Flatmap vs map
Learnt about key/values
Sorting RDD’S

More Related Content

Assignment 2

  • 1. Assignment 2 Conor Dorrian L00101441 Hypothesis An exploratory investigation into the application with big data analytics for Twitter live feed and top hash tags. Introduction This abstract is going to work with and around big data to know more and better understand big data. I choose this topic, As I am using Hadoop and python for project preparation next semester and wanted to try something new and to see similarities and differences in them. Twitter a social networking service where users post and read short "tweets". Registered users can post and read tweets, but those who are unregistered can only read them. Scala is almost like Java, Scala is object-oriented programming language. Unlike Java, Scala has many features of functional programming languages like Scheme and im- mutability etc. Spark is an open source-processing engine built around speed, ease of use, and analytics. If you have large amounts of data that requires low latency processing that a typical MapReduce program cannot provide. In this abstract will be setting up and installing Spark Streaming and Scala IDE and will then will be looking into twitter live feeds and twitter top hash tags by creating API tokens and access keys on https://apps.twitter.com/ and segments of code. Setting up Spark Streaming 1. Download JDK-1.8.0_101 for windows 2. Run and save to windows(C:programfilesJava) folder 3. Download apache spark 2.0.0.tgz file 4. Download winRaR x64 to extract files from .tgz 5. Extract .tgz file 6. Created a folder in windows(C:) To make it easier for myself to find. PC>windows(C:)>Spark Spark is now installed which is built is built on Scala which is built on top of java
  • 2. Next Installation 1. Edited log4 in Spark folder by deleting .template property it had at the end of it 2. Edited the file in wordpad by changing rootCategory = info to rootCategory = ERROR. This changes the log level and makes it easier to run. 3. Download wntulis to make spark think its running a Hadoop cluster. Made a winutils folder with a bin folder in it. Put the winutils file in bin folder. Environment variables 1. Next was to edit environment variables. Control Panel/System Security/System/ Advanced System/Settings/Environment Variables 2. Create 3 new variables: Variable Name = SPARK_Home Varibale value = C:Spark Variable Name = JAVA_Home Varibale value = C:filesjavajdk1.8.0_101 Variable Name = HADOOP_Home Varibale value = C:winutils Figure 1.1: Showing variable that where created Figure 1.2: Showing variable that where created
  • 3. Figure 1.3: Showing variable that where created 3. Then edited the path directories by adding %SPARK_HOME%bin %JAVA_HOME%bin Figure 2: Showing paths that where created after the variables Environment variables are now complete. Installing Scala IDE 1. Downloaded Scala IDE from www.Scala-ide.org/downloads /sdk.html 2. Because Scala is built on top of java there is no need for any special software to install 3. Unzip folder and placed it in downloads 4. Eclipse.exe is the Scala IDE 5. Moved the eclipse folder in scala to my (C:) drive so I can get to it easier 6. Pinned it to desktop then for easier access Test to see if all work 1. Go to control panel as administrator 2. Cd C:Spark to get into spark folder 3. Dir to get the directories 4. Spark-shell. 5. After a few minutes Spark should appear without any errors
  • 4. Figure 3: Showing Spark working 6. Then create a mini database 7. Val rdd = sc.textfile(“README.md) gives the database 8. Rdd.Count() will give how many lines are in that file Figure 4: got the database and showed how big it was Apache Spark now installedandthe firstSparkapplicationcompleted.
  • 5. Starting Assessments Live Twitter feed application 1. Went to apps.twitter.com. Sign in to twitter and then create an app. Type in the information about the app etc. 2. Keys and access tokens button is at the bottom of the screen.(Very Important) 3. Click the button and create the keys to connect to Spark Streaming with twitter 4. Created a new folder where I would save my code to and created a notepad where I would keep all random generated keys from twitter 5. Went into Scala Eclipse and began work creating my twitter app 6. Set up twitter credentials  setupTwitter()  setuplogging() 7. Before starting the app, I added Jars to the path from spark jar files which were automatically installed when apache spark was downloaded. 8. I then added 3 more twitter jars files so the code would work. Without these the app would not work. 9. After the jars were added I became the workings of the code.
  • 6. 10. I then ran the code and it worked. I had created my 1st application on Spark and Scala. This outputted a live twitter feed every second with 15 tweets. Continuing Twitter feed (getting popular Hashtags) Live top hashtags every 5 minutes 1. Set up Twitter credentials from access keys  setupTwitter()  setupLogging() 2. Create a DStream(collection of elements partitioned across the nodes) from each tweet to extract text 3. Separate each tweet into a list of words 4. Put in a new DStream called “val tweetwords” using flatmap operation to create a new entry 5. Then deleted everything which is not a hashtag  Val hashtags = tweetwords.filter(word =>word.startswith(“#”) 6. Used a map which is basically MapReduce to add every hastag by 1  Example: k:1, k:1, d:2, d:2, d:2, a:3  Equals: k:2 d:3 a:1 7. Added the maps every 5 minutes every second 8. These were then placed in a “hashtagsCounts” to be added and given final value 9. Then put it into a sorted result
  • 7. 10. Print the top 10 most hash tagged tweets
  • 9. Conclusion This abstract took me a long time to complete, but in the end, it was all worth it. I learnt a lot, which I hope, will be beneficial in the long run after college. Getting the abstract started was easy enough but gave me many errors at the start. I just had set up Spark Streaming, which was built on Scala, which is built on top of java. Errors 1. First error occurred when trying to show spark in command prompt. Figure 5: Showing error while trying to run spark I got multiple errors like this. I knew it was to do with the winutils file but did not know what. I then created a new winutils file but I got another different error. Figure 6: Showing a different error from creating new winutils folder
  • 10. I then realised that I had put the winutils file into too many bin files by accident. Spark didn’t work and winutils didn’t work because of 2bin folders. Windows(C:Program fileswinutilsbinbin) I accidently created 2 bin folders which caused the error. After realising this I deleted 1 of the bin folders. I pressed “ctrl D” to restart the command prompt to do it again, It then worked fine. After this installation, I was ready to get going onto my Assignment My next error was when I was trying to run the code for a twitter feed. When this showed up, I was very confused and did not really know what was wrong. However, after much research I finally understood what the first error was. It was just making windows letting it through the firewall. After understanding the first error, I then got another error (below) every time after 3 Time messages
  • 11. The error above took me days to fix because I was not used to the language and had other thing to do also. I first had to make my computer time sync, which I thought would fix it but it did not. I then went back to my access keys and that is where the error lay. I forgot to put “ – “ in-between 2 numbers. After this was fixed It worked perfectly. After I was finished with the live Twitter feed, I then went on to the top hashtags for twitter. This was a lot easier because I did not have to set up anything, it was already downloaded and installed(i.eSpark). The top hashtags code was ok to get my head around and I had hardly any problems with it. Stuff I learned from top hashtags. Filtering DSTREAMS Flatmap vs map Learnt about key/values Sorting RDD’S