Data Mining: IE:4172 Big Data Analytics Stephen Baek

Data Mining
IE:4172 Big Data Analytics

Stephen Baek
Sea of Information
● Internet data are extremely prevalent
● They can be useful in many applications:
○ Predicting outcomes of political elections
○ Market trend research
○ Sentiment/reputation analysis
○ Stock market prediction
○ Sports science
○ Diffusion of information
○ Natural disasters
○ Diseases, epidemiology, public health
○ … the list goes on and on
Image Source: Unknown

Data is the new oil
● We have to “mine” it…
○ Publicly available datasets
■ Raw files made available for download
■ e.g. UCI ML repository, Kaggle competitions, data.gov, NIH Chest X-ray Dataset, …
○ Web crawling/scraping
■ Automated bots/macros to collect data from the web
■ Navigate through websites by tracking down the links
■ e.g. Search engines!
○ API - Application Programming Interface
■ A programing interface to send query & retrieve data
■ e.g. Twitter API
○ Proprietary datasets
Image Source: Wikipedia

Public Datasets
● https://www.data.gov/
Public Datasets
● https://www.kaggle.com
Public Datasets
● https://archive.ics.uci.edu/ml/index.php
Web Crawling & Scraping
● Data mining from websites can be incredibly tedious and repetitious
● Web browser macros can automate repetitive web clicks, filling in forms, etc.
https://youtu.be/hytfjJGqlio
Web Crawling & Scraping
● Crawler: aka web robot, or web spider
○ A software program that automatically traverses hyperlinks
○ Systematically browses the world wide web
○ Examples:
■ Googlebot: collects documents from the web to build a searchable index.
■ Xenon: is a web crawler used by government tax authorities to detect fraud
● There are many open source crawlers:

○ For example: https://github.com/scrapinghub
○ BeautifulSoup, LXML
Web Crawler Policies
● The behavior of a web crawler is the outcome of a combination of policies
○ a selection policy which states the pages to download,
○ a re-visit policy which states when to check for changes to the pages,
○ a politeness policy that states how to avoid overloading Web sites.
○ a parallelization policy that states how to coordinate distributed web crawlers.
Web Crawler Policies
● The behavior of a web crawler is the outcome of a combination of policies
○ a selection policy which states the pages to download,
○ a re-visit policy which states when to check for changes to the pages,
○ a politeness policy that states how to avoid overloading Web sites.
○ a parallelization policy that states how to coordinate distributed web crawlers.
● Web crawlers are not always welcome
○ A not so well-behaved crawler can be blacklisted
○ robot.txt: a special file located on a web server that enforces restrictions
■ ‘Allow’ tag: list of pages that can be accessed
■ ‘Disallow’ tag: list of pages that should not be indexed
○ HTML META tags: does the similar thing with robot.txt
■ <META name=”ROBOT” content=”NOFOLLOW”>
■ <META name=”GOOGLEBOT” content=”NOINDEX”>
Application Programming Interface (API)
● Set of functions, routines, protocols, and tools for building software
applications
● APIs define the standard way of accessing data
● Examples:
○ Twitter API: https://dev.twitter.com
○ Facebook API: https://developers.facebook.com
○ Yahoo! Finance API
○ Google Map API
○ …
(ICA) Let’s Play
Image Source: https://pixabay.com

Homework! - Due: 9/17 (Tuesday)
ICA - Topic 1
● Debate on the Nobel Prize in Physics 2017: “First Direct Observation of
Gravitational Wave”
○ What is the gravitational wave?
■ https://www.nationalgeographic.com/news/2017/10/gravitational-waves-nobel-prize-phy
sics-ligo-science-space/
○ The debate:
■ https://arstechnica.com/science/2018/10/danish-physicists-claim-to-cast-doubt-on-detec
tion-of-gravitational-waves/
● Discuss:
○ What is the gravitational wave in layperson's terms?
○ What’s the root of the debate?
○ What is the correlated noise and what can you do about it?
○ Danish vs American scientists - who do you think is more convincing?
ICA - Topic 2
● David Balley. (2018). Why outliers are good for science?
○ https://rss.onlinelibrary.wiley.com/doi/pdf/10.1111/j.1740-9713.2018.01105.x
● Discuss:
○ What is the Gaussian distribution (the bell curve) and what is the Cauchy distribution?
○ Is real-world measurement closer to the Gaussian or Cauchy? Why do you think is the
reason?
○ What’s the criteria commonly used to determine outliers? How can they be wrong?
○ What is the author’s point to claim that outliers might actually be good for science?
ICA - Topic 3
● Candace Corbeil - Gaps in the Spreadsheet
○ https://www.apa.org/science/about/psa/2016/02/gaps-spreadsheet
● Gerhard Svolba - The origin, detection, treatment and consequences of
missing values in analytics.
○ http://analytics-magazine.org/missing-values/
● Discuss:
○ What are the three types of missing data?
○ What is multiple imputation how can they be useful for data that are missing at random?
○ In case of systematic (non-random) missing data, would you still use multiple imputation? Or
what else can you do?

Data Mining: IE:4172 Big Data Analytics Stephen Baek

Uploaded by

Copyright:

Available Formats

Data Mining: IE:4172 Big Data Analytics Stephen Baek

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining: IE:4172 Big Data Analytics Stephen Baek

Uploaded by

Copyright:

Available Formats

Data Mining

IE:4172 Big Data Analytics

Image Source: Unknown

Image Source: Wikipedia

● There are many open source crawlers:

Image Source: https://pixabay.com

You might also like