Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Data Mining: IE:4172 Big Data Analytics Stephen Baek

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 16

Data Mining

IE:4172 Big Data Analytics


Stephen Baek
Sea of Information
● Internet data are extremely prevalent
● They can be useful in many applications:
○ Predicting outcomes of political elections
○ Market trend research
○ Sentiment/reputation analysis
○ Stock market prediction
○ Sports science
○ Diffusion of information
○ Natural disasters
○ Diseases, epidemiology, public health
○ … the list goes on and on

Image Source: Unknown


Data is the new oil
● We have to “mine” it…
○ Publicly available datasets
■ Raw files made available for download
■ e.g. UCI ML repository, Kaggle competitions, data.gov, NIH Chest X-ray Dataset, …
○ Web crawling/scraping
■ Automated bots/macros to collect data from the web
■ Navigate through websites by tracking down the links
■ e.g. Search engines!
○ API - Application Programming Interface
■ A programing interface to send query & retrieve data
■ e.g. Twitter API
○ Proprietary datasets

Image Source: Wikipedia


Public Datasets
● https://www.data.gov/
Public Datasets
● https://www.kaggle.com
Public Datasets
● https://archive.ics.uci.edu/ml/index.php
Web Crawling & Scraping
● Data mining from websites can be incredibly tedious and repetitious
● Web browser macros can automate repetitive web clicks, filling in forms, etc.

https://youtu.be/hytfjJGqlio
Web Crawling & Scraping
● Crawler: aka web robot, or web spider
○ A software program that automatically traverses hyperlinks
○ Systematically browses the world wide web
○ Examples:
■ Googlebot: collects documents from the web to build a searchable index.
■ Xenon: is a web crawler used by government tax authorities to detect fraud

● There are many open source crawlers:


○ For example: https://github.com/scrapinghub
○ BeautifulSoup, LXML
Web Crawler Policies
● The behavior of a web crawler is the outcome of a combination of policies
○ a selection policy which states the pages to download,
○ a re-visit policy which states when to check for changes to the pages,
○ a politeness policy that states how to avoid overloading Web sites.
○ a parallelization policy that states how to coordinate distributed web crawlers.
Web Crawler Policies
● The behavior of a web crawler is the outcome of a combination of policies
○ a selection policy which states the pages to download,
○ a re-visit policy which states when to check for changes to the pages,
○ a politeness policy that states how to avoid overloading Web sites.
○ a parallelization policy that states how to coordinate distributed web crawlers.
● Web crawlers are not always welcome
○ A not so well-behaved crawler can be blacklisted
○ robot.txt: a special file located on a web server that enforces restrictions
■ ‘Allow’ tag: list of pages that can be accessed
■ ‘Disallow’ tag: list of pages that should not be indexed
○ HTML META tags: does the similar thing with robot.txt
■ <META name=”ROBOT” content=”NOFOLLOW”>
■ <META name=”GOOGLEBOT” content=”NOINDEX”>
Application Programming Interface (API)
● Set of functions, routines, protocols, and tools for building software
applications
● APIs define the standard way of accessing data
● Examples:
○ Twitter API: https://dev.twitter.com
○ Facebook API: https://developers.facebook.com
○ Yahoo! Finance API
○ Google Map API
○ …
(ICA) Let’s Play

Image Source: https://pixabay.com


Homework! - Due: 9/17 (Tuesday)
ICA - Topic 1
● Debate on the Nobel Prize in Physics 2017: “First Direct Observation of
Gravitational Wave”
○ What is the gravitational wave?
■ https://www.nationalgeographic.com/news/2017/10/gravitational-waves-nobel-prize-phy
sics-ligo-science-space/
○ The debate:
■ https://arstechnica.com/science/2018/10/danish-physicists-claim-to-cast-doubt-on-detec
tion-of-gravitational-waves/

● Discuss:
○ What is the gravitational wave in layperson's terms?
○ What’s the root of the debate?
○ What is the correlated noise and what can you do about it?
○ Danish vs American scientists - who do you think is more convincing?
ICA - Topic 2
● David Balley. (2018). Why outliers are good for science?
○ https://rss.onlinelibrary.wiley.com/doi/pdf/10.1111/j.1740-9713.2018.01105.x

● Discuss:
○ What is the Gaussian distribution (the bell curve) and what is the Cauchy distribution?
○ Is real-world measurement closer to the Gaussian or Cauchy? Why do you think is the
reason?
○ What’s the criteria commonly used to determine outliers? How can they be wrong?
○ What is the author’s point to claim that outliers might actually be good for science?
ICA - Topic 3
● Candace Corbeil - Gaps in the Spreadsheet
○ https://www.apa.org/science/about/psa/2016/02/gaps-spreadsheet
● Gerhard Svolba - The origin, detection, treatment and consequences of
missing values in analytics.
○ http://analytics-magazine.org/missing-values/

● Discuss:
○ What are the three types of missing data?
○ What is multiple imputation how can they be useful for data that are missing at random?
○ In case of systematic (non-random) missing data, would you still use multiple imputation? Or
what else can you do?

You might also like