Data Mining: IE:4172 Big Data Analytics Stephen Baek
Data Mining: IE:4172 Big Data Analytics Stephen Baek
Data Mining: IE:4172 Big Data Analytics Stephen Baek
Web Crawling & Scraping
● Crawler: aka web robot, or web spider
○ A software program that automatically traverses hyperlinks
○ Systematically browses the world wide web
○ Examples:
■ Googlebot: collects documents from the web to build a searchable index.
■ Xenon: is a web crawler used by government tax authorities to detect fraud
● Discuss:
○ What is the gravitational wave in layperson's terms?
○ What’s the root of the debate?
○ What is the correlated noise and what can you do about it?
○ Danish vs American scientists - who do you think is more convincing?
ICA - Topic 2
● David Balley. (2018). Why outliers are good for science?
● Discuss:
○ What is the Gaussian distribution (the bell curve) and what is the Cauchy distribution?
○ Is real-world measurement closer to the Gaussian or Cauchy? Why do you think is the
○ What’s the criteria commonly used to determine outliers? How can they be wrong?
○ What is the author’s point to claim that outliers might actually be good for science?
ICA - Topic 3
● Candace Corbeil - Gaps in the Spreadsheet
● Gerhard Svolba - The origin, detection, treatment and consequences of
missing values in analytics.
● Discuss:
○ What are the three types of missing data?
○ What is multiple imputation how can they be useful for data that are missing at random?
○ In case of systematic (non-random) missing data, would you still use multiple imputation? Or
what else can you do?