Lecture Notes 1 2 Intro Python
Lecture Notes 1 2 Intro Python
INTRODUCTION
I. BASICS OF MACHINE LEARNING
Machine Learning is a science and art of programming computers to learn from data.
Examples:
• bank pre-approval for a loan: approved vs. not approved (supervised, classification)
• bank pre-approval for a loan amount (supervised, regression)
• spam filter (supervised, classification)
• document topic modeling (unsupervised)
• building an intelligent bot for a game (reinforcement learning)
ML is about getting data and using it not only for analysis, but to do a job such as predictions.
Why is ML so important/useful/popular these days and how is it different from traditional
approaches? In 90s scientists worked on image analysis and spam filters and they wrote codes
where they created rules for computers to do the task; nowadays the scientists write codes asking
computers to figure out why image is an face from the data:
• great amount of data available
• tremendous computational power
Skills:
Exploratory data analysis and visualization (discover and visualize the data to get insights)
o techniques depend on whether data is categorical or numerical: charts, graphs, tables,
numerical measures (average, standard deviation, min, max, range, quartiles, etc.)
o Pie chart showing the class level of students at some university
o Bar chart showing the number of male and female students at UHD enrolled each
year, from 2010 to 2021.
o Histogram showing the number of diamonds of a certain carat value
o Box-and-whiskers diagram showing the number of hours students spent last week
on HW
o Scatter plot showing diamond price vs. its carat value
http://abyss.uoregon.edu/~js/glossary/correlation.html
Clustering
• It takes unlabeled data and returns a grouping of data
• We are not given any a priori class labels; instead, we want to find the “natural” groups,
called clusters, within the data
• Applications:
o grouping customers based on their purchasing behavior to send customized
targeted advertisements to each group
• Algorithms: K-means, Hierarchical Clustering
Association Rule Mining
• Market basket analysis: data consists of transactions; given that the customer purchased
burger and chips, predict what other items the customer is likely to buy
https://www.analyticsvidhya.com/blog/2014/08/effective-cross-selling-market-basket-analysis/
Dimensionality Reduction
• Principal Component Analysis: topic modeling (Latent Semantic Analysis in NLP)
https://www.datacamp.com/tutorial/discovering-hidden-topics-python
Feature selection/extraction includes methods that select relevant features and discard the
irrelevant features in the data
• For example, assume that our task is to select features for predicting mileage of a car and
we are given data that includes: engine capacity, top speed, and color
• Types of feature selection methods:
o true selection methods – choose a subset of all the features measured
o projection or embedding methods – compute linear or nonlinear combinations of
the features measured and then select a subset of these combinations
https://www.tibco.com/reference-center/what-is-a-neural-network
Model evaluation
https://scikit-learn.org/stable/modules/cross_validation.html
III. MAIN CHALLENGES IN MACHINE LEARNING
• Insufficient quantity of data – it takes a lot of data for most ML models to work properly
o M. Banko, E. Brill, “Scaling to very very large corpora for natural language
disambiguation”, ACL '01: Proceedings of the 39th Annual Meeting on Association
for Computational Linguistics (July 2001), pages 26–33.
• Nonrepresentative training data
o The training data must be representative of the new data we want to generalize.
o Example: Literary Digest poll for the US presidential election in 1936; 2.4 million
completed surveys predicted that Landon would get 57% of the votes; Roosevelt
won with 62% of the votes.
• Poor quality data and irrelevant data - “garbage in, garbage out”
o outliers, missing values, etc.
o feature selection/extraction
• There is no universally best model
o D. H. Wolpert, W. G. Macready, "No Free Lunch Theorems for Optimization",
IEEE Transactions on Evolutionary Computation 1, 67 (1997).
• Overfitting and underfitting
https://www.kaggle.com/getting-started/166897
References and Reading Material:
2. PYTHON TUTORIAL
Look at Python tutorial codes (courtesy of Dr. Randy Davila).