DSRT - 734 - Residency Week - Second - Presentation

School of Computer &
Information Sciences
DSRT 734-Inferential Statistics

in Decision Making
1
UC Residency Schedule Feb 14 – 16, 2020
2020 Summer_MAIN_DSRT 734-Inferential Statistics in Decision Making
Friday: 5PM – 10PM Location

Virtual
Saturday: 8AM – 7:30PM
Major Break Times
- Lunch @ 12:45 PM
-
Note: If you cannot attend this scheduled
Sunday: 8AM – 1.00 PM
residency session the course– please refer to
Academic Department to attend the make up
Lunch on Saturday only
session.
2
Residency Rules
• Residency session is 60% of the course learning experience
• The assignment includes research/case analysis/industry project
– Students should complete collaboratively in groups
– 4 students per group (max)
• The deliverable of the project includes
– the group presentation
– the research paper by each individual
• Research paper is a 10-15 pages, to be submitted by Week 14.
• Final Presentations are:
– 20 minutes for each group
– Delivered in the final session 8am - 1:00pm on Sunday.
3
Statistical Methods Review
• Some well-known statistical tests and procedures are:
– Analysis of variance (ANOVA)
– Chi-squared test.
– Correlation.
– Factor analysis.
– Mann–Whitney U.
– Mean square weighted deviation (MSWD)
– Pearson product-moment correlation coefficient.
– Regression analysis.
– Etc…
4
Residency Grading
Stage Points Due Requirements Output
Pre-Residency 10 6:00 pm Friday June 5th • Complete pre- Fill out in iLearn questionnaire
Assessment residency assessment
Quiz #2
Identify a data set for 20 11:00 am Saturday June 6th– Each team explains the data set,
Analysis and Define the each Group presents 7-8 they choose, analyze the data
Statistical methods mins attributes, visualizes, correlates,
variables , proposes ML
approach ~ 5 slides team
Design Solution 40 9:00 am – 12:00 noon Each team presents full analysis
Sunday June 7th with comparison of ML
20 minutes per each Group algorithms
Post-Residency 10 1:00 pm Sunday June 7th • Complete post-residency Fill out in iLearn questionnaire
Assessment assessment
5
Agenda - Overall
Friday June 5th Saturday continued
5:00 pm Introductions 12:30 pm Feedback
5.30 pm Pre-Residency assessment 12:45 pm Lunch
6:00 pm Machine Learning Review 1:45 pm Continue Group work
6:45 pm Group Planning & Q&A 7:30 pm Session End
7:25 pm Q&A Session – Begin Research
10:00 pm Session End Sun June 7th
Saturday June 6th 8:00 am Group Presentations -
• Each Group presents 20 mins – Each
8:00 am Effective Presentations
group is graded on how they present
8:30 am Continue research, presentation their results
11:00 am Group presentations – explain Q&A 12.30 pm Post-Residency Assessment
and analyze, visualize your data set – and 1:00 pm Wrap Up
you will analyze the data
6
Friday June 5th
5:00 pm Introductions/ Pre-Residency assessment
• Everyone introduce themselves • Go into iLearn and fill out Pre-
• Name assessment
• Professional Background • How does this course help you
• Why taking this course? professionally?
• What have you learnt so far? • What are your course
• Skills & preferences expectations?
• Lead?
• Present?
• Process data?
• Design Charts and Slides
7
1. Heart Disease Dataset
https://www.kaggle.com/ronitf/heart-disease-uci
• This database contains 76 attributes, but all published

experiments refer to using a subset of 14 of them. In particular,
the Cleveland database is the only one that has been used by ML
researchers to this date.
• The "goal" field refers to the presence of heart disease in the
patient. It is integer valued from 0 (no presence) to 4.
Experiments with the Cleveland database have concentrated on
simply attempting to distinguish presence (values 1,2,3,4) from
absence (value 0).
8
2. Mall Customer Segmentation Data
Market Basket Analysis
https://www.kaggle.com/vjchoudhary7/customer-segmentation-tutorial-in-python/kernels?sortBy=hotness&group=everyon
e&pageSize=20&datasetId=42674&language=R
• You own a supermarket mall

– through membership cards , you have some basic data about your customers
like Customer ID, age, gender, annual income and spending score.
• Spending Score
– you assign to the customer based on your defined parameters like customer
behavior and purchasing data.
• Problem Statement
– You own the mall and want to understand the customers like who can be
easily converge [Target Customers] so that the sense can be given to
marketing team and plan the strategy accordingly. 9
3. International football results from 1872 to 2018
https://www.kaggle.com/martj42/international-football-results-from-1872-to-2017
• This dataset includes 39,669 results of international football matches starting from the very first official match in
1972 up to 2018.
• The matches range from FIFA World Cup to FIFI Wild Cup to regular friendly matches.
• The matches are strictly men's full internationals and the data does not include Olympic Games or matches where
at least one of the teams was the nation's B-team, U-23 or a league select team.
– results.csv includes the following columns:
– date - date of the match
– home_team - the name of the home team
– away_team - the name of the away team
– home_score - full-time home team score including extra time, not including penalty-shootouts
– away_score - full-time away team score including extra time, not including penalty-shootouts
– tournament - the name of the tournament
– city - the name of the city/town/administrative unit where the match was played
– country - the name of the country where the match was played
– neutral - TRUE/FALSE column indicating whether the match was played at a neutral venue
• Goal: Soccer Predictions for the Fifa World Cup 2018
10
4. Students Performance in Exams
https://www.kaggle.com/spscientist/students-performance-in-exams/home
• Context
– Marks secured by the students
• Content
– This data set consists of the marks secured by the students in various subjects.
• Acknowledgements
– http://roycekimmons.com/tools/generated_data/exams
• Inspiration
– Understand the influence of the parents background, test preparation etc on students
performance
– Use the student data of test results, create a fictitous variable of pass or fail.
– Predict whether a student passes or fails using these any classification method
https://www.kaggle.com/katiej277/classification-in-r-logistic-regression-and-lda
11
5. Bank Marketing
https://www.kaggle.com/henriqueyamahata/bank-marketing
Bank client data:
• Age (numeric)
• Job : type of job (categorical: 'admin.', 'blue-collar', 'entrepreneur', 'housemaid', 'management',
'retired', 'self-employed', 'services', 'student', 'technician', 'unemployed', 'unknown')
• Marital : marital status (categorical: 'divorced', 'married', 'single', 'unknown' ; note: 'divorced'
means divorced or widowed)
• Education (categorical: 'basic.4y', 'basic.6y', 'basic.9y', 'high.school', 'illiterate',
'professional.course', 'university.degree', 'unknown')
• Default: has credit in default? (categorical: 'no', 'yes', 'unknown')
• Housing: has housing loan? (categorical: 'no', 'yes', 'unknown')
• Loan: has personal loan? (categorical: 'no', 'yes', 'unknown')
y - has the client subscribed a term deposit? (binary: 'yes', 'no')
12
6. Breast Cancer Wisconsin (Diagnostic) Data
https://www.kaggle.com/uciml/breast-cancer-wisconsin-data
On UCI Machine Learning Repository:

Attribute Information:
1) ID number 2) Diagnosis (M = malignant, B = benign) 3-32) Ten real-valued features are computed for each cell
nucleus:
a) radius (mean of distances from center to points on the perimeter)
b) texture (standard deviation of gray-scale values)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter^2 / area - 1.0) g) concavity (severity of concave portions of the contour)
g) concave points (number of concave portions of the contour)
h) symmetry
i) fractal dimension ("coastline approximation" - 1)
• All feature values are recoded with four significant digits.
• Missing attribute values: none - Class distribution: 357 benign, 212 malignant
13
7. IBM HR Analytics Employee Attrition & Performance
https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset
• Uncover the factors that lead to employee attrition and explore important questions
such as ‘show me a breakdown of distance from home by job role and attrition’ or
‘compare average monthly income by education and attrition’.
This is a fictional data set created by IBM data scientists.
• Education 1 'Below College' 2 'College' 3 'Bachelor' 4 'Master' 5 'Doctor'
• Environment Satisfaction 1 'Low' 2 'Medium' 3 'High' 4 'Very High'
• Job Involvement 1 'Low' 2 'Medium' 3 'High' 4 'Very High'
• Job Satisfaction 1 'Low' 2 'Medium' 3 'High' 4 'Very High'
• Performance Rating 1 'Low' 2 'Good' 3 'Excellent' 4 'Outstanding'
• Relationship Satisfaction 1 'Low' 2 'Medium' 3 'High' 4 'Very High'
• Work Life Balance 1 'Bad' 2 'Good' 3 'Better' 4 'Best'
14
8. House Sales in King County, USA
https://www.kaggle.com/harlfoxem/housesalesprediction
• This dataset contains house sale prices for King

County, which includes Seattle. It includes homes
sold between May 2014 and May 2015.
• Predict house price using regression
15
Questions?
16

DSRT - 734 - Residency Week - Second - Presentation

Uploaded by

Copyright:

Available Formats

DSRT - 734 - Residency Week - Second - Presentation

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DSRT - 734 - Residency Week - Second - Presentation

Uploaded by

Copyright:

Available Formats

School of Computer &

DSRT 734-Inferential Statistics

Friday: 5PM – 10PM Location

• This database contains 76 attributes, but all published

• You own a supermarket mall

On UCI Machine Learning Repository:

• This dataset contains house sale prices for King

You might also like