Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
18 views

Lecture 1_ Introduction to Data Science

The document discusses the importance of data in improving lives, making informed decisions, and monitoring quality across various sectors. It highlights the concept of data as a valuable resource that requires processing to unlock its potential and presents applications of data science in predicting trends and analyzing behavior. Additionally, it outlines the skill sets needed for data science and provides examples of datasets for practical projects.

Uploaded by

Gokul Gokul
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Lecture 1_ Introduction to Data Science

The document discusses the importance of data in improving lives, making informed decisions, and monitoring quality across various sectors. It highlights the concept of data as a valuable resource that requires processing to unlock its potential and presents applications of data science in predicting trends and analyzing behavior. Additionally, it outlines the skill sets needed for data science and provides examples of datasets for practical projects.

Uploaded by

Gokul Gokul
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Introduction to Data

Science
Why Data is important?
● Improve Peoples’ life
○ Health monitoring, AI based diagnosis

● Make informed decision


○ Data driven decision like understanding problematic feature of a product
and redesigning from users’ review

● Quality monitoring
○ TDS alarm, Low cartridge alarm, monitoring a complex system with
number of parameters

● Measure effectiveness of a strategy


○ Effectiveness of strategy can be measured using different parameters
Why Data is important?
● Finding reason of problem
○ Sudden problem can be due to recent changes. If there is
report of more child death then it could be because of wrong
application of certain medicine, under staffing

● Stop guessing
○ “I think this would work” – no more trial, go with data.

● Effective resource utilization


○ Data helps to decide how one can utilize critical resource more
effectively
Why Data is important?
● Add-on menu in hostel
○ Requirement may varies depending on menu, day, month, festival,
vacation

● Data and Election


○ Data not only helps in predicting election result; it may help you to win an
election
○ identify some behavioural traits of a control group like impatient? risk
averse? easily influenced by authority figures? Having strong opinion?
using psychometric test
○ test your planned adverts on this control group, and measure the
effectiveness.
○ If you’re interested in getting them to back a political candidate, measure
how likely they are to vote for them after seeing the ad.
○ analyse the control group’s social media data
Why Data is important?

● Example of OLA/UBER/OYO

● Motivation from Google/Facebook/Amazon


Data is the new oil

● Clive Humby, the British Data Scientist was


first to coin the phrase “Data is the New Oil”

● Humby highlighted the fact that, although


inherently valuable, data needs processing,
just as oil needs refining before its true value
can be unlocked.
What Can we do with Data?
● Google Flu Trends: Detecting outbreaks two weeks ahead of
CDC.

● New models are estimating which cities are most at risk for
spread of the Ebola virus.

● Recommender systems (NetFlix [which movie to watch]


Amazon [which product to buy], Facebook [suggesting friend]

● Prediction System (MAP – based on traffic condition predicting


best route, where should you invest, weather forecast)
What Can we do with Data?
● Opinion mining , sentiment analysis– social
media data

● Diagnoses - > from a set of medical


examination and knowledge about different
disease.

● Software Log Data  Automatic Trouble


Shooting (Splunk)
Where does Data comes
User Generated Mobile
It’s All Happening On-line data
Every action you perform
online:
Click ….
Fast Forward, pause,… .
Server request
Transaction
Network message

… Health/Scientific
Internet of Things / M2M Computing
Datafication
● How to quantify friendship?
● How to rate a product?
● Taking all aspects of life and turning them
into data
○ Google’s augmented-reality glasses datafy the gaze
○ Linked in datafy our professional network
● When we like something or someone online
then we are helping in datafying something.
How Big the data is
● There are 2.5 Exabyte (1 Exabyte = 1018 byte) of data created each day
● Internet
○ More than 3.7 billion humans use the internet
○ On average, Google now processes more than 3.5 billion searches per day

● Social Media (monthly active users)


● Communication
○ We send 16 million text messages
○ There are 990,000 Tinder swipes
○ 156 million emails are sent; worldwide it is expected that
there will be9 billion email users by 2019
○ 15,000 GIFs are sent via Facebook messenger
○ Every minute there are 103,447,520 spam emails sent
○ There are 154,200 calls on Skype

● Digital Photo
○ People takes around 1.2 trillion photos per day
Data generated in a Day
The Data Equation
Oceans of Data

Praia de Forte, Brazil


The Data Equation
Rivers of Information

Doubtful Sound, New Zealand


The Data Equation Streams of
Knowledge

Wasatch, Utah, USA


The Data Equation

Drops of
Understanding

(Nix 1984)
What is Data Science?
Like any emerging field, it isn’t yet well defined,
but incorporates elements of:
● Exploratory Data Analysis and Visualization
● Machine Learning and Statistics
● High-Performance Computing technologies
for dealing with scale.
What is Data Science?
● Data science is an interdisciplinary field that uses
scientific methods, processes, algorithms and systems to
extract knowledge and insights from data in various forms.

● Data science is a "concept to unify statistics, data


analysis, machine learning and their related methods" in
order to "understand and analyze actual phenomena" with
data. It employs techniques and theories drawn from many
fields within the context
of mathematics, statistics, information science,
and computer science.
Skill Sets for Data Science
Appreciating Data
Computer Scientists do not naturally appreciate
data: it’s just stuff to run through a program.
The usual way to test algorithm performance is
to run the implementation on “random data”.
But interesting data sets are a scarce resource,
which requires hard work and imagination to
obtain.
Computer vs. Real Scientists (1)
● Scientists strive to understand the
complicated and messy natural world, while
computer scientists build their own clean and
organized virtual worlds. Thus:
● Nothing is ever completely true or false in
science, while everything is either true or
false in Computer Science / Mathematics.
Computer vs. Real Scientists (2)
● Scientists are data-driven, while computer
scientists are algorithm-driven.
● Scientists obsess about discovering things,
which computer scientists invent rather than
discover.
● Scientists are comfortable with the idea that
data has errors; computer scientists are not.
Asking Good Questions
Software developers are not encouraged to ask
questions, but data scientists are:
● What exciting things might you be able to
learn from a given data set?
● What things do you/your people really want
to know?
● What data sets might get you there?
Let’s Practice Asking Questions!
Who, What, Where, When, and Why on the
following datasets:
● Baseball-reference.com
● International Movie Database (IMBb)
● NYC taxi cab records
Baseball-Reference.com: biosketch
Statistical Record of Play
Summary
statistics of each
years batting,
pitching, and
fielding record,
with teams and
awards.
Baseball Questions
● How to best measure individual player’s skill,
value or performance?
● How fair do trades between teams work out?
● What is the trajectory of player’s
performances as they mature and age?
● To what extent does batting performance
correlate with the position played?
Demographic Questions
● Do left-handed people have shorter lifespans
than right-handers?
● How often do people return to where they
were born?
● Do player salaries reflect past, present, or
future performance?
● Are heights and weights increasing in the
population?
IMDb: Movie Data
IMDb: Actor Data
Movie Questions
● Can we predict how well people will like a
movie? What about its gross?
● What does the social network of actors look
like?
● What is the age distribution of actors and
actresses in film?
● Do stars live longer or shorter lives than the
bit players or public?
NYC Taxi Cab Data
● Gives driver/owner, pickup/dropoff location,
and fare data for every taxi trip taken.
● Data obtained from NYC via Freedom of
Information Act Request (FOA)
Taxicab Questions
● How much do drivers make each night?
● How far do they travel?
● How much slower is traffic during rush hour?
● Where are people traveling to/from at
different times of the day?
● Do faster drivers get tipped better?
● Where should drivers go to pick up their next
fare?
Projects

● Datasets:
○ https://www.kaggle.com/datasets?tagids=3022
○ https://www.data.gov/
○ https://data.gov.in/
● Some Project Ideas:
○ https://www.analyticsvidhya.com/blog/2018/05/24-ultimate-
data-science-projects-to-boost-your-knowledge-and-skills/
○ Kdnuggets
○ https://www.analyticsindiamag.com/popular-data-science-
projects-for-aspiring-data-scientists
Course Evaluation
● Quiz:4-6 (40)
● Mid Sem: quiz (10)+project/assignment(15)
● End Sem: quiz(15) + project/assignment(20)
Reference Books
● The Data Science Design Manual, Skiena
● Probability and Statistics for Engineers and
Scientists, Ronald E Walpole, Raymond H
Myers, Sharon L Myers, Keying E Ye

You might also like