0% found this document useful (0 votes)

20 views

Datascience and Visualization

Uploaded by

Chaithanya Reddy

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views

Datascience and Visualization

Uploaded by

Chaithanya Reddy

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

MODULE 1

Why Now?

We have a lot of data from both our online and offline activities, and we also have cheap and powerful
computers to analyze it. This data comes from many areas like shopping, social media, finance, medicine,
and more.

The large amount of data helps us understand human behavior better and create new data-driven products,
such as recommendations on Amazon and Facebook, credit ratings, and personalized learning in
education.

We're entering a period where our behavior influences these data products, and these products, in turn,
influence our behavior, creating a continuous feedback loop. This is possible due to advances in
technology and widespread acceptance of it in our lives, which wasn't the case ten years ago.

Given this impact, we need to seriously consider how data is used and the ethical and technical
responsibilities of those handling it.

Compare Data Science and Big Data

Datafication

Datafication is the process of turning various aspects of our lives into data that can be collected, analyzed,
and used. For example, when you like something on social media or use services like Google Glasses,
your actions generate data. This data can then be analyzed for various purposes, often by businesses or
organizations to improve their services or sell products.

The key idea is that everything we do—whether intentionally or not—can be turned into data. This
includes both what we choose to share, like our social media activities, and what happens without our
direct input, like being tracked by sensors or cameras.

The authors argue that once something is datafied, it can be used to create new value, often for
businesses. This value is usually about improving efficiency or making money. However, if we think
about datafication in a broader sense, involving everyone rather than just businesses, the benefits and
implications can be more complex and might not always align with business goals.

What Is a Data Scientist?

In Academia:
- Who They Are: In academia, people who work with large amounts of data aren’t usually called
"data scientists" but may hold titles in other fields like statistics, computer science, or social
science. They often join interdisciplinary groups or institutes focused on data science.
- Role: Academic data scientists handle large and complex data sets, solving real-world problems
by applying computational techniques. Their work often spans different fields and can lead to new
research areas and PhD topics.

In Industry:
- Chief Data Scientist: This role involves setting the company's data strategy, managing data
infrastructure, ensuring data privacy, and integrating data into products. They also lead a team and
work closely with company leaders.
- General Data Scientist: Typically, they extract insights from data using statistical and machine
learning methods. They spend time cleaning and preparing data, exploring it to find patterns, and
building models to solve business problems. They also design experiments and communicate
findings clearly to help the company make data-driven decisions.

In Summary: A data scientist, whether in academia or industry, is someone who works with data
to solve problems. They use a mix of technical skills, like programming and statistics, and practical
skills, like communication and problem-solving, to turn data into actionable insights.
MODULE 2
Philosophy of EDA
The importance and purpose of Exploratory Data Analysis (EDA) in data science and
statistical work, particularly in large-scale contexts like those at Google.

 Before attempting to convince others of your findings or insights, it's crucial to

understand the data and its implications yourself.

 Rachel learned EDA best practices from experienced statisticians at Google,

emphasizing the value of mentorship and applied knowledge in real-world
settings.

 Even with large datasets, such as those at Google, EDA is essential. While the
reasons for doing EDA in large-scale data are similar to smaller datasets, there
are additional motivations, such as debugging the data logging process.

 EDA is performed at the beginning of data analysis to understand the data,

whereas data visualization is used later to communicate findings. The graphics
in EDA are for personal comprehension, not for presentation.

 EDA aids in understanding the data, which informs and improves algorithm
development. For instance, defining a concept like "popularity" for a ranking
algorithm requires understanding the data's behavior through EDA.

 Visualizing data and making comparisons can be more effective than

immediately applying statistical models like regression. EDA should be a
critical part of the data analysis process.

 Analysts and data scientists are encouraged to make EDA a standard part of
their workflow.

Reasons for EDA

 Intuition and Comparisons: EDA helps in developing intuition about the

data, making comparisons between distributions, and ensuring data quality.

 Sanity Checks: It involves sanity checking to verify that the data is on the
expected scale and in the correct format.

 Identifying Issues: EDA is crucial for finding missing data or outliers and
summarizing data effectively.

 Debugging Logging Processes: In the context of log-generated data, EDA

helps identify and correct issues in the logging process, ensuring patterns in the
data are genuine.

EDA is a foundational step in the data analysis process, crucial for understanding, verifying,
and summarizing data. It distinguishes itself from data visualization by its exploratory nature
and is indispensable in both small and large-scale data analysis, with significant benefits for
debugging and algorithm development. Making EDA a standard practice is highly
encouraged for all data analysts and scientists.
The Data Science Process
1. Real World:
o People are engaging in various activities (e.g., using social media,
participating in events, etc.).
o Data is generated from these activities.
2. Raw Data is Collected:
o This includes all kinds of data like logs, records, emails, or genetic
information.
3. Data is Processed:
o The collected raw data is processed using various tools (Python, R, SQL) to
clean it up. This step involves organizing the data into a structured format
(like columns in a spreadsheet).
4. Clean Data:
o After processing, we get clean data, which is now ready for analysis.
5. Exploratory Data Analysis (EDA):
o EDA is performed to understand the data better, find patterns, and identify
issues like missing values or outliers. If issues are found, we might need to go
back and collect more data or clean it further.
6. Machine Learning Algorithms / Statistical Models:
o We apply machine learning algorithms or statistical models to the clean data.
The choice of model depends on the problem we are trying to solve (e.g.,
classification, prediction).
7. Build Data Product:
o Sometimes, the goal is to create a data product (like a recommendation system
or spam classifier) that users interact with.
8. Communicate Findings:
o The results of the analysis are visualized and reported. This can be for internal
use (e.g., to inform a boss or colleagues) or external (e.g., academic papers,
public reports).
9. Make Decisions:
o Based on the analysis, decisions are made, which might involve implementing
a data product that feeds back into the real world.

Feedback Loop:
 Influence of Models:
o Unlike some predictions (e.g., weather forecasts), data products can influence
the real world. For example, a recommendation system can change user
behavior, creating a feedback loop where new data is generated from these
interactions.
o It's important to account for this influence when building and analyzing
models, as the model itself can change the phenomenon it is observing.

Key Takeaway:
 Data science is not just about analyzing data; it's about creating products and making
decisions that impact the real world. This process is iterative and involves continuous
feedback and improvement.
Case Study: RealDirect

Doug Perlson, with a background in real estate law, startups, and online advertising, founded
RealDirect to revolutionize the real estate market using data. Typically, people sell their homes
once every seven years with the help of brokers and current data. However, there are issues with
the broker system and the quality of data available. RealDirect addresses both problems effectively.

Traditional brokers operate independently and aggressively guard their data. Experienced brokers
might have slightly more data than their less experienced counterparts, but it’s still limited.
RealDirect solves this by hiring a team of licensed real estate agents who collaborate and pool their
knowledge. They built an interface for sellers, providing them with data-driven tips on selling their
homes and offering real-time recommendations based on interaction data.

These brokers become data experts, using tools to collect and access new and relevant information,
including recent data on co-op sales in NYC. One issue with publicly available data is its outdated
nature, as there’s a three-month lag between a sale and when the data is available. RealDirect is
working on real-time feeds to capture when people start searching for homes, initial offers, the
time between offer and closing, and how people search for homes online. This up-to-date
information benefits both buyers and sellers, as long as they’re honest.

RealDirect makes money by offering a subscription service to sellers for around $395 a month,
granting them access to selling tools. They also provide the option to use RealDirect’s agents at a
reduced commission, typically 2% of the sale price instead of the usual 2.5% or 3%. The pooling
of data allows RealDirect to optimize the process, reducing commission costs and increasing
volume.

The website acts as a platform for managing the sale or purchase process, with various statuses like
active, offer made, offer rejected, showing, in contract, etc. Based on your status, the software
suggests different actions. However, RealDirect faces challenges, such as a New York law
requiring listings to be behind a registration wall. While this can be an obstacle, serious buyers are
likely to register. Competitors like Zillow do not pose a real threat since they only show listings
without additional services.

Despite facing resistance from traditional brokers who dislike RealDirect’s approach to cutting
commissions, the transparency of listings means brokers have no choice but to cooperate. Potential
buyers can see the listings elsewhere and would complain if brokers withheld information.

Doug highlighted key factors buyers care about, such as nearby parks, subway access, schools, and
price comparisons of apartments sold in the same building or block. RealDirect aims to cover these
data points as part of their service, continually improving the buying and selling experience
through data transparency and better information.

By combining licensed agents, data-driven tools, and real-time information, RealDirect enhances
the home buying and selling process, making it more efficient and transparent for all parties
involved. Despite the challenges and resistance from traditional brokers, RealDirect’s innovative
approach to leveraging data stands to transform the real estate market.
Linear Regression

3. Benefits:
 Provides a more reliable estimate of model performance compared to a single
train-test split.
 Helps in tuning model parameters and selecting the best model.

TANGO - Ocean - For Ocean New Comers
100% (1)
TANGO - Ocean - For Ocean New Comers
58 pages
Eds
100% (2)
Eds
151 pages
CNC Milling Machine Procedure
100% (1)
CNC Milling Machine Procedure
1 page
Project Management Assignment - Final
100% (3)
Project Management Assignment - Final
23 pages
DS Unit-1 PDF
No ratings yet
DS Unit-1 PDF
50 pages
UNIT I Material
No ratings yet
UNIT I Material
25 pages
DS_Module 1
No ratings yet
DS_Module 1
57 pages
DAT100_Int_Data_Ana_Lec2_Intro II
No ratings yet
DAT100_Int_Data_Ana_Lec2_Intro II
39 pages
Data
No ratings yet
Data
43 pages
6220010
No ratings yet
6220010
37 pages
Data Science Course Road Map
No ratings yet
Data Science Course Road Map
14 pages
PYTHON FOR DATA ANALYTICS: Mastering Python for Comprehensive Data Analysis and Insights (2023 Guide for Beginners)
From Everand
PYTHON FOR DATA ANALYTICS: Mastering Python for Comprehensive Data Analysis and Insights (2023 Guide for Beginners)
Waldo Todd
No ratings yet
EDS Unit 1?
No ratings yet
EDS Unit 1?
15 pages
Data Science 2
No ratings yet
Data Science 2
3 pages
Fundamentals of Data Science
100% (3)
Fundamentals of Data Science
62 pages
"Big Data Science" Basic Concepts and Applications
From Everand
"Big Data Science" Basic Concepts and Applications
Sukanta Bhattacharya
No ratings yet
Data Science 1A
100% (1)
Data Science 1A
53 pages
Data Science Notes
No ratings yet
Data Science Notes
61 pages
PYTHON DATA SCIENCE: A Practical Guide to Mastering Python for Data Science and Artificial Intelligence (2023 Beginner Crash Course)
From Everand
PYTHON DATA SCIENCE: A Practical Guide to Mastering Python for Data Science and Artificial Intelligence (2023 Beginner Crash Course)
Calvert Long
No ratings yet
M1 - FDS
No ratings yet
M1 - FDS
19 pages
Modul1 PPt.pptx
No ratings yet
Modul1 PPt.pptx
56 pages
Module - 1 IDS
100% (1)
Module - 1 IDS
19 pages
Unit - 1
No ratings yet
Unit - 1
25 pages
Data Science Unit 1
No ratings yet
Data Science Unit 1
85 pages
Data Analytics for Businesses 2019: Master Data Science with Optimised Marketing Strategies using Data Mining Algorithms (Artificial Intelligence, Machine Learning, Predictive Modelling and more)
From Everand
Data Analytics for Businesses 2019: Master Data Science with Optimised Marketing Strategies using Data Mining Algorithms (Artificial Intelligence, Machine Learning, Predictive Modelling and more)
Riley Adams
5/5 (1)
Data Science Career Guide Interview Preparation
From Everand
Data Science Career Guide Interview Preparation
Gradient Publication
No ratings yet
Fods MQP Solutions - 025136
No ratings yet
Fods MQP Solutions - 025136
76 pages
Introduction To Datasciecne
No ratings yet
Introduction To Datasciecne
50 pages
Fundamentals of Data Science
No ratings yet
Fundamentals of Data Science
53 pages
Chap1-Overview of Data Science
No ratings yet
Chap1-Overview of Data Science
50 pages
Module 1 - Introduction To Data Science
No ratings yet
Module 1 - Introduction To Data Science
15 pages
Datascience Notes
No ratings yet
Datascience Notes
161 pages
Slidesgo Enhancing Insights a Comprehensive Overview of Data Science Modules 20250113133756aOMY
No ratings yet
Slidesgo Enhancing Insights a Comprehensive Overview of Data Science Modules 20250113133756aOMY
14 pages
Unit 1-FDS
No ratings yet
Unit 1-FDS
18 pages
CUITM217-DATA-SCIENCE Data
No ratings yet
CUITM217-DATA-SCIENCE Data
48 pages
Research Assignment 02burhan Ul Din
No ratings yet
Research Assignment 02burhan Ul Din
8 pages
Data Science
100% (2)
Data Science
33 pages
Data Science Intro Session-18 & 19
No ratings yet
Data Science Intro Session-18 & 19
48 pages
Chapter 1 - Lecture
No ratings yet
Chapter 1 - Lecture
7 pages
Glossary
No ratings yet
Glossary
50 pages
Data Analytics and Data Processing Essentials
From Everand
Data Analytics and Data Processing Essentials
gareth thomas
No ratings yet
Summer Training
No ratings yet
Summer Training
8 pages
INTRODUCTION and M1-CH-1
No ratings yet
INTRODUCTION and M1-CH-1
63 pages
Data Science Lecture No 02
No ratings yet
Data Science Lecture No 02
21 pages
Inroduction To Data Science
No ratings yet
Inroduction To Data Science
62 pages
Data Science and Analytics: Transforming Raw Data into Actionable Insights: A Comprehensive Guide
From Everand
Data Science and Analytics: Transforming Raw Data into Actionable Insights: A Comprehensive Guide
Marlowe Reyes
No ratings yet
DS 1
No ratings yet
DS 1
56 pages
IDS UNIT 1,2,3,4 & 5
No ratings yet
IDS UNIT 1,2,3,4 & 5
117 pages
Module1 21CS644 DSV
No ratings yet
Module1 21CS644 DSV
16 pages
Chapter 2
No ratings yet
Chapter 2
30 pages
Module For Data Science
No ratings yet
Module For Data Science
10 pages
Data Science Lecture No 02
No ratings yet
Data Science Lecture No 02
21 pages
Data Science Mastery: From Beginner to Expert in Big Data Analytics
From Everand
Data Science Mastery: From Beginner to Expert in Big Data Analytics
Kameron Hussain
No ratings yet
EET - Module 2
No ratings yet
EET - Module 2
18 pages
IDS Mid 1 Notes
No ratings yet
IDS Mid 1 Notes
80 pages
DSE 3 Unit 1
100% (1)
DSE 3 Unit 1
10 pages
Business Analytics - Suggetions - 2024
No ratings yet
Business Analytics - Suggetions - 2024
27 pages
Selected Topics - Datascience
No ratings yet
Selected Topics - Datascience
17 pages
Data Science - Ebook
No ratings yet
Data Science - Ebook
32 pages
C1 Part2
No ratings yet
C1 Part2
28 pages
Data Science
No ratings yet
Data Science
5 pages
Data Science Modern Technology5
No ratings yet
Data Science Modern Technology5
6 pages
From Data To Decisions: Driving Performance in the Age of Analytics
From Everand
From Data To Decisions: Driving Performance in the Age of Analytics
Babatunde Yusuf
No ratings yet
KCC Buildcon Private Limited: Department
No ratings yet
KCC Buildcon Private Limited: Department
13 pages
Lesson 2 - History of Computers
No ratings yet
Lesson 2 - History of Computers
22 pages
Delta ASDA B2 User Manual PDF
No ratings yet
Delta ASDA B2 User Manual PDF
337 pages
SANDIA CRS WithStorage
No ratings yet
SANDIA CRS WithStorage
275 pages
National Logistics Cell (NLC) Pakistan: Talha Rashid
No ratings yet
National Logistics Cell (NLC) Pakistan: Talha Rashid
1 page
Savitri-Partner PPT-For Spirent 9419 - New
No ratings yet
Savitri-Partner PPT-For Spirent 9419 - New
15 pages
Emirates ID Card Toolkit Service MOHRE Installation Guide
No ratings yet
Emirates ID Card Toolkit Service MOHRE Installation Guide
6 pages
Settingsprovider
No ratings yet
Settingsprovider
60 pages
De Lorenzo - Smart Grid Eng
No ratings yet
De Lorenzo - Smart Grid Eng
29 pages
Turtlesim Project
No ratings yet
Turtlesim Project
4 pages
BLUEBOX DATATECH MANUAL CONTROLADOR pCO
No ratings yet
BLUEBOX DATATECH MANUAL CONTROLADOR pCO
36 pages
Web Pentesting For Vurlns
No ratings yet
Web Pentesting For Vurlns
280 pages
Inventory Problems
No ratings yet
Inventory Problems
6 pages
Staticrelays
No ratings yet
Staticrelays
80 pages
WAGO-I/O-System 750-882 Manual
No ratings yet
WAGO-I/O-System 750-882 Manual
450 pages
DCD Material
No ratings yet
DCD Material
252 pages
Lab 4 - Clippper and Clamper Circuits
No ratings yet
Lab 4 - Clippper and Clamper Circuits
7 pages
Flow Chart Symbols and Pseudocode 2018 v2
100% (1)
Flow Chart Symbols and Pseudocode 2018 v2
8 pages
RANAdvisor TrueSite Handheld
No ratings yet
RANAdvisor TrueSite Handheld
84 pages
Excel 2.0
No ratings yet
Excel 2.0
52 pages
Biografi Mark Zuckerberg Dalam Bahasa Inggris
No ratings yet
Biografi Mark Zuckerberg Dalam Bahasa Inggris
3 pages
Microwave Lesson Plan
No ratings yet
Microwave Lesson Plan
3 pages
DrivingCharging Integrated Controller For Electric Vehicles
No ratings yet
DrivingCharging Integrated Controller For Electric Vehicles
19 pages
Oneapi Rendering Toolkit Get Started Guide Windows 2023.2 766442 781968
No ratings yet
Oneapi Rendering Toolkit Get Started Guide Windows 2023.2 766442 781968
26 pages
Best Fitness Devices To Complement Your Carb Manager Membership
No ratings yet
Best Fitness Devices To Complement Your Carb Manager Membership
34 pages
6SL3220-2YE44-0UF0 Datasheet en
No ratings yet
6SL3220-2YE44-0UF0 Datasheet en
3 pages
DX Diag
No ratings yet
DX Diag
28 pages