Datascience and Visualization
Datascience and Visualization
Why Now?
We have a lot of data from both our online and offline activities, and we also have cheap and powerful
computers to analyze it. This data comes from many areas like shopping, social media, finance, medicine,
and more.
The large amount of data helps us understand human behavior better and create new data-driven products,
such as recommendations on Amazon and Facebook, credit ratings, and personalized learning in
education.
We're entering a period where our behavior influences these data products, and these products, in turn,
influence our behavior, creating a continuous feedback loop. This is possible due to advances in
technology and widespread acceptance of it in our lives, which wasn't the case ten years ago.
Given this impact, we need to seriously consider how data is used and the ethical and technical
responsibilities of those handling it.
Datafication is the process of turning various aspects of our lives into data that can be collected, analyzed,
and used. For example, when you like something on social media or use services like Google Glasses,
your actions generate data. This data can then be analyzed for various purposes, often by businesses or
organizations to improve their services or sell products.
The key idea is that everything we do—whether intentionally or not—can be turned into data. This
includes both what we choose to share, like our social media activities, and what happens without our
direct input, like being tracked by sensors or cameras.
The authors argue that once something is datafied, it can be used to create new value, often for
businesses. This value is usually about improving efficiency or making money. However, if we think
about datafication in a broader sense, involving everyone rather than just businesses, the benefits and
implications can be more complex and might not always align with business goals.
In Academia:
- Who They Are: In academia, people who work with large amounts of data aren’t usually called
"data scientists" but may hold titles in other fields like statistics, computer science, or social
science. They often join interdisciplinary groups or institutes focused on data science.
- Role: Academic data scientists handle large and complex data sets, solving real-world problems
by applying computational techniques. Their work often spans different fields and can lead to new
research areas and PhD topics.
In Industry:
- Chief Data Scientist: This role involves setting the company's data strategy, managing data
infrastructure, ensuring data privacy, and integrating data into products. They also lead a team and
work closely with company leaders.
- General Data Scientist: Typically, they extract insights from data using statistical and machine
learning methods. They spend time cleaning and preparing data, exploring it to find patterns, and
building models to solve business problems. They also design experiments and communicate
findings clearly to help the company make data-driven decisions.
In Summary: A data scientist, whether in academia or industry, is someone who works with data
to solve problems. They use a mix of technical skills, like programming and statistics, and practical
skills, like communication and problem-solving, to turn data into actionable insights.
MODULE 2
Philosophy of EDA
The importance and purpose of Exploratory Data Analysis (EDA) in data science and
statistical work, particularly in large-scale contexts like those at Google.
Even with large datasets, such as those at Google, EDA is essential. While the
reasons for doing EDA in large-scale data are similar to smaller datasets, there
are additional motivations, such as debugging the data logging process.
EDA aids in understanding the data, which informs and improves algorithm
development. For instance, defining a concept like "popularity" for a ranking
algorithm requires understanding the data's behavior through EDA.
Analysts and data scientists are encouraged to make EDA a standard part of
their workflow.
Sanity Checks: It involves sanity checking to verify that the data is on the
expected scale and in the correct format.
Identifying Issues: EDA is crucial for finding missing data or outliers and
summarizing data effectively.
EDA is a foundational step in the data analysis process, crucial for understanding, verifying,
and summarizing data. It distinguishes itself from data visualization by its exploratory nature
and is indispensable in both small and large-scale data analysis, with significant benefits for
debugging and algorithm development. Making EDA a standard practice is highly
encouraged for all data analysts and scientists.
The Data Science Process
1. Real World:
o People are engaging in various activities (e.g., using social media,
participating in events, etc.).
o Data is generated from these activities.
2. Raw Data is Collected:
o This includes all kinds of data like logs, records, emails, or genetic
information.
3. Data is Processed:
o The collected raw data is processed using various tools (Python, R, SQL) to
clean it up. This step involves organizing the data into a structured format
(like columns in a spreadsheet).
4. Clean Data:
o After processing, we get clean data, which is now ready for analysis.
5. Exploratory Data Analysis (EDA):
o EDA is performed to understand the data better, find patterns, and identify
issues like missing values or outliers. If issues are found, we might need to go
back and collect more data or clean it further.
6. Machine Learning Algorithms / Statistical Models:
o We apply machine learning algorithms or statistical models to the clean data.
The choice of model depends on the problem we are trying to solve (e.g.,
classification, prediction).
7. Build Data Product:
o Sometimes, the goal is to create a data product (like a recommendation system
or spam classifier) that users interact with.
8. Communicate Findings:
o The results of the analysis are visualized and reported. This can be for internal
use (e.g., to inform a boss or colleagues) or external (e.g., academic papers,
public reports).
9. Make Decisions:
o Based on the analysis, decisions are made, which might involve implementing
a data product that feeds back into the real world.
Feedback Loop:
Influence of Models:
o Unlike some predictions (e.g., weather forecasts), data products can influence
the real world. For example, a recommendation system can change user
behavior, creating a feedback loop where new data is generated from these
interactions.
o It's important to account for this influence when building and analyzing
models, as the model itself can change the phenomenon it is observing.
Key Takeaway:
Data science is not just about analyzing data; it's about creating products and making
decisions that impact the real world. This process is iterative and involves continuous
feedback and improvement.
Case Study: RealDirect
Doug Perlson, with a background in real estate law, startups, and online advertising, founded
RealDirect to revolutionize the real estate market using data. Typically, people sell their homes
once every seven years with the help of brokers and current data. However, there are issues with
the broker system and the quality of data available. RealDirect addresses both problems effectively.
Traditional brokers operate independently and aggressively guard their data. Experienced brokers
might have slightly more data than their less experienced counterparts, but it’s still limited.
RealDirect solves this by hiring a team of licensed real estate agents who collaborate and pool their
knowledge. They built an interface for sellers, providing them with data-driven tips on selling their
homes and offering real-time recommendations based on interaction data.
These brokers become data experts, using tools to collect and access new and relevant information,
including recent data on co-op sales in NYC. One issue with publicly available data is its outdated
nature, as there’s a three-month lag between a sale and when the data is available. RealDirect is
working on real-time feeds to capture when people start searching for homes, initial offers, the
time between offer and closing, and how people search for homes online. This up-to-date
information benefits both buyers and sellers, as long as they’re honest.
RealDirect makes money by offering a subscription service to sellers for around $395 a month,
granting them access to selling tools. They also provide the option to use RealDirect’s agents at a
reduced commission, typically 2% of the sale price instead of the usual 2.5% or 3%. The pooling
of data allows RealDirect to optimize the process, reducing commission costs and increasing
volume.
The website acts as a platform for managing the sale or purchase process, with various statuses like
active, offer made, offer rejected, showing, in contract, etc. Based on your status, the software
suggests different actions. However, RealDirect faces challenges, such as a New York law
requiring listings to be behind a registration wall. While this can be an obstacle, serious buyers are
likely to register. Competitors like Zillow do not pose a real threat since they only show listings
without additional services.
Despite facing resistance from traditional brokers who dislike RealDirect’s approach to cutting
commissions, the transparency of listings means brokers have no choice but to cooperate. Potential
buyers can see the listings elsewhere and would complain if brokers withheld information.
Doug highlighted key factors buyers care about, such as nearby parks, subway access, schools, and
price comparisons of apartments sold in the same building or block. RealDirect aims to cover these
data points as part of their service, continually improving the buying and selling experience
through data transparency and better information.
By combining licensed agents, data-driven tools, and real-time information, RealDirect enhances
the home buying and selling process, making it more efficient and transparent for all parties
involved. Despite the challenges and resistance from traditional brokers, RealDirect’s innovative
approach to leveraging data stands to transform the real estate market.
Linear Regression
3. Benefits:
Provides a more reliable estimate of model performance compared to a single
train-test split.
Helps in tuning model parameters and selecting the best model.