Notes - Support Course For Data Science Fundamentals
Notes - Support Course For Data Science Fundamentals
Presentation forum
Before starting, go to the Presentation Forum , this space is dedicated to getting to know
us and to exchange points of view about your expectations of the course. You will
find the link to the Presentation Forum in the menu on the left.
The importance of data science covers almost all major sectors and companies. In
this course, students will explore the scope and multidisciplinary nature of data science,
with a focus on solving problems using data across multiple disciplines. Students will be
able to see how to use different tools to discover patterns and meanings in data. They will
consider ethical considerations related to the application of data science, develop an
inquiring mindset, and explore a data science framework that they can apply in any
industry sector, company, or institution.
Analyze the role that data science has in advising business decisions.
Define the process and questions inherent to the data science process or cycle.
Unit results:
Collect data
Explore data
Detail of the hours that the student must dedicate to each module and each unit.
Technical requirements
A computer that plays sound and has one of the following operating systems:
Recent versions of the most common browsers with JavaScript enabled (e.g. Chrome,
Mozilla Firefox, Apple Safari, Microsoft Internet Explorer)
Course requirements
2. Complete all reading assignments and review all multimedia resource presentations.
3. Actively participate in online discussions and post thoughtful responses (if applicable).
Get Inspired contains course videos that give students the opportunity to learn from
experts in the field and link key concepts and issues to real-world context. Key and
reflective questions are presented to frame the core ideas presented in the unit.
Learn and Practice engages students in readings, videos, and interactive activities. In
this section, the necessary requirements will be identified, which will provide
students with the keys to the success of their learning. Interactive activities will allow
students to practice what they have learned in a safe and entertaining environment.
Apply and Share offers opportunities to assess mastery through quizzes, the unit
test, or written assignments. This section invites students to share questions,
knowledge, and discoveries with their peers and instructor through an online
discussion platform.
Course Policies
General
All course assignments, including discussion forum posts and responses, are expected
to be submitted by the due date, where applicable. Your instructor can explain the
scheduled due dates for your assignments.
The tasks in each module are cumulative and mutually reinforcing. Therefore, all
course assignments must be completed and submitted on time.
Academic integrity
Students are expected to exhibit the highest standards of academic canon. A lack of
academic ethics that affects the academic integrity of a student, classmates or school
must be repressed with disciplinary measures that may include temporary or permanent
suspension. The following are considered unethical:
Cheating Cheating includes attempting to receive help from people, notes, or other
materials without the instructor's permission, or accessing a test or quiz before the
application date.
The consequences of any such conduct depend on the seriousness of the offense that
occurred, the student's prior violations of regulations and policies, and the student's
attitude and cooperation as determined by the instructor or program officials. Disciplinary
actions for verifiable academic dishonesty are at the immediate discretion of the
instructor. The instructor can make decisions in the event of a foul.
Unit 1. Understanding the world of data
"The ability to capture data - to understand it, process it, extract value from it,
visualize it and communicate it - that will be a very important capability in the coming
decades." -Hal Varian Chief Economist at Google.
In this unit, we'll introduce you to data scientists and you'll hear them talk about
the scope and importance of their work. They'll share tools they use to process and
understand data, explain how institutions (of any size) can derive value from data science,
and why they love what they do. Additionally, through a global case study, you will meet
a data science team and see how they work to extract value from data.
In this case study, you will play the role of an intern at Nutri Mondo, an
international organization that helps local communities access healthy eating and
addresses the injustices that lead them to hunger. The team uses data science to help
communities understand the factors that lead to food insecurity, build healthy eating
educational programs, reduce foodborne illnesses, and mitigate the impact that poverty,
race, and geography have. in food safety.
Learning objectives:
- Data science is a process, not an event. It is the process of using data to understand
different things, to understand the world.
- For me it is when you have a model or a hypothesis of a problem, and you try to validate
that hypothesis or model with its data.
- Data science is the art of [going] discovering the ideas and trends hidden behind data.
- It is when data is translated into a story. So use storytelling to generate insight. And with
these ideas, you can make strategic choices for a company or an institution.
- Data science is a field of processes and systems to extract data in various forms whether
it is unstructured or structured form.
- Data science is the study of data, just as biological sciences is the study of biology,
physical sciences is the study of physical reactions. Data is real, data has real properties,
and we need to study it if we are going to work on it.
- Data science involves some data and some signs. The definition and name came about in
the 80's and 90's when some professors were studying some statistics curricula, and
thought it would be better to call it data science. But what is data science? I would see
data science as an attempt to work with data, to find answers to the questions they are
exploring. Simply put, it's more about data than science. If you have data, and you're
curious, and you're working with data, and you're manipulating it, you're exploring it, the
very exercise of analyzing the data, trying to get some answers is data science.
Data science is relevant today because we have tons of data available. We used to worry
about a lack of data, now we have a deluge of data. In the past, we didn't have
algorithms, now we have algorithms. In the past the software was expensive, now it is
open source and free. In the past, we could not store large amounts of data, now for a
fraction of a cost, we can have millions of data sets at a very low cost. So, the tools to
work with data, the variability of data, and the ability to store and analyze data,
everything is cheap, everything is available, everything is everywhere, it is here. There has
never been a better time to be a data scientist.
Data science is a relatively new field that has emerged from the technological ability to
produce, collect and analyze large data sets. As you get started, consider the following
questions:
Resources
In this video from IBM, you will learn about the relationship between data,
science, and curiosity.
Why is it called Data Science?
- Data science involves some data and some signs. The definition and name came about in
the 80's and 90's when some professors were studying some statistics curricula, and
thought it would be better to call it data science. But what is data science? I would see
data science as an attempt to work with data, to find answers to the questions they are
exploring. Simply put, it's more about data than science. If you have data, and you're
curious, and you're working with data, and you're manipulating it, you're exploring it, the
very exercise of analyzing the data, trying to get some answers is data science.
Watch this video from IBM to learn the definition of big data and its
characteristics.
Hello. Welcome to the Big Data University. By the end of this lesson, you will know
the definition of Big Data and know its characteristics. Plus, you'll learn what the V's of
Big Data stand for and discover why so many people are interested in Big Data. Bernard
Marr defines Big Data as the digital footprint that we generate in this technological era.
This fingerprint is made up of all the data that is recorded when we use any digital
technology. The basic concept of the term Big Data is that everything we do increasingly
leaves a digital footprint that we can use and analyze to increase our intelligence. The
driving forces of this new world are access to ever-larger volumes of data and our growing
technological ability to mine this data for business statistics.
The research company Gartner defines Big Data as follows: Big Data is information
resources of high volume, velocity or variety that require cost-effective and innovative
forms of information processing that result in the optimization of statistics, the
improvement of decision making and process automation.
Ernst & Young, for its part, offers the following definition. Big Data refers to the
large, dynamic, and disparate volumes of data created by people, tools, and machines.
Innovative, scalable new technology is needed to collect, host, and analytically process
the massive amounts of data being gathered to deliver real-time business insights that
relate to customers, risk, profitability, and performance. , productivity management and
the greatest value of actions.
According to Lisa Arthur, a Forbes contributor, Big Data is a set of traditional and
digital sources, whether internal or external to the company, that represent a source of
non-stop discovery and analysis.
There is no single definition of Big Data, but there are certain elements that are
common to the different definitions, such as Velocity, Volume, Variety and Veracity.
These are the Vs of Big Data.
Velocity is the speed of the data and how quickly it accumulates. Volume is the
scale of the data or the increase in the amount of data stored. Variety is the diversity of
data. We have structured data that is organized in an orderly way in rows and columns,
or relational databases and unstructured data that is not organized in a predefined way,
such as tweets, blog posts, photos, numbers and even data from video. Truthfulness is
conformity with facts and accuracy. Given the enormous amount of data available in the
digital age, controversy over the accuracy of the data continues. Is the information real
or false?
Let's look at the V's in more detail. Velocity is the idea that data is generated
extremely quickly and that this is a process that never stops. Attributes include real-time
or near-real-time transmission and on-premise and cloud-based technologies that can
process information quickly. Volume refers to the amount of data that is generated. For
example, exabyte, zettabyte, yottabytes, etc. Other drivers of Volume are increasing data
sources, higher resolution sensors, and scalable infrastructure. Veracity is the quality and
origin of the data. Attributes include coherence, completeness, completeness, and
ambiguity. The drivers are cost and the need to have traceability. Variety is the concept
that data comes from different sources, machines, people and processes, both internal to
the organization and external to it. Attributes include the degree of structure and
complexity, and drivers are mobile technologies, social media, wearable technologies,
geolocation technologies, videos, and many more.
Let's look at some examples of the V in action. Speed. Many hours of video are
uploaded to YouTube every 60 seconds. This amount of data is generated every minute.
Think, then, how much data accumulates in several hours, days and years. Volume. Every
day we create approximately 2.5 trillion bytes of data. That figure is equivalent to 10
million Blu-ray DVDs per day. The world population is about seven billion people, and the
vast majority use digital devices. All of these devices generate, record and store data.
Additionally, if each person has more than one device -- for example, mobile devices,
desktop computers, laptops, and so on -- even more data is produced. Variety. Let's think
about the different types of data, texts, images and videos. There is also sound, the health
data that wearable devices provide, and the various types of data that devices connected
to the Internet of Things provide. Veracity. It is considered that 80 percent of data is
unstructured, so we must devise ways to produce reliable and accurate statistics. Data
must be categorized, analyzed and visualized. The emerging V is Courage. This V refers to
our ability and our need to extract value from data. Value, not only economic, can refer to
medical or social benefits, or the satisfaction of clients, employees or oneself. The main
reasons why people invest time in understanding Big Data is to obtain Value.
This is just the beginning of your learning about Big Data. Have fun learning.
Read this article and understand why the role of data scientist is gaining
importance in more and more institutions.
Davenport, T. H., & Patil, D. J. (2012). Data Scientist: The Most Popular Job of the
21st Century Harvard Business Review, 90(10), 70–76.
This article explores the field of data science through data and its structures, as
well as the high-level process you can use to extract value from data.
Watch this video from IBM and learn why data scientist skills are important for
companies and institutions.
- Data science is relevant today because we have tons of data available. We used to worry
about a lack of data, now we have a deluge of data. In the past, we didn't have
algorithms, now we have algorithms. In the past the software was expensive, now it is
open source and free. In the past, we could not store large amounts of data, now for a
fraction of a cost, we can have millions of data sets at a very low cost. So, the tools to
work with data, the variability of data, and the ability to store and analyze data,
everything is cheap, everything is available, everything is everywhere, it is here. There has
never been a better time to be a data scientist.
Watch this video from IBM and learn how businesses derive value from data.
How does Big Data impact companies and people?
WOMAN: Hello. Welcome to the university of Big Data. In this lesson we will give
you an overview of Big Data. Additionally, we will learn how to obtain Value from Big Data.
We will cover the terms, concepts and technologies and analyze how we arrived at the era
of Big Data. Many of us generate and use Big Data without even knowing it.
How does Big Data impact companies and people? Have you ever searched for or
purchased a product on Amazon? Did you notice that Amazon started making
recommendations related to the product you searched for? Recommendation engines are
a common big data application. Companies like Amazon, Netflix, and Spotify use
algorithms that rely on Big Data to make specific recommendations based on customer
preferences and historical behavior. Personal assistants, like Siri on Apple devices, use Big
Data to craft answers to the endless number of questions end users can ask. Google now
makes recommendations based on Big Data from users' devices. Now that we have an
idea of how customers use Big Data, let's look at how Big Data impacts businesses.
In 2011, McKinsey & Company stated that Big Data would become the main basis
of competition and would bring about a new era of innovation and productive growth. In
2013, UPS announced it was using customer, driver and vehicle data through a new route
guidance system that was designed to save time, money and fuel. Initiatives like this
support the idea that Big Data will dramatically change the way businesses compete and
operate.
How does a company gain a competitive advantage? Have you heard of the Netflix
series "House of Cards"? The first season was released in 2013 and immediately became a
hit. At the time, the New York Times reported that Netflix executives knew "House of
Cards" would be a hit even before filming the series. But how do they know? Thanks to Big
Data. Netflix has a huge amount of data. This company knows what time of day you watch
movies. Additionally, it records when users pause, rewind or fast forward. It also gets
ratings from millions of users, as well as information from the searches they perform. By
analyzing all the Big Data, Netflix learned that many of its users had streamed David
Fincher's work and that the movies starring Kevin Spacey had always been successful. He
also discovered that the British version of "House of Cards" had also been successful.
Netflix knew that people who like Fincher also liked Spacey. All this information suggested
that purchasing the series would be a good bet for the company, and in fact it was. In
other words, thanks to Big Data, Netflix knows what people want before they do.
Let's analyze another example. Market saturation and selective customers will
require Chinese e-commerce companies to better leverage Big Data to gain market share.
Companies will have to persuade customers to make more frequent, larger purchases and
from a broader range of online store categories. As digital shopping grows, e-commerce
companies already have the tools to achieve this goal. Major companies are already using
data to create models that are designed to increase retention rate and spend per
customer based on eCommerce data. They also began to adopt promotional activities and
analytics-based pricing systems.
The Internet of Things refers to the exponential increase in connected devices. The
"IoT" suggests that many of today's products will be connected to a network or the
Internet and will generate many different data, such as refrigerators, coffee makers or
pillows. Another "IoT" connection is called wearable, and refers to the clothing and items
we wear that are already connected. These items include the Fitbit, Apple Watch, or
Nike's new self-lacing running shoes.
In this lesson you discovered some of the characteristics of Big Data and learned
about some of the applications. Have fun learning.
Watch this video from IBM and learn the type of data and problems data scientists
work with.
Watch this video from IBM to learn some big data use cases and understand how
big data adds business value.
WOMAN: Hello. Welcome to the Big Data University. In this lesson we will analyze
some practical cases of Big Data and see how Big Data adds value to companies.
We'll look at exploring Big Data to find, visualize, and understand this type of data
to optimize business insights. We will learn the concept of enhanced 360-degree view.
This is a way to drill down into the customer for a true unified view that incorporates
internal and external data sources. We will explore the concept of security and the
extension of intelligence, in order to reduce risk, detect fraud and monitor cybersecurity
in real time. We will look at operations analytics to analyze various machine data, with
the goal of improving business results.
Exploring Big Data addresses the challenge facing all large organizations. Business
information is distributed in different systems and isolated structures. Exploration allows
you to analyze and mine Big Data to find, visualize and understand all your data to
improve decision making. By creating a unified view of information from different data
sources, both internal to and external to your organization, you get improved value and
new insights. Let's analyze an example from the transportation area. By using data from
different systems, such as cameras in different parts of the city, weather information and
GPS data from Ubers, taxis, trucks and cars, we can predict traffic conditions at a faster
rate and precise, in order to implement more intelligent systems in real time that
optimize traffic flow. This possibility offers many positive benefits such as reducing fuel
emissions, planning public transport and obtaining a more durable transport
infrastructure. With the advent of self-driving cars, machine learning algorithms can be
trained using the historical and real-time data offered by human-driven cars. In this way,
the autonomous car learns how real drivers behaved in different traffic situations and
under different circumstances and weather conditions.
During the digital era, the points of contact that exist between an organization and
its customers have multiplied, so organizations now need specialized solutions to manage
these connections effectively. The enhanced 360-degree customer view is a
comprehensive approach that takes into account all available and relevant customer
information to achieve greater engagement, improved revenue and long-term loyalty.
This is the basis of the new customer relationship management, or CRM systems. Let's
analyze an example in detail. By taking an enhanced 360-degree view of the customer
and using available and relevant information such as spending habits, purchasing
behavior and preferences, stores can plan, prepare and serve customers more effectively.
The increasing number of technological crimes, acts of cyber terrorism, espionage events,
computer leaks and major cyber fraud cases pose a real threat to all people and
organizations. In order to address these security challenges, companies are using Big
Data technologies to change and utilize their intelligence and cybersecurity activities.
How do they do that? They process and analyze new types of data, such as social media,
emails, and endless hours of video. Analyzing data in transit and at rest can help you find
new associations or discover patterns and facts that can significantly improve intelligence,
security, and compliance. Operations analytics focuses on analyzing machine data, which
can include everything from signals, sensors and logs, to GPS device data. This type of
data grows at an exponential rate and comes in large volumes and various formats. By
using Big Data for this type of analysis, organizations can gain real-time visibility into
operations, customer experience, transactions, and behaviors. Big Data allows companies
to predict if a machine will stop working, if its components will need to be replaced, and
even if an employee will quit. Let's analyze an example. Aircraft engines generate
enormous amounts of data per second. By analyzing this enormous amount of data from
the turbine, and even other aircraft sensors such as GPS, temperature and speed,
organizations can gain real-time visibility into aircraft operations. This data is used to
operate the aircraft safely and efficiently. And in the unlikely event of an accident, this
data can also tell investigators exactly what caused the accident. Many of the current
protocols and regulations in aeronautics come from data collected in previous accidents.
Thank you for completing this lesson. Here he analyzed some practical cases of
Big Data and discovered how they add value to companies. Have fun learning.
- When a company is hiring, anyone who works on a data science team, they need to
think about what role that person is going to take. Before a company starts, they need to
understand what they want from their data science team and then they have to hire to
get started. As a data science team grows, they need to understand if they need
engineers, architects, designers to work on visualization or if they just need more people
who can multiply large matrices.
Watch this video from IBM and learn how important it is for data scientists to be
good at communicating, linking, and interacting with data.
- I think there is no hard and fast rule for hiring data scientists, I think it's going to be a
matter of case by case; I would say there has to be some kind of technical component.
Someone should be able to manipulate the data, they should be able to communicate,
what they find in the data. I find very often, no one really cares about the R square or
confidence interval, so you have to be able to introduce those things and explain
something convincingly. And you also have to find someone who can relate, because data
science, being typically new means that the person in that role has to make relationships
and has to work in different departments.
- Yes this data scientist [must] have good background mathematics and statistics.
- They have to consider problem-solving and analysis skills. A data scientist must be good
at problem analysis.
- People who are hiring should love playing with data and then know how to play with
data visualization, have analytical thinking.
- From a skills point of view, let's focus on technical skills. and in that case, the first thing
would be what type of technical platform you would like to adopt. Let's say you want to
work in a structured data environment, and let's say you want to work in market
research, then the kind of skills you need are slightly different [than] someone who would
like to work in big data environments.
If you want to work in big data, then there's the other aspect of it, and that is
being able to store data. So you start with experience storing large amounts of data and
then look at platforms that allow you to do that. The next step would be to be able to
manipulate large amounts of data and the last step would be to apply algorithms for
those large data sets.
So it's a three-step process, but it most likely starts [most importantly] with where
you would like to be, in what field, in what domain; So [that], in terms of platforms, let's
say you want to be in a traditional predictive analytics environment and you're not
working with big data, then your R, or Stata, or Python would be your tools.
If you are working primarily with unstructured data, then Python is more suitable
than R. If you are working with big data, then Hadoop and Spark are the environments
you will work with. So it all depends on where you would like to be, what type of work
excites you and then you choose your tools.
In addition to technical skills, the second aspect of data science is having the
ability to communicate. Communication skills or presentation skills; I call them
storytelling skills, that's you've done your analysis, now can you tell a great story about
this? If you have a very large table, can you synthesize this and make it more attractive,
than when it comes out on the screen, or is it part of the document that just talks, sings
the findings and the reader just takes it from there? Therefore, the ability to present your
results either verbally or in a presentation or document, so presentation communication
skills, are just as important as technical skills.
When you have a great vision and when you present your results, imagine that
you are driving on a mountain and then there is a sharp turn, and you cannot see what is
beyond the turn and then you make that turn, and then suddenly you see a tremendous
valley in front of you, and this great feeling of amazement that [uff… of..]: “I didn't know”
right?; So when you present your conclusions and you have this great finding and you
communicate it well, This is what people feel because they didn't expect it. They weren't
aware of it and then this great feeling of happiness that “Now I know. I did not realize.
Now I know” and then it gives them power. It gives them ideas of what they can do with
this knowledge, this new vision. It is a great feeling of joy. [And] as a data scientist, you
can share it with your clients because you enabled it.
What is more important: having the analytical skills of a data scientist or loving the
institution you work for? Watch this video from IBM to determine.
- When companies are hiring people for a data science team, perhaps a data scientist or
an analyst or a chief data scientist, the tendency would be to find the person who has all
the skills, who knows the specific domain of the data. knowledge, be excellent at
structured and unstructured data analysis, be great at presentation, have great
storytelling skills. So, if you put all this together, you realize that you are looking for a
unicorn and your chances of finding a unicorn are pretty rare.
I think what you have to do is look, given the pool of applicants you have. Who
resonates the most with your company's DNA? Because, you see, you can teach analytical
skills, anyone can learn analytical skills if they put the time and effort into it, but what
really matters is who is passionate about the type of business you do. Someone might be
a great data scientist, in the retail environment, but they may not be as exciting, working
in IT-related firms or working with gigabytes of web logs; but if someone is excited about
those web logs or someone is excited about health-related data then they could
contribute to your productivity a lot more.
And I would say that if I'm looking for someone, if I have to put together a data
science team, I would first look for curiosity, is that person curious about things? Not just
for data science, but anything, Curious why this room is painted a certain way? Why do
the shelves have books and what kind of books? You have to have a certain degree of
curiosity, about everything that is in your vision, that you look at. The second thing is,
does he have a sense of humor? Because you see, you must have been happy about it. If
someone is too serious about it, they would probably take it too seriously and wouldn't
be able to look at the lighter elements. The third thing I think about and I think the last
thing I look for, if I have to have a hierarchy, the last thing I would look for is technical
skills; it would go through these social skills, curiosity, sense of humor, the ability to tell a
story, the ability to know that there is a story there. And then once everything is there
then I'll say, “Well, can you do the technical side of this?” And if there is any hope or any
sign of some technical skills, I would take them because I can train them in whatever skill
they need, but I can't teach curiosity, I can't teach [telling] stories, and I certainly can't
instill a sense of humor in anyone. .
Watch this video from IBM to understand the relationship between data science
and big data.
- Hello. Welcome to the Big Data University. In this lesson we will analyze how Big Data
relates to data science, examine the skills that data scientists must have, and analyze the
stages of the data science process. For Big Data analysis, you can start with a few topics
general: integration, analysis, visualization, optimization, security and governance.
Analysis . Let's analyze the example of Walmart. Walmart uses a search engine
called Polaris that helps customers search for products they want to buy. This engine
evaluates the way the user behaves on the website in order to display the most relevant
results. Polaris displays certain items based on the user's interests. And since many
customers visit Walmart's website, huge amounts of data are collected, making the
analysis of that Big Data very important.
Visualization . Some people work well with data tables. However, the vast majority
of people need Big Data presented to them graphically in order to understand it. Data
visualization is useful for people who need to analyze data such as analysts or data
scientists, and it is primarily useful for non-technical people who need to make decisions
from data but do not work with it on a daily basis. An example of Big Data visualization is
the display of temperature on a map divided into regions. By using the huge amounts of
data collected by sensors and satellites in space, people can see a quick and easy summary
of where the temperature will be higher or lower.
Security and governance . Data privacy is a fundamental part of the Big Data era.
Companies and individuals need to take a closer look at how data is collected, retained,
used and disclosed. Privacy violations occur when there is unauthorized access, use or
disclosure of personal information. And sometimes, this situation leads to legal
proceedings. Companies must establish strict controls and privacy policies that comply
with the legal framework of the geographic region in which they are located. Big Data
governance has three requirements. Automated integration , meaning easy access to data
no matter where you are. Visual content , that is, easy classification, indexing and
discovery within Big Data, in order to optimize its use. And agile governance , which is the
definition and execution of governance that is appropriate to the value of the data and its
intended use. By analyzing these three aspects, companies can quickly get a profile of the
level of importance of the data and the level of security needed to protect it.
Some of the applications used in Big Data are Hadoop, Oozie, Hive, Apache Pig,
Apache Spark, MapReduce and YARN, Sqoop, ZooKeper and text analysis . We need people
who have the skills to run these applications and analyze Big Data. Big Data University
offers free courses on Hadoop, machine learning, analytics, Spark, and many more topics .
Look for Big Data Dudes, Big Data experts, to learn more about Spark and Big Data. There
are many massive open online courses, MOOCs and some formal programs related to Big
Data.
Data science is the process of cleaning, extracting and analyzing data with the goal
of obtaining valuable statistics. In data science, data size is less important. You can use
data of all sizes -- small, medium, and large -- that relate to a business or scientific case.
Statistics are extracted through a combination of exploratory data analysis and modeling.
Data science is the process of analyzing data to obtain statistics to make informed
decisions. The data scientist is the person trained to obtain statistics from data through
the use of tools and their experience with information, statistics and business or science.
These are other skills that data scientists must have.
The following process can be used to understand Big Data: Problem determination.
What is the business problem? What is the objective of the project? What would you do if
you had all the data? Data collection. What data is relevant? Are there privacy issues?
Data exploration. Data plotting. Is there a pattern? Data analysis. Creation of a model.
Model adjustment. Model validation. Narration: Communication visualization. Can we tell
a story? Take action and decisions.
Thank you for completing this lesson. Here he learned about some Big Data
applications and learned the process of data science. Have fun learning.
In the World of Data Science , visit Miami and go to the Meet the Data Science
Team website .
Nutri Mondo is an international organization that uses data science to help local
communities gain greater access to healthy food and address the inequities that drive
hunger.
We live in a world with an overabundance of food and yet there are millions of
malnourished people. The health problems related to this dietary imbalance are diverse,
but their solution is known: implement more effective food distribution through the
scientific and business tools we have at our disposal.
Nutri Mondo's mission is to combine data and analytical resources to alleviate suffering
and ensure a better life for all people.
ABOUT US?
Jonathan Febres.
Research administrator. Jonathan has been participating in Nutri Mondo for five
years. Before joining Nutri Mondo, he was a sales training manager. Jonathan learned
about the power of data at the same time he discovered design thinking. His love of data
and desire to make a difference pushed him to leave the corporate world and seek
solutions to food and nutritional problems.
Favorite quote: "Data science is not a shortcut to the right answers, but a way to
find the right question."
Karen Pond
Analysis of data. Karen has been with Nutri Mondo for 4 and a half years. Before
joining Nutri Mondo, Karen worked for a small law firm investigating bank fraud and
developing financial audit systems. Her computer science studies and desire to make a
lasting impact on people's lives impressed Jonathan, who offered her this position during
their first meeting.
Francis Hassani
Location: Tegucigalpa, Honduras
Project manager. Francis has been participating in Nutri Mondo for 3 years. Before
Nutri Mondo, he worked for the city's safety department and collaborated with the fire
department to develop software that would identify apartments where low-income
families lived in potentially dangerous conditions. Between that experience and military
service, he became a logistics teacher.
Favorite quote: "If everything seems to be under control, it's likely that not enough
is known about the project."
Susana Maciel
Location: Miami, United States
Director of Education Engagement and Development at Nutri Mondo. Susana has
been involved with Nutri Mondo for 10 years, where she started as a procurement
manager. She dedicated herself to food assistance after working as a high school teacher.
What inspired Susana to change careers was her concern about the health problems her
students had related to malnutrition. His love for food left the garden and kitchen of his
home and turned to the food problems of others.
Favorite quote: "People deserve to be fed in any way necessary."
LOCATIONS :
Clark, T., & Wiesenfeld, D. (2017) There are 3 factors that limit your analysis, and
technology is not one of them. Harvard Business Review Articles, 2–5.
"Fundamentally, what makes a great data scientist stand out is their fierce curiosity; that's
the key factor."
-John Sullivan
The role of the data scientist is relatively new in organizations, whether for-profit,
non-profit or government. Technology has made the creation, storage and analysis of
available data much more accessible. Still, the mere presence of data and data tools doesn't
mean much value without data science professionals with the skills to make sense of it.
Institutions need people with the right skills and mindset to turn all the data into meaningful
information that can guide decision-making.
In this unit, you will learn about data science methodology through a global case
study, in which you are an intern at Nutri Mondo, a fictional institution that uses data science
to solve problems related to food insecurity. and hunger worldwide. Below, you can read the
latest message from Susana, its director, in which she discusses what she will explore next in
the field of data science.
Learning objectives:
Below you can read the email from Susana Maciel, the director you will be working with
on this case study:
MURTAZA HAIDER: Once you have completed your analysis and you have your
results, know that your results are wrong, doubt your results a little bit. Always believe
that there is no black and white, there is always a little gray; and if you can, do the analysis
again and again until you are absolutely sure that no mistakes have been made.
RAFAEL B. DA SILVA: I really like Ockham's razor which says that if there is a
problem with two explanations, the simpler one is usually the better one.
LUIS O. MARTINS: In my opinion, “less is more.” Try to simplify things.
STEPHEN SHERMAN: We trust in God, but everyone else requires data. I didn't
invent that. It was Deming. I had to look into it.
Scientists are always looking for the answers and statistics hidden in the data. As
you become familiar with data science methodology, consider the following:
Click on the image below to see an interactive diagram that summarizes each step
of the data science methodology. In this unit, you will focus on Business Understanding
and the Analytical Approach .
Welcome.
Despite the recent increase in computing power and access to data that has been
observed in recent decades, our ability to use data in the decision-making process is lost
or not maximized, as we tend not to. fully understand the questions being asked and how
the data can be correctly applied to the problem at hand.
The data science methodology discussed in this course has been developed by
John Rollins, an experienced data scientist currently working at IBM. This course is based
on his experience and expresses his opinion regarding the importance of following a
certain methodology to achieve success. Simply put, data science methodology aims to
answer ten basic questions in a prescribed sequence.
As you can see on this slide, there are two questions designed to define the
problem, and thus determine the approach that will be used. Then there are four
questions that will allow you to organize the data you need. And finally there are four
more questions that will try to validate the data and the approach that is designed. Now
take a moment to familiarize yourself with the ten questions, as they will be critical to
your success.
This course comprises several components. There are five modules and each one
covers two stages of the methodology and explains the reasons why each stage is
necessary. Within the same module, a practical case is shared that exemplifies what has
just been learned. There is also a hands-on lab that allows you to apply the material.
Finally, there are three review questions to assess whether you understood the concepts.
When you are ready, take the final exam.
The case study included in the course highlights how data science methodology
can be applied in context. This case revolves around the following situation: There is a
limited budget to provide medical care to the public. Hospital readmissions for recurring
problems can be seen as a sign of failure by the system to adequately address the
patient's illness before initial discharge.
The main question is this: What is the best way to allocate these funds to maximize
their use in the delivery of quality health care? As you can see, if the new data science
pilot program is successful, it will provide better care for patients as doctors will have new
tools that allow them to incorporate timely, data-driven information into healthcare
decisions. .
In the case study sections, these icons appear in the upper right corner of the
screen to allow you to differentiate theory from practice within each module. A glossary of
data science terminology is also provided to help you understand the key terms used in
the course. If you have a problem or question during the course, explore the information
and discussion sessions. Now that you're ready, put on your headphones and let's get
started.
Watch this video from IBM and discover the first step of the data science
methodology: business understanding.
Business understanding.
Has this ever happened to you?: Your boss invites you to a meeting and informs
you that there is an important task that must be completed within a very limited deadline.
They both strive to ensure that they have considered all aspects of the task, and at the
end of the meeting they are both happy to know that everything is under control.
That afternoon, however, after examining for a while the various issues at hand,
you notice that you must ask several more questions in order to actually complete the
task. Unfortunately, your boss won't be available until the next morning. Now, with that
tight deadline ringing in his head, he begins to feel uneasy. What should you do? Do you
jeopardize the progress of the task or stop for clarification?
In the first part of the data science methodology, time must be spent searching for
explanations in order to obtain what can be called a business understanding . This
understanding must be achieved at the beginning of the methodology because if you are
clear about the problem that must be solved, you can determine what data should be
used to answer the main question .
For example, if a business owner asks how he can reduce the costs of a certain
activity, we must understand whether the goal is to improve the effectiveness of that
activity or to improve the profitability of the company. Once the goal is clarified, the next
step is to determine the objectives that support that goal .
By analyzing the objectives, you can have structured conversations that allow you
to identify priorities and thus organize and plan how to address the problem . Depending
on the problem, certain people must be present in the conversation to identify the
requirements and clarify the questions . [Business Understanding 04.jpg]
Now, let's examine the use case related to the application of enterprise
compression. In this case study the question is: What is the best way to allocate the
limited budget allocated to medicine to maximize its use in the provision of quality
medical care? This question was very important for a US health insurance provider.
As public funding for readmissions decreased, this insurer was at risk of having to
make up the difference in costs, which could increase its clients' rates. Knowing that
raising rates would not be a favorable strategy, the insurer met with medical authorities in
its region and brought in data scientists from IBM to see how data science could be
applied to the question at hand.
Goals and objectives had to be defined before beginning to collect data . After
spending time determining goals and objectives, the team prioritized patient readmissions
as a task that would be useful to discuss. Taking into account the goals and objectives, it
was discovered that 30% of people who complete rehabilitation treatment are re-
hospitalized within one year, and 50% within five years. After analyzing some records, it
was detected that patients with congestive heart failure were the most likely to be
readmitted. Furthermore, it was determined that a decision tree model could be applied
to analyze this situation in order to identify the reasons for its existence.
To begin, to gain the business understanding that would guide the analytics team
as they formulate and execute their first project, IBM data scientists proposed and
conducted an in-person workshop. The participation of the main business sponsors was
very important throughout the project for the following reasons. They set the general
direction, maintained commitment and provided suggestions, ensured necessary
assistance in appropriate cases.
Finally, four business requirements were identified for the model to be created.
Those requirements were as follows: Predict readmission outcomes for patients with
congestive heart failure. Predict the risk of readmission. Understand the combination of
events that led to the predicted outcome. Apply a process to new patients that is easy to
understand regarding readmission risk. This concludes the business understanding session
of this course. Thanks for your time.
How do data scientists separate cause from effect? In this article, you will find the
answer.
Redman, T. c. (February 19, 2014) How to Explore Cause and Effect as a Data
Scientist. Harvard Business Digital Articles.
Analytical Approach. – This step involves making it clear to stakeholders who is asking the
question, so that the most appropriate path or approach can be selected.
Data Requirements. – Think of this aspect of data science methodology as the process of
cooking with data. Each step is essential to making food. Therefore, if the problem to be
solved is the recipe and the data is an ingredient, the data scientist will need to identify
what ingredients are needed, how they can obtain or collect them, how they can
understand or work with them, and how they should prepare the data to achieve the
desired result.
Data collection. – In this step, you review the data requirements and decide if the
collection requires more or less data.
Data Understanding. – This step includes activities related to the construction of the data
set. Basically, in data science methodology, the understanding step answers the following
question: Is the data you collected representative of the problem that needs to be solved?
Modeling. – Data modeling focuses on developing models that are descriptive (what
happened) or predictive (what could happen). These models are based on the analytical
approach that has been selected.
Assessment. – The evaluation of a model goes hand in hand with its creation. Therefore,
the modeling and evaluation stages are carried out in an iterative [interactive] manner.
The evaluation of the model is carried out during its development and before its
implementation.
Implementation. – Once the model is evaluated and the data scientist is convinced that it
will work, it is implemented and tested definitively. Depending on the purpose of the
model, it can be implemented for a limited group of users, or in a test environment to
increase confidence in the application of the result for use on a larger scale.
Comments. – User feedback will help fine-tune the model and evaluate it in terms of
performance and impact. The value of the model will depend on correctly incorporating
feedback and applying adjustments for as long as the solution is needed.
Module: Common Terms and Tools
Watch this video from IBM and discover some examples of Big Data and how it is
generated.
NARRATOR: Hello! Welcome to the university of big data. In this lesson we will look at
some examples of big data and how it is generated. We will examine the different types of
big data and their sources. Why is everyone talking about big data?
In the last two years, more data was created than in the entire history of humanity.
By 2020, approximately 1.7 megabytes of new information will be created per second for
every human being on the planet. Furthermore, by 2020 the data we create and copy will
reach 35 zettabytes. compared to the current 7.9 zettabytes. The graph on the right shows
global data growth in zettabytes. Pay attention to the 343% increase between 2015 and
2020.
How big is a zettabyte? A bit is binary, it is a one or a zero. Eight bits make up a
byte and 1024 bytes make up a kilobyte. In turn, 1024 kilobytes make up one megabyte.
The size of long videos and DVDs is in gigabytes. One gigabyte of storage is equal to 1024
megabytes. Today, we have USB flash drives that can store a few tens of gigabytes of
information. While computers and hard drives now store terabytes of information. One
terabyte is equivalent to 1024 gigabytes. For its part, 1024 terabytes make up a petabyte.
And 1024 petabytes make up one exabyte.
Think of a large city or a busy international airport such as Heathrow, JFK, O'Hare,
Dubai, or OR Tambo in Johannesburg. Petabytes and exabytes are handled there. All of
these airports record and transmit data. Everyone in those airports has mobile devices.
Also be aware of security cameras and all personnel in and around the airport. A study of
the digital universe carried out by IDC found that last year the information reached 0.8
zerabytes and predicted that by 2020 this figure will rise to 35 zerabytes. It is anticipated
that by 2020 machines will produce one-tenth of the world's data. And that most of the
planet's data will be produced in emerging markets. Furthermore, it is predicted that the
amount of data generated will increasingly exceed the available storage space.
Advances in cloud computing have contributed to the rise of big data potential.
According to a 2013 McKinsey study, the rise of cloud computing greatly contributed to
launching the era of big data. Cloud computing allows users to access highly scalable
computing and storage resources over the Internet. By using cloud computing, businesses
can use server capacity however needed and quickly expand it to the scale needed to
process large data sets and run complicated mathematical models. Cloud computing
lowers the price of big data analysis as resources are shared among many users who pay
only for the capacity they actually use.
A survey by IBM and Saïd Business School identified three main sources of big data:
human-generated data, machine-generated data, and enterprise-generated data, which is
the data that organizations produce in their own operations. The graph on the right shows
different responses where people were able to select multiple options. Big data will
require analysts to have specific skills. Big data skills include detecting and analyzing
trends occurring in big data.
In addition, this structure makes it easier to search for data using a programming
language such as SQL. Two examples of semi-structured big data sources are XML and
JSON files. These sources use labels or other markers to reinforce the hierarchies of
records and fields that exist within the data. In 2011 alone, a large multi-radius telescope
project called the Square Kilometer Array, or SKA, produced about 1,000 petabytes of raw
data per day. It is projected that in 2020 it will produce about 20,000 petabytes or 20
billion gigabytes of data per day.
Watch this video from IBM and learn about big data processing technologies.
WOMAN: Hello. Welcome to the Big Data University. In this lesson you will learn about Big
Data processing technologies, learn about Hadoop, and discover what it is and why it is
considered a great Big Data solution. In a 2011 McKinsey Global Institute report, the main
components and ecosystems are summarized as follows:
Data analysis techniques, such as A/B Testing, Machine Learning and Natural Language
Processing. Big Data technologies, such as business intelligence, cloud computing, and
databases. Visualization, such as graphs and other elements that allow data to be
displayed.
The Big Data processing technologies we will discuss combine large sets of
structured and unstructured data in a format that supports analysis and visualization.
Value can only be obtained from Big Data if it can be reduced or converted into formats
that people can understand. One trend enabled by the Big Data revolution is the
development of new software tools and database systems, such as Hadoop, HBase, and
NoSQL, for large unstructured data sets. There are many vendors that offer Big Data
processing tools and provide training.
We'll start with IBM, which has the Big Data University and the Data Scientists Lab.
The Data Scientist Lab is a collection of open source tools hosted in the cloud, such as
Open Refine, Jupyter Notebooks, Zeppelin Notebooks, and RStudio. This lab provides easy
access to Spark, Hadoop, and many other big data analytics engines, as well as
programming languages such as Python, R, and Scala.
Let's discuss some of the terms used in conversations related to Hadoop. A node is
simply a computer. Typically this is basic non-enterprise hardware that contains data. So,
in this example, we have node one and then we can add more, like node two, node three,
and so on. This is called a frame. A rack is a collection of 30 or 40 nodes that are physically
stored together and are all connected to the same network switch. The network
bandwidth between two nodes in the same rack is greater than the bandwidth between
two nodes in different racks.
The Hadoop Cluster is a set of racks. IBM Analytics defines Hadoop as follows:
Apache Hadoop is a highly scalable storage platform that is designed to process very large
data sets on hundreds or thousands of compute nodes operating in parallel. Hadoop
provides a cost-effective storage solution for large volumes of data with no formatting
requirement. MapReduce, the programming paradigm that enables this enormous
scalability, is the backbone of Hadoop.
There are three types of data warehouse upgrades in a data set. Processing that
uses Big Data as a target zone, before determining what data, should be moved to the
data warehouse. These could be categorized as irrelevant data or relevant data that is
sent to the data warehouse. Downloading, which is the transfer of data that is accessed
very infrequently from warehouses. Up to an enterprise level version of Hadoop.
Exploration, which is the use of Big Data capabilities to explore and discover new, high-
value data from huge amounts of raw data, and to free up the warehouse for more
structured, robust analysis.
Thanks for watching this video. Here you learned what Hadoop is and why it is a
great solution for Big Data. Have fun learning.
Watch this video from IBM and discover the tools and technology that data
scientists enjoy.
SHINGAI MANJENGWA: I love regression. I would say regression is one of the first
concepts that helped me understand data. I really enjoy regression.
MURTAZA HAIDER: Structured data is like tabulated data. These are things that are
known in the Microsoft Excel format -- lines and columns. That's called structured data.
Unstructured data is data that comes, mostly, from the web, where it is not tabulated, it is
not in lines and columns, it is text, sometimes it is videos or audio. They would have to use
more sophisticated algorithms to extract data.
In fact, many times, we take unstructured data and spend a lot of time and effort
to structure it so we can analyze it. If there is something structured in tables, columns and
lines, go ahead. That is structured data. But if it's a blog, or if they're trying to get
information from web pages, and there are millions of web pages, that's unstructured
data. It will take more effort to get information from that.
Machine learning is a set of advanced tools that people use to find answers. I don't
really like machine learning, and I'll tell you why. Imagine that there is an island with
45,000 inhabitants. They are isolated from the rest of the world. No one can enter or leave
the island. Imagine there is a murder on the island. You are the detectives who must find
the culprit. Various methods can be used. One method, is to say: "The murderer is on this
island." There are 45,000 suspects. I will interview each one until I find the culprit. That's
machine learning, because there are no reasons, there are no assumptions, there are no
hypotheses, there are no hunches. They say, "I don't know anything. "I'm going to give
everything to this model to find the culprit." Sometimes the culprit is found, sometimes
not. But it would take time.
Machine learning is used when you don't have many assumptions about the data
and when you don't know much about the data. This model is used to see what comes
out. It is a black box method. I know that many professionals always use it. I like to look at
data with preconceived notions, because as I said, a data scientist is someone who is very
critical. A data scientist is someone with opinions about data. They have opinions about
what they are learning or researching. They can't use a sink method where they throw
everything into the model. Machine learning encourages you to throw everything at it and
see what comes out.
There are thousands of books on regression and there are millions of classes on
regression. I have always believed that they do not explain regression well because they
go into data, models and statistical distributions. Let's forget about that. I will explain
regression in simple terms. If you've ever taken a taxi, you already understand regression.
Is that how it works. The moment they get into the taxi, they see that there is a fixed
amount. It says $2.50. Whether the taxi moves forward or they get off right there, that is
what they owe the driver as soon as they get into the taxi. That's a constant. They have to
pay that amount if they entered the taxi.
As you start moving, for every 100 meters, the fare increases by a certain amount.
It's a fraction. There is a relationship between the distance and the amount paid beyond
that constant. If they don't move because they are stuck in traffic, they have to pay more
for each additional minute. As your minutes increase, your rate increases. As the distance
increases, the rate increases. And while all this is happening, you have already paid a base
rate, which is the constant. That's regression.
The regression tells them what the base fare is and what the relationship is
between time and the fare they paid and between distance and the fare they paid.
Because without knowing those relationships, and only knowing the distance and what
was paid, the regression allows them to calculate the constant, which they did not know
was $2.50. And they will be able to calculate the relationship between fare and distance,
and fare and time. That's regression.
Read this article and learn what the 5 needs of institutions are to correctly use big
data.
Davenport, T. H. (March 26, 2014) Reasons for the success of big data projects,
Harvard Business Review Digital Articles.
Read this article and discover the most used languages in data science applications and
programming.
Jones, M. T. (April 23, 2018) Introduction to Data Science. Part 4: The languages of
data science. DeveloperWorks. IBM Corporation.
Another thing data scientists need is comfort and flexibility with the analytics
platform . With software and computing platforms . But that is secondary. The most
important thing is curiosity and the ability to take a stand . Once you do that, when you
have analyzed it, you will have answers. Finally, a data scientist needs the ability to tell a
story. Once you have the analytics [platform], once you have the tabulations, you should
now be able to tell a great story . Because if they don't tell a great story based on that, the
findings will continue to be hidden and buried and no one will know. To become
prominent they must be able to tell stories.
Maybe it's movies, retail, health, computers. When you know what your strong
area is, then start acquiring analytical skills . What platforms to learn? And those platforms
and tools will be specific to the industry they are interested in . When they have some
mastery of the tools, the next thing is to apply those skills to real problems. And then
show everyone what you can do.
Watch this video from IBM and learn how to select the right analytics approach for
a data science project.
Analytical Approach.
Choosing the correct analytical approach depends on the question being asked.
The approach involves clarification from the person asking the question so that the most
appropriate path or approach can be selected . In this video we will see how the second
stage of the data science methodology is applied. Once the problem to be solved is
defined, the appropriate analytical approach is selected for the problem in the context of
the business requirements. This is the second stage of the data science methodology.
Once the question is well understood, the analytical approach can be selected. This
step involves identifying what type of patterns will be needed to address the question
most effectively. If the question attempts to determine the probabilities of an action, a
predictive model may be used. However, if the question attempts to show relationships, a
descriptive approach may be needed. This type of approach analyzes clusters of similar
activities based on certain events and preferences .
Now let's examine the practical case related to the application of the analytical
approach. In this case study, a decision tree classification model was used to identify the
combination of events that led to each patient's outcome. In this approach the
examination of the variables of each node, along with the path to each leaf, led to a
respective boundary value. This assumes that the decision tree classifier provides the
predicted outcome, as well as the probability of that outcome based on the proportion of
each group's dominant outcome, whether positive or negative.
With this information, analysts can obtain the risk of readmission or the probability
of a positive result for each patient. If the dominant result is positive, the risk is simply the
proportion of positive patients found on the sheet. On the other hand, if the result is
negative, the risk is one minus the proportion of patients with negative results found on
the sheet.
Decision tree classification models allow people outside of data science to easily
understand and apply them to classify new patients with respect to their readmission
risks. Clinicians can instantly see what events cause a patient to be classified as high risk,
and multiple models can be created and applied at various points in the hospital stay. In
this way, a dynamic overview of the patient's risk and the way it evolves with the various
treatments applied is obtained.
For these reasons, the decision tree classification approach was selected for
creating the congestive heart failure readmission model. This concludes the analytical
approach section of this course. Thanks for your time.
Watch this video from IBM and learn how data scientists began their careers.
SHINGAI MANJENGWA: Data Science didn't exist when I was a child. I never said, “I
want to be a data scientist.” No. Did not exist. I didn't know I would work in data science.
RAFAEL DA SILVA: When I was a child, the field of data science did not exist. It's
something very new.
STEPHEN SHERMAN: Data science didn't come into being until 2009, 2011. It was
someone like DJ Patil or Andrew Gelman who coined the term. Before that, there were
statistics. I didn't want to work on that. I wanted to work in business. I found Data Science
to be more interesting.
DIANA ZARATE-DIAZ: Statistical Study. That's how I started. I went through many
stages in my life where I wanted to be a singer or a doctor. But I realized I was good at
math. So I chose an area that focuses on quantitative analysis. From there, I knew I
wanted to work with data. Not necessarily data science as it is known today.
LUIS O. MARTINS: The first time I heard about data science was during my first year
in Mechanical Engineering. Strategic advisory firms use data science to make decisions. It
was the first time I heard of data science.
RAFAEL DA SILVA: I had a complicated problem that I needed to solve, and the
common techniques that existed at that time could not solve the problem.
STEPHEN SHERMAN: I graduated with a degree in mathematics at the worst
possible time, right after the economic crisis. You had to be useful to get a job. So I got a
degree in Statistics. I worked in so many Data Scientist positions that I became one.
After earning my degree, it turned out that the first company I worked for analyzed
electronic point-of-sale data for retail manufacturers. What we were doing was data
science, but we didn't use that term until much later. It wasn't until five or six years ago
that we started calling it Analytics and Data Science.
LUIS O. MARTINS: I had several internship options here in Canada. One of the
options was to work with Data Science. I worked in project development. I think it was a
good choice. This is how I started my internship in Data Science.
MURTAZA HAIDER: I am a civil engineer. All engineers work with data. I would say
my mainstream use of Data Science in my life began with transportation research. I started
building large models, I tried to forecast traffic on the streets, determine congestion and
gas emissions, or exhaust emissions. I think that's where I started. I started building these
models when I was studying at the University of Toronto. I started working with large data
sets, looking at samples of 150,000 households from half a million trips. And that was in
the '90s, when this made up a very large set, but not in today's terms.
That's how I started. I kept working with it. I then moved to McGill University,
where I was a Professor of Transportation Engineering, and built larger data models that
included data and analytics. I would say yes, transportation led me to Data Science.
Sullivan, J. (2014). Get the right data scientists by asking the “wrong” questions.
Harvard Business Review Digital Articles.
Read this article and discover the importance for data scientists of establishing
relationships and dialogues with their colleagues and stakeholders.
Redman, T. c. (January 16, 2017). The best data scientists get closer to people.
Harvard Business Review Digital Articles.
Data can come from many sources and in very different formats. In this unit, you
will learn how data scientists collect data to examine the problem they are investigating.
You will also discuss the role of ethics and security practices that should guide data
collection and storage.
Additionally, she will have a role as an intern at Nutri Mondo, an organization that
uses data science to address issues related to food insecurity and other nutrition-related
issues. Read the message from Nutri Mondo Director Susana Maciel to determine the
context of your learning in this unit.
Learning objectives:
Below you can read the email from Susana Maciel, the director you will be working
with on this case study:
The same technology was used in North Africa for similar purposes to connect
activists when governments intentionally disconnected them as a means of controlling the
population. But over the years, thinking about these technologies and the things I work
on, a question worries me and that is, if we are wrong about the virtues of technology and
if we sometimes harm the communities we try to help. The technology industry in the
world operates under certain premises that, if we do great things, they will positively
affect everyone. Ultimately, these innovations will reach everyone. But it is not always the
case.
In the world of smartphones and apps, there is a big movement toward tracking
personal health with apps that measure the number of calories you burn or whether
you're sitting too much or walking enough. These technologies make it more efficient for
patients to enter clinics and in return, these clinics expect these types of efficiencies. As
these devices reach doctors' offices and are welcomed by them, what happens to the
techno-invisibles? What is the medical experience like for someone who doesn't have a
$400 phone or watch tracking their movements? Does it become a burden on the medical
system? Has your experience changed?
In the financial world, Bitcoin and cryptocurrencies are revolutionizing the way
money is moved around the world, but the challenge with these technologies is that the
barrier to entry is very high, right? You need access to the same phones, devices,
connectivity, and even where you don't need them, where there is a representative, they
usually require some capital to participate. And I wonder, what happens to the last
community that uses paper money when the rest of the world moves to digital?
You can check out books from home, you can do research going to and from
school, but these are two big assumptions, first, that you have access at home, second,
that you have access on your cell phone, and in Philadelphia, a lot of kids don't have it. So
what will your educational experience be like in a completely cloud-based library that was
once a basic part of your education? How will they stay competitive?
A final example from East Africa: there is a big movement to digitize land property
rights, for several reasons. Migrating communities, older generations dying, and basically
poor records management have led to conflicts over property. So there's a big movement
to put this information online, and track ownership of these parcels of land, get them into
the cloud, and deliver them to communities.
But really, the accidental consequence of this has been that venture investors, and
real estate developers, have rushed to buy these plots of land above these communities,
because they have access to the technologies and connectivity that make them possible.
This is the common denominator that connects these examples, the accidental
consequences of the devices and technologies we create. As engineers, as technologists,
we sometimes prefer efficiency over effectiveness. We weigh more on doing things than
on the results of what we do. This must change. We have a responsibility to think about
the results of technologies, especially when they increasingly control the world in which
we live.
In the late 90's there was great pressure for ethics in the financial and banking
world. I think that in 2014 a similar movement in the technology area has been delayed.
So I encourage you to think about the next big thing, as entrepreneurs, managers,
engineers and creators, to think about the accidental consequences of the things we are
creating, because true innovation is finding ways to include everyone. Thank you.
Business and institutional needs require data scientists to solve certain problems.
But what other big problem can data science solve in your community or globally?
What are the ethical questions data scientists should ask themselves as they
advance their work?
Click on the image below to see an interactive diagram that summarizes each step
of the data science methodology. In this unit, you will focus on Data Requirements and
Data Collection .
Module: Data: What is needed and how to get it?
Watch this video from IBM to understand how data scientists determine data
requirements.
Data Requirements.
If your goal is to cook spaghetti, but you don't have the right ingredients to prepare
this dish, your success will suffer. Think of this section of data science methodology as the
process of cooking with data. Each step is essential to making food. Therefore, if the
problem to be solved is the recipe and data is an ingredient, the data scientist will need to
identify what ingredients are needed . How you can obtain or collect it, how you can
understand or work with it, and how you should prepare the data to achieve the desired
result. Once the problem at hand has been understood and the selected analytical
approach has been used, the data scientist is ready to begin.
Now let's examine some examples of the data requirements stage within the data
science methodology. Before the data collection and preparation stages, it is essential to
define the data requirements for decision tree classification. This includes identifying the
content, formats, and data sources that are necessary for initial data collection .
Now let's examine the use case related to the application of data requirements. In
this case study the first task was to define the data requirements for the decision tree
classification approach that had been selected . This included selecting a suitable patient
cohort from a membership base of certain health insurance providers. In order to compile
complete medical records, 3 criteria for inclusion in the cohort were identified.
First, in order for them to have access to the necessary information, the person
had to have been admitted as a hospital patient within the provider's service area.
Second, they focused on patients who had been diagnosed with congestive heart
failure during a particular year.
Third, for a complete medical history to be compiled, the patient had to have been
hospitalized for at least 6 months.
Patients with congestive heart failure who had also been diagnosed with other
relevant disorders were excluded from the cohort, as these disorders would result in
higher-than-normal readmission rates, thus skewing the results.
The content, format, and data representations needed for decision tree
classification were then defined . This modeling technique requires one record per patient
with columns representing the model variables. Modeling readmission outcomes required
data covering all aspects of the patient's medical history.
To obtain the format of a record per patient, data scientists summarized the
transactional records at the patient level, which created certain new variables that
represented that information. That was a task for the data preparation stage, so it is
important to foresee and anticipate the following stages. This concludes the data
requirements section of this course. Thanks for your time.
Watch this video from IBM and find out what data scientists do during the data
collection phase under the data science methodology.
Data collection.
After initial data collection, the data scientist performs an assessment to determine
if you have what you need . Just like when you buy ingredients for a meal, it is likely that
some are out of season and more difficult to obtain or cost more than you thought.
At this stage the data requirements are reviewed and it is decided whether the
collection requires more or less data . Once the data is obtained during the collection
stage, the data scientist will have a clear idea of the content with which they will work.
Techniques such as descriptive statistics and visualization can be applied to evaluate the
content, quality, and initial statistics provided by the data .
Problematic data will be identified and plans created to fill the gaps or replace the
data . Basically, the ingredients are now sitting on the board and waiting to be cut. Let's
examine some examples of the data collection stage within the data science methodology.
This stage represents a kind of follow-up to the data requirements stage.
Now let's examine the practical case related to the application of data collection.
Data collection requires knowing the source or knowing where the necessary data can be
found [?] . In the context of our case study, this may include: demographics, medical
records, patient coverage information, provider data, claims records, pharmacy
information, and data related to all heart failure patient diagnoses. congestive
For this case study, information on certain drugs was also needed, but that data
source was not yet integrated with the rest. This allows us to reach an important
conclusion. It is not wrong to postpone decisions on unavailable data and try to acquire it
at a later stage.
For example, this can even be done after obtaining some intermediate results with
predictive modeling. If those results suggest that drug information may be important in
obtaining an effective model, it is necessary to invest time in trying to obtain that
information. However, the team was ultimately able to create a reasonably effective
model without the information from these drugs.
DBAs and programmers often work together to extract data from various sources
and then combine it. In this way, redundant data can be removed and the information
made available for the next stage of the methodology, data compression. At this stage, if
necessary, data scientists and analytics team members can determine different ways to
better manage their data, including automating certain processes in the database to make
the collection task easier and faster. . Thanks for your time.
Nutri Mondo's data science team analyzes the relationships between diet-related
problems and other factors. Listen as the research leader, data scientist, and project
manager briefly discuss different criteria that can be used to choose a data set. In the
work of Nutri Mondo's data science team, think about how the analytical approach aligns
with the data requirements and data collection process.
FRANCIS: Welcome. Well, let's analyze where we are. In our last meeting with
Susana we determined that we will look for data to understand the relationship that exists
between food availability and diet-related problems. And we will also explore other
relationships that may influence these problems. . Will this information help you design
educational programs and engagement initiatives that are relevant and effective for the
local community?
FRANCIS: Yes! He talked to the regional teams and they all believe our goals are
appropriate. Remember how she thought the regional offices could collect the data?
FRANCIS: That's right! Preferably government data, but I'm not sure why.
KAREN: Public data is transparent. Additionally, they may be more accurate than
the data our regional teams could collect. There are likely even more variables that are
relevant and accessible, such as ethnicity and income level. With our exploratory
approach, the analysis of the current situation will allow us to see what we are focusing
on. Furthermore, access is free.
FRANCIS: According to Susana, the regional offices believe that if the data we use
comes from national data sets, there will be greater effectiveness in our partnerships with
local governments.
JONATHAN: Did you take into account that with government data we may not be
able to get data from all the countries we work with?
FRANCIS: Yes. First of all they would like to analyze the situation in the United
States. They want to see what is available in terms of data and then analyze whether we
can do something similar in other countries.
KAREN: Okay. So, with respect to our data requirements, it is convenient that these
be public, that they be free, and that they come from the government. The variables we
need are obesity, diabetes, heart disease, cancer, stunted growth in adolescents and
cognitive disorders, and also data on food distribution, ethnicity and the economy.
JONATHAN: Yes. For now we will start with the data from the United States
government. KAREN: Good. I analyzed some data that is publicly available. If we only look
at data from the US government, we are likely to reach an interesting result. Furthermore,
regarding data collection we only have to select it and then download it.
JONATHAN: These 4 are all from US government sources. But perhaps they are
applicable to some of our other communities. We have one CDC/NIH set, one US
Department of Labor set, and two US Department of Agriculture sets. KAREN: Based on
your analysis, are there any outfits that stand out?
JONATHAN: I've only looked at them once, but they all look pretty solid.
KAREN: Let's see, what else do we have and how could it relate to one of these
studies? For example, this study the Department of Labor wants to cover a similar
population during the same time period.
FRANCIS: So what should be done now? Should data be compared directly to each
other?
Read this article and find out how a data scientist can ensure they collect the right
data.
Wessel, M. (November 3, 2016) You don't need big data, you need the right data.
Harvard Business Review Digital Articles.
Read this article to understand the processes data scientists use to collect and
store data.
Rao, V. R. (March 7, 2018) How data is turned into knowledge. Part 2: Data lakes
and data swamps.
Hello. Welcome to this Data Privacy Basics course from Fireside Analytics Inc. for
Big Data University.
In this course you will get an overview of data privacy laws. We'll use Canada's laws
as a framework, you'll be exposed to high-stakes data privacy cases, and you'll discover
some guiding principles to help you avoid these types of issues. We will cover some
general data privacy issues to provide a brief summary of what you need to know
regarding collection, retention, use and disclosure. Let's get started.
The lesson objectives are as follows: a summary of Canada's privacy laws and the
first case study. Canada has two federal privacy laws. The privacy law that addresses
personal information handling practices by federal government agencies and
departments. And the Personal Information Protection and Electronic Documents Act,
PIPEDA, which is the federal privacy law for the private sector.
These privacy laws are ranked as one of the most progressive in the world and can
be considered a good foundation for data privacy in general. Let's learn about privacy law.
Privacy law relates to the right of an individual to access and correct information
that the Government of Canada holds about them. The collection, use and disclosure of
this individual's personal information by the government in the course of providing
services. For example, old-age pensions or employment insurance.
The Privacy Act only applies to federal government institutions that appear on the
Privacy Act's institutions agenda. It applies to all personal information that the federal
government collects, uses, and discloses, whether about individuals or federal employees.
The privacy law does not apply to political parties or political representatives.
Canada's Privacy Inspector's Office monitors compliance with privacy law. Canada's
privacy inspector is an official of Parliament who maintains direct communication with the
House of Commons and the Senate. You can use the following information to contact the
Canadian Privacy Inspector's office directly.
The inspector is an advocate for the privacy rights of Canadians, and his powers
include investigating complaints, conducting audits, and prosecuting under two federal
laws. Publicly denounce the personal information management practices of public and
private sector organizations. Support, conduct and publish research on privacy issues and
promote public awareness and understanding of privacy issues.
PIPEDA does not apply to organizations operating entirely within a province, whose
legislation has been deemed substantially similar to PIPEDA. This provision has no effect if
the personal information crosses provincial or national borders. PIPEDA generally applies
to private sector organizations doing business in Canada.
Watch this video from IBM and learn the seven foundational principles for
designing systems that protect data.
[Translation Summary]
1. A privacy violation occurs when there is unauthorized access to, collection, use or
disclosure of personal information.
2. Such " unauthorized " activity occurs in contravention of applicable privacy
legislation, such as PIPEDA , or similar provincial privacy legislation.
o PIPEDA is the Personal Information Protection and Electronic Documents
Act.
3. Some of the most common privacy breaches occur when personal information of
clients, patients, customers, or employees is stolen, lost, or distorted by mistake
(for example, a computer containing personal information is stolen or personal
information is sent by mail email by mistake to the wrong people).
Companies should invest some time in creating privacy plans and privacy policies .
Canada's privacy commissioner offers a free tool to help businesses create a
privacy plan.
o http://www.priv.gc.ca/resource/tool-outil/english/index.asp?a=logout
You will be asked to answer a few questions about the organization of interests
and upon completion the tool provides the following:
An information audit of your business.
Consent provisions required specifically for your business.
A security plan to protect the personal information in your care.
An example of a privacy brochure for your clients.
A training needs assessment.
1. Collection.
2. Retention.
3. Use.
4. Divulgation.
Privacy by Design.
1. You feel that your personal information has been collected, used or disclosed in
error;
2. You were denied access to your personal information, or
3. You consider that there was an unreasonable delay in accessing your information.
https://www.privacyhorizon.wikispaces.com
Cognitive class. (nd-a) Think W3 – Debit and credit card data Parties involved.
Read this article and learn about some of the ethical issues that companies should
consider when using big data to understand their customers.
Schrage, M. (January 14, 2014) The dangerous new era of big data discrimination.
Harvard Business Review Digital Articles.
Read this article and find out how managers can make the most of big data and
data science.
Lambrecht, A. & Tucker, C. (July 12, 2016) The 4 mistakes most managers make
with analytics. Harvard Business Review Digital Articles.
Similarly, the data that data scientists collect in and of itself doesn't have much
value. Data must be filtered, cleaned and polished to get value from it. In this unit, you will
learn what data scientists do after collecting data in preparation for building models that
institutions can use to support decision making.
Additionally, you will have the role of an intern at Nutri Mondo, an organization
that uses data science to solve problems related to food insecurity and other food-related
conflicts. Read the message from Nutri Mondo director, Susana Maciel, to establish the
learning context in this unit.
Learning objectives:
Below you can read the email from Susana Maciel, the director you will be working
with on this case study:
But the truth is that by seeing this you will be able to get to know me much better
than by seeing this, which is what you are probably more used to and probably have on
your phones now. Bar graphs that count your steps, circular graphs for your sleep quality,
your early running path. In my day to day, I work with data.
I have a data visualization design company where we devise and create methods
that make information accessible through visual representations. What my work has
taught me over the years is that, to understand data and its true potential, sometimes we
have to put it aside and see through it instead. Because data is just a medium we use to
capture reality. They are always used as substitutes for something, but they are not the
real thing.
We lived in a very liberal city and my father was a politician in the Democratic
Party. I remember that nobody believed that Berlusconi could win. It didn't seem possible
at all. But he won. And I remember perfectly what I felt. It was a total surprise; My father
swore he didn't know anyone who had voted for him in my city.
For the first time, the data I had offered me a distorted view of reality. The truth is
that my data sample was quite limited and skewed so I thought it was because I lived in a
bubble and had little chance of seeing outside of it.
I think it's fair to say that this time the data failed us, and spectacularly so. We
believed in the data, but what happened, even in the most serious newspapers, is that the
obsession with reducing everything to two simple percentages to get a shocking headline
made us focus on those two figures and nothing else. In an effort to simplify the message
and create a nice, unambiguous blue and red map we completely lost our way. Somehow,
we forgot that there were stories, human stories behind those figures.
In a different context, but with a similar situation, a woman brought me and my
team a peculiar challenge. She came to us with a wealth of data, but ultimately, she
wanted to tell one of the most human stories possible. This is Samantha Cristoforetti.
She was the first Italian woman astronaut, and she contacted us before leaving on
a six-month expedition to the International Space Station. He told us, "I'm going to space,
and I want to do something meaningful with my mission data to bring people together."
People don't care about data just for the sake of it, because numbers are never the
end all be all. They are always a means to an end. The story we needed to tell is that there
was a human being in a tiny box flying in space above their heads, visible to the naked eye
on a clear night.
So we decided to use the data to create a connection between Samantha and all
the people looking for her from below. We designed and developed what we called
"Friends in Space", a web application that simply allowed them to say hello to Samantha
from wherever they were, and they could also say hello to people who were also online
from around the world. All of these greetings left marks on a map as Samantha flew and
she greeted us back every day using Twitter from the ISS. Thanks to that, people were able
to see the mission data from another perspective.
Suddenly it was much more about human nature and curiosity than it was about
technology. The data enhanced the experience, but the human stories were the impetus.
Such a positive response from thousands of users taught me an important lesson: that
working with data involves designing ways to transform the abstract and uncountable into
something that can be seen, felt and reconnected to our lives and behaviors, something
difficult to achieve if we let ourselves. May the obsession with numbers and the
technology that surrounds them guide us in the process.
But we can do even more to relate data and the stories they represent. We can do
without technology completely. A couple of years ago I met this woman, Stefanie
Posavec, a designer from London with whom I share the passion and obsession with data.
We didn't know each other, but we decided to perform a very radical experiment, we
started a communication using only data, no other language, and we chose not to use
technology to share our data.
In fact, our only means of communication was the old-fashioned postal mail. For
“Dear Data”, every week for a year, we used our personal data to get to know each other,
weekly personal data about trivial topics, our feelings, the interaction with our partner,
the compliments we received and the sounds of our environment.
The first week of this project, we chose a somewhat cold and impersonal topic
"How many times do we check the time in a week?" This is the front of my postcard, as
you can see, the little symbols represent all the times I checked the time, organized by day
and time chronologically, nothing really complex here.
But, as you'll see in the caption, I added anecdotal details about those moments.
The different types of symbols indicate why you were checking the time, what were you
doing? I was bored? I was hungry? Was I late? Did I check the time on purpose or did I
look at the clock by chance? And most importantly, I represented the details of my days
and personality in my data collection.
The data served as a magnifying glass or filter to discover and demonstrate, for
example, my eternal concern about being late, although I am never late. Stefanie and I
collected our data manually for a year in order to focus on the details that computers miss
—at least until now.
We also use data to study our minds and our words, not just our activities. Around
week three we monitored the “thank yous” we said and received, and I discovered that I
tend to thank people I don't know more.
I seem to thank waitresses and waiters compulsively, but I don't thank the people
close to me enough. After a year, making conscious and listing these types of actions
became a ritual. It really changed something in us. We became more in tune with
ourselves, we were more aware of our behavior and environment.
That year, Stefanie and I made a very deep connection through the shared data
journal, but we did it because we invested ourselves in these numbers by adding the
contexts of our personal stories. Only in this way could we make them meaningful and
truly represent us.
I'm not asking you to start drawing your personal details, or to send postcards to a
friend on the other side of the world. What I'm asking you is to see data—all types of data
—as the beginning of a conversation and not the end. Because the data itself will never
give us a solution.
That is why the data failed us so much, because we failed to include enough
context to represent reality, a complex, intricate reality full of nuances. We kept seeing
these two numbers, we became obsessed with them and thought that our world could be
reduced to two percentages and a confrontation, meanwhile, the real stories that really
mattered, we pushed aside.
Seeing these stories through models and algorithms caused us to miss what I call
“data humanism.” In Renaissance humanism, European intellectuals placed human
nature, rather than God, at the center of their worldview. I think something similar must
happen with the data universe. Apparently, today data is believed to be like God,
containing infallible truth about our present and future.
The experiences I shared with you today taught me that to make data accurately
represent human nature and to ensure that it does not mislead us further, we need to
start designing ways to include empathy, imperfection, and human qualities in the
collection. , data processing, analysis and visualization. Ultimately, I do see a time where
instead of using data just to be more efficient, we will all use data to be more human.
Thank you.
TEDNYC. (March 2017). Giorgia Lupi: How we can find ourselves in data [video file].
How on a small scale and a large scale do you use data to understand the world
around you?
What challenges arise when people try to share and understand data and
statistical information?
Click on the image below to see an interactive diagram that summarizes each step
of the data science methodology. In this unit, you will focus on Data Understanding and
Data Preparation .
Module: Understanding the data
Watch this video from IBM and learn about all the activities related to building
data sets during this step in data science methodology.
Data Understanding
Data understanding includes all activities related to the construction of data sets.
Basically understanding data within the data science methodology answers the following
question: Is the data you collected representative of the problem that needs to be solved?
Let's apply the data understanding stage of our methodology to the practical case we have
been analyzing.
In order to understand the data that are related to hospitalizations for congestive
heart failure, descriptive statistics had to be performed in relation to the columns of data
that would become the variables in the model.
First, these statistics included HERST, univariate, and statistics on each variable
such as median, minimum, maximum, and standard deviation.
Second, pairwise correlations were used to analyze how close the relationships
were between certain variables. And if they exist, which ones had a very high correlation,
which would indicate that they are redundant and that only one is relevant for the
modeling.
Histograms are ideal for understanding how values or a variable are distributed
and what types of data preparation may be necessary to make the variable most useful in
a given model.
For example, if they wanted to make a categorical variable with too many different
values informative in a model, the histogram would help them decide how they could
consolidate those values.
Univariates, statistics, and histograms were also used to assess data quality. With
the information provided you can recode, or if necessary, even discard certain values such
as when a given variable has many missing values.
Initially, the meaning of hospitalization for congestive heart failure was decided on
the basis of a first diagnosis of that disease. However, during the data understanding
stage, it was discovered that the initial definition did not address all hospitalizations for
this disease that were expected based on clinical experience. For this reason, we returned
to the data collection stage, added secondary and tertiary diagnoses, and developed a
more comprehensive definition of hospitalization for congestive heart failure.
This is just one example of the interactive processes of the methodology. The more
you work with the problem and the data the more you learn and therefore the more
adjustments you can make to the model, which in turn leads to a more effective solution
to the problem. This concludes the data understanding section of this course. Thanks for
your time.
In the World of Data Science , visit Sao Pablo and watch the Understanding Data
video. Watch as a data science team explores a new data set and begins to identify initial
patterns among the collected data.
During a status update meeting, Jonathan, the leader of the investigation; Karen,
the data scientist; and Francis, the project manager, discuss the steps to take to
understand the data collected for the use case. They talk about different processes that
can be carried out to identify initial patterns. Think about why the data understanding
stage tends to lead us back to the collection phase.
FRANCIS: Very good, thank you. It seems like an opportune time to make this
status update. I thank you for taking this extra time to update me on the status of this
project. Well, where are we?
JONATHAN: What we're doing now is using IBM's Watson Analytics. A browser
platform that allows you to enter data, analyze it and then view the results. This is a very
interesting introductory project, as we use an exploratory approach. We take an existing
data set and simply analyze the relationships that exist between the variables. In this case
with respect to diseases related to diet and food insecurity. We are not creating a
predictive model, we are just describing relationships.
KAREN: Many of the tools we will use are common to all data science projects.
These tools help us understand and prepare data. This project allows us to establish
connections and create visualizations with the data, through IBM Watson.
JONATHAN: When we analyze our data set in this notebook, the analysis begins
with a typical data science segmentation. In the early stages of examining the data we
may discover certain sections with stronger information than others. We analyze a series
of visualizations to determine which data is most useful and focus our efforts on it. In this
stage we explore the aspect of the data and clean it. Although often considered boring,
this part of the job is extremely important. Without cleansed data our statistics and
visualizations could be inaccurate or confusing.
KAREN: Let me show you an example. I will share my screen. Notice the column full
of NAN?
KAREN: It means "not a number." We also have zero values in another column. At
this stage we just want to see where they are. Typically a value of zero means that an
observation was not recorded. NAN can mean the same thing, or it can refer to entering a
number with text or simply an error. However, before deciding what to do with these
values let's examine the entire data set. What we will do here is apply a correlation table
to the entire data set. Correlation tables can appear simply as large tables with numbers
or as visualizations. This is what the table looks like.
We don't need data science for this type of analysis. However, we are looking at
many variables here. For our project we analyze the relationships that exist between the
variables. Therefore, in the data understanding stage we need to know what data we have
and what types of relationships exist. In this table let's see if we can find variables that
have positive and negative correlations. For some people these figures are a little difficult
to visualize. Watson data science tools are ideal for displaying correlations visually.
KAREN: In this graph we can see the complete table. All 1235 columns and rows at
once. We can also see that many of the variables have relationships. The red ones show
positive correlations, while the blue ones show negative correlations. We can quickly see
that there are many areas that we may need to explore. We will also apply some basic
descriptive statistics on each of the variables.
For example: What is the maximum value of a variable, the minimum value and its
standard deviation? We must ask the program to identify that information. We can use
this function to better understand some of the variables. You can imagine how important
it is to be able to analyze data and apply mathematical relationships with millions of
variables so quickly.
FRANCIS: We're not just focusing on large data sets, though, are we? Don't we
need to look at some of the variables?
JONATHAN: Yes. We can also get to know a specific variable a little better. Let's
look at the "food insecurity" column and see how the different observations tracked in
terms of frequency. This graph shows that 18.4 percent were observed more than 250
times in our data set.
KAREN: We can even look at some specific relationships. I created two scatterplots
to share with you. This is the relationship that exists between the percentage of obese
adults in a population and the participation rate in SNAP, which is the program commonly
known as the "food stamp program." If they have a certain level of income, families can
apply for government assistance to buy food in different stores. It can be seen that the
data are positively correlated. There appears to be a relationship between increasing
SNAP participation and increasing obesity rates.
This second graph is a scatter plot with the regression line incorporated. You can
see that the angle points upwards to the right. This shows that these two variables have a
clearer positive correlation. In the graph above I did not enter a regression line. With this
graph we can see that this line shows us the positive correlation. The larger the angle, the
stronger the correlation. In this second graph we can see that the part of the population
that has a higher rate of obesity is more likely to live in areas of greater food insecurity.
FRANCIS: So does this prove that people who use food stamps become obese?
That's not good.
KAREN: No. We cannot yet draw those types of conclusions. The Correlation is not
causation. These graphs do not prove that food insecurity causes obesity or that federal
food assistance programs cause obesity. . But when one variable increases, so does the
other.
An old teacher of mine said that height and reading ability had a positive
correlation. But that doesn't mean that taller people read better. It means that children
are shorter than adults and that adults generally have a greater reading ability than
children. In these tables, the correlations simply assume that there is likely another
connection between them. What they tell us is that it may be worth further exploration.
JONATHAN: Okay. Now that we have seen some relationships in the data we have
to start preparing it. What variables do we want to focus on? What do we do with all the
zero and NAN values? All of this takes place at the data preparation stage. I will talk to
Susana regarding these initial patterns and see if she has any suggestions. This review has
highlighted several relationships that may be worth exploring a little further. Since she has
a lot of experience, it is likely that Susana can specify which variables we should focus on
in the next step: Data preparation.
Watch this video from IBM and discover how data is prepared for the next step in
data science methodology.
Data Preparation
In a sense, data preparation is similar to washing freshly picked vegetables, as it involves
removing unwanted elements such as dirt or blemishes. Along with collection and compression,
data preparation is the most extensive stage of a data science project, typically taking up 70
percent and even up to 90 percent of the project time. . If some of the collection and preparation
processes in the database are automated, this time can be reduced by up to 50 percent . This time
savings means data scientists will have more time to focus on building models.
To continue the culinary metaphor, we know that the process of chopping the onions
allows their flavor to spread throughout the sauce more easily than if we placed a whole onion in
the pot. Similarly In the preparation stage, the transformation of the data consists of ensuring that
it reaches a state that facilitates the work .
Specifically, the data preparation stage of the methodology answers the following
question : How is the data prepared? To work with data effectively, data must be prepared to
address missing or invalid values and remove duplicates to ensure that all content is properly
formatted. .
Function engineering It is also part of data preparation. That is the process of using
domain knowledge from data to create functions that enable machine learning algorithms to
operate . A function is a feature that can be useful in solving a problem. In the field of data,
functions are important for predictive models and influence the results to be achieved. Function
engineering It is essential when applying machine learning tools to analyze data .
When working with text, you must follow the data coding steps of text analysis in order to
manipulate the data. . To address the question the data scientist needs to know what he or she is
looking for within the data set. Data analysis is essential to ensure that appropriate groupings are
established and that programming does not miss hidden content . The data preparation stage lays
the foundation for the next steps that must be followed in addressing the question.
Although this stage may require a lot of time, if carried out correctly the results will be
very beneficial for the project. On the contrary, if you skip this stage the result will not be as
indicated and you may have to return to the starting point . It is essential that you take the time
necessary during this stage and use the tools at your disposal to automate the common steps that
must be followed to streamline data preparation.
During this stage you should make sure to pay attention to the small details . After all, just
one wrong ingredient can ruin any sophisticated dish. This concludes the data preparation section
of this course in which we have analyzed some key concepts. Thanks for your time.
Two members of the data science team, Susana and Karen, meet to discuss how
they can get the most out of their data. In that meeting, they discuss how they can
prepare the data for analysis.
SUSANA: Karen, thank you for the overview you gave of the initial investigation of
this data. I was pleasantly surprised to see that the data was very relevant despite being
several years old. After reading your report I noticed that we should only focus on seeing
the connection that exists between some specific values. With a data set of this size, a
large amount of irrelevant information appears and confusion arises.
KAREN: I agree. But at least we have options. What data do you want to include in
our study?
SUSANA: I think we will get a good picture of the situation if we analyze variables
related to economic inequality, obesity, access to food, ethnicity or race. And participation
in federal food programs, such as food stamps and free or reduced-price lunches, which
are offered to students.
KAREN: Let me examine those variables. I'll be back online in a moment.
KAREN: In the end there only ended up being 28 variables. To make this change I
created a smaller data frame and a quick heat map that allows us to see which
correlations we should view. Visualize some of these relationships through SEABORN , but
I am very interested in trying other visualizations. The quickest way to explore them is
through Pixie Dust, a library of visualizations that can be used in Watson Studio. In this
way, Susana, we can immediately notice that there is a considerable amount of strong
relationships and correlations. In this map, positive correlations have a lighter color, while
inverse or negative correlations have darker colors. Some of these correlations include
people 18 years of age or younger and people of Hispanic origin. There is an inverse
relationship between Asian origin and obesity. And correlations between the tax on sugary
drinks and Hispanic origin, and between African American origin and obesity and food
insecurity. There are also correlations between the tax on sugary drinks and obese minors.
Between farmers markets and assistance methods like WIC and SNAP. Between obese
minors and reduced price lunches and some other correlations.
SUSANA: Can we try to trace some of these relationships?
KAREN: Of course. I will use SEABORN. Let's look at the white population and the
use of food stamps.
SUSANA: How interesting. Look at that positive correlation. The percentage of the
white population is generally associated with the population using food stamps. Look at
this case. This observation had an 80 percent white population and the percentage of
people using food stamps was approximately 80 percent.
KAREN: Exactly. Note that this does not mean that 80 percent of white people used
food stamps. But in that observation, 80 percent of the people who used food stamps
were white. What else would you like to see?
SUSANA: What about the percentage of the Hispanic population that uses food
stamps? How interesting! There we see an inverse relationship.
KAREN: Exactly. And look at this: In this observation they make up approximately
25 percent of the population. But only about 50 percent of them participate in the food
stamp program.
SUSANA: Can you show me the Hispanic population and their participation in
reduced price lunches in public schools?
KAREN: Of course. Give me a minute. With these simple regression diagrams we
can see that in 2010 the white population was highly correlated with the use of the SNAP
program or food stamps . We can see that the Hispanic population is not highly correlated
in this time period. However, the Hispanic population does have a positive correlation
with reduced price lunches in public schools.
SUSANA: These variations could be due to a variety of reasons, including eligibility,
reporting, dynamic policies, and program usage. More than anything our teams want to
know what the variations are so we can customize our approach accordingly.
KAREN: Remember that these graphs have no statistical significance. But they can
help us understand the data and become familiar with it.
SUSANA: I understand. What I would like to see is how those observations play out
in different states in the United States. Since our offices are regional, if we could see
relationships based on where issues are a problem and where they are not, our team
could determine how to address some of these issues.
KAREN: Yes, that is possible. The models I shared with you don't look that good.
Let me organize the information a little better so you can interact with the data on a map.
Read this article and find out how data scientists clean and prepare a data set.
The brain can process an image much faster than it can process text. That's why
data modeling has so much potential. Models allow you to create statistics and visualize
relationships between data that cannot be easily explained in text format.
In this unit, you will explore the steps of data science methodology in modeling
and evaluation, and see the different ways a data scientist creates, evaluates, and tunes a
model to tell a story.
Additionally, you will have the role of an intern at Nutri Mondo, an organization
that uses data science to solve problems related to food insecurity and other food-related
conflicts. Read the message from Nutri Mondo director, Susana Maciel, to establish the
learning context in this unit.
Learning objectives:
Below you can read the email from Susana Maciel, the director you will be working
with on this case study:
It seems that we all suffer from information overload or data excess. The good
thing is that there may be an easy solution to that, simply by using your eyes more. That
is, visualizing information so that we can see important patterns and connections, and
then designing that information so that it makes more sense, or so that it tells a story, or
so that it allows us to focus only on the relevant information. Otherwise, the information
displayed may be really cold.
Let's see. Here we have the billion diagram. And this came out of the frustration I
had having to write an article about numbers, about billions of dollars. That is, they lack
any meaning without context. 500 billion for the pipeline. 20 billion for this war. That way
it doesn't make any sense, the only way to understand it is to visualize it and relativize it.
So, I dug into a bunch of figures published in various media and then adapted the
boxes to the quantities. And the colors that represent the motivation behind money. So
purple means combat, red means donations and green means speculation. And what is
immediately perceived is that a different relationship with numbers begins to be
established. You can literally see them. But the most important thing is that we begin to
see patterns and connections between the numbers that we would otherwise find
scattered in multiple press articles.
Let me point out a few that I like. These are the income of OPEC, the green box
here, 780 billion a year. And this little pixel in the corner of three billion is the funds for
climate change. Americans, incredibly generous people, donated more than $300 billion
to charity every year, compared to the total foreign aid donated by the 17 most
industrialized countries of $120 billion.
And, of course, the Iraq war, whose expected costs were barely 60 billion in 2003.
And that has multiplied slightly. Afghanistan is now worth $3 trillion. So it's great because
now we have this environment where we can add numbers as well. So we can say, a new
number comes out... let's look at the African debt.
How much of this diagram do you think could be forgiven of the debt Africa owes
to the West? Let's take a look. There is. 227 billion is what Africa owes. And the recent
financial crisis, how much of this diagram could this amount reach? What would this cost
the world? We'll see. Woohoo! I think it's the appropriate sound effect for that amount of
money. 11.9 billion.
We have to imagine what a scenario of the world's fears would be like. Let's take a
look. They are mountains, a range of hills, a timeline of global panic broadcast in the
media. I'll tag it for you in a second. I would like to point out that height means the
intensity of certain fears as reported in the media. I'll show you.
This in pink is swine flu. Bird flu. SARS epidemic, the brown one. Do you remember
that one? The mistake of the millennium, terrible disaster. These small green spikes are
asteroid collisions. And in summer, here, murder wasps. So these are what our fears seem
to be over time in our media. But what I love, I'm a journalist, is finding hidden patterns, I
love being a data detective. And there is a very interesting and strange pattern hidden in
this data that can only be seen by visualizing it. Let me highlight them.
Look at this line. It is a map related to violent video games. You see, there's a kind
of strange, consistent pattern in the data, identical spikes every year. If we look in detail,
we see that the peaks are repeated in the same month every year. Because? In November,
video games are launched for the Christmas campaign and there may be greater concern
about their content.
But April is not a particularly relevant month for video games. Why April? In April
1999 was the Columbine massacre, and since then, that fear has been remembered by the
media and resonates in the minds of some throughout the year.
We have flashbacks, anniversaries, court cases, even simulated shootings, all
fueling fear as the theme of the day. And there is another pattern here. They see him? Do
you see that emptiness? There is a space, and it affects the rest of the news. Why is there
a void there? See when it starts? September 2001, when we had something very real to
fear.
I've been working as a data journalist for about a year, and I keep hearing one
phrase all the time, that is: "Data is the new oil." And data is a ubiquitous resource that we
can shape to innovate and generate new insights, and it is all within our reach and can be
easily extracted. It's an unhappy metaphor these days, especially if you live in the Gulf of
Mexico. That's why I would adapt this metaphor a little, and I would say that information
is the new land. Because for me, it is a fertile and creative medium.
Over the years, we have deposited a huge amount of information and data online,
which we flourish with networks and connectivity, and it has been crafted and cultivated
by unpaid workers and governments. And, well, I wish I could milk the metaphor a little
more.
Can you guess what this information says? Which rises twice a year, once at Easter
and then two weeks before Christmas and which has a mini peak every Monday and then
flattens out during the summer. Let's see, your answers, (Audience: Chocolate) David
McCandless: Chocolate. You might also want some chocolate. Any ideas? (Public:
Shopping) DM: Shopping. Yes, shopping therapy could help. (Public: sick leave) DM: Sick
leave. Yes, I'm sure they would like a few days off. We see it?
This is the data that Byron Lee and I have obtained from 10,000 Facebook status
updates after the phrase "is interrupted" and "interrupted" and this is the pattern we
found. People order for spring break […] they go out on weekends, impacting on Mondays,
they are single in the summer. And the lowest activity of the year, of course, is on
Christmas Day. Who would do that? Therefore, there is a titanic amount of data,
unprecedented.
But if you ask the right kind of question, or approach it in the right way, interesting
things can emerge. So the information is precious. The data is precious. I wonder if I could
make my life so beautiful. Here is my visual CV. I'm not really sure I achieved it. Quite
squared. The colors aren't that wonderful either. But I wanted to convey something to
you.
I started as a programmer, and then I worked as an editor for many years, about 20
years, in the press, the Internet and then in advertising and I recently started designing.
And I've never been to design school. I've never studied art or anything like that. I am a
self-taught person who learns by doing. And when I started designing, I discovered a
strange thing about myself. I already knew how to design, but not in the sense of being
very brilliant, but rather of having a sensitivity to the concepts of grid and space, as well as
alignment and typography. It's as if being exposed to all this in the media for years has
instilled in me a kind of design literacy. And I don't feel unique.
I think every day, all of us are now touched by information design. It has spread
before our eyes across the Web, and we are all now viewers; We all demand a visual
feature of the information.
And there is something almost magical about visual information. It doesn't take
effort, but skill. And if you're navigating a dense jungle of information, coming across a
beautiful graph or a beautiful data visualization is a relief, it's like coming across a clearing
in the jungle. And this made me curious, which led me to the work of a Danish physicist,
Tor Norretranders, he renamed the bandwidth of the senses in computer terms. So here
we go.
These are the senses investing more in them every second. The sense of sight is the
fastest. It has the same bandwidth as a computer network. In addition, we have touch,
with the approximate speed of a USB key. As for hearing and smell, these have the
performance of a hard drive.
And finally, taste, poor and old, whose performance is barely equivalent to that of
a calculator. And this little square in the corner, 0.7%, that's the amount that we're
actually aware of. So a lot of perception most of it is visual and it keeps increasing. It is
unconscious.
The eye is very sensitive to patterns of color, shape and pattern variations. The eye
loves them and considers them beautiful. It is the visual language. And if you combine this
visual language with mental language, which deals with words, numbers and concepts,
you begin to speak two languages simultaneously, one improving the other.
Thus, first is the eye and then the concepts occur. And all this is two languages
working at the same time. So we could use this new type of language, if you will, to modify
our perspective or change our opinions. Let me ask you a simple question with a very
simple answer. Who has the largest military budget? Surely the US, right? Enormous. 609
billion in 2008, 607 thousand, more like. So enormous that it can cover the sum of all the
rest of the world's military budgets. Gobble, gobble, gobble.
Now, we can look at Africa's total debt and the UK's budget deficit for comparison. So you
could well tune in to your opinion that the United States is a kind of military war machine
prepared to dominate the world with its enormous industrial and military apparatus. But
is it true that the US has the largest military budget? Because it is an incredibly rich
country. In fact, it is so extremely rich that it can contain the four economies of the most
industrialized nations. That's how immensely rich he is. For this reason it is forced to have
a huge military budget.
So, to be fair and modify our perspective, we have to compare it with another set
of data, and that set of data is the GDP, or the income of the country. Who has the largest
budget in proportion to GDP? Let's see.
This changes the picture considerably. Other countries that, perhaps, were not
considered, come to light and the US drops to eighth place. Now we also do it with
soldiers. Who has the greatest number of soldiers? China for sure. Of course, 2.1 million.
Once again, he agrees with his view that China is a militarized regime ready to mobilize
large armed forces.
But, China has a huge population. So if we do the same thing, we see a radically
different image. China drops to 124th place. In reality, he has a small army when other
data is taken into account.
Therefore, absolute figures, like the military budget, in a connected world, do not
provide the full picture. It is therefore not as true as it could be. We need relative figures
that connect with other data so that we can see a more complete picture, and that leads
us to change our perspective.
As Hans Rosling, the teacher, my teacher, said: "Let the data change the way you
think." And if it can be done, perhaps behavior can also be changed. Let's look at this one.
I'm a bit of a health freak. I like taking vitamin supplements and staying fit, but I
don't understand what's going on with the evidence. There is always contradictory
evidence. Should I take vitamin C? Should I eat durum wheat? This is a visualization of the
evidence related to nutritional supplements.
This type of diagram is called a balloon race. So the higher the image, the more
evidence there is for each supplement. And the bubbles correspond to the popularity in
relation to the hits on Google. So can the relationship between effectiveness and
popularity be immediately grasped? but you can also, if the tests are ranked, evaluate
them online based on whether they are worth it. And so, supplements above this line are
worth investigating, but only under the conditions listed below.
And then, sub-threshold supplements are perhaps not worth investigating. Now,
this involves an enormous amount of work. We have scoured about 1,000 studies from
PubMed, the biomedical database, and compiled and classified them. And it was very
frustrating for me because I had 250 pending views for my book, and I spent a month
doing this, and I only wrote two pages.
But what it points to is that the information displayed in this way is a form of
knowledge compression. It's a way to squeeze a huge amount of information and
understanding into a small space. And once you select the data and once you clean that
data, and once it's ready, you can do interesting things.
So I turned this into an interactive application, so now you can generate these
requests over the Internet; This is online viewing. And I can say, "Yes, fantastic." Since she
produces herself. And I can ask, "Show me the things that affect heart health." So let's
filter. So I select the heart, if that's what interests me. I think, “No, no. I don't want to take
anything synthetic. I just want to see the plants and I just want to see the herbs and
plants. "I want all natural ingredients." And this application generates the response from
the data.
The data is stored in a Google document, which is generated from that data. So the
information is now alive, it's a living image, and I can update it in a second. New evidence
is generated. I have only changed one row in a spreadsheet. Wow! Once again, the image
is auto-generated. This is great.
It's a kind of life. A species with which you can go further, and can go beyond
numbers. And I like to apply visualization information to ideas and concepts. This is a
visualization of the political spectrum, in an attempt to try to understand how it works and
how ideas infiltrate from the government into society and culture, into families, into
individuals, into their beliefs and again around a cycle.
What I like about this image is that it is made up of concepts, it explores our vision
of the world and helps us, it helps me, in any case, to see what others think, to see where
they are coming from. Feeling incredibly cool doing it. And the most exciting thing for me
when designing it, was that, when I was designing this image, I desperately wanted this
side, the left side, to be better than the right, being a journalist with leftist orientations,
but it was not possible, because there would be created an unbalanced and biased
diagram.
So, in order to really create a complete picture, I had to accept the views of the
right side and at the same time it was uncomfortable to recognize how many of those
qualities were also really mine, which turned out to be very upsetting and uncomfortable.
[…] But not too uncomfortable, because it does not represent a threat to see a political
perspective, as opposed to forcing one to say or listen.
In fact, one is able to hold conflicting points of view with joy, when one can see
them. It's even fun to get involved in them because it's visual. And that's what excites me,
seeing how data can change my point of view and modify my way of thinking, beautiful,
precious data.
So, to recap, I wanted to say that for me design is about solving problems and
offering elegant solutions. And information design is about solving information problems.
And it seems like we have a lot of information problems these days in our society, from
overload and saturation to breakdowns in trust and reliability, as well as rampant
skepticism and lack of transparency, or even just "interestingness."
That is, I can find information that is too interesting. This one has a magnetic
quality that draws me in. Therefore, visualizing information can give us a very quick
solution to this type of problem. And even when the information is terrible, the visual can
be very beautiful. And we often achieve greater clarity or the answer to a simple question
very quickly, like this one, the recent Icelandic volcano... Which emitted the most CO2?
The planes or the volcano, the planes on the ground or the volcano? Let's see.
We look at the data and see that yes, the volcano emitted 150,000 tons; the planes
could have issued 345,000 if they had taken off. So, in principle, it was our first carbon
neutral volcano. […] And this is beautiful. Thank you.
TEDGlobal. (July 2010). David McCandless: The beauty of data visualization [video
file].
The brain processes visual information a thousand times faster than written
information:
What would this mean for data scientists' effective ways of communicating project
results?
Might you find visualization of a complex data set interesting?
Might you find complex data visualization misleading and inaccurate?
Click on the image below to see an interactive diagram that summarizes each step of
the data science methodology. In this unit, you will focus on Modeling and Evaluation .
Module: What is data modeling?
Watch this video from IBM and discover how data scientists approach the data
modeling phase in data science methodology.
Modeling: Concepts.
Modeling is the stage of data science methodology where the scientist can taste
the sauce and determine if it is good or if it needs a little more seasoning. This part of the
course aims to answer two key questions: First, what is the purpose of data modeling?
And secondly, what are the characteristics of this process?
Data modeling focuses on developing models that are descriptive or predictive. For
example, a descriptive model might analyze a situation as follows: If a person did that,
then they are likely to prefer this. A predictive model attempts to obtain positive or
negative results. These models are based on the analytical approach that has been taken,
regardless of whether it is based on statistics or machine learning.
The data scientist will use a training set for predictive modeling. A training set is a
set of historical data in which the results are already known. The training set works as an
indicator to determine whether the model should be calibrated.
At this stage the data scientist will work with different algorithms to ensure that
the variables in question are really necessary. The success of data compilation,
preparation and modeling depends on understanding the problem at hand and taking the
appropriate analytical approach. The data allows the question to be answered and, like
the quality of the ingredients in the kitchen, paves the way to the result.
The ultimate goal is to get the data scientist to a point where they can create a
data model that answers the question. If dinner is ready to be served and there is a very
hungry diner at the table, the key question is this: is the meal I made enough? Well, let's
hope so.
In this stage of the methodology, the evaluation of the model, the implementation
and the comments guarantee that the answer is close and relevant. This relevance is
fundamental for the field of data science in general since it is a new field of study and we
are interested in the possibilities it has to offer. The more people benefit from the results
of this practice, the more this field will develop.
This concludes the section of this course called: From modeling to evaluation, in
which we analyze the key concepts that are related to modeling. Thanks for your time.
Read this article and discover how companies can derive value from data
repositories with machine learning.
Watch this video from IBM and discover some of the processes and tools data
scientists use to evaluate a predictive model.
Assessment
The evaluation of the model goes hand in hand with its creation since the modeling
and evaluation stages are carried out iteratively. The evaluation of the model is carried
out during its development and before its implementation. The evaluation allows the
quality of the model to be calculated but also provides the possibility of seeing if it
matches the original request. The evaluation answers the following question: Does the
model used really answer the initial question or should adjustments be made?
Model evaluation can have two main phases, the first is the diagnostic measures
phase that is used to ensure that the model is performing as intended. If the model is
predictive, a decision tree can be used to evaluate whether the response it offers agrees
with the initial design. It can also be used to see if there are areas that require
adjustments. On the other hand, if the model is descriptive, that is, if it is a model in which
relationships are evaluated, a test set with known results can be applied and the
adjustments that the model requires can be made.
The second phase of evaluation that can be used is the statistical significance test.
This type of evaluation can be applied to ensure that data is appropriately handled and
interpreted within the model. Additionally, this type of evaluation is designed to avoid
unnecessary questioning once the answer is revealed.
Now let's return to our practical case to be able to apply the evaluation component
within the data science methodology. Let's discuss a way to find the most effective model
through a diagnostic measure based on adjusting one of the parameters in the model
creation. Specifically, we will discuss how to adjust the relative cost of misclassifying
affirmative or negative outcomes.
For budgetary reasons, the risk-reducing intervention could not be applied to most
or all patients with congestive heart failure. In any case, many of them would not have
required readmission.
On the other hand, because not enough patients with high-risk congestive heart
failure were targeted, the intervention would not be as effective as it should be in
improving patient care. So how do we determine which model is the most effective?
As you can see on this slide, the most effective model is the one that provides the
maximum separation between the blue R or C curve that is relative to the red reference
line. We can see, with a relative misclassification cost of 4 to 1 for the receiving model.
Model 3 is the best of the 4. Also, in case you were wondering, ROC means a receiver
operating characteristic curve. This curve was first developed during World War II to
detect enemy aircraft using radar. Since then, it has also been used in many other fields.
Currently, it is commonly used in machine learning and data mining.
The ROC curve is a useful diagnostic tool for determining the most effective
classification model. This curve quantifies the performance of a binary classification model
and declassifies affirmative and negative results when a discrimination criterion is varied.
In this case, the criterion is a relative cost of misclassification. By comparing the true
positive rate against the false positive rate, for different values of the relative cost of
misclassification, the ROC curve helped select the most effective model. This concludes
the evaluation section of this course. Thanks for your time.
In the World of Data Science , visit Tegucigalpa and watch the Data Modeling
video. Watch as a data science team evaluates some of the descriptive models they
created for this project.
The team discusses ways that certain simple visualizations can be used to model
data after it has been prepared. This step is preparation for deployment.
FRANCIS: Hello everyone. I organized this meeting so that we can evaluate the
descriptive models that Karen and Jonathan developed for Susana's project.
JONATHAN: Thank you, Francis, for organizing this meeting. For greater
coordination, I remind you that we are in the stage of the process in which we focus on
evaluating whether the model really meets our goal. The initial business plan for this
project was to use data to understand the relationship between food availability and diet-
related problems.
As well as exploring other relationships that could influence these problems, such
as economic income. Our goal was to create descriptive models that will help local offices
design educational programs and engagement initiatives that are relevant to the local
community.
We identified data sets from the US Department of Agriculture and the US Bureau
of Labor Statistics. We combined those data sets, removed variables that weren't relevant
to this project, and created some visualizations that I think we can get some real value
from.
SUSANA: Thank you for that excellent summary of our progress. My goal is to help
determine if the models that Karen and Jonathan offered actually meet our original goal.
Will these models help local offices design educational programs and engagement
initiatives that are relevant and effective for the local community? Once we achieve this I
will need your help to determine how we can implement this work. Let's analyze it.
KAREN: Very good. Let's examine one of our visualizations. I thought our work with
children and schools would help us evaluate and see how we can share the data and
relationships we explore.
This work contains information about some United States federal programs that
are designed to help low-income families. These models show access to stores that are
authorized to accept SNAP per thousand people. Stores that can accept WIC per thousand
people and farmers markets that accept WIC and SNAP. As well as children's eligibility for
free lunches and childhood obesity.
JONATHAN: Let's analyze the characteristics of graphs 1, 2 and 3. You can scroll to
get information about each State. Additionally, color coding is used to show magnitude. I
would like to focus on Texas and Georgia. Darker colors show a higher rate of authorized
stores in the population.
SUSANA: That information is very useful. I like being able to visualize the figures as
that makes general understanding much easier. However, can you show me what
information appears for Texas in models 1 and 2?
KAREN: 39.71.
SUSANA: So my question is this: Are there 212.8 and 39.71 stores per thousand
people or per million people? Or every thousand people who participate in SNAP? That
seems a little confusing. I wish the labels were clearer.
FRANCIS: Very good. So, keep the magnitude colors, but make the chart labels
lighter.
KAREN: Very good. We can do it. The calculation is every one million people. So for
Texas we would read it like this: 212.8 stores authorized to receive SNAP per one million
people.
SUSANA: Texas and Georgia really catch my attention on these maps. It's great that
the data is overlaid on the map.
JONATHAN: Thank you. We wanted to keep this data on a single screen so that
some level of connection could be observed. Look at the graph of children's eligibility for
free lunches and childhood obesity. Look at the two highest points of free lunch eligibility
and the prevalence of childhood obesity.
KAREN: Additionally, if you look at how each State fits in with farmers markets that
accept WIC and SNAP you can see their prevalence within the State.
JONATHAN: We thought that local offices in the United States could see the
comparisons between their States.
SUSANA: Yes, that is very useful when we think about the nutrition education
programs that are aimed at families with young children and the success we had when the
agricultural community and farmers markets intervened.
However, look at free lunch eligibility and the prevalence of childhood obesity. The
X axis label is not clear. It says "percentage of obesity" but the figures are not percentages.
I don't know how to read that information. And, if I have doubts, it is evident that the
teams will have them too.
SUSANA: I would also like to have the data set file. I am referring to the file they
cleaned with the set of selected variables. I know the work we did is not too complex in
mathematical terms, but the volume of data we are modeling is very large. However, it
would also be useful to be able to re-examine the data set, query the variables, and show
them to the teams.
Read this information to understand the importance of using data to tell a story.
Hammond, K. J. (May 1, 2013) The value of big data is not the data Harvard
Business Review Digital Articles.
Li, M. (October 13, 2015) The best data scientists know how to tell stories.
Harvard Business Review Digital Articles.
In this Data Science Basics unit, we will cover the last phases of data science
methodology: implementation and feedback. Remember that data science is a process,
not a final answer. As you will see in this unit, when data scientists reach the “end” of a
question, this stage can surface new questions, statistics, and needs. Curiosity, possibility,
and technology are driving data scientists further and further into the field. Think about
how you could become part of this important field and how you could continue learning
about this valuable and exciting work.
Additionally, you will have the role of an intern at Nutri Mondo, an organization
that uses data science to solve problems related to food insecurity and other food-related
conflicts. Read the message from Nutri Mondo director, Susana Maciel, to establish the
learning context in this unit.
Learning objectives:
Below you can read the email from Susana Maciel, the director you will be working
with on this case study:
3 WAYS TO SPOT A BAD STATISTIC
Today I will talk to you about statistics. If that makes you immediately suspicious,
that's fine, that doesn't make you conspiracy theory nuts, it makes you skeptical. When it
comes to numbers, especially today, you should be skeptical. But they should also be able
to distinguish reliable numbers from those that are not. So today I want to give you tools
to be able to do that.
But before doing so, I want to clarify what numbers I will talk to you about. I won't
make claims like, "9 out of 10 women recommend this anti-aging cream." I think everyone
rolls their eyes at numbers like that. People now question statistics like, “The
unemployment rate in the US.” USA is 5%." This statement is different because it does not
come from a private company, it comes from the government.
On the one hand, there are those who say that these statistics are crucial, and
necessary to understand society as a whole to overcome emotional anecdotes and
measure progress objectively. On the other hand, there are those who say that these
statistics are elitist, even manipulated; that they do not make sense and that they do not
really reflect what happens in people's daily lives.
It seems that the second group is winning the fight so far. We live in a world of
alternative facts where people do not identify with these statistics, the starting point for
debate. This is a problem. There are movements in the US. USA Currently to remove some
government statistics.
How to legislate health services without data on health or poverty? How can we
have a public debate on immigration if we do not agree on how many people enter and
leave the country? Statistics come from the state; that's where its name comes from. The
goal was to better measure the population in order to better serve them. So we need the
government's numbers, but we don't have to blindly accept them or blindly reject them.
We need to have the skills to spot bad statistics.
So I became convinced that to get exact numbers I had to question as many people
as I could. This is how I became a data journalist. My job is to find data series and share
them with the public. Anyone can do it, you don't need to be a nerd. You can ignore those
words; People use them to say they are intelligent while pretending to be modest. In
short, anyone can do it.
I want to give you three questions that will help you spot bad statistics. Question
number one, is uncertainty distinguished? One thing that has changed people's
relationship with numbers, including their trust in the media, has been the use of political
polls. Personally, I have a lot of problems with political polls because I believe that the role
of journalists is to report facts and not try to predict them, especially when those
predictions can damage democracy by telling people: don't even vote for that guy, he
doesn't have a chance. Let's put that aside and talk about the accuracy of this task.
Based on national elections in the United Kingdom, Italy, Israel and of course, the
most recent US presidential election. In the US, using polls to predict election results is as
accurate as using the Moon to predict hospital admissions. No, seriously, I used real data
from an academic study to come up with it.
There are many reasons why polls have become inaccurate. Our societies are
diverse, which makes it difficult for pollsters to obtain a representative sample of the
population in their surveys. People are reluctant to answer telephone surveys, and also,
surprisingly, people lie. But you don't need to know that to watch the media. The
probability of Hillary Clinton winning was reported to decimal places. We do not use
decimal figures to describe temperature.
How is predicting the behavior of 230 million voters in this country so accurate?
And then we have those fancy graphics. Many data visualizations exaggerate accuracy,
and it works; Those graphs desensitize us to criticism. When they hear a number they may
feel skeptical. When put on a graph it feels like objective science, and it is not. So I looked
for better ways to communicate this to people, to show them the inaccuracy in the
numbers.
I started taking groups of real information, and turning them into hand-crafted
visualizations, so people could see how inaccurate the data is; that they were made by a
human, who found the information and visualized it. For example, instead of finding the
probability of getting the flu in any month, they would look at the raw distribution of the
flu season. This is… […] a bad shot shown in February.
But it's a more responsible data display, because showing exact probabilities might
encourage people to take their vaccines at the wrong time. The point of these shaky lines
is that people remember these inaccuracies, but they don't necessarily stick with a specific
number, but they remember important facts.
Data such as injustice and inequity mark our lives. Facts like African Americans and
Native Americans have less life expectancy than other races, and that won't change
anytime soon. Data on how prisoners can be confined in the US. USA in isolation in cells
smaller than an average parking space.
The point of these visualizations is also to remind people of some very important
statistical concepts, concepts like averages. Suppose you hear, "The average swimming
pool in the US." USA suffers from 6.23 fecal accidents". It does not mean that every pool in
the country has 6.23 milestones. To show that, I reviewed the original data, coming from
the CDC, they studied 47 aquatic facilities. I spent an afternoon redistributing excrement.
So they could see how misleading the averages are.
The second question to ask yourself to detect an erroneous number: Am I reflected
in the data? This question is also about averages, because people get frustrated with
national statistics because they don't tell you who is winning or losing in a national policy.
It's understandable why people get frustrated with global averages when they don't
match their personal experiences.
I wanted to show them how data relates to everyday life. I started an advice
column "Dear Mona," where people write me questions and concerns and I try to respond
with facts. People ask everything, like, "Is it normal for me to sleep in separate beds with
my wife?", "Do people regret getting tattoos?", "What does it mean to die of natural
causes?"
All questions are good, because they make you think about ways to find and
communicate these numbers. If someone asks, "How much urine is too much urine?",
which is a question I was asked, you want to make sure the visualization makes sense to as
many people as possible.
These numbers are available. Sometimes they are buried in the appendix of an
academic study. And they are certainly not inscrutable; If you really want to test these
urine volume numbers, you could try it yourself with a bottle.
The point of this is not necessarily that every data set relates to you. I'm interested
in how many women were fined in France for wearing the face veil, or the niqab, even if I
don't live in France or wear the face veil. The point is to get as much context as possible.
It's about stepping back from the data point, like the unemployment rate is 5%, and seeing
how it changes over time, or seeing how it changes by educational status, --so your
parents always wanted them to go to college-- or see how it varies by gender.
Today the unemployment rate for men is higher than that for women. Until the
early eighties, it was the other way around. This is one of the biggest changes in American
society, and it's all on the graph, if you look beyond the averages. The axles are
everything; If you change the scale, you can change the story.
The third and final question I want you to think about when looking at a statistic is:
How was the information collected? So far, I've only talked about how the data is
communicated, but how it was collected is just as important. I know it's difficult, because
the methodology can be opaque and somewhat boring, but there are some simple steps
to review this.
I'll use one last example. One survey found that 41% of the country's Muslims
supported jihad, which is obviously scary, and was reported everywhere in 2015. When I
want to review a number like that, I'll start by finding the original questionnaire. It turns
out that the journalists who reported that statistic ignored a later question in the survey
that asked respondents to define "jihad." And most defined it as, "The personal and
peaceful struggle to be more religious." 16% defined it as a "holy and violent war against
non-believers." This is an important point: based on those numbers it is possible that no
one in the survey that defined it as a holy war said they also supported it.
Those two groups may not coincide. It is also pertinent to ask how the survey was
conducted. This survey was free to participate, anyone could find it on the internet and
answer it. It is not possible to know if these people identified themselves as Muslims.
Ultimately, there were 600 respondents to that survey. There are nearly three million
Muslims in this country, according to the Pew Research Center. It means that one in every
5,000 Muslims in this country was surveyed.
This is one of three reasons why government statistics are better than private
ones. A survey can talk to a few hundred, or thousands, or like L'Oreal in 2005, to sell skin
care products, they talked to 48 women to say they worked.
Private companies don't care much about getting the numbers right, they just need
the right numbers. Government statisticians are not like that. In theory, at least, they are
impartial, because the majority do their jobs regardless of who is in power. They are
public servants. And they do their job properly, they don't just talk to a couple hundred.
Those unemployment numbers that I've talked about come from the Labor Statistics
Administration, and to come up with their estimates, they talk to about 140,000
businesses in the country.
TEDNYC. (2017, February). Mona Chalabi: 3 ways to spot a bad statistic [video file].
When you look at the presentation of data and statistics, how can you tell if what is
presented is reliable or not? Answer the following questions:
Can you detect if the uncertainty or the data presented is a verified fact?
If the data relates to some aspect of your life, can you see yourself reflected in the
data? In other words, does the data contradict or support your own experience?
Can you understand how the data was collected?
Click on the image below to see an interactive diagram that summarizes each step
of the data science methodology. In this unit, you will focus on Implementation and
Feedback .
Watch this video from IBM and learn how data scientists implement a predictive
model for a data science project.
Implementation.
In the business environment, stakeholders have different experts who will make
this possible, such as the solution owner, the marketing department, application
developers, and IT management.
Once the model is evaluated and the data scientist is convinced that it will work, it
is finally implemented and tested. Depending on the purpose of the model, it can be
implemented for a limited group of users or in a test environment that increases
confidence in the application of the result in order to use it throughout the organization.
Now let's discuss the use case related to the deployment application. To prepare
for the implementation of the solution, the next step was to assimilate the knowledge for
the business group that would design and manage the intervention program to reduce the
risk of readmission.
In this situation, members of the business sector translated the results of the
model so that clinical staff could understand how high-risk patients could be identified and
design appropriate intervention actions. Clearly, the goal was to reduce the likelihood of
these patients being readmitted within 30 days of their discharge date.
During the business requirements stage the intervention program manager and
her team wanted an application that would provide near real-time automated risk
assessments for congestive heart failure. Additionally, this application needed to be easy
to use for clinical staff, and preferably browser-based and tablet-compatible so that staff
could use it everywhere.
Patient data was generated during the hospital stay. These were automatically
prepared in the format needed by the model and each patient received a score as the
discharge date approached. Thus, doctors obtained an updated risk assessment for each
patient, allowing them to see which patients should receive intervention after discharge.
As part of the implementation of the solution the intervention team developed and
provided training for clinical staff. Additionally, the processes for recording and
monitoring patients receiving the intervention needed to be developed in collaboration
with IT developers and database administrators, so that the results could pass through the
feedback stage and the model could be adjusted. over time.
This slide displays an interactive summary of the risk relative to the patient
population within a given model node, so that clinicians can understand the mix of
conditions for this patient subgroup. Additionally, this report provides a detailed summary
about the patient in question, including their expected risk and medical history details,
providing a concise summary for the physician. This concludes the implementation section
of this course. Thanks for your time.
In the World of Data Science , visit Sao Pablo and watch the Implementation
video. Watch as a data science team considers how best to implement a descriptive model
they have developed for a data science project.
In the World of Data Science, visit Sao Paulo and see the means of implementation.
The data science team shares strategies for implementing the model in the organization.
Four options are evaluated.
Read this article and find out how to improve your data visualization skills.
Watch this video from IBM and discover why feedback is critical to the
effectiveness of data science methodology.
Comments
The feedback stage plan included these steps: First, the review process was to be
defined and implemented with the overall responsibility of measuring the results of a risk
escalation model in relation to the population that was likely to of suffering from
congestive heart failure. Clinical management executives were to have overall
responsibility for the review process. Second, patients with congestive heart failure who
received intervention had to be followed up and their readmission outcomes had to be
recorded. Third, the intervention needed to be measured to determine its effectiveness in
reducing readmissions. For ethical reasons, patients with congestive heart failure should
not be divided into controlled groups and treatment groups. Rather, readmission rates
needed to be compared before and after implementation of the model in order to
measure its impact.
After the implementation and feedback stages, the impact that the intervention
program had had on readmission rates after the first year of implementation had to be
analyzed. The model then needed to be fine-tuned based on all the data collected after
implementation and the insights gained throughout these stages. Other adjustments
included: Incorporating information on participation in the intervention program and
possibly adjusting the model to add detailed pharmaceutical data. As you probably
remember, the collection of this data was initially postponed because it was not available
at that time. However, after feedback and practical experience with the model it is likely
that it will be determined that it is worth incorporating that data. We also have to allow
for the possibility that other adjustments might be required during the comment stage.
After sharing the model with the Nutri Mondo teams in San Antonio (TX), Atlanta
(GA), São Paulo, and Mexico City, the data science team analyzes the feedback that other
teams emailed them.
The data science team decided to expose their models and visualizations to
regional offices in the United States and to two country directors, in São Paulo, Brazil, and
in Mexico City, Mexico. After the presentation of the models, the data science team
reviewed the email comments they received from different teams.
Message 1 of 4
FFirst of all, I wanted to thank you for allowing our office to review the models you
created. Our team certainly values this information. We have already begun to discuss the
relationship between this data and our current projects. What we find most useful:
Being able to see how national data and trends translate to the local level was a
huge benefit. The states that have the program (Texas, New Mexico and Arizona)
have different rates of poverty, diabetes and obesity in adults and children, among
others. Information at the local level allows us to have a clearer understanding.
Right now, we are analyzing the relationship between data and current tasks in
education and outreach programs.
We were able to in-depth compare some of our counties and states with others in
the United States to determine who our “peers” are, in terms of data. I would like
to partner with similar areas at the national level, so that we can learn from each
other.
We would love to have Karen and Jonathan come to town and teach us how to
interact with the data sets they created to easily generate graphs, which will then
be disseminated to our primary education programs and governments at the local
level (city and county).
We would love for Nutri Mondo to be able to compare this data to more current
trends so we can project what would happen in our area in a 5-10 year period.
Message 2 of 4
We knew how Georgia ranked on health issues, but it's great to have public
government data organized this way. Our outreach teams have already drafted press
releases so local media can focus more of their attention on the issues these models
reveal. What we liked:
The visualizations they created are not static. We can delve into them to be more
specific.
We need more of this :) We want to train someone here in using Tableau and
other tools to be able to calculate some of these numbers. Can your team help us?
A big question has arisen for Georgia and the Southeast region: How should we
divide our resources on education activities that teach nutrition and cooking
healthy foods versus legal efforts to bring fresh food close to the population that
needs it most in our region? ? We will explore this in depth, and this data is very
supportive.
Message 3 of 4
Thanks for sharing these models. Part of the team here visualized them and
consider that it would be very useful to carry out a similar project for Mexico. We must
have local and national data to visualize our challenges with issues related to diet and
access to fresh foods. Mexico has one of the highest adult obesity rates in the world.
These are some of the comments:
Message 4 of 4
You may be aware that, here in Brazil, we have had an alarming increase in obesity
and other related health problems: hypertension, diabetes and heart disease. Poverty
represents another problem. We are accustomed, historically, to seeing health problems
related to malnutrition and undernutrition. However, as income per person at the national
level has increased considerably in recent decades, diet-related problems such as diabetes
and obesity are seen, apparently similar to what the US data shows. As a consequence, we
have this strange combination of malnutrition due to lack of income to buy food with an
increase in diet-related diseases due to a high rate of unhealthy eating habits. I appreciate
the data models that help explain some of the problems. My comments on this:
Seeing the problems revealed by the data, I would like to compare the tasks of our
program with what happens in the US. USA How have the classes helped families
in the communities address food and nutrition issues? We have similar problems,
and we are likely to learn from each other.
I would love to investigate how migration has changed diet. Many communities in
San Pablo moved from rural areas in the past generation. These people used to
have gardens in their gardens; Now they spend too much time at work or
commuting from home to work. They no longer have the time they had to get
fresh food. Has the same happened with internal migration in the US? USA?
Read this article to understand what data quality issues are affecting businesses,
and what you can do about them.
Case Summary
We have reached the end of our story and we hope you can share it. Here he
learned to think like a data scientist. For example, to take the necessary steps to address a
data science problem and apply them to interesting examples from reality. These steps
include defining a specific business or research problem, collecting and analyzing data,
creating a model, and understanding feedback after implementing the model.
In this course you also learned methodical ways to move from problem to focus,
including the importance of understanding the business question, goals, and objectives.
And select the most effective analytical approach to answer the question and solve the
problem. You also learned methodical ways of working with data, specifically determining
data requirements, collecting appropriate data, understanding it, and preparing it for
modeling.
You also learned how to model data using the analytical approach that was
appropriate based on your data requirements and the problem you were trying to solve.
Once you selected your approach, you learned the steps to take to evaluate and
implement the model, obtain feedback, and use that feedback constructively to improve
it.
Remember that the stages of this methodology are iterative. This means that as
long as the solution is still necessary the model can be improved. Regardless of whether
improvements come from constructive feedback or from examining new available data
sources.
Through a real case study, you learned how data science methodology can be
applied in context to successfully achieve the goals that were established in the business
requirements stage. He also discovered how the methodology delivered greater value to
business units by incorporating data science practices into their daily analysis and
reporting functions.
The success of the new pilot program analyzed in this case study was
overwhelming. since doctors were able to provide better care to patients through new
tools that allowed them to incorporate timely, data-based information into decisions
related to medical care. Finally, in a few words he learned the true meaning of a
methodology. He discovered that its purpose is to explain how you can examine a
problem, work with the data to solve it, and generate an answer that addresses the root
problem.
And that's how John Rollins sees it. We hope you enjoyed this course on data
science methodology and that this was a valuable experience that you want to share with
others.
Additionally, we also hope that you will check out and take other courses in the
Data Science Fundamentals Learning Path. Now, if you are ready and willing to face the
challenge, take the final exam. Thanks for your time.