Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
29 views

Hello and Welcome To The Data Scientist

Uploaded by

Annie Jain
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Hello and Welcome To The Data Scientist

Uploaded by

Annie Jain
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 33

WHAT IS DATA SCIENCE?

Hello and welcome to the Data Scientist's Toolbox, the first course in the Data Science Specialization
series. Here, we will be going over the basics of data science and introducing you to the tools that will be
used throughout the series. So, the first question you probably need answered going into this course is,
what is data science? That is a great question. To different people this means different things, but at its
core, data science is using data to answer questions. This is a pretty broad definition and that's because
it's a pretty broad field. Data science can involve statistics, computer science, mathematics, data
cleaning and formatting, and data visualization. An Economist Special Report sums up this melange of
skills well. They state that a data scientist is broadly defined as someone who combines the skills of
software programmer, statistician, and storyteller/artists to extract the nuggets of gold hidden under
mountains of data. By the end of these courses, hopefully you will feel equipped to do just that. One of
the reasons for the rise of data science in recent years is the vast amount of data currently available and
being generated. Not only are massive amounts of data being collected about many aspects of the world
and our lives, but we simultaneously have the rise of inexpensive computing. This has created the
perfect storm in which we enrich data and the tools to analyze it, rising computer memory capabilities,
better processors, more software and now, more data scientists with the skills to put this to use and
answer questions using this data. There is a little anecdote that describes the truly exponential growth
of data generation we are experiencing. In the third century BC, the Library of Alexandria was believed
to house the sum of human knowledge. Today, there is enough information in the world to give every
person alive 320 times as much of it as historians think was stored in Alexandria's entire collection, and
that is still growing. We'll talk a little bit more about big data in a later lecture. But it deserves an
introduction here since it has been so integral to the rise of data science. There are a few qualities that
characterize big data. The first is volume. As the name implies, big data involves large datasets. These
large datasets are becoming more and more routine. For example, say you had a question about online
video. Well, YouTube has approximately 300 hours of video uploaded every minute. You would
definitely have a lot of data available to you to analyze. But you can see how this might be a difficult
problem to wrangle all of that data. This brings us to the second quality of Big Data, velocity. Data is
being generated and collected faster than ever before. In our YouTube example, new data is coming at
you every minute. In a completely different example, say you have a question about shipping times of
rats. Well, most transport trucks have real-time GPS data available. You could in real time analyze the
trucks movements if you have the tools and skills to do so. The third quality of big data is variety. In the
examples I've mentioned so far, you have different types of data available to you. In the YouTube
example, you could be analyzing video or audio, which is a very unstructured dataset, or you could have
a database of video lengths, views or comments, which is a much more structured data set to analyze.
So, we've talked about what data science is and what sorts of data it deals with, but something else we
need to discuss is what exactly a data scientist is. The most basic of definitions would be that a data
scientist is somebody who uses data to answer questions. But more importantly to you, what skills does
a data scientist embody? To answer this, we have this illustrative Venn diagram in which data science is
the intersection of three sectors, substantive expertise, hacking skills, and math and statistics. To explain
a little on what we mean by this, we know that we use data science to answer questions. So first, we
need to have enough expertise in the area that we want to ask about in order to formulate our
questions, and to know what sorts of data are appropriate to answer that question. Once we have our
question and appropriate data, we know from the sorts of data that data science works with. Oftentimes
it needs to undergo significant cleaning and formatting. This often takes computer
programming/hacking skills. Finally, once we have our data, we need to analyze it. This often takes math
and stats knowledge. In this specialization, we'll spend a bit of time focusing on each of these three
sectors. But we'll primarily focus on math and statistics knowledge and hacking skills. For hacking skills,
we'll focus on teaching two different components, computer programming or at least computer
programming with R which will allow you to access data, play around with it, analyze it, and plot it.
Additionally, we'll focus on having you learn how to go out and get answers to your programming
questions. One reason data scientists are in such demand is that most of the answers are not already
outlined in textbooks. A data scientist needs to be somebody who knows how to find answers to novel
problems. Speaking of that demand, there is a huge need for individuals with data science skills. Not
only are machine-learning engineers, data scientists, and big data engineers among the top emerging
jobs in 2017 according to LinkedIn, the demand far exceeds the supply. They state, "Data scientists roles
have grown over 650 percent since 2012. But currently, 35,000 people in the US have data science skills
while hundreds of companies are hiring for those roles - even those you may not expect in sectors like
retail and finance. Supply of candidates for these roles cannot keep up with demand." This is a great
time to be getting into data science. Not only do we have more and more data, and more and more
tools for collecting, storing, and analyzing it, but the demand for data scientists is becoming increasingly
recognized as important in many diverse sectors, not just business and academia. Additionally,
according to Glassdoor, in which they ranked the top 50 best jobs in America, data scientist is THE top
job in the US in 2017, based on job satisfaction, salary, and demand. The diversity of sectors in which
data science is being used is exemplified by looking at examples of data scientists. One place we might
not immediately recognize the demand for data science is in sports. Daryl Morey is the general manager
of a US basketball team, the Houston Rockets. Despite not having a strong background in basketball,
Morey was awarded the job as GM on the basis of his bachelor's degree in computer science and his
MBA from MIT. He was chosen for his ability to collect and analyze data and use that to make informed
hiring decisions. Another data scientists that you may have heard of his Hilary Mason. She is a co-
founder of FastForward Labs, a machine learning company recently acquired by Cloudera, a data science
company, and is the Data Scientist in Residence at Accel. Broadly, she uses data to answer questions
about mining the web and understanding the way that humans interact with each other through social
media. Finally, Nate Silver is one of the most famous data scientists or statisticians in the world today.
He is founder and editor in chief at FiveThirtyEight, a website that uses statistical analysis - hard
numbers - to tell compelling stories about elections, politics, sports, science, economics, and lifestyle. He
uses large amounts of totally free public data to make predictions about a variety of topics. Most
notably, he makes predictions about who will win elections in the United States, and has a remarkable
track record for accuracy doing so. One great example of data science in action is from 2009 in which
researchers at Google analyzed 50 million commonly searched terms over a five-year period and
compared them against CDC data on flu outbreaks. Their goal was to see if certain searches coincided
with outbreaks of the flu. One of the benefits of data science and using big data is that it can identify
correlations. In this case, they identified 45 words that had a strong correlation with the CDC flu
outbreak data. With this data, they have been able to predict flu outbreaks based solely off of common
Google searches. Without this mass amounts of data, these 45 words could not have been predicted
beforehand. Now that you have had this introduction into data science, all that really remains to cover
here is a summary of what it is that we will be teaching you throughout this course. To start, we'll go
over the basics of R. R is the main programming language that we will be working with in this course
track. So, a solid understanding of what it is, how it works, and getting it installed on your computer is a
must. We'll then transition into RStudio, which is a very nice graphical interface to R, that should make
your life easier. We'll then talk about version control, why it is important, and how to integrate it into
your work. Once you have all of these basics down, you'll be all set to apply these tools to answering
your very own data science questions. Looking forward to learning with you. Let's get to it.

WHAT IS DATA?

Since we've spent some time discussing what data science is, we should spend some time looking at
what exactly data is. First, let's look at what a few trusted sources consider data to be. First up, we'll
look at the Cambridge English Dictionary which states that data is information, especially facts or
numbers collected to be examined and considered and used to help decision-making. Second, we'll look
at the definition provided by Wikipedia which is, a set of values of qualitative or quantitative variables.
These are slightly different definitions and they get a different components of what data is. Both agree
that data is values or numbers or facts. But the Cambridge definition focuses on the actions that
surround data. Data is collected, examined and most importantly, used to inform decisions. We've
focused on this aspect before. We've talked about how the most important part of data science is the
question and how all we are doing is using data to answer the question. The Cambridge definition
focuses on this. The Wikipedia definition focuses more on what data entails. And although it is a fairly
short definition, we'll take a second to parse this and focus on each component individually. So, the first
thing to focus on is, a set of values. To have data, you need a set of items to measure from. In statistics,
this set of items is often called the population. The set as a whole is what you are trying to discover
something about. The next thing to focus on is, variables. Variables are measurements or characteristics
of an item. Finally, we have both qualitative and quantitative variables. Qualitative variables are,
unsurprisingly, information about qualities. They are things like country of origin, sex or treatment
group. They're usually described by words, not numbers and they are not necessarily ordered.
Quantitative variables on the other hand, are information about quantities. Quantitative measurements
are usually described by numbers and are measured on a continuous ordered scale. They're things like
height, weight and blood pressure. So, taking this whole definition into consideration we have
measurements, either qualitative or quantitative on a set of items making up data. Not a bad definition.
When we were going over the definitions, our examples of data, country of origin, sex, height, weight
are pretty basic examples. You can easily envision them in a nice-looking spreadsheet like this one, with
individuals along one side of the table in rows, and the measurements for those variables along the
columns. Unfortunately, this is rarely how data is presented to you. The data sets we commonly
encounter are much messier. It is our job to extract the information we want, corralled into something
tidy like the table here, analyze it appropriately and often, visualize our results. These are just some of
the data sources you might encounter. And we'll briefly look at what a few of these data sets often look
like, or how they can be interpreted. But one thing they have in common is the messiness of the data.
You have to work to extract the information you need to answer your question. One type of data that I
work with regularly, is sequencing data. This data is generally first encountered in the fast queue format.
The raw file format produced by sequencing machines. These files are often hundreds of millions of lines
long, and it is our job to parse this into an understandable and interpretable format, and infer
something about that individual's genome. In this case, this data was interpreted into expression data,
and produced a plot called the Volcano Plot. One rich source of information is countrywide censuses. In
these, almost all members of a country answer a set of standardized questions and submit these
answers to the government. When you have that many respondents, the data is large and messy. But
once this large database is ready to be queried, the answers embedded are important. Here we have a
very basic result of the last US Census. In which all respondents are divided by sex and age. This
distribution is plotted in this population pyramid plot. I urge you to check out your home country census
bureau, if available and look at some of the data there. This is a mock example of an electronic medical
record. This is a popular way to store health information, and more and more population-based studies
are using this data to answer questions and make inferences about populations at large, or as a method
to identify ways to improve medical care. For example, if you are asking about a population's common
allergies, you will have to extract many individuals allergy information, and put that into an easily
interpretable table format where you will then perform your analysis. A more complex data source to
analyze our images slash videos. There is a wealth of information coded in an image or video, and it is
just waiting to be extracted. An example of image analysis that you may be familiar with is when you
upload a picture to Facebook. Not only does it automatically recognize faces in the picture, but then
suggests who they maybe. A fun example you can play with is The Deep Dream software that was
originally designed to detect faces in an image, but has since moved onto more artistic pursuits. There is
another fun Google initiative involving image analysis, where you help provide data to Google's machine
learning algorithm by doodling. Recognizing that we've spent a lot of time going over what data is, we
need to reiterate data is important, but it is secondary to your question. A good data scientist asks
questions first and seeks out relevant data second. Admittedly, often the data available will limit, or
perhaps even enable certain questions you are trying to ask. In these cases, you may have to re-frame
your question or answer a related question but the data itself does not drive the question asking. In this
lesson we focused on data, both in defining it and in exploring what data may look like and how it can be
used. First, we looked at two definitions of data. One that focuses on the actions surrounding data, and
another on what comprises data. The second definition embeds the concepts of populations, variables
and looks at the differences between quantitative and qualitative data. Second, we examined different
sources of data that you may encounter and emphasized the lack of tidy data sets. Examples of messy
data sets where raw data needs to be rankled into an interpretable form, can include sequencing data,
census data, electronic medical records et cetera. Finally, we return to our beliefs on the relationship
between data and your question and emphasize the importance of question first strategies. You could
have all the data you could ever hope for, but if you don't have a question to start, the data is useless.

GETTING HELP
One of the main skills you are going to be called upon for as a data scientist is your ability to solve
problems. And sometimes to do that, you need help. The ability to solve problems is at the root of data
science. So the importance of being able to do so is paramount. In this lesson, we are going to equip you
with some strategies to help you when you get stuck with a problem and need some help. Much of this
information has been compiled from Roger Peng's video on Getting Help. And Eric Raymond's How to
Ask Questions the Smart Way, so definitely check out those resources. Before we dive into how to get
help, we first need to focus on why you need these skills in the first place. First off, this course is not like
a standard class you have taken before where there may be 30 to 100 people and you have access to
your professor for immediate help. In this class, at any one time there can be thousands of students
taking the class, no one person could provide help to all of these people all of the time. So we'll
introduce you to some strategies to deal with getting help in this course. Also, as we said earlier, being
able to solve problems is often one of the core skills of a data scientist. Data science is new, you may be
the first person to come across a specific problem and you need to be equipped with skills that allow
you to tackle problems that are both new to you and the community. Play video starting at :1:16 and
follow transcript1:16 Finally, troubleshooting and figuring out solutions to problems is a great
transferable skill. It will serve you well as a data scientist, but so much of what any job often entails is
problem solving. Being able to think about problems and get help effectively is a benefit to you in
whatever career path you find yourself in. Before you begin asking others for help on your problem,
there are a few steps you can take on your own. Oftentimes, the fastest answer is one you find for
yourself. One of your first stops for data analysis problems should be reading the manuals or help files.
For our problems, try typing question mark command. If you post a question on a forum that is easily
answered by the manual, you will often get a reply of, read the manual. Which is not the easiest way to
get at the answer you were going for. Next steps are searching on Google and searching relevant
forums. Common forums for data science problems include Stack Overflow and Cross Validated.
Additionally, for you in this class, there is a course forum that is a great resource and super helpful.
Before posting a question to any forum, try and double-check that it hasn't been asked before using the
forum search functions. While you are Googling, things to pay attention to and look for are tutorials,
FAQs, or vignettes of whatever command or program is giving you trouble. These are great resources to
get you started. Either in telling you the language/words to use in your next searches or outright
showing you how to do something. As you get further into this course and using R, you may run into
coding problems and errors. And there are a few strategies you should have ready to deal with these. In
my experience, coding problems generally fall into two categories. Your command produces no data and
spits out an error message, or your command produces an output but it is not at all what you wanted.
These two problems have different strategies for dealing with them. If you are getting an error message,
I've been there. You type out a command and all you get are lines and lines of angry red text telling you
that you did something wrong and this can be overwhelming. But taking a second to check over your
command for typos and then carefully reading the error message solves the problem in nearly all of the
cases. The error messages are there to help you. It is the computer telling you what went wrong. And
when all else fails, you can be pretty assured that somebody out there got the same error message
panicked and posted to a forum. The answer is out there. On the other hand, If you get an output,
consider how the output was different from what you expected. And think about what it looks like the
command actually did, why it would do that, and not what you wanted. Most problems like this are
because the command you provided told the program to do one thing, and it did that thing exactly. It
just turns out what you told it to do wasn't actually what you wanted. These problems are often the
most frustrating. You are so close, but so far. These sorts of problems give you plenty of practice
thinking like a computer program. All right, you've done everything you're supposed to do to solve the
problem on your own. You need to bring in the big guns now, other people. Easiest is to find a peer with
some experience with what you are working on, and ask them for help/direction. This is often great
because the person explaining gets to solidify their understanding while teaching it to you. And you get
a hands-on experience seeing how they would solve the problem. In this class, your peers can be your
classmates, and you can interact with them through the course forum. Double-check your question
hasn't been asked already. But outside of this course, you may not have too many data science savvy
peers. What then? Rubber duck debugging is a long held tradition of solitary programmers everywhere.
In the book, The Pragmatic Programmer, there is a story of how stumped programmers would explain
their problem to a rubber duck and in the process of explaining the problem, identify the solution.
Wikipedia explains it well. Many programmers have had the experience of explaining a programming
problem to someone else, possibly even to someone who knows nothing about programming. And then
hitting upon the solution in the process of explaining the problem. In describing what the code is
supposed to do and observing what it actually does, any incongruity between these two becomes
apparent. So next time you are stumped, bring out the bath toys. Play video starting at :5:22 and follow
transcript5:22 You've done your best, you've searched and searched, you've talked with peers, you've
done everything possible to figure it out on your own and you are still stuck. It's time, time to post your
question to a relevant forum. Before you go ahead and just post your question, you need to consider
how you can best ask your question to garner helpful answers. Try to include details such as a very
specific question that you are trying to answer and what steps you have already taken in your
troubleshooting. Give details on how to reproduce the problem and include sample data for
troubleshooters to work from. Explain what your goal and expected output are in detail and what your
output was instead. If you got an error message, definitely mention that in your post. Additionally,
relevant details about your operating system or version of the product in question are often helpful
details to your potential problem solvers. One of the most important details of your posting is the title. It
is what signals to others what you are having trouble. There is a art to titling your posts. Without being
specific you don't give your potential helpers a lot to go off of. They don't really know what the problem
is and if they are able to help you. Instead, you need to provide some details about what you are having
problems with. Answering what you were doing and what the problem is are two key pieces of
information that you need to provide. This way, somebody who is on the forum will know exactly what
is happening and that they might be able to help. Use titles that focus on the very specific core problem
that you are trying to get help with. It signals to people that you are looking for a very specific answer.
The more specific the question, often the faster the answer. Play video starting at :6:57 and follow
transcript6:57 Following all of the tips mentioned so far will serve you well in posting on forums and
observing forum etiquette. You are asking for help. You are hoping somebody else will take time out of
their day to help you. You need to be courteous. Often this takes the form of asking specific questions,
doing some troubleshooting of your own. And giving potential problem solvers easy access to all the
information they need to help you. Formalizing some of these dos and don'ts, you get some guidelines
to follow. Before posting, make sure you're asking your question in an appropriate form and read the
forum posting guidelines. Make sure you describe your goal and are explicit and detailed in your
explanation of the problem in your problem solving steps so far. Provide the minimum information
required to describe and replicate the problem. Don't bog people down with unrelated problems. And
finally, the big two. One, be courteous. These people are helping you. And two, make sure to follow up
on your post and post the solution. Not only do the people helping you deserve thanks, but this is
helpful to anybody else who has the same problem as you later on. There are also pretty clear guidelines
on what not to do. First, nobody wants to help somebody who assumes that the root cause of the
problem isn't because they have made a mistake, but that there is something wrong with the program.
Spoiler alert. It's almost always because you made a mistake. Similarly, nobody wants to do your
homework for you. They want to help somebody who is genuinely trying to learn, not find a shortcut.
Additionally, for people active on multiple forums, it is always aggravating when the same person posts
the same question on five different forums. Or when the same question is posted on the same forum
repeatedly, be patient. Pick the most relevant forum for your purposes, post once and wait. There is an
art to problem solving and the only way to get practice is to get out there and start solving problems. In
this lesson, we look at how to effectively get help when you run into a problem. This is important for this
course, but also for your future as a data scientist. We first looked at strategies to use before asking for
help. Including reading the manual, checking the help files, and searching Google and appropriate
forums. We also covered some common coding problems you may face and some preliminary steps you
can take on your own. Including paying special attention to error messages and examining how your
code behaved compared to your goal. Once you've exhausted these options, we turn to other people for
help. We can ask peers for help or explain our problems to our trusty rubber ducks, be it an actual
rubber duck or an unsuspecting coworker. Our course forum is also a great resource for you all to talk
with many of your peers. Go introduce yourself. And if all else fails, we can post on forums, be it in this
class or at another forum, like Stack Overflow, with very specific, reproducible questions. Before doing
so, be sure to brush up on your forum etiquette. It never hurt anybody to be polite. Be a good citizen of
our forums. There is an art to problem solving, and the only way to get practice is to get out there and
start solving problems. Get to it.

DATA SCIENCE PROCESS

In the first few lessons of this course, we discuss what data and data science are and ways to get help.
What we haven't yet covered is what an actual data science project looks like. To do so, we'll first step
through an actual data science project, breaking down the parts of a typical project and then provide a
number of links to other interesting data science projects. Our goal in this lesson is to expose you to the
process one goes through as they carry out data science projects. Every data science project starts with
a question that is to be answered with data. That means that forming the question is an important first
step in the process. The second step, is finding or generating the data you're going to use to answer that
question. With the question solidified and data in hand, the data are then analyzed first by exploring the
data and then often by modeling the data, which means using some statistical or machine-learning
techniques to analyze the data and answer your question. After drawing conclusions from this analysis,
the project has to be communicated to others. Sometimes this is the report you send to your boss or
team at work, other times it's a blog post. Often it's a presentation to a group of colleagues. Regardless,
a data science project almost always involve some form of communication of the project's findings.
We'll walk through these steps using a data science project example below. For this example, we're
going to use an example analysis from a data scientist named Hilary Parker. Her work can be found on
her blog and the specific project we'll be working through here is from 2013 entitled, Hilary: The most
poison baby name in US history. To get the most out of this lesson, click on that link and read through
Hilary's post. Once you're done, come on back to this lesson and read through the breakdown of this
post. When setting out on a data science project, it's always great to have your question well-defined.
Additional questions may pop up as you do the analysis. But knowing what you want to answer with
your analysis is a really important first step. Hilary Parker's question is included in bold in her post.
Highlighting this makes it clear that she's interested and answer the following question; is Hilary/Hillary
really the most rapidly poison naming recorded American history? To answer this question, Hilary
collected data from the Social Security website. This data set included 1,000 most popular baby names
from 1880 until 2011. As explained in the blog post, Hilary was interested in calculating the relative risk
for each of the 4,110 different names in her data set from one year to the next, from 1880-2011. By
hand, this would be a nightmare. Thankfully, by writing code in R, all of which is available on GitHub,
Hilary was able to generate these values for all these names across all these years. It's not important at
this point in time to fully understand what a relative risk calculation is. Although, Hilary does a great job
breaking it down in her post. But it is important to know that after getting the data together, the next
step is figuring out what you need to do with that data in order to answer your question. For Hilary's
question, calculating the relative risk for each name from one year to the next from 1880-2011, and
looking at the percentage of babies named each name in a particular year would be what she needed to
do to answer her question. What you don't see in the blog post is all of the code Hilary wrote to get the
data from the Social Security website, to get it in the format she needed to do the analysis and to
generate the figures. As mentioned above, she made all this code available on GitHub so that others
could see what she did and repeat her steps if they wanted. In addition to this code, data science
projects often involve writing a lot of code and generating a lot of figures that aren't included in your
final results. This is part of the data science process to figuring out how to do what you want to do to
answer your question of interest. It's part of the process. It doesn't always show up in your final project
and can be very time consuming. That said, given that Hilary now had the necessary values calculated,
she began to analyze the data. The first thing she did was look at the names with the biggest drop in
percentage from one year to the next. By this preliminary analysis, Hilary was sixth on the list. Meaning
there were five other names that had had a single year drop in popularity larger than the one the name
Hilary experienced from 1992-1993. In looking at the results of this analysis, the first five years appeared
peculiar to Hilary Parker. It's always good to consider whether or not the results were what you were
expecting from many analysis. None of them seemed to be names that were popular for long periods of
time. To see if this hunch was true, Hilary plotted the percent of babies born each year with each of the
names from this table. What she found was that among these poisoned names, names that experienced
a big drop from one year to the next in popularity, all of the names other than Hilary became popular all
of a sudden and then dropped off in popularity. Hilary Parker was able to figure out why most of these
other names became popular. So definitely read that section of her post. The name, Hilary, however,
was different. It was popular for a while and then completely dropped off in popularity. To figure out
what was specifically going on with the name Hilary, she removed names that became popular for short
periods of time before dropping off and only looked at names that were in the top 1,000 for more than
20 years. The results from this analysis definitively showed that Hilary had the quickest fall from
popularity in 1992 of any female baby named between 1880 and 2011. Marian's decline was gradual
over many years. For the final step in this data analysis process, once Hilary Parker had answered her
question, it was time to share it with the world. An important part of any data science project is
effectively communicating the results of the project. Hilary did so by writing a wonderful blog post that
communicated the results of her analysis. Answered the question she set out to answer, and did so in an
entertaining way. Additionally, it's important to note that most projects build off someone else's work.
It's really important to give those people credit. Hilary accomplishes this by linking to a blog post where
someone had asked a similar question previously, to the Social Security website where she got the data
and where she learned about web scraping. Hilary's work was carried out using the R programming
language. Throughout the courses in this series, you'll learn the basics of programming in R, exploring
and analyzing data, and how to build reports and web applications that allow you to effectively
communicate your results. To give you an example of the types of things that can be built using the R
programming and suite of available tools that use R, below are a few examples of the types of things
that have been built using the data science process and the R programming language. The types of
things that you'll be able to generate by the end of this series of courses. Masters students at the
University of Pennsylvania set out to predict the risk of opioid overdoses in Providence, Rhode Island.
They include details on the data they used. The steps they took to clean their data, their visualization
process, and their final results. While the details aren't important now, seeing the process and what
types of reports can be generated is important. Additionally, they've created a Shiny app, which is an
interactive web application. This means that you can choose what neighborhood in Providence you want
to focus on. All of this was built using R programming. The following are smaller projects than the
example above, but data science projects nonetheless. In each project, the author had a question they
wanted to answer and use data to answer that question. They explored, visualized, and analyzed the
data. Then, they wrote blog posts to communicate their findings. Take a look to learn more about the
topics listed and to see how others work through the data science project process and communicate
their results. Maelle Samuel looked to use data to see where one should live in the US given their
weather preferences. David Robinson carried out an analysis of Trump's tweets to show that Trump only
writes the angrier ones himself. Charlotte Galvin used open data available from the City of Toronto to
build a map with information about sexual health clinics. In this lesson, we hope we've conveyed that
sometimes data science projects are tackling difficult questions. Can we predict the risk of opioid
overdose? While other times the goal of the project is to answer a question you're interested in
personally; is Hilary the most rapidly poisoned baby name in recorded American history? In either case,
the process is similar. You have to form your question, get data, explore and analyze your data, and
communicate your results. With the tools you will learn in this series of courses, you will be able to set
out and carry out your own data science projects like the examples included in this lesson.

INSTALLING R

Now that we've got a handle on what a data scientist is, how to find answers, and then spend some time
going over data science example, it's time to get you set up to start exploring on your own. The first step
of that is installing R. First, let's remind ourselves exactly what R is and why we might want to use it. R is
both a programming language in an environment focused mainly on statistical analysis and graphics. It
will be one of the main tools you use in this and following courses. R is downloaded from the
Comprehensive R Archive Network or CRAN. While this might be your first brush with it, we will be
returning to CRAN time and time again when we install packages, so keep an eye out. Outside of this
course, you may be asking yourself, "Why should I use R?" One reason to want to use R it's popularity. R
is quickly becoming the standard language for statistical analysis. This makes R a great language to learn
as the more popular software is, the quicker new functionality is developed, the more powerful it
becomes and the better this support there is. Additionally, as you can see in this graph, knowing R is one
of the top five languages asked for in data scientist's job postings. Another benefit to R it's cost. Free.
This one is pretty self-explanatory. Every aspect of R is free to use, unlike some other stats packages you
may have heard of EG, SAS or SPSS. So there is no cost barrier to using R. Yet another benefit is R's
extensive functionality. R is a very versatile language. We've talked about its use in stats and in graphing.
But it's used can be expanded in many different functions from making websites, making maps, using
GIS data, analyzing language and even making these lectures and videos. Here we are showing a dot
density map made in R of the population of Europe. Each dot is worth 50 people in Europe. For
whatever task you have in mind, there is often a package available for download that does exactly that.
The reason that the functionality of R is so extensive is the community that has been built around R.
Individuals have come together to make packages that add to the functionality of R, and more are being
developed every day. Particularly, for people just getting started out with R, it's community is a huge
benefit due to its popularity. There are multiple forums that have pages and pages dedicated to solving
R problems. We talked about this in the getting help lesson. These forums are great both were finding
other people who have had the same problem as you and posting your own new problems. Now that
we've spent some time looking at the benefits of R, it is time to install it. We'll go over installation for
both Windows and Mac below, but know that these are general guidelines, and small details are likely to
change subsequent to the making of this lecture. Use this as a scaffold. For both Windows and Mac
machines, we start at the CRAN homepage. If you're on a Windows compute, follow the link Download R
for Windows and follow the directions there. If this is your first time installing R, go to the base
distribution and click on the link at the top of the page that should say something like Download R
version number for Windows. This will download an executable file for installation. Open the
executable, and if prompted by a security warning, allow it to run. Select the language you prefer during
installation and agree to the licensing information. You will next be prompted for a destination location.
This will likely be defaulted to program files in a subfolder called R, followed by another sub-directory
for the version number. Unless you have any issues with this, the default location is perfect. You will
then be prompted to select which components should be installed. Unless you are running short on
memory, installing all of the components is desirable. Next, you'll be asked about startup options and,
again, the defaults are fine for this. You will then be asked where setup should place shortcuts. That is
completely up to you. You can allow it to add the program to the start menu, or you can click the box at
the bottom that says, "Do not create a start menu link." Finally, you will be asked whether you want a
desktop or quick launch icon. Up to you. I do not recommend changing the defaults for the registry
entries though. After this window, the installation should begin. Test that the installation worked by
opening R for the first time. If you are on a Mac computer, follow the link Download R for Mac OS X.
There you can find the various R versions for download. Note, if your Mac is older than OS X 10.6 Snow
Leopard, you will need to follow the directions on this page for downloading older versions of R that are
compatible with those operating systems. Click on the link to the most recent version of R, which will
download a PKG file. Open the PKG file and follow the prompts as provided by the installer. First, click
"Continue "on the welcome page and again on the important information window page. Next, you will
be presented with the software license agreement. Again, continue. Next you may be asked to select a
destination for R, either available to all users or to a specific disk. Select whichever you feel is best suited
to your setup. Finally, you will be at the standard install page. R selects a default directory, and if you are
happy with that location, go ahead and click Install. At this point, you may be prompted to type in the
admin password, do so and the install will begin. Once the installation is finished, go to your applications
and find R. Test that the installation worked by opening R for the first time. In this lesson, we first looked
at what R is and why we might want to use it. We then focused on the installation process for R on both
Windows and Mac computers. Before moving on to the next lecture, be sure that you have R installed
properly.

INSTALL R STUDIO

We've installed R and can open the R interface to input code. But there are other ways to interface with
R, and one of those ways is using RStudio. In this lesson, we'll get RStudio installed on your computer.
RStudio is a graphical user interface for R that allows you to write, edit, and store code, generate, view,
and store plots, manage files, objects and dataframes, and integrate with version control systems to
name a few of its functions. We will be exploring exactly what RStudio can do for you in future lessons.
But for anybody just starting out with R coding, the visual nature of this program as an interface for R is
a huge benefit. Thankfully, installation of RStudio is fairly straight forward. First, you go to the RStudio
download page. We want to download the RStudio Desktop version of the software, so click on the
appropriate download under that heading. You will see a list of installers for supported platforms. At this
point, the installation process diverges for Macs and Windows, so follow the instructions for the
appropriate OS. For Windows, select the RStudio Installer for the various Windows editions; Vista,7,8,10.
This will initiate the download process. When the download is complete, open this executable file to
access the installation wizard. You may be presented with a security warning at this time, allow it to
make changes to your computer. Following this, the installation wizard will open. Following the defaults
on each of the windows of the wizard is appropriate for installation. In brief, on the welcome screen,
click next. If you want RStudio installed elsewhere, browse through your file system, otherwise, it will
likely default to the program files folder, this is appropriate. Click, "Next". On this final page, allow
RStudio to create a Start Menu shortcut. Click "Install". R studio is now being installed. Wait for this
process to finish. R studio is now installed on your computer. Click "Finish". Check that RStudio is
working appropriately by opening it from your start menu. For Macs, select the Macs OS X RStudio
installer; Mac OS X 10.6+(64-bit). This will initiate the download process. When the download is
complete, click on the downloaded file and it will begin to install. When this is finished, the applications
window will open. Drag the RStudio icon into the applications directory. Test the installation by opening
your Applications folder and opening the RStudio software. In this lesson, we installed RStudio, both for
Macs and for Windows computers. Before moving on to the next lecture, click through the available
menus and explore the software a bit. We will have an entire lesson dedicated to exploring RStudio, but
having some familiarity beforehand will be helpful.

TOUR OF R STUDIO

Now that we have RStudio installed, we should familiarize ourselves with the various components and
functionality of it. RStudio provides a cheat sheet of the RStudio environment that you should definitely
check out. Rstudio can be roughly divided into four quadrants, each with specific and varied functions
plus a main menu bar. When you first open RStudio, you should see a window that looks roughly like
this. You may be missing the upper-left quadrant and instead have the left side of the screen with just
one region, console. If this is the case, go to "File" then "New File" then "RScript" and now it should
more closely resemble the image. You can change the sizes of each of the various quadrants by hovering
your mouse over the spaces between quadrants and click dragging the divider to resize this sections. We
will go through each of the regions and describe some of their main functions. It would be impossible to
cover everything that RStudio can do. So, we urge you to explore RStudio on your own too. The menu
bar runs across the top of your screen and should have two rows. The first row should be a fairly
standard menu starting with file and edit. Below that there was a row of icons that are shortcuts for
functions that you'll frequently use. To start, let's explore the main sections of the menu bar that you
will use. The first being the file menu. Here we can open new or saved files, open new or saved projects.
We'll have an entire lesson in the future about our projects, so stay tuned. Save our current document
or close RStudio. If you mouse over a new file, a new menu will appear that suggests the various file
formats available to you. RScript and RMarkdown files are the most common file types for use, but you
can also generate RNotebooks, web apps, websites or slide presentations. If you click on any one of
these, a new tab in the source quadrant will open. We'll spend more time in a future lesson on
RMarkdown files and their use. The Session menu has some RSpecific functions in which you can restart,
interrupt or terminate R. These can be helpful if R isn't behaving or is stuck and you want to stop what it
is doing and start from scratch. The Tools menu is a treasure trove of functions for you to explore. For
now, you should know that this is where you can go to install new packages, see you next lecture, set up
your version control software, see future lesson, linking GitHub and RStudio and set your options and
preferences for how RStudio looks and functions. For now, we will leave this alone, but be sure to
explore these menus on your own once you have a bit more experience with RStudio and see what you
can change to best suit your preferences. The console region should look familiar to you. When you
opened R, you were presented with the console. This is where you type in execute commands and
where the output of said command is displayed. To execute your first command, try typing 1 plus 1 then
enter at the greater than prompt. You should see the output one surrounded by square brackets
followed by a two below your command. Now copy and paste the code on screen into your console and
hit "Enter." This creates a matrix with four rows and two columns with the numbers one through eight.
To view this matrix, first look to the environment quadrant where you should see a data set called
example. Click anywhere on the example line and a new tab on the source quadrant should appear
showing the matrix you created. Any dataframe or matrix that you create in R can be viewed this way in
RStudio. Rstudio also tells you some information about the object in the environment. Like whether it is
a list or a dataframe or if it contains numbers, integers or characters. This is very helpful information to
have as some functions only work with certain classes of data and knowing what kind of data you have is
the first step to that. The quadrant has two other tabs running across the top of it. We'll just look at the
history tab now. Your history tab should look something like this. Here you will see the commands that
we have run in this session of R. If you click on any one of them, you can click to console or to source
and this will either rerun the command in the console or will move the command to the source,
respectively. Do so now for your example matrix and send it to source. The Source panel is where you
will be spending most of your time in RStudio. This is where you store the R commands that you want to
save it for later, either as a record of what you did or as a way to rerun the code. We'll spend a lot of
time in this quadrant when we discuss RMarkdown. But for now, click the "Save" icon along the top of
this quadrant and save this script is my_first_R_Script.R. Now you will always have a record of creating
this matrix. The final region we'll look at occupies the bottom right of the RStudio window. In this
quadrant, five tabs run across the top, Files, Plots, Packages, Help, and Viewer. In files, you can see all of
the files in your current working directory. If this isn't where you want to save or retrieve files from, you
can also change the current working directory in this tab using the ellipsis at the far right, finding the
desired folder and then under the More cog wheel, setting this new folder as the working directory. In
the plots tab, if you generate a plot with your code, it will appear here. You can use the arrows to
navigate to previously generated plots. The zoom function will open the plot in a new window that is
much larger than the quadrant. "Export" is how you save the plot. You can either save it as an image or
as a PDF. The broom icon clears all plots from memory. The "Packages" tab will be explored more in
depth in the next lesson on R packages. Here you can see all the packages you have installed, load and
unload these packages and update them. The "Help" tab is where you find the documentation for your R
packages in various functions. In the upper right of this panel, there is a search function for when you
have a specific function or package in question. In this lesson, we took a tour of the RStudio software.
We became familiar with the main menu and its various menus. We looked at the console where our
code is input and run. We then moved onto the environment panel that lists all of the objects that had
been created within an R session and allows you to view these objects in a new tab and source. In this
same quadrant, there is a history tab that keeps a record of all commands that have been run. It also
presents the option to either rerun the command in the console or send the command to source to be
saved. Source is where you save your R commands. The bottom-right quadrant contains a listing of all
the files in your working directory, displays generated plots, lists your installed packages, and supplies
help files for when you need some assistance. Take some time to explore RStudio on your own.

R PACKAGES

Now that we've installed R in RStudio and have a basic understanding of how they work together, we
can get at what makes R so special, packages. So far, anything we've played around with an R uses the
Base R system. Base R or everything included in R when you download it has rather basic functionality
for statistics and plotting, but it can sometimes be limiting. To expand upon R's basic functionality,
people have developed packages. A package is a collection of functions, data, and code conveniently
provided in a nice complete format for you. At the time of writing, there are just over 14,300 packages
available to download, each with their own specialized functions and code, all for some different
purpose. R package is not to be confused with the library. These two terms are often conflated in
colloquial speech about R. A library is the place where the package is located on your computer. To think
of an analogy, a library is well, a library, and a package is a book within the library. The library is where
the book/packages are located. Packages are what make R so unique. Not only does Base R have some
great functionality, but these packages greatly expand its functionality. Perhaps, most special of all, each
package is developed and published by the R community at large and deposited in repositories. A
repository is a central location where many developed packages are located and available for download.
There are three big repositories. They are the Comprehensive R Archive Network, or CRAN, which is R's
main repository with over 12,100 packages available. There is also the Bioconductor repository, which is
mainly for Bioinformatic focus packages. Finally, there is GitHub, a very popular, open source repository
that is not R specific. So, you know where to find packages. But there are so many of them. How can you
find a package that will do what you are trying to do in R? There are a few different avenues for
exploring packages. First, CRAN groups all of its packages by their functionality/topic into 35 themes. It
calls this its task view. This at least allows you to narrow the packages, you can look through to a topic
relevant to your interests. Second, there is a great website. R documentation, which is a search engine
for packages and functions from CRAN, Bioconductor, and GitHub, that is, the big three repositories. If
you have a task in mind, this is a great way to search for specific packages to help you accomplish that
task. It also has a Task View like CRAN that allows you to browse themes. More often, if you have a
specific task in mind, Googling that task followed by R package is a great place to start. From there,
looking at tutorials, vignettes, and forums for people already doing what you want to do is a great way
to find relevant packages. Great. You found a package you want. How do you install it? If you are
installing from the CRAN repository, use the Install Packages function with the name of the package you
want to install in quotes between the parentheses. Note, you can use either single or double quotes. For
example, if you want to install the package ggplot2, you would use install.packages("ggplot2"). Try doing
so in your R Console. This command downloads the ggplot2 package from CRAN and installs it onto your
computer. If you want to install multiple packages at once, you can do so by using a character vector
with the names of the packages separated by commas as formatted here. If you want to use RStudio's
Graphical Interface to install packages, go to the Tools menu, and the first option should be Install
Packages. If installing from CRAN, selected is the repository and type the desired packages in the
appropriate box. The Bioconductor repository uses their own method to install packages. First, to get
the basic functions required to install through Bioconductor, use
source("https://bioconductor.org/biocLite.R") This makes the main install function of Bioconductor
biocLite available to you. Following this you call the package you want to install in quote between the
parentheses of the biocLite command as seen here for the GenomicRanges package. Installing from
GitHub is a more specific case that you probably won't run into too often. In the event you want to do
this, you first must find the package you want on GitHub and take note of both the package name and
the author of the package. The general workflow is installing the devtools package only if you don't
already have devtools installed. If you've been following along with this lesson, you may have installed it
when we were practicing installations using the R console, then you load the devtools package using the
library function SO. More on with this command is doing in a few seconds. Finally, using the command
install_github calling the authors GitHub username followed by the package name. Installing a package
does not make its functions immediately available to you. First, you must load the package into R. To do
so, use the library function. Think of this like any other software you install on your computer. Just
because you've installed the program doesn't mean it's automatically running. You have to open the
program. Same with R you've installed it but now you have to open it. For example, to open the ggplot2
package, you would use the library function and call it ggplot2. Note do not put the package name in
quotes. Unlike when you are installing the packages, the library command does not accept package
names in quotes. There is an order to loading packages. Some packages require other packages to be
loaded first, aka dependencies. That package is manual/help pages. We'll help you out and finding that
order if they are picky. If you want to load a package using the RStudio interface, in the lower right
quadrant, there is a tab called packages that list set all of the packages in a brief description as well as
the version number of all of the packages you have installed. To load a package, just click on the
checkbox beside the package name. Once you've got a package, there are a few things you might need
to know how to do. If you aren't sure if you've already installed the package or want to check with
packages are installed, you can use either of the Install Packages or library commands with nothing
between the parentheses to check. In RStudio, that package tab introduced earlier is another way to
look at all of the packages you have installed. You can check what packages need an update with a call
to the functional packages. This will identify all packages that have been updated since you install
them/Last updated them. To update all packages, use update packages. If you only want to update a
specific package, just use once again install packages. Within the RStudio interface still in that Packages
tab, you can click Update which will list all of the packages that are not up-to-date. It gives you the
option to update all of your packages or allows you to select specific packages. You will want to
periodically checking on your packages and check if you've fallen out of date, be careful though.
Sometimes an update can change the functionality of certain functions. So if you rerun some old code,
the command may be changed or perhaps even outright gone and you will need to update your CO2.
Sometimes you want to unload a package in the middle of a script. The package you have loaded may
not play nicely with another package you want to use. To unload a given package, you can use the
detach function. For example, you would type detach package:ggplot2 then unload equals true in the
format shown. This would unload the ggplot2 package that we loaded earlier. Within the RStudio
interface in the Packages tab, you can simply unload a package by unchecking the box beside the
package name. If you no longer want to have a package installed, you can simply uninstall it using the
function Removed.packages. For example, remove packages followed by ggplot2 try that. But then
actually reinstalled the ggplot2 package. It's a super useful plotting package. Within RStudio in the
Packages tab, clicking on the X at the end of a package's row will uninstall that package. Sometimes,
when you are looking at a package that you might want to install, you will see that it requires a certain
version of R to run. To know if you can use that package, you need to know what version of R you are
running. One way to know your R version is to check when you first open R or RStudio. The first thing it
outputs in the console tells you what version of R is currently running. If you didn't pay attention at the
beginning, you can type version into the console and it will output information on the R version you're
running. Another helpful command is session info. It will tell you what version of R you are running along
with a listing of all of the packages you have loaded. The output of this command is a great detail to
include when posting a question to forums. It tells potential helpers a lot of information about your OS,
R, and the packages plus their version numbers that you are using. In all of this information about
packages, we have not actually discussed how to use a package's functions. First, you need to know
what functions are included within a package. To do this, you can look at the manner help pages
included in all well-made packages. In the console, you can use the help function to access a package's
help file. Try using the help function calling package equals ggplot2 and you will see all of the many
functions that ggplot2 provides. Within the RStudio interface, you can access the help files through the
Packages tab. Again, clicking on any package name should open up these associated help files in the
Help tab found in that same quadrant beside the Packages tab. Clicking on any one of these help pages
will take you to that functions help page that tells you what that function is for and how to use it. Once
you know what function within a package you want to use, you simply call it in the console like any other
function we've been using throughout this lesson. Once a package has been loaded, it is as if it were a
part of the base R functionality. If you still have questions about what functions within a package are
right for you or how to use them, many packages include vignettes. These are extended help files that
include an overview of the package and its functions, but often they go the extra mile and include
detailed examples of how to use the functions in plain words that you can follow along with to see how
to use the package. To see the vignettes included in a package, you can use the browseVignettes
function. For example, let's look at the vignettes included in ggplot2 using browseVignettes followed by
ggplot2, you should see that there are two included vignettes. Extending ggplot2 and aesthetics
specification. Exploring the aesthetic specifications vignette is a great example of how vignettes can be
helpful clear instructions on how to use the included functions. In this lesson, we've explored our
packages in depth. We examined what a package is is and how it differs from a library, what repositories
are, and how to find a package relevant to your interests. We investigated all aspects of how packages
work, how to install them from the various repositories, how to load them, how to check which
packages are installed, and how to update, uninstall, and unload packages. We took a small detour and
looked at how to check with version of R you have which is often an important detail to know when
installing packages. Finally, we spent some time learning how to explore help files and vignettes which
often give you a good idea of how to use a package and all of its functions.

PROJECTS IN R

One of the ways people organize their work in R is through the use of R projects. A built-in functionality
of R Studio that helps to keep all your related files together. R Studio provides a great guide on how to
use projects. So, definitely check that out. First off, what is an R project? When you make a project, it
creates a folder where all files will be kept, which is helpful for organizing yourself and keeping multiple
projects separate from each other. When you reopen a project, R Studio remembers what files were
open and will restore the work environment as if you have never left, which is very helpful when you are
starting backup on a project after some time off. Functionally, creating a project in R will create a new
folder and assign that as the working directory so that all files generated will be assigned to the same
directory. The main benefit of using projects is that it starts the organization process off right. It creates
a folder for you and now you have a place to store all of your input data, your code and the output of
your code. Everything you are working on within a project is self-contained, which often means finding
things is much easier. There's only one place to look. Also, since everything related to one project is all in
the same place, it is much easier to share your work with others either by directly sharing the folders
slash files, or by associating it with version control software. We'll talk more about linking projects in R
with version control systems in a future lesson entirely dedicated to the topic. Finally, since R Studio
remembers what documents you had opened when you close this session, it is easier to pick a project
up after a break. Everything is set up just as you left it. There are three ways to make a project. First, you
can make it from scratch. This will create a new directory for all your files to go in. Or you can create a
project from an existing folder. This will link an existing directory with R Studio. Finally, you can link a
project from version control. This will clone an existing project onto your computer. Don't worry too
much about this one. You'll get more familiar with it in the next few lessons. Let's create a project from
scratch, which is often what you will be doing. Open R Studio and under "File," select "New Project." You
can also create a new project by using the projects toolbar and selecting new project in the drop-down
menu, or there is a new project shortcut in the toolbar. Since we are starting from scratch, select "New
Directory." When prompted about the project type, select "New Project." Pick a name for your project
and for this time, save it to your desktop. This will create a folder on your desktop where all of the files
associated with this project will be kept. Click create project. A blank R Studio session should open. A
few things to note. One, in the files quadrant of the screen, you can see that R Studio has made this new
directory, your working directory and generated a single file with the extension, "R project". Two, in the
upper right of the window, there is a project's toolbar that states the name of your current project and
has a drop-down menu with a few different options that we'll talk about in a second. Opening an
existing project is as simple as double clicking the R Project file on your computer. You can accomplish
the same from within R Studio by opening R Studio and going to file then open project. You can also use
the project toolbar and open the drop down menu and select "Open Project." Quitting a project is as
simple as closing your R Studio window. You can also go to file "Close project," and this will do the same.
Finally, you can use the project toolbar by clicking on the drop down menu and choosing closed project.
All of these options will quit a project and doing so will cause R Studio to write which documents are
currently open so they can be restored when you start back up again and it then closes the R session.
When you set up your project, you can tell it to save environment. So, for example, all of your variables
in data tables will be pre-loaded when you reopen the project, but this is not the default behavior. The
projects toolbar is also an easy way to switch between projects. Click on the drop-down menu and
choose "Open Project" and find your new project you want to open. This will save the current project,
close it and then open the new project within the same window. If you want multiple projects open at
the same time, do the same, but instead, select "Open Project in New Session." This can also be
accomplished through the file menu, where those same options are available. When you are setting up a
project, it can be helpful to start out by creating a few directories. Try a few strategies and see what
works best for you. But most file structures are set up around having a directory containing the raw
data. A directory that you keep scripts slash R files in, and a directory for the output of your code. If you
set up these boulders before you start, it can save you organizational headaches later on in a project
when you can't quite remember where something is. In this lesson, we've covered what projects in R
are. Why you might want to use them, how to open, close or switch between projects and some best
practices to best set you up for organizing yourself.

Version Control

Now that we've got a handle on our RStudio and projects, there are a few more things we want to set
you up with before moving on to the other courses, understanding version control, installing Git, and
linking Git with RStudio. In this lesson, we will give you a basic understanding of version control. First
things first, what is version control? Version control is a system that records changes that are made to a
file or a set of files over time. As you make edits, the version control system takes snapshots of your files
and the changes and then saves those snapshots so you can refer, revert back to previous versions later
if need be. If you've ever used the track changes feature in Microsoft Word, you have seen a
rudimentary type of version control in which the changes to a file are tracked and you can either choose
to keep those edits or revert to the original format. Version control systems like Git are like a more
sophisticated track changes in that, they are far more powerful and are capable of meticulously tracking
successive changes on many files with potentially many people working simultaneously on the same
groups of files. Hopefully, once you've mastered version control software, paper final final two actually
finaldoc.docx will be a thing of the past for you. As we've seen in this example, without version control,
you might be keeping multiple, very similar copies of a file and this could be dangerous. You might start
editing the wrong version not recognizing that the document labeled final has been further edited to
final two and now all your new changes have been applied to the wrong file. Version control systems
help to solve this problem by keeping a single updated version of each file with a record of all previous
versions and a record of exactly what changed between the versions which brings us to the next major
benefit of version control. It keeps a record of all changes made to the files. This can be of great help
when you are collaborating with many people on the same files. The version control software keeps
track of who, when, and why those specific changes were made. It's like track changes to the extreme.
This record is also helpful when developing code. If you realize after sometime that you made a mistake
and introduced an error, you can find the last time you edited the particular bit of code, see the changes
you made and revert back to that original, unbroken code leaving everything else you've done in the
meanwhile on touched. Finally, when working with a group of people on the same set of files, version
control is helpful for ensuring that you aren't making changes to files that conflict with other changes. If
you've ever shared a document with another person for editing, you know the frustration of integrating
their edits with a document that has changed since you sent the original file. Now, you have two
versions of that same original document. Version control allows multiple people to work on the same
file and then helps merge all of the versions of the file and all of their edits into one cohesive file. Git is a
free and open source version control system. It was developed in 2005 and has since become the most
commonly used version control system around. Stack Overflow which should sound familiar from our
getting help lesson surveyed over 60,000 respondents on which version control system they use. As you
can tell from the chart, Git is by far the winner. As you become more familiar with Git and how it works
in interfaces with your projects, you'll begin to see why it has risen to the height of popularity. One of
the main benefits of Git is that it keeps a local copy of your work and revisions which you can then
netted offline. Then once you return to internet service, you can sync your copy of the work with all of
your new edits and track changes to the main repository online. Additionally, since all collaborators on a
project had their own local copy of the code, everybody can simultaneously work on their own parts of
the code without disturbing the common repository. Another big benefit that we'll definitely be taking
advantage of is the ease with which RStudio and Git interface with each other. In the next lesson, we'll
work on getting Git installed and linked with RStudio and making a GitHub account. GitHub is an online
interface for Git. Git is software used locally on your computer to record changes. GitHub is a host for
your files and the records of the changes made. You can think of it as being similar to Dropbox. The files
are on your computer but they are also hosted online and are accessible from many computer. GitHub
has the added benefit of interfacing with Git to keep track of all of your file versions and changes. There
is a lot of vocabulary involved in working with Git and often the understanding of one word relies on
your understanding of a different Git concept. Take some time to familiarize yourself with the following
words and go over it a few times to see how the concepts relate. A repository is equivalent to the
projects folder or directory. All of your version controlled files and the recorded changes are located in a
repository. This is often shortened to repo. Repositories are what are hosted on GitHub and through this
interface you can either keep your repositories private and share them with select collaborators or you
can make them public. Anybody can see your files in their history. To commit is to save your edits and
the changes made. A commit is like a snapshot of your files. Git compares the previous version of all of
your files in the repo to the current version and identifies those that have changed since then. Those
that have not changed, it maintains that previously stored file untouched. Those that have changed, it
compares the files, loads the changes and uploads the new version of your file. We'll touch on this in the
next section, but when you commit a file, typically you accompany that file change with a little note
about what you changed and why. When we talk about version control systems, commits are at the
heart of them. If you find a mistake, you will revert your files to a previous commit. If you want to see
what has changed in a file over time, you compare the commits and look at the messages to see why
and who. To push is to update the repository with your edits. Since Git involves making changes locally,
you need to be able to share your changes with the common online repository. Pushing is sending those
committed changes to that repository so now everybody has access to your edits. Pulling is updating
your local version of the repository to the current version since others may have edited in the
meanwhile. Because the shared repository is hosted online in any of your collaborators or even yourself
on a different computer could it made changes to the files and then push them to the shared repository.
You are behind the times, the files you have locally on your computer may be outdated. So, you pull to
check if you were up to date with the main repository. One final term you must know is staging which is
the act of preparing a file for a commit. For example, if since your last commit you have edited three
files for completely different reasons, you don't want to commit all of the changes in one go, your
message on why you are making the commit in what has changed will be complicated since three files
have been changed for different reasons. So instead, you can stage just one of the files and prepare it
for committing. Once you've committed that file, you can stage the second file and commit it and so on.
Staging allows you to separate out file changes into separate commits, very helpful. To summarize these
commonly used terms so far and to test whether you've got the hang of this, files are hosted in a
repository that is shared online with collaborators. You pull the repository's contents so that you have a
local copy of the files that you can edit. Once you are happy with your changes to a file, you stage the
file and then commit it. You push this commit to the shared repository. This uploads your new file and
all of the changes and is accompanied by a message explaining what changed, why, and by whom. A
branch is when the same file has two simultaneous copies. When you were working locally in editing a
file, you have created a branch where your edits are not shared with the main repository yet. So, there
are two versions of the file. The version that everybody has access to on the repository and your local
edited version of the file. Until you push your changes and merge them back into the main repository,
you are working on a branch. Following a branch point, the version history splits into two and tracks the
independent changes made to both the original file in the repository that others may be editing and
tracking your changes on your branch and then merges the files together. Merging is when independent
edits of the same file are incorporated into a single unified file. Independent edits are identified by Git
and are brought together into a single file with both sets of edits incorporated. But you can see a
potential problem here. If both people made an edit to the same sentence that precludes one of the edit
from being possible, we have a problem. Git recognizes this disparity, conflict and asks for user
assistance in picking which edit to keep. So, a conflict is when multiple people make changes to the
same file and Git is unable to merge the edits. You are presented with the option to manually try and
merge the edits or to keep one edit over the other. When you clone something, you are making a copy
of an existing Git repository. If you have just been brought on to a project that has been tracked with
version control, you will clone the repository to get access to and create a local version of all of the
repository's files and all of the track changes. A fork is a personal copy of a repository that you have
taken from another person. If somebody is working on a cool project and you want to play around with
it, you can fork their repository and then when you make changes, the edits are logged on your
repository not theirs. It can take some time to get used to working with version control software like Git,
but there are a few things to keep in mind to help establish good habits that will help you out in the
future. One of those things is to make purposeful commits. Each commit should only addressed as single
issue. This way if you need to identify when you changed a certain line of code, there is only one place
to look to identify the change and you can easily see how to revert the code. Similarly, making sure you
write formative messages on each commit is a helpful habit to get into. If each message is precise in
what was being changed, anybody can examine the committed file and identify the purpose for your
change. Additionally, if you are looking for a specific edit you made in the past, you can easily scan
through all of your commits to identify those changes related to the desired edit. Finally, be cognizant of
their version of files you are working on. Frequently check that you are up to date with the current repo
by frequently pulling. Additionally, don't hoard your edited files. Once you have committed your files
and written that helpful message, you should push those changes to the common repository. If you are
done editing a section of code and are planning on moving onto an unrelated problem, you need to
share that edit with your collaborators. Now that we've covered what version control is and some of the
benefits, you should be able to understand why we have three whole lessons dedicated to version
control and installing it. We looked at what Git and GitHub are and then covered much of the commonly
used and sometimes confusing vocabulary inherent to version control work. We then quickly went over
some best practices to using Git, but the best way to get a hang of this all is to use it. Hopefully, you feel
like you have a better handle on how Git works now. So, let's move on to the next lesson and get it
installed.

GitHub and Git

Now that we've got a handle on what version control is. In this lesson, you will sign up for a GitHub
account, navigate around the GitHub website to become familiar with some of its features and install
and configure Git. All in preparation for linking both with your RStudio. As we previously learned, GitHub
is a cloud-based management system for your version controlled files. Like Dropbox, your files are both
locally on your computer and hosted online and easily accessible. Its interface allows you to manage
version control and provides users with a web-based interface for creating projects, sharing them,
updating code, etc. To get a GitHub account, first go to www.github.com. You will be brought to their
homepage where you should fill in your information, make a username, put in your email, choose a
secure password, and click sign up for GitHub. You should now be logged into GitHub. In the future, to
log onto GitHub, go to github.com where you will be presented with a homepage. If you aren't already
logged in, click on the sign in link at the top. Once you've done that, you will see the login page where
you will enter in your username and password that you created earlier. Once logged in, you will be back
at github.com but this time the screen should look like this. We're going to take a quick tour of the
GitHub website and we'll particularly focus on these sections of the interface, user settings, notifications,
help files, and the GitHub guide. Following this tour, will make your very first repository using the
GitHub guide. First, let's look at your user settings. Now that you've logged onto GitHub, we should fill
out some of your profile information and get acquainted with the account settings. In the upper right
corner, there is an icon with a narrow beside it. Click this and go to your profile. This is where you
control your account from and can view your contribution, histories, and repositories. Since you are just
starting out, you aren't going to have any repositories or contributions yet, but hopefully we'll change
that soon enough. What we can do right now is edit your profile. Go to edit profile along the left-hand
edge of the page. Here, take some time and fill out your name and a little description of yourself in the
bio box. If you like, upload a picture of yourself. When you are done, click update profile. Along the left-
hand side of this page, there are many options for you to explore. Click through each of these menus to
get familiar with the options available to you. To get you started, go to the account page. Here, you can
edit your password or if you are unhappy with your username, change it. Be careful though, there can
be unintended consequences when you change your username if you are just starting out and don't
have any content yet, you'll probably be safe though. Continue looking through the personal setting
options on your own. When you're done, go back to your profile. Once you've had a bit more
experienced with GitHub, you'll eventually end up with some repositories to your name. To find those,
click on the repositories link on your profile. For now, it will probably look like this. By the end of the
lecture though, check back to this page to find your newly created repository. Next, we'll check out the
notifications menu. Along the menu bar across the top of your window, there is a bell icon representing
your notifications. Click on the bell. Once you become more active on GitHub and are collaborating with
others, here is where you can find messages and notifications for all the repositories, teams, and
conversations you are a part of. Along the bottom of every single page there is the help button. GitHub
has a great help system in place. If you ever have a question about GitHub, this should be your first point
to search. Take some time now and look through the various help files and see if any catch your eye.
GitHub recognizes that this can be an overwhelming process for new users and as such have developed
a mini tutorial to get you started with GitHub. Go through this guide now and create your first
repository. When you're done, you should have a repository that looks something like this. Take some
time to explore around the repository. Check out your commit history so far. Here you can find all of the
changes that have been made to the repository and you can see who made the change, when they
made the change, and provided you wrote an appropriate commit message. You can see why they made
the change. Once you've explored all of the options in the repository, go back to your user profile. It
should look a little different from before. Now when you are on your profile, you can see your latest
repository created. For a complete listing of your repositories, click on the Repositories tab. Here you
can see all of your repositories, a brief description, the time of the last edit, and along the right-hand
side, there is an activity graph showing one and how many edits have been made on the repository. As
you may remember from our last lecture, Git is the free and open-source version control system which
GitHub is built on. One of the main benefits of using the Git system is its compatibility with RStudio.
However, in order to link the two software together, we first need to download and install Git on your
computer. To download Git, go to git-scm.com/download. Click on the appropriate download link for
your operating system. This should initiate the download process. We'll first look at the install process
for Windows computers and follow that with Mac installation steps. Follow along with the relevant
instructions for your operating system. For Windows computers, once the download is finished, open
the.exe file to initiate the installation wizard. If you receive a security warning, click run and to allow.
Following this, click through the installation wizard generally accepting the default options unless you
have a compelling reason not to. Click install and allow the wizard to complete the installation process.
Following this, check the launch Git Bash option. Unless you are curious, deselect the View Release
Notes box as you are probably not interested in this right now. Doing so, a command line environment
will open. Provided you accepted the default options during the installation process, there will now be a
start menu shortcut to launch Git Bash in the future. You have now installed Git. For Macs, we will walk
you through the most common installation process. However, there are multiple ways to get Git onto
your Mac. You can follow the tutorials at
www.@lash.com/git/tutorials/installgitforalternativeinstallationrats. After downloading the appropriate
git version for Macs, you should have downloaded a dmg file for installation on your Mac. Open this file.
This will install Git on your computer. A new window will open. Double click on the PKG file and an
installation wizard will open. Click through the options accepting the defaults. Click Install. When
prompted, close the installation wizard. You have successfully installed Git. Now that Git is installed, we
need to configure it for use with GitHub in preparation for linking it with RStudio. We need to tell Git
what your username and email are so that it knows how to name each commit is coming from you. To
do so, in the command prompt either Git Bash for Windows or terminal for Mac, type git config --global
user.name "Jane Doe" with your desired username in place of Jane Doe. This is the name each commit
will be tagged with. Following this, in the command prompt type, git config --global user.email
janedoe@gmail.com making sure to use the same email address you signed up for GitHub with. At this
point, you should be set for the next step. But just to check, confirm your changes by typing git config
--list. Doing so, you should see the username and email you selected above. If you notice any problems
or want to change these values, just retype the original config commands from earlier with your desired
changes. Once you are satisfied that your username and email is correct, exit the command line by
typing exit and hit enter. At this point, you are all set up for the next lecture. In this lesson, we signed up
for a GitHub account and toured the GitHub website. We made your first repository and filled in some
basic profile information on GitHub. Following this, we installed Git on your computer and configured it
for compatibility with GitHub and RStudio.

Link Github and RStudio

Now that we have both RStudio and Git set up on your computer in a GitHub account, it's time to link
them together so that you can maximize the benefits of using RStudio in your version control pipelines.
To link RStudio in Git, in RStudio, go to Tools, then Global Options, then Git/SVN. Sometimes the default
path to the Git executable is not correct. Confirm that git.exe resides in the directory that RStudio has
specified. If not, change the directory to the correct path. Otherwise, click "Okay" or "Apply". RStudio
and Git are now linked. Now, to link RStudio to GitHub in that same RStudio option window, click
"Create RSA Key" and when there is complete, click "Close". Following this, in that same window again,
click "View public key" and copy the string of numbers and letters. Close this window. You have now
created a key that is specific to you which we will provide to GitHub so that it knows who you are when
you commit a change from within RStudio. To do so, go to github.com, log in if you are not already, and
go to your account settings. There, go to SSH and GPG keys and click "New SSH key". Paste in the public
key you have copied from RStudio into the key box and give it a title related to RStudio. Confirm the
addition of the key with your GitHub password. GitHub and RStudio are now linked. From here, we can
create a repository on GitHub and link to RStudio. To do so, go to GitHub and create a new repository by
going to your Profile, Repositories and New. Name your new test repository and give it a short
description. Click "Create Repository", copy the URL for your new repository. In RStudio, go to File, New
Project, select Version Control, select Git as your version control software. Paste in the repository URL
from before, select the location where you would like the project stored. When done, click on "Create
Project". Doing so will initialize a new project linked to the GitHub repository and open a new session of
RStudio. Create a new R script by going to File, New File, R Script and copy and paste the following code:
print("This file was created within RStudio") and then on a new line paste, print("And now it lives on
GitHub"). Save the file. Note that when you do so, the default location for the file is within the new
project directory you created earlier. Once that is done, looking back at RStudio, in the Git tab of the
environment quadrant, you should see your file you just created. Click the checkbox under Staged to
stage your file. Click on it. A new window should open that lists all of the changed files from earlier and
below that shows the differences in the stage files from previous versions. In the upper quadrant, in the.
Commit message box, write yourself a commit message. Click Commit, close the window. So far, you
have created a file, saved it, staged it, and committed it. If you remember your version control lecture,
the next step is to push your changes to your online repository, push your changes to the GitHub
repository, go to your GitHub repository and see that the commit has been recorded. You've just
successfully pushed your first commit from within RStudio to GitHub. In this lesson, we linked Git and
RStudio so that RStudio recognizes you are using it as your version control software. Following that, we
linked RStudio to GitHub so that you can push and pull repositories from within RStudio. To test this, we
created a repository on GitHub, linked it with a new project within RStudio, created a new file and then
staged, committed and pushed the file to your GitHub repository.

Projects Under Version Control

In the previous lesson, we linked RStudio with Git and GitHub. In doing this, we created a repository on
GitHub and linked it to RStudio. Sometimes, however, you may already have an R project that isn't yet
under version control or linked with GitHub. Let's fix that. So, what if you already have an R project that
you've been working on but don't have it linked up to any version control software tat tat. Thankfully,
RStudio and GitHub recognize this can happen and steps in place to help you. Admittedly, this is slightly
more troublesome to do than just creating a repository on GitHub and linking it with RStudio before
starting the project. So, first, let's set up a situation where we have a local project that isn't under
version control. Go to File, New Project, New Directory, New Project and name your project. Since we
are trying to emulate a time where you have a project not currently under version control, do not click
Create a git repository, click Create Project. We've now created an R project that is not currently under
version control. Let's fix that. First, let's set it up to interact with Git. Open Git Bash or Terminal and
navigate to the directory containing your project files. Move around directories by typing CD for change
directory, followed by the path of the directory. When the command prompt in the line before the
dollar sign says the correct location of your project, you are in the correct location. Once here, type git
init followed by GitHub period. This initializes this directory as a Git repository and adds all of the files in
the directory to your local repository. Commit these changes to the Git repository using git commit dash
m initial commit. At this point, we have created an R project and have now linked it to Git version
control. The next step is to link this with GitHub. To do this, go to github.com. Again, create a new
repository. Make sure the name is the exact same as your R project and do not initialize the readme file,
gitignore or license. Once you've created this repository, you should see that there is an option to push
an existing repository from the command line with instructions below containing code on how to do so.
In Git Bash or Terminal, copy and paste these lines of code to link your repository with GitHub. After
doing so, refresh your GitHub page and it should now look something like this. When you reopen your
project in RStudio, you should now have access to the Git tab in the upper right quadrant then can push
to GitHub from within RStudio any future changes. If there is an existing project that others are working
on that you are asked to contribute to, you can link the existing project with your RStudio. It follows the
exact same premises that from the last lesson where you created a GitHub repository and then cloned it
to your local computer using RStudio. In brief, in RStudio, go to File, New Project, Version Control. Select
Git as your version control system, and like in the last lesson, provide the URL to the repository that you
are attempting to clone and select a location on your computer to store the files locally. Create the
project. All the existing files in the repository should now be stored locally on your computer and you
have the ability to push at it's from your RStudio interface. The only difference from the last lesson is
that you did not create the original repository. Instead, you cloned somebody else's. In this lesson, we
went over how to convert an existing project to be under Git version control using the command line.
Following this, we linked your newly version controlled project to GitHub using a mix of GitHub
commands in the command line. We then briefly recap how to clone an existing GitHub repository to
your local machine using RStudio.

R Markdown

We've spent a lot of time getting R and RStudio working, learning about projects and version control.
You are practically an expert of this. There is one last major functionality of our slash R Studio that we
would be remiss to not include in your introduction to R; Markdown. R Markdown is a way of creating
fully reproducible documents in which both text and code can be combined. In fact, this lessons are
written using R Markdown. That's how we make things like bullet lists, bolded and italicized text, in line
links and run inline r code. By the end of this lesson, you should be able to do each of those things too
and more. Despite these documents all starting as plain text, you can render them into HTML pages, or
PDF, or Word documents or slides, the symbols you use to signal, for example, bold or italics is
compatible with all of those formats. One of the main benefits is the reproducibility of using R
Markdown. Since you can easily combine text and code chunks in one document, you can easily
integrate introductions, hypotheses, your code that you are running, the results of that code, and your
conclusions all in one document. Sharing what you did, why you did it, and how it turned out becomes
so simple, and that person you share it with can rerun your code and get the exact same answers you
got. That's what we mean about reproducibility. But also, sometimes you will be working on a project
that takes many weeks to complete. You want to be able to see what you did a long time ago and
perhaps be reminded exactly why you were doing this. And you can see exactly what you ran and the
results of that code, and R Markdown documents allow you to do that. Another major benefit to R
markdown is that since it is plain texts, it works very well with version control systems. It is easy to track
what character changes occur between commits unlike other formats that are in plain text. For example,
in one version of this lesson, I may have forgotten to bold" this" word. When I catch my mistake, I can
make the plain text changes to signal I would like that word bolded, and in the commit, you can see the
exact character changes that occurred to now make the word bold. Another selfish benefit of R
Markdown is how easy it is to use. Like everything in R, this extended functionality comes from an R
package: rmarkdown. All you need to do to install it is run install.packages R Markdown, and that's it.
You are ready to go. To create an R Markdown document in RStudio, go to File, New File, R Markdown.
You will be presented with this window. I've filled in a title and an author and switch the output format
to a PDF. Explore on this window and the tabs along the left to see all the different formats that you can
output too. When you are done, click OK, and a new window should open with a little explanation on R
Markdown files. There are three main sections of an R Markdown document. The first is the header at
the top bounded by the three dashes. This is where you can specify details like the title, your name, the
date, and what kind of document you want to output. If you filled in the blanks in the window earlier,
these should be filled out for you. Also on this page, you can see text sections, for example, one section
starts with ## R Markdown. We'll talk more about what this means in a second, but this section will
render as text when you produce the PDF of this file, and all of the formatting you will learn generally
applies to this section. Finally, you will see code chunks. These are bounded by the triple back texts.
These are pieces of our code chunks that you can run right from within your document, and the output
of this code will be included in the PDF when you create it. The easiest way to see how each of these
sections behave is to produce the PDF. When you are done with a document in R Markdown, you are set
to knit your plain text and code into your final document. To do so, click on the Knit button along the top
of the source panel. When you do so, it will prompt you to save the document as an RMD file, do so. You
should see a document like this one. So, here you can see that the content of a header was rendered
into a title, followed by your name and the date. The text chunks produced a section header called R
Markdown, which is valid by two paragraphs of text. Following this, you can see the R code, summary
(cars) which is importantly followed by the output of running that code. Further down, you will see code
that ran to produce a plot, and then that plot. This is one of the huge benefits of R Markdown, rendering
the results to code inline. Go back to the R Markdown file that produced this PDF and see if you can see
how you signify you on text build it, and look at the word Knit and see what it is surrounded by. At this
point, I hope we've convinced you that R Markdown is a useful way to keep your code/data, and have
set you up to be able to play around with it. To get you started, we'll practice some of the formatting
that is inherent to R Markdown documents. To start, let's look at bolding and italicizing text. To bold
text, you surround it by two asterisks on either side. Similarly, to italicize text, you surround the word
with a single asterisk on either side. We've also seen from the default document that you can make
section headers. To do this, you put a series of hash marks. The number of hash marks determines what
level of heading it is. One hash is the highest level and will make the largest text. Two hashes is the next
highest level and so on. Play around with this formatting and make a series of headers. The other thing
we've seen so far is code chunks. To make an R code chunk, you can type the three back ticks, followed
by the curly brackets surrounding a lowercase r. Put your code on a new line and end the chunk with
three back ticks. Thankfully, RStudio recognize you'd be doing this a lot and there are shortcuts. Namely,
Control, Alt, I for Windows. Or command, Option, I for max. Additionally, along the top of the source
quadrant, there is the insert button that will also produce an empty code chunk. Try making an empty
code chunk. Inside it, type the code print, Hello world. When you need your document, you will see this
code chunk and the admittedly simplistic output of that chunk. If you aren't ready to knit your document
yet but wanted to see the output of your code, select the line of code you want to run and use Control
Enter, or hit the Run button along the top of your source window. The text Hello world should be
outputted in your console window. If you have multiple lines of code in a chunk and you want to run
them all in one go, you can run the entire chunk by using Control, Shift, Enter, or hitting the green arrow
button on the right side of the chunk, or going to the Run menu and selecting Run Current Chunk. One
final thing we will go into detail on is making bulleted lists, like the one at the top of this lesson. Lists are
easily created by proceeding each perspective bullet point by a single dash, followed by a space.
Importantly, at the end of each bullets line, end with two spaces. This is a quirk of R Markdown that will
cause spacing problems if not included. This is a great starting point, and there is so much more you can
do with R Markdown. Thankfully, RStudio developers have produced an R Markdown cheat sheet that
we urge you to go check out and see everything you can do with R Markdown. The sky is the limit. In this
lesson, we've delved into R Markdown, starting with what it is and why you might want to use it. We
hopefully got you started with R Markdown, first by installing it, and then by generating and knitting our
first R Markdown document. We then looked at some of the various formatting options available to you
in practice generating code and running it within the RStudio interface.

Types of Data Science Questions

In this lesson, we're going to be a little more conceptual and look at some of the types of analyses data
scientists employ to answer questions in data science. There are, broadly speaking, six categories in
which data analysis fall. In the approximate order of difficulty, they are, descriptive, exploratory,
inferential, predictive, causal, and mechanistic. Let's explore the goals of each of these types and look at
some examples of each analysis. To start, let's look at descriptive data analysis. The goal of descriptive
analysis is to describe or summarize a set of data. Whenever you get a new data set to examine, this is
usually the first kind of analysis you will perform. Descriptive analysis will generate simple summaries
about the samples and their measurements. You may be familiar with common descriptive statistics,
including measures of central tendency e.g, mean, median, mode. Or measures of variability e.g, range,
standard deviations, or variance. This type of analysis is aimed at summarizing your sample, not for
generalizing the results of the analysis to a larger population, or trying to make conclusions. Description
of data is separated from making interpretations. Generalizations and interpretations require additional
statistical steps. Some examples of purely descriptive analysis can be seen in censuses. Here the
government collects a series of measurements on all of the country's citizens, which can then be
summarized. Here you are being shown the age distribution in the US, stratified by sex. The goal of this
is just to describe the distribution. There is no inferences about what this means or predictions on how
the data might trend in the future. It is just to show you a summary of the data collected. The goal of
exploratory analysis is to examine or explore the data and find relationships that weren't previously
known. Exploratory analyzes explore how different measures might be related to each other but do not
confirm that relationship is causative. You've probably heard the phrase correlation does not imply
causation, and exploratory analysis lie at the root of this saying. Just because you observed a
relationship between two variables during exploratory analysis, it does not mean that one necessarily
causes the other. Because of this, exploratory analysis, while useful for discovering new connections,
should not be the final say in answering a question. It can allow you to formulate hypotheses and drive
the design of future studies and data collection. But exploratory analysis alone should never be used as
the final say on why or how data might be related to each other. Going back to the census example from
above, rather than just summarizing the data points within a single variable, we can look at how two or
more variables might be related to each other. In this plot, we can see the percent of the work force
that is made up of women in various sectors, and how that has changed between 2000 and 2016.
Exploring this data, we can see quite a few relationships. Looking just at the top row of the data, we can
see that women make up a vast majority of nurses, and that it has slightly decreased in 16 years. While
these are interesting relationships to note, the causes of these relationship is no apparent from this
analysis. All exploratory analysis can tell us is that a relationship exist, not the cause. The goal of
inferential analysis is to use a relatively small sample of data to or infer say something about the
population at large. Inferential analysis is commonly the goal of statistical modelling. Where you have a
small amount of information to extrapolate and generalise that information to a larger group. inferential
analysis typically involves using the data you have to estimate that value in the population, and then
give a measure of uncertainty about your estimate. Since you are moving from a small amount of data
and trying to generalize to a larger population, your ability to accurately infer information about the
larger population depends heavily on your sampling scheme. If the data you collect is not from a
representative sample of the population, the generalizations you infer won't be accurate for the
population. Unlike in our previous examples, we shouldn't be using census data in inferential analysis. A
census already collects information on functionally the entire population, there is nobody left to infer to.
And inferring data from the US census to another country would not be a good idea, because they US
isn't necessarily representative of another country that we are trying to infer knowledge about. Instead,
a better example of inferential analysis is a study in which a subset of the US population wasn't safe, for
their life expectancy given the level of air pollution they experienced. This study uses the data they
collected from a sample of the US population, to infer how air pollution might be impacting life
expectancy in the entire US. The goal of predictive analysis is to use current data to make predictions
about future data. Essentially, you are using current and historical data to find patterns, and predict the
likelihood of future outcomes. Like in inferential analysis, your accuracy and predictions is dependent on
measuring the right variables. If you aren't measuring the right variables to predict an outcome, your
predictions aren't going to be accurate. Additionally, there are many ways to build up prediction models
with some being better or worse for specific cases. But in general, having more data and a simple model,
generally performs well at predicting future outcomes. All this been said, much like an exploratory
analysis, just because one variable one variable may predict another, it does not mean that one causes
the other. You are just capitalizing on this observed relationship to predict this second variable. A
common saying is that prediction is hard, especially about the future. There aren't easy ways to gauge
how well you are going to predict an event until that event has come to pass. So evaluating different
approaches or models is a challenge. We spend a lot of time trying to predict things. The upcoming
weather. The outcomes of sports events. And in the example we'll explore here, the outcomes of
elections. We've previously mentioned Nate Silver of FiveThirtyEight, where they try and predict the
outcomes of US elections, and sports matches too. Using historical polling data and trends in current
polling, FiveThirtyEight builds models to predict the outcomes in the next US presidential vote, and has
been fairly accurate at doing so. FiveThirtyEight's models accurately predicted the 2008 and 2012
elections, and was widely considered an outlier in the 2016 US elections, as it was one of the few models
to suggest Donald Trump at having a chance of winning. The caveat to a lot of the analyses we've looked
at so far is we can only see correlations and can't get at the cause of the relationships we observe.
Causal analysis fills that gap. The goal of causal analysis is to see what happens to one variable when we
manipulate another variable, looking at the cause and effect of the relationship. Generally, causal
analysis are fairly complicated to do with observed data alone. There will always be questions as to
whether are these correlation driving your conclusions, or that the assumptions underlying your analysis
are valid. More often, causal analysis are applied to the results of randomized studies that were
designed to identify causation. Causal analysis is often considered the gold standard in data analysis, and
is seen frequently in scientific studies where scientists are trying to identify the cause of a phenomenon.
But often getting appropriate data for doing a causal analysis is a challenge. One thing to note about
causal analysis is that the data is usually analyzed in aggregate and observed relationships are usually
average effects. So, while on average, giving a certain population a drug may alleviate the symptoms of a
disease, this causal relationship may not hold true for every single affected individual. As we've said,
many scientific studies allow for causal analysis. Randomized controlled trials for drugs are a prime
example of this. For example, one randomized control trial examine the effect of a new drug on a
treating infants with spinal muscular atrophy. Comparing a sample of infants receiving the drug versus a
sample receiving a mock control. They measure various clinical outcomes in the babies and look at how
the drug affects the outcomes. Mechanistic analysis are not nearly as commonly used as the previous
analysis. The goal of mechanistic analysis is to understand the exact changes in variables that lead to
exact changes in other variables. These analyses are exceedingly hard to use to infer much, except in
simple situations or in those that are nicely modeled by deterministic equations. Given this description,
it might be clear to see how mechanistic analyses are most commonly applied to physical or engineering
sciences, biological sciences. For example, are far too noisy of datasets to use mechanistic analysis.
Often, when these analyses are applied, the only noise in the data is measurement error, which can be
accounted for. You can generally find examples of mechanistic analysis in material science experiments.
Here, we have a study on biocomposites, essentially making biodegradable plastics that was examining
how biocarbon particle size, functional polymer type, and concentration affected mechanical properties
of the resulting plastic. They are able to do mechanistic analysis through a careful balance of controlling
and manipulating variables with very accurate measures of both those variables and the desired
outcome. In this lesson, we've covered the various types of data analysis, their goals. And looked at a
few examples of each to demonstrate what each analysis is capable of, and importantly, what it is not.
Experimental Design

Now that we've looked at the different types of data science questions, we are going to spend some
time looking at experimental design concepts. As a data scientist, you are a scientist as such. We need to
have the ability to design proper experiments to best answer your data science questions. Experimental
design is organizing and experiments. So, that you have the correct data and enough of it to clearly and
effectively answer your data science question. This process involves, clearly formulating your questions
in advance of any data collection, designing the best setup possible to gather the data to answer your
question, identifying problems or sources of air in your design, and only then collecting the appropriate
data. Going into an analysis, you need plan in advance of what you're going to do and how you are going
to analyze the data. If you do the wrong analysis, you can come to the wrong conclusions. We've seen
many examples of this exact scenario play out in the scientific community over the years. There's an
entire website, retraction watch dedicated to identifying papers that have been retracted or removed
from the literature as a result of poor scientific practices, and sometimes those poor practices are a
result of poor experimental design and analysis. Occasionally, these erroneous conclusions can have
sweeping effects particularly in the field of human health. For example, here we have a paper that was
trying to predict the effects of a person's genome on their response to different chemotherapies to
guide which patient receives which drugs to best treat their cancer. As you can see, this paper was
retracted over four years after it was initially published. In that time, this data which was later shown to
have numerous problems in their setup and cleaning, was cited in nearly 450 other papers that may
have used these erroneous results to bolster their own research plans. On top of this, this wrongly
analyzed data was used in clinical trials to determine cancer patient treatment plans. When the stakes
are this high, experimental design is paramount. There are a lot of concepts and terms inherent to
experimental design. Let's go over some of these now. Independent variable AKA factor, is the variable
that the experimenter manipulates. It does not depend on other variables being measured, often
displayed on the x-axis. Dependent variables are those that are expected to change as a result of
changes in the independent variable, often displayed on the y-axis. So, that changes the in x, the
independent variable effect changes in y. So, when you are designing an experiment, you have to decide
what variables you will measure, and which you will manipulate to effect changes and other measured
variables. Additionally, you must develop your hypothesis. Essentially an educated guess as to the
relationship between your variables and the outcome of your experiment. Let's do an example
experiment now, let say for example that I have a hypothesis that a shoe size increases, literacy also
increases. In this case designing my experiment, I will use a measure of literacy eg, reading fluency as my
variable that depends on an individual's shoe size. To answer this question, I will design an experiment
in which I measure this shoe size and literacy level of 100 individuals. Sample size is the number of
experimental subjects you will include in your experiment. There are ways to pick an optimal sample size
that you will cover in later courses. Before I collect my data though, I need to consider if there are
problems with this experiment that might cause an erroneous result. In this case, my experiment may
be fatally flawed by a confounder. A confounder is an extraneous variable that may affect the
relationship between the dependent and independent variables. In our example since age effects for
size and literacy is affected by age. If we see any relationship between shoe size and literacy, the
relationship may actually be due to age, ages confounding our experimental design. To control for this,
we can make sure we also measure the age of each individual. So, that we can take into account the
effects of age on literacy, and other way we could control for ages effect on literacy would be to fix the
age of all participants. If everyone we study is the same age, then we have removed the possible effect
of age on literacy. In other experimental design paradigms, a control group may be appropriate. This is
when you have a group of experimental subjects that are not manipulated. So, if you were studying the
effect of a drug on survival, you would have a group that received the drug, treatment and a group that
did not control. This way, you can compare the effects of the drug and the treatment versus control
group. In these study designs, there are strategies we can use to control for confounding effects. One,
we can blind the subjects to their assigned treatment group. Sometimes, when a subject knows that
they are in the treatment group eg, receiving the experimental drug, they can feel better not from the
drug itself but from knowing they are receiving treatment. This is known as the possible effect. To
combat this, often participants are blinded to the treatment group they are in. This is usually achieved
by giving the control group and lock treatment eg, given a sugar pill they are told is the drug. In this way,
if the possible effect is causing a problem with your experiment, both groups should experience it
equally, and this strategy is at the heart of many of these studies spreading any possible confounding
effects equally across the groups being compared. For example, if you think age is a possible
confounding effect, making sure that both groups have similar ages and age ranges will help to mitigate
any effect age may be having on your dependent variable. The effect of age is equal between your two
groups. This balancing of confounders is often achieved by randomization. Generally, we don't know
what will be a confounder beforehand to help lessen the risk of accidentally biasing one group to be
enriched for a confounder. You can randomly assign individuals to each of your groups. This means that
any potential confounding variables should be distributed between each group roughly equally, to help
eliminate/reduce systematic errors. There is one final concept of experimental design that we need to
cover in this lesson and that is replication. Replication is pretty much what it sounds like repeating an
experiment with different experimental subjects. As single experiments results may have occurred by
chance. A confounder was unevenly distributed across your groups. There was a systematic error in the
data collection. There were some outliers, etcetera. However, if you can repeat the experiment and
collect a whole new set of data and still come to the same conclusion, your study is much stronger. Also
at the heart of replication is that it allows you to measure the variability of your data more accurately,
which allows you to better assess whether any differences you see in your data are significant. Once
you've collected and analyzed your data, one of the next steps of being a good citizen scientist is to
share your data and code for analysis. Now that you have a GitHub account and we've shown you how
to keep your version control data and analyses on GitHub, this is a great place to share your code. In
fact, hosted on GitHub, our group, the leek group has developed a guide that has great advice for how
to best share data. One of the many things often reported in experiments as a value called the p-value.
This is a value that tells you the probability that the results of your experiment were observed by
chance. This is a very important concept in statistics that we won't be covering in depth here. If you
want to know more, check out the YouTube video linked which explains more about p-values. What you
need to look out for this when you manipulate p-values towards your own end, often when your p-value
is less than 0.05. In other words, there is a five percent chance that the differences you saw were
observed by chance. A result is considered significant. But if you do 20 tests by chance, you would
expect one of the 20 that is five percent to be significant. In the age of big data, testing 20 hypotheses is
a very easy proposition, and this is where the term p-hacking comes from. This is when you exhaustively
search a dataset to find patterns and correlations that appear statistically significant by virtue of the
sheer number of tests you have performed. These spurious correlations can be reported as significant
and if you perform enough tests, you can find a dataset and analysis that will show you what you
wanted to see. Check out this 538 activity, where you can manipulate unfiltered data and perform a
series of tests such that you can get the data to find whatever relationship you want. XKCD mocks this
concept in a comic testing the link between jelly beans and acne. Clearly there is no link there. But if you
test enough jelly bean colors eventually, one of them will be correlated with acne at p-value less than
0.05. In this lesson, we covered what experimental design is and why good experimental design matters.
We then looked in depth to the principles of experimental design and define some of the common terms
you need to consider when designing an experiment. Next, we determined a bit to see how you should
share your data and code for analysis, and finally we looked at the dangers of p-hacking and
manipulating data to achieve significance.

Big data

A term you may have heard of before this course is Big Data. There have always been large datasets, but
it seems like lately, this has become a password in data science. What does it mean? We talked a little
about big data in the very first lecture of this course. As the name suggests, big data are very large
datasets. We previously discussed three qualities that are commonly attributed to big datasets; volume,
velocity, variety. From these three adjectives, we can see that big data involves large datasets of diverse
data types that are being generated very rapidly. But none of these qualities seem particularly new. Why
has the concept of Big Data been so recently popularized? In part, as technology in data storage has
evolved to be able to hold larger and larger datasets. The definition of "big" has evolved too. Also, our
ability to collect and record data has improved with time such that the speed with which data is
collected his unprecedented. Finally, what is considered data has evolved, so that there is now more
than ever. Companies have recognized the benefits to collecting different information, and the rise of
the internet and technology have allowed different and varied datasets to be more easily collected and
available for analysis. One of the main shifts in data science has been moving from structured datasets
to tackling unstructured data. Structured data is what you traditionally might think of data, long tables,
spreadsheets, or databases, with columns and rows of information that you can sum or average or
analyze, however you like within those confines. Unfortunately, this is rarely how data is presented to
you in this day and age. The datasets we commonly encounter are much messier and it is our job to
extract the information we want and corralled into something tidy and structured. With the digital age
and the advance of the Internet, many pieces of information that we're in traditionally collected were
suddenly able to be translated into a format that a computer could record, store, search and analyze.
Once this was appreciated, there was a proliferation of this unstructured data being collected from all of
our digital interactions, emails, Facebook and other social media interactions, text messages, shopping
habits, smartphones and their GPS tracking websites you visit. How long you are on that website and
what you look at, CCTV cameras and other video sources et cetera. The amount of data and the various
sources that can record and transmit data has exploded. It is because of this explosion in the volume,
velocity and variety of data that big data has become so salient a concept. These datasets are now so
large and complex that we need new tools and approaches to make the most of them. As you can guess,
given the variety of data types and sources, very rarely as the data stored in a neat, ordered
spreadsheet, that traditional methods for cleaning and analysis can be applied to. Given some of the
qualities of big data above, you can already start seeing some of the challenges that may be associated
with working with big data. For one, it is big. There was a lot of raw data that you need to be able to
store and analyze. Second, it is constantly changing and updating. By the time you finish your analysis,
there is even more new data you could incorporate into your analysis. Every second you are analyzing, is
another second of data you haven't used. Third, the variety can be overwhelming. There are so many
sources of information that it can sometimes be difficult to determine what source of data may be best
suited to answer your data science question. Finally, it is messy. You don't have neat data tables to
quickly analyze. You have messy data. Before you can start looking for answers, you need to turn your
unstructured data into a format that you can analyze. So, with all of these challenges, why don't we just
stick to analyzing smaller, more manageable, curated datasets and arriving at our answers that way?
Sometimes questions are best addressed using these smaller datasets, but many questions benefit from
having lots and lots of data and if there is some messiness or inaccuracies in this data. The sheer volume
of it negates the effect of these small errors. So, we are able to get closer to the truth even with these
messier datasets. Additionally, when you have data that is constantly updating, while this can be a
challenge to analyze, the ability to have real-time, up-to-date information allows you to do analyses that
are accurate to the current state and make on the spot, rapid, informed predictions and decisions. One
of the benefits of having all these new sources of information is that questions that weren't previously
able to be answered due to lack of information. Suddenly have many more sources to glean information
from and new connections and discoveries are now able to be made. Questions that previously were
inaccessible now have newer, unconventional data sources that may allow you to answer these formerly
unfeasible questions. Another benefit to using big data is that, it can identify hidden correlations. Since
we can collect data on a myriad of qualities on any one subject, we can look for qualities that may not
be obviously related to our outcome variable, but the big data can identify a correlation there. Instead
of trying to understand precisely why an engine breaks down or why a drug side effect disappears,
researchers can instead collect and analyze massive quantities of information about such events and
everything that is associated with them, looking for patterns that might help predict future occurrences.
Big data helps answer what? Not why? Often that's good enough. Big data has now made it possible to
collect vast amounts of data, very rapidly from a variety of sources and improvements in technology
have made it cheaper to collect, store and analyze. But the question remains, how much of this data
explosion is useful for answering questions you care about? Regardless of the size of the data, you need
the right data to answer a question. A famous statistician, John Tukey, said in 1986, "The combination of
some data and an aching desire for an answer does not ensure that a reasonable answer can be
extracted from a given body of data." Essentially, any given dataset may not be suited for your question,
even if you really wanted it to and big data does not fixed this. Even the largest datasets around might
not be big enough to be able to answer your question if it's not the right data. In this lesson, we went
over some qualities that characterize big data, volume, velocity and variety. We compared structured
and unstructured data and examined some of the new sources of unstructured data. Then, we turn to
looking at the challenges and benefits of working with these big datasets. Finally, we came back to the
idea that data science is question-driven science and even the largest of datasets may not be
appropriate for your case.

You might also like