Statistical Analysis Lesson 1 Notes
Statistical Analysis Lesson 1 Notes
Introduction to
Statistical
Analysis
Contents
3 Introduction
4 Lesson outcomes
5 Introduction to SAS
10 References
Lesson outcomes
By the end of this lesson, you should be able to:
• Know what the difference between Statistical Analysis and data analysis is
• Understand the Statistical Analysis journey
• Have downloaded and installed SAS University Edition
• Have downloaded an open source data set from Kaggle
• Have imported the open source data set into SAS University Edition
Introduction
In lesson 1, we will be introduced to Statistical Analysis.
Lesson 1 will discuss what the difference between this course and the data analysis course is and who this course is aimed
at. We will dive deep into all the fundamental concepts you need to lay a good understanding of Statistical Analysis.
Thereafter, we will introduce the tool we will utilize throughout this module, called SAS. We will end the lesson with a fun
practical demonstration in SAS.
Data analysis is the investigation, cleaning, some modelling and presenting of data.
A data analyst is, therefore, someone who specializes in exploring data, whereas a statistical analyst will centre their
attention more on what is inferring beyond the data. There is an overlap between these fields, especially as technology
evolves, but they are two separate pursuits.
1. Step one involves understanding the question posed to us as the statistician or statistical analyst
4. Thereafter we manage the data by, for instance, creating extra variables.
5. In step 5, we describe the data with descriptive measures like the mean and median as well as with the help of
plots
Let us break each step down and make sure we know what is expected of us in each part of the journey.
Before we dive into the data, we need to understand what stakeholders want from data and understand the question they
are posing to us:
2. Data collection
Next up in the process we need to extract data from the various sources that can help us answer the question posed by the
stakeholders.
Use all data sources available to draw insights from; the more data we gather, the better.
Make sure that the format of the data is compatible with the tool you are using in this step and if not, transform it.
3. Data cleaning
Data cleaning involves removing information not needed, updating information that is incomplete, incorrect, or
incomplete, checking if there are any duplicate variables, to name but a few of the steps.
Note: If you are combining data sources, check that unexpected errors did not creep in. Also, check if the data is outdated
or not.
4. Data management
In this step, we create new variables through a combination of existing ones, for instance, to make the data exploration
step more efficient. After we have explored the data, we will return to this step and likely have a better understanding of
how to manage the data
This step can also include making sure the data complies with the data protection privacy act if the data is drawn from
older sources.
5. Describing data
• Describing the data involves showing or summarising the data in a meaningful way such that, for example,
patterns might emerge from the data.
• Data is described to spot any obvious trends and outliers and evaluate initial distributions.
• This step helps us to further clean and manage the data.
6. Rewind
We return to the data cleaning step after we have a better understanding of the data and we repeat the data cleaning,
managing and describing the data steps of this journey until we find the data to be in a sufficient condition for us to start
the modelling process.
7. Model
Finally, we get to the fun part, modelling the data! here we apply the appropriate model to the data for predictive
analytical purposes.
With the model, we can within a certain accuracy forecast the event
If we work with garbage data, the result in this step will also be garbage, hence why we spend so much time cleaning and
managing the data.
8. Interpret
Appropriate interpretation of the results is critical to make sense of the apparent disarray. To accurately assess the results,
you need to understand the subject matter you are analysing as well as the statistical method you are applying.
9. Report
The last step, but never the end of the analytics process, will always be to communicate the findings of your results to your
audience, the shareholders that posed the question.
Always be sure to carry over the learnings in a manner that is understandable to the audience. If the audience has some
technical knowledge of the field, you can explain findings in more detail, but if the audience has no technical knowledge of
the field, make sure to focus more on the key takeaways of the results.
Reporting can also take the form of a written document, for example, in the case of academic papers. Naturally, academic
jargon is more technical than a presentation and is typically led by the medium it wishes to be published in, but we will not
go into too much detail, because this is dependent on your field of study and many other factors.
Finally, remember to include many visualizations as this help to sketch a clearer picture of the outcome to both technical
and non-technical audiences alike.
Introduction to SAS
SAS
SAS is a Statistical Analysis system established in 1976 specifically designed for data management, advanced analytics,
multivariate analysis, business intelligence and predictive analytics to name but a few of the functions.
SAS is a software suite and tools include data mining, Statistical Analysis, forecasting, text analysis, and optimization and
simulation.
Advantages of SAS
Disadvantages of SAS
• There is a high cost involved for purchasing the license to work with SAS (in this module, however, we will use SAS
university edition that is free SAS software).
• In comparison to open-sourced platforms, SAS has a shortcoming when it comes to graphically represent data.
• Users have also mentioned that it is difficult to use SAS for text mining (which we will not do in this course,
therefore this downfall is not of great importance to us).
Download
https://www.sas.com/en_za/software/university-edition/download-software.html
• Stata files.
• Microsoft Excel files. To import XLSB and XLSM files, you must use the SAS LIBNAME statement.
• JMP files.
• Paradox DB files.
• SPSS files.
1. Ensure data set is in myfolder folder created as a subfolder in the SAS University Edition folder
3. Import data
McGraw-Hill.
versus-r-part-
two/#:~:text=Entry%20costs%20to%20license%20the,of%20the%20first%20year%20fee.
• Gomez, L., 2018, 6 steps for data cleaning and why it matters, Geotab,
https://www.geotab.com/blog/data-cleaning/
• Glen, S., 2020, Difference between data analysis and statistical analysis, Data Science Central,
https://www.datasciencecentral.com/profiles/blogs/difference-between-data-analysis-and-
statistical-analysis
javaTpoint, https://www.javatpoint.com/advantages-and-disadvantages-of-sas
• Kozyrkov, C., 2019, What’s the difference between analytics and statistics?, towards data science,
https://towardsdatascience.com/whats-the-difference-between-analytics-and-statistics-
cd35d457e17
https://documentation.sas.com/?cdcId=webeditorcdc&cdcVersion=5.2&docsetId=webeditorug&doc
setTarget=p11uw39h8jb27on1fc3d0og7ac52.htm&locale=en