Introduction To Database Management and Statistical Software
Introduction To Database Management and Statistical Software
statistical packages
03/13/23 1
Objective
• The main objectives of this session are to enable the
participants:
– To process and manage public health data using statistical
software (EpiData, Epi-Info, SPSS, Stata…
3
Data management and analysis
Data processing
–Data coding
–Data entry
–Data cleaning
Data analysis
–Descriptive analysis
–Bivariate analysis
–Stratified analysis
–Multivariate analysis,
4
Data management
Data processing refers to:
data entry into a computer
data checks and correction
Variable coding
Data cleaning
The aim of this process is to produce a
relatively ‘clean’ data set
Data Entry
concerns the transfer of data from a
questionnaire to a computer file.
It is converting data into a form that can be
read and manipulated by computers used in
quantitative data analysis.
Before analysis, data must be checked for
errors,
information that needs coding must be coded
and missing values must be dealt with
Data coding
• In general computers are at their best with
numbers.
• You must translate variables through coding
• Coding is assigning a separate (non-
overlapping) numerical code for separate
answers and missing values.
• Eg. Instead of using ‘Male’ and ‘Female’ for
the variable sex, it can be indicated as 1 =
Male and 2 = Female .
Missing values
• Missing values occur when measurements were
not taken, or respondents did not answer
questions.
• missing values should not be entered as a ‘blank’,
because some statistical packages interpret
blanks as zero.
• Ideally, a code should be chosen to denote a
missing value (e.g. a code ‘9’ or ’99’ or ’999’)
• After the analysis is finished, decoding back to
original variables is required when writing the
report
Data Cleaning
• Once data have been gathered, they need to
be entered into a computer data file and
checked for errors.
• The aim of this process is to produce a clean
set of data for statistical analysis.
Components of Data Analysis
Data processing
– Data entry
– Coding
– Cleaning
Descriptive /exploratory
– Frequencies,
– Tables and graphs
– Cross tabulations (chi-squares, spearman’s correlation…)
– Measures of central tendency and variations
– Proportions/percentages
Analytic /inferential
– Estimation
– Confidence intervals (P-value, OR,…)
– Hypothesis testing
– Statistical models
Statistical software
Spreadsheets and graphics
For data entry:
Epidata
EPI Info
• Create table
• Filtering(to select and show certain data in the
table)
3. charts
1. Bar chart
2. Line chart
3. Circle chart
4. Scatter chart
Statistical softwares for data management and analysis
•A number of data management software packages exists
are: EPI-Info, EpiData, SPSS, STATA, spreadsheets,…
20
EpiData and Epi-Info are good for
data management (Data entry )
•Because they are fast, reliable, allows
for controlled and double data entry,
21
EpiData
• It is used for simple or programmed data entry and data
documentation.
22
Statistical Softwares for Data Analysis
• As quantitative research grows, application of
statistical software (SS) becomes a more crucial
part of data analysis.
• Initially, it was paper and pen and later the advent of which
computer has helped invention of punching machines & later
upgraded to simple calculator and complex scientific calculator.
24
Statistical Softwares…
• Many proprietary and freeware statistical software
packages are available that are suitable for different
statistical analysis, depending on the user's needs
25
Available Statistical Packages
26
Statistical Softwares…
27
Statistical Softwares…
28
STATA
• It is powerful statistical package with smart data management
facilities, wide array of up-to-date statistical techniques, &
excellent system for producing publication-quality graphs
• STATA is available for Windows, Unix, and Mac computers.
• The standard version is called Stata/IC (or Intercooled Stata)
and can handle up to 2,047 variables.
• Special edition called Stata/SE that can handle up to 32,766
variables (allows longer string variables & larger matrices)
• The version for multicore/multiprocessor computers called
Stata/MP, which has the same limits but is substantially faster.
29
Statistical Softwares…
• STATA performs most general statistical
analyses:
– Regression analysis (any type of regression)
– Survival analysis
– Analysis of variance (ANOVA/ANCOVA)
– Factor Analysis and PCA
– Multivariate analysis (MANOVA/MANCOVA)
– Time series analysis.
30
Statistical Softwares…
31
EpiInfo…
• Epi Info is a suite of free data management,
processing and analysis software designed
specially for public health community by CDC
33
EpiInfo…
34
SPSS
Definition of - SPSS
35
Opening SPSS
oyou can open SPSS in one of two ways.
1. If there is an SPSS on the desktop, simply
put the cursor on it and double click the left
mouse button.
2. If not you can open by following this step
Start > All program > SPSS 16.0
36
When you use SPSS, you work in one of
several windows:
• The data view
• Variable view
• Output view
• Draft output view
• The syntax view
37
The data view
•The data view displays your actual data and any
new variables you have created.
38
Name
–Each variable name must be unique; duplication
is not allowed.
–Start with a letter
–May have up to 8 characters, including letters,
numbers, and the symbols (@, #, _, or $).
–Variable names cannot end with a period
39
Type
under this we specify weather variables are
numeric, string and other type
Width
Width indicates the number of characters
–Maximum width is 40 characters.
Decimals
–If more decimals have been entered or computed by
SPSS, the additional information will be retained
internally but not displayed on screen.
40
Label
- Is important to identify in detail what a variable
represents.
–Is limited to 255 characters
–May contain spaces and punctuation
Values
– And here his is where we give code for
categorical variables.
41
Missing
–Signal to SPSS which data should be treated as
missing
–System Missing data – SPSS display a single
period
Columns
–Columns affect only the display of values in
the Data Editor. Changing the column width
does not change the defined width of a variable.
42
Measure
–Indicates the level of measurement
–Then under this the measurement scale of
variables are identified. Type of measurement
are
•Nominal
•Ordinal and
•Ranked or scale.
43
Nominal
-Is the simplest type of data in which all values
fall into unordered categories or classes.
-it data takes on one of two distinct value-such as
male and female are said t be dichotomous
-However not all nominal data need to be
dichotomous. Often there are 3 or 4 possible
categories into which our data may fall.
for example: blood type.
44
Ordinal
-When the order is important among categories
is the observation is referred to as ordinal data
type.
For example: the result of blood pressure of
patient (normal, serious, critical).
Scale data (Interval and Ratio)
- Is the data type in which the observations are
arranged from lowest to highest.
45
The output view
The output window is where you see
the results of your various queries
46
The syntax view
•The best method of preserving the exact steps of
a particular analysis is the syntax view.
•In the syntax view, you’ll preserve the code used
to generate any set of tables or charts.
Note
Syntax is basically the actual computer code that
produces a specific output.
47
48