The document defines statistical terms across multiple sections. It provides definitions for common statistical concepts like population, sample, measures of central tendency, and types of data. The glossary contains over 50 terms and their meanings to help explain foundational statistical concepts.
The document defines statistical terms across multiple sections. It provides definitions for common statistical concepts like population, sample, measures of central tendency, and types of data. The glossary contains over 50 terms and their meanings to help explain foundational statistical concepts.
The document defines statistical terms across multiple sections. It provides definitions for common statistical concepts like population, sample, measures of central tendency, and types of data. The glossary contains over 50 terms and their meanings to help explain foundational statistical concepts.
The document defines statistical terms across multiple sections. It provides definitions for common statistical concepts like population, sample, measures of central tendency, and types of data. The glossary contains over 50 terms and their meanings to help explain foundational statistical concepts.
Download as XLSX, PDF, TXT or read online from Scribd
Download as xlsx, pdf, or txt
You are on page 1of 5
365 Data Science - Statistics
Glossary
Section Lesson Word
1 Population vs sample population 1 Population vs sample sample 1 Population vs sample parameter 1 Population vs sample statistic 1 Population vs sample random sample 2 Types of data representative sample 2 Types of data variable 2 Types of data types of data 2 Types of data categorical data 2 Types of data numerical data 2 Types of data discrete data 2 Types of data continuous data 2 Levels of measurement levels of measurement 2 Levels of measurement qualitative data 2 Levels of measurement quantitative data 2 Levels of measurement nominal 2 Levels of measurement ordinal 2 Levels of measurement ratio 2 Levels of measurement interval 2 Categorical variables. Visualization techniques frequency distribution table 2 Categorical variables. Visualization techniques frequency 2 Categorical variables. Visualization techniques absolute frequency 2 Categorical variables. Visualization techniques relative frequency 2 Categorical variables. Visualization techniques cumulative frequency 2 Categorical variables. Visualization techniques Pareto diagram 2 The Histogram histogram 2 The Histogram bins (histogram) 2 Cross table and scatter plot cross table 2 Cross table and scatter plot contigency table 2 Cross table and scatter plot scatter plot 2 Mean, median and mode measures of central tendency 2 Mean, median and mode mean 2 Mean, median and mode median 2 Mean, median and mode mode 2 Skewness measures of asymmetry 2 Skewness skewness 2 Variance sample formula 2 Variance population formula 2 Variance measures of variability 2 Variance variance 2 Standard deviation and coefficient of variation standard deviation 2 Standard deviation and coefficient of variation coefficient of variation 2 Covariance univariate measure 2 Covariance multivariate measure 2 Covariance covariance 2 Correlation linear correlation coefficient 2 Correlation correlation 3 What is a distribution distribution 3 The normal distribution Bell curve 3 The normal distribution Gaussian distribution 3 The normal distribution to control for the mean/std/etc 3 The standard normal distribution standard normal distribution 3 The standard normal distribution z-statistic 3 The standard normal distribution standardized variable 3 The central limit theorem central limit theorem 3 The central limit theorem sampling distribution 3 Standard error standard error 3 Estimators and estimates estimator 3 Estimators and estimates estimate 3 Estimators and estimates bias 3 Estimators and estimates efficiency (in estimators) 3 Estimators and estimates point estimator 3 Estimators and estimates point estimate 3 Estimators and estimates interval estimator 3 Estimators and estimates interval estimate 3 Definition of confidence intervals confidence interval 3 Definition of confidence intervals reliability factor 3 Definition of confidence intervals level of confidence 3 Population variance known, z-score critical value 3 Population variance known, z-score z-table 3 Student's T distribution t-statistic 3 Student's T distribution a rule of thumb 3 Student's T distribution t-table 3 Student's T distribution degrees of freedom 3 Margin of error margin of error 4 Null vs alternative hypothesis 4 Null vs alternative hypothesis test 4 Null vs alternative null hypothesis 4 Null vs alternative alternative hypothesis 4 Null vs alternative to accept a hypothesis 4 Null vs alternative to reject a hypothesis 4 Null vs alternative one-tailed (one-sided) test 4 Null vs alternative two-tailed (two-sided) test 4 Rejection region and significance level significance level 4 Rejection region and significance level rejection region 4 Type I error vs type II error type I error (false positive) 4 Type I error vs type II error type II error (false negative) 4 Type I error vs type II error power of the test 4 Test for the mean. Population variance known z-score 4 Test for the mean. Population variance known μ0 4 p-value p-value 4 Test for the mean. Population variance unknown email open rate Definition The collections of all items of interest to our study; denoted N. A subset of the population; denoted n. Is a value that refers to a population. It is the opposite of statistic. Is a value that refers to a sample. It is the opposite of a parameter. A random sample is collected when each member of the sample is chosen from the population strictly by chance. A representative sample is a subset of the population that accurately reflects the members of the entire population. A variable is a set of characteristics of a person, object, thing, idea, etc. Variables can vary from case to case. For example, 'height' is a variab A way to classify data. There are two types of data - categorical and numerical. A subset of types of data. Describes categories or groups. A subset of types of data. Represents numbers. Can be further divided into discrete and continuous. Data that can be counted in a finite matter. Opposite of continuous. Data that is 'infinite' and impossible to count. Opposite of discrete. A way to classify data. There are two levels of measurement - qualitative and quantitative, which are further divided into nominal & ordinal, an A subset of levels of measurement. There are two types of qualitative data - nominal and ordinal. A subset of levels of measurement. There are two types of quantitative data - ratio and interval. Refers to variables that describe different categories and cannot be put in any order. Refers to variables that describe different categories, but can be ordered. A number, no matter if whole or a fraction. There exists a unique and unambiguous zero point. An interval variable represents a number or an interval. There isn't a unique and unambiguous zero point. For example, degrees in Celsius an A table that represents the frequency of each variable. Measures the occurrence of a variable. Measures the NUMBER of occurrences of a variable. Measures the RELATIVE NUMBER of occurrences of a variable. Usually, expressed in percentages. The sum of relative frequencies so far. The cumulative frequency of all members is 100% or 1. A special type of bar chart, where frequencies are shown in descending order. There is an additional line on the chart, showing the cumulative A type of bar chart that represents numerical data. It is divided into intervals (or bins) that are not overlapping and span from the first observat The intervals that are represented in a histogram. A table which represents categorical data. On one axis we have the categories, and on the other - their frequencies. It can be built with absolu See cross table. A plot that represents numerical data. Graphically, each observation looks like a point on the scatter plot. Measures that describe the data through the so called 'averages'. The most common are the mean, median and mode. There is also geometr The simple average of the dataset. Denoted μ. The middle number in an ordered dataset. The value that occurs most often. A dataset can have 0, 1 or multiple modes. Measures that describe the data through the level of symmetry that is observed. The most common are skewness and kurtosis. A measure that describes the symmetry of the dataset around its mean. A formula, that is calculated on a sample. The value obtained is a statistic. A formula, that is calculated on a population. The value obtained is a parameter. Measures that describe the data through the level of dispersion (variability). The most common ones are variance and standard deviation. Measures the dispersion of the dataset around its mean. It is measured in units squared. Denoted σ2 for a population and s2 for a sample. Measures the dispersion of the dataset around its mean. It is measured in original units. It is equal to the square root of the variance. Denoted Measures the dispersion of the dataset around its mean. It is also called 'relative standard deviation'. It is useful for comparing different datase A measure which refers to a single variable. A measure which refers to multiple variables. A measure of relationship between two variables. Usually, because of its scale of measurement, covariance is not directly interpretable. Deno A measure of relationship between two variables. Very useful for direct interpretation as it takes on values from [-1,1]. Denoted ρxy for a popula A measure of the relationship between two variables. There are several ways to compute it, the most common being the linear correlation coe A distribution is a function that shows the possible values for a variable and the probability of their occurrence. A common name for the normal distribution. The original name of the normal distribution. Named after the famous mathematician Gauss, who was the first to explore it through his work o holding this particular value constant, we change the other variables and observe the effect. A normal distribution with a mean of 0, and a standard deviation of 1 The statistic associated with the normal distribution In statistics, we usually standardize a variable using the z-score formula. This is done by first subtracting the mean and then dividing by the st No matter the distribution of the underlying dataset, the sampling distribution of the means of the dataset approximate a normal distribution. the distribution of a sample. the standard error is the standard deviation of the sampling distribution. It takes into account the size of the sample. A function or a rule, according to which we make estimations. A particular value that was estimated through an estimator. An unbiased estimator has an expected value the population parameter. A biased one has an expected value different from the population pa in the context of estimators, the efficiency loosely refers to 'lack of variability'. The most efficient estimator is the one with the least variability. I A function or a rule, according to which we make estimations that will result in a single number. A single number that was derived from a certain point estimator. A function or a rule, according to which we make estimations that will result in an interval. In this course, we will only consider confidence inte A particular result that was obtained from an interval estimator. It is an interval. A confidence interval is the range within which you expect the population parameter to be. You have a certain probability of it being correct, e A value from a z-table, t-table, etc. that is associated with our test. Shows in what % of the cases we expect the population parameter to fall into the confidence interval we obtained. Denoted 1 - α. Example: 95 A value coming from a table for a specific statistic (z, t, F, etc.) associated with the probability α that the researcher has chosen. A table associated with the Z-statistic, where given a probability (α), we can see the value of the standardized variable, following the standard A statistic that is generally associated with the Student's T distribution, in the same way the z-statistic is associated with the normal distributio A principle, which is approximately true, but is widely used in practice due to its simplicity. A table associated with the t-statistic, where given a probability (α), and certain degrees of freedom, we can check the reliability factor. The number of variables in the final calculation that are free to vary. Half the width of a confidence interval. It drives the width of the interval. Loosely, a hypothesis is 'an idea that can be tested' A test that is conducted in order to verify if a hypothesis is true or false. The null hypothesis is the one to be tested. Whenever we are conducting a test, we are trying to reject the null hypothesis. The alternative hypothesis is the opposite of the null. It is usually the opinion of the researcher, as he is trying to reject the null hypothesis and The statistical evidence shows, that the hypothesis is likely to be true. The statistical evidence shows, that the hypothesis is likely to be false. Tests which are determining if a value is lower (lower or equal) or higher (higher or equal) to a certain value are one-sided. This is because th Tests which are determining if a value is equal (or different) to a certain value are two-sided. This is because they can be rejected on two side The probability of rejecting the null hypothesis, if it is true. Denoted α. You choose the significance level. All else equal, the lower the level, the The part of the distribution, for which we would reject the null hypothesis. This error consists of rejecting a null hypothesis that is true. The probability of committing it is α, the significance level. This error consists of accepting a null hypothesis that is false. The probability of committing it is β. Probability of rejecting a null hypothesis that is false (the researcher's goal). Denoted by 1- β. The standardized variable associated with the dataset we are testing. It is observed in the table with an α equal to the level of significance of t The hypothesized population mean. The smallest level of significance at which we can still reject the null hypothesis, given the observed sample statistic. An email open rate is a measure of how many people on the email list actually open the emails they have received. esian statistics).