Statistis II

Md. Mazharul Islam Jony

Statistis II

A Hand Note for Department of Finance Time series, Sampling and Test of Hypothesis 𝑦=𝑇×𝑆×𝐶×𝐼 Sample Design Sample Size Sampling Frame Sampling Technique 𝒛𝟐 × 𝒔𝟐 . 𝑵 𝒏= 𝟐 𝒛 × 𝒔𝟐 + (𝑵 − 𝟏)𝒆𝟐 𝒙 − 𝝁𝟎 𝒛= 𝝈 ( ) 𝒏 Department of Finance Jagannath University Prepared By Md. Mazharul Islam (Jony) ID no:091541, 3rd Batch. Department of Finance. Jagannath University. Email: jony007ex@gmail.com Md. Mazharul Islam Jony 1|Page Time Series Time Series: A time series is a set of observation taken at specific times, usually at equal interval. In other words, Time series is a collection of data recorded over a period of time (weekly, monthly, quarterly), an analysis of history that can be used by management to make current decisions and plans based on long term forecasting. It usually assumes past pattern to continue into the future. Mathematically, a time series is defined by the values 𝑦1 , 𝑦2 , … .. of a variable y (temperature, closing price of share etc) at times 𝑡1 , 𝑡2 , … … . Thus y is a function of t; this is symbolized 𝑦 = 𝑓(𝑡). Here t is an independent variable and y is a dependent variable. For example, The total annual production of rice in Bangladesh over of years, the daily closing price of a share on the stock exchange, the hourly temperatures etc. Components of Time Series: There are four basic types of variations which account for the change in the series over a period of time. These four types of patterns, variations, movements are known as components or elements of time series. They are, Secular Trend: The general or smooth tendency of the data to grow and decline over a long period of time is technically called secular trend or simply trend. By trend we mean, smooth, regular, long term movement of the data; sudden and erratic movement either in upward or in downward direction have nothing to do with the trend. For example, Population, Price, Production etc are showing upward trend. Trend can be divided under two heads: 1. Linear or Straight line trend: A linear trend equation is used when the data are increasing or decreasing by equal amount. The equation is 𝑦 = 𝑎 + 𝑏𝑡. 2. Nonlinear trend: A nonlinear trend equation is used when the data are increasing or decreasing by increasing amounts over time. The equation is log 𝑦 = log 𝑎 + log 𝑏(𝑡). Seasonal Variation: Seasonal variation is a pattern of change in time series that have taken place during a period of 12 month which tends to repeat each year as a result of change in climate, weather condition, festival etc. For example, During winter there is a greater demand for woolen clothes, On the occasion of Eid, there is a big demand for sweets and bank withdrawal go up for shopping. Uses of seasonal Variation:  Seasonal variation fluctuations help plan for sufficient goods and materials on hand to meet verifying seasonal demand. Department of Finance Jagannath University Md. Mazharul Islam Jony 2|Page  Seasonal variations are fluctuation that coincide with certain seasons and are repeated year after year.  Analysis of seasonal fluctuations over a period of years helps in evaluating current sales. Cyclical Variation: The term cycle refers to the recurrent variation in time series that usually last longer than a year and regular, neither in amplitude nor in length. A business cycle consists of the recurrence of rise and fall movements of business activity from some sort of statistical trend or normal (a statistical average). There are four well-defined periods or phases in the business cycle namely, i) Prosperity, ii) Decline, iii) Depression, iv) Improvement. For example, In prosperity period, Share business is booming, prices are high and profits are made. After a period of time, the price of share decline, it also decline the business activity, then the depression come. As a result factories close, business fail. After a rigid economy, in the improvement or recovery period, increasing business activity with rising prices. Irregular Variation: Irregular Variation refers to such variations in business activities which do not repeat in a definite pattern. In fact, the category labeled irregular variation is really intended to include all types of variation other than those accounting for the trend, seasonal and cyclical movements. For example, Irregular variations are caused by such isolated special occurrences as flood, earthquakes, strikes, war, rapid technological progress etc. Irregular Variation can be classified into two types.  Episodic: Episodic fluctuations are unpredictable, but they can be identified. The initial impact on the economy of a major strike or a war can be identified, but strike or war cannot be predicted.  Residual: After the episodic fluctuations have been removed, the remaining variation is called residual variation. The residual fluctuations, often called chance fluctuations, are unpredictable and they cannot be identified. Note: Neither episodic nor residual variation can be projected into the future. Mathematical Model or Relationship for Time Series: In traditional time series analysis, mathematical relationship can be described as, 𝑦𝑡 = 𝑓(𝑇, 𝑆, 𝐶, 𝐼). Where, T = Secular trend, S = Seasonal variation, C = Cyclical variation, I = Irregular variation, y = Total value of time series a time t. There are two relationships in the time series. Multiplicative Relationship: Multiplicative relationship between four components, that is, it is assumed that any particular value in series is the product of factors than can be attributed to the various components. Symbolically, 𝑦=𝑇×𝑆×𝐶×𝐼 In this model, the seasonal, cyclical and irregular items except trend are not viewed as absolute amounts but rather as relative influences. Department of Finance Jagannath University Md. Mazharul Islam Jony 3|Page For example, A seasonal index of 110 percent would mean that the actual value is 10 percent higher than it otherwise would be because of seasonal influences. Additive Relationship: Additive relationship is to treat each observation of a time series as the sum of these four components. Symbolically, 𝑦 = 𝑇 + 𝑆 + 𝐶 + 𝐼. When the relationship is assumed the major aim of time series analysis is to isolate those parts of the overall variation of a time series which are traceable to each of these four components and measuring each part independently. In contrast, the multiplicative model is not only considered the standard or traditional assumption for series analysis. It is more often employed in practice than all other possible models combined. Moving-Average Method: The moving average method is not only useful in smoothing a time series to see its trend; it is the basic method used in measuring the seasonal fluctuation. When a trend is to be determined by the method of moving average, the average value for a number of years ( or months, or weeks) is secured and this average is taken as the normal or trend value for the unit of time falling at the middle of the period covered in the calculation of the average. The effect of averaging is to give a smoother curve, lessening the influence of the fluctuations that pull the annual figures away from general trend. While applying this method, it is necessary to select a period for moving average such as 3 yearly moving averages, 6 yearly moving averages, 8 yearly moving averages etc. The period of moving average is to be decided in the light of the length of the cycle. The formula for 3 years moving average, will be: 𝑎+𝑏+𝑐 𝑏+𝑐+𝑑 𝑐+𝑑+𝑒 , , ……… 3 3 3 Md. Mazharul Islam (Jony) ID no:091541, 3rd Batch. Department of Finance. Jagannath University. Department of Finance Jagannath University Md. Mazharul Islam Jony 4|Page Sampling Sampling: Sampling refers to drawing a sample (a subset) from a population (the full set). In other words, Sampling is that part of statistical practice concerned with the selection of an unbiased or random subset of individual observations within a population of individuals intended to yield some knowledge about the population of concern, especially for the purposes of making predictions based on statistical inference. Sampling is an important aspect of data collection. Census: A census is the process of obtaining information about every member of a population (not necessarily a human population). It can be contrasted with sampling in which information is only obtained from a subset of a population. As such it is a method used for accumulating statistical data. Sample and Population: A sample in a research study is a relatively small number of individuals about whom information is obtained. The larger group to whom the information is then generalized is the population. Reasons for sampling instead of census / Need for sampling: There are 6 reasons for sampling. They are described below.. 1. Economy: Unit cost of collecting data in the case of census is significantly less than in the case of sampling. For example: In case of census is taka 200, while in the case of sampling is taka 1,000 but due to the larger number of items the total cost involve in the case of census of census is significantly higher than in the case of sampling. For example, We can find out the total cost of collecting information by multiplying the total population with the unit cost in case of census. Here total population = N We can find out the total cost of collecting information by multiplying the sample size with the unit cost in case of census. Here sample size=n. Census : Sampling : 10,00,000 × 200 = 20,00,00,000 5,000 × 1000 = 50,00,000 2.Timeliness: Unit time involve in the case of sampling then in the case census but due to the larger size of population total time involve in the case of census in significantly higher than in the case of census. 3. Large size of many populations: In some cases the size of the population is extremely large. All of them are not treasonable due in traveling, disease, death, mental abnormality, prisoners etc. In that situation the only way to conduct the research is collecting data through a sample survey. 4. Inaccessibility of the entire population: In some cases the entire population may not be accessible. At that case sampling is necessary. Suppose in some cases the entire population is inaccessible because of aircraft crash. 5. Destructive nature of many populations: Due to destructive nature of many of the population, the resources are completed to collect information only on a part of the population. For example: Blood test for a patient. Life hours of a tube light. 6. Reliability: By using a scientific sampling technique one can minimize the sampling error and as qualified investigators are included, the non-sampling error committed in the case of sample survey is also minimum. Department of Finance Jagannath University Md. Mazharul Islam Jony 5|Page The amount of non-sampling error in the case of census is much higher than the total amount of sampling and non-sampling error committed in the case of a sample survey ( as less qualified investigator are involve in the case of census and the supervision, monitoring and quality control mechanism in the case of census. The degree of errors has a relationship with reliability. If error decrease than the reliability increase sampling decrease both the sampling and non-sampling error. So, it enhance the reliability of information. Sampling Errors: There are two types of sampling errors: 1. Sampling error: The gap between the sample mean and population mean constitute the sampling error. The gaps between various sample means are knows as sampling fluctuation. 2. Non-sampling error: Any sample estimate many also be affected by error other than the sampling error is known as non-sampling error. Non-sampling error are two types: i. Systematic: The errors arising out of the systematic tendency of the respondents to conceal the fact and to overestimate or under-estimate values. ii. Unsystematic: Errors arising in the process of collecting, recording, processing and analyzing the data due to carelessness or other mistakes are known as unsystematic errors. Sampling errors in Sampling and Census: 1. In the case of census there is no sampling error but there are non sampling errors. In the case sampling there are both sampling error and non-sampling error. 2. By using a scientific sampling technique one can minimize the sampling error and as qualified investigators are included, the non-sampling error committed in the case of sample survey is also minimum. The amount of non-sampling error in the case of census is much higher than the total amount of sampling and non-sampling error committed in the case of a sample survey as less qualified investigator are involve in the case of census and the supervision, monitoring and quality control mechanism in the case of census. Sample Design There are three Components of Sample Design. They are… Sample Design Sample Size Sampling Frame Sampling Technique SAMPLE SIZE In general, the larger the sample size (selected with the use of probability techniques) the better. The more heterogeneous a population is on a variety of characteristics (e.g. race, age, sexual orientation, religion) then a larger sample is needed to reflect that diversity. Determination of sample size through the approach based on confidence level and precision rate: Department of Finance Jagannath University Md. Mazharul Islam Jony 6|Page If population size is unknown 𝑛= 𝑧2 × 𝑠2 𝑒2 𝑜𝑟 𝑛= 𝑧2 × 𝑝 × 𝑞 𝑒2 where, n = sample size s = standard deviation p = proportion of success for the indicator q =1-p z = standard normal variate at a given level of significance (z = 1.96 at 95% level of confidence) e = Precision rate or amount of admissible error in the estimate If population size is known: 𝑧2 × 𝑠2. 𝑁 𝑛= 2 𝑧 × 𝑠 2 + (𝑁 − 1)𝑒 2 𝑧 2 × 𝑝 × 𝑞. 𝑁 𝑛= 2 𝑧 × 𝑝 × 𝑞 + (𝑁 − 1)𝑒 2 𝑜𝑟 where, n = sample size p = proportion of success for the indicator q=1-p z = standard normal variate at a given level of significance. (z = 1.96 at 95% levels of confidence) N= Population size e = Precision rate or amount of admissible error in the estimate The sample size will be adjusted further (if n > 5%,- N) using the formula given below: 𝑛 𝑛𝑎 = 𝑛 , 1+ 𝑁 Where, 𝑛𝑎 = adjusted sample size Note: * Large Population Size’s 1% to 2%. * Medium Population Size’s 5% to 10%. * Small Population Size’s 20% to 50%. * If there is a budget & time constraint, a sample size of 30 items from each group irrespective of the total number of items of different groups may help objectively assess the characteristics of all the items belonging to different groups. PROBLEMS: 1. Determine the sample size for estimating the true weight of containers based on the following information: i. Estimate must be made at 95% confidence level and within 0.15 kg of the true weight. ii. The standard deviation of weight = 1.5 kg.. 2. In an election for the post of President of the FBCCI there are 2 candidates: X and Y. You are interested in estimating the proportion of voters favoring Candidate X at 95% confidence interval and with Department of Finance Jagannath University Md. Mazharul Islam Jony 7|Page an error not to exceed 2%. A previous poll showed 55% of the registered voters favoring candidate X. What sample size should be used for the purpose? 3. Beximco Pharma wishes to estimate the true proportion of defectiveness of its products at 99% confidence level and with 2% precision rate. Past record indicates that 6% of the products are defective. What would be the required sample size for estimating the true proportion of defectiveness in a production lot of 25000 units? 4. ACI wishes to estimate the true proportion of defectiveness of its products at 99% confidence level and with 3% precision rate. Past record indicates that 13% of the products are defective. What would be the required sample size for estimating the true proportion of defectiveness in a production lot of 8000 units? 5. The government wishes to be 95% certain of estimating the mean income of garments workers within an error of Tk. 20. Past studies indicate that the standard deviation of income is Tk. 200. What size of sample of garments workers should be used to obtain a reliable estimate of the mean income? 6. Monno Ceramic wishes to estimate the true proportion of defectiveness of its products at 99% confidence level and with 3% precision rate. Past record indicates that 9% of the products are defective. What would be the required sample size for estimating the true proportion of defectiveness in a production lot of 10000 units? Sampling Frame A sample frame is a list that includes every member of the population from which a sample is to be taken. Without some form of sample frame, a random sample of a population, other than an extremely small population, is impossible. In short, The List Containing the Particulars about all the items of a Population is known as the Sampling Frame. It is Prepared with a view to facilitating the Researcher to Select required Samples. Sampling Methods or Techniques: Based on the Methods of Drawing Samples, Sampling Techniques are broadly be divided into two types: Sampling Techniques Random Sampling Non-Random Sampling Simple Random Sampling Persuasive/ Judgmental Stratified Random Sampling Convenience Cluster Sampling Quata Systematic Sampling Multiphase Sampling Multistage Sampling Department of Finance Jagannath University Md. Mazharul Islam Jony 8|Page Random sampling: When all the items of the population have an equal chance to be included in the sample, the technique is known as Random sampling. In random sampling, each item or element of the population has an equal chance of being chosen at each draw. A sample is random if the method for obtaining the sample meets the criterion of randomness (each element having an equal chance at each draw). The actual composition of the sample itself does not determine whether or not it was a random sample. There are 6 types of Random Sampling, they are, Simple Random Sampling, Stratified Sampling, Systematic Sampling, Cluster Sampling, Multiphase Sampling and Multistage Sampling. Simple Random Sampling Simple random sample (SRS) is a special case of a random sample. A sample is called simple random sample if each unit of the population has an equal chance of being selected for the sample. Whenever a unit is selected for the sample, the units of the population are equally likely to be selected. It is an equal probability sampling method (which is abbreviated by EPSEM). EPSEM means "everyone in the sampling frame has an equal chance of being in the final sample." Difference between Random Sample and Simple Random Sample: If each unit of the population has known (equal or un-equal) probability of selection in the sample, the sample is called a random sample. If each unit of the population has equal probability of being selected for the sample, the sample obtained is called simple random sample. Selection of Simple Random Sample: A simple random sample is usually selected by without replacement. The following methods are used for the selection of a simple random sample:  Lottery Method: This is an old classical method but it is a powerful technique and modern methods of selection are very close to this method. All the units of the population are numbered from 1 𝑡𝑜 𝑁. This is called sampling frame. These numbers are written on the small slips of paper or the small round metallic balls. The paper slips or the metallic balls should be of the same size otherwise the selected sample will not be truly random. The slips or the balls are thoroughly mixed and a slip or ball is picked up. Again the population of slips is mixed and the next unit is selected. In this manner, the number of slips equal to the sample size ′𝑛′ is selected. The units of the population which appear on the selected slips make the simple random sample. This method of selection is commonly used when size of the population is small. For a large population there is a big heap of paper slips and it is difficult to mix the slips properly  Using a Random Number Table: All the units of the population are numbered from 1 to 𝑁 or from 0 to 𝑁 − 1. We consult the random number table to take a simple random sample. Suppose the size of the population is 80 and we have to select a random sample of 8 units. The units of the population are numbered from 01 to 80. We read two-digit numbers from the table of random numbers. We can take a start from any columns or rows of the table. Let us consult random number table given in this content. Two-digit numbers are taken from the table. Any number above 80 will be ignored and if any number is repeated, we shall not record it if sampling is done without replacement. Let us read the first two columns of the table. The random number from the table is 10, 37, 08, 12, 66, 31, 63 and 73. The two numbers 99 and 85 have not been recorded because the population does not contain these numbers. The units of the population whose numbers have been selected constitute the simple random sample. Let us suppose that the size of the population is 100. If the units are numbered from 001 to 100, we shall have to read 3-digit random numbers. From the first 3 columns of the random number table, the random numbers are 100, 375, 084, 990 and 128 Department of Finance Jagannath University Md. Mazharul Islam Jony 9|Page and so on. We find that most of the numbers are above 100 and we are wasting our time while reading the table. We can avoid it by numbering the units of the population from 00 to 99. In this way, we shall read 2-digit numbers from the table. Thus if N is 100, 1000 or 10000, the numbering is done from 00 to 99, 000 to 999 or 0000 to 9999.  Using the Computer: The facility of selecting a simple random sample is available on the computers. The computer is used for selecting a sample of prize-bond winners, a sample of Hajj applicants, and a sample of applicants for residential plots and for various other purposes. Stratified sampling: Stratified sampling is probably the most commonly used probability method. In a stratified sample the sampling frame is divided into non-overlapping groups or strata, e.g. geographical areas, age-groups, genders. A sample is taken from each stratum, and when this sample is a simple random sample it is referred to as stratified random sampling. Stratification is the process of grouping members of the population into relatively homogeneous subgroups before sampling. The strata should be mutually exclusive: every element in the population must be assigned to only one stratum. The strata should also be collectively exhaustive: no population element can be excluded. Then random or systematic sampling is applied within each stratum. This often improves the representativeness of the sample by reducing sampling error. It can produce a weighted mean that has less variability than the arithmetic mean of a simple random sample of the population. Stratified sampling techniques are generally used when the population is heterogeneous, or dissimilar, where certain homogeneous, or similar, sub-populations can be isolated (strata). Types of stratified random sampling: There are actually two different types of stratified sampling. Proportional stratified sampling: When the total sample size is distributed among different strata according to size of the population of each of the strata then the type of sampling is known as proportional stratified random sampling. In proportional stratified, the sample proportions are made to be the same as the population proportions on the stratification variable(s). Proportional stratified sampling is an equal probability sampling. Proportional weight of each of the strata is maintained through the use of following formula.. 𝑁𝑖 𝑛𝑖 = ×𝑛 𝑁 Here, 𝑛𝑖 = sample size of the ith stratum, 𝑁𝑖 = Population size of the ith stratum, 𝑛 = sample size. Example: If in Uttara sector 12, there are 10000 poor, 5000 Middle class and 20000 Rich people lived. We take 200 people from here as sample to precede a survey, and then allocate the total sample size to different strata following the techniques of proportional stratified random sampling. Solution: Here, 𝑁𝑖 =10000+5000+20000 = 35000, n = 200. Then, Sample size For poor Department of Finance 10000 𝑛1 = 35000 × 200 = 57 Jagannath University Md. Mazharul Islam (Jony) ID no:091541, 3rd Batch. Department of Finance. Jagannath University. Md. Mazharul Islam Jony 10 | P a g e 5000 For Middle class, 𝑛2 = 35000 × 200 = 29 For Rich, 𝑛3 = 35000 × 200 = 114 20000 Example: There are 90000 trader in the city of Dhaka of these 50% are retailers the proportion of Arothdar, Wholesaler, hawker are 10%, 15%, 50%, 25% respectively. Determine the size of the sample to estimate the contribution of trade services to the national economy on GDP at 95% Confidence level and 2% precision. Allocate the total sample size to different Strata following the technique of proportional stratified random sampling. Solution: Here, N = 90000, z = 95% = 1.96, p = 0.50, q = (1−0.50) = 0.50, e = 0.02. Then 𝑛 = 𝑧 2 ×𝑝×𝑞×𝑁 2 𝑧 2×𝑝 ×𝑞+(𝑁 −1)𝑒 (1.96)2 ×0.50×0.50×90000 = (1.96)2 ×0.50×0.50+(90000 −1)0.02 2 = 2339 ∴ Sample size For Retailer 𝑛3 = 90000 × 2339 = 1169 [90000 × 50% = 45000] For Arothdar 𝑛1 = 90000 × 2339 = 234 [90000× 10% = 9000] 𝑛2 = 90000 × 2339 = 351 [90000 × 15% = 13500] 𝑛4 = 90000 × 2339 = 585 [90000 × 25% = 22500] For Wholesaler Foe hawker 45000 9000 13500 22500 Disproportional stratified sampling: In disproportional stratified sampling, the sample proportions are made to be different from the proportions on the stratification variable(s). In other words, the subsamples are not proportional to their sizes in the population. Here is an example showing the difference between proportional and disproportional stratified sampling:  For example if gender is your stratification variable and the population is composed of 75% females and you want a sample of 100 people, then you would randomly select 75 females and 25 males. In disproportional stratified sampling you might instead select 50 males and 50 females from this same population. In the first case the percentages are proportional; in the second case they are not proportional. In both types, the sampling frame is first divided into subpopulations. Difference between proportional and Disproportional stratified random sampling The main difference between the two sampling techniques is the proportion given to each stratum with respect to other strata. In proportional sampling, each stratum has the same sampling fraction while in disproportional sampling technique; the sampling fraction of each stratum varies. Proportionate Versus Disproportionate Stratification All stratified sampling designs fall into one of two categories, each of which has strengths and weaknesses as described below.  Proportionate stratification. With proportionate stratification, the sample size of each stratum is proportionate to the population size of the stratum. This means that each stratum has the same sampling fraction.  Proportionate stratification provides equal or better precision than a simple random sample of the same size.  Gains in precision are greatest when values within strata are homogeneous.  Gains in precision accrue to all survey measures. Department of Finance Jagannath University Md. Mazharul Islam Jony 11 | P a g e  Disproportionate stratification. With disproportionate stratification, the sampling fraction may vary from one stratum to the next.    The precision of the design may be very good or very poor, depending on how sample points are allocated to strata. The way to maximize precision through disproportionate stratification is discussed in a subsequent lesson (see Statistics Tutorial: Sample Size Within Strata). If variances differ across strata, disproportionate stratification can provide better precision than proportionate stratification, when sample points are correctly allocated to strata. With disproportionate stratification, the researcher can maximize precision for a single important survey measure. However, gains in precision may not accrue to other survey measures. Cluster Sampling: A cluster is a group of individuals or objects having different characteristics. In a cluster sampling, clusters are so formed as there be the maximum heterogeneity within cluster and homogeneity between cluster to cluster. Cluster sampling involves selecting the sample units in groups. In short, Cluster sampling is a sampling technique used when "natural" groupings are evident in a statistical population. For example, a sample of telephone calls may be collected by first taking a collection of telephone lines and collecting all the calls on the sampled lines. The analysis of cluster samples must take into account the intra-cluster correlation which reflects the fact that units in the same cluster are likely to be more similar than two units picked at random. Systematic Sampling Systematic sampling is a random sampling technique which is frequently chosen by researchers for its simplicity and its periodic quality. Systematic sampling relies on arranging the target population according to some ordering scheme and then selecting elements at regular intervals through that ordered list. A common way of selecting members for a sample population using systematic sampling is simply to divide the total number of units in the general population by the desired number of units for the sample population. The result of the division serves as the marker for selecting sample units from within the general population. For example, if anyone wanted to select a random group of 1,000 people from a population of 50,000 using systematic sampling, he would simply select every 50th person, since 50,000/1,000 = 50. Multiphase Sampling Multiphase Sampling is a sampling method in which certain items of information are drawn from the whole units of a sample and certain other items of information are taken from the subsample. For example, if the collection of information concerning variate, y, is relatively expensive, and there exists some other variate, x, correlated with it, which is relatively cheap to investigate, it may be profitable to carry out sampling in two phases. At the first phase, x is investigated, and the information thus obtained is used either, (a) to stratify the population at the second phase, when y is investigated, or (b) as supplementary information at the second phase, a ratio or regression estimate being used. Two-phase sampling is sometimes called ―double sampling‖. Department of Finance Jagannath University Md. Mazharul Islam Jony 12 | P a g e Multistage sampling Multistage sampling is a complex form of cluster sampling in which two or more levels of units are embedded one in the other. Multistage sampling is a sampling method in which the population is divided into a number of groups or primary stages from which samples are drawn; these are then divided into groups or secondary stages from which samples are drawn, and so on. For instance, Geographic areas (primary units), Factories (secondary units), and Employees (tertiary units). At each stage, a sample of the corresponding units is selected. At first, a sample of primary units is selected, then, in each of those selected, a sample of secondary units is selected, and so on. All ultimate units (individuals, for instance) selected at the last step of this procedure are then surveyed. Multistage sampling is used frequently when a complete list of all members of the population does not exist and is inappropriate. Moreover, by avoiding the use of all sample units in all selected clusters, multistage sampling avoids the large, and perhaps unnecessary, costs associated traditional cluster sampling. Non-Random Sampling: Under non-random sampling samples are chosen on the basis of the experience, experiment, liking and disliking of the researcher. Here all the items do not have equal opportunities to select. There are three types of Non-random sampling. They are discussed below. Persuasive/ Judgmental Sampling: Judgment sampling is a common nonprobability method. The researcher selects the sample based on judgment. This is usually an extension of convenience sampling. For example, a researcher may decide to draw the entire sample from one "representative" city, even though the population includes all cities. When using this method, the researcher must be confident that the chosen sample is truly representative of the entire population. In judgment sampling, the researcher or some other "expert" uses his/her judgment in selecting the units from the population for study based on the population’s parameters. In short, Judgment sampling involves the choice of subjects who are most advantageously placed or in the best position to provide the information required. Convenience sampling: When data are collected from a group or chunk of responded at the convenience of the researcher then this kind of sampling is known as convenience sampling. This type of sampling is conducted only to assess the attitudinal dimensions of the respondent. Convenience sampling is a non-probability method. This means that subjects are chosen in a nonrandom manner, and some members of the population have no chance of being included. Convenience sampling is also known as Opportunity Sampling, Accidental Sampling or Haphazard Sampling. Example: A group of students in a high school do a study about teacher attitudes. They interview teachers at the school, a couple of teachers in the family and few others who are known to their parents. Quota Sampling: In quota sampling, the population is first segmented into mutually exclusive sub-groups, just as in stratified sampling. Then judgment is used to select the subjects or units from each segment based on a specified proportion. For example, an interviewer may be told to sample 200 females and 300 males between the age of 45 and 60. This means that individuals can put a demand on who they want to sample (targeting) Quota sampling is useful when time is limited, a sampling frame is not available, the research budget is very tight or when detailed accuracy is not important. You can also choose how many of each category is selected. There are similarities with stratified sampling, but in quota sampling the selection of the sample is nonrandom. Department of Finance Jagannath University Md. Mazharul Islam Jony 13 | P a g e Example: Interviewers might be tempted to interview those who look most helpful. The problem is that these samples may be biased because not everyone gets a chance of selection. This random element is its greatest weakness and quota versus probability has been a matter of controversy for many years In judgment sampling, the researcher or some other "expert" uses his/her judgment in selecting the units from the population for study based on the population’s parameters. Random versus Non-random Samples In statistics, a sample is a subset of a population. Usually, the population is very large, making a complete enumeration of all the values in the population impractical or impossible. The sample represents a subset of manageable size; the sample size is the number of units in the sample. Samples are collected and statistics are calculated from the samples so that one can make inferences or extrapolations from the sample to the population. This process of collecting information from a sample is referred to as sampling. Samples are selected in such a way as to avoid presenting a biased view of the population. The sample will be unrepresentative of the population if certain members of the population are excluded from any possible sample. For example, if a researcher is interested in the drug-usage patterns among teenagers, but collects the sample from schools, the sample is biased because it excludes teenagers not in school for a variety of reasons, such as lack of funds to attend or schooled at home. Biases may also occur if some members of the population are more likely or less likely to be included in the sample than other members of the population for a reason other than the sample design. So the sample collected from schools is also biased because students who miss a lot of school days because of a chronic illness will be less likely to be selected than students who attend regularly. The best way to avoid a biased or unrepresentative sample, and thus to obtain a representative sample of the population, is to select a random sample, also known as a probability sample. A random sample is defined as a sample in which every individual member of the population has a non-zero probability of being selected as part of the sample. In a simple random sample, every individual member of the population has the same probability of being selected as every other individual member. Other types of random samples fall under the category of complex sample design. A sample that is not random is called a non-random sample or a non-probability sample. Some examples of non-random samples are convenience samples, judgment samples, purposive samples, quota samples, and snowball samples. Shortcuts Here are some important terms used in sampling:  A sample is a set of elements taken from a larger population.  The sample is a subset of the population which is the full set of elements or people or whatever you are sampling.  A statistic is a numerical characteristic of a sample, but a parameter is a numerical characteristic of population.  Sampling error refers to the difference between the value of a sample statistic, such as the sample mean, and the true value of the population Md. Mazharul Islam (Jony) ID no:091541, 3rd Batch. parameter, such as the population mean. Note: some error is always present in sampling. With random sampling methods, the error is Department of Finance. random rather than systematic. Jagannath University.  The response rate is the percentage of people in the sample selected for the study who actually participate in the study.  A sampling frame is just a list of all the people that are in the population. Here is an example of a sampling frame (a list of all the names in my population, and they are numbered). Note that the following sampling frame also has information on age and gender included in case you want to draw some samples and do some calculations. Department of Finance Jagannath University Md. Mazharul Islam Jony 14 | P a g e Test of Hypothesis Test of Hypothesis A hypothesis is a theory that is testable. A hypothesis test is a statistical method that uses sample data to evaluate a hypothesis about a population or populations. Every hypothesis test requires the analyst to state a null hypothesis and an alternative hypothesis. The hypotheses are stated in such a way that they are mutually exclusive. That is, if one is true, the other must be false; and vice versa. Null Hypothesis (𝑯𝒐 ) The null hypothesis represents a theory that has been put forward, either because it is believed to be true or because it is to be used as a basis for argument, but has not been proved. The null hypothesis (H0) states that in the general population there is no change, no difference, or no relationship … (=, ≤, ≥). In the context of an experiment, the null hypothesis predicts that the experiment has no effect on the dependent variable for the population. For example, in a clinical trial of a new drug, the null hypothesis might be that the new drug is no better, on average, than the current drug. We would write 𝑯𝒐 : There is no difference between the two drugs on average. We give special consideration to the null hypothesis. This is due to the fact that the null hypothesis relates to the statement being tested, whereas the alternative hypothesis relates to the statement to be accepted if / when the null is rejected. Alternative Hypothesis (𝑯𝟏 ) The alternative hypothesis, 𝐻1 , is a statement of what a statistical hypothesis test is set up to establish. The alternative hypothesis (H1) states that there is a change, a difference, or a relationship for the general population … (≠, <, >). In the context of an experiment, the alternative hypothesis predicts that the independent variable does have an effect on the dependent variable. For example, in a clinical trial of a new drug, the alternative hypothesis might be that the new drug has a different effect, on average, compared to that of the current drug. We would write 𝐻1 : the two drugs have different effects, on average. The alternative hypothesis might also be that the new drug is better, on average, than the current drug. In this case we would write 𝐻1 : the new drug is better than the current drug, on average. The final conclusion once the test has been carried out is always given in terms of the null hypothesis. We either "Reject 𝐻𝑜 in favor of 𝐻1 " or "Do not reject 𝐻𝑜 "; we never conclude "Reject 𝐻1 ", or even "Accept 𝐻1 ". If we conclude "Do not reject 𝐻𝑜 ", this does not necessarily mean that the null hypothesis is true, it only suggests that there is not sufficient evidence against 𝐻𝑜 in favor of H1. Rejecting the null hypothesis then, suggests that the alternative hypothesis may be true. Test Statistic A test statistic is a quantity calculated from our sample of data. Its value is used to decide whether or not the null hypothesis should be rejected in our hypothesis test. The choice of a test statistic will depend on the assumed probability model and the hypotheses under question. Critical Value(s) The critical value(s) for a hypothesis test is a threshold to which the value of the test statistic in a sample is compared to determine whether or not the null hypothesis is rejected. The critical value for any hypothesis test depends on the significance level at which the test is carried out, and whether the test is one-sided or two-sided. See also critical region. Department of Finance Jagannath University Md. Mazharul Islam Jony 15 | P a g e Critical Region The critical region CR, or rejection region RR, is a set of values of the test statistic for which the null hypothesis is rejected in a hypothesis test. That is, the sample space for the test statistic is partitioned into two regions; one region (the critical region) will lead us to reject the null hypothesis H0, the other will not. So, if the observed value of the test statistic is a member of the critical region, we conclude "Reject H0"; if it is not a member of the critical region then we conclude "Do not reject H0". Power The power of a statistical hypothesis test measures the test's ability to reject the null hypothesis when it is actually false - that is, to make a correct decision. In other words, the power of a hypothesis test is the probability of not committing a type II error. It is calculated by subtracting the probability of a type II error from 1, usually expressed as: Power = 1 - P(type II error) = 1 − 𝛽 The maximum power a test can have is 1, the minimum is 0. Ideally we want a test to have high power, close to 1. Test Statistic It is the random variable X whose value is tested to arrive at a decision. The Central Limit Theorem states that for large sample sizes (n > 30) drawn randomly from a population, the distribution of the means of those samples will approximate normality, even when the data in the parent population are not distributed normally. A z statistic is usually used for large sample sizes (n > 30), but often large samples are not easy to obtain, in which case the t-distribution can be used. The population standard deviation  is estimated by the sample standard deviation, s. The t curves are bell shaped and distributed around t=0. The exact shape on a given t-curve depends on the degrees of freedom. In case of performing multiple comparisons by one way Anova, the F-statistic is normally used.It is defined as the ratio of the mean square due to the variability between groups to the mean square due to the variability within groups. The critical value of F is read off from tables on the F-distribution knowing the Type-I error and the degrees of freedom between & within the groups. Confidence and Precision The confidence level of a confidence interval is an assessment of how confident we are that the true population mean is within the interval. The precision of the interval is given by its width (the difference between the upper and lower endpoint). Wide intervals do not provide us with very precise information about the location of the true population mean. Short intervals provide us with very precise information about the location of the population mean. If the sample size n remains the same:  Increasing the confidence level of an interval decreases precision  Decreasing the confidence level of an interval increases its precision Generally confidence levels are chosen to be between about 90% and 99%. These confidence levels usually provide reasonable precision and confidence. Decision Errors in Test of Hypothesis In statistical hypothesis testing, there are two types of errors that can be made or incorrect conclusions that can be drawn. If a null hypothesis is incorrectly rejected when it is in fact true, this is called a Type I error (also known as a false positive). A Type II error (also known as a false negative) occurs when a null hypothesis is not rejected despite being false. The Greek letter 𝛼 is used to denote the probability of type I error, and the letter 𝛽 is used to denote the probability of type II error. Type I error Type I error, also known as an "error of the first kind", an 𝛼 error, or a "false positive": the error of rejecting a null hypothesis when it is actually true. Plainly speaking, it occurs when we are observing a difference when in truth there is none, thus indicating a test of poor specificity. An example of this would be if a test shows that a woman is pregnant when in reality she is not, or telling a patient he is sick when in Department of Finance Jagannath University Md. Mazharul Islam Jony 16 | P a g e fact he is not. Type I error can be viewed as the error of excessive credulity. In other words, a Type I error indicates "A Positive Assumption is False" Types of error Type of decision Accept(or Fail to Reject ) Reject H0 H0 Correct decision Wrong decision H0 true (Null Hypothesis is true) (1- 𝛼) Type I error (𝛼) Confidence Level False Positive Wrong decision Correct decision H0 false (Alternative Hypothesis (H1) is true) Type II error (𝛽) (1- 𝛽) False Negative Power of the Test 1.00 1.00 Sum A type I error is often considered to be more serious, and therefore more important to avoid, than a type II error. The hypothesis test procedure is therefore adjusted so that there is a guaranteed 'low' probability of rejecting the null hypothesis wrongly; this probability is never 0. This probability of a type I error can be precisely computed as, P(type I error) = significance level = 𝛼. Type II error Type II error, also known as an "error of the second kind", a 𝛽 error, or a "false negative": the error of failing to reject a null hypothesis when in fact we should have rejected it. In other words, this is the error of failing to observe a difference when in truth there is one, thus indicating a test of poor sensitivity. An example of this would be if a test shows that a woman is not pregnant, when in reality, she is. Type II error can be viewed as the error of excessive skepticism. In other words, a Type II error indicates "A Negative assumption is False". The probability of a type II error is generally unknown, but is symbolized by 𝛽 and written, P(type II error) = 𝛽 False positive rate The false positive rate is the proportion of absent events that yield positive test outcomes, i.e., the conditional probability of a positive test result given an absent event. The false positive rate is equal to the significance level. The specificity of the test is equal to 1 minus the false positive rate. In statistical hypothesis testing, this fraction is given the Greek letter 𝛼, and 1 − 𝛼 is defined as the specificity of the test. Increasing the specificity of the test lowers the probability of type I errors, but raises the probability of type II errors (false negatives that reject the alternative hypothesis when it is true). False negative rate The false negative rate is the proportion of present events that yield negative test outcomes, i.e., the conditional probability of a negative test result given present event. In statistical hypothesis testing, this fraction is given the letter 𝛽. The power (or the sensitivity) of the test is equal to 1 minus 𝛽. Bayes' theorem The probability that an observed positive result is a false positive (as contrasted with an observed positive result being a true positive) may be calculated using Bayes' theorem. The key concept of Bayes' theorem is that the true rates of false positives and false negatives are not a function of the accuracy of the test alone, but also the actual rate or frequency of occurrence Department of Finance Jagannath University Md. Mazharul Islam Jony 17 | P a g e within the test population; and, often, the more powerful issue is the actual rates of the condition within the sample being tested. In summary:         Rejecting a null-hypothesis when it should not have been rejected creates a type I error. Failing to reject a null-hypothesis when it should have been rejected creates a type II error. (In either case, a wrong decision or error in judgment has occurred.) Decision rules (or tests of hypotheses), in order to be good, must be designed to minimize errors of decision. Minimizing errors of decision is not a simple issue for any given sample size the effort to reduce one type of error generally results in increasing the other type of error. Based on the real-life application of the error, one type may be more serious than the other. (In such cases, a compromise should be reached in favor of limiting the more serious type of error.) The only way to minimize both types of error is to increase the sample size, and this may or may not be feasible. Significance Level The significance level of a statistical hypothesis test is a fixed probability of wrongly rejecting the null hypothesis H0, if it is in fact true. It is the probability of a type I error and is set by the investigator in relation to the consequences of such an error. That is, we want to make the significance level as small as possible in order to protect the null hypothesis and to prevent, as far as possible, the investigator from inadvertently making false claims. The significance level is usually denoted by 𝛼. Significance Level = P(type I error) = 𝛼 Usually, the significance level is chosen to be 0.05 (or equivalently, 5%). One-Tailed and Two-Tailed Tests A test of a statistical hypothesis, where the region of rejection is on only one side of the sampling distribution, is called a one-tailed test. For example, suppose the null hypothesis states that the mean is less than or equal to 10. The alternative hypothesis would be that the mean is greater than 10. The region of rejection would consist of a range of numbers located on the right side of sampling distribution; that is, a set of numbers greater than 10. A test of a statistical hypothesis, where the region of rejection is on both sides of the sampling distribution, is called a two-tailed test. For example, suppose the null hypothesis states that the mean is equal to 10. The alternative hypothesis would be that the mean is less than 10 or greater than 10. The region of rejection would consist of a range of numbers located on both sides of sampling distribution; that is, the region of rejection would consist partly of numbers that were less than 10 and partly of numbers that were greater than 10. Steps in Hypothesis Testing All hypothesis tests are conducted the same way. The researcher states a hypothesis to be tested, formulates an analysis plan, analyzes sample data according to the plan, and accepts or rejects the null hypothesis, based on results of the analysis. 1. Stating the Management Question: The first step is to state the management problem in terms of a question that identifies the population(s) of interest to the researcher, the Department of Finance Jagannath University Md. Mazharul Islam Jony 18 | P a g e 2. parameter(s) of the variable under investigation, and the hypothesized value of the parameter(s). This step makes the researcher not only define what is to be tested but what variable will be used in sample data collection. State the hypotheses: Every hypothesis test requires the analyst to state a null hypothesis and an alternative hypothesis. The hypotheses are stated in such a way that they are mutually exclusive. That is, if one is true, the other must be false; and vice versa. 3. Level of Significance: Often, researchers choose significance levels equal to 0.01, 0.05, or 0.10; but any value between 0 and 1 can be used. 4. Choosing Test method: Typically, the test method involves a test statistic and a sampling distribution. Computed from sample data, the test statistic might be a mean test, proportion, difference between means, difference between proportions, z-test, t-test, chi-square test, etc. Given a test statistic and its sampling distribution, a researcher can assess probabilities associated with the test statistic. If the test statistic probability is less than the significance level, the null hypothesis is rejected. Calculate Test Statistic: The fifth step is to calculate a statistic analogous to the parameter specified by the null hypothesis. If the null hypothesis is defined by the parameter (𝜇), then the statistics computed on our data set would be the mean (𝑥 ) and the standard deviation (s). Mathematically, 𝑆𝑎𝑚𝑝𝑙𝑒 𝑆𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐 −𝐻𝑦𝑝𝑜𝑡 ℎ𝑒𝑠𝑖𝑠𝑒𝑑 𝑃𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑃𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟 Test Statistic = 5. Standard 6. 7. 8. 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑒𝑟𝑟𝑜𝑟 𝑜𝑓 𝑡ℎ𝑒 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐 𝑆𝐷 𝜎 error between 𝑥 and 𝜇: = = 𝜎𝑥 𝑛 𝑛 Comparison between Calculated value and the Table of the Theoretical Value: Compare the observed value of the statistic to the critical value. Make a decision: If the test statistic falls in the critical region, Reject H0 in favors of HA. If the test statistic does not fall in the critical region, conclude that there is not enough evidence to reject H0. Provide Answer to the Management Question: The final step is to describe the results and state correct statistical conclusions in an understandable way to the management. The conclusion consists of two statements, one describing the results of the null hypothesis and the other describing the results of the alternative hypothesis. The first statement should state as to whether we accepted or rejected the null hypothesis and for what value of alpha or pvalue for our test statistic. The second statement should answer the research question proposed in step 1 stating the sample statistic collected which estimated the parameter we hypothesized. Md. Mazharul Islam (Jony) ID no:091541, 3rd Batch. Department of Finance. Jagannath University. Department of Finance Jagannath University Md. Mazharul Islam Jony 19 | P a g e Terminology of Hypothesis Test All hypothesis tests share the same basic terminology and structure.  A null hypothesis is an assertion about a population that you would like to test. It is "null" in the sense that it often represents a status quo belief, such as the absence of a characteristic or the lack of an effect. It may be formalized by asserting that a population parameter, or a combination of population parameters, has a certain value. In the example given in the Introduction, the null hypothesis would be that the average price of gas across the state was $1.15. This is written H0: µ = 1.15.  An alternative hypothesis is a contrasting assertion about the population that can be tested against the null hypothesis. In the example given in the Introduction, possible alternative hypotheses are:  H1: µ ≠ 1.15 — State average was different from $1.15 (two-tailed test)  H1: µ > 1.15 — State average was greater than $1.15 (right-tail test)  H1: µ < 1.15 — State average was less than $1.15 (left-tail test)  To conduct a hypothesis test, a random sample from the population is collected and a relevant test statistic is computed to summarize the sample. This statistic varies with the type of test, but its distribution under the null hypothesis must be known (or assumed).  The p value of a test is the probability, under the null hypothesis, of obtaining a value of the test statistic as extreme or more extreme than the value computed from the sample.  The significance level of a test is a threshold of probability α agreed to before the test is conducted. A typical value of α is 0.05. If the p value of a test is less than α, the test rejects the null hypothesis. If the p value is greater than α, there is insufficient evidence to reject the null hypothesis. Note that lack of evidence for rejecting the null hypothesis is not evidence for accepting the null hypothesis. Also note that substantive "significance" of an alternative cannot be inferred from the statistical significance of a test.  The significance level α can be interpreted as the probability of rejecting the null hypothesis when it is actually true—a type I error. The distribution of the test statistic under the null hypothesis determines the probability α of a type I error. Even if the null hypothesis is not rejected, it may still be false—a type II error. The distribution of the test statistic under the alternative hypothesis determines the probability β of a type II error. Type II errors are often due to small sample sizes. The power of a test, 1 – β, is the probability of correctly rejecting a false null hypothesis.  Results of hypothesis tests are often communicated with a confidence interval. A confidence interval is an estimated range of values with a specified probability of containing the true population value of a parameter. Upper and lower bounds for confidence intervals are computed from the sample estimate of the parameter and the known (or assumed) sampling distribution of the estimator. A typical assumption is that estimates will be normally distributed with repeated sampling (as dictated by the Central Limit Theorem). Wider confidence intervals correspond to poor estimates (smaller samples); narrow intervals correspond to better estimates (larger samples). If the null hypothesis asserts the value of a population parameter, the test rejects the null hypothesis when the hypothesized value lies outside the computed confidence interval for the parameter. Chi-square test A chi-square test (also chi squared test or 𝜒 2 test) is any statistical hypothesis test in which the sampling distribution of the test statistic is a chi-square distribution when the null hypothesis is true, or any in which this is asymptotically true, meaning that the sampling distribution (if the null hypothesis is true) can be made to approximate a chi-square distribution as closely as desired by making the sample size large enough. Chi-square test also known as Pearson's chi-square test. Department of Finance Jagannath University Md. Mazharul Islam Jony 20 | P a g e Definitions of other symbols:  α, the probability of Type I error (rejecting a null hypothesis when it is in fact true)  n = sample size  n1 = sample 1 size  n2 = sample 2 size  = sample mean  μ0 = hypothesized population mean  μ1 = population 1 mean  μ2 = population 2 mean  σ = population standard deviation  σ2 = population variance Name         s = sample standard deviation s2 = sample variance s1 = sample 1 standard deviation s2 = sample 2 standard deviation t = t statistic df = degrees of freedom 𝑑= sample mean of differences d0 = hypothesized population mean difference  sd = standard deviation of differences Formula Two-sample z-test 𝒛= 𝑡= (z is the distance from the mean in relation to the standard deviation of the mean). For non-normal distributions it is possible to calculate a minimum proportion of a population that falls within k standard deviations for any k (see: Chebyshev's inequality). 𝒙𝟏 − 𝒙𝟐 − 𝒅𝟎 𝝈𝟏 𝟐 𝝈𝟐 𝟐 𝒏𝟏 + 𝒏𝟐 𝒙𝟏 − 𝒙𝟐 − 𝒅𝟎 𝑆𝑝 Two-sample pooled t-test, equal variances* Assumptions or notes (Normal population or n > 30) and σ known. 𝑥 − 𝜇0 𝑧= 𝜎 ( ) 𝑛 One-sample z-test 𝟏 𝟏 + 𝒏𝟏 𝒏𝟐 2  𝑝= x/n = sample proportion, unless specified otherwise  p0 = hypothesized population proportion  p1 = proportion 1  p2 = proportion 2  dp = hypothesized difference in proportion  min{n1,n2} = minimum of n1 and n2  x1 = n1p1  x2 = n2p2  χ2 = Chi-squared statistic  F = F statistic Normal population and independent observations and σ1 and σ2 are known , ℎ𝑒𝑟𝑒, 𝑆𝑝 𝑛1 − 1 𝑆1 2 + 𝑛2 − 1 𝑆2 2 = 𝑛1 + 𝑛2 − 2 and 𝐷𝑓 = 𝑛1 + 𝑛2 − 2 (Normal populations or n1 + n2 > 40) and independent observations and σ1 = σ2 and σ1 and σ2 unknown We Md. Mazharul Islam (Jony) ID no:091541, 3rd Batch. Department of Finance. Jagannath University. Website: http://jagannath.academia.edu/jony007ex ; Email: jony007ex@gmail.com ; Phone:+88 01198150195 Department of Finance Jagannath University Md. Mazharul Islam Jony 21 | P a g e

Log In

Statistis II

Related papers

Related papers