Sample Module

Introduction to Statistics

Introduction

Statistical analysis has an extremely important role in business, and in government also. Analysis is the process of organizing, rearranging, sorting, and summarizing raw data in a manner that yields insights we call information. Information improves decision quality. The more information a business manager can obtain, the better business decisions that manager can and will make. Statistical analysis is the processing of raw data with a focus on measures of central tendency and measures of dispersion to yield probabilities of the occurrences of future events. For example, data on past sales is analyzed and the probability of some future level of sales is computed based on the statistical analysis.

Most people without a statistics background feel that statistical analysis is all about crunching numbers; making histograms and computing means and variances. While number crunching is a part of statistics, the interpretation of those numbers is the far more important, and difficult, part of statistical analysis. We do not stress the computations. That is mechanical and there are many computer spreadsheet programs and statistics programs to do number crunching for you. While we show the computation of statistical values, we stress the appropriate use of a particular concept and the correct interpretation of the resulting number.

Samples

As we noted above, the more information a manager has, the higher quality decisions can be made. It stands to reason that the highest quality decisions can be made when the manager has access to COMPLETE information. A manager can achieve complete information if the entire population for which the manager is responsible is polled or surveyed. If a manager is responsible for 10 employees, it is very feasible to survey all 10 employees and achieve complete information. If a manager is responsible for target market consisting of 10,000 potential customers, then a survey of all 10,000 potential customers is both practically infeasible and prohibitively expensive. In such a situation, it is more feasible and much cheaper to draw a sample from the population and make inferences about the population based on the results of statistical analyisis of the sample.

Definitions

A population is the entirety of conceivable members of some defined group, body, entity, area, or jurisdiction which is the object of statistical examination. A population whose members can be counted is said to be finite. The students in a school would be a finite population. A population whose members cannot be counted or whose member can assume values which are infinitely divisible is said to be infinite. The weights of the students in a school would be an infinite population since the weight of something can assume an infinite number of values.

A census is the survey of the entire population. Conducting a census is a difficult, time consuming, and expensive endeavor; extremely so when the population is large. A sample is a small number (relative to the population) of members of the population drawn to represent that population. The interpretation of the terms population and sample is influenced by context. The students in one school can be a population in one context, but also can be a sample in the context of all students in a school district.

A sample must be representative of the population from which it is drawn. This is achieved when every member of the population has the same chance of being selected for the sample. A sample drawn in a manner that meets this requirement is called a random sample. Analyses performed on random samples yield unbiased results, meaning that the results can be accurately applied to the population.

A parameter is a characteristic of a population. A statistic is a characteristic of a sample. Sample statistics are used as estimates of their corresponding population parameters. The science of statistics is about the drawing of a random sample from a population in order to use the statistics of that sample to make inferences (informed claims called hypotheses) regarding the parameters of that population.

The Simple Random Sample

There are entire books written on JUST sampling. We will only scratch the surface of that topic in this module. As noted above, a random sample is one for which each and every member of the population has the same chance of being selected or the sample. An example is the choosing of a name or number from a bin containing all names or numbers in the population. A raffle will have participants' ticket stubs in a bin and the winner is the holder of the ticket whose stub is drawn. A lottery often has lightweight numbered balls held in a bin with the winning number determined by the numbers on the balls drawn from the bin. The game BINGO also has chips or balls with a letter/number combination on each one. The game progresses as chips or balls are drawn from the bin. In such drawings, you may have seen the bin being rotated or agitated so that the stubs, chips, or balls are shuffled inside the bin. The shuffling is to ensure that the population members are well mixed, affording each member the same probability of being drawn. This shuffling has the technical term of randomization, a process resulting in each member of a population having the same probability of being drawn for a sample and thus leading to a random sample.

Randomization is an extremely important and extremely difficult topic, hence the large number of books, courses, and statistics professionals who specialize in that topic. We limit our discussion of the topic to recognizing when a population is sufficiently randomized or not, and when a sample possibly may not be random. The most common reason for a non-random sample is that the target population to be analyzed is not the same as the population from which the sample was actually drawn.

An example of this can be polling by telephone. Say a politician wants to survey voters' attitudes on an issue. The target population is all possible voters. If a survey firm uses a land-line telephone registry to contact potential voters, what is the actual population from which the sample is being drawn? The actual population is the subscribers to land-line telephone service. If almost all voters also subscribe to land-line telephones, then there may not be a serious problem because the two populations have an extremely large area of overlap. This was the case in the United States after land-line telephones became inexpensive in the 1950's and before cell phones became very popular in the 1990's. Currently, there is not a large area of overlap because land-lines have lost popularity and wireless communication is extremely popular and becoming increasingly inexpensive. Today, a land-line telephone survey would exclude a large number of voters, giving them a ZERO probability being selected for the sample. The consequent sample would not be random and the results from analyzing it would be biased.

Example 1.

A population consists of 100 light-weight balls numbered from 00 through 99. Each number has 2 digits because a scanner reads the number on a ball as it is sampled and requires input for 2 digits. The lightweight balls are shuffled by a fan. A sample of 10 balls is drawn yielding the numbers; 08, 36, 78, 29, 03, 95, 51, 40, 66, and 17. Does the mean of this sample provide an unbiased estimate of the mean of the population?

The sample is random because each member of the population, each ball, has the same chance of being drawn for the sample. Its mean of 423/10 = 42.3 is an unbiased estimate of the population mean of 4,950/100 = 49.5.

Example 2.

A population consists of 100 light-weight balls numbered from 0 through 99. Each number is supposed to have 2 digits because a scanner reads the number on a ball as it is sampled and requires input for 2 digits. However, the balls numbered 0 through 9 have only 1 digit. The light-weight balls are shuffled by a fan. A sample of 10 balls is drawn yielding the numbers; 8, 36, 78, 29, 3, 95, 51, 40, 66, and 17. The balls numbered 3 and 8 are rejected by the scanner due to having only 1 digit and two other balls are drawn in their stead; their numbers are 12 and 84. Does the mean of this sample provide an unbiased estimate of the mean of the population?

The sample is not random because each member of the population, each ball, does not have the same chance of being drawn for the sample. The balls numbered 0 through 9 have a 0% chance of being drawn for the sample, while the balls numbered 10 through 99 have the same greater than 0% chance of being drawn for the sample. Its mean of 508/10 = 50.8 is a biased estimate of the population mean of 4,950/100 = 49.5.

The Stratified Random Sample

Frequently, a population has characteristics that are not evenly distributed among its members. An example is the income of the households in an area or jurisdiction. In most countries, household income is very high for a relatively small number of households and low for a relatively large number of households. A simple random sample, in which every household has the same probability of being drawn, will have a large number of low income households and very few, possibly none, high income households. In such a scenario, a sample statistic, such as mean income, would not be representative of the population mean. If an investigator is interested in mean income, it is important that a sufficient number of high income households are in the sample in order for its statistics to be representative of the entire population. The investigator would need to over-draw or over-sample the high income households. Such an action would cause high income households to have a higher chance of being drawn for the sample than lower income households, creating a non-random sample and leading to biased results.

The strategy for over-sampling high income households while still yielding a random sample is to divide the population into strata based on income and draw a simple random sample from each stratum. The decisions of into how many strata to divide the population and what each stratum threshold should be is the subject of the many books on sampling. We do not discuss that in this module and take the stratification decisions as given. Such a strategy, where there are multiple strata with the all members of one strata having a same chance of being drawn for the sample but the members of different strata having different chances of being drawn for the sample, is called a stratified random sample. The difficulty with stratified random samples is the achieving of unbiased results. That requires accounting for the different chances, among the population members, of being drawn for the sample.

That accounting involves applying weights to each sample item and computing weighted sample statistics. (Weighted sample statistics, such as the weighted mean, are discussed in the complete module but not in this sample module.) Those population members who are over-drawn for the sample must be under-weighted in sample computations, and the under-drawn members over-weighted in sample computations. We now present the computations of those weights.

Say a population of size N is divided into 2 strata, one with a sub-population of N1 and the other with a sub-population of N2. A sample is drawn with n1 observations from the first stratum and n2 observations from the second stratum. The probability of a first stratum member being drawn is n1 / N1, and that for a second stratum member is n2 / N2. The raw weights applied to the sample observations are the reciprocals of each observation's probability of being drawn for the sample. For sample observations from the first stratum, the weight is w1 = N1 / n1, and that for the second stratum is w2 = N2 / n2. These raw weights are adjusted to account for the sample size n = n1 + n2. The adjustment factor is the sample size, n, divided by the sum of the raw weights for all the observations in the sample, Σwi. To compute the sample mean, the value of each observation is multiplied by the adjusted weight for its stratum, the values of all the observations are summed and divided by the sample size. This yields an unbiased estimate of the population mean.

Example 3.

A population consists of 80 light-weight balls numbered from 00 through 49 and 20 light-weight balls numbered from 50 through 99. Each number has 2 digits because a scanner reads the number on a ball as it is sampled and requires input for 2 digits. The lightweight balls are shuffled by a fan. To assure a representative sample, the population is divided into 2 strata; one for ball values 00 through 49 and the other for ball values 50 through 99. A stratified sample is drawn, 5 balls from the lower stratum and 5 from the upper stratum, yielding the numbers; 08, 17, 29, 36, 40, 58, 64, 78, 83, and 91. What is the mean of this sample provide and is it an unbiased estimate of the mean of the population?

The probability of a population member being drawn from the lower stratum is 5/80, so the raw weight for observations from that stratum is 80/5 = 16. For the upper stratum, the probability is 5/20 yielding a raw weight of 20/5 = 4. The sum of the raw weights is 5 * 16 + 5 * 4 = 100 and the sample size is 10, so the adjustment factor for the weights is 10 / 100 = 0.10. The adjusted weight for each lower stratum observation is 16 / 10 = 1.6, and that for each upper stratum observation is 4.0 / 10 = 0.40. Notice that the stratum which was under-drawn for the sample, the lower stratum, has an adjusted weight which is greater than 1, while that for the over-drawn stratum, the upper stratum, has an adjusted weight which is less than 1. Remember: The observations from the strata which is over-drawn for the sample are under-weighted in computations. The observations from the strata which is under-drawn for the sample are over-weighted in computations. The unbiased mean of this stratified sample is [1.6 * (8 + 17 + 29 + 36 + 40) + 0.40 * (58 + 64 + 78 + 83 + 91)] / 10 = 35.76.

Confidence Intervals

A use of sample statistics is the determination of a range of possible values for a population parameter. This range is called a confidence interval. A confidence interval represents the range of values a population parameter may assume with a certain probability or level of confidence. The level of confidence is also sometimes referred to as the complement of the level of significance. The level of significance has the symbol alpha, α. The level of confidence equals 1 - α. There are 2-sided and 1-sided confidence intervals. A 2-sided confidence interval shows the upper AND lower limits of the range of possible values for the population parameter associated with a given level of confidence. A 1-sided confidence interval shows the upper OR lower limit of the range of possible values for the population parameter associated with a given level of confidence.

The level of confidence is the probability that the population parameter falls within the range of values given for the confidence interval. The analyst chooses the level of confidence. A 90% level of confidence (or 10% level of significance) for a 2-sided confidence interval means that 90% of the time the population parameter value will fall within the confidence interval limits, or 5% of the time the population parameter value will fall below the lower limit and 5% of the time it will fall above the upper limit. A 90% level of confidence (or 10% level of significance) for a 1-sided upper bound confidence interval means that the population parameter value falls within the confidence interval limits 90% of the time, and above the limit 10% of the time. A 90% level of confidence (or 10% level of significance) for a 1-sided lower bound confidence interval means that the population parameter value falls within the confidence interval limits 90% or falls below the confidence interval limit 10% of the time for an upper bound confidence interval. Each level of confidence is associated with a value used in computing the confidence interval. This value depends on which probability distribution is the appropriate one to use, and that depends on the population parameter for which the confidence interval is being computed.

For this sample module, we will compute confidence intervals for the population mean. That will require use of the normal probability distribution. The values for the normal distribution, called Z-scores, associated with the different levels of confidence are:

```Confidence Level     Significance Level     2-Sided     1-Sided

90%                   10%               1.645       1.282
95%                    5%               1.960       1.645
98%                    2%               2.328       2.055
99%                    1%               2.575       2.328
```

The 2-sided confidence interval is constructed as follows. The sample mean has a precision measure added to it to find the upper limit of the interval, and subtracted from it to find the lower limit of the interval. The precision measure is the Z-score value times the standard deviation of the mean. The formula is μ = X ± SX * Zα/2. The 1-sided confidence interval is constructed in a similar manner. The sample mean has a precision measure added to it find the upper limit, or subtracted from it to find the lower limit. The formulae are μ = X + SX * Zα for the upper bound confidence interval, and μ = X - SX * Zα for the lower bound confidence interval. The final confidence interval is written as μ ∈ [μ, μ] for the 2-sided interval, μ ∈ [μ, ∞) for the 1-sided lower bound interval, or μ ∈ (-∞, μ] for the 1-sided upper bound interval.

Example 1.

A sample of 100 observations is found to have a mean of 12.8 and a standard deviation of 3.6. What is the 2-sided 95% confidence interval for the population mean?

The standard deviation of the sample mean is: SX = 3.6 / √100 = 0.36. The Z-score for a 2-sided 95% confidence interval is 1.960. The precision amount is 1.960 * 0.36 = 0.706. μ = 12.8 ± 0.706. μ ∈ [12.1, 13.5]. There is a 95% probability that the population mean falls in the range from 12.1 through 13.5.

Example 2.

A sample of 100 observations is found to have a mean of 12.8 and a standard deviation of 3.6. What is the 1-sided upper bound 95% confidence interval for the population mean?

The precision amount is now 1.645 * 0.36 = 0.592. μ = 12.8 + 0.592. μ ∈ (-∞, 13.4]. There is a 95% probability that the population mean falls in the range from minus infinity through 13.4.

Example 3.

A sample of 100 observations is found to have a mean of 12.8 and a standard deviation of 3.6. What is the 1-sided lower bound 95% confidence interval for the population mean?

The precision amount is again 1.645 * 0.36 = 0.592. μ = 12.8 - 0.592. μ ∈ [12.2, ∞). There is a 95% probability that the population mean falls in the range from 12.2 through infinity.