STATISTICS IV: HYPOTHESIS TESTING
Hypothesis tests are procedures for making rational decisions about the reality of effects.
Rational Decisions
Most decisions require that an individual select a single alternative from a number of possible alternatives. The decision is made without knowing whether or not it is correct; that is, it is based on incomplete information. For example, a person either takes or does not take an umbrella to school based upon both the weather report and observation of outside conditions. If it is not currently raining, this decision must be made with incomplete information.
A rational decision is characterized by the use of a procedure which insures the likelihood or probability that success is incorporated into the decision-making process. The procedure must be stated in such a fashion that another individual, using the same information, would make the same decision.
One is reminded of a STAR TREK episode. Captain Kirk, for one reason or another, is stranded on a planet without his communicator and is unable to get back to the Enterprise. Spock has assumed command and is being attacked by Klingons (who else). Spock asks for and receives information about the location of the enemy, but is unable to act because he does not have complete information. Captain Kirk arrives at the last moment and saves the day because he can act on incomplete information.
This
story goes against the concept of rational man. Spock,
being the ultimate
rational man, would not be immobilized by indecision.
Instead, he would
have selected the alternative which realized the greatest
expected benefit
given the information available. If complete information
were required to
make decisions, few decisions would be made by rational
men and women.
This is obviously not the case. The script writer misunderstood
Spock and
rational man.
Effects
When a change in one thing is associated with a change in another, we have an effect. The changes may be either quantitative or qualitative, with the hypothesis testing procedure selected based upon the type of change observed. For example, if changes in salt intake in a diet are associated with activity level in children, we say an effect occurred. In another case, if the distribution of political party preference (Republicans, Democrats, or Independents) differs for sex (Male or Female), then an effect is present. Much of the behavioral science is directed toward discovering and understanding effects.
The effects
discussed in the remainder of this text appear
as various statistics
including: differences between means, contingency
tables, and correlation
coefficients.
General Principles
All hypothesis tests conform to similar principles and proceed with the same sequence of events.
* A model of the world
is created in which there
are no effects. The experiment is then repeated
an infinite number of
times.
* The results of the experiment are
compared with the
model of step one. If, given the model, the results are
unlikely, then
the model is rejected and the effects are accepted as real.
If, the results
could be explained by the model, the model must be
retained. In the latter
case no decision can be made about the reality of
effects.
Hypothesis testing is equivalent to the geometrical concept of hypothesis negation. That is, if one wishes to prove that A (the hypothesis) is true, one first assumes that it isn't true. If it is shown that this assumption is logically impossible, then the original hypothesis is proven. In the case of hypothesis testing the hypothesis may never be proven; rather, it is decided that the model of no effects is unlikely enough that the opposite hypothesis, that of real effects, must be true.
An analogous situation exists with respect to hypothesis testing in statistics. In hypothesis testing one wishes to show real effects of an experiment. By showing that the experimental results were unlikely, given that there were no effects, one may decide that the effects are, in fact, real. The hypothesis that there were no effects is called the NULL HYPOTHESIS. The symbol H0 is used to abbreviate the Null Hypothesis in statistics. Note that, unlike geometry, we cannot prove the effects are real, rather we may decide the effects are real.
For example, suppose the following probability model (distribution) described the state of the world. In this case the decision would be that there were no effects; the null hypothesis is true.
Event A might be considered fairly likely, given the above model was correct. As a result the model would be retained, along with the null hypothesis. Event B on the other hand is unlikely, given the model. Here the model would be rejected, along with the null hypothesis.
The Model
The SAMPLING DISTRIBUTION is a distribution of a sample statistic. It is used as a model of what would happen if
1.) the null hypothesis were true (there really were no effects), and
2.) the experiment was repeated an infinite number
of
times.
Probability
Probability is a theory of uncertainty. It is a necessary concept because the world according to the scientist is unknowable in its entirety. However, prediction and decisions are obviously possible. As such, probability theory is a rational means of dealing with an uncertain world.
Probabilities are numbers associated with events that range from zero to one (0-1). A probability of zero means that the event is impossible. For example, if I were to flip a coin, the probability of a leg is zero, due to the fact that a coin may have a head or tail, but not a leg. Given a probability of one, however, the event is certain. For example, if I flip a coin the probability of heads, tails, or an edge is one, because the coin must take one of these possibilities.
In real life, most events have probabilities between these two extremes. For instance, the probability of rain tonight is .40; tomorrow night the probability is .10. Thus it can be said that rain is more likely tonight than tomorrow.
The meaning of the term probability depends upon one's philosophical orientation. In the CLASSICAL approach, probabilities refer to the relative frequency of an event, given the experiment was repeated an infinite number of times. For example, the .40 probability of rain tonight means that if the exact conditions of this evening were repeated an infinite number of times, it would rain 40% of the time.
In the Subjective approach, however, the term probability refers to a "degree of belief." That is, the individual assigning the number .40 to the probability of rain tonight believes that, on a scale from 0 to 1, the likelihood of rain is .40. This leads to a branch of statistics called "BAYESIAN STATISTICS." While many statisticians take this approach, it is not usually taught at the introductory level. At this point in time all the introductory student needs to know is that a person calling themselves a "Bayesian Statistician" is not ignorant of statistics. Most likely, he or she is simply involved in the theory of statistics.
No matter what theoretical position is taken, all probabilities must conform to certain rules. Some of the rules are concerned with how probabilities combine with one another to form new probabilities. For example, when events are independent, that is, one doesn't effect the other, the probabilities may be multiplied together to find the probability of the joint event. The probability of rain today AND the probability of getting a head when flipping a coin is the product of the two individual probabilities.
A deck of
cards illustrates other principles of probability
theory. In bridge,
poker, rummy, etc., the probability of a heart can be
found by dividing
thirteen, the number of hearts, by fifty-two, the number
of cards,
assuming each card is equally likely to be drawn. The probability
of a
queen is four (the number of queens) divided by the number of cards.
The
probability of a queen OR a heart is sixteen divided by fifty-two.
This
figure is computed by adding the probability of hearts to the
probability
of a queen, and then subtracting the probability of a queen
AND a heart
which equals 1/52.
Testing Hypothesis About Single Means
THE HEAD-START EXPERIMENT
Suppose an educator had a theory which argued that a great deal of learning occurrs before children enter grade school or kindergarten. This theory explained that socially disadvantaged children start school intellectually behind other children and are never able to catch up. In order to remedy this situation, he proposes a head-start program, which starts children in a school situation at ages three and four.
A politician reads this theory and feels that it might be true. However, before he is willing to invest the billions of dollars necessary to begin and maintain a head-start program, he demands that the scientist demonstrate that the program really does work. At this point the educator calls for the services of a researcher and statistician.
Because this is a fantasy, the following research design would probably never be used in practice. This design will be used to illustrate the procedure and the logic underlying the hypothesis test.
A random sample 64 four-year old children is taken from the population of all four-year old children. The children in the sample are all enrolled in the head-start program for a year, at the end of which time they are given a standardized intelligence test. The mean I.Q. of the sample is found to be 103.27.
On the basis of this information, the educator wishes to begin a nationwide head-start program. He argues that the average I.Q. in the population is 100 (m =100) and that 103.27 is greater than that. Therefore, the head-start program had an effect of about 103.27-100 or 3.27 I.Q. points. As a result, the billions of dollars necessary for the program would be well invested.
The statistician, being in this case the devil's advocate, is not ready to act so hastily. He wants to know whether chance could have caused the large mean. In other words, head start doesn't make a bit of difference. The mean of 103.27 was obtained because the sixty-four students selected for the sample were slightly brighter than average. He argues that this possibility must be ruled out before any action is taken. If not ruled out completely, he argues that although possible, the likelihood must be small enough that the risk of making a wrong decision outweighs possible benefits of making a correct decision.
To
determine if chance could have caused the difference,
the hypothesis test proceeds as a thought
experiment.
First, the statistician assumes that there were no effects;
in this case,
the head-start program didn't work. He then creates a model
of what the world would look like if the
experiment were performed an infinite
number of times under the assumption
of no effects. The sampling distribution
of the mean is used as this
model. The reasoning goes something like this:
POPULATION
DISTRIBUTION ASSUMING NO EFFECTS
SAMPLING DISTRIBUTION
ASSUMING NO EFFECTS AND N = 64
RESULTS OF THE
EXPERIMENT
He or she then compares the results of the actual experiment with those expected from the model, given there were no effects and the experiment was repeated an infinite number of times. He or she concludes that the model probably could explain the results.
Therefore, because chance could explain the results, the
educator
was premature in deciding that head-start had a real
effect.
HEAD-START EXPERIMENT REDONE
Suppose that the researcher changed the experiment. Instead of a sample of sixty-four children, the sample was increased to N=400 four-year old children. Furthermore, this sample had the same mean (=103.27) at the conclusion as had the previous study. The statistician must now change the model to reflect the larger sample size.
POPULATION DISTRIBUTION ASSUMING NO
EFFECTS
SAMPLING DISTRIBUTION ASSUMING NO
EFFECTS AND N = 400
RESULTS OF THE
EXPERIMENT
The conclusion reached by the statistician states that it is highly unlikely the model could explain the results. The model of chance is rejected and the reality of effects accepted. Why? The mean that resulted from the study fell in the tail of the sampling distribution.
The different conclusions reached in these two experiments may seem contradictory to the student. A little reflection, however, reveals that the second experiment was based on a much larger sample size (400 vs. 64). As such, the researcher is rewarded for doing more careful work and taking a larger sample. The sampling distribution of the mean specifies the nature of the reward.
At this point it should also be pointed out that we are discussing statistical significance: whether or not the results could have occurred by chance. The second question, that of practical significance, occurs only after an affirmative decision about the reality of the effects. The practical significance question is tackled by the politician, who must decide whether the effects are large enough to be worth the money to begin and maintain the program. Even though head-start works, the money may be better spent in programs for the health of the aged or more nuclear submarines. In short, this is a political and practical decision made by people and not statistical procedures.