STATISTICS I: COLLECTING AND ORGANIZING DATA
Objectives:
Statistics is
a branch of mathematics
that studies techniques for collecting, organizing
and interpreting data.
Descriptive Statistics includes procedures used to organize and present data in a convenient, useable and communicable form.
One important use of statistics is
to
summarize a collection of data in a clear and understandable
way.
For example, assume a psychologist gave a personality test
measuring
shyness to all 2500 students attending a small college. How
might these
measurements be summarized?
There are
two basic methods: numerical and graphical.
Using the numerical approach one might compute statistics such as the mean and standard deviation. These statistics convey information about the average degree of shyness and the degree to which people differ in shyness.
Using the graphical approach one might create a stem and leaf display and a box plot. These plots contain detailed information about the distribution of shyness scores.
Graphical methods are better suited than numerical methods for identifying patterns in the data. Numerical approaches are more precise and objective.
Since the numerical and graphical approaches complement each other, it is wise to use both.
Inferential Statistics are used to draw inferences about a population from a sample. Consider an experiment in which 10 subjects who performed a task after 24 hours of sleep deprivation scored 12 points lower than 10 subjects who performed after a normal night's sleep. Is the difference real or could it be due to chance? How much larger could the real difference be than the 12 points found in the sample? These are the types of questions answered by inferential statistics.
There are two main methods used in inferential statistics: estimation and hypothesis testing.
Variables
A variable is any measured characteristic or attribute that differs for different subjects. For example, if the weight of 30 subjects were measured, then weight would be a variable.
The first step, before any
calculations or plotting of
data,
is
to decide what type of data one is dealing with.
It is useful to distinguish between two broad types of
variables
(data): quantitative (or numeric) (for which one asks "how much?")
and
qualitative (for which one asks "what type?").
Each is broken down into
two sub-types:
Because
qualitative data always have a limited number
of alternative values, such
variables are also described as discrete. All
qualitative data are
discrete, while some numeric data are discrete and
some are
continuous.
For statistical analysis, qualitative data can be converted into discrete numeric data by simply counting the different values that appear.
Note: the word "variable" is used in two senses. It can mean an item of data collected on each sampling unit, and it can mean "random variable". A random variable is a variable in the mathematical sense, but one that takes different values according to a probability distribution. The word "variate" is also sometimes used to mean random variable. In Statistics, we use random variables to build probability models for data variables. This makes sense because when data are collected on observational units sampled at random, the values recorded for the data variables can be regarded as realisations of mathematical random variables.
Qualitative Data
Qualitative data arise when the observations fall into separate distinct categories.
Examples:
Colour of eyes : blue, green,
brown etc
Exam result : pass or
fail
Socio-economic status : low, middle or
high.
Such data are inherently discrete, in that there are a finite number of possible categories into which each observation may fall.
Data are classified as:
Quantitative or numerical data arise when the observations are counts or measurements. The data are said to be discrete if the measurements are integers (eg number of people in a household, number of cigarettes smoked per day) and continuous if the measurements can take on any value, usually within some range (eg weight).
Quantities such as sex and weight are
called variables,
because the value of
these quantities vary from one observation to another.
Numbers calculated to describe important features of
the data are
called statistics.
For example,
(i) the proportion of females, and (ii) the average age of
unemployed
persons, in a sample of residents of a town are
statistics.
The following table shows a part of
some (hypothetical)
data on a group of 48 subjects.
'Age' and 'income' are continuous numeric
variables,
'age group' is an ordinal qualitative
variable,
and 'sex' is a nominal qualitative
variable.
The ordinal variable 'age group' is
created from the continuous
variable 'age' using five
categories:
age group = 1 if age is less than
20;
age group = 2 if age is 20 to
29;
age group = 3 if age is 30 to
39;
age group = 4 if age is 40 to
49;
age group = 5 if age is 50 or
more
Table 1 - Hypothetical
Data
| Subject No. | Age (years) | Age Group | Annual Income (*$10,00) | Sex |
| 1 | 32 | 3 | 41 | F |
| 2 | 20 | 2 | 15 | M |
| 3 | 45 | 4 | 23 | F |
| ... | .... | ..... | .... | .... |
| ... | ... | ... | .... | ... |
| 47 | 19 | 1 | 0.5 | F | 48 | 32 | 3 | 19 | F |
Stem and leaf plots
Before any statistical calculation, even the simplest, is performed the data should be tabulated or plotted. If they are quantitative and relatively few, say up to about 30, they are conveniently written down in order of size.
For example, a pediatric registrar in a district general hospital is investigating the amount of lead in the urine of children from a nearby housing estate. In a particular street there are 15 children whose ages range from 1 year to under 16, and in a preliminary study the registrar has found the following amounts of urinary lead , given in Table 2 (which is called an array:)
Table 2
| Urinary concentration of lead in 15 children from housing estate |
| 0.6, 2.6, 0.1, 1.1, 0.4, 2.0, 0.8, 1.3, 1.2, 1.5, 3.2, 1.7, 1.9, 1.9, 2.2 |
A simple way to order, and also to display, the data is to use a stem and leaf plot. To do this we need to abbreviate the observations to two significant digits. In the case of the urinary concentration data, the digit to the left of the decimal point is the "stem" and the digit to the right the "leaf".
We first write the stems in order down the page. We then work along the data set, writing the leaves down "as they come". Thus, for the first data point, we write a 6 opposite the 0 stem. These are as given in Table 3.
Table 3 Steam and leaf plot "as
they
come"
| Stem | Leaf |
| 0 | 6 1 4 8 |
| 1 | 1 3 2 5 7 9 9 |
| 2 | 6 0 2 |
| 3 | 2 |
We then order the leaves, as in Table 4.
Table 4 Ordered steam and leaf
plot
| Stem | Leaf |
| 0 | 1 4 6 8 |
| 1 | 1 2 3 5 7 9 9 |
| 2 | 0 2 6 |
| 3 | 2 |
The advantage of first setting the figures out in order of size and not simply feeding them straight from notes into a calculator (for example, to find their mean) is that the relation of each to the next can be looked at. Is there a steady progression, a noteworthy hump, a considerable gap? Simple inspection can disclose irregularities. Furthermore, a glance at the figures gives information on their range. The smallest value is 0.1 and the largest is 3.2
Median
To find the median (or mid point) we need to identify the point which has the property that half the data are greater than it, and half the data are less than it. For 15 points, the mid point is clearly the eighth largest, so that seven points are less than the median, and seven points are greater than it. This is easily obtained from Table 4 by counting the eighth "leaf", which is 1.5.
To find the median for an even number of points, the procedure is as follows. Suppose the pediatric registrar obtained a further set of 16 urinary lead concentrations from children living in the countryside in the same county as the hospital (Table 5)
Table 5
| Urinary concentration of lead in 16 rural children |
| 0.2, 0.3, 0.6, 0.7, 0.8, 1.5, 1.7, 1.8, 1.9, 1.9, 2.0, 2.0, 2.1, 2.8, 3.1, 3.4 |
To obtain the median we average the eighth and ninth points (1.8 and 1.9) to get 1.85. In general, if n is even, we average the n/2th largest and the (n/2 + 1)th largest observations.
The main advantage of using the median as a measure of location is that it is "robust" to outliers. For example, if we had accidentally written 34 rather than 3.4 in Table 2 , the median would still have been 1.85. One disadvantage is that it is tedious to order a large number of observations by hand (there is usually no "median" button on a calculator).
Data display
The simplest way to show data is a dot plot. Figure 1 shows the data from Tables 2 and 5 and together with the median for each set.
When the data sets are large, plotting individual points can be cumbersome. An alternative is a box-whisker plot. The box is marked by the first and third quartile, and the whiskers extend to the range. The median is also marked in the box, as shown in Figure 2.
Histograms
Suppose the paediatric registrar referred to earlier extends the urban study to the entire estate in which the children live. He obtains figures for the urinary lead concentration in 140 children aged over 1 year and under 16. We can display these data as a grouped frequency table (Table 6)
Table 6. Lead concentration in 140 urban
children
| Lead Concentration | Number of Children |
| 0- | 2 |
| o.4- | 7 |
| 0.8- | 10 |
| 1.2- | 16 |
| 1.6- | 23 |
| 2.0- | 28 |
| 3.2- | 11 |
| 3.6- | 7 |
| 2.4 | 19 |
| 2.8- | 16 |
| 3.2- | 11 |
| 3.6- | 7 |
| 4.0- | 1 |
| 4.4- | |
| Total | 140 |
Figure 3 shows the histogram of data fromTable 6.
Bar charts
Suppose, of the 140 children, 20 lived in owner occupied houses, 70 lived in council houses and 50 lived in private rented accommodation. Figures from the census suggest that for this age group, throughout the county, 50% live in owner occupied houses, 30% in council houses, and 20% in private rented accommodation. Type of accommodation is a categorical variable, which can be displayed in a bar chart. We first express our data as percentages:
14% owner occupied, 50% council house, 36% private rented. We then display the data as a bar chart. The sample size should always be given (Figure 4).
Figure 4 Bar chart of housing data for 140 children and comparable census data
How many groups should I have for a histogram?
In general one should choose enough groups to show the shape of a distribution, but not too many to lose the shape in the noise. It is partly aesthetic judgement but, in general, between 5 and 15, depending on the sample size, gives a reasonable picture. Try to keep the intervals (known also as "bin widths") equal. With equal intervals the height of the bars and the area of the bars are both proportional to the number of subjects in the group. With unequal intervals this link is lost, and interpretation of the figure can be difficult.
What is the distinction between a histogram and a bar chart?
Alas, with modern graphics programs the distinction is often lost. A histogram shows the distribution of a continuous variable and, since the variable is continuous, there should be no gaps between the bars. A bar chart shows the distribution of a discrete variable or a categorical one, and so will have spaces between the bars. It is a mistake to use a bar chart to display a summary statistic such as a mean, particularly when it is accompanied by some measure of variation to produce a "dynamite plunger plot". It is better to use a box-whisker plot.
What is the best way to display data?
The general principle
should be, as far as possible, to
show the original data and to try not to
obscure the design of a study
in the display. Within the constraints of
legibility show as much information
as possible. If data points are
matched or from the same patients link
them with lines. When displaying
the relationship between two quantitative
variables, use a scatter plot in
preference to categorising one or both
of the variables.
Suggested reading from HyperStat Online Textbook:
Chapter 1.
Introduction to Statistics
Chapter 2. Describing
Univariate Data
HOMEWORK
1.A teacher wishes to know whether the males in his/her class have more
favorable attitudes toward gun control than do the females. All students
in the class are given a questionnaire about gun control and the mean
responses of the males and the females are compared. Is this an example of
descriptive or inferential statistics?
2. A medical researcher is testing the effectiveness of a new drug
for treating Parkinson's disease. Ten subjects with the disease are given
the new drug and 10 are given a placebo. Improvement in symptomology is
measured. What would be the roles of descriptive and inferential
statistics in the analysis of these data?
3. From the 140 children whose urinary concentration of lead were
investigated 20 were chosen who were aged at least 1 year but under 5
years. The following concentrations of copper (in ) were found.
0.70, 0.45, 0.72, 0.30, 1.16, 0.69, 0.83, 0.74, 1.24, 0.77,
0.65, 0.76, 0.42, 0.94, 0.36, 0.98, 0.64, 0.90, 0.63, 0.55
a) Retaining only the first 2 significant digits, create an ordered stem
and leaf plot with the given data
b) Find the median, range and the first, second and third quartile.
c) Draw a box-whisker plot.
4. Categorize the following variables as being qualitative or
quantitative, continuous or discrete:
Time spent eating everyday
Number of pairs of shoes owned by a person
The tail length of a certain species of mice
Favourite color
Favourite music star
IQ
Quantity of water drunk per day