STATISTICS I: COLLECTING AND ORGANIZING DATA

Objectives:


Statistics is a branch of mathematics that studies techniques for collecting, organizing and interpreting data.

Descriptive Statistics includes procedures used to organize and present data in a convenient, useable and communicable form.

    One important use of statistics is to summarize a collection of data in a clear and understandable way. For example, assume a psychologist gave a personality test measuring shyness to all 2500 students attending a small college. How might these measurements be summarized?
There are two basic methods: numerical and graphical.

    Using the numerical approach one might compute statistics such as the mean and standard deviation. These statistics convey information about the average degree of shyness and the degree to which people differ in shyness.

    Using the graphical approach one might create a stem and leaf display and a box plot. These plots contain detailed information about the distribution of shyness scores.

Graphical methods are better suited than numerical methods for identifying patterns in the data. Numerical approaches are more precise and objective.

Since the numerical and graphical approaches complement each other, it is wise to use both.

Inferential Statistics are used to draw inferences about a population from a sample. Consider an experiment in which 10 subjects who performed a task after 24 hours of sleep deprivation scored 12 points lower than 10 subjects who performed after a normal night's sleep. Is the difference real or could it be due to chance? How much larger could the real difference be than the 12 points found in the sample? These are the types of questions answered by inferential statistics.

There are two main methods used in inferential statistics: estimation and hypothesis testing.


 


Variables

A variable is any measured characteristic or attribute that differs for different subjects. For example, if the weight of 30 subjects were measured, then weight would be a variable.

The first step, before any calculations or plotting of data, is to decide what type of data one is dealing with.
It is useful to distinguish between two broad types of variables (data): quantitative (or numeric) (for which one asks "how much?") and qualitative (for which one asks "what type?"). Each is broken down into two sub-types:


Because qualitative data always have a limited number of alternative values, such variables are also described as discrete. All qualitative data are discrete, while some numeric data are discrete and some are continuous.

For statistical analysis, qualitative data can be converted into discrete numeric data by simply counting the different values that appear.

Note: the word "variable" is used in two senses. It can mean an item of data collected on each sampling unit, and it can mean "random variable". A random variable is a variable in the mathematical sense, but one that takes different values according to a probability distribution. The word "variate" is also sometimes used to mean random variable. In Statistics, we use random variables to build probability models for data variables. This makes sense because when data are collected on observational units sampled at random, the values recorded for the data variables can be regarded as realisations of mathematical random variables. 


Qualitative Data

Qualitative data arise when the observations fall into separate distinct categories.

Examples:
Colour of eyes : blue, green, brown etc
Exam result : pass or fail
Socio-economic status : low, middle or high.

Such data are inherently discrete, in that there are a finite number of possible categories into which each observation may fall.

Data are classified as:

Quantitative Data

Quantitative or numerical data arise when the observations are counts or measurements. The data are said to be discrete if the measurements are integers (eg number of people in a household, number of cigarettes smoked per day) and continuous if the measurements can take on any value, usually within some range (eg weight).

Quantities such as sex and weight are called variables, because the value of these quantities vary from one observation to another.
Numbers calculated to describe important features of the data are called statistics. For example, (i) the proportion of females, and (ii) the average age of unemployed persons, in a sample of residents of a town are statistics.

The following table shows a part of some (hypothetical) data on a group of 48 subjects.
'Age' and 'income' are continuous numeric variables,
'age group' is an ordinal qualitative variable,
and 'sex' is a nominal qualitative variable.

The ordinal variable 'age group' is created from the continuous variable 'age' using five categories:
age group = 1 if age is less than 20;
age group = 2 if age is 20 to 29;
age group = 3 if age is 30 to 39;
age group = 4 if age is 40 to 49;
age group = 5 if age is 50 or more
 

Table 1 - Hypothetical Data
 
 

< tr>
Subject No. Age (years) Age Group Annual Income (*$10,00) Sex
1 32 3 41 F
2 20 2 15 M
3 45 4 23 F
... .... ..... .... ....
... ... ... .... ...
47 19 1 0.5 F
48 32 3 19 F

Stem and leaf plots

 Before any statistical calculation, even the simplest, is performed the data should be tabulated or plotted. If they are quantitative and relatively few, say up to about 30, they are conveniently written down in order of size.

For example, a pediatric registrar in a district general hospital is investigating the amount of lead in the urine of children from a nearby housing estate. In a particular street there are 15 children whose ages range from 1 year to under 16, and in a preliminary study the registrar has found the following amounts of urinary lead , given in Table 2 (which is called an array:)

Table 2

Urinary concentration of lead in 15 children from housing estate 
0.6, 2.6, 0.1, 1.1, 0.4, 2.0, 0.8, 1.3, 1.2, 1.5, 3.2, 1.7, 1.9, 1.9, 2.2

A simple way to order, and also to display, the data is to use a stem and leaf plot. To do this we need to abbreviate the observations to two significant digits. In the case of the urinary concentration data, the digit to the left of the decimal point is the "stem" and the digit to the right the "leaf".

We first write the stems in order down the page. We then work along the data set, writing the leaves down "as they come". Thus, for the first data point, we write a 6 opposite the 0 stem. These are as given in Table 3.

Table 3    Steam and leaf plot "as they come"
 

Stem  Leaf
0 6    1    4   8
1 1  3  2  5  7  9   9
2 6  0  2
3 2

We then order the leaves, as in Table 4.

Table 4    Ordered steam and leaf plot
 
 

Stem  Leaf
0 1   4   6   8
1 1  2  3  5  7  9   9
2 0  2  6
3 2

The advantage of first setting the figures out in order of size and not simply feeding them straight from notes into a calculator (for example, to find their mean) is that the relation of each to the next can be looked at. Is there a steady progression, a noteworthy hump, a considerable gap? Simple inspection can disclose irregularities. Furthermore, a glance at the figures gives information on their range. The smallest value is 0.1 and the largest is 3.2


Median

 To find the median (or mid point) we need to identify the point which has the property that half the data are greater than it, and half the data are less than it. For 15 points, the mid point is clearly the eighth largest, so that seven points are less than the median, and seven points are greater than it. This is easily obtained from Table 4 by counting the eighth "leaf", which is 1.5.

To find the median for an even number of points, the procedure is as follows. Suppose the pediatric registrar obtained a further set of 16 urinary lead concentrations from children living in the countryside in the same county as the hospital (Table 5)

Table 5

Urinary concentration of lead in 16 rural  children 
0.2, 0.3, 0.6, 0.7, 0.8, 1.5, 1.7, 1.8, 1.9, 1.9, 2.0, 2.0, 2.1, 2.8, 3.1, 3.4

To obtain the median we average the eighth and ninth points (1.8 and 1.9) to get 1.85. In general, if n is even, we average the n/2th largest and the (n/2 + 1)th largest observations.

The main advantage of using the median as a measure of location is that it is "robust" to outliers. For example, if we had accidentally written 34 rather than 3.4 in Table 2 , the median would still have been 1.85. One disadvantage is that it is tedious to order a large number of observations by hand (there is usually no "median" button on a calculator).


Data display

 The simplest way to show data is a dot plot. Figure 1 shows the data from Tables 2 and 5 and together with the median for each set.


 

When the data sets are large, plotting individual points can be cumbersome. An alternative is a box-whisker plot. The box is marked by the first and third quartile, and the whiskers extend to the range. The median is also marked in the box, as shown in Figure 2.


Histograms
 

 Suppose the paediatric registrar referred to earlier extends the urban study to the entire estate in which the children live. He obtains figures for the urinary lead concentration in 140 children aged over 1 year and under 16. We can display these data as a grouped frequency table (Table 6)

Table 6. Lead concentration in 140 urban children
 
 

Lead Concentration Number of Children
0- 2
o.4- 7
0.8- 10
1.2- 16
1.6- 23
2.0- 28
3.2- 11
3.6- 7
2.4 19
2.8- 16
3.2- 11
3.6- 7
4.0- 1
4.4-
Total 140

Figure 3 shows the histogram of data fromTable 6.




 

Bar charts

Suppose, of the 140 children, 20 lived in owner occupied houses, 70 lived in council houses and 50 lived in private rented accommodation. Figures from the census suggest that for this age group, throughout the county, 50% live in owner occupied houses, 30% in council houses, and 20% in private rented accommodation. Type of accommodation is a categorical variable, which can be displayed in a bar chart. We first express our data as percentages:

14% owner occupied, 50% council house, 36% private rented. We then display the data as a bar chart. The sample size should always be given (Figure 4).

Figure 4 Bar chart of housing data for 140 children and comparable census data

 
 



Common questions

How many groups should I have for a histogram?

In general one should choose enough groups to show the shape of a distribution, but not too many to lose the shape in the noise. It is partly aesthetic judgement but, in general, between 5 and 15, depending on the sample size, gives a reasonable picture. Try to keep the intervals (known also as "bin widths") equal. With equal intervals the height of the bars and the area of the bars are both proportional to the number of subjects in the group. With unequal intervals this link is lost, and interpretation of the figure can be difficult.

What is the distinction between a histogram and a bar chart?

Alas, with modern graphics programs the distinction is often lost. A histogram shows the distribution of a continuous variable and, since the variable is continuous, there should be no gaps between the bars. A bar chart shows the distribution of a discrete variable or a categorical one, and so will have spaces between the bars. It is a mistake to use a bar chart to display a summary statistic such as a mean, particularly when it is accompanied by some measure of variation to produce a "dynamite plunger plot". It is better to use a box-whisker plot.

What is the best way to display data?

The general principle should be, as far as possible, to show the original data and to try not to obscure the design of a study in the display. Within the constraints of legibility show as much information as possible. If data points are matched or from the same patients link them with lines. When displaying the relationship between two quantitative variables, use a scatter plot in preference to categorising one or both of the variables.


Suggested reading from HyperStat Online Textbook:

Chapter 1. Introduction to Statistics
Chapter 2. Describing Univariate Data



HOMEWORK

1.A teacher wishes to know whether the males in his/her class have more favorable attitudes toward gun control than do the females. All students in the class are given a questionnaire about gun control and the mean responses of the males and the females are compared. Is this an example of descriptive or inferential statistics?

2. A medical researcher is testing the effectiveness of a new drug for treating Parkinson's disease. Ten subjects with the disease are given the new drug and 10 are given a placebo. Improvement in symptomology is measured. What would be the roles of descriptive and inferential statistics in the analysis of these data?

3. From the 140 children whose urinary concentration of lead were investigated 20 were chosen who were aged at least 1 year but under 5 years. The following concentrations of copper (in ) were found.

0.70, 0.45, 0.72, 0.30, 1.16, 0.69, 0.83, 0.74, 1.24, 0.77,

0.65, 0.76, 0.42, 0.94, 0.36, 0.98, 0.64, 0.90, 0.63, 0.55

a) Retaining only the first 2 significant digits, create an ordered stem and leaf plot with the given data
b) Find the median, range and the first, second and third quartile.
c) Draw a box-whisker plot.

4. Categorize the following variables as being qualitative or quantitative, continuous or discrete:

Time spent eating everyday
Number of pairs of shoes owned by a person
The tail length of a certain species of mice
Favourite color
Favourite music star
IQ
Quantity of water drunk per day