Changes

Jump to navigation Jump to search
528 bytes removed ,  09:49, 4 January 2013
no edit summary
Line 1: Line 1: −
             
     −
  −
  −
  −
'''Statistics'''
  −
  −
  −
  −
  −
  −
   
= Introduction =
 
= Introduction =
 
   
 
   
Line 70: Line 59:  
   
 
   
 
=== Descriptive and Inferential Statistics ===
 
=== Descriptive and Inferential Statistics ===
  −
  −
  −
   
   
 
   
 
When analysing data, for example, the marks
 
When analysing data, for example, the marks
Line 1,698: Line 1,683:     
   
 
   
 +
[[Image:KOER-%20Mathematics%20-%20Statistics_html_6201ec25.png]]
 +
       
   
 
   
 +
36 25 38 46 55 68
 +
72 55 36 38
    +
 +
67 45 22 48 91 46
 +
52 61 58 55
    
   
 
   
       +
 +
=== How do you construct a histogram from a continuous variable? ===
 
   
 
   
       
   
 
   
 
+
To construct a
 +
histogram from a continuous variable you first need to split the data
 +
into intervals, called bins. In the example above, age has been split
 +
into bins, with each bin representing a 10-year period starting at 20
 +
years. Each bin contains the number of occurrences of scores in the
 +
data set that are contained within that bin. For the above data set,
 +
the frequencies in each bin have been tabulated along with the scores
 +
that contributed to the frequency in each bin (see below):
    
   
 
   
Line 1,716: Line 1,717:     
   
 
   
[[Image:KOER-%20Mathematics%20-%20Statistics_html_6201ec25.png]]
+
Bin Frequency Scores
 +
Included in Bin
    
   
 
   
 
+
20-30 2 25,22
    
   
 
   
 
+
30-40 4 36,38,36,38
    
   
 
   
 
+
40-50 4 46,45,48,46
    
   
 
   
 
+
50-60 5 55,55,52,58,55
    
   
 
   
 
+
60-70 3 68,67,61
    
   
 
   
 
+
70-80 1 72
    
   
 
   
 
+
80-90 0 -
    
   
 
   
 
+
90-100 1 91
    
   
 
   
Line 1,746: Line 1,748:     
   
 
   
 
+
Notice that, unlike a
 +
bar chart, there are no "gaps" between the bars (although
 +
some bars might be "absent" reflecting no frequencies).
 +
This is because a histogram represents a continuous data set, and as
 +
such, there are no gaps in the data. (Although you will have to
 +
decide whether you round up or round down scores on the boundaries of
 +
bins)
    
   
 
   
Line 1,752: Line 1,760:     
   
 
   
 
+
=== Choosing the correct bin width ===
 
   
   
 
   
       
   
 
   
 +
There is no right or
 +
wrong answer as to how wide a bin should be, but there are rules of
 +
thumb. You need to make sure that the bins are not too small or too
 +
large. Consider the histogram we produced earlier (see above): the
 +
following histograms use the same data but have either much smaller
 +
or larger bins, as shown below:
   −
 
+
 
 +
[[Image:KOER-%20Mathematics%20-%20Statistics_html_75ab55c3.png]]
 
   
 
   
 
+
We can see from the
 
+
histogram on the left, that the bin width is too small as it shows
+
too much individual data and does not allow the underlying pattern
36 25 38 46 55 68
+
(frequency distribution) of the data to be easily seen. At the other
72 55 36 38
+
end of the scale, is the diagram on the right, where the bins are too
 +
large and, again, we are unable to find the underlying trend in the
 +
data.
    
   
 
   
67 45 22 48 91 46
+
Histograms are based on
52 61 58 55
+
area not height of bars
    
   
 
   
Line 1,775: Line 1,791:     
   
 
   
=== How do you construct a histogram from a continuous variable? ===
+
In a histogram, it is
 +
the area of the bar that indicates the frequency of occurrences for
 +
each bin. This means that the height of the bar does not necessarily
 +
indicate how many occurrences of scores there were within each
 +
individual bin. It is the product of height multiplied by the width
 +
of the bin that indicates the frequency of occurrences within that
 +
bin. One of the reasons that the height of the bars is often
 +
incorrectly assessed as indicating frequency and not the area of the
 +
bar is due to the fact that a lot of histograms often have equally
 +
spaced bars (bins) and, under these circumstances, the height of the
 +
bin does reflect the frequency.
 +
 
 
   
 
   
       
   
 
   
To construct a
+
=== What is the difference between a bar chart and a histogram? ===
histogram from a continuous variable you first need to split the data
  −
into intervals, called bins. In the example above, age has been split
  −
into bins, with each bin representing a 10-year period starting at 20
  −
years. Each bin contains the number of occurrences of scores in the
  −
data set that are contained within that bin. For the above data set,
  −
the frequencies in each bin have been tabulated along with the scores
  −
that contributed to the frequency in each bin (see below):
  −
 
   
   
 
   
 
+
[[Image:KOER-%20Mathematics%20-%20Statistics_html_6dfca87b.png]]
    
   
 
   
Bin Frequency Scores
+
The major difference is
Included in Bin
+
that a histogram is only used to plot the frequency of score
 +
occurrences in a continuous data set that has been divided into
 +
classes, called bins. Bar charts, on the other hand, can be used for
 +
a great deal of other types of variables including ordinal and
 +
nominal data sets.
    
   
 
   
20-30 2 25,22
+
 
    
   
 
   
30-40 4 36,38,36,38
+
 
    
   
 
   
40-50 4 46,45,48,46
+
 
    
   
 
   
50-60 5 55,55,52,58,55
+
== Circle or Pie Chart ==
 
   
   
 
   
60-70 3 68,67,61
+
[[Image:KOER-%20Mathematics%20-%20Statistics_html_461389d1.png]]These
 +
are called circle graphs. A circle graph shows the relationship
 +
between a whole and its parts. Here, the whole circle is divided into
 +
sectors. The size of each sector is proportional to the activity or
 +
information it represents.
    
   
 
   
70-80 1 72
     −
  −
80-90 0 -
      
   
 
   
90-100 1 91
+
A variety of graphical
 +
representations of data are now possible using spreadsheet software.
 +
OpenOffice CALC can convert a table of data into bar charts, pie
 +
charts, area charts etc and make data much more easy to
 +
read/interpret.
    
   
 
   
       +
 
 +
== Activities ==
 +
 +
=== Activity 2: Histogram and Bar Chart ===
 
   
 
   
Notice that, unlike a
+
==== Learning Objectives ====
bar chart, there are no "gaps" between the bars (although
  −
some bars might be "absent" reflecting no frequencies).
  −
This is because a histogram represents a continuous data set, and as
  −
such, there are no gaps in the data. (Although you will have to
  −
decide whether you round up or round down scores on the boundaries of
  −
bins)
  −
 
   
   
 
   
 
+
Learn to draw a histogram and bar chart.
 +
Understand the difference between a bar chart and a histogram and be
 +
able to select the appropriate chart by looking at the problem and
 +
data.
    
   
 
   
=== Choosing the correct bin width ===
+
==== Materials and Resources Required ====
 
   
 
   
 
+
Paper and Pencil
    
   
 
   
There is no right or
+
==== Pre-requisites/ Instructions ====
wrong answer as to how wide a bin should be, but there are rules of
  −
thumb. You need to make sure that the bins are not too small or too
  −
large. Consider the histogram we produced earlier (see above): the
  −
following histograms use the same data but have either much smaller
  −
or larger bins, as shown below:
  −
 
   
   
 
   
 
+
==== Method ====
 
   
   
 
   
[[Image:KOER-%20Mathematics%20-%20Statistics_html_75ab55c3.png]]
+
Solve the problems A and B
    
   
 
   
   −
  −
         
   
 
   
We can see from the
+
A> In the past year, you have recorded the
histogram on the left, that the bin width is too small as it shows
+
number of tickets that a movie theater has sold during each month.
too much individual data and does not allow the underlying pattern
+
To represent this data set graphically, would you construct a bar
(frequency distribution) of the data to be easily seen. At the other
+
graph or a histogram? Why is this choice better than the other?
end of the scale, is the diagram on the right, where the bins are too
+
Using the following data, construct the graph that you choose.
large and, again, we are unable to find the underlying trend in the
  −
data.
     −
+
                                                       
Histograms are based on
+
{| border="1"
area not height of bars
+
|-
 +
|
 +
Month
    
   
 
   
 
+
|
 +
Number of Tickets Sold
    
   
 
   
In a histogram, it is
+
|-
the area of the bar that indicates the frequency of occurrences for
+
|
each bin. This means that the height of the bar does not necessarily
+
January
indicate how many occurrences of scores there were within each
  −
individual bin. It is the product of height multiplied by the width
  −
of the bin that indicates the frequency of occurrences within that
  −
bin. One of the reasons that the height of the bars is often
  −
incorrectly assessed as indicating frequency and not the area of the
  −
bar is due to the fact that a lot of histograms often have equally
  −
spaced bars (bins) and, under these circumstances, the height of the
  −
bin does reflect the frequency.
      
   
 
   
 
+
|
 +
25
    
   
 
   
=== What is the difference between a bar chart and a histogram? ===
+
|-
 +
|
 +
February
 +
 
 
   
 
   
[[Image:KOER-%20Mathematics%20-%20Statistics_html_6dfca87b.png]]
+
|
 +
20
    
   
 
   
The major difference is
+
|-
that a histogram is only used to plot the frequency of score
+
|
occurrences in a continuous data set that has been divided into
+
March
classes, called bins. Bar charts, on the other hand, can be used for
  −
a great deal of other types of variables including ordinal and
  −
nominal data sets.
      
   
 
   
 +
|
 +
15
    +
 +
|-
 +
|
 +
April
    
   
 
   
 +
|
 +
20
    +
 +
|-
 +
|
 +
May
    
   
 
   
 +
|
 +
30
    +
 +
|-
 +
|
 +
June
    
   
 
   
== Circle or Pie Chart ==
+
|
 +
35
 +
 
 
   
 
   
[[Image:KOER-%20Mathematics%20-%20Statistics_html_461389d1.png]]These
+
|-
are called circle graphs. A circle graph shows the relationship
+
|
between a whole and its parts. Here, the whole circle is divided into
+
July
sectors. The size of each sector is proportional to the activity or
  −
information it represents.
      
   
 
   
 +
|
 +
40
    +
 +
|-
 +
|
 +
August
    
   
 
   
A variety of graphical
+
|
representations of data are now possible using spreadsheet software.
+
20
OpenOffice CALC can convert a table of data into bar charts, pie
  −
charts, area charts etc and make data much more easy to
  −
read/interpret.
      
   
 
   
 +
|-
 +
|
 +
September
    +
 +
|
 +
25
   −
 
  −
== Activities ==
   
   
 
   
=== Activity 2: Histogram and Bar Chart ===
+
|-
 +
|
 +
October
 +
 
 
   
 
   
==== Learning Objectives ====
+
|
+
15
Learn to draw a histogram and bar chart.
  −
Understand the difference between a bar chart and a histogram and be
  −
able to select the appropriate chart by looking at the problem and
  −
data.
      
   
 
   
==== Materials and Resources Required ====
  −
  −
Paper and Pencil
  −
  −
  −
==== Pre-requisites/ Instructions ====
  −
  −
==== Method ====
  −
  −
Solve the problems A and B
  −
  −
  −
  −
  −
  −
  −
A> In the past year, you have recorded the
  −
number of tickets that a movie theater has sold during each month.
  −
To represent this data set graphically, would you construct a bar
  −
graph or a histogram? Why is this choice better than the other?
  −
Using the following data, construct the graph that you choose.
  −
  −
                                                       
  −
{| border="1"
   
|-
 
|-
 
|  
 
|  
Month
+
November
    
   
 
   
 
|  
 
|  
Number of Tickets Sold
+
20
    
   
 
   
 
|-
 
|-
 
|  
 
|  
January
+
December
    
   
 
   
 
|  
 
|  
25
+
30
    
   
 
   
|-
+
|}
|
+
 
February
+
 
 +
 +
 
    
   
 
   
|
+
B> For a recent
20
+
science project, you collected data regarding the distribution of
 +
fish and aquatic life in a nearby pond. Your data consists of the
 +
number of living creatures found in each 1 meter depth increment in
 +
the pond. Construct a bar graph and several histograms (vary the
 +
depth increment size) for the following data. In which case(s) is the
 +
histogram the same as the bar graph? How do the other histograms vary
 +
from the bar graph?
    
   
 
   
 +
 +
 +
                                               
 +
{| border="1"
 
|-
 
|-
 
|  
 
|  
March
+
'''Depth Range'''
    
   
 
   
 
|  
 
|  
15
+
'''Number of Living Creatures '''
    
   
 
   
 
|-
 
|-
 
|  
 
|  
April
+
0 – 1 meters
    
   
 
   
 
|  
 
|  
20
+
10
    
   
 
   
 
|-
 
|-
 
|  
 
|  
May
+
1 – 2 meters
    
   
 
   
 
|  
 
|  
30
+
93
    
   
 
   
 
|-
 
|-
 
|  
 
|  
June
+
2 – 3 meters
    
   
 
   
 
|  
 
|  
35
+
23
    
   
 
   
 
|-
 
|-
 
|  
 
|  
July
+
3 – 4 meters
    
   
 
   
 
|  
 
|  
40
+
47
    
   
 
   
 
|-
 
|-
 
|  
 
|  
August
+
4 – 5 meters
    
   
 
   
 
|  
 
|  
20
+
68
    
   
 
   
 
|-
 
|-
 
|  
 
|  
September
+
5 – 6 meters
    
   
 
   
 
|  
 
|  
25
+
51
    
   
 
   
 
|-
 
|-
 
|  
 
|  
October
+
6 – 7 meters
    
   
 
   
 
|  
 
|  
15
+
43
    
   
 
   
 
|-
 
|-
 
|  
 
|  
November
+
7 – 8 meters
    
   
 
   
 
|  
 
|  
20
+
21
    
   
 
   
 
|-
 
|-
 
|  
 
|  
December
+
8 – 9 meters
    
   
 
   
 
|  
 
|  
30
+
15
    
   
 
   
|}
+
|-
 
+
|
 +
9 – 10 meters
    
   
 
   
 
+
|
 +
8
    
   
 
   
B> For a recent
+
|}
science project, you collected data regarding the distribution of
+
==== Evaluation ====
fish and aquatic life in a nearby pond. Your data consists of the
+
number of living creatures found in each 1 meter depth increment in
+
# Does the student understand the difference between a bar chart and a histogram ?
the pond. Construct a bar graph and several histograms (vary the
+
# Does the student know when to use each of these charts - - depending on the type of data continuous and discrete ?
depth increment size) for the following data. In which case(s) is the
+
histogram the same as the bar graph? How do the other histograms vary
+
== Evaluation ==
from the bar graph?
+
 +
== Self-Evaluation ==
 +
 +
== Further Explorations ==
 +
 +
=== Types of Variables ===
 +
 +
All experiments examine some kind of variable(s).
 +
A variable is not only something that we measure, but also something
 +
that we can manipulate and something we can control for. To
 +
understand the characteristics of variables and how we use them in
 +
research, this guide is divided into three main sections. First, we
 +
illustrate the role of dependent and independent variables. Second,
 +
we discuss the difference between experimental and non-experimental
 +
research. Finally, we explain how variables can be characterised as
 +
either categorical or continuous.
    +
 +
=== Dependent and Independent Variables ===
 
   
 
   
      −
                                               
  −
{| border="1"
  −
|-
  −
|
  −
'''Depth Range'''
      
   
 
   
|
+
An independent variable, sometimes called an
'''Number of Living Creatures '''
+
experimental or predictor variable, is a variable that is being
 
+
manipulated in an experiment in order to observe the effect on a
 +
dependent variable, sometimes called an outcome variable.
 +
 
 
   
 
   
|-
+
 
|
+
 
0 – 1 meters
      
   
 
   
|
+
Imagine that a tutor asks 100 students to complete
10
+
a maths test. The tutor wants to know why some students perform
 +
better than others. Whilst the tutor does not know the answer to
 +
this, she thinks that it might be because of two reasons: (1) some
 +
students spend more time revising for their test; and (2) some
 +
students are naturally more intelligent than others. As such, the
 +
tutor decides to investigate the effect of revision time and
 +
intelligence on the test performance of the 100 students. The
 +
dependent and independent variables for the study are:
    
   
 
   
|-
+
 
|
+
 
1 – 2 meters
      
   
 
   
|
+
Dependent Variable: Test Mark (measured from 0 to
93
+
100)
    
   
 
   
|-
  −
|
  −
2 – 3 meters
     −
+
 
|
  −
23
      
   
 
   
|-
+
Independent Variables: Revision time (measured in
|
+
hours) Intelligence (measured using IQ score)
3 – 4 meters
      
   
 
   
|
  −
47
     −
  −
|-
  −
|
  −
4 – 5 meters
     −
  −
|
  −
68
      
   
 
   
|-
+
The dependent variable is simply that, a variable
|
+
that is dependent on an independent variable(s). For example, in our
5 – 6 meters
+
case the test mark that a student achieves is dependent on revision
 +
time and intelligence. Whilst revision time and intelligence (the
 +
independent variables) may (or may not) cause a change in the test
 +
mark (the dependent variable), the reverse is implausible; in other
 +
words, whilst the number of hours a student spends revising and the
 +
higher a student's IQ score may (or may not) change the test mark
 +
that a student achieves, a change in a student's test mark has no
 +
bearing on whether a student revises more or is more intelligent
 +
(this simply doesn't make sense).
    
   
 
   
|
  −
51
     −
  −
|-
  −
|
  −
6 – 7 meters
     −
  −
|
  −
43
      
   
 
   
|-
+
Therefore, the aim of the tutor's investigation is
|
+
to examine whether these independent variables - revision time and IQ
7 – 8 meters
+
- result in a change in the dependent variable, the students' test
 +
scores. However, it is also worth noting that whilst this is the main
 +
aim of the experiment, the tutor may also be interested to know if
 +
the independent variables - revision time and IQ - are also connected
 +
in some way.
    
   
 
   
|
+
 
21
+
 
    
   
 
   
|-
+
In the section on experimental and
|
+
non-experimental research that follows, we find out a little more
8 – 9 meters
+
about the nature of independent and dependent variables.
    
   
 
   
|
+
=== Experimental and Non-Experimental Research ===
15
+
   −
  −
|-
  −
|
  −
9 – 10 meters
     −
  −
|
  −
8
      
   
 
   
|}
+
Experimental research: In experimental research,
==== Evaluation ====
+
the aim is to manipulate an independent variable(s) and then examine
+
the effect that this change has on a dependent variable(s). Since it
# Does the student understand the difference between a bar chart and a histogram ?
+
is possible to manipulate the independent variable(s), experimental
# Does the student know when to use each of these charts - - depending on the type of data continuous and discrete ?
+
research has the advantage of enabling a researcher to identify a
+
cause and effect between variables. For example, take our example of
== Evaluation ==
+
100 students completing a maths exam where the dependent variable was
+
the exam mark (measured from 0 to 100) and the independent variables
== Self-Evaluation ==
+
were revision time (measured in hours) and intelligence (measured
+
using IQ score). Here, it would be possible to use an experimental
== Further Explorations ==
+
design and manipulate the revision time of the students. The tutor
+
could divide the students into two groups, each made up of 50
=== Types of Variables ===
+
students. In "group one", the tutor could ask the students
+
not to do any revision. Alternately, "group two" could be
All experiments examine some kind of variable(s).
+
asked to do 20 hours of revision in the two weeks prior to the test.
A variable is not only something that we measure, but also something
+
The tutor could then compare the marks that the students achieved.
that we can manipulate and something we can control for. To
  −
understand the characteristics of variables and how we use them in
  −
research, this guide is divided into three main sections. First, we
  −
illustrate the role of dependent and independent variables. Second,
  −
we discuss the difference between experimental and non-experimental
  −
research. Finally, we explain how variables can be characterised as
  −
either categorical or continuous.
      
   
 
   
=== Dependent and Independent Variables ===
+
Non-experimental research: In non-experimental
 +
research, the researcher does not manipulate the independent
 +
variable(s). This is not to say that it is impossible to do so, but
 +
it will either be impractical or unethical to do so. For example, a
 +
researcher may be interested in the effect of illegal, recreational
 +
drug use (the dependent variable(s)) on certain types of behaviour
 +
(the independent variable(s)). However, whilst possible, it would be
 +
unethical to ask individuals to take illegal drugs in order to study
 +
what effect this had on certain behaviours. As such, a researcher
 +
could ask both drug and non-drug users to complete a questionnaire
 +
that had been constructed to indicate the extent to which they
 +
exhibited certain behaviours. Whilst it is not possible to identify
 +
the cause and effect between the variables, we can still examine the
 +
association or relationship between them.In addition to understanding
 +
the difference between dependent and independent variables, and
 +
experimental and non-experimental research, it is also important to
 +
understand the different characteristics amongst variables. This is
 +
discussed next.
 +
 
 
   
 
   
   Line 2,241: Line 2,279:     
   
 
   
An independent variable, sometimes called an
+
=== Categorical and Continuous Variables ===
experimental or predictor variable, is a variable that is being
  −
manipulated in an experiment in order to observe the effect on a
  −
dependent variable, sometimes called an outcome variable.
  −
 
   
   
 
   
   Line 2,251: Line 2,285:     
   
 
   
Imagine that a tutor asks 100 students to complete
+
Categorical variables are also known as discrete
a maths test. The tutor wants to know why some students perform
+
or qualitative variables. Categorical variables can be further
better than others. Whilst the tutor does not know the answer to
+
categorized as either''' nominal, ordinal or dichotomous.'''
this, she thinks that it might be because of two reasons: (1) some
  −
students spend more time revising for their test; and (2) some
  −
students are naturally more intelligent than others. As such, the
  −
tutor decides to investigate the effect of revision time and
  −
intelligence on the test performance of the 100 students. The
  −
dependent and independent variables for the study are:
      
   
 
   
Line 2,266: Line 2,294:     
   
 
   
Dependent Variable: Test Mark (measured from 0 to
+
'''Nominal variables''' are variables that have
100)
+
two or more categories but which do not have an intrinsic order. For
 +
example, a real estate agent could classify their types of property
 +
into distinct categories such as houses, condos, co-ops or bungalows.
 +
So "type of property" is a nominal variable with 4
 +
categories called houses, condos, co-ops and bungalows. Of note, the
 +
different categories of a nominal variable can also be referred to as
 +
groups or levels of the nominal variable. Another example of a
 +
nominal variable would be classifying where people live in Karnataka
 +
by district. In this case there will be many more levels of the
 +
nominal variable (30 in fact).
    
   
 
   
 
+
'''Dichotomous variables''' are nominal
 
+
variables which have only two categories or levels. For example, if
 +
we were looking at gender, we would most probably categorize somebody
 +
as either "male" or "female". This is an example
 +
of a dichotomous variable (and also a nominal variable). Another
 +
example might be if we asked a person if they owned a mobile phone.
 +
Here, we may categorise mobile phone ownership as either "Yes"
 +
or "No". In the real estate agent example, if type of
 +
property had been classified as either residential or commercial then
 +
"type of property" would be a dichotomous variable.
    
   
 
   
Independent Variables: Revision time (measured in
+
'''Ordinal variables''' are variables that have
hours) Intelligence (measured using IQ score)
+
two or more categories just like nominal variables only the
 +
categories can also be ordered or ranked. So if you asked someone if
 +
they liked the policies of the Democratic Party and they could answer
 +
either "Not very much", "They are OK" or "Yes,
 +
a lot" then you have an ordinal variable. Why? Because you have
 +
3 categories, namely "Not very much", "They are OK"
 +
and "Yes, a lot" and you can rank them from the most
 +
positive (Yes, a lot), to the middle response (They are OK), to the
 +
least positive (Not very much). However, whilst we can rank the
 +
levels, we cannot place a "value" to them; we cannot say
 +
that "They are OK" is twice as positive as "Not very
 +
much" for example.
    
   
 
   
Line 2,282: Line 2,338:     
   
 
   
The dependent variable is simply that, a variable
+
Continuous variables are also known as
that is dependent on an independent variable(s). For example, in our
+
quantitative variables. Continuous variables can be further
case the test mark that a student achieves is dependent on revision
+
categorized as either interval or ratio variables.
time and intelligence. Whilst revision time and intelligence (the
  −
independent variables) may (or may not) cause a change in the test
  −
mark (the dependent variable), the reverse is implausible; in other
  −
words, whilst the number of hours a student spends revising and the
  −
higher a student's IQ score may (or may not) change the test mark
  −
that a student achieves, a change in a student's test mark has no
  −
bearing on whether a student revises more or is more intelligent
  −
(this simply doesn't make sense).
      
   
 
   
Line 2,299: Line 2,347:     
   
 
   
Therefore, the aim of the tutor's investigation is
+
'''Interval variables''' are variables for which
to examine whether these independent variables - revision time and IQ
+
their central characteristic is that they can be measured along a
- result in a change in the dependent variable, the students' test
+
continuum and they have a numerical value (for example, temperature
scores. However, it is also worth noting that whilst this is the main
+
measured in degrees Celsius or Fahrenheit). So the difference between
aim of the experiment, the tutor may also be interested to know if
+
20C and 30C is the same as 30C to 40C. However, temperature measured
the independent variables - revision time and IQ - are also connected
+
in degrees Celsius or Fahrenheit is NOT a ratio variable.
in some way.
      
   
 
   
 
+
'''Ratio variables''' are interval variables but
 
+
with the added condition that 0 (zero) of the measurement indicates
 
+
that there is none of that variable. So, temperature measured in
+
degrees Celsius or Fahrenheit is not a ratio variable because 0C does
In the section on experimental and
+
not mean there is no temperature. However, temperature measured in
non-experimental research that follows, we find out a little more
+
Kelvin is a ratio variable as 0 Kelvin (often called absolute zero)
about the nature of independent and dependent variables.
+
indicates that there is no temperature whatsoever. Other examples of
 +
ratio variables include height, mass, distance and many more. The
 +
name "ratio" reflects the fact that you can use the ratio
 +
of measurements. So, for example, a distance of ten metres is twice
 +
the distance of 5 metres.
   −
  −
=== Experimental and Non-Experimental Research ===
   
   
 
   
   Line 2,323: Line 2,372:     
   
 
   
Experimental research: In experimental research,
+
=== Ambiguities in classifying a type of variable ===
the aim is to manipulate an independent variable(s) and then examine
+
the effect that this change has on a dependent variable(s). Since it
+
 
is possible to manipulate the independent variable(s), experimental
+
 
research has the advantage of enabling a researcher to identify a
  −
cause and effect between variables. For example, take our example of
  −
100 students completing a maths exam where the dependent variable was
  −
the exam mark (measured from 0 to 100) and the independent variables
  −
were revision time (measured in hours) and intelligence (measured
  −
using IQ score). Here, it would be possible to use an experimental
  −
design and manipulate the revision time of the students. The tutor
  −
could divide the students into two groups, each made up of 50
  −
students. In "group one", the tutor could ask the students
  −
not to do any revision. Alternately, "group two" could be
  −
asked to do 20 hours of revision in the two weeks prior to the test.
  −
The tutor could then compare the marks that the students achieved.
      
   
 
   
Non-experimental research: In non-experimental
+
In some cases, the measurement scale for data is
research, the researcher does not manipulate the independent
+
ordinal but the variable is treated as continuous. For example, a
variable(s). This is not to say that it is impossible to do so, but
+
Likert scale that contains five values - strongly agree, agree,
it will either be impractical or unethical to do so. For example, a
+
neither agree nor disagree, disagree, and strongly disagree - is
researcher may be interested in the effect of illegal, recreational
+
ordinal. However, where a Likert scale contains seven or more value -
drug use (the dependent variable(s)) on certain types of behaviour
+
strongly agree, moderately agree, agree, neither agree nor disagree,
(the independent variable(s)). However, whilst possible, it would be
+
disagree, moderately disagree, and strongly disagree - the underlying
unethical to ask individuals to take illegal drugs in order to study
+
scale is sometimes treated as continuous although where you should do
what effect this had on certain behaviours. As such, a researcher
+
this is a cause of great dispute.
could ask both drug and non-drug users to complete a questionnaire
  −
that had been constructed to indicate the extent to which they
  −
exhibited certain behaviours. Whilst it is not possible to identify
  −
the cause and effect between the variables, we can still examine the
  −
association or relationship between them.In addition to understanding
  −
the difference between dependent and independent variables, and
  −
experimental and non-experimental research, it is also important to
  −
understand the different characteristics amongst variables. This is
  −
discussed next.
      
   
 
   
Line 2,364: Line 2,392:       −
  −
=== Categorical and Continuous Variables ===
   
   
 
   
   Line 2,371: Line 2,397:     
   
 
   
Categorical variables are also known as discrete
+
== Enrichment Activities ==
or qualitative variables. Categorical variables can be further
+
categorized as either''' nominal, ordinal or dichotomous.'''
+
= Central tendency =
 
   
   
 
   
 
+
== Introduction ==
 
  −
 
   
   
 
   
'''Nominal variables''' are variables that have
+
A measure of central tendency is a single value
two or more categories but which do not have an intrinsic order. For
+
that attempts to describe a set of data by identifying the central
example, a real estate agent could classify their types of property
+
position within that set of data. As such, measures of central
into distinct categories such as houses, condos, co-ops or bungalows.
+
tendency are sometimes called measures of central location. They are
So "type of property" is a nominal variable with 4
+
also classed as summary statistics. The mean (often called the
categories called houses, condos, co-ops and bungalows. Of note, the
+
average) is most likely the measure of central tendency that you are
different categories of a nominal variable can also be referred to as
+
most familiar with, but there are others, such as, the median and the
groups or levels of the nominal variable. Another example of a
+
mode.
nominal variable would be classifying where people live in Karnataka
  −
by district. In this case there will be many more levels of the
  −
nominal variable (30 in fact).
      
   
 
   
'''Dichotomous variables''' are nominal
+
The mean, median and mode are all valid measures
variables which have only two categories or levels. For example, if
+
of central tendency but, under different conditions, some measures of
we were looking at gender, we would most probably categorize somebody
+
central tendency become more appropriate to use than others. In the
as either "male" or "female". This is an example
+
following sections we will look at the mean, mode and median and
of a dichotomous variable (and also a nominal variable). Another
+
learn how to calculate them and under what conditions they are most
example might be if we asked a person if they owned a mobile phone.
+
appropriate to be used.
Here, we may categorise mobile phone ownership as either "Yes"
  −
or "No". In the real estate agent example, if type of
  −
property had been classified as either residential or commercial then
  −
"type of property" would be a dichotomous variable.
      
   
 
   
'''Ordinal variables''' are variables that have
+
== Objectives ==
two or more categories just like nominal variables only the
+
categories can also be ordered or ranked. So if you asked someone if
+
* Understand and know that a measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data.
they liked the policies of the Democratic Party and they could answer
+
* Understand that the mean, median and mode are all valid measures of central tendency but, under different conditions, some measures of central tendency become more appropriate to use than others.
either "Not very much", "They are OK" or "Yes,
+
* Learn to calculation of mean and median and analyse data and make conclusions.
a lot" then you have an ordinal variable. Why? Because you have
+
3 categories, namely "Not very much", "They are OK"
+
== Mean (Arithmetic) ==
and "Yes, a lot" and you can rank them from the most
  −
positive (Yes, a lot), to the middle response (They are OK), to the
  −
least positive (Not very much). However, whilst we can rank the
  −
levels, we cannot place a "value" to them; we cannot say
  −
that "They are OK" is twice as positive as "Not very
  −
much" for example.
  −
 
   
   
 
   
 +
The mean (or average) is the most popular and well
 +
known measure of central tendency. It can be used with both discrete
 +
and continuous data, although its use is most often with continuous
 +
data. The mean is equal to the sum of all the values in the data set
 +
divided by the number of values in the data set. So, if we have n
 +
values in a data set and they have values x<sub>1</sub>, x<sub>2</sub>,
 +
..., x<sub>n</sub>, then the sample mean, usually denoted by
 +
[[Image:KOER-%20Mathematics%20-%20Statistics_html_174cec39.gif]]
 +
(pronounced x bar), is:
   −
 
+
 
 +
[[Image:KOER-%20Mathematics%20-%20Statistics_html_69b2cf9e.gif]]
    
   
 
   
Continuous variables are also known as
+
This formula is usually written in a slightly
quantitative variables. Continuous variables can be further
+
different manner using the Greek capitol letter, Σ,
categorized as either interval or ratio variables.
+
pronounced &quot;sigma&quot;, which means &quot;sum of...&quot;:
    
   
 
   
 
+
[[Image:KOER-%20Mathematics%20-%20Statistics_html_m50e9a786.gif]]
 
      
   
 
   
'''Interval variables''' are variables for which
+
You may have noticed that the above formula refers
their central characteristic is that they can be measured along a
+
to the sample mean. So, why call have we called it a sample mean?
continuum and they have a numerical value (for example, temperature
+
This is because, in statistics, samples and populations have very
measured in degrees Celsius or Fahrenheit). So the difference between
+
different meanings and these differences are very important, even if,
20C and 30C is the same as 30C to 40C. However, temperature measured
+
in the case of the mean, they are calculated in the same way. To
in degrees Celsius or Fahrenheit is NOT a ratio variable.
+
acknowledge that we are calculating the population mean and not the
 +
sample mean, we use the Greek lower case letter &quot;mu&quot;,
 +
denoted as µ:
    
   
 
   
'''Ratio variables''' are interval variables but
+
[[Image:KOER-%20Mathematics%20-%20Statistics_html_7b1e9596.gif]]
with the added condition that 0 (zero) of the measurement indicates
  −
that there is none of that variable. So, temperature measured in
  −
degrees Celsius or Fahrenheit is not a ratio variable because 0C does
  −
not mean there is no temperature. However, temperature measured in
  −
Kelvin is a ratio variable as 0 Kelvin (often called absolute zero)
  −
indicates that there is no temperature whatsoever. Other examples of
  −
ratio variables include height, mass, distance and many more. The
  −
name &quot;ratio&quot; reflects the fact that you can use the ratio
  −
of measurements. So, for example, a distance of ten metres is twice
  −
the distance of 5 metres.
      
   
 
   
 
+
The mean is essentially a model of your data set.
 
+
It is the value that is most common. You will notice, however, that
 +
the mean is not often one of the actual values that you have observed
 +
in your data set. However, one of its important properties is that it
 +
minimises error in the prediction of any one value in your data set.
 +
That is, it is the value that produces the lowest amount of error
 +
from all other values in the data set.
    
   
 
   
=== Ambiguities in classifying a type of variable ===
+
An important property of the mean is that it
 +
includes every value in your data set as part of the calculation. In
 +
addition, the mean is the only measure of central tendency where the
 +
sum of the deviations of each value from the mean is always zero.
 +
 
 
   
 
   
   Line 2,464: Line 2,483:     
   
 
   
In some cases, the measurement scale for data is
+
'''When not to use the mean'''
ordinal but the variable is treated as continuous. For example, a
  −
Likert scale that contains five values - strongly agree, agree,
  −
neither agree nor disagree, disagree, and strongly disagree - is
  −
ordinal. However, where a Likert scale contains seven or more value -
  −
strongly agree, moderately agree, agree, neither agree nor disagree,
  −
disagree, moderately disagree, and strongly disagree - the underlying
  −
scale is sometimes treated as continuous although where you should do
  −
this is a cause of great dispute.
      
   
 
   
 +
The mean has one main disadvantage: it is
 +
particularly susceptible to the influence of outliers. These are
 +
values that are unusual compared to the rest of the data set by being
 +
especially small or large in numerical value. For example, consider
 +
the wages of staff at a factory below:
    +
                                     
 +
{| border="1"
 +
|-
 +
|
 +
Staff
    +
 +
|
 +
1
    
   
 
   
 +
|
 +
2
    +
 +
|
 +
3
    +
 +
|
 +
4
    
   
 
   
== Enrichment Activities ==
+
|
 +
5
 +
 
 
   
 
   
= Central tendency =
+
|
 +
6
 +
 
 
   
 
   
== Introduction ==
+
|
 +
7
 +
 
 
   
 
   
A measure of central tendency is a single value
+
|
that attempts to describe a set of data by identifying the central
+
8
position within that set of data. As such, measures of central
  −
tendency are sometimes called measures of central location. They are
  −
also classed as summary statistics. The mean (often called the
  −
average) is most likely the measure of central tendency that you are
  −
most familiar with, but there are others, such as, the median and the
  −
mode.
      
   
 
   
The mean, median and mode are all valid measures
+
|
of central tendency but, under different conditions, some measures of
+
9
central tendency become more appropriate to use than others. In the
  −
following sections we will look at the mean, mode and median and
  −
learn how to calculate them and under what conditions they are most
  −
appropriate to be used.
      
   
 
   
== Objectives ==
+
|
 +
10
 +
 
 
   
 
   
* Understand and know that a measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data.
+
|-
* Understand that the mean, median and mode are all valid measures of central tendency but, under different conditions, some measures of central tendency become more appropriate to use than others.
+
|
* Learn to calculation of mean and median and analyse data and make conclusions.
+
Salary
 +
 
 
   
 
   
== Mean (Arithmetic) ==
+
|
 +
15k
 +
 
 
   
 
   
The mean (or average) is the most popular and well
+
|
known measure of central tendency. It can be used with both discrete
+
18k
and continuous data, although its use is most often with continuous
  −
data. The mean is equal to the sum of all the values in the data set
  −
divided by the number of values in the data set. So, if we have n
  −
values in a data set and they have values x<sub>1</sub>, x<sub>2</sub>,
  −
..., x<sub>n</sub>, then the sample mean, usually denoted by
  −
[[Image:KOER-%20Mathematics%20-%20Statistics_html_174cec39.gif]]
  −
(pronounced x bar), is:
      
   
 
   
 +
|
 +
16k
    +
 +
|
 +
14k
    +
 +
|
 +
15k
    
   
 
   
[[Image:KOER-%20Mathematics%20-%20Statistics_html_69b2cf9e.gif]]
+
|
 +
15k
    
   
 
   
This formula is usually written in a slightly
+
|
different manner using the Greek capitol letter, Σ,
+
12k
pronounced &quot;sigma&quot;, which means &quot;sum of...&quot;:
      
   
 
   
[[Image:KOER-%20Mathematics%20-%20Statistics_html_m50e9a786.gif]]
+
|
 +
17k
    
   
 
   
You may have noticed that the above formula refers
+
|
to the sample mean. So, why call have we called it a sample mean?
+
90k
This is because, in statistics, samples and populations have very
  −
different meanings and these differences are very important, even if,
  −
in the case of the mean, they are calculated in the same way. To
  −
acknowledge that we are calculating the population mean and not the
  −
sample mean, we use the Greek lower case letter &quot;mu&quot;,
  −
denoted as µ:
      
   
 
   
[[Image:KOER-%20Mathematics%20-%20Statistics_html_7b1e9596.gif]]
+
|
 +
95k
    
   
 
   
The mean is essentially a model of your data set.
+
|}
It is the value that is most common. You will notice, however, that
+
The mean salary for these ten staff is $30.7k.
the mean is not often one of the actual values that you have observed
+
However, inspecting the raw data suggests that this mean value might
in your data set. However, one of its important properties is that it
+
not be the best way to accurately reflect the typical salary of a
minimises error in the prediction of any one value in your data set.
+
worker, as most workers have salaries in the $12k to 18k range. The
That is, it is the value that produces the lowest amount of error
+
mean is being skewed by the two large salaries. Therefore, in this
from all other values in the data set.
+
situation we would like to have a better measure of central tendency.
 +
As we will find out later, taking the median would be a better
 +
measure of central tendency in this situation.
    
   
 
   
An important property of the mean is that it
+
Another time when we usually prefer the median
includes every value in your data set as part of the calculation. In
+
over the mean (or mode) is when our data is skewed (i.e. the
addition, the mean is the only measure of central tendency where the
+
frequency distribution for our data is skewed). If we consider the
sum of the deviations of each value from the mean is always zero.
+
normal distribution - as this is the most frequently assessed in
 
+
statistics - when the data is perfectly normal then the mean, median
 +
and mode are identical. Moreover, they all represent the most typical
 +
value in the data set. However, as the data becomes skewed the mean
 +
loses its ability to provide the best central location for the data
 +
as the skewed data is dragging it away from the typical value.
 +
However, the median best retains this position and is not as strongly
 +
influenced by the skewed values. This is explained in more detail in
 +
the skewed distribution section later in this guide.
 +
 
 
   
 
   
 +
== Median ==
 +
 +
The median is the middle score for a set of data
 +
that has been arranged in order of magnitude. The median is less
 +
affected by outliers and skewed data. In order to calculate the
 +
median, suppose we have the data below:
    +
      −
  −
'''When not to use the mean'''
     −
+
                         
The mean has one main disadvantage: it is
  −
particularly susceptible to the influence of outliers. These are
  −
values that are unusual compared to the rest of the data set by being
  −
especially small or large in numerical value. For example, consider
  −
the wages of staff at a factory below:
  −
 
  −
                                     
   
{| border="1"
 
{| border="1"
 
|-
 
|-
 
|  
 
|  
Staff
+
65
    
   
 
   
 
|  
 
|  
1
+
55
    
   
 
   
 
|  
 
|  
2
+
89
    
   
 
   
 
|  
 
|  
3
+
56
    
   
 
   
 
|  
 
|  
4
+
35
    
   
 
   
 
|  
 
|  
5
+
14
    
   
 
   
 
|  
 
|  
6
+
56
    
   
 
   
 
|  
 
|  
7
+
55
    
   
 
   
 
|  
 
|  
8
+
87
    
   
 
   
 
|  
 
|  
9
+
45
    
   
 
   
 
|  
 
|  
10
+
92
 +
 
 +
 +
|}
 +
 
 +
 
 +
 
 +
 +
We first need to rearrange that data into order of
 +
magnitude (smallest first):
    
   
 
   
 +
 +
 +
 +
                         
 +
{| border="1"
 
|-
 
|-
 
|  
 
|  
Salary
+
14
    
   
 
   
 
|  
 
|  
15k
+
35
    
   
 
   
 
|  
 
|  
18k
+
45
    
   
 
   
 
|  
 
|  
16k
+
55
    
   
 
   
 
|  
 
|  
14k
+
55
    
   
 
   
 
|  
 
|  
15k
+
'''56'''
    
   
 
   
 
|  
 
|  
15k
+
56
    
   
 
   
 
|  
 
|  
12k
+
65
    
   
 
   
 
|  
 
|  
17k
+
87
    
   
 
   
 
|  
 
|  
90k
+
89
    
   
 
   
 
|  
 
|  
95k
+
92
    
   
 
   
 
|}  
 
|}  
The mean salary for these ten staff is $30.7k.
  −
However, inspecting the raw data suggests that this mean value might
  −
not be the best way to accurately reflect the typical salary of a
  −
worker, as most workers have salaries in the $12k to 18k range. The
  −
mean is being skewed by the two large salaries. Therefore, in this
  −
situation we would like to have a better measure of central tendency.
  −
As we will find out later, taking the median would be a better
  −
measure of central tendency in this situation.
     −
+
 
Another time when we usually prefer the median
  −
over the mean (or mode) is when our data is skewed (i.e. the
  −
frequency distribution for our data is skewed). If we consider the
  −
normal distribution - as this is the most frequently assessed in
  −
statistics - when the data is perfectly normal then the mean, median
  −
and mode are identical. Moreover, they all represent the most typical
  −
value in the data set. However, as the data becomes skewed the mean
  −
loses its ability to provide the best central location for the data
  −
as the skewed data is dragging it away from the typical value.
  −
However, the median best retains this position and is not as strongly
  −
influenced by the skewed values. This is explained in more detail in
  −
the skewed distribution section later in this guide.
      
   
 
   
== Median ==
+
Our median mark is the middle mark - in this case
+
56 (highlighted in bold). It is the middle mark because there are 5
The median is the middle score for a set of data
+
scores before it and 5 scores after it. This works fine when you have
that has been arranged in order of magnitude. The median is less
+
an odd number of scores but what happens when you have an even number
affected by outliers and skewed data. In order to calculate the
+
of scores? What if you had only 10 scores? Well, you simply have to
median, suppose we have the data below:
+
take the middle two scores and average the result. So, if we look at
 +
the example below:
    
   
 
   
Line 2,710: Line 2,743:       −
                         
+
                       
 
{| border="1"
 
{| border="1"
 
|-
 
|-
Line 2,751: Line 2,784:  
|  
 
|  
 
45
 
45
  −
  −
|
  −
92
      
   
 
   
Line 2,762: Line 2,791:     
   
 
   
We first need to rearrange that data into order of
+
We again rearrange that data into order of
 
magnitude (smallest first):
 
magnitude (smallest first):
  −
        Line 2,789: Line 2,816:  
   
 
   
 
|  
 
|  
55
+
'''55'''
    
   
 
   
Line 2,821: Line 2,848:     
   
 
   
Our median mark is the middle mark - in this case
+
Only now we have to take the 5th and 6th score in
56 (highlighted in bold). It is the middle mark because there are 5
+
our data set and average them to get a median of 55.5.
scores before it and 5 scores after it. This works fine when you have
  −
an odd number of scores but what happens when you have an even number
  −
of scores? What if you had only 10 scores? Well, you simply have to
  −
take the middle two scores and average the result. So, if we look at
  −
the example below:
      
   
 
   
 +
== Mode ==
 +
 +
The mode is the most frequent score in our data
 +
set. On a histogram it represents the highest bar in a bar chart or
 +
histogram. You can, therefore, sometimes consider the mode as being
 +
the most popular option. An example of a mode is presented below:
   −
 
+
 
+
[[Image:KOER-%20Mathematics%20-%20Statistics_html_58d59706.png]]
                       
  −
{| border="1"
  −
|-
  −
|
  −
65
      
   
 
   
|
+
Normally, the mode is used for categorical data
55
+
where we wish to know which is the most common category as
 +
illustrated below:
    
   
 
   
|
+
We can see above that the most common form of
89
+
transport, in this particular data set, is the bus. However, one of
 +
the problems with the mode is that it is not unique, so it leaves us
 +
with problems when we have two or more values that share the highest
 +
frequency, such as below:
   −
+
 
|
+
[[Image:KOER-%20Mathematics%20-%20Statistics_html_m64bbad46.png]]
56
     −
+
We are now stuck as to which mode best describes
|
+
the central tendency of the data. This is particularly problematic
35
+
when we have continuous data, as we are more likely not to have any
 
+
one value that is more frequent than the other. For example, consider
+
measuring 30 peoples' weight (to the nearest 0.1 kg). How likely is
|
+
it that we will find two or more people with '''exactly'''
14
+
the same weight, e.g. 67.4 kg? The answer, is probably very unlikely
 +
- many people might be close but with such a small sample (30 people)
 +
and a large range of possible weights you are unlikely to find two
 +
people with exactly the same weight, that is, to the nearest 0.1 kg.
 +
This is why the mode is very rarely used with continuous data.
   −
+
Another problem with the mode is that it will not
|
+
provide us with a very good measure of central tendency when the most
56
+
common mark is far away from the rest of the data in the data set, as
 +
depicted in the diagram below:
    
   
 
   
|
+
[[Image:KOER-%20Mathematics%20-%20Statistics_html_152dd141.png]]
55
      
   
 
   
|
+
In the above diagram the mode has a value of 2. We
87
+
can clearly see, however, that the mode is not representative of the
 +
data, which is mostly concentrated around the 20 to 30 value range.
 +
To use the mode to describe the central tendency of this data set
 +
would be misleading.
    
   
 
   
|
+
== Skewed Distributions and the Mean and Median ==
45
  −
 
   
   
 
   
|}
+
[[Image:KOER-%20Mathematics%20-%20Statistics_html_26c6186d.png]]We
 +
often test whether our data is normally distributed as this is a
 +
common assumption underlying many statistical tests. An example of a
 +
normally distributed set of data is presented below:
         −
+
When you have a normally distributed sample you
We again rearrange that data into order of
+
can legitimately use both the mean or the median as your measure of
magnitude (smallest first):
+
central tendency. In fact, in any symmetrical distribution the mean,
 
+
median and mode are equal. However, in this situation, the mean is
 
+
widely preferred as the best measure of central tendency as it is the
 
+
measure that includes all the values in the data set for its
                         
+
calculation, and any change in any of the scores will affect the
{| border="1"
+
value of the mean. This is not the case with the median or mode.
|-
  −
|
  −
14
      
   
 
   
|
+
However, when our data is skewed, for example, as
35
+
with the right-skewed data set below:
    
   
 
   
|
+
[[Image:KOER-%20Mathematics%20-%20Statistics_html_m2609c500.png]]
45
  −
 
  −
  −
|
  −
55
  −
 
  −
  −
|
  −
'''55'''
  −
 
  −
  −
|
  −
'''56'''
  −
 
  −
  −
|
  −
56
  −
 
  −
  −
|
  −
65
  −
 
  −
  −
|
  −
87
  −
 
  −
  −
|
  −
89
  −
 
  −
  −
|
  −
92
  −
 
  −
  −
|}
  −
 
  −
 
  −
 
  −
  −
Only now we have to take the 5th and 6th score in
  −
our data set and average them to get a median of 55.5.
  −
 
  −
  −
== Mode ==
  −
  −
The mode is the most frequent score in our data
  −
set. On a histogram it represents the highest bar in a bar chart or
  −
histogram. You can, therefore, sometimes consider the mode as being
  −
the most popular option. An example of a mode is presented below:
  −
 
  −
  −
[[Image:KOER-%20Mathematics%20-%20Statistics_html_58d59706.png]]
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
Normally, the mode is used for categorical data
  −
where we wish to know which is the most common category as
  −
illustrated below:
  −
 
  −
  −
We can see above that the most common form of
  −
transport, in this particular data set, is the bus. However, one of
  −
the problems with the mode is that it is not unique, so it leaves us
  −
with problems when we have two or more values that share the highest
  −
frequency, such as below:
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
[[Image:KOER-%20Mathematics%20-%20Statistics_html_m64bbad46.png]]
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
We are now stuck as to which mode best describes
  −
the central tendency of the data. This is particularly problematic
  −
when we have continuous data, as we are more likely not to have any
  −
one value that is more frequent than the other. For example, consider
  −
measuring 30 peoples' weight (to the nearest 0.1 kg). How likely is
  −
it that we will find two or more people with '''exactly'''
  −
the same weight, e.g. 67.4 kg? The answer, is probably very unlikely
  −
- many people might be close but with such a small sample (30 people)
  −
and a large range of possible weights you are unlikely to find two
  −
people with exactly the same weight, that is, to the nearest 0.1 kg.
  −
This is why the mode is very rarely used with continuous data.
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
Another problem with the mode is that it will not
  −
provide us with a very good measure of central tendency when the most
  −
common mark is far away from the rest of the data in the data set, as
  −
depicted in the diagram below:
  −
 
  −
  −
[[Image:KOER-%20Mathematics%20-%20Statistics_html_152dd141.png]]
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
In the above diagram the mode has a value of 2. We
  −
can clearly see, however, that the mode is not representative of the
  −
data, which is mostly concentrated around the 20 to 30 value range.
  −
To use the mode to describe the central tendency of this data set
  −
would be misleading.
  −
 
  −
  −
== Skewed Distributions and the Mean and Median ==
  −
  −
[[Image:KOER-%20Mathematics%20-%20Statistics_html_26c6186d.png]]We
  −
often test whether our data is normally distributed as this is a
  −
common assumption underlying many statistical tests. An example of a
  −
normally distributed set of data is presented below:
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
When you have a normally distributed sample you
  −
can legitimately use both the mean or the median as your measure of
  −
central tendency. In fact, in any symmetrical distribution the mean,
  −
median and mode are equal. However, in this situation, the mean is
  −
widely preferred as the best measure of central tendency as it is the
  −
measure that includes all the values in the data set for its
  −
calculation, and any change in any of the scores will affect the
  −
value of the mean. This is not the case with the median or mode.
  −
 
  −
  −
However, when our data is skewed, for example, as
  −
with the right-skewed data set below:
  −
 
  −
  −
[[Image:KOER-%20Mathematics%20-%20Statistics_html_m2609c500.png]]
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
  −
 
  −
 
  −
  −
 
       
283

edits

Navigation menu