Anonymous

Changes

From Karnataka Open Educational Resources
2,323 bytes removed ,  09:13, 4 January 2013
no edit summary
Line 1: Line 1: −
         
+
           
<br>
+
 
<br>
+
 
    
   
 
   
Line 7: Line 7:     
   
 
   
<br>
+
 
<br>
+
 
    
   
 
   
Line 71: Line 71:  
=== Descriptive and Inferential Statistics ===
 
=== Descriptive and Inferential Statistics ===
 
   
 
   
<br>
+
 
<br>
+
 
    
   
 
   
Line 176: Line 176:  
= Mind Map =
 
= Mind Map =
 
   
 
   
<br>
+
 
    
   
 
   
[[Image:KOER-%20Mathematics%20-%20Statistics_html_m14464871.jpg]]<br>
+
[[Image:KOER-%20Mathematics%20-%20Statistics_html_m14464871.jpg]]
    
   
 
   
Line 230: Line 230:  
written as numbers, for example, if you were collecting data from
 
written as numbers, for example, if you were collecting data from
 
people on their height or weight.
 
people on their height or weight.
 +
 +
 
 +
 +
 +
 +
 +
 +
 +
 +
 +
 +
 +
 +
 +
 +
 +
 +
 +
 +
 +
 +
 +
 +
    
   
 
   
Line 247: Line 271:     
   
 
   
<br>
+
 
    
   
 
   
Line 255: Line 279:     
   
 
   
<br>
+
 
    
   
 
   
Line 274: Line 298:  
* A teacher may ask “How many hours of TV by students on TV' to get an idea of what children are learning from TV at home and how it supplements (or affects) the learning in the school
 
* A teacher may ask “How many hours of TV by students on TV' to get an idea of what children are learning from TV at home and how it supplements (or affects) the learning in the school
 
   
 
   
<br>
+
 
    
   
 
   
Line 312: Line 336:  
# Specialised equipment (rainwater gauges to measure rainfall in a place, various medical equipment that collect information about different biological processes)
 
# Specialised equipment (rainwater gauges to measure rainfall in a place, various medical equipment that collect information about different biological processes)
 
   
 
   
<br>
+
 
    
   
 
   
Line 321: Line 345:     
   
 
   
<br>
+
 
    
   
 
   
Line 331: Line 355:     
   
 
   
<br>
+
 
    
   
 
   
 
NatWest One Day
 
NatWest One Day
International Series: England v India<br>
+
International Series: England v India
 
Friday, 16 September 2011 at
 
Friday, 16 September 2011 at
 
The Swalec Stadium
 
The Swalec Stadium
Line 362: Line 386:  
   
 
   
 
|}  
 
|}  
<br>
+
 
    
   
 
   
Line 372: Line 396:  
|-
 
|-
 
|  
 
|  
<br>
+
Name
    
   
 
   
 
|  
 
|  
<br>
+
Wicket
    
   
 
   
Line 459: Line 483:  
   
 
   
 
|  
 
|  
<br>
+
 
    
   
 
   
Line 550: Line 574:  
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
Line 608: Line 632:  
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
Line 633: Line 657:  
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
Line 645: Line 669:  
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
Line 654: Line 678:  
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
Line 670: Line 694:  
   
 
   
 
|}  
 
|}  
<br>
+
 
    
        
 
        
Line 825: Line 849:  
   
 
   
 
|}  
 
|}  
<br>
+
 
    
   
 
   
 
|}  
 
|}  
<br>
+
 
    
   
 
   
Line 840: Line 864:  
and wrote the choice of fruit against each name. This list would help
 
and wrote the choice of fruit against each name. This list would help
 
the teacher to distribute fruits according to the choice.
 
the teacher to distribute fruits according to the choice.
  −
  −
<br>
  −
<br>
      
            
 
            
Line 911: Line 931:  
   
 
   
 
|}   
 
|}   
<br>
  −
<br>
     −
+
 
<br>
  −
<br>
      
   
 
   
Line 924: Line 940:  
this information to Maria. Maria does so using a paper and a pencil.
 
this information to Maria. Maria does so using a paper and a pencil.
 
After arranging the choices in a column, she puts against a choice of
 
After arranging the choices in a column, she puts against a choice of
food one ( | ) mark for every student making that choice.
+
food one ( / ) mark for every student making that choice.
 +
 
 +
 +
 
    
              
 
              
Line 949: Line 968:  
   
 
   
 
|  
 
|  
|||||||||||||||||
+
/////////////// //
    
   
 
   
|||||||||||||
+
/////////////
    
   
 
   
||||||||||||||||||||
+
////////////////////
    
   
 
   
 
|}  
 
|}  
<br>
+
 
<br>
+
 
    
   
 
   
 
Umesh, after seeing the
 
Umesh, after seeing the
 
table suggested a better method to count the students. He asked
 
table suggested a better method to count the students. He asked
Maria to organise the marks ( | ) in a group of ten as shown below :
+
Maria to organise the marks ( / ) in a group of ten as shown below :
    
                  
 
                  
Line 994: Line 1,013:  
   
 
   
 
|  
 
|  
|||||||||| |||||||
+
////////// ///////
    
   
 
   
|||||||||| |||
+
////////// ///
    
   
 
   
|||||||||| ||||||||||
+
////////// //////////
    
   
 
   
Line 1,014: Line 1,033:  
   
 
   
 
|}   
 
|}   
<br>
+
 
 +
 
 +
 +
 
 +
 
 +
 +
 
    
   
 
   
Line 1,050: Line 1,075:  
   
 
   
 
|  
 
|  
||||| |||||
+
///// ///// /////
||||| ||
+
//
    
   
 
   
||||| |||||
+
///// ///// ///
|||
      
   
 
   
||||| ||||| ||||| |||||
+
///// ///// ///// /////
    
   
 
   
Line 1,072: Line 1,096:  
   
 
   
 
|}   
 
|}   
<br>
+
 
    
   
 
   
<br>
+
 
    
   
 
   
Line 1,086: Line 1,110:     
   
 
   
<br>
+
 
    
                                
 
                                
Line 1,109: Line 1,133:  
   
 
   
 
|  
 
|  
|||| ||
+
///// //
    
   
 
   
Line 1,122: Line 1,146:  
   
 
   
 
|  
 
|  
||||
+
/////
    
   
 
   
Line 1,135: Line 1,159:  
   
 
   
 
|  
 
|  
|||||
+
///// /
    
   
 
   
Line 1,148: Line 1,172:  
   
 
   
 
|  
 
|  
||||
+
////
    
   
 
   
Line 1,156: Line 1,180:  
   
 
   
 
|}  
 
|}  
<br>
+
 
    
   
 
   
Line 1,178: Line 1,202:     
   
 
   
<br>
+
 
    
   
 
   
Line 1,186: Line 1,210:     
   
 
   
<br>
+
 
    
                                
 
                                
Line 1,256: Line 1,280:  
   
 
   
 
|}  
 
|}  
<br>
+
 
    
    
 
    
Line 1,305: Line 1,329:  
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|-
 
|-
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|-
 
|-
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|-
 
|-
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|-
 
|-
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|-
 
|-
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|-
 
|-
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|-
 
|-
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|-
 
|-
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|-
 
|-
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|-
 
|-
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|  
 
|  
<br>
+
-
    
   
 
   
 
|}  
 
|}  
<br>
+
 
<br>
+
 
    
   
 
   
Line 1,704: Line 1,728:  
=== What is a histogram? ===
 
=== What is a histogram? ===
 
   
 
   
<br>
+
 
    
   
 
   
Line 1,715: Line 1,739:     
   
 
   
<br>
+
 
    
   
 
   
<br>
+
 
    
   
 
   
<br>
+
 
    
   
 
   
<br>
+
 
    
   
 
   
<br>
+
 
    
   
 
   
<br>
+
 
    
   
 
   
[[Image:KOER-%20Mathematics%20-%20Statistics_html_6201ec25.png]]<br>
+
[[Image:KOER-%20Mathematics%20-%20Statistics_html_6201ec25.png]]
    
   
 
   
<br>
     −
  −
<br>
      
   
 
   
<br>
     −
  −
<br>
      
   
 
   
<br>
     −
  −
<br>
      
   
 
   
<br>
     −
  −
<br>
      
   
 
   
<br>
     −
  −
<br>
      
   
 
   
<br>
     −
  −
<br>
      
   
 
   
<br>
     −
  −
<br>
      
   
 
   
<br>
+
 
 +
 
 +
 +
 
 +
 
 +
 +
 
 +
 
 +
 +
 
 +
 
 +
 +
 
 +
 
 +
 +
 
 +
 
 +
 +
 
 +
 
 +
 +
 
    
   
 
   
Line 1,789: Line 1,813:     
   
 
   
<br>
+
 
    
   
 
   
 
=== How do you construct a histogram from a continuous variable? ===
 
=== How do you construct a histogram from a continuous variable? ===
 
   
 
   
<br>
+
 
    
   
 
   
Line 1,807: Line 1,831:     
   
 
   
<br>
+
 
    
   
 
   
Line 1,838: Line 1,862:     
   
 
   
<br>
+
 
    
   
 
   
Line 1,850: Line 1,874:     
   
 
   
<br>
+
 
    
   
 
   
 
=== Choosing the correct bin width ===
 
=== Choosing the correct bin width ===
 
   
 
   
<br>
+
 
    
   
 
   
Line 1,866: Line 1,890:     
   
 
   
<br>
+
 
    
   
 
   
[[Image:KOER-%20Mathematics%20-%20Statistics_html_75ab55c3.png]]<br>
+
[[Image:KOER-%20Mathematics%20-%20Statistics_html_75ab55c3.png]]
    
   
 
   
<br>
+
 
    
   
 
   
<br>
+
 
    
   
 
   
Line 1,891: Line 1,915:     
   
 
   
<br>
+
 
    
   
 
   
Line 1,907: Line 1,931:     
   
 
   
<br>
+
 
    
   
 
   
 
=== What is the difference between a bar chart and a histogram? ===
 
=== What is the difference between a bar chart and a histogram? ===
 
   
 
   
[[Image:KOER-%20Mathematics%20-%20Statistics_html_6dfca87b.png]]<br>
+
[[Image:KOER-%20Mathematics%20-%20Statistics_html_6dfca87b.png]]
    
   
 
   
Line 1,923: Line 1,947:     
   
 
   
<br>
+
 
    
   
 
   
<br>
+
 
    
   
 
   
<br>
+
 
    
   
 
   
Line 1,941: Line 1,965:     
   
 
   
<br>
+
 
    
   
 
   
Line 1,951: Line 1,975:     
   
 
   
<br>
+
 
    
    
 
    
Line 1,978: Line 2,002:     
   
 
   
<br>
+
 
<br>
+
 
    
   
 
   
Line 2,108: Line 2,132:  
   
 
   
 
|}  
 
|}  
<br>
+
 
    
   
 
   
<br>
+
 
    
   
 
   
Line 2,124: Line 2,148:     
   
 
   
<br>
+
 
    
                                                  
 
                                                  
Line 2,254: Line 2,278:  
=== Dependent and Independent Variables ===
 
=== Dependent and Independent Variables ===
 
   
 
   
<br>
+
 
<br>
+
 
    
   
 
   
Line 2,264: Line 2,288:     
   
 
   
<br>
+
 
<br>
+
 
    
   
 
   
Line 2,279: Line 2,303:     
   
 
   
<br>
+
 
<br>
+
 
    
   
 
   
Line 2,287: Line 2,311:     
   
 
   
<br>
+
 
<br>
+
 
    
   
 
   
Line 2,295: Line 2,319:     
   
 
   
<br>
+
 
<br>
+
 
    
   
 
   
Line 2,312: Line 2,336:     
   
 
   
<br>
+
 
<br>
+
 
    
   
 
   
Line 2,325: Line 2,349:     
   
 
   
<br>
+
 
<br>
+
 
    
   
 
   
Line 2,336: Line 2,360:  
=== Experimental and Non-Experimental Research ===
 
=== Experimental and Non-Experimental Research ===
 
   
 
   
<br>
+
 
<br>
+
 
    
   
 
   
Line 2,378: Line 2,402:     
   
 
   
<br>
+
 
<br>
+
 
    
   
 
   
 
=== Categorical and Continuous Variables ===
 
=== Categorical and Continuous Variables ===
 
   
 
   
<br>
+
 
<br>
+
 
    
   
 
   
Line 2,393: Line 2,417:     
   
 
   
<br>
+
 
<br>
+
 
    
   
 
   
Line 2,437: Line 2,461:     
   
 
   
<br>
+
 
<br>
+
 
    
   
 
   
Line 2,446: Line 2,470:     
   
 
   
<br>
+
 
<br>
+
 
    
   
 
   
Line 2,471: Line 2,495:     
   
 
   
<br>
+
 
<br>
+
 
    
   
 
   
 
=== Ambiguities in classifying a type of variable ===
 
=== Ambiguities in classifying a type of variable ===
 
   
 
   
<br>
+
 
<br>
+
 
    
   
 
   
Line 2,492: Line 2,516:     
   
 
   
<br>
+
 
<br>
+
 
    
   
 
   
It is worth noting that how we categorise
+
 
variables is somewhat of a choice. Whilst we categorised gender as a
+
 
dichotomous variable (you are either male or female), social
  −
scientists may disagree with this, arguing that gender is a more
  −
complex variable involving more than two distinctions, but also
  −
including measurement levels like genderqueer, intersex, and
  −
transgender. At the same time, some researchers would argue that a
  −
Likert scale, even with seven values, should never be treated as a
  −
continuous variable.
      
   
 
   
Line 2,550: Line 2,567:     
   
 
   
<br>
+
 
<br>
+
 
    
   
 
   
Line 2,593: Line 2,610:     
   
 
   
<br>
+
 
<br>
+
 
    
   
 
   
Line 2,731: Line 2,748:     
   
 
   
<br>
+
 
<br>
+
 
    
                            
 
                            
Line 2,782: Line 2,799:  
   
 
   
 
|}  
 
|}  
<br>
+
 
<br>
+
 
    
   
 
   
Line 2,790: Line 2,807:     
   
 
   
<br>
+
 
<br>
+
 
    
                            
 
                            
Line 2,841: Line 2,858:  
   
 
   
 
|}  
 
|}  
<br>
+
 
<br>
+
 
    
   
 
   
Line 2,854: Line 2,871:     
   
 
   
<br>
+
 
<br>
+
 
    
                          
 
                          
Line 2,901: Line 2,918:  
   
 
   
 
|}  
 
|}  
<br>
+
 
<br>
+
 
    
   
 
   
 
We again rearrange that data into order of
 
We again rearrange that data into order of
magnitude (smallest first):<br>
+
magnitude (smallest first):
<br>
+
 
<br>
+
 
    
                            
 
                            
Line 2,958: Line 2,975:  
   
 
   
 
|}  
 
|}  
<br>
+
 
<br>
+
 
    
   
 
   
Line 2,974: Line 2,991:     
   
 
   
[[Image:KOER-%20Mathematics%20-%20Statistics_html_58d59706.png]]<br>
+
[[Image:KOER-%20Mathematics%20-%20Statistics_html_58d59706.png]]
<br>
     −
  −
<br>
  −
<br>
      
   
 
   
<br>
  −
<br>
     −
  −
<br>
  −
<br>
     −
  −
<br>
  −
<br>
      
   
 
   
<br>
  −
<br>
     −
  −
<br>
  −
<br>
     −
  −
<br>
  −
<br>
      
   
 
   
<br>
  −
<br>
     −
  −
<br>
  −
<br>
     −
  −
<br>
  −
<br>
      
   
 
   
<br>
  −
<br>
     −
  −
<br>
  −
<br>
     −
  −
<br>
  −
<br>
      
   
 
   
<br>
  −
<br>
     −
  −
<br>
  −
<br>
     −
  −
Normally, the mode is used for categorical data
  −
where we wish to know which is the most common category as
  −
illustrated below:
      
   
 
   
We can see above that the most common form of
  −
transport, in this particular data set, is the bus. However, one of
  −
the problems with the mode is that it is not unique, so it leaves us
  −
with problems when we have two or more values that share the highest
  −
frequency, such as below:
     −
  −
<br>
  −
<br>
     −
  −
[[Image:KOER-%20Mathematics%20-%20Statistics_html_m64bbad46.png]]<br>
  −
<br>
      
   
 
   
<br>
  −
<br>
     −
  −
<br>
  −
<br>
     −
  −
<br>
  −
<br>
      
   
 
   
<br>
  −
<br>
     −
  −
<br>
  −
<br>
     −
  −
<br>
  −
<br>
      
   
 
   
<br>
  −
<br>
     −
  −
<br>
  −
<br>
     −
  −
<br>
  −
<br>
      
   
 
   
<br>
  −
<br>
     −
  −
<br>
  −
<br>
     −
  −
<br>
  −
<br>
      
   
 
   
<br>
  −
<br>
     −
  −
<br>
  −
<br>
     −
  −
<br>
  −
<br>
      
   
 
   
<br>
  −
<br>
     −
  −
We are now stuck as to which mode best describes
  −
the central tendency of the data. This is particularly problematic
  −
when we have continuous data, as we are more likely not to have any
  −
one value that is more frequent than the other. For example, consider
  −
measuring 30 peoples' weight (to the nearest 0.1 kg). How likely is
  −
it that we will find two or more people with '''exactly'''
  −
the same weight, e.g. 67.4 kg? The answer, is probably very unlikely
  −
- many people might be close but with such a small sample (30 people)
  −
and a large range of possible weights you are unlikely to find two
  −
people with exactly the same weight, that is, to the nearest 0.1 kg.
  −
This is why the mode is very rarely used with continuous data.
     −
  −
<br>
  −
<br>
      
   
 
   
Another problem with the mode is that it will not
  −
provide us with a very good measure of central tendency when the most
  −
common mark is far away from the rest of the data in the data set, as
  −
depicted in the diagram below:
     −
  −
[[Image:KOER-%20Mathematics%20-%20Statistics_html_152dd141.png]]<br>
  −
<br>
     −
  −
<br>
  −
<br>
      
   
 
   
<br>
  −
<br>
     −
  −
<br>
  −
<br>
     −
  −
<br>
  −
<br>
      
   
 
   
<br>
  −
<br>
     −
  −
<br>
  −
<br>
     −
  −
<br>
  −
<br>
      
   
 
   
<br>
+
Normally, the mode is used for categorical data
<br>
+
where we wish to know which is the most common category as
 +
illustrated below:
    
   
 
   
<br>
+
We can see above that the most common form of
<br>
+
transport, in this particular data set, is the bus. However, one of
 +
the problems with the mode is that it is not unique, so it leaves us
 +
with problems when we have two or more values that share the highest
 +
frequency, such as below:
    
   
 
   
<br>
  −
<br>
     −
  −
<br>
  −
<br>
     −
  −
<br>
  −
<br>
      
   
 
   
<br>
+
[[Image:KOER-%20Mathematics%20-%20Statistics_html_m64bbad46.png]]
<br>
     −
  −
<br>
  −
<br>
      
   
 
   
<br>
  −
<br>
     −
  −
<br>
  −
<br>
     −
  −
In the above diagram the mode has a value of 2. We
  −
can clearly see, however, that the mode is not representative of the
  −
data, which is mostly concentrated around the 20 to 30 value range.
  −
To use the mode to describe the central tendency of this data set
  −
would be misleading.
      
   
 
   
== Skewed Distributions and the Mean and Median ==
  −
  −
[[Image:KOER-%20Mathematics%20-%20Statistics_html_26c6186d.png]]We
  −
often test whether our data is normally distributed as this is a
  −
common assumption underlying many statistical tests. An example of a
  −
normally distributed set of data is presented below:
     −
  −
<br>
  −
<br>
     −
  −
<br>
  −
<br>
      
   
 
   
<br>
  −
<br>
     −
  −
<br>
  −
<br>
     −
  −
<br>
  −
<br>
      
   
 
   
<br>
  −
<br>
     −
  −
<br>
  −
<br>
     −
  −
<br>
  −
<br>
      
   
 
   
<br>
  −
<br>
     −
  −
<br>
  −
<br>
     −
  −
When you have a normally distributed sample you
  −
can legitimately use both the mean or the median as your measure of
  −
central tendency. In fact, in any symmetrical distribution the mean,
  −
median and mode are equal. However, in this situation, the mean is
  −
widely preferred as the best measure of central tendency as it is the
  −
measure that includes all the values in the data set for its
  −
calculation, and any change in any of the scores will affect the
  −
value of the mean. This is not the case with the median or mode.
      
   
 
   
However, when our data is skewed, for example, as
  −
with the right-skewed data set below:
     −
  −
[[Image:KOER-%20Mathematics%20-%20Statistics_html_m2609c500.png]]<br>
  −
<br>
     −
  −
<br>
  −
<br>
      
   
 
   
<br>
  −
<br>
     −
  −
<br>
  −
<br>
     −
  −
<br>
  −
<br>
      
   
 
   
<br>
  −
<br>
     −
  −
<br>
  −
<br>
     −
  −
<br>
  −
<br>
      
   
 
   
<br>
  −
<br>
     −
  −
<br>
  −
<br>
     −
  −
<br>
  −
<br>
      
   
 
   
<br>
  −
<br>
     −
  −
<br>
  −
<br>
     −
  −
<br>
  −
<br>
      
   
 
   
<br>
  −
<br>
     −
+
 
<br>
  −
<br>
      
   
 
   
<br>
+
 
<br>
+
 
    
   
 
   
we find that the mean is being dragged in the
  −
direct of the skew. In these situations, the median is generally
  −
considered to be the best representative of the central location of
  −
the data. The more skewed the distribution the greater the difference
  −
between the median and mean, and the greater emphasis should be
  −
placed on using the median as opposed to the mean. A classic example
  −
of the above right-skewed distribution is income (salary), where
  −
higher-earners provide a false representation of the typical income
  −
if expressed as a mean and not a median.
     −
  −
If dealing with a normal distribution, and tests
  −
of normality show that the data is non-normal, then it is customary
  −
to use the median instead of the mean. This is more a rule of thumb
  −
than a strict guideline however. Sometimes, researchers wish to
  −
report the mean of a skewed distribution if the median and mean are
  −
not appreciably different (a subjective assessment) and if it allows
  −
easier comparisons to previous research to be made.
     −
  −
<br>
  −
<br>
      
   
 
   
== Summary of when to use the mean, median and mode ==
  −
  −
Please use the following summary table to know
  −
what the best measure of central tendency is with respect to the
  −
different types of variables.
     −
  −
<br>
  −
<br>
     −
                       
  −
{| border="1"
  −
|-
  −
|
  −
'''Type of Variable'''
      
   
 
   
|
  −
'''Best measure of central tendency'''
     −
+
 
|-
  −
|
  −
Nominal
      
   
 
   
|
+
 
Mode
+
 
    
   
 
   
|-
+
We are now stuck as to which mode best describes
|
+
the central tendency of the data. This is particularly problematic
Ordinal
+
when we have continuous data, as we are more likely not to have any
 +
one value that is more frequent than the other. For example, consider
 +
measuring 30 peoples' weight (to the nearest 0.1 kg). How likely is
 +
it that we will find two or more people with '''exactly'''
 +
the same weight, e.g. 67.4 kg? The answer, is probably very unlikely
 +
- many people might be close but with such a small sample (30 people)
 +
and a large range of possible weights you are unlikely to find two
 +
people with exactly the same weight, that is, to the nearest 0.1 kg.
 +
This is why the mode is very rarely used with continuous data.
    
   
 
   
|
+
 
Median
+
 
    
   
 
   
|-
+
Another problem with the mode is that it will not
|
+
provide us with a very good measure of central tendency when the most
Interval/Ratio (not skewed)
+
common mark is far away from the rest of the data in the data set, as
 +
depicted in the diagram below:
    
   
 
   
|
+
[[Image:KOER-%20Mathematics%20-%20Statistics_html_152dd141.png]]
Mean
+
 
    
   
 
   
|-
+
 
|
+
 
Interval/Ratio (skewed)
      
   
 
   
|
+
 
Median
+
 
    
   
 
   
|}
+
 
<br>
+
 
<br>
      
   
 
   
== Relative advantages and disadvantages of mean, median and  mode ==
+
 
+
 
Mean.<br>
  −
Advantages:
  −
Finds the most accurate average of the set of numbers.<br>
  −
Disadvantages:
  −
Outliers (few values are very different from most) can change the
  −
mean a lot... making it much lower/higher than it should
  −
be.<br>
  −
<br>
  −
Median:<br>
  −
Advantages: Finds the middle number of a set of
  −
data, so outliers have little or no effect.<br>
  −
Disadvantages: If the
  −
gap between some numbers is large, while it is small between other
  −
numbers in the data, this can cause the median to be a very
  −
inaccurate way to find the middle of a set of
  −
values.<br>
  −
<br>
  −
Mode:<br>
  −
Advantages: Allows you to see what value
  −
happened the most in a set of data. This can help you to figure out
  −
things in a different way. It is also quick and easy.<br>
  −
Disadvantages:
  −
Could be very far from the actual middle of the data. The least
  −
reliable way to find the middle or average of the data.
      
   
 
   
<br>
+
 
 +
 
    
   
 
   
This means that each of
+
 
these measures can be useful in different kinds of distributions.
+
 
    
   
 
   
<br>
+
 
 +
 
    
   
 
   
== Activities ==
+
 
+
 
== Activity 1 : Central Tendency ==
+
 
 
   
 
   
==== Learning Objectives ====
  −
  −
Learn to calculate each average measure - Mean,
  −
Median, Mode. And understand the difference between them. Know in
  −
which situation which measure must be used.
     −
+
 
==== Pre-requisites/ Instructions ====
  −
  −
<br>
  −
<br>
      
   
 
   
==== Materials and Resources Required ====
+
 
 +
 
 +
 
 
   
 
   
Paper and Pencil
+
 
 +
 
    
   
 
   
==== Method ====
+
 
 +
 
 +
 
 
   
 
   
Solve the problems A and B
+
 
 +
 
    
   
 
   
<br>
  −
<br>
     −
+
 
A. 27 members of a
  −
class were given a puzzle to solve and the times (in minutes) each
  −
pupil took to solve it were noted.
      
   
 
   
<br>
  −
<br>
     −
       
  −
{| border="1"
  −
|-
  −
|
  −
'''the times (in minutes) each pupil took'''
     −
  −
|-
  −
|
  −
19 14 15 9 18 16 10 11 16
      
   
 
   
4 20 10 14 11 9 13 15 13
     −
  −
12 2 17 15 14 10 11 10 12
     −
  −
|}
  −
<br>
  −
<br>
      
   
 
   
<br>
+
In the above diagram the mode has a value of 2. We
<br>
+
can clearly see, however, that the mode is not representative of the
 +
data, which is mostly concentrated around the 20 to 30 value range.
 +
To use the mode to describe the central tendency of this data set
 +
would be misleading.
    
   
 
   
# The MEAN value of a set of data is Sum of Values / Number of Values . What is the mean (to 2 decimal places) of the times given in the table?
+
== Skewed Distributions and the Mean and Median ==
# The MEDIAN is the middle value of an ordered set of data.
  −
## Write down the times in the table above in ascending order.
  −
## How many values are there?
  −
## What is the median ?
  −
#
  −
# The MODE is the value which occurs most often, i.e. the most popular.
  −
## What is the mode of the times in the table above?
  −
#
  −
# Which of the three measures do you think is most representative of the average time? In this case it is probably the mean, but this will not always be so.
   
   
 
   
<br>
+
[[Image:KOER-%20Mathematics%20-%20Statistics_html_26c6186d.png]]We
<br>
+
often test whether our data is normally distributed as this is a
 +
common assumption underlying many statistical tests. An example of a
 +
normally distributed set of data is presented below:
    
   
 
   
'''B Choosing which measure to use '''
     −
+
 
The sales in one week of a particular dress are
  −
given in terms of the dress sizes.
      
   
 
   
# Determine the mean, median and mode for this data .
  −
# What is the size that is sold the most ?
  −
# Which of these measures is of most use?
  −
  −
<br>
  −
<br>
     −
  −
Dress sizes sold in one week
     −
             
  −
{| border="1"
  −
|-
  −
|
  −
10
      
   
 
   
16
     −
  −
16
     −
  −
12
      
   
 
   
16
     −
  −
|
  −
14
     −
  −
12
      
   
 
   
14
     −
  −
16
     −
  −
18
      
   
 
   
|
  −
12
     −
  −
10
     −
  −
18
      
   
 
   
10
     −
  −
14
     −
  −
|
  −
16
      
   
 
   
14
     −
  −
8
     −
  −
10
      
   
 
   
16
     −
  −
|
  −
18
     −
  −
16
      
   
 
   
14
     −
  −
16
     −
  −
8
      
   
 
   
|}
+
When you have a normally distributed sample you
<br>
+
can legitimately use both the mean or the median as your measure of
<br>
+
central tendency. In fact, in any symmetrical distribution the mean,
 +
median and mode are equal. However, in this situation, the mean is
 +
widely preferred as the best measure of central tendency as it is the
 +
measure that includes all the values in the data set for its
 +
calculation, and any change in any of the scores will affect the
 +
value of the mean. This is not the case with the median or mode.
    
   
 
   
==== Evaluation ====
+
However, when our data is skewed, for example, as
 +
with the right-skewed data set below:
 +
 
 
   
 
   
# Does the student understand the difference between Mean, Median and Mode
+
[[Image:KOER-%20Mathematics%20-%20Statistics_html_m2609c500.png]]
# Can the student calculate each of the measures ?
+
 
# Does the student know which measure is useful and represents the actual data given a data set ?
+
 
 
   
 
   
== Self-Evaluation ==
+
 
 +
 
 +
 
 
   
 
   
== Further Explorations ==
+
 
 +
 
 +
 
 
   
 
   
== Enrichment Activities ==
+
 
 +
 
 +
 
 
   
 
   
= Dispersion =
+
 
 +
 
 +
 
 
   
 
   
== Introduction ==
+
 
 +
 
 +
 
 
   
 
   
A measure of spread, sometimes also called a
+
 
measure of dispersion, is used to describe the variability in a
+
 
sample or population. It is usually used in conjunction with a
  −
measure of central tendency, such as, the mean or median, to provide
  −
an overall description of a set of data.
      
   
 
   
There are many reasons why the measure of the
+
 
spread of data values is important but one of the main reasons
+
 
regards its relationship with measures of central tendency. A measure
+
 
of spread gives us an idea of how well the mean, for example,
+
represents the data. If the spread of values in the data set is large
+
 
then the mean is not as representative of the data as if the spread
+
 
of data is small. This is because a large spread indicates that there
  −
are probably large differences between individual scores.
  −
Additionally, in research, it is often seen as positive if there is
  −
little variation in each data group as it indicates that the similar.
      
   
 
   
We will be looking at the range, quartiles,
+
 
variance, absolute deviation and standard deviation.
+
 
    
   
 
   
== Objectives ==
+
 
 +
 
 +
 
 
   
 
   
* Understand that a measure of dispersion is a measure of spread, is used to describe the variability in a sample or population.
+
 
* It is usually used in conjunction with a measure of central tendency, such as, the mean or median, to provide an overall description of a set of data.
+
 
* It important to measure the spread of data because we can understand its relationship with measures of central tendency to make more accurate interpretation of data.
+
 
* Understand and know the terms:Range, Quartile, Standard Deviation , Cumulative Frequency
  −
* Calculation of Co-efficient of Variation. Meaning and interpretation of C.V. Analyse data and make conclusions
   
   
 
   
== Range ==
  −
  −
The range is the difference between the highest
  −
and lowest scores in a data set and is the simplest measure of
  −
spread. So we calculate range as:
     −
  −
<br>
  −
<br>
     −
  −
Range = maximum value - minimum value
      
   
 
   
<br>
  −
<br>
     −
  −
For example, let us consider the following data
  −
set:
     −
  −
23 56 45 65 59 55 62 54 85 25
      
   
 
   
<br>
  −
<br>
     −
  −
The maximum value is 85 and the minimum value is
  −
23. This results in a range of 62, which is 85 minus 23. Whilst using
  −
the range as a measure of spread is limited, it does set the
  −
boundaries of the scores. This can be useful if you are measuring a
  −
variable that has either a critical low or high threshold (or both)
  −
that should not be crossed. The range will instantly inform you
  −
whether at least one value broke these critical thresholds. In
  −
addition, the range can be used to detect any errors when entering
  −
data. For example, if you have recorded the age of school children in
  −
your study and your range is 7 to 123 years old you know you have
  −
made a mistake!<br>
  −
<br>
  −
<br>
     −
  −
=== Quartiles and Interquartile Range ===
  −
  −
<br>
  −
<br>
      
   
 
   
Quartiles tell us about the spread of a data set
  −
by breaking the data set into quarters, just like the median breaks
  −
it in half. For example, consider the marks of the 100 students
  −
below, which have been ordered from the lowest to the highest scores,
  −
and the quartiles highlighted in red.
     −
  −
<br>
  −
<br>
     −
  −
Order Score Order Score Order Score Order
  −
Score Order Score
      
   
 
   
1st 35 21st 42 41st 53 61st 64 81st 74
     −
  −
2nd 37 22nd 42 42nd 53 62nd 64 82nd 74
     −
  −
3rd 37 23rd 44 43rd 54 63rd 65 83rd 74
      
   
 
   
4th 38 24th 44 44th 55 64th 66 84th 75
+
we find that the mean is being dragged in the
 +
direct of the skew. In these situations, the median is generally
 +
considered to be the best representative of the central location of
 +
the data. The more skewed the distribution the greater the difference
 +
between the median and mean, and the greater emphasis should be
 +
placed on using the median as opposed to the mean. A classic example
 +
of the above right-skewed distribution is income (salary), where
 +
higher-earners provide a false representation of the typical income
 +
if expressed as a mean and not a median.
    
   
 
   
5th 39 25th 45 45th 55 65th 67 85th 75
+
If dealing with a normal distribution, and tests
 
+
of normality show that the data is non-normal, then it is customary
+
to use the median instead of the mean. This is more a rule of thumb
6th 39 26th 45 46th 56 66th 67 86th 76
+
than a strict guideline however. Sometimes, researchers wish to
 +
report the mean of a skewed distribution if the median and mean are
 +
not appreciably different (a subjective assessment) and if it allows
 +
easier comparisons to previous research to be made.
    
   
 
   
7th 39 27th 45 47th 57 67th 67 87th 77
     −
  −
8th 39 28th 45 48th 57 68th 67 88th 77
     −
  −
9th 39 29th 47 49th 58 69th 68 89th 79
      
   
 
   
10th 40 30th 48 50th 58 70th 69 90th 80
+
== Summary of when to use the mean, median and mode ==
 
   
   
 
   
11th 40 31st 49 51st 59 71st 69 91st 81
+
Please use the following summary table to know
 +
what the best measure of central tendency is with respect to the
 +
different types of variables.
    
   
 
   
12th 40 32nd 49 52nd 60 72nd 69 92nd 81
     −
  −
13th 40 33rd 49 53rd 61 73rd 70 93rd 81
     −
  −
14th 40 34th 49 54th 62 74th 70 94th 81
     −
+
                       
15th 40 35th 51 55th 62 75th 71 95th 81
+
{| border="1"
 +
|-
 +
|
 +
'''Type of Variable'''
    
   
 
   
16th 41 36th 51 56th 62 76th 71 96th 81
+
|
 +
'''Best measure of central tendency'''
    
   
 
   
17th 41 37th 51 57th 63 77th 71 97th 83
+
|-
 +
|
 +
Nominal
    
   
 
   
18th 42 38th 51 58th 63 78th 72 98th 84
+
|
 +
Mode
    
   
 
   
19th 42 39th 52 59th 64 79th 74 99th 84
+
|-
 +
|
 +
Ordinal
    
   
 
   
20th 42 40th 52 60th 64 80th 74 100th 85
+
|
 +
Median
    
   
 
   
<br>
+
|-
<br>
+
|
 +
Interval/Ratio (not skewed)
    
   
 
   
<br>
+
|
<br>
+
Mean
    
   
 
   
The first quartile (Q1) lies between the 25th and
+
|-
26th student's marks, the second quartile (Q2) between the 50th and
+
|
51st student's marks, and the third quartile (Q3) between the 75th
+
Interval/Ratio (skewed)
and 76th student's marks. Hence:
      
   
 
   
<br>
+
|
<br>
+
Median
    
   
 
   
First quartile (Q1) = 45 + 45 ÷ 2 = 45
+
|}
   −
  −
Second quartile (Q2) = 58 + 59 ÷ 2 = 58.5
     −
  −
Third quartile (Q3) = 71 + 71 ÷ 2 = 71
      
   
 
   
<br>
+
== Relative advantages and disadvantages of mean, median and  mode ==
<br>
  −
 
   
   
 
   
In the above example, we have an even number of
+
Mean.
scores (100 students rather than an odd number such as 99 students).
+
Advantages:
This means that when we calculate the quartiles, we take the sum of
+
Finds the most accurate average of the set of numbers.
the two scores around each quartile and then half them (hence Q1= 45
+
Disadvantages:
+ 45 ÷ 2 = 45) . However, if we had an odd number of scores (say, 99
+
Outliers (few values are very different from most) can change the
students), then we would only need to take one score for each
+
mean a lot... making it much lower/higher than it should
quartile (that is, the 25th, 50th and 75th scores). You should
+
be.
recognize that the second quartile is also the median.
+
 
 +
Median:
 +
Advantages: Finds the middle number of a set of
 +
data, so outliers have little or no effect.
 +
Disadvantages: If the
 +
gap between some numbers is large, while it is small between other
 +
numbers in the data, this can cause the median to be a very
 +
inaccurate way to find the middle of a set of
 +
values.
 +
 
 +
Mode:
 +
Advantages: Allows you to see what value
 +
happened the most in a set of data. This can help you to figure out
 +
things in a different way. It is also quick and easy.
 +
Disadvantages:
 +
Could be very far from the actual middle of the data. The least
 +
reliable way to find the middle or average of the data.
    
   
 
   
<br>
  −
<br>
     −
  −
Quartiles are a useful measure of spread because
  −
they are much less affected by outliers or a skewed data set than the
  −
equivalent measures of mean and standard deviation. For this reason,
  −
quartiles are often reported along with the median as the best choice
  −
of measure of spread and central tendency, respectively, when dealing
  −
with skewed and/or data with outliers. A common way of expressing
  −
quartiles is as an interquartile range. The interquartile range
  −
describes the difference between the third quartile (Q3) and the
  −
first quartile (Q1), telling us about the range of the middle half of
  −
the scores in the distribution. Hence, for our 100 students:
      
   
 
   
<br>
+
This means that each of
<br>
+
these measures can be useful in different kinds of distributions.
    
   
 
   
Interquartile range = Q3 - Q1
     −
  −
= 71 - 45
      
   
 
   
= 26
+
== Activities ==
 
   
   
 
   
<br>
+
== Activity 1 : Central Tendency ==
<br>
  −
 
   
   
 
   
However, it should be noted that in journals and
+
==== Learning Objectives ====
other publications you will usually see the interquartile range
  −
reported as 45 to 71, rather than the calculated range.
  −
 
   
   
 
   
<br>
+
Learn to calculate each average measure - Mean,
<br>
+
Median, Mode. And understand the difference between them. Know in
 +
which situation which measure must be used.
    
   
 
   
A slight variation on this is the
+
==== Pre-requisites/ Instructions ====
semi-interquartile range, which is half the interquartile range = ½
+
(Q3 - Q1). Hence, for our 100 students, this would be 26 ÷ 2 = 13.
+
 
 +
 
    
   
 
   
== Standard Deviation ==
+
==== Materials and Resources Required ====
 
   
 
   
The standard deviation is a measure of the spread
+
Paper and Pencil
of scores within a set of data. Usually, we are interested in the
  −
standard deviation of a population. However, as we are often
  −
presented with data from a sample only, we can estimate the
  −
population standard deviation from a sample standard deviation. These
  −
two standard deviations, sample and population standard deviations,
  −
are calculated differently. In statistics we are usually presented
  −
with having to calculate sample standard deviations, and so this is
  −
what this article will focus on, although the formula for a
  −
population standard deviation will also be shown.
      
   
 
   
=== When to use the sample or population standard deviation ===
+
==== Method ====
 
   
 
   
We are normally interested in knowing the
+
Solve the problems A and B
population standard deviation as our population contains all the
  −
values we are interested in. Therefore, you would normally calculate
  −
the population standard deviation if: (1) you have the entire
  −
population or (2) you have a sample of a larger population but you
  −
are only interested in this sample and do not wish to generalize your
  −
findings to the population. However, in statistics, we are usually
  −
presented with a sample from which we wish to estimate (generalize
  −
to) a population, and the standard deviation is no exception to this.
  −
Therefore, if all you have is a sample but you wish to make a
  −
statement about the population standard deviation from which the
  −
sample is drawn, then you need to use the sample standard deviation.
  −
Confusion can often arise as to which standard deviation to use due
  −
to the name &quot;sample&quot; standard deviation incorrectly being
  −
interpreted as meaning the standard deviation of the sample itself
  −
and not as the estimate of the population standard deviation based on
  −
the sample.
      
   
 
   
=== What type of data should you use when you calculate a standard deviation? ===
+
 
 +
 
 +
 
 
   
 
   
The standard deviation is used in conjunction with
+
A. 27 members of a
the mean, to summarise [[continuous]]
+
class were given a puzzle to solve and the times (in minutes) each
data not categorical data. In addition, the standard deviation, like
+
pupil took to solve it were noted.
the [[mean]],
  −
is normally only appropriate when the continuous data is not
  −
significantly skewed or has outliers.
      
   
 
   
=== Examples of when to use the sample or population standard deviation ===
+
 
 +
 
 +
 
 +
       
 +
{| border="1"
 +
|-
 +
|
 +
'''the times (in minutes) each pupil took'''
 +
 
 
   
 
   
Q. A teacher sets an exam for their pupils. The
+
|-
teacher wants to summarize the results the pupils attained as a mean
+
|
and standard deviation. Which standard deviation should be used?
+
19 14 15 9 18 16 10 11 16
    
   
 
   
A. Population standard deviation. Why? Because the
+
4 20 10 14 11 9 13 15 13
teacher is only interested in this class of pupils' scores and nobody
  −
else.
      
   
 
   
Q. A researcher has recruited males aged 45 to 65
+
12 2 17 15 14 10 11 10 12
years old for an exercise training study to investigate risk markers
  −
for heart disease, e.g. cholesterol. Which standard deviation would
  −
most likely be used?
      
   
 
   
A. Sample standard deviation. Although not
+
|}
explicitly stated, a researcher investigating health related issues
+
 
will not be simply concerned with just the participants of their
+
 
study; they will want to show how their sample results can be
  −
generalised to the whole population (in this case, males aged 45 to
  −
65 years old). Hence, the use of the sample standard deviation.
      
   
 
   
Q. One of the questions on a national consensus
  −
survey asks for respondent's age. Which standard deviation would be
  −
used to describe the variation in all ages received from the
  −
consensus?
     −
+
 
A. Population standard deviation. A national
  −
consensus is used to find out information about the nation's
  −
citizens. By definition, it includes the whole population, therefore,
  −
a population standard deviation would be used.
      
   
 
   
=== What are the formulas for the standard deviation? ===
+
# The MEAN value of a set of data is Sum of Values / Number of Values . What is the mean (to 2 decimal places) of the times given in the table?
 +
# The MEDIAN is the middle value of an ordered set of data.
 +
## Write down the times in the table above in ascending order.
 +
## How many values are there?
 +
## What is the median ?
 +
#
 +
# The MODE is the value which occurs most often, i.e. the most popular.
 +
## What is the mode of the times in the table above?
 +
#
 +
# Which of the three measures do you think is most representative of the average time? In this case it is probably the mean, but this will not always be so.
 
   
 
   
The '''sample standard deviation formula'''
  −
is:
     −
  −
[[Image:KOER-%20Mathematics%20-%20Statistics_html_m5610ded5.gif]]
     −
  −
where,
      
   
 
   
s = sample standard
+
'''B Choosing which measure to use '''
deviation<br>
  −
Σ = sum
  −
of...<br>
  −
X = sample mean<br>
  −
n = number of scores in sample.
      
   
 
   
The '''population standard deviation'''
+
The sales in one week of a particular dress are
formula is:
+
given in terms of the dress sizes.
    
   
 
   
[[Image:KOER-%20Mathematics%20-%20Statistics_html_m48922b88.gif]]
+
# Determine the mean, median and mode for this data .
 +
# What is the size that is sold the most ?
 +
# Which of these measures is of most use?
 +
 +
 
 +
 
    
   
 
   
where,
+
Dress sizes sold in one week
 +
 
 +
             
 +
{| border="1"
 +
|-
 +
|
 +
10
    
   
 
   
σ
+
16
= population standard deviation<br>
  −
Σ
  −
= sum of...<br>
  −
μ =
  −
population mean<br>
  −
n = number of scores in sample.
     −
 
  −
== Variation ==
   
   
 
   
Quartiles are useful but they are also somewhat
+
16
limited because they do not take into account every score in our
  −
group of data. To get a more representative idea of spread we need to
  −
take into account the actual values of each score in a data set. The
  −
absolute deviation, variance and standard deviation are such
  −
measures.
      
   
 
   
<br>
+
12
<br>
      
   
 
   
The absolute and mean absolute deviation show the
+
16
amount of deviation (variation) that occurs around the mean score. To
  −
find the total variability in our group of data, we simply add up the
  −
deviation of each score from the mean. The average deviation of a
  −
score can then be calculated by dividing this total by the number of
  −
scores. How we calculate the deviation of a score from the mean
  −
depends on our choice of statistic, whether we use absolute
  −
deviation, variance or standard deviation.
      
   
 
   
<br>
+
|
<br>
+
14
    
   
 
   
=== Absolute Deviation and Mean Absolute Deviation ===
+
12
 +
 
 +
 +
14
 +
 
 
   
 
   
<br>
+
16
<br>
      
   
 
   
Perhaps the simplest way of calculating the
+
18
deviation of a score from the mean is to take each score and minus
  −
the mean score. For example, the mean score for the group of 100
  −
students we used earlier was 58.75 out of 100. Therefore, if we took
  −
a student that scored 60 out of 100, the deviation of a score from
  −
the mean is 60 - 58.75 = 1.25. It is important to note that scores
  −
above the mean have positive deviations (as demonstrated above)
  −
whilst that scores below the mean will have negative deviations.
      
   
 
   
<br>
+
|
<br>
+
12
    
   
 
   
To find out the total variability in our data set,
+
10
we would perform this calculation for all of the 100 students'
  −
scores. However, the problem is that because we have both positive
  −
and minus signs, when we add up all of these deviations, they cancel
  −
each other out, giving us a total deviation of zero. Since we are
  −
only interested in the deviations of the scores and not whether they
  −
are above or below the mean score, we can ignore the minus sign and
  −
take only the absolute value, giving us the absolute deviation.
  −
Adding up all of these absolute deviations and dividing them by the
  −
total number of scores then gives us the mean absolute deviation (see
  −
below). Therefore, for our 100 students the mean absolute deviation
  −
is 12.81, as shown below:
      
   
 
   
<br>
+
18
<br>
      
   
 
   
=== Variance ===
+
10
 +
 
 
   
 
   
<br>
+
14
<br>
      
   
 
   
Another method for calculating the deviation of a
+
|
group of scores from the mean, such as the 100 students we used
+
16
earlier, is to use the variance. Unlike the absolute deviation, which
+
 
uses the absolute value of the deviation in order to &quot;rid
+
itself&quot; of the negative values, the variance achieves positive
+
14
values by squaring each of the deviations instead. Adding up these
+
 
squared deviations gives us the sum of squares, which we can then
+
divide by the total number of scores in our group of data (in other
+
8
words, 100 because there are 100 students) to find the variance (see
+
 
below). Therefore, for our 100 students, the variance is 211.89, as
+
shown below:
+
10
    
   
 
   
<br>
+
16
<br>
      
   
 
   
As a measure of variability, the variance is
+
|
useful. If the scores in our group of data are spread out then the
+
18
variance will be a large number. Conversely, if the scores are spread
  −
closely around the mean, then the variance will be a smaller number.
  −
However, there are two potential problems with the variance. First,
  −
because the deviations of scores from the mean are 'squared', this
  −
gives more weight to extreme scores. If our data contains outliers
  −
(in other words, one or a small number of scores that are
  −
particularly far away from the mean and perhaps do not represent well
  −
our data as a whole) this can give undo weight to these scores.
  −
Secondly, the variance is not in the same units as the scores in our
  −
data set: variance is measured in the units squared. This means we
  −
cannot place it on our frequency distribution and cannot directly
  −
relate its value to the values in our data set. Therefore, the figure
  −
of 211.89, our variance, appears somewhat arbitrary. Calculating the
  −
standard deviation rather than the variance rectifies this problem.
  −
Nonetheless, analysing variance is extremely important in some
  −
statistical analyses, discussed in other statistical guides.
      
   
 
   
<br>
+
16
    
   
 
   
=== Coefficient of variation ===
+
14
 +
 
 +
 +
16
 +
 
 +
 +
8
 +
 
 +
 +
|}
 +
 
 +
 
 +
 
 +
 +
==== Evaluation ====
 +
 +
# Does the student understand the difference between Mean, Median and Mode
 +
# Can the student calculate each of the measures ?
 +
# Does the student know which measure is useful and represents the actual data given a data set ?
 +
 +
== Self-Evaluation ==
 +
 +
== Further Explorations ==
 +
 +
== Enrichment Activities ==
 +
 +
= Dispersion =
 +
 +
== Introduction ==
 +
 +
A measure of spread, sometimes also called a
 +
measure of dispersion, is used to describe the variability in a
 +
sample or population. It is usually used in conjunction with a
 +
measure of central tendency, such as, the mean or median, to provide
 +
an overall description of a set of data.
 +
 
 +
 +
There are many reasons why the measure of the
 +
spread of data values is important but one of the main reasons
 +
regards its relationship with measures of central tendency. A measure
 +
of spread gives us an idea of how well the mean, for example,
 +
represents the data. If the spread of values in the data set is large
 +
then the mean is not as representative of the data as if the spread
 +
of data is small. This is because a large spread indicates that there
 +
are probably large differences between individual scores.
 +
Additionally, in research, it is often seen as positive if there is
 +
little variation in each data group as it indicates that the similar.
 +
 
 +
 +
We will be looking at the range, quartiles,
 +
variance, absolute deviation and standard deviation.
 +
 
 +
 +
== Objectives ==
 +
 +
* Understand that a measure of dispersion is a measure of spread, is used to describe the variability in a sample or population.
 +
* It is usually used in conjunction with a measure of central tendency, such as, the mean or median, to provide an overall description of a set of data.
 +
* It important to measure the spread of data because we can understand its relationship with measures of central tendency to make more accurate interpretation of data.
 +
* Understand and know the terms:Range, Quartile, Standard Deviation , Cumulative Frequency
 +
* Calculation of Co-efficient of Variation. Meaning and interpretation of C.V. Analyse data and make conclusions
 +
 +
== Range ==
 +
 +
The range is the difference between the highest
 +
and lowest scores in a data set and is the simplest measure of
 +
spread. So we calculate range as:
 +
 
 +
 +
 
 +
 
 +
 
 +
 +
Range = maximum value - minimum value
 +
 
 +
 +
 
 +
 
 +
 
 +
 +
For example, let us consider the following data
 +
set:
 +
 
 +
 +
23 56 45 65 59 55 62 54 85 25
 +
 
 +
 +
 
 +
 
 +
 
 +
 +
The maximum value is 85 and the minimum value is
 +
23. This results in a range of 62, which is 85 minus 23. Whilst using
 +
the range as a measure of spread is limited, it does set the
 +
boundaries of the scores. This can be useful if you are measuring a
 +
variable that has either a critical low or high threshold (or both)
 +
that should not be crossed. The range will instantly inform you
 +
whether at least one value broke these critical thresholds. In
 +
addition, the range can be used to detect any errors when entering
 +
data. For example, if you have recorded the age of school children in
 +
your study and your range is 7 to 123 years old you know you have
 +
made a mistake!
 +
 
 +
 
 +
 
 +
 +
=== Quartiles and Interquartile Range ===
 +
 +
 
 +
 
 +
 
 +
 +
Quartiles tell us about the spread of a data set
 +
by breaking the data set into quarters, just like the median breaks
 +
it in half. For example, consider the marks of the 100 students
 +
below, which have been ordered from the lowest to the highest scores,
 +
and the quartiles highlighted in red.
 +
 
 +
 +
 
 +
 
 +
 
 +
 +
Order Score Order Score Order Score Order
 +
Score Order Score
 +
 
 +
 +
1st 35 21st 42 41st 53 61st 64 81st 74
 +
 
 +
 +
2nd 37 22nd 42 42nd 53 62nd 64 82nd 74
 +
 
 +
 +
3rd 37 23rd 44 43rd 54 63rd 65 83rd 74
 +
 
 +
 +
4th 38 24th 44 44th 55 64th 66 84th 75
 +
 
 +
 +
5th 39 25th 45 45th 55 65th 67 85th 75
 +
 
 +
 +
6th 39 26th 45 46th 56 66th 67 86th 76
 +
 
 +
 +
7th 39 27th 45 47th 57 67th 67 87th 77
 +
 
 +
 +
8th 39 28th 45 48th 57 68th 67 88th 77
 +
 
 +
 +
9th 39 29th 47 49th 58 69th 68 89th 79
 +
 
 +
 +
10th 40 30th 48 50th 58 70th 69 90th 80
 +
 
 +
 +
11th 40 31st 49 51st 59 71st 69 91st 81
 +
 
 +
 +
12th 40 32nd 49 52nd 60 72nd 69 92nd 81
 +
 
 +
 +
13th 40 33rd 49 53rd 61 73rd 70 93rd 81
 +
 
 +
 +
14th 40 34th 49 54th 62 74th 70 94th 81
 +
 
 +
 +
15th 40 35th 51 55th 62 75th 71 95th 81
 +
 
 +
 +
16th 41 36th 51 56th 62 76th 71 96th 81
 +
 
 +
 +
17th 41 37th 51 57th 63 77th 71 97th 83
 +
 
 +
 +
18th 42 38th 51 58th 63 78th 72 98th 84
 +
 
 +
 +
19th 42 39th 52 59th 64 79th 74 99th 84
 +
 
 +
 +
20th 42 40th 52 60th 64 80th 74 100th 85
 +
 
 +
 +
 
 +
 
 +
 
 +
 +
 
 +
 
 +
 
 +
 +
The first quartile (Q1) lies between the 25th and
 +
26th student's marks, the second quartile (Q2) between the 50th and
 +
51st student's marks, and the third quartile (Q3) between the 75th
 +
and 76th student's marks. Hence:
 +
 
 +
 +
 
 +
 
 +
 
 +
 +
First quartile (Q1) = 45 + 45 ÷ 2 = 45
 +
 
 +
 +
Second quartile (Q2) = 58 + 59 ÷ 2 = 58.5
 +
 
 +
 +
Third quartile (Q3) = 71 + 71 ÷ 2 = 71
 +
 
 +
 +
 
 +
 
 +
 
 +
 +
In the above example, we have an even number of
 +
scores (100 students rather than an odd number such as 99 students).
 +
This means that when we calculate the quartiles, we take the sum of
 +
the two scores around each quartile and then half them (hence Q1= 45
 +
+ 45 ÷ 2 = 45) . However, if we had an odd number of scores (say, 99
 +
students), then we would only need to take one score for each
 +
quartile (that is, the 25th, 50th and 75th scores). You should
 +
recognize that the second quartile is also the median.
 +
 
 +
 +
 
 +
 
 +
 
 +
 +
Quartiles are a useful measure of spread because
 +
they are much less affected by outliers or a skewed data set than the
 +
equivalent measures of mean and standard deviation. For this reason,
 +
quartiles are often reported along with the median as the best choice
 +
of measure of spread and central tendency, respectively, when dealing
 +
with skewed and/or data with outliers. A common way of expressing
 +
quartiles is as an interquartile range. The interquartile range
 +
describes the difference between the third quartile (Q3) and the
 +
first quartile (Q1), telling us about the range of the middle half of
 +
the scores in the distribution. Hence, for our 100 students:
 +
 
 +
 +
 
 +
 
 +
 
 +
 +
Interquartile range = Q3 - Q1
 +
 
 +
 +
= 71 - 45
 +
 
 +
 +
= 26
 +
 
 +
 +
 
 +
 
 +
 
 +
 +
However, it should be noted that in journals and
 +
other publications you will usually see the interquartile range
 +
reported as 45 to 71, rather than the calculated range.
 +
 
 +
 +
 
 +
 
 +
 
 +
 +
A slight variation on this is the
 +
semi-interquartile range, which is half the interquartile range = ½
 +
(Q3 - Q1). Hence, for our 100 students, this would be 26 ÷ 2 = 13.
 +
 
 +
 +
== Standard Deviation ==
 +
 +
The standard deviation is a measure of the spread
 +
of scores within a set of data. Usually, we are interested in the
 +
standard deviation of a population. However, as we are often
 +
presented with data from a sample only, we can estimate the
 +
population standard deviation from a sample standard deviation. These
 +
two standard deviations, sample and population standard deviations,
 +
are calculated differently. In statistics we are usually presented
 +
with having to calculate sample standard deviations, and so this is
 +
what this article will focus on, although the formula for a
 +
population standard deviation will also be shown.
 +
 
 +
 +
=== When to use the sample or population standard deviation ===
 +
 +
We are normally interested in knowing the
 +
population standard deviation as our population contains all the
 +
values we are interested in. Therefore, you would normally calculate
 +
the population standard deviation if: (1) you have the entire
 +
population or (2) you have a sample of a larger population but you
 +
are only interested in this sample and do not wish to generalize your
 +
findings to the population. However, in statistics, we are usually
 +
presented with a sample from which we wish to estimate (generalize
 +
to) a population, and the standard deviation is no exception to this.
 +
Therefore, if all you have is a sample but you wish to make a
 +
statement about the population standard deviation from which the
 +
sample is drawn, then you need to use the sample standard deviation.
 +
Confusion can often arise as to which standard deviation to use due
 +
to the name &quot;sample&quot; standard deviation incorrectly being
 +
interpreted as meaning the standard deviation of the sample itself
 +
and not as the estimate of the population standard deviation based on
 +
the sample.
 +
 
 +
 +
=== What type of data should you use when you calculate a standard deviation? ===
 +
 +
The standard deviation is used in conjunction with
 +
the mean, to summarise [[continuous]]
 +
data not categorical data. In addition, the standard deviation, like
 +
the [[mean]],
 +
is normally only appropriate when the continuous data is not
 +
significantly skewed or has outliers.
 +
 
 +
 +
=== Examples of when to use the sample or population standard deviation ===
 +
 +
Q. A teacher sets an exam for their pupils. The
 +
teacher wants to summarize the results the pupils attained as a mean
 +
and standard deviation. Which standard deviation should be used?
 +
 
 +
 +
A. Population standard deviation. Why? Because the
 +
teacher is only interested in this class of pupils' scores and nobody
 +
else.
 +
 
 +
 +
Q. A researcher has recruited males aged 45 to 65
 +
years old for an exercise training study to investigate risk markers
 +
for heart disease, e.g. cholesterol. Which standard deviation would
 +
most likely be used?
 +
 
 +
 +
A. Sample standard deviation. Although not
 +
explicitly stated, a researcher investigating health related issues
 +
will not be simply concerned with just the participants of their
 +
study; they will want to show how their sample results can be
 +
generalised to the whole population (in this case, males aged 45 to
 +
65 years old). Hence, the use of the sample standard deviation.
 +
 
 +
 +
Q. One of the questions on a national consensus
 +
survey asks for respondent's age. Which standard deviation would be
 +
used to describe the variation in all ages received from the
 +
consensus?
 +
 
 +
 +
A. Population standard deviation. A national
 +
consensus is used to find out information about the nation's
 +
citizens. By definition, it includes the whole population, therefore,
 +
a population standard deviation would be used.
 +
 
 +
 +
=== What are the formulas for the standard deviation? ===
 +
 +
The '''sample standard deviation formula'''
 +
is:
 +
 
 +
 +
[[Image:KOER-%20Mathematics%20-%20Statistics_html_m5610ded5.gif]]
 +
 
 +
 +
where,
 +
 
 +
 +
s = sample standard
 +
deviation
 +
Σ = sum
 +
of...
 +
X = sample mean
 +
n = number of scores in sample.
 +
 
 +
 +
The '''population standard deviation'''
 +
formula is:
 +
 
 +
 +
[[Image:KOER-%20Mathematics%20-%20Statistics_html_m48922b88.gif]]
 +
 
 +
 +
where,
 +
 
 +
 +
σ
 +
= population standard deviation
 +
Σ
 +
= sum of...
 +
μ =
 +
population mean
 +
n = number of scores in sample.
 +
 
 +
 
 +
== Variation ==
 +
 +
Quartiles are useful but they are also somewhat
 +
limited because they do not take into account every score in our
 +
group of data. To get a more representative idea of spread we need to
 +
take into account the actual values of each score in a data set. The
 +
absolute deviation, variance and standard deviation are such
 +
measures.
 +
 
 +
 +
 
 +
 
 +
 
 +
 +
The absolute and mean absolute deviation show the
 +
amount of deviation (variation) that occurs around the mean score. To
 +
find the total variability in our group of data, we simply add up the
 +
deviation of each score from the mean. The average deviation of a
 +
score can then be calculated by dividing this total by the number of
 +
scores. How we calculate the deviation of a score from the mean
 +
depends on our choice of statistic, whether we use absolute
 +
deviation, variance or standard deviation.
 +
 
 +
 +
 
 +
 
 +
 
 +
 +
=== Absolute Deviation and Mean Absolute Deviation ===
 +
 +
 
 +
 
 +
 
 +
 +
Perhaps the simplest way of calculating the
 +
deviation of a score from the mean is to take each score and minus
 +
the mean score. For example, the mean score for the group of 100
 +
students we used earlier was 58.75 out of 100. Therefore, if we took
 +
a student that scored 60 out of 100, the deviation of a score from
 +
the mean is 60 - 58.75 = 1.25. It is important to note that scores
 +
above the mean have positive deviations (as demonstrated above)
 +
whilst that scores below the mean will have negative deviations.
 +
 
 +
 +
 
 +
 
 +
 
 +
 +
To find out the total variability in our data set,
 +
we would perform this calculation for all of the 100 students'
 +
scores. However, the problem is that because we have both positive
 +
and minus signs, when we add up all of these deviations, they cancel
 +
each other out, giving us a total deviation of zero. Since we are
 +
only interested in the deviations of the scores and not whether they
 +
are above or below the mean score, we can ignore the minus sign and
 +
take only the absolute value, giving us the absolute deviation.
 +
Adding up all of these absolute deviations and dividing them by the
 +
total number of scores then gives us the mean absolute deviation (see
 +
below). Therefore, for our 100 students the mean absolute deviation
 +
is 12.81, as shown below:
 +
 
 +
 +
 
 +
 
 +
 
 +
 +
=== Variance ===
 +
 +
 
 +
 
 +
 
 +
 +
Another method for calculating the deviation of a
 +
group of scores from the mean, such as the 100 students we used
 +
earlier, is to use the variance. Unlike the absolute deviation, which
 +
uses the absolute value of the deviation in order to &quot;rid
 +
itself&quot; of the negative values, the variance achieves positive
 +
values by squaring each of the deviations instead. Adding up these
 +
squared deviations gives us the sum of squares, which we can then
 +
divide by the total number of scores in our group of data (in other
 +
words, 100 because there are 100 students) to find the variance (see
 +
below). Therefore, for our 100 students, the variance is 211.89, as
 +
shown below:
 +
 
 +
 +
 
 +
 
 +
 
 +
 +
As a measure of variability, the variance is
 +
useful. If the scores in our group of data are spread out then the
 +
variance will be a large number. Conversely, if the scores are spread
 +
closely around the mean, then the variance will be a smaller number.
 +
However, there are two potential problems with the variance. First,
 +
because the deviations of scores from the mean are 'squared', this
 +
gives more weight to extreme scores. If our data contains outliers
 +
(in other words, one or a small number of scores that are
 +
particularly far away from the mean and perhaps do not represent well
 +
our data as a whole) this can give undo weight to these scores.
 +
Secondly, the variance is not in the same units as the scores in our
 +
data set: variance is measured in the units squared. This means we
 +
cannot place it on our frequency distribution and cannot directly
 +
relate its value to the values in our data set. Therefore, the figure
 +
of 211.89, our variance, appears somewhat arbitrary. Calculating the
 +
standard deviation rather than the variance rectifies this problem.
 +
Nonetheless, analysing variance is extremely important in some
 +
statistical analyses, discussed in other statistical guides.
 +
 
 +
 +
 
 +
 
 +
 +
=== Coefficient of variation ===
 
   
 
   
 
Coefficient of
 
Coefficient of
variation is defined as
+
variation is defined as
 +
 
 +
 +
[[Image:KOER-%20Mathematics%20-%20Statistics_html_1afc44b3.png]]
 +
 
 +
 +
 
 +
 
 +
 +
 
 +
 
 +
 +
where v is the standard
 +
deviation and x is the mean of the given data. It is also called as
 +
 
 +
 +
a relative standard
 +
deviation.
 +
 
 +
 +
'''Remarks '''
 +
 
 +
 +
* The coefficient of variation helps us to compare the consistency of two or more
 +
* collections of data.
 +
* When the coefficient of variation is more, the given data is less consistent.
 +
* When the coefficient of variation is less, the given data is more consistent.
 +
 +
 
 +
 
 +
 
 +
 +
== Self-Evaluation ==
 +
 +
== Further Explorations ==
 +
 +
== Enrichment Activities ==
 +
 +
= See Also =
 +
 +
Statistics
 +
on Wikipedia [[http://en.wikipedia.org/wiki/Statistics]]
    
   
 
   
[[Image:KOER-%20Mathematics%20-%20Statistics_html_1afc44b3.png]]<br>
     −
  −
<br>
     −
  −
<br>
  −
  −
  −
where v is the standard
  −
deviation and x is the mean of the given data. It is also called as
  −
  −
  −
a relative standard
  −
deviation.
  −
  −
  −
'''Remarks '''
  −
  −
  −
* The coefficient of variation helps us to compare the consistency of two or more
  −
* collections of data.
  −
* When the coefficient of variation is more, the given data is less consistent.
  −
* When the coefficient of variation is less, the given data is more consistent.
  −
  −
<br>
  −
<br>
  −
  −
  −
== Self-Evaluation ==
  −
  −
== Further Explorations ==
  −
  −
== Enrichment Activities ==
  −
  −
= See Also =
  −
  −
Statistics
  −
on Wikipedia [[http://en.wikipedia.org/wiki/Statistics]]
  −
  −
  −
<br>
  −
<br>
      
   
 
   
Line 4,224: Line 4,241:  
&quot;Figuring and society&quot; by Ronald Meek,
 
&quot;Figuring and society&quot; by Ronald Meek,
 
Fontana ISBN 0 00 632560
 
Fontana ISBN 0 00 632560
  −
  −
<br>
  −
<br>
  −
  −
  −
<br>
  −
<br>
  −
  −
  −
<br>
 
283

edits