Line 1: |
Line 1: |
− |
| |
| | | |
− |
| |
− |
| |
− |
| |
− | '''Statistics'''
| |
− |
| |
− |
| |
− |
| |
− |
| |
− |
| |
− |
| |
| = Introduction = | | = Introduction = |
| | | |
Line 70: |
Line 59: |
| | | |
| === Descriptive and Inferential Statistics === | | === Descriptive and Inferential Statistics === |
− |
| |
− |
| |
− |
| |
− |
| |
| | | |
| When analysing data, for example, the marks | | When analysing data, for example, the marks |
Line 1,698: |
Line 1,683: |
| | | |
| | | |
| + | [[Image:KOER-%20Mathematics%20-%20Statistics_html_6201ec25.png]] |
| + | |
| | | |
| | | |
| | | |
| + | 36 25 38 46 55 68 |
| + | 72 55 36 38 |
| | | |
| + | |
| + | 67 45 22 48 91 46 |
| + | 52 61 58 55 |
| | | |
| | | |
| | | |
| | | |
| + | |
| + | === How do you construct a histogram from a continuous variable? === |
| | | |
| | | |
| | | |
| | | |
− | | + | To construct a |
| + | histogram from a continuous variable you first need to split the data |
| + | into intervals, called bins. In the example above, age has been split |
| + | into bins, with each bin representing a 10-year period starting at 20 |
| + | years. Each bin contains the number of occurrences of scores in the |
| + | data set that are contained within that bin. For the above data set, |
| + | the frequencies in each bin have been tabulated along with the scores |
| + | that contributed to the frequency in each bin (see below): |
| | | |
| | | |
Line 1,716: |
Line 1,717: |
| | | |
| | | |
− | [[Image:KOER-%20Mathematics%20-%20Statistics_html_6201ec25.png]]
| + | Bin Frequency Scores |
| + | Included in Bin |
| | | |
| | | |
− | | + | 20-30 2 25,22 |
| | | |
| | | |
− | | + | 30-40 4 36,38,36,38 |
| | | |
| | | |
− | | + | 40-50 4 46,45,48,46 |
| | | |
| | | |
− | | + | 50-60 5 55,55,52,58,55 |
| | | |
| | | |
− | | + | 60-70 3 68,67,61 |
| | | |
| | | |
− | | + | 70-80 1 72 |
| | | |
| | | |
− | | + | 80-90 0 - |
| | | |
| | | |
− | | + | 90-100 1 91 |
| | | |
| | | |
Line 1,746: |
Line 1,748: |
| | | |
| | | |
− | | + | Notice that, unlike a |
| + | bar chart, there are no "gaps" between the bars (although |
| + | some bars might be "absent" reflecting no frequencies). |
| + | This is because a histogram represents a continuous data set, and as |
| + | such, there are no gaps in the data. (Although you will have to |
| + | decide whether you round up or round down scores on the boundaries of |
| + | bins) |
| | | |
| | | |
Line 1,752: |
Line 1,760: |
| | | |
| | | |
− | | + | === Choosing the correct bin width === |
− | | |
| | | |
| | | |
| | | |
| | | |
| + | There is no right or |
| + | wrong answer as to how wide a bin should be, but there are rules of |
| + | thumb. You need to make sure that the bins are not too small or too |
| + | large. Consider the histogram we produced earlier (see above): the |
| + | following histograms use the same data but have either much smaller |
| + | or larger bins, as shown below: |
| | | |
− | | + | |
| + | [[Image:KOER-%20Mathematics%20-%20Statistics_html_75ab55c3.png]] |
| | | |
− | | + | We can see from the |
− | | + | histogram on the left, that the bin width is too small as it shows |
− |
| + | too much individual data and does not allow the underlying pattern |
− | 36 25 38 46 55 68
| + | (frequency distribution) of the data to be easily seen. At the other |
− | 72 55 36 38
| + | end of the scale, is the diagram on the right, where the bins are too |
| + | large and, again, we are unable to find the underlying trend in the |
| + | data. |
| | | |
| | | |
− | 67 45 22 48 91 46
| + | Histograms are based on |
− | 52 61 58 55
| + | area not height of bars |
| | | |
| | | |
Line 1,775: |
Line 1,791: |
| | | |
| | | |
− | === How do you construct a histogram from a continuous variable? ===
| + | In a histogram, it is |
| + | the area of the bar that indicates the frequency of occurrences for |
| + | each bin. This means that the height of the bar does not necessarily |
| + | indicate how many occurrences of scores there were within each |
| + | individual bin. It is the product of height multiplied by the width |
| + | of the bin that indicates the frequency of occurrences within that |
| + | bin. One of the reasons that the height of the bars is often |
| + | incorrectly assessed as indicating frequency and not the area of the |
| + | bar is due to the fact that a lot of histograms often have equally |
| + | spaced bars (bins) and, under these circumstances, the height of the |
| + | bin does reflect the frequency. |
| + | |
| | | |
| | | |
| | | |
| | | |
− | To construct a
| + | === What is the difference between a bar chart and a histogram? === |
− | histogram from a continuous variable you first need to split the data | |
− | into intervals, called bins. In the example above, age has been split
| |
− | into bins, with each bin representing a 10-year period starting at 20
| |
− | years. Each bin contains the number of occurrences of scores in the
| |
− | data set that are contained within that bin. For the above data set,
| |
− | the frequencies in each bin have been tabulated along with the scores
| |
− | that contributed to the frequency in each bin (see below):
| |
− | | |
| | | |
− | | + | [[Image:KOER-%20Mathematics%20-%20Statistics_html_6dfca87b.png]] |
| | | |
| | | |
− | Bin Frequency Scores
| + | The major difference is |
− | Included in Bin
| + | that a histogram is only used to plot the frequency of score |
| + | occurrences in a continuous data set that has been divided into |
| + | classes, called bins. Bar charts, on the other hand, can be used for |
| + | a great deal of other types of variables including ordinal and |
| + | nominal data sets. |
| | | |
| | | |
− | 20-30 2 25,22
| + | |
| | | |
| | | |
− | 30-40 4 36,38,36,38
| + | |
| | | |
| | | |
− | 40-50 4 46,45,48,46
| + | |
| | | |
| | | |
− | 50-60 5 55,55,52,58,55
| + | == Circle or Pie Chart == |
− | | |
| | | |
− | 60-70 3 68,67,61
| + | [[Image:KOER-%20Mathematics%20-%20Statistics_html_461389d1.png]]These |
| + | are called circle graphs. A circle graph shows the relationship |
| + | between a whole and its parts. Here, the whole circle is divided into |
| + | sectors. The size of each sector is proportional to the activity or |
| + | information it represents. |
| | | |
| | | |
− | 70-80 1 72
| |
| | | |
− |
| |
− | 80-90 0 -
| |
| | | |
| | | |
− | 90-100 1 91
| + | A variety of graphical |
| + | representations of data are now possible using spreadsheet software. |
| + | OpenOffice CALC can convert a table of data into bar charts, pie |
| + | charts, area charts etc and make data much more easy to |
| + | read/interpret. |
| | | |
| | | |
| | | |
| | | |
| + | |
| + | == Activities == |
| + | |
| + | === Activity 2: Histogram and Bar Chart === |
| | | |
− | Notice that, unlike a
| + | ==== Learning Objectives ==== |
− | bar chart, there are no "gaps" between the bars (although
| |
− | some bars might be "absent" reflecting no frequencies).
| |
− | This is because a histogram represents a continuous data set, and as
| |
− | such, there are no gaps in the data. (Although you will have to
| |
− | decide whether you round up or round down scores on the boundaries of
| |
− | bins)
| |
− | | |
| | | |
− | | + | Learn to draw a histogram and bar chart. |
| + | Understand the difference between a bar chart and a histogram and be |
| + | able to select the appropriate chart by looking at the problem and |
| + | data. |
| | | |
| | | |
− | === Choosing the correct bin width === | + | ==== Materials and Resources Required ==== |
| | | |
− | | + | Paper and Pencil |
| | | |
| | | |
− | There is no right or
| + | ==== Pre-requisites/ Instructions ==== |
− | wrong answer as to how wide a bin should be, but there are rules of
| |
− | thumb. You need to make sure that the bins are not too small or too
| |
− | large. Consider the histogram we produced earlier (see above): the
| |
− | following histograms use the same data but have either much smaller
| |
− | or larger bins, as shown below:
| |
− | | |
| | | |
− | | + | ==== Method ==== |
− | | |
| | | |
− | [[Image:KOER-%20Mathematics%20-%20Statistics_html_75ab55c3.png]]
| + | Solve the problems A and B |
| | | |
| | | |
| | | |
− |
| |
− |
| |
| | | |
| | | |
| | | |
− | We can see from the
| + | A> In the past year, you have recorded the |
− | histogram on the left, that the bin width is too small as it shows
| + | number of tickets that a movie theater has sold during each month. |
− | too much individual data and does not allow the underlying pattern
| + | To represent this data set graphically, would you construct a bar |
− | (frequency distribution) of the data to be easily seen. At the other
| + | graph or a histogram? Why is this choice better than the other? |
− | end of the scale, is the diagram on the right, where the bins are too
| + | Using the following data, construct the graph that you choose. |
− | large and, again, we are unable to find the underlying trend in the
| |
− | data.
| |
| | | |
− |
| + | |
− | Histograms are based on
| + | {| border="1" |
− | area not height of bars
| + | |- |
| + | | |
| + | Month |
| | | |
| | | |
− | | + | | |
| + | Number of Tickets Sold |
| | | |
| | | |
− | In a histogram, it is
| + | |- |
− | the area of the bar that indicates the frequency of occurrences for
| + | | |
− | each bin. This means that the height of the bar does not necessarily
| + | January |
− | indicate how many occurrences of scores there were within each
| |
− | individual bin. It is the product of height multiplied by the width
| |
− | of the bin that indicates the frequency of occurrences within that
| |
− | bin. One of the reasons that the height of the bars is often
| |
− | incorrectly assessed as indicating frequency and not the area of the
| |
− | bar is due to the fact that a lot of histograms often have equally
| |
− | spaced bars (bins) and, under these circumstances, the height of the
| |
− | bin does reflect the frequency.
| |
| | | |
| | | |
− | | + | | |
| + | 25 |
| | | |
| | | |
− | === What is the difference between a bar chart and a histogram? ===
| + | |- |
| + | | |
| + | February |
| + | |
| | | |
− | [[Image:KOER-%20Mathematics%20-%20Statistics_html_6dfca87b.png]]
| + | | |
| + | 20 |
| | | |
| | | |
− | The major difference is
| + | |- |
− | that a histogram is only used to plot the frequency of score
| + | | |
− | occurrences in a continuous data set that has been divided into
| + | March |
− | classes, called bins. Bar charts, on the other hand, can be used for
| |
− | a great deal of other types of variables including ordinal and
| |
− | nominal data sets.
| |
| | | |
| | | |
| + | | |
| + | 15 |
| | | |
| + | |
| + | |- |
| + | | |
| + | April |
| | | |
| | | |
| + | | |
| + | 20 |
| | | |
| + | |
| + | |- |
| + | | |
| + | May |
| | | |
| | | |
| + | | |
| + | 30 |
| | | |
| + | |
| + | |- |
| + | | |
| + | June |
| | | |
| | | |
− | == Circle or Pie Chart ==
| + | | |
| + | 35 |
| + | |
| | | |
− | [[Image:KOER-%20Mathematics%20-%20Statistics_html_461389d1.png]]These
| + | |- |
− | are called circle graphs. A circle graph shows the relationship
| + | | |
− | between a whole and its parts. Here, the whole circle is divided into
| + | July |
− | sectors. The size of each sector is proportional to the activity or
| |
− | information it represents.
| |
| | | |
| | | |
| + | | |
| + | 40 |
| | | |
| + | |
| + | |- |
| + | | |
| + | August |
| | | |
| | | |
− | A variety of graphical
| + | | |
− | representations of data are now possible using spreadsheet software.
| + | 20 |
− | OpenOffice CALC can convert a table of data into bar charts, pie
| |
− | charts, area charts etc and make data much more easy to
| |
− | read/interpret.
| |
| | | |
| | | |
| + | |- |
| + | | |
| + | September |
| | | |
| + | |
| + | | |
| + | 25 |
| | | |
− |
| |
− | == Activities ==
| |
| | | |
− | === Activity 2: Histogram and Bar Chart ===
| + | |- |
| + | | |
| + | October |
| + | |
| | | |
− | ==== Learning Objectives ====
| + | | |
− |
| + | 15 |
− | Learn to draw a histogram and bar chart.
| |
− | Understand the difference between a bar chart and a histogram and be
| |
− | able to select the appropriate chart by looking at the problem and
| |
− | data.
| |
| | | |
| | | |
− | ==== Materials and Resources Required ====
| |
− |
| |
− | Paper and Pencil
| |
− |
| |
− |
| |
− | ==== Pre-requisites/ Instructions ====
| |
− |
| |
− | ==== Method ====
| |
− |
| |
− | Solve the problems A and B
| |
− |
| |
− |
| |
− |
| |
− |
| |
− |
| |
− |
| |
− | A> In the past year, you have recorded the
| |
− | number of tickets that a movie theater has sold during each month.
| |
− | To represent this data set graphically, would you construct a bar
| |
− | graph or a histogram? Why is this choice better than the other?
| |
− | Using the following data, construct the graph that you choose.
| |
− |
| |
− |
| |
− | {| border="1"
| |
| |- | | |- |
| | | | | |
− | Month
| + | November |
| | | |
| | | |
| | | | | |
− | Number of Tickets Sold
| + | 20 |
| | | |
| | | |
| |- | | |- |
| | | | | |
− | January
| + | December |
| | | |
| | | |
| | | | | |
− | 25
| + | 30 |
| | | |
| | | |
− | |- | + | |} |
− | |
| + | |
− | February
| + | |
| + | |
| + | |
| | | |
| | | |
− | |
| + | B> For a recent |
− | 20
| + | science project, you collected data regarding the distribution of |
| + | fish and aquatic life in a nearby pond. Your data consists of the |
| + | number of living creatures found in each 1 meter depth increment in |
| + | the pond. Construct a bar graph and several histograms (vary the |
| + | depth increment size) for the following data. In which case(s) is the |
| + | histogram the same as the bar graph? How do the other histograms vary |
| + | from the bar graph? |
| | | |
| | | |
| + | |
| + | |
| + | |
| + | {| border="1" |
| |- | | |- |
| | | | | |
− | March
| + | '''Depth Range''' |
| | | |
| | | |
| | | | | |
− | 15
| + | '''Number of Living Creatures ''' |
| | | |
| | | |
| |- | | |- |
| | | | | |
− | April
| + | 0 – 1 meters |
| | | |
| | | |
| | | | | |
− | 20
| + | 10 |
| | | |
| | | |
| |- | | |- |
| | | | | |
− | May
| + | 1 – 2 meters |
| | | |
| | | |
| | | | | |
− | 30
| + | 93 |
| | | |
| | | |
| |- | | |- |
| | | | | |
− | June
| + | 2 – 3 meters |
| | | |
| | | |
| | | | | |
− | 35
| + | 23 |
| | | |
| | | |
| |- | | |- |
| | | | | |
− | July
| + | 3 – 4 meters |
| | | |
| | | |
| | | | | |
− | 40
| + | 47 |
| | | |
| | | |
| |- | | |- |
| | | | | |
− | August
| + | 4 – 5 meters |
| | | |
| | | |
| | | | | |
− | 20
| + | 68 |
| | | |
| | | |
| |- | | |- |
| | | | | |
− | September
| + | 5 – 6 meters |
| | | |
| | | |
| | | | | |
− | 25
| + | 51 |
| | | |
| | | |
| |- | | |- |
| | | | | |
− | October
| + | 6 – 7 meters |
| | | |
| | | |
| | | | | |
− | 15
| + | 43 |
| | | |
| | | |
| |- | | |- |
| | | | | |
− | November
| + | 7 – 8 meters |
| | | |
| | | |
| | | | | |
− | 20
| + | 21 |
| | | |
| | | |
| |- | | |- |
| | | | | |
− | December
| + | 8 – 9 meters |
| | | |
| | | |
| | | | | |
− | 30
| + | 15 |
| | | |
| | | |
− | |} | + | |- |
− | | + | | |
| + | 9 – 10 meters |
| | | |
| | | |
− | | + | | |
| + | 8 |
| | | |
| | | |
− | B> For a recent
| + | |} |
− | science project, you collected data regarding the distribution of
| + | ==== Evaluation ==== |
− | fish and aquatic life in a nearby pond. Your data consists of the
| + | |
− | number of living creatures found in each 1 meter depth increment in
| + | # Does the student understand the difference between a bar chart and a histogram ? |
− | the pond. Construct a bar graph and several histograms (vary the
| + | # Does the student know when to use each of these charts - - depending on the type of data continuous and discrete ? |
− | depth increment size) for the following data. In which case(s) is the
| + | |
− | histogram the same as the bar graph? How do the other histograms vary
| + | == Evaluation == |
− | from the bar graph?
| + | |
| + | == Self-Evaluation == |
| + | |
| + | == Further Explorations == |
| + | |
| + | === Types of Variables === |
| + | |
| + | All experiments examine some kind of variable(s). |
| + | A variable is not only something that we measure, but also something |
| + | that we can manipulate and something we can control for. To |
| + | understand the characteristics of variables and how we use them in |
| + | research, this guide is divided into three main sections. First, we |
| + | illustrate the role of dependent and independent variables. Second, |
| + | we discuss the difference between experimental and non-experimental |
| + | research. Finally, we explain how variables can be characterised as |
| + | either categorical or continuous. |
| | | |
| + | |
| + | === Dependent and Independent Variables === |
| | | |
| | | |
| | | |
− |
| |
− | {| border="1"
| |
− | |-
| |
− | |
| |
− | '''Depth Range'''
| |
| | | |
| | | |
− | |
| + | An independent variable, sometimes called an |
− | '''Number of Living Creatures '''
| + | experimental or predictor variable, is a variable that is being |
− | | + | manipulated in an experiment in order to observe the effect on a |
| + | dependent variable, sometimes called an outcome variable. |
| + | |
| | | |
− | |-
| + | |
− | |
| + | |
− | 0 – 1 meters
| |
| | | |
| | | |
− | |
| + | Imagine that a tutor asks 100 students to complete |
− | 10
| + | a maths test. The tutor wants to know why some students perform |
| + | better than others. Whilst the tutor does not know the answer to |
| + | this, she thinks that it might be because of two reasons: (1) some |
| + | students spend more time revising for their test; and (2) some |
| + | students are naturally more intelligent than others. As such, the |
| + | tutor decides to investigate the effect of revision time and |
| + | intelligence on the test performance of the 100 students. The |
| + | dependent and independent variables for the study are: |
| | | |
| | | |
− | |-
| + | |
− | |
| + | |
− | 1 – 2 meters
| |
| | | |
| | | |
− | |
| + | Dependent Variable: Test Mark (measured from 0 to |
− | 93
| + | 100) |
| | | |
| | | |
− | |-
| |
− | |
| |
− | 2 – 3 meters
| |
| | | |
− |
| + | |
− | |
| |
− | 23
| |
| | | |
| | | |
− | |-
| + | Independent Variables: Revision time (measured in |
− | |
| + | hours) Intelligence (measured using IQ score) |
− | 3 – 4 meters
| |
| | | |
| | | |
− | |
| |
− | 47
| |
| | | |
− |
| |
− | |-
| |
− | |
| |
− | 4 – 5 meters
| |
| | | |
− |
| |
− | |
| |
− | 68
| |
| | | |
| | | |
− | |-
| + | The dependent variable is simply that, a variable |
− | |
| + | that is dependent on an independent variable(s). For example, in our |
− | 5 – 6 meters
| + | case the test mark that a student achieves is dependent on revision |
| + | time and intelligence. Whilst revision time and intelligence (the |
| + | independent variables) may (or may not) cause a change in the test |
| + | mark (the dependent variable), the reverse is implausible; in other |
| + | words, whilst the number of hours a student spends revising and the |
| + | higher a student's IQ score may (or may not) change the test mark |
| + | that a student achieves, a change in a student's test mark has no |
| + | bearing on whether a student revises more or is more intelligent |
| + | (this simply doesn't make sense). |
| | | |
| | | |
− | |
| |
− | 51
| |
| | | |
− |
| |
− | |-
| |
− | |
| |
− | 6 – 7 meters
| |
| | | |
− |
| |
− | |
| |
− | 43
| |
| | | |
| | | |
− | |-
| + | Therefore, the aim of the tutor's investigation is |
− | |
| + | to examine whether these independent variables - revision time and IQ |
− | 7 – 8 meters
| + | - result in a change in the dependent variable, the students' test |
| + | scores. However, it is also worth noting that whilst this is the main |
| + | aim of the experiment, the tutor may also be interested to know if |
| + | the independent variables - revision time and IQ - are also connected |
| + | in some way. |
| | | |
| | | |
− | |
| + | |
− | 21
| + | |
| | | |
| | | |
− | |-
| + | In the section on experimental and |
− | |
| + | non-experimental research that follows, we find out a little more |
− | 8 – 9 meters
| + | about the nature of independent and dependent variables. |
| | | |
| | | |
− | |
| + | === Experimental and Non-Experimental Research === |
− | 15
| + | |
| | | |
− |
| |
− | |-
| |
− | |
| |
− | 9 – 10 meters
| |
| | | |
− |
| |
− | |
| |
− | 8
| |
| | | |
| | | |
− | |}
| + | Experimental research: In experimental research, |
− | ==== Evaluation ====
| + | the aim is to manipulate an independent variable(s) and then examine |
− |
| + | the effect that this change has on a dependent variable(s). Since it |
− | # Does the student understand the difference between a bar chart and a histogram ?
| + | is possible to manipulate the independent variable(s), experimental |
− | # Does the student know when to use each of these charts - - depending on the type of data continuous and discrete ?
| + | research has the advantage of enabling a researcher to identify a |
− |
| + | cause and effect between variables. For example, take our example of |
− | == Evaluation ==
| + | 100 students completing a maths exam where the dependent variable was |
− |
| + | the exam mark (measured from 0 to 100) and the independent variables |
− | == Self-Evaluation ==
| + | were revision time (measured in hours) and intelligence (measured |
− |
| + | using IQ score). Here, it would be possible to use an experimental |
− | == Further Explorations ==
| + | design and manipulate the revision time of the students. The tutor |
− |
| + | could divide the students into two groups, each made up of 50 |
− | === Types of Variables ===
| + | students. In "group one", the tutor could ask the students |
− |
| + | not to do any revision. Alternately, "group two" could be |
− | All experiments examine some kind of variable(s).
| + | asked to do 20 hours of revision in the two weeks prior to the test. |
− | A variable is not only something that we measure, but also something
| + | The tutor could then compare the marks that the students achieved. |
− | that we can manipulate and something we can control for. To
| |
− | understand the characteristics of variables and how we use them in
| |
− | research, this guide is divided into three main sections. First, we
| |
− | illustrate the role of dependent and independent variables. Second,
| |
− | we discuss the difference between experimental and non-experimental
| |
− | research. Finally, we explain how variables can be characterised as
| |
− | either categorical or continuous.
| |
| | | |
| | | |
− | === Dependent and Independent Variables ===
| + | Non-experimental research: In non-experimental |
| + | research, the researcher does not manipulate the independent |
| + | variable(s). This is not to say that it is impossible to do so, but |
| + | it will either be impractical or unethical to do so. For example, a |
| + | researcher may be interested in the effect of illegal, recreational |
| + | drug use (the dependent variable(s)) on certain types of behaviour |
| + | (the independent variable(s)). However, whilst possible, it would be |
| + | unethical to ask individuals to take illegal drugs in order to study |
| + | what effect this had on certain behaviours. As such, a researcher |
| + | could ask both drug and non-drug users to complete a questionnaire |
| + | that had been constructed to indicate the extent to which they |
| + | exhibited certain behaviours. Whilst it is not possible to identify |
| + | the cause and effect between the variables, we can still examine the |
| + | association or relationship between them.In addition to understanding |
| + | the difference between dependent and independent variables, and |
| + | experimental and non-experimental research, it is also important to |
| + | understand the different characteristics amongst variables. This is |
| + | discussed next. |
| + | |
| | | |
| | | |
Line 2,241: |
Line 2,279: |
| | | |
| | | |
− | An independent variable, sometimes called an
| + | === Categorical and Continuous Variables === |
− | experimental or predictor variable, is a variable that is being
| |
− | manipulated in an experiment in order to observe the effect on a
| |
− | dependent variable, sometimes called an outcome variable.
| |
− | | |
| | | |
| | | |
Line 2,251: |
Line 2,285: |
| | | |
| | | |
− | Imagine that a tutor asks 100 students to complete
| + | Categorical variables are also known as discrete |
− | a maths test. The tutor wants to know why some students perform
| + | or qualitative variables. Categorical variables can be further |
− | better than others. Whilst the tutor does not know the answer to
| + | categorized as either''' nominal, ordinal or dichotomous.''' |
− | this, she thinks that it might be because of two reasons: (1) some
| |
− | students spend more time revising for their test; and (2) some
| |
− | students are naturally more intelligent than others. As such, the
| |
− | tutor decides to investigate the effect of revision time and
| |
− | intelligence on the test performance of the 100 students. The
| |
− | dependent and independent variables for the study are:
| |
| | | |
| | | |
Line 2,266: |
Line 2,294: |
| | | |
| | | |
− | Dependent Variable: Test Mark (measured from 0 to
| + | '''Nominal variables''' are variables that have |
− | 100)
| + | two or more categories but which do not have an intrinsic order. For |
| + | example, a real estate agent could classify their types of property |
| + | into distinct categories such as houses, condos, co-ops or bungalows. |
| + | So "type of property" is a nominal variable with 4 |
| + | categories called houses, condos, co-ops and bungalows. Of note, the |
| + | different categories of a nominal variable can also be referred to as |
| + | groups or levels of the nominal variable. Another example of a |
| + | nominal variable would be classifying where people live in Karnataka |
| + | by district. In this case there will be many more levels of the |
| + | nominal variable (30 in fact). |
| | | |
| | | |
− | | + | '''Dichotomous variables''' are nominal |
− | | + | variables which have only two categories or levels. For example, if |
| + | we were looking at gender, we would most probably categorize somebody |
| + | as either "male" or "female". This is an example |
| + | of a dichotomous variable (and also a nominal variable). Another |
| + | example might be if we asked a person if they owned a mobile phone. |
| + | Here, we may categorise mobile phone ownership as either "Yes" |
| + | or "No". In the real estate agent example, if type of |
| + | property had been classified as either residential or commercial then |
| + | "type of property" would be a dichotomous variable. |
| | | |
| | | |
− | Independent Variables: Revision time (measured in
| + | '''Ordinal variables''' are variables that have |
− | hours) Intelligence (measured using IQ score)
| + | two or more categories just like nominal variables only the |
| + | categories can also be ordered or ranked. So if you asked someone if |
| + | they liked the policies of the Democratic Party and they could answer |
| + | either "Not very much", "They are OK" or "Yes, |
| + | a lot" then you have an ordinal variable. Why? Because you have |
| + | 3 categories, namely "Not very much", "They are OK" |
| + | and "Yes, a lot" and you can rank them from the most |
| + | positive (Yes, a lot), to the middle response (They are OK), to the |
| + | least positive (Not very much). However, whilst we can rank the |
| + | levels, we cannot place a "value" to them; we cannot say |
| + | that "They are OK" is twice as positive as "Not very |
| + | much" for example. |
| | | |
| | | |
Line 2,282: |
Line 2,338: |
| | | |
| | | |
− | The dependent variable is simply that, a variable
| + | Continuous variables are also known as |
− | that is dependent on an independent variable(s). For example, in our
| + | quantitative variables. Continuous variables can be further |
− | case the test mark that a student achieves is dependent on revision
| + | categorized as either interval or ratio variables. |
− | time and intelligence. Whilst revision time and intelligence (the
| |
− | independent variables) may (or may not) cause a change in the test
| |
− | mark (the dependent variable), the reverse is implausible; in other
| |
− | words, whilst the number of hours a student spends revising and the
| |
− | higher a student's IQ score may (or may not) change the test mark
| |
− | that a student achieves, a change in a student's test mark has no
| |
− | bearing on whether a student revises more or is more intelligent
| |
− | (this simply doesn't make sense).
| |
| | | |
| | | |
Line 2,299: |
Line 2,347: |
| | | |
| | | |
− | Therefore, the aim of the tutor's investigation is
| + | '''Interval variables''' are variables for which |
− | to examine whether these independent variables - revision time and IQ
| + | their central characteristic is that they can be measured along a |
− | - result in a change in the dependent variable, the students' test
| + | continuum and they have a numerical value (for example, temperature |
− | scores. However, it is also worth noting that whilst this is the main
| + | measured in degrees Celsius or Fahrenheit). So the difference between |
− | aim of the experiment, the tutor may also be interested to know if
| + | 20C and 30C is the same as 30C to 40C. However, temperature measured |
− | the independent variables - revision time and IQ - are also connected
| + | in degrees Celsius or Fahrenheit is NOT a ratio variable. |
− | in some way. | |
| | | |
| | | |
− | | + | '''Ratio variables''' are interval variables but |
− | | + | with the added condition that 0 (zero) of the measurement indicates |
− | | + | that there is none of that variable. So, temperature measured in |
− |
| + | degrees Celsius or Fahrenheit is not a ratio variable because 0C does |
− | In the section on experimental and
| + | not mean there is no temperature. However, temperature measured in |
− | non-experimental research that follows, we find out a little more
| + | Kelvin is a ratio variable as 0 Kelvin (often called absolute zero) |
− | about the nature of independent and dependent variables.
| + | indicates that there is no temperature whatsoever. Other examples of |
| + | ratio variables include height, mass, distance and many more. The |
| + | name "ratio" reflects the fact that you can use the ratio |
| + | of measurements. So, for example, a distance of ten metres is twice |
| + | the distance of 5 metres. |
| | | |
− |
| |
− | === Experimental and Non-Experimental Research ===
| |
| | | |
| | | |
Line 2,323: |
Line 2,372: |
| | | |
| | | |
− | Experimental research: In experimental research,
| + | === Ambiguities in classifying a type of variable === |
− | the aim is to manipulate an independent variable(s) and then examine
| + | |
− | the effect that this change has on a dependent variable(s). Since it
| + | |
− | is possible to manipulate the independent variable(s), experimental
| + | |
− | research has the advantage of enabling a researcher to identify a
| |
− | cause and effect between variables. For example, take our example of
| |
− | 100 students completing a maths exam where the dependent variable was
| |
− | the exam mark (measured from 0 to 100) and the independent variables
| |
− | were revision time (measured in hours) and intelligence (measured
| |
− | using IQ score). Here, it would be possible to use an experimental
| |
− | design and manipulate the revision time of the students. The tutor
| |
− | could divide the students into two groups, each made up of 50
| |
− | students. In "group one", the tutor could ask the students
| |
− | not to do any revision. Alternately, "group two" could be
| |
− | asked to do 20 hours of revision in the two weeks prior to the test.
| |
− | The tutor could then compare the marks that the students achieved.
| |
| | | |
| | | |
− | Non-experimental research: In non-experimental
| + | In some cases, the measurement scale for data is |
− | research, the researcher does not manipulate the independent
| + | ordinal but the variable is treated as continuous. For example, a |
− | variable(s). This is not to say that it is impossible to do so, but | + | Likert scale that contains five values - strongly agree, agree, |
− | it will either be impractical or unethical to do so. For example, a
| + | neither agree nor disagree, disagree, and strongly disagree - is |
− | researcher may be interested in the effect of illegal, recreational
| + | ordinal. However, where a Likert scale contains seven or more value - |
− | drug use (the dependent variable(s)) on certain types of behaviour
| + | strongly agree, moderately agree, agree, neither agree nor disagree, |
− | (the independent variable(s)). However, whilst possible, it would be
| + | disagree, moderately disagree, and strongly disagree - the underlying |
− | unethical to ask individuals to take illegal drugs in order to study
| + | scale is sometimes treated as continuous although where you should do |
− | what effect this had on certain behaviours. As such, a researcher
| + | this is a cause of great dispute. |
− | could ask both drug and non-drug users to complete a questionnaire
| |
− | that had been constructed to indicate the extent to which they
| |
− | exhibited certain behaviours. Whilst it is not possible to identify
| |
− | the cause and effect between the variables, we can still examine the
| |
− | association or relationship between them.In addition to understanding
| |
− | the difference between dependent and independent variables, and
| |
− | experimental and non-experimental research, it is also important to
| |
− | understand the different characteristics amongst variables. This is
| |
− | discussed next.
| |
| | | |
| | | |
Line 2,364: |
Line 2,392: |
| | | |
| | | |
− |
| |
− | === Categorical and Continuous Variables ===
| |
| | | |
| | | |
Line 2,371: |
Line 2,397: |
| | | |
| | | |
− | Categorical variables are also known as discrete
| + | == Enrichment Activities == |
− | or qualitative variables. Categorical variables can be further
| + | |
− | categorized as either''' nominal, ordinal or dichotomous.'''
| + | = Central tendency = |
− | | |
| | | |
− | | + | == Introduction == |
− | | |
− | | |
| | | |
− | '''Nominal variables''' are variables that have
| + | A measure of central tendency is a single value |
− | two or more categories but which do not have an intrinsic order. For
| + | that attempts to describe a set of data by identifying the central |
− | example, a real estate agent could classify their types of property
| + | position within that set of data. As such, measures of central |
− | into distinct categories such as houses, condos, co-ops or bungalows.
| + | tendency are sometimes called measures of central location. They are |
− | So "type of property" is a nominal variable with 4
| + | also classed as summary statistics. The mean (often called the |
− | categories called houses, condos, co-ops and bungalows. Of note, the
| + | average) is most likely the measure of central tendency that you are |
− | different categories of a nominal variable can also be referred to as
| + | most familiar with, but there are others, such as, the median and the |
− | groups or levels of the nominal variable. Another example of a
| + | mode. |
− | nominal variable would be classifying where people live in Karnataka
| |
− | by district. In this case there will be many more levels of the
| |
− | nominal variable (30 in fact).
| |
| | | |
| | | |
− | '''Dichotomous variables''' are nominal
| + | The mean, median and mode are all valid measures |
− | variables which have only two categories or levels. For example, if
| + | of central tendency but, under different conditions, some measures of |
− | we were looking at gender, we would most probably categorize somebody
| + | central tendency become more appropriate to use than others. In the |
− | as either "male" or "female". This is an example
| + | following sections we will look at the mean, mode and median and |
− | of a dichotomous variable (and also a nominal variable). Another | + | learn how to calculate them and under what conditions they are most |
− | example might be if we asked a person if they owned a mobile phone.
| + | appropriate to be used. |
− | Here, we may categorise mobile phone ownership as either "Yes"
| |
− | or "No". In the real estate agent example, if type of
| |
− | property had been classified as either residential or commercial then
| |
− | "type of property" would be a dichotomous variable.
| |
| | | |
| | | |
− | '''Ordinal variables''' are variables that have
| + | == Objectives == |
− | two or more categories just like nominal variables only the
| + | |
− | categories can also be ordered or ranked. So if you asked someone if
| + | * Understand and know that a measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data. |
− | they liked the policies of the Democratic Party and they could answer
| + | * Understand that the mean, median and mode are all valid measures of central tendency but, under different conditions, some measures of central tendency become more appropriate to use than others. |
− | either "Not very much", "They are OK" or "Yes,
| + | * Learn to calculation of mean and median and analyse data and make conclusions. |
− | a lot" then you have an ordinal variable. Why? Because you have
| + | |
− | 3 categories, namely "Not very much", "They are OK"
| + | == Mean (Arithmetic) == |
− | and "Yes, a lot" and you can rank them from the most | |
− | positive (Yes, a lot), to the middle response (They are OK), to the
| |
− | least positive (Not very much). However, whilst we can rank the
| |
− | levels, we cannot place a "value" to them; we cannot say
| |
− | that "They are OK" is twice as positive as "Not very
| |
− | much" for example.
| |
− | | |
| | | |
| + | The mean (or average) is the most popular and well |
| + | known measure of central tendency. It can be used with both discrete |
| + | and continuous data, although its use is most often with continuous |
| + | data. The mean is equal to the sum of all the values in the data set |
| + | divided by the number of values in the data set. So, if we have n |
| + | values in a data set and they have values x<sub>1</sub>, x<sub>2</sub>, |
| + | ..., x<sub>n</sub>, then the sample mean, usually denoted by |
| + | [[Image:KOER-%20Mathematics%20-%20Statistics_html_174cec39.gif]] |
| + | (pronounced x bar), is: |
| | | |
− | | + | |
| + | [[Image:KOER-%20Mathematics%20-%20Statistics_html_69b2cf9e.gif]] |
| | | |
| | | |
− | Continuous variables are also known as
| + | This formula is usually written in a slightly |
− | quantitative variables. Continuous variables can be further
| + | different manner using the Greek capitol letter, Σ, |
− | categorized as either interval or ratio variables.
| + | pronounced "sigma", which means "sum of...": |
| | | |
| | | |
− | | + | [[Image:KOER-%20Mathematics%20-%20Statistics_html_m50e9a786.gif]] |
− | | |
| | | |
| | | |
− | '''Interval variables''' are variables for which
| + | You may have noticed that the above formula refers |
− | their central characteristic is that they can be measured along a
| + | to the sample mean. So, why call have we called it a sample mean? |
− | continuum and they have a numerical value (for example, temperature
| + | This is because, in statistics, samples and populations have very |
− | measured in degrees Celsius or Fahrenheit). So the difference between
| + | different meanings and these differences are very important, even if, |
− | 20C and 30C is the same as 30C to 40C. However, temperature measured
| + | in the case of the mean, they are calculated in the same way. To |
− | in degrees Celsius or Fahrenheit is NOT a ratio variable.
| + | acknowledge that we are calculating the population mean and not the |
| + | sample mean, we use the Greek lower case letter "mu", |
| + | denoted as µ: |
| | | |
| | | |
− | '''Ratio variables''' are interval variables but
| + | [[Image:KOER-%20Mathematics%20-%20Statistics_html_7b1e9596.gif]] |
− | with the added condition that 0 (zero) of the measurement indicates
| |
− | that there is none of that variable. So, temperature measured in
| |
− | degrees Celsius or Fahrenheit is not a ratio variable because 0C does
| |
− | not mean there is no temperature. However, temperature measured in
| |
− | Kelvin is a ratio variable as 0 Kelvin (often called absolute zero)
| |
− | indicates that there is no temperature whatsoever. Other examples of
| |
− | ratio variables include height, mass, distance and many more. The
| |
− | name "ratio" reflects the fact that you can use the ratio
| |
− | of measurements. So, for example, a distance of ten metres is twice
| |
− | the distance of 5 metres.
| |
| | | |
| | | |
− | | + | The mean is essentially a model of your data set. |
− | | + | It is the value that is most common. You will notice, however, that |
| + | the mean is not often one of the actual values that you have observed |
| + | in your data set. However, one of its important properties is that it |
| + | minimises error in the prediction of any one value in your data set. |
| + | That is, it is the value that produces the lowest amount of error |
| + | from all other values in the data set. |
| | | |
| | | |
− | === Ambiguities in classifying a type of variable ===
| + | An important property of the mean is that it |
| + | includes every value in your data set as part of the calculation. In |
| + | addition, the mean is the only measure of central tendency where the |
| + | sum of the deviations of each value from the mean is always zero. |
| + | |
| | | |
| | | |
Line 2,464: |
Line 2,483: |
| | | |
| | | |
− | In some cases, the measurement scale for data is
| + | '''When not to use the mean''' |
− | ordinal but the variable is treated as continuous. For example, a
| |
− | Likert scale that contains five values - strongly agree, agree,
| |
− | neither agree nor disagree, disagree, and strongly disagree - is
| |
− | ordinal. However, where a Likert scale contains seven or more value -
| |
− | strongly agree, moderately agree, agree, neither agree nor disagree,
| |
− | disagree, moderately disagree, and strongly disagree - the underlying
| |
− | scale is sometimes treated as continuous although where you should do
| |
− | this is a cause of great dispute.
| |
| | | |
| | | |
| + | The mean has one main disadvantage: it is |
| + | particularly susceptible to the influence of outliers. These are |
| + | values that are unusual compared to the rest of the data set by being |
| + | especially small or large in numerical value. For example, consider |
| + | the wages of staff at a factory below: |
| | | |
| + | |
| + | {| border="1" |
| + | |- |
| + | | |
| + | Staff |
| | | |
| + | |
| + | | |
| + | 1 |
| | | |
| | | |
| + | | |
| + | 2 |
| | | |
| + | |
| + | | |
| + | 3 |
| | | |
| + | |
| + | | |
| + | 4 |
| | | |
| | | |
− | == Enrichment Activities ==
| + | | |
| + | 5 |
| + | |
| | | |
− | = Central tendency =
| + | | |
| + | 6 |
| + | |
| | | |
− | == Introduction ==
| + | | |
| + | 7 |
| + | |
| | | |
− | A measure of central tendency is a single value
| + | | |
− | that attempts to describe a set of data by identifying the central
| + | 8 |
− | position within that set of data. As such, measures of central
| |
− | tendency are sometimes called measures of central location. They are
| |
− | also classed as summary statistics. The mean (often called the
| |
− | average) is most likely the measure of central tendency that you are
| |
− | most familiar with, but there are others, such as, the median and the
| |
− | mode.
| |
| | | |
| | | |
− | The mean, median and mode are all valid measures
| + | | |
− | of central tendency but, under different conditions, some measures of
| + | 9 |
− | central tendency become more appropriate to use than others. In the
| |
− | following sections we will look at the mean, mode and median and
| |
− | learn how to calculate them and under what conditions they are most
| |
− | appropriate to be used.
| |
| | | |
| | | |
− | == Objectives ==
| + | | |
| + | 10 |
| + | |
| | | |
− | * Understand and know that a measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data.
| + | |- |
− | * Understand that the mean, median and mode are all valid measures of central tendency but, under different conditions, some measures of central tendency become more appropriate to use than others.
| + | | |
− | * Learn to calculation of mean and median and analyse data and make conclusions.
| + | Salary |
| + | |
| | | |
− | == Mean (Arithmetic) ==
| + | | |
| + | 15k |
| + | |
| | | |
− | The mean (or average) is the most popular and well
| + | | |
− | known measure of central tendency. It can be used with both discrete
| + | 18k |
− | and continuous data, although its use is most often with continuous
| |
− | data. The mean is equal to the sum of all the values in the data set
| |
− | divided by the number of values in the data set. So, if we have n
| |
− | values in a data set and they have values x<sub>1</sub>, x<sub>2</sub>,
| |
− | ..., x<sub>n</sub>, then the sample mean, usually denoted by
| |
− | [[Image:KOER-%20Mathematics%20-%20Statistics_html_174cec39.gif]]
| |
− | (pronounced x bar), is:
| |
| | | |
| | | |
| + | | |
| + | 16k |
| | | |
| + | |
| + | | |
| + | 14k |
| | | |
| + | |
| + | | |
| + | 15k |
| | | |
| | | |
− | [[Image:KOER-%20Mathematics%20-%20Statistics_html_69b2cf9e.gif]]
| + | | |
| + | 15k |
| | | |
| | | |
− | This formula is usually written in a slightly
| + | | |
− | different manner using the Greek capitol letter, Σ,
| + | 12k |
− | pronounced "sigma", which means "sum of...":
| |
| | | |
| | | |
− | [[Image:KOER-%20Mathematics%20-%20Statistics_html_m50e9a786.gif]]
| + | | |
| + | 17k |
| | | |
| | | |
− | You may have noticed that the above formula refers
| + | | |
− | to the sample mean. So, why call have we called it a sample mean?
| + | 90k |
− | This is because, in statistics, samples and populations have very
| |
− | different meanings and these differences are very important, even if,
| |
− | in the case of the mean, they are calculated in the same way. To
| |
− | acknowledge that we are calculating the population mean and not the
| |
− | sample mean, we use the Greek lower case letter "mu",
| |
− | denoted as µ:
| |
| | | |
| | | |
− | [[Image:KOER-%20Mathematics%20-%20Statistics_html_7b1e9596.gif]]
| + | | |
| + | 95k |
| | | |
| | | |
− | The mean is essentially a model of your data set. | + | |} |
− | It is the value that is most common. You will notice, however, that
| + | The mean salary for these ten staff is $30.7k. |
− | the mean is not often one of the actual values that you have observed
| + | However, inspecting the raw data suggests that this mean value might |
− | in your data set. However, one of its important properties is that it
| + | not be the best way to accurately reflect the typical salary of a |
− | minimises error in the prediction of any one value in your data set.
| + | worker, as most workers have salaries in the $12k to 18k range. The |
− | That is, it is the value that produces the lowest amount of error
| + | mean is being skewed by the two large salaries. Therefore, in this |
− | from all other values in the data set.
| + | situation we would like to have a better measure of central tendency. |
| + | As we will find out later, taking the median would be a better |
| + | measure of central tendency in this situation. |
| | | |
| | | |
− | An important property of the mean is that it
| + | Another time when we usually prefer the median |
− | includes every value in your data set as part of the calculation. In
| + | over the mean (or mode) is when our data is skewed (i.e. the |
− | addition, the mean is the only measure of central tendency where the
| + | frequency distribution for our data is skewed). If we consider the |
− | sum of the deviations of each value from the mean is always zero.
| + | normal distribution - as this is the most frequently assessed in |
− | | + | statistics - when the data is perfectly normal then the mean, median |
| + | and mode are identical. Moreover, they all represent the most typical |
| + | value in the data set. However, as the data becomes skewed the mean |
| + | loses its ability to provide the best central location for the data |
| + | as the skewed data is dragging it away from the typical value. |
| + | However, the median best retains this position and is not as strongly |
| + | influenced by the skewed values. This is explained in more detail in |
| + | the skewed distribution section later in this guide. |
| + | |
| | | |
| + | == Median == |
| + | |
| + | The median is the middle score for a set of data |
| + | that has been arranged in order of magnitude. The median is less |
| + | affected by outliers and skewed data. In order to calculate the |
| + | median, suppose we have the data below: |
| | | |
| + | |
| | | |
| | | |
− |
| |
− | '''When not to use the mean'''
| |
| | | |
− |
| + | |
− | The mean has one main disadvantage: it is
| |
− | particularly susceptible to the influence of outliers. These are
| |
− | values that are unusual compared to the rest of the data set by being
| |
− | especially small or large in numerical value. For example, consider
| |
− | the wages of staff at a factory below:
| |
− | | |
− |
| |
| {| border="1" | | {| border="1" |
| |- | | |- |
| | | | | |
− | Staff
| + | 65 |
| | | |
| | | |
| | | | | |
− | 1
| + | 55 |
| | | |
| | | |
| | | | | |
− | 2
| + | 89 |
| | | |
| | | |
| | | | | |
− | 3
| + | 56 |
| | | |
| | | |
| | | | | |
− | 4
| + | 35 |
| | | |
| | | |
| | | | | |
− | 5
| + | 14 |
| | | |
| | | |
| | | | | |
− | 6
| + | 56 |
| | | |
| | | |
| | | | | |
− | 7
| + | 55 |
| | | |
| | | |
| | | | | |
− | 8
| + | 87 |
| | | |
| | | |
| | | | | |
− | 9
| + | 45 |
| | | |
| | | |
| | | | | |
− | 10
| + | 92 |
| + | |
| + | |
| + | |} |
| + | |
| + | |
| + | |
| + | |
| + | We first need to rearrange that data into order of |
| + | magnitude (smallest first): |
| | | |
| | | |
| + | |
| + | |
| + | |
| + | |
| + | {| border="1" |
| |- | | |- |
| | | | | |
− | Salary
| + | 14 |
| | | |
| | | |
| | | | | |
− | 15k
| + | 35 |
| | | |
| | | |
| | | | | |
− | 18k
| + | 45 |
| | | |
| | | |
| | | | | |
− | 16k
| + | 55 |
| | | |
| | | |
| | | | | |
− | 14k
| + | 55 |
| | | |
| | | |
| | | | | |
− | 15k
| + | '''56''' |
| | | |
| | | |
| | | | | |
− | 15k
| + | 56 |
| | | |
| | | |
| | | | | |
− | 12k
| + | 65 |
| | | |
| | | |
| | | | | |
− | 17k
| + | 87 |
| | | |
| | | |
| | | | | |
− | 90k
| + | 89 |
| | | |
| | | |
| | | | | |
− | 95k
| + | 92 |
| | | |
| | | |
| |} | | |} |
− | The mean salary for these ten staff is $30.7k.
| |
− | However, inspecting the raw data suggests that this mean value might
| |
− | not be the best way to accurately reflect the typical salary of a
| |
− | worker, as most workers have salaries in the $12k to 18k range. The
| |
− | mean is being skewed by the two large salaries. Therefore, in this
| |
− | situation we would like to have a better measure of central tendency.
| |
− | As we will find out later, taking the median would be a better
| |
− | measure of central tendency in this situation.
| |
| | | |
− |
| + | |
− | Another time when we usually prefer the median
| |
− | over the mean (or mode) is when our data is skewed (i.e. the
| |
− | frequency distribution for our data is skewed). If we consider the
| |
− | normal distribution - as this is the most frequently assessed in
| |
− | statistics - when the data is perfectly normal then the mean, median
| |
− | and mode are identical. Moreover, they all represent the most typical
| |
− | value in the data set. However, as the data becomes skewed the mean
| |
− | loses its ability to provide the best central location for the data
| |
− | as the skewed data is dragging it away from the typical value.
| |
− | However, the median best retains this position and is not as strongly
| |
− | influenced by the skewed values. This is explained in more detail in
| |
− | the skewed distribution section later in this guide.
| |
| | | |
| | | |
− | == Median ==
| + | Our median mark is the middle mark - in this case |
− |
| + | 56 (highlighted in bold). It is the middle mark because there are 5 |
− | The median is the middle score for a set of data
| + | scores before it and 5 scores after it. This works fine when you have |
− | that has been arranged in order of magnitude. The median is less
| + | an odd number of scores but what happens when you have an even number |
− | affected by outliers and skewed data. In order to calculate the
| + | of scores? What if you had only 10 scores? Well, you simply have to |
− | median, suppose we have the data below:
| + | take the middle two scores and average the result. So, if we look at |
| + | the example below: |
| | | |
| | | |
Line 2,710: |
Line 2,743: |
| | | |
| | | |
− |
| + | |
| {| border="1" | | {| border="1" |
| |- | | |- |
Line 2,751: |
Line 2,784: |
| | | | | |
| 45 | | 45 |
− |
| |
− |
| |
− | |
| |
− | 92
| |
| | | |
| | | |
Line 2,762: |
Line 2,791: |
| | | |
| | | |
− | We first need to rearrange that data into order of | + | We again rearrange that data into order of |
| magnitude (smallest first): | | magnitude (smallest first): |
− |
| |
− |
| |
| | | |
| | | |
Line 2,789: |
Line 2,816: |
| | | |
| | | | | |
− | 55 | + | '''55''' |
| | | |
| | | |
Line 2,821: |
Line 2,848: |
| | | |
| | | |
− | Our median mark is the middle mark - in this case
| + | Only now we have to take the 5th and 6th score in |
− | 56 (highlighted in bold). It is the middle mark because there are 5
| + | our data set and average them to get a median of 55.5. |
− | scores before it and 5 scores after it. This works fine when you have
| |
− | an odd number of scores but what happens when you have an even number
| |
− | of scores? What if you had only 10 scores? Well, you simply have to
| |
− | take the middle two scores and average the result. So, if we look at | |
− | the example below:
| |
| | | |
| | | |
| + | == Mode == |
| + | |
| + | The mode is the most frequent score in our data |
| + | set. On a histogram it represents the highest bar in a bar chart or |
| + | histogram. You can, therefore, sometimes consider the mode as being |
| + | the most popular option. An example of a mode is presented below: |
| | | |
− | | + | |
− | | + | [[Image:KOER-%20Mathematics%20-%20Statistics_html_58d59706.png]] |
− |
| |
− | {| border="1"
| |
− | |-
| |
− | |
| |
− | 65
| |
| | | |
| | | |
− | |
| + | Normally, the mode is used for categorical data |
− | 55
| + | where we wish to know which is the most common category as |
| + | illustrated below: |
| | | |
| | | |
− | |
| + | We can see above that the most common form of |
− | 89
| + | transport, in this particular data set, is the bus. However, one of |
| + | the problems with the mode is that it is not unique, so it leaves us |
| + | with problems when we have two or more values that share the highest |
| + | frequency, such as below: |
| | | |
− |
| + | |
− | |
| + | [[Image:KOER-%20Mathematics%20-%20Statistics_html_m64bbad46.png]] |
− | 56
| |
| | | |
− |
| + | We are now stuck as to which mode best describes |
− | |
| + | the central tendency of the data. This is particularly problematic |
− | 35
| + | when we have continuous data, as we are more likely not to have any |
− | | + | one value that is more frequent than the other. For example, consider |
− |
| + | measuring 30 peoples' weight (to the nearest 0.1 kg). How likely is |
− | |
| + | it that we will find two or more people with '''exactly''' |
− | 14
| + | the same weight, e.g. 67.4 kg? The answer, is probably very unlikely |
| + | - many people might be close but with such a small sample (30 people) |
| + | and a large range of possible weights you are unlikely to find two |
| + | people with exactly the same weight, that is, to the nearest 0.1 kg. |
| + | This is why the mode is very rarely used with continuous data. |
| | | |
− |
| + | Another problem with the mode is that it will not |
− | |
| + | provide us with a very good measure of central tendency when the most |
− | 56
| + | common mark is far away from the rest of the data in the data set, as |
| + | depicted in the diagram below: |
| | | |
| | | |
− | |
| + | [[Image:KOER-%20Mathematics%20-%20Statistics_html_152dd141.png]] |
− | 55
| |
| | | |
| | | |
− | |
| + | In the above diagram the mode has a value of 2. We |
− | 87
| + | can clearly see, however, that the mode is not representative of the |
| + | data, which is mostly concentrated around the 20 to 30 value range. |
| + | To use the mode to describe the central tendency of this data set |
| + | would be misleading. |
| | | |
| | | |
− | |
| + | == Skewed Distributions and the Mean and Median == |
− | 45
| |
− | | |
| | | |
− | |}
| + | [[Image:KOER-%20Mathematics%20-%20Statistics_html_26c6186d.png]]We |
| + | often test whether our data is normally distributed as this is a |
| + | common assumption underlying many statistical tests. An example of a |
| + | normally distributed set of data is presented below: |
| | | |
| | | |
| | | |
− |
| + | When you have a normally distributed sample you |
− | We again rearrange that data into order of
| + | can legitimately use both the mean or the median as your measure of |
− | magnitude (smallest first):
| + | central tendency. In fact, in any symmetrical distribution the mean, |
− | | + | median and mode are equal. However, in this situation, the mean is |
− | | + | widely preferred as the best measure of central tendency as it is the |
− | | + | measure that includes all the values in the data set for its |
− |
| + | calculation, and any change in any of the scores will affect the |
− | {| border="1"
| + | value of the mean. This is not the case with the median or mode. |
− | |-
| |
− | |
| |
− | 14
| |
| | | |
| | | |
− | |
| + | However, when our data is skewed, for example, as |
− | 35
| + | with the right-skewed data set below: |
| | | |
| | | |
− | |
| + | [[Image:KOER-%20Mathematics%20-%20Statistics_html_m2609c500.png]] |
− | 45
| |
− | | |
− |
| |
− | |
| |
− | 55
| |
− | | |
− |
| |
− | |
| |
− | '''55'''
| |
− | | |
− |
| |
− | |
| |
− | '''56'''
| |
− | | |
− |
| |
− | |
| |
− | 56
| |
− | | |
− |
| |
− | |
| |
− | 65
| |
− | | |
− |
| |
− | |
| |
− | 87
| |
− | | |
− |
| |
− | |
| |
− | 89
| |
− | | |
− |
| |
− | |
| |
− | 92
| |
− | | |
− |
| |
− | |}
| |
− | | |
− | | |
− | | |
− |
| |
− | Only now we have to take the 5th and 6th score in
| |
− | our data set and average them to get a median of 55.5.
| |
− | | |
− |
| |
− | == Mode ==
| |
− |
| |
− | The mode is the most frequent score in our data
| |
− | set. On a histogram it represents the highest bar in a bar chart or
| |
− | histogram. You can, therefore, sometimes consider the mode as being
| |
− | the most popular option. An example of a mode is presented below:
| |
− | | |
− |
| |
− | [[Image:KOER-%20Mathematics%20-%20Statistics_html_58d59706.png]]
| |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | Normally, the mode is used for categorical data
| |
− | where we wish to know which is the most common category as
| |
− | illustrated below:
| |
− | | |
− |
| |
− | We can see above that the most common form of
| |
− | transport, in this particular data set, is the bus. However, one of
| |
− | the problems with the mode is that it is not unique, so it leaves us
| |
− | with problems when we have two or more values that share the highest
| |
− | frequency, such as below:
| |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | [[Image:KOER-%20Mathematics%20-%20Statistics_html_m64bbad46.png]]
| |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | We are now stuck as to which mode best describes
| |
− | the central tendency of the data. This is particularly problematic
| |
− | when we have continuous data, as we are more likely not to have any
| |
− | one value that is more frequent than the other. For example, consider
| |
− | measuring 30 peoples' weight (to the nearest 0.1 kg). How likely is
| |
− | it that we will find two or more people with '''exactly'''
| |
− | the same weight, e.g. 67.4 kg? The answer, is probably very unlikely
| |
− | - many people might be close but with such a small sample (30 people)
| |
− | and a large range of possible weights you are unlikely to find two
| |
− | people with exactly the same weight, that is, to the nearest 0.1 kg.
| |
− | This is why the mode is very rarely used with continuous data.
| |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | Another problem with the mode is that it will not
| |
− | provide us with a very good measure of central tendency when the most
| |
− | common mark is far away from the rest of the data in the data set, as
| |
− | depicted in the diagram below:
| |
− | | |
− |
| |
− | [[Image:KOER-%20Mathematics%20-%20Statistics_html_152dd141.png]]
| |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | In the above diagram the mode has a value of 2. We
| |
− | can clearly see, however, that the mode is not representative of the
| |
− | data, which is mostly concentrated around the 20 to 30 value range.
| |
− | To use the mode to describe the central tendency of this data set
| |
− | would be misleading.
| |
− | | |
− |
| |
− | == Skewed Distributions and the Mean and Median ==
| |
− |
| |
− | [[Image:KOER-%20Mathematics%20-%20Statistics_html_26c6186d.png]]We
| |
− | often test whether our data is normally distributed as this is a
| |
− | common assumption underlying many statistical tests. An example of a
| |
− | normally distributed set of data is presented below:
| |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | When you have a normally distributed sample you
| |
− | can legitimately use both the mean or the median as your measure of
| |
− | central tendency. In fact, in any symmetrical distribution the mean,
| |
− | median and mode are equal. However, in this situation, the mean is
| |
− | widely preferred as the best measure of central tendency as it is the
| |
− | measure that includes all the values in the data set for its
| |
− | calculation, and any change in any of the scores will affect the
| |
− | value of the mean. This is not the case with the median or mode.
| |
− | | |
− |
| |
− | However, when our data is skewed, for example, as
| |
− | with the right-skewed data set below:
| |
− | | |
− |
| |
− | [[Image:KOER-%20Mathematics%20-%20Statistics_html_m2609c500.png]] | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
− | | |
− | | |
− |
| |
− | | |
| | | |
| | | |