Changes

Statistics (view source)

Revision as of 09:49, 4 January 2013

528 bytes removed , 09:49, 4 January 2013

no edit summary

Line 1: Line 1: −

−

~~'''Statistics'''~~

−

= Introduction =

Line 70: Line 59:

=== Descriptive and Inferential Statistics ===

−

When analysing data, for example, the marks

Line 1,698: Line 1,683:

+

[[Image:KOER-%20Mathematics%20-%20Statistics_html_6201ec25.png]]

+

36 25 38 46 55 68

+

72 55 36 38

+

67 45 22 48 91 46

+

52 61 58 55

+

=== How do you construct a histogram from a continuous variable? ===

−

+

To construct a

+

histogram from a continuous variable you first need to split the data

+

into intervals, called bins. In the example above, age has been split

+

into bins, with each bin representing a 10-year period starting at 20

+

years. Each bin contains the number of occurrences of scores in the

+

data set that are contained within that bin. For the above data set,

+

the frequencies in each bin have been tabulated along with the scores

+

that contributed to the frequency in each bin (see below):

Line 1,716: Line 1,717:

−

~~[[Image:KOER-%20Mathematics%20-%20Statistics_html_6201ec25.png]]~~

+

Bin Frequency Scores

+

Included in Bin

−

+

20-30 2 25,22

−

+

30-40 4 36,38,36,38

−

+

40-50 4 46,45,48,46

−

+

50-60 5 55,55,52,58,55

−

+

60-70 3 68,67,61

−

+

70-80 1 72

−

+

80-90 0 -

−

+

90-100 1 91

Line 1,746: Line 1,748:

−

+

Notice that, unlike a

+

bar chart, there are no "gaps" between the bars (although

+

some bars might be "absent" reflecting no frequencies).

+

This is because a histogram represents a continuous data set, and as

+

such, there are no gaps in the data. (Although you will have to

+

decide whether you round up or round down scores on the boundaries of

+

bins)

Line 1,752: Line 1,760:

−

+

=== Choosing the correct bin width ===

−

+

There is no right or

+

wrong answer as to how wide a bin should be, but there are rules of

+

thumb. You need to make sure that the bins are not too small or too

+

large. Consider the histogram we produced earlier (see above): the

+

following histograms use the same data but have either much smaller

+

or larger bins, as shown below:

−

+

[[Image:KOER-%20Mathematics%20-%20Statistics_html_75ab55c3.png]]

−

+

We can see from the

−

+

histogram on the left, that the bin width is too small as it shows

−

+

too much individual data and does not allow the underlying pattern

−

~~36 25 38 46 55 68~~

+

(frequency distribution) of the data to be easily seen. At the other

−

~~72 55 36 38~~

+

end of the scale, is the diagram on the right, where the bins are too

+

large and, again, we are unable to find the underlying trend in the

+

data.

−

~~67 45 22 48 91 46~~

+

Histograms are based on

−

~~52 61 58 55~~

+

area not height of bars

Line 1,775: Line 1,791:

−

~~=== How do you construct~~ a histogram ~~from~~ a ~~continuous variable? ===~~

+

In a histogram, it is

+

the area of the bar that indicates the frequency of occurrences for

+

each bin. This means that the height of the bar does not necessarily

+

indicate how many occurrences of scores there were within each

+

individual bin. It is the product of height multiplied by the width

+

of the bin that indicates the frequency of occurrences within that

+

bin. One of the reasons that the height of the bars is often

+

incorrectly assessed as indicating frequency and not the area of the

+

bar is due to the fact that a lot of histograms often have equally

+

spaced bars (bins) and, under these circumstances, the height of the

+

bin does reflect the frequency.

+

−

~~To construct~~ a

+

=== What is the difference between a bar chart and a histogram? ===

−

histogram ~~from a continuous variable you first need to split the data~~

−

~~into intervals, called bins. In the example above, age has been split~~

−

~~into bins, with each bin representing a 10-year period starting at 20~~

−

~~years. Each bin contains the number of occurrences of scores in the~~

−

~~data set that are contained within that bin. For the above data set,~~

−

~~the frequencies in each bin have been tabulated along with the scores~~

−

~~that contributed to the frequency in each bin (see below):~~

−

+

[[Image:KOER-%20Mathematics%20-%20Statistics_html_6dfca87b.png]]

−

~~Bin Frequency Scores~~

+

The major difference is

−

~~Included~~ in ~~Bin~~

+

that a histogram is only used to plot the frequency of score

+

occurrences in a continuous data set that has been divided into

+

classes, called bins. Bar charts, on the other hand, can be used for

+

a great deal of other types of variables including ordinal and

+

nominal data sets.

−

~~20-30 2 25,22~~

+

−

~~30-40 4 36,38,36,38~~

+

−

~~40-50 4 46,45,48,46~~

+

−

~~50-60 5 55,55,52,58,55~~

+

== Circle or Pie Chart ==

−

60-~~70 3 68,67~~,61

+

[[Image:KOER-%20Mathematics%20-%20Statistics_html_461389d1.png]]These

+

are called circle graphs. A circle graph shows the relationship

+

between a whole and its parts. Here, the whole circle is divided into

+

sectors. The size of each sector is proportional to the activity or

+

information it represents.

−

~~70-80 1 72~~

−

~~80-90 0 -~~

−

~~90-100 1 91~~

+

A variety of graphical

+

representations of data are now possible using spreadsheet software.

+

OpenOffice CALC can convert a table of data into bar charts, pie

+

charts, area charts etc and make data much more easy to

+

read/interpret.

+

== Activities ==

+

=== Activity 2: Histogram and Bar Chart ===

−

~~Notice that, unlike a~~

+

==== Learning Objectives ====

−

~~bar chart, there are no "gaps" between the bars (although~~

−

~~some bars might be "absent" reflecting no frequencies).~~

−

~~This is because a histogram represents a continuous data set, and as~~

−

~~such, there are no gaps in the data. (Although you will have to~~

−

~~decide whether you round up or round down scores on the boundaries of~~

−

~~bins)~~

−

+

Learn to draw a histogram and bar chart.

+

Understand the difference between a bar chart and a histogram and be

+

able to select the appropriate chart by looking at the problem and

+

data.

−

=== ~~Choosing the correct bin width~~ ===

+

==== Materials and Resources Required ====

−

+

Paper and Pencil

−

~~There is no right or~~

+

==== Pre-requisites/ Instructions ====

−

~~wrong answer as to how wide a bin should be, but there are rules of~~

−

~~thumb. You need to make sure that the bins are not too small or too~~

−

~~large. Consider the histogram we produced earlier (see above): the~~

−

~~following histograms use the same data but have either much smaller~~

−

~~or larger bins, as shown below:~~

−

+

==== Method ====

−

~~[[Image:KOER-%20Mathematics%20-%20Statistics_html_75ab55c3.png]]~~

+

Solve the problems A and B

−

~~We can see from~~ the

+

A> In the past year, you have recorded the

−

~~histogram on the left~~, ~~that the bin width is too small as it shows~~

+

number of tickets that a movie theater has sold during each month.

−

~~too much individual data and does not allow~~ the ~~underlying pattern~~

+

To represent this data set graphically, would you construct a bar

−

~~(frequency distribution)~~ of ~~the data to be easily seen~~. ~~At the other~~

+

graph or a histogram? Why is this choice better than the other?

−

~~end of the scale~~, is the ~~diagram on~~ the ~~right~~, ~~where~~ the ~~bins are too~~

+

Using the following data, construct the graph that you choose.

−

~~large and, again, we are unable to find the underlying trend in the~~

−

~~data~~.

−

+

−

~~Histograms are based on~~

+

{| border="1"

−

~~area not height of bars~~

+

|-

+

|

+

Month

−

+

|

+

Number of Tickets Sold

−

~~In a histogram, it is~~

+

|-

−

~~the area of the bar that indicates the frequency of occurrences for~~

+

|

−

~~each bin. This means that the height of the bar does not necessarily~~

+

January

−

~~indicate how many occurrences of scores there were within each~~

−

~~individual bin. It is the product of height multiplied by the width~~

−

~~of the bin that indicates the frequency of occurrences within that~~

−

~~bin. One of the reasons that the height of the bars is often~~

−

~~incorrectly assessed as indicating frequency and not the area of the~~

−

~~bar is due to the fact that a lot of histograms often have equally~~

−

~~spaced bars (bins) and, under these circumstances, the height of the~~

−

~~bin does reflect the frequency.~~

−

+

|

+

25

−

~~=== What is the difference between a bar chart and a histogram? ===~~

+

|-

+

|

+

February

+

−

~~[[Image:KOER-%20Mathematics%~~20~~-%20Statistics_html_6dfca87b.png]]~~

+

|

+

20

−

~~The major difference is~~

+

|-

−

~~that a histogram is only used to plot the frequency of score~~

+

|

−

~~occurrences in a continuous data set that has been divided into~~

+

March

−

~~classes, called bins. Bar charts, on the other hand, can be used for~~

−

~~a great deal of other types of variables including ordinal and~~

−

~~nominal data sets.~~

+

|

+

15

+

|-

+

|

+

April

+

|

+

20

+

|-

+

|

+

May

+

|

+

30

+

|-

+

|

+

June

−

~~== Circle or Pie Chart ==~~

+

|

+

35

+

−

~~[[Image:KOER~~-~~%20Mathematics%20-%20Statistics_html_461389d1.png]]These~~

+

|-

−

~~are called circle graphs. A circle graph shows the relationship~~

+

|

−

~~between a whole and its parts. Here, the whole circle is divided into~~

+

July

−

~~sectors. The size of each sector is proportional to the activity or~~

−

~~information it represents.~~

+

|

+

40

+

|-

+

|

+

August

−

~~A variety of graphical~~

+

|

−

~~representations of data are now possible using spreadsheet software.~~

+

20

−

~~OpenOffice CALC can convert a table of data into bar charts, pie~~

−

~~charts, area charts etc and make data much more easy to~~

−

~~read/interpret.~~

+

|-

+

|

+

September

+

|

+

25

−

~~== Activities ==~~

−

~~=== Activity 2: Histogram and Bar Chart ===~~

+

|-

+

|

+

October

+

−

~~==== Learning Objectives ====~~

+

|

−

+

15

−

~~Learn to draw a histogram and bar chart.~~

−

~~Understand the difference between a bar chart and a histogram and be~~

−

~~able to select the appropriate chart by looking at the problem and~~

−

~~data.~~

−

~~==== Materials and Resources Required ====~~

−

~~Paper and Pencil~~

−

~~==== Pre-requisites/ Instructions ====~~

−

~~==== Method ====~~

−

~~Solve the problems A and B~~

−

~~A> In the past year, you have recorded the~~

−

~~number of tickets that a movie theater has sold during each month.~~

−

~~To represent this data set graphically, would you construct a bar~~

−

~~graph or a histogram? Why is this choice better than the other?~~

−

~~Using the following data, construct the graph that you choose.~~

−

~~{| border="1"~~

|-

|

−

~~Month~~

+

November

|

−

~~Number of Tickets Sold~~

+

20

|-

|

−

~~January~~

+

December

|

−

25

+

30

−

|-

+

|}

−

|

+

−

~~February~~

+

−

|

+

B> For a recent

−

20

+

science project, you collected data regarding the distribution of

+

fish and aquatic life in a nearby pond. Your data consists of the

+

number of living creatures found in each 1 meter depth increment in

+

the pond. Construct a bar graph and several histograms (vary the

+

depth increment size) for the following data. In which case(s) is the

+

histogram the same as the bar graph? How do the other histograms vary

+

from the bar graph?

+

{| border="1"

|-

|

−

~~March~~

+

'''Depth Range'''

|

−

15

+

'''Number of Living Creatures '''

|-

|

−

~~April~~

+

0 – 1 meters

|

−

20

+

10

|-

|

−

~~May~~

+

1 – 2 meters

|

−

30

+

93

|-

|

−

~~June~~

+

2 – 3 meters

|

−

35

+

23

|-

|

−

~~July~~

+

3 – 4 meters

|

−

40

+

47

|-

|

−

~~August~~

+

4 – 5 meters

|

−

20

+

68

|-

|

−

~~September~~

+

5 – 6 meters

|

−

25

+

51

|-

|

−

~~October~~

+

6 – 7 meters

|

−

15

+

43

|-

|

−

~~November~~

+

7 – 8 meters

|

−

20

+

21

|-

|

−

~~December~~

+

8 – 9 meters

|

−

30

+

15

−

|}

+

|-

−

+

|

+

9 – 10 meters

−

+

|

+

8

−

~~B> For~~ a ~~recent~~

+

|}

−

~~science project, you collected~~ data ~~regarding the distribution~~ of

+

==== Evaluation ====

−

~~fish~~ and ~~aquatic life in a nearby pond~~. ~~Your data consists of~~ the

+

−

~~number~~ of ~~living creatures found in each 1 meter depth increment~~ in

+

# Does the student understand the difference between a bar chart and a histogram ?

−

~~the pond~~. ~~Construct a bar graph and several histograms (vary the~~

+

# Does the student know when to use each of these charts - - depending on the type of data continuous and discrete ?

−

~~depth increment size) for~~ the ~~following data~~. ~~In which case(s) is~~ the

+

−

~~histogram the same~~ as ~~the bar graph? How do the other histograms vary~~

+

== Evaluation ==

−

~~from the bar graph?~~

+

== Self-Evaluation ==

+

== Further Explorations ==

+

=== Types of Variables ===

+

All experiments examine some kind of variable(s).

+

A variable is not only something that we measure, but also something

+

that we can manipulate and something we can control for. To

+

understand the characteristics of variables and how we use them in

+

research, this guide is divided into three main sections. First, we

+

illustrate the role of dependent and independent variables. Second,

+

we discuss the difference between experimental and non-experimental

+

research. Finally, we explain how variables can be characterised as

+

either categorical or continuous.

+

=== Dependent and Independent Variables ===

−

~~{| border="1"~~

−

|-

−

|

−

~~'''Depth Range'''~~

−

|

+

An independent variable, sometimes called an

−

~~'''Number of Living Creatures '''~~

+

experimental or predictor variable, is a variable that is being

−

+

manipulated in an experiment in order to observe the effect on a

+

dependent variable, sometimes called an outcome variable.

+

−

|-

+

−

|

+

−

~~0 – 1 meters~~

−

|

+

Imagine that a tutor asks 100 students to complete

−

10

+

a maths test. The tutor wants to know why some students perform

+

better than others. Whilst the tutor does not know the answer to

+

this, she thinks that it might be because of two reasons: (1) some

+

students spend more time revising for their test; and (2) some

+

students are naturally more intelligent than others. As such, the

+

tutor decides to investigate the effect of revision time and

+

intelligence on the test performance of the 100 students. The

+

dependent and independent variables for the study are:

−

|-

+

−

|

+

−

~~1 – 2 meters~~

−

|

+

Dependent Variable: Test Mark (measured from 0 to

−

93

+

100)

−

|-

−

|

−

~~2 – 3 meters~~

−

+

−

|

−

23

−

|-

+

Independent Variables: Revision time (measured in

−

|

+

hours) Intelligence (measured using IQ score)

−

~~3 – 4 meters~~

−

|

−

47

−

|-

−

|

−

~~4 – 5 meters~~

−

|

−

68

−

|-

+

The dependent variable is simply that, a variable

−

|

+

that is dependent on an independent variable(s). For example, in our

−

~~5 – 6 meters~~

+

case the test mark that a student achieves is dependent on revision

+

time and intelligence. Whilst revision time and intelligence (the

+

independent variables) may (or may not) cause a change in the test

+

mark (the dependent variable), the reverse is implausible; in other

+

words, whilst the number of hours a student spends revising and the

+

higher a student's IQ score may (or may not) change the test mark

+

that a student achieves, a change in a student's test mark has no

+

bearing on whether a student revises more or is more intelligent

+

(this simply doesn't make sense).

−

|

−

51

−

|-

−

|

−

~~6 – 7 meters~~

−

|

−

43

−

|-

+

Therefore, the aim of the tutor's investigation is

−

|

+

to examine whether these independent variables - revision time and IQ

−

~~7 – 8 meters~~

+

- result in a change in the dependent variable, the students' test

+

scores. However, it is also worth noting that whilst this is the main

+

aim of the experiment, the tutor may also be interested to know if

+

the independent variables - revision time and IQ - are also connected

+

in some way.

−

|

+

−

21

+

−

|-

+

In the section on experimental and

−

|

+

non-experimental research that follows, we find out a little more

−

~~8 – 9 meters~~

+

about the nature of independent and dependent variables.

−

|

+

=== Experimental and Non-Experimental Research ===

−

15

+

−

|-

−

|

−

~~9 – 10 meters~~

−

|

−

8

−

|}

+

Experimental research: In experimental research,

−

~~==== Evaluation ====~~

+

the aim is to manipulate an independent variable(s) and then examine

−

+

the effect that this change has on a dependent variable(s). Since it

−

~~# Does~~ the ~~student understand the difference between a bar chart~~ and ~~a histogram ?~~

+

is possible to manipulate the independent variable(s), experimental

−

~~# Does~~ the ~~student know when to use each of these charts - - depending~~ on ~~the type of data continuous and discrete ?~~

+

research has the advantage of enabling a researcher to identify a

−

+

cause and effect between variables. For example, take our example of

−

~~== Evaluation ==~~

+

100 students completing a maths exam where the dependent variable was

−

+

the exam mark (measured from 0 to 100) and the independent variables

−

~~== Self-Evaluation ==~~

+

were revision time (measured in hours) and intelligence (measured

−

+

using IQ score). Here, it would be possible to use an experimental

−

~~== Further Explorations ==~~

+

design and manipulate the revision time of the students. The tutor

−

+

could divide the students into two groups, each made up of 50

−

~~=== Types of Variables ===~~

+

students. In "group one", the tutor could ask the students

−

+

not to do any revision. Alternately, "group two" could be

−

~~All experiments examine some kind of~~ variable(s).

+

asked to do 20 hours of revision in the two weeks prior to the test.

−

A variable ~~is not only something that we measure~~, ~~but also something~~

+

The tutor could then compare the marks that the students achieved.

−

~~that we can manipulate~~ and ~~something we can control for~~. To

−

~~understand~~ the ~~characteristics of~~ variables and ~~how we use them in~~

−

~~research, this guide is divided into three main sections~~. ~~First~~, we

−

~~illustrate~~ the ~~role~~ of ~~dependent and independent variables~~. ~~Second~~,

−

~~we discuss~~ the ~~difference between experimental and non-experimental~~

−

~~research~~. ~~Finally~~, ~~we explain how variables can~~ be ~~characterised as~~

−

~~either categorical or continuous~~.

−

~~=== Dependent~~ and ~~Independent Variables ===~~

+

Non-experimental research: In non-experimental

+

research, the researcher does not manipulate the independent

+

variable(s). This is not to say that it is impossible to do so, but

+

it will either be impractical or unethical to do so. For example, a

+

researcher may be interested in the effect of illegal, recreational

+

drug use (the dependent variable(s)) on certain types of behaviour

+

(the independent variable(s)). However, whilst possible, it would be

+

unethical to ask individuals to take illegal drugs in order to study

+

what effect this had on certain behaviours. As such, a researcher

+

could ask both drug and non-drug users to complete a questionnaire

+

that had been constructed to indicate the extent to which they

+

exhibited certain behaviours. Whilst it is not possible to identify

+

the cause and effect between the variables, we can still examine the

+

association or relationship between them.In addition to understanding

+

the difference between dependent and independent variables, and

+

experimental and non-experimental research, it is also important to

+

understand the different characteristics amongst variables. This is

+

discussed next.

+

Line 2,241: Line 2,279:

−

~~An independent variable, sometimes called an~~

+

=== Categorical and Continuous Variables ===

−

~~experimental or predictor variable, is a variable that is being~~

−

~~manipulated in an experiment in order to observe the effect on a~~

−

~~dependent variable, sometimes called an outcome variable.~~

−

Line 2,251: Line 2,285:

−

~~Imagine that a tutor asks 100 students to complete~~

+

Categorical variables are also known as discrete

−

~~a maths test~~. ~~The tutor wants to know why some students perform~~

+

or qualitative variables. Categorical variables can be further

−

~~better than others. Whilst the tutor does not know the answer to~~

+

categorized as either''' nominal, ordinal or dichotomous.'''

−

~~this, she thinks that it might~~ be ~~because of two reasons: (1) some~~

−

~~students spend more time revising for their test; and (2) some~~

−

~~students are naturally more intelligent than others. As such~~, ~~the~~

−

~~tutor decides to investigate the effect of revision time and~~

−

~~intelligence on the test performance of the 100 students~~. ~~The~~

−

~~dependent and independent variables for the study are:~~

Line 2,266: Line 2,294:

−

~~Dependent Variable: Test Mark (measured from 0~~ to

+

'''Nominal variables''' are variables that have

−

~~100~~)

+

two or more categories but which do not have an intrinsic order. For

+

example, a real estate agent could classify their types of property

+

into distinct categories such as houses, condos, co-ops or bungalows.

+

So "type of property" is a nominal variable with 4

+

categories called houses, condos, co-ops and bungalows. Of note, the

+

different categories of a nominal variable can also be referred to as

+

groups or levels of the nominal variable. Another example of a

+

nominal variable would be classifying where people live in Karnataka

+

by district. In this case there will be many more levels of the

+

nominal variable (30 in fact).

−

+

'''Dichotomous variables''' are nominal

−

+

variables which have only two categories or levels. For example, if

+

we were looking at gender, we would most probably categorize somebody

+

as either "male" or "female". This is an example

+

of a dichotomous variable (and also a nominal variable). Another

+

example might be if we asked a person if they owned a mobile phone.

+

Here, we may categorise mobile phone ownership as either "Yes"

+

or "No". In the real estate agent example, if type of

+

property had been classified as either residential or commercial then

+

"type of property" would be a dichotomous variable.

−

~~Independent Variables: Revision time~~ (~~measured in~~

+

'''Ordinal variables''' are variables that have

−

~~hours) Intelligence~~ (~~measured using IQ score~~)

+

two or more categories just like nominal variables only the

+

categories can also be ordered or ranked. So if you asked someone if

+

they liked the policies of the Democratic Party and they could answer

+

either "Not very much", "They are OK" or "Yes,

+

a lot" then you have an ordinal variable. Why? Because you have

+

3 categories, namely "Not very much", "They are OK"

+

and "Yes, a lot" and you can rank them from the most

+

positive (Yes, a lot), to the middle response (They are OK), to the

+

least positive (Not very much). However, whilst we can rank the

+

levels, we cannot place a "value" to them; we cannot say

+

that "They are OK" is twice as positive as "Not very

+

much" for example.

Line 2,282: Line 2,338:

−

~~The dependent variable is simply that, a variable~~

+

Continuous variables are also known as

−

~~that is dependent on an independent variable(s)~~. ~~For example, in our~~

+

quantitative variables. Continuous variables can be further

−

~~case the test mark that a student achieves is dependent on revision~~

+

categorized as either interval or ratio variables.

−

~~time and intelligence. Whilst revision time and intelligence (the~~

−

~~independent~~ variables~~) may (or may not) cause a change in the test~~

−

~~mark (the dependent variable), the reverse is implausible; in other~~

−

~~words, whilst the number of hours a student spends revising and the~~

−

~~higher a student's IQ score may (~~or ~~may not) change the test mark~~

−

~~that a student achieves, a change in a student's test mark has no~~

−

~~bearing on whether a student revises more or is more intelligent~~

−

~~(this simply doesn't make sense)~~.

Line 2,299: Line 2,347:

−

~~Therefore, the aim of the tutor~~'~~s investigation~~ is

+

'''Interval variables''' are variables for which

−

~~to examine whether these independent variables - revision time~~ and IQ

+

their central characteristic is that they can be measured along a

−

~~- result in~~ a ~~change in the dependent variable~~, ~~the students' test~~

+

continuum and they have a numerical value (for example, temperature

−

~~scores~~. ~~However, it is also worth noting that whilst this is~~ the ~~main~~

+

measured in degrees Celsius or Fahrenheit). So the difference between

−

~~aim of~~ the ~~experiment~~, ~~the tutor may also be interested to know if~~

+

20C and 30C is the same as 30C to 40C. However, temperature measured

−

~~the independent variables - revision time and IQ - are also connected~~

+

in degrees Celsius or Fahrenheit is NOT a ratio variable.

−

in ~~some way~~.

−

+

'''Ratio variables''' are interval variables but

−

+

with the added condition that 0 (zero) of the measurement indicates

−

+

that there is none of that variable. So, temperature measured in

−

+

degrees Celsius or Fahrenheit is not a ratio variable because 0C does

−

~~In the section on experimental~~ and

+

not mean there is no temperature. However, temperature measured in

−

~~non-experimental research~~ that ~~follows~~, ~~we find out~~ a ~~little more~~

+

Kelvin is a ratio variable as 0 Kelvin (often called absolute zero)

−

~~about~~ the ~~nature~~ of ~~independent and dependent variables~~.

+

indicates that there is no temperature whatsoever. Other examples of

+

ratio variables include height, mass, distance and many more. The

+

name "ratio" reflects the fact that you can use the ratio

+

of measurements. So, for example, a distance of ten metres is twice

+

the distance of 5 metres.

−

~~=== Experimental and Non-Experimental Research ===~~

Line 2,323: Line 2,372:

−

~~Experimental research: In experimental research,~~

+

=== Ambiguities in classifying a type of variable ===

−

~~the aim is to manipulate an independent variable(s) and then examine~~

+

−

~~the effect that this change has on~~ a ~~dependent variable(s). Since it~~

+

−

~~is possible to manipulate the independent variable(s), experimental~~

+

−

~~research has the advantage of enabling a researcher to identify a~~

−

~~cause and effect between variables. For example, take our example~~ of

−

~~100 students completing a maths exam where the dependent~~ variable ~~was~~

−

~~the exam mark (measured from 0 to 100) and the independent variables~~

−

~~were revision time (measured in hours) and intelligence (measured~~

−

~~using IQ score). Here, it would be possible to use an experimental~~

−

~~design and manipulate the revision time of the students. The tutor~~

−

~~could divide the students into two groups, each made up of 50~~

−

~~students. In "group one", the tutor could ask the students~~

−

~~not to do any revision. Alternately, "group two" could be~~

−

~~asked to do 20 hours of revision in the two weeks prior to the test.~~

−

~~The tutor could then compare the marks that the students achieved.~~

−

~~Non-experimental research:~~ In ~~non-experimental~~

+

In some cases, the measurement scale for data is

−

~~research~~, the ~~researcher does not manipulate~~ the ~~independent~~

+

ordinal but the variable is treated as continuous. For example, a

−

variable~~(s). This~~ is ~~not to say that it is impossible to do so, but~~

+

Likert scale that contains five values - strongly agree, agree,

−

~~it will either be impractical or unethical to do so~~. For example, a

+

neither agree nor disagree, disagree, and strongly disagree - is

−

~~researcher may be interested in the effect of illegal~~, ~~recreational~~

+

ordinal. However, where a Likert scale contains seven or more value -

−

~~drug use (the dependent variable(s)) on certain types of behaviour~~

+

strongly agree, moderately agree, agree, neither agree nor disagree,

−

~~(the independent variable(s)). However~~, ~~whilst possible~~, ~~it would be~~

+

disagree, moderately disagree, and strongly disagree - the underlying

−

~~unethical to ask individuals to take illegal drugs in order to study~~

+

scale is sometimes treated as continuous although where you should do

−

~~what effect this had on certain behaviours~~. ~~As such~~, a ~~researcher~~

+

this is a cause of great dispute.

−

~~could ask both drug and non~~-~~drug users to complete a questionnaire~~

−

~~that had been constructed to indicate the extent to which they~~

−

~~exhibited certain behaviours. Whilst it is not possible to identify~~

−

~~the cause and effect between the variables~~, ~~we can still examine the~~

−

~~association or relationship between them.In addition to understanding~~

−

~~the difference between dependent and independent variables~~, and

−

~~experimental and non-experimental research, it~~ is ~~also important to~~

−

~~understand the different characteristics amongst variables. This~~ is

−

~~discussed next~~.

Line 2,364: Line 2,392: −

−

~~=== Categorical and Continuous Variables ===~~

Line 2,371: Line 2,397:

−

~~Categorical variables are also known as discrete~~

+

== Enrichment Activities ==

−

~~or qualitative variables. Categorical variables can be further~~

+

−

~~categorized as either''' nominal, ordinal or dichotomous.'''~~

+

= Central tendency =

−

+

== Introduction ==

−

~~'''Nominal variables''' are variables~~ that ~~have~~

+

A measure of central tendency is a single value

−

~~two or more categories but which do not have an intrinsic order. For~~

+

that attempts to describe a set of data by identifying the central

−

~~example,~~ a ~~real estate agent could classify their types~~ of ~~property~~

+

position within that set of data. As such, measures of central

−

~~into distinct categories~~ such ~~as houses, condos~~, ~~co-ops or bungalows.~~

+

tendency are sometimes called measures of central location. They are

−

~~So "type~~ of ~~property" is a nominal variable with 4~~

+

also classed as summary statistics. The mean (often called the

−

~~categories~~ called ~~houses, condos, co-ops and bungalows~~. ~~Of note, the~~

+

average) is most likely the measure of central tendency that you are

−

~~different categories of a nominal variable can~~ also ~~be referred to~~ as

+

most familiar with, but there are others, such as, the median and the

−

~~groups or levels of~~ the ~~nominal variable. Another example~~ of a

+

mode.

−

~~nominal variable would be classifying where people live in Karnataka~~

−

~~by district. In this case~~ there ~~will be many more levels of~~ the

−

~~nominal variable (30 in fact)~~.

−

~~'''Dichotomous variables'''~~ are ~~nominal~~

+

The mean, median and mode are all valid measures

−

~~variables which have only two categories or levels. For example~~, if

+

of central tendency but, under different conditions, some measures of

−

~~we were looking at gender~~, ~~we would most probably categorize somebody~~

+

central tendency become more appropriate to use than others. In the

−

~~as either "male" or "female". This is an example~~

+

following sections we will look at the mean, mode and median and

−

of ~~a dichotomous variable (and also a nominal variable). Another~~

+

learn how to calculate them and under what conditions they are most

−

~~example might be if we asked a person if they owned a mobile phone~~.

+

appropriate to be used.

−

~~Here,~~ we ~~may categorise mobile phone ownership as either "Yes"~~

−

~~or "No". In~~ the ~~real estate agent example~~, ~~if type of~~

−

~~property had been classified as either residential or commercial then~~

−

~~"type of property" would~~ be ~~a dichotomous variable~~.

−

~~'''Ordinal variables''' are variables~~ that ~~have~~

+

== Objectives ==

−

~~two or more categories just like nominal variables only~~ the

+

−

~~categories can also be ordered or ranked~~. ~~So if you asked someone if~~

+

* Understand and know that a measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data.

−

~~they liked~~ the ~~policies~~ of ~~the Democratic Party and they could answer~~

+

* Understand that the mean, median and mode are all valid measures of central tendency but, under different conditions, some measures of central tendency become more appropriate to use than others.

−

~~either "Not very much"~~, ~~"They are OK" or "Yes~~,

+

* Learn to calculation of mean and median and analyse data and make conclusions.

−

~~a lot" then you have an ordinal variable~~. ~~Why? Because you have~~

+

−

~~3 categories, namely "Not very much", "They are OK"~~

+

== Mean (Arithmetic) ==

−

and ~~"Yes, a lot"~~ and ~~you can rank them from the most~~

−

~~positive (Yes, a lot), to the middle response (They are OK), to the~~

−

~~least positive~~ (~~Not very much~~)~~. However, whilst we can rank the~~

−

~~levels, we cannot place a "value" to them; we cannot say~~

−

~~that "They are OK" is twice as positive as "Not very~~

−

~~much" for example.~~

−

+

The mean (or average) is the most popular and well

+

known measure of central tendency. It can be used with both discrete

+

and continuous data, although its use is most often with continuous

+

data. The mean is equal to the sum of all the values in the data set

+

divided by the number of values in the data set. So, if we have n

+

values in a data set and they have values x1, x2,

+

..., xn, then the sample mean, usually denoted by

+

[[Image:KOER-%20Mathematics%20-%20Statistics_html_174cec39.gif]]

+

(pronounced x bar), is:

−

+

[[Image:KOER-%20Mathematics%20-%20Statistics_html_69b2cf9e.gif]]

−

~~Continuous variables are also known as~~

+

This formula is usually written in a slightly

−

~~quantitative variables~~. ~~Continuous variables can be further~~

+

different manner using the Greek capitol letter, Σ,

−

~~categorized as either interval or ratio variables~~.

+

pronounced "sigma", which means "sum of...":

−

+

[[Image:KOER-%20Mathematics%20-%20Statistics_html_m50e9a786.gif]]

−

~~'''Interval variables''' are variables for which~~

+

You may have noticed that the above formula refers

−

~~their central characteristic is that they can be measured along~~ a

+

to the sample mean. So, why call have we called it a sample mean?

−

~~continuum~~ and ~~they~~ have ~~a numerical value (for example~~, ~~temperature~~

+

This is because, in statistics, samples and populations have very

−

~~measured~~ in ~~degrees Celsius or Fahrenheit)~~. So the ~~difference between~~

+

different meanings and these differences are very important, even if,

−

~~20C and 30C is~~ the ~~same as 30C to 40C. However~~, ~~temperature measured~~

+

in the case of the mean, they are calculated in the same way. To

−

~~in degrees Celsius or Fahrenheit is NOT a ratio variable.~~

+

acknowledge that we are calculating the population mean and not the

+

sample mean, we use the Greek lower case letter "mu",

+

denoted as µ:

−

~~'''Ratio variables''' are interval variables but~~

+

[[Image:KOER-%20Mathematics%20-%20Statistics_html_7b1e9596.gif]]

−

~~with the added condition that 0 (zero) of the measurement indicates~~

−

~~that there is none of that variable. So, temperature measured in~~

−

~~degrees Celsius or Fahrenheit is not a ratio variable because 0C does~~

−

~~not mean there is no temperature. However, temperature measured in~~

−

~~Kelvin is a ratio variable as 0 Kelvin (often called absolute zero)~~

−

~~indicates that there is no temperature whatsoever. Other examples of~~

−

~~ratio variables include height, mass, distance and many more. The~~

−

~~name "ratio" reflects the fact that you can use the ratio~~

−

~~of measurements. So, for example, a distance of ten metres is twice~~

−

~~the distance of 5 metres~~.

−

+

The mean is essentially a model of your data set.

−

+

It is the value that is most common. You will notice, however, that

+

the mean is not often one of the actual values that you have observed

+

in your data set. However, one of its important properties is that it

+

minimises error in the prediction of any one value in your data set.

+

That is, it is the value that produces the lowest amount of error

+

from all other values in the data set.

−

~~=== Ambiguities~~ in ~~classifying a type~~ of ~~variable ===~~

+

An important property of the mean is that it

+

includes every value in your data set as part of the calculation. In

+

addition, the mean is the only measure of central tendency where the

+

sum of the deviations of each value from the mean is always zero.

+

Line 2,464: Line 2,483:

−

~~In some cases,~~ the ~~measurement scale for data is~~

+

'''When not to use the mean'''

−

~~ordinal but the variable is treated as continuous. For example, a~~

−

~~Likert scale that contains five values - strongly agree, agree,~~

−

~~neither agree nor disagree, disagree, and strongly disagree - is~~

−

~~ordinal. However, where a Likert scale contains seven or more value -~~

−

~~strongly agree, moderately agree, agree, neither agree nor disagree,~~

−

~~disagree, moderately disagree, and strongly disagree - the underlying~~

−

~~scale is sometimes treated as continuous although where you should do~~

−

~~this is a cause of great dispute.~~

+

The mean has one main disadvantage: it is

+

particularly susceptible to the influence of outliers. These are

+

values that are unusual compared to the rest of the data set by being

+

especially small or large in numerical value. For example, consider

+

the wages of staff at a factory below:

+

{| border="1"

+

|-

+

|

+

Staff

+

|

+

1

+

|

+

2

+

|

+

3

+

|

+

4

−

~~== Enrichment Activities ==~~

+

|

+

5

+

−

~~= Central tendency =~~

+

|

+

6

+

−

~~== Introduction ==~~

+

|

+

7

+

−

~~A measure of central tendency is a single value~~

+

|

−

~~that attempts to describe a set of data by identifying the central~~

+

8

−

~~position within that set of data. As such, measures of central~~

−

~~tendency are sometimes called measures of central location. They are~~

−

~~also classed as summary statistics. The mean (often called the~~

−

~~average) is most likely the measure of central tendency that you are~~

−

~~most familiar with, but there are others, such as, the median and the~~

−

~~mode.~~

−

~~The mean, median and mode are all valid measures~~

+

|

−

~~of central tendency but, under different conditions, some measures of~~

+

9

−

~~central tendency become more appropriate to use than others. In the~~

−

~~following sections we will look at the mean, mode and median and~~

−

~~learn how to calculate them and under what conditions they are most~~

−

~~appropriate to be used.~~

−

~~== Objectives ==~~

+

|

+

10

+

−

* Understand and know that a measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data.

+

|-

−

* Understand that the mean, median and mode are all valid measures of central tendency but, under different conditions, some measures of central tendency become more appropriate to use than others.

+

|

−

* Learn to calculation of mean and median and analyse data and make conclusions.

+

Salary

+

−

~~== Mean (Arithmetic) ==~~

+

|

+

15k

+

−

~~The mean (or average) is the most popular and well~~

+

|

−

~~known measure of central tendency. It can be used with both discrete~~

+

18k

−

~~and continuous data, although its use is most often with continuous~~

−

~~data. The mean is equal to the sum of all the values in the data set~~

−

~~divided by the number of values in the data set. So, if we have n~~

−

~~values in a data set and they have values x1, x2,~~

−

~~..., xn, then the sample mean, usually denoted by~~

−

~~[[Image:KOER-%20Mathematics%20-%20Statistics_html_174cec39.gif]]~~

−

~~(pronounced x bar), is:~~

+

|

+

16k

+

|

+

14k

+

|

+

15k

−

~~[[Image:KOER-%20Mathematics%20-%20Statistics_html_69b2cf9e.gif]]~~

+

|

+

15k

−

~~This formula is usually written in a slightly~~

+

|

−

~~different manner using the Greek capitol letter, Σ,~~

+

12k

−

~~pronounced "sigma", which means "sum of...":~~

−

~~[[Image:KOER-%20Mathematics%20-%20Statistics_html_m50e9a786.gif]]~~

+

|

+

17k

−

~~You may have noticed that the above formula refers~~

+

|

−

~~to the sample mean. So, why call have we called it a sample mean?~~

+

90k

−

~~This is because, in statistics, samples and populations have very~~

−

~~different meanings and these differences are very important, even if,~~

−

~~in the case of the mean, they are calculated in the same way. To~~

−

~~acknowledge that we are calculating the population mean and not the~~

−

~~sample mean, we use the Greek lower case letter "mu",~~

−

~~denoted as µ:~~

−

~~[[Image:KOER-%20Mathematics%20-%20Statistics_html_7b1e9596.gif]]~~

+

|

+

95k

−

The mean is ~~essentially a model of your data set~~.

+

|}

−

~~It is~~ the value ~~that is~~ most ~~common~~. ~~You will notice, however, that~~

+

The mean salary for these ten staff is $30.7k.

−

~~the~~ mean is ~~not often one of~~ the ~~actual values that you have observed~~

+

However, inspecting the raw data suggests that this mean value might

−

~~in your data set~~. ~~However~~, ~~one of its important properties is that it~~

+

not be the best way to accurately reflect the typical salary of a

−

~~minimises error in the prediction~~ of ~~any one value in your data set~~.

+

worker, as most workers have salaries in the $12k to 18k range. The

−

~~That is~~, ~~it is~~ the ~~value that produces the lowest amount~~ of ~~error~~

+

mean is being skewed by the two large salaries. Therefore, in this

−

~~from all other values~~ in ~~the data set~~.

+

situation we would like to have a better measure of central tendency.

+

As we will find out later, taking the median would be a better

+

measure of central tendency in this situation.

−

~~An important property of~~ the mean is ~~that it~~

+

Another time when we usually prefer the median

−

~~includes every~~ value in ~~your~~ data set as ~~part of~~ the ~~calculation. In~~

+

over the mean (or mode) is when our data is skewed (i.e. the

−

~~addition,~~ the mean is the ~~only measure of~~ central ~~tendency where~~ the

+

frequency distribution for our data is skewed). If we consider the

−

~~sum of~~ the ~~deviations of each~~ value ~~from~~ the ~~mean~~ is ~~always zero~~.

+

normal distribution - as this is the most frequently assessed in

−

+

statistics - when the data is perfectly normal then the mean, median

+

and mode are identical. Moreover, they all represent the most typical

+

value in the data set. However, as the data becomes skewed the mean

+

loses its ability to provide the best central location for the data

+

as the skewed data is dragging it away from the typical value.

+

However, the median best retains this position and is not as strongly

+

influenced by the skewed values. This is explained in more detail in

+

the skewed distribution section later in this guide.

+

== Median ==

+

The median is the middle score for a set of data

+

that has been arranged in order of magnitude. The median is less

+

affected by outliers and skewed data. In order to calculate the

+

median, suppose we have the data below:

+

−

~~'''When not to use the mean'''~~

−

+

−

~~The mean has one main disadvantage: it is~~

−

~~particularly susceptible to the influence of outliers. These are~~

−

~~values that are unusual compared to the rest of the data set by being~~

−

~~especially small or large in numerical value. For example, consider~~

−

~~the wages of staff at a factory below:~~

−

{| border="1"

|-

|

−

~~Staff~~

+

65

|

−

1

+

55

|

−

2

+

89

|

−

3

+

56

|

−

4

+

35

|

−

5

+

14

|

−

6

+

56

|

−

7

+

55

|

−

8

+

87

|

−

9

+

45

|

−

10

+

92

+

|}

+

We first need to rearrange that data into order of

+

magnitude (smallest first):

+

{| border="1"

|-

|

−

~~Salary~~

+

14

|

−

~~15k~~

+

35

|

−

~~18k~~

+

45

|

−

~~16k~~

+

55

|

−

~~14k~~

+

55

|

−

~~15k~~

+

'''56'''

|

−

~~15k~~

+

56

|

−

~~12k~~

+

65

|

−

~~17k~~

+

87

|

−

~~90k~~

+

89

|

−

~~95k~~

+

92

|}

−

~~The mean salary for these ten staff is $30.7k.~~

−

~~However, inspecting the raw data suggests that this mean value might~~

−

~~not be the best way to accurately reflect the typical salary of a~~

−

~~worker, as most workers have salaries in the $12k to 18k range. The~~

−

~~mean is being skewed by the two large salaries. Therefore, in this~~

−

~~situation we would like to have a better measure of central tendency.~~

−

~~As we will find out later, taking the median would be a better~~

−

~~measure of central tendency in this situation.~~

−

+

−

~~Another time when we usually prefer the median~~

−

~~over the mean (or mode) is when our data is skewed (i.e. the~~

−

~~frequency distribution for our data is skewed). If we consider the~~

−

~~normal distribution - as this is the most frequently assessed in~~

−

~~statistics - when the data is perfectly normal then the mean, median~~

−

~~and mode are identical. Moreover, they all represent the most typical~~

−

~~value in the data set. However, as the data becomes skewed the mean~~

−

~~loses its ability to provide the best central location for the data~~

−

~~as the skewed data is dragging it away from the typical value.~~

−

~~However, the median best retains this position and is not as strongly~~

−

~~influenced by the skewed values. This is explained in more detail in~~

−

~~the skewed distribution section later in this guide.~~

−

~~== Median ==~~

+

Our median mark is the middle mark - in this case

−

+

56 (highlighted in bold). It is the middle mark because there are 5

−

~~The~~ median is the middle ~~score for a set of data~~

+

scores before it and 5 scores after it. This works fine when you have

−

~~that has been arranged~~ in ~~order of magnitude~~. ~~The median~~ is ~~less~~

+

an odd number of scores but what happens when you have an even number

−

~~affected by outliers~~ and ~~skewed data~~. ~~In order~~ to ~~calculate~~ the

+

of scores? What if you had only 10 scores? Well, you simply have to

−

~~median~~, ~~suppose~~ we ~~have~~ the ~~data~~ below:

+

take the middle two scores and average the result. So, if we look at

+

the example below:

Line 2,710: Line 2,743: −

+

{| border="1"

|-

Line 2,751: Line 2,784:

|

45

−

|

−

92

Line 2,762: Line 2,791:

−

We ~~first need to~~ rearrange that data into order of

+

We again rearrange that data into order of

magnitude (smallest first):

−

Line 2,789: Line 2,816:

|

−

55

+

'''55'''

Line 2,821: Line 2,848:

−

~~Our median mark is the middle mark - in this case~~

+

Only now we have to take the 5th and 6th score in

−

~~56 (highlighted in bold). It is the middle mark because there are 5~~

+

our data set and average them to get a median of 55.5.

−

~~scores before it and 5 scores after it. This works fine when you have~~

−

~~an odd number of scores but what happens when you have an even number~~

−

~~of scores? What if you had only 10 scores? Well, you simply~~ have to

−

take the ~~middle two scores~~ and average ~~the result~~. ~~So, if we look at~~

−

~~the example below:~~

+

== Mode ==

+

The mode is the most frequent score in our data

+

set. On a histogram it represents the highest bar in a bar chart or

+

histogram. You can, therefore, sometimes consider the mode as being

+

the most popular option. An example of a mode is presented below:

−

+

−

+

[[Image:KOER-%20Mathematics%20-%20Statistics_html_58d59706.png]]

−

~~{| border="1"~~

−

|-

−

|

−

65

−

|

+

Normally, the mode is used for categorical data

−

55

+

where we wish to know which is the most common category as

+

illustrated below:

−

|

+

We can see above that the most common form of

−

89

+

transport, in this particular data set, is the bus. However, one of

+

the problems with the mode is that it is not unique, so it leaves us

+

with problems when we have two or more values that share the highest

+

frequency, such as below:

−

+

−

|

+

[[Image:KOER-%20Mathematics%20-%20Statistics_html_m64bbad46.png]]

−

56

−

+

We are now stuck as to which mode best describes

−

|

+

the central tendency of the data. This is particularly problematic

−

35

+

when we have continuous data, as we are more likely not to have any

−

+

one value that is more frequent than the other. For example, consider

−

+

measuring 30 peoples' weight (to the nearest 0.1 kg). How likely is

−

|

+

it that we will find two or more people with '''exactly'''

−

14

+

the same weight, e.g. 67.4 kg? The answer, is probably very unlikely

+

- many people might be close but with such a small sample (30 people)

+

and a large range of possible weights you are unlikely to find two

+

people with exactly the same weight, that is, to the nearest 0.1 kg.

+

This is why the mode is very rarely used with continuous data.

−

+

Another problem with the mode is that it will not

−

|

+

provide us with a very good measure of central tendency when the most

−

56

+

common mark is far away from the rest of the data in the data set, as

+

depicted in the diagram below:

−

|

+

[[Image:KOER-%20Mathematics%20-%20Statistics_html_152dd141.png]]

−

55

−

|

+

In the above diagram the mode has a value of 2. We

−

87

+

can clearly see, however, that the mode is not representative of the

+

data, which is mostly concentrated around the 20 to 30 value range.

+

To use the mode to describe the central tendency of this data set

+

would be misleading.

−

|

+

== Skewed Distributions and the Mean and Median ==

−

45

−

|}

+

[[Image:KOER-%20Mathematics%20-%20Statistics_html_26c6186d.png]]We

+

often test whether our data is normally distributed as this is a

+

common assumption underlying many statistical tests. An example of a

+

normally distributed set of data is presented below:

−

+

When you have a normally distributed sample you

−

~~We again rearrange that data into order~~ of

+

can legitimately use both the mean or the median as your measure of

−

~~magnitude (smallest first):~~

+

central tendency. In fact, in any symmetrical distribution the mean,

−

+

median and mode are equal. However, in this situation, the mean is

−

+

widely preferred as the best measure of central tendency as it is the

−

+

measure that includes all the values in the data set for its

−

+

calculation, and any change in any of the scores will affect the

−

~~{| border="1"~~

+

value of the mean. This is not the case with the median or mode.

−

|-

−

|

−

14

−

|

+

However, when our data is skewed, for example, as

−

35

+

with the right-skewed data set below:

−

|

+

[[Image:KOER-%20Mathematics%20-%20Statistics_html_m2609c500.png]]

−

45

−

|

−

55

−

|

−

~~'''55'''~~

−

|

−

~~'''56'''~~

−

|

−

56

−

|

−

65

−

|

−

87

−

|

−

89

−

|

−

92

−

|}

−

~~Only now we have to take the 5th and 6th score in~~

−

~~our data set and average them to get a median of 55.5.~~

−

~~== Mode ==~~

−

~~The mode is the most frequent score in our data~~

−

~~set. On a histogram it represents the highest bar in a bar chart or~~

−

~~histogram. You can, therefore, sometimes consider the mode as being~~

−

~~the most popular option. An example of a mode is presented below:~~

−

~~[[Image:KOER-%20Mathematics%20-%20Statistics_html_58d59706.png]]~~

−

~~Normally, the mode is used for categorical data~~

−

~~where we wish to know which is the most common category as~~

−

~~illustrated below:~~

−

~~We can see above that the most common form of~~

−

~~transport, in this particular data set, is the bus. However, one of~~

−

~~the problems with the mode is that it is not unique, so it leaves us~~

−

~~with problems when we have two or more values that share the highest~~

−

~~frequency, such as below:~~

−

~~[[Image:KOER-%20Mathematics%20-%20Statistics_html_m64bbad46.png]]~~

−

~~We are now stuck as to which mode best describes~~

−

~~the central tendency of the data. This is particularly problematic~~

−

~~when we have continuous data, as we are more likely not to have any~~

−

~~one value that is more frequent than the other. For example, consider~~

−

~~measuring 30 peoples' weight (to the nearest 0.1 kg). How likely is~~

−

~~it that we will find two or more people with '''exactly'''~~

−

~~the same weight, e.g. 67.4 kg? The answer, is probably very unlikely~~

−

~~- many people might be close but with such a small sample (30 people)~~

−

~~and a large range of possible weights you are unlikely to find two~~

−

~~people with exactly the same weight, that is, to the nearest 0.1 kg.~~

−

~~This is why the mode is very rarely used with continuous data.~~

−

~~Another problem with the mode is that it will not~~

−

~~provide us with a very good measure of central tendency when the most~~

−

~~common mark is far away from the rest of the data in the data set, as~~

−

~~depicted in the diagram below:~~

−

~~[[Image:KOER-%20Mathematics%20-%20Statistics_html_152dd141.png]]~~

−

~~In the above diagram the mode has a value of 2. We~~

−

~~can clearly see, however, that the mode is not representative of the~~

−

~~data, which is mostly concentrated around the 20 to 30 value range.~~

−

~~To use the mode to describe the central tendency of this data set~~

−

~~would be misleading.~~

−

~~== Skewed Distributions and the Mean and Median ==~~

−

~~[[Image:KOER-%20Mathematics%20-%20Statistics_html_26c6186d.png]]We~~

−

~~often test whether our data is normally distributed as this is a~~

−

~~common assumption underlying many statistical tests. An example of a~~

−

~~normally distributed set of data is presented below:~~

−

~~When you have a normally distributed sample you~~

−

~~can legitimately use both the mean or the median as your measure of~~

−

~~central tendency. In fact, in any symmetrical distribution the mean,~~

−

~~median and mode are equal. However, in this situation, the mean is~~

−

~~widely preferred as the best measure of central tendency as it is the~~

−

~~measure that includes all the values in the data set for its~~

−

~~calculation, and any change in any of the scores will affect the~~

−

~~value of the mean. This is not the case with the median or mode.~~

−

~~However, when our data is skewed, for example, as~~

−

~~with the right-skewed data set below:~~

−

[[Image:KOER-%20Mathematics%20-%20Statistics_html_m2609c500.png]]

−

Bindu

283

edits

Changes

Statistics (view source)

Revision as of 09:49, 4 January 2013

Navigation menu

Search