Statistics

Introduction

The following is a background literature for teachers. It summarises the things to be known to a teacher to teach this topic more effectively . This literature is meant to be a ready reference for the teacher to develop the concepts, inculcate necessary skills, and impart knowledge in Statistics from Class 6 to Class 10.

The teacher will get an overall idea of all the sub topics required for school level statistics. The flow of how to build/develop an understanding of the topic for students from basics to more advanced aspects. Each subtopic will be developed by way of introductions, objectives, activities, evaluation and advanced and additional information and resources.

Statistics

In early times, the meaning of statistics was restricted to information about states ( any political organization with a government that has supreme independent authority over a geographic area). This was later extended to include all collections of information of all types, and later still it was extended to include the analysis and interpretation of such data. In modern terms, "statistics" means both sets of collected information and analytical work which requires statistical inference.

Doing statistical analysis it is possible to test numerical data for relevance, reliability and validity. In order to do this, statisticians must present data in such a form that others can utilise the relevant information to enable them to make judgements. One view is that the study of statistics is reported to have started with the Englishman, John Graunt (1620 – 1674), who collected and studied the death records in various cities of Britain. He was fascinated by the patterns he found in the whole population. Much of current day statistical analysis is of quite recent development, the availability of cheap computing power acting as a catalyst for the development of appropriate ways of presenting and analysing data. In fact, the more advanced statistical analyses and tests are based on probability theory, developed over the past few centuries, but put into a more modern context by mathematical statisticians such as Karl Pearson (1857 – 1936) , Sir Ronald Fisher (1890 – 1962) , Jerzy Neyman (1894 – 1981).

The curricular objectives for school level statistical work can be described as follows:

To understand the meaning of data. The need for statistics and how to collect, organise and represent data in different ways.
Skills to represent and analyse data in tabular and graphical forms.
Understanding central tendency and computation of the measure of central tendency namely arithmetic mean, median and mode for both grouped and non-grouped data. Have the ability to use the appropriate central tendency to represent the data appropriately.
Understanding dispersion determine the measures of dispersion such as range quartile deviation, mean deviation and standard deviation.
Understand the limitations and drawbacks of statistics

Descriptive and Inferential Statistics

When analysing data, for example, the marks achieved by 100 students for a piece of coursework, it is possible to use both descriptive and inferential statistics in your analysis of their marks. Typically, in most research conducted on groups of people, you will use both descriptive and inferential statistics to analyse your results and draw conclusions. So what are descriptive and inferential statistics? And what are their differences?

Descriptive Statistics

Descriptive statistics is the term given to the analysis of data that helps describe, show or summarize data in a meaningful way such that, for example, patterns might emerge from the data. Descriptive statistics do not, however, allow us to make conclusions beyond the data we have analysed or reach conclusions regarding any hypotheses we might have made. They are simply a way to describe our data.

Descriptive statistics are very important, as if we simply presented our raw data it would be hard to visualize what the data was showing, especially if there was a lot of it. Descriptive statistics therefore allow us to present the data in a more meaningful way which allows simpler interpretation of the data. For example, if we had the results of 100 pieces of students' coursework, we may be interested in the overall performance of those students. We would also be interested in the distribution or spread of the marks. Descriptive statistics allow us to do this. How to properly describe data through statistics and graphs is an important topic and discussed in other Laerd Statistics Guides. Typically, there are two general types of statistic that are used to describe data:

Measures of central tendency: these are ways of describing the central position of a frequency distribution for a group of data. In this case, the frequency distribution is simply the distribution and pattern of marks scored by the 100 students from the lowest to the highest. We can describe this central position using a number of statistics, including the mode, median, and mean. You can read about measures of central tendency here.

Measures of spread: these are ways of summarizing a group of data by describing how spread out the scores are. For example, the mean score of our 100 students may be 65 out of 100. However, not all students will have scored 65 marks. Rather, their scores will be spread out. Some will be lower and others higher. Measures of spread help us to summarize how spread out these scores are. To describe this spread, a number of statistics are available to us, including the range, quartiles, absolute deviation, variance and standard deviation.

When we use descriptive statistics it is useful to summarize our group of data using a combination of tabulated description (i.e. tables), graphical description (i.e. graphs and charts) and statistical commentary (i.e. a discussion of the results).

Inferential Statistics

We have seen that descriptive statistics provide information about our immediate group of data. For example, we could calculate the mean and standard deviation of the exam marks for the 100 students and this could provide valuable information about this group of 100 students. Any group of data like this, that includes all the data you are interested, in is called a population. A population can be small or large, as long as it includes all the data you are interested in. For example, if you were only interested in the exam marks of 100 students, then the 100 students would represent your population. Descriptive statistics are applied to populations and the properties of populations, like the mean or standard deviation, are called parameters as they represent the whole population (i.e. everybody you are interested in).

Often, however, you do not have access to the whole population you are interested in investigating but only have a limited number of data instead. For example, you might be interested in the exam marks of all students in the UK. It is not feasible to measure all exam marks of all students in the whole of the UK so you have to measure a smaller sample of students, for example, 100 students, that are used to represent the larger population of all UK students. Properties of samples, such as the mean or standard deviation, are not called parameters but statistics. Inferential statistics are techniques that allow us to use these samples to make generalizations about the populations from which the samples were drawn. It is, therefore, important the sample accurately represents the population. The process of achieving this is called sampling. Inferential statistics arise out of the fact that sampling naturally incurs sampling error and thus a sample is not expected to perfectly represent the population. The methods of inferential statistics are (1) the estimation of parameter(s) and (2) testing of statistical hypotheses.

Mind Map

Data Handling

Introduction

Data is a collection of facts, such as values or measurements. It can be numbers, words, measurements, observations or even just descriptions of things. Statistical work is done for problem solving. For problem solving, we first have to understand the problem (postulating hypotheses ) , then we have to collect relevant data , after which we must be able to present the data, finally analyse the data and make conclusions related to the original hypotheses. Statistics provides us with tools to analyse data and draw conclusions from a large set of data by organising the data in the set in different ways and analysing the data by observing patterns. Data handling would include identifying data, collecting data, organising/representing data and summarising data.

Objective

What is statistical work and why and where we would need to use this.
To understand different types of data: qualitative and quantitative
To understand the sources of data : Primary and Secondary
To learn how to collect, classify and display data; data is information that is used in any process connected with statistics.

Data

The term data refers to qualitative or quantitative attributes of a variable or set of variables.Data refers to the pieces of information that have been observed and recorded, from an experiment or a survey. There are two types of data: primary and secondary. The word ”data” is the plural of the word ”datum”, and therefore one should say, ”the data are” and not ”the data is”. Data can be classified as primary or secondary, and primary or secondary data can be classified as qualitative or quantitative.

The figure below summarises the classifications of data. Primary data describes the original data that have been collected. This type of data is also known as raw data. Often the primary data set is very large and is therefore summarised or processed to extract meaningful information. Qualitative data is information that cannot be written as numbers, for example, if you were collecting data from people on how they feel or what their favourite colour is.Quantitative data is information that can be written as numbers, for example, if you were collecting data from people on their height or weight.

Secondary data is primary data that has been summarised or processed, for example, the set of colours that people gave as favourite colours would be secondary data because it is a summary of responses. Data already collected prior our use is secondary data. Primary data is what we collect as a part of our study. All processed data therefore is also secondary.

Transforming primary data into secondary data through analysis, grouping or organisationinto secondary data is the process of generating information.

Purpose of Collecting Primary Data

Data is collected to provide answers that help with understanding a particular situation. Here are examples to illustrate some real world data collections scenarios in the categories of qualitative and quantitative data.

Qualitative Data

The local government might want to know how many residents have electricity and might ask the question: ”Does your home have a safe supply of electricity?”
A supermarket manager might ask the question: “What flavours of soft drink should be stocked in my supermarket?” The question asked of customers might be “What is your favourite soft drink?” Based on the customers’ responses, the manager can make an informed decision as to what soft drinks to stock.
A company manufacturing medicines might ask “How effective is our pill at relieving a headache?” The question asked of people using the pill for a headache might be: “Does taking the pill relieve your headache?” Based on responses, the company learns how effective their product is.
A motor car company might want to improve their customer service, and might ask their customers: “How can we improve our customer service?”
A teacher may ask “How many hours of TV by students on TV' to get an idea of what children are learning from TV at home and how it supplements (or affects) the learning in the school

Quantitative Data

A cell phone manufacturing company might collect data about how often people buy new cell phones and what factors affect their choice, so that the cell phone company can focus on those features that would make their product more attractive to buyers.
A town councillor might want to know how many accidents have occurred at a particular intersection, to decide whether a robot should be installed. The councillor would visit the local police station to research their records to collect the appropriate data.
A supermarket manager might ask the question: “What flavours of soft drink should be stocked in my supermarket?” The question asked of customers might be “What is your favourite soft drink?” Based on the customers’ responses, the manager can make an informed decision as to what soft drinks to stock.
What kind of TV programs are watched by students, how many are educational in nature.

However, it is important to note that different questions reveal different features of a situation, and that this affects the ability to understand the situation. For example, if the question in the list What kind of TV programs are watched by students, how many are educational in nature. was re-phrased to be: Do your children watch educational programs on TV and if you answered yes, but most programs being watched were of entertainment value, , then this could give the wrong impression that TV was being used as an educational tool in your home .

Data Collection

The method of collecting the data must be appropriate to the question being asked. Some

examples of data collecting methods are:

Experiments
Questionnaires, surveys, focus group discussions and interviews
Other sources (friends, family, newspapers, books, magazines and now increasingly the Internet)
Observation
Specialised equipment (rainwater gauges to measure rainfall in a place, various medical equipment that collect information about different biological processes)

The most important aspect of each method of data collecting is to clearly formulate the question that is to be answered. The details of the data collection should therefore be structured to take your question into account.

You must have observed your teacher recording the attendance of students in your class everyday, or recording marks obtained by you after every test or examination. Similarly, you must have also seen a cricket score board. One score boards have been illustrated here :

NatWest One Day International Series: England v India Friday, 16 September 2011 at The Swalec Stadium

England beat India by 6 wickets (D/L). England won the toss and decided to field

India Innings

304 for 6 (50.0 overs)

England Innings

241 for 4 (32.2 overs)

India 1st Innings - Close

Name

Wicket

Runs

Balls

4s

6s

P Patel

c Bresnan

b Swann

19

39

0

Rahane

c Finn

b Dernbach

26

47

3

0

Dravid

b Swann

69

79

4

0

Kohli

hit wicket

b Swann

107

93

9

1

Raina

c Bresnan

b Finn

15

0

1

Dhoni

not out

-

50

26

5

2

Jadeja

c Bopara

b Dernbach

0

1

0

Ashwin

not out

-

0

Extras

-

6w 1b 11lb

18

-

Total

-

for 6

304

(50.0 ovs)

Bowler

Overs

Maidens

Runs

Wickets

Bresnan

9.0

0

62

0

Finn

10.0

1

44

1

Dernbach

10.0

0

73

2

Swann

9.0

0

34

3

S Patel

8.0

0

55

0

Bopara

4.0

0

24

0

Recording Data

Let us take an example of a class which is preparing to go for a picnic. The teacher asked the students to give their choice of fruits out of banana, apple, orange or guava. Uma is asked to prepare the list. She prepared a list of all the children and wrote the choice of fruit against each name. This list would help the teacher to distribute fruits according to the choice.

Raghav — Banana

Preeti — Apple

Amar — Guava

Fatima — Orange

Amita — Apple

Raman — Banana

Radha — Orange

Farida — Guava

Anuradha — Banana

Rati — Banana

Bhawana — Apple

Manoj — Banana

Donald — Apple

Maria — Banana

Uma — Orange

Akhtar — Guava

Ritu — Apple

Salma — Banana

Kavita — Guava

Javed — Banana

Example 1 : A teacher wants to know the choice of food of each student as part of the mid-day meal programme. The teacher assigns the task of collecting this information to Maria. Maria does so using a paper and a pencil. After arranging the choices in a column, she puts against a choice of food one ( / ) mark for every student making that choice.

Choice

Number of students

Rice only

Chapati only

Both rice and chapati

/////////////// //

/////////////

////////////////////

Umesh, after seeing the table suggested a better method to count the students. He asked Maria to organise the marks ( / ) in a group of ten as shown below :

Choice

Tally marks

Number of students

Rice only

Chapati only

Both rice and chapati

////////// ///////

////////// ///

////////// //////////

17

13

20

Rajan made it simpler by asking her to make groups of five instead of ten, as

shown below :

Choice

Tally marks

Number of students

Rice only

Chapati only

Both rice and chapati

///// ///// ///// //

///// ///// ///

///// ///// ///// /////

17

13

20

Meaning of Frequency

Frequency means the number of occurrences within a given time period. It is not easy to answer the question looking at the choices written haphazardly. We arrange the data in Table below using tally marks.

Subject

Tally Marks

Number of Students

Art

///// //

7

Mathematics

/////

5

Science

///// /

6

English

////

4

The number of tallies before each subject gives the number of students who like that particular subject. This is known as the frequency of that subject. Frequency gives the number of times that a particular entry occurs. From above table, Frequency of students who like English is 4 Frequency of students who like Mathematics is 5 The table made is known as frequency distribution table as it gives the number of times an entry occurs.

Categorical Frequency Distributions

Categorical frequency distributions - can be used for data that can be placed in specific categories, such as nominal- or ordinal-level data. (nominal or ordinal also called discrete data is where we can distinctly count the occurrences of a variable).

Examples - political affiliation, religious affiliation, blood type etc. Below is Blood Type frequency distribution example.

Class

Frequency

Percent

A

5

20

B

7

28

C

9

36

D

4

16

Activities

Activity 1 Data Collection

Learning Objectives

Understand collection of data .

Materials and resources required

Paper & Pen

Pre-requisites/ Instructions

The meaning of data and how to data is organised in a tabular form

Method

The table below has spaces for up to 10 entries. The first four columns have headings. Choose headings for the other columns and collect data from the 10 of your class mates

Name

Age

Height

Favourite Colour

-

Evaluation

Looking at the table and data can the student answer the following questions ?

Does any student like green the most ?
Do you think red is the most popular colour, why ?
What other information did you come to know about each student ?

Evaluation

At the end of this sub-topic the student should be able to

Identify the different types of data
Collect, classify and organise data in a tabular form
Calculate the frequency of data
Interpret data that is given in a tabular form

Self-Evaluation

Further Explorations

Enrichment Activities

Graphical representation of Data

Introduction

Tabular data can be also represented in the form of a picture ( charts) as visual representations can sometimes be easier to interpret. There are different types of pictorial representations that can be used to represent different type of data.

Objectives

Understand and know the different pictorial representations: Histogram, Bar Char, Pie Chart
To be able to look at the data and select the chart that would clearly represent the data as well as convey intended information about the data.
Understand and know the terms : Frequency Distribution, Class intervals
To be able to look at a graphical representation and interpret the data

Histogram & Bar Chart

What is a histogram?

A histogram is a plot that lets you discover, and show, the underlying frequency distribution (shape) of a set of continuous data. This allows the inspection of the data for its underlying distribution (e.g. normal distribution), outliers, skewness, etc. An example of a histogram, and the raw data it was constructed from, is shown below:

36 25 38 46 55 68 72 55 36 38

67 45 22 48 91 46 52 61 58 55

How do you construct a histogram from a continuous variable?

To construct a histogram from a continuous variable you first need to split the data into intervals, called bins. In the example above, age has been split into bins, with each bin representing a 10-year period starting at 20 years. Each bin contains the number of occurrences of scores in the data set that are contained within that bin. For the above data set, the frequencies in each bin have been tabulated along with the scores that contributed to the frequency in each bin (see below):

Bin Frequency Scores Included in Bin

20-30 2 25,22

30-40 4 36,38,36,38

40-50 4 46,45,48,46

50-60 5 55,55,52,58,55

60-70 3 68,67,61

70-80 1 72

80-90 0 -

90-100 1 91

Notice that, unlike a bar chart, there are no "gaps" between the bars (although some bars might be "absent" reflecting no frequencies). This is because a histogram represents a continuous data set, and as such, there are no gaps in the data. (Although you will have to decide whether you round up or round down scores on the boundaries of bins)

Choosing the correct bin width

There is no right or wrong answer as to how wide a bin should be, but there are rules of thumb. You need to make sure that the bins are not too small or too large. Consider the histogram we produced earlier (see above): the following histograms use the same data but have either much smaller or larger bins, as shown below:

We can see from the histogram on the left, that the bin width is too small as it shows too much individual data and does not allow the underlying pattern (frequency distribution) of the data to be easily seen. At the other end of the scale, is the diagram on the right, where the bins are too large and, again, we are unable to find the underlying trend in the data.

Histograms are based on area not height of bars

In a histogram, it is the area of the bar that indicates the frequency of occurrences for each bin. This means that the height of the bar does not necessarily indicate how many occurrences of scores there were within each individual bin. It is the product of height multiplied by the width of the bin that indicates the frequency of occurrences within that bin. One of the reasons that the height of the bars is often incorrectly assessed as indicating frequency and not the area of the bar is due to the fact that a lot of histograms often have equally spaced bars (bins) and, under these circumstances, the height of the bin does reflect the frequency.

What is the difference between a bar chart and a histogram?

The major difference is that a histogram is only used to plot the frequency of score occurrences in a continuous data set that has been divided into classes, called bins. Bar charts, on the other hand, can be used for a great deal of other types of variables including ordinal and nominal data sets.

Circle or Pie Chart

These are called circle graphs. A circle graph shows the relationship between a whole and its parts. Here, the whole circle is divided into sectors. The size of each sector is proportional to the activity or information it represents.

A variety of graphical representations of data are now possible using spreadsheet software. OpenOffice CALC can convert a table of data into bar charts, pie charts, area charts etc and make data much more easy to read/interpret.

Activities