1
00:00:00,314 --> 00:00:02,981
(upbeat music)
2
00:00:11,440 --> 00:00:13,850
- Hello, I'm Professor Rick Jerz,
3
00:00:13,850 --> 00:00:15,730
and I welcome you to this session,
4
00:00:15,730 --> 00:00:18,980
titled Descriptive Statistics.
5
00:00:18,980 --> 00:00:20,470
In our previous session,
6
00:00:20,470 --> 00:00:23,540
we learned how to look
at data graphically.
7
00:00:23,540 --> 00:00:27,120
This is always a useful
and important first step
8
00:00:27,120 --> 00:00:30,680
when trying to understand
the nature of data.
9
00:00:30,680 --> 00:00:35,600
However, graphs don't always
convey the best understanding.
10
00:00:35,600 --> 00:00:38,350
They do not provide us with enough detail
11
00:00:38,350 --> 00:00:41,370
for more in-depth investigation.
12
00:00:41,370 --> 00:00:46,330
They only give a broad
understanding of what is going on.
13
00:00:46,330 --> 00:00:48,540
In this session, we are going to learn
14
00:00:48,540 --> 00:00:53,060
the many ways to describe
data numerically.
15
00:00:53,060 --> 00:00:55,960
We are going to learn
about two general ways
16
00:00:55,960 --> 00:00:57,710
of describing data,
17
00:00:57,710 --> 00:01:01,290
central tendency and dispersion.
18
00:01:01,290 --> 00:01:04,830
This topic will contain
a lot of equations,
19
00:01:04,830 --> 00:01:07,680
but don't worry because
in our next session,
20
00:01:07,680 --> 00:01:10,520
we will learn how to use Microsoft Excel
21
00:01:10,520 --> 00:01:13,050
to do these calculations.
22
00:01:13,050 --> 00:01:16,330
Using Excel also avoids the need
23
00:01:16,330 --> 00:01:19,330
to memorize all these equations.
24
00:01:19,330 --> 00:01:21,500
Great news for you, I'm sure.
25
00:01:21,500 --> 00:01:23,693
Okay, let's get started.
26
00:01:26,800 --> 00:01:29,540
We have a number of
goals for this session.
27
00:01:29,540 --> 00:01:33,170
We want to be able to
calculate the arithmetic mean,
28
00:01:33,170 --> 00:01:36,570
weighted mean, median, and mode.
29
00:01:36,570 --> 00:01:39,620
We also want to explain
the characteristics,
30
00:01:39,620 --> 00:01:42,650
uses, advantages, and disadvantages
31
00:01:42,650 --> 00:01:45,140
of each measure of location.
32
00:01:45,140 --> 00:01:49,920
We want to be able to compute
and interpret the range,
33
00:01:49,920 --> 00:01:53,700
mean deviation, variance,
and standard deviation,
34
00:01:53,700 --> 00:01:56,640
and we want to understand
the characteristics,
35
00:01:56,640 --> 00:01:59,840
uses, advantages, and disadvantages
36
00:01:59,840 --> 00:02:02,950
of each measure of dispersion.
37
00:02:02,950 --> 00:02:04,500
In a previous session,
38
00:02:04,500 --> 00:02:07,170
we assumed that we collected some data
39
00:02:07,170 --> 00:02:10,010
and then looked at it graphically.
40
00:02:10,010 --> 00:02:14,160
This graphic approach provides
a quick overview of data
41
00:02:14,160 --> 00:02:15,550
and it lets us observe
42
00:02:15,550 --> 00:02:18,800
some general relationships and trends.
43
00:02:18,800 --> 00:02:21,780
It's also pretty easy
for people to interpret.
44
00:02:21,780 --> 00:02:25,260
Although some of you would like
to call it quits right here,
45
00:02:25,260 --> 00:02:27,480
we have to go beyond this graphic approach
46
00:02:27,480 --> 00:02:31,143
and learn a variety of numeric approaches.
47
00:02:32,400 --> 00:02:34,840
I'd like to begin this
chapter with an example.
48
00:02:34,840 --> 00:02:39,260
Let's assume that we have
collected our M&M weight data
49
00:02:39,260 --> 00:02:40,360
and plotted it,
50
00:02:40,360 --> 00:02:43,300
and let's say that the
graph looks like this.
51
00:02:43,300 --> 00:02:45,340
From this representation of the data,
52
00:02:45,340 --> 00:02:47,230
what might we be able to say?
53
00:02:47,230 --> 00:02:49,060
Well, one thing that we can certainly say
54
00:02:49,060 --> 00:02:53,390
is that every M&M does not
weight exactly the same.
55
00:02:53,390 --> 00:02:56,510
We can also say that we have
a lowest or minimum value,
56
00:02:56,510 --> 00:02:59,871
somewhere around .84 grams.
57
00:02:59,871 --> 00:03:02,700
Likewise, we can say
that there's a high value
58
00:03:02,700 --> 00:03:04,810
right around one gram.
59
00:03:04,810 --> 00:03:06,980
We might also say that most of the data
60
00:03:06,980 --> 00:03:09,410
seems to be clustered near the middle,
61
00:03:09,410 --> 00:03:12,640
and that most of the
M&Ms seem to be centered
62
00:03:12,640 --> 00:03:16,500
somewhere around .9125 grams.
63
00:03:16,500 --> 00:03:18,720
So, just as we have expected,
64
00:03:18,720 --> 00:03:20,490
the graph can give us a good feel
65
00:03:20,490 --> 00:03:23,730
for the nature of M&M weights.
66
00:03:23,730 --> 00:03:27,050
Now, let's assume that we do
not have a graph to look at,
67
00:03:27,050 --> 00:03:29,240
but I say something to you like,
68
00:03:29,240 --> 00:03:33,360
we have a lowest value near .84 grams,
69
00:03:33,360 --> 00:03:36,500
a highest value near one gram,
70
00:03:36,500 --> 00:03:39,010
most of the data seems to be clustered
71
00:03:39,010 --> 00:03:43,040
around the center value of .9125.
72
00:03:43,040 --> 00:03:45,880
If you kinda close your
eyes and take this data,
73
00:03:45,880 --> 00:03:47,930
and try to mentally visualize
74
00:03:47,930 --> 00:03:49,390
what the graph would look like,
75
00:03:49,390 --> 00:03:51,230
with enough practice,
76
00:03:51,230 --> 00:03:54,250
might your mental image match the graph
77
00:03:54,250 --> 00:03:56,750
that would be created from this data?
78
00:03:56,750 --> 00:03:59,050
If I were to say this a
little bit differently,
79
00:03:59,050 --> 00:04:02,410
like I'm going to give you some values,
80
00:04:02,410 --> 00:04:07,410
or statistics, what can you
say about weights of M&Ms?
81
00:04:08,780 --> 00:04:11,860
Well, that's what this
whole chapter is about.
82
00:04:11,860 --> 00:04:14,800
Learning about some key statistics
83
00:04:14,800 --> 00:04:18,370
that represent the nature of data.
84
00:04:18,370 --> 00:04:22,150
In this example, I said
there's a low value.
85
00:04:22,150 --> 00:04:24,210
The statistical way to say this
86
00:04:24,210 --> 00:04:27,040
would be minimum value.
87
00:04:27,040 --> 00:04:29,170
I also said there's a high value.
88
00:04:29,170 --> 00:04:33,080
Statistically, we would say maximum value.
89
00:04:33,080 --> 00:04:36,180
I said that most of the
data seems to be clustered
90
00:04:36,180 --> 00:04:38,280
and centered around,
91
00:04:38,280 --> 00:04:41,810
the statistical concept of
data being centered around
92
00:04:41,810 --> 00:04:44,470
is central tendency.
93
00:04:44,470 --> 00:04:47,750
And in statistics, we wouldn't
use the word clustered,
94
00:04:47,750 --> 00:04:50,490
but we do use the word dispersion.
95
00:04:50,490 --> 00:04:51,940
In just this short example,
96
00:04:51,940 --> 00:04:53,840
we have learned about the advantage
97
00:04:53,840 --> 00:04:57,720
of being able to calculate
some statistics about the data.
98
00:04:57,720 --> 00:04:58,690
As you might notice,
99
00:04:58,690 --> 00:05:01,880
some of these statistics are
pretty easy conceptually,
100
00:05:01,880 --> 00:05:03,610
minimum and maximum.
101
00:05:03,610 --> 00:05:05,740
However, there are others
102
00:05:05,740 --> 00:05:09,360
like central tendency and dispersion
103
00:05:09,360 --> 00:05:12,800
that could be defined
in a variety of ways.
104
00:05:12,800 --> 00:05:13,910
In the previous chapter,
105
00:05:13,910 --> 00:05:16,720
we learned about graphical
ways of representing data,
106
00:05:16,720 --> 00:05:17,730
and in this chapter,
107
00:05:17,730 --> 00:05:20,130
we're going to learn about numerical ways
108
00:05:20,130 --> 00:05:22,110
of representing data.
109
00:05:22,110 --> 00:05:23,930
The numeric approach provides
110
00:05:23,930 --> 00:05:26,500
a more detailed analysis of data.
111
00:05:26,500 --> 00:05:28,060
It is also more powerful
112
00:05:28,060 --> 00:05:32,140
for understanding relationships
and trends of data.
113
00:05:32,140 --> 00:05:35,380
Certainly, this numeric
approach is more difficult
114
00:05:35,380 --> 00:05:37,050
for people to interpret.
115
00:05:37,050 --> 00:05:40,630
The numeric approach helps
calculate probabilities,
116
00:05:40,630 --> 00:05:42,800
and that's really where we wanna end up.
117
00:05:42,800 --> 00:05:45,470
To be able to assess business risk
118
00:05:45,470 --> 00:05:48,400
and make better business decisions.
119
00:05:48,400 --> 00:05:50,730
After collecting a bunch of data,
120
00:05:50,730 --> 00:05:52,020
we want to be able to know
121
00:05:52,020 --> 00:05:54,240
a couple things about this data.
122
00:05:54,240 --> 00:05:57,610
The first thing is central tendency.
123
00:05:57,610 --> 00:05:59,440
Is there one number that we can use
124
00:05:59,440 --> 00:06:01,880
to characterize all this data?
125
00:06:01,880 --> 00:06:03,700
We will learn that there are several ways
126
00:06:03,700 --> 00:06:06,010
to characterize central tendency,
127
00:06:06,010 --> 00:06:07,810
the mean, or average,
128
00:06:07,810 --> 00:06:09,560
median, and mode.
129
00:06:09,560 --> 00:06:12,230
But we will also learn that
this is not good enough.
130
00:06:12,230 --> 00:06:15,850
We need to also know
something about the dispersion
131
00:06:15,850 --> 00:06:18,050
or the spread of the data.
132
00:06:18,050 --> 00:06:21,040
For dispersion, we will
learn about the range,
133
00:06:21,040 --> 00:06:25,690
the mean deviation, variance,
and standard deviation.
134
00:06:25,690 --> 00:06:27,450
Although we will be presented
135
00:06:27,450 --> 00:06:30,630
with the equations for
all of these measurements,
136
00:06:30,630 --> 00:06:32,100
I really want to teach you
137
00:06:32,100 --> 00:06:34,680
how to do this in Microsoft Excel.
138
00:06:34,680 --> 00:06:36,320
Let's start with the mean.
139
00:06:36,320 --> 00:06:39,120
The arithmetic mean is
the most widely used
140
00:06:39,120 --> 00:06:40,930
measure of location.
141
00:06:40,930 --> 00:06:44,030
It's important to realize
that this does require
142
00:06:44,030 --> 00:06:47,540
interval or ratio scales.
143
00:06:47,540 --> 00:06:49,120
For M&M data,
144
00:06:49,120 --> 00:06:51,570
if we had five yellow and,
145
00:06:51,570 --> 00:06:54,280
let's say, three blue M&Ms,
146
00:06:54,280 --> 00:06:56,840
it wouldn't make any sense
to say that on the average,
147
00:06:56,840 --> 00:07:00,170
we have four green M&Ms.
148
00:07:00,170 --> 00:07:03,920
Assuming that we mix blue
and yellow and get green.
149
00:07:03,920 --> 00:07:08,360
Counting blue and yellow M&Ms
only provides nominal data.
150
00:07:08,360 --> 00:07:09,700
To calculate the mean,
151
00:07:09,700 --> 00:07:13,410
we must use all of the
values of the data collected.
152
00:07:13,410 --> 00:07:14,710
The mean is unique.
153
00:07:14,710 --> 00:07:18,370
We only have one mean for any set of data.
154
00:07:18,370 --> 00:07:21,380
The mean will be somewhere
in the middle of our data.
155
00:07:21,380 --> 00:07:24,060
Some data values will
be higher than the mean
156
00:07:24,060 --> 00:07:25,690
and some will be lower.
157
00:07:25,690 --> 00:07:29,760
If we subtract the mean
from all of our data values,
158
00:07:29,760 --> 00:07:34,430
we will end up with some
positive and negative numbers.
159
00:07:34,430 --> 00:07:38,570
If we add these positive and
negative deviations together,
160
00:07:38,570 --> 00:07:42,030
they will add up to exactly zero.
161
00:07:42,030 --> 00:07:44,220
This next point is very subtle.
162
00:07:44,220 --> 00:07:46,280
That the mean is affected
163
00:07:46,280 --> 00:07:50,680
by unusually large or small data values.
164
00:07:50,680 --> 00:07:52,280
If I wanted to calculate
165
00:07:52,280 --> 00:07:55,580
the average income of students in class,
166
00:07:55,580 --> 00:07:58,880
and let's assume that, oh, Bill Gates,
167
00:07:58,880 --> 00:08:01,250
the former CEO of Microsoft,
168
00:08:01,250 --> 00:08:03,850
and one of the wealthiest
people in the world
169
00:08:03,850 --> 00:08:06,340
happens to be one of our students,
170
00:08:06,340 --> 00:08:09,130
our average salary would probably be,
171
00:08:09,130 --> 00:08:12,770
oh, I don't know, 500
million a piece, right?
172
00:08:12,770 --> 00:08:17,090
Bill Gates' income is unusually high,
173
00:08:17,090 --> 00:08:20,670
and it's going to pull
the mean towards it.
174
00:08:20,670 --> 00:08:23,240
Now you have to realize
that we did nothing wrong
175
00:08:23,240 --> 00:08:24,810
in calculating the mean,
176
00:08:24,810 --> 00:08:28,290
it's just that this unusually high value
177
00:08:28,290 --> 00:08:30,400
has distorted the central tendency
178
00:08:30,400 --> 00:08:32,640
that we were hoping to observe.
179
00:08:32,640 --> 00:08:35,340
But the calculation is still correct.
180
00:08:35,340 --> 00:08:38,540
The mean is calculated
by summing the values
181
00:08:38,540 --> 00:08:41,340
and dividing by the number of values.
182
00:08:41,340 --> 00:08:44,400
This slide represents a graphic image
183
00:08:44,400 --> 00:08:46,610
of what the mean represents.
184
00:08:46,610 --> 00:08:48,800
This is kinda like a teeter-totter,
185
00:08:48,800 --> 00:08:51,010
where the mean is in the middle
186
00:08:51,010 --> 00:08:54,450
and each individual data
value is shown in blue.
187
00:08:54,450 --> 00:08:55,930
Notice how some of the values
188
00:08:55,930 --> 00:08:58,950
have to be on the left side of the mean
189
00:08:58,950 --> 00:09:00,480
and others to the right.
190
00:09:00,480 --> 00:09:03,310
This is how we balance our teeter-totter.
191
00:09:03,310 --> 00:09:07,410
The equation to calculate
the mean is as follows,
192
00:09:07,410 --> 00:09:10,480
mu is equal to the sum of x,
193
00:09:10,480 --> 00:09:13,750
from the first value all the
way up to the last value,
194
00:09:13,750 --> 00:09:16,420
divided by the total number of values.
195
00:09:16,420 --> 00:09:19,820
Mu represents the population mean.
196
00:09:19,820 --> 00:09:22,970
We will later see that a
different symbol is used
197
00:09:22,970 --> 00:09:26,010
when we are calculating
the mean of a sample.
198
00:09:26,010 --> 00:09:27,950
Here is a very simple example
199
00:09:27,950 --> 00:09:30,250
of calculating the population mean.
200
00:09:30,250 --> 00:09:32,870
There are 12 automobile
manufacturing companies
201
00:09:32,870 --> 00:09:34,240
in the United States.
202
00:09:34,240 --> 00:09:37,030
Listed below is the number of patents
203
00:09:37,030 --> 00:09:38,910
granted by the U.S. government
204
00:09:38,910 --> 00:09:41,280
to each company in a recent year.
205
00:09:41,280 --> 00:09:42,710
To calculate the mean,
206
00:09:42,710 --> 00:09:45,280
we add up all of these values
207
00:09:45,280 --> 00:09:46,350
and then we need to know
208
00:09:46,350 --> 00:09:48,690
that there are 12 of these in total,
209
00:09:48,690 --> 00:09:50,440
so we divide by 12
210
00:09:50,440 --> 00:09:55,060
and get 195 patents, on the average.
211
00:09:55,060 --> 00:09:57,180
If we happen to have sample data
212
00:09:57,180 --> 00:09:59,060
instead of population data,
213
00:09:59,060 --> 00:10:01,130
we can still calculate the mean.
214
00:10:01,130 --> 00:10:03,210
The only difference in the calculation
215
00:10:03,210 --> 00:10:05,610
is that since this is a sample,
216
00:10:05,610 --> 00:10:08,880
we use the symbol X bar instead of mu
217
00:10:08,880 --> 00:10:11,580
to represent the sample mean.
218
00:10:11,580 --> 00:10:15,290
If somebody has taken the
time to group data for us,
219
00:10:15,290 --> 00:10:16,790
for example, somebody says,
220
00:10:16,790 --> 00:10:19,720
we had five people with four green M&Ms,
221
00:10:19,720 --> 00:10:23,250
and two people with six green M&Ms,
222
00:10:23,250 --> 00:10:26,960
there is a shortcut method
for calculating the mean.
223
00:10:26,960 --> 00:10:29,250
We sometimes refer to this calculation
224
00:10:29,250 --> 00:10:31,810
as a weighted mean calculation
225
00:10:31,810 --> 00:10:35,620
or a mean for grouped data.
226
00:10:35,620 --> 00:10:38,530
The equation may look a
little bit more difficult
227
00:10:38,530 --> 00:10:42,440
but doing this on a
calculator becomes easier.
228
00:10:42,440 --> 00:10:45,210
Here's an example for weighted mean.
229
00:10:45,210 --> 00:10:46,810
The Carter Construction Company
230
00:10:46,810 --> 00:10:50,902
pays its hourly employees $16.50,
231
00:10:50,902 --> 00:10:54,710
$19, or $25 per hour.
232
00:10:54,710 --> 00:10:57,450
There are 26 hourly employees,
233
00:10:57,450 --> 00:11:01,650
14 of which are paid at the $16.50 rate
234
00:11:01,650 --> 00:11:03,870
10 at the $19 rate,
235
00:11:03,870 --> 00:11:06,680
and two at the $25 rate.
236
00:11:06,680 --> 00:11:11,400
What is the mean hourly
pay for the 26 employees?
237
00:11:11,400 --> 00:11:15,443
The weighted mean
calculation produces $18.12.
238
00:11:17,090 --> 00:11:19,950
If we didn't have this
weighted mean equation,
239
00:11:19,950 --> 00:11:24,950
we would have to add up 14
$16.50 per hour numbers,
240
00:11:25,910 --> 00:11:30,200
then to that, we'd have to add 10 at $19,
241
00:11:30,200 --> 00:11:33,640
and then two at the $25 rate.
242
00:11:33,640 --> 00:11:36,250
With a calculator or pencil and paper,
243
00:11:36,250 --> 00:11:38,590
this would take a lot more time.
244
00:11:38,590 --> 00:11:41,400
If somebody has given us frequency data
245
00:11:41,400 --> 00:11:45,550
for categories that have
a low and a high value,
246
00:11:45,550 --> 00:11:47,780
we encounter another kind of problem.
247
00:11:47,780 --> 00:11:49,840
Which category value do we use?
248
00:11:49,840 --> 00:11:52,100
The low value or the high value?
249
00:11:52,100 --> 00:11:54,010
The answer is neither.
250
00:11:54,010 --> 00:11:55,810
What we have to do is calculate
251
00:11:55,810 --> 00:11:58,100
the midpoint of the category
252
00:11:58,100 --> 00:12:00,960
and assume that this is our data.
253
00:12:00,960 --> 00:12:03,440
This slide shows how
we make the calculation
254
00:12:03,440 --> 00:12:05,430
for this provided data.
255
00:12:05,430 --> 00:12:08,550
Although this is the best we
can do with this kind of data,
256
00:12:08,550 --> 00:12:12,270
you might be noticing that
we are losing some accuracy
257
00:12:12,270 --> 00:12:14,700
in the calculation of the mean.
258
00:12:14,700 --> 00:12:17,310
What were these real values?
259
00:12:17,310 --> 00:12:19,750
Well, this is the best we can do.
260
00:12:19,750 --> 00:12:21,440
Let's now look at the median,
261
00:12:21,440 --> 00:12:23,640
which is our second of three ways
262
00:12:23,640 --> 00:12:26,270
of calculating central tendency.
263
00:12:26,270 --> 00:12:30,040
The median is the midpoint
of all the values,
264
00:12:30,040 --> 00:12:34,330
after they have been ordered
from smallest to largest.
265
00:12:34,330 --> 00:12:37,110
If we wanted to find the median income
266
00:12:37,110 --> 00:12:38,700
of people in the class,
267
00:12:38,700 --> 00:12:41,500
and let's even assume that
Bill Gates is in the class,
268
00:12:41,500 --> 00:12:43,960
we take all of the income values
269
00:12:43,960 --> 00:12:46,550
and order them from low to high.
270
00:12:46,550 --> 00:12:49,990
Then we drop off the lowest low side,
271
00:12:49,990 --> 00:12:52,090
and the highest high side.
272
00:12:52,090 --> 00:12:53,830
Bye-bye, Bill Gates.
273
00:12:53,830 --> 00:12:55,350
We continue to do this
274
00:12:55,350 --> 00:12:58,410
until we're left with one value.
275
00:12:58,410 --> 00:13:00,490
If we have an odd number of data,
276
00:13:00,490 --> 00:13:03,860
for example, an odd number
of students in the class,
277
00:13:03,860 --> 00:13:07,550
the median will always
be the middle value.
278
00:13:07,550 --> 00:13:10,430
If we happen to have an
even number of students,
279
00:13:10,430 --> 00:13:13,410
dropping one off from
the low and the high side
280
00:13:13,410 --> 00:13:15,130
causes us to end up with,
281
00:13:15,130 --> 00:13:17,040
oh, nothing in the middle.
282
00:13:17,040 --> 00:13:18,470
Well, by convention,
283
00:13:18,470 --> 00:13:22,300
what we do is when we get
to the last two values,
284
00:13:22,300 --> 00:13:26,720
we then take the middle or
average of these two values.
285
00:13:26,720 --> 00:13:29,110
Some of the characteristics of the median
286
00:13:29,110 --> 00:13:32,040
are that there is a unique median,
287
00:13:32,040 --> 00:13:35,470
meaning only one for any set of data.
288
00:13:35,470 --> 00:13:39,360
There are as many values
above the median as below it
289
00:13:39,360 --> 00:13:41,170
in the data array.
290
00:13:41,170 --> 00:13:42,910
With an even set of numbers,
291
00:13:42,910 --> 00:13:47,510
the median will be the average
of the two middle numbers
292
00:13:47,510 --> 00:13:51,150
and what's important is that the median
293
00:13:51,150 --> 00:13:56,150
is not affected by extremely
large or small values.
294
00:13:56,760 --> 00:13:57,610
So sometimes,
295
00:13:57,610 --> 00:14:00,160
the median is a better
measure of central tendency
296
00:14:00,160 --> 00:14:03,830
when your data contains
a few extreme values.
297
00:14:03,830 --> 00:14:06,970
It can be computed for ratio, interval,
298
00:14:06,970 --> 00:14:09,640
and ordinal-level data.
299
00:14:09,640 --> 00:14:13,150
The third measure of central
tendency is the mode.
300
00:14:13,150 --> 00:14:16,100
The mode is the value of the observation
301
00:14:16,100 --> 00:14:19,230
that appears most frequently.
302
00:14:19,230 --> 00:14:21,620
It may not have anything to do whatsoever
303
00:14:21,620 --> 00:14:23,650
with the middle of the data.
304
00:14:23,650 --> 00:14:27,780
It's the most frequently
occurring value, period.
305
00:14:27,780 --> 00:14:30,290
This graphic tries to
show the relationship
306
00:14:30,290 --> 00:14:31,890
between the mean, median,
307
00:14:31,890 --> 00:14:34,580
and mode for some collected data.
308
00:14:34,580 --> 00:14:37,110
If we collected and graphed some data,
309
00:14:37,110 --> 00:14:39,840
and the plot was completely symmetrical,
310
00:14:39,840 --> 00:14:44,140
the mean, median, and mode
would all be the same value.
311
00:14:44,140 --> 00:14:46,880
If the data happen to
have more smaller values
312
00:14:46,880 --> 00:14:48,830
or more higher values,
313
00:14:48,830 --> 00:14:53,070
the median and mode would be to the left
314
00:14:53,070 --> 00:14:56,450
or to the right of this
data, respectively.
315
00:14:56,450 --> 00:15:00,070
The mean, median, and mode are
pretty simple to understand.
316
00:15:00,070 --> 00:15:02,830
Dispersion gets a little bit more complex.
317
00:15:02,830 --> 00:15:04,910
Remember that the mean and median
318
00:15:04,910 --> 00:15:07,510
only describes the center of the data
319
00:15:07,510 --> 00:15:09,470
but it does not tell us anything
320
00:15:09,470 --> 00:15:11,490
about the spread of the data.
321
00:15:11,490 --> 00:15:13,070
A second reason for studying
322
00:15:13,070 --> 00:15:15,390
the dispersion in a set of data
323
00:15:15,390 --> 00:15:20,390
is to compare the spread in
two or more distributions.
324
00:15:20,470 --> 00:15:23,320
This graphic tries to show what I mean.
325
00:15:23,320 --> 00:15:26,330
It shows three different sets of data
326
00:15:26,330 --> 00:15:30,050
that all have the same calculated mean.
327
00:15:30,050 --> 00:15:31,420
But if you look at the data,
328
00:15:31,420 --> 00:15:35,190
as represented as cylinders
on this teeter-totter,
329
00:15:35,190 --> 00:15:38,550
clearly, the data sets are not the same.
330
00:15:38,550 --> 00:15:41,040
The dispersion is different.
331
00:15:41,040 --> 00:15:42,970
There are four measures of dispersion
332
00:15:42,970 --> 00:15:44,980
that we will learn to calculate.
333
00:15:44,980 --> 00:15:47,910
They are the range, mean deviation,
334
00:15:47,910 --> 00:15:50,670
variance, and standard deviation.
335
00:15:50,670 --> 00:15:52,590
Let's look into each of these.
336
00:15:52,590 --> 00:15:55,580
The first one, the
range, is pretty simple.
337
00:15:55,580 --> 00:15:57,910
It's the difference between the highest
338
00:15:57,910 --> 00:15:59,930
and smallest data value.
339
00:15:59,930 --> 00:16:02,060
We simply subtract these two values
340
00:16:02,060 --> 00:16:03,900
and this is the range.
341
00:16:03,900 --> 00:16:04,940
Here's an example.
342
00:16:04,940 --> 00:16:06,680
The number of cappuccinos sold
343
00:16:06,680 --> 00:16:09,660
at the Starbucks location
in the Orange County Airport
344
00:16:09,660 --> 00:16:11,330
between four and seven p.m.,
345
00:16:11,330 --> 00:16:13,900
for a sample of five days last year
346
00:16:13,900 --> 00:16:17,150
were 20, 40, 50, 60, and 80.
347
00:16:17,150 --> 00:16:21,770
The range ends up being
80 minus 20, or 60.
348
00:16:21,770 --> 00:16:23,180
For the mean deviation,
349
00:16:23,180 --> 00:16:25,350
we take the absolute value
350
00:16:25,350 --> 00:16:28,850
of the difference between the
mean and every data point,
351
00:16:28,850 --> 00:16:32,630
add these up, and divide by the
total number of data points.
352
00:16:32,630 --> 00:16:36,600
You might be wondering what
is this absolute value thing?
353
00:16:36,600 --> 00:16:38,150
Remember, from algebra,
354
00:16:38,150 --> 00:16:41,660
that the absolute value
is always positive.
355
00:16:41,660 --> 00:16:44,000
If we didn't take the absolute values,
356
00:16:44,000 --> 00:16:46,360
we would always end up with zero.
357
00:16:46,360 --> 00:16:48,350
That's what we just
learned about the mean.
358
00:16:48,350 --> 00:16:50,120
That the sum of the deviations
359
00:16:50,120 --> 00:16:53,360
from the mean and its data is always zero.
360
00:16:53,360 --> 00:16:55,440
By using the absolute value,
361
00:16:55,440 --> 00:16:58,670
we make every difference be positive
362
00:16:58,670 --> 00:17:02,590
and then the sum of these
divided by the total number
363
00:17:02,590 --> 00:17:06,210
will always produce some
value other than zero.
364
00:17:06,210 --> 00:17:09,360
This slide shows how we
calculate the mean deviation
365
00:17:09,360 --> 00:17:13,360
for the same set of
Starbuck capuccino coffees.
366
00:17:13,360 --> 00:17:16,730
The mean deviation ends up as 16.
367
00:17:16,730 --> 00:17:19,220
Bigger numbers for the mean deviation
368
00:17:19,220 --> 00:17:21,220
mean more dispersion.
369
00:17:21,220 --> 00:17:24,800
Smaller numbers mean less dispersion.
370
00:17:24,800 --> 00:17:27,360
If instead of 16, we ended up with,
371
00:17:27,360 --> 00:17:28,940
oh, let's say three,
372
00:17:28,940 --> 00:17:31,460
we would know that there's
not a whole lot of variation
373
00:17:31,460 --> 00:17:34,380
from day to day from what the average is.
374
00:17:34,380 --> 00:17:36,380
The variance and standard deviation
375
00:17:36,380 --> 00:17:38,380
are the hardest to calculate
376
00:17:38,380 --> 00:17:41,670
with either a calculator
or pencil and paper.
377
00:17:41,670 --> 00:17:45,870
The variance is very similar
to this mean deviation
378
00:17:45,870 --> 00:17:49,660
but instead of taking the
absolute value of the difference,
379
00:17:49,660 --> 00:17:52,460
we square the difference,
380
00:17:52,460 --> 00:17:55,760
add these up, and divide by
the total number of values.
381
00:17:55,760 --> 00:17:58,540
Squaring this difference of the numbers
382
00:17:58,540 --> 00:18:02,160
will always produce a positive number.
383
00:18:02,160 --> 00:18:04,010
You might remember from algebra,
384
00:18:04,010 --> 00:18:07,620
that a positive times a
positive is a positive.
385
00:18:07,620 --> 00:18:10,920
And a negative times a
negative is a positive.
386
00:18:10,920 --> 00:18:13,250
Another thing that squaring the difference
387
00:18:13,250 --> 00:18:14,920
between these numbers does
388
00:18:14,920 --> 00:18:19,040
is any large deviation gets exaggerated.
389
00:18:19,040 --> 00:18:21,020
If you can understand the variance,
390
00:18:21,020 --> 00:18:23,800
the standard deviation is pretty simple.
391
00:18:23,800 --> 00:18:27,870
It is simply the square root
of the variance, that's it.
392
00:18:27,870 --> 00:18:31,580
Here is an example for variance
and standard deviation.
393
00:18:31,580 --> 00:18:34,360
The number of traffic citations issued
394
00:18:34,360 --> 00:18:35,990
during the last five months
395
00:18:35,990 --> 00:18:38,460
in Beaufort County, South Carolina,
396
00:18:38,460 --> 00:18:42,530
is 38, 26, 13, 41, and 22.
397
00:18:42,530 --> 00:18:45,090
What is the population variance?
398
00:18:45,090 --> 00:18:49,180
The calculated variance is 106.8.
399
00:18:49,180 --> 00:18:53,870
The standard deviation is
the square root of 106.8.
400
00:18:53,870 --> 00:18:55,890
It's not shown on this overhead,
401
00:18:55,890 --> 00:18:58,630
but it would be approximately 10 point,
402
00:18:58,630 --> 00:19:00,520
oh, one, let's say.
403
00:19:00,520 --> 00:19:03,120
Remember 10 times 10 is 100,
404
00:19:03,120 --> 00:19:06,060
which is pretty close to 106.
405
00:19:06,060 --> 00:19:10,010
There is a subtle yet very
important characteristic
406
00:19:10,010 --> 00:19:12,890
about the variance and standard deviation
407
00:19:12,890 --> 00:19:14,260
that you must learn.
408
00:19:14,260 --> 00:19:16,240
For sample data,
409
00:19:16,240 --> 00:19:19,560
this equation is a little bit different.
410
00:19:19,560 --> 00:19:22,450
Instead of dividing by the
total number of values,
411
00:19:22,450 --> 00:19:25,920
we divide by one less than the total.
412
00:19:25,920 --> 00:19:29,410
This is represented by n minus one.
413
00:19:29,410 --> 00:19:30,790
The nature of the sample
414
00:19:30,790 --> 00:19:34,670
is that we do not have all
of the data to analyze.
415
00:19:34,670 --> 00:19:38,410
So when we calculate the
variance or standard deviation,
416
00:19:38,410 --> 00:19:41,280
dividing by n minus one
417
00:19:41,280 --> 00:19:43,960
causes the variance and standard deviation
418
00:19:43,960 --> 00:19:47,570
to be bigger than dividing by n.
419
00:19:47,570 --> 00:19:50,360
You might have to think back
to mathematics for this.
420
00:19:50,360 --> 00:19:53,390
But when the denominator gets smaller,
421
00:19:53,390 --> 00:19:55,730
the result gets larger.
422
00:19:55,730 --> 00:19:58,890
An important, but again,
subtle concept here
423
00:19:58,890 --> 00:20:00,350
is that with sample data,
424
00:20:00,350 --> 00:20:03,270
if somebody asks us to
describe the spread,
425
00:20:03,270 --> 00:20:05,070
we're going to hedge a little bit
426
00:20:05,070 --> 00:20:08,420
and make the spread be a little bit wider,
427
00:20:08,420 --> 00:20:11,440
just because we know we
don't have all of the data.
428
00:20:11,440 --> 00:20:15,030
In future chapters, this
will become very important.
429
00:20:15,030 --> 00:20:16,170
You might be wondering,
430
00:20:16,170 --> 00:20:17,100
what are we ever gonna do
431
00:20:17,100 --> 00:20:19,830
with this standard deviation and variance?
432
00:20:19,830 --> 00:20:22,410
Well, it turns out that this calculation
433
00:20:22,410 --> 00:20:25,040
has been studied in-depth over time
434
00:20:25,040 --> 00:20:29,910
and it has produced some very
interesting observations.
435
00:20:29,910 --> 00:20:32,640
A mathematician named Chebyshev
436
00:20:32,640 --> 00:20:35,210
found that for any set of observations,
437
00:20:35,210 --> 00:20:39,190
the proportion of the
values that lie within k,
438
00:20:39,190 --> 00:20:41,430
standard deviations of the mean,
439
00:20:41,430 --> 00:20:44,840
is at least one minus one over k squared,
440
00:20:44,840 --> 00:20:47,530
where k is a constant greater than one.
441
00:20:47,530 --> 00:20:50,030
Hmm, I'm glad I'm not a mathematician.
442
00:20:50,030 --> 00:20:51,500
What else can you tell me?
443
00:20:51,500 --> 00:20:54,660
Well, here's a little bit
more interesting idea.
444
00:20:54,660 --> 00:20:56,180
The empirical rule.
445
00:20:56,180 --> 00:20:58,800
We collect a bunch of data and plot it.
446
00:20:58,800 --> 00:21:00,590
And let's assume that the plot
447
00:21:00,590 --> 00:21:04,380
is a symmetrical bell-shaped
frequency distribution.
448
00:21:04,380 --> 00:21:05,780
The empirical rules says
449
00:21:05,780 --> 00:21:10,780
that approximately 68% of
the observations will lie
450
00:21:10,900 --> 00:21:15,230
within plus or minus one
standard deviation of the mean,
451
00:21:15,230 --> 00:21:20,230
about 95% within two standard
deviations of the mean,
452
00:21:20,600 --> 00:21:22,530
and practically all,
453
00:21:22,530 --> 00:21:25,540
99.7% will lie within
454
00:21:25,540 --> 00:21:29,370
plus or minus three standard
deviations of the mean.
455
00:21:29,370 --> 00:21:31,240
Hmm, once again, big deal.
456
00:21:31,240 --> 00:21:34,120
This really doesn't mean a
whole lot to us right now.
457
00:21:34,120 --> 00:21:36,530
But later on, we're gonna study data
458
00:21:36,530 --> 00:21:39,130
that takes on this symmetrical shape.
459
00:21:39,130 --> 00:21:42,450
Let me tie this back to
our discussion on graphing.
460
00:21:42,450 --> 00:21:44,730
Let's assume that we collect some data
461
00:21:44,730 --> 00:21:48,050
and plot it as a relative
frequency diagram
462
00:21:48,050 --> 00:21:51,740
and that this diagram
looks like a bell curve
463
00:21:51,740 --> 00:21:54,180
or normal distribution.
464
00:21:54,180 --> 00:21:57,000
We can calculate the mean of this data,
465
00:21:57,000 --> 00:21:59,900
which happens to be the
middle of this curve,
466
00:21:59,900 --> 00:22:01,290
and we can also calculate
467
00:22:01,290 --> 00:22:04,650
a statistic called the standard deviation.
468
00:22:04,650 --> 00:22:05,720
I have mentioned to you
469
00:22:05,720 --> 00:22:09,100
that once we have a relative
frequency diagram of data,
470
00:22:09,100 --> 00:22:10,750
we can ask various questions
471
00:22:10,750 --> 00:22:15,080
about the probability of
certain events happening.
472
00:22:15,080 --> 00:22:16,900
For example, in this case,
473
00:22:16,900 --> 00:22:18,300
we might want to know
474
00:22:18,300 --> 00:22:21,390
what the probability is of a value being,
475
00:22:21,390 --> 00:22:26,390
oh, let's say, one standard
deviation lower than the mean,
476
00:22:26,710 --> 00:22:29,640
and one standard deviation
higher than the mean.
477
00:22:29,640 --> 00:22:31,660
This empirical rule says
478
00:22:31,660 --> 00:22:34,710
that the area under the curve,
479
00:22:34,710 --> 00:22:36,410
or the probability
480
00:22:36,410 --> 00:22:40,483
of this particular event
occurring would be 68%.
481
00:22:42,070 --> 00:22:43,760
So you might be able to see
482
00:22:43,760 --> 00:22:45,710
that this empirical rule
483
00:22:45,710 --> 00:22:49,640
is getting us over to the
concept of probabilities
484
00:22:49,640 --> 00:22:51,410
and appreciating this concept
485
00:22:51,410 --> 00:22:56,410
that the probability is
the area under the curve.
486
00:22:56,970 --> 00:23:01,010
In later chapters, we will
study this normal distribution
487
00:23:01,010 --> 00:23:04,970
and we will see how to
answer probability questions
488
00:23:04,970 --> 00:23:08,360
given any points underneath this curve.
489
00:23:08,360 --> 00:23:09,970
Not just minus one
490
00:23:09,970 --> 00:23:12,860
or plus one standard
deviations from the mean,
491
00:23:12,860 --> 00:23:15,060
but any values.
492
00:23:15,060 --> 00:23:19,450
By this time, you might be
overwhelmed with these equations.
493
00:23:19,450 --> 00:23:21,470
In another session, we will learn
494
00:23:21,470 --> 00:23:23,770
how to calculate all of these values
495
00:23:23,770 --> 00:23:26,640
using Microsoft Excel.
496
00:23:26,640 --> 00:23:28,000
In this chapter, we have learned
497
00:23:28,000 --> 00:23:30,740
about ways of describing data numerically.
498
00:23:30,740 --> 00:23:32,210
In the previous chapter,
499
00:23:32,210 --> 00:23:36,290
we learned about how to
describe data graphically.
500
00:23:36,290 --> 00:23:37,960
Both methods are important
501
00:23:37,960 --> 00:23:41,142
in our study of business statistics.
502
00:23:41,142 --> 00:23:44,398
(upbeat music)
503
00:23:44,398 --> 00:23:47,940
Thanks for taking time to
spend some moments with me.
504
00:23:47,940 --> 00:23:51,390
I hope that you have enjoyed
what I have shared with you.
505
00:23:51,390 --> 00:23:54,510
If you ever have any
questions or need my help,
506
00:23:54,510 --> 00:23:55,945
please let me know.
507
00:23:55,945 --> 00:23:58,695
(upbeat music)