AB1202: Statistical & Quantitative Methods
Lecture 1 Introduction & Data Presentation 2013‐2014, S2 Dr Michael Li
Outline
• Course Briefing • Introduction
– – – – What is Statistics Data & Data Sources Populations and Samples Examples
• Tabular & Graphical Methods of Data Presentation
– Frequency Distribution – Histograms & Pareto Charts – Stem Plots & Scatter Plots
2
Course Information
• COURSE INSTRUCTORS
Dr. Michael Li Dr. Chen Shaoxiang S3-B1A-19 S3-B2A-30 67904659 67906143
[email protected] [email protected]
• COURSE ASSESSMENT
Components
Coursework Final Examination (Open-book) Total
Marks
40% 60% 100%
Coursework Components
Class Participation Case Study (Group) Two In-Class Quizzes Sub-Total
Marks
20% 30% 50% 100%
• COURSE DELIVERY
– 12 lectures + 12 tutorials (please pay attention to MI – mobility initiative) – Two in‐class quizzes: during Tutorial 7 (week 9, after recess) & Tutorial 11 (week 13) respectively – Statistical software knowledge (required): SPSS (a very powerful/useful statistics software), Excel (add‐on for statistical analysis), TreePlan (decision trees)
3
Course Coverage
• Making Sense of Data and Summarizing Data • Concept of Probability – Bayes Theorem • Random Variables & Probability Distributions – Binomial, Uniform, Normal, Covariance (Appendix B) • Decision Analysis • Sampling Distributions • Statistical Inference: Confidence Intervals & Hypothesis Testing • Design of Experiment & Analysis of Variance • Regression Models – Simple & Multiple Regressions • Required textbook
– Bruce L. Bowerman, Richard T. O’Connell and Emily S. Murphree. “Business Statistics in Practice, Sixth Edition” McGraw‐Hill/Irwin, 2012
4
What Is Statistics?
1. 2. 3. Collecting Data
e.g., Survey
Presenting Data
e.g., Charts & Tables
Data Analysis
Why?
Characterizing Data
e.g., Average
Statistics is the science of data. It involves collecting, classifying, summarizing, organizing, analyzing, and interpreting numerical information.
© 1984‐1994 T/Maker Co.
Decision‐ Making
5
Basic Concepts
• Data: facts and figures from which
conclusions can be drawn – Data set: the data that are collected for a particular study
•
Elements: may be people, objects, events, or other entries
•
•
Variable: any characteristic of an element Measurement: A way to assign a value of a variable to the element
– Quantitative: the possible measurements of the values of a variable are numbers that represent quantities – Qualitative: the possible measurements fall into several categories
Cross‐sectional data: Data collected at the same or approximately the same point in time
– Example: mobile phone bills of employees at a bank during a particular month
•
•
Time series data: data collected over different time periods
– Most economics data are time‐series data, e.g., inflation, unemployment rate, CPI, exchange rate, etc. Periodic (monthly, quarterly, or yearly) corporate sales figures are also time‐ series data
6
–
Cross‐Sectional Data – SG Example
Source: Singapore Population 2012 (Department of Statistics)
A moment of pondering: • Any insights from the data? • Any impact on you?
Time Series Data – SG Example
Source: Singapore Population 2012 Live‐Births refer to all live‐births occurring within Singapore and its territorial waters. Total Fertility Rate refers to the average number of live‐births each female would have during her reproductive years.
8
Data Sources
• Existing sources (secondary): data already gathered by public or private sources
– – – – Library Government Data collection agency Internet
• Experimental and observational studies (primary): data that we collect ourselves for a specific purpose
– Response variable: the main variable of interest, e.g., salary – Factors: other variables related to response variable, e.g., education, experiences, etc.
9
Data Sources from NTU Library
• NTU Library Business Databases (some examples):
– Compustat Global
• Currency, statement, balance sheet, flow of funds, and supplemental data items data of listed global companies from 1989 onwards
– Business Monitor International
• Country risks and business environment
– Datamonitor 360
• Intelligences in companies, industries, products and countries, etc.
– Global Market Information Database (GMID)
• Business intelligence on countries, consumers and industries
– International Financial Statistics (IMF)
• Statistics on exchange rates, international reserves, banking, balance of payments, government finances, prices, etc for most countries in the world
10
Singapore Government Data Sources
• Statistics Singapore
– Economic data, sector‐level data, demographic data, household survey data, national census data
• Housing Development Board (HDB)
– Resale flat prices
• Urban Redevelopment Board (URA)
– Private residential transactions
• Land Transport Authority (LTA) – Onemotoring
– Vehicle population, COE prices, real‐time traffic etc
• Singapore Tourism Board (STB)
– Annual, quarterly and monthly tourism statistics
11
Key Concepts: Populations and Samples
Population The set of all elements about which we wish to draw conclusions (people, objects or events) An examination of the entire population of measurements A selected subset of the units of a population
Census
Sample
12
Statistical Methods
Statistical Methods
Descriptive Statistics
Inferential Statistics
the science of describing the important aspects of a set of measurements
the science of using a sample of measurements to make generalizations about the important aspects of a population of measurements
13
Example 1: Estimating Cell Phone Costs (p.8)
• A bank wishes to decide whether to hire a cellular management service to choose its employees’ calling plans.
– Over 10,000 employees, on different types of calling plans
• The cellular service company suggests studying the calling patterns of mobile users on 500‐minute‐per‐ month plans
– Purpose: whether cellular costs can be substantially reduced – The bank has 2,136 employees on a variety of 500‐minute‐ per‐month plans, with different basic monthly rates, different coverage charges, and different additional charges for long‐ distance calls and roaming.
14
Cell Phone Costs (cont.)
• Selecting a random sample (from 2,136 employees)
– A random sample of 100 employees on 500‐minute plan – Key observation: many overages and underage
Data file: Lect01‐Data.xlsx Worksheet: CellUse
Excel function: • Countif(range, criteria)
15
Example 2: Rating a New Design
• A branding company is studying to see if changes should be made in the bottle design for a popular soft drink.
– Respondents are shoppers from a large shopping mall on a particular Saturday – Exposed to the new bottle design and asked to rate:
• Five items with a 7‐point “Likert scale” (survey instrument) • A composite score is the sum of all five items • Rule of thumb: a score of 25 is the smallest score for a success
16
Rating a New Design (cont.)
• Sampling method: “interception method”
– Not a completely random sample, but can generate an approximately random sample (how?) – A sample size of 60
Worksheet: Design
– Key observations: 57 of 60 (i.e., 95%%) composite scores are at least 25
17
Example 3: Estimating Car Gas Mileage
• Study of tax credit offered by the federal government to automakers for improving fuel economy of gasoline powered midsize cars • Automaker has introduced a new model and wishes to demonstrate it qualifies for the tax credit • US EPA Fuel Economy:
– http://www.epa.gov/fueleconomy/ – Market average: 26 miles per gallon (mpg) (year 2009) – Tax incentive goal: an improvement of 5 mpg, i.e., at least 31 mpg
18
Estimating Car Gas Mileage (cont.)
• An approximately random sample of 50 cars
– One car from each of 50 consecutive production shifts – Each selected car is subject to an EPA test
• 7.5‐mile city driving trip & a 10‐mile highway driving • A combined mileage for the car
• Vary from 29.8 mpg to 33.3 mpg • 38 our of 50 (76%) of the mileages are greater than 31 mpg.
19
Data Presentation Techniques
• Graphically Summarizing Qualitative Data
– Frequency distribution, bar chart, pie chart, Pareto chart
• Graphically Summarizing Quantitative Data
– Frequency distribution, histograms, ogives
• Stem‐and‐Leaf Displays • Crosstabulation Tables • Scatter Plots
20
Frequency Distribution for Qualitative Data
• With qualitative data, names identify the different categories • This data can be summarized using a frequency distribution
– Frequency distribution:
• A table that summarizes the number of items in each of several non‐ overlapping classes
21
Example 2.1: 2006 Jeep Purchasing Patterns
• Table 2.1 lists all 251 vehicles sold in 2006 by the Jeep dealers
– It does not reveal much useful information
• A frequency distribution is a useful summary
– Simply count the number of times each model appears in Table 2.1
Worksheet: JeepSales
22
Relative Frequency and Percent Frequency
• Relative frequency summarizes the proportion of items in each class
– For each class, divide the frequency of the class by the total number of observations – Multiply by 100 to obtain the percent frequency
Worksheet: JeepSales
23
Bar Charts and Pie Charts
• Bar chart: A vertical or horizontal rectangle represents the frequency for each category
– Height can be frequency, relative frequency, or percent frequency
• Pie chart: A circle divided into slices where the size of each slice represents its relative frequency or percent frequency • Using Excel to draw bar chart and pie chart – easy
24
Excel Bar and Pie Chart of the Jeep Sales Data
Worksheet: JeepSales
25
Pareto Chart
• Pareto chart: A bar chart having the different kinds of defects listed on the horizontal scale
– Bar height represents the frequency of occurrence – Bars are arranged in decreasing height from left to right – Sometimes augmented by plotting a cumulative percentage point for each bar
Worksheet: Labels
26
Graphically Summarizing Quantitative Data
• Often need to summarize and describe the shape of the distribution • One way is to group the measurements into classes of a frequency distribution and
– “Classify and count” – The frequency distribution is a table
• Then display the data in the form of a histogram
– The histogram is a picture of the frequency distribution
27
Constructing a Frequency Distribution
Steps in making a frequency distribution:
1. 2. 3. 4. 5. Find the number of classes Find the class length Form non‐overlapping classes of equal width Tally and count Graph the histogram
Example 2.2: Payment time • A sample of 60 observations, min = 10 days, max = 65 days
28
Number of Classes & Class Length
• Number of Classes
– Group all of the n data into K number of classes – K is the smallest whole number for which 2K n (a guide only) – In Examples 2.2 n = 65
• For K = 6, 26 = 64, < n • For K = 7, 27 = 128, > n • So use K = 7 classes
• Class length
– Find the length of each class as the largest measurement minus the smallest divided by the number of classes found earlier (K) – For Example 2.2, (29‐10)/7 = 2.7143
• Because payments measured in days, round to three days
29
Histogram – Using Excel
25 20 15 10 5 0 10 < 13 13<16 3 14
Histogram
23
12 8 4 1 16<19 19<22 22<25 25<28 28<31 30
Histogram – Using SPSS
SPSS data file: Lect01‐PaymentTime.sav
Note: Most statistical software generates histograms automatically – so there is no unique histogram so long as the graph shows the data pattern.
Histograms: Three General Cases
Symmetrical: The right and left tails of the histogram appear to be mirror images of each other
Skewed to the right: The right tail of the histogram is longer than the left tail
Skewed to the left: The left tail of the histogram is longer than the right tail
32
Cumulative Distributions
• Another way to summarize a distribution is to construct a cumulative distribution • To do this, use the same number of classes, class lengths, and class boundaries used for the frequency distribution • Rather than a count, we record the number of measurements that are less than the upper boundary of that class, in other words, a running total.
33
Ogive
• Ogive: A graph of a cumulative distribution
– Plot a point above each upper class boundary at height of cumulative frequency – Connect points with line segments – Can also be drawn using
• Cumulative relative frequencies • Cumulative percent frequencies
Worksheet: PayTime
34
Stem‐and‐Leaf Displays
• Purpose is to see the overall pattern of the data, by grouping the data into classes
– the variation from class to class – the amount of data in each class – the distribution of the data within each class
• Best for small to moderately sized data distributions
35
• The stem‐and‐leaf display:
29 + 0.8 = 29.8 29 8 30 13455677888 31 0012334444455667778899 32 01112334455778 33 + 0.3 = 33.3 33 03
Car Mileage Example
Looking at the stem‐and‐leaf display, the distribution appears almost “symmetrical” • The upper portion (29, 30, 31) is almost a mirror image of the lower portion of the display (31, 32, 33) • But not exactly a mirror reflection
SPSS data file: Lect01‐GasMiles.sav
36
Constructing a Stem‐and‐Leaf Display
• No rules that dictate the number of stem values
– Can split the stems as needed – Use SPSS (Excel cannot generate stem plots)
SPSS data file: Lect01‐PaymentTime.sav
37
Stem‐and‐Leaf Display ‐ SPSS
Stem‐and‐leaf display for Payment Time data
Stem‐and‐leaf display for Car Mileage data
Note: Step‐and‐leaf displays are NOT unique!
Cross‐tabulation Tables
•
– –
Classifies data on two dimensions
Rows classify according to one dimension Columns classify according to a second dimension
•
1. 2. 3.
Requires three variables
The row variable The column variable The variable counted in the cells
•
SPSS can easily create cross‐tabulation tables
39
Example 2.5: Investor Satisfaction
• The raw data: fund type & satisfaction level
40
Investor Satisfaction: Cross‐tabulation
• A cross tabulation table of fund type vs. satisfaction level
41
Cross‐tabulations – Using SPSS
• Analyze → Descrip ve Sta s cs → Crosstabs
SPSS data file: Lect01‐Invest.sav
42
Scatter Plots
• Used to study relationships between two variables
– Place one variable on the x‐axis – Place a second variable on the y‐axis – Place dot on pair coordinates
• Software
– Excel: easy & simple – SPSS: easy & sophisticated!
• Types of Relationships
– Linear: A straight line relationship between the two variables
• Positive: When one variable goes up, the other variable goes up • Negative: When one variable goes up, the other variable goes down
– No Linear Relationship: There is no coordinated linear movement between the two variables
43
Scatter Plots – Using Excel
Worksheet “SalesPlot”
44
End of Lecture 1
NEXT LECTURE: CHAPTER 3 DESCRIPTIVE STATISTICS
45