Statistics

Published on May 2016 | Categories: Types, Business/Law | Downloads: 76 | Comments: 0 | Views: 1474
of 87
Download PDF   Embed   Report

basic quantitative analysis

Comments

Content

Histograms In addition to grouping data, we often graph them to better visualize any patterns in the data. Seeing data displayed graphically can significantly deepen our understanding of a data set and the situation it describes.

outliers In many data sets, there are occasional values that fall far from the rest of the data. For example, if we graph the age distribution of students in a college course, we might see a data point at 75 years. Data points like this one that fall far from the rest of the data are known as outliers. How do we interpret them? P17 Summary With any data set we encounter, we must find ways to allow the data to tell their story. Ordering and graphing data sets often expose patterns and trends, thus helping us to learn more about the data and the underlying situation. If data can provide insight into a situation, they can help us to make the right decisions Creating Histograms Note: Unless you have installed the Excel Data Analysis ToolPak add-in, you will not be able to create histograms using the Histogram tool. However, we suggest you read through the instructions to learn how Excel creates histograms so you can construct them in the future when you do have access to the Data Analysis Toolpak. To check if the Toolpak is installed on your computer, go to the Data tab in the Toolbar in Excel 2007. If "Data Analysis" appears in the Ribbon, the Toolpak has already been installed. If not, click the Office Button in the top left and select "Excel Options." Choose "Add-Ins" and highlight the "Analysis Toolpak" in the list and click "Go." Check the box next to Analysis Toolpak and click "OK." Excel will then walk you through a setup process to install the toolpak Central Values for data Graphs are very useful for gaining insight into data. However, sometimes we would like to summarize the data in a concise way with a single number. The mean Often, we'd like to summarize a set of data with a single number. We'd like that summary value to describe the data as well as possible. But how do we do this? Which single value best represents an entire set of data? That depends on the data we're investigating and the type of questions we'd like the data to answer. The median Let's look at the revenues of the top 100 companies in the US. The mean revenue of these companies is about $42 billion. How should we interpret this number? How well does this average represent the revenues of these companies?

The mode A third statistic to represent the "center" of a data set is its mode: the data set's most frequently occurring value. We might use the mode to represent data when knowing the average value isn't as important as knowing the most common value. P23 Summary To summarize a data set using a single value, we can choose one of three values: the mean, the median, or the mode. They are often called summary statistics or descriptive statistics. All three give a sense of the "center" or "central tendency" of the data set, but we need to understand how they differ before using them: Finding the mean in Excel To find the mean of a data set entered in Excel, we use the AVERAGE function. Excel can find the median, even if a data set is unordered, using the MEDIAN function. Excel can also find the most common value of a data set, the mode, using the MODE function Variability The mean, median and mode give you a sense of the center of the data, but none of these indicate how far the data are spread around the center. "Two sets of data could have the same mean and median, and yet be distributed completely differently around the center value," Alice tells you. "We need a way to measure variation in the data." The Standard Deviation It's often critical to have a sense of how much data vary. Do the data cluster close to the center, or are the values widely dispersed? calculating A hotel manager has to staff the front reception desk in her lobby. She initially focuses on a staffing plan for Saturdays, typically a heavy traffic day. In the hospitality industry, like many service industries, proper staffing can make the difference between unhappy guests and satisfied customers who want to return. P30 Interpreting What does a standard deviation of 25.2 requests tell us? Suppose the standard deviation had been 50 requests. Summary P31 The standard deviation measures how much data vary about their mean value.

P32 Finding in Excel Excel's STDEV function calculates the standard deviation. The Coefficient of Variation The standard deviation measures how much a data set varies from its mean. But the standard deviation only tells you so much. How can you compare the variability in different data sets? P34Summary The coefficient of variation expresses the standard deviation as a fraction of the mean. We can use it to compare variation in different data sets of different scales or units. Applying Data Analysis After a good night's sleep, you meet Alice for Breakfast. "It's time to get started on Leo's assignments. Could you get those price quotes from diving schools and prepare a presentation for Leo? We'll want to present our findings as neatly and concisely as possible. Use graphs and summary statistics wherever appropriate. Meanwhile, I'll start working on Leo's hotel occupancy problem." Pricing scuba school In addition to the school Leo is currently using, you find 20 other scuba services in the phone book. You call those 20 and get price quotes on how much they would charge the Kahana per guest for a Scuba Certification Course. Exercise1 After a company completes its initial public offering, how is the ownership of common stock distributed between individuals in the firm, often termed "named insiders"? Exercise3

Two Variables We use histograms to help us answer questions about one variable. How do we start to investigate patterns and trends with two variables? Sometimes, we are not as interested in the relationship between two variables as we are in the behavior of a single variable over time. In such cases, we can consider time as our second variable Suppose we are planning the purchase of a large amount of high-speed computer memory from an electronics distributor. Experience tells us these components have high price volatility. Should we make the purchase now? Or wait? Assuming we have price data collected over time, we can plot a scatter diagram for memory price,

in the same way we plotted height and weight. Because time is one of the variables, we call this graph a time series. Let's look at two data sets: heights and weights of athletes. What can we say about the two data sets? Is there a relationship between the two? Our intuition tells us that height and weight should be related. How can we use the data to inform that intuition? How can we let the data tell their story about the strength and nature of that relationship? As always, one of our first steps is to try to visualize the data. Because we know that each height and weight belong to a specific athlete, we first pair the two variables, with one height-weight pair for each athlete. Plotting these data pairs on axes of height and weight — one data point for each athlete in our data set — we can see a relationship between height and weight. This type of graph is called a "scatter diagram." Scatter diagrams provide a visual summary of the relationship between two variables. They are extremely helpful in recognizing patterns in a relationship. The more data points we have, the more apparent the relationship becomes. In our scatter diagram, there's a clear general trend: taller athletes tend to be heavier. We need to be careful not to draw conclusions about causality when we see these types of relationships. Growing taller might make us a bit heavier, but height certainly doesn't tell the whole story about our weights. Assuming causality in the other direction would be just plain wrong. Although we may wish otherwise, growing heavier certainly doesn't make us taller! The direction and extent of causality might be easy to understand with the height and weight example, but in business situations, these issues can be quite subtle. Managers who use data to make decisions without firm understanding of the underlying situation often make blunders that in hindsight can appear as ludicrous as assuming that gaining weight can make us taller. Why don't we try graphing another pair of data sets to see if we can identify a relationship? On a scatter diagram, we plot for each day the number of massages purchased at a spa resort versus the total number of guests visiting the resort. We can see a relationship between the number of guests and the number of massages. The more guests that stay at the resort, the more massages purchased — to a point, where massages level off. Why does the number of massages reach a plateau? We should investigate further. Perhaps there are limited numbers of massage rooms at the spa. Scatter plots can give us insights that prompt us to ask good questions, those that deepen our understanding of the underlying context from which the data are drawn.

Time series are extremely useful because they put data points in temporal order and show how data change over time. Have prices been steadily declining or rising? Or have prices been erratic

over time? Are there seasonal patterns, with prices in some months consistently higher than in others? Time series will help us recognize seasonal patterns and yearly trends. But we must be careful: we shouldn't rely only on visual analysis when looking for relationships and patterns. Our intuition tells us that pairs of variables with a strong relationship on a scatter plot must be related to each other. But we must be careful: human intuition isn't foolproof and often we infer relationships where there are none. We must be careful to avoid some of these common pitfalls Let's look at an example. For US presidents of the last 150 years, there seems to be a connection between being elected in a year that is a multiple of 20 (1900, 1920, 1940, etc.) and dying in office. Abraham Lincoln (elected in 1860) was the first victim of this unfortunate relationship. James Garfield (elected 1880) survived his presidency (but was assasinated the year after he left office), and William McKinley (1900), Warren Harding (1920), Franklin Roosevelt (1940), and John F. Kennedy (1960) all died in office. Ronald Reagan (elected 1980) only narrowly survived an assassination attempt. What do the data suggest about the president elected in 2020? Probably nothing. Unless we have a reasonable theory about the connection between the two variables, the relationship is no more than an interesting coincidence. Hidden variables Even when two data sets seem to be directly related, we may need to investigate further to understand the reason for the relationship. We may find that the reason is not due to any fundamental connection between the two variables themselves, but that they are instead mutually related to another underlying factor. Suppose we're examining sales of ice-hockey pucks and baseballs at a sporting goods store The sales of the two products form a relationship on a scatter plot: when puck sales slump, baseball sales jump. But are the two data sets actually related? If so, why? A third, hidden factor probably drives both data sets: the season. In winter, people play ice hockey. In spring and summer, people play baseball. If we had simply plotted puck and baseball sales without thinking further, we might not have considered the time of year at all. We could have neglected a critical variable driving the sales of both products. In many business contexts, hidden variables can complicate the investigation of a relationship between almost any two variables.

A final point: Keep in mind that scatter plots don't prove anything about causality. They never prove that one variable causes the other, but simply illustrate how the data behave.

Summary Plotting two variables helps us see relationships between two data sets. But even when relationships exist, we still need to be skeptical: is the relationship plausible? An apparent relationship between two variables may simply be coincidental, or may stem from a relationship each variable has with a third, often hidden variable. 1 Plotting two variables on a scatter diagram can help illustrate the relationship between them. 2 When one variable is time, the relationship is known as a time series. 3 A relationship is not proof of causality. 4 Be alert to the possibility of hidden variables. To create a scatter diagram in Excel with two data sets, we need to first prepare the data, and then use Excel's built in chart tools to plot the data. To prepare our data, we need to be sure that each data point in the first set is aligned with its corresponding value in the other set. The sets don't need to be contiguous, but it's easier if the data are aligned side by side in two columns. If the data sets are next to each other, simply select both sets. Next, from the Insert tab in the toolbar, select Scatter in the Charts bin from the Ribbon, and choose the first type: Scatter with Only Markers. Excel will insert a nonspecific scatter plot into the worksheet, with the first column of data represented on the X-axis and the second column of data on the Y-axis. We can include a chart title and label the axes by selecting Quick Layout from the Ribbon and choosing Layout 1. Then we can add the chart title and label the axes by selecting and editing the text. Finally, our scatter diagram is complete. You can explore more of Excel's new Chart Tools to edit and design elements of your chart. Correlation By plotting two variables on a scatter plot, we can examine their relationship. But can we measure the strength of that relationship? Can we describe the relationship in a standardized way? Humans have an uncanny ability to discern patterns in visual displays of data. We "know" when the relationship between two variables looks strong ... ... or weak ... ... linear ...

... or nonlinear ... positive (when one variable increases, the other tends to increase) ... ... or negative (when one variable increases, the other tends to decrease). Suppose we are trying to discern if there is a linear relationship between two variables. Intuitively, we notice when data points are close to an imaginary line running through a scatter plot. Logically, the closer the data points are to that line, the more confidently we can say there is a linear relationship between the two variables. However, it is useful to have a simple measure to quantify and communicate to others what we so readily perceive visually. The correlation coefficient is such a measure: it quantifies the extent to which there is a linear relationship between two variables. To describe the strength of a linear relationship, the correlation coefficient takes on values between -1 and +1. Here's a strong positive correlation (about 0.85) ... and here's a strong negative correlation (about -0.90). If every point falls exactly on a line with a negative slope, the correlation coefficient is exactly -1. At the extremes of the correlation coefficient, we see relationships that are perfectly linear, but what happens in the middle? Even when the correlation coefficient is 0, a relationship might exist — just not a linear relationship. As we've seen, scatter plots can reveal patterns and help us better understand the business context the data describe. To reinforce our understanding of how our intuition about the strength of a linear relationship between variables translates into a correlation coefficient, let's revisit the examples we analyzed visually earlier. In some cases, the correlation coefficient may not tell the whole story. Managers want to understand the attendance patterns of their employees. For example, do workers' absence rates vary by time of year? Suppose a manager suspects that his employees skip work to enjoy the good life more often as the temperature rises. After pairing absences with daily temperature data, he finds the correlation coefficient to be 0.466. While not a strong linear relationship, a coefficient of 0.466 does indicate a positive relationship — suggesting that the weather might indeed be the culprit. But look at the data — besides a few outliers, there isn't a clear relationship. Seeing the scatter

plot, the manager might realize that the three outliers correspond to a late-summer, three-day transportation strike that kept some workers homebound the previous year. Without looking at the data, the correlation coefficient can lead us down false paths. If we exclude the outliers, the relationship disappears, and the correlation essentially drops to zero, quieting any suspicion of weather. Why do the outliers influence our measure of linearity so much? As a summary statistic for the data, the correlation coefficient is calculated numerically, incorporating the value of every data point. Just as it does with the mean, this inclusiveness can get us into trouble... Because measures like correlation give more weight to points distant from the center of the data, outliers can strongly influence the correlation coefficient of the entire set. In these situations, our intuition and the measure we use to quantify our intuition can be quite different. We should always attempt to reconcile those differences by returning to the data. Summary The correlation coefficient characterizes the strength and direction of a linear relationship between two data sets. The value of the correlation coefficient ranges between -1 and +1. 1 A correlation coefficient near +1 or -1 indicates that the two variables have a strong positive or negative linear relationship, respectively. 2 A correlation coefficient near zero indicates a weak or nonexistent linear relationship. 3A coefficient near zero does not prove there is no relationship between the two variables; It indicates only that any relationship that does exist is not linear. 4Outliers can unduly influence the calculation of the correlation coefficient, making the correlation much higher or lower than what it would be without the outlying points. Excel's CORREL function calculates the correlation coefficient for two variables. Let's return to our data on athletes' height and weight. Enter the data set into the spreadsheet as two paired columns. We must make sure that each data point in the first set is aligned with its corresponding value in the other set. To compute the correlation, simply enter the two variables' ranges, separated by a comma, into the CORREL function as shown below. The order in which the two data sets are selected does not matter, as long as the data "pairs" are maintained. With height and weight, both values certainly need to refer to the same person! Occupancy and Arrivals Alice is eager to move forward: "With your new understanding of scatter diagrams and correlation, you'll be able to help me with Leo's hotel occupancy problem."

In the hotel industry, one of the most important management performance measures is room occupancy rate, the percentage of available rooms occupied by guests. Alice suggests that the monthly occupancy rate might be related to the number of visitors arriving on the island each month. On a geographically isolated location like Hawaii, visitors almost all arrive by airplane or cruise ship, so state agencies can gather very precise data on arrivals. Alice asks you to investigate the relationship between room occupancy rates and the influx of visitors, as measured by the average number of visitors arriving to Kauai per day in a given month. She wants a graphical overview of this relationship, and a measure of its strength. Leo's folders include data on the number of arrivals on Kauai, and on average hotel occupancy rates in Kauai, as tracked by the Hawaii Department of Business, Economic Development, and Tourism. The best way to graphically represent the relationship between arrivals and occupancy is: A histogram A scatter diagram A time series A series of concentric burning wheels

You generate the scatter diagram using the data file and Excel's Chart Wizard. The relationship can be characterized as: Weakly negative and linear Strongly negative and non-linear Strongly positive and linear Strongly positive and non-linear This is the best answer. The relationship is positive. Higher levels of occupancy generally correspond to higher numbers of arrivals. The trend appears to be reasonably linear.

You calculate the correlation coefficient. Enter the correlation coefficient in decimal notation with 2 digits to the right of the decimal, (e.g., enter "5" as "5.00"). Round if necessary. To find the correlation coefficient, open the Kahana Data file. In any empty cell, type =CORREL(B2:B37,C2:C37). When you hit enter, the correct answer, 0.71, will appear.

Together with Alice, you compile your findings and present them to Leo.

see. The relationship between the number of people arriving on Kauai and the island's hotel occupancy rate follows a general trend, but not a precise pattern. Look at this: in two months with nearly the same average number of daily arrivals, the occupancy rates were very different — 68% in one month and 82% in the other. But why should they be so different? When people arrive on the island, they have to sleep somewhere. Do more campers come to Kauai in one month, and more hotel patrons in the other? Well, that might be one explanation. There could be differences in the type of tourists arriving. The vacation preferences of the arrivals would be what we call a hidden variable Another hidden variable might be the average length of stay. If the length of stay varies month to month, then so will hotel occupancy. When 50 arrivals check into a hotel, the occupancy rate will be higher if they spend 10 days each at the hotel than if they spend only 3 days. I'm following you, but I'm beginning to see that the occupancy issue is more complex than I expected. Let's get back to it at a later time. The scuba school contract is more pressing at the moment. Exercise1 As online retailing expands, many companies are interested in knowing how effective search engines are in helping consumers find goods online. Computer scientists study the effectiveness of such search engines and compare how many results search engines recall and the precision with which they recall them. "Precision" is another way of saying that the search found its target, for example a page containing both the phrases "winter parka" and "Eddie Bauer." What could you say about the relationship between the Precision and the number of Results Recalled? The amount of information a search engine recalls decreases over time. An increase in precision causes the amount retrieved to decrease. Recall and precision seem to be related: a large number of results typically pairs with low precision.

This is the best answer. From the scatter plot, we can see that the variables demonstrate a relationship, but maybe not a linear one. However, even when we recognize a clear relationship, we cannot conclude that greater precision causes the amount of information recalled to decrease. Exercise2 Is an education a good investment in your future? Some very successful business executives are college dropouts, but is there a relationship in the general population between income and education level?

Consider the following scatter plot, which lists the income and years of formal education for 18 people. Is the correlation: Strongly positive Weakly positive Weakly negative This is the best answer. The level of income is strongly associated with the number of years of education for our data. Though we should always calculate the correlation coefficient if we want to have a precise measure, it's good to have a rough feel for the correlation between two variables we see plotted on a scatter diagram. For the income-education data, the coefficient is nearest to: 0.1 -0.5 0.9 This is the best answer. A fairly strong linear relationship has a correlation coefficient closer to 1.0, making 0.9 a reasonable guess for what we see occurring between income and education level. Sampling & Estimation The scuba problem Leo asks you to help him evaluate the Kahana's contract with the scuba school. Scuba diving lessons are an ideal way for our guests to enjoy their vacation or take a break from their business activities. We have an excellent coral reef, and scuba diving is becoming very popular among vacationers and business travelers. We started our year-round diving program last year, contracting a local diving school to do a scuba certification course. The one-year trial contract is now up for renewal. Maintaining the scuba offerings on-site isn't cheap. We have to staff the scuba desk seven days a week, and we subsidize the costs associated with each course. So I want to get a good handle on how satisfied the guests are with the lessons before I decide whether or not to renew the contract. The hotel has a database with information about which guests took scuba lessons and when. Feel free to take a look at it, but I can't spend a fortune figuring this out. And I need to know as soon as possible, since our contract expires at the end of the month. Alice convinces you to do some field research and join her for a scuba diving lesson. You return late that afternoon exhausted but exhilarated. Alice is especially enthusiastic.

"Well, I certainly give the lessons two thumbs up. And we haven't even been out to sea yet! "But our opinions alone can't decide the matter. We shouldn't infer from our experience that Leo's clientele as a whole enjoyed the scuba certification course. After all, we may have caught the instructor on his best day this year." Alice suggests creating a survey to find out how satisfied guests are with the scuba diving school. Random Samples Naturally, you can't ask the opinion of every guest who took scuba lessons over the past year. You have to survey a few guests, and from their opinions draw conclusions about hotel guests in general. The guests you choose to survey must be representative of all of the guests who have taken the scuba course at the resort. But how can you be sure you get a good sample? As managers, we often need to know something about a large group of people or products. For example, how many defective parts does a large plant produce each year? What are the average annual earnings of a Wall Street investment banker? How many people in our industry plan to attend the annual conference? When it is too costly to gather the information we want to know about every person or every thing in an entire group, we often ask the question of a subset, or sample of the group. We then try to use that information to draw conclusions about the whole group. To take a sample, we first select elements from the entire group, or "population," at random. We then analyze that sample and try to infer something about the total population we're interested in. For example, we could select a sample of people in our industry, ask them if they plan to attend the annual conference, and then infer from their answers how many people in the entire industry plan to attend. For example, if 10% of the people in our sample say they will attend, we might feel quite confident saying that between 7% and 13% of our entire population will attend. This is the general structure of all the problems we'll address in this unit — we'll work out the details as we go forward. We want to know something about a population large enough to make examining every population member impractical. We first select elements from the population at random... ...then analyze that sample... ...and then draw an inference about the total population we're interested in.

Taking a Random Sample The first trick to sampling is to make sure we select a sample that broadly represents the entire group we're interested in. For example, we couldn't just ask the conference organizers if they wanted to attend. They would not be representative of the whole group — they would be biased in favor of attending the conference! To get a good sample, we must make sure we select the sample "at random" from the full population. This means that every person or thing in the population is equally likely to be selected. If there are 15,000 people in the industry, and we are choosing a sample of 1,000, then every person needs to have the same chance — 1 out of 15 — of being selected. Selecting a random sample sounds easy, but actually doing it can be quite challenging. In this section, we'll see examples of some major mistakes people have made while trying to select a random sample, and provide some advice about how to avoid the most common types of sampling errors. In some cases, selecting a random sample can be fairly easy. If we have a complete list of each member of the group in a database, we can just assign a unique number to each member of the group. We then let a computer draw random numbers from the list. This would ensure that each element of the population has an equal likelihood of being selected. If the population about which we need to obtain information is not listed in an easy-to-access database, the task of selecting a sample at random becomes more difficult. In these cases, we have to be extremely careful not to introduce a bias in the way we select the sample. For example, if we want to know something about the opinions of an entire company, we cannot just pick employees from one department. We have to make sure that each employee has an equal chance of being included in the sample. A department as a whole might be biased in favor of one opinion. Once we have decided how to select a sample, we have to ask how large our sample needs to be. How many members of the group do we need to study to get a good estimate about what we want to know about the entire population? The answer is: It depends on how "accurate" we want our estimate to be. We might expect that the larger the population, the larger the sample size needed to achieve a given level of accuracy, but this is not true. A sample size of 1,000 randomly-selected individuals can often give a satisfactory estimation about the underlying population, as long as the sample is representative of the whole population. This is true regardless of whether the population consists of thousands of employees or millions of factory parts.

Sometimes, a sample size of 100 or even 50 might be enough when we are not that concerned about the accuracy of our estimate. Other times, we might need to sample thousands to obtain the accuracy we require. Later in this unit, we will find out how to calculate a good sample size. For now, it's important to understand that the sample size depends on the level of accuracy we require, not on the size of the population. Learning about a sample Once we select our sample, we need to make sure we obtain accurate information about each member of the sample. For example, if we want to learn about the number of defects a plant produces, we must carefully measure each item in the sample. When we want to learn something about a group of people and don't have any existing data, we often use a survey to learn about an issue of interest. Conducting a survey raises problems that can be surprisingly tricky to resolve. First, how do we phrase our questions? Is there a bias in any questions that might lead participants to answer them in a certain way? Are any questions worded ambiguously? If some of the people in the sample interpret a question one way, and others interpret it differently, our results will be meaningless! Second, how do we best conduct the survey? Should we send the survey in the mail, or conduct it over the phone? Should we interview survey participants in person, or distribute handouts at a meeting? There are advantages and disadvantages to all methods. A survey sent through the mail may be relatively inexpensive, but might have a very low response rate. This is a major problem if those who respond have a different opinion than those who don't respond. After all, the sample is meant to learn about the entire population, not just those with strong opinions! Creating a telephone survey creates other issues: When do we call people? Who is home during regular business hours? Most likely not working professionals. On the other hand, if we call household numbers in the evening the "happy hour crowd" might not be available. When we decide to conduct a survey in person, we have to consider whether the presence of the person asking the questions might influence the survey results. Are the survey participants likely to conceal certain information out of embarrassment? Are they likely to exaggerate? Clearly, every survey will have different issues that we need to confront before going into the field to collect the data. With any type of survey, we must pay close attention to the response rate. We have to be sure that those who respond to the survey answer questions in much the same way as those who don't

respond would answer them. Otherwise, we will have a biased view of what the whole population thinks. Surveys with low response rates are particularly susceptible to bias. If we get a low response rate, we must try to follow up with the people who did not respond the first time. We either need to increase the response rate by getting answers from those who originally did not respond, or we must demonstrate that the non-respondents' opinions do not differ from those of the respondents on the issue of interest. Low response rate

Contact non-respondents

Raise response rate or Show non-respondents do not differ Tracking down everyone in a sample and getting their response can be costly and time consuming. When our resources are limited, it is often better to take a small sample and relentlessly pursue a high response rate than to take a larger sample and settle for a low response rate. Summary Often it makes sense to infer facts about a large population from a smaller sample. To make sound inferences: 1. Make sure the sample is representative of the population: 2. pick elements at random: every member of the population must be equally likely to be selected. 3. Choose appropriate sample size: 4. for large populations, sample size does not depend on population size. 5. sample size depends on desired accuracy. 6. Avoid biased results: 7. Phrase questions neutrally to avoid bias. 8. Pursue high response rates: better to have a smaller with a high response. 9. Understand incentives and motivations of respondents and pollsters. To understand the importance of representative samples, let's go back in history and look at some mistakes made in the Literary Digest poll of 1936. The Literary Digest, a popular magazine in the 1930's, had correctly predicted the outcome of U.S, presidential elections from 1916 to 1932. When the results of the 1936 poll were announced, the public paid attention. Who would become the next president?

Newscaster: "Once again, the Literary Digest sent out a survey to the American public, asking, "Whom will you vote for in this year's presidential election?" This may well be the largest poll in American history." Newscaster: "The Digest sent the survey to over 10 million Americans and over two million responded!" Newscaster: "And the survey results predict: Alf Landon will beat Franklin D. Roosevelt by a large margin and become President of the United States." As it turned out, Alf Landon did not become President of the United States. Instead, Franklin D. Roosevelt was re-elected to a third term in office in the largest landslide victory recorded to that date. This was a devastating blow to the Digest's reputation. What went wrong? How could such a large survey be so far off the mark? The Literary Digest made two mistakes that led it to predict the wrong election outcome. First, it mailed the survey to people on three different lists: the magazine's subscribers, car owners, and people listed in telephone directories. What was wrong with choosing a sample from these lists? The sample was not representative of the American public. Most lower-income people did not subscribe to the Digest and did not own phones or cars back in 1936. This led the poll to be biased towards higher-income households and greatly distorted the poll's results. Lower-income households were more likely to vote for the Democrat, Roosevelt, but they were not included in the poll. Second, the magazine relied on people to voluntarily send their responses back to the magazine. Out of the ten million voters who were sent a poll, over two million responded. Two million is a huge number of people. What was wrong with this survey? Mistakes: Unrepresentative sample Low response rate Biased respondents Biased questions The mistake was simple: Republicans, who wanted political change, felt more strongly about the election than Democrats. Democrats, who were generally happy with Roosevelt's policies, were less interested in returning the survey. Among those who received the survey, a disproportionate number of Republicans responded, and the results became even more biased. The Digest had put an unprecedented effort into the poll and had staked its reputation on predicting the outcome of the election. Its reputation wounded, the Digest went out of business soon thereafter. During the same election year, a little known psychologist named George Gallup correctly predicted what the Digest missed: Roosevelt's victory. What did Gallup do that the Literary Digest

did not? Did he create an even bigger sample? Surprisingly, George Gallup used a much smaller sample. He knew that large samples were no guarantee of accurate results if they weren't randomly selected from the population Gallup's team interviewed only 3,000 people, but made sure that the people they selected were truly representative of the US population. He also instructed his team to be persistent in asking the opinion of each person in the sample, which generated a high response rate. Gallup's correct prediction of the 1936 election winner boosted his reputation and Gallup's method of polling soon became a standard for public opinion polls. Today's polls usually consist of a sample of around a thousand randomly selected people who are truly representative of the underlying populations. For example, look at poll reported in a leading newspaper: the sample size will likely be around a thousand. Another common survey mistake is phrasing the questions in a way that leads to a biased response. Let's take a look at a recent example of a biased question. In 1992, Ross Perot, an independent contender for the US Presidential election, conducted a mailin survey to show that the public supported his desire to abolish special interest groups. This is the question he asked: “Should laws be passed to eliminate all possibilities of special interests giving huge sums of money to candidates?” In Perot's mail-in survey, 99 percent of respondents said "yes" to that question. It seemed as if everyone in America agreed with Perot's stance. Soon after Perot's survey, Yankelovich Partners, an independent market research firm, conducted two interesting follow-up surveys. In the first survey, it used the same question that Perot asked and found that 80 percent of the population favored passing the law. YP attributed the difference to the fact that it was able to create a more representative sample than Perot. Interestingly, Yankelovich then conducted a similar survey, but rephrased the question in the following way: “Should laws be passed to prohibit interest groups from contributing to campaigns,or do groups have a right to contribute to the candidates they support?” The response to this question was strikingly different. Only 40 percent of the sampled population agreed to prohibit contributions. As it turned out, the results of the survey all came down to the way the question was phrased. For any survey we conduct, it's critical to phrase the question in the most neutral way possible to avoid bias in the sample results. The real lesson of these two examples is this: How data are collected is at least as important as

how data are analyzed. A sample that is unrepresentative, biased, or not drawn at random can give highly misleading results. How sample data are collected is at least as important as how they are analyzed. Knowing that sample data need to be representative and unbiased, you conduct a survey of the hotel guests. How can you best determine if hotel guests are enjoying the scuba course? By searching the hotel database, you determine that 2,804 hotel guests took scuba trips in the past year. The scuba certification course was offered year-round. The database includes each guest's name, address, phone number, age, date of arrival, length of stay, and room number. Your first step is deciding what type of survey to conduct that will be inexpensive, quick, and will provide a good sample of all the guests who took scuba lessons. Should you mail a survey to the whole list of guests who took scuba lessons, expecting that a small percentage will respond, or conduct a telephone survey, which would likely provide a higher response rate, but cost more per guest contacted? To ensure a good response rate — and because Leo wants an answer quickly — you choose to contact customers by phone. Alice warns that to keep costs low, you can only contact 50 hotel guests, and reminds you to create a random, representative sample. You open up the list of names in the hotel database. The names were entered as guests arrived. To make things simple, you randomly select a date and then record the first 50 guests arriving after that date who took the course. You ask the hotel operator to call them for you, and tell him to be persistent. Eventually he is able to contact 45 of the guests on the list. He asks the guests to rate their scuba experience on a 1 to 6 scale and reports the results back to you. Click the link below to view your sample. Enter the average satisfaction level as a decimal number with one digit to the right of the decimal point (e.g., enter "5" as "5.0"). Round if necessary.

You compute the average satisfaction level and find that it is 2.5. You give Leo the news. He explodes. Two point five! That's impossible! I know for sure that it must be higher than that! You'd better go over your data again. Back in your room, you look over your list of data. What should you tell Leo?

You should have mailed out your survey. Your survey is not representative of the guests who took the scuba course. Your survey is unbiased and representative, and Leo should accept the survey results as true.

Your observation is correct. Although mailing out the survey might have changed your result, that was not the main problem with your survey. What factor is biasing your results?

By bothering people at home, you got negative responses. The income levels of the customers you phoned were not representative of the scuba-diving guests. The dates that the surveyed customers visited the resort were not representative of the scubadiving guests. Correct! Since you choose guests only from the month of April, any usual event that happened in that period could bias your results. In addition, your sample would be biased if more of a certain type of guests (for example business travelers versus tourists) visited during April than during the rest of the year. When you report this news to Leo, he begins to laugh. We were hit with a hurricane at the beginning of April. Half the scuba classes were cancelled, and the ones that did meet had to deal with choppy water and bad visibility. Even the weeks following the hurricane were bad. Usually guests see a manta ray every week, and the guests in April could barely see the underwater coral. No wonder they weren't happy. You assure Leo you will conduct the survey again with a more representative sample. This time, you make sure that the guests are truly randomly selected. Later, you have new data in your hands from 45 randomly chosen guests that show the average satisfaction rate to be 4.4 on a 1 to 6 scale. The standard deviation of the sample is 1.54. Mr. Gavin Collins is the Chief Operating Officer of Bell Computers, a market leader in personal computers. This morning, he opened the latest issue of Business 4.0, a business journal, and noticed an article on Bell Computers. The article praised the high quality and low cost of the PCs made by Bell. However, it also included some negative comments about Bell's customer service. Currently, customer service is only available to customers of Bell Computers over the phone. Collins wants to understand more fully what customers think of Bell's customer service. His marketing department designs a survey that asks customers to rate Bell's customer service from 1 to 10. How should he conduct the survey? Bell Computers should mail a survey to every customer in Bell's database asking them to write

Bell about their experiences with the customer service department. Bell's sales peak during the holidays, when people give gifts, including computers. Bell should send a mail survey along with each of its outbound computer shipments in December. Bell is located in the Southern United States. 55% of Bell's customers are also located in the South. Bell should conduct a phone survey in one of the major Southern cities. Every month, on a random day and time, Bell should conduct a phone survey immediately after a Customer Service Representative has spoken to a customer. New answers should be added to a rolling average.

This is the best answer. Conducting a phone survey immediately after a randomly chosen customer service session will create a random sample that is representative of all of Bell's customers. Wave" is a company that manufactures laundry detergent in several countries around the world. In India, the competition among laundry detergents is fierce. The sales per month of Wave have been constant for the past five years. Wave CEO Mr. Sharma instructed his marketing team to come up with a strong advertising campaign stressing Wave's superiority over other competitors. Wave conducted a survey in the month of June. They asked the following questions: "Have you heard of Wave?" "Do you think Wave is a good product?" "Do you notice a difference in the color of your clothes after using Wave?" Then, citing the results of their survey, Wave aired a major television campaign claiming that 75% of the population thought that Wave was a good product. You are a new associate at Madison Consulting. With your partner, Ms. Mehta, you have been asked to conduct a study for Wave's main competitor, the Coral Reef Detergent Company, about whether Wave's claims hold water. Coral Reef wonders how the Wave results are possible, considering that Coral Reef holds over 45% of the current market share. Ms. Mehta has been going through the survey methodology, and she tells you, "This sample is obviously not representative and unbiased. Coral Reef can dispute Wave's claim!" What has Ms. Mehta noticed? You have been asked to conduct a survey to determine the percentage of flights arriving at a small airport that were filled to capacity that morning. You decide to stand outside the airport's single exit door and ask a sample of 60 passengers leaving the airport how full their flight was. Your first thought is to just ask the first 60 passengers departing the airport how full their flight was, but you quickly realize that that could be a highly biased sample. Any 60 people leaving at the same time would likely have come from only a couple of flights, and you want to get a good sense of what percent of all flights arriving that morning were filled to capacity. Thus, you decide to randomly select 60 people from all the passengers departing the building that morning.

After conducting your survey, you tally the results: 10 people decline to answer, 30 people tell you that their flight was filled to capacity, and 20 people tell you that their flight was not filled to capacity. What can you conclude from your survey results so far? The best estimate is that 60% of the flights were filled to capacity. The best estimate is that 50% of the flights were filled to capacity. There is a problem with the survey approach. This is the correct answer. There is a problem with your survey. What is the problem with your survey? A sample of 60 passengers is not large enough to provide a good estimate. Only those passengers that feel most strongly about the issue are likely to respond. Passengers from full planes are likely to be selected more frequently than passengers from relatively empty planes. This is the correct answer. There is a systematic bias in your sample: When you sample passengers at the exit door of an airport, you will, on average, select more people from full planes, simply because when a plane is full, there are more passengers on it - and hence more leaving the airport than when a plane is relatively empty. To see this, imagine that 10 planes have arrived that morning — five of which were full (having 100 passengers each) and five of which had only a single passenger on the plane. In this case, half of the planes were full. However, almost all of the passengers (500 of the total 505) departing from the airport would report (correctly!) that they had been on a full plane. Since people from a full plane are more likely to be selected, there is a systematic bias in your response. It is important, in every survey, to try to make your sample as representative as possible. In this case, your sample was not representative of the planes arriving to the airport. A better approach might be to ask the people you select what their flight number was, and then ask them how full their flight was. Make sure you have at least one passenger from every plane. Then count the responses of only one person from each flight. By including only one person per flight in your sample, you ensure that your sample is an accurate prediction of how many planes are filled to capacity. Sampling is complicated, and it is important to think through all the factors that might influence your results. In this case, the mistake is that you are trying to estimate a population of planes by sampling a population of passengers. This makes the sample unrepresentative of the underlying population. By randomly sampling the passengers rather than the flights, each flight is not equally likely to be selected, and the sample is biased. The Scuba Problem(Part 11)

You report the results of your survey, the sample mean, and its standard deviation to Leo. A sample mean of 4.4 makes more sense to me, but I'm still a bit uneasy about your survey result. After all, you've only collected 45 responses. If you'd chosen different people, they likely would have given different responses. What if — just by chance — these 45 people loved the scuba course, and no one else did? You have a good point there, Leo. Our intuition is that the average satisfaction rate for all guests isn't too far from 4.4, but at this point we're not sure exactly how far away it might be. Without more calculations, all we can say is that 4.4 is the best estimate we have. That is why... Wait a minute! This is very unsatisfying. Are you telling me that there's no way to gauge the accuracy of this survey result? If the results are a little off, that's not a problem. But you have to tell me how far off they might be. What if you're off by two whole points, and the true satisfaction of my hotel guests is 2.4, not 4.4? In that case, my decision would be completely different. I need to know how accurately this sample reflects the opinions of all the hotel guests who went scuba diving! The sample mean is the best point estimate of the population mean, but it cannot tell you how accurately the sample reflects the population. Alice suggests giving Leo a range of values that is almost certain to contain the population mean. "We may not be able to pin down mean satisfaction precisely. But confining it to a range of likely values will provide Leo with enough information to make a sound business decision." That sounds like a good idea, but you wonder how to actually do it. Using Confidence intervals The sample mean is the best estimate of our population mean. However, it is only a point estimate. It does not give us a sense of how accurately the sample mean estimates the population mean. Think about it. If we know only the sample mean, what can we really say about the population mean? In the case of our scuba school, what can we say about the average satisfaction rate of all scuba-diving hotel guests? Could it be 4.3? 4.0? 4.7? 2.0? To make decisions as a manager, we need to have more than just a good point estimate. We need to have a sense of how close or far away the true population mean might be from our estimate. We can indicate the most likely values of the true population mean by creating a range, or interval, around the sample mean. If we construct it correctly, this range will very likely contain the true population mean.

For example, by constructing a range, we might be able to tell Leo that we are very confident that the true average customer satisfaction for all scuba guests falls between 4.2 and 4.6. Knowing that the true average is almost certainly between 4.2 and 4.6, Leo is better equipped to make a decision than if he simply knew the estimated average of 4.4. Creating a range around the sample mean is quite easy. First, we need to know three statistics of the sample: the mean x-bar, the standard deviation s, and the sample size n. We also need to know how "confident" we'd like to be that the range contains the true mean of the population. For any level of "confidence", there is a value we'll call z to put into the formula. We'll learn later in this unit exactly what we mean by "confidence," and how to compute z. For now, just keep in mind that for higher levels of confidence, we'll need to put in a larger value of z. Using these numbers, we can create a range around the sample mean according to the following formula: Before we actually use the formula, let's try to develop our intuition about the range we're creating. Where should the range be centered? How wide must the range be to make us confident that it contains the true population mean? What factors would lead us to need a wider or narrower range? Let's see how the statistics of the sample influence the location and width of the range. Let's start with the sample mean. The sample mean is our best estimate of the population mean. This suggests that the sample mean should always be the center of the range. Move the slider bar to see how the sample mean affects the range. Second, the width of the range depends on the standard deviation of the sample. When the sample standard deviation is large, we have greater uncertainty about the accuracy of the sample mean as an estimate of the population mean. Thus, we have to create a wider range to be confident that it includes the true population mean. On the other hand, if the sample standard deviation is small, we feel more confident that our sample mean is an accurate predictor of the true population mean. In this case, we can draw a more narrow range. The larger the standard deviation, the wider the range must be. Move the slider bar to see how the sample standard deviation affects the range. Third, the width of the range depends on the sample size. With a very small sample, it's quite possible that one or two atypical points in the sample could throw the sample mean off considerably from the true population mean. So with a small sample, we need to create a wide

range to feel comfortable that the true mean is likely to be inside it. The larger the sample, the more certain we can be that the sample mean represents the population mean. With a large sample, even if our sample includes a few atypical points, there are likely to be many more typical points in the sample to compensate for the outliers. Thus, with a large sample, we can feel comfortable with a small range. Move the slider bar to see how the sample size influences the range. Finally, the width of the range depends on our desired level of confidence. The level of confidence states how certain we want to be that the range contains the mean of the population. The more confident we want to be that the range contains the true population mean, the wider we have to make the range. If our desired level of confidence is fairly low, we can draw a more narrow range. In the language of statistics, we indicate our level of confidence by saying, for example, that we are "95% confident" that the range contains the true population mean. This means there is a 95% chance that the range contains the true population mean. Move the slider bar to see how the confidence level affects the range.

P74 These variables determine the size of the range that we want to construct. We will learn exactly how to construct this range in a later section. For now, all we have to understand is that the population mean can best be estimated by a range of values and that the range depends on three sample statistics as well as the level of confidence that we want to assign to the range. Summary The sample mean is our best initial estimate of the population mean. To indicate how accurate this estimate is, we construct a range around the sample mean that likely contains the population mean. The width of the range is determined by the sample size, sample standard deviation, and the level of confidence. The confidence level measures how certain we are that the range we construct contains the true population mean. 1Construct range around the sample mean to estimate the population mean 2Larger standard deviation=>wider range 3 Large sample=>smaller range 4.Greater confidence=>wider range

The normal distriction Alice recommends taking a step back from sampling and learning about the normal distribution. The normal distribution helps us create a range around a sample mean that is likely to contain the true population mean. You can use the normal distribution to turn the intuitive notion of "confidence in your estimate" into a precisely defined concept. Understanding the normal distribution will also give you deeper insight into how sampling works.

The normal distribution is a probability distribution that is centered at the mean. It is shaped like a bell, and is sometimes called the "bell curve."

The z-statistic The unique shape of the normal curve allows us to translate any normal distribution into a standard normal curve, as we did with women's heights simply by re-labeling the x-axis. To do this more formally, we use something called the z-statistic.

The normal distribution has a unique symmetrical shape whose center and width are completely determined by its mean and its standard deviation. For every normal distribution, the probability of being within a specified number of standard deviations of the mean is the same. The distance from the mean, as measured in standard deviations, is known as the z-value. Using the properties of the normal distribution, we can calculate a probability associated with any range of values.

Unique, symmetrical bell shape Center at mean; width determined by standard deviation Probability within 2 std.dev.s’s of mean =95% Probability within 1 std.dev.s’s of mean =68% z-value= (x - mean) / sigma

P79 Using EXCEL’s normal functions To find the cumulative probability associated with a given z-value for a standard normal curve, we use the Excel function NORMSDIST. Note the S between the M and the D. It indicates we are working with a 'standard' normal curve with mean zero and standard deviation one. The previous clip shows us how to use software programs like Excel to calculate z-values and cumulative probabilities for the normal curve. Another way to find z-values and cumulative probabilities is to use a z-table. Using z-tables is a bit more cumbersome than using Excel, but it helps reinforce the concepts.

Find the cumulative probability associated with the z-value 2. Enter your answer in decimal notation with 2 digits to the right of the decimal, (e.g., enter "5" as "5.00"). Round if necessary. Find the cumulative probability associated with the z-value 2.36. Enter your answer in decimal notation with 2 digits to the right of the decimal, (e.g., enter "5" as "5.00"). Round if necessary. Find the cumulative probability associated with the z-value -1. Enter your answer in decimal notation with 2 digits to the right of the decimal, (e.g., enter "5" as "5.00"). Round if necessary. Find the cumulative probability associated with the z-value 1.645. Enter your answer in decimal notation with 2 digits to the right of the decimal, (e.g., enter "5" as "5.00"). Round if necessary. Find the cumulative probability associated with the z-value -1.645. Enter your answer in decimal notation with 2 digits to the right of the decimal, (e.g., enter "5" as "5.00"). Round if necessary. For a normal curve with mean 100 and standard deviation 10, find the cumulative probability associated with the value 115. Enter your answer in decimal notation with 2 digits to the right of the decimal, (e.g., enter "5" as "5.00"). Round if necessary. For a normal curve with mean 100 and standard deviation 10, find the cumulative probability associated with the value 80. Enter your answer in decimal notation with 2 digits to the right of the decimal, (e.g., enter "5" as "5.00"). Round if necessary. For a normal curve with mean 100 and standard deviation 10, find the probability of obtaining a value greater than 80 but less than 115. Enter your answer in decimal notation with 2 digits to the right of the decimal, (e.g., enter "5" as "5.00"). Round if necessary For a normal curve with mean 80 and standard deviation 5, find the probability of obtaining a value greater than 85 but less than 95.

Enter your answer in decimal notation with 2 digits to the right of the decimal, (e.g., enter "5" as "5.00"). Round if necessary. For a normal curve with mean 47 and standard deviation 6, find the probability of obtaining a value greater than 45. Enter your answer in decimal notation with 3 digits to the right of the decimal, (e.g., enter "5" as "5.000"). Round if necessary. The Central Limit Theorem

For a normal curve with mean 47 and standard deviation 6, find the probability of obtaining a value greater than 38 but less than 45. Enter your answer in decimal notation with 2 digits to the right of the decimal, (e.g., enter "5" as "5.00"). Round if necessary.

Find the z-value associated with the cumulative probability of 60%. Enter your answer in decimal notation with 2 digits to the right of the decimal, (e.g., enter "5" as "5.00"). Round if necessary. Find the z-value associated with the cumulative probability of 40%. Enter your answer in decimal notation with 2 digits to the right of the decimal, (e.g., enter "5" as "5.00"). Round if necessary. Find the z-value associated with the cumulative probability of 2.5%. Enter your answer in decimal notation with 2 digits to the right of the decimal, (e.g., enter "5" as "5.00"). Round if necessary. For a normal curve with mean 222 and standard deviation 17, find the value associated with the cumulative probability of 88%. Enter your answer as an integer (e.g., "5"). Round if necessary. For a normal curve with mean 222 and standard deviation 17, find the value associated with the cumulative probability of 28%. Enter your answer as an integer (e.g., "5"). Round if necessary.

The Central Limit Theorem How can the normal distribution help you sample Leo's hotel guests? How do the unique properties of the normal distribution help us when we use a random sample to infer something about the underlying population? After all, when we sample a population, we usually have no idea whether or not the population is normally distributed. We're typically sampling because we don't even know the mean of the population! If the normal distribution is such a great tool, when can we use it? It turns out that even if a population is not normally distributed, the properties of the normal distribution are very helpful to us in sampling. To see why, let's first learn about a well-established statistical fact known as the "Central Limit Theorem".

Definition Roughly speaking, the Central Limit Theorem says that if we took many random samples from a population and plotted the means of each sample, then — assuming the samples we take are sufficiently large — the resulting plot of the sample means would look normally distributed. Furthermore, if we took enough of these samples, the mean of the resulting distribution of sample means would be equal to the true mean of the population. To repeat: no matter what type of distribution the population has — uniform, skewed, bi-modal, or completely bizarre — if we took enough samples, and the samples were sufficiently large, then the means of those samples would form a normal distribution centered around the true mean of the population. The Central Limit Theorem is one of the subtlest aspects of basic statistics. It may seem odd to be drawing a distribution of the means of many samples, but that is exactly what we are doing. We'll call this distribution the Distribution of Sample Means. (Statisticians also often call it the Sampling Distribution of the Mean). Let's walk through this step-by-step. If we have a population — any population — we can take a random sample. This sample has a mean. We can plot that mean on a graph. Then we take another sample. That sample also has a mean, which we also plot on the graph Now, if we plot a lot of sample means in this way, they will start to form a normal distribution around the population's mean. The more samples we take, the more the graph of the sample means would look like a normal

distribution. Eventually, the graph of the sample means — the Distribution of the Sample Means — would form a nearly perfect replica of a normal distribution. Now, nobody would actually take a lot of samples, calculate all of the sample means, and then construct a normal distribution with them. We're taking a lot of samples here just to let you see that graphing the means of many samples would give you a normal curve. In the real world, we take a single sample and squeeze it for all the information it's worth. But what does the Central Limit Theorem allow us to say based on that single sample? The Central Limit Theorem tells us that the mean of that one sample is part of a normal distribution. More specifically, we know that the sample mean falls somewhere in a normal Distribution of Sample Means that is centered at the true population mean. The Central Limit Theorem is so powerful for sampling and estimation because it allows us to ignore the underlying distribution of the population we want to learn about. Since we know the Distribution of Sample Means is normally distributed and centered at the true population mean, we can completely disregard the underlying distribution of the population. As we'll see shortly, because we know so much about the normal distribution, we can use the information about the Distribution of Sample Means to draw conclusions about the likelihood of different values of the actual population mean. SUMMARY The Central Limit Theorem states that for any population distribution, the means of samples from that population are distributed approximately normally. The more samples, and the larger the sample size, the closer the Distribution of Sample Means fits a normal curve. The mean of a single sample lies on this normal curve, so we can use the normal curve's special properties to extract more information from a single sample mean. 1 Sample means are distributed approximately normally, regardless of the distribution of the underlying population. 2 More samples, larger size Better approximation to normal distribution 3Mean of Distribution of Sample Means = Mean of population distribution 4Use properties of normal distribution to extract information from a sample Let's see how the Central Limit Theorem works using a graphical illustration. The three icons are marked "Uniform," "Bimodal," and "Skewed." On a later page, clicking on each of the three sections in the navigation will display a different kind of distribution. On the next page, clicking on "Uniform" will display a distribution that is uniform in shape, i.e. a distribution for which all values in a specified range are equally likely to occur. Clicking on "Bimodal" will display a distribution that has two separate areas where values are more likely to occur than elsewhere. Clicking on "Skewed" will display a distribution that is not symmetrical —

values are more likely to fall above the mean than below. Uniform The population distribution is on the top half of the page. Let's take a sample of it. This sample has a mean Let's start building a distribution of the sample means on the bottom half of the page by placing each sample mean on a graph. We repeat this process several times to create our distribution. Take a sample. Find its mean. Record it in the sample mean histogram. This histogram approximates the distribution of the sample means. As we can see, the shape of the original distribution doesn't matter. The distribution of the sample means will always form a normal distribution. This is what the Central Limit Theorem predicts. Bimodal The population distribution is on the top half of the page. Let's take a sample of it. This sample has a mean. Let's start building a distribution of the sample means on the bottom half of the page by placing each sample mean on a graph. We repeat this process several times to create our distribution. Take a sample. Find its mean. Record it in the sample mean histogram. This histogram approximates the distribution of the sample means As we can see, the shape of the original distribution doesn't matter. The distribution of the sample means will always form a normal distribution. This is what the Central Limit Theorem predicts. Skewed The population distribution is on the top half of the page. Let's take a sample of it. This sample has a mean. Let's start building a distribution of the sample means on the bottom half of the page by placing each sample mean on a graph. We repeat this process several times to create our distribution Take a sample. Find its mean. Record it in the sample mean histogram. This histogram approximates the distribution of the sample means. As we can see, the shape of the original distribution doesn't matter. The distribution of the sample means will always form a normal distribution. This is what the Central Limit Theorem predicts. The Central Limit Theorem states that the means of sufficiently large samples are always normally distributed, a key insight that will allow you to estimate the population mean from a sample. Confidence intervals Using the properties of the normal distribution and the Central Limit Theorem, you can construct a range of values that is almost certain to contain the population mean. For a normal distribution, we know that if we select a value at random, it will be within two standard deviations of the distribution's mean 95% of the time.

The Central Limit Theorem offers us two additional insights. First, we know that the means of sufficiently large samples are normally distributed, regardless of the distribution of the underlying population. Second, we know that the mean of the Distribution of Sample Means is equal to the true population mean. Combining these facts can give us a measure of how accurately the mean of a sample estimates the population mean. Specifically, we can now conclude that if we take a sufficiently large sample — let's say at least 30 points — from a population, there is a 95% chance that the mean of that sample falls within two standard deviations of the true population mean. Let's build this up step by step to make sure we understand the logic. First, we take a sample from a population and compute its mean. We know that the mean of that sample is a point on a normal distribution — the Distribution of Sample Means. Since the mean of our sample is a value randomly obtained from a normal distribution, there is a 95% chance that the sample mean is within two standard deviations of the mean of the distribution. The Central Limit Theorem tells us that the mean of that distribution is the same as the true population mean. Thus, we can conclude that there is a 95% chance that the sample mean is within two standard deviations of the population mean. We have argued that 95% of our samples will have a mean within the range shown around the true population mean. Next we'll turn this around and look at intervals around sample means, because that's exactly what a confidence interval is. Let's look at intervals around the means of two different types of samples: those whose sample means fall within the 2 standard deviation range around the population mean (which should be the case for 95% of all samples) and those whose sample means fall outside the 2 standard deviation range around the population mean (which should be the case for 5% of all samples). First, let's look at a sample whose mean falls outside the 2 standard deviation range shown around the population mean. Since this sample mean is outside the range, it must be more than 2 standard deviations away from the population mean. Since the population mean is more than 2 standard deviations away from this sample mean, an interval of width 2 standard deviations around this sample mean could not contain the true population mean.

We know that 5% of all samples should have sample means outside the 2 standard deviation range around the population mean. Therefore 5% of all samples we obtain will have intervals that do not contain the population mean. Now let's think about the remaining 95% of samples whose means do fall within the 2 standard deviation range around the population mean If we draw an interval of width 2 standard deviations around any one of these sample means, the interval would contain the true population mean. Thus, 95% of all samples we obtain will have intervals that contain the population mean We've just shown how to go from any sample mean — a point estimate — to a range around the sample mean — a 95% confidence interval. We've also argued that 95% of confidence intervals obtained in this way should contain the true population mean. It's important to emphasize: We are not saying that 95% of the time our sample mean is the population mean, but we are saying that 95% of the time a range that is two standard deviations wide centered around the sample mean contains the population mean. To visualize the general concept of a confidence interval, imagine taking 20 different samples from a population and drawing a confidence interval around each. As the diagram shows, on average 95% of these intervals — or 19 out of 20 — would actually contain the population mean. What does this insight mean for us as managers? When we set a confidence level of 95%, we are agreeing to an approach that 1 out of 20 times will give us an interval that does not contain the true population mean. If we aren't comfortable with those odds, we should raise the confidence level. P90 If we increase the confidence level to 98%, we have only a 1 out of 50 chance of obtaining an interval that does not contain the true population mean. However, this higher confidence comes at a cost. If we keep the same sample size, then the confidence interval will widen, thereby decreasing the accuracy of our estimate. Alternatively, to keep the same interval width, we can increase our sample size. How do we know if an interval is too wide? Typically, if we would make a different decision for different values within an interval, that interval is too wide. Let's look at an example. P90 To estimate the percent of people in our industry who will attend the annual conference, we might construct a confidence interval that ranges from 7% to 13%. If we would select a different conference venue if the true percentages is 7% than if it is 13%, we need to tighten our range. Now, before we are ready to actually create our own confidence intervals, there is a technical point we need to be acquainted with. We need to know that the standard deviation of the Distribution of

Sample Means is σ, the standard deviation of the underlying population, divided by the square root of n, the sample size. We won't prove this fact here, but simply note that it is true, and that it should confirm our general intuition about the Distribution of Sample Means. For example, if we have huge samples, we'd expect the means of those large samples to be tightly clustered around the true population mean, and thereby form a narrow distribution. A confidence interval is an estimate for the mean of a population. It specifies a range that is likely to contain the population mean. A confidence interval is centered at the mean of a sample randomly drawn from the population under study. When we have a confidence level of 95% we expect equally wide confidence intervals centered at 95 out of 100 such sample means to contain the population mean.

1 The confidence interval is centered at the sample mean. 2 95%confidence => Confidence intervals around 95% of all sample means contain the population mean 3 4 5 Weigh the odds of a confidence interval not containing the population mean against the costs of a higher confidence level. The confidence interval should be narrow enough so management decision will not change for different values in the interval. The standard deviation of the distribution of sample mean is :See note book P90-1

Finding a confidence interval You understand the theory behind a confidence interval. But how do you actually construct one? We can now translate the previous discussion into a simple method for finding a confidence interval for the mean of any population. First, we randomly select a sample of size 30 from the population. We then compute the mean and standard deviation of the sample Next, we assign the sample mean as the center of the confidence interval. To find the width of the interval, we must know the level of confidence we want to assign to the interval. If we want a 95% confidence interval, the interval should be 2 times the standard deviation of the population divided by the square root of n, the sample size. Since we typically don't know the standard deviation of the population, we substitute the best estimate that we do have — the standard deviation of the sample. Here's what the equation looks like for our example

If we want a level of confidence other than 95%, instead of multiplying s/sqrt(n) by 2, we multiply by the z-value corresponding to the desired level of confidence. We can use this formula to compute any confidence interval. There is one restriction: in order for it to work, the sample size has to be at least 30. Let's walk through an example. Wine Lover's Magazine's managers have asked us to help them estimate the average age of their subscribers so they can better target potential advertisers. We tell them we plan to survey a sample of their subscribers. They say they're comfortable with our working with a sample, but emphasize that they want to be 95% confident that the range we give them contains the true average age of its full set of subscribers. We obtain survey results from 60 randomly-chosen subscribers and determine that the sample has a mean of 52 and a standard deviation of 40. To find an appropriate confidence interval, we incorporate information about the sample into the formula: The z-value for a 95% confidence interval is about 2, or more accurately, about 1.96. This tells us that a 95% confidence interval would begin at 52 minus 10.12, or 41.88, and end at the mean plus 10.12, or 62.12. (SEE NOTE BOOK P90-2) We give management the range from 41.88 to 62.12 as an estimate of the average age of its subscribers, telling them they can be 95% confident that the true population mean falls between these values. What if we want a confidence level other than 95%? We can use the sample mean, standard deviation, and size from the sample data, but how do we obtain the right z-value? The z-value for 95% confidence is well known to be about 2, but how do we find a z-value for a less common confidence interval? To be 98% confident that our interval contains the population mean, how do we obtain the appropriate z-value? (SEE NOTE BOOK P90-3) To find the z-value for 98% confidence level, we are essentially asking: How far to the left and right of the standard normal curve's mean do we have to go to capture 98% of the area? Capturing 98% of the area centered at the mean of the normal curve leaves two areas at the tails, each covering 1% of the area under the curve. The z-value of the right boundary is the z-value associated with a cumulative probability of 99% — the sum of the central 98% and the 1% in the left tail. Converting the desired confidence level into the corresponding cumulative probability on the standard normal curve is essential because Excel's NORMSINV function and the z-table work

with cumulative probabilities. To find the z-value associated with a cumulative probability of 99%, enter into Excel =NORMSINV(0.99), which returns the z-value 2.33. Or, look in the z table and find the cell that contains a cumulative probability closest to 0.9900. The z-value is 2.33, the sum of the row-value 2.3 and the column-value 0.03. Try finding a z-value yourself. Find the z-value associated with a 99.5% confidence level using the appropriate normal distribution function in Excel or using the Standard Normal Table (z-table) in your briefcase. The correct z-value for a confidence level of 99.5% is: 2.81

Our first step is to convert the confidence level of 99.5% into the corresponding cumulative probability on the standard normal curve. To do this, note that to have 99.5% probability in the middle of the standard normal curve, we must exclude a total area of 1 - 99.5% = 0.5% from the curve. That area is divided into two equal parts in the distribution's tails: 0.25% in each tail. (SEE NOTE BOOK P90-4) We can now see that the cumulative probability associated with confidence level of 99.5% is 1 — 0.25% = 99.75%. Thus, the z-value for a confidence level of 99.5% is the same as the z-value of a cumulative probability of 99.75%. We find the z-value in Excel by entering =NORMSINV(0.9975), which returns the value 2.81. Alternatively, we could find the z-value in the z-table by looking up the probability 0.9975 Summary To calculate a confidence interval, we take a sample, compute its mean and standard deviation, and then build a range around the sample mean with a specified level of confidence. The confidence level indicates how confident we are that the sample mean we collected contains the population mean. (SEE NOTE BOOK P90-5) Using Samll Samples We assumed in our confidence limit calculations that the sample size was at least 30. What if it isn't? What if we have only a small sample? Let's consider a different survey, one that concerns a delicate matter. The business manager of a large ocean liner, the Demiurgos asks for our help. She wants us to find out the value of her guests' belongings. She needs this value to determine the correct insurance protection in case guest belongings disappear from their cabins, are destroyed in a fire, or sink with the ship.

She has no idea how valuable her guests' belongings are, but she feels uneasy asking them for this information. She is willing to ask only 16 guests to estimate the total value of the belongings in their cabins. From this sample, we need to prepare an estimate. With a sample size less than 30, we cannot calculate confidence intervals in the same way as with a large sample size. A small sample increases our uncertainty about two important aspects of our estimate of the population mean First, with a small sample, the consequences of the Central Limit Theorem are not assured, so we cannot be sure that the sample means follow a normal distribution. Second, with a small sample, we can't be sure that the sample standard deviation is a good estimate of the population standard deviation. Due to these additional uncertainties, we cannot use z-values to construct confidence intervals. Using a z-value would overstate our confidence in our estimate. Can we still create a confidence interval? Is there a way to estimate the population mean even if we have only a handful of data points? It depends: if we don't know anything about the underlying population, we cannot create a confidence interval with fewer than 30 data points. However, if the underlying population is normally distributed — or even roughly normally distributed — we can use a confidence interval to estimate the population mean. In practice, as long as we are sure the underlying population is not highly skewed or extremely bimodal, we can construct a confidence interval, even when we have a small sample. However, we do need to modify our approach slightly. To estimate the population mean with a small sample, we use a t-distribution, which was discovered in the early 20th century at the Guinness Brewing Company in Ireland. A t-distribution gives us t-values in much the same way as a normal distribution gives us z-values. What is the difference between the normal distribution and the t-distribution? A t-distribution looks similar to a normal distribution, but is not as tall in the center and has thicker tails, because it is more likely than the normal distribution to have values fall farther away from the mean. Therefore, the normal distribution's "rules of thumb" for 68% and 95% probabilities no longer hold. For example, we must go more than 2 standard deviations on either side of the mean to capture 95% of the probability for a t-distribution. Thus, to achieve the same level of confidence, a confidence interval based on a t-distribution will

be wider than one based on a normal distribution. This reinforces our intuition: we have less certainty about our estimate with a smaller sample, so we need a wider interval to achieve a given level of confidence. The t-distribution is also different because it varies with the sample size: For each sample size, there is a different t-value associated with a given level of confidence. The smaller the sample size n, the shorter the height and the thicker the tails of the t-distribution curve, and the farther we have to go from the mean to reach a given level of confidence. On the other hand, as the sample size increases, the shape of the t-distribution becomes more and more like the shape of a normal distribution. Once we reach a sample size of 30, the t-distribution becomes virtually identical to the z-distribution, so t-values and z-values can be used interchangeably. Incidentally, we can use the t-distribution even for sample sizes larger than 30. However, most people use the z-distribution for larger samples, partially out of habit and partially because it's easier, since the z-value doesn't vary based on the sample size. To find the right t-value, we first have to identify the t-distribution that corresponds to our sample size. We do this by finding the number of "degrees of freedom" of the sample, which for our purposes is simply the sample size minus one. If our sample size is 16, we have 15 degrees of freedom, and so on. Excel provides a simple function for finding the appropriate t-value for a confidence interval. If we enter 1 minus the level of confidence we want and the degrees of freedom into the Excel function TINV, Excel gives us the appropriate t-value. For example, for a 95% confidence interval and a sample size of n = 16, the Excel function TINV(0.05,15) would return the value 2.131. Once we find the t-value, we use it just like we used the z-value to find a confidence interval. For example, for t = 2.131, the appropriate confidence interval is: If we don't have Excel handy, we can use a t-distribution table to find the t-value associated with the degrees of freedom and the confidence level we specify. When using different t-value tables, we need to be careful to note which probability the table depicts. Some tables report values associated with the confidence level, like 0.95. Others report values based on the area in the tails, which would be 0.05 for a 95% confidence interval. Our t-table, like many others, reports values associated with a cumulative probability, so for a 95% level of confidence, we would have to look at a cumulative probability of 97.5%.

Returning to the good ship Demiurgos, let's determine an estimate of the average value of passengers' belongings. The manager samples 16 guests, and reports that they have an average of $10,200 worth of clothing, jewelry, and personal effects in their cabins. From her survey numbers, we calculate a standard deviation of $4,800. We need to double check that the distribution isn't too skewed, which we might expect, since some of the passengers are quite wealthy. The manager explains that the insurance policy has a limited liability clause that limits a passenger's maximum claim to $20,000. Above $20,000, passengers' own homeowners' policies must cover any losses. Thus, in the survey, if a guest reported values above $20,000, the manager simply reported $20,000 as the value to be covered for our data set. We sketch a graph of the 16 values that confirms that the distribution is not too asymmetric, so we feel comfortable using the t-distribution. Since we have a sample of 16 passengers, there are 15 degrees of freedom. The Excel function =TINV(0.05,15) tells us that the appropriate t-value is 2.131. Using the confidence interval formula, the guests' valuables are worth $10,200 plus or minus 2.131 times $4,800 over the square root of 16. Thus, the width of the confidence interval is 2.131*4,800/4 = $2,557, and we can report that we are 95% confident that the average value of passengers' belongings is between $7,643 and $12,757. What if the Demiurgos' manager thinks this interval is too large? She will have to survey more guests. Increasing the sample size causes the t-value to decrease, and also increases the size of the denominator (the square root of n). Both factors narrow the confidence interval. For example, if she asks 10 more guests, and the standard deviation of the sample does not change, the t-value would drop to 2.06 and the square root of n in the denominator would increase. The width of the interval would decrease significantly, from $2,557 to $1,939.

SUMMARY Confidence intervals can be constructed even with a sample size of less than 30, as long as the population is roughly normally distributed (or, at least not too skewed or bimodal). To find a confidence interval with a small sample, use a t-distribution. T-distributions are a set of distributions that resemble the normal distribution, but with shorter heights near the mean and thicker tails. To find a confidence interval for a small sample size, place the appropriate t-value into the confidence interval formula. (SEE NOTE BOOK P90-6

When we take a survey, we often want a specific level of accuracy in our estimate of the population mean. For example, when estimating car owners' average spending on car repairs each year, we might want to be 95% confident that our estimate is within $50 of the true mean. We know that the sample size of our survey directly affects the accuracy of our estimate. The larger the sample size, the tighter the confidence interval and the more accurate our estimate. A sample of size n gives us a confidence interval that extends a distance of d on either side of the mean: To find the sample size necessary to give us a specified distance d from the mean, we must have an estimate of sigma, the standard deviation of spending. If we do not have an estimate based on past data or some other source, we might take a preliminary survey to obtain a rough estimate of sigma. In this example, we estimate sigma to be $300 based on past experience. Since we want a 95% level of confidence, we set z = 1.96. To ensure our desired accuracy — that d is no more than $50 — we must randomly sample at least 139 people. In general, to ensure a confidence interval extends a distance of at most d on either side of the mean, we choose a sample size n that satisfies the expression below. We can do this with simple algebra, or by using the attached Excel utility. When estimating a population mean, we can ensure that our confidence interval extends a distance of at most d on either side of the mean by choosing an appropriate sample size. (SEE NOTE BOOK P90-7 Here is a step-by-step process for creating a confidence interval: First, we choose a level of confidence and a sample size n appropriate to the decision context. Second, we take a random sample and find the sample mean. This is our best estimate for the population mean. Third, we find the sample's standard deviation. Fourth, find the z-value or t-value associated with the proper confidence level. If our sample size is over 30, we find the z-value for our confidence level. If not, we find the t-value for our confidence level and with degrees of freedom = sample size - 1. Fifth, we calculate the end points of the confidence interval using the formulae below. SUMMARY Construct confidence intervals using the steps outlined below. With a confidence interval derived from an unbiased random sample, we can say that the true population mean falls within the interval with the corresponding level of confidence. (SEE NOTE BOOK P90-8) Click here to open an Excel utility that allows you to create confidence intervals by providing the

sample mean, standard deviation, size, and desired level of confidence. You should enter data only in the yellow input areas of the utility. To ensure you are using the utility correctly, try to reproduce the results for the Wine Lover's Magazine and the Demiurgos examples. The sample you collected earlier has all the data you need to create a confidence interval for Leo's problem You take another look at the survey you created earlier for Leo: you sampled 45 guests, and calculated that the average satisfaction rate of the sample was 4.4, with a standard deviation of 1.54. Using this information, you decide to create a 95% confidence interval for Leo. Your calculations show the following:

We can be 95% sure that the population mean falls between 3.95 and 4.85. To create a 95% confidence interval, you take the mean of the sample and add/subtract the z-value multiplied by the sample standard deviation divided by the square root of the sample size. Using the numbers given, you obtain a 95% confidence interval by going 0.45 points above and below the sample mean of 4.4, which translates into a confidence interval from 3.95 to 4.85. You meet with Leo and tell him that you can be 95% certain that the population mean falls between 3.95 and 4.85. Leo looks at your numbers and shakes his head. That's just not accurate enough for me to make a decision. If the mean is close to 4.85, I'd be happy, but if it's closer to 4, I'm concerned. Can we narrow the range at all? Looking over your notes, you think you can give Leo some options.

We can survey a larger group of people. This is the best answer. By increasing the sample size, you can narrow your confidence interval even if the standard deviation stays constant. Why don't you create a larger sample and report the results back to me? You select another 40 guests at random and ask the hotel operator to conduct the survey for you again. He is able to reach 25 guests. You combine the two samples, which gives a new sample size of 70. For the combined sample, you find that the new sample mean is 4.5 and the new sample standard deviation is 1.2. Armed with more data, you create another confidence interval. We can be 95% certain that the average satisfaction of all hotel guests with the scuba school is

between: 4.22 and 4.78 To create this 95% confidence interval, you take the mean of the sample and add/subtract the zvalue multiplied by the sample standard deviation divided by the square root of the sample size. Using the numbers given, you obtain a 95% confidence interval by going 0.28 points above and below the sample mean of 4.5, which translates into a confidence interval from 4.22 to 4.78. Thank you. I am much happier with this result. I have enough information now to decide whether to keep the current scuba diving school. Exercise1 Toshi Matsumoto is the Chief Operating Officer of a consumer electronics retailer with over 150 stores spread throughout Japan. For over a year, the sales of high-end VCRs have lagged, due to a shift towards DVD players. Just today, Toshi heard that Veetek, a large South Asian electronics retailer, is looking to purchase a bulk shipment of high-end VCRs. This would be a perfect opportunity for Toshi to liquidate slow-moving inventory currently languishing on the shelves of his stores. Before he calls Veetek, he wants to know how many highend VCRs he can promise. After two days of furious phone calls, his deputy has gathered data from 36 representative outlets in his retail chain. The mean high-end VCR inventory in each store polled was 500 units. The standard deviation was 180. Toshi needs you to find a 95% confidence interval for the average VCR inventory per store. The interval is: From 441 to 559 Exercise2 Paul Segal manages the pig-farming division of the agricultural company Bowman-LyonsCenterville. A rumored outbreak of Pulluscular Pig Disorder (PPD) in one of Paul's herds is on the verge of causing a public relations disaster. The main symptom of PPD is a shrinking brain, and the only certain way to diagnose PPD is by measuring brain size post-mortem. Paul needs to know if his herd is affected by PPD, but he does not want to have to slaughter hundreds of swine to find out. At the preliminary stage, he can offer no more than 5 prime porkers to be slaughtered and diagnosed. For the pigs slaughtered, the mean brain weight was 0.18 lbs, with a standard deviation of 0.06 lbs. With 95% confidence, in what range does the herd's average brain weight lie? [0.106 lbs, 0.254 lbs]

Proportions The next morning, you and Alice are about to head off to the hotel pool when Leo calls you. I'm sorry to disturb you, but I have another problem, and I think you might be able to help. The Kahana is a very popular resort during the summer tourist season. But the number of leisure visitors drops significantly during the off-season, from September through February and then April through May. We usually have quite a few room vacancies during that period of time. We expect to have about 200 rooms vacant for weeklong periods during the slow season this year. I've developed a new program that rewards our best guests with a special discount if they book a weeklong stay during our slow period. They won't have complete date flexibility of course, but the steep discount should make the offer attractive for them. To see how many of our past guests would accept such an offer, I sent promotional brochures to 100 of them. The deadline by which they had to respond to the offer has passed. Ten guests responded with the required room deposit prior to the deadline — that's a solid 10 percent. I figure if we send out 2,000 promotions, we'll get about 200 responses. This is a nice idea Leo, but I'm concerned it could backfire. If more than 10% respond to this offer, you might end up disappointing some of the very guests you're trying to reward. Or, if too many respond and you give them all the discount, you'll have to turn away customers willing to pay full price. That is exactly my concern. I wonder how accurate the 10% response rate is. Just because it held for 100 guests, will it hold for 2,000? What if 11% actually respond to the promotions? Imagine what would happen if 220 guests responded. I don't want to anger 20 loyal customers by telling them the offer is not valid, but I also don't want to turn away full paying guests to accommodate the extra 20 guests at a discount. I'm willing to reserve 200 rooms for these discount weeklong stays during the slow season. How many return guests can I safely send the discount offer and be confident that no more than 200 will respond? You can tell that Leo is growing quite comfortable with relying on your statistical methods. He seems almost as interested in them as he is in your results.

Sometimes, the question we pose to members of a sample calls for a yes or no answer. We might survey people in a target market and ask if they plan to buy a new car this year. Or survey voters and ask if they plan to vote for the incumbent candidate for office. Or we might take a sample of the products our plant produced yesterday and count how many are defective. Even though our question has only two answers, we still have to address an inherent uncertainty: We know what values our data can take — yes or no — but we don't know how often each response will be given. In these cases, we usually convey the survey results by reporting the percentage of yes responses as a proportion, p-bar. This is our best estimate of p, the true percentage of "yes" responses in the underlying population NOTE 1001 Suppose, for example, that we have posted advertisements in the subway cars on Boston's "Red Line," and want to know what percentage of all passengers remembers seeing our ad.

We create a proper survey, and ask randomly selected Red Line passengers if they remember seeing our ad. 300 passengers respond to our survey, of which 100 passengers report remembering the ad. Then p-bar is simply 33%, which is the number of people that remember the ad, 100, divided by the number of respondents, 300. The remaining 200 passengers, or 67% of the sample, report not remembering the ad. The two proportions always add up to 1 because survey respondents report either remembering the ad or not. Once we know the proportion of the sample, we can draw conclusions about all Red Line passengers. Our best estimate, or point estimate, for p, the percentage of all passengers who remember seeing our ad, is 33%. As managers, we typically want more than this simple point estimate — we want to know how accurate the estimate is. How far from 33% might the true percentage be? Can we say confidently that it is between 30% and 36%, for example? When we work with proportions, how do we find a confidence interval around our point estimate? The process for creating a confidence interval around a proportion is nearly identical to the process we've used before. The only difference is that we can approximate the standard deviation of the population with a simple formula rather than calculating it directly from the raw data.

Based on our sample, our best estimate of the true population proportion is p-bar, the percentage of "yes" responses in our survey. Statistical theory tells us that our best estimate of the standard deviation of the true population proportion is the square root of [(p-bar)*(1 - (p-bar)]. We can use this approximate standard deviation to determine a confidence interval for the proportion. NOTE 1002 For our Red Line ad, we approximate the standard deviation with the square root of 0.33 times 0.67, or 0.47. A 95% confidence interval is 0.33 plus or minus 1.96 times 0.47 divided by the square root of 300. This is equal to 0.33 plus or minus 0.053, or 27.7% to 38.3%. Unfortunately, there is one catch when we calculate confidence intervals around proportions... NOTE 1003

Sample size Sample size matters, particularly when dealing with very small or very large proportions. Suppose we are sampling New Yorkers for Amyotrophic Lateral Sclerosis, commonly known as Lou Gehrig's Disease. In the U.S., the odds of having the disease are less than 1 in 10,000. Would our sample be useful if we surveyed 100 people? No. We probably wouldn't find a single person with the disease in our sample. Since the true proportion is very small, we need to have a large enough sample to make sure we find at least a few people with the disease. Otherwise, we will not have enough data to get a good estimate of the true proportion. There is a guideline we must meet to make sure that our sample is large enough when estimating proportions. Two conditions must be met: First, the product of the sample size and the proportion must be at least 5. Second, the product of the sample size and 1 minus the proportion must also be at least 5. NOTE 1004 If both these requirements are met, we can use the sample. Essentially, this guideline guarantees that our sample contains a reasonable number of "yes" and a reasonable number of "no" answers. Our sample will not be useful otherwise. To avoid an invalid sample, we need to create a large enough sample size to satisfy the requirements. However, since we don't know the proportion p-bar before sampling, we don't know if the two conditions are met before setting the sample size. How can we get around this problem? We can obtain a preliminary estimate of p-bar using either of two methods: first, we can use past experience. For example, to estimate the rate of Lou Gehrig's disease, we can research the rate of occurrence in the general population. This is a reasonable first estimate for p-bar. In many cases, however, we are sampling for the first time. Without past experience, we don't know what p-bar might be. In this case, it may well be worth our time to take a small test sample to estimate the proportion, p-bar.

For example, if the proportion of yes answers in our small test sample is 3%, then we can use 3% as our preliminary estimate of p-bar. Substituting 3% for p-bar in our two requirements, n(p-bar) ≥ 5 and n(1 - (p-bar)) ≥ 5, tells us that n must satisfy n*0.03 ≥ 5 and n*0.97 ≥ 5. Thus the sample size we need for our real sample must be at least 167. We would then use a real sample — with at least 167 respondents — to find an actual sample value of p-bar to create a confidence interval for the population proportion. NOTE 1005

Summary Proportions are often used to indicate the frequency of some characteristic in a population. The sample proportion p-bar is the number of occurrences of the characteristic in the sample divided by the number of respondents, the sample size. It is our best estimate of the true proportion in the population. We can construct a confidence interval for the population proportion. Two guidelines for the sample size must be met for a valid confidence interval: n(p-bar) and n(1 - (p-bar)) must each be at least five. NOTE 1006 Creating confidence intervals around proportions is not much different from creating them around means. Finding the right number of Leo's promotional brochures to mail should be easy. Leo needs to know how accurate the 10 percent response rate of his 100-customer sample is. Will this response rate hold for 2,000 guests? To how many guests can he send the discount offer for his 200 rooms? First, you calculate a 95% confidence interval for the response rate. Enter the lower bound as a decimal number with two digits to the right of the decimal, (e.g., enter "5" as "5.00"). Round if necessary. The 95% confidence interval for the proportion estimate is 0.0412 to 0.1588, or 4.12% and 15.88%. You obtain that answer by using the sample data and applying the familiar formula: NOTE 1007

Then after giving Leo's questions some thought, you recommend to him that he send the mailing to a specific number of guests. Enter the number of guests as an integer, (e.g., "5"). Round if necessary.

Based on the confidence interval for the proportion, the maximum percentage of people who are likely to respond to the discount offer (at the 95% confidence level) is 15.88%. So, if 15.88% of people were to respond for 200 rooms, how many people should Leo send out the survey to? Simply divide 200 by 0.1588 to get to the answer: Leo needs to send out the survey to at most 1,259 past customers. Leo is pleased with your work. He tells you to relax and enjoy the resort.

Exercise1 GMW is a German auto manufacturer that has regional sales subsidiaries throughout the world. Arturo Lopez heads the Mexican sales division of the company's Latin American subsidiary. GMW earns additional profit when customers choose to finance their car purchase with a GMW financing package. Arturo has been asked to submit a report to the GMW CEO in Germany about the percentage of GMW customers who opt for financing. Arturo has asked you, a new member of the division sales team, to devise a way to estimate this percentage. You take a random sample of 64 cars sold in the Mexican sales division, and find that 13 of them, or about 20.3%, opted for GMW financing. you want to be 95% confident in your report to Mr. Lopez, you should tell him that the percentage of all Mexican customers opting for GMW financing falls in the range: from 10.4% to 30.2% EXERCISE2 Kayleigh Marlon is the Chief Buyer at Tar-Mart, a company that operates a chain of superstores selling discount merchandise. Tar-Mart has a huge national presence, and manufacturers compete fiercely to get their products onto Tar-Mart's shelves. Crown Toothpaste, a new entrant in the toothpaste market, is one of them. Kayleigh agreed to stock Crown for 4 weeks and display it prominently. After that period, she will stop stocking Crown unless 5% of Tar-Mart's customers bought Crown or were considering buying Crown within the next month. The trial period is now over. Kayleigh has asked you to take a sample of customers to see if TarMart should continue stocking Crown. She would like you to be at least 95% confident in your answer. The first step is to decide how large a sample size to choose. Kayleigh tells you that, in the past, when Tar-Mart introduced a new product, the percentage of people who expressed interest ranged between 2% and 10%. What sample size should you use? 250 This is the best answer. This sample size will satisfy the two rules of thumb (n(p-bar) ≥ 5 and n(1 - (p-bar)) ≥ 5) for all proportions falling in the range 2% to 10%.

You choose a sample size of 250. After conducting the survey, you find that 10 out of 250 people surveyed had bought Crown or were considering buying Crown within the next month. What is the 95% confidence interval for the population proportion? From 1.6% to 6.4% First, you find the sample proportion: 10 out of 250 is a proportion of 4%. You verify that n(p-bar) = 250*0.04 = 10 ≥ 5 and n(1 - (p-bar)) = 250*0.96= 240 ≥5. Then, using the formula, you find the confidence interval around the sample proportion. The endpoints of that interval are 1.6% and 6.4%. EXERCISE3 OO-P-S is a small-package delivery service with worldwide operations. Celine Bedex, VP Marketing, has heard increasing complaints about late deliveries, and wants to know how many of the shipments are late by one day or more. Celine would like an estimate of the percentage of late deliveries. In a sample of 256 shipments, 2 were delivered late, a proportion of about 0.008, or 0.8%. If Celine wants to be 99% confident in the result of a confidence interval calculation, the interval is: No valid inferences can be drawn from these data. This is the best answer. One of the rules of thumb for the sample size is not being satisfied: n(pbar) = (256)(0.008) = 2 is less than 5. Celine collects a new sample, this time of 729 shipments. Of these, 8 were late. Celine can be 99% confident that the population proportion of late packages is between: 0.1% and 2.1% This is the correct answer. The new sample size is sufficiently large to investigate a population proportion of 0.011. First, calculate the sample proportion for the new sample: 8/729 = 0.011. Then, verify that the new sample size satisfies the rules of thumb. Both n(p-bar) and n(1 - (p-bar)) are greater than 5. Using the new sample size and sample proportion, calculate the confidence interval: [0.1%, 2.1%]. NOTE 1008

Hypothesis Testing After finishing the sampling assignments, you and Alice decide to take some time off to enjoy the beach. Just as you are gathering your beach gear, Leo gives you another call. Hi there! Don't let me keep you from enjoying the beach. I just wanted to let you know what I'd like you to help me with next. I've been working on ideas to increase the Kahana's profits. Is it possible to increase profits by raising the room prices? That would be an easy solution.

I wish it were that easy. Room prices are extremely competitive and are often the first thing potential guests take into consideration. So if we increase room prices, I'm afraid we'll have fewer guests. That might put us back where we started from with profits — or even worse. What other factors influence your profits? The two major ones are room occupancy rates and discretionary spending. "Discretionary spending" is the money guests spend on non-room amenities. You know, food, drinks, spa services, sports activities, and so on. As a manager I can affect a variety of factors that influence discretionary spending: the quality of the restaurant, for example, or the types of amenities offered. And you'd like us to help you understand your guests' discretionary spending patterns better. Right. Then I can explore new ways to increase profits on non-room amenities. I can also see if some of my recent efforts to increase guest spending have paid off. I'm particularly interested in restaurant operations. I've made some changes to the restaurants recently. For example, I hired a new executive chef last year. I'd like to know if restaurant revenues per person have changed since then. I'd also like to find out if the renovation of our premier cocktail lounge has resulted in higher spending on beverages. Finally, I've been wondering if discretionary spending patterns are different for leisure and business guests. If so, I might change our marketing campaigns to better suit each of those market segments. What records do you have for us to work with? We don't have a consolidated report for this year yet, so we'll need to conduct some surveys and analyze the results. You're really getting into these statistical methods, aren't you, Leo? Leo made some important changes to his business and he has some ideas of what the impact of these changes has been. How do you put his ideas to the test? As managers, we often need to put our claims, ideas, or theories to the test before we make important decisions. Based on whether or not our claim is statistically supported, we may wish to take managerial action. Hypothesis testing is a statistical method for testing such claims. A hypothesis is simply a claim that we want to substantiate. To begin, we will learn how to test hypotheses about population

means. For instance, suppose we know that the historical average number of defects in a production process is 3 defects per 1,000 units produced. We have a hunch that a certain change to the process — a new machine, say — has changed this number. The hypothesis we wish to substantiate is that the average defect rate has changed — that it is no longer 3 per 1,000. How do we conduct a hypothesis test? First, we collect a random sample of units produced by the process. Then, we see whether or not what we learn about the sample supports our hypothesis that the defect rate has changed. Suppose our sample has an average defect rate of 2.7 defects per 1,000. Based on this sample, can we confidently say that the defect rate has changed? That depends. To find out, we construct a range around the historical defect rate of 3 — the population mean that has been cast in doubt. We construct the range so that if the mean defect rate in the population is still 3, it is very likely for the mean of a sample taken from the population to fall within that range. The outcome of our test will depend on whether 2.7, the mean of the sample we have taken, falls within the range or not. If the sample mean of 2.7 falls outside of the range, we feel comfortable rejecting the hypothesis that the defect rate is still 3. However, if the sample mean falls within the range, we don't have enough evidence to support the claim that the defect rate has changed. This example captures the essence of hypothesis testing, but we need to formalize our intuition about the example and define our new statistical technique more precisely. To conduct a hypothesis test, we formulate two hypotheses: the so-called null hypothesis and the alternative hypothesis. Based on experience or conventional wisdom, we have an initial value of the population mean in mind. The null hypothesis states that the population mean is equal to that initial value: in our example, the null hypothesis states that the current population mean is 3 defects per 1,000. We use the Greek letter mu to represent the population mean, in this case the current average defect rate. NOTE 1009 The alternative hypothesis is the claim we are trying to substantiate. Here, the alternative hypothesis is that the average defect rate has changed. Note that the alternative hypothesis states that the null hypothesis does not hold. As the example suggests, in a hypothesis test, we test the null hypothesis. Based on evidence we

gather from a sample, there are only two possible conclusions we can draw from a hypothesis test: either we reject the null hypothesis or we do not reject it. Since the alternative hypothesis states the opposite of the null hypothesis, by "rejecting" the null hypothesis we necessarily "accept" the alternative hypothesis. In our example, the evidence from our sample will help us determine whether or not we should reject the null hypothesis that the defect rate is still 3 in favor of the alternative hypothesis that the defect rate has changed. Based on our sample evidence, which conclusion should we draw? We reject the null hypothesis if it is highly unlikely that our sample mean would come from a population with the mean stated by the null hypothesis. For example, if the sample we drew had a defect rate of 14 per 1,000, we would reject the null hypothesis. Drawing a sample with 14 defects from a population with an average defect rate of 3 would be very unlikely. "We cannot reject the null hypothesis if it is reasonably likely that our sample mean would come from a population with the mean stated by the null hypothesis. The null hypothesis may or may not be true: we simply don't have enough evidence to draw a definite conclusion." For example, if the sample we drew had a defect rate of 3.05 per 1,000, we could not reject the null hypothesis, since it wouldn't be unusual to randomly draw a sample with 3.05 defects from a population with an average defect rate of 3. Note that having the sample's average defect rate very close to 3 does not "prove" that the mean is 3. Thus we never say that we "accept" the null hypothesis — we simply don't reject it. It is because we can never "accept" the null hypothesis that we do not pose the claim that we actually want to substantiate as the null hypothesis — such a test would never allow us to "accept" our claim! The only way we can substantiate our claim is to state it as the opposite of the null hypothesis, and then reject the null hypothesis based on the evidence. It is important that we understand exactly how to interpret the results of a hypothesis test. Let's illustrate the two types of conclusions with an analogy: a US jury trial. In the US judicial system, the accused is considered innocent until proven guilty. So, the null hypothesis is that the accused is innocent. The alternative hypothesis is that the accused is guilty: this is the claim that the prosecution is trying to prove. The two possible outcomes of a jury trial are "guilty" or "not guilty." The jury does not convict the accused unless it is certain beyond reasonable doubt that the accused is guilty. With insufficient evidence, the jury cannot conclude that the accused truly is innocent. The jury simply declares that the accused is "not guilty.

Similarly, in a hypothesis test, if our evidence is not strong enough to reject the null hypothesis, then that does not prove that the null hypothesis is true. We simply have failed to show it is false, and thus cannot reject it. A hypothesis is a claim or assertion that can be tested. On the basis of a hypothesis test we either reject or leave unchallenged a particular statement: the null hypothesis. Alice promises Leo that the two of you will drop by his office first thing in the morning to test if Leo's survey results support his claims that food and beverage spending patterns have changed. SUMMARY We use hypothesis tests to substantiate a claim about a population mean. The null hypothesis states that the population mean is equal to an initial value that is based on our experience or conventional wisdom. We test the null hypothesis to learn if we should reject it in favor of our claim, the alternative hypothesis, which states that the null hypothesis does not hold.      Use Hypothesis tests to substantiate theories and claims about population means. The Null Hypothesis: Expresses conventional wisdom about mean. The Alternative Hypothesis: Is the claim we wish to substantiate. Is the opposite of null hypothesis. To conduct a hypothesis test: Collect a sample. Ask: Is sample highly unlikely if null hypothesis is true? If yes, reject null hypothesis. If no, do not reject the null hypothesis. Never ”accept” the null hypothesis.

Single population means The next morning, Leo explains the measures he has undertaken to increase customer spending on food and beverages. "I'd like to see if they've had a discernable impact on my guests' restaurantrelated spending patterns." Last year, I made two major changes to restaurant operations: I brought in a new executive chef and renovated the main cocktail lounge. The chef introduced a new menu: a fusion of traditional Hawaiian and French cuisine. She put some elaborate items on the menu, like that mango and brie tart I recommended to you. She also has offerings that cater to simpler tastes. But the question is, have restaurant profits been affected by the new chef? Since we set our food margins as a fixed percentage of food revenue, I know that if revenues have increased, profits have increased too. Based on last year's consolidated reports, the average spending on food per person per day was $55. I'm curious to see if that has changed. In addition, I renovated the cocktail lounge. The old bar was designed poorly and used space inefficiently. Now more guests can be seated in the lounge, and more seats have good views of the ocean

I also invested in a large machine that makes a wide variety of frozen drinks. Frozen pina coladas are very, very popular. I hope my investments in the bar are paying off in terms of higher guest spending on drinks. Beverages have high margins, but I'm not sure if beverage sales have increased enough to cover the investments. Can we say, for beverages, as for food, that "changes in revenues" are a good proxy for "changes in profits?" Absolutely. I set my profit margins as a fixed percentage of revenues for beverages as well. Last year, the average spending on beverages per guest per day was $21. Isn't that high? Well, we have some very nice wines in our restaurants. We don't have the consolidated report yet, but I've already had my staff choose a random sample of guests. We pulled the restaurant and lounge receipts for the guests in the sample and noted three items: total food revenues, total beverage revenues, and number of guests at the table. Using this information, we should be able to estimate the daily spending on food and beverages per guest. You look at Leo's data and wonder how you can discern whether Leo's changes — the new chef and the bar renovations — have influenced the resort's profits. Leo has prepared data for you. How are you going to put it to use? Our first type of hypothesis test is used to study population means. Let's walk through an example of this type of test Suppose the manager of a movie theater implemented a new strategy at the beginning of the year: he started showing old classics instead of recent releases. He knows that prior to the change in strategy, average customer satisfaction was 6.7 out of a possible 10 points. He would like to know if average customer satisfaction has changed since he altered his theater's artistic focus. The manager's null hypothesis states that the current mean satisfaction has not changed; it is still 6.7. We use the Greek letter mu to represent the current mean satisfaction rating of the theater's entire film-going population. His alternative hypothesis is the opposite of the null hypothesis: it states that average customer satisfaction is now different. NOTE 1010 To substantiate his claim that the mean has changed, the manager takes a random sample of 196 moviegoers. He is careful to sample across movies, show times, and dates. The mean satisfaction rating for the sample is 7.3, with a standard deviation of 2.8.

Does the fact that the random sample's mean of 7.3 is higher than the historical mean of 6.7 indicate that this year's moviegoers really are more satisfied? Or, is the mean still the same, and the manager "just happened" to pick a sample with an unusually high average satisfaction rating? This is equivalent to asking the question: If the null hypothesis is true — the average satisfaction is still 6.7 — would we be likely to randomly draw the sample that we did, with average satisfaction 7.3? To answer this question, we have to first define what we mean by "likely." As in sampling and estimation, we typically use 95% as our threshold level of likelihood. We then construct a range around the population mean specified by our null hypothesis. The range should be drawn so that if the null hypothesis is true, 95% of all samples drawn from the population would fall in that range. In other words, we create a range of likely sample means. The central limit theorem tells us that the distribution of sample means follows a normal curve, so we can use its familiar properties to find probabilities. Moreover, the distribution of sample means is centered at our assumed population mean, mu, and has standard deviation sigma/sqrt(n). We don't know sigma, the underlying population standard deviation, so we use the sample standard deviation as our best estimate. As we do when constructing 95% confidence intervals, we create a range with width z*s/sqrt(n) = 1.96*s/sqrt(n) on either side of the mean. However, when we conduct a hypothesis test, we center the range around the mean specified in the null hypothesis because we always start a hypothesis test by assuming the null hypothesis is true. NOTE1011 In our example, the null hypothesis is that the population mean is 6.7, n is 196, and s is 2.8. Our 95% confidence level translates into a z-value of 1.96. We construct the range of likely sample means: NOTE1012 This tells us that if the population mean is 6.7, there is a 95% chance that the mean of a randomly selected sample will fall between 6.3 and 7.1. Now, if we take a sample, and the mean does not fall within the range around 6.7, we can reject the null hypothesis. Why? Because if the population mean were 6.7, it would be unlikely to collect a sample whose mean falls outside this range. The region outside the range of likely sample means is called the "rejection region," since we reject the null hypothesis if our sample mean falls into it. In the movie theater example, the rejection region contains all values less than 6.3 and all values greater than 7.1. In this example, the sample mean, 7.3, falls in the rejection region, so we reject the null hypothesis. Whenever we reject the null hypothesis, we in effect accept the alternative hypothesis.

We conclude that customer satisfaction has indeed changed from the historical mean value of 6.7. If our sample mean had fallen within the range around 6.7, we could not make a definite statement about moviegoers' satisfaction. We would not have enough evidence to state that things have changed, but we can never claim that they have definitely remained the same. Unless we poll every customer, we'll never know for sure if customer satisfaction has truly changed. Working only with sample data, there is always a chance that we'll draw the wrong conclusion about the population. We can go wrong in two ways: rejecting a null hypothesis that is in fact true or failing to reject a null hypothesis that is in fact false. Let's look at the first of these: the null hypothesis is true, but we reject it. We choose the confidence level so it is unlikely — but not impossible — for the sample mean to fall in the rejection region when the null hypothesis is true. In this case, we are using a 95% confidence level, so by unlikely we mean a 5% chance. However, 5% of all samples from a population with the null hypothesis mean would fall in the rejection region, so when we reject a null hypothesis, there is a 5% chance we will do so erroneously. Therefore, when the sample mean falls in the rejection region, we can only be 95% confident that we are justified in rejecting the null hypothesis. Hence we continue to speak of a confidence level of 95%. A hypothesis test with a 95% confidence level is said to have a 5% level of significance. A 5% significance level says that there is a 5% chance of a sample mean falling in the rejection region when the null hypothesis is true. This is what people mean when they say that something is "statistically significant at a 5% significance level. If we increase our confidence level, we widen the range around the null hypothesis mean. At a 99% confidence level, our range captures 99% of all sample means. This reduces to 1% our chance of rejecting the null hypothesis erroneously. But doing this has a downside: by decreasing the chance of one type of error, we increase the chance of the other type. The higher the confidence level the smaller the rejection region, and the less likely it is that we can reject the null hypothesis when it is in fact false. This decreases our chance of being able to substantiate the alternative hypothesis when it is true. As managers, we need to choose the confidence level of our test based on the relative costs of making each type of error The range of likely sample means should not be confused with a confidence interval. Confidence intervals are always constructed around sample means, never around population means. When we construct a confidence interval, we don't even have an initial estimate of the population mean. Constructing a confidence interval is a process for estimating the population mean, not for testing particular claims about that mean. Smmary In a hypothesis test for population means, we assume that the null hypothesis is true. Then, we

construct a range of likely sample means around the null hypothesis mean. If the sample mean we collect falls in the rejection region, we reject the null hypothesis. Otherwise, we cannot reject the null hypothesis. The confidence level measures how confident we are that we are justified in rejecting the null hypothesis.note1013 One-sided Hypothesis tests The movie theater manager did not have a strong conviction about the direction of change for customer satisfaction prior to performing the hypothesis test. He wanted to test for change in both directions — up or down — and thus he used a two-sided hypothesis test. The null hypothesis — that no change has taken place — could have been wrong in either of two ways: Customer satisfaction may have increased or decreased. The two-tailed nature of the test was reflected in the two-sided range we drew around the population mean. Sometimes, we may want to know if the actual population mean differs from our initial value of the population mean in a specific direction. For instance, if the theater manager were quite sure that satisfaction had not decreased, he wouldn't have to test in that direction; rather, he'd only have to test for positive change. In these cases, our alternative hypothesis should clearly state which direction of change we want to test for. These kinds of tests are called one-sided hypothesis tests. Here, we substantiate the claim that the mean has increased only if the sample mean is sufficiently higher than 6.7, so our rejection region extends only to the right. Let's outline how to formulate one- and two-sided tests. For a two-sided test we have an initial understanding of the population: the population mean is equal to a specified initial value. If we want to substantiate the claim that a population mean has changed, the null hypothesis should state that the mean still equals that initial value. The alternative hypothesis should state that the mean does not equal that initial value. If we want to know that the actual population mean is greater than the initial value — the null hypothesis mean — then the null hypothesis should state that the population mean has at most that value. The alternative hypothesis states that the mean is greater than the null hypothesis mean. Likewise, if we want to substantiate the claim that a population mean is less than the initial value, the null hypothesis should state that the mean is at least that initial value. The alternative hypothesis should state that the mean is less than the null hypothesis mean, and the rejection region extends only to the left. When we conduct a one-sided hypothesis test, we need to create a one-sided range of likely sample means. Suppose the theater manager claims that satisfaction improved. As usual, he states the claim he wants to substantiate as his alternative hypothesis.

The 196-person sample has mean 7.3 and standard deviation 2.8. Does this sample provide sufficient evidence to substantiate the claim that mean satisfaction increased? To find out, the manager creates a one-sided range: he assumes the population mean is the null hypothesis mean, 6.7, and finds the range that contains the lower 95% of all sample means. To find this range, all he needs to do is calculate its upper bound. For what value would 95% of all sample means be less than that value? To find out, we use what we know about the cumulative probability under the normal curve: a cumulative probability of 95% corresponds to a z-value of 1.645. Why is this different from the z-value for a two-sided test with a 95% confidence level? For a twosided test, the z-value corresponds to a 97.5% cumulative probability, since 2.5% of the probability is excluded from each tail. For a one-sided test, the z-value corresponds to a 95% cumulative probability, since 5% of the probability is excluded from the upper tail. z-table We now have all the information we need to find the upper bound on the range of likely sample means. note1014 The rejection region is everything above the value 7.0. The sample mean falls in the rejection region, so the manager rejects the null hypothesis. He is confident that customer satisfaction is higher. summary When we want to test for change in a specific direction, we use a one-sided test. Instead of finding a range containing 95% of all sample means centered at the null hypothesis mean, we find a onesided range. We calculate its endpoint using the cumulative probability under the normal curve. note1015 The Excel Utility link below allows you to perform hypothesis tests for single populations. Make sure you do at least one example by hand to ensure you thoroughly understand the basic concepts before using the utility. You should enter data only in the yellow input areas of the utility. To ensure you are using the utility correctly, try to reproduce the results for the theater manager's example. A single-population hypothesis test tests a claim using a sample from a single population. With a plan in mind, you take a look at Leo's sample data. You are ready to analyze the impact of the changes Leo has made to his restaurant operations. You draw a table to organize the data from your sample on daily guest spending on restaurant food. One change Leo made to his restaurant operations was to hire a new chef. He wants to know

whether average restaurant spending per guest has changed since she took over the menu and the kitchen. This is a clear case for a hypothesis test. Last year's average spending on food per person was $55; this gives you an initial value for the mean. Leo wants to know if mean spending has changed, so you use a two-sided test. You jot down your null hypothesis, which states that the average revenue per guest is still $55. If the null hypothesis is true, the difference between the sample mean of $64 and the initial value of $55 can be accounted for by chance. You add the alternative hypothesis to your notes. Next, you assume that the null hypothesis is true: the population mean is $55. Now you need to construct a range of likely sample means around $55 and ask: does the sample mean of $64 fall within that range? Or does it fall in the rejection region? Leo didn't specify what level of confidence he wanted for your results. You call him for clarification. I suppose a 95% confidence level is okay. I'd like to be more confident, of course. After you point out that higher confidence would reduce his chances of being able to substantiate a change in spending if a change has taken place, he agrees to 95%. You pull out your trusty calculator and get ready to compute a range around the null hypothesis mean of $55. Consulting your notes, you find the correct formula: You find the range containing 95% of all sample means. Its endpoints are: [$49.12; $60.88] This is the correct answer. The z-value for 95% confidence in a two-sided test is 1.96. You pause for a moment to reflect on the interpretation of this range. Suppose the null hypothesis is true. Then 19 out of 20 samples of this size from the population of hotel guests would have means that would fall in the calculated range. The sample mean of $64 falls outside of this range. You and Alice report your results to Leo. Looks like hiring that chef was a good decision. The evidence suggests that mean spending per person has increased. I'm glad to hear it. Now what about renovating the bar? Can you run a similar test to see if that has

affected average beverage spending? Leo emphasizes that he can't imagine that his investments in the bar could have reduced average beverage spending per guest. He wants to know if spending has gone up. You decide to do a onesided test. First, you write down all of Leo's data, along with the hypotheses: You need to find an upper bound such that 95% of all sample means are smaller than it. To do so, you use a z-value of 1.645. The upper bound is $24.29. What is the correct interpretation of this number? Given that the null hypothesis is true, If the sample mean is $21, 19 out of 20 samples have means LESS than $24.29

This is the correct answer. $24.29 is an upper bound: 95% of all sample means collected from a population with the null hypothesis mean fall below $24.29.

The range of likely sample means contains the collected sample mean of $24. This tells you that: The null hypothesis should NOT be rejected. This is the correct answer. The difference between the sample means and the population mean may well be due to the randomness of the sample. There is not enough evidence to reject the null hypothesis. Presenting your full report to Leo, he appears confused and disappointed. How is this possible? Why hasn't renovating the bar increased revenues? Even if the frozen drink machine didn't pay off, shouldn't the increase in seats have helped? First of all, we haven't concluded that average revenue has not increased. We just can't be sure that it has. The fact that our sample mean is $24 vs. $21 last year does not allow us to say anything definitive about the change in average beverage revenue. Remember, we set out to substantiate our hypothesis that spending has improved. Based just on this sample, we are unable to conclude that spending has increased. You added seats and now more people can be seated in your lounge. But a greater number of guests does not necessarily translate into more spending per person That does make a lot of sense.

Your overall revenues may have actually increased, because more guests can be seated in the lounge. Gosh, I'm glad to hear that. For a moment there, I thought I had made a really bad investment. I'm quite optimistic I'll see a jump in total beverage revenues in the consolidated report at the end of the year. Why don't we go fill three of those new seats right now? Exercise1 Blanche McCarthy is the marketing director of Oma's Own snack food company. Oma's makes toasted pretzel snacks, and advertises that these pretzels contain an average of 112 calories per serving. In a recent test, an independent consumer research organization conducted an experiment to see if this claim was true. Somewhat to their surprise, the researchers found that the average calorie content in a sample of 32 bags was 102 calories per serving. The standard deviation of the sample was 19. Blanche would like to know if the calorie content of Oma's pretzels really has changed, so she can market them appropriately. With 99% confidence, do these data indicate that the pretzels' calorie content has changed? Yes The answer cannot be determined from the information provided. This is the correct answer. The data indicate that the null hypothesis should be rejected. The calorie content has probably changed. You begin any hypothesis test by formulating a null and an alternative hypothesis. The null hypothesis states that the population mean is equal to the initial value. In this problem, the null hypothesis is that the caloric content in the actual population is what Oma's has always advertised The alternative hypothesis should contradict the null hypothesis. For a two-sided test, the alternative hypothesis simply states that the mean does not equal the initial value. A two-sided test is more appropriate in this problem, since Blanche only wants to know if the mean calorie content has changed. You assume that the null hypothesis is true and construct a range of likely sample means around the population mean. Using the data and the appropriate formula, you find the range [103; 121]. The sample mean of 102 falls outside of that range, so you can reject the null hypothesis. Blanche can be 99% confident that the population mean is not 112.

Why might Blanche have chosen a 99% confidence level rather than the more typical 95% level for her test? She feels that it would be very costly to change her marketing campaign if there is in fact no change in the average number of calories. This is the correct answer. A high confidence level decreases our chance of erroneously rejecting the null hypothesis. In this case, Blanche wants to minimize the chance of saying that the caloric content has changed if it really is still 112 calories per serving. Exercise2 The Clearwater Power Company produces electrical power from coal. A local environmental group claims that Clearwater's emissions have raised sulfur dioxide levels above permissible standards in Blue Sky, the town downwind of the plant. According to Environmental Protection Agency standards, an acceptable average sulfur dioxide level is 30 parts per billion (ppb). As Clearwater's PR consultant, you want to defend the company, and you try to anticipate the environmentalist's argument. The environmental group collects 36 samples on randomly selected days over the course of a year. It finds a mean sulfur dioxide content of 35 ppb with a standard deviation of 24 ppb The environmentalist group will use a hypothesis test to back up its claim that the sulfur dioxide levels are higher than permitted. Which of the following is an appropriate null hypothesis for this problem?

The average sulfur dioxide level is no higher than 30 ppb, the EPA's standard of acceptability. This is the best answer. The null hypothesis states the conventional wisdom: that the population mean of the population under investigation the sulfur dioxide concentration of air in Blue Sky is less than or equal to 30 ppb, the acceptability standard for which the EPA does not require a remedy. The environmentalists will pose as the alternative hypothesis the claim they are trying to substantiate: that Blue Sky's levels exceed the acceptable standard. The environmentalists' claim is that sulfur dioxide levels are higher, so they will want to run a one-sided test. The alternative hypothesis states that the sulfur dioxide levels are above the accepted standard. We assume they will choose a 95% confidence level. What is the range of likely sample means? All values below 36.58 ppb. They calculate the one-sided range around the null hypothesis mean that contains 95% of all samples. The z-value for a one-sided 95% range is 1.645. The upper bound on the range of likely sample means is 36.58 ppb. note1016

Based on your calculations, you should: Do not reject the null hypothesis. This is the correct answer. 35 ppb falls within the range of likely sample means. Ata a 95% confidence level, these sample data do not provide enough evidence to reject the null hypothesis.

Exercise3 You are the plant manager of a Neshey's chocolate factory. The shop was flooded during the recent storms. The machine that wraps Neshey's popular chocolate confection, Smooches, still works, but you are afraid it may not be working at its former capacity. If the machine isn't working at top capacity, you will need to have it replaced. Which type of hypothesis test is most appropriate for this problem? One-sided test This is the best answer. You want to know if the machine's performance has been impaired, not simply if the performance has changed. The hourly output of the machine is normally distributed. Before the flood, the machine wrapped an average of 340 Smooches per hour. Over the first week after the flood, you counted wrapped Smooches during 32 randomly selected one-hour periods. The machine averaged 318 Smooches per hour, with a standard deviation of 44. You conduct a one-sided hypothesis test using a 95% confidence level. According to your calculations, you should:

Have the machine replaced. Continue to use the machine. The lower output in the sample hours you observed was due solely to chance. This is the correct answer. The sample mean falls below the lower bound of the one-sided range of likely sample means around the null hypothesis mean. You can be 95% confident that the machine's performance has been impaired. The null hypothesis is that µ ≥ 340. The alternative hypothesis is that µ < 340 since you are using a one-tail test and you are assuming that the new population mean is lower than the population mean before the flood. Identify the relevant values. The sample size n=32. The standard deviation s=44. The appropriate

z-value is 1.645 if you want to capture 95% of all sample means in a one-sided range around the null hypothesis mean. Use the formula and calculate the lower bound, 327. The sample mean of 318 falls well outside of the calculated range of likely sample means. You accept this as strong evidence against the null hypothesis, substantiating the alternative hypothesis that the mean output rate has dropped. You should replace the machine. Single Population Proportions Happy with your work on restaurant spending, Leo jumps right into the next problem. "It's not just the revenue of the restaurants that I care about," Leo says, "It's also my guests' satisfaction with their restaurant experience." When I go out to eat, I expect more than just excellent food. The whole dining experience is essential — everything from the service, to the décor, to the design and quality of the silverware. And it's not just that all of these factors must be excellent individually — they have to fit together. The restaurant has to have ambiance! I'm sure my guests have similar expectations, and I want to be sure my restaurant meets them. Since my new chef introduced more sophisticated cuisine, I made some changes to the décor that I think have improved the ambiance. It took me a long time and a substantial amount of money to get everything right, but I'm pleased with the result: the restaurants are elegant and distinctly Hawaiian. Just like the new chef's cuisine. In the past, I've contracted a local market research firm to conduct surveys, asking guests to rate the Kahana's restaurants' ambiance on a scale of one to five. Historically, the percentage of people that rated ambiance the top score of 5 gave me a good idea of how well we were doing. That percentage has been very high: 72%. I've collected this year's data for you. Can you figure out if my guests are happier with my restaurants' ambiance? Alice tells you that testing Leo's claim about a proportion will be very similar to testing a mean. Often the summary statistic we want to make a claim about is a proportion. How do we test a hypothesis about a population proportion instead of a population mean? We know from our work with confidence intervals that the processes for estimating population proportions and population means are virtually identical. Similarly, hypothesis tests for proportions are much like hypothesis tests for means. Because we are examining a population proportion instead of a population mean, we use slightly different notation: we use a lower case p to represent the population proportion in place of µ for a population mean. We construct a hypothesis test to test a claim about the value of p.

Again, we formulate null and alternative hypotheses. Based on conventional wisdom or past experience, we have an initial understanding of the population proportion. The null hypothesis for a proportion test states the initial understanding. For example, in a two-sided test, the null hypothesis asserts that the population proportion, p, is equal to the initial value we had in mind. The alternative hypothesis is the claim we are using the hypothesis test to substantiate. The alternative hypothesis typically states the opposite of the null hypothesis: it states that our initial understanding is incorrect. As with population means, we collect a random sample and calculate the sample proportion, "p bar." However, for a hypothesis test about a population proportion, we don't need to calculate a standard deviation from the sample. Statistical theory tells us that σ, the standard deviation of the population proportion, is the square root of [p*(1 - p)]. Since we always start the test assuming the null hypothesis is true, we will calculate σ using the null hypothesis proportion. Analogously to population mean tests, we create a range of likely sample proportions around the null hypothesis proportion. To create the range, we substitute for σ, the standard deviation of the underlying null hypothesis population.note1017 If our sample proportion falls outside the range of likely sample proportions, we reject the null hypothesis. Otherwise, we cannot reject the null hypothesis SUMMARY In a hypothesis test for population proportions, we assume that the null hypothesis is true. Then, we construct a range of likely sample proportions around the null hypothesis proportion. If the sample proportion we collect falls in the rejection region, we reject the null hypothesis. Otherwise, we cannot reject the null hypothesis. note1018 Once you understand hypothesis testing for means, using the same techniques on proportions is easy. By now, you're familiar with the concept of testing a hypothesis. You recognize that Leo's restaurant ambiance problem calls for a hypothesis test for a population proportion. Leo wants you to find out if the proportion of his guests that rate restaurant ambiance "excellent" has increased. Historically, that population proportion has been 0.72. Since Leo wants to see if there has been positive change, you do a one-sided test. The appropriate pair of hypotheses is: Null hypothesis p ≤ 0.72, alternative hypothesis: p > 0.72

You are doing a one-sided test to see if the proportion of guests rating the restaurant "excellent" has increased. The alternative hypothesis states that the proportion has increased, and the null hypothesis states that it has not increased. You look at Leo's data. The sample proportion is 0.81 and the sample size is 126. But what about the standard deviation?

You have enough information to calculate the standard deviation. This is the correct answer. For proportions, you can calculate the standard deviation using the null hypothesis proportion. Here's how you find the standard deviation for a proportion problem: Using the appropriate formula, you calculate the standard deviation to be 0.45. Leo wanted you to use a 95% confidence level. Now you're ready to construct a range of likely sample means around the null hypothesis value of the population proportion: 0.72. Find the range of likely sample proportions around the null hypothesis proportion, and formulate a short answer for Leo. The evidence supports Leo's claim that the proporation of guests rating the restaurant ambiance "excellent" has increased. A one-sided test calls for a one-sided range of likely sample proportions. You need to find the upper bound for this range such that the range captures the lower 95% of the sample proportions. The z-value for a one-sided 95% confidence level is 1.645. Substitute the null hypothesis proportion, 0.72, for p. The upper bound for the range containing the lower 95% of all sample means is 0.78. Since the sample proportion 0.81 falls in the rejection region, you reject the null hypothesis. The data provide sufficient evidence that the population proportion has, in fact, changed. Alice presents your findings to Leo, telling him that with 95% confidence, the data you collected indicate that the difference between the historical population proportion and the proportion of the random sample is not due to chance. The proportion of your guests that rate the restaurants' ambiance as "excellent" has increased. Exercse1 Luther Lenya, the new product guru of The Ventura Automotive Insurance Company, is considering marketing a special insurance package to members of certain professional groups. In particular, Luther wants to create a special package for health professionals.

To find out what rate to charge for this package, Luther conducts a preliminary study to see if health professionals are less likely to be involved in car accidents than the rest of his customer base. If the data indicate that health professionals are less likely to be involved in car accidents, then Ventura can offer health professionals a lower, more competitive rate.

In the past 5 years, 8.3% of Ventura's customers have been involved in accidents. Which of the following is the correct pair of hypotheses for solving Luther's problem? Null hypothesis p ≥ 8.3%; Alternative hypothesis p < 8.3% This is the correct answer. Luther wants a one-sided test, because he wants to know if medical professionals are better drivers. The alternative hypothesis should state that medical professionals are less likely to be in accidents. EXERCISE1 Luther Lenya, the new product guru of The Ventura Automotive Insurance Company, is considering marketing a special insurance package to members of certain professional groups. In particular, Luther wants to create a special package for health professionals. To find out what rate to charge for this package, Luther conducts a preliminary study to see if health professionals are less likely to be involved in car accidents than the rest of his customer base. If the data indicate that health professionals are less likely to be involved in car accidents, then Ventura can offer health professionals a lower, more competitive rate.

In the past 5 years, 8.3% of Ventura's customers have been involved in accidents. Which of the following is the correct pair of hypotheses for solving Luther's problem? Null hypothesis p ≥ 8.3%; Alternative hypothesis p < 8.3% This is the correct answer. Luther wants a one-sided test, because he wants to know if medical professionals are better drivers. The alternative hypothesis should state that medical professionals are less likely to be in accidents. A sample of 240 customers in the health profession reveals that 12 (5.0%) have had accidents. If he uses a 95% confidence level, which of the following is the best conclusion Luther can come to?

The evidence suggests that health professionals are less likely to be involved with car accidents. The data provide no evidence suggests that health professionals are more or less likely to be involved in car accidents. This is the best answer. The range of likely sample proportions around the null hypothesis proportion does not contain the sample proportion, so we can reject the null hypothesis. With 95% confidence, the proportion of health professionals involved in car accidents is lower than the proportion of Ventura's population of drivers. You need to find a range of likely sample proportions. To find this range, you calculate a standard deviation. The standard deviation is 0.28. For a one-sided test, a confidence level of 95% corresponds to a z-value of 1.645. The lower bound of this range is 0.054 = 5.4%. The range of likely sample proportions does not contain 5.0%, so you should reject the null hypothesis. With 95% confidence, the proportion of health professionals involved in car accidents is lower than the proportion of the overall population of drivers. P-Values After sleeping over your analysis of restaurant operations, Leo seems unsatisfied. Don't get me wrong, I appreciate your hard work. But look here: these hypothesis tests result in a "reject/don't reject" decision. If I understand you correctly, it doesn't matter how close to the border of the rejection region our sample statistic falls: "reject" is "reject." But can't you tell me more? I want to know how strong the evidence against the null hypothesis is, not just if it is strong enough. I'm glad you brought that issue up, Leo. We have a second method of doing hypothesis tests, one that provides a measure of the strength of the evidence. The evening before, Alice had acquainted you with p-values: "We can use the p-value method of hypothesis testing to make 'reject/not reject' decisions in the same way we have been doing all along. But the p-value also measures the strength of evidence against a null hypothesis." In hypothesis tests we've done so far, we first chose the confidence level of the test. The confidence level tells us the significance level of the test, which is simply 1 minus the confidence level. Typically, we chose a 5% significance level — a 95% confidence level — as our threshold value for rejection. Assuming that the null hypothesis is true, we reasoned that certain sample mean values are less likely to appear than others. If the mean of the sample we collected was sufficiently unlikely to appear (that is, less than 5% likely) we considered the null hypothesis implausible and rejected it. Now, rather than simply checking whether the likelihood of collecting our sample is above or

below our chosen threshold, we'll ask: if the null hypothesis is true, how likely is it to choose a sample with a mean at least as far from the null hypothesis mean as the sample mean we collected? The "p-value" measures this likelihood: it tells us how likely it is to collect a sample mean that falls at least a certain distance from the null hypothesis mean. In the familiar hypothesis testing procedure, if the p-value is less than our threshold of 5%, we reject our null hypothesis. The p-value does more than simply answer the question of whether or not we can reject the hypothesis. It also indicates the strength of the evidence for rejecting the null hypothesis. For example, if the p-value is 0.049, we barely have enough evidence to reject the null hypothesis at the 0.05 level of significance; if it is 0.001, we have strong evidence for rejecting the hypothesis. Let's look at an example. Recall the movie theater manager who wanted to know if the average satisfaction rate for his clientele had changed from its historical rate of 6.7. To find out, we constructed the range, 6.3 to 7.1, which would have contained 95% of the sample means if the null hypothesis mean had still been true. Since the mean of the sample of current moviegoers we collected, 7.3, fell outside of that range, we rejected the null hypothesis. Because 7.3 fell in the rejection region, we know that the likelihood of collecting a sample mean as extreme as 7.3 is less than 5% if the null hypothesis is true. Now let's find out exactly how unlikely it is by calculating the p-value Calculating the p-value is a little tricky, but we have all the tools we need to do it. Recall that for samples of sufficient size, the sample means of any population are distributed normally. To calculate the likelihood of a certain range of sample mean values — in our example, sample mean values greater than 7.3 or less than 6.1 — we just need to find the appropriate area under the distribution curve of the sample means. To calculate the p-value for this two-sided test, we want to find the area under the normal curve to the right of 7.3 and to the left of 6.1. The standard deviation in this example is 2.8, and the sample size is 196. We can calculate this probability by first calculating the z-value associated with the value 7.3. That z-value is 3. Then, we find the probability of having a z-value less than -3 or greater than 3. The area to the left of the z-value of -3 is 0.00135. The area to the right of the z-value of +3 is the same size, so the total area is 0.0027. That is our p-value. These areas and the p-value can be found in Excel using the NORMSDIST(-3) function, in the z-table, or with the Excel utility

provided. Our p-value calculation tells us that the probability of collecting a sample mean at least as far from 6.7 as 7.3 is 0.0027. The p-value is lower than 0.05. Thus, at a significance level of 0.05, we would reject the null hypothesis and conclude that moviegoers' average satisfaction rating is no longer 6.7. But the p-value 0.0027 is much smaller than 0.05. Thus, we can reject the null hypothesis at 0.0027, a much lower significance level. In other words, we can reject the null hypothesis with 99.73% confidence. In general, the lower the p-value, the higher our confidence in rejecting the null hypothesis. One-sided hypothesis tests are also easily conducted with p-values. For one-sided tests, the pvalue is the area under one side of the curve. In our movie theater example, if the alternative hypothesis states that the population mean is larger than 6.7, the p-value is the area under the normal curve to the right of the sample mean of 7.3. Summary The p-value measures the strength of the evidence against the null hypothesis. It is the likelihood, assuming that the null hypothesis is true, of collecting a sample mean at least as far from the null hypothesis mean as the sample actually collected. We compare the p-value to the threshold significance level to make a reject/not reject decision. The p-value also tells us how comfortable we can be with that decision. Note 1019 Now Alice explains the basics of p-values to Leo, so you can present the results of your restaurant revenue hypothesis test again. This time, you'll be able to give Leo an idea of how strong the statistical evidence is. Leo wants you to complete the p-value hypothesis test right there in his office. You're a little nervous — you've never had a client peering over your shoulder when you work. But you oblige him, because you're growing more confident of your statistical skills. Looking back at your notes on the problem, you find the data and the hypotheses. You make a mental note that you are doing a two-sided test to see whether or not average spending on food has changed from its historical level of $55. An eager Leo interrupts your thought process: When you ran the hypothesis test earlier I had you use a 95% confidence level. That corresponds to a significance level of 0.5, right? You politely respond: I'm sorry, but I don't think that's right. Good choice. Leo is still a little confused, but you bring him up to speed. To find the significance level corresponding to a confidence level of 95%, simply subtract 95%

from 100%, and convert into decimal notation: 0.05. After you clarify Leo's mistake, he sits back and lets you finish your analysis without further interruption. First, you find the appropriate z-value. Enter the z-value as a decimal number with 2 digits to the right of the decimal, (e.g., enter "5" as "5.00"). Round if necessary.

The correct z-value is 3.00, corresponding to a right-tail probability of 0.00135. You: Double that probability to find the p-value. This is the correct answer. For a two-sided test, you calculate both tail probabilities. Your sample has a mean (x-bar = $64) that is $9 higher than the assumed population mean, $55. You want to calculate the likelihood of getting a sample mean that is at least as far from the population mean as x-bar. That likelihood is not just the tail probability to the right of the sample mean. Sample means on the other side of the normal curve are just as far from the population mean as x-bar. They must be included, too, when you calculate the p-value for a two-sided hypothesis test. Doubling the right-tail probability gives you the correct p-value: 0.0027. Alice summarizes your results for Leo. All we have to do is compare the p-value to the significance level. The p-value 0.0027 is less than the significance level 0.05. Our data are statistically significant at the 0.05 level. Just as we calculated earlier by constructing a range around the null hypothesis mean, the p-value method suggests that we reject the null hypothesis. With 95% confidence, average food spending per guest has changed. But now we can also see that the evidence is very strong, because the p-value is much lower than the significance level. We can claim that food spending has changed at the 0.0027 level of significance. Thanks, you two. I feel much more comfortable concluding that average guest spending in my restaurant has changed. In the following exercise you will revisit an earlier problem, this time solving it with the p-value method. Blanche McCarthy is the marketing director of Oma's Own snack food company. Oma's makes toasted pretzel snacks. Each bag of pretzels contains one serving, and Oma's advertises that the pretzel snacks contain an average of 112 calories per serving. In a recent test, an independent consumer research organization conducted an experiment to see if this claim was true. The researchers found that the average calorie content in a sample of 32 bags was 102 calories per serving. The standard deviation of the sample was 19. Blanche would like to know if the calorie content of Oma's pretzels has really changed, so she can

market them appropriately. At the significance level 0.01, do these data indicate that the pretzels' calorie content has changed? Yes. This is the best answer. The data indicate that the null hypothesis should be rejected. The calorie content has probably changed. In this problem, the null hypothesis is that the actual population mean is what Oma's has always advertised. A two-side test is more appropriate in this problem, since Blanche only wants to know if the mean calorie content has changed. Assuming that the null hypothesis is true, you find a z-value for the sample mean of 102 using the appropriate formula. The z-value is -2.98. Using the Excel NORMSDIST function or the Standard Normal Table, you can find the corresponding left-tail probability of 0.0014. For a two-sided test, you double this number to find the p-value, in this case 0.0028. Since this p-value is less than the significance level, you can reject the null hypothesis. Moreover, you now can say that you are rejecting the null hypothesis at the 0.0028 level of significance. You can recommend to Blanche that she have the labeling changed on the pretzel bags, and adjust her marketing accordingly. Comparing two populations Now satisfied with your analysis of the restaurant, Leo asks you to compare the discretionary spending habits of two categories of guests: leisure and business. Every hotel manager wrestles with the problem of stretching limited marketing resources. I want to make sure that I'm wisely allocating each marketing dollar. Leisure guests, such as tourists and honeymooners, are especially attracted to Hawaii. Also, many professional associations like to have their conventions here, so our islands attract business travelers, who mix business and pleasure. Business travelers pay lower room prices because conferences book rooms in bulk. Bulk reservations are good for me because they keep my occupancy levels high. However, I don't have a good sense of whether the discretionary spending of my business guests is different from that of my leisure guests: they may take fewer scuba lessons but use the spa services more, for example Can you help me figure out whether there is any significant difference between leisure and business travelers' discretionary spending habits? Your conclusions might influence my marketing efforts. I collected two random samples: one of leisure guests and one of business guests. Not including room, meal, and beverage charges leisure travelers spent an average of $75 a day, compared to $64 a day for the business travelers.

I knew that the difference between the two averages of the two samples could be due to chance, so I thought I'd have you do a hypothesis test to find out. When I was compiling the data for you, I realized that my samples were of different sizes. I was able to get 85 leisure guests to respond, but only 76 business guests returned my survey. Which figure will you use as the sample size? Or will you add them together? I also realized that with these data, you'd have to calculate two sample standard deviations, one for each sample. How do you go about solving a problem like this? How do you test whether two populations have different means? So far, we've used hypothesis tests to study the mean or proportion of a single population. Often, managers want to compare the means or proportions of two different populations: in this case, we use a two-population hypothesis test. Let's clarify when we use each type of test. We conduct single-population tests when we have an initial value for a population mean and want to test to see if it is correct. Single population tests are especially useful when we suspect that the population mean has changed. For example, we use a single-population test when we know the historical average of a population and want to test whether that historical average has changed. We conduct two-population tests to compare a characteristic of two groups for which we have access to sample data for each group. For example, we'd use a two-population test to study which of two educational software packages better prepares students for the GMAT. Do the students using package 1 perform better on the GMAT than the students using package 2? In two-population tests, we take two samples, one from each population. For each sample, we calculate the sample mean, standard deviation, and sample size. We can then use the two sets of sample data to test claims about differences between the two populations. For example, when we want to know whether two populations have different means, we formulate a null hypothesis stating that the means are not different: the first population mean is equal to the second. Let's look at the GMAT software package example more closely. The manager of one educational software company might wonder if the average GMAT score of students using her software is different from the average GMAT score of students using the competitor's software. Since the manager only wants to test if the average GMAT scores are different, she conducts a two-sided hypothesis test for two populations. The null hypothesis states that there is no difference between the average GMAT scores of the students who use the two companies' software. The alternative hypothesis states that the average GMAT scores of the students who use the two

companies' software are different. We denote the average scores of the two populations by the Greek letter mu and distinguish them with subscripts. Our hypotheses are: To be 95% confident in the result of the test, we use a significance level of 0.05. We collect two samples, one from each population. We denote the sample means with the familiar x-bar, which we again distinguish with subscripts. We are able to collect the GMAT scores of 45 people who used the company's software, and 36 people who used the competitor's software. As we will see shortly, the different sample sizes will not pose a problem. The respective sample means are 650 and 630, and the standard deviations are 60 and 50. Could the two random samples we picked just happen to have different means by chance but really have come from populations that have the same population means? The null hypothesis states that there is no difference in the two population means. As with singlepopulation tests, we test the null hypothesis by asking how likely it would be to produce the sample results if the null hypothesis is in fact true. That is, if the average GMAT scores for students using the two different software packages actually are the same, what is the chance that two samples we collect would have sample means as different as 650 and 630? Our intuition tells us that the greater the difference between the means of the two samples, the more likely it is that the samples came from different populations. But how do we know when the numerical difference is large enough to be statistically significant? When do we have enough evidence to actually conclude that the two populations must be different? We use p-values to answer this question. First we calculate a z-value for the difference of the sample means, incorporating the data from both populations. It looks a bit complicated: Let's compute the z-value for our example. Since we assume that the null hypothesis is true, we have: NOTE 1020 Using the formula, we find that the z-value is 1.64. For a two-sided test, a z-value of 1.64 translates into a probability in one tail of 0.05, and thus a pvalue of 0.10. Since this p-value is greater than the significance level of 0.05, we cannot reject the null

hypothesis. In other words, the high p-value tells us that there is insufficient evidence from the two samples to conclude that the average GMAT score of the students who use the company's software is different from the average GMAT score of students who use the competitor's software. Two-population hypothesis tests can be performed using the formula shown above, or you can click here to access the Excel utility for hypothesis testing. summary In a hypothesis test for two population means, we assume a null hypothesis: that the two population means are equal. We collect a sample from each population and calculate its sample statistics. We calculate a p-value for the difference between the two samples. If the p-value is less than the significance level, we reject the null hypothesis. NOTE 1021

Often, managers want to know if two population proportions are equal. For example, a marketing manager of a packaged snack foods company might want to compare the snack food habits of different states in the US. The marketing manager might think that the proportion of consumers who favor potato chips in Texas is different from the proportion of consumers who favor potato chips in Oklahoma. Comparing two population proportions is similar to comparing two population means. We have two populations: the null hypothesis states that their proportions are the same; the alternative hypothesis states that they are different. We collect a sample from each population and calculate its sample size and sample proportion. As in the single population proportion test, we don't need to find the sample standard deviation, since we know that the population standard deviation is the square root of [p*(1 - p)]. Similarly to the hypothesis tests for comparing two population means, we calculate a z-value for the difference between the proportions using the formula below: We translate the z-value into a p-value just as we would for any other type of hypothesis test. If the p-value is less than our significance level, we reject the null hypothesis and conclude that the proportions are different. If the p-value is greater than the significance level, we do not reject the null hypothesis. Let's take a closer look at the study of snacking habits in Texas and Oklahoma. The manager does not wish to test for a particular direction of difference; he just wants to know if the proportions are different. Thus, he should use a two-sided test. The marketing manager wants to be 95% confident in the result of this test, so the significance level is 0.05.

Suppose we collect responses from 400 people in Texas and 225 people in Oklahoma. The sample proportions are 45% and 35%, respectively. Could the two random samples we picked just happen to have different sample proportions? That is, if the true proportions of Texans and Oklahomans favoring potato chips actually are the same, what would be the chance that the sample proportions are 45% and 35% respectively? We use p-values to answer this question. First, we calculate a z-value for the difference of the sample proportions that incorporates the data from both populations. The null hypothesis states that the population proportions are equal, so their difference is 0. The z-value is 2.48. For a two-sided test, a z-value of 2.48 translates into a probability in one tail of 0.0065 and hence a p-value of 0.013 Since this p-value is less than the significance level of 0.05, we can reject the null hypothesis. In other words, the low p-value tells us that there is sufficient evidence from the samples to conclude that there is a difference between the proportions of Texan and Oklahoman potato chip lovers. We can make this claim at a 0.013 level of significance. Two-population hypothesis tests for population proportions can be performed using the formula shown above, or you can click here to access the Excel utility for hypothesis testing. In a hypothesis test for two population proportions, we assume a null hypothesis: the two population proportions are equal. We collect two samples and calculate the sample proportions. We calculate a p-value for the difference between the sample proportions. If the p-value is less than the significance level, we reject the null hypothesis. NOTE 1022 Make sure you do at least one example by hand to ensure you thoroughly understand the basic concepts before using the utility. You should enter data only in the yellow input areas of the utility. To ensure you are using the utility correctly, try to reproduce the results for the GMAT and potato chip examples. Two-population hypothesis tests help you determine whether two populations have different means. You use a two-population test to solve Leo's problem. You have to find out if leisure guests' average daily discretionary spending is different from business guests' average daily discretionary spending. Leo has provided these data: Now it's time to state the null hypothesis. The best formulation is:

There is no difference between business and leisure guests' mean spending. This is the best answer. You want to know if two means are different, not if they differ in one particular direction. If Leo had asked you to conduct a test to learn only if business guests' spending was greater than that of leisure guests, the second answer would be correct.

Regression Basics Introduction As you relax in your room during a brief afternoon downpour, your phone rings. Leo just called. He wants us to come to his office immediately. He sounds a little angry. We'd better not keep him waiting. I'm sorry if I was short on the phone. I'm very upset. We just had a little incident down in the restaurant. A server spilled a tureen of crab bisque on one of our most "favored" guests, Mr. Pitt. I'm sorry if I was short on the phone. I'm very upset. We just had a little incident down in the restaurant. A server spilled a tureen of crab bisque on one of our most "favored" guests, Mr. Pitt. The Kahana's occupancy this year has been higher than I expected, and I had to hire extra help from a staffing agency. Those staffing agencies charge a fortune, which is especially irritating considering that the employees they refer to us are often poorly suited to customer service in an up-scale hotel. Really, this is my fault for not having a more effective staffing process. I just wish I could predict my needs better. Sometimes, when demand is lower than I expected, I'm overstaffed. Then I lose money paying idle bellhops. If I had a good sense of my staffing needs at least a month in advance, I could avoid hiring workers at the last minute and having idle staff. I had been thinking that the number of advance reservations would give me a good idea of how high my occupancy would be a month down the road. But clearly advance reservations don't tell me the whole story. I've been making way too many false predictions. Is there anything you can do to help me here? What predictions about occupancy can I make based on advance bookings? And how much can I trust them? We'll take a look at the data on advance bookings and occupancy and let you know what we find out. Alice seems confident that the two of you can offer useful advice on Leo's staffing problem: "This will be a great opportunity for you to learn regression. It's a powerful statistical tool used all the time in business: in finance, demand forecasting, market research to name just a few areas. I'm sure you'll use it in your MBA program. And it's a great chance to review what you've learned so far: sampling, confidence intervals, and hypothesis testing all play a part in regression." As we have seen, it is often useful to examine the relationship between two variables. Using

scatter diagrams, we can visualize such relationships. NOTE 1023

We can learn more about the relationship by finding the correlation coefficient, which measures the strength of the linear relationship on a scale from -1 to 1. Regression is a statistical tool that goes even further: it can help us understand and characterize the specific structure of the relationship between two variables. Let's look at an example. Julius Tabin owns a small food processing company that produces the spreadable lunchmeat product EasyMeat. Julius is trying to understand the relationship between his firm's advertising and its sales. Total sales in the spreadable meat industry have been fairly flat over the last decade, and Julius' competitors' actions have been quite stable. Julius believes that his advertising levels influence his firm's sales positively, but he doesn't have a clear understanding of what the relationship looks like. Let's have a look at data on his firm's advertising and sales over the last 10 years. Click on the Excel link to create the scatter diagram yourself from an Excel spreadsheet Year Advertising ($) Actual Sales ($) 1992 35,000 1,100,000 1993 45,000 2,105,000 1994 55,000 3,000,000 1995 55,000 2,000,000 1996 65,000 3,200,200 1997 60,000 2,699,500 1998 70,000 3,100,000 1999 75,000 2,900,000 2000 80,000 4,007,000 2001 95,000 4,300,000 Plotting annual sales against annual advertising expenditures gives us a visual sense of the relationship between the two variables. Looking at the graph, we can see that as advertising has gone up, sales have generally increased. The relationship looks reasonably linear. The correlation coefficient for the two variables is 0.93, indicating a strong linear relationship between advertising and sales. What if we were to draw a line that characterizes this relationship? Which line would best fit the data? Our mind's eye already sees how the two variables are related, but how can we formalize our visual impression? Before we start any calculations, let's look at several lines that could describe the relationship.

One of these lines most accurately describes the relationship between the two variables: the "bestfit" or regression line. In our example, the best-fit line is Sales = -333,831 + 50*Advertising. For this line, the y-intercept is -333,831 and the slope is 50. In general, a regression line can be described by a simple linear equation, y = a + bx, with yintercept a and slope b. In this equation, the y-variable, sales, is called the dependent variable, to suggest that we think Julius' sales depend to some degree on his advertising. The x-variable, advertising, is called the independent variable, or the explanatory variable. NOTE 1024

When we observe that a change in the independent variable (here advertising) is typically accompanied by a proportional change in the dependent variable (here sales), regression analysis can identify and formalize that relationship.

Smmary Regression analysis helps us find the mathematical relationship between two variables. We can use regression to describe a linear relationship: one that can be represented by a straight line and characterized by an equation of the form y = a + bx.       Plot the behavior of two variables on a scatter diagram to observe patterns in their relationship Use regression analysis to identify the linear relationship that best fits the data. The linear relationship has the form y=a + bx. a is the y-intercept of the line b is the slope of the line y is called the dependent variable and x is called the independent, or explanatory variable.

What kinds of questions can regression analysis help answer? How does regression help us as managers? In can help in two ways: first, it helps us forecast. For example, we can make predictions about future values of sales based on possible future values of advertising. Second, it helps us deepen our understanding of the structure of the relationship between two variables by expressing the relationship mathematically. Let's talk first about how managers can use regression to forecast. In our example, regression can

help Julius predict his company's sales for a specified level of advertising. For example, if he plans to spend $65,000 in advertising next year, what might we expect sales to be? If we didn't know anything about the relationship, but only had the historical data, we might simply note that the last time Julius spent $65,000 on advertising, his sales were $3,200,200. But is this the best prediction we can make? Not at all. Regression analysis brings the entire data set to bear on our prediction. In general, this will allow us to make more accurate predictions than if we infer the future value of sales from a single observation of advertising and sales. Having identified the relationship between the two variables from the full data set, we can apply our understanding of that relationship to our forecast. Using regression analysis, we found the regression line to be Sales = -333,831 + 50*Advertising. If Julius plans to spend $65,000 in advertising, what would we predict sales to be? Around $2,900,000. This is the best answer. The point on the line shows us what level of sales to expect. In this case, we would expect sales of $2,916,169. With regression, we can forecast sales for any advertising level within the range of advertising levels we've seen historically. For example, even if Julius has never spent exactly $50,000 on advertising, we can still forecast a corresponding level of sales. NOTE 1025 We must be extremely cautious about forecasting sales for values of advertising beyond the range of values we have already observed. The further we are from the historical values of advertising, the more we should question the reliability of our forecast. For example, we might feel comfortable forecasting sales for advertising levels a bit above the observed range- perhaps as high as $100,000 or $105,000. But we shouldn't infer that if Julius spent $10 million on advertising, he would achieve $500 million in sales. The total market for spreadable meat is probably much less than $500 million annually! Likewise, we might feel comfortable forecasting sales for advertising levels just below the observed range. But we certainly shouldn't report that if Julius spent $0 on advertising he would have negative sales! If we try to use our regression equation to forecast sales for advertising levels outside of the historical range, we are implicitly assuming that the relationship between advertising and sales continues to be linear outside of the historical range.

In reality, although the relationship may be quite linear for the range of values we've observed, the curve may well level off for advertising values much lower or much higher than those we've observed. With no observations outside the historical data range, we simply don't have evidence about what the relationship looks like there. Another critical caveat to keep in mind is that whenever we use historical data to predict future values, we are assuming that the past is a reasonable predictor of the future. Thus, we should only use regression to predict the future if the general circumstances that held in the past, such as competition, industry dynamics, and economic environment, are expected to hold in the future. Regression can be used to deepen our understanding of the structural relationship between two variables. If we think about it, many business decisions are about increasing or decreasing one variable — investments or advertising, for example — to affect some other variable — productivity, brand recognition, or profits, for example. Regression can reveal the structure of relationships of this type. NOTE 1026 Our regression analysis stipulates a linear relationship between sales and advertising. Understanding "the structure" of this relationship translates into finding and interpreting the coefficients of the regression equation. As we've noted above, the constant term -333,831 may have no real managerial significance; it just "anchors" the regression line by telling us the y-intercept. We've never seen advertising levels close to $0, so we cannot infer that spending no money on advertising will lead to sales of $333,831! The more important term is the advertising coefficient, 50, which gives us the slope of the line. The advertising coefficient tells us how sales have changed on average as advertising has increased. In the past, when advertising has increased by $10,000, what has been the average corresponding change in sales? Sales have increased by $500,000.

Assuming that the relationship between sales and advertising is linear, each $1 increase in advertising should be accompanied by the same average increase in sales. In our example, for every incremental $1 in advertising, sales increase on average by $50. Thus, for every incremental $10,000 in advertising, sales increase on average by $500,000. The regression line gives us insight into how two variables are related. As one variable increases, by how much does the other variable typically change? How much growth in sales can we anticipate from an incremental increase in advertising expenditures? Regression analysis helps managers answer questions like these.

Summary We use regression analysis for two primary purposes: forecasting and studying the structure of the relationship between two variables. We can use regression to predict the value of the dependent variable for a specified value of the independent variable. The regression equation also tells us how the dependent variable has typically changed with changes in the independent variable.

Use regression analysis to understand the structure of the relationship between two variables. Structural Relationship: y = a + bx Use regression analysis to forecast y for a value of x within the historically observed range of xvalues. Be cautious about using regression to forecast for values beyond the historically observed range of x-values. Exercise 1 Per-capita consumption of soft drink beverages is related to per-capita gross domestic product (GDP). Generally, the higher the GDP of a country, the more soda its citizens consume. Soft drink consumption is measured in number of 8-oz servings. Based on data from 12 countries, the relationship can be expressed mathematically as: (Per Capita Soft drink consumption) = 130 + 0.018*per capita GDP Based on this relationship, you can expect that, on average, for each additional $1,000 of percapita GDP a country's soda consumption increases by: 18 SERVINGS The regression equation tells us that in our data set, average soda consumption increases by 0.018 servings for every additional $1 of per-capita GDP. So, for an additional $1,000, average consumption increases by ($1,000)(0.018 servings/$) = 18 servings. The per-capita GDP in the Netherlands is $25,034. What do you predict is the average number of servings of soda consumed in the Netherlands per year? Enter predicted average soda consumption (in servings) as an integer (e.g., "5"). Round if necessary. The regression equation tells us that average soda consumption = 130 + 0.018*(per-capita GDP). Therefore, we anticipate the Netherlands' average soda consumption to be 580.6 servings. Although the regression predicts a soda consumption of around 581 servings per person for the Netherlands, the actual measured number of servings consumed is much lower: 362. The discrepancy in the actual and predicted consumption reinforces that per-capita GDP alone is not a perfect predictor of soda consumption.

A regression line helps you understand the relationship between two variables and forecast future values of the dependent variable. Alice points out to you that these two features of regression analysis make it a powerful tool for managers who make important decisions in the uncertain world of business. But how do you generate a regression line from observed data? Of all the straight lines that you could draw through a scatter diagram, which one is the regression line? Let's return to Julius Tabin's sales and advertising data. As we can see from the graph, no straight line could be drawn that would pass through every point in the data set. This is not surprising. Typically, advertising is not a perfect predictor of sales, so we don't expect every data point to fall in a perfect line. The regression line depicts the best linear relationship between the two variables. We attribute the difference between the actual data points and the line to the influence that other variables have on sales, or to chance alone.

Since the regression line does not pass through every point, the line does not fit the data perfectly. How accurately does the regression line represent the data? To measure the accuracy of a line, we'll quantify the dispersion of the data around the line. Let's look at one line we could draw through our data set. Let's consider a second line. Click on the line that more closely fits the ten data points. Although in this example we can see which of two lines is more accurate, it is useful to have a precise measure of a line's accuracy. To quantify how accurately a line fits a data set, we measure the vertical distance between each data point and the line. Why don't we measure the shortest distance between the point and the line — the distance perpendicular to the line? Why do we measure vertically? We measure vertical distance because we are interested in how well the line predicts the value of the dependent variable. The dependent variable — in our case, sales — is measured on the vertical axis. For each data point, we want to know: how close is the value of sales predicted by the line to the historically observed value of sales? From now on we will refer to this vertical distance between a data point and the line as the error in prediction or the residual error, or simply the error. The error is the difference between the

observed value and the line's prediction for our dependent variable. This difference may be due to the influence of other variables or to plain chance. Going forward, we will refer to the value of the dependent variable predicted by the line as y-hat and to the actual value of the dependent variable as y. Then the error is y - (y-hat), the difference between the actual and predicted values of the dependent variable. The complete mathematical description of the relationship between the dependent and independent variables is y = a + bx + error. The y-value of any data point is exactly defined by these terms: the value y-hat given by the regression line plus the error, y - (y-hat). Collectively, the errors in prediction for all the data points measure how accurately a line fits a set of data. To quantify the total size of the errors, we cannot just sum each of the vertical distances. If we did, positive and negative distances would cancel each other out. Instead, we take the square of each distance and then sum all the squares, similarly to what we do when we calculate variance. This measure, called the Sum of Squared Errors (SSE), or the Residual Sum of Squares, gives us a good measure of how accurately a line describes a set of data. The less well the line fits the data, the larger the errors, and the higher the Sum of Squared Errors. sUMMARY To find the line that best fits a data set, we first need a measure of the accuracy of a line's fit: the Sum of Squared Errors. To find the Sum of Squared Errors, we calculate the vertical distances from the data points to the line, square the distances, and sum the squares. Error = vertical distance from data point to line = actual value – predicted value ^ = y-y Measure of accuracy = Sum of Squared Errors Now that you have a way to measure how well a line fits a set of data, you need a way to identify the line that "best fits" the data: the regression line. We can calculate the Sum of Squared Errors for any line that passes through the data. Of course, different lines will give us different Sums of Squared Errors. The line we are looking for — the regression line — is the one with the smallest Sum of Squared Errors

Let's look at several lines that could describe the relationship between advertising and sales in our example. Our intuition tells us that the middle line is a much better fit than line a or line b. Let's check our intuition. For each line, we can calculate the Sum of Squared Errors to determine its accuracy. The lower the Sum of Squared Errors, the more precisely the line fits the data, and the higher the line's accuracy The line that most accurately describes the relationship between advertising and sales — the regression line — is the line that minimizes the sum of squares. Finding the regression line for a set of data is a calculation-intensive process best left to statistical software. Summary The line that most accurately fits the data — the regression line — is the line for which the Sum of Squared Errors is minimized. Lower SSE  Higher Accuracy Lowest SSE  Regression Line Note: Unless you have installed the Excel Data Analysis ToolPak add-in, you will not be able to do regression analysis using the regression tool. However, we suggest you read through the following instructions to learn how Excel's regression tool works, so you can run regressions in the future, when you do have access to the Data Analysis Toolpak. Performing regression analysis by hand is a time-consuming process. Fortunately, statistical software packages and major spreadsheet programs — Excel, for example — can do the necessary calculations for you in a matter of seconds. Click on the Excel link to access the data file so you can practice doing the analysis in Excel as you read through the instructions. Let's go through the process step by step. We start with data entered in two columns in an Excel spreadsheet. Each column contains values of a variable. To perform regression analysis, there must be an equal number of entries in each column. Under the Data tab in the toolbar we select the Data Analysis option. A window pops open containing an alphabetical list of statistical tools. We select "Regression" and click "OK". A new window opens offering several options for regression analysis. In the regression window, we see a prompt field titled ''Input Y Range.'' In it, we enter C1:C11, the range of cells containing the column label (C1) and the data (C2:C11) for the dependent variable: Sales ($).

We repeat this for the prompt field titled ''Input X Range,'' entering B1:B11 to include both the column label (B1) and the data (B2:B11) for the independent variable: Advertising ($). Since we included the column lables in row 1 in our ranges, we must check the "Labels" box. Including labels is helpful because Excel uses the labels to identify the variable coefficients in the output sheet. If you do not include the labels in your ranges, do not check the label box, or Excel will treat the first row of data as labels, excluding those entries from the regression. Finally, we select the output option "New Worksheet Ply:", enter the name for the new worksheet, and click "OK." Excel opens a new worksheet with the name we specified. In it, we see an intimidating array of data. For the moment, we are mainly interested in the entries in the cells labeled "Coefficients", which specify the intercept and slope of the regression line. Note that the label "Advertising ($)" has been carried over from the original data column. The coefficient in the "Advertising ($)" row is the slope of the regression line. For the exercises in this unit, we strongly recommend you find the relevant data in an Excel spreadsheet and perform the regression analyses yourself. If you do not have the Analysis Toolpak, you can open a file containing the relevant regression output. EXERCISE1 To practice using Excel's regression tool, run a regression using the world soft drink consumption data from an earlier exercise. Use soft drink consumption for the dependent variable and per capita GDP for the independent variable. What is the slope of the regression line? Enter the slope as a decimal number with 3 digits to the right of the decimal point (e.g., enter "5" as "5.000"). Round if necessary.

We run the regression by selecting range C1:C13 for the Y-range, the dependent variable consumption, and B1:B13 for the X-range, the independent variable GDP per capita. We check the label box, and see the output below. The slope of the regression line is the coefficient of the independent variable, GDP per capita. What is the intercept of the regression line? Enter the intercept as an integer (e.g., "5"). Round if necessary. The intercept of the line is the coefficient labeled "Intercept."

Deeper Into Regression Equipped with the basic tools needed to find and interpret the regression line, you feel ready to tackle Leo's assignment. But Alice cautions you not to be hasty and urges you to consider some tricky questions: "How well does the regression line actually characterize the relationship in the data? Is a straight line even a good descriptor of the relationship?" How much does the relationship between advertising and sales help us understand and predict sales? We'd like to be able to quantify the predictive power of the relationship in determining sales levels. How much more do we know about sales thanks to the advertising data? To answer this question we need a benchmark telling us how much we know about the behavior of sales without the advertising data. Only then does it make sense to ask how much more information the advertising data give us. Without the advertising data, we have the sales data alone to work with. Using no information other than the sales data, the best predictor for future sales is simply the mean of previous sales. Thus, we use mean sales as our benchmark, and draw a "mean sales line" through the data. Let's compare the accuracy of the regression line and the mean sales line. We already have a measure of how accurately an individual line fits a set of data: the Sum of Squared Errors about the line. Now we want a measure of how much more accurate the regression line is than the mean line. To obtain such a measure, we'll calculate the Sum of Squared Errors for each of the two lines, and see how much smaller the error is around the regression line than around the mean line. The Sum of Squared Errors for the mean sales line measures the total variation in the sales data. In fact, it is the same measure of variation we use to derive the standard deviation of sales. We call the Sum of Squared Errors for our benchmark — the mean sales line — the Total Sum of Squares. Here, the Total Sum of Squares is 8.01 trillion. The difference between the Total Sum of Squares and the Residual Sum of Squares, 6.88 trillion in this case, is called the Regression Sum of Squares. The Regression Sum of Squares measures the variation in sales "explained" by the regression line. Excel's regression output reports all three of these terms. A standardized measure of the regression line's explanatory power is called R-squared. Rsquared is the fraction of the total variation in the dependent variable that is explained by the regression line. R-squared will always be between 0 and 1 — at worst, the regression line explains none of

the variation in sales; at best it explains all of it.

R-squared is presented either as a fraction, a percentage, or a decimal. We find that in the advertising and sales example, the R-squared value is 6.88 trillion/8.01 trillion = 0.859 = 85.9%.NOTE1030 Then we subtract the fraction of unexplained variation from 1 to obtain R-squared. Fortunately, we don't need to calculate R-squared ourselves — Excel computes R-squared and includes it in the standard regression output. In a regression that has only one independent variable, R-squared is closely related to the correlation coefficient between the independent and dependent variables: the correlation coefficient is simply the positive or negative square root of R-squared; positive if the slope of the regression line is positive and negative if the slope of the regression line is negative. NOTE1031 Excel's regression output always computes the square root of R-squared, which it labels "Multiple R." NOTE1032 SUMMARY R-squared measures how well the behavior of the independent variable explains the behavior of the dependent variable. R-squared is the ratio of the Regression Sum of Squares to the Total Sum of Squares. As such, it tells us what proportion of the total variation in the dependent variable is explained by its linear relationship with the independent variable. NOTE1034

Sponsor Documents

Or use your account on DocShare.tips

Hide

Forgot your password?

Or register your new account on DocShare.tips

Hide

Lost your password? Please enter your email address. You will receive a link to create a new password.

Back to log-in

Close