lab08

.pdf

School

University of Oregon *

*We aren’t endorsed by this school

Course

101

Subject

Statistics

Date

Apr 3, 2024

Type

pdf

Pages

14

Uploaded by MasterGrouse3886 on coursehero.com

lab08 March 20, 2024 [1]: import otter grader = otter . Notebook() 1 Lab 8: Confidence Intervals and Characteristics of Distributions Reading : * Estimation * Mean and Variability Please complete this notebook by filling in the cells provided. Before you begin, execute the following cell to load the provided tests. Each time you start your server, you will need to execute this cell again to load the tests. In this lab, we will learn how to generate confidence intervals from samples to infer information about a population. We will also introduce some important statistics characterizing distributions. [2]: # Don't change this cell; just run it. import numpy as np from datascience import * # These lines do some fancy plotting magic. import matplotlib % matplotlib inline import matplotlib.pyplot as plt plt . style . use( 'fivethirtyeight' ) import warnings warnings . simplefilter( 'ignore' , FutureWarning ) import otter grader = otter . Notebook() 1.1 1. Thai Restaurants Ben and Frank are trying see what the best Thai restaurant in Eugene is. They survey 1500 UO students selected uniformly at random, and ask each student what Thai restaurant is the best ( Note: this data is fabricated for the purposes of this homework ). The choices of Thai restaurant are Sweet Basil Express, Drumrongthai, Manola’s Thai Cuisine, and Tasty Thai. After compiling the results, Ben and Frank release the following percentages from their sample: 1
Thai Restaurant Percentage Sweet Basil Express 8% Drumrongthai 52% Manola’s Thai Cuisine 25% Tasty Thai 15% These percentages represent a uniform random sample of the population of UO students. The sample suggests that UO students heavily prefer Drumrongthai to other Thai restaurants, but to what extent does this sample reflect the overall population’s preference (the entire population of all UO students)? To answer this, 1. We could take more random samples and compare them to our initial sample. However, this is time- and cost-intensive and if we could do this, we probably would have taken a larger initial sample anyway. 2. We could take our estimate for UO’s favorite Thai restaurant (Drumrongthai, 52%) at face value and decide that this is as good as it gets. 3. We could bootstrap our sample by resampling it thousands of times to create a confidence interval for our estimate. You’ve probably guessed that the third option is our best one. But why? In lieu of having the true population estimate at hand, our sample is what we have to work with. The estimate itself is ok, but its even better to know how much this estimate might have varied (as if we had taken multiple random samples and not just one). By bootstrapping our sample, we are tacitly acknowledging that our sample estimate is almost certainly wrong, but a confidence interval derived via resampling provides a range of values that could contain the true value. Using this sample, we will attempt to estimate the corresponding parameters , or the percentage of the votes that each restaurant will receive from the entire population. We will first attain confidence intervals by bootstrapping and then use them to compute a range of values that reflects the uncertainty of our estimates. The table votes contains the results of the survey. [3]: # Just run this cell votes = Table . read_table( 'votes.csv' ) votes [3]: Vote Sweet Basil Express Sweet Basil Express Sweet Basil Express Sweet Basil Express Sweet Basil Express Sweet Basil Express Sweet Basil Express Sweet Basil Express Sweet Basil Express Sweet Basil Express 2
… (1490 rows omitted) Question 1. Complete the function one_resampled_percentage below. It should return Drum- rongthai’s percentage of votes after simulating one bootstrap sample of tbl . Remember to sample with replacement, otherwise we’ll end up with the same distribution as our sample. Note: tbl will always be in the same format as votes . [4]: def one_resampled_percentage (tbl): sample = tbl . sample() num_votes = votes . num_rows percentage = np . count_nonzero(sample . where( 'Vote' , are . equal_to( 'Drumrongthai' )) . column( 'Vote' )) / num_votes *100 return percentage one_resampled_percentage(votes) [4]: 52.33333333333333 [5]: grader . check( "q1_1" ) [5]: q1_1 results: All test cases passed! We now have a function to compute a single bootstrap from our sample. But we’ll need many more to create out confidence interval. Question 2. Complete the percentages_in_resamples function such that it returns an array of 2500 bootstrapped estimates of the percentage of voters who will vote for Drumrongthai. You should use the one_resampled_percentage function you wrote above. [6]: def percentages_in_resamples (): percentage_drum = make_array() for i in np . arange( 2500 ): sample = one_resampled_percentage(votes) percentage_drum = np . append(percentage_drum, sample) return percentage_drum [7]: grader . check( "q1_2" ) [7]: q1_2 results: All test cases passed! In the following cell, we run the function you just defined, percentages_in_resamples , and cre- ate a histogram of the calculated statistic for the 2,500 bootstrap estimates of the percentage of voters who voted for Drumrongthai. As you can see, we’ve derived not just a single estimate from our sample, but an entire distribution of estimates. Based on what the original Thai restaurant percentages were, does the graph seem reasonable? Talk to a friend or ask a TA if you are unsure! [8]: resampled_percentages = percentages_in_resamples() 3
Table() . with_column( 'Estimated Percentage' , resampled_percentages) . hist( "Estimated Percentage" ) Now that we have our bootstrapped distribution, we only need to find our desired percentiles to create a confidence interval. Question 3. Using the array resampled_percentages , find the values at the two edges of the middle 95% of the bootstrapped percentage estimates. (Compute the lower and upper ends of the interval, named drum_lower_bound and drum_upper_bound , respectively.) Hint If you’re unsure how to do this, the book chapter on percentiles can help you out. [14]: drum_lower_bound = percentile( 2.5 , resampled_percentages) drum_upper_bound = percentile( 97.5 , resampled_percentages) print ( "Bootstrapped 95 % c onfidence interval for the percentage of Drumrongthai voters in the population: [ {:f} , {:f} ]" . format(drum_lower_bound, drum_upper_bound)) Bootstrapped 95% confidence interval for the percentage of Drumrongthai voters in the population: [49.533333, 54.600000] [15]: grader . check( "q1_3" ) [15]: q1_3 results: All test cases passed! 4
Question 4. The survey results seem to indicate that Drumrongthai is beating all the other Thai restaurants combined among voters. We would like to use confidence intervals to determine a range of likely values for Drumrongthai’s true lead over all the other restaurants combined. The calculation for Drumrongthai’s lead over Sweet Basil Express, Manola’s Thai Cuisine, and Tasty Thai combined is: Drumrongthai’s % of the vote - (Sweet Basil Thai’s % + Manola Thai Cuisine’s % + Tasty Thai’s %) Define the function one_resampled_difference that returns exactly one value of Drum- rongthai’s percentage lead over Sweet Basil Express, Manola’s Thai Cuisine, and Tasty Thai com- bined from one bootstrap sample of tbl . [16]: def one_resampled_difference (tbl): bootstrap = tbl . sample() drum_percentage = np . count_nonzero(bootstrap . where( 'Vote' , 'Drumrongthai' ) . column( 'Vote' )) / bootstrap . num_rows sbe_percentage = np . count_nonzero(bootstrap . where( 'Vote' , 'Sweet Basil Thai' ) . column( 'Vote' )) / bootstrap . num_rows mtc_percentage = np . count_nonzero(bootstrap . where( 'Vote' , 'Manola Thai Cuisine' ) . column( 'Vote' )) / bootstrap . num_rows tt_percentage = np . count_nonzero(bootstrap . where( 'Vote' , 'Tasty Thai' ) . column( 'Vote' )) / bootstrap . num_rows return drum_percentage - (sbe_percentage + mtc_percentage + tt_percentage) [17]: grader . check( "q1_4" ) [17]: q1_4 results: All test cases passed! Question 5. Write a function called leads_in_resamples that finds 2,500 bootstrapped esti- mates (the result of calling one_resampled_difference ) of Drumrongthai’s lead over Sweet Basil Express, Manola’s Thai Cuisine, and Tasty Thai combined. Plot a histogram of the resulting samples. Note: Drumrongthai’s lead can be negative. [18]: def leads_in_resamples (): leads = make_array() for i in np . arange( 2500 ): bootstrap = votes . sample() drum_percentage = np . count_nonzero(bootstrap . where( 'Vote' , 'Drumrongthai' ) . column( 'Vote' )) / bootstrap . num_rows sbe_percentage = np . count_nonzero(bootstrap . where( 'Vote' , 'Sweet Basil Thai' ) . column( 'Vote' )) / bootstrap . num_rows mtc_percentage = np . count_nonzero(bootstrap . where( 'Vote' , 'Manola Thai Cuisine' ) . column( 'Vote' )) / bootstrap . num_rows tt_percentage = np . count_nonzero(bootstrap . where( 'Vote' , 'Tasty Thai' ) . column( 'Vote' )) / bootstrap . num_rows diffs = drum_percentage - (sbe_percentage + mtc_percentage + tt_percentage) 5
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help