Solutions to Exercises

Exercises page for Biostats Course VHM 801 at AVC - Fall Semester 2020

Follow this link to extra exercises (labeled as x:number).

This page contains links to solutions (either as text files, to be opened directly in a web browser or in Notepad (or similar), or as .pdf files, to be opened in a suitable reader, such as Adobe Acrobat) for selected exercises of VHM 801. The solutions have been compiled by the Biostats 801 course instructor, Henrik Stryhn (with help from Jenny Yu).

Extra exercises

x:1. This exercise uses the Mean and Median applet (use link from homepage) which allows you to place observations on a line and see their mean and median displayed visually. Put some points to get a sense of how it works.
1. Now put two observations on the line. Why does only one arrow appear?
2. Try next with three observations, where two are close together near the center and one somewhat to the right of these two. Move the rightmost point away and towards the other points, and observe how the mean and median behave. Explain your findings. Then move the rightmost across the two centre points to the left: what happened to the mean and median as you did that? Again, explain your findings.
3. Finally, put five (distinct) observations on the line. Try to add one point without changing the median: where is your new point? Explore what happens to the median when you add another point; explain your findings.
x:2. In this exercise, we use the guinea pig data from Exercise 1.51 to explore how the number of bins in a histogram affects its shape and the impression it gives of the distribution.
1. Draw first the default histogram (in your software). How many bins does it have?
2. In Minitab, in order to change the binning of a histogram, you have to right-click on the bars, select the menu item Edit bars and the submenu Binning. Try to construct histograms with the maximal and minimal number of bins. How many bins did you get, and how does it affect your impression of the distribution?
3. Experiment further with the binning to determine your preferred number of bins; how many bins did you choose? Does your preferred choice agree with the software default or any of the guidelines mentioned in Lecture 1? Do you think the shape of the distribution will affect the "best" number of bins? (For example, you may think about whether we should use the same number of bins for a symmetrical and a skewed distribution.)
x:3. Randomization may be based on tables of random digits (e.g., Table A of PSLS or Table B of IPS). Discuss whether each of the following statements about such tables are true.
1. Among 40 random digits (corresponding one row in the table), there are exactly 10 nines.
2. Each pair of digits has a chance/probability of 1/100 of being 99.
3. The number 9999 can never appear as a group because it is not a random pattern.
x:4. ABO blood types are determined from a pair of blood type alleles, inherited independently from the parents. Three alleles exist: A, B, and O, corresponding to the 4 blood types (with allele combinations indicated in parenthesis): A (AA and AO), B (BB and BO), AB (AB) and O (OO). Each allele may be inherited from a parent to a child with probability 0.5. Use this information to answer the following questions.
1. Determine the blood types children of parents that both have blood type AB can have, and their probabilities.
2. If the parents have allele combinations AB and AO, what is the probability that two of their children both have blood type A? What is the probability the two children have the same blood type?
3. Which of the probabilities computed above are actually from binomial distributions?
x:5. The Probability applet (link at homepage) allows us to simulate random tosses of a coin. We can use the applet to explore the behaviour of both relative frequencies (or proportions) and the counts of heads as the number of trials increases. Experiment a bit with the applet before resetting it for the work to be done for this exercise. Set the number of trials (tosses) at 50.
1. Record the number of heads after 50 tosses, and compute the difference between the observed and expected proportion, as well as the difference between the observed and expected count. Hint: it may be easiest to enter the number of heads and the number of tosses in Minitab, and do the calculation in the Calculate menu (if you use formulas, the calculations will automatically expand to new rows).
2. Continue the tosses up to totals of first 100 and then 200. Repeat the calculations with the new numbers.
3. Set the number of tosses at 200 (without resetting), and continue the tosses up to totals of 400, 1000 and 2000, respectively. Repeat also here the calculations with the new numbers. You may continue with further tosses if you think it will be helpful.
4. What patterns do you observe for the proportion and count of heads? Are these findings what you would expect?
x:6. The Monty Hall problem: a famous statistical paradox that has generated a heated debate among statisticians and laymen. The following description is taken from the detailed Wikipedia article on the problem. The context is a game (television) show, described as follows:
In order to properly understand the situation, it is important to explicitly define the role of the host:
1. the host must always open a door that was not picked by the contestant,
2. the host must always open a door to reveal a goat and never the car,
3. the host must always offer the chance to switch between the originally chosen door and the remaining closed door,
With these assumptions, determine the chance of winning the car in both of the options available to the contestant: (i) stay at the originally chosen door, and (ii) switch doors.
x:7. The Normal density curve applet (link at homepage) computes and visualizes areas under the normal curve. In this exercise, we will use it to illustrate the 68-95-99.7 rule.
1. With the flags placed one standard deviation on either side of the mean, what is the area between those two values? Note that if you drag the flags across each other, the applet will display the area in the middle. What does the 68-95-99.7 rule say this area equals?
2. Locate the flags two and three standard deviations from the mean, and answer the same questions.
3. As discussed in the lecture, the length of human pregnancies may be approximated by a normal distribution with mean 266 and standard deviation 16 days. Use the 68-95-99.7 rule and/or the applet to answer the following questions.
  1. Almost all (99.7%) human pregnancies fall in what range of lengths?
  2. What percent of human pregnancies are longer than 282 days?
x:8. We can use simulated data to explore how random variability affects statistical procedures. It is recommended to repeat each of the questions below a few times with different simulated data to get a sense of whether your findings are consistent or just reflect random noise.
1. Generate 100 observations from the standard normal distribution. Produce a histogram with an overlaid normal distribution curve and a normal plot. Do these plots suggest any important deviations from a normal distribution?
2. Generate 100 observations from the uniform distribution (0,1). Produce a histogram with an overlaid normal distribution curve and a normal plot. Do these plots suggest any important deviations from a normal distribution? Describe and explain any such deviations you see.
x:9. Binomial versus hypergeometric distribution. We plan to carry out a survey among graduate students in one faculty of a small university, where the total population encompasses (only) 100 students. Our question of interest is whether the students own electronics (computer, tablet, phone) of a particular brand A. We plan to include 25 students in the survey. Assume that the true proportion in the population is 0.7 (or 70%) of owners.
1. If the survey (quite unrealistically) was carried out with replacement (i.e., students could be selected to participate multiple times), what is the distribution of the number (X) of respondents saying they own electronics of brand A? Use statistical software to produce a tabulation of the probability function for the distribution of X, and produce also a graph of this distribution. What are the mean and variance of X (or X's distribution)?
2. If the survey was carried out without replacement, the variable X would have a hypergeometric distribution with parameters 100 (population size), 25 (sample size) and 0.7 (true proportion). Use statistical software to produce a tabulation of the probability distribution of X and a graph of its distribution. Answer the following questions:
  1. Based on the graphs from a) and b), what is the visually biggest difference between the two distributions?
  2. Which value of X has the biggest difference in probability in the two distributions?
  3. Use your tabulated probabilities of the distribution in b) to compute its mean - do the two distributions have the same mean?
  4. Use your tabulated probabilities of the distribution in b) to compute its variance - do the two distributions have the same variance?
3. How large should the survey be in order for it to be acceptable to use the same distribution for settings a) and b)?
x:10. This exercise uses the Law of Large Numbers, Central Limit Theorem and Normal Approximation to Binomial Distributions applets for the textbooks. Use the first of these to illustrate the Law of large numbers. Choose two dice, tick the boxes to add the mean and roll totals to the figure, and roll the dice a few times. Make sure you understand the numbers displayed (in particular, the mean value). Then run the dice a larger number of times (say 40, or even more) to see the stabilization of the average with increasing number of trials.
Use the two other applets to illustrate the approximative normal distribution for an average of i.i.d. variables. Explore how large a value of n it requires to make the approximation good for the exponential distribution and the binary distributions with p=0.5, 0.7, 0.9 and 0.95, and try to explain why.
x:11. This exercise uses the Confidence Intervals applet for the textbooks. First explore the meaning of the controls for Confidence level and Sample size when you generate new samples. Reset these values at their default values (95 and 20, respectively), and sample 25 intervals. Note the number of intervals that cover the true value (the Hit value). Reset and repeat for a total of 30 times. Describe the distribution using a stemplot and suitable statistics, and summarize your findings. If you repeated the experiment very many times, what would you expect the average number and average proportion to be? (If you continue the trials without resetting, the Percent hit gives you the proportion across all runs of the experiment.)
x:12. This exercise uses the Statistical Significance applet for the textbooks. Keep the default settings, change the observed mean to 1, and update the calculation. What is the P-value for testing H0 (mu=0) against Ha (mu>0)? - is the test significant at significance level 0.05? Write down the P-value and indicate its significance (no/yes). Now change the alternative hypothesis to a two-sided alternative (Ha: mu<>0), and repeat the procedure. Do also the same for the other one-sided alternative (Ha: mu<0). Repeat all these steps while you change the observed mean from 1 to 0 in steps of 0.1, Display all your results (P-values and significance) in tabular form with the different values of the sample mean along the rows and the 3 alternative hypotheses along the columns. Describe the patterns you see in the P-values in the columns for the different Ha, and explain your findings.
x:13. Explain in simple language why a test significant at the 1% level is also significant at the 5% level.
x:14. A study on whether dogs could be trained to detect lung and breast cancer by sniffing exhaled breath samples was carried out in 2005, reported in McCulloch et al. (2006), Integrative Cancer Therapies, 5:30-39. Multiple dogs were involved in a total of 125 breast cancer trials. In each trial there were four control samples and one cancer sample, and the dog was supposed to identify the cancer sample by lying down next to it. In 110 out of the 125 trials, the dogs correctly identified the cancer sample. Construct a 95% confidence interval for the true proportion of times the dogs correctly identify a breast cancer sample. Use also the data to test the two null hypotheses: (1) pure guessing (p=0.2), and (2) same sensitivity as for lung cancer samples (p=0.99).
x:15. Supplementary Exercises 7.102-4 study the effect of piano lessons on the spatial-temporal reasoning of preschool children. The data involve 34 children who took piano lessons and a control group of 44 children. The data take only small whole-number values, and there are many ties. Use the Wilcoxon (Mann-Whitney) rank sum test to decide whether piano lessons improve spatial-temporal reasoning.
x:16. Supplementary Exercises 6.95 and 7.64 study the accuracy of radon detectors based on readings from 12 detectors placed in a chamber holding 105 picocuries per liter of radion. In this exercise we explore whether the median reading differs significantly from the true value 105. Give a justification for using nonparametric methods based on the information you have (from previous descriptive analysis) about the distribution. Then carry out a nonparametric test of the question of interest; state your hypotheses explicitly; draw conclusions, and compare with those of the previous analysis.
x:17. For IPS 7e Supplementary Exercises 9.2, 9.4, 9.22 and 9.57, describe the statistical model and statistical hypotheses you would use to analyze the data. In particular, would you use Model I (independent multinomials) to compare several populations, or would you use Model II (a single multinomial) to assess independence between two categorical variables?
x:18. This exercise uses the One-Way ANOVA applet at the PSLS website.
1. Start by exploring the impact of within-group sample size (n) on the F-statistic. What happens when you increase/decrease n, and why?
2. Set the within-group sample size at n=5. Move the black dots representing the group means up or down until you arrive at a P-value of 0.05 (or very close to 0.05). What value does the F-statistic have? Check your finding by finding the 95%-percentile of the relevant F-distribution in a statistical table and/or using statistical software.
3. How did the F-statistic change when you moved the group means further from/closer to each other? Explain why this makes sense.
4. Now explore the impact of increasing the within-group variation by dragging the standard deviation slider to the right. Describe what happens to the F-statistic and the P-value, and explain why.
x:19. We previously analyzed data on the impact of piano lessons on spatial-temporal reasoning in preschool children (Supplementary Exercise 7.102). The control group was actually made up of 3 different groups: children who took singing lessons or computer lessons or who received no extra lessons; the group2 variable gives this classification of the children. Make a table giving the sample size, mean, standard deviation and standard error for each group. Then analyze the data to compare the gains in spatial-temporal reasoning among the 4 groups. If an overall test for homogeneity among groups is significant, continue your analysis by pairwise comparisons between groups, using a suitable procedure. Summarize your results and present your findings by a table and graph of group means with associated confidence intervals.
x:20. This problem is based on an article in the Journal of Wildlife Diseases; follow this link to the article. The article reports multiple analyses of which we will only focus on a few (although others can be questioned as well).
1. Describe the study in general terms, including e.g. study type, statistical design and blinding.
2. Consider Figure 1. Determine the statistical design and the number of animals the figure is based on. Suggest a possible statistical model for the data, and review the assumptions it involves. Identify the analysis reported in the paper, and discuss whether you think this analysis is justified and correct for the problem examined, results presented and conclusions drawn.
3. Consider Figure 2. Determine again the statistical design the figure is based on, and critique the actual analysis carried out in a similar manner as above. Focus your attention on any new issues with the analysis.
x:21. Frank Anscombe published a paper in 1973 (The American Statistician 27, 17-21) to illustrate the importance of graphically assessing the data used for simple linear regression. He constructed four datasets to illustrate his points (datafile anscombe), each consisting of 11 pairs of variables (x,y). The first 3 datasets use the same x-variable (denoted x1_3). For each of the four datasets, create a scatterplot for (x,y) add the regression line for prediction of y from x to the plot, and predict y for x=10. Examine the results output for the regression analysis, and discuss in which of the these datasets you think the regression could be used to describe the dependence of y on x.
x:22. Supplementary Exercise 10.33 studied serum retinol and C-reactive protein (CRP) values measured on 40 children. We will use these data to illustrate Spearman's rank correlation coefficient. It is computed in three steps: (i) rank the values for the x-variable; (ii) rank the values for the y-variable; and (iii) compute the (Pearson) correlation between the two columns of ranks. For large n (say n>30), a t-test for no association (correlation=0) can be computed with the rank correlation coefficient in the same way as for the Pearson correlation. For small n, use instead a table with critical values for different significance levels, e.g. here. A good resource for very exact percentiles for Spearman's rho is Ramsey (1989), J Educat Statist 14, 245-253. Some software packages (e.g. Minitab and Stata) use pretty crude and inaccurate approximations for small n.
For the CRP and retinol data, compute the Spearman rank correlation coefficient, and its statistical significance, both directly in the software and indirectly via the ranks. Compare your findings with those for the Pearson correlation coefficient. Additionally, explore how strongly the Spearman rank correlation is affected by the outlier in the golf scores data (Supplementary Exercises 2.2 and 10.7), and compare also here with your findings for the Pearson correlation.
x:23. An aquaculture site holds 1000 fish (actual numbers are usually much higher). It is of interest to determine whether a particular pathogen is present in the population. Because of the infectious nature of the pathogen, a sporadic presence is assumed quite implausible, so the focus is on identifying a prevalence (proportion of positives) of at most 10%.
1. Determine a sample size that would give 95% confidence that the prevalence does not exceed 10% if all samples were negative. Carry out the calculation both with an infinite and finite population assumption - are the results substantially different?
2. The 1000 fish are actually held in 10 separate cages/tanks, say of 100 fish each. Does this affect your sample size calculation? If yes, revise your calculation. If not, discuss the assumptions you make in order to for this to be valid.
2001 home assignment 1. link
2003 home assignment 4. link
2006 final exam: 3. link
2009 home assignment 2. link

Solutions to extra exercises:

x:1, x:2, x:3, x:4, x:5, x:6, x:7, x:8 (do-file; R-program), x:9, x:10, x:11, x:12, x:13, x:14, x:15, x:16, x:17, x:18, x:19 (do-file), x:20, x:21 (do-file), x:22 (do-file; R-program), x:23, home assignment 2001:1, home assignment 2003.4, final exam 2006:3, home assignment 2009.2,

Henrik Stryhn (hstryhn@upei.ca) 2020-11-28

Exercises page for Biostats Course VHM 801 at AVC - Fall Semester 2020

Solutions to exercises

Stata do-files for exercises

R program files for exercises

Data files

Data files (.mtw)

Data files (.csv)

Extra exercises

Solutions to extra exercises: