Exercises page for Biostats Course VHM 801 at AVC - Fall Semester 2020
Follow this link to extra exercises (labeled as x:number).
This page contains links to solutions (either as text files, to be opened directly in a web browser
or in Notepad (or similar), or as .pdf files, to be opened in a suitable reader, such as Adobe Acrobat)
for selected exercises of VHM 801. The solutions have been compiled
by the Biostats 801 course instructor, Henrik Stryhn (with help from Jenny Yu).
Solutions to exercises
(from Supplementary Exercises for IPS7e)
- 1, 10,
16, 22,
42, 51,
65,
72, 77,
110, 111,
113, 117,
121, 123,
127, 144,
145,
- 2, 7,
11, 12,
27, 28,
48, 57,
59, 60,
67, 69,
- 4,
10, 14,
18, 19,
40,
77, 79,
94, 95,
- 9, 10,
14, 26,
48, 52,
56, 60,
71,
73, 75,
76, 78,
92, 107,
108, 115,
122, 123,
- 7, 9,
14, 33,
40,
47, 49,
51, 53,
54,
- 7,
11, 12,
13, 14,
33, 38,
39, 45,
46, 68,
70, 85,
87, 95,
96, 99,
103, 107,
108, 111,
115,
140, 142,
- 4, 50,
58, 59,
64, 66,
68,
73, 74,
91, 93,
102, 103,
104, 127,
129, 132,
143, 145,
- 1, 21,
33, 62,
84, 85,
98, 103,
- 20, 36,
38, 39,
40, 44,
48, 50,
52, 62,
- 7, 12,
17, 18,
26, 27,
33, 38,
39, 40,
- 6, 15,
16, 17,
45,
- 1, 9,
25, 26,
27,
35, 36,
40, 43,
54, 55,
- 3, 4,
15, 16,
19, 31,
- 7, 8,
17, 18,
19, 33,
37,
- 8, 19,
22, 45,
- 4, 14,
Stata do-files for exercises
- 10,
42, 72,
51, 77,
110, 111,
127, 145,
- 2,
7, 11,
12, 27,
28,
48,
57, 59,
60,
- 14, 40,
- 10, 71,
- 7, 40,
49, 53,
- 13, 14,
87, 95,
115,
140,
- 4, 58,
59, 64,
68,
73, 74,
91, 93,
102, 103,
127,
143, 145,
- 62, 84,
85,
- 20, 36,
38, 44,
48, 50,
52,
- 7, 12,
17, 33,
38, 39,
- 15, 16,
17,
- 35, 43,
54, 55,
- 15, 31,
- 33, 37,
- 8, 19,
22,
R program files for exercises
- 10,
42, 72,
51, 77,
110, 111,
127, 145,
- 2, 11,
12, 48,
- 14, 40,
- 10, 71,
- 7, 40,
49, 53,
- 13, 14,
87, 95,
115, 140,
- 4, 58,
59, 64,
68,
73, 74,
102, 103,
127,
143, 145,
- 62, 84,
85,
- 20, 36,
38, 48,
50,
- 7, 17, 38,
39,
- 35, 43,
54, 55,
- 31,
- 33,
- 8, 19,
22,
Data files
(.zip archive of all current data files)
Data files (.mtw)
- Chapter 1: 10, 16,
23, 42, 51,
145,
- Chapter 2: 2, 7,
11, 12,
57, 59,
- Chapter 3: 14, 40,
- Chapter 6: 11, 13, 95, 140,
- Chapter 7: 58, 64,
66, 73,
102, 143, 145,
- Chapter 9: 19, 20, 36,
44, 50,
52, 62,
- Chapter 10: 12, 17, 33,
- Chapter 11: 15, 16, 17,
- Chapter 12: 54,
- Chapter 13: 15, 31,
- Chapter 26: 33, 37,
- Chapter 27: 8, 19,
22, 45,
- Chapter 28: 14,
- Extra: parasite, PEI migration,
crab, poplar,
lab concentrations, body temperature,
neuron, avadex, health habit,
wine, job satisfaction (Stephens), reading,
satscore (Stephens), fidget,
anscombe, sales (Stephens),
tomato, CSData,
plants1, plants2,
home assignment 2001:1,
sparrow data (exam 2006:3),
home assignment 2009:2,
insulin data (exam 2013:2),
Data files (.csv)
- Chapter 1: 10, 16,
23, 42, 51,
145,
- Chapter 2: 2, 7,
11, 12,
57, 59,
- Chapter 3: 14, 40,
- Chapter 6: 11, 13, 95, 140,
- Chapter 7: 58, 64,
66, 73,
102, 143, 145,
- Chapter 9: 19, 20, 36,
44, 50,
52, 62,
- Chapter 10: 12, 17, 33,
- Chapter 11: 15, 16, 17,
- Chapter 12: 54,
- Chapter 13: 15, 31,
- Chapter 26: 33, 37,
- Chapter 27: 8, 19,
22, 45,
- Chapter 28: 14,
- Extra: parasite, PEI migration,
crab, lab concentrations, body temperature,
neuron, avadex, health habit,
wine, job satisfaction (Stephens), reading,
satscore (Stephens), fidget,
anscombe, sales (Stephens),
tomato, CSData,
plants1, plants2,
home assignment 2001:1,
sparrow data (exam 2006:3),
home assignment 2009:2,
insulin data (exam 2013:2),
- x:1. This exercise uses the Mean and Median
applet (use link from homepage) which allows you to place observations on a line and see their mean
and median displayed visually. Put some points to get a sense of how it
works.
- Now put two observations on the line. Why does only one arrow appear?
- Try next with three observations, where two are close together near
the center and one somewhat to the right of these two. Move the
rightmost point away and towards the other points, and observe how the
mean and median behave. Explain your findings. Then move the rightmost
across the two centre points to the left: what happened to the mean and
median as you did that? Again, explain your findings.
- Finally, put five (distinct) observations on the line. Try to add
one point without changing the median: where is your new point? Explore
what happens to the median when you add another point; explain your
findings.
- x:2. In this exercise, we use the guinea pig data from
Exercise 1.51 to explore how the number of bins in a histogram
affects its shape and the impression it gives of the distribution.
- Draw first the default histogram (in your software). How many bins does it have?
- In Minitab, in order to change the binning of a histogram, you have to right-click
on the bars, select the menu item Edit bars and the submenu
Binning. Try to construct histograms with the maximal and minimal number
of bins. How many bins did you get, and how does it affect your impression of the
distribution?
- Experiment further with the binning to determine your preferred
number of bins; how many bins did you choose? Does your preferred choice
agree with the software default or any of the guidelines mentioned in
Lecture 1? Do you think the shape of the distribution will affect the
"best" number of bins? (For example, you may think about whether we should use the same number of bins
for a symmetrical and a skewed distribution.)
- x:3. Randomization may be based on tables of random digits (e.g., Table A of PSLS or Table B of IPS).
Discuss whether each of the following statements about such tables are true.
- Among 40 random digits (corresponding one row in the table), there are exactly 10 nines.
- Each pair of digits has a chance/probability of 1/100 of being 99.
- The number 9999 can never appear as a group because it is not a random pattern.
- x:4. ABO blood types are determined from a pair of blood type alleles, inherited
independently from the parents. Three alleles exist: A, B, and O, corresponding to the
4 blood types (with allele combinations indicated in parenthesis): A (AA and AO),
B (BB and BO), AB (AB) and O (OO). Each allele may be inherited from a parent to a child with probability 0.5.
Use this information to answer the following questions.
- Determine the blood types children of parents that both have blood type AB can have, and their probabilities.
- If the parents have allele combinations AB and AO, what is the probability that two of their children
both have blood type A? What is the probability the two children have the same blood type?
- Which of the probabilities computed above are actually from binomial
distributions?
- x:5. The Probability applet (link at homepage) allows us to simulate random tosses of a coin.
We can use the applet to explore the behaviour of both relative frequencies (or proportions) and
the counts of heads as the number of trials increases. Experiment a bit with the applet before resetting
it for the work to be done for this exercise. Set the number of trials (tosses) at 50.
- Record the number of heads after 50 tosses, and compute the difference between the
observed and expected proportion, as well as the difference between the observed and expected count. Hint: it may be
easiest to enter the number of heads and the number of tosses in Minitab, and do the calculation in the
Calculate menu (if you use formulas, the calculations will automatically expand to new rows).
- Continue the tosses up to totals of first 100 and then 200. Repeat the calculations with the new numbers.
- Set the number of tosses at 200 (without resetting), and continue the tosses up to totals of 400, 1000 and 2000,
respectively. Repeat also here the calculations with the new numbers.
You may continue with further tosses if you think it will be helpful.
- What patterns do you observe for the proportion and count of heads?
Are these findings what you would expect?
- x:6. The Monty Hall problem: a famous statistical paradox that has generated a heated debate among
statisticians and laymen. The following description is taken from the detailed Wikipedia article
on the problem. The context is a game (television) show,
described as follows:
Suppose you're on a game show, and you're given the choice of
three doors: Behind one door is a car; behind the others, goats. You
pick a door, say No. 1, and the host, who knows what's behind the doors,
opens another door, say No. 3, which has a goat. He then says to you,
"Do you want to pick door No. 2?" Is it to your advantage to switch your
choice?
In order to properly understand the situation, it is important to
explicitly define the role of the host:
- the host must always open a door that was not picked by the contestant,
- the host must always open a door to reveal a goat and never the car,
- the host must always offer the chance to switch between the
originally chosen door and the remaining closed door,
With these assumptions, determine the chance of winning the car in both
of the options available to the contestant: (i) stay at the originally
chosen door, and (ii) switch doors.
- x:7. The Normal density curve applet (link at homepage) computes and
visualizes areas under the normal curve. In this exercise,
we will use it to illustrate the 68-95-99.7 rule.
- With the flags placed one standard deviation on either side of the mean, what is the area
between those two values? Note that if you drag the flags across each other, the applet
will display the area in the middle. What does the 68-95-99.7 rule say
this area equals?
- Locate the flags two and three standard deviations from the mean, and answer the
same questions.
-
As discussed in the lecture, the length of human pregnancies may be approximated by a normal distribution with
mean 266 and standard deviation 16 days. Use the 68-95-99.7 rule and/or the applet
to answer the following questions.
- Almost all (99.7%) human pregnancies fall in what range of lengths?
- What percent of human pregnancies are longer than 282 days?
- x:8. We can use simulated data to explore how random variability affects
statistical procedures. It is recommended to repeat each of the questions below a
few times with different simulated data to get a sense of whether your findings
are consistent or just reflect random noise.
- Generate 100 observations from the standard normal distribution. Produce a histogram
with an overlaid normal distribution curve and a normal plot. Do these plots suggest any
important deviations from a normal distribution?
- Generate 100 observations from the uniform distribution (0,1). Produce a histogram
with an overlaid normal distribution curve and a normal plot. Do these plots suggest any
important deviations from a normal distribution? Describe and explain any such
deviations you see.
- x:9. Binomial versus hypergeometric distribution.
We plan to carry out a survey among graduate students in one faculty of a small university, where
the total population encompasses (only) 100 students. Our question of interest is whether the students
own electronics (computer, tablet, phone) of a particular brand A. We plan to include 25 students in the survey.
Assume that the true proportion in the population is 0.7 (or 70%) of owners.
- If the survey (quite unrealistically) was carried out with replacement (i.e., students could
be selected to participate multiple times), what is the distribution of the number (X) of respondents saying
they own electronics of brand A? Use statistical software to produce a tabulation of the probability
function for the distribution of X, and produce also a graph of this
distribution. What are the mean and variance of X (or X's distribution)?
- If the survey was carried out without replacement, the variable X would have a hypergeometric distribution with parameters
100 (population size), 25 (sample size) and 0.7 (true proportion).
Use statistical software to produce a tabulation of the probability
distribution of X and a graph of its distribution. Answer the following
questions:
- Based on the graphs from a) and b), what is the visually
biggest difference between the two distributions?
- Which value of X has the biggest difference in probability in the
two distributions?
- Use your tabulated probabilities of the distribution in b) to
compute its mean - do the two distributions have the same mean?
- Use your tabulated probabilities of the distribution in b) to
compute its variance - do the two distributions have the same variance?
- How large should the survey be in order for it to be acceptable to use the same distribution for settings a) and b)?
- x:10. This exercise uses the Law of Large Numbers,
Central Limit Theorem and Normal Approximation to Binomial Distributions applets
for the textbooks.
Use the first of these to illustrate the Law of large numbers. Choose
two dice, tick the boxes to add the mean and roll totals to the figure,
and roll the dice a few times. Make sure you understand the numbers
displayed (in particular, the mean value). Then run the dice a larger
number of times (say 40, or even more) to see the stabilization of the
average with increasing number of trials.
Use the two other applets to illustrate the approximative normal distribution for an average of
i.i.d. variables. Explore how large a value of n it requires to make the approximation
good for the exponential distribution and the binary distributions with p=0.5, 0.7, 0.9 and 0.95,
and try to explain why.
- x:11. This exercise uses the Confidence Intervals applet
for the textbooks. First explore the meaning of the controls for
Confidence level and Sample size when you generate new samples. Reset
these values at their default values (95 and 20, respectively), and
sample 25 intervals. Note the number of intervals that cover the true
value (the Hit value). Reset and repeat for a total of 30 times.
Describe the distribution using a stemplot and suitable statistics, and
summarize your findings. If you repeated the experiment very many times,
what would you expect the average number and average proportion to be?
(If you continue the trials without resetting, the Percent hit gives you the proportion
across all runs of the experiment.)
- x:12. This exercise uses the Statistical Significance applet
for the textbooks. Keep the default settings, change the observed mean to 1, and
update the calculation. What is the P-value for testing H0 (mu=0) against
Ha (mu>0)? - is the test significant at significance level 0.05? Write down the P-value
and indicate its significance (no/yes). Now change the alternative hypothesis to a two-sided
alternative (Ha: mu<>0), and repeat the procedure. Do also the same for the other one-sided
alternative (Ha: mu<0). Repeat all these steps while you change the observed mean from 1 to 0
in steps of 0.1, Display all your results (P-values and significance) in tabular form with the different
values of the sample mean along the rows and the 3 alternative hypotheses along the columns. Describe the patterns you see in the P-values in the
columns for the different Ha, and explain your findings.
- x:13. Explain in simple language why a test significant at
the 1% level is also significant at the 5% level.
- x:14. A study on whether dogs could be trained to detect lung and breast cancer by sniffing
exhaled breath samples was carried out in 2005, reported in McCulloch et al. (2006), Integrative
Cancer Therapies, 5:30-39. Multiple dogs were involved in a total of 125 breast cancer trials. In each trial there were four
control samples and one cancer sample, and the dog was supposed to identify the cancer sample
by lying down next to it. In 110 out of the 125 trials, the dogs correctly identified the cancer
sample. Construct a 95% confidence interval for the true proportion of times the dogs correctly
identify a breast cancer sample. Use also the data to test the two null hypotheses:
(1) pure guessing (p=0.2), and (2) same sensitivity as for lung
cancer samples (p=0.99).
- x:15. Supplementary Exercises 7.102-4 study the effect of piano
lessons on the spatial-temporal
reasoning of preschool children. The data involve 34
children who took piano lessons and a control group of 44 children. The
data take only small whole-number values, and there are many ties.
Use the Wilcoxon (Mann-Whitney) rank sum test to decide whether piano lessons improve
spatial-temporal reasoning.
- x:16. Supplementary Exercises 6.95 and 7.64 study the accuracy of radon detectors
based on readings from 12 detectors placed in a chamber holding 105 picocuries per liter of radion.
In this exercise we explore whether the median reading differs
significantly from the true value 105. Give a justification for using nonparametric
methods based on the information you have (from previous descriptive analysis) about
the distribution. Then carry out a nonparametric test of the question of interest; state your
hypotheses explicitly; draw conclusions, and compare with those of the previous
analysis.
- x:17. For IPS 7e Supplementary Exercises 9.2, 9.4, 9.22 and
9.57, describe the statistical model and statistical hypotheses you
would use to analyze the data. In particular, would you use Model I
(independent multinomials) to compare several populations, or would you
use Model II (a single multinomial) to assess independence between two
categorical variables?
- x:18. This exercise uses the One-Way ANOVA applet
at the PSLS website.
- Start by exploring the impact of within-group sample size (n) on the
F-statistic. What happens when you increase/decrease n, and why?
- Set the within-group sample size at n=5. Move the black dots
representing the group means up or down until you arrive at a P-value of
0.05 (or very close to 0.05). What value does the F-statistic have?
Check your finding by finding the 95%-percentile of the relevant
F-distribution in a statistical table and/or using statistical software.
- How did the F-statistic change when you moved the group means
further from/closer to each other? Explain why this makes sense.
- Now explore the impact of increasing the within-group variation by
dragging the standard deviation slider to the right. Describe what
happens to the F-statistic and the P-value, and explain why.
- x:19. We previously analyzed data on the impact of piano
lessons on spatial-temporal reasoning in preschool children
(Supplementary Exercise 7.102). The control group was actually made up of 3
different groups: children who took singing lessons or computer lessons
or who received no extra lessons; the group2 variable gives this
classification of the children. Make a table giving the sample size,
mean, standard deviation and standard error for each group. Then analyze
the data to compare the gains in spatial-temporal reasoning among the 4
groups. If an overall test for homogeneity among groups is significant,
continue your analysis by pairwise comparisons between groups, using a
suitable procedure. Summarize your results and present your findings by
a table and graph of group means with associated confidence
intervals.
- x:20. This problem is based on an article
in the Journal of Wildlife Diseases; follow this link
to the article. The article reports
multiple analyses of which we will only focus on a few (although others
can be questioned as well).
- Describe the study in general terms, including e.g. study type,
statistical design and blinding.
- Consider Figure 1. Determine the statistical design and the number
of animals the figure is based on. Suggest a possible statistical model
for the data, and review the assumptions it involves. Identify the
analysis reported in the paper, and discuss whether you think this analysis
is justified and correct for the problem examined, results presented and conclusions drawn.
- Consider Figure 2. Determine again the statistical design the figure
is based on, and critique the actual analysis carried out in a similar
manner as above. Focus your attention on any new issues with the
analysis.
- x:21. Frank Anscombe published a paper in 1973 (The American Statistician 27, 17-21) to
illustrate the importance of graphically assessing the data used for simple linear regression. He
constructed four datasets to illustrate his points (datafile anscombe), each consisting of
11 pairs of variables (x,y). The first 3 datasets use the same x-variable (denoted x1_3).
For each of the four datasets, create a scatterplot for (x,y) add the regression line for
prediction of y from x to the plot, and predict y for x=10. Examine the results output for the regression analysis,
and discuss in which of the these datasets you think the regression could be used to describe the dependence
of y on x.
- x:22. Supplementary Exercise 10.33 studied serum retinol and C-reactive protein (CRP)
values measured on 40 children. We will use
these data to illustrate Spearman's rank correlation coefficient. It is computed in three steps: (i)
rank the values for the x-variable; (ii) rank the values for the
y-variable; and (iii) compute the (Pearson) correlation between the two
columns of ranks. For large n (say n>30), a t-test for no association (correlation=0) can be
computed with the rank correlation coefficient in the same way as for
the Pearson correlation. For small n, use instead a table with critical
values for different significance levels, e.g. here. A good resource
for very exact percentiles for Spearman's rho is Ramsey (1989), J Educat Statist 14, 245-253. Some software
packages (e.g. Minitab and Stata) use pretty crude and inaccurate approximations for small n.
For the CRP and retinol data, compute the Spearman rank correlation coefficient, and its
statistical significance, both directly in the software and indirectly via the ranks.
Compare your findings with those for the Pearson correlation
coefficient. Additionally, explore how strongly the Spearman rank correlation is affected
by the outlier in the golf scores data (Supplementary Exercises 2.2 and 10.7), and compare also
here with your findings for the Pearson correlation.
- x:23. An aquaculture site holds 1000 fish (actual numbers are usually much higher). It is of interest to determine whether a particular
pathogen is present in the population. Because of the infectious nature of the pathogen, a sporadic presence
is assumed quite implausible, so the focus is on identifying a prevalence (proportion of positives)
of at most 10%.
- Determine a sample size that would give 95% confidence that the prevalence does not exceed 10%
if all samples were negative. Carry out the calculation both with an infinite and finite population
assumption - are the results substantially different?
- The 1000 fish are actually held in 10 separate cages/tanks, say of 100 fish each. Does this affect your sample size calculation?
If yes, revise your calculation. If not, discuss the assumptions you
make in order to for this to be valid.
- 2001 home assignment 1. link
- 2003 home assignment 4. link
- 2006 final exam: 3. link
- 2009 home assignment 2. link
Solutions to extra exercises:
x:1, x:2, x:3,
x:4, x:5, x:6,
x:7, x:8 (do-file; R-program),
x:9, x:10, x:11, x:12,
x:13, x:14, x:15, x:16,
x:17, x:18, x:19 (do-file),
x:20, x:21 (do-file),
x:22 (do-file; R-program),
x:23,
home assignment 2001:1, home assignment 2003.4,
final exam 2006:3, home assignment 2009.2,
Henrik Stryhn
(hstryhn@upei.ca) 2020-11-28