Analysis of Variance (ANOVA) ANOVA Create the ANOVA table quickly using EXCEL Can you remember what to do?! Lets try a one-way ANOVA in R: > anova_dataset1 <- read.csv( "c:/anova_data_set.csv", header=TRUE) Visualise! > boxplot( anova_dataset1 ) Box plot What can we see?

After the test We could conduct a One-way ANOVA to find if there was a difference between at least one of the groups but we did not carry out any further testing. A statistically significant F value tells us only that somewhere there is a meaningful difference between the group means. But it does not tell us which groups differ from each other significantly. To do this, we must conduct post hoc tests. There are a variety of post hoc tests available. Some are more conservative, making it more difficult to find statistically significant differences between groups, whereas others are more liberal. ANOVA in R

> names( anova_dataset1 ) [1] "Group.1" "Group.2" "Group.3" "Group.4" > s_anova_dataset1 <- stack( anova_dataset1 ) > names( s_anova_dataset1 ) [1] "values" "ind" > head( s_anova_dataset1 ) values ind 1 38 Group.1 2 39 Group.1 3 42 Group.1 4 40 Group.1 5 41 Group.1 6 38 Group.2

ANOVA > oneway.test( values ~ ind, var.equal=TRUE, data=s_anova_dataset1 ) One-way analysis of means data: values and ind F = 8.4272, num df = 3, denom df = 16, p-value = 0.001376 > model <- aov( values ~ ind, data=s_anova_dataset1 ) > summary(model) Df Sum Sq Mean Sq F value Pr(>F) ind 3 89.75 29.92 8.427 0.00138 ** Residuals 16 56.80 3.55

--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Post-hoc Test All post hoc tests use the same basic principle: They allow you to compare each group mean to each other group mean and determine if they are significantly different while controlling for the number of group comparisons being made. Well look at one test: Tukey HSD (honestly significantly different) post hoc test The Tukey test compares each group mean to each other group mean by using the familiar formula described for t tests. More specifically, it is the mean of one group minus the mean of a second group divided by the

standard error. Our final value is an observed Tukey HSD value. Tukey HSD > TukeyHSD( model ) Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = values ~ ind, data = s_anova_dataset1) $ind diff lwr upr

p adj Group.2-Group.1 0.8 -2.6092986 4.209299 0.9063674 Group.3-Group.1 3.2 -0.2092986 6.609299 0.0697312 Group.4-Group.1 5.4 1.9907014 8.809299 0.0017410 Group.3-Group.2 2.4 -1.0092986 5.809299 0.2238581 Group.4-Group.2 4.6 1.1907014 8.009299 0.0068097 Group.4-Group.3 2.2 -1.2092986 5.609299 0.2890583 Which groups differ? Another example > group1 <- c(5,5,4,4,3)

> group2 <- c(5,4,4,3,3) > group3 <- c(4,3,2,2,1) > groups <- data.frame( group1, group2, group3 ) > names( groups ) [1] "group1" "group2" "group3" > head( groups ) group1 group2 group3 1 5 5 4

2 5 4 3 3 4 4

2 4 4 3 2 5 3

3 1 > ANOVA example > sgroups <- stack( groups ) > oneway.test( values ~ ind, var.equal=TRUE, data=sgroups) One-way analysis of means data: values and ind F = 4.963, num df = 2, denom df = 12, p-value = 0.02687 > model2 <- aov( values ~ ind, data=sgroups)

> summary(model2) Df Sum Sq Mean Sq F value Pr(>F) ind 2 8.933 4.467 4.963 0.0269 * Residuals 12 10.800 0.900 --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 ANOVA example > TukeyHSD( model2 ) Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = values ~ ind, data = sgroups) $ind diff

lwr upr p adj group2-group1 -0.4 -2.000718 1.2007182 0.7866927 group3-group1 -1.8 -3.400718 -0.1992818 0.0277219 group3-group2 -1.4 -3.000718 0.2007182 0.0891867 How to write this up! If we were to describe the last ANOVA we performed we describe it as follows:

A one-way ANOVA to compare the three groups was performed. This analysis produced a statistically significant result (F(2,12) = 4.96, p < .05). Post hoc Tukey tests revealed that the only significant difference between groups was found between group 1 (M = 4.20) and group 3 (M = 2.40). Are these numbers correct? Caveat: we would mention what we were investigating too. A note on ANOVA in R The oneway.test( ) function is well behaved when it comes to missing values because the default is to omit them from the analysis. This means you should check your data for missing values before hand, as this procedure will not tell you about them if they exist! Now lets see what happens when have more than one independent

variable! You should know from our work on multiple regression that this is not an unlikely scenario to come across. Factorial ANOVA Factorial ANOVA is the technique to use when you have one continuous (i.e., interval or ratio scaled) dependent variable and two or more categorical (i.e., nominally scaled) independent variables. For example, suppose I want to know whether boys and girls differ in the amount of television they watch per week, on average. Suppose I also want to know whether children in different regions of the United States (i.e., East, West, North, and South) differ in their average amount of television watched per week. In this example, average amount of television watched per week is my dependent variable, and gender and region of the country are my two independent variables.

This is known as a 2 4 factorial analysis, because one of my independent variables has two levels (gender) and one has four levels (region). Example Lets say you want to compare the average reduction in blood pressure on certain dosages of a drug. The factor is drug dosage. Suppose it has three levels: 10mg per day, 20mg per day, or 30mg per day. Suppose someone else studies the response to that same drug and examines whether the times taken per day (one time or two times) has any effect on blood pressure. In this case, the factor is number of times per day, and it has two levels: once and twice.

Suppose you want to study the effects of dosage and number of times taken together, because you believe both may have an affect on the response. We use a two-way ANOVA to answer this Two-way ANOVA Its an extenstion of the one-way ANOVA but it does more than just running two separate ANOVAs because the two factors you use may operate on the response differently together than they would separately. They may interact Well look at:

setting up the model, the ANOVA table the F-tests drawing the appropriate conclusions Two-way ANOVA The two-way ANOVA model extends the ideas of the one-way ANOVA model and adds an interaction term to examine how various combinations of the two factors affect the response. The two-way ANOVA model contains two factors, A and B, and each factor has a certain number of levels (say i levels of factor A and j levels of factor B). In the drug study example you have A = drug dosage with i = 1, 2, or 3 and B = number of times

taken per day with j = 1 or 2. Each person involved in the study is subject to one of the three different drug dosages and will take the drug in one of the two methods given. That means you have 3 * 2 = 6 different combinations of factors A and B that you can apply to the subjects, and you can study it in the two-way ANOVA model. Treatments Each different combination of levels of factors A and B is called a treatment in the model. Treatment 4 is the combination of 20mg of the drug taken in two doses of 10mg each per day. If factor A has i levels and factor B has j levels, you have i * j different combinations of treatments in your two-way ANOVA model.

Two-way ANOVA terms The two-way ANOVA model contains three terms: The main effect A: A term for the effect of factor A on the response The main effect B: A term for the effect of factor B on the response The interaction of A and B: The effect of the combination of factors A and B (denoted AB) In one-way ANOVA we separate total variability into: SS-total = SS-Within + SS-Between. (SS-total = SS-treatments + SS-error) We now have a new main effect and an interaction between A and B so

Two-way ANOVA The sums of squares equation for the two-way ANOVA model is: SS-total = SS-A + SS-B + SS-AB + SS-within. Here SS-total is the total variability in the y-values; SS-A is the sums of squares due to factor A (representing the variability in the y-values explained by factor A.); and similarly for SS-B and factor B. SS-AB is the sums of squares due to the interaction of factors A and B, and SS-within (aka SS-error ) is the amount of variability left unexplained, and deemed error. Now lets look into the interaction more Interaction Effects

The interaction effect is the heart of the two-way ANOVA model. Knowing that the two factors may act together in a different way than they would separately is important and must be taken into account. What is it? Interaction is when two factors meet, or interact with each other, on the response in a way thats different from how each factor affects the response separately. For example, before you can test to see whether dosage of medicine (factor A) or number of times taken (factor B) are important in explaining changes in blood pressure, you have to look at how they operate together to affect blood pressure. That is, you have to examine the interaction term. Interaction Suppose youre taking one type of medicine for cholesterol and one

medicine for a heart problem. Suppose researchers only looked at the effects of each drug alone, saying each one was good for managing the problem for which it was designed, with little to no side effects. Now you come along and mix the two drugs in your system. As far as the individual study results are concerned, all bets are off. With only those separate studies to go on, they will have no idea how the drugs will interact with each other, and you can be in a great deal of trouble very quickly. Interaction Plots In the two-way ANOVA model, you have two factors and their interaction. A number of results could come out of this model in terms of

significance of the individual terms, as you can see in the following: Factors A and B are both significant. Factor A is significant but not factor B. Factor B is significant but not factor A. Neither factors A nor B are significant. The interaction term AB is significant. Interaction Plots

Both A and B are significant in the model (no interaction present) The lines represent the levels of the times per-day factor (B); the x-axis represents the levels of the dosage factor (A); and the y-axis represents the average value of the response variable y, change in blood pressure, at each combination of treatments. Two parallel lines in an interaction plot means a lack of an interaction effect. Interaction Plots Factor A is significant but not factor B: blood pressure changes across dosage levels for taking the drug once or twice a day. However, the two lines are so close together that whether you take the

drug once or twice a day has no effect. So factor A (dosage) is significant, and factor B (times per day) isnt. Interaction Plots Factor B is significant but not factor A: The lines are flat across dosage levels indicating that dosage has no effect on blood pressure. However, the two lines for times per day are spread apart, so their effect on blood pressure is significant. (Parallel lines mean no interaction effect). Interaction Plots Neither factor is significant: We see two flat lines that are very close to each other. This represents the case where neither factor A nor factor B are significant.

Interaction Plots Interaction term is significant: This is the most interesting interaction plot of all. The big picture is that because the two lines cross, then factors A and B interact with each other in the way that they operate on the response. If they didnt interact, then the lines would be parallel. When you take the drug two times per day at the low dose, you get a low change in blood pressure; as you increase dosage, blood pressure increases also. But when you take the drug once per day, the opposite result happens.

In practice > data("weightgain", package = "HSAUR") > str( weightgain ) > plot.design( weightgain ) > wg_aov <- aov(weightgain ~ source * type, data = weightgain) > summary(wg_aov) Two-way ANOVA table > summary(wg_aov) Df Sum Sq Mean Sq F value Pr(>F) source 1 221 220.9 0.988 0.3269

type 1 1300 1299.6 5.812 0.0211 * source:type 1 884 883.6 3.952 0.0545 . Residuals 36 8049 223.6 --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1