How to control ANOVA assumptions

ANOVA (Analysis of Variance) is a statistical technique used to compare the means of multiple groups to determine if there are significant differences among them. For the results of ANOVA to be valid, three fundamental assumptions must hold: the residuals must be normally distributed, the variances must be homogeneous, and the observations must be independent. In this post, we will demonstrate how to verify each of these assumptions using Python.

Step 1: Normality of Residuals with Q-Q Plots

Why it Matters: ANOVA assumes that the residuals (i.e., the errors between observed and predicted values) follow a normal distribution. If this assumption is violated, the resulting p-values and confidence intervals may be inaccurate, potentially leading to incorrect conclusions.

We use a qq-plot to check this assumption. Additionally we support our decision by performing a Shapiro-Wilk test.

How to do that in Python

Let’s create a Q-Q plot to assess the normality of the residuals in Python.

We start by generating three groups of normally distributed data points using np.random.normal(). Each group has 30 observations, and they have slightly different means. We combine all of these into a pandas DataFrame called data, where we have two columns: value (the actual observations) and group (indicating to which group each value belongs).

import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
from scipy.stats import shapiro

# Generate normal data for three groups
np.random.seed(42)
group1 = np.random.normal(loc=5, scale=1, size=30)
group2 = np.random.normal(loc=5.5, scale=1, size=30)
group3 = np.random.normal(loc=6, scale=1, size=30)

# Combine data into a DataFrame
data = pd.DataFrame({
    'value': np.concatenate([group1, group2, group3]),
    'group': ['Group 1'] * 30 + ['Group 2'] * 30 + ['Group 3'] * 30
})

Then we use the statsmodels library to fit an ANOVA model. Specifically, we use Ordinary Least Squares (OLS) regression to model the relationship between value and group. The formula 'value ~ group' tells the model to predict value based on the categorical variable group. After fitting the model, we extract the residuals (errors) using model.resid, which will be used to assess normality.

# Fit an ANOVA model
model = sm.OLS.from_formula('value ~ group', data).fit()
residuals = model.resid

The Q-Q plot is created through the sm.qqplot() function from statsmodels. It plots the quantiles of the residuals against the quantiles of a standard normal distribution. If the residuals are normally distributed, they should lie approximately along the diagonal line (line='s').

# Q-Q plot
sm.qqplot(residuals, line='s')
plt.title('Q-Q Plot of Residuals')
plt.show()

The Shapiro-Wilk test is a statistical test for normality. A p-value greater than 0.05 indicates that the residuals are likely normally distributed. If the p-value is less than 0.05, the residuals are not normally distributed.

# Shapiro-Wilk test for normality
stat, p_value = shapiro(residuals)
print(f"Shapiro-Wilk Test: p-value = {p_value}")

OUTPUT: Shapiro-Wilk Test: p-value = 0.89

If your residuals are not normally distributed, consider applying a data transformation (e.g., log or square root transformation) to better meet the normality assumption.

Step 2: Homogeneity of Variances with Residual Plots

Why it Matters: ANOVA also assumes that the variances across the groups are equal, a property known as homoscedasticity. If group variances differ significantly, it can inflate the Type I error rate, leading to an increased risk of falsely detecting significant differences. Homogeneity of variances ensures that the variability within each group is comparable, which is a key condition for making valid statistical inferences.

We plot the residuals against the predictions or fitted values to check if the group variances are equal.

How to do that in Python

First, we extract the fitted values (i.e., the predicted values) using model.fittedvalues. We then create a residual plot by plotting the residuals against these fitted values. The red dashed line at zero represents the expected value of residuals if there is no bias. The goal here is to see if the residuals are evenly spread around this line. If you notice a clear pattern, such as a funnel shape, this suggests heteroscedasticity (i.e., unequal variances).

# Fitted values
fitted_values = model.fittedvalues

# Residual plot
plt.scatter(fitted_values, residuals)
plt.axhline(0, color='red', linestyle='--')
plt.title('Residuals vs Fitted Values')
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.show()

If unequal variances are detected, you can either transform the data or use a more robust version of ANOVA, such as Welch's ANOVA, which does not require the assumption of equal variances.

Step 3: Independence of Observations with Residual Analysis

Why it Matters: The independence assumption means that each observation should be unrelated to the others. If observations are not independent, the validity of the ANOVA results is compromised, as the data points may influence each other, leading to biased results. This issue often arises from poor experimental design or when data have a time or spatial structure. For example, repeated measures data or time series data can often lead to non-independent observations.

How to do that in Python

To check the independence of observations, we plot the residuals in the order they appear in the dataset. This helps us see if there are any systematic patterns. If there are noticeable trends (e.g., the residuals tend to increase or decrease over time), this suggests that the residuals are not independent, which could be problematic for ANOVA.

# Residuals ordered by observation
plt.plot(residuals, marker='o', linestyle='--')
plt.title('Residuals by Observation Order')
plt.xlabel('Observation Order')
plt.ylabel('Residuals')
plt.show()

If non-independence is identified, consider redesigning your experiment to include randomization or using statistical methods that account for correlated observations, such as mixed-effects models. Mixed-effects models can model the correlation structure among observations, making them appropriate when independence is not achievable.

Next Steps

Now that you understand how to check ANOVA assumptions, try these techniques on your own datasets! Verifying these assumptions is an important step before interpreting the results of any statistical analysis. If you encounter issues with your assumptions, consider applying data transformations or using alternative statistical methods like Welch's ANOVA or non-parametric tests. These methods can provide more reliable results when standard ANOVA assumptions are not met. Feel free to reach out if you have questions or need further guidance.

Happy analyzing! πŸš€

Previous
Previous

How to create a fractional factorial design with Python

Next
Next

A full factorial design in Python from Beginning to End