ANOVA with Python for intermediates
In this blog post, we’ll walk through the process of performing ANOVA using Python, focusing on an example involving filtration rate as the response variable. Our dataset includes the factors temperature (T), pressure (P), formaldehyde concentration (CoF), and stirring rate (RPM). Using ANOVA, we aim to identify which of these factors significantly affect the filtration rate. Specifically, we want to determine if, for example, changing the temperature will significantly impact the filtration rate or if the observed changes are random. Based on these results, we use ANOVA to build the best possible model for our DoE.
Understanding the Dataset
Previously we have visualized the data to get an idea of which factors and interactions might be relevant. This is important as it helps us with the ANOVA. We have seen that the factors temperature, concentration, and stirring rate seem to be important factors while pressure seems to be an insignificant factor. For the interactions, the temperature and concentration interaction as well as the temperature and stirring rate interaction appear relevant. This information will guide us on which factors to include in the model for ANOVA.
ANOVA Basics
How do you approach an ANOVA in DoE? Well, there are basically two ways to go about it.
- All Parameters Model: Start by including all main effects and interactions. Gradually remove the parameter with the highest p-value (least significant) and refit the model until only significant parameters remain.
- Stepwise Addition Model: Begin with a subset of parameters known to be significant. Add additional parameters step-by-step, assessing significance after each addition.
But always follow the hierarchy rule that demands including lower-order terms when higher-order terms are present. That means that when an interaction between two or more factors is significant, always include main effects even if the main effect is not a significant factor.
For our example, we'll follow the stepwise addition approach since our data visualization has already indicated significant parameters. Therefore, we start our ANOVA with the main effects temperature, concentration of formaldehyde, and stirring rate, and also add the interactions temperature:concentration and temperature:stirring rate.
Performing ANOVA in Python
import statsmodels.api as sm
from statsmodels.formula.api import ols
# Define the formula for the model
formula = 'Filtration_rate ~ T + CoF + RPM + T:CoF + T:RPM'
# Fit the model
model = ols(formula, data=df).fit()
# Perform ANOVA
anova_results = sm.stats.anova_lm(model)
# Print the ANOVA results
print(anova_results)
Step-by-Step Breakdown
1. Define the Formula for the Model
formula = 'Filtration_rate ~ T + CoF + RPM + T:CoF + T:RPM'
The formula specifies the relationship between the response variable (Filtration_rate
) and the explanatory variables (T
, CoF
, RPM
). Additionally, it includes interaction terms (T:CoF
and T:RPM
) to examine how combinations of these factors influence the filtration rate.
2. Fit the Model
model = ols(formula, data=df).fit()
This line fits a linear model to the data using the Ordinary Least Squares (OLS) method.
3. Perform ANOVA
anova_results = sm.stats.anova_lm(model)
This line performs the ANOVA analysis on the fitted model.
4. Print the ANOVA Results
print(anova_results)
This line prints the results of the ANOVA analysis.
Example Output and Interpretation
The ANOVA table provides the following columns:
- Degrees of Freedom (df): The number of independent values or quantities which can be assigned to a statistical distribution.
- Sum of Squares (sum_sq): Measures the variation due to each factor.
- Mean Square (mean_sq): The sum of squares divided by the respective degrees of freedom.
- F-Value (F): The test statistic used to determine if the observed variance can be attributed to the factor.
- p-Value (PR(>F)): The probability that the observed F-value would occur if the null hypothesis were true.
Or if we would want to explain these in simpler language:
- Degrees of Freedom (df): How many ways we can combine our data points.
- Sum of Squares (sum_sq): How much each factor contributes to the differences we see.
- Mean Square (mean_sq): The average contribution of each factor.
- F-Value (F): A big number means a factor makes a big difference.
- p-Value (PR(>F)): If this is small (like less than 0.05), it means a factor is important.
In our example, the factors temperature, concentration, and stirring rate, along with their interactions, all have p-values less than 0.05, indicating they are significant.
Let us now add some terms to the model function to see whether or not they are significant. For example, the pressure and the interaction between pressure and concentration of formaldehyde since this is the one interaction with the highest effect from the visualization step.
# Adding additional effects involving pressure to confirm insignificance
# Import data
df = pd.read_excel("example_filtration rate_fullfact.xlsx")
# Define the formula for the model
formula = 'Filtration_rate ~ T + + P + CoF + RPM + T:CoF + T:RPM + P:CoF'
# Fit the model
model = ols(formula, data=df).fit()
# Perform ANOVA
anova_results = sm.stats.anova_lm(model)
# Print the ANOVA results
print(anova_results)
But as we see, neither the main effect pressure nor the interaction between pressure and concentration of formaldehyde is a significant parameter. That means, we finalize our model without the pressure-related terms.
Adding Quadratic Terms
If we performed a central composite design (CCD), we could add quadratic terms to account for non-linear dependencies within the data. You can do that as shown below:
# Adding quadratic terms
formula_quad = 'Filtration_rate ~ T + CoF + RPM + T:CoF + T:CoF + I(T**2) + I(CoF**2) + I(RPM**2)'
model_quad = ols(formula_quad, data=df).fit()
anova_results_quad = sm.stats.anova_lm(model_quad, typ=2)
print(anova_results_quad)
But remember that you also need the adequate design for that. You cannot add quadratic terms to a simple two-level full factorial design.
But…
Whenever you perform an ANOVA and create a model that describes your results, you should check if the model is a good description of the underlying data. That is what we will cover in our next blog post.