What is a QQ-Plot and why is it important?

After performing an ANOVA analysis, it is crucial to validate the assumptions that underlie the statistical model we have created. One powerful tool for this purpose is the Quantile-Quantile plot, or QQ-Plot. In this post, we’ll explore what a QQ-Plot is, how it works, and why it is a vital part of the model validation process in DoE.

What is a QQ-Plot?

A QQ-Plot is a graphical tool used to assess whether a set of data plausibly comes from some theoretical distribution, such as the normal distribution. It plots the quantiles of the data against the quantiles of a specified theoretical distribution. If the data follows the specified distribution, the points on the QQ-Plot will approximately lie on a straight line.

 
 

What is a Normal Distribution?

A normal distribution, also called a Gaussian distribution, is a continuous probability distribution characterized by its symmetric, bell-shaped curve. It describes how data points tend to scatter around a central value, known as the mean. Typically, the scatter of your response variable caused by random error will follow a normal distribution.

Imagine you measured the temperature of a chemical reaction multiple times. You'd notice that most temperature readings are close to the average (mean) temperature, with fewer readings being much higher or much lower. This pattern creates a symmetric, bell-shaped curve when you plot it on a graph.

Key points about a normal distribution:

  1. Symmetry: The left side of the curve mirrors the right side.
  2. Mean: The center of the curve represents the average value.
  3. Frequency: Values closer to the mean are more common, while values further from the mean are less common.

One useful feature of the normal distribution is that it allows us to predict how likely it is to get a certain measurement:

  • About 68% of the data falls within one standard deviation of the mean.
  • About 95% of the data falls within two standard deviations of the mean.

So, if you know the mean and standard deviation of your reaction temperatures, you can estimate the probability of future temperature readings falling within specific ranges. This helps in understanding and controlling the consistency of your chemical processes.

What are Quantiles?

Quantiles divide a probability distribution into intervals with equal probabilities, or in other words, they partition your data into equal-sized, consecutive subsets. Here are some common quantiles:

  • Median (50th percentile): The middle value that separates the higher half from the lower half of the data set.
  • Quartiles (25th and 75th percentiles): Values that divide the data into four equal parts.
  • Percentiles: Values that divide the data into 100 equal parts.

For example, the 25th percentile (or first quartile) is the value below which 25% of the data falls. You can calculate these quantiles for your data, and for perfectly normally distributed data and compare them in a QQ-Plot. This comparison helps to determine how closely your data follows a normal distribution.

Why is the QQ-Plot Important?

In ANOVA and other parametric tests, one key assumption is that the residuals (the differences between observed and predicted values) are normally distributed. Normality of residuals leads to reliable p-values and confidence intervals, which in turn support valid conclusions from the analysis. Or in simple terms: ANOVA works correctly only when the residuals are normally distributed.

A QQ-Plot is sensitive to deviations from the theoretical distribution, making it easy to detect whether or not the data is normally distributed.

Ideal Case

In an ideal scenario where your data follows the theoretical distribution perfectly, the QQ-Plot will be a straight line at a 45-degree angle.

Deviations from the ideal case

  • S-Shape: Indicates lighter tails than the normal distribution.
  • Inverted S-Shape: Indicates heavier tails than the normal distribution.
  • Curved Upwards: Suggests a right-skewed distribution.
  • Curved Downwards: Suggests a left-skewed distribution.

Model Improvement

By identifying these deviations, you can take steps to improve your model. For instance, if your data is not normally distributed, you might consider transforming your data or maybe you have forgotten to include some interaction terms.

If nothing helps, ANOVA is not the right tool to build your model but fingers crossed.

Previous
Previous

A full factorial design in Python from Beginning to End

Next
Next

Evaluating Model Performance with Residual Analysis