Design of experiments has been around for fifty years, the software to run it just as long, and most people still do not use it. It looks complicated, people think a DoE requires hundreds of experiments. So most of us stuck with changing one factor at a time, which at least felt like less work. What I think held DoE back was the expertise to drive it, knowing which of the many designs to pick and how to properly analyze and interpret the results.

I believe that generative AI will take all that off our hands. In the not-so-far future we will just specify a goal and an AI agent will figure out the most efficient way to reach it, with as few experiments as possible. It picks the right design, analyzes the results for you, and suggests the right follow-up experiments. At every step it explains its reasoning so you always stay in the loop (until it gets so good that you don’t have to anymore).

This is how I picture the future of experimental work. We won’t use any DoE software tools anymore. No need for statistical training. We’ll only have a conversation with an AI agent (or maybe multiple at a time). This essay is an example of how I imagine this future.

What you had to learn to run DoE

I don’t think it is a coincidence that most DoE software providers are also quite heavy on providing courses or other kinds of learning material. Because it is not straight forward to apply Design of Experiments. You have to learn and understand quite a lot.

The first is choosing the design. Most people who have heard of DoE know one design, the full factorial design. It is the easiest one to understand but unfortunately also the most expensive to run. For example you need 128 runs if you test 7 factors which is not that uncommon. But the full factorial design is only one design among dozens and the right fractional design tests those same 7 factors with only 16 experiments instead of 128. Previously you needed to know that this fractional design existed and more importantly how to properly use it.

The second is analyzing the results. You had to fit a model, judge which main effects and interactions are significant through statistical methods like ANOVA and while doing all that also navigate confounding that was introduced through the design; for example the fractional factorial design I just mentioned.

And the third is choosing the next set of experiments. Because usually you’re not finished after just one design you ran and you shouldn’t be. There’s this whole concept of sequential experimentation that says you should actually move in small steps and reach your goal through a multiple smaller designs and not one large. It’s much more effective this way but you had to know what design makes sense to run next and how to cleverly adapt the design space.

These are all the things you had to learn through a lot of dedication or expensive courses or both.

Let the agent be the design expert

If the agent is the design expert you just describe the goal you have and provide the necessary domain knowledge and the agent figures out the rest.

You
I need to increase the hardness of a coating. Can you help me with that?
Claude
Yes I can. Can you provide some more context? What type of coating. Do you already have factors in mind that could be relevant?
You
It is a two-component polyurethane clear coat that is cured at elevated temperatures, and I was hoping that we can increase the hardness without me reformulating from scratch. So I wanted to focus on varying only the curing temperature, the hardener type, the catalyst level, and the curing time.
Claude
Ok. That’s four factors. I would suggest that we test each one at two settings, a high and a low. That would be 16 experimental runs if we test all possible combinations but I think we can learn most of what we need to know from just 8 experiments. Should we try that?
You
How does that work? How can we reduce the number of experiments that much?
Claude

If we choose to test every possible combination, that is called a full factorial design. These designs are quite detailed and they usually provide much more information about the factors and their effects than needed. For example, a full factorial design can tell you whether the effect of temperature changes depending on which hardener and which catalyst concentration you use together. This is called a three-way interaction and they are usually not relevant.

The design we use instead is a so-called fractional design. The way it works is that some effects share the same runs. You get a combined number that contains the effect of temperature alone and the three-way interaction I just described, and you cannot separate the two. But that is fine because the three-way interaction is usually so small that the combined number is almost entirely the effect of temperature. The design we choose also combines two-way interactions, which is slightly more problematic because the effect of temperature might actually be influenced by the catalyst concentration. But we will figure that out as we go. Does that sound reasonable to you?

You
Yes. Sounds good. How do we continue?
Claude
Below is a table with one row per factor. Fill in the ranges you want to test. I already added typical values to start from, but change anything you want. Once you confirm the ranges, I will generate the design so you can perform the experiments.
FactorLow settingHigh setting
Curing temperature60 °C80 °C
Hardener typeType AType B
Catalyst level1 %5 %
Curing time30 min60 min
You
5 % catalyst is too much. Reduce that to 3 %. The rest is fine.
Claude
Then these are the eight experiments to test:
RunCuring temperatureHardenerCatalystCuring time
160 °Ctype A1 %30 min
260 °Ctype A3 %60 min
360 °Ctype B1 %60 min
460 °Ctype B3 %30 min
580 °Ctype A1 %60 min
680 °Ctype A3 %30 min
780 °Ctype B1 %30 min
880 °Ctype B3 %60 min
You
Here is the table with the hardness measured on each panel. Please analyze the results for me:
RunCuring temperatureHardenerCatalystCuring timeHardness
160 °Ctype A1 %30 min33
260 °Ctype A3 %60 min87
360 °Ctype B1 %60 min57
460 °Ctype B3 %30 min68
580 °Ctype A1 %60 min88
680 °Ctype A3 %30 min72
780 °Ctype B1 %30 min77
880 °Ctype B3 %60 min84
Claude
It seems like the hardener type does not affect the hardness at all. That’s good because now we also don’t have a problem with the combined effects anymore that the fractional design introduced.
curing temperaturecuring timecatalysthardener type+19.0+16.5+14.0+1.5effect on hardness, oscillations →
Average effect of each factor on the hardness of the coating.
Claude
Curing temperature and curing time both increase the hardness of your coating. Catalyst concentration does too, but only at 60 °C. If you cure at 80 °C, adding more catalyst doesn’t increase the hardness any more as you can see here:
hardness, oscillations →4582.5≈ 7860 °C80 °C+32 oscillations1 %3 %catalyst concentration
Two-way interactions between curing temperature and catalyst. At 60 °C, raising the catalyst concentration from 1 % to 3 % increases the hardness by about 32 oscillations. At 80 °C the concentration doesn’t matter.
You
Ok, that’s interesting. What does that mean for my experiment?
Claude
Well that you have a choice now. You can either cure the coating at 80 degrees but with 1 % catalyst or you cure it at 60 degrees and use 3 %. Both give more or less the same hardness in the end. We can also optimize that further. We could either increase the temperature more or the catalyst concentration and see how much higher the hardness gets. What do you think?
You
It would be great to only need 60 degrees instead of 80. I can’t increase the catalyst concentration much higher though so I’m happy with the result for now. Thanks. Please write the final report.

From experiments to goals

This was just a small example of how the future could look, and I don’t think this future is far away.

And it is only the beginning. Classical design of experiments is a great tool, but you can imagine this same agent also reaching for Bayesian optimization and much more. In fact, there is research showing that the LLM itself, on its reasoning alone, is quite good at running experiments and suggesting the next ones to try. As these models get smarter, you might not need a formal method at all. And once you connect your whole database of past results and experiments, it only gets better.

So the question is, will this make humans obsolete? I don’t think it will but we need to change and adapt to work with these systems. Our job moves more towards deciding which problems are worth solving. The agent optimizes towards a target. Picking the target, and knowing why it’s the right one, is the part that stays human.