At Intellegens, we share lots of examples of how machine learning (ML) accelerates R&D in areas like chemicals, materials, food research, and life science. But I bet there are a few frustrated readers who never made it to first base with ML. If you’re one of those, or fear you might become one, a key reason might well be that “it won’t work with my data.” So, in this month’s blog, we go back to basics and discuss how to avoid some key pitfalls.
The ‘real data’ problem
When machine learning struggles to generate good results, the underlying reasons often come back to the same fundamental issue: you had the temerity to present it with real data. You know, data that comes out of the lab or off a production line. Perhaps data that’s been merged together from many sources. Data that hasn’t been carefully curated, cleaned up, or pre-analysed. Here are some of typical features of that data:
- It has gaps. Maybe that’s because it doesn’t make sense to measure every property in every test. Or because combining datasets from diverse projects inevitably results in a grid of data with lots of holes in it. Or simply because not every result gets recorded cleanly.
- It is ‘noisy’. A degree of scatter in the results you are measuring might be inherent to your experiments. Or there might be measurement errors. Or both.
- There isn’t enough data. You might, rightly, be concerned that a model trained with a small amount of data won’t be accurate. So maybe you’ll resort to Design of Experiments methods to plan collection of more data and wait until you have ‘enough’ data to try out the machine learning.
- It’s a mix of different types. Probably, a lot of your data is numerical. But you’re also likely to have categorical data. Perhaps one input is just the name of the operator or machine used for a test. Or outputs may be captured as “Poor / Acceptable / Good”.
Many ML methods fall at the first of these hurdles. The initial step in any ML project is to ‘train’ a model using existing data. The model (a mathematical representation of the system you are studying) can then be applied, for example, to predict likely new outputs for a different set of inputs. But most ML methods want data that is complete and clean for the training step. They don’t work well if there are gaps. So you have to do a lot of pre-processing just to get started: perhaps removing otherwise useful rows of data, or breaking the data down to study it one dataset or one property at a time, or performing some estimation of the missing data.
An ML checklist – what makes a practical R&D tool?
That’s all very well if you have time and a bit of statistical or data science knowledge on your side. But for most practical R&D projects, you want a tool at which you can quickly throw data, getting started with some useful insights that might tell you, for example, what experiment to do next. You want to focus on R&D objectives, not data processing. If that’s the type of tool you want, here are the features to look for:
- Even when the training dataset is ‘sparse and noisy‘ (i.e., has gaps and is messy), an underlying algorithm that can not only train ML models, but also infer missing values. This rare capability is a particular strength of our Alchemite ML technology.
- Excellence in uncertainty quantification. Your model’s predictions will not be perfect. That’s OK, as long as you understand how much reliance to place on them, and can see how this uncertainty changes for different predictions and as more data is added. Any good ML method estimates the uncertainty in its results. Make sure this is clearly presented and calculated by a method (like non-parametric Bayesian probability distributions) that draws information from your data, rather than being based on standard distributions.
- Support for an adaptive approach. Instead of waiting for ‘enough’ data – why not try out machine learning early? Good uncertainty quantification is also vital here. If the software can tell you how uncertain its prediction is and recommend the data most likely to reduce that uncertainty, you can conduct the relevant experiments and quickly improve model accuracy. This adaptive, iterative approach has been found to reduce the overall amount of experiment needed to reach a solution by 50-80% compared to more conventional Design of Experiments methods.
- An ability to recognise and model both numerical and categorical data in its predictions.
Avoid the data gap
If you apply an ML tool that ticks these boxes, you’ll eliminate many of the frustrations that have prevented researchers from getting started with ML. Lab scientists will get to a result with fewer experiments through an adaptive DOE approach. Formulation experts will find new solutions, faster, by testing out hypotheses based on knowledge gained from experimental data. Project managers will more easily draw insights from legacy data compiled from different projects, despite all those irritating gaps.
So those data frustrations can be avoided. If you’re still not sure, it shouldn’t take long to give it a go!