Conversational AI for science: finding a route through the jungle

Our head of Agentic AI, Joel Strickland, is currently working remotely while travelling through South and Central America. Alongside working on the ongoing development of our agentic assistant tools for R&D, he’s also been reflecting on what it really takes to build trustworthy AI for science. Here, in the first of a series of blog posts, he shares some insights from an unusual location…

Blog post by Joel Strickland.

I’m writing this from the Guatemalan jungle with a very good cup of coffee and a slightly less relaxing question: what would it take to build an AI assistant for science that researchers can actually trust? Not just something that spits out answers, but something you’d trust when the work actually matters.

It turns out most researchers ask the same things. They rely heavily on software. 92% depend on it, 69% say their work would be impossible without it ¹⁻³. Yet they still spend weeks stitching systems together and struggling to reproduce their own results ⁴⁻⁵.

After years helping clients clean messy data, build tailored analyses, and uncover real insights – and after testing multiple AI assistants that helped with bits and pieces but couldn’t be trusted with full research workflows – we decided to take a different approach at Intellegens. We started building NOA (No Ordinary Assistant): an AI assistant researchers can trust because it doesn’t just generate answers. It executes validated workflows that produce consistent, reproducible results.

This is ongoing development work, part of a project to integrate agentic and generative AI methods with our machine learning solutions that has already delivered new tools to assist our Alchemite™ customers in interpreting their ML results. NOA itself is at a prototype stage. But even getting this far wasn’t simple. Here are some of the things that we have learned so far.

The Problem: Flexibility vs. Rigour

Science needs two things at once: the flexibility to test ideas quickly and the rigour to reproduce results reliably. Tools like ChatGPT give you flexibility, but they generate different code every time. That’s fine for brainstorming, not for publishable science. Traditional R&D software gives you rigour, but requires programming skills that 56% of researchers don’t have ¹.

Before building NOA, we interviewed 11 R&D teams across materials science, chemistry, and formulations, asking what would actually make AI useful in their work.

The results surprised us. Out of 300 discussion points, 187 (62%) were about trust: security, understanding how decisions are made, and setting limits on what gets automated. Only 19% were about features. Every team said security was non-negotiable. 82% wanted explainable reasoning. 73% wanted operational control. 64% wanted human approval gates.

The insight was clear: researchers don’t doubt AI can work. They doubt they can trust it.

Building Trust Through Design

NOA is built around a simple idea: it talks to you naturally but only executes through pre-approved, validated workflows – like following tested recipes instead of improvising.

Ask NOA a question and, instead of guessing, it finds the right workflow – whether that’s cleaning data, running feature engineering, or training a model. Each step uses a specific tool with a clear definition of its inputs and outputs.

But turning that into a smooth user experience was not easy.

At first we had NOA ask multiple questions to the LLM per request, hoping more prompts meant better accuracy. Instead, we got more hallucinations. One good prompt with the right context gave far better results.

Speed mattered more than expected too. No one wants to stare at a loading icon during heavy data processing, so we added real-time status updates showing exactly what’s running.

We also learned that splitting work across multiple specialized agents broke the flow entirely. Each agent lived in its own bubble – meaning follow-up questions confused the whole chain. So we rebuilt around one unified agent that keeps full context throughout the conversation.

Good technical design starts with understanding what people actually need. Researchers told us trust mattered more than features, so we built everything around that. 

What’s next?

We’re feeding all of these insights into our development program and it’s exciting to see the path ahead becoming clearer, as we cut through the forest of practical AI challenges, one validated workflow at a time. There’s still a long journey ahead, but I look forward to keeping you updated as we progress, and to demonstrating NOA as it matures. This post is the first in a three-part series on building AI for scientific research. In February, I’ll share what we’re learning from NOA in real-world use – what works, what breaks, and what surprises us. In April, we’ll look at what’s next: how agentic workflows could reshape research in materials and chemistry.

References

¹ Hettrick, S., Antonioletti, M., Carr, L., Chue Hong, N., Crouch, S., De Roure, D.C., Emsley, I., Goble, C., Hay, A., Inupakutika, D. and Jackson, M., 2014. UK research software survey 2014.

² Hannay, J.E., MacLeod, C., Singer, J., Langtangen, H.P., Pfahl, D. and Wilson, G., 2009, May. How do scientists develop and use scientific software?. In 2009 ICSE workshop on software engineering for computational science and engineering (pp. 1-8). IEEE.

³ Nangia, U. and Katz, D.S., 2017, September. Track 1 paper: surveying the US National Postdoctoral Association regarding software use and training in research. In Workshop on Sustainable Software for Science: Practice and Experiences (WSSSPE 5.1).

⁴ Jiménez, R.C., Kuzak, M., Alhamdoosh, M., Barker, M., Batut, B., Borg, M., Capella-Gutierrez, S., Hong, N.C., Cook, M., Corpas, M. and Flannery, M., 2017. Four simple recommendations to encourage best practices in research software. F1000Research, 6, pp.ELIXIR-876.

⁵ Carver, J., Heaton, D., Hochstein, L. and Bartlett, R., 2013. Self-perceptions about software engineering: A survey of scientists and engineers. Computing in Science & Engineering, 15(1), pp.7-11.

Search