As generative AI technologies become increasingly embedded in software products and workflows, these systems start to mirror the characteristics of large language models (LLMs) themselves. This shift has led to concerns about reliability, as these models are inherently non-deterministic, often producing varied responses to identical inputs. This characteristic is both a feature and a challenge, particularly in enterprise environments where consistency and accuracy are paramount.
The non-deterministic nature of LLMs means that errors can propagate, especially when reasoning models and AI agents are involved. According to Dan Lines, COO of LinearB, “Ultimately, any kind of probabilistic model is sometimes going to be wrong. These kinds of inconsistencies that are drawn from the absence of a well-structured world model are always going to be present at the core of a lot of the systems that we’re working with and systems that we’re reasoning about.”
Understanding the Non-Determinism of LLMs
LLMs are designed to be “dream machines,” capable of generating novel and unexpected outputs. However, when these outputs are factually incorrect, they become problematic. In enterprise software, where reliability is crucial, understanding and mitigating these errors is essential. Daniel Loreto, CEO of Jetify, highlights the difficulty in predicting LLM behavior, emphasizing the need for tools and processes to ensure desired system performance.
Enterprise applications rely heavily on trust, which is built on authorized access, high availability, and idempotency. For GenAI processes, accuracy is an additional critical factor. Tariq Shaukat, CEO of Sonar, notes, “A lot of the real success stories that I hear about are apps that have relatively little downside if it goes down for a couple of minutes or there’s a minor security breach or something like that.”
Addressing Hallucinations and Enhancing Accuracy
One common issue with LLMs is “hallucinations,” where the model generates inaccurate or irrelevant information. Retrieval-augmented generation (RAG) is a typical approach to grounding responses in factual data, yet even RAG systems can falter. Amr Awadallah, CEO of Vectara, points out, “Even when you ground LLMs, 1 out of every 20 tokens coming out might be completely wrong, completely off topic, or not true.”
To mitigate these issues, additional guardrails on prompts and responses are necessary. Maryam Ashoori, Head of Product at watsonx.ai, IBM, explains the importance of filtering on both input and output sides to prevent harmful or inappropriate content from being generated.
Implementing Observability and Monitoring
Observability in LLMs is crucial for understanding and rectifying errors. Abby Kearns, CTO of Alembic, highlights the need for reinventing traditional tooling for machine workloads. While standard software metrics like logs and stack traces provide insights into system performance, LLMs require more nuanced approaches to measure hallucination rates, factual consistency, and bias.
Mark Doble, CEO of Alexi, suggests using multiple models to evaluate outputs, akin to a “LLM-as-judge” approach. This method ensures more reliable outputs by leveraging the collective intelligence of various models.
Ensuring Reliability in AI Workflows
Incorporating determinism into AI workflows is essential for enterprise applications. Jeremy Edberg, CEO of DBOS, emphasizes the importance of durable execution technologies that save progress within workflows, thereby preventing costly failures. “We’ve always had a cost to downtime, right? Now, though, it’s getting much more important because AI is non-deterministic,” he says.
Qian Li, cofounder of DBOS, advocates for checkpointing applications to ensure progress is saved, reducing the need for repeated prompts and minimizing the risk of varied responses.
Ultimately, while LLMs offer powerful capabilities, they also introduce complexity and risk. For personal projects, the non-determinism of AI can be intriguing and even delightful. However, for enterprise software, reliability and trust are non-negotiable. As Raj Patel, AI transformation lead at Holistic AI, aptly puts it, “Trust is key. I think trust takes years to build, seconds to break, and then a fair bit to recover.”