Six Lessons Learned Building RAG Systems in Production

towardsdatascience.com

couple of years, RAG has turned into a kind of credibility signal in the AI field. If a company wants to look serious to investors, clients, or even its own leadership, it’s now expected to have a Retrieval-Augmented Generation story ready. LLMs changed the landscape almost overnight and pushed generative AI into nearly every business conversation.

But in practice: Building a bad RAG system is worse than no RAG at all.

I’ve seen this pattern repeat itself again and again. Something ships quickly, the demo looks fine, leadership is satisfied. Then real users start asking real questions. The answers are vague. Sometimes wrong. Occasionally confident and completely nonsensical. That’s usually the end of it. Trust disappears fast, and once users decide a system can’t be trusted, they don’t keep checking back to see if it has improved and will not give it a second chance. They simply stop using it.

In this case, the real failure is not technical but it’s human one. People will tolerate slow tools and clunky interfaces. What they won’t tolerate is being misled. When a system gives you the wrong answer with confidence, it feels deceptive. Recovering from that, even after months of work, is extremely hard.

Only a few incorrect answers are enough to send users back to manual searches. By the time the system finally becomes truly reliable, the damage is already done, and no one wants to use it anymore.

In this article, I share six lessons I wish I had known before deploying RAG projects for clients.

1. Start with a real business problem

Important RAG decisions happen long before you write any code.

Why are you embarking on this project? The problem to be solved really needs to be identified. Doing it “because everyone else is doing it” isn’t a strategy.
Then there’s the question of return on investment, the one everyone avoids. How much time will this actually save in concrete workflows, and not just based on abstract metrics presented in slides?
And finally, the use case. This is where most RAG projects quietly fail. “Answer internal questions” is not a use case. Is it helping HR respond to policy questions without endless back-and-forth? Is it giving developers instant, accurate access to internal documentation while they’re coding? Is it a narrowly scoped onboarding assistant for the first 30 days of a new hire? A strong RAG system does one thing well.

RAG can be powerful. It can save time, reduce friction, and genuinely improve how teams work. But only if it’s treated as real infrastructure, not as a trend experiment.

The rule is simple: don’t chase trends. Implement value.

If that value can’t be clearly measured in time saved, efficiency gained, or costs reduced, then the project probably shouldn’t exist at all.

2. Data preparation will take more time than you expect

Many teams rush their RAG development, and to be honest, a simple MVP can be achieved very quickly if we aren’t focused on performance. But RAG is not a quick prototype; it’s a huge infrastructure project. The moment you start stressing your system with real evolving data in production, the weaknesses in your pipeline will begin to surface.

Given the recent popularity of LLMs with large context windows, sometimes measured in millions, some declare long-context models make retrieval optional and teams are trying just to bypass the retrieval step. But from what I’ve seen, implementing this architecture many times, large context windows in LLMs are super useful, but they are not a substitute for a good RAG solution. When you compare the complexity, latency, and cost of passing a massive context window versus retrieving only the most relevant snippets, a well-engineered RAG system remains necessary.

But what defines a “good” retrieval system? Your data and its quality, of course. The classic principle of “Garbage In, Garbage Out” applies just as much here as it did in traditional machine learning. If your source data isn’t meticulously prepared, your entire system will struggle. It doesn’t matter which LLM you use; your retrieval quality is the most critical component.

Too often, teams push raw data directly into their vector database (VectorDB). It quickly becomes a sandbox where the only retrieval mechanism is an application based on cosine similarity. While it might pass your quick internal tests, it will almost certainly fail under real-world pressure.

In mature RAG systems, data preparation has its own pipeline with tests and versioning steps. This means cleaning and preprocessing your input corpus. No amount of clever chunking or fancy architecture can fix fundamentally bad data.

3. Effective chunking is about keeping ideas intact

When we talk about data preparation, we’re not just talking about clean data; we’re talking about meaningful context. That brings us to chunking.

Chunking refers to breaking down a source document, perhaps a PDF or internal document, into smaller chunks before encoding it into vector form and storing it within a database.

Why is Chunking Needed? LLMs have a limited number of tokens, and even “long context LLMs” get costly and suffer from distraction with too much noise. The essence of chunking is to pick out the single most relevant bit of information that will answer the user’s question and transmit only that bit to the LLM.

Most development teams split documents using simple techniques : token limits, character counts, or rough paragraphs. These methods are very fast, but it’s usually at that point where retrieval starts degrading.

When we chunk a text without smart rules, it becomes fragments rather than entire concepts. The result is pieces that slowly drift apart and become unreliable. Copying a naive chunking strategy from another company’s published architecture, without understanding your own data structure, is dangerous.

The best RAG systems I’ve seen incorporate Semantic Chunking.

In practice, Semantic Chunking means breaking up text into meaningful pieces, not just random sizes. The idea is to keep each piece focused on one complete thought. The goal is to make sure that every chunk represents a single complete idea.

How to Implement It: You can implement this using techniques like:Recursive Splitting: Breaking text based on structural delimiters (e.g., sections, headers, then paragraphs, then sentences).
Sentence transformers: This uses a lightweight and compact model to identify all important transitions based on semantic rules in order to segment the text at those points.

To implement more robust techniques, you can consult open source libraries such as the various text segmentation modules of LangChain (especially their advanced recursive modules) and research articles on topic segmentation.

4. Your data will become outdated

The list of problems does not end there once you have launched. What happens when your source data evolves? Outdated embeddings slowly kill RAG systems over time.

This is what happens when the underlying knowledge in your document corpus changes (new policies, updated facts, restructured documentation) but the vectors in your database are never updated.

If your embeddings are weak, your model will essentially hallucinate from a historical record rather than current facts.

Why is updating a VectorDB technically challenging? Vector databases are very different from traditional SQL databases. Every time you update a single document, you don’t simply change a couple of fields but may well have to re-chunk the whole document, generate new large vectors, and then wholly replace or delete the old ones. That is a computationally intensive operation, very time-consuming, and can easily lead to a situation of downtime or inconsistencies if not treated with care. Teams often skip this because the engineering effort is non-trivial.

When do you have to re-embed the corpus? There’s no rule of thumb; testing is your only guide during this POC phase. Don’t wait for a specific number of changes in your data; the best approach is to have your system automatically re-embed, for example, after a major version release of your internal rules (if you are building an HR system). You also need to re-embed if the domain itself changes significantly (for example, in case of some major regulatory shift).

Embedding versioning, or keeping track of which documents are associated with which run for generating a vector, is a good practice. This space needs innovative ideas; migration in VectorDB is often a missed step by many teams.

5. Without evaluation, failures surface only when users complain

RAG evaluation means measuring how well your RAG application actually performs. The idea is to check whether your knowledge assistant powered by RAG gives accurate, helpful, and grounded answers. Or, more simply: is it actually working for your real use case?
Evaluating a RAG system is different from evaluating a classic LLM. Your system has to perform on real queries that you can’t fully anticipate. What you want to understand is whether the system pulls the right information and answers correctly.
A RAG system is made of multiple components, starting from how you chunk and store your documents, to embeddings, retrieval, prompt format, and the LLM version.
Because of this, RAG evaluation should also be multi-level. The best evaluations include metrics for each part of the system separately, as well as business metrics to assess how the entire system performs end to end.

While this evaluation usually starts during development, you will need it at every stage of the AI product lifecycle.

Rigorous evaluation transforms RAG from a proof of concept into a measurable technical project.

6. Trendy architectures rarely fit your problem

Architecture decisions are frequently imported from blog posts or conferences without ever asking whether they fit the internal-specific requirements.

For those who are not familiar with RAG, many RAG architectures exist, starting from a simple Monolithic RAG system and scaling up to complex, agentic workflows.

You do not need a complicated Agentic RAG for your system to work well. In fact, most business problems are best solved with a Basic RAG or a Two-Step RAG architecture. I know the words “agent” and “agentic” are popular right now, but please prioritize implemented value over implemented trends.

Monolithic (Basic) RAG: Start here. If your users’ queries are straightforward and repetitive (“What is the vacation policy?”), a simple RAG pipeline that retrieves and generates is all you need.
Two-Step Query Rewriting: Use this when the user’s input might be indirect or ambiguous. The first LLM step rewrites the user’s ambiguous input into a cleaner, better search query for the VectorDB.
Agentic RAG: Only consider this when the use case requires complex reasoning, workflow execution, or tool use (e.g., “Find the policy, summarize it, and then draft an email to HR asking for clarification”).

RAG systems are a fascinating architecture that has gained massive traction recently. While some claim “RAG is dead,” I believe this skepticism is just a natural part of an era where technology evolves incredibly fast.

If your use case is clear and you want to resolve a specific pain point involving large volumes of document data, RAG remains a highly effective architecture. The key is to keep it simpleand integrate the user from the very beginning.

Do not forget that building a RAG system is a complex undertaking that requires a mix of Machine Learning, MLOps, deployment, and infrastructure skills. You absolutely must embark on the journey with everyone—from developers to end-users—involved from day one.

🤝 Stay Connected

If you enjoyed this article, feel free to follow me on LinkedIn for more honest insights about AI, Data Science, and careers.

👉 LinkedIn: Sabrine Bendimerad

👉 Medium: https://medium.com/@sabrine.bendimerad1

👉 Instagram: https://tinyurl.com/datailearn