Build a RAG Assistant on Your Own Data: A Technical Guide

Q: Do I need an expensive vector database?

Usually not. For most businesses, the `pgvector` extension on the Postgres database you already run handles embedding storage and nearest-neighbor search comfortably. A dedicated vector store like Pinecone earns its place only at large scale or very high query volume. Starting on your existing database keeps the architecture simple, the data in one place, and the cost near zero while you prove the assistant out.

A general-purpose model like Claude or GPT is brilliant at language and useless at your specifics. It has never seen your product catalog, your policies, your past tickets, or your internal docs, so when you ask it something only your business would know, it either declines or, worse, makes up a confident answer. That gap is what stops most companies from putting AI in front of customers or staff.

Retrieval augmented generation, RAG, closes it. Instead of hoping the model already knows, you fetch the relevant passages from your own data at question time and hand them to the model as context, so the answer is grounded in your facts rather than its imagination. This guide walks through how to build one properly, the pipeline, the part everyone gets wrong, and how to keep it from hallucinating or leaking. The examples assume a Node or Vercel app.

Why RAG instead of fine-tuning or a giant prompt

There are three ways to give a model your knowledge, and RAG is usually the right one. Fine-tuning bakes knowledge into the model's weights, which is expensive, slow to update, and bad at facts that change. Stuffing everything into one prompt does not scale past a few documents and gets costly fast. RAG keeps your data in a searchable store and pulls only what each question needs, so it is cheap, updates the moment a document changes, and can cite exactly where an answer came from. For the large majority of business assistants, that is the winning trade.

How RAG works: the pipeline

The whole system is two phases. Offline, you ingest your documents, split them into chunks, turn each chunk into an embedding (a vector that captures its meaning), and store those vectors. Online, when a question arrives, you embed the question the same way, find the chunks whose vectors are closest to it, paste those into the prompt as context, and ask the model to answer from them with citations. Everything else is detail around making each of those steps reliable.

Chunking: the part everyone gets wrong

Retrieval quality is decided here, long before the model is involved. Chunk too large and each result is full of irrelevant text that dilutes the answer and wastes tokens. Chunk too small and you sever the context a passage needs to make sense. A practical starting point is a few hundred words per chunk with a small overlap so ideas that straddle a boundary are not split, and chunking on natural seams (headings, paragraphs) rather than blindly every N characters.

Attach metadata to every chunk too: the source document, a title, a URL, a date, and any access tags. You will need it for citations, for filtering, and for permissions.

// chunk, embed, and store each piece with its metadata
for (const chunk of chunkDocument(doc, { size: 800, overlap: 100 })) {
  const embedding = await embed(chunk.text);   // text -> vector
  await db.chunks.insert({
    docId: doc.id, title: doc.title, url: doc.url,
    text: chunk.text, embedding,
  });
}

Embeddings and the vector store

An embedding model maps text to a vector so that similar meanings land near each other, and a vector store lets you find the nearest ones fast. You do not need exotic infrastructure: pgvector on the Postgres you already run is enough for most businesses, and a dedicated store like Pinecone makes sense only at large scale. Retrieval is then a nearest-neighbor search.

async function retrieve(question: string, userId: string) {
  const qVec = await embed(question);
  // nearest chunks the user is allowed to see; <-> is vector distance
  return db.query`
    SELECT text, title, url FROM chunks
    WHERE doc_id IN (SELECT doc_id FROM acl WHERE user_id = ${userId})
    ORDER BY embedding <-> ${qVec}
    LIMIT 6`;
}

Grounding: how you stop hallucinations

The model should answer from the retrieved context and nothing else. Say so explicitly in the system prompt, tell it to admit when the answer is not in the context rather than inventing one, and ask it to cite the sources you provided. The single biggest lever on quality is not which model you use, it is whether retrieval actually surfaced the right passage, so when answers are wrong, look at what was retrieved first.

const messages = [
  { role: "system", content:
    "Answer only from the provided context. If it is not there, say you do not know. Cite the source titles." },
  { role: "user", content:
    `Context:\n${hits.map(h => `[${h.title}] ${h.text}`).join("\n---\n")}\n\nQuestion: ${question}` },
];

Keeping it fresh and access-controlled

Two things turn a demo into something you can ship. First, freshness: re-embed and update a document's chunks whenever it changes, so the assistant never answers from stale content. Second, permissions: retrieval must respect who is asking. Filter candidate chunks by the user's access before the nearest-neighbor search, as in the query above, so the assistant can never surface a document the person is not allowed to see. An assistant that leaks an internal doc to the wrong customer is worse than no assistant.

Evaluate it like software

You cannot improve what you do not measure. Build a small set of real questions with known good answers, and check the system against it whenever you change the chunking, the model, or the prompt. Track whether the right source was retrieved and whether the answer was faithful to it. This turns tuning from guesswork into a feedback loop, and it is what separates an assistant you trust in production from a clever weekend demo.

A RAG build checklist

Choose RAG over fine-tuning for knowledge that changes or needs citations.
Chunk on natural boundaries, a few hundred words with overlap, and attach source metadata.
Store embeddings in pgvector on your existing database before reaching for anything heavier.
Filter retrieval by the user's permissions before the similarity search.
Ground the model: answer only from context, admit uncertainty, cite sources.
Re-index documents when they change so answers stay fresh.
Keep an evaluation set and measure retrieval and faithfulness on every change.

FAQ

What is RAG and why would I use it instead of fine-tuning?

RAG, retrieval augmented generation, fetches relevant passages from your own data at question time and gives them to the model as context, so answers are grounded in your facts. Fine-tuning instead bakes knowledge into the model's weights, which is expensive, slow to update, and weak on changing facts. For most business assistants RAG wins, because it is cheaper, updates the instant a document changes, and can cite exactly where each answer came from.

How do I stop the AI from making things up?

Ground it and check retrieval. Instruct the model to answer only from the provided context and to say it does not know when the answer is not there, and have it cite the sources you passed in. Most wrong answers are actually retrieval failures, the right passage never reached the model, so when you see a hallucination, inspect what was retrieved first and fix the chunking or search before blaming the model.

Do I need an expensive vector database?

Usually not. For most businesses, the pgvector extension on the Postgres database you already run handles embedding storage and nearest-neighbor search comfortably. A dedicated vector store like Pinecone earns its place only at large scale or very high query volume. Starting on your existing database keeps the architecture simple, the data in one place, and the cost near zero while you prove the assistant out.

How do I keep the assistant from leaking data to the wrong person?

Enforce permissions at retrieval, not just in the interface. Before the similarity search, filter candidate chunks to only the documents the asking user is allowed to see, so a restricted document can never be surfaced as context in the first place. Store access tags on each chunk during ingestion and apply them on every query. Hiding a result in the UI is not enough, the model must never receive content the user cannot access.

How big should my chunks be?

Start around a few hundred words per chunk with a small overlap, and split on natural boundaries like headings and paragraphs rather than fixed character counts. Chunks that are too large bury the relevant sentence in noise and waste tokens, while chunks that are too small lose the surrounding context. Treat chunk size as a setting you tune against an evaluation set, because the right value depends on your documents.

If you want an assistant that actually knows your business and can be trusted in front of customers or staff, tell me what it needs to answer and I will map out the cleanest way to build it on your own data.

Build a RAG Assistant on Your Own Data: A Technical Guide

Why RAG instead of fine-tuning or a giant prompt

How RAG works: the pipeline

Chunking: the part everyone gets wrong

Embeddings and the vector store

Grounding: how you stop hallucinations

Keeping it fresh and access-controlled

Evaluate it like software

A RAG build checklist

FAQ

What is RAG and why would I use it instead of fine-tuning?

How do I stop the AI from making things up?

Do I need an expensive vector database?

How do I keep the assistant from leaking data to the wrong person?

How big should my chunks be?

Want a hand applying this?

Go deeper

An AI Assistant That Actually Knows Your Store (Not a Generic Bot)

Adding an AI Assistant to Your Website: What Actually Works

Should You Build an AI Chatbot or Buy One?

Practical AI for Small Businesses: 5 Automations Worth Building

API Integrations: Why Connecting Your Stack Beats Copy-Paste

You Don't Have to Build It All at Once: Custom Software, Step by Step