Rag on David Lang

Context Window Strategies: Making the Most of Long-Context LLMs

Fri, 10 Apr 2026 00:00:00 +0000

Million-token context windows tempt teams to dump entire repos into prompts. That is expensive, slow, and often less accurate than targeted retrieval.

When Full Context Helps

Single-file refactors, analyzing one large document, comparing a few long contracts.

When Retrieval Wins

Whole codebases, ticket backlogs, and wiki sites-embed, filter, rerank, then pass top-k chunks.

Compression Techniques

Summarize conversation history. Use hierarchical memory (session summary + recent turns). Strip comments and generated noise from code context.

From RAG to Agentic AI: What's Next for LLM-Powered Apps

Mon, 01 Dec 2025 00:00:00 +0000

The industry moved from chatbots → RAG → agents. Understanding the progression helps you invest in the right layer for your product maturity.

RAG Era

Ground models in private data. Mature patterns: chunking, hybrid search, citations. Still the right default for Q&A and search.

Agent Era

Models call tools, plan multi-step workflows, and maintain state. Higher capability, higher risk.

What’s Next

Evals-as-code in every pipeline
Smaller specialist models routed by orchestrators
On-device for privacy-sensitive steps
Human-agent collaboration UIs, not just chat

Migration Path

Master RAG and evals first. Add one well-scoped agent tool. Measure task completion before expanding autonomy.

Improving LLM Accuracy: Techniques Beyond Prompt Engineering

Tue, 25 Mar 2025 00:00:00 +0000

When prompts plateau, these engineering levers move accuracy more than another adjective in the system message.

Better Retrieval

Hybrid search (BM25 + vectors), rerankers (Cohere, cross-encoders), and metadata filters reduce wrong context reaching the model.

Structured Outputs

Force JSON with schemas (Zod, Pydantic, OpenAI structured outputs). Parse failures trigger retry with repair prompts.

Model Routing

Small models classify intent; large models answer hard questions. Cuts cost and reduces overconfident rambling on simple queries.

Vector Databases: Pinecone, Weaviate, and Chroma Compared

Mon, 22 Apr 2024 00:00:00 +0000

Vector databases store embeddings and perform similarity search-the retrieval layer in RAG and recommendation systems.

Comparison

	Pinecone	Weaviate	Chroma
Hosting	Managed cloud	Self-host or cloud	Embedded / local
Best for	Production scale	Hybrid search + GraphQL	Prototyping
Ops burden	Low	Medium	Low

pgvector Alternative

PostgreSQL with pgvector keeps vectors beside relational data-excellent when you already run Postgres and need ACID transactions.

Selection Criteria

Consider QPS, filtering (metadata predicates), hybrid keyword + vector search, cost, and data residency. Prototype on Chroma or pgvector; migrate to Pinecone or Weaviate at scale.

Building RAG Systems: Retrieval-Augmented Generation Explained

Thu, 18 Jan 2024 00:00:00 +0000

RAG grounds LLM responses in your private data by retrieving relevant documents before generation. It reduces hallucinations and keeps answers current without retraining models.

Pipeline Overview

Ingest - Load PDFs, wikis, tickets into chunks (500–1000 tokens).
Embed - Convert chunks to vectors with an embedding model.
Store - Save vectors in Pinecone, pgvector, or Chroma.
Retrieve - On query, embed the question and find top-k similar chunks.
Generate - Pass chunks as context to the LLM.

context = "

".join(retrieved_chunks)
prompt = f"Use only this context:
{context}

Question: {user_query}"

Chunking Strategy

Overlap chunks by 10–20% to avoid cutting sentences. Metadata (source, page) helps citations and debugging.