<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Rag on David Lang</title>
    <link>https://www.davidlang.tech/tags/rag/</link>
    <description>Recent content in Rag on David Lang</description>
    <generator>Hugo</generator>
    <language>en</language>
    <lastBuildDate>Fri, 10 Apr 2026 00:00:00 +0000</lastBuildDate>
    <atom:link href="https://www.davidlang.tech/tags/rag/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Context Window Strategies: Making the Most of Long-Context LLMs</title>
      <link>https://www.davidlang.tech/posts/context-window-strategies-making-the-most-of-long-context-llms/</link>
      <pubDate>Fri, 10 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://www.davidlang.tech/posts/context-window-strategies-making-the-most-of-long-context-llms/</guid>
      <description>&lt;p&gt;Million-token context windows tempt teams to dump entire repos into prompts. That is expensive, slow, and often less accurate than targeted retrieval.&lt;/p&gt;&#xA;&lt;h2 id=&#34;when-full-context-helps&#34;&gt;When Full Context Helps&lt;/h2&gt;&#xA;&lt;p&gt;Single-file refactors, analyzing one large document, comparing a few long contracts.&lt;/p&gt;&#xA;&lt;h2 id=&#34;when-retrieval-wins&#34;&gt;When Retrieval Wins&lt;/h2&gt;&#xA;&lt;p&gt;Whole codebases, ticket backlogs, and wiki sites-embed, filter, rerank, then pass top-k chunks.&lt;/p&gt;&#xA;&lt;h2 id=&#34;compression-techniques&#34;&gt;Compression Techniques&lt;/h2&gt;&#xA;&lt;p&gt;Summarize conversation history. Use hierarchical memory (session summary + recent turns). Strip comments and generated noise from code context.&lt;/p&gt;</description>
    </item>
    <item>
      <title>From RAG to Agentic AI: What&#39;s Next for LLM-Powered Apps</title>
      <link>https://www.davidlang.tech/posts/from-rag-to-agentic-ai-whats-next-for-llm-powered-apps/</link>
      <pubDate>Mon, 01 Dec 2025 00:00:00 +0000</pubDate>
      <guid>https://www.davidlang.tech/posts/from-rag-to-agentic-ai-whats-next-for-llm-powered-apps/</guid>
      <description>&lt;p&gt;The industry moved from chatbots → RAG → agents. Understanding the progression helps you invest in the right layer for your product maturity.&lt;/p&gt;&#xA;&lt;h2 id=&#34;rag-era&#34;&gt;RAG Era&lt;/h2&gt;&#xA;&lt;p&gt;Ground models in private data. Mature patterns: chunking, hybrid search, citations. Still the right default for Q&amp;amp;A and search.&lt;/p&gt;&#xA;&lt;h2 id=&#34;agent-era&#34;&gt;Agent Era&lt;/h2&gt;&#xA;&lt;p&gt;Models call tools, plan multi-step workflows, and maintain state. Higher capability, higher risk.&lt;/p&gt;&#xA;&lt;h2 id=&#34;whats-next&#34;&gt;What&amp;rsquo;s Next&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;&lt;strong&gt;Evals-as-code&lt;/strong&gt; in every pipeline&lt;/li&gt;&#xA;&lt;li&gt;&lt;strong&gt;Smaller specialist models&lt;/strong&gt; routed by orchestrators&lt;/li&gt;&#xA;&lt;li&gt;&lt;strong&gt;On-device&lt;/strong&gt; for privacy-sensitive steps&lt;/li&gt;&#xA;&lt;li&gt;&lt;strong&gt;Human-agent collaboration&lt;/strong&gt; UIs, not just chat&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;migration-path&#34;&gt;Migration Path&lt;/h2&gt;&#xA;&lt;p&gt;Master RAG and evals first. Add one well-scoped agent tool. Measure task completion before expanding autonomy.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Improving LLM Accuracy: Techniques Beyond Prompt Engineering</title>
      <link>https://www.davidlang.tech/posts/improving-llm-accuracy-techniques-beyond-prompt-engineering/</link>
      <pubDate>Tue, 25 Mar 2025 00:00:00 +0000</pubDate>
      <guid>https://www.davidlang.tech/posts/improving-llm-accuracy-techniques-beyond-prompt-engineering/</guid>
      <description>&lt;p&gt;When prompts plateau, these engineering levers move accuracy more than another adjective in the system message.&lt;/p&gt;&#xA;&lt;h2 id=&#34;better-retrieval&#34;&gt;Better Retrieval&lt;/h2&gt;&#xA;&lt;p&gt;Hybrid search (BM25 + vectors), rerankers (Cohere, cross-encoders), and metadata filters reduce wrong context reaching the model.&lt;/p&gt;&#xA;&lt;h2 id=&#34;structured-outputs&#34;&gt;Structured Outputs&lt;/h2&gt;&#xA;&lt;p&gt;Force JSON with schemas (Zod, Pydantic, OpenAI structured outputs). Parse failures trigger retry with repair prompts.&lt;/p&gt;&#xA;&lt;h2 id=&#34;model-routing&#34;&gt;Model Routing&lt;/h2&gt;&#xA;&lt;p&gt;Small models classify intent; large models answer hard questions. Cuts cost and reduces overconfident rambling on simple queries.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Vector Databases: Pinecone, Weaviate, and Chroma Compared</title>
      <link>https://www.davidlang.tech/posts/vector-databases-pinecone-weaviate-and-chroma-compared/</link>
      <pubDate>Mon, 22 Apr 2024 00:00:00 +0000</pubDate>
      <guid>https://www.davidlang.tech/posts/vector-databases-pinecone-weaviate-and-chroma-compared/</guid>
      <description>&lt;p&gt;Vector databases store embeddings and perform similarity search-the retrieval layer in RAG and recommendation systems.&lt;/p&gt;&#xA;&lt;h2 id=&#34;comparison&#34;&gt;Comparison&lt;/h2&gt;&#xA;&lt;table&gt;&#xA;  &lt;thead&gt;&#xA;      &lt;tr&gt;&#xA;          &lt;th&gt;&lt;/th&gt;&#xA;          &lt;th&gt;Pinecone&lt;/th&gt;&#xA;          &lt;th&gt;Weaviate&lt;/th&gt;&#xA;          &lt;th&gt;Chroma&lt;/th&gt;&#xA;      &lt;/tr&gt;&#xA;  &lt;/thead&gt;&#xA;  &lt;tbody&gt;&#xA;      &lt;tr&gt;&#xA;          &lt;td&gt;Hosting&lt;/td&gt;&#xA;          &lt;td&gt;Managed cloud&lt;/td&gt;&#xA;          &lt;td&gt;Self-host or cloud&lt;/td&gt;&#xA;          &lt;td&gt;Embedded / local&lt;/td&gt;&#xA;      &lt;/tr&gt;&#xA;      &lt;tr&gt;&#xA;          &lt;td&gt;Best for&lt;/td&gt;&#xA;          &lt;td&gt;Production scale&lt;/td&gt;&#xA;          &lt;td&gt;Hybrid search + GraphQL&lt;/td&gt;&#xA;          &lt;td&gt;Prototyping&lt;/td&gt;&#xA;      &lt;/tr&gt;&#xA;      &lt;tr&gt;&#xA;          &lt;td&gt;Ops burden&lt;/td&gt;&#xA;          &lt;td&gt;Low&lt;/td&gt;&#xA;          &lt;td&gt;Medium&lt;/td&gt;&#xA;          &lt;td&gt;Low&lt;/td&gt;&#xA;      &lt;/tr&gt;&#xA;  &lt;/tbody&gt;&#xA;&lt;/table&gt;&#xA;&lt;h2 id=&#34;pgvector-alternative&#34;&gt;pgvector Alternative&lt;/h2&gt;&#xA;&lt;p&gt;PostgreSQL with pgvector keeps vectors beside relational data-excellent when you already run Postgres and need ACID transactions.&lt;/p&gt;&#xA;&lt;h2 id=&#34;selection-criteria&#34;&gt;Selection Criteria&lt;/h2&gt;&#xA;&lt;p&gt;Consider QPS, filtering (metadata predicates), hybrid keyword + vector search, cost, and data residency. Prototype on Chroma or pgvector; migrate to Pinecone or Weaviate at scale.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Building RAG Systems: Retrieval-Augmented Generation Explained</title>
      <link>https://www.davidlang.tech/posts/building-rag-systems-retrieval-augmented-generation-explained/</link>
      <pubDate>Thu, 18 Jan 2024 00:00:00 +0000</pubDate>
      <guid>https://www.davidlang.tech/posts/building-rag-systems-retrieval-augmented-generation-explained/</guid>
      <description>&lt;p&gt;RAG grounds LLM responses in your private data by retrieving relevant documents before generation. It reduces hallucinations and keeps answers current without retraining models.&lt;/p&gt;&#xA;&lt;h2 id=&#34;pipeline-overview&#34;&gt;Pipeline Overview&lt;/h2&gt;&#xA;&lt;ol&gt;&#xA;&lt;li&gt;&lt;strong&gt;Ingest&lt;/strong&gt; - Load PDFs, wikis, tickets into chunks (500–1000 tokens).&lt;/li&gt;&#xA;&lt;li&gt;&lt;strong&gt;Embed&lt;/strong&gt; - Convert chunks to vectors with an embedding model.&lt;/li&gt;&#xA;&lt;li&gt;&lt;strong&gt;Store&lt;/strong&gt; - Save vectors in Pinecone, pgvector, or Chroma.&lt;/li&gt;&#xA;&lt;li&gt;&lt;strong&gt;Retrieve&lt;/strong&gt; - On query, embed the question and find top-k similar chunks.&lt;/li&gt;&#xA;&lt;li&gt;&lt;strong&gt;Generate&lt;/strong&gt; - Pass chunks as context to the LLM.&lt;/li&gt;&#xA;&lt;/ol&gt;&#xA;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#93a1a1;background-color:#002b36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;context &lt;span style=&#34;color:#719e07&#34;&gt;=&lt;/span&gt; &lt;span style=&#34;color:#2aa198&#34;&gt;&amp;#34;&lt;/span&gt;&#xA;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&#xA;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#2aa198&#34;&gt;&amp;#34;.join(retrieved_chunks)&lt;/span&gt;&#xA;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;prompt &lt;span style=&#34;color:#719e07&#34;&gt;=&lt;/span&gt; &lt;span style=&#34;color:#2aa198&#34;&gt;f&lt;/span&gt;&lt;span style=&#34;color:#2aa198&#34;&gt;&amp;#34;Use only this context:&lt;/span&gt;&#xA;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;{context}&#xA;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&#xA;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;Question: {user_query}&lt;span style=&#34;color:#2aa198&#34;&gt;&amp;#34;&lt;/span&gt;&#xA;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id=&#34;chunking-strategy&#34;&gt;Chunking Strategy&lt;/h2&gt;&#xA;&lt;p&gt;Overlap chunks by 10–20% to avoid cutting sentences. Metadata (source, page) helps citations and debugging.&lt;/p&gt;</description>
    </item>
  </channel>
</rss>
