Llm on David Lang

Context Window Strategies: Making the Most of Long-Context LLMs

Fri, 10 Apr 2026 00:00:00 +0000

Million-token context windows tempt teams to dump entire repos into prompts. That is expensive, slow, and often less accurate than targeted retrieval.

When Full Context Helps

Single-file refactors, analyzing one large document, comparing a few long contracts.

When Retrieval Wins

Whole codebases, ticket backlogs, and wiki sites-embed, filter, rerank, then pass top-k chunks.

Compression Techniques

Summarize conversation history. Use hierarchical memory (session summary + recent turns). Strip comments and generated noise from code context.

Building Reliable AI Agents: Lessons from Production

Sat, 28 Feb 2026 00:00:00 +0000

Production agents fail in boring ways: timeouts, tool errors, runaway loops, and silent wrong answers. Reliability engineering applies to agents too.

Hardening Checklist

Max steps and token budgets per session
Idempotent tools with clear error messages
Checkpoint state for long workflows
Circuit breakers when external APIs fail
Structured logging of every tool call

Graceful Degradation

When the agent fails, fall back to search-only RAG or human handoff-never an empty error.

From RAG to Agentic AI: What's Next for LLM-Powered Apps

Mon, 01 Dec 2025 00:00:00 +0000

The industry moved from chatbots → RAG → agents. Understanding the progression helps you invest in the right layer for your product maturity.

RAG Era

Ground models in private data. Mature patterns: chunking, hybrid search, citations. Still the right default for Q&A and search.

Agent Era

Models call tools, plan multi-step workflows, and maintain state. Higher capability, higher risk.

What’s Next

Evals-as-code in every pipeline
Smaller specialist models routed by orchestrators
On-device for privacy-sensitive steps
Human-agent collaboration UIs, not just chat

Migration Path

Master RAG and evals first. Add one well-scoped agent tool. Measure task completion before expanding autonomy.

Evaluating LLM Outputs: RAGAS, DeepEval, and Custom Metrics

Sat, 18 Oct 2025 00:00:00 +0000

Frameworks like RAGAS and DeepEval codify LLM evaluation metrics so you can regression-test prompts and pipelines in CI.

RAGAS (RAG Assessment)

Measures context precision/recall, faithfulness, and answer relevance-ideal for retrieval pipelines.

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy

result = evaluate(dataset=eval_dataset, metrics=[faithfulness, answer_relevancy])

DeepEval

Offers pytest-style LLM tests, G-Eval, and hallucination metrics with CI integration.

Custom Metrics

Domain-specific checks often outperform generic scores-JSON schema match, SQL execution success, unit test pass rate for codegen.

Building Multi-Agent AI Systems

Tue, 20 May 2025 00:00:00 +0000

Multi-agent systems divide work among specialized agents-a researcher, coder, critic-coordinated by a supervisor or message bus.

Patterns

Supervisor - One model delegates subtasks and aggregates results.

Peer-to-peer - Agents message each other until consensus or max rounds.

Pipeline - Fixed stages (plan → implement → test).

Implementation Tips

Give each agent a narrow system prompt and tool set. Pass structured state (JSON) between agents, not raw chat logs.

Failure Modes

Infinite loops, duplicated work, conflicting edits. Enforce step limits, idempotent tools, and single-writer rules for shared files.

Improving LLM Accuracy: Techniques Beyond Prompt Engineering

Tue, 25 Mar 2025 00:00:00 +0000

When prompts plateau, these engineering levers move accuracy more than another adjective in the system message.

Better Retrieval

Hybrid search (BM25 + vectors), rerankers (Cohere, cross-encoders), and metadata filters reduce wrong context reaching the model.

Structured Outputs

Force JSON with schemas (Zod, Pydantic, OpenAI structured outputs). Parse failures trigger retry with repair prompts.

Model Routing

Small models classify intent; large models answer hard questions. Cuts cost and reduces overconfident rambling on simple queries.

How to Validate and Measure LLM Accuracy in Production

Tue, 18 Feb 2025 00:00:00 +0000

Shipping an LLM feature without measurement is shipping a bug generator. Production validation combines automated metrics, human review, and business KPIs.

Levels of Evaluation

Unit-level - Schema validation, regex checks, refusal detection
Golden set - Curated Q&A pairs scored automatically
Online - User thumbs, task completion, support escalations
Human - Expert rubrics on sampled traffic

Metrics That Matter

Faithfulness - Answer grounded in retrieved context (RAG)
Relevance - Addresses the user question
Toxicity / PII - Safety filters
Latency and cost - p95 tokens and dollars per session

Implementation Sketch

def validate_response(answer: str, context: str) -> dict:
    return {
        "has_citation": "[source:" in answer,
        "length_ok": 50 < len(answer) < 4000,
        "grounded": entailment_score(context, answer) > 0.7,
    }

Log scores to your observability stack (Datadog, LangSmith, Phoenix).

AI-Powered Code Review: Integrating LLMs into Dev Workflows

Sun, 22 Sep 2024 00:00:00 +0000

LLMs can summarize diffs, flag security smells, and suggest tests-but they should augment human review, not replace it.

CI Integration

Post PR diffs to an LLM with a structured prompt. Output JSON findings consumed by GitHub Actions or GitLab CI. Fail builds only on high-severity, high-confidence issues to reduce noise.

Prompt Design for Reviews

Include: changed files, diff hunks, coding standards doc, and explicit instruction to cite line numbers and avoid nits.

Claude API vs OpenAI API: Choosing the Right LLM

Wed, 14 Aug 2024 00:00:00 +0000

Anthropic’s Claude and OpenAI’s GPT families both offer strong APIs. Choosing between them depends on task, context length, cost, and compliance-not benchmark hype alone.

Strengths at a Glance

Claude - Long context windows, careful refusals, strong long-document analysis and coding reviews.

OpenAI - Broad ecosystem, function calling maturity, image and audio modalities, largest third-party integration surface.

Integration Pattern

Abstract the provider behind an interface:

interface LLMProvider {
  chat(messages: Message[]): Promise<string>;
}

Swap implementations per route (cheap model for classification, premium for generation).

Fine-Tuning LLMs: When and How to Customize AI Models

Wed, 15 May 2024 00:00:00 +0000

Fine-tuning adapts a base model to your domain with labeled examples. Use it when prompting and RAG cannot achieve consistent style, format, or task-specific behavior.

When to Fine-Tune

Fixed output schema (legal clauses, medical codes)
Brand voice across thousands of responses
Specialized terminology poorly covered by general models

When Not to Fine-Tune

Facts that change frequently (use RAG)
One-off tasks (use prompting)
Small datasets without validation (risk overfitting)

OpenAI Fine-Tuning Flow

Prepare JSONL with messages arrays. Upload, create job, evaluate on a holdout set. Monitor loss and human ratings before promoting to production.

Building RAG Systems: Retrieval-Augmented Generation Explained

Thu, 18 Jan 2024 00:00:00 +0000

RAG grounds LLM responses in your private data by retrieving relevant documents before generation. It reduces hallucinations and keeps answers current without retraining models.

Pipeline Overview

Ingest - Load PDFs, wikis, tickets into chunks (500–1000 tokens).
Embed - Convert chunks to vectors with an embedding model.
Store - Save vectors in Pinecone, pgvector, or Chroma.
Retrieve - On query, embed the question and find top-k similar chunks.
Generate - Pass chunks as context to the LLM.

context = "

".join(retrieved_chunks)
prompt = f"Use only this context:
{context}

Question: {user_query}"

Chunking Strategy

Overlap chunks by 10–20% to avoid cutting sentences. Metadata (source, page) helps citations and debugging.

Prompt Engineering Fundamentals for Developers

Sat, 14 Oct 2023 00:00:00 +0000

Prompt engineering is the practice of designing inputs so LLMs produce reliable, useful outputs. Developers who treat prompts as code ship better AI features.

Structure Your Prompts

Use clear sections: role, context, task, format, and constraints.

You are a code reviewer for a TypeScript React codebase.
Context: PR diff below.
Task: List bugs, security issues, and style problems.
Format: JSON array of { severity, file, message }.
Constraints: Max 10 items. No speculation beyond the diff.

Few-Shot Examples

Include 2–3 input/output pairs for classification or extraction tasks. Examples beat lengthy instructions for format adherence.

Introduction to LangChain: Building AI-Powered Apps

Wed, 08 Mar 2023 00:00:00 +0000

LangChain composes LLM calls with prompts, memory, tools, and retrieval. It standardizes patterns that every AI app eventually needs.

Chains and Prompts

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

llm = ChatOpenAI(model="gpt-4")
prompt = ChatPromptTemplate.from_messages([
    ("system", "Answer as a senior engineer."),
    ("user", "{question}"),
])
chain = prompt | llm
response = chain.invoke({"question": "What is RAG?"})

Retrieval

Load documents, chunk text, embed with OpenAI or open models, store in a vector DB, and retrieve relevant chunks at query time-foundation for RAG systems.

Getting Started with the OpenAI API in Node.js

Thu, 12 Jan 2023 00:00:00 +0000

The OpenAI API brought large language models to application developers through a simple HTTP interface. Node.js remains a natural fit for BFF layers that call LLMs.

Installation and First Request

npm install openai

import OpenAI from 'openai';

const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

const completion = await client.chat.completions.create({
  model: 'gpt-4',
  messages: [
    { role: 'system', content: 'You are a helpful coding assistant.' },
    { role: 'user', content: 'Explain async/await in JavaScript.' },
  ],
});

console.log(completion.choices[0].message.content);

Production Considerations

Never expose API keys in frontend bundles. Proxy requests through your backend. Set max_tokens, timeouts, and retry policies. Log token usage for cost control.