Back to Engineering Logs
LLM OPS2024-11-2810 min read

Dynamic Context Injection: Solving RAG Hallucinations

Why static vector search isn't enough for Enterprise AI

Business Impact

Static RAG fails in long conversations. dyncontext acts as a 'smart RAM' for your AI agents, dynamically injecting relevant history and reducing token costs by 40% via intelligent caching.

The Limitation of Vector Databases

Standard RAG pipelines are static:

  1. User asks a question.
  2. System retrieves top-K chunks from Vector DB.
  3. LLM answers.

This fails in conversational workflows. If a user says "Change that to 50%," a static RAG system doesn't know what "that" refers to. It searches the vector DB for "50%" and returns garbage.

The Solution: Dynamic Context Management

dyncontext is a middleware layer that sits between your Vector DB and your LLM. It treats context as a living asset, prioritizing information based on recency, relevance, and semantic weight.

Hybrid Retrieval Strategy

We don't rely on embeddings alone. We implement a weighted scoring system crucial for "Sovereign AI" deployments where accuracy is non-negotiable.

from dyncontext import ContextManager

# Configure a hybrid retrieval strategy
cm = ContextManager(
    semantic_weight=0.4,    # Vector Similarity (The "Vibe")
    keyword_weight=0.2,     # BM25 (Exact Keyword Match)
    recency_weight=0.2,     # Time Decay (Newer is better)
    tag_weight=0.2          # Metadata (Department/Project scope)
)

# Retrieve context that actually fits the conversation
context = await cm.get_context(
    query="What's the liability cap?",
    session_id="legal-case-884"
)

Intelligent Reranking & Caching

Retrieval is just step one. We implemented a Cross-Encoder Reranker to filter the results.

  • Step 1: Retrieve 50 chunks (High Recall).
  • Step 2: Rerank using a high-precision model.
  • Step 3: Pass only the top 5 to the LLM (High Precision).

We also built a 3-Layer Cache (Embedding, Query, Session) to reduce latency by 70%. If a user asks the same question, we never hit the LLM API.

Telemetry & Observability

You cannot optimize what you cannot measure. dyncontext provides deep telemetry for every interaction, compatible with OpenTelemetry.

{
    "retrieval_ms": 45,
    "cache_hit": true,
    "relevance_score": 0.92,
    "tokens_saved": 1400
}

Why Use This Over LangChain?

LangChain is great for prototyping. dyncontext is built for production latency and cost optimization. It is provider-agnostic, working seamlessly with OpenAI, Anthropic, or Local Llama-3 deployments.

Integration

pip install dyncontext

View the Source: GitHub Repository

Technologies Used

PythonHybrid SearchVector DBRerankingRedis