The "Garbage In, Garbage Out" Problem in RAG
The biggest failure point in Enterprise RAG (Retrieval-Augmented Generation) isn't the model—it's the data. Most organizations try to feed raw PDFs directly into a vector database, leading to poor retrieval and hallucinations.
To build Sovereign AI capable of handling complex queries, raw documents (PDF, Docx, Confluence) must be:
- Layout-parsed (preserving tables and headers)
- Chunked semantically (not just by character count)
- Synthetic Data Generated (creating Q&A pairs for fine-tuning)
We built doc2dataset to solve this: an open-source pipeline that uses LLMs to prepare data for other LLMs.
Architecture: How to Automate Data Cleaning
We moved away from Regex and rule-based parsing. Instead, we treat data preparation as an agentic workflow.
1. The Multi-Modal Parser
We support the chaotic reality of enterprise data. The parser detects input type and applies specific extraction strategies:
- PDFs: Uses OCR and layout analysis to preserve table structures (crucial for financial/legal data).
- DOCX: Maps the XML hierarchy to Markdown headers.
- Unstructured Web: Strips boilerplate HTML while keeping semantic tags.
2. LLM-Powered Extraction (The Secret Sauce)
Instead of manually writing rules for every document type, we define Data Schemas and let a small, efficient LLM (like Mistral or GPT-4o-mini) handle the extraction.
from doc2dataset import DatasetGenerator, ExtractionType
# Define what you want to extract from the noise
generator = DatasetGenerator(
extraction_types=[
ExtractionType.QA_PAIRS, # Great for Fine-tuning Llama-3
ExtractionType.FACTS, # Great for Vector Search (RAG)
ExtractionType.SUMMARIES # Great for Metadata filtering
]
)
# Process entire directories asynchronously
dataset = await generator.process("./messy_legal_docs/")
3. The Quality Gate
Automated pipelines can be noisy. We implemented a filtering layer to ensure only "Gold Standard" data makes it to the fine-tuning set:
- Deduplication: Uses cosine similarity to remove redundant chunks.
- Hallucination Check: Self-consistency checks to ensure the Q&A pair is grounded in the text.
- Complexity Scoring: We discard examples that are too simple ("What is the date?") in favor of complex reasoning tasks.
Impact on Model Performance
In our internal benchmarks, models fine-tuned on doc2dataset outputs showed:
- 40% increase in adherence to formatting instructions.
- Reduced hallucinations regarding specific numerical data (due to better table parsing).
- 10x faster iteration cycles for data scientists.
Implementation
The library is optimized for local execution, ensuring your data never leaves your infrastructure if using local models.
pip install doc2dataset
View the Source: GitHub Repository