How Prompt Engineering Holds a Pipeline Together

I built a pipeline to convert medical guideline PDFs into structured JSON that a clinical decision-support system could actually query. The layout parsing was the hard part, I thought. Get MinerU to detect headings, group sections, preserve tables. That took a week.

The prompt engineering took longer. And it’s the part that determines whether the whole thing works.

1. The Problem With Asking an LLM to “Extract Information”

The naive version of this pipeline sends a chunk of text to an LLM and asks it to pull out the important facts. You get something back. It looks reasonable. It’s also inconsistent across documents, inconsistent across sections of the same document, and structurally unpredictable in ways that break any downstream system trying to consume it.

The LLM isn’t the problem. The prompt is.

“Extract the key information from this text” is not a contract. It’s a vibe. The model will produce whatever structure seems reasonable for that particular chunk of text on that particular call. Sometimes a list. Sometimes prose. Sometimes the JSON keys are different names for the same concept. Downstream code that tries to parse this will spend most of its time handling edge cases that shouldn’t exist.

The fix is specificity. Not slightly more specific. Radically more specific.

2. Schema as a Prompt Constraint

The pipeline’s Intent Extraction mode enforces a strict output schema. The LLM doesn’t get to decide what shape the output is.

{
  "intent": "short_snake_case_topic",
  "triggers": ["search phrase 1", "search phrase 2"],
  "context_packs": [
    {
      "type": "diagnosis | treatment | referral | symptoms | protocol",
      "source_anchor": "section_name",
      "facts": ["atomic fact 1"],
      "contraindications": ["harmful action to avoid"],
      "rules": [
        {
          "if": ["explicit condition"],
          "then": ["explicit outcome"],
          "else": []
        }
      ]
    }
  ]
}

Every field is defined. Every field’s purpose is defined. The prompt doesn’t just show this schema - it explains what goes in each key, what counts as a valid fact versus a rule, what snake_case_topic means and why.

The result is that output is consistent across every section of every document. The downstream system always gets what it expects. Schema compliance isn’t enforced by the model’s goodwill. It’s enforced by how the prompt is written and a validator that fills missing keys with safe defaults when the model slips.

3. What the Prompt Actually Has to Say

There’s a section in the README that lists what a valid prompt for this pipeline must include. It reads like a contract clause because that’s what it is.

The prompt explicitly forbids hallucination. Not “try to stay accurate” - explicitly states that the model must only extract what is present in the source text. It instructs the model to return empty arrays rather than omitting keys when a field has no content. It requires raw JSON with no markdown fences. It defines what a trigger is: a phrase a real user would type when searching for this information, diverse in phrasing, not just synonyms of each other.

Each of those constraints exists because something broke without it.

The empty array requirement came from a downstream parser hitting a KeyError when a section had no contraindications and the model just omitted the key. The no-markdown-fences rule came from the model wrapping JSON in triple-backtick blocks that broke parsing until a stripping step was added. The hallucination prohibition came from the model inferring clinical facts not in the source text and presenting them with the same confidence as extracted ones.

Prompt constraints aren’t defensive programming. They’re a record of failures.

4. Multi-Call as a Recall Strategy

One pass per section isn’t enough for dense documents. The pipeline supports running multiple LLM calls per section with different prompt versions and merging the results.

The idea is that different prompt phrasings surface different aspects of the same text. A prompt focused on conditional logic pulls out rules that a fact-focused prompt misses. A prompt that asks what a clinician would search for generates different triggers than one that asks what a patient would type.

MULTI_CALL_ENABLED=true
MULTI_CALL_COUNT=3

# Each section gets processed three times:
# document_prompt_1.json  - fact and rule focused
# document_prompt_2.json  - trigger and search phrase focused
# document_prompt_3.json  - edge case and contraindication focused

This is prompt engineering used not for accuracy on a single call but for coverage across multiple calls. The model’s stochastic nature becomes useful rather than a problem to suppress.

The tradeoff is cost. Three calls per section on a 50-page document adds up fast. The toggle exists for that reason - you decide when the recall improvement is worth it.

5. Zero-Schema Mode and What It Reveals

The second mode strips the schema constraint entirely. The LLM gets a prompt, returns whatever valid JSON it wants, and the pipeline accepts it. No validation beyond JSON parsability.

This mode exists because Stage 1’s document parsing infrastructure is valuable regardless of what you do with the text. The same section grouping, reading order preservation, and table extraction all work for generating conversational SFT training data, Q&A pairs, summaries, or translations. To switch tasks entirely, you replace one entry in prompts.json. No code changes.

The Python handles I/O, retries, and parsing. The prompt is what the system actually does.

I didn’t fully appreciate that until I watched the same pipeline - same section grouping, same table extraction, same Docker setup - produce clinical knowledge objects one day and ASHA worker training conversations the next, because someone changed a text file. The code didn’t change. The application changed. That’s a strange thing to sit with if you’re used to thinking of software as the code.

6. The JSON Repair Layer and What It Admits

The pipeline includes an auto-repair system that runs before parsing LLM output. It closes unterminated strings, fixes missing commas between JSON objects, and strips trailing commas that break standard parsers. There’s also a standalone cleaning script for manual recovery of existing outputs.

This layer exists because LLMs generating large structured outputs will occasionally truncate mid-sentence when they hit token limits, or produce formatting glitches that are close to valid JSON but not quite. The repair system handles the common failure modes automatically.

What it admits is that even well-engineered prompts don’t produce perfect output every time. The system is designed around that reality. Prompt engineering sets the ceiling on output quality. Defensive parsing infrastructure is what keeps the pipeline running when the output falls short of that ceiling.

That combination - tight prompts plus tolerant parsing - is the actual engineering pattern here. Neither alone is enough.