llm-feature-design
Skill · v1.0.0 · MIT

llm-feature-design

Design LLM-powered features that survive production - model choice, prompt structure, structured output, fallbacks, cost control, evals, and guardrails. Use when the user wants to add AI/LLM functionality to a product, asks about prompt design, RAG, agent features, handling model failures, or evaluating LLM output quality.

elyra › /skills install llm-feature-design

An LLM feature is a probabilistic component inside a deterministic system. Design for the 5% of calls that go wrong, not the demo that went right.

When to use

  • "Add an AI feature that…" (summarize, classify, extract, generate, chat)
  • Choosing between models, or between LLM and non-LLM solutions
  • "The prompt works sometimes" / output quality is inconsistent
  • Reviewing an LLM integration before launch

Principles

  • LLM last. If regex, SQL, or a rules engine solves it, use that. LLMs buy flexibility at the cost of latency, money, and nondeterminism.
  • Structured in, structured out. Free-text responses parsed with string-splitting is a production incident on a timer. Use schemas/tool calls.
  • The prompt is code. Versioned, reviewed, tested — never edited live in a dashboard.
  • No evals, no iteration. Without a measured baseline, every prompt "improvement" is a coin flip.

Process

1. Frame the task

  • What's the exact job? (classify into N labels ≠ summarize ≠ extract fields ≠ open generation)
  • What does failure cost? Wrong label in an internal tool vs wrong refund amount to a customer → very different designs
  • Could this be done without an LLM? Write down why not.

2. Choose the shape

Task Shape
Classification / extraction Single call, structured output (JSON schema / tool call), low temperature
Summarization / rewriting Single call, length + style constraints in prompt
Q&A over own data RAG: retrieve → ground → answer with citations
Multi-step with side effects Agent with tools — only if a fixed pipeline truly can't work

Pick the smallest model that passes your evals; upgrade on evidence, not vibes.

3. Design the prompt

  • Structure: role → task → constraints → examples (few-shot) → input
  • Put variable user content in clearly delimited sections; treat it as untrusted (prompt injection)
  • Specify the output schema and the failure behavior: "If the answer is not in the context, return null" — give the model an out, or it will invent one

4. Engineer the failure paths

  • Validation: parse + schema-check every response; on failure → one retry with the error fed back, then fallback
  • Fallback: deterministic default, cached previous result, degraded UX, or human handoff — never a raw stack trace
  • Timeouts + budgets: max tokens, max retries, max $/request; rate-limit per user
  • Idempotency for anything with side effects

5. Build evals before shipping

  • Collect 20–100 representative inputs (include the ugly ones) with expected outputs
  • Score automatically where possible (exact match, schema validity, contains-citation); LLM-as-judge for fuzzy quality, spot-checked by humans
  • Run evals on every prompt/model change — this is your regression suite

6. Guardrails & ops

  • Input: size limits, injection screening for high-stakes actions
  • Output: PII/unsafe-content filtering where relevant; citations required for factual claims from RAG
  • Log prompt version, model, latency, token cost per call; alert on error-rate and cost anomalies
  • Plan for model deprecation: pinned model versions + eval suite = safe upgrades

Output format

## LLM feature: <name>

**Task:** … **Why LLM:** …
**Shape:** single call / RAG / agent. **Model:** … (smallest that passes evals)

### Prompt contract
Input: … → Output schema: … → On unanswerable: …

### Failure handling
| Failure | Behavior |
|---------|----------|
| Invalid output | retry ×1 w/ error → fallback: … |
| Timeout | … |

### Evals
Dataset: N cases. Metrics: … Baseline: …%

### Cost & limits
~$/request, max tokens, rate limit.

Anti-patterns

  • ❌ Shipping on "it worked in the playground" with zero evals
  • ❌ Parsing free-text output with regex instead of demanding structured output
  • ❌ No "I don't know" path — forcing the model to always answer manufactures hallucinations
  • ❌ User input concatenated raw into the prompt for a feature with side effects
  • ❌ Defaulting to the biggest model without testing a smaller one
  • ❌ Prompt edited in production with no version history
  • ❌ Unbounded loops in agents — no max iterations, no budget