Skip to content

Prompt Engineering Techniques That Actually Work for LLMs

Published: at 12:00 AM

Table of Contents

Open Table of Contents

Chain-of-Thought and Why It Works

Large language models generate tokens sequentially, and each token is influenced by all previous tokens in the context. When you instruct a model to “think step by step,” you are not invoking magic — you are forcing the model to surface intermediate reasoning as tokens that subsequent tokens can attend to. This dramatically reduces errors on multi-step tasks because mistakes in early reasoning steps become visible and correctable, rather than being compressed into a single output token. Chain-of-thought prompting is most effective for tasks that require arithmetic, logical deduction, or multi-hop information retrieval.

The practical implication is that the effort you put into structuring the reasoning scaffold pays dividends. Instead of asking “What is the net margin if revenue is $2.4M and costs are $1.9M?”, ask “Calculate the net profit first, then divide by revenue to get the margin, and show your work.” The explicit instruction to show work changes which tokens the model generates, which changes the quality of the answer. For complex pipelines, decompose problems into subtasks and solve each in a separate prompt rather than asking the model to juggle everything in a single context.

Few-Shot Examples and Output Formatting

The fastest way to get consistent structured output from an LLM is to show it exactly what you want. A well-designed few-shot prompt includes two to four input-output pairs that demonstrate the expected format, tone, and level of detail. When you need JSON output, include examples with the exact keys and value types you expect — models are remarkably good at inferring schemas from examples. Combine this with explicit instructions like “respond only with valid JSON, no prose” and a system prompt that reinforces the constraint.

from anthropic import Anthropic

client = Anthropic()
SYSTEM = "You are a structured data extractor. Always respond with valid JSON only."

FEW_SHOT = """Extract the named entities.

Input: "Apple reported $120B in revenue for Q4, CEO Tim Cook said."
Output: {"organizations": ["Apple"], "people": ["Tim Cook"], "metrics": [{"value": "120B", "unit": "USD", "context": "Q4 revenue"}]}

Input: "Elon Musk's xAI raised $6B from Sequoia and Andreessen Horowitz."
Output: {"organizations": ["xAI", "Sequoia", "Andreessen Horowitz"], "people": ["Elon Musk"], "metrics": [{"value": "6B", "unit": "USD", "context": "funding round"}]}

Input: {user_input}
Output:"""

def extract_entities(text: str) -> dict:
    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=512,
        system=SYSTEM,
        messages=[{"role": "user", "content": FEW_SHOT.format(user_input=text)}]
    )
    import json
    return json.loads(response.content[0].text)

Retrieval-Augmented Generation for Grounded Answers

Prompt engineering alone cannot solve the hallucination problem — a model that does not know a fact will confabulate one. Retrieval-Augmented Generation (RAG) addresses this by fetching relevant documents at inference time and injecting them into the prompt as grounding context. The prompt template becomes: “Given the following context, answer the question. If the answer is not in the context, say you don’t know.” This simple instruction, combined with high-quality retrieval, reduces hallucination rates dramatically and makes model outputs auditable — you can trace every claim back to a source document. The quality of your retrieval system matters as much as your prompt: a well-crafted prompt on poorly retrieved context will still produce wrong answers.