AI & LLM Cheatsheet

Prompt engineering, token optimization, reducing hallucination, RAG, embeddings, API usage & evaluation

AI / Practical Guide
Contents
✍️

Prompt Engineering

πŸ’‘ Core Techniques
  • Zero-shot: Direct instruction with no examples
  • Few-shot: Provide 2-5 examples before the task
  • Chain-of-Thought (CoT): "Think step by step" β€” improves reasoning
  • Self-Consistency: Generate multiple CoT paths, take majority vote
  • Tree of Thoughts: Explore multiple reasoning branches, evaluate each
  • ReAct: Reasoning + Acting β€” model explains thought, then calls tools
# System prompt template (structured)
system_prompt = """
You are an expert {role}.

## Rules
- Always respond in {format}
- Cite sources when making claims
- If unsure, say "I don't know"

## Output Format
{output_schema}
"""

# Few-shot example
messages = [
    {"role": "system", "content": "Classify sentiment as positive, negative, or neutral."},
    {"role": "user", "content": "The food was amazing!"},
    {"role": "assistant", "content": "positive"},
    {"role": "user", "content": "Terrible service, never coming back."},
    {"role": "assistant", "content": "negative"},
    {"role": "user", "content": "The meeting is at 3pm."},  # actual query
]

# Chain-of-Thought prompt
prompt = """Solve the following problem step by step.
Show your reasoning before giving the final answer.

Q: If a train travels 120km in 2 hours,
   then slows to 40km/h for 3 hours,
   what is the average speed for the whole journey?

Think step by step:"""
🎯 Prompt Patterns
  • Persona: "You are a senior Python developer reviewing code"
  • Constraints: "Respond in JSON only", "Max 100 words"
  • Delimiters: Use ```, ---, or XML tags to separate sections
  • Step numbering: "Step 1: … Step 2: …" forces structured thinking
  • Negative examples: "Do NOT include explanations" or "Never hallucinate data"
πŸͺ™

Token Optimization

πŸ’‘ Strategies to Reduce Token Usage
  • Be concise: Remove filler words, use terse instructions
  • Structured output: JSON/YAML is more token-efficient than prose
  • Summarize context: Compress long documents before injecting
  • Caching: Reuse system prompts (prompt caching in API)
  • Truncation: Only include relevant parts of documents
  • Tiktoken: Count tokens before sending to avoid overflows
import tiktoken

# Count tokens (OpenAI models)
enc = tiktoken.encoding_for_model("gpt-4")
tokens = enc.encode("Hello, world!")
print(len(tokens))  # 4

# Estimate cost
input_tokens = len(enc.encode(prompt))
cost = input_tokens / 1_000_000 * 2.50  # GPT-4o: $2.50/1M input

# Token-efficient formatting
# ❌ Verbose: "Please classify the following text into one of these categories"
# βœ… Concise: "Classify β†’ [positive|negative|neutral]"

# Compress context with summarization
summary_prompt = f"Summarize in <200 words:\n{long_document}"
summary = call_llm(summary_prompt)
final_prompt = f"Based on this context:\n{summary}\n\nAnswer: {question}"
πŸ›‘οΈ

Reducing Hallucination

⚠️ Hallucination Causes: Training data gaps, over-confident generation, ambiguous prompts, long context distraction.
Mitigation Strategies
  • Ground in context: Provide source documents, use RAG
  • Instruction: "Only use information from the provided context. If unsure, say 'I don't know'"
  • Lower temperature: 0.0–0.3 for factual tasks
  • Structured output: Force JSON schema β€” constrains free-form generation
  • Citation: "Quote the exact passage supporting your answer"
  • Self-verification: "After answering, verify each claim against the context"
  • Multi-turn verification: Ask model to critique its own response
  • Constrained decoding: Use logit_bias, response_format, tool_use
# Anti-hallucination system prompt
system = """You are a precise research assistant.

RULES:
1. Only answer using the CONTEXT provided below
2. If the context doesn't contain the answer, say "Not found in context"
3. Quote relevant passages with [Source: section_name]
4. Never invent facts, dates, or statistics
5. Express uncertainty with confidence levels: [HIGH/MEDIUM/LOW]

CONTEXT:
{retrieved_documents}
"""

# Structured output with schema enforcement
response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    temperature=0.1,           # low for factuality
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "answer",
            "schema": {
                "type": "object",
                "properties": {
                    "answer": {"type": "string"},
                    "confidence": {"type": "string", "enum": ["HIGH","MEDIUM","LOW"]},
                    "sources": {"type": "array", "items": {"type": "string"}}
                },
                "required": ["answer", "confidence", "sources"]
            }
        }
    }
)
πŸ“š

RAG (Retrieval-Augmented Generation)

RAG Pipeline
  1. Index: Chunk documents β†’ embed β†’ store in vector DB
  2. Retrieve: Embed query β†’ find top-K similar chunks
  3. Augment: Inject retrieved chunks into prompt context
  4. Generate: LLM answers grounded in retrieved context
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

# 1. Chunk documents
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " "]
)
chunks = splitter.split_documents(docs)

# 2. Embed & store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(chunks, embeddings)

# 3. Retrieve
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
docs = retriever.invoke("What is the return policy?")

# 4. Generate with context
context = "\n\n".join([d.page_content for d in docs])
prompt = f"""Answer using ONLY the context below.

Context:
{context}

Question: {question}
Answer:"""
Advanced RAG Strategies
  • Hybrid search: Combine vector similarity + BM25 keyword search
  • Re-ranking: Use a cross-encoder to re-rank retrieved chunks
  • Parent-child chunking: Retrieve small chunks, expand to parent for context
  • Query decomposition: Break complex queries into sub-questions
  • Metadata filtering: Filter by date, source, category before similarity search
🧬

Embeddings & Vector Databases

from openai import OpenAI
import numpy as np

client = OpenAI()

# Generate embeddings
response = client.embeddings.create(
    model="text-embedding-3-small",  # 1536 dims, cheap
    input=["Hello world", "Goodbye moon"]
)
vec = response.data[0].embedding  # list of floats

# Cosine similarity
def cosine_sim(a, b):
    a, b = np.array(a), np.array(b)
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
Vector DB Options
  • Chroma: Lightweight, local, great for prototyping
  • Pinecone: Managed, scalable, serverless tier available
  • Weaviate: Open-source, hybrid search built-in
  • Qdrant: Rust-based, fast, filtering support
  • pgvector: PostgreSQL extension, familiar SQL interface
  • FAISS: Facebook's library, local, no server needed
πŸ€–

OpenAI API

from openai import OpenAI

client = OpenAI()  # reads OPENAI_API_KEY from env

# Chat completion
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain async/await in Python"}
    ],
    temperature=0.7,
    max_tokens=500,
    top_p=0.9,
)
answer = response.choices[0].message.content

# Streaming
stream = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    stream=True
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

# Tool / Function calling
tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a city",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {"type": "string", "description": "City name"}
            },
            "required": ["city"]
        }
    }
}]
response = client.chat.completions.create(
    model="gpt-4o", messages=messages, tools=tools
)
🧠

Anthropic API

from anthropic import Anthropic

client = Anthropic()  # reads ANTHROPIC_API_KEY

# Chat
message = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system="You are a precise coding assistant.",
    messages=[
        {"role": "user", "content": "Write a Python fibonacci function"}
    ]
)
print(message.content[0].text)

# Streaming
with client.messages.stream(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=messages
) as stream:
    for text in stream.text_stream:
        print(text, end="")

# Tool use
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    tools=[{
        "name": "get_weather",
        "description": "Get weather for a location",
        "input_schema": {
            "type": "object",
            "properties": {"city": {"type": "string"}},
            "required": ["city"]
        }
    }],
    messages=[{"role": "user", "content": "Weather in Tokyo?"}]
)
πŸ”¬

Fine-Tuning

When to Fine-Tune vs Prompt Engineer
  • Prompt engineering first: Always try few-shot + RAG before fine-tuning
  • Fine-tune when: Consistent format/style needed, domain-specific jargon, latency/token reduction, specific behavior hard to prompt
  • Don't fine-tune for: Adding new knowledge (use RAG), temporary needs, small datasets (<50 examples)
# OpenAI fine-tuning (JSONL format)
# training_data.jsonl β€” one JSON per line
{"messages": [{"role":"system","content":"..."},{"role":"user","content":"..."},{"role":"assistant","content":"..."}]}

# Upload & create fine-tuning job
file = client.files.create(file=open("data.jsonl", "rb"), purpose="fine-tune")
job = client.fine_tuning.jobs.create(
    training_file=file.id,
    model="gpt-4o-mini-2024-07-18",
    hyperparameters={"n_epochs": 3}
)

# Use fine-tuned model
response = client.chat.completions.create(
    model="ft:gpt-4o-mini:my-org::abc123",
    messages=messages
)
🀝

AI Agents & Tool Use

Agent Architectures
  • ReAct: Thought β†’ Action β†’ Observation loop
  • Plan-and-Execute: Create full plan, execute steps, revise
  • Multi-agent: Specialized agents collaborate (researcher, coder, reviewer)
  • Reflection: Agent reviews its own output and iterates
# LangGraph agent example
from langgraph.prebuilt import create_react_agent
from langchain_openai import ChatOpenAI
from langchain_core.tools import tool

@tool
def search_docs(query: str) -> str:
    """Search internal documentation."""
    # your retrieval logic
    return results

@tool
def run_sql(query: str) -> str:
    """Execute a read-only SQL query."""
    return db.execute(query)

llm = ChatOpenAI(model="gpt-4o")
agent = create_react_agent(llm, tools=[search_docs, run_sql])

result = agent.invoke({
    "messages": [{"role": "user", "content": "How many users signed up last week?"}]
})
πŸ“

Evaluation & Metrics

Key Metrics
  • Accuracy / F1: For classification tasks
  • BLEU / ROUGE: Text similarity (reference-based)
  • BERTScore: Semantic similarity using embeddings
  • Faithfulness: Does the answer match the provided context? (RAG)
  • Relevance: Is the retrieved context relevant to the question?
  • Hallucination rate: % of claims not supported by context
  • LLM-as-Judge: Use a stronger model to evaluate outputs
# LLM-as-Judge evaluation
eval_prompt = """Rate the following answer on a scale of 1-5.

Criteria:
- Accuracy: Is the information correct?
- Completeness: Does it fully answer the question?
- Conciseness: Is it appropriately brief?

Question: {question}
Answer: {answer}
Reference: {reference}

Output JSON: {"accuracy": int, "completeness": int, "conciseness": int, "explanation": str}
"""

# RAGAS evaluation framework
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision

result = evaluate(
    dataset=eval_dataset,
    metrics=[faithfulness, answer_relevancy, context_precision]
)
print(result)  # {'faithfulness': 0.92, 'answer_relevancy': 0.87, ...}
πŸ”’

Safety & Guardrails

Common Attack Vectors
  • Prompt injection: User embeds "Ignore previous instructions…"
  • Jailbreaking: Role-play scenarios to bypass safety
  • Data extraction: Trying to extract system prompts or training data
  • Indirect injection: Malicious instructions in retrieved documents
Defenses
  • Separate system and user content with clear delimiters
  • Input validation and content filtering before LLM
  • Output filtering: check for PII, harmful content, off-topic
  • Use Guardrails frameworks: NeMo Guardrails, Guardrails AI
  • Rate limiting and abuse detection
  • Monitor and log all interactions
βš–οΈ

Model Comparison (2024-2025)

Popular Models At-a-Glance
Model Context Strengths
GPT-4o 128K Best all-round, multimodal, tools
Claude Opus 4 200K Long context, coding, analysis
Claude Sonnet 4 200K Great balance of speed/quality
Gemini 2.5 Pro 1M+ Massive context, multimodal
Llama 3.1 405B 128K Open-source, self-hostable
Mistral Large 128K European, multilingual, fast
DeepSeek V3 128K Cost-effective, strong reasoning