Full Stack • Java • System Design • Cloud • AI Engineering

AI2024-01-31

Large Language Models (LLMs) - Complete Guide

Comprehensive guide to Large Language Models covering GPT, BERT, transformers, prompt engineering, and practical applications.

Large Language Models (LLMs) - Complete Guide

What are Large Language Models?

Large Language Models (LLMs) are AI models trained on massive amounts of text data to understand and generate human-like text. They use deep learning, specifically transformer architecture, to process and generate language.

Key Characteristics

  • Large Scale: Billions of parameters (GPT-3: 175B, GPT-4: 1.7T+)
  • Pre-trained: Trained on vast internet text
  • Transfer Learning: Fine-tuned for specific tasks
  • Few-Shot Learning: Learn from few examples
  • Emergent Abilities: Capabilities not explicitly trained

Evolution of LLMs

Timeline

2017: Transformer Architecture (Attention is All You Need)
2018: BERT (Bidirectional Encoder)
2018: GPT-1 (117M parameters)
2019: GPT-2 (1.5B parameters)
2020: GPT-3 (175B parameters)
2021: DALL-E, Codex
2022: ChatGPT, InstructGPT
2023: GPT-4, Claude, LLaMA, Bard
2024: Gemini, Claude 3, GPT-4 Turbo

Transformer Architecture

Core Components

1. Self-Attention Mechanism

# Simplified attention calculation
Q = Query  # What we're looking for
K = Key    # What we have
V = Value  # What we return

Attention(Q, K, V) = softmax(Q·K^T / √d_k) · V

Example:
Input: "The cat sat on the mat"
- "cat" attends to "sat" (subject-verb)
- "sat" attends to "mat" (verb-object)
- "on" attends to "mat" (preposition-object)

2. Multi-Head Attention

# Multiple attention mechanisms in parallel
# Each head learns different relationships

MultiHead(Q, K, V) = Concat(head_1, ..., head_h) · W^O

Benefits:
- Capture different types of relationships
- Parallel processing
- Better representation learning

3. Feed-Forward Networks

FFN(x) = max(0, x·W_1 + b_1)·W_2 + b_2

# Two linear transformations with ReLU
# Applied to each position independently

4. Positional Encoding

# Add position information to embeddings
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

# Allows model to understand word order

Architecture Types

1. Encoder-Only (BERT)

Input → Encoder → Output

Use Cases:
- Text classification
- Named entity recognition
- Question answering
- Sentiment analysis

Example: BERT, RoBERTa, ALBERT

2. Decoder-Only (GPT)

Input → Decoder → Output

Use Cases:
- Text generation
- Code generation
- Creative writing
- Chatbots

Example: GPT-3, GPT-4, LLaMA

3. Encoder-Decoder (T5)

Input → Encoder → Decoder → Output

Use Cases:
- Translation
- Summarization
- Question answering
- Text-to-text tasks

Example: T5, BART, mT5

Popular LLMs

1. GPT (Generative Pre-trained Transformer)

GPT-3.5 (ChatGPT):

Parameters: 175B
Context Length: 4,096 tokens
Training Data: Up to Sep 2021
Strengths:
- Conversational
- Creative writing
- Code generation
- General knowledge

Limitations:
- Knowledge cutoff
- Can hallucinate
- No internet access

GPT-4:

Parameters: ~1.7T (estimated)
Context Length: 8K-32K tokens
Training Data: Up to Apr 2023
Strengths:
- Multimodal (text + images)
- Better reasoning
- More accurate
- Longer context

Pricing:
- GPT-4: $0.03/1K input, $0.06/1K output
- GPT-4-32K: $0.06/1K input, $0.12/1K output

2. Claude (Anthropic)

Claude 3:

Variants: Opus, Sonnet, Haiku
Context Length: 200K tokens
Strengths:
- Constitutional AI (safer)
- Long context
- Better at following instructions
- Reduced hallucinations

Use Cases:
- Document analysis
- Research assistance
- Content creation
- Code review

3. LLaMA (Meta)

LLaMA 2:

Sizes: 7B, 13B, 70B parameters
License: Open source (commercial use)
Strengths:
- Open source
- Efficient
- Can run locally
- Fine-tunable

Use Cases:
- Research
- Custom applications
- On-premise deployment
- Cost-effective solutions

4. Gemini (Google)

Gemini Pro:

Multimodal: Text, images, audio, video
Context Length: 32K tokens
Strengths:
- Multimodal understanding
- Google integration
- Real-time information
- Code execution

Use Cases:
- Complex reasoning
- Multimodal tasks
- Research
- Development

Prompt Engineering

Basic Principles

1. Be Specific

❌ Bad: "Write about AI"
✅ Good: "Write a 500-word article explaining AI to beginners, 
         including 3 real-world examples"

2. Provide Context

❌ Bad: "Translate this"
✅ Good: "Translate the following technical documentation from 
         English to Spanish, maintaining technical terminology"

3. Use Examples (Few-Shot)

Classify sentiment:

Example 1:
Text: "I love this product!"
Sentiment: Positive

Example 2:
Text: "Terrible experience"
Sentiment: Negative

Now classify:
Text: "It's okay, nothing special"
Sentiment: ?

Advanced Techniques

1. Chain-of-Thought (CoT)

Prompt: "Let's solve this step by step:
Problem: If a train travels 120 km in 2 hours, 
         what's its speed in m/s?

Step 1: Calculate speed in km/h
Step 2: Convert to m/s
Step 3: Final answer"

Benefits:
- Better reasoning
- Fewer errors
- Explainable results

2. Self-Consistency

# Generate multiple responses
# Choose most consistent answer

for i in range(5):
    response = llm.generate(prompt)
    responses.append(response)

final_answer = most_common(responses)

3. ReAct (Reasoning + Acting)

Thought: I need to find the current weather
Action: search("weather in New York")
Observation: 72°F, sunny
Thought: Now I can answer
Answer: The weather in New York is 72°F and sunny

4. Tree of Thoughts

# Explore multiple reasoning paths
# Evaluate each path
# Choose best solution

Problem → [Path 1, Path 2, Path 3]
       → Evaluate each
       → Select best
       → Continue reasoning

Prompt Templates

1. Role-Based

You are an expert Python developer with 10 years of experience.
Your task is to review the following code and suggest improvements.

Code:
[paste code here]

Please provide:
1. Code quality assessment
2. Potential bugs
3. Performance improvements
4. Best practices recommendations

2. Structured Output

Analyze the following text and provide output in JSON format:

Text: "Apple Inc. announced record profits of $100B in Q4 2023"

Output format:
{
  "company": "company name",
  "event": "event type",
  "amount": "monetary value",
  "period": "time period"
}

3. Iterative Refinement

Initial prompt: "Write a blog post about AI"
Refinement 1: "Make it more technical"
Refinement 2: "Add code examples"
Refinement 3: "Focus on practical applications"

Fine-Tuning LLMs

When to Fine-Tune

Use Cases:

  • Domain-specific knowledge
  • Consistent style/tone
  • Specialized tasks
  • Better performance
  • Cost reduction

Alternatives:

  • Prompt engineering
  • RAG (Retrieval Augmented Generation)
  • Few-shot learning
  • In-context learning

Fine-Tuning Process

1. Data Preparation

# Prepare training data
training_data = [
    {
        "prompt": "Classify: I love this product!",
        "completion": "Positive"
    },
    {
        "prompt": "Classify: Terrible experience",
        "completion": "Negative"
    }
]

# Format for OpenAI
import jsonlines
with jsonlines.open('training.jsonl', 'w') as writer:
    for item in training_data:
        writer.write(item)

2. Fine-Tuning

import openai

# Upload training file
file = openai.File.create(
    file=open("training.jsonl", "rb"),
    purpose='fine-tune'
)

# Create fine-tuning job
fine_tune = openai.FineTune.create(
    training_file=file.id,
    model="gpt-3.5-turbo"
)

# Monitor progress
openai.FineTune.retrieve(fine_tune.id)

3. Using Fine-Tuned Model

response = openai.ChatCompletion.create(
    model="ft:gpt-3.5-turbo:org:model:id",
    messages=[
        {"role": "user", "content": "Classify: Great product!"}
    ]
)

RAG (Retrieval Augmented Generation)

Architecture

User Query
    ↓
Retrieve Relevant Documents (Vector DB)
    ↓
Combine Query + Documents
    ↓
LLM Generation
    ↓
Response

Implementation

from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA

# 1. Create embeddings
embeddings = OpenAIEmbeddings()

# 2. Create vector store
vectorstore = Chroma.from_documents(
    documents=docs,
    embedding=embeddings
)

# 3. Create retrieval chain
qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(),
    retriever=vectorstore.as_retriever(),
    return_source_documents=True
)

# 4. Query
result = qa_chain({"query": "What is RAG?"})
print(result['result'])

Benefits

  • ✅ Up-to-date information
  • ✅ Domain-specific knowledge
  • ✅ Reduced hallucinations
  • ✅ Source attribution
  • ✅ Cost-effective

LLM Applications

1. Chatbots

from openai import OpenAI

client = OpenAI()

messages = [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "Hello!"}
]

response = client.chat.completions.create(
    model="gpt-4",
    messages=messages
)

print(response.choices[0].message.content)

2. Code Generation

prompt = """
Write a Python function that:
1. Takes a list of numbers
2. Removes duplicates
3. Sorts in descending order
4. Returns the result

Include docstring and type hints.
"""

response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": prompt}]
)

3. Text Summarization

prompt = f"""
Summarize the following text in 3 bullet points:

{long_text}

Summary:
"""

4. Sentiment Analysis

prompt = f"""
Analyze the sentiment of the following review:

Review: {review_text}

Provide:
1. Overall sentiment (Positive/Negative/Neutral)
2. Confidence score (0-1)
3. Key phrases
"""

5. Data Extraction

prompt = f"""
Extract structured information from this text:

Text: {text}

Extract:
- Names
- Dates
- Locations
- Organizations

Format as JSON.
"""

Best Practices

1. Cost Optimization

# Use appropriate model
- GPT-3.5: Simple tasks, high volume
- GPT-4: Complex reasoning, accuracy critical

# Optimize token usage
- Clear, concise prompts
- Limit output length
- Cache responses
- Batch requests

# Monitor usage
import tiktoken

def count_tokens(text, model="gpt-4"):
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))

2. Error Handling

from openai import OpenAI
import time

def call_llm_with_retry(prompt, max_retries=3):
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model="gpt-4",
                messages=[{"role": "user", "content": prompt}]
            )
            return response.choices[0].message.content
        except Exception as e:
            if attempt == max_retries - 1:
                raise
            time.sleep(2 ** attempt)  # Exponential backoff

3. Safety and Moderation

# Use moderation API
moderation = client.moderations.create(input=user_input)

if moderation.results[0].flagged:
    return "Content violates policy"

# Add safety instructions
system_prompt = """
You are a helpful assistant. Follow these rules:
1. Don't provide harmful information
2. Respect privacy
3. Be unbiased
4. Admit when unsure
"""

4. Evaluation

# Test prompts systematically
test_cases = [
    {"input": "...", "expected": "..."},
    {"input": "...", "expected": "..."}
]

for test in test_cases:
    result = llm.generate(test["input"])
    accuracy = evaluate(result, test["expected"])
    print(f"Accuracy: {accuracy}")

Future Trends

1. Multimodal Models

  • Text + Images + Audio + Video
  • Unified understanding
  • Cross-modal generation

2. Smaller, Efficient Models

  • Distillation
  • Quantization
  • Edge deployment

3. Specialized Models

  • Domain-specific LLMs
  • Task-specific optimization
  • Better performance

4. Better Reasoning

  • Chain-of-thought
  • Tool use
  • Multi-step planning

5. Reduced Hallucinations

  • Fact-checking
  • Source attribution
  • Confidence scores

Conclusion

Large Language Models are transforming how we interact with AI and build applications. Understanding their capabilities and limitations is crucial for effective use.

Key Takeaways:

  • LLMs use transformer architecture
  • Prompt engineering is crucial
  • Fine-tuning for specific tasks
  • RAG for up-to-date information
  • Consider cost and safety

Next Steps:

  1. Experiment with different LLMs
  2. Practice prompt engineering
  3. Build a RAG application
  4. Learn fine-tuning
  5. Stay updated with latest models

Happy building! 🤖