AI • 2024-01-31

Large Language Models (LLMs) - Complete Guide

Comprehensive guide to Large Language Models covering GPT, BERT, transformers, prompt engineering, and practical applications.

Large Language Models (LLMs) - Complete Guide

What are Large Language Models?

Large Language Models (LLMs) are AI models trained on massive amounts of text data to understand and generate human-like text. They use deep learning, specifically transformer architecture, to process and generate language.

Key Characteristics

Large Scale: Billions of parameters (GPT-3: 175B, GPT-4: 1.7T+)
Pre-trained: Trained on vast internet text
Transfer Learning: Fine-tuned for specific tasks
Few-Shot Learning: Learn from few examples
Emergent Abilities: Capabilities not explicitly trained

Evolution of LLMs

Timeline

2017: Transformer Architecture (Attention is All You Need)
2018: BERT (Bidirectional Encoder)
2018: GPT-1 (117M parameters)
2019: GPT-2 (1.5B parameters)
2020: GPT-3 (175B parameters)
2021: DALL-E, Codex
2022: ChatGPT, InstructGPT
2023: GPT-4, Claude, LLaMA, Bard
2024: Gemini, Claude 3, GPT-4 Turbo

Transformer Architecture

Core Components

1. Self-Attention Mechanism

# Simplified attention calculation
Q = Query  # What we're looking for
K = Key    # What we have
V = Value  # What we return

Attention(Q, K, V) = softmax(Q·K^T / √d_k) · V

Example:
Input: "The cat sat on the mat"
- "cat" attends to "sat" (subject-verb)
- "sat" attends to "mat" (verb-object)
- "on" attends to "mat" (preposition-object)

2. Multi-Head Attention

# Multiple attention mechanisms in parallel
# Each head learns different relationships

MultiHead(Q, K, V) = Concat(head_1, ..., head_h) · W^O

Benefits:
- Capture different types of relationships
- Parallel processing
- Better representation learning

3. Feed-Forward Networks

FFN(x) = max(0, x·W_1 + b_1)·W_2 + b_2

# Two linear transformations with ReLU
# Applied to each position independently

4. Positional Encoding

# Add position information to embeddings
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

# Allows model to understand word order

Architecture Types

1. Encoder-Only (BERT)

Input → Encoder → Output

Use Cases:
- Text classification
- Named entity recognition
- Question answering
- Sentiment analysis

Example: BERT, RoBERTa, ALBERT

2. Decoder-Only (GPT)

Input → Decoder → Output

Use Cases:
- Text generation
- Code generation
- Creative writing
- Chatbots

Example: GPT-3, GPT-4, LLaMA

3. Encoder-Decoder (T5)

Input → Encoder → Decoder → Output

Use Cases:
- Translation
- Summarization
- Question answering
- Text-to-text tasks

Example: T5, BART, mT5

Popular LLMs

1. GPT (Generative Pre-trained Transformer)

GPT-3.5 (ChatGPT):

Parameters: 175B
Context Length: 4,096 tokens
Training Data: Up to Sep 2021
Strengths:
- Conversational
- Creative writing
- Code generation
- General knowledge

Limitations:
- Knowledge cutoff
- Can hallucinate
- No internet access

GPT-4:

Parameters: ~1.7T (estimated)
Context Length: 8K-32K tokens
Training Data: Up to Apr 2023
Strengths:
- Multimodal (text + images)
- Better reasoning
- More accurate
- Longer context

Pricing:
- GPT-4: $0.03/1K input, $0.06/1K output
- GPT-4-32K: $0.06/1K input, $0.12/1K output

2. Claude (Anthropic)

Claude 3:

Variants: Opus, Sonnet, Haiku
Context Length: 200K tokens
Strengths:
- Constitutional AI (safer)
- Long context
- Better at following instructions
- Reduced hallucinations

Use Cases:
- Document analysis
- Research assistance
- Content creation
- Code review

3. LLaMA (Meta)

LLaMA 2:

Sizes: 7B, 13B, 70B parameters
License: Open source (commercial use)
Strengths:
- Open source
- Efficient
- Can run locally
- Fine-tunable

Use Cases:
- Research
- Custom applications
- On-premise deployment
- Cost-effective solutions

4. Gemini (Google)

Gemini Pro:

Multimodal: Text, images, audio, video
Context Length: 32K tokens
Strengths:
- Multimodal understanding
- Google integration
- Real-time information
- Code execution

Use Cases:
- Complex reasoning
- Multimodal tasks
- Research
- Development

Prompt Engineering

Basic Principles

1. Be Specific

❌ Bad: "Write about AI"
✅ Good: "Write a 500-word article explaining AI to beginners, 
         including 3 real-world examples"

2. Provide Context

❌ Bad: "Translate this"
✅ Good: "Translate the following technical documentation from 
         English to Spanish, maintaining technical terminology"

3. Use Examples (Few-Shot)

Classify sentiment:

Example 1:
Text: "I love this product!"
Sentiment: Positive

Example 2:
Text: "Terrible experience"
Sentiment: Negative

Now classify:
Text: "It's okay, nothing special"
Sentiment: ?

Advanced Techniques

1. Chain-of-Thought (CoT)

Prompt: "Let's solve this step by step:
Problem: If a train travels 120 km in 2 hours, 
         what's its speed in m/s?

Step 1: Calculate speed in km/h
Step 2: Convert to m/s
Step 3: Final answer"

Benefits:
- Better reasoning
- Fewer errors
- Explainable results

2. Self-Consistency

# Generate multiple responses
# Choose most consistent answer

for i in range(5):
    response = llm.generate(prompt)
    responses.append(response)

final_answer = most_common(responses)

3. ReAct (Reasoning + Acting)

Thought: I need to find the current weather
Action: search("weather in New York")
Observation: 72°F, sunny
Thought: Now I can answer
Answer: The weather in New York is 72°F and sunny

4. Tree of Thoughts

# Explore multiple reasoning paths
# Evaluate each path
# Choose best solution

Problem → [Path 1, Path 2, Path 3]
       → Evaluate each
       → Select best
       → Continue reasoning

Prompt Templates

1. Role-Based

You are an expert Python developer with 10 years of experience.
Your task is to review the following code and suggest improvements.

Code:
[paste code here]

Please provide:
1. Code quality assessment
2. Potential bugs
3. Performance improvements
4. Best practices recommendations

2. Structured Output

Analyze the following text and provide output in JSON format:

Text: "Apple Inc. announced record profits of $100B in Q4 2023"

Output format:
{
  "company": "company name",
  "event": "event type",
  "amount": "monetary value",
  "period": "time period"
}

3. Iterative Refinement

Initial prompt: "Write a blog post about AI"
Refinement 1: "Make it more technical"
Refinement 2: "Add code examples"
Refinement 3: "Focus on practical applications"

Fine-Tuning LLMs

When to Fine-Tune

Use Cases:

Domain-specific knowledge
Consistent style/tone
Specialized tasks
Better performance
Cost reduction

Alternatives:

Prompt engineering
RAG (Retrieval Augmented Generation)
Few-shot learning
In-context learning

Fine-Tuning Process

1. Data Preparation

# Prepare training data
training_data = [
    {
        "prompt": "Classify: I love this product!",
        "completion": "Positive"
    },
    {
        "prompt": "Classify: Terrible experience",
        "completion": "Negative"
    }
]

# Format for OpenAI
import jsonlines
with jsonlines.open('training.jsonl', 'w') as writer:
    for item in training_data:
        writer.write(item)

2. Fine-Tuning

import openai

# Upload training file
file = openai.File.create(
    file=open("training.jsonl", "rb"),
    purpose='fine-tune'
)

# Create fine-tuning job
fine_tune = openai.FineTune.create(
    training_file=file.id,
    model="gpt-3.5-turbo"
)

# Monitor progress
openai.FineTune.retrieve(fine_tune.id)

3. Using Fine-Tuned Model

response = openai.ChatCompletion.create(
    model="ft:gpt-3.5-turbo:org:model:id",
    messages=[
        {"role": "user", "content": "Classify: Great product!"}
    ]
)

RAG (Retrieval Augmented Generation)

Architecture

User Query
    ↓
Retrieve Relevant Documents (Vector DB)
    ↓
Combine Query + Documents
    ↓
LLM Generation
    ↓
Response

Implementation

from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA

# 1. Create embeddings
embeddings = OpenAIEmbeddings()

# 2. Create vector store
vectorstore = Chroma.from_documents(
    documents=docs,
    embedding=embeddings
)

# 3. Create retrieval chain
qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(),
    retriever=vectorstore.as_retriever(),
    return_source_documents=True
)

# 4. Query
result = qa_chain({"query": "What is RAG?"})
print(result['result'])

Benefits

✅ Up-to-date information
✅ Domain-specific knowledge
✅ Reduced hallucinations
✅ Source attribution
✅ Cost-effective

LLM Applications

1. Chatbots

from openai import OpenAI

client = OpenAI()

messages = [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "Hello!"}
]

response = client.chat.completions.create(
    model="gpt-4",
    messages=messages
)

print(response.choices[0].message.content)

2. Code Generation

prompt = """
Write a Python function that:
1. Takes a list of numbers
2. Removes duplicates
3. Sorts in descending order
4. Returns the result

Include docstring and type hints.
"""

response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": prompt}]
)

3. Text Summarization

prompt = f"""
Summarize the following text in 3 bullet points:

{long_text}

Summary:
"""

4. Sentiment Analysis

prompt = f"""
Analyze the sentiment of the following review:

Review: {review_text}

Provide:
1. Overall sentiment (Positive/Negative/Neutral)
2. Confidence score (0-1)
3. Key phrases
"""

5. Data Extraction

prompt = f"""
Extract structured information from this text:

Text: {text}

Extract:
- Names
- Dates
- Locations
- Organizations

Format as JSON.
"""

Best Practices

1. Cost Optimization

# Use appropriate model
- GPT-3.5: Simple tasks, high volume
- GPT-4: Complex reasoning, accuracy critical

# Optimize token usage
- Clear, concise prompts
- Limit output length
- Cache responses
- Batch requests

# Monitor usage
import tiktoken

def count_tokens(text, model="gpt-4"):
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))

2. Error Handling

from openai import OpenAI
import time

def call_llm_with_retry(prompt, max_retries=3):
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model="gpt-4",
                messages=[{"role": "user", "content": prompt}]
            )
            return response.choices[0].message.content
        except Exception as e:
            if attempt == max_retries - 1:
                raise
            time.sleep(2 ** attempt)  # Exponential backoff

3. Safety and Moderation

# Use moderation API
moderation = client.moderations.create(input=user_input)

if moderation.results[0].flagged:
    return "Content violates policy"

# Add safety instructions
system_prompt = """
You are a helpful assistant. Follow these rules:
1. Don't provide harmful information
2. Respect privacy
3. Be unbiased
4. Admit when unsure
"""

4. Evaluation

# Test prompts systematically
test_cases = [
    {"input": "...", "expected": "..."},
    {"input": "...", "expected": "..."}
]

for test in test_cases:
    result = llm.generate(test["input"])
    accuracy = evaluate(result, test["expected"])
    print(f"Accuracy: {accuracy}")

Future Trends

1. Multimodal Models

Text + Images + Audio + Video
Unified understanding
Cross-modal generation

2. Smaller, Efficient Models

Distillation
Quantization
Edge deployment

3. Specialized Models

Domain-specific LLMs
Task-specific optimization
Better performance

4. Better Reasoning

Chain-of-thought
Tool use
Multi-step planning

5. Reduced Hallucinations

Fact-checking
Source attribution
Confidence scores

Conclusion

Large Language Models are transforming how we interact with AI and build applications. Understanding their capabilities and limitations is crucial for effective use.

Key Takeaways:

LLMs use transformer architecture
Prompt engineering is crucial
Fine-tuning for specific tasks
RAG for up-to-date information
Consider cost and safety

Next Steps:

Experiment with different LLMs
Practice prompt engineering
Build a RAG application
Learn fine-tuning
Stay updated with latest models

Happy building! 🤖