Large Language Models (LLMs) - Complete Guide
Comprehensive guide to Large Language Models covering GPT, BERT, transformers, prompt engineering, and practical applications.
Large Language Models (LLMs) - Complete Guide
What are Large Language Models?
Large Language Models (LLMs) are AI models trained on massive amounts of text data to understand and generate human-like text. They use deep learning, specifically transformer architecture, to process and generate language.
Key Characteristics
- Large Scale: Billions of parameters (GPT-3: 175B, GPT-4: 1.7T+)
- Pre-trained: Trained on vast internet text
- Transfer Learning: Fine-tuned for specific tasks
- Few-Shot Learning: Learn from few examples
- Emergent Abilities: Capabilities not explicitly trained
Evolution of LLMs
Timeline
2017: Transformer Architecture (Attention is All You Need)
2018: BERT (Bidirectional Encoder)
2018: GPT-1 (117M parameters)
2019: GPT-2 (1.5B parameters)
2020: GPT-3 (175B parameters)
2021: DALL-E, Codex
2022: ChatGPT, InstructGPT
2023: GPT-4, Claude, LLaMA, Bard
2024: Gemini, Claude 3, GPT-4 Turbo
Transformer Architecture
Core Components
1. Self-Attention Mechanism
# Simplified attention calculation
Q = Query # What we're looking for
K = Key # What we have
V = Value # What we return
Attention(Q, K, V) = softmax(Q·K^T / √d_k) · V
Example:
Input: "The cat sat on the mat"
- "cat" attends to "sat" (subject-verb)
- "sat" attends to "mat" (verb-object)
- "on" attends to "mat" (preposition-object)
2. Multi-Head Attention
# Multiple attention mechanisms in parallel
# Each head learns different relationships
MultiHead(Q, K, V) = Concat(head_1, ..., head_h) · W^O
Benefits:
- Capture different types of relationships
- Parallel processing
- Better representation learning
3. Feed-Forward Networks
FFN(x) = max(0, x·W_1 + b_1)·W_2 + b_2
# Two linear transformations with ReLU
# Applied to each position independently
4. Positional Encoding
# Add position information to embeddings
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
# Allows model to understand word order
Architecture Types
1. Encoder-Only (BERT)
Input → Encoder → Output
Use Cases:
- Text classification
- Named entity recognition
- Question answering
- Sentiment analysis
Example: BERT, RoBERTa, ALBERT
2. Decoder-Only (GPT)
Input → Decoder → Output
Use Cases:
- Text generation
- Code generation
- Creative writing
- Chatbots
Example: GPT-3, GPT-4, LLaMA
3. Encoder-Decoder (T5)
Input → Encoder → Decoder → Output
Use Cases:
- Translation
- Summarization
- Question answering
- Text-to-text tasks
Example: T5, BART, mT5
Popular LLMs
1. GPT (Generative Pre-trained Transformer)
GPT-3.5 (ChatGPT):
Parameters: 175B
Context Length: 4,096 tokens
Training Data: Up to Sep 2021
Strengths:
- Conversational
- Creative writing
- Code generation
- General knowledge
Limitations:
- Knowledge cutoff
- Can hallucinate
- No internet access
GPT-4:
Parameters: ~1.7T (estimated)
Context Length: 8K-32K tokens
Training Data: Up to Apr 2023
Strengths:
- Multimodal (text + images)
- Better reasoning
- More accurate
- Longer context
Pricing:
- GPT-4: $0.03/1K input, $0.06/1K output
- GPT-4-32K: $0.06/1K input, $0.12/1K output
2. Claude (Anthropic)
Claude 3:
Variants: Opus, Sonnet, Haiku
Context Length: 200K tokens
Strengths:
- Constitutional AI (safer)
- Long context
- Better at following instructions
- Reduced hallucinations
Use Cases:
- Document analysis
- Research assistance
- Content creation
- Code review
3. LLaMA (Meta)
LLaMA 2:
Sizes: 7B, 13B, 70B parameters
License: Open source (commercial use)
Strengths:
- Open source
- Efficient
- Can run locally
- Fine-tunable
Use Cases:
- Research
- Custom applications
- On-premise deployment
- Cost-effective solutions
4. Gemini (Google)
Gemini Pro:
Multimodal: Text, images, audio, video
Context Length: 32K tokens
Strengths:
- Multimodal understanding
- Google integration
- Real-time information
- Code execution
Use Cases:
- Complex reasoning
- Multimodal tasks
- Research
- Development
Prompt Engineering
Basic Principles
1. Be Specific
❌ Bad: "Write about AI"
✅ Good: "Write a 500-word article explaining AI to beginners,
including 3 real-world examples"
2. Provide Context
❌ Bad: "Translate this"
✅ Good: "Translate the following technical documentation from
English to Spanish, maintaining technical terminology"
3. Use Examples (Few-Shot)
Classify sentiment:
Example 1:
Text: "I love this product!"
Sentiment: Positive
Example 2:
Text: "Terrible experience"
Sentiment: Negative
Now classify:
Text: "It's okay, nothing special"
Sentiment: ?
Advanced Techniques
1. Chain-of-Thought (CoT)
Prompt: "Let's solve this step by step:
Problem: If a train travels 120 km in 2 hours,
what's its speed in m/s?
Step 1: Calculate speed in km/h
Step 2: Convert to m/s
Step 3: Final answer"
Benefits:
- Better reasoning
- Fewer errors
- Explainable results
2. Self-Consistency
# Generate multiple responses
# Choose most consistent answer
for i in range(5):
response = llm.generate(prompt)
responses.append(response)
final_answer = most_common(responses)
3. ReAct (Reasoning + Acting)
Thought: I need to find the current weather
Action: search("weather in New York")
Observation: 72°F, sunny
Thought: Now I can answer
Answer: The weather in New York is 72°F and sunny
4. Tree of Thoughts
# Explore multiple reasoning paths
# Evaluate each path
# Choose best solution
Problem → [Path 1, Path 2, Path 3]
→ Evaluate each
→ Select best
→ Continue reasoning
Prompt Templates
1. Role-Based
You are an expert Python developer with 10 years of experience.
Your task is to review the following code and suggest improvements.
Code:
[paste code here]
Please provide:
1. Code quality assessment
2. Potential bugs
3. Performance improvements
4. Best practices recommendations
2. Structured Output
Analyze the following text and provide output in JSON format:
Text: "Apple Inc. announced record profits of $100B in Q4 2023"
Output format:
{
"company": "company name",
"event": "event type",
"amount": "monetary value",
"period": "time period"
}
3. Iterative Refinement
Initial prompt: "Write a blog post about AI"
Refinement 1: "Make it more technical"
Refinement 2: "Add code examples"
Refinement 3: "Focus on practical applications"
Fine-Tuning LLMs
When to Fine-Tune
Use Cases:
- Domain-specific knowledge
- Consistent style/tone
- Specialized tasks
- Better performance
- Cost reduction
Alternatives:
- Prompt engineering
- RAG (Retrieval Augmented Generation)
- Few-shot learning
- In-context learning
Fine-Tuning Process
1. Data Preparation
# Prepare training data
training_data = [
{
"prompt": "Classify: I love this product!",
"completion": "Positive"
},
{
"prompt": "Classify: Terrible experience",
"completion": "Negative"
}
]
# Format for OpenAI
import jsonlines
with jsonlines.open('training.jsonl', 'w') as writer:
for item in training_data:
writer.write(item)
2. Fine-Tuning
import openai
# Upload training file
file = openai.File.create(
file=open("training.jsonl", "rb"),
purpose='fine-tune'
)
# Create fine-tuning job
fine_tune = openai.FineTune.create(
training_file=file.id,
model="gpt-3.5-turbo"
)
# Monitor progress
openai.FineTune.retrieve(fine_tune.id)
3. Using Fine-Tuned Model
response = openai.ChatCompletion.create(
model="ft:gpt-3.5-turbo:org:model:id",
messages=[
{"role": "user", "content": "Classify: Great product!"}
]
)
RAG (Retrieval Augmented Generation)
Architecture
User Query
↓
Retrieve Relevant Documents (Vector DB)
↓
Combine Query + Documents
↓
LLM Generation
↓
Response
Implementation
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
# 1. Create embeddings
embeddings = OpenAIEmbeddings()
# 2. Create vector store
vectorstore = Chroma.from_documents(
documents=docs,
embedding=embeddings
)
# 3. Create retrieval chain
qa_chain = RetrievalQA.from_chain_type(
llm=OpenAI(),
retriever=vectorstore.as_retriever(),
return_source_documents=True
)
# 4. Query
result = qa_chain({"query": "What is RAG?"})
print(result['result'])
Benefits
- ✅ Up-to-date information
- ✅ Domain-specific knowledge
- ✅ Reduced hallucinations
- ✅ Source attribution
- ✅ Cost-effective
LLM Applications
1. Chatbots
from openai import OpenAI
client = OpenAI()
messages = [
{"role": "system", "content": "You are a helpful assistant"},
{"role": "user", "content": "Hello!"}
]
response = client.chat.completions.create(
model="gpt-4",
messages=messages
)
print(response.choices[0].message.content)
2. Code Generation
prompt = """
Write a Python function that:
1. Takes a list of numbers
2. Removes duplicates
3. Sorts in descending order
4. Returns the result
Include docstring and type hints.
"""
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
)
3. Text Summarization
prompt = f"""
Summarize the following text in 3 bullet points:
{long_text}
Summary:
"""
4. Sentiment Analysis
prompt = f"""
Analyze the sentiment of the following review:
Review: {review_text}
Provide:
1. Overall sentiment (Positive/Negative/Neutral)
2. Confidence score (0-1)
3. Key phrases
"""
5. Data Extraction
prompt = f"""
Extract structured information from this text:
Text: {text}
Extract:
- Names
- Dates
- Locations
- Organizations
Format as JSON.
"""
Best Practices
1. Cost Optimization
# Use appropriate model
- GPT-3.5: Simple tasks, high volume
- GPT-4: Complex reasoning, accuracy critical
# Optimize token usage
- Clear, concise prompts
- Limit output length
- Cache responses
- Batch requests
# Monitor usage
import tiktoken
def count_tokens(text, model="gpt-4"):
encoding = tiktoken.encoding_for_model(model)
return len(encoding.encode(text))
2. Error Handling
from openai import OpenAI
import time
def call_llm_with_retry(prompt, max_retries=3):
for attempt in range(max_retries):
try:
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
except Exception as e:
if attempt == max_retries - 1:
raise
time.sleep(2 ** attempt) # Exponential backoff
3. Safety and Moderation
# Use moderation API
moderation = client.moderations.create(input=user_input)
if moderation.results[0].flagged:
return "Content violates policy"
# Add safety instructions
system_prompt = """
You are a helpful assistant. Follow these rules:
1. Don't provide harmful information
2. Respect privacy
3. Be unbiased
4. Admit when unsure
"""
4. Evaluation
# Test prompts systematically
test_cases = [
{"input": "...", "expected": "..."},
{"input": "...", "expected": "..."}
]
for test in test_cases:
result = llm.generate(test["input"])
accuracy = evaluate(result, test["expected"])
print(f"Accuracy: {accuracy}")
Future Trends
1. Multimodal Models
- Text + Images + Audio + Video
- Unified understanding
- Cross-modal generation
2. Smaller, Efficient Models
- Distillation
- Quantization
- Edge deployment
3. Specialized Models
- Domain-specific LLMs
- Task-specific optimization
- Better performance
4. Better Reasoning
- Chain-of-thought
- Tool use
- Multi-step planning
5. Reduced Hallucinations
- Fact-checking
- Source attribution
- Confidence scores
Conclusion
Large Language Models are transforming how we interact with AI and build applications. Understanding their capabilities and limitations is crucial for effective use.
Key Takeaways:
- LLMs use transformer architecture
- Prompt engineering is crucial
- Fine-tuning for specific tasks
- RAG for up-to-date information
- Consider cost and safety
Next Steps:
- Experiment with different LLMs
- Practice prompt engineering
- Build a RAG application
- Learn fine-tuning
- Stay updated with latest models
Happy building! 🤖