Build a PDF Knowledge Assistant with Spring AI
A detailed step-by-step guide to build a PDF knowledge assistant using Spring Boot, Spring AI, PDF parsing, TokenTextSplitter, PGVector, embeddings, and RAG.
A PDF knowledge assistant is an AI application that can answer questions from uploaded PDF files.
Instead of asking the model to answer from general knowledge, we first extract text from PDFs, split the text into chunks, store those chunks in a vector database, and retrieve the most relevant chunks whenever the user asks a question.
This pattern is called RAG: Retrieval Augmented Generation.
In this guide, we will build a Spring Boot application that:
- Uploads PDF files.
- Reads PDF text using Spring AI's PDF document reader.
- Splits extracted text into smaller chunks.
- Creates embeddings for each chunk.
- Stores chunks and embeddings in PostgreSQL with PGVector.
- Lets users ask questions about uploaded PDFs.
- Returns an answer with source snippets.
Final Application APIs
| API | Method | Purpose |
|---|---|---|
/api/pdf/health |
GET |
Check service health |
/api/pdf/upload |
POST |
Upload and ingest a PDF |
/api/pdf/search |
POST |
Search relevant PDF chunks |
/api/pdf/ask |
POST |
Ask a question from uploaded PDFs |
/api/pdf/ask-manual |
POST |
Ask using manual RAG context for learning |
How the PDF Assistant Works
flowchart TD
A["Upload PDF"] --> B["Save temporarily"]
B --> C["PagePdfDocumentReader"]
C --> D["PDF pages become Documents"]
D --> E["TokenTextSplitter"]
E --> F["Smaller text chunks"]
F --> G["Embedding model"]
G --> H["PGVector VectorStore"]
Q["User question"] --> I["Similarity search"]
H --> I
I --> J["Relevant PDF chunks"]
J --> K["ChatClient prompt"]
K --> L["Grounded answer"]
The important idea:
The LLM does not read your whole PDF every time. It only receives the most relevant chunks retrieved from PGVector.
Tools and Frameworks
| Tool | Recommended Version | Purpose |
|---|---|---|
| Java | 21 or later | Application runtime |
| Spring Boot | 4.0.x | REST API framework |
| Spring AI | 2.0.0 | PDF readers, embeddings, VectorStore, ChatClient |
| PostgreSQL | 16 or later | Database |
| PGVector | Current Docker image | Vector search extension |
| OpenAI API key | Required in this guide | Chat and embedding model |
| Docker | Current version | Run PostgreSQL + PGVector |
| Maven | 3.9+ | Build tool |
| curl or Postman | Any current version | API testing |
Spring AI 2.0.x works with Spring Boot 4.0.x and 4.1.x. If your project uses Spring Boot 3.x, use the matching Spring AI 1.x dependency line.
Project Structure
Create this structure:
spring-ai-pdf-knowledge-assistant/
├── docker-compose.yml
├── pom.xml
└── src/
└── main/
├── java/
│ └── com/
│ └── codewithvenu/
│ └── pdfassistant/
│ ├── PdfKnowledgeAssistantApplication.java
│ ├── controller/
│ │ └── PdfKnowledgeController.java
│ ├── dto/
│ │ ├── AskPdfRequest.java
│ │ ├── AskPdfResponse.java
│ │ ├── PdfSearchRequest.java
│ │ ├── PdfSourceDto.java
│ │ └── UploadPdfResponse.java
│ ├── exception/
│ │ └── GlobalExceptionHandler.java
│ └── service/
│ └── PdfKnowledgeService.java
└── resources/
└── application.yml
Step 1: Create pom.xml
File: pom.xml
Copy this complete file:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<parent>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-parent</artifactId>
<version>4.0.0</version>
<relativePath/>
</parent>
<groupId>com.codewithvenu</groupId>
<artifactId>spring-ai-pdf-knowledge-assistant</artifactId>
<version>0.0.1-SNAPSHOT</version>
<name>spring-ai-pdf-knowledge-assistant</name>
<description>PDF Knowledge Assistant with Spring AI</description>
<properties>
<java.version>21</java.version>
<spring-ai.version>2.0.0</spring-ai.version>
</properties>
<dependencyManagement>
<dependencies>
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-bom</artifactId>
<version>${spring-ai.version}</version>
<type>pom</type>
<scope>import</scope>
</dependency>
</dependencies>
</dependencyManagement>
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-validation</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-jdbc</artifactId>
</dependency>
<dependency>
<groupId>org.postgresql</groupId>
<artifactId>postgresql</artifactId>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-starter-model-openai</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-starter-vector-store-pgvector</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-vector-store-advisor</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-pdf-document-reader</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-test</artifactId>
<scope>test</scope>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-maven-plugin</artifactId>
</plugin>
</plugins>
</build>
</project>
Dependency explanation:
| Dependency | Purpose |
|---|---|
spring-ai-pdf-document-reader |
Reads PDF pages with Apache PDFBox |
spring-ai-starter-vector-store-pgvector |
Stores PDF chunks and embeddings in PostgreSQL |
spring-ai-starter-model-openai |
Provides chat and embedding models |
spring-ai-vector-store-advisor |
Provides QuestionAnswerAdvisor for RAG |
spring-boot-starter-jdbc |
Connects to PostgreSQL |
Step 2: Start PGVector with Docker
File: docker-compose.yml
services:
postgres:
image: pgvector/pgvector:pg16
container_name: pdf-assistant-pgvector
ports:
- "5432:5432"
environment:
POSTGRES_DB: pdf_assistant
POSTGRES_USER: postgres
POSTGRES_PASSWORD: postgres
volumes:
- pdf-assistant-data:/var/lib/postgresql/data
volumes:
pdf-assistant-data:
Start it:
docker compose up -d
Check it:
docker ps
Expected container:
pdf-assistant-pgvector
Step 3: Configure Spring Boot
File: src/main/resources/application.yml
server:
port: 8080
spring:
application:
name: spring-ai-pdf-knowledge-assistant
servlet:
multipart:
max-file-size: 25MB
max-request-size: 25MB
datasource:
url: jdbc:postgresql://localhost:5432/pdf_assistant
username: postgres
password: postgres
ai:
openai:
api-key: ${OPENAI_API_KEY}
chat:
options:
model: gpt-4.1-mini
temperature: 0.2
embedding:
options:
model: text-embedding-3-small
vectorstore:
pgvector:
initialize-schema: true
index-type: HNSW
distance-type: COSINE_DISTANCE
dimensions: 1536
max-document-batch-size: 1000
Set your OpenAI API key:
export OPENAI_API_KEY="your-openai-api-key-here"
On Windows PowerShell:
$env:OPENAI_API_KEY="your-openai-api-key-here"
Important settings:
multipart.max-file-sizeallows PDF upload.initialize-schema: truetells Spring AI to create the PGVector table.dimensions: 1536matchestext-embedding-3-small.- If you change the embedding model, verify the vector dimensions.
Step 4: Main Application Class
File: src/main/java/com/codewithvenu/pdfassistant/PdfKnowledgeAssistantApplication.java
package com.codewithvenu.pdfassistant;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
@SpringBootApplication
public class PdfKnowledgeAssistantApplication {
public static void main(String[] args) {
SpringApplication.run(PdfKnowledgeAssistantApplication.class, args);
}
}
Step 5: Create DTOs
UploadPdfResponse
File: src/main/java/com/codewithvenu/pdfassistant/dto/UploadPdfResponse.java
package com.codewithvenu.pdfassistant.dto;
public record UploadPdfResponse(
String fileName,
String documentId,
int pagesRead,
int chunksStored
) {
}
AskPdfRequest
File: src/main/java/com/codewithvenu/pdfassistant/dto/AskPdfRequest.java
package com.codewithvenu.pdfassistant.dto;
import jakarta.validation.constraints.NotBlank;
public record AskPdfRequest(
@NotBlank(message = "question is required")
String question,
String documentId,
Integer topK,
Double similarityThreshold
) {
public int safeTopK() {
return topK == null ? 5 : topK;
}
public double safeSimilarityThreshold() {
return similarityThreshold == null ? 0.70 : similarityThreshold;
}
}
PdfSearchRequest
File: src/main/java/com/codewithvenu/pdfassistant/dto/PdfSearchRequest.java
package com.codewithvenu.pdfassistant.dto;
import jakarta.validation.constraints.NotBlank;
public record PdfSearchRequest(
@NotBlank(message = "query is required")
String query,
String documentId,
Integer topK,
Double similarityThreshold
) {
public int safeTopK() {
return topK == null ? 5 : topK;
}
public double safeSimilarityThreshold() {
return similarityThreshold == null ? 0.70 : similarityThreshold;
}
}
PdfSourceDto
File: src/main/java/com/codewithvenu/pdfassistant/dto/PdfSourceDto.java
package com.codewithvenu.pdfassistant.dto;
public record PdfSourceDto(
String content,
String fileName,
String documentId,
String page,
Double score
) {
}
AskPdfResponse
File: src/main/java/com/codewithvenu/pdfassistant/dto/AskPdfResponse.java
package com.codewithvenu.pdfassistant.dto;
import java.util.List;
public record AskPdfResponse(
String answer,
List<PdfSourceDto> sources
) {
}
Step 6: Build the PDF Knowledge Service
This service contains the main logic:
- Validate PDF upload.
- Save file temporarily.
- Read pages with
PagePdfDocumentReader. - Add metadata.
- Split pages with
TokenTextSplitter. - Store chunks in PGVector.
- Search chunks.
- Ask the model using RAG.
File: src/main/java/com/codewithvenu/pdfassistant/service/PdfKnowledgeService.java
package com.codewithvenu.pdfassistant.service;
import com.codewithvenu.pdfassistant.dto.AskPdfRequest;
import com.codewithvenu.pdfassistant.dto.AskPdfResponse;
import com.codewithvenu.pdfassistant.dto.PdfSearchRequest;
import com.codewithvenu.pdfassistant.dto.PdfSourceDto;
import com.codewithvenu.pdfassistant.dto.UploadPdfResponse;
import org.springframework.ai.chat.client.ChatClient;
import org.springframework.ai.chat.client.advisor.vectorstore.QuestionAnswerAdvisor;
import org.springframework.ai.document.Document;
import org.springframework.ai.reader.ExtractedTextFormatter;
import org.springframework.ai.reader.pdf.PagePdfDocumentReader;
import org.springframework.ai.reader.pdf.config.PdfDocumentReaderConfig;
import org.springframework.ai.transformer.splitter.TokenTextSplitter;
import org.springframework.ai.vectorstore.SearchRequest;
import org.springframework.ai.vectorstore.VectorStore;
import org.springframework.stereotype.Service;
import org.springframework.web.multipart.MultipartFile;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.UUID;
import java.util.stream.Collectors;
@Service
public class PdfKnowledgeService {
private final VectorStore vectorStore;
private final ChatClient chatClient;
public PdfKnowledgeService(VectorStore vectorStore, ChatClient.Builder chatClientBuilder) {
this.vectorStore = vectorStore;
this.chatClient = chatClientBuilder
.defaultSystem("""
You are a PDF knowledge assistant.
Answer only from the retrieved PDF context.
If the answer is not present in the PDF context, say:
I do not know from the uploaded PDF.
Keep answers clear, accurate, and beginner-friendly.
""")
.build();
}
public UploadPdfResponse uploadAndIngest(MultipartFile file) {
validatePdf(file);
String documentId = UUID.randomUUID().toString();
String originalFileName = cleanFileName(file.getOriginalFilename());
Path tempFile = null;
try {
tempFile = Files.createTempFile("spring-ai-pdf-", ".pdf");
file.transferTo(tempFile);
List<Document> pages = readPdfPages(tempFile, originalFileName, documentId);
List<Document> chunks = splitPagesIntoChunks(pages);
vectorStore.add(chunks);
return new UploadPdfResponse(
originalFileName,
documentId,
pages.size(),
chunks.size()
);
}
catch (IOException ex) {
throw new IllegalStateException("Failed to process PDF file", ex);
}
finally {
deleteTempFile(tempFile);
}
}
public List<PdfSourceDto> search(PdfSearchRequest request) {
SearchRequest searchRequest = buildSearchRequest(
request.query(),
request.documentId(),
request.safeTopK(),
request.safeSimilarityThreshold()
);
return vectorStore.similaritySearch(searchRequest)
.stream()
.map(this::toSource)
.toList();
}
public AskPdfResponse ask(AskPdfRequest request) {
SearchRequest searchRequest = buildSearchRequest(
request.question(),
request.documentId(),
request.safeTopK(),
request.safeSimilarityThreshold()
);
QuestionAnswerAdvisor advisor = QuestionAnswerAdvisor.builder(vectorStore)
.searchRequest(searchRequest)
.build();
String answer = chatClient
.prompt()
.advisors(advisor)
.user(request.question())
.call()
.content();
List<PdfSourceDto> sources = vectorStore.similaritySearch(searchRequest)
.stream()
.map(this::toSource)
.toList();
return new AskPdfResponse(answer, sources);
}
public AskPdfResponse askManual(AskPdfRequest request) {
SearchRequest searchRequest = buildSearchRequest(
request.question(),
request.documentId(),
request.safeTopK(),
request.safeSimilarityThreshold()
);
List<Document> documents = vectorStore.similaritySearch(searchRequest);
String context = documents.stream()
.map(Document::getText)
.collect(Collectors.joining("\n\n---\n\n"));
String answer = chatClient
.prompt()
.user(user -> user
.text("""
Use the PDF context below to answer the question.
PDF context:
{context}
Question:
{question}
Rules:
- Answer only from the PDF context.
- If the context does not contain the answer, say: I do not know from the uploaded PDF.
- Keep the answer clear.
""")
.param("context", context)
.param("question", request.question()))
.call()
.content();
List<PdfSourceDto> sources = documents.stream()
.map(this::toSource)
.toList();
return new AskPdfResponse(answer, sources);
}
private List<Document> readPdfPages(Path pdfPath, String fileName, String documentId) {
PagePdfDocumentReader reader = new PagePdfDocumentReader(
pdfPath.toUri().toString(),
PdfDocumentReaderConfig.builder()
.withPageTopMargin(0)
.withPageExtractedTextFormatter(
ExtractedTextFormatter.builder()
.withNumberOfTopTextLinesToDelete(0)
.build()
)
.withPagesPerDocument(1)
.build()
);
List<Document> pages = reader.read();
return pages.stream()
.map(page -> new Document(
page.getText(),
enrichMetadata(page.getMetadata(), fileName, documentId)
))
.toList();
}
private List<Document> splitPagesIntoChunks(List<Document> pages) {
TokenTextSplitter splitter = TokenTextSplitter.builder()
.withChunkSize(800)
.withMinChunkSizeChars(350)
.withMinChunkLengthToEmbed(20)
.withMaxNumChunks(10000)
.withKeepSeparator(true)
.build();
return splitter.apply(pages);
}
private SearchRequest buildSearchRequest(String query, String documentId, int topK, double threshold) {
SearchRequest.Builder builder = SearchRequest.builder()
.query(query)
.topK(topK)
.similarityThreshold(threshold);
if (documentId != null && !documentId.isBlank()) {
builder.filterExpression("documentId == '" + escapeFilterValue(documentId) + "'");
}
return builder.build();
}
private Map<String, Object> enrichMetadata(Map<String, Object> existingMetadata, String fileName, String documentId) {
Map<String, Object> metadata = new HashMap<>(existingMetadata);
metadata.put("fileName", fileName);
metadata.put("documentId", documentId);
metadata.put("type", "pdf");
return metadata;
}
private PdfSourceDto toSource(Document document) {
Map<String, Object> metadata = document.getMetadata();
return new PdfSourceDto(
document.getText(),
String.valueOf(metadata.getOrDefault("fileName", "unknown.pdf")),
String.valueOf(metadata.getOrDefault("documentId", "unknown")),
String.valueOf(metadata.getOrDefault("page_number", metadata.getOrDefault("page", "unknown"))),
document.getScore()
);
}
private void validatePdf(MultipartFile file) {
if (file == null || file.isEmpty()) {
throw new IllegalArgumentException("PDF file is required");
}
String fileName = file.getOriginalFilename();
if (fileName == null || !fileName.toLowerCase().endsWith(".pdf")) {
throw new IllegalArgumentException("Only PDF files are allowed");
}
String contentType = file.getContentType();
if (contentType != null && !contentType.equalsIgnoreCase("application/pdf")) {
throw new IllegalArgumentException("Invalid content type. Expected application/pdf");
}
}
private String cleanFileName(String fileName) {
if (fileName == null || fileName.isBlank()) {
return "uploaded.pdf";
}
return Path.of(fileName).getFileName().toString();
}
private String escapeFilterValue(String value) {
return value.replace("'", "\\'");
}
private void deleteTempFile(Path tempFile) {
if (tempFile == null) {
return;
}
try {
Files.deleteIfExists(tempFile);
}
catch (IOException ignored) {
// Temporary file cleanup failure should not fail the user request.
}
}
}
Service Flow Explained
sequenceDiagram
participant U as User
participant C as Controller
participant S as Service
participant R as PDF Reader
participant T as Token Splitter
participant V as PGVector
U->>C: Upload PDF
C->>S: uploadAndIngest(file)
S->>R: Read PDF pages
R-->>S: Page documents
S->>T: Split pages into chunks
T-->>S: Smaller chunks
S->>V: vectorStore.add(chunks)
V-->>S: Stored embeddings
S-->>C: documentId + chunk count
C-->>U: Upload response
Step 7: Build the Controller
File: src/main/java/com/codewithvenu/pdfassistant/controller/PdfKnowledgeController.java
package com.codewithvenu.pdfassistant.controller;
import com.codewithvenu.pdfassistant.dto.AskPdfRequest;
import com.codewithvenu.pdfassistant.dto.AskPdfResponse;
import com.codewithvenu.pdfassistant.dto.PdfSearchRequest;
import com.codewithvenu.pdfassistant.dto.PdfSourceDto;
import com.codewithvenu.pdfassistant.dto.UploadPdfResponse;
import com.codewithvenu.pdfassistant.service.PdfKnowledgeService;
import jakarta.validation.Valid;
import org.springframework.http.MediaType;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.PostMapping;
import org.springframework.web.bind.annotation.RequestBody;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RequestParam;
import org.springframework.web.bind.annotation.RestController;
import org.springframework.web.multipart.MultipartFile;
import java.util.List;
import java.util.Map;
@RestController
@RequestMapping("/api/pdf")
public class PdfKnowledgeController {
private final PdfKnowledgeService pdfKnowledgeService;
public PdfKnowledgeController(PdfKnowledgeService pdfKnowledgeService) {
this.pdfKnowledgeService = pdfKnowledgeService;
}
@GetMapping("/health")
public Map<String, String> health() {
return Map.of("status", "UP", "service", "pdf-knowledge-assistant");
}
@PostMapping(value = "/upload", consumes = MediaType.MULTIPART_FORM_DATA_VALUE)
public UploadPdfResponse upload(@RequestParam("file") MultipartFile file) {
return pdfKnowledgeService.uploadAndIngest(file);
}
@PostMapping("/search")
public List<PdfSourceDto> search(@Valid @RequestBody PdfSearchRequest request) {
return pdfKnowledgeService.search(request);
}
@PostMapping("/ask")
public AskPdfResponse ask(@Valid @RequestBody AskPdfRequest request) {
return pdfKnowledgeService.ask(request);
}
@PostMapping("/ask-manual")
public AskPdfResponse askManual(@Valid @RequestBody AskPdfRequest request) {
return pdfKnowledgeService.askManual(request);
}
}
Step 8: Add Error Handling
File: src/main/java/com/codewithvenu/pdfassistant/exception/GlobalExceptionHandler.java
package com.codewithvenu.pdfassistant.exception;
import org.springframework.http.HttpStatus;
import org.springframework.http.ResponseEntity;
import org.springframework.web.bind.MethodArgumentNotValidException;
import org.springframework.web.bind.annotation.ExceptionHandler;
import org.springframework.web.bind.annotation.RestControllerAdvice;
import java.time.Instant;
import java.util.HashMap;
import java.util.Map;
@RestControllerAdvice
public class GlobalExceptionHandler {
@ExceptionHandler(MethodArgumentNotValidException.class)
public ResponseEntity<Map<String, Object>> handleValidation(MethodArgumentNotValidException ex) {
Map<String, String> fieldErrors = new HashMap<>();
ex.getBindingResult().getFieldErrors().forEach(error ->
fieldErrors.put(error.getField(), error.getDefaultMessage())
);
Map<String, Object> body = new HashMap<>();
body.put("timestamp", Instant.now());
body.put("status", HttpStatus.BAD_REQUEST.value());
body.put("error", "Validation failed");
body.put("fields", fieldErrors);
return ResponseEntity.badRequest().body(body);
}
@ExceptionHandler(IllegalArgumentException.class)
public ResponseEntity<Map<String, Object>> handleBadRequest(IllegalArgumentException ex) {
Map<String, Object> body = new HashMap<>();
body.put("timestamp", Instant.now());
body.put("status", HttpStatus.BAD_REQUEST.value());
body.put("error", "Bad request");
body.put("message", ex.getMessage());
return ResponseEntity.badRequest().body(body);
}
@ExceptionHandler(Exception.class)
public ResponseEntity<Map<String, Object>> handleException(Exception ex) {
Map<String, Object> body = new HashMap<>();
body.put("timestamp", Instant.now());
body.put("status", HttpStatus.INTERNAL_SERVER_ERROR.value());
body.put("error", "PDF assistant request failed");
body.put("message", ex.getMessage());
return ResponseEntity.status(HttpStatus.INTERNAL_SERVER_ERROR).body(body);
}
}
For production, do not expose raw exception messages. Log detailed errors internally and return safe messages to users.
Step 9: Run the Application
Start PGVector:
docker compose up -d
Run Spring Boot:
mvn spring-boot:run
Health check:
curl http://localhost:8080/api/pdf/health
Expected output:
{
"service": "pdf-knowledge-assistant",
"status": "UP"
}
Step 10: Upload a PDF
Assume you have a file named spring-ai-guide.pdf.
Upload:
curl -X POST http://localhost:8080/api/pdf/upload \
-F "[email protected]"
Expected response:
{
"fileName": "spring-ai-guide.pdf",
"documentId": "d5b5a9d5-9c28-44b5-9b0c-6f4470c8d111",
"pagesRead": 12,
"chunksStored": 34
}
Save the documentId. You can use it later to ask questions only from this PDF.
Step 11: Search PDF Chunks
Search across all uploaded PDFs:
curl -X POST http://localhost:8080/api/pdf/search \
-H "Content-Type: application/json" \
-d '{
"query": "What is ChatClient in Spring AI?",
"topK": 3,
"similarityThreshold": 0.65
}'
Expected response shape:
[
{
"content": "ChatClient provides a fluent API for communicating with AI chat models...",
"fileName": "spring-ai-guide.pdf",
"documentId": "d5b5a9d5-9c28-44b5-9b0c-6f4470c8d111",
"page": "4",
"score": 0.89
}
]
Search only one PDF:
curl -X POST http://localhost:8080/api/pdf/search \
-H "Content-Type: application/json" \
-d '{
"query": "What is ChatClient in Spring AI?",
"documentId": "d5b5a9d5-9c28-44b5-9b0c-6f4470c8d111",
"topK": 3,
"similarityThreshold": 0.65
}'
Step 12: Ask Questions from PDFs
Ask across all PDFs:
curl -X POST http://localhost:8080/api/pdf/ask \
-H "Content-Type: application/json" \
-d '{
"question": "Explain Spring AI ChatClient in simple terms.",
"topK": 5,
"similarityThreshold": 0.65
}'
Expected response:
{
"answer": "Spring AI ChatClient is a fluent API for communicating with AI chat models. It helps developers build prompts, send user messages, receive responses, and use features like advisors and streaming.",
"sources": [
{
"content": "ChatClient provides a fluent API for communicating with AI chat models...",
"fileName": "spring-ai-guide.pdf",
"documentId": "d5b5a9d5-9c28-44b5-9b0c-6f4470c8d111",
"page": "4",
"score": 0.89
}
]
}
Ask from one specific PDF:
curl -X POST http://localhost:8080/api/pdf/ask \
-H "Content-Type: application/json" \
-d '{
"question": "What are the main steps in a RAG pipeline?",
"documentId": "d5b5a9d5-9c28-44b5-9b0c-6f4470c8d111",
"topK": 5,
"similarityThreshold": 0.65
}'
Expected answer style:
{
"answer": "The main RAG pipeline steps are reading source documents, splitting them into chunks, creating embeddings, storing them in a vector database, retrieving relevant chunks for a question, and generating an answer using the retrieved context.",
"sources": [
{
"content": "A typical RAG pipeline includes document loading, text splitting, embedding, vector storage, retrieval, and generation...",
"fileName": "spring-ai-guide.pdf",
"documentId": "d5b5a9d5-9c28-44b5-9b0c-6f4470c8d111",
"page": "8",
"score": 0.91
}
]
}
Step 13: Test Manual RAG
The /ask endpoint uses QuestionAnswerAdvisor.
The /ask-manual endpoint shows the RAG process manually:
- Search PGVector.
- Join chunks into context.
- Put context into the prompt.
- Ask the chat model.
curl -X POST http://localhost:8080/api/pdf/ask-manual \
-H "Content-Type: application/json" \
-d '{
"question": "What is PGVector used for in this PDF?",
"topK": 5,
"similarityThreshold": 0.65
}'
This endpoint is useful for learning and debugging because you can see exactly how the context is passed to the model in code.
Input and Output Examples
Upload Input
curl -X POST http://localhost:8080/api/pdf/upload \
-F "[email protected]"
Upload Output
{
"fileName": "employee-handbook.pdf",
"documentId": "900fba1d-90d7-4d75-a8a9-2bcf26870c40",
"pagesRead": 48,
"chunksStored": 121
}
Question Input
{
"question": "How many paid vacation days are available?",
"documentId": "900fba1d-90d7-4d75-a8a9-2bcf26870c40",
"topK": 5,
"similarityThreshold": 0.70
}
Question Output
{
"answer": "The uploaded PDF says employees receive 15 paid vacation days per year after completing the probation period.",
"sources": [
{
"content": "Full-time employees receive 15 paid vacation days per calendar year after successful completion of the probation period...",
"fileName": "employee-handbook.pdf",
"documentId": "900fba1d-90d7-4d75-a8a9-2bcf26870c40",
"page": "12",
"score": 0.93
}
]
}
Important Concepts
| Concept | Simple Meaning |
|---|---|
| PDF reader | Extracts text from PDF pages |
| Document | Spring AI object containing text and metadata |
| Metadata | Extra information such as file name, page, document ID |
| Chunking | Splitting large text into smaller pieces |
| Embedding | Numeric meaning representation of text |
| VectorStore | Stores embeddings and searches similar text |
| PGVector | PostgreSQL extension for vector search |
| RAG | Retrieve context first, then generate answer |
| Source citation | Showing which PDF chunk was used |
Why Chunking Matters
Bad chunking creates bad answers.
If chunks are too large:
- Retrieval may return noisy context.
- Token usage increases.
- The answer may be less precise.
If chunks are too small:
- Context may be incomplete.
- Important meaning can be split across chunks.
- The answer may miss details.
Good starting values:
| Setting | Recommended Start |
|---|---|
| Chunk size | 800 tokens |
| Minimum chunk size | 350 characters |
| Top K | 3 to 5 |
| Similarity threshold | 0.65 to 0.75 |
Common Problems and Fixes
| Problem | Cause | Fix |
|---|---|---|
| Upload fails | File is not PDF | Check file extension and content type |
| No answers found | Similarity threshold too high | Lower threshold to 0.60 or 0.65 |
| Irrelevant answers | Threshold too low | Increase threshold to 0.75 |
| Wrong PDF used | No document filter | Pass documentId in ask request |
| Slow upload | Large PDF or many chunks | Limit file size and tune chunking |
| Table missing | Schema not initialized | Set initialize-schema: true |
| Vector dimension error | Embedding model changed | Recreate vector table with correct dimension |
| Poor PDF text | Scanned PDF image | Use OCR before ingestion |
Scanned PDFs and OCR
PagePdfDocumentReader extracts text from PDFs that already contain text.
If your PDF is scanned as images, the reader may extract little or no text. In that case, add OCR before Spring AI ingestion.
Common OCR options:
- Tesseract OCR.
- AWS Textract.
- Azure Document Intelligence.
- Google Document AI.
OCR flow:
flowchart LR
PDF["Scanned PDF"] --> OCR["OCR Service"]
OCR --> Text["Extracted Text"]
Text --> Chunk["TokenTextSplitter"]
Chunk --> Vector["PGVector"]
Vector --> RAG["RAG Answer"]
Production Checklist
Before production, add:
- Authentication and authorization.
- Tenant-based
documentIdortenantIdfiltering. - Virus scanning for uploaded PDFs.
- File size and page count limits.
- OCR support for scanned PDFs.
- Persistent document metadata table.
- Delete and re-index functionality.
- Duplicate file detection.
- Source citations in final UI.
- Logging for ingestion, retrieval, token usage, and model latency.
- Evaluation questions for each PDF collection.
- Prompt injection checks for PDF content.
Complete Test Script
curl http://localhost:8080/api/pdf/health
curl -X POST http://localhost:8080/api/pdf/upload \
-F "[email protected]"
curl -X POST http://localhost:8080/api/pdf/search \
-H "Content-Type: application/json" \
-d '{"query":"What is Spring AI ChatClient?","topK":3,"similarityThreshold":0.65}'
curl -X POST http://localhost:8080/api/pdf/ask \
-H "Content-Type: application/json" \
-d '{"question":"What is Spring AI ChatClient?","topK":5,"similarityThreshold":0.65}'
curl -X POST http://localhost:8080/api/pdf/ask-manual \
-H "Content-Type: application/json" \
-d '{"question":"What is RAG?","topK":5,"similarityThreshold":0.65}'
Summary
You built a PDF knowledge assistant with Spring AI.
The key pipeline is:
- Upload PDF.
- Read PDF pages with
PagePdfDocumentReader. - Convert pages into Spring AI
Documentobjects. - Split text using
TokenTextSplitter. - Store chunks in PGVector using
VectorStore. - Retrieve relevant chunks for a question.
- Use
ChatClientand RAG to generate a grounded answer. - Return sources so the user can trust the answer.
This pattern is the foundation for:
- HR policy assistants.
- Legal PDF assistants.
- Insurance document Q&A.
- Banking policy Q&A.
- Technical manual assistants.
- PDF-based customer support bots.