Build a PDF Knowledge Assistant with Spring AI

A detailed step-by-step guide to build a PDF knowledge assistant using Spring Boot, Spring AI, PDF parsing, TokenTextSplitter, PGVector, embeddings, and RAG.

A PDF knowledge assistant is an AI application that can answer questions from uploaded PDF files.

Instead of asking the model to answer from general knowledge, we first extract text from PDFs, split the text into chunks, store those chunks in a vector database, and retrieve the most relevant chunks whenever the user asks a question.

This pattern is called RAG: Retrieval Augmented Generation.

In this guide, we will build a Spring Boot application that:

Uploads PDF files.
Reads PDF text using Spring AI's PDF document reader.
Splits extracted text into smaller chunks.
Creates embeddings for each chunk.
Stores chunks and embeddings in PostgreSQL with PGVector.
Lets users ask questions about uploaded PDFs.
Returns an answer with source snippets.

Final Application APIs

API	Method	Purpose
`/api/pdf/health`	`GET`	Check service health
`/api/pdf/upload`	`POST`	Upload and ingest a PDF
`/api/pdf/search`	`POST`	Search relevant PDF chunks
`/api/pdf/ask`	`POST`	Ask a question from uploaded PDFs
`/api/pdf/ask-manual`	`POST`	Ask using manual RAG context for learning

How the PDF Assistant Works

flowchart TD
    A["Upload PDF"] --> B["Save temporarily"]
    B --> C["PagePdfDocumentReader"]
    C --> D["PDF pages become Documents"]
    D --> E["TokenTextSplitter"]
    E --> F["Smaller text chunks"]
    F --> G["Embedding model"]
    G --> H["PGVector VectorStore"]

    Q["User question"] --> I["Similarity search"]
    H --> I
    I --> J["Relevant PDF chunks"]
    J --> K["ChatClient prompt"]
    K --> L["Grounded answer"]

The important idea:

The LLM does not read your whole PDF every time. It only receives the most relevant chunks retrieved from PGVector.

Tools and Frameworks

Tool	Recommended Version	Purpose
Java	21 or later	Application runtime
Spring Boot	4.0.x	REST API framework
Spring AI	2.0.0	PDF readers, embeddings, VectorStore, ChatClient
PostgreSQL	16 or later	Database
PGVector	Current Docker image	Vector search extension
OpenAI API key	Required in this guide	Chat and embedding model
Docker	Current version	Run PostgreSQL + PGVector
Maven	3.9+	Build tool
curl or Postman	Any current version	API testing

Spring AI 2.0.x works with Spring Boot 4.0.x and 4.1.x. If your project uses Spring Boot 3.x, use the matching Spring AI 1.x dependency line.

Project Structure

Create this structure:

spring-ai-pdf-knowledge-assistant/
├── docker-compose.yml
├── pom.xml
└── src/
    └── main/
        ├── java/
        │   └── com/
        │       └── codewithvenu/
        │           └── pdfassistant/
        │               ├── PdfKnowledgeAssistantApplication.java
        │               ├── controller/
        │               │   └── PdfKnowledgeController.java
        │               ├── dto/
        │               │   ├── AskPdfRequest.java
        │               │   ├── AskPdfResponse.java
        │               │   ├── PdfSearchRequest.java
        │               │   ├── PdfSourceDto.java
        │               │   └── UploadPdfResponse.java
        │               ├── exception/
        │               │   └── GlobalExceptionHandler.java
        │               └── service/
        │                   └── PdfKnowledgeService.java
        └── resources/
            └── application.yml

Step 1: Create `pom.xml`

File: pom.xml

Copy this complete file:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <parent>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-parent</artifactId>
        <version>4.0.0</version>
        <relativePath/>
    </parent>

    <groupId>com.codewithvenu</groupId>
    <artifactId>spring-ai-pdf-knowledge-assistant</artifactId>
    <version>0.0.1-SNAPSHOT</version>
    <name>spring-ai-pdf-knowledge-assistant</name>
    <description>PDF Knowledge Assistant with Spring AI</description>

    <properties>
        <java.version>21</java.version>
        <spring-ai.version>2.0.0</spring-ai.version>
    </properties>

    <dependencyManagement>
        <dependencies>
            <dependency>
                <groupId>org.springframework.ai</groupId>
                <artifactId>spring-ai-bom</artifactId>
                <version>${spring-ai.version}</version>
                <type>pom</type>
                <scope>import</scope>
            </dependency>
        </dependencies>
    </dependencyManagement>

    <dependencies>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
        </dependency>

        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-validation</artifactId>
        </dependency>

        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-jdbc</artifactId>
        </dependency>

        <dependency>
            <groupId>org.postgresql</groupId>
            <artifactId>postgresql</artifactId>
            <scope>runtime</scope>
        </dependency>

        <dependency>
            <groupId>org.springframework.ai</groupId>
            <artifactId>spring-ai-starter-model-openai</artifactId>
        </dependency>

        <dependency>
            <groupId>org.springframework.ai</groupId>
            <artifactId>spring-ai-starter-vector-store-pgvector</artifactId>
        </dependency>

        <dependency>
            <groupId>org.springframework.ai</groupId>
            <artifactId>spring-ai-vector-store-advisor</artifactId>
        </dependency>

        <dependency>
            <groupId>org.springframework.ai</groupId>
            <artifactId>spring-ai-pdf-document-reader</artifactId>
        </dependency>

        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-test</artifactId>
            <scope>test</scope>
        </dependency>
    </dependencies>

    <build>
        <plugins>
            <plugin>
                <groupId>org.springframework.boot</groupId>
                <artifactId>spring-boot-maven-plugin</artifactId>
            </plugin>
        </plugins>
    </build>
</project>

Dependency explanation:

Dependency	Purpose
`spring-ai-pdf-document-reader`	Reads PDF pages with Apache PDFBox
`spring-ai-starter-vector-store-pgvector`	Stores PDF chunks and embeddings in PostgreSQL
`spring-ai-starter-model-openai`	Provides chat and embedding models
`spring-ai-vector-store-advisor`	Provides `QuestionAnswerAdvisor` for RAG
`spring-boot-starter-jdbc`	Connects to PostgreSQL

Step 2: Start PGVector with Docker

File: docker-compose.yml

services:
  postgres:
    image: pgvector/pgvector:pg16
    container_name: pdf-assistant-pgvector
    ports:
      - "5432:5432"
    environment:
      POSTGRES_DB: pdf_assistant
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: postgres
    volumes:
      - pdf-assistant-data:/var/lib/postgresql/data

volumes:
  pdf-assistant-data:

Start it:

docker compose up -d

Check it:

docker ps

Expected container:

pdf-assistant-pgvector

Step 3: Configure Spring Boot

File: src/main/resources/application.yml

server:
  port: 8080

spring:
  application:
    name: spring-ai-pdf-knowledge-assistant

  servlet:
    multipart:
      max-file-size: 25MB
      max-request-size: 25MB

  datasource:
    url: jdbc:postgresql://localhost:5432/pdf_assistant
    username: postgres
    password: postgres

  ai:
    openai:
      api-key: ${OPENAI_API_KEY}
      chat:
        options:
          model: gpt-4.1-mini
          temperature: 0.2
      embedding:
        options:
          model: text-embedding-3-small

    vectorstore:
      pgvector:
        initialize-schema: true
        index-type: HNSW
        distance-type: COSINE_DISTANCE
        dimensions: 1536
        max-document-batch-size: 1000

Set your OpenAI API key:

export OPENAI_API_KEY="your-openai-api-key-here"

On Windows PowerShell:

$env:OPENAI_API_KEY="your-openai-api-key-here"

Important settings:

multipart.max-file-size allows PDF upload.
initialize-schema: true tells Spring AI to create the PGVector table.
dimensions: 1536 matches text-embedding-3-small.
If you change the embedding model, verify the vector dimensions.

Step 4: Main Application Class

File: src/main/java/com/codewithvenu/pdfassistant/PdfKnowledgeAssistantApplication.java

package com.codewithvenu.pdfassistant;

import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;

@SpringBootApplication
public class PdfKnowledgeAssistantApplication {

    public static void main(String[] args) {
        SpringApplication.run(PdfKnowledgeAssistantApplication.class, args);
    }
}

Step 5: Create DTOs

UploadPdfResponse

File: src/main/java/com/codewithvenu/pdfassistant/dto/UploadPdfResponse.java

package com.codewithvenu.pdfassistant.dto;

public record UploadPdfResponse(
    String fileName,
    String documentId,
    int pagesRead,
    int chunksStored
) {
}

AskPdfRequest

File: src/main/java/com/codewithvenu/pdfassistant/dto/AskPdfRequest.java

package com.codewithvenu.pdfassistant.dto;

import jakarta.validation.constraints.NotBlank;

public record AskPdfRequest(
    @NotBlank(message = "question is required")
    String question,

    String documentId,

    Integer topK,

    Double similarityThreshold
) {
    public int safeTopK() {
        return topK == null ? 5 : topK;
    }

    public double safeSimilarityThreshold() {
        return similarityThreshold == null ? 0.70 : similarityThreshold;
    }
}

PdfSearchRequest

File: src/main/java/com/codewithvenu/pdfassistant/dto/PdfSearchRequest.java

package com.codewithvenu.pdfassistant.dto;

import jakarta.validation.constraints.NotBlank;

public record PdfSearchRequest(
    @NotBlank(message = "query is required")
    String query,

    String documentId,

    Integer topK,

    Double similarityThreshold
) {
    public int safeTopK() {
        return topK == null ? 5 : topK;
    }

    public double safeSimilarityThreshold() {
        return similarityThreshold == null ? 0.70 : similarityThreshold;
    }
}

PdfSourceDto

File: src/main/java/com/codewithvenu/pdfassistant/dto/PdfSourceDto.java

package com.codewithvenu.pdfassistant.dto;

public record PdfSourceDto(
    String content,
    String fileName,
    String documentId,
    String page,
    Double score
) {
}

AskPdfResponse

File: src/main/java/com/codewithvenu/pdfassistant/dto/AskPdfResponse.java

package com.codewithvenu.pdfassistant.dto;

import java.util.List;

public record AskPdfResponse(
    String answer,
    List<PdfSourceDto> sources
) {
}

Step 6: Build the PDF Knowledge Service

This service contains the main logic:

Validate PDF upload.
Save file temporarily.
Read pages with PagePdfDocumentReader.
Add metadata.
Split pages with TokenTextSplitter.
Store chunks in PGVector.
Search chunks.
Ask the model using RAG.

File: src/main/java/com/codewithvenu/pdfassistant/service/PdfKnowledgeService.java

package com.codewithvenu.pdfassistant.service;

import com.codewithvenu.pdfassistant.dto.AskPdfRequest;
import com.codewithvenu.pdfassistant.dto.AskPdfResponse;
import com.codewithvenu.pdfassistant.dto.PdfSearchRequest;
import com.codewithvenu.pdfassistant.dto.PdfSourceDto;
import com.codewithvenu.pdfassistant.dto.UploadPdfResponse;
import org.springframework.ai.chat.client.ChatClient;
import org.springframework.ai.chat.client.advisor.vectorstore.QuestionAnswerAdvisor;
import org.springframework.ai.document.Document;
import org.springframework.ai.reader.ExtractedTextFormatter;
import org.springframework.ai.reader.pdf.PagePdfDocumentReader;
import org.springframework.ai.reader.pdf.config.PdfDocumentReaderConfig;
import org.springframework.ai.transformer.splitter.TokenTextSplitter;
import org.springframework.ai.vectorstore.SearchRequest;
import org.springframework.ai.vectorstore.VectorStore;
import org.springframework.stereotype.Service;
import org.springframework.web.multipart.MultipartFile;

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.UUID;
import java.util.stream.Collectors;

@Service
public class PdfKnowledgeService {

    private final VectorStore vectorStore;
    private final ChatClient chatClient;

    public PdfKnowledgeService(VectorStore vectorStore, ChatClient.Builder chatClientBuilder) {
        this.vectorStore = vectorStore;
        this.chatClient = chatClientBuilder
            .defaultSystem("""
                You are a PDF knowledge assistant.
                Answer only from the retrieved PDF context.
                If the answer is not present in the PDF context, say:
                I do not know from the uploaded PDF.
                Keep answers clear, accurate, and beginner-friendly.
                """)
            .build();
    }

    public UploadPdfResponse uploadAndIngest(MultipartFile file) {
        validatePdf(file);

        String documentId = UUID.randomUUID().toString();
        String originalFileName = cleanFileName(file.getOriginalFilename());
        Path tempFile = null;

        try {
            tempFile = Files.createTempFile("spring-ai-pdf-", ".pdf");
            file.transferTo(tempFile);

            List<Document> pages = readPdfPages(tempFile, originalFileName, documentId);
            List<Document> chunks = splitPagesIntoChunks(pages);

            vectorStore.add(chunks);

            return new UploadPdfResponse(
                originalFileName,
                documentId,
                pages.size(),
                chunks.size()
            );
        }
        catch (IOException ex) {
            throw new IllegalStateException("Failed to process PDF file", ex);
        }
        finally {
            deleteTempFile(tempFile);
        }
    }

    public List<PdfSourceDto> search(PdfSearchRequest request) {
        SearchRequest searchRequest = buildSearchRequest(
            request.query(),
            request.documentId(),
            request.safeTopK(),
            request.safeSimilarityThreshold()
        );

        return vectorStore.similaritySearch(searchRequest)
            .stream()
            .map(this::toSource)
            .toList();
    }

    public AskPdfResponse ask(AskPdfRequest request) {
        SearchRequest searchRequest = buildSearchRequest(
            request.question(),
            request.documentId(),
            request.safeTopK(),
            request.safeSimilarityThreshold()
        );

        QuestionAnswerAdvisor advisor = QuestionAnswerAdvisor.builder(vectorStore)
            .searchRequest(searchRequest)
            .build();

        String answer = chatClient
            .prompt()
            .advisors(advisor)
            .user(request.question())
            .call()
            .content();

        List<PdfSourceDto> sources = vectorStore.similaritySearch(searchRequest)
            .stream()
            .map(this::toSource)
            .toList();

        return new AskPdfResponse(answer, sources);
    }

    public AskPdfResponse askManual(AskPdfRequest request) {
        SearchRequest searchRequest = buildSearchRequest(
            request.question(),
            request.documentId(),
            request.safeTopK(),
            request.safeSimilarityThreshold()
        );

        List<Document> documents = vectorStore.similaritySearch(searchRequest);

        String context = documents.stream()
            .map(Document::getText)
            .collect(Collectors.joining("\n\n---\n\n"));

        String answer = chatClient
            .prompt()
            .user(user -> user
                .text("""
                    Use the PDF context below to answer the question.

                    PDF context:
                    {context}

                    Question:
                    {question}

                    Rules:
                    - Answer only from the PDF context.
                    - If the context does not contain the answer, say: I do not know from the uploaded PDF.
                    - Keep the answer clear.
                    """)
                .param("context", context)
                .param("question", request.question()))
            .call()
            .content();

        List<PdfSourceDto> sources = documents.stream()
            .map(this::toSource)
            .toList();

        return new AskPdfResponse(answer, sources);
    }

    private List<Document> readPdfPages(Path pdfPath, String fileName, String documentId) {
        PagePdfDocumentReader reader = new PagePdfDocumentReader(
            pdfPath.toUri().toString(),
            PdfDocumentReaderConfig.builder()
                .withPageTopMargin(0)
                .withPageExtractedTextFormatter(
                    ExtractedTextFormatter.builder()
                        .withNumberOfTopTextLinesToDelete(0)
                        .build()
                )
                .withPagesPerDocument(1)
                .build()
        );

        List<Document> pages = reader.read();

        return pages.stream()
            .map(page -> new Document(
                page.getText(),
                enrichMetadata(page.getMetadata(), fileName, documentId)
            ))
            .toList();
    }

    private List<Document> splitPagesIntoChunks(List<Document> pages) {
        TokenTextSplitter splitter = TokenTextSplitter.builder()
            .withChunkSize(800)
            .withMinChunkSizeChars(350)
            .withMinChunkLengthToEmbed(20)
            .withMaxNumChunks(10000)
            .withKeepSeparator(true)
            .build();

        return splitter.apply(pages);
    }

    private SearchRequest buildSearchRequest(String query, String documentId, int topK, double threshold) {
        SearchRequest.Builder builder = SearchRequest.builder()
            .query(query)
            .topK(topK)
            .similarityThreshold(threshold);

        if (documentId != null && !documentId.isBlank()) {
            builder.filterExpression("documentId == '" + escapeFilterValue(documentId) + "'");
        }

        return builder.build();
    }

    private Map<String, Object> enrichMetadata(Map<String, Object> existingMetadata, String fileName, String documentId) {
        Map<String, Object> metadata = new HashMap<>(existingMetadata);
        metadata.put("fileName", fileName);
        metadata.put("documentId", documentId);
        metadata.put("type", "pdf");
        return metadata;
    }

    private PdfSourceDto toSource(Document document) {
        Map<String, Object> metadata = document.getMetadata();

        return new PdfSourceDto(
            document.getText(),
            String.valueOf(metadata.getOrDefault("fileName", "unknown.pdf")),
            String.valueOf(metadata.getOrDefault("documentId", "unknown")),
            String.valueOf(metadata.getOrDefault("page_number", metadata.getOrDefault("page", "unknown"))),
            document.getScore()
        );
    }

    private void validatePdf(MultipartFile file) {
        if (file == null || file.isEmpty()) {
            throw new IllegalArgumentException("PDF file is required");
        }

        String fileName = file.getOriginalFilename();
        if (fileName == null || !fileName.toLowerCase().endsWith(".pdf")) {
            throw new IllegalArgumentException("Only PDF files are allowed");
        }

        String contentType = file.getContentType();
        if (contentType != null && !contentType.equalsIgnoreCase("application/pdf")) {
            throw new IllegalArgumentException("Invalid content type. Expected application/pdf");
        }
    }

    private String cleanFileName(String fileName) {
        if (fileName == null || fileName.isBlank()) {
            return "uploaded.pdf";
        }
        return Path.of(fileName).getFileName().toString();
    }

    private String escapeFilterValue(String value) {
        return value.replace("'", "\\'");
    }

    private void deleteTempFile(Path tempFile) {
        if (tempFile == null) {
            return;
        }
        try {
            Files.deleteIfExists(tempFile);
        }
        catch (IOException ignored) {
            // Temporary file cleanup failure should not fail the user request.
        }
    }
}

Service Flow Explained

sequenceDiagram
    participant U as User
    participant C as Controller
    participant S as Service
    participant R as PDF Reader
    participant T as Token Splitter
    participant V as PGVector

    U->>C: Upload PDF
    C->>S: uploadAndIngest(file)
    S->>R: Read PDF pages
    R-->>S: Page documents
    S->>T: Split pages into chunks
    T-->>S: Smaller chunks
    S->>V: vectorStore.add(chunks)
    V-->>S: Stored embeddings
    S-->>C: documentId + chunk count
    C-->>U: Upload response

Step 7: Build the Controller

File: src/main/java/com/codewithvenu/pdfassistant/controller/PdfKnowledgeController.java

package com.codewithvenu.pdfassistant.controller;

import com.codewithvenu.pdfassistant.dto.AskPdfRequest;
import com.codewithvenu.pdfassistant.dto.AskPdfResponse;
import com.codewithvenu.pdfassistant.dto.PdfSearchRequest;
import com.codewithvenu.pdfassistant.dto.PdfSourceDto;
import com.codewithvenu.pdfassistant.dto.UploadPdfResponse;
import com.codewithvenu.pdfassistant.service.PdfKnowledgeService;
import jakarta.validation.Valid;
import org.springframework.http.MediaType;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.PostMapping;
import org.springframework.web.bind.annotation.RequestBody;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RequestParam;
import org.springframework.web.bind.annotation.RestController;
import org.springframework.web.multipart.MultipartFile;

import java.util.List;
import java.util.Map;

@RestController
@RequestMapping("/api/pdf")
public class PdfKnowledgeController {

    private final PdfKnowledgeService pdfKnowledgeService;

    public PdfKnowledgeController(PdfKnowledgeService pdfKnowledgeService) {
        this.pdfKnowledgeService = pdfKnowledgeService;
    }

    @GetMapping("/health")
    public Map<String, String> health() {
        return Map.of("status", "UP", "service", "pdf-knowledge-assistant");
    }

    @PostMapping(value = "/upload", consumes = MediaType.MULTIPART_FORM_DATA_VALUE)
    public UploadPdfResponse upload(@RequestParam("file") MultipartFile file) {
        return pdfKnowledgeService.uploadAndIngest(file);
    }

    @PostMapping("/search")
    public List<PdfSourceDto> search(@Valid @RequestBody PdfSearchRequest request) {
        return pdfKnowledgeService.search(request);
    }

    @PostMapping("/ask")
    public AskPdfResponse ask(@Valid @RequestBody AskPdfRequest request) {
        return pdfKnowledgeService.ask(request);
    }

    @PostMapping("/ask-manual")
    public AskPdfResponse askManual(@Valid @RequestBody AskPdfRequest request) {
        return pdfKnowledgeService.askManual(request);
    }
}

Step 8: Add Error Handling

File: src/main/java/com/codewithvenu/pdfassistant/exception/GlobalExceptionHandler.java

package com.codewithvenu.pdfassistant.exception;

import org.springframework.http.HttpStatus;
import org.springframework.http.ResponseEntity;
import org.springframework.web.bind.MethodArgumentNotValidException;
import org.springframework.web.bind.annotation.ExceptionHandler;
import org.springframework.web.bind.annotation.RestControllerAdvice;

import java.time.Instant;
import java.util.HashMap;
import java.util.Map;

@RestControllerAdvice
public class GlobalExceptionHandler {

    @ExceptionHandler(MethodArgumentNotValidException.class)
    public ResponseEntity<Map<String, Object>> handleValidation(MethodArgumentNotValidException ex) {
        Map<String, String> fieldErrors = new HashMap<>();

        ex.getBindingResult().getFieldErrors().forEach(error ->
            fieldErrors.put(error.getField(), error.getDefaultMessage())
        );

        Map<String, Object> body = new HashMap<>();
        body.put("timestamp", Instant.now());
        body.put("status", HttpStatus.BAD_REQUEST.value());
        body.put("error", "Validation failed");
        body.put("fields", fieldErrors);

        return ResponseEntity.badRequest().body(body);
    }

    @ExceptionHandler(IllegalArgumentException.class)
    public ResponseEntity<Map<String, Object>> handleBadRequest(IllegalArgumentException ex) {
        Map<String, Object> body = new HashMap<>();
        body.put("timestamp", Instant.now());
        body.put("status", HttpStatus.BAD_REQUEST.value());
        body.put("error", "Bad request");
        body.put("message", ex.getMessage());

        return ResponseEntity.badRequest().body(body);
    }

    @ExceptionHandler(Exception.class)
    public ResponseEntity<Map<String, Object>> handleException(Exception ex) {
        Map<String, Object> body = new HashMap<>();
        body.put("timestamp", Instant.now());
        body.put("status", HttpStatus.INTERNAL_SERVER_ERROR.value());
        body.put("error", "PDF assistant request failed");
        body.put("message", ex.getMessage());

        return ResponseEntity.status(HttpStatus.INTERNAL_SERVER_ERROR).body(body);
    }
}

For production, do not expose raw exception messages. Log detailed errors internally and return safe messages to users.

Step 9: Run the Application

Start PGVector:

docker compose up -d

Run Spring Boot:

mvn spring-boot:run

Health check:

curl http://localhost:8080/api/pdf/health

Expected output:

{
  "service": "pdf-knowledge-assistant",
  "status": "UP"
}

Step 10: Upload a PDF

Assume you have a file named spring-ai-guide.pdf.

Upload:

curl -X POST http://localhost:8080/api/pdf/upload \
  -F "[email protected]"

Expected response:

{
  "fileName": "spring-ai-guide.pdf",
  "documentId": "d5b5a9d5-9c28-44b5-9b0c-6f4470c8d111",
  "pagesRead": 12,
  "chunksStored": 34
}

Save the documentId. You can use it later to ask questions only from this PDF.

Step 11: Search PDF Chunks

Search across all uploaded PDFs:

curl -X POST http://localhost:8080/api/pdf/search \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What is ChatClient in Spring AI?",
    "topK": 3,
    "similarityThreshold": 0.65
  }'

Expected response shape:

[
  {
    "content": "ChatClient provides a fluent API for communicating with AI chat models...",
    "fileName": "spring-ai-guide.pdf",
    "documentId": "d5b5a9d5-9c28-44b5-9b0c-6f4470c8d111",
    "page": "4",
    "score": 0.89
  }
]

Search only one PDF:

curl -X POST http://localhost:8080/api/pdf/search \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What is ChatClient in Spring AI?",
    "documentId": "d5b5a9d5-9c28-44b5-9b0c-6f4470c8d111",
    "topK": 3,
    "similarityThreshold": 0.65
  }'

Step 12: Ask Questions from PDFs

Ask across all PDFs:

curl -X POST http://localhost:8080/api/pdf/ask \
  -H "Content-Type: application/json" \
  -d '{
    "question": "Explain Spring AI ChatClient in simple terms.",
    "topK": 5,
    "similarityThreshold": 0.65
  }'

Expected response:

{
  "answer": "Spring AI ChatClient is a fluent API for communicating with AI chat models. It helps developers build prompts, send user messages, receive responses, and use features like advisors and streaming.",
  "sources": [
    {
      "content": "ChatClient provides a fluent API for communicating with AI chat models...",
      "fileName": "spring-ai-guide.pdf",
      "documentId": "d5b5a9d5-9c28-44b5-9b0c-6f4470c8d111",
      "page": "4",
      "score": 0.89
    }
  ]
}

Ask from one specific PDF:

curl -X POST http://localhost:8080/api/pdf/ask \
  -H "Content-Type: application/json" \
  -d '{
    "question": "What are the main steps in a RAG pipeline?",
    "documentId": "d5b5a9d5-9c28-44b5-9b0c-6f4470c8d111",
    "topK": 5,
    "similarityThreshold": 0.65
  }'

Expected answer style:

{
  "answer": "The main RAG pipeline steps are reading source documents, splitting them into chunks, creating embeddings, storing them in a vector database, retrieving relevant chunks for a question, and generating an answer using the retrieved context.",
  "sources": [
    {
      "content": "A typical RAG pipeline includes document loading, text splitting, embedding, vector storage, retrieval, and generation...",
      "fileName": "spring-ai-guide.pdf",
      "documentId": "d5b5a9d5-9c28-44b5-9b0c-6f4470c8d111",
      "page": "8",
      "score": 0.91
    }
  ]
}

Step 13: Test Manual RAG

The /ask endpoint uses QuestionAnswerAdvisor.

The /ask-manual endpoint shows the RAG process manually:

Search PGVector.
Join chunks into context.
Put context into the prompt.
Ask the chat model.

curl -X POST http://localhost:8080/api/pdf/ask-manual \
  -H "Content-Type: application/json" \
  -d '{
    "question": "What is PGVector used for in this PDF?",
    "topK": 5,
    "similarityThreshold": 0.65
  }'

This endpoint is useful for learning and debugging because you can see exactly how the context is passed to the model in code.

Input and Output Examples

Upload Input

curl -X POST http://localhost:8080/api/pdf/upload \
  -F "[email protected]"

Upload Output

{
  "fileName": "employee-handbook.pdf",
  "documentId": "900fba1d-90d7-4d75-a8a9-2bcf26870c40",
  "pagesRead": 48,
  "chunksStored": 121
}

Question Input

{
  "question": "How many paid vacation days are available?",
  "documentId": "900fba1d-90d7-4d75-a8a9-2bcf26870c40",
  "topK": 5,
  "similarityThreshold": 0.70
}

Question Output

{
  "answer": "The uploaded PDF says employees receive 15 paid vacation days per year after completing the probation period.",
  "sources": [
    {
      "content": "Full-time employees receive 15 paid vacation days per calendar year after successful completion of the probation period...",
      "fileName": "employee-handbook.pdf",
      "documentId": "900fba1d-90d7-4d75-a8a9-2bcf26870c40",
      "page": "12",
      "score": 0.93
    }
  ]
}

Important Concepts

Concept	Simple Meaning
PDF reader	Extracts text from PDF pages
Document	Spring AI object containing text and metadata
Metadata	Extra information such as file name, page, document ID
Chunking	Splitting large text into smaller pieces
Embedding	Numeric meaning representation of text
VectorStore	Stores embeddings and searches similar text
PGVector	PostgreSQL extension for vector search
RAG	Retrieve context first, then generate answer
Source citation	Showing which PDF chunk was used

Why Chunking Matters

Bad chunking creates bad answers.

If chunks are too large:

Retrieval may return noisy context.
Token usage increases.
The answer may be less precise.

If chunks are too small:

Context may be incomplete.
Important meaning can be split across chunks.
The answer may miss details.

Good starting values:

Setting	Recommended Start
Chunk size	`800` tokens
Minimum chunk size	`350` characters
Top K	`3` to `5`
Similarity threshold	`0.65` to `0.75`

Common Problems and Fixes

Problem	Cause	Fix
Upload fails	File is not PDF	Check file extension and content type
No answers found	Similarity threshold too high	Lower threshold to `0.60` or `0.65`
Irrelevant answers	Threshold too low	Increase threshold to `0.75`
Wrong PDF used	No document filter	Pass `documentId` in ask request
Slow upload	Large PDF or many chunks	Limit file size and tune chunking
Table missing	Schema not initialized	Set `initialize-schema: true`
Vector dimension error	Embedding model changed	Recreate vector table with correct dimension
Poor PDF text	Scanned PDF image	Use OCR before ingestion

Scanned PDFs and OCR

PagePdfDocumentReader extracts text from PDFs that already contain text.

If your PDF is scanned as images, the reader may extract little or no text. In that case, add OCR before Spring AI ingestion.

Common OCR options:

Tesseract OCR.
AWS Textract.
Azure Document Intelligence.
Google Document AI.

OCR flow:

flowchart LR
    PDF["Scanned PDF"] --> OCR["OCR Service"]
    OCR --> Text["Extracted Text"]
    Text --> Chunk["TokenTextSplitter"]
    Chunk --> Vector["PGVector"]
    Vector --> RAG["RAG Answer"]

Production Checklist

Before production, add:

Authentication and authorization.
Tenant-based documentId or tenantId filtering.
Virus scanning for uploaded PDFs.
File size and page count limits.
OCR support for scanned PDFs.
Persistent document metadata table.
Delete and re-index functionality.
Duplicate file detection.
Source citations in final UI.
Logging for ingestion, retrieval, token usage, and model latency.
Evaluation questions for each PDF collection.
Prompt injection checks for PDF content.

Complete Test Script

curl http://localhost:8080/api/pdf/health

curl -X POST http://localhost:8080/api/pdf/upload \
  -F "[email protected]"

curl -X POST http://localhost:8080/api/pdf/search \
  -H "Content-Type: application/json" \
  -d '{"query":"What is Spring AI ChatClient?","topK":3,"similarityThreshold":0.65}'

curl -X POST http://localhost:8080/api/pdf/ask \
  -H "Content-Type: application/json" \
  -d '{"question":"What is Spring AI ChatClient?","topK":5,"similarityThreshold":0.65}'

curl -X POST http://localhost:8080/api/pdf/ask-manual \
  -H "Content-Type: application/json" \
  -d '{"question":"What is RAG?","topK":5,"similarityThreshold":0.65}'

Summary

You built a PDF knowledge assistant with Spring AI.

The key pipeline is:

Upload PDF.
Read PDF pages with PagePdfDocumentReader.
Convert pages into Spring AI Document objects.
Split text using TokenTextSplitter.
Store chunks in PGVector using VectorStore.
Retrieve relevant chunks for a question.
Use ChatClient and RAG to generate a grounded answer.
Return sources so the user can trust the answer.

This pattern is the foundation for:

HR policy assistants.
Legal PDF assistants.
Insurance document Q&A.
Banking policy Q&A.
Technical manual assistants.
PDF-based customer support bots.