Full Stack • Java • System Design • Cloud • AI Engineering

Build a PDF Knowledge Assistant with Spring AI

A detailed step-by-step guide to build a PDF knowledge assistant using Spring Boot, Spring AI, PDF parsing, TokenTextSplitter, PGVector, embeddings, and RAG.

A PDF knowledge assistant is an AI application that can answer questions from uploaded PDF files.

Instead of asking the model to answer from general knowledge, we first extract text from PDFs, split the text into chunks, store those chunks in a vector database, and retrieve the most relevant chunks whenever the user asks a question.

This pattern is called RAG: Retrieval Augmented Generation.

In this guide, we will build a Spring Boot application that:

  • Uploads PDF files.
  • Reads PDF text using Spring AI's PDF document reader.
  • Splits extracted text into smaller chunks.
  • Creates embeddings for each chunk.
  • Stores chunks and embeddings in PostgreSQL with PGVector.
  • Lets users ask questions about uploaded PDFs.
  • Returns an answer with source snippets.

Final Application APIs

API Method Purpose
/api/pdf/health GET Check service health
/api/pdf/upload POST Upload and ingest a PDF
/api/pdf/search POST Search relevant PDF chunks
/api/pdf/ask POST Ask a question from uploaded PDFs
/api/pdf/ask-manual POST Ask using manual RAG context for learning

How the PDF Assistant Works

flowchart TD
    A["Upload PDF"] --> B["Save temporarily"]
    B --> C["PagePdfDocumentReader"]
    C --> D["PDF pages become Documents"]
    D --> E["TokenTextSplitter"]
    E --> F["Smaller text chunks"]
    F --> G["Embedding model"]
    G --> H["PGVector VectorStore"]

    Q["User question"] --> I["Similarity search"]
    H --> I
    I --> J["Relevant PDF chunks"]
    J --> K["ChatClient prompt"]
    K --> L["Grounded answer"]

The important idea:

The LLM does not read your whole PDF every time. It only receives the most relevant chunks retrieved from PGVector.

Tools and Frameworks

Tool Recommended Version Purpose
Java 21 or later Application runtime
Spring Boot 4.0.x REST API framework
Spring AI 2.0.0 PDF readers, embeddings, VectorStore, ChatClient
PostgreSQL 16 or later Database
PGVector Current Docker image Vector search extension
OpenAI API key Required in this guide Chat and embedding model
Docker Current version Run PostgreSQL + PGVector
Maven 3.9+ Build tool
curl or Postman Any current version API testing

Spring AI 2.0.x works with Spring Boot 4.0.x and 4.1.x. If your project uses Spring Boot 3.x, use the matching Spring AI 1.x dependency line.

Project Structure

Create this structure:

spring-ai-pdf-knowledge-assistant/
├── docker-compose.yml
├── pom.xml
└── src/
    └── main/
        ├── java/
        │   └── com/
        │       └── codewithvenu/
        │           └── pdfassistant/
        │               ├── PdfKnowledgeAssistantApplication.java
        │               ├── controller/
        │               │   └── PdfKnowledgeController.java
        │               ├── dto/
        │               │   ├── AskPdfRequest.java
        │               │   ├── AskPdfResponse.java
        │               │   ├── PdfSearchRequest.java
        │               │   ├── PdfSourceDto.java
        │               │   └── UploadPdfResponse.java
        │               ├── exception/
        │               │   └── GlobalExceptionHandler.java
        │               └── service/
        │                   └── PdfKnowledgeService.java
        └── resources/
            └── application.yml

Step 1: Create pom.xml

File: pom.xml

Copy this complete file:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <parent>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-parent</artifactId>
        <version>4.0.0</version>
        <relativePath/>
    </parent>

    <groupId>com.codewithvenu</groupId>
    <artifactId>spring-ai-pdf-knowledge-assistant</artifactId>
    <version>0.0.1-SNAPSHOT</version>
    <name>spring-ai-pdf-knowledge-assistant</name>
    <description>PDF Knowledge Assistant with Spring AI</description>

    <properties>
        <java.version>21</java.version>
        <spring-ai.version>2.0.0</spring-ai.version>
    </properties>

    <dependencyManagement>
        <dependencies>
            <dependency>
                <groupId>org.springframework.ai</groupId>
                <artifactId>spring-ai-bom</artifactId>
                <version>${spring-ai.version}</version>
                <type>pom</type>
                <scope>import</scope>
            </dependency>
        </dependencies>
    </dependencyManagement>

    <dependencies>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
        </dependency>

        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-validation</artifactId>
        </dependency>

        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-jdbc</artifactId>
        </dependency>

        <dependency>
            <groupId>org.postgresql</groupId>
            <artifactId>postgresql</artifactId>
            <scope>runtime</scope>
        </dependency>

        <dependency>
            <groupId>org.springframework.ai</groupId>
            <artifactId>spring-ai-starter-model-openai</artifactId>
        </dependency>

        <dependency>
            <groupId>org.springframework.ai</groupId>
            <artifactId>spring-ai-starter-vector-store-pgvector</artifactId>
        </dependency>

        <dependency>
            <groupId>org.springframework.ai</groupId>
            <artifactId>spring-ai-vector-store-advisor</artifactId>
        </dependency>

        <dependency>
            <groupId>org.springframework.ai</groupId>
            <artifactId>spring-ai-pdf-document-reader</artifactId>
        </dependency>

        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-test</artifactId>
            <scope>test</scope>
        </dependency>
    </dependencies>

    <build>
        <plugins>
            <plugin>
                <groupId>org.springframework.boot</groupId>
                <artifactId>spring-boot-maven-plugin</artifactId>
            </plugin>
        </plugins>
    </build>
</project>

Dependency explanation:

Dependency Purpose
spring-ai-pdf-document-reader Reads PDF pages with Apache PDFBox
spring-ai-starter-vector-store-pgvector Stores PDF chunks and embeddings in PostgreSQL
spring-ai-starter-model-openai Provides chat and embedding models
spring-ai-vector-store-advisor Provides QuestionAnswerAdvisor for RAG
spring-boot-starter-jdbc Connects to PostgreSQL

Step 2: Start PGVector with Docker

File: docker-compose.yml

services:
  postgres:
    image: pgvector/pgvector:pg16
    container_name: pdf-assistant-pgvector
    ports:
      - "5432:5432"
    environment:
      POSTGRES_DB: pdf_assistant
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: postgres
    volumes:
      - pdf-assistant-data:/var/lib/postgresql/data

volumes:
  pdf-assistant-data:

Start it:

docker compose up -d

Check it:

docker ps

Expected container:

pdf-assistant-pgvector

Step 3: Configure Spring Boot

File: src/main/resources/application.yml

server:
  port: 8080

spring:
  application:
    name: spring-ai-pdf-knowledge-assistant

  servlet:
    multipart:
      max-file-size: 25MB
      max-request-size: 25MB

  datasource:
    url: jdbc:postgresql://localhost:5432/pdf_assistant
    username: postgres
    password: postgres

  ai:
    openai:
      api-key: ${OPENAI_API_KEY}
      chat:
        options:
          model: gpt-4.1-mini
          temperature: 0.2
      embedding:
        options:
          model: text-embedding-3-small

    vectorstore:
      pgvector:
        initialize-schema: true
        index-type: HNSW
        distance-type: COSINE_DISTANCE
        dimensions: 1536
        max-document-batch-size: 1000

Set your OpenAI API key:

export OPENAI_API_KEY="your-openai-api-key-here"

On Windows PowerShell:

$env:OPENAI_API_KEY="your-openai-api-key-here"

Important settings:

  • multipart.max-file-size allows PDF upload.
  • initialize-schema: true tells Spring AI to create the PGVector table.
  • dimensions: 1536 matches text-embedding-3-small.
  • If you change the embedding model, verify the vector dimensions.

Step 4: Main Application Class

File: src/main/java/com/codewithvenu/pdfassistant/PdfKnowledgeAssistantApplication.java

package com.codewithvenu.pdfassistant;

import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;

@SpringBootApplication
public class PdfKnowledgeAssistantApplication {

    public static void main(String[] args) {
        SpringApplication.run(PdfKnowledgeAssistantApplication.class, args);
    }
}

Step 5: Create DTOs

UploadPdfResponse

File: src/main/java/com/codewithvenu/pdfassistant/dto/UploadPdfResponse.java

package com.codewithvenu.pdfassistant.dto;

public record UploadPdfResponse(
    String fileName,
    String documentId,
    int pagesRead,
    int chunksStored
) {
}

AskPdfRequest

File: src/main/java/com/codewithvenu/pdfassistant/dto/AskPdfRequest.java

package com.codewithvenu.pdfassistant.dto;

import jakarta.validation.constraints.NotBlank;

public record AskPdfRequest(
    @NotBlank(message = "question is required")
    String question,

    String documentId,

    Integer topK,

    Double similarityThreshold
) {
    public int safeTopK() {
        return topK == null ? 5 : topK;
    }

    public double safeSimilarityThreshold() {
        return similarityThreshold == null ? 0.70 : similarityThreshold;
    }
}

PdfSearchRequest

File: src/main/java/com/codewithvenu/pdfassistant/dto/PdfSearchRequest.java

package com.codewithvenu.pdfassistant.dto;

import jakarta.validation.constraints.NotBlank;

public record PdfSearchRequest(
    @NotBlank(message = "query is required")
    String query,

    String documentId,

    Integer topK,

    Double similarityThreshold
) {
    public int safeTopK() {
        return topK == null ? 5 : topK;
    }

    public double safeSimilarityThreshold() {
        return similarityThreshold == null ? 0.70 : similarityThreshold;
    }
}

PdfSourceDto

File: src/main/java/com/codewithvenu/pdfassistant/dto/PdfSourceDto.java

package com.codewithvenu.pdfassistant.dto;

public record PdfSourceDto(
    String content,
    String fileName,
    String documentId,
    String page,
    Double score
) {
}

AskPdfResponse

File: src/main/java/com/codewithvenu/pdfassistant/dto/AskPdfResponse.java

package com.codewithvenu.pdfassistant.dto;

import java.util.List;

public record AskPdfResponse(
    String answer,
    List<PdfSourceDto> sources
) {
}

Step 6: Build the PDF Knowledge Service

This service contains the main logic:

  • Validate PDF upload.
  • Save file temporarily.
  • Read pages with PagePdfDocumentReader.
  • Add metadata.
  • Split pages with TokenTextSplitter.
  • Store chunks in PGVector.
  • Search chunks.
  • Ask the model using RAG.

File: src/main/java/com/codewithvenu/pdfassistant/service/PdfKnowledgeService.java

package com.codewithvenu.pdfassistant.service;

import com.codewithvenu.pdfassistant.dto.AskPdfRequest;
import com.codewithvenu.pdfassistant.dto.AskPdfResponse;
import com.codewithvenu.pdfassistant.dto.PdfSearchRequest;
import com.codewithvenu.pdfassistant.dto.PdfSourceDto;
import com.codewithvenu.pdfassistant.dto.UploadPdfResponse;
import org.springframework.ai.chat.client.ChatClient;
import org.springframework.ai.chat.client.advisor.vectorstore.QuestionAnswerAdvisor;
import org.springframework.ai.document.Document;
import org.springframework.ai.reader.ExtractedTextFormatter;
import org.springframework.ai.reader.pdf.PagePdfDocumentReader;
import org.springframework.ai.reader.pdf.config.PdfDocumentReaderConfig;
import org.springframework.ai.transformer.splitter.TokenTextSplitter;
import org.springframework.ai.vectorstore.SearchRequest;
import org.springframework.ai.vectorstore.VectorStore;
import org.springframework.stereotype.Service;
import org.springframework.web.multipart.MultipartFile;

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.UUID;
import java.util.stream.Collectors;

@Service
public class PdfKnowledgeService {

    private final VectorStore vectorStore;
    private final ChatClient chatClient;

    public PdfKnowledgeService(VectorStore vectorStore, ChatClient.Builder chatClientBuilder) {
        this.vectorStore = vectorStore;
        this.chatClient = chatClientBuilder
            .defaultSystem("""
                You are a PDF knowledge assistant.
                Answer only from the retrieved PDF context.
                If the answer is not present in the PDF context, say:
                I do not know from the uploaded PDF.
                Keep answers clear, accurate, and beginner-friendly.
                """)
            .build();
    }

    public UploadPdfResponse uploadAndIngest(MultipartFile file) {
        validatePdf(file);

        String documentId = UUID.randomUUID().toString();
        String originalFileName = cleanFileName(file.getOriginalFilename());
        Path tempFile = null;

        try {
            tempFile = Files.createTempFile("spring-ai-pdf-", ".pdf");
            file.transferTo(tempFile);

            List<Document> pages = readPdfPages(tempFile, originalFileName, documentId);
            List<Document> chunks = splitPagesIntoChunks(pages);

            vectorStore.add(chunks);

            return new UploadPdfResponse(
                originalFileName,
                documentId,
                pages.size(),
                chunks.size()
            );
        }
        catch (IOException ex) {
            throw new IllegalStateException("Failed to process PDF file", ex);
        }
        finally {
            deleteTempFile(tempFile);
        }
    }

    public List<PdfSourceDto> search(PdfSearchRequest request) {
        SearchRequest searchRequest = buildSearchRequest(
            request.query(),
            request.documentId(),
            request.safeTopK(),
            request.safeSimilarityThreshold()
        );

        return vectorStore.similaritySearch(searchRequest)
            .stream()
            .map(this::toSource)
            .toList();
    }

    public AskPdfResponse ask(AskPdfRequest request) {
        SearchRequest searchRequest = buildSearchRequest(
            request.question(),
            request.documentId(),
            request.safeTopK(),
            request.safeSimilarityThreshold()
        );

        QuestionAnswerAdvisor advisor = QuestionAnswerAdvisor.builder(vectorStore)
            .searchRequest(searchRequest)
            .build();

        String answer = chatClient
            .prompt()
            .advisors(advisor)
            .user(request.question())
            .call()
            .content();

        List<PdfSourceDto> sources = vectorStore.similaritySearch(searchRequest)
            .stream()
            .map(this::toSource)
            .toList();

        return new AskPdfResponse(answer, sources);
    }

    public AskPdfResponse askManual(AskPdfRequest request) {
        SearchRequest searchRequest = buildSearchRequest(
            request.question(),
            request.documentId(),
            request.safeTopK(),
            request.safeSimilarityThreshold()
        );

        List<Document> documents = vectorStore.similaritySearch(searchRequest);

        String context = documents.stream()
            .map(Document::getText)
            .collect(Collectors.joining("\n\n---\n\n"));

        String answer = chatClient
            .prompt()
            .user(user -> user
                .text("""
                    Use the PDF context below to answer the question.

                    PDF context:
                    {context}

                    Question:
                    {question}

                    Rules:
                    - Answer only from the PDF context.
                    - If the context does not contain the answer, say: I do not know from the uploaded PDF.
                    - Keep the answer clear.
                    """)
                .param("context", context)
                .param("question", request.question()))
            .call()
            .content();

        List<PdfSourceDto> sources = documents.stream()
            .map(this::toSource)
            .toList();

        return new AskPdfResponse(answer, sources);
    }

    private List<Document> readPdfPages(Path pdfPath, String fileName, String documentId) {
        PagePdfDocumentReader reader = new PagePdfDocumentReader(
            pdfPath.toUri().toString(),
            PdfDocumentReaderConfig.builder()
                .withPageTopMargin(0)
                .withPageExtractedTextFormatter(
                    ExtractedTextFormatter.builder()
                        .withNumberOfTopTextLinesToDelete(0)
                        .build()
                )
                .withPagesPerDocument(1)
                .build()
        );

        List<Document> pages = reader.read();

        return pages.stream()
            .map(page -> new Document(
                page.getText(),
                enrichMetadata(page.getMetadata(), fileName, documentId)
            ))
            .toList();
    }

    private List<Document> splitPagesIntoChunks(List<Document> pages) {
        TokenTextSplitter splitter = TokenTextSplitter.builder()
            .withChunkSize(800)
            .withMinChunkSizeChars(350)
            .withMinChunkLengthToEmbed(20)
            .withMaxNumChunks(10000)
            .withKeepSeparator(true)
            .build();

        return splitter.apply(pages);
    }

    private SearchRequest buildSearchRequest(String query, String documentId, int topK, double threshold) {
        SearchRequest.Builder builder = SearchRequest.builder()
            .query(query)
            .topK(topK)
            .similarityThreshold(threshold);

        if (documentId != null && !documentId.isBlank()) {
            builder.filterExpression("documentId == '" + escapeFilterValue(documentId) + "'");
        }

        return builder.build();
    }

    private Map<String, Object> enrichMetadata(Map<String, Object> existingMetadata, String fileName, String documentId) {
        Map<String, Object> metadata = new HashMap<>(existingMetadata);
        metadata.put("fileName", fileName);
        metadata.put("documentId", documentId);
        metadata.put("type", "pdf");
        return metadata;
    }

    private PdfSourceDto toSource(Document document) {
        Map<String, Object> metadata = document.getMetadata();

        return new PdfSourceDto(
            document.getText(),
            String.valueOf(metadata.getOrDefault("fileName", "unknown.pdf")),
            String.valueOf(metadata.getOrDefault("documentId", "unknown")),
            String.valueOf(metadata.getOrDefault("page_number", metadata.getOrDefault("page", "unknown"))),
            document.getScore()
        );
    }

    private void validatePdf(MultipartFile file) {
        if (file == null || file.isEmpty()) {
            throw new IllegalArgumentException("PDF file is required");
        }

        String fileName = file.getOriginalFilename();
        if (fileName == null || !fileName.toLowerCase().endsWith(".pdf")) {
            throw new IllegalArgumentException("Only PDF files are allowed");
        }

        String contentType = file.getContentType();
        if (contentType != null && !contentType.equalsIgnoreCase("application/pdf")) {
            throw new IllegalArgumentException("Invalid content type. Expected application/pdf");
        }
    }

    private String cleanFileName(String fileName) {
        if (fileName == null || fileName.isBlank()) {
            return "uploaded.pdf";
        }
        return Path.of(fileName).getFileName().toString();
    }

    private String escapeFilterValue(String value) {
        return value.replace("'", "\\'");
    }

    private void deleteTempFile(Path tempFile) {
        if (tempFile == null) {
            return;
        }
        try {
            Files.deleteIfExists(tempFile);
        }
        catch (IOException ignored) {
            // Temporary file cleanup failure should not fail the user request.
        }
    }
}

Service Flow Explained

sequenceDiagram
    participant U as User
    participant C as Controller
    participant S as Service
    participant R as PDF Reader
    participant T as Token Splitter
    participant V as PGVector

    U->>C: Upload PDF
    C->>S: uploadAndIngest(file)
    S->>R: Read PDF pages
    R-->>S: Page documents
    S->>T: Split pages into chunks
    T-->>S: Smaller chunks
    S->>V: vectorStore.add(chunks)
    V-->>S: Stored embeddings
    S-->>C: documentId + chunk count
    C-->>U: Upload response

Step 7: Build the Controller

File: src/main/java/com/codewithvenu/pdfassistant/controller/PdfKnowledgeController.java

package com.codewithvenu.pdfassistant.controller;

import com.codewithvenu.pdfassistant.dto.AskPdfRequest;
import com.codewithvenu.pdfassistant.dto.AskPdfResponse;
import com.codewithvenu.pdfassistant.dto.PdfSearchRequest;
import com.codewithvenu.pdfassistant.dto.PdfSourceDto;
import com.codewithvenu.pdfassistant.dto.UploadPdfResponse;
import com.codewithvenu.pdfassistant.service.PdfKnowledgeService;
import jakarta.validation.Valid;
import org.springframework.http.MediaType;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.PostMapping;
import org.springframework.web.bind.annotation.RequestBody;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RequestParam;
import org.springframework.web.bind.annotation.RestController;
import org.springframework.web.multipart.MultipartFile;

import java.util.List;
import java.util.Map;

@RestController
@RequestMapping("/api/pdf")
public class PdfKnowledgeController {

    private final PdfKnowledgeService pdfKnowledgeService;

    public PdfKnowledgeController(PdfKnowledgeService pdfKnowledgeService) {
        this.pdfKnowledgeService = pdfKnowledgeService;
    }

    @GetMapping("/health")
    public Map<String, String> health() {
        return Map.of("status", "UP", "service", "pdf-knowledge-assistant");
    }

    @PostMapping(value = "/upload", consumes = MediaType.MULTIPART_FORM_DATA_VALUE)
    public UploadPdfResponse upload(@RequestParam("file") MultipartFile file) {
        return pdfKnowledgeService.uploadAndIngest(file);
    }

    @PostMapping("/search")
    public List<PdfSourceDto> search(@Valid @RequestBody PdfSearchRequest request) {
        return pdfKnowledgeService.search(request);
    }

    @PostMapping("/ask")
    public AskPdfResponse ask(@Valid @RequestBody AskPdfRequest request) {
        return pdfKnowledgeService.ask(request);
    }

    @PostMapping("/ask-manual")
    public AskPdfResponse askManual(@Valid @RequestBody AskPdfRequest request) {
        return pdfKnowledgeService.askManual(request);
    }
}

Step 8: Add Error Handling

File: src/main/java/com/codewithvenu/pdfassistant/exception/GlobalExceptionHandler.java

package com.codewithvenu.pdfassistant.exception;

import org.springframework.http.HttpStatus;
import org.springframework.http.ResponseEntity;
import org.springframework.web.bind.MethodArgumentNotValidException;
import org.springframework.web.bind.annotation.ExceptionHandler;
import org.springframework.web.bind.annotation.RestControllerAdvice;

import java.time.Instant;
import java.util.HashMap;
import java.util.Map;

@RestControllerAdvice
public class GlobalExceptionHandler {

    @ExceptionHandler(MethodArgumentNotValidException.class)
    public ResponseEntity<Map<String, Object>> handleValidation(MethodArgumentNotValidException ex) {
        Map<String, String> fieldErrors = new HashMap<>();

        ex.getBindingResult().getFieldErrors().forEach(error ->
            fieldErrors.put(error.getField(), error.getDefaultMessage())
        );

        Map<String, Object> body = new HashMap<>();
        body.put("timestamp", Instant.now());
        body.put("status", HttpStatus.BAD_REQUEST.value());
        body.put("error", "Validation failed");
        body.put("fields", fieldErrors);

        return ResponseEntity.badRequest().body(body);
    }

    @ExceptionHandler(IllegalArgumentException.class)
    public ResponseEntity<Map<String, Object>> handleBadRequest(IllegalArgumentException ex) {
        Map<String, Object> body = new HashMap<>();
        body.put("timestamp", Instant.now());
        body.put("status", HttpStatus.BAD_REQUEST.value());
        body.put("error", "Bad request");
        body.put("message", ex.getMessage());

        return ResponseEntity.badRequest().body(body);
    }

    @ExceptionHandler(Exception.class)
    public ResponseEntity<Map<String, Object>> handleException(Exception ex) {
        Map<String, Object> body = new HashMap<>();
        body.put("timestamp", Instant.now());
        body.put("status", HttpStatus.INTERNAL_SERVER_ERROR.value());
        body.put("error", "PDF assistant request failed");
        body.put("message", ex.getMessage());

        return ResponseEntity.status(HttpStatus.INTERNAL_SERVER_ERROR).body(body);
    }
}

For production, do not expose raw exception messages. Log detailed errors internally and return safe messages to users.

Step 9: Run the Application

Start PGVector:

docker compose up -d

Run Spring Boot:

mvn spring-boot:run

Health check:

curl http://localhost:8080/api/pdf/health

Expected output:

{
  "service": "pdf-knowledge-assistant",
  "status": "UP"
}

Step 10: Upload a PDF

Assume you have a file named spring-ai-guide.pdf.

Upload:

curl -X POST http://localhost:8080/api/pdf/upload \
  -F "[email protected]"

Expected response:

{
  "fileName": "spring-ai-guide.pdf",
  "documentId": "d5b5a9d5-9c28-44b5-9b0c-6f4470c8d111",
  "pagesRead": 12,
  "chunksStored": 34
}

Save the documentId. You can use it later to ask questions only from this PDF.

Step 11: Search PDF Chunks

Search across all uploaded PDFs:

curl -X POST http://localhost:8080/api/pdf/search \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What is ChatClient in Spring AI?",
    "topK": 3,
    "similarityThreshold": 0.65
  }'

Expected response shape:

[
  {
    "content": "ChatClient provides a fluent API for communicating with AI chat models...",
    "fileName": "spring-ai-guide.pdf",
    "documentId": "d5b5a9d5-9c28-44b5-9b0c-6f4470c8d111",
    "page": "4",
    "score": 0.89
  }
]

Search only one PDF:

curl -X POST http://localhost:8080/api/pdf/search \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What is ChatClient in Spring AI?",
    "documentId": "d5b5a9d5-9c28-44b5-9b0c-6f4470c8d111",
    "topK": 3,
    "similarityThreshold": 0.65
  }'

Step 12: Ask Questions from PDFs

Ask across all PDFs:

curl -X POST http://localhost:8080/api/pdf/ask \
  -H "Content-Type: application/json" \
  -d '{
    "question": "Explain Spring AI ChatClient in simple terms.",
    "topK": 5,
    "similarityThreshold": 0.65
  }'

Expected response:

{
  "answer": "Spring AI ChatClient is a fluent API for communicating with AI chat models. It helps developers build prompts, send user messages, receive responses, and use features like advisors and streaming.",
  "sources": [
    {
      "content": "ChatClient provides a fluent API for communicating with AI chat models...",
      "fileName": "spring-ai-guide.pdf",
      "documentId": "d5b5a9d5-9c28-44b5-9b0c-6f4470c8d111",
      "page": "4",
      "score": 0.89
    }
  ]
}

Ask from one specific PDF:

curl -X POST http://localhost:8080/api/pdf/ask \
  -H "Content-Type: application/json" \
  -d '{
    "question": "What are the main steps in a RAG pipeline?",
    "documentId": "d5b5a9d5-9c28-44b5-9b0c-6f4470c8d111",
    "topK": 5,
    "similarityThreshold": 0.65
  }'

Expected answer style:

{
  "answer": "The main RAG pipeline steps are reading source documents, splitting them into chunks, creating embeddings, storing them in a vector database, retrieving relevant chunks for a question, and generating an answer using the retrieved context.",
  "sources": [
    {
      "content": "A typical RAG pipeline includes document loading, text splitting, embedding, vector storage, retrieval, and generation...",
      "fileName": "spring-ai-guide.pdf",
      "documentId": "d5b5a9d5-9c28-44b5-9b0c-6f4470c8d111",
      "page": "8",
      "score": 0.91
    }
  ]
}

Step 13: Test Manual RAG

The /ask endpoint uses QuestionAnswerAdvisor.

The /ask-manual endpoint shows the RAG process manually:

  1. Search PGVector.
  2. Join chunks into context.
  3. Put context into the prompt.
  4. Ask the chat model.
curl -X POST http://localhost:8080/api/pdf/ask-manual \
  -H "Content-Type: application/json" \
  -d '{
    "question": "What is PGVector used for in this PDF?",
    "topK": 5,
    "similarityThreshold": 0.65
  }'

This endpoint is useful for learning and debugging because you can see exactly how the context is passed to the model in code.

Input and Output Examples

Upload Input

curl -X POST http://localhost:8080/api/pdf/upload \
  -F "[email protected]"

Upload Output

{
  "fileName": "employee-handbook.pdf",
  "documentId": "900fba1d-90d7-4d75-a8a9-2bcf26870c40",
  "pagesRead": 48,
  "chunksStored": 121
}

Question Input

{
  "question": "How many paid vacation days are available?",
  "documentId": "900fba1d-90d7-4d75-a8a9-2bcf26870c40",
  "topK": 5,
  "similarityThreshold": 0.70
}

Question Output

{
  "answer": "The uploaded PDF says employees receive 15 paid vacation days per year after completing the probation period.",
  "sources": [
    {
      "content": "Full-time employees receive 15 paid vacation days per calendar year after successful completion of the probation period...",
      "fileName": "employee-handbook.pdf",
      "documentId": "900fba1d-90d7-4d75-a8a9-2bcf26870c40",
      "page": "12",
      "score": 0.93
    }
  ]
}

Important Concepts

Concept Simple Meaning
PDF reader Extracts text from PDF pages
Document Spring AI object containing text and metadata
Metadata Extra information such as file name, page, document ID
Chunking Splitting large text into smaller pieces
Embedding Numeric meaning representation of text
VectorStore Stores embeddings and searches similar text
PGVector PostgreSQL extension for vector search
RAG Retrieve context first, then generate answer
Source citation Showing which PDF chunk was used

Why Chunking Matters

Bad chunking creates bad answers.

If chunks are too large:

  • Retrieval may return noisy context.
  • Token usage increases.
  • The answer may be less precise.

If chunks are too small:

  • Context may be incomplete.
  • Important meaning can be split across chunks.
  • The answer may miss details.

Good starting values:

Setting Recommended Start
Chunk size 800 tokens
Minimum chunk size 350 characters
Top K 3 to 5
Similarity threshold 0.65 to 0.75

Common Problems and Fixes

Problem Cause Fix
Upload fails File is not PDF Check file extension and content type
No answers found Similarity threshold too high Lower threshold to 0.60 or 0.65
Irrelevant answers Threshold too low Increase threshold to 0.75
Wrong PDF used No document filter Pass documentId in ask request
Slow upload Large PDF or many chunks Limit file size and tune chunking
Table missing Schema not initialized Set initialize-schema: true
Vector dimension error Embedding model changed Recreate vector table with correct dimension
Poor PDF text Scanned PDF image Use OCR before ingestion

Scanned PDFs and OCR

PagePdfDocumentReader extracts text from PDFs that already contain text.

If your PDF is scanned as images, the reader may extract little or no text. In that case, add OCR before Spring AI ingestion.

Common OCR options:

  • Tesseract OCR.
  • AWS Textract.
  • Azure Document Intelligence.
  • Google Document AI.

OCR flow:

flowchart LR
    PDF["Scanned PDF"] --> OCR["OCR Service"]
    OCR --> Text["Extracted Text"]
    Text --> Chunk["TokenTextSplitter"]
    Chunk --> Vector["PGVector"]
    Vector --> RAG["RAG Answer"]

Production Checklist

Before production, add:

  1. Authentication and authorization.
  2. Tenant-based documentId or tenantId filtering.
  3. Virus scanning for uploaded PDFs.
  4. File size and page count limits.
  5. OCR support for scanned PDFs.
  6. Persistent document metadata table.
  7. Delete and re-index functionality.
  8. Duplicate file detection.
  9. Source citations in final UI.
  10. Logging for ingestion, retrieval, token usage, and model latency.
  11. Evaluation questions for each PDF collection.
  12. Prompt injection checks for PDF content.

Complete Test Script

curl http://localhost:8080/api/pdf/health

curl -X POST http://localhost:8080/api/pdf/upload \
  -F "[email protected]"

curl -X POST http://localhost:8080/api/pdf/search \
  -H "Content-Type: application/json" \
  -d '{"query":"What is Spring AI ChatClient?","topK":3,"similarityThreshold":0.65}'

curl -X POST http://localhost:8080/api/pdf/ask \
  -H "Content-Type: application/json" \
  -d '{"question":"What is Spring AI ChatClient?","topK":5,"similarityThreshold":0.65}'

curl -X POST http://localhost:8080/api/pdf/ask-manual \
  -H "Content-Type: application/json" \
  -d '{"question":"What is RAG?","topK":5,"similarityThreshold":0.65}'

Summary

You built a PDF knowledge assistant with Spring AI.

The key pipeline is:

  1. Upload PDF.
  2. Read PDF pages with PagePdfDocumentReader.
  3. Convert pages into Spring AI Document objects.
  4. Split text using TokenTextSplitter.
  5. Store chunks in PGVector using VectorStore.
  6. Retrieve relevant chunks for a question.
  7. Use ChatClient and RAG to generate a grounded answer.
  8. Return sources so the user can trust the answer.

This pattern is the foundation for:

  • HR policy assistants.
  • Legal PDF assistants.
  • Insurance document Q&A.
  • Banking policy Q&A.
  • Technical manual assistants.
  • PDF-based customer support bots.

References