Full Stack • Java • System Design • Cloud • AI Engineering

AWS X-Ray with Spring Boot - Distributed Tracing

Learn how to implement distributed tracing in Spring Boot applications using AWS X-Ray to monitor microservices, diagnose latency, and troubleshoot production issues.


Introduction

As applications evolve into microservices, a single user request may travel through multiple services, databases, queues, and external APIs. When performance degrades or failures occur, logs and metrics alone often cannot reveal where the problem originated.

AWS X-Ray provides distributed tracing, allowing developers to visualize the complete lifecycle of a request across all components. It records request paths, measures latency, identifies bottlenecks, and highlights errors, making troubleshooting significantly faster.


Why Distributed Tracing?

Imagine an e-commerce platform where placing an order involves:

  • API Gateway
  • Order Service
  • Inventory Service
  • Payment Service
  • Notification Service
  • PostgreSQL Database
  • Amazon SQS

A customer reports that order placement takes 12 seconds.

Without tracing:

  • Each service must be checked individually.
  • Logs need to be manually correlated.
  • Root cause analysis is slow.

With X-Ray:

  • The complete request flow is visible.
  • Each service's latency is measured.
  • Errors are pinpointed immediately.
  • Dependencies are automatically mapped.

High-Level Architecture

flowchart LR
    U[User]
    APIGW[API Gateway]
    ORDER[Order Service]
    PAYMENT[Payment Service]
    INVENTORY[Inventory Service]
    DB[(PostgreSQL)]
    SQS[Amazon SQS]
    EMAIL[Notification Service]
    XRAY[AWS X-Ray]

    U --> APIGW
    APIGW --> ORDER
    ORDER --> PAYMENT
    ORDER --> INVENTORY
    ORDER --> DB
    ORDER --> SQS
    SQS --> EMAIL

    APIGW --> XRAY
    ORDER --> XRAY
    PAYMENT --> XRAY
    INVENTORY --> XRAY
    EMAIL --> XRAY

Understanding Tracing Concepts

Trace

A trace represents the complete journey of a request from start to finish.

Example:

Customer clicks "Place Order"

↓

API Gateway

↓

Order Service

↓

Payment Service

↓

Database

↓

Notification Service

↓

Response Returned

Segment

Each AWS service or application contributes a segment to the trace.

Example:

Order Service

↓

Payment Service

↓

Inventory Service

Each segment contains:

  • Start Time
  • End Time
  • Response Status
  • Errors
  • Metadata

Subsegment

Within a service, smaller operations are captured as subsegments.

Example:

Order Service

├── Validate Request
├── Save Order
├── Call Payment API
├── Query Inventory
└── Publish SQS Message

Request Flow

sequenceDiagram
    participant User
    participant Gateway
    participant Order
    participant Payment
    participant Database
    participant XRay

    User->>Gateway: POST /orders
    Gateway->>Order: Forward Request
    Order->>Payment: Process Payment
    Payment-->>Order: Success
    Order->>Database: Save Order
    Database-->>Order: Saved
    Order->>XRay: Send Trace Data
    Order-->>Gateway: Response
    Gateway-->>User: Order Created

Spring Boot Integration

Required Dependencies

Add Spring Boot Actuator and the AWS X-Ray SDK (or, for new projects, prefer OpenTelemetry with the AWS Distro for OpenTelemetry).

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>

Instrumenting Business Logic

Annotate critical operations so traces include business context:

public Order createOrder(OrderRequest request) {

    // Validate request

    // Call Payment Service

    // Update Inventory

    // Save to Database

    // Publish Event

    return order;
}

This allows X-Ray (or OpenTelemetry) to record execution timing for each step.


Typical Trace Timeline

API Gateway                15 ms

↓

Order Service             120 ms

↓

Payment Service          1800 ms

↓

Database                  60 ms

↓

Notification              45 ms

↓

Total Request          2040 ms

In this example, the Payment Service is the performance bottleneck.


Service Map

flowchart TD
    CLIENT[Client]
    API[API Gateway]
    ORDER[Order Service]
    PAYMENT[Payment Service]
    INVENTORY[Inventory Service]
    DB[(PostgreSQL)]
    SNS[Amazon SNS]

    CLIENT --> API
    API --> ORDER
    ORDER --> PAYMENT
    ORDER --> INVENTORY
    ORDER --> DB
    ORDER --> SNS

The service map highlights:

  • Dependencies
  • Latency
  • Error rates
  • Request volume

Monitoring External Calls

Distributed tracing is valuable for:

  • REST APIs
  • Databases
  • Kafka
  • Amazon SQS
  • Amazon SNS
  • Redis
  • External payment gateways

Each outbound request becomes part of the trace.


Error Analysis

If a payment gateway fails:

Order Service

↓

Payment Service

↓

HTTP 500

↓

Retry

↓

Timeout

↓

Order Failed

The trace shows:

  • Error location
  • Exception
  • Retry duration
  • Total impact

Sampling

Tracing every request can increase storage and cost.

Use sampling rules to trace:

  • 100% of errors
  • 10% of normal traffic
  • 100% of critical APIs

This balances visibility with cost.


Deployment Options

AWS X-Ray (or AWS Distro for OpenTelemetry) supports:

  • Amazon EC2
  • Amazon ECS
  • Amazon EKS
  • AWS Lambda
  • AWS Elastic Beanstalk
  • Hybrid environments

CloudWatch Integration

Tracing works best alongside logs and metrics.

flowchart LR
    APP[Spring Boot]
    LOGS[CloudWatch Logs]
    METRICS[CloudWatch Metrics]
    XRAY[X-Ray Traces]
    DASHBOARD[CloudWatch Dashboard]

    APP --> LOGS
    APP --> METRICS
    APP --> XRAY

    LOGS --> DASHBOARD
    METRICS --> DASHBOARD
    XRAY --> DASHBOARD

This provides a complete observability solution:

  • Logs explain what happened.
  • Metrics show how the system is performing.
  • Traces reveal where time is spent.

Production Best Practices

  • Trace all user-facing APIs.
  • Add meaningful operation names.
  • Correlate traces with request IDs.
  • Capture database and outbound HTTP calls.
  • Use sampling to control costs.
  • Avoid storing sensitive data in traces.
  • Monitor latency trends over time.
  • Combine traces with centralized logging.
  • Integrate alarms for high latency and error rates.
  • Review service maps regularly to identify new bottlenecks.

Common Troubleshooting

Issue Possible Cause Resolution
No traces visible Missing IAM permissions Grant tracing permissions to the workload
Partial traces Downstream service not instrumented Enable tracing across all services
Missing database spans JDBC instrumentation disabled Enable database tracing
High tracing cost Sampling rate too high Reduce sampling percentage
Incomplete request flow Context propagation missing Ensure trace headers are forwarded between services

Enterprise Observability Architecture

flowchart TD
    USER[Users]

    USER --> LB[Load Balancer]

    LB --> ORDER[Order Service]

    ORDER --> PAYMENT[Payment Service]

    ORDER --> INVENTORY[Inventory Service]

    PAYMENT --> DB[(Database)]

    INVENTORY --> REDIS[(Redis)]

    ORDER --> KAFKA[Kafka]

    ORDER --> LOGS[CloudWatch Logs]

    ORDER --> METRICS[CloudWatch Metrics]

    ORDER --> TRACE[AWS X-Ray / OpenTelemetry]

    LOGS --> DASH[CloudWatch Dashboard]

    METRICS --> DASH

    TRACE --> DASH

    DASH --> DEVOPS[Operations Team]

X-Ray vs Logs vs Metrics

Capability Logs Metrics Traces
Error Details
Performance Trends
Request Path
Root Cause Analysis Limited Limited Excellent
Business Insights Limited Moderate Strong

Interview Questions

  1. What is distributed tracing?
  2. How does X-Ray differ from CloudWatch Logs?
  3. What is a trace, segment, and subsegment?
  4. Why is context propagation important?
  5. How do sampling rules reduce cost?
  6. How would you trace requests across microservices?
  7. How do you diagnose latency using a trace timeline?
  8. Why should logs, metrics, and traces be used together?

Summary

Distributed tracing provides end-to-end visibility into modern applications. By integrating Spring Boot with AWS X-Ray (or OpenTelemetry on AWS), teams can follow requests across services, identify latency bottlenecks, troubleshoot failures quickly, and improve overall application reliability.

A production-ready observability strategy combines:

  • CloudWatch Logs for detailed diagnostics
  • CloudWatch Metrics for health monitoring
  • CloudWatch Alarms for proactive alerting
  • AWS X-Ray / OpenTelemetry for end-to-end request tracing

Together, these capabilities enable faster incident response, better performance optimization, and greater confidence when operating distributed Spring Boot applications on AWS.


Loading likes...

Comments

Share a question, correction, or practical insight about this article.

Loading approved comments...