GraphQL API Gateway Patterns
Distributed Tracing

Distributed Tracing Pattern for GraphQL API Gateways and GraphQL APIs

Wouldn't it be nice if you could understand what exactly is happening in your GraphQL Servers and API Gateways? That's what this pattern is all about.

Problem

Let's say you're an advanced GraphQL user and you're multiple Services behind a GraphQL API Gateway using the API Aggregation Pattern. You're using a client like Relay that supports Fragments, and you're leveraging the Persisted Operations Pattern to persist all Operations on the Server.

You've realized that one of your Operations is more than two standard deviations slower than all other Operations. But why is it slow? Is it the Gateway? Is it one of the Services? Is it a database of one of the Services? We don't really know, because we don't have any visibility into what's happening when a GraphQL Operation is executed.

Solution

The solution is to use Distributed Tracing, which is a technique that allows you to create a trace of all the events that happen when a request is processed. It's called "distributed" because it allows you to trace events across multiple services, not just the Gateway or our GraphQL service.

You might be asking, how is it possible to generate traces across multiple services, when every system might use a different programming language and different frameworks? In addition, how is it possible to build tracing components without locking yourself into a specific vendor?

The answer is standardization, and the most popular standard for Distributed Tracing is OpenTelemetry (opens in a new tab) or in short OTEL. OpenTelemetry is a vendor-neutral standard for Distributed Tracing, and it's supported by all major cloud providers and many other vendors. Almost every production-grade solution supports OpenTelemetry these days.

How does it work?

  1. You need an OpenTelemetry compatible Backend that can receive traces
  2. You need an OpenTelemetry Collector that can receive traces and forward them to the Backend in batches
  3. You need to instrument your services with an OTEL SDK and send the traces to the Collector

The first service to receive a request will add a trace context to the request. This context will be forwarded to all other services involved in processing using Headers or other mechanisms. This way, the OTEL backend can correlate all events that belong to the same request, which generates a meaningful trace for us, from the edge proxy to the GraphQL API Gateway up until the (Micro) Service layer and even the database.

Considerations

There are a few things to consider when implementing Distributed Tracing.

OpenTelemetry Collector

When setting up your own OTEL Collector, you need to make sure to configure batching correctly. Depending on the number of traces you're processing, and the OTEL backend you're using, it's important to configure batch sizes and timeouts correctly to avoid performance issues and to make sure that traces are not lost.

OTEL collectors can be used in a stateless way, so you can easily scale them horizontally to handle more traces.

If you're looking at reducing the probability of losing traces, you might want to consider using an intermediate storage like Kafka. With such a configuration, you could immediately acknowledge the receipt of a trace in the Collector, and then asynchronously process the trace in a separate process. If the collector crashes or is restarted during processing, the SDK will resend the trace to another Collector instance. In addition, a Kafka "buffer" can help you to handle spikes in traffic.

Sampling

Sampling is a technique that allows you to reduce the number of traces you're processing. When dealing with high traffic volumes, it's not feasible to create a trace for every request, at least not end-to-end.

For billing purposes, you might want to create a trace for every request, but you might want to sample the trace to reduce the number of spans. For example, we can create a span at the edge proxy with the name of the GraphQL Operation, but we then apply sampling of 0.1 to all subsequent spans. We will still have a trace for every request, but only every 10th request will have a trace that contains spans for all services involved.

Depending on your traffic volume, you might want to sample more or less aggressively, but a sample rate of 1 is usually too high because tracing is not free.