Complexity-based Rate Limiting & Quotas for GraphQL APIs

When applying rate limiting or quotas to REST APIs, the most common approach is to limit the number of requests per time window. For example, you might allow 100 requests per minute. This approach works well for APIs that uniquely identify a resource with a URL. But what about GraphQL APIs?

Problem

GraphQL APIs are different from REST APIs in that there's usually only one endpoint. It's usually something like https://api.example.com/graphql. The client sends a GraphQL Operation to the server using a HTTP POST request.

So, how should we actually rate limit a GraphQL API? Simply limiting the number of requests per time window is not enough, as every GraphQL request can contain arbitrary complexity through nesting and batches via aliases. You might have also heard of the N+1 problem in GraphQL and how it can impact performance.

If you've got some experience with GraphQL, you might have heard of approaches to solve this problem, like calculating the complexity of a GraphQL Operation and limiting it.

E.g. you might be thinking to "just limit the depth" or "the number of fields", but does that really solve the problem? If you dig depper, you'll find that implementing a good complexity algorithm is not that easy and there are some dangerous pitfalls.

Understanding Complexity of GraphQL Operations

Before we look at some example GraphQL Operations, let's first define what we mean by "complexity".

In this context, complexity is a measure of how much work a GraphQL Operation requires to be executed. But how can we articulate this in a way that we can actually measure it?

First, we can talk about the number of fields in a GraphQL Operation. An Operation with 10 fields is more complex than an Operation with 5 fields.

Second, we can talk about the depth of a GraphQL Operation. The depth is the number of nested fields in a GraphQL Operation. An Operation with a depth of 5 is more complex than an Operation with a depth of 3. Well, actually it's not that easy. What if an Operation with a depth of 5 results in one single database query, whereas another Operation with a depth of 3 results in 3 database queries? Which one is more complex?

As you can see, we cannot simply look at the number of fields or the depth of a GraphQL Operation. Statically analysing an Operation is a good start, but it has its limitations.

Third, we need to look at the (potential) number of Nodes that are being returned by a GraphQL Operation. What's a Node? A Node is simply an object in the response. If you have a GraphQL Operation that returns a list of 10 Users, then you have 11 Nodes in the response, if we count the parent of the list as well. Why potential? When a GraphQL Operation can return a list of Nodes, or even lists of lists of Nodes, then we need to consider the maxium number of Nodes that can be returned theoretically, and the number of Nodes that are actually returned in the response.

Summarizing, we can say that the complexity of a GraphQL Operation is a combination of:

the number of fields
the depth
the underlying data sources
the number of Nodes in the response

Let's look at some examples to illustrate this.

Example 1 - Simple Query

query {
  # depth = 1
  user(id: "1") {
    # depth = 2
    id
    name
    email
  }
}

This Operation has a depth of 2, contains 4 fields and returns 1 Node. If the underlying data souce can fetch the user in a single request, we make a total of 1 requests to the underlying data source.

Example 2 - Nested Query

query {
  # depth = 1
  # nodes = 1
  user(id: "1") {
    # depth = 2
    id
    name
    email
    # nodes = ??? (depends on the number of posts)
    posts {
      # depth = 3
      id
      title
      body
    }
  }
}

This Operation has a depth of 3, contains 7 fields and returns at least 1 Node. There's a problem though, the posts field has no pagination. If the user has 1M posts, then this Operation will return 1M +1 Nodes.

If we assume that we can still fetch the user with a single request, and we can fetch all posts of the user with a second request, then we make a total of 2 requests to the underlying data source. That said, fetching 1M posts is much more expensive than fetching 20 posts. Our takeaway here is that complexity-based rate limiting is not able to take into consideration the complexity of unbouded lists.

Example 3 - Nested Query with Pagination

query {
  # depth = 1
  # nodes = 1
  user(id: "1") {
    # depth = 2
    id
    name
    email
    # nodes = 1 + <= 20
    posts(first: 20) {
      # depth = 3
      id
      title
      body
    }
  }
}

This Operation has a depth of 3, contains 7 fields and returns at least 1 Node and at most 21 Nodes. If the user has 1M posts in total, this Operation will still return just 21 Nodes, so we've got an upper bound on the number of Nodes.

Given that we still fetch the user with one request, and the first 20 posts with a second request, it's fair to assume that we make a total of 2 requests with little to medium complexity.

As you can see, it's a lot easier to reason about the complexity of an Operation with pagination.

Example 4 - Nested Query with Pagination and Aliases

query {
  first: user(id: "1") {
    ...UserFields  
  }
  second: user(id: "2") {
    ...UserFields
  }
}
fragment UserFields on User {
    id
    name
    email
    posts(first: 20) {
      id
      title
      body
    }
}

This Operation is essentially the same as the previous one, but it uses aliases to fetch two users in a single request. This example is typically used to illustrate batching attacks in GraphQL. If a GraphQL server doesn't prevent such a behaviour, a client could send a single request with 1000 batched requests. Even though we're just fetching up to 20 posts per user, we could end up with 1000 * 20 = 20.000 posts in the response, not to mention the generated server load.

Example 5 - Nested Query with Pagination and Offsets

query {
  # depth = 1
  # nodes = 1
  user(id: "1") {
    # depth = 2
    id
    name
    email
    # nodes = 1 + <= 20
    posts(first: 20, skip: 20) {
      # depth = 3
      id
      title
      body
    }
  }
}

This Operation is just a slight variation of the previous one, we're using an offset (skip) of 20 to fetch the second page of posts. What I'd like to illustrate with this example is that static analysis of GraphQL Operations Complexity needs to understand the semantics of list arguments. While the first argument increases the number of potential Nodes in a response, the skip argument doesn't.

How to implement a Complexity-based Rate Limiting Algorithm for GraphQL APIs

Now that we've seen some examples of GraphQL Operations, let's look at how we can implement a Complexity-based Rate Limiting Algorithm for GraphQL APIs.

The first step is to define a Complexity Model for your GraphQL API. A Complexity Model is a function that takes a GraphQL Operation as input, and returns one or more numbers that represent the complexity of the Operation.

Here's a simple example in pseudo code:

function calculateComplexity(operation: GraphQLOperation): { fields: number, depth: number, nodes: number } {
  // algorithm
  return { fields, depth, nodes }
}

You can implement the algorithm, e.g. by using the visitor pattern. First, you parse the GraphQL Operation into an AST (Abstract Syntax Tree), then you traverse the AST and count the fields and potential nodes, as well as the maxium depth of the Operation.

You can then use the calculated complexity to rate limit users or clients. For example, you could use the number of requested fields and maximum depth as a hard threshold. If a user requests more than 100 fields or a depth of more than 10, you could reject the request.

Then, you could use the number of potential nodes as a soft threshold. You can use an in memory store like Redis to keep track of the number of nodes a user requested in the last minute. If a user makes a request, you calculate the potential number of nodes and reject the request if the user would request more than 1000 per minute. As we can only estimate the number of nodes, we could use a factor to be on the safe side, e.g. allowing them up to 1500 nodes during the estimation. Once the request is processed, we can take the actual number of nodes and update the counter in Redis.

We could also make this more sophisticated by taking into consideration the number of Nodes that are actually returned in the response. For example, if the user is above the soft limit, we still process the request, but we still return an error if the number of Nodes in the response is above the hard limit.

Considerations for using Complexity-based Rate Limiting for GraphQL APIs in Production

At first glance, adding Complexity-based Rate Limiting to your GraphQL API seems like a good idea, but there are a few things you should consider before implementing it in production.

It's very important that you start with a clear strategy on what you'd like to achieve. Do you want to protect your systems against malicious actors? Do you want to protect yourself against accidental Denial of Service attacks? How will you handle false positives? What if a particular user or client needs to fetch more data than others?

It's very easy to block your users from using your API in a way that you didn't anticipate. How flexible will you be in terms of adjusting the complexity limits? Can you easily adjust or reset the limits for individual users or clients?

How predictable is the behaviour to your users? Will they easily understand why their requests are rejected? What if you deny a request that looks legitimate to them?

How can automated API clients react to rate limiting? Is there a standardized way to communicate rate limiting to the client? Can a client automatically adjust their request rate based on the response?

Conclusion

In this article, we've looked at Complexity-based Rate Limiting for GraphQL APIs. We've seen some examples of GraphQL Operations and how to calculate their complexity. We've also looked at how to implement a Complexity-based Rate Limiting Algorithm for GraphQL APIs. Finally, we've discussed some considerations for using Complexity-based Rate Limiting for GraphQL APIs in Production.

My recommendation is to start as simple as possible. Set a high enough hard limit to catch malicious actors or accidental Denial of Service attacks, but without blocking legitimate users.

Once this is in place, you can start to monitor the actual complexity of your GraphQL Operations in production and slowly adjust the limits.

Bff Derived Client Generation Cross API Joins