Case study: Design a solution for long-running requests

Greetings

As developers, we often aim to deliver APIs with minimal latency, typically in just a few milliseconds. However, it's important to acknowledge that not all customer requirements are alike. There are scenarios where we must tackle intricate computations that can extend over many minutes. Let's explore solutions for addressing such use cases.

Problem statement

The user expects to view the outcome of a complex computation, which requires more than 5 minutes to complete.

Initial design

We can outline our initial architecture by offering either a REST or GraphQL API that waits for the computation to conclude.

The primary limitation of this design is that it blocks the server thread, which can lead to the exhaustion of server resources. Additionally, the user remains uninformed about the ongoing process.

Improving with an Async Job and Polling

Rather than waiting for the calculation to complete, we can utilize a separate thread and promptly return from the initial request. Upon completion of the calculation, we can store the result in a data repository, such as a database or cache, and offer a separate endpoint for result polling.

This immediately addresses our primary concern by avoiding the blocking of the main thread. Consequently, the server can seamlessly handle any number of user requests. Moreover, users receive immediate confirmation of a successful request. It's important to note that in such APIs, it's advisable to use a status code of 202 instead of 200.
However, this approach lacks scalability since the server must allocate resources for the computation and for extra API calls for polling. Let's address this scalability issue.

Improving with a Queue

Instead of burdening the server with the calculation, we opt to decouple this process. To enhance communication, we've introduced a queue to receive requests. Our worker component monitors the queue, performs the calculation, and updates the data storage accordingly. It's important to note that the client's user interface continues to poll for the result.

While this design undeniably resolves the problem to some extent, we have the opportunity to enhance the architecture with WebSocket. Nonetheless, it's worth mentioning that WebSocket introduces additional expenses and development complexity. For an initial proof of concept (POC), opting for polling remains a viable choice.

Improving with WebSocket

Our overall architectural structure remains consistent; however, the communication of calculation results is now facilitated via WebSocket.

Rather than storing the result directly in the database, we can utilize a cache as the datastore. Redis proves to be an excellent choice for its high performance as a cache. Furthermore, we can utilize Redis messaging to transmit the WebSocket message.

While REST is a viable option, GraphQL presents another compelling choice. The integration of GraphQL with Subscriptions seamlessly aligns with the overarching architecture.

Regardless of the chosen option, there are AWS services available that we can depend on.

Designing with AWS AppSync

Designing a GraphQL server is relatively straightforward with modern frameworks, although maintaining WebSocket functionality and ensuring high availability can be challenging. That's why it's advisable to make the most of cloud services whenever possible.

AWS AppSync is a managed service provided by AWS that simplifies the development of GraphQL APIs. It allows us to create and run serverless, real-time, and scalable GraphQL APIs with ease. (from doc)

Designing with AWS API Gateway

As we might opt to use REST instead of GraphQL, AWS API Gateway is a good solution as our API Gateway.

Amazon API Gateway is an AWS service for creating, publishing, maintaining, monitoring, and securing REST, HTTP, and WebSocket APIs at any scale. (from doc)

Conclusion

In this article, we explored potential approaches to designing a solution for long-running API requests. Software solutions vary based on specific needs, making it challenging to prescribe one exact solution for a given problem. Achieving an optimal solution involves balancing trade-offs and selecting what aligns best with our specific requirements.

Manju