Effective Strategies for Handling External Services

Greetings

In various software development scenarios, we inevitably encounter external dependencies, whether they are our own services or entirely third-party entities. Especially when it comes to microservices this is very common. Systems can fail or crash if we don’t plan and handle these scenarios properly. Having personally encountered such situations, I became increasingly curious to delve deeper into this.
Note that there are no “exact” answers when comes to these kinds of situations. The best is to let the business decide which option to use as it will save a lot of time, engineering effort and of course money by not over-engineering.

What can go wrong?

Let's first categorize the scenarios that can happen when using external systems.

Service Unavailable
Service is not able to handle high load
Service had implemented rate limits
Service is slow

All these will impact our businesses hence we need to carefully design our systems. However, we cannot say there is a perfect solution as it depends on the business need at hand.

Service Unavailable

We can expect service unavailability in any system as high availability doesn't mean it is always available. There can be failures due to various reasons. In such situations, we can introduce circuit breakers with exponential back-off. In that way, we can re-try say 5 times and If it is still unavailable, we can log errors and handle them without crashing as we can expect it to come alive sooner.

Slow APIs

Not all the APIs behave the same and with or without load, those can be slow which greatly impacts overall performance. This can happen not only for get requests but also for post requests. If it is a get, we can maintain a cache on our side. If it is a post, we need to process asynchronously using a queue (depending on the situation).

Rate limits/ high load

Service unavailability is common and we can easily prepare for that. However, in other situations, we need to analyze the business carefully. The service we are calling may have implemented rate limits but we are exceeding them thus failing the request. Or, the service has not designed to handle a higher load. Maybe, our business situation will tell us that circuit breaking is sufficient. Again, this can happen either in GET or POST requests.

Ignore it

The first thing we need to consider is, whether can we ignore this. Engineering effort is money and we should not waste it when we have simple solutions. We can simply give a message to the user asking to retry. As the services will not go down forever, this is an acceptable way depending on the business requirement.

Bulk APIs

We can use bulk APIs in both get and post to minimize the impact on limitations (rate limits, high load, slowness). For example, instead of fetching by id (in standard GET request), there can be an API for fetching by ids or post bulk payload. This is useful not only in direct flow but also when utilizing caches or queues as well.

Implement Circuit Breakers

The circuit breaker is a well-known pattern in microservices to handle API failures. As mentioned above, we can re-try with exponential back-off a few times before falling back. Depending on the business scenario we can decide whether to stop processing overall execution or whether to proceed with another data source.

Use a Cache

When we fetch data heavily from the 3rd party service, we can implement a cache on our end. We call the external service only when the cache is missed. This will not only handle said scenarios but also will improve the overall performance drastically. We need to be careful about this if the data changes frequently causing the data on our end to be outdated.

Use an Event Queue

Even-driven architecture is also very common and super nice. We can greatly improve the responsiveness of the application by asynchronously handling the heavy work. Unlike circuit breakers, re-trying will not affect the overall performance as it happens outside the main flow. We have the flexibility to choose between bulk process or single execution as well. Further, we can easily implement retry mechanism using a Dead Letter Queue.

Getting data closer to our service

One of the ways to maintain the healthy availability of a system is to get the data closer to the service even though this duplicates the same data set in multiple places. Much like the transactional outbox pattern, we keep the data in our own database and use Queues for communications. By doing so, we are accepting availability over consistency which is suitable for most situations.

Queue as a database

I first encountered this neat design during my AWS journey which is a neat pattern with EventBridge. Instead of reaching the business logic, a queue or a bus (ex: EventBridge) is used to dispatch the request by allowing other workers (ex: Lambda) to invoke the services asynchronously. What is fun about this pattern is that we can completely change our business logic without affecting any of the customer requests.

Conclusion

External services play a significant role in the majority of applications, regardless of whether they are legacy systems or built using modern technologies. This introduces new problems to engineers to use their creativity and come up with better solutions to handle the situation. However, when you have multiple options, let the business decide what to sacrifice as there is no perfect solution. Do you have any other solution? please let me know.

Happy learning ☺

References

Here are a few resources for further study.

https://redis.com/blog/what-is-enterprise-caching/

https://redis.com/glossary/rate-limiting/

https://martinfowler.com/bliki/CircuitBreaker.html

https://microservices.io/patterns/data/transactional-outbox.html

Manju