Excellence in Software Engineering
What if remote service is unavailable?
16 August 2017

Author:Nihal SARMAŞIK, SW Architect – Defence Application Software Group

Today, many applications collaborate with remote services to provide functionality. While developing a distributed system, we sometimes encounter temporary remote service unavailability or failure due to transient faults such as slow network connections. This may lead to resource exhaustion and cascading failure in the application.

Rule of thumb is, before depending on remote service, think twice and ask yourself “What if remote service is unavailable?”

At that point, before digging into the solution of the problem, I want to ask a question: “Have you ever thought what would happen if overcurrent flows from electrical outlets to your appliances at home?” Your appliances could be damaged or burned, and a fire could start at your home. That would be a disaster.

I think most of you say, “No, this will not cause any problems. We have protection measures in household electricity system…” Yes you are right, we are using circuit breakers at home to protect electrical appliances from overloading.

Now some of you may say, “Uppss… What is a circuit breaker?”. I want to give a little bit of information about circuit breakers and how they work.

A circuit breaker is an electrical switch which protects electrical circuit from damage caused by overcurrent or short circuit. It realizes the problem quickly and interrupts current flow by breaking the circuit.

As illustrated in the figure below, a circuit breaker closes the circuit under normal circumstances, and immediately opens the circuit after detecting an abnormal condition.[1]

Let’s turn back to the solution of the problem.

After visiting the electrical circuit breaker, most of you may say, “Yes, that is it! Why can’t we use the same logic in software?”

Actually, there is a resilience pattern named “Circuit Breaker” in software, as well. Someone invented the wheel before us.

Circuit breaker pattern is used to provide stability and prevent cascading failures in distributed systems.

“The essence of the pattern is that, when one of your dependencies stops responding, you need to stop calling it for a little while.

A file system that has exhausted its operation queue is not going to recover while you keep hammering it with new requests. A remote web service is not going to come back any faster if you keep opening new TCP connections and mindlessly waiting for the 30 second timeout. Worse yet, if your application normally expects that web service to respond in 100ms, suddenly starting to block for 30s is likely to deteriorate the performance of your own application and trigger a cascading failure.” [2]

“A circuit breaker acts as a proxy for operations that might fail. The proxy should monitor the number of recent failures that have occurred, and use this information to decide whether to allow the operation to proceed, or simply return an exception immediately.” [3]

The circuit breaker can be implemented as a state machine with the following states:

Closed: Circuit breaker forwards the requests which come from the application to the remote service. It counts recent failures within the specified time period. If the number of recent failures exceeds the specified threshold, the circuit breaker changes its state to “Open”.

Open: Circuit breaker immediately fails the request which has come from the application and returns an exception to the application. After a specified timeout, circuit breaker goes to the “Half Open” state.

Half-Open: Circuit breaker forwards request to the remote service. If this request is successful, it goes back to the “Closed” state and everything is back to normal. But if the request fails, it goes back to the “Open” state.

Circuit breakers are one of the most important protection devices in our daily life. They make electrical devices resilient to overload. For a resilient application, we can simply copy the real life solution to software.

[1] http://www.topswagcode.com/circuit-breaker-pattern/
[2] https://blog.tatham.oddie.com.au/2011/10/31/released-reliabilitypatterns-a-circuit-breaker-implementation-for-net/
[3] https://docs.microsoft.com/en-us/azure/architecture/patterns/circuit-breaker

Past Articles

Why Built-Operate-Transfer (BOT)?

Why Built-Operate-Transfer (BOT)?

“ In the last months we have received some BOT demands from some large enterprises especially from Europe and we had several negotiations with them. Below we want to share with our followers our impressions regarding the concerns, feelings and the alternative searches of any CEO of any enterprise in case of expanding their production into other countries.”

Semantic Versioning

Semantic Versioning

In Wikipedia it’s described as: “Software versioning is the process of assigning either unique version names or unique version numbers to unique states of computer software.