February 1, 2014

A failure specification is as important as a functional specification in distributed systems

This post discusses the handling of failures in Work Queue, an abstraction for building and running distributed applications, and describes broad observations and lessons learned.

Work Queue (WQ) is similar to Hadoop in offering a programming model that splits a large computation into multiple pieces and executes them concurrently on available resources.
 

At the same time, WQ has several differences with Hadoop. One major difference is WQ supports arbitrary dependencies between tasks in a workload. 

In addition to the data-parallel workloads supported by Hadoop, WQ can also run workloads expressed as a directed acyclic graph (DAG), or an iterative workload where a set of tasks are executed and evaluated until an end condition is satisfied.

WQ employs a master-worker execution model, where the master dispatches tasks to workers for execution and coordinates the successful completion of the entire workload.
 

Applications use the C, Perl, or Python API of WQ to describe and run their workload.
 

Recently, we ran into a few bugs related to the handling of failures in WQ. Here are 3 of these bugs:
  1. WQ was not correctly handling failures in the transmission of files between the master and workers. In certain cases, it ignored the failures and marked the transmission as successful.
  2. WQ treated the inability to access or open a file (e.g., a file path was incorrectly specified in the application) at the master as a failure during transmission to the worker and wrongly marked the worker as faulty.
  3. WQ crashed on a memory allocation failure bringing down the entire operation.
On first inspection, the bugs looked to be coding errors and their fixes to be straightforward. However, on deeper inspection, their root causes went deeper.

We found the following root causes:

1. Failures were being lost: WQ traverses a number of system calls, libraries, and functions when coordinating the execution of tasks across workers.
 
One of the bugs happened because a return value indicating failure was not being propagated through the call stack to the function where the actions to handle failures were undertaken.
 
Fix: We constructed call graphs in WQ and made sure return values were being propagated and received at the function where appropriate actions for failure can be taken. 

Tip: We found Doxygen to be useful in building call graphs for the functions of our interest.
 
2. Failures were treated alike: There are two types of failures that happen in an execution of an application using WQ: application-level and system-level failures.
 
The application-level failures are those that can be remedied only at the application. Examples include incorrect path names for files, incorrect specification of execution environment for tasks, etc.
 
The system-level failures are everything that happen in the layers below the application. These failures are transparent to the application. WQ must detect, handle, and recover from the failures. Examples are network failures, hardware errors, suspended workers, etc. WQ recovers from these failures by retrying or migrating the execution to a different worker.
 
WQ had some logic to handle the failures based on their type. But, this was never consistently followed. In short, there were two things WQ lacked: (i)  it did not always distinguish the detected failures and (ii) a clear demarcation of the actions to be taken for each of the two types of failures.
 
Fix: We added two functions: handle_app_level_failure() and handle_sys_level_failure(). All failure handling actions go through these 2 routines.

3. Resource failures were fatal: A failure in memory allocation resulted in a crash (segfault).

Rather, failures in allocating resources should be treated as partial failures so WQ can continue functioning with the previously allocated resources. This will provide graceful degradation upon (multiple) resource failures.
 
Fix: We added code to detect and handle failures in the allocation of file descriptors, memory, disk within the scope of an individual worker or task.
 


The above 3 root causes are systemic and have the potential to introduce critical bugs if they haven't already. We therefore found this to not only be a productive exercise but also a necessary one.

While working on identifying the root-causes and their fixes, I had interesting discussions with my advisor Dr. Douglas Thain, who is also the chief architect of WQ. I learned the following about handling failures from those discussions.

4. Not all failures need to be handled: There are several hundreds of failures that can happen in a distributed environment. Reacting to every one of them can be a futile exercise.

For example, does every TCP transmission have to be checked for its success in WQ? The answer is no!
 
Because WQ uses asynchronous messages. That is, the sender does not have to wait for the receiver to acknowledge their reception in order to proceed with the next action.
 
If there was a failure in sending a message, the connection was broken or terminated by TCP. So, as long as we caught broken connections when receiving on the link, we could safely ignore failures in transmission.
 
The trick here is to identify and handle the failures that when neglected can lead to the failure of the entire system. With a good understanding of the architecture of the system, this shouldn't be hard to do.

5. Realize the application can make smarter choices: When building abstractions for running distributed applications, it is important to avoid the trap of thinking the abstraction must handle and recover from every failure.


Trying to handle every failure at the abstraction can lead to unnecessary complexity and bloat.

For example, what happens when a task does not produce the output files that the application says it expects to see? 

Should the task be retried? Should the other output files that were successfully produced by the task thrown away since the task failed?

The answer in WQ is we retrieve the output files that were successfully produced by the task to the path specified by the application. We also retrieve as much information as we can about the task, such as its stdout, exit code, and execution time and return the task to the application.

The application/developer can harness this information to determine the appropriate remedial action, such as modify the commands, their arguments, or the executables, and resubmit the task for execution.

Conclusion
The lesson here is to treat the handling of failures in distributed systems as an integral part of the design and development. One idea to accomplish this is to write down a failure specification that describes the failures the system will handle, the actions that will be taken to handle failures, and how the failures and the actions to handle them will impact the overall operation of the system.