wired musings: cloud computing

Showing posts with label cloud computing. Show all posts

February 1, 2014

A failure specification is as important as a functional specification in distributed systems

This post discusses the handling of failures in Work Queue, an abstraction for building and running distributed applications, and describes broad observations and lessons learned.

Work Queue (WQ) is similar to Hadoop in offering a programming model that splits a large computation into multiple pieces and executes them concurrently on available resources.

At the same time, WQ has several differences with Hadoop. One major difference is WQ supports arbitrary dependencies between tasks in a workload.

In addition to the data-parallel workloads supported by Hadoop, WQ can also run workloads expressed as a directed acyclic graph (DAG), or an iterative workload where a set of tasks are executed and evaluated until an end condition is satisfied.

WQ employs a master-worker execution model, where the master dispatches tasks to workers for execution and coordinates the successful completion of the entire workload.

Applications use the C, Perl, or Python API of WQ to describe and run their workload.

Recently, we ran into a few bugs related to the handling of failures in WQ. Here are 3 of these bugs:

WQ was not correctly handling failures in the transmission of files between the master and workers. In certain cases, it ignored the failures and marked the transmission as successful.
WQ treated the inability to access or open a file (e.g., a file path was incorrectly specified in the application) at the master as a failure during transmission to the worker and wrongly marked the worker as faulty.
WQ crashed on a memory allocation failure bringing down the entire operation.

On first inspection, the bugs looked to be coding errors and their fixes to be straightforward. However, on deeper inspection, their root causes went deeper.

We found the following root causes:

1. Failures were being lost: WQ traverses a number of system calls, libraries, and functions when coordinating the execution of tasks across workers.

One of the bugs happened because a return value indicating failure was not being propagated through the call stack to the function where the actions to handle failures were undertaken.

Fix: We constructed call graphs in WQ and made sure return values were being propagated and received at the function where appropriate actions for failure can be taken.

Tip: We found Doxygen to be useful in building call graphs for the functions of our interest.

2. Failures were treated alike: There are two types of failures that happen in an execution of an application using WQ: application-level and system-level failures.

The application-level failures are those that can be remedied only at the application. Examples include incorrect path names for files, incorrect specification of execution environment for tasks, etc.

The system-level failures are everything that happen in the layers below the application. These failures are transparent to the application. WQ must detect, handle, and recover from the failures. Examples are network failures, hardware errors, suspended workers, etc. WQ recovers from these failures by retrying or migrating the execution to a different worker.

WQ had some logic to handle the failures based on their type. But, this was never consistently followed. In short, there were two things WQ lacked: (i) it did not always distinguish the detected failures and (ii) a clear demarcation of the actions to be taken for each of the two types of failures.

Fix: We added two functions: handle_app_level_failure() and handle_sys_level_failure(). All failure handling actions go through these 2 routines.

3. Resource failures were fatal: A failure in memory allocation resulted in a crash (segfault).

Rather, failures in allocating resources should be treated as partial failures so WQ can continue functioning with the previously allocated resources. This will provide graceful degradation upon (multiple) resource failures.

Fix: We added code to detect and handle failures in the allocation of file descriptors, memory, disk within the scope of an individual worker or task.

The above 3 root causes are systemic and have the potential to introduce critical bugs if they haven't already. We therefore found this to not only be a productive exercise but also a necessary one.

While working on identifying the root-causes and their fixes, I had interesting discussions with my advisor Dr. Douglas Thain, who is also the chief architect of WQ. I learned the following about handling failures from those discussions.

4. Not all failures need to be handled: There are several hundreds of failures that can happen in a distributed environment. Reacting to every one of them can be a futile exercise.

For example, does every TCP transmission have to be checked for its success in WQ? The answer is no!

Because WQ uses asynchronous messages. That is, the sender does not have to wait for the receiver to acknowledge their reception in order to proceed with the next action.

If there was a failure in sending a message, the connection was broken or terminated by TCP. So, as long as we caught broken connections when receiving on the link, we could safely ignore failures in transmission.

The trick here is to identify and handle the failures that when neglected can lead to the failure of the entire system. With a good understanding of the architecture of the system, this shouldn't be hard to do.

5. Realize the application can make smarter choices: When building abstractions for running distributed applications, it is important to avoid the trap of thinking the abstraction must handle and recover from every failure.

Trying to handle every failure at the abstraction can lead to unnecessary complexity and bloat.

For example, what happens when a task does not produce the output files that the application says it expects to see?

Should the task be retried? Should the other output files that were successfully produced by the task thrown away since the task failed?

The answer in WQ is we retrieve the output files that were successfully produced by the task to the path specified by the application. We also retrieve as much information as we can about the task, such as its stdout, exit code, and execution time and return the task to the application.

The application/developer can harness this information to determine the appropriate remedial action, such as modify the commands, their arguments, or the executables, and resubmit the task for execution.

Conclusion
The lesson here is to treat the handling of failures in distributed systems as an integral part of the design and development. One idea to accomplish this is to write down a failure specification that describes the failures the system will handle, the actions that will be taken to handle failures, and how the failures and the actions to handle them will impact the overall operation of the system.

June 3, 2012

Evolution of High Performance Computing

I was at CondorWeek 2012 as part of our CCL contingent where I had the pleasure of listening to an array of fascinating talks on Distributed and High Performance Computing.

At the end of one such talk, my mind started plotting the history and evolution of High Performance Computing. Here is what I found on its evolution:

(Before I proceed, for clarity, here is the definition I use for high performance computing (HPC): HPC refers to the technologies developed for the class of programs that are too large, too resource-intensive, and too long to run on commodity computing systems, such as desktops.)

Age of the Processors:

For the first twenty years of computing, it was all about the processing power available in a motherboard. Moore's Law aided and dictated the continued advances in the processing speeds of a single core CPU.

This trend continued until the rate of growth slowed and power dissipation became a bottleneck. That led to the design of architectures involving multiple processors and multiple cores on each processor.

The multi-processor systems introduced and developed parallel computing. It marked the beginning of high-performance computing where large processing-heavy programs were decomposed into parallel pieces to take advantage of multi-processor based systems.

Age of the Clusters:

The need for higher computational power soon began surpassing the available processing speeds and the projected rate of their improvement.

The introduction and quick adoption of computer networking provided a much needed breakthrough for HPC. Higher computational capacity and power were then achieved by connecting multiple dedicated computing systems together. These connected systems were first called batch processing systems and served as the predecessors to cluster computing.

As networking technologies advanced, cluster computing involving hundreds of sophisticated and dedicated computing systems became prevalent for running large, long running programs.

Age of the Grids & Clouds:

Over the last two decades networking speeds and bandwidth have continued to outpace the advancements in processing and disk storage speeds. This led to wide area networks connecting thousands of computing systems spread over several geographic regions.

These wide area networks consisted either of (a) dedicated computing systems housed in multiple data centers or (b) multi-purpose shared systems whose idle cycles were harvested and made available for consumption (an idea championed by the Condor project at the University of Wisconsin-Madison).

Soon, efforts began tapping into the vast aggregated processing and storage capacity available in these networks. These networks had come to be treated as platforms for running HPC applications. This trend led to the emergence of clouds and grids whose resources were available to stakeholders and costumers for consumption.

Age of the Software Frameworks:

With the rate of advances in hardware slowing, software frameworks are becoming the agents of the next wave of growth in HPC.

This is so because software frameworks are best positioned to bring together and manage heterogeneous resources from a variety of environments, such as clouds, grids, and clusters, and satisfy the increasing computational and storage needs of users.

These frameworks are also better equipped to provide fault-tolerance, load-balancing, and to handle the complexities with managing several thousand heterogeneous resources.

I am currently involved with the development of one such software framework - Work Queue which is available as a C, Python, Perl library.

Some more examples of software frameworks used for HPC are Hadoop, Pegasus, Taverna.

This age represents the current trends and research directions in HPC.

Thoughts on the Future:

While software frameworks seem to be the path forward in the evolution of HPC, hardware advances cannot be ignored. For instance, GPU hardware are slowly gaining traction as relatively inexpensive but effective platforms for HPC as described in this paper.

All said, the future for HPC looks bright as it continues to evolve toward being more economical, powerful, and easy to deploy and run.

January 13, 2012

Elastic Applications in the Cloud

Parallel- and super-computing gained popularity over the last two decades due to the advent of faster processors, memory, and networking hardware. A need to run large scale high-performance applications was typically satisfied by procuring or gaining access to such high-end hardware.

However, the later part of the last decade gave birth to cloud computing that leveraged the long existing paradigm known as distributed computing. Cloud computing has continued to gain popularity due to its on-demand resource allocation and usage based pricing model. Its growth has also been helped by several organizations (Amazon, Google, Microsoft, and more) offering cloud platforms for public use.

Cloud computing also presents an excellent alternative to running high-performance computations. However, to successfully and efficiently harness the scale of resources available in and across multiple cloud computing platforms requires fault-tolerance. In addition, it also requires run-time adaptability to available resources. That is, applications must be able to harness resources as they become available and adapt to resource losses and failures during run-time. Such applications that dynamically adapt to resource availability are termed elastic. The elasticity of applications also provides better fault-tolerance and scalability.

When elastic applications become portable across multiple platforms, their scalability becomes bounded only by the cost of running resources.

So, how to build or convert parallel applications to portable elastic applications? Use a framework like Work Queue. Work Queue abstracts the underlying distributed execution environment, and provides an interface for deploying and running applications in the cloud. Example and details can be found here.

Work Queue by adhering to the design guidelines of cloud computing abstractions and frameworks presented in this paper, allows applications deployed through it to be elastic and executable across multiple cloud platforms simultaneously. As a result, such applications exhibit excellent scalability which is often required in high-performance computations to gain valuable scientific insights.