August 31, 2012

Tools for literature review in Computer Science

I am currently reviewing literature to survey research related to my work. Since this process requires time and effort and often happens over several months, I set out exploring ways to streamline the process and make it efficient.

Below, I have listed what I found to be helpful and useful.

First, I found these to contain practical advice and guidance on how to read and assimilate a research paper:
  1. Reading How to read a paper by S Keshav
  2. Then reading How to read a CS Research Paper by Philip W. L. Fong.
From the references in the paper by Keshav, I found this Literature Review Matrix designed by Iain H McLean. This matrix provides a simple and elegant interface to help break down and absorb the contents of a research paper as described by Keshav.
 
I then modified this matrix to incorporate the guidance in Fong's paper for reading publications in computer science and engineering. You can download them here: pdf and doc

I prefer reading papers in print than on screen. So I print a copy of this matrix, fill it up, and clip it to the front of every research paper I read and review. That way, it makes it easier to refer back and quickly recollect the reviewed papers.

Finally, I found citeulike to be an excellent tool for importing and storing publications from on-line sources such as the IEEE, ACM digital libraries. It helps me to quickly import (through their bookmarking button), record, and mark a publication for review. This is especially useful when I am looking through the citation list of a publication and I want to record those citations that look interesting for later review. I also use this site to archive and label reviewed publications so I can export them quickly to include in the bibliography section when writing a paper.

Further, citeulike also offers several useful features such as the ability to export all your stored references in formats (such as bibtex) and search for similar or related publications. All the above features are currently available in its free version.

These tools have greatly helped me in streamlining the literature review process.

July 31, 2012

Interesting (Computer Science) Blogs

I subscribe to various blogs related to Computer Science, Software, Lifestyle, and Technology. I have compiled here that list of blogs and sites that I follow and found to be educational, inspirational, and useful.

1. Study Hacks

This blog is authored by Cal Newport, an assistant professor and computer scientist. He has also the written books that study and profile the secrets of high performers. In this blog, he describes his thoughts and observations on success, performance, and productivity, such as the application of deliberate focus and the methodical construction of remarkable careers and lives.

2. Matt Might's blog

Matt Might is an assistant professor and blogs about a variety of topics ranging from CS to productivity to useful hacks to healthy living. He has a crisp style writing in short sentences and paragraphs. A good read for people interested in CS, software, programming, and learning.

3. Prof. Douglas Thain's blog

My advisor's blog! :) He talks about the problems facing users in building, deploying, and running workloads and applications on distributed systems. He also describes the research we do in studying and devising solutions for these problems, and the software products we build and maintain for users to successfully navigate the challenges of using distributed systems.

4. All Things Distributed

This is a blog maintained by Amazon's CTO, Werner Vogels, who is also known as one of the architects of the Amazon AWS platform. He talks mostly about topics and news related to the Amazon AWS platform. But his recent posts on seminal CS papers that he reads for the weekend take you to the foundations of Computer Science and Engineering.

5. Mashable

A site that curates everything related to technology. They cover and aggregate every aspect of technology ranging from news, product reviews, new announcements and offerings, recommendations, entrepreneurship, tips and tricks.  It helps me stay current with technology and the businesses powered by it.

6. Google Research Blog

Google uses this blog to post on their research and its impact inside and outside of the company. It gives me a peek into how Google approaches and does research on a large scale.

7. Living in an Ivory Basement

A blog about "big data", bio-informatics, Python, and software. Written by Titus Brown, a professor at MSU. The blog has good insights on the challenges facing (biology and bio-informatics) scientists even with access to higher computing capacities becoming easier.

8. Volatile and Decentralized 

This blog is maintained by Matt Welsh, who left a faculty position for industry. His posts touch various topics on research, academia, systems, and software.

9. Google Blog

Google's official blog is regularly updated with news, announcements, insights, tutorials and tips, interviews, and product updates. I use it to see how Google communicates some of its complicated products, technologies, and decisions, to the masses.

10. Peter Norvig's blog

Peter Norvig is the Director of Research at Google. His blog is heavy on Artificial Intelligence, but there are several useful posts and essays on programming. It is also interspersed with posts such as this.

11. profserious

This blog by Anthony Finkelstein, a professor at University College London, talks about a variety of topics ranging from graduate school, academia, software engineering, programming, entrepreneurship.

12. Software & Engineering

Carlos Oliveira's blog focuses on C and C++ programming along with observations on software engineering and practices.

June 3, 2012

Evolution of High Performance Computing

I was at CondorWeek 2012 as part of our CCL contingent where I had the pleasure of listening to an array of fascinating talks on Distributed and High Performance Computing.

At the end of one such talk, my mind started plotting the history and evolution of High Performance Computing. Here is what I found on its evolution:

(Before I proceed, for clarity, here is the definition I use for high performance computing (HPC): HPC refers to the technologies developed for the class of programs that are too large, too resource-intensive, and too long to run on commodity computing systems, such as desktops.)

Age of the Processors:  

 


For the first twenty years of computing, it was all about the processing power available in a motherboard. Moore's Law aided and dictated the continued advances in the processing speeds of a single core CPU.

This trend continued until the rate of growth slowed and power dissipation became a bottleneck. That led to the design of architectures involving multiple processors and multiple cores on each processor.

The multi-processor systems introduced and developed parallel computing. It marked the beginning of high-performance computing where large processing-heavy programs were decomposed into parallel pieces to take advantage of multi-processor based systems.

 

Age of the Clusters:



The need for higher computational power soon began surpassing the available processing speeds and the projected rate of their improvement.

The introduction and quick adoption of computer networking provided a much needed breakthrough for HPC. Higher computational capacity and power were then achieved by connecting multiple dedicated computing systems together. These connected systems were first called batch processing systems and served as the predecessors to cluster computing.

As networking technologies advanced, cluster computing involving hundreds of sophisticated and dedicated computing systems became prevalent for running large, long running programs.

Age of the Grids & Clouds:

 

 
Over the last two decades networking speeds and bandwidth have continued to outpace the advancements in processing and disk storage speeds. This led to wide area networks connecting thousands of computing systems spread over several geographic regions.

These wide area networks consisted either of (a) dedicated computing systems housed in multiple data centers or (b) multi-purpose shared systems whose idle cycles were harvested and made available for consumption (an idea championed by the Condor project at the University of Wisconsin-Madison).

Soon, efforts began tapping into the vast aggregated processing and storage capacity available in these networks. These networks had come to be treated as platforms for running HPC applications. This trend led to the emergence of clouds and grids whose resources were available to stakeholders and costumers for consumption.

Age of the Software Frameworks:


With the rate of advances in hardware slowing, software frameworks are becoming the agents of the next wave of growth in HPC.

This is so because software frameworks are best positioned to bring together and manage heterogeneous resources from a variety of environments, such as clouds, grids, and clusters, and satisfy the increasing computational and storage needs of users.

These frameworks are also better equipped to provide fault-tolerance, load-balancing, and to handle the complexities with managing several thousand heterogeneous resources.

I am currently involved with the development of one such software framework - Work Queue which is available as a C, Python, Perl library.

Some more examples of software frameworks used for HPC are Hadoop, Pegasus, Taverna.

This age represents the current trends and research directions in HPC.

Thoughts on the Future:


While software frameworks seem to be the path forward in the evolution of HPC, hardware advances cannot be ignored. For instance, GPU hardware are slowly gaining traction as relatively inexpensive but effective platforms for HPC as described in this paper.

All said, the future for HPC looks bright as it continues to evolve toward being more economical, powerful, and easy to deploy and run.

April 25, 2012

Tips for a technical presentation

Recently, I had to give an introductory tutorial on debugging techniques for the Fundamentals of Computing class at the University of Notre Dame.

Around this time, I was reading Dale Carnegie's best-seller on public speaking and I decided this tutorial session was an opportunity to apply some of the methods discussed in the book.

I planned on using a powerpoint-style presentation that I had found to be useful and effective for such technical talks and presentations. The link to the pdf of the presentation is here.

Here are the methods from the book that I incorporated in the presentation, and how I found them to be beneficial:
  
1. Visualize: Using pictures in my presentation offered several benefits. The key ones were:
  • It made it easier to relate the key points in the talk to normal occurrences and actions in our daily lives.
  • It made it easier to keep the audience engaged. I have realized nothing grabs the mind's attention than a picture that draws eyes to it.
  • It allowed me to invoke certain thoughts and feelings in the audience's mind. For example, a well formatted program is easy to search, move around, add new items, throw out old and obsolete items, and show it to your peers. Just like a well organized closet shown in the talk slides above!
  • It allowed me to weave a story and add some anecdotes around some of the key points in the presentation. Like, how the construction of Sydney Opera House was begun before all the designs and details were finalized, leading to extreme cost overruns and delays (which illustrated the perils of starting without a design or blueprint).
2. Relate to audience: I never really understood the importance of this until I listened to a lawyer speaking to a group of software engineers. I noticed how he had the attention of everyone in the room for the entire hour. The reason - he was using processes and concepts common in the engineering field as analogies to explain laws and legal practices!

Here is what I did and found to be helpful in my presentation:
  • My audience was comprised of young, eager to explore the world, fun, sports-loving bunch that grew up watching The Simpsons. So I had pictures that either looked familiar or appealed to their interests and sensitivities.
3. Use examples: My colleague recently shared a study that showed how practice is necessary for learning facts and, on the other hand, worked examples for learning skills. This was especially important in my talk since debugging is a skill that is essential to becoming a good software engineer.

I incorporated examples throughout my presentation by:
  • Having sample programs and code illustrated in the slides where possible to describe the techniques being discussed.
  • Having links to sample programs that the audience can download, run, and debug using the tools and techniques described in the presentation. I used the last 10 minutes of my presentation to have them run a sample program and use some of the described tools to identify and fix its errors.
In the end, it was not only my audience who learned something new! I had learned and understood the effectiveness of these above methods in speaking to an audience and engaging them in a technical presentation.


February 28, 2012

Import modules in Python

I was factoring out some code in Python into modules and it piqued my curiosity on how they worked. Particularly, how importing modules worked. And more particularly, how are circular dependencies in modules handled? For example, if mod_a imports mod_b and mod_b in turn imports mod_a, what happens?

So, I decided to dig deeper. Here are two simple modules:

mod_a.py:
import mod_b                                                               

var_a = 10                                                                 
print "var_a seen from mod_a is: %d" % (var_a)                    
print "var_b seen from mod_a is: %d" % (mod_b.var_b)

mod_b.py:
import mod_a

var_b = 100 
print "var_a seen from mod_b is: %d" % (mod_a.var_a)              
print "var_b seen from mod_b is: %d" % (var_b)
What happens when you run mod_a.py? It fails at the statement printing mod_b.var_b despite importing mod_b at the beginning.

This is why it fails - when mod_a is run, it first imports mod_b. In Python, when a module is imported, the module is first checked to see if it has already been evaluated. If not, the imported module is evaluated line-by-line and loaded into memory. Note that if this module imports other modules, the import process becomes recursive. That is at each import, the evaluation stops and transfers to that of the module being imported.

In the above example, when mod_a starts executing, it encounters the import of mod_b to its namespace. So it transfers to evaluating mod_b. However, mod_b in turn imports mod_a to its namespace and therefore the evaluation moves to mod_a. In this evaluation of mod_a, the import of mod_b is ignored since mod_b is already found to have begun its import to the namespace of mod_a (otherwise it will result in a deadlock!).

During the evaluation of mod_a, var_a is initialized and printed. When it gets to printing the value of var_b that is assigned in mod_b, it fails complaining there is no attribute named var_b in mod_b. See why? The evaluation of mod_b never reached the assignment of var_b there.

Now try moving var_b in mod_b.py to before the import mod_a statement. You will see the print statements from mod_a, followed by print statements from mod_b, finally followed by print statements again from mod_a. The execution flow is as follows: mod_a execution transfers to mod_b evaluation which transfers to mod_a evaluation. At the completion of these evaluations, the control flows back to the original invocation of mod_a.

Now I know exactly what goes on under the hood when I import modules in Python.

January 13, 2012

Elastic Applications in the Cloud

Parallel- and super-computing gained popularity over the last two decades due to the advent of faster processors, memory, and networking hardware. A need to run large scale high-performance applications was typically satisfied by procuring or gaining access to such high-end hardware. 

However, the later part of the last decade gave birth to cloud computing that leveraged the long existing paradigm known as distributed computing. Cloud computing has continued to gain popularity due to its on-demand resource allocation and usage based pricing model. Its growth has also been helped by several organizations (Amazon, Google, Microsoft, and more) offering cloud platforms for public use.

Cloud computing also presents an excellent alternative to running high-performance computations. However, to successfully and efficiently harness the scale of resources available in and across multiple cloud computing platforms requires fault-tolerance. In addition, it also requires run-time adaptability to available resources. That is, applications must be able to harness resources as they become available and adapt to resource losses and failures during run-time. Such applications that dynamically adapt to resource availability are termed elastic. The elasticity of applications also provides better fault-tolerance and scalability.

When elastic applications become portable across multiple platforms, their scalability becomes bounded only by the cost of running resources.

So, how to build or convert parallel applications to portable elastic applications? Use a framework like Work Queue. Work Queue abstracts the underlying distributed execution environment, and provides an interface for deploying and running applications in the cloud. Example and details can be found here

Work Queue by adhering to the design guidelines of cloud computing abstractions and frameworks presented in this paper, allows applications deployed through it to be elastic and executable across multiple cloud platforms simultaneously. As a result, such applications exhibit excellent scalability which is often required in high-performance computations to gain valuable scientific insights.