C++ Next: 05/01/2011

Wednesday, May 25, 2011

Code Complete by Steve McConnell : Chapter 18

The author focuses on coding style and layout. The author first provides an example code written in three different styles and compare them from the point of view of readability. The author also provides examples where the interpretation of a program is different by a human and a computer just because of difference in perception of humans due to layout of the program.

The author points out an observation from Elliot Soloway and Kate Ehrlich that advanced programmers have strong expectations about what the programs look like and their productivity drops drastically when the expectations are violated. The author also enumerates the objectives of a good layout such as accurately and consistently representing the logical structure of the code, improve readability and withstand modifications.

The author compares a program with a book and emphasizes the usage of white-space for following:

Grouping: White-space should be used to group related statements together.
Blank Lines: Blank lines should be used to divide a groups of related statements into paragraphs. The author sites a study to highlight that optimal number of blank lines in a program should be about 8 to 16 percent.
Alignment:
Indentation: The author sites a study to point out that 2 to 4 spaces of indentation is optimal. Anything above 4 spaces or no indentation reduces comprehension.

The author also emphasize the usage of parenthesis to clarify expressions. The author then describes four different layout styles:

pure blocks
endline layout
emulating pure blocks
begin and end as block boundaries

The author discusses the pros and cons of each layout style and finds endline layout to be very difficult to maintain. The author suggests to either use pure blocks as in Ada or use emulation of pure blocks or begin/end as block boundaries in languages like C/Pascal.

The author then provides some guidelines for layout of control structures. The author suggests to avoid unindented or double-indented begin/end pairs. The author also advises to format single-statement blocks consistently. The author also suggests to avoid gotos.

The author then provides guidelines for layouting of individual statements:

Length of a statement should be restricted to fit in a single page.
Use spaces to enhance readability of array references
Use spaces to enhance readability of arguments
Align groups of related assignments

The author then provides formatting guidelines for continuation lines:

Incompleteness of the statement should be made obvious
While breaking a statement, closely related elements should be kept together
Routine call continuation lines should be indented the standard amount
End of continuation line should be easy to find
Control statement continuation line should be indented by standard amount
Continuation lines should be indented after the assignment operator
Each line should have only one statement.
Avoid statements which have side effects

The author then discusses the trade-off between performance and clarity and correctness of code. The author then provides some guidelines for data declaration:

Align data declarations
Use one data declaration per line
Order declarations sensibly

The author then provides some guidelines on how to layout the comments and routines. The author then provides following guidelines to layout files, modules and programs:

Put each module in a file
Separate routines within a file clearly
Sequence routines alphabetically

The author also provides a standard layout of a source file.

Friday, May 20, 2011

Code Complete by Steve McConnell : Chapter 17

In this chapter, the author discusses issues in writing general control constructs. The author starts discussing boolean expressions and provide following tips:

Use true and false instead of 1 and 0 for boolean expressions.
Compare boolean values to false implicitly
Break complicated tests in partial tests with boolean expressions
Move the complicated tests completely to a boolean function
Use decision tables to replace complicated conditions
Use De Morgan's theorem to simplify boolean expressions
Use parenthesis to clarify boolean expressions

The author suggests to use curly braces or begin/end to mark the boundary of block statements executed inside a loop or in a conditional statement. With regards to null statements, the author suggests to call attention to them and use preprocessor macro for null statements.

The author then provides tips to avoid deep nesting by converting nested ifs to if-then-else or to a case statement.

The author then discusses the power of structured programming and its three main constructs:

Sequence: A set of statements executed in order.
Selection: It is a construct that allows statements to be executed selectively.
Iteration: It is a construct that allows group of statements to be executed multiple times.

The author then provides techniques to emulate structured constructs using goto statement in different programming languages.

The author introduces the concept of cyclomatic complexity or control-flow complexity. The author also provides guidelines on how to reduce it.

Code Complete by Steve McConnell : Chapter 16

The author discusses unusual control structures in this chapter. The author first provides the information about debate going on for the use of goto statements in structured programming languages. The author then provides different ways to rewrite the code in such a way that goto statements are not required:

Rewrite with nested if statements
Rewrite with a status variable

The author then provides guidelines for using return statement:

Minimize the number of return statements
Use a return when it enhances readability

The author then discusses the use of recursion and provides following guidelines:

Make sure the recursion stops
Use safety counters to avoid infinite recursion
Limit recursion to one routine
Keep the usage of stack in check
Don't use recursion for Fibonacci or factorial calculation

Wednesday, May 18, 2011

Paper Review : MapReduce - Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat

This is my first review of a published paper. Over the next few days, I am going to concentrate on papers related to distributed systems that changed the way people write softwares for distributed systems.

This paper is available at Google Research. The authors are Jeffrey Dean and Sanjay Ghemawat.

In this paper, the authors discuss about a programming model MapReduce and its implementation to process and generate large data sets. Programs written in this model are automatically parallelized, and the model takes care of partitioning the data, scheduling program's execution on a set of machines, handling machine failures, and managing the required inter-machine communication. The authors then introduce the programming model where the programmer has to write just two functions:

Map: It takes as input a key/value pair and emits a set of intermediate key/value pairs. The MapReduce library groups together all intermediate values associated with the same intermediate key and passes them to the Reduce function.
Reduce: It takes an intermediate key and a set of values for that key and typically produces one output for each key.

The authors provide sample implementation of counting the number of occurrences of words in a large set of documents. The authors then provide some more examples like

Distributed grep
Count of URL access frequency
Inverted index
Distributed sort

The authors explain the overall flow of MapReduce as following steps:

The MapReduce library divides the input files into pieces and starts many copies of the program on a set of cluster machines.
One task is master and rest are workers. There are $M$ map tasks and $R$ reduce tasks and the master picks up idle worker and assign either one map or one reduce task.
A worker doing map task reads contents of the corresponding input split. It parses key/value pairs out of the input and sends it to user-defined Map function. The intermediate key value pairs produced by the Map function are buffered.
The buffered intermediate key/value pairs are partitioned into $R$ regions and location of them is passed back to the master who forwards them to reduce workers.
The reduce worker will read the data from the locations provided by the master and sorts the intermediate keys.
The reduce worker iterates over sorted intermediate data and passes each unique intermediate key and the corresponding set of intermediate values to user's Reduce function. The output of the reduce function is appended to a final output file for this reduce partition.
When all the map and reduce tasks complete, the user program returns.

The authors then describe the strategy to handle a worker failure and the master failure. The authors then describe the usage of GFS (Google File System) in reducing the network bandwidth usage by coupling data and tasks on the same machine. The authors then provide details about typical values of $M$ and $R$ used in their own environment. The authors also discuss the problem of stragglers, the machines which take too long to complete the last few map or reduce tasks and the solution in form of backup tasks that the authors provided.

The authors then discuss following extensions to the model:

Partitioning: Instead of default partitioning function that divides the data across $R$ reduce tasks, users are allowed to specify their own partitioning function.
Ordering Guarantee: The library guarantees that the intermediate key/value pairs within a given partition are processed in increasing key order.
Combiner Function: The Combiner function is used to process the output of map tasks on the local machine before the result is sent over the network for reduce tasks to process.
Input and Output types: The library supports reading input data in some standard formats but user can customize it by implementing a reader interface.
Side-effects: Users are allowed to produce auxiliary files as additional outputs of their map/reduce operators.
Skipping Bad Records: When there are bugs in user code that crashes map or reduce operation deterministically, the library keeps track of such failures and skips processing of such record.
Local Execution: Debugging problems in map or reduce is impossible in distributed architecture so the library provides a way to execute the code serially on a single machine and the user can attach debugger to it.
Status Information: The master runs an HTTP server and outputs status information of all the map/reduce tasks, their standard error and standard output files.
Counters: The library provides facility of creating named counters which can be used to keep track of occurrences of a particular event, e.g. number of German documents indexed, number of capitalized words in a file.

The authors then measure the performance of their library on a particular hardware configuration for two set of examples: grep and sort. They report the performance of sort to be comparable to that reported in TeraSort benchmark.

The authors also provide performance data for effect of backup tasks and also in case of large number of machine failures (200 out of 1745). The authors conclude the paper with some of the applications written in Google with the usage of MapReduce library.

Outside of Google, the MapReduce library has been implemented as open source in Apache Hadoop project. Amazon has come up with Elastic MapReduce which uses Apache Hadoop to provide facility to develop applications that can process large sets of data.

Some recent comments by one of the ex-googlers on this is available at Ex-Google Engineer Says the Company's Software Infrastructure is Obsolete.

Tuesday, May 17, 2011

Code Complete by Steve McConnell : Chapter 15

The author discusses about loops in this chapter. The author provides different kinds of constructs available in different programming languages to implement looping. The author also provides examples where each kind of looping construct, loop with condition at the start, loop with condition at the end, loop with exit loop, for loop etc..., can be used.

The author suggests to minimize the number of factors that affect the loop and treat the inside of a loop as a routine.
The author then provides following guidelines for entering a loop:

Enter the loop from one location
Put initialization code just before the loop

The author also suggests following guidelines

Use infinite loop construct as per the language
Correctly choose whether to use while or for loop
Keep loop housekeeping scores at either the beginning or the end of the loop
Make each loop perform only one function
Don't play with the loop index for the loop to terminate
Avoid code that depends on loop index's final value
Make loop terminating condition obvious
Make sure the loop exits
Use safety counter that ensures exit from the loop

The author suggests that, for every loop, the programmer should check three cases of interest: the first case, an arbitrary middle case and the last case. The author then provides following guidelines for using loop variables:

Use ordinal or enumerated types instead of floats for limits on loops
Use meaningful variable names to make nested loops readable and avoid index cross-talk
Make loops short enough to view all at once
Limit loop nesting to three levels
Document long loops

The author then provides a way to easily write loops. Languages which do not provide operations on arrays, the author compares loops as single operations on a group of array elements.

Monday, May 16, 2011

Code Complete by Steve McConnell : Chapter 14

In this chapter, the author provides guidelines for writing conditional statements as follows:

Write nominal paths through the code first then the exceptions.
Branch correctly on inequality, i.e. use $<$ and $<=$ correctly.
Put normal case in if condition rather than in else.

The author then provides some guidelines in writing chains of if-then-else statements:

Simplify complicated tests with boolean function calls
Put the most common case first
Make sure all the cases are covered
If the language allows, replace the chain with other constructs such as case/switch

The author then provides some guidelines to write case statements:

Order case statements sorted alphabetically or numerically
Order cases by frequency
Put the normal case first
Keep actions of each case simple
Don't use phony variables in order to use case statement
Use the default clause to detect errors

Code Complete by Steve McConnell : Chapter 13

The author moves away from data to focusing on how to organize control flow in the code. The author provides following guidelines:

* Organize code so that dependencies are clear.
* Name routines so that dependencies are obvious.
* Use routine parameters to make dependencies clear.
* Document unclear dependencies.

The author advises to keep the code in such a way that it maintains the flow readable from top to bottom.

The author defines code between references of a variable as "window of vulnerability" and addition of new code in this area may inadvertently alter the variable. Because of this, the author suggest to keep references to a variable localized.
The author defines span as the number of lines of code between references of two variables. The authors advice the readers to compute the span of all the variables and get an average and organize the code in such a way so as to keep it as low as possible.
This enhances the readability of the code since the person can just focus on one section at a time.

The author then defines the concept of variable live time, the total number of statements over which a variable is live, i.e. from the statement where the variable is referenced first to the statement when it is referenced last. The author suggests to keep average value of this over all the variables low. This increases the readability and reduces the window of vulnerability.

The author also emphasizes the importance of putting related statements together or transfer them to create a newer smaller routine.

Saturday, May 14, 2011

Code Complete by Steve McConnell : Chapter 12

The author discusses tips to use user-defined data types. The authors suggest to use structured data to simplify operations on blocks of data and clarify data relationships. Structured data simplify parameter passing and reduce maintenance.

The authors then discuss the benefits of using table look-ups instead of complicated conditional statements. The authors discuss three techniques to look up an entry in the table:

Direct Access - The author discusses two ways, directly using data as a key to table and the other is fudging the table-lookup keys.
Indexed Access - This works similar to handling of virtual memory address to physical memory address in operating systems.
Stair-step Access - The entries in a table are valid for ranges of data. The authors also provide some suggestions on using the technique effectively.

The authors then discuss the advantages of using abstract data types (ADTs) such as:

Hide implementation details
It's easier to improve performance
Changes are localized
The program becomes self-documenting

The authors then give some examples of ADTs for real world entities and also some guidelines of designing ADTs.

Code Complete by Steve McConnell : Chapter 11

This chapter provides some tips to use primitive data types.

Avoid using number literals directly. Use named constants instead. This makes code easily modifiable and readable.
Anticipate divide by zero errors.
Use type conversions explicitly.
Avoid mixed type comparisons.
Never ignore compiler warnings.
Take proper care for integer divisions.
Check for integer overflow.
Check for overflow in intermediate results.
Avoid additions and subtractions of floating point numbers with a very different magnitude.
Never do equality comparison between floating point numbers.
Anticipate rounding errors in floating point numbers.
Avoid magic characters and strings
Avoid off-by-one errors
Initialize strings to null and use strncpy instead of strcpy to avoid endless strings
Use arrays of characters instead of pointers
Use boolean variables to document conditions
Enumerated types instead of integer or string literals increase readability
Enumerated types in some languages are type-checked which increases reliability
Enumerated types increase modifiability/maintainability
Have the first enumerated value as invalid
Check that array indexes are within the bounds
Use correct subscripts for multi-dimensional arrays
Isolate pointer operations in routines
Check pointers and variables referenced by the pointers before using it
Set pointers to NULL after freeing them

Tuesday, May 10, 2011

Reversi (Othello)

I had developed the game of Reversi (Othello) during my BTech from 1999 till 2002. The earlier versions of the game was in DOS where user was supposed to enter row/column to make a move. I changed it to cursor based implementation and then migrated the whole code to Windows. I also added Windows Help which included history, rules and information about the algorithms I had used in deciding the best move for the computer. There were three different algorithms used:

The move which captures the highest number of opponent's balls - this was very preliminary
Each cell was given a priority with corner cells given higher priority than the middle ones and the next move was decided based on this priority.
A minimax search tree based approach where I was creating a tree of moves by both the players and was trying to get the best move that results in a win ultimately.

During my BTech Thesis in final year, I extended the game such that the games played would be stored in a knowledge base and in deciding the next move, the computer will refer the knowledge base to see if the move has resulted in losing the game in the past. This reduced the points for a particular move.

Recently I thought of learning programming in Android and downloaded the Android SDK. What is the best way to learn other than to implement Reversi on Android? But I am late. I found there are lots of Reversi games available here. The best I got more information is from Aart Bik. Aart is a Software Engineer at Google with a nice blog and a website. He has developed chess as well as checkers too.

His website gives details about perft, a method to verify the correctness of the move generation engine and different board representations. It also talks about protocols UCI, XBoard and Winboard that completely separates game engine from the user interface. During my own development of Reversi, I had never thought about doing all this standardization. Lucky me to know it today!