C++ Next: 02/01/2011

Friday, February 25, 2011

Google Interview Question : Find depth of a k-ary tree given a parent array

Problem statement: You are given a k-ary tree with parent array $P$ such that $P[i], 0 \le i < N$ gives the parent of node $ i $ and $N$ is the total nodes in the tree. What is the time and space complexity of finding the depth of the tree?

Solution 1 and Solution 2 are available with $ O (n^2) $ complexity. Solution 3 gives $ O(n) $ algorithm.

Wednesday, February 16, 2011

Code Complete by Steve McConnell : Chapter 6

A module is a collection of data and the routines that act on the data. The author provides empirical evidence to prove that modularity contributes more to maintainability than structure does and is the most important factor in preventing corrective maintenance. Modular programming also improves maintainability and comprehension by programmers.

The author emphasizes on loose coupling of modules and strong cohesion among routines belonging to a single module.

The author then enlists the benefits of Information hiding such as:

Hide module level secrets: These include

Data Structures
Algorithms
Areas that can change

Hardware interaction
Data size
Input and Output
Nonstandard language features

Ease of Modification: Changes are localized to routines of module when information is hidden from the rest of the program.
Better Readability:
Increased Reliability:

The author then enlists barriers that prevent usage of information hiding:

Excessive distribution of Information: If implementation of employee records as an array of 1000 elements is known to every part of the program, it deters use of information hiding.
Circular Dependencies among modules: If more than one modules are using the same set of data, the data would not be kept hidden from the rest of the modules.
Perceived performance penalties: The belief that information hiding will result into degrade in performance deters the use of information hiding.

The author then enlists some of the areas of a program around which modules can be created:

User Interface
Hardware Dependencies
System Dependencies
Data Management
Real world objects and abstract data types
Related Operations
Reusable Code

The author concludes the chapter showing the ways modules can be implemented in different programming languages.

Thursday, February 10, 2011

Code Complete by Steve McConnell : Chapter 5

The author discusses a low-quality routine and enlists the problems associated with it which include:

Bad name
No documentation
No error check on input data
Use of magic numbers instead of constants
Unused parameters
Modification of global variables
Too many parameters
Poor parameter ordering

The author emphasizes importance of routines in computer science and discusses about reasons to create a routine:

Reducing complexity: A routine hides information and makes it easier to manage complex programs.
Minimize code size: Similar code at different places in the program can be put into a routine to reduce size of code.
Improve correctness: Code in a routine is likely to be more correct since it is called with different inputs from several places.
Improve maintainability: If a defect is discovered, the code needs to be changed at only one place.
Limit effects of changes: Isolate areas that are likely to change in future.
Hiding sequences: A routine hides the sequences of events and bundles them as a single operation.
Improve performance: If a routine's performance is improved, many parts of the code benefit.
Hiding implementation: A routine will hide implementation details of an operation be its data structures or algorithm.
Promoting code reuse: Code in a routine can be reused in another program more easily.
Improve readability: Putting a section of a code in a routine and giving it a proper name improves readability of the code where the routine is invoked.
Improving portability: Use of routines isolate non-portable capabilities.

The author provides guidelines to create effective routine names:

For a procedure, a verb followed by an object is a good combination, e.g. PrintReport, CheckOrder, etc... For Object Oriented languages, the object is not required in the name because it is already part of the procedure call, e.g. report.Print, order.Check, etc...
For a function (which returns a value), the description of the return value should be used as the name, e.g. NextCustomer, CurrentPenColor, etc...

The author suggests to include everything that a routine does in a name and if the name becomes too long to describe everything it does, it should be broken in several parts. The author suggests that a good routine name has an average length of 15 to 20 characters against 9 to 15 for that of a good variable name.

The author enlists several types of acceptable and unacceptable cohesion in a routine as below:

Acceptable

Functional Cohesion: This occurs when a routine does one and only one operation.
Sequential Cohesion: This occurs when a routine performs operations in a specific order sharing data from step to step and makes only a part of a bigger operation.
Communicational Cohesion: This occurs when operations in a routine make use of the same set of data and aren't related in any other way.
Temporal Cohesion: This occurs when operations are combined into a routine because they are all done at the same time.

Unacceptable

Procedural Cohesion: This occurs when a routine performs operations in a specific order but the operations do not share any data.
Logical Cohesion: This occurs when several operations are combined into a routine and one of the operations is selected by a control flag that is passed in.
Coincidental Cohesion: This occurs when operations in a routine are not at all related.

The author discusses the idea of coupling. The degree of coupling between two routines describe the strength of connection between them. It is a complement to cohesion and it is better to have as loose coupling as possible. The author then discusses several parameters for coupling:

Size: Small routines with less number of parameters are loosely coupled compared to large ones or with lots of parameters.
Intimacy: The routines should try to interact directly through parameters and avoid interaction through indirect ways like global variables or database records.
Visibility: The connection between different routines should be prominent and not hidden.
Flexibility: It refers to how easily you can change the connections between routines.

The author gives examples of good and bad coupling.

Simple-data coupling: Two routines are simple-data coupled if all the data passed between them is non-structured and through a parameter list.
Data-structure coupling: Two routines are data-structure coupled if the data passed between them is structured.
Control coupling: Two routines are control coupled if one routine passes data to the other that tells the second routine what to do.
Content (Pathological) coupling: Two routine are pathologically coupled if one uses the code inside another or if one alters the local data used inside another.

The first two kind of coupling are acceptable while the last two are not.

The author then discusses about ideal length of a routine. Contrary to general belief, the author provides some empirical results to show that:

The size of routine is inversely proportinal to the number of errors per line of code. Small routines have more errors per line of code than larger routines.
Larger routines were cheaper to develop per line of code.
Code needed to be changed the least when routine averaged 100 to 150 lines of code.

But author still warns on writing routines larger than 200 lines of code since they become difficult to understand.

The author discusses about defensive programming, a technique where if a routine receives incorrect input, the behaviour of the routine is still defined and consistent. The author enlists several techniques of using defensive programming:

Assertions: The author suggests to use preprocessor macros for assertions and warns against putting executable code in assertions since the assertions may not execute in production code.
Check values of data input and routine parameters: Routines should be coded to verify the range of values for data obtained from a file or user and all the input parameters. A strategy should be in place on the software's behaviour on invalid data.
Exception Handling: Unexpected situations/errors should generate exceptions that becomes obvious in development and are recoverable in production code.
Change: Anticipate changes that can happen to a program and code in such a way that least number of routines get affected.
Function Return Values: Code should be written to always check function return values even if it will always be success.

The author then provides guidelines for function parameters:

Parameters should be in input-modify-output order.
If several functions have similar parameters, their orders should match.
Never have unused parameters.
Put status or error variable last.
Never change the routine parameters. Take a copy and operate on the copy.
Document assumptions about parameters

Whether the parameter is input, modify or output
Units (Millisecond, Inch, etc...)
Range of expected values
Special values used for error etc... that should never appear
Meanings of status codes and error values

Limit the number of parameters to seven
Never assume any parameter passing mechanism (i.e. calling conventions)

The author ends the chapter with comments on issues of using macros and how to solve the issues.

Monday, February 7, 2011

Find Top N : Solution 3

The problem is described here. Solution 1 is here. Solution 2 is here. In both the earlier solution, we used some in-built function offered by IBM Lotus Domino XPages Server side Javascript. In this solution, I will just try to find the optimal solution theoretically without using any in-built functions.

In my MTech thesis, I had worked on multi-pattern matching algorithm, Aho-Corasick. The algorithm basically creates a finite state machine from a list of patterns that are to be searched in a document. A sample FSM created for input strings {he, him, his, she, her} is shown in the image to the left. If we modify the state machine in such a manner that it also keeps the count for a particular string in the end state, the machine can actually work for finding top-N strings from a list of strings.

Now about the time and space complexity, you can clearly see here that the time to create the state machine is $O(|S|)$ where $|S|$ is nothing but the cumulative length of all the strings together. In the second pass, we can keep a list of Top $N$ strings with their occurrence count and then go through each end state and insert/replace the string represented by that end state in the Top N strings list if the occurrence count is higher than any of the string inside the list. This second pass can take $O(|S| \times N)$. This makes it the optimal algorithm in terms of time complexity but space required is quite huge, to the tune of $O(|S|)$.

I thought there might be already some work in this area and found burstsort. This is the fastest sorting algorithm found till 2007 and uses trie. It was discovered by Ranjan Sinha. I found that the space requirement reduces a lot by using the technique mentioned in this algorithm. This is the optimal algorithm in terms of both time and space complexity.