Monday, February 7, 2011

Find Top N : Solution 3

The problem is described here. Solution 1 is here. Solution 2 is here. In both the earlier solution, we used some in-built function offered by IBM Lotus Domino XPages Server side Javascript. In this solution, I will just try to find the optimal solution theoretically without using any in-built functions.

In my MTech thesis, I had worked on multi-pattern matching algorithm, Aho-Corasick. The algorithm basically creates a finite state machine from a list of patterns that are to be searched in a document. A sample FSM created for input strings {he, him, his, she, her} is shown in the image to the left. If we modify the state machine in such a manner that it also keeps the count for a particular string in the end state, the machine can actually work for finding top-N strings from a list of strings.

Now about the time and space complexity, you can clearly see here that the time to create the state machine is $O(|S|)$ where $|S|$ is nothing but the cumulative length of all the strings together. In the second pass, we can keep a list of Top $N$ strings with their occurrence count and then go through each end state and insert/replace the string represented by that end state in the Top N strings list if the occurrence count is higher than any of the string inside the list. This second pass can take $O(|S| \times N)$. This makes it the optimal algorithm in terms of time complexity but space required is quite huge, to the tune of $O(|S|)$.

I thought there might be already some work in this area and found burstsort. This is the fastest sorting algorithm found till 2007 and uses trie. It was discovered by Ranjan Sinha. I found that the space requirement reduces a lot by using the technique mentioned in this algorithm. This is the optimal algorithm in terms of both time and space complexity.

No comments:

Post a Comment