C++ Next: Introduction to Algorithms by Thomas H. Cormen, Charles E. Leiserson and Ronald L. Rivest and Clifford Stein : Chapter 11

Monday, June 20, 2011

Introduction to Algorithms by Thomas H. Cormen, Charles E. Leiserson and Ronald L. Rivest and Clifford Stein : Chapter 11

The authors introduce the concept of hash tables and the operations on hash tables. The authors then discuss several ways hash tables can be implemented in different scenarios:

Direct-Address Tables: This is useful in cases where the set of possible values of keys is small and all the three operations, i.e. INSERT, DELETE and SEARCH can be implemented in $ O(1) $ time. There are $ m $ slots defined for possible $ m $ values of keys and key $ k $ is stored in slot $ k $.
Hash Tables: This is useful when the universe of key values is very large. A hash function is used to map key $ k $ to slot $ h(k) $. $ h(k) $ is called the hash value of key $ k $. When two keys $ k_1 $ and $ k_2 $ map to the same hash value, it is defined as collision. The authors discuss the method of chaining to resolve collisions. The elements that map to the same hash value are put in the linked list. The authors then analyze the average running time of hash table look-up. The authors define uniform simple hashing as the assumption that any element is equally likely to hash into a slot independently of other elements. The authors then prove that for both successful and unsuccessful searches, the average running time under uniform simple hashing is $ \Theta (1 + \alpha) $ where $ \alpha $ is called the load factor and is defined as $ n / m $ where $ n $ is the total elements stored in the hash table and $ m $ is the number of hash table slots.

The authors then discuss three different hash functions which satisfy the uniform simple hashing assumption as below:

The division method: The hash function is defined as $$ h(k) = k \text{ mod } m $$ The authors then discuss good choices of $ m $ and advices not to use $ 2^{p} $ or $ 2^{p} - 1 $ since it makes the hash function ineffective.
The multiplication method: For a constant $ 0 < A < 1 $, hash function is defined as $$ h(k) = m \times (kA - \lfloor (kA) \rfloor ) $$ The authors show that the method does not depend on the value of $ m $ and also gives some suggestion on how to choose $ A $.
Universal Hashing: If the hash function is known, the input can be chosen to increase the average search time of a hash table. To avoid such a situation, a set of hash functions are implemented and one of them is chosen randomly at the start time. Let $ \mathbb{H} $ be a collection of hash functions which map keys from a universe $ U $ to a range $ {0,1,...,m-1} $. Such a key is called universal if for each pair of distinct keys $ k,l \in U $, the number of hash functions $ h \in \mathbb{H} $ for which $ h(k) = h(l) $ is at most $ | \mathbb{H} | / m $. The authors then prove that this gives on an average constant search time.

Besides chaining, the other method of resolving collisions is open addressing. When collision happens, the key is stored at the next empty location in the hash table instead of in a linked list. The querying of hash table for empty slot is called probe and the authors provide several methods to implement probing:

Linear Probing: This uses auxiliary hash function $ h'(k) $ and the hash function is defined as $ h(k,i) = (h'(k) + i) \text{ mod } m $.
Quadratic Probing: This uses a hash function of the form $ h(k,i) = ( h'(k) + c_1i + c_2i^{2} ) \text{ mod } m $.
Double Hashing: This uses a hash function of the form $ h(k,i) = ( h_1(k) + ih_2(k)) \text{ mod } m $.

The authors then prove average case running time related to insert, successful and unsuccessful search for an open addressing hash function.

The authors then discuss perfect hashing where the worst-case running time is $ O(1) $. The application of perfect hashing is in cases where the keys are static and do not change. The implementation uses two level hash functions where collision resolution is done by using a secondary hash table to store the keys that collide. The authors then prove some results on the running time of perfect hashing.

C++ Next

Monday, June 20, 2011

Introduction to Algorithms by Thomas H. Cormen, Charles E. Leiserson and Ronald L. Rivest and Clifford Stein : Chapter 11

No comments:

Post a Comment