Hashing, Hash Functions, Hashed Access Characteristics, Mapping functions, Open addressing

<< File Organizations: Hashing Algorithm, Collision Handling

Index Classification >>

Database Management System (CS403)

Lecture No. 36

Reading Material

"Database Management Systems", 2nd edition, Raghu Ramakrishnan, Johannes

Gehrke, McGraw-Hill

Database Management, Jeffery A Hoffer

Overview of Lecture

o Hashing

o Hashing Algorithm

o Collision Handling

In the previous lecture we have discussed file organization and techniques of file

handling. In today's lecture we will study hashing techniques, its algorithms and

collision handling.

Hashing

A hash function is computed on some attribute of each record. The result of the

function specifies in which block of the file the record should be placed .Hashing

provides rapid, non-sequential, direct access to records. A key record field is used to

calculate the record address by subjecting it to some calculation; a process called

hashing. For numeric ascending order a sequential key record fields this might

involve simply using relative address indexes from a base storage address to access

records. Most of the time, key field does not have the values in sequence that can

directly be used as relative record number. It has to be transformed. Hashing involves

computing the address of a data item by computing a function on the search key value.

A hash function h is a function from the set of all search key values K to the set of all

bucket addresses B.

265

Database Management System (CS403)

�

We choose a number of buckets to correspond to the number of search key

values we will have stored in the database.

�

To perform a lookup on a search key value

, we compute

, and

search the bucket with that address.

�

If two search keys i and j map to the same address, because

, then the bucket at the address obtained will contain

records with both search key values.

�

In this case we will have to check the search key value of every record in

the bucket to get the ones we want.

�

Insertion and deletion are simple.

Hash Functions

A good hash function gives an average-case lookup that is a small constant,

independent of the number of search keys. We hope records are distributed uniformly

among the buckets. The worst hash function maps all keys to the same bucket. The

best hash function maps all keys to distinct addresses. Ideally, distribution of keys to

addresses is uniform and random.

Hashed Access Characteristics

Following are the major characteristics:

�

No indexes to search or maintain

�

Very fast direct access

�

Inefficient sequential access

�

Use when direct access is needed, but sequential access is not.

Mapping functions

The direct address approach requires that the function, h(k), is a one-to-one mapping

from each k to integers in (1,m). Such a function is known as a perfect hashing

function: it maps each key to a distinct integer within some manageable range and

enables us to trivially build an O(1) search time table.

Unfortunately, finding a perfect hashing function is not always possible. Let's say that

we can find a hash function, h(k), which maps most of the keys onto unique integers,

but maps a small number of keys on to the same integer. If the number of collisions

266

Database Management System (CS403)

(cases where multiple keys map onto the same integer), is sufficiently small, then

hash tables work quite well and give O(1) search times.

Handling the Collisions

In the small number of cases, where multiple keys map to the same integer, then elements with

different keys may be stored in the same "slot" of the hash table. It is clear that when the hash function

is used to locate a potential match, it will be necessary to compare the key of that element with the

search key. But there may be more than one element, which should be stored in a single slot of the table.

Various techniques are used to manage this problem:

�

Chaining,

�

Overflow areas,

�

Re-hashing,

�

Using neighboring slots (linear probing),

�

Quadratic probing,

�

Random probing, ...

Chaining:

One simple scheme is to chain all collisions in lists attached to the appropriate slot.

This allows an unlimited number of collisions to be handled and doesn't require a

priori knowledge of how many elements are contained in the collection. The tradeoff

is the same as with linked lists versus array implementations of collections: linked list

overhead in space and, to a lesser extent, in time.

Re-hashing:

Re-hashing schemes use a second hashing operation when there is a collision. If

there is a further collision, we re-hash until an empty "slot" in the table is found.

The re-hashing function can either be a new function or a re-application of the

original one. As long as the functions are applied to a key in the same order, then

a sought key can always be located.

Linear probing:

One of the simplest re-hashing functions is +1 (or -1), ie on a collision; look in the

neighboring slot in the table. It calculates the new address extremely quickly and may

be extremely efficient on a modern RISC processor due to efficient cache utilization.

The animation gives you a practical demonstration of the effect of linear probing: it

also implements a quadratic re-hash function so that you can compare the difference.

267

Database Management System (CS403)

h(j)=h(k), so the next hash function, h1 is used. A second collision occurs, so h2 is used.

Clustering:

Linear probing is subject to a clustering phenomenon. Re-hashes from one location occupy a block of

slots in the table which "grows" towards slots to which other keys hash. This exacerbates the collision

problem and the number of re-hashed can become large.

Overflow area:

Another scheme will divide the pre-allocated table into two sections: the primary area to which keys

are mapped and an area for collisions, normally termed the overflow area.

When a collision occurs, a slot in the overflow area is used for the new element and a

link from the primary slot established as in a chained system. This is essentially the

same as chaining, except that the overflow area is pre-allocated and thus possibly

faster to access. As with re-hashing, the maximum number of elements must be

known in advance, but in this case, two parameters must be estimated: the optimum

size of the primary and overflow areas.

268

Database Management System (CS403)

Of course, it is possible to design systems with multiple overflow tables, or with a

mechanism for handling overflow out of the overflow area, which provide

flexibility without losing the advantages of the overflow scheme.

Collision handling

A well-chosen hash function can avoid anomalies which are result of an excessive

number of collisions, but does not eliminate collisions. We need some method for

handling collisions when they occur. We'll consider the following techniques:

�

Open addressing

�

Chaining

Open addressing:

The hash table is an array of (key, value) pairs. The basic idea is that when a (key,

value) pair is inserted into the array, and a collision occurs, the entry is simply

inserted at an alternative location in the array. Linear probing, double hashing, and

rehashing are all different ways of choosing an alternative location. The simplest

probing method is called linear probing. In linear probing, the probe sequence is

simply the sequence of consecutive locations, beginning with the hash value of the

key. If the end of the table is reached, the probe sequence wraps around and

continues at location 0. Only if the table is completely full will the search fail.

Summary Hash Table Organization:

Organization Advantages

Disadvantages

Unlimited number of elements

Overhead of multiple linked

Chaining

Unlimited number of collisions

lists

Re-hashing

Maximum

number

Fast re-hashing

elements must be known

Fast

access

through

use

of main table space

Multiple

collisions

may

become probable

Overflow

Fast access

Two

parameters

which

area

Collisions don't use primary table

govern

performance

space

need to be estimated

269

Table of Contents: