File Organizations: Hashing Algorithm, Collision Handling

<< Data Storage Concepts, Physical Storage Media, Memory Hierarchy

Hashing, Hash Functions, Hashed Access Characteristics, Mapping functions, Open addressing >>

Database Management System (CS403)

Lecture No. 35

Reading Material

"Database Systems Principles, Design and Implementation" written by Catherine Ricardo,

Maxwell Macmillan.

"Database Management System" by Jeffery A Hoffer

Overview of Lecture

o Hashing

o Hashing Algorithm

o Collision Handling

In the previous lecture we have studied about different storage media and the RAID

levels and we started with file organization. In this lecture we will study in length

about different types of file organizations.

File Organizations

This is the most common structure for large files that are typically processed in their

entirety, and it's at the heart of the more complex schemes. In this scheme, all the

records have the same size and the same field format, with the fields having fixed size

as well. The records are sorted in the file according to the content of a field of a scalar

type, called ``key''. The key must identify uniquely a records, hence different record

have diferent keys. This organization is well suited for batch processing of the entire

file, without adding or deleting items: this kind of operation can take advantage of the

fixed size of records and file; moreover, this organization is easily stored both on disk

and tape. The key ordering, along with the fixed record size, makes this organization

amenable to dicotomic search. However, adding and deleting records to this kind of

file is a tricky process: the logical sequence of records tipycally matches their

physical layout on the media storage, so to ease file navigation, hence adding a record

and maintaining the key order requires a reorganization of the whole file. The usual

solution is to make use of a "log file'' (also called "transaction file''), structured as a

260

Database Management System (CS403)

pile, to perform this kind of modification, and periodically perform a batch update on

the master file.

Sequential files provide access only in a particular sequence. That does not suit many

applications since it involves too much time. Some mechanism for direct access is

required

Direct Access File Organization:

For most users, the file system is the most visible aspect of an operating system. Files

store data and programs. The operating system implements the abstract concept of a

file by managing mass storage devices, such as tapes and disks. Also files are

normally organized into directories to ease their use, so we look at a variety of

directory structures. Finally, when multiple users have access to files, it may be

desirable to control by whom and in what ways files may be accessed. This control is

known as file protection. Following are the two types:

Indexed Sequential

Direct File Organization

Indexed sequential file:

An index file can be used to effectively overcome the above-mentioned problem, and

to speed up the key search as well. The simplest indexing structure is the single-level

one: a file whose records are pair's key-pointer, where the pointer is the position in

the data file of the record with the given key. Only a subset of data records, evenly

spaced along the data file, are indexed, so to mark intervals of data records. A key

search then proceeds as follows: the search key is compared with the index ones to

find the highest index key preceding the search one, and a linear search is performed

from the record the index key points onward, until the search key is matched or until

the record pointed by the next index entry is reached. In spite of the double file access

(index + data) needed by this kind of search, the decrease in access time with respect

to a sequential file is significant.

A type of file access in which an index is used to obtain the address of the block

which contains the required record. An index file can be used to effectively to speed

up the key search as well. The simplest indexing structure is the single-level one: a

261

Database Management System (CS403)

file whose records are pairs key-pointer, where the pointer is the position in the data

file of the record with the given key. Only a subset of data records, evenly spaced

along the data file, are indexed, so to mark intervals of data records.

A key search then proceeds as follows: the search key is compared with the index

ones to find the highest index key preceding the search one, and a linear search is

performed from the record the index key points onward, until the search key is

matched or until the record pointed by the next index entry is reached. In spite of the

double file access (index + data) needed by this kind of search, the decrease in access

time with respect to a sequential file is significant.

Consider, for example, the case of simple linear search on a file with 1,000 records.

With the sequential organization, an average of 500 key comparisons are necessary

(assuming uniformly distributed search key among the data ones). However, using

and evenly spaced index with 100 entries, the number of comparisons is reduced to 50

in the index file plus 50 in the data file: a 5:1 reduction in the number of operations.

This scheme can obviously be hyerarchically extended: an index is a sequential file in

itself, amenable to be indexed in turn by a second-level index, and so on, thus

exploiting more and more the hyerarchical decomposition of the searches to decrease

the access time. Obviously, if the layering of indexes is pushed too far, a point is

reached when the advantages of indexing are hampered by the increased storage costs,

and by the index access times as well.

Indexed:

Why using a single index for a certain key field of a data record? Indexes can be

obviously built for each field that uniquely identifies a record (or set of records within

the file), and whose type is amenable to ordering. Multiple indexes hence provide a

high degree of flexibility for accessing the data via search on various attributes; this

organization also allows the use of variable length records (containing different

fields).

It should be noted that when multiple indexes are are used the concept of sequentiality

of the records within the file is useless: each attribute (field) used to construct an

index typically imposes an ordering of its own. For this very reason is typicaly not

possible to use the ``sparse'' (or ``spaced'') type of indexing previously described. Two

262

Database Management System (CS403)

types of indexes are usually found in the applications: the exhaustive type, which

contains an entry for each record in the main file, in the order given by the indexed

key, and the partial type, which contain an entry for all those records that contain the

chosen key field (for variable records only).

Defining Keys:

An indexed sequential file must have at least one key. The first (primary) key is

always numbered 0. An indexed sequential file can have up to 255 keys; however, for

file-processing efficiency it is recommended that you define no more than 7 or 8 keys.

(The time required to insert a new record or update an existing record is directly

related to the number of keys defined; the retrieval time for an existing record,

however, is unaffected by the number of keys.)

When you design an indexed sequential file, you must define each key in the

following terms:

Position and size

Data type

Index number

Options selected

When you want to define more than one key, or to define keys of different data types,

you must be careful when you specify the key fields. The next few subsections

describe some considerations for specifying keys. In Indexed sequential files

following are ensured:

New records are added to an overflow file

Record in main file that precedes it is updated to contain a pointer to

the new record

The overflow is merged with the main file during a batch update

Multiple indexes for the same key field can be set up to increase

efficiency

The diagram of Index sequential file is as under:

263

Database Management System (CS403)

Indexed Sequential Summary:

Following are salient features of Indexed sequential file structure:

Records are stored in sequence and index is maintained.

Dense and nondense types of indexes are maintained.

Track overflows and file overflow areas are ensured.

Cylinder index increases the efficiency .

264

Table of Contents: