Tuesday, November 11, 2008

Very Large Files - External Sorting


The previous algorithms have assumed that all data reside in memory. However, there may be

times when the dataset is too large, and alternative methods are required. In this section, we will

examine techniques for sorting (external sorts) and implementing dictionaries (B-trees) for very

large files.

External Sorting

One method for sorting a file is to load the file into memory, sort the data in memory, then write

the results. When the file cannot be loaded into memory due to resource limitations, an external

sort applicable. We will implement an external sort using replacement selection to establish

initial runs, followed by a polyphase merge sort to merge the runs into one sorted file. I highly

recommend you consult Knuth [1998], as many details have been omitted.

Theory

For clarity, I’ll assume that data is on one or more reels of magnetic tape. Figure 4-1 illustrates a

3-way polyphase merge. Initially, in phase A, all data is on tapes T1 and T2. Assume that the

beginning of each tape is at the bottom of the frame. There are two sequential runs of data on

T1: 4-8, and 6-7. Tape T2 has one run: 5-9. At phase B, we’ve merged the first run from tapes

T1 (4-8) and T2 (5-9) into a longer run on tape T3 (4-5-8-9). Phase C is simply renames the

tapes, so we may repeat the merge again. In phase D we repeat the merge, with the final output

on tape T3.

phase T1 T2 T3

A 7

6

8

4

9

5

B

7

6

9

8

5

4

C 9

8

5

4

7

6

D 9

8

7

6

5

4

Figure 4-1: Merge Sort

Several interesting details have been omitted from the previous illustration. For example,

how were the initial runs created? And, did you notice that they merged perfectly, with no extra

runs on any tapes? Before I explain the method used for constructing initial runs, let me digress

for a bit.

In 1202, Leonardo Fibonacci presented the following exercise in his Liber Abbaci (Book of

the Abacus): “How many pairs of rabbits can be produced from a single pair in a year’s time?”

We may assume that each pair produces a new pair of offspring every month, each pair becomes

fertile at the age of one month, and that rabbits never die. After one month, there will be 2 pairs

of rabbits; after two months there will be 3; the following month the original pair and the pair

born during the first month will both usher in a new pair, and there will be 5 in all; and so on.

This series, where each number is the sum of the two preceding numbers, is known as the

Fibonacci sequence:

- 31 -

0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, ... .

Curiously, the Fibonacci series has found widespread application to everything from the

arrangement of flowers on plants to studying the efficiency of Euclid’s algorithm. There’s even

a Fibonacci Quarterly journal. And, as you might suspect, the Fibonacci series has something to

do with establishing initial runs for external sorts.

Recall that we initially had one run on tape T2, and 2 runs on tape T1. Note that the numbers

{1,2} are two sequential numbers in the Fibonacci series. After our first merge, we had one run

on T1 and one run on T2. Note that the numbers {1,1} are two sequential numbers in the

Fibonacci series, only one notch down. We could predict, in fact, that if we had 13 runs on T2,

and 21 runs on T1 {13,21}, we would be left with 8 runs on T1 and 13 runs on T3 {8,13} after

one pass. Successive passes would result in run counts of {5,8}, {3,5}, {2,3}, {1,1}, and {0,1},

for a total of 7 passes. This arrangement is ideal, and will result in the minimum number of

passes. Should data actually be on tape, this is a big savings, as tapes must be mounted and

rewound for each pass. For more than 2 tapes, higher-order Fibonacci numbers are used.

Initially, all the data is on one tape. The tape is read, and runs are distributed to other tapes

in the system. After the initial runs are created, they are merged as described above. One

method we could use to create initial runs is to read a batch of records into memory, sort the

records, and write them out. This process would continue until we had exhausted the input tape.

An alternative algorithm, replacement selection, allows for longer runs. A buffer is allocated in

memory to act as a holding place for several records. Initially, the buffer is filled. Then, the

following steps are repeated until the input is exhausted:

· Select the record with the smallest key that is ³ the key of the last record written.

· If all keys are smaller than the key of the last record written, then we have reached the

end of a run. Select the record with the smallest key for the first record of the next run.

· Write the selected record.

· Replace the selected record with a new record from input.

- 32 -

Figure 4-2 illustrates replacement selection for a small file. The beginning of the file is to the

right of each frame. To keep things simple, I’ve allocated a 2-record buffer. Typically, such a

buffer would hold thousands of records. We load the buffer in step B, and write the record with

the smallest key (6) in step C. This is replaced with the next record (key 8). We select the

smallest key ³ 6 in step D. This is key 7. After writing key 7, we replace it with key 4. This

process repeats until step F, where our last key written was 8, and all keys are less than 8. At this

point, we terminate the run, and start another.

Step Input Buffer Output

A 5-3-4-8-6-7

B 5-3-4-8 6-7

C 5-3-4 8-7 6

D 5-3 8-4 7-6

E 5 3-4 8-7-6

F 5-4 3 | 8-7-6

G 5 4-3 | 8-7-6

H 5-4-3 | 8-7-6

Figure 4-2: Replacement Selection

This strategy simply utilizes an intermediate buffer to hold values until the appropriate time for

output. Using random numbers as input, the average length of a run is twice the length of the

buffer. However, if the data is somewhat ordered, runs can be extremely long. Thus, this

method is more effective than doing partial sorts.

When selecting the next output record, we need to find the smallest key ³ the last key

written. One way to do this is to scan the entire list, searching for the appropriate key. However,

when the buffer holds thousands of records, execution time becomes prohibitive. An alternative

method is to use a binary tree structure, so that we only compare lg n items.

Implementation

Source for the external sort algorithm may be found in file ext.c. Function makeRuns calls

readRec to read the next record. Function readRec employs the replacement selection

algorithm (utilizing a binary tree) to fetch the next record, and makeRuns distributes the records

in a Fibonacci distribution. If the number of runs is not a perfect Fibonacci number, dummy runs

are simulated at the beginning of each file. Function mergeSort is then called to do a

polyphase merge sort on the runs.

No comments: