Tuesday, November 11, 2008

B-Trees


Dictionaries for very large files typically reside on secondary storage, such as a disk. The

dictionary is implemented as an index to the actual file and contains the key and record address

of data. To implement a dictionary we could use red-black trees, replacing pointers with offsets

from the beginning of the index file, and use random access to reference nodes of the tree.

However, every transition on a link would imply a disk access, and would be prohibitively

- 33 -

expensive. Recall that low-level disk I/O accesses disk by sectors (typically 256 bytes). We

could equate node size to sector size, and group several keys together in each node to minimize

the number of I/O operations. This is the principle behind B-trees. Good references for B-trees

include Knuth [1998] and Cormen [1998]. For B+-trees, consult Aho [1983].

Theory

Figure 4-3 illustrates a B-tree with 3 keys/node. Keys in internal nodes are surrounded by

pointers, or record offsets, to keys that are less than or greater than, the key value. For example,

all keys less than 22 are to the left and all keys greater than 22 are to the right. For simplicity, I

have not shown the record address associated with each key.

26

22

10 16

4 6 8 12 14 18 20 24 28 30

Figure 4-3: B-Tree

We can locate any key in this 2-level tree with three disk accesses. If we were to group 100

keys/node, we could search over 1,000,000 keys in only three reads. To ensure this property

holds, we must maintain a balanced tree during insertion and deletion. During insertion, we

examine the child node to verify that it is able to hold an additional node. If not, then a new

sibling node is added to the tree, and the child’s keys are redistributed to make room for the new

node. When descending for insertion and the root is full, then the root is spilled to new children,

and the level of the tree increases. A similar action is taken on deletion, where child nodes may

be absorbed by the root. This technique for altering the height of the tree maintains a balanced

tree.

B-Tree B*-Tree B + -Tree B ++ -Tree

data stored in any node any node leaf only leaf only

on insert, split 1 x 1 ® 2 x 1/2 2 x 1 ® 3 x 2/3 1 x 1 ® 2 x 1/2 3 x 1 ® 4 x 3/4

on delete, join 2 x 1/2 ® 1 x 1 3 x 2/3 ® 2 x 1 2 x 1/2 ® 1 x 1 3 x 1/2 ® 2 x 3/4

Table 4-1: B-Tree Implementations

Several variants of the B-tree are listed in Table 4-1. The standard B-tree stores keys and

data in both internal and leaf nodes. When descending the tree during insertion, a full child node

is first redistributed to adjacent nodes. If the adjacent nodes are also full, then a new node is

created, and ½ the keys in the child are moved to the newly created node. During deletion,

children that are ½ full first attempt to obtain keys from adjacent nodes. If the adjacent nodes are

also ½ full, then two nodes are joined to form one full node. B*-trees are similar, only the nodes

- 34 -

are kept 2/3 full. This results in better utilization of space in the tree, and slightly better

performance.

22

10 16 26

4 6 8 10 12 14 16 18 20 22 24 26 28 30

Figure 4-4: B-+Tree

Figure 4-4 illustrates a B+-tree. All keys are stored at the leaf level, with their associated data

values. Duplicates of the keys appear in internal parent nodes to guide the search. Pointers have

a slightly different meaning than in conventional B-trees. The left pointer designates all keys

less than the value, while the right pointer designates all keys greater than or equal to (GE) the

value. For example, all keys less than 22 are on the left pointer, and all keys greater than or

equal to 22 are on the right. Notice that key 22 is duplicated in the leaf, where the associated

data may be found. During insertion and deletion, care must be taken to properly update parent

nodes. When modifying the first key in a leaf, the last GE pointer found while descending the

tree will require modification to reflect the new key value. Since all keys are in leaf nodes, we

may link them for sequential access.

The last method, B++-trees, is something of my own invention. The organization is similar to

B+-trees, except for the split/join strategy. Assume each node can hold k keys, and the root node

holds 3k keys. Before we descend to a child node during insertion, we check to see if it is full.

If it is, the keys in the child node and two nodes adjacent to the child are all merged and

redistributed. If the two adjacent nodes are also full, then another node is added, resulting in four

nodes, each ¾ full. Before we descend to a child node during deletion, we check to see if it is ½

full. If it is, the keys in the child node and two nodes adjacent to the child are all merged and

redistributed. If the two adjacent nodes are also ½ full, then they are merged into two nodes,

each ¾ full. Note that in each case, the resulting nodes are ¾ full. This is halfway between ½

full and completely full, allowing for an equal number of insertions or deletions in the future.

Recall that the root node holds 3k keys. If the root is full during insertion, we distribute the

keys to four new nodes, each ¾ full. This increases the height of the tree. During deletion, we

inspect the child nodes. If there are only three child nodes, and they are all ½ full, they are

gathered into the root, and the height of the tree decreases.

Another way of expressing the operation is to say we are gathering three nodes, and then

scattering them. In the case of insertion, where we need an extra node, we scatter to four nodes.

For deletion, where a node must be deleted, we scatter to two nodes. The symmetry of the

operation allows the gather/scatter routines to be shared by insertion and deletion in the

implementation.

Implementation

Source for the B++-tree algorithm may be found in file btr.c. In the implementation-dependent

section, you’ll need to define bAdrType and eAdrType, the types associated with B-tree file

offsets and data file offsets, respectively. You’ll also need to provide a callback function that is

used by the B++-tree algorithm to compare keys. Functions are provided to insert/delete keys,

find keys, and access keys sequentially. Function main, at the bottom of the file, provides a

simple illustration for insertion.

The code provided allows for multiple indices to the same data. This was implemented by

returning a handle when the index is opened. Subsequent accesses are done using the supplied

handle. Duplicate keys are allowed. Within one index, all keys must be the same length. A

binary search was implemented to search each node. A flexible buffering scheme allows nodes

to be retained in memory until the space is needed. If you expect access to be somewhat ordered,

increasing the bufCt will reduce paging.

No comments: