Unix I/O Buffer Management Algorithm

Unix I/O buffer management algorithm first appeared in V6 Unix (Ritchie and Thompson 1978; Lion 1996). It is discussed in detail in Chapter 3 of Bach (Bach 1990). The Unix buffer management subsystem consists of the following components.

  • . I/O buffers: A set of NBUF buffers in kernel is used as a buffer cache. Each buffer is represented by a structure.

typdef struct buf{

struct buf *next_free;  // freelist pointer

struct buf *next_dev;  // dev_list pointer

int dev,blk;  // assigned disk block;

int opcode; // READ|WRITE

int dirty; // buffer data modified

int async; // ASYNC write flag

int valid; // buffer data valid

int busy; // buffer is in use

int wanted; // some process needs this buffer

struct semaphore lock=1; // buffer locking semaphore; value=1

struct semaphore iodone=0; // for process to wait for I/O completion;

char buf[BLKSIZE]; // block data area

} BUFFER;

BUFFER buf[NBUF], *freelist; // NBUF buffers and free buffer list

The buffer structure consists of two parts; a header part for buffer management and a data part for a block of data. To conserve kernel memory, the status fields may be defined as a bit vector, in which each bit represents a unique status condition. They are defined as int here for clarity and ease of discussion.

  • . Device Table: Each block device is represented by a device table structure.

struct devtab{

u16 dev;              // major device number

BUFFER *dev_list;     // device buffer list

BUFFER *io_queue;     // device I/O queue

} devtab[NDEV];

Each devtab has a dev_list, which contains I/O buffers currently assigned to the device, and an io_queue, which contains buffers of pending I/O operations on the device. The I/O queue may be organized for optimal I/O operations. For instance, it may implement the various disk scheduling algorithms, such as the elevator algorithm or the linear-sweep algorithm, etc. For the sake of simplicity, Unix uses FIFO I/O queues.

  • . Buffer Initialization: When the system starts, all I/O buffers are in the freelist and all device lists and I/O queues are empty.
  • . Buffer Lists: When a buffer is assigned to a (dev, blk), it is inserted into the devtab’s dev_list. If the buffer is currently in use, it is marked as BUSY and removed from the freelist. A BUSY buffer may also be in the I/O queue of a devtab. Since a buffer cannot be free and busy at the same time, the device I/O queue is maintained by using the same next_free pointer. When a buffer is no longer BUSY, it is released back to the freelist but remains in the dev_list for possible reuse. A buffer may change from one dev_list to another only when it is reassigned. As shown before, read/ write disk blocks can be expressed in terms of bread, bwrite and dwrite, all of which depend on getblk and brelse. Therefore, getblk and brelse form the core of the Unix buffer management scheme. The algorithm of getblk and brelse is as follows.
  • . Unix getblk/brelse algorithm: (Lion 1996; Chapter 3 of (Bach 1990)).

/* getblk: return a buffer=(dev,blk) for exclusive use */

BUFFER *getblk(dev,blk){

while(1){

(1). search dev_list for a bp=(dev, blk);

(2). if (bp in dev_lst){

if (bp BUSY){

set bp WANTED flag;

sleep(bp);     // wait for bp to be released

continue;      // retry the algorithm

}

/* bp not BUSY */

take bp out of freelist;

mark bp BUSY;

return bp;

}

(3). /* bp not in cache; try to get a free buf from freelist */

if (freelist empty){

set freelist WANTED flag;

sleep(freelist); // wait for any free buffer

continue;      // retry the algorithm

}

(4). /* freelist not empty */

bp = first bp taken out of freelist; mark bp BUSY;

if (bp DIRTY){       // bp is for delayed write

awrite(bp);      // write bp out ASYNC;

continue;        // from (1) but not retry

}

(5). reassign bp to (dev,blk); // set bp data invalid, etc.

return bp;

}

/** brelse: releases a buffer as FREE to freelist **/

brelse(BUFFER *bp){

if (bp WANTED)

wakeup(bp);        // wakeup ALL proc’s sleeping on bp;

if (freelist WANTED)

wakeup(freelist); // wakeup ALL proc’s sleeping on freelist;

clear bp and freelist WANTED flags;

insert bp to (tail of) freelist;

}

It is noted that in (Bach 1990), buffers are maintained in hash queues. When the number of buffers is large, hashing may reduce the search time. If the number of buffers is small, hashing may actually increases the execution time due to additional overhead. Furthermore, studies (Wang 2002) have shown that hashing has almost no effect on the buffer cache performance. In fact, we may consider the device lists as hash queues by the simple hashing function hash(dev, blk) = dev. So there is no loss of generality by using the device lists as hash queues. The Unix algorithm is very simple and easy to understand. Perhaps because of its extreme simplicity, most people are not very impressed by it at first sight. Some may even consider it naive because of the repeated retry loops. However, the more you look at it, the more it makes sense. This amazingly simple but effective algorithm attests to the ingenuity of the original Unix designers. Some specific comments about the Unix algorithm follow.

  1. Data Consistency: In order to ensure data consistency, getblk must never assign more than one buffer to the same (dev, blk). This is achieved by having the process re-execute the “retry loops” after waking up from sleep. The reader may verify that every assigned buffer is unique. Second, dirty buffers are written out before they are reassigned, which guarantees data consistency.
  2.  Cache effect: Cache effect is achieved by the following means. A released buffer remains in the device list for possible reuse. Buffers marked for delay-write do not incur immediate I/O and are available for reuse. Buffers are released to the tail of freelist but allocated from the front of freelist. This is based on the LRU ((Least-Recent-Used) principle, which helps prolong the lifetime of assigned buffers, thereby increasing their cache effect.
  3.  Critical Regions: Device interrupt handlers may manipulate the buffer lists, e.g. remove a bp from a devtab’s I/O queue, change its status and call brelse(bp). So in getblk and brelse, device interrupts are masked out in these critical regions. These are implied but not shown in the algorithm.

Shortcomings of Unix Algorithm

The Unix algorithm is very simple and elegant, but it also has the following shortcomings.

  1. Inefficiency: the algorithm relies on retry loops. For example, releasing a buffer may wake up two sets of processes; those who want the released buffer, as well as those who just need a free buffer. Since only one process can get the released buffer, all other awakened processes must go back to sleep again. After waking up from sleep, every awakened process must re-execute the algorithm again from beginning because the needed buffer may already exist. This would cause excessive process switches.
  2. Unpredictable cache effect: In the Unix algorithm, every released buffer is up for grabs. If the buffer is obtained by a process which needs a free buffer, the buffer would be reassigned, even though there may be processes which still need the buffer.
  3.  Possible starvation: The Unix algorithm is based on the principle of “free economy”, in which every process is given chances to try but with no guarantee of success. Therefore, process starvation may occur.
  4.  The algorithm uses sleep/wakeup, which is only suitable for uniprocessor systems.

Source: Wang K.C. (2018), Systems Programming in Unix/Linux, Springer; 1st ed. 2018 edition.

Leave a Reply

Your email address will not be published. Required fields are marked *