红联Linux门户
Linux帮助

Linux缓存机制之块缓存

发布时间:2014-11-22 10:47:48来源:linux网站作者:bullbat

在Linux内核中,并非总使用基于页的方法来承担缓存的任务。内核的早期版本只包含了块缓存,来加速文件操作和提高系统性能。这是来自于其他具有相同结构的类UNIX操作系统的遗产。来自于底层块设备的块缓存在内存的缓冲区中,可以加速读写操作。

与内存页相比,块不仅比较小(大多数情况下),而且长度可变的,依赖于使用的块设备(或文件系统)。随着日渐倾向于使用基于页操作实现的通用文件存取方法,块缓存作为中枢系统缓存的重要性已经逐渐失去。主要的缓存任务现在由页缓存承担。另外,基于块的I/O的标准数据结构,现在已经不再是缓冲区,而是struct bio结构。

缓冲区用作小型的数据传输,一般设计的数据量是与块长度可比拟的。文件系统在处理元数据时,通常会使用此类方法。而裸数据的传输则按页进行,而缓冲区的实现也基于也缓存。


块缓存在结构上由两个部分组成:

1)缓冲头(buffer head)包含了与缓冲区状态相关的所有管理数据,包括快号、块长度、访问计数器等。这些数据不是直接存储在缓冲头之后,而是存储在物理内存的一个独立区域中,由缓冲头结构中的一个对应的指针表示。

2)有用数据保存在专门分配的页中,这些页也可能同时存在于页缓存中。这进一步细分了页缓存,在我们的例子中,页划分为4个长度相同的部分,每一部分由其自身的缓冲头描述。缓冲头存储的内存区域与有用数据存储的区域是有关的。

这使得页面可以细分为更小的部分,各顾各部分之间完全连续的(因为缓冲区数据和缓冲头数据是分离的)。因为一个缓冲区由至少512字节组成,每页最多可包括MAX_BUF_PER_PAGE个缓冲区。该常数定义为页面长度的函数。

如果修改了某个缓冲区,则会立即印象到页面的内容(反之也是),因而两个缓存不需要显示同步,毕竟二者的数据是共享的。

当然,有些应用程序在访问块设备时,使用的是块而不是页面,读取文件系统的操作几块,就是一个例子。一个独立的块缓存用于加速此类访问。该块缓存的运作独立于页面缓存,而不是在其上建立的。为此,缓冲头数据结构(对于块缓存和页面缓存是相同的)群聚在一个长度恒定的数组中,各个数组项按LUR方式管理。在一个三个数组项用过之后,将其置于索引位置0,其他数组项相应下移。这意味这最常使用的数组项位于数组的开头,而不常用的数组项将被后退,如果很长时间不使用,则会“掉出”数组。

因为数组的长度,或者说LUR列表中的项数,是一个固定值,在内核运行期间不改变,内核无需运行独立的线程来将缓存长度修正为合理值。相反,内核只需要在一项“掉出”数组时,将相关的缓冲区从缓存删除,以释放内存,用于其他目地。


块缓存实现

块患处不仅仅用作页面缓存的附加功能,对以块而不是页面进行处理的对象来说,块缓存是一个独立的缓存。

数据结构

块缓冲区头

struct buffer_head { 
unsigned long b_state;  /* buffer state bitmap (see above) */ 
struct buffer_head *b_this_page;/* circular list of page's buffers */ 
struct page *b_page;/* the page this bh is mapped to */ 
 
sector_t b_blocknr; /* start block number */ 
size_t b_size;  /* size of mapping */ 
char *b_data;   /* pointer to data within the page */ 
 
struct block_device *b_bdev; 
bh_end_io_t *b_end_io;  /* I/O completion */ 
void *b_private;/* reserved for b_end_io */ 
struct list_head b_assoc_buffers; /* associated with another mapping */ 
struct address_space *b_assoc_map;  /* mapping this buffer is
associated with */ 
atomic_t b_count;   /* users using this buffer_head */ 
};
 

操作

内核必须提供一组操作,使得其余代码能够轻松有效地利用缓冲区的功能。切记:这些机制对内存中实际缓存的数据没有贡献。

在使用缓冲区之前,内核首先必须创建一个buffer_head结构实例,而其余的函数则对该结构进行操作。因为创建新缓冲头是一个频繁重现的任务,他应该尽快执行。这是一种很经典的情形,可使用slab缓存解决。

切记:内核源代码确实提供了一些函数,可用作前端,来创建和销毁缓冲头。alloc_buffer_head生成一个新的缓冲头,而free_buffer_head销毁一个显存的缓冲头。

/*分配buffer_head*/ 
struct buffer_head *alloc_buffer_head(gfp_t gfp_flags) 

/*从slab中分配空间*/ 
struct buffer_head *ret = kmem_cache_alloc(bh_cachep, gfp_flags); 
if (ret) { 
/*初始化*/ 
INIT_LIST_HEAD(&ret->b_assoc_buffers); 
get_cpu_var(bh_accounting).nr++; 
recalc_bh_state(); 
put_cpu_var(bh_accounting); 

return ret; 
}
 

页缓存和块缓存的交互

一页划分为几个数据单元,但缓冲头保存在独立的内存区中,与实际数据无关。与缓冲区的交互没有改变的页的内容,缓冲区只不过为页的数据提供了一个新的视图。

为支持页与缓冲区的交互,需要使用struct page的private成员。其类型为unsigned long,可用作指向虚拟地址空间中任何位置的指针。

Private成员还可以用作其他用途,根据页的具体用途,可能与缓冲头完全无关。但其主要的用途是关联缓冲区和页。这样的话,private指向将页划分为更小单位的第一个缓冲头。各个缓冲头通过b_this_page链接为一个环形链表。在该链表中每个缓冲头的b_this_page成员指向下一个缓冲头,而最后一个缓冲头的b_this_page成员指向第一个缓冲头。这使得内核从page结构开始,可以轻易地扫描与页关联的所有buffer_head实例。

内核提供cteate_empty_buffers函数关联page和buffer_head结构之间的关联:

/*
 * We attach and possibly dirty the buffers atomically wrt
 * __set_page_dirty_buffers() via private_lock.  try_to_free_buffers
 * is already excluded via the page lock.
 */ 
void create_empty_buffers(struct page *page, 
unsigned long blocksize, unsigned long b_state) 

struct buffer_head *bh, *head, *tail; 
 
head = alloc_page_buffers(page, blocksize, 1); 
bh = head; 
/*遍历所有缓冲头,设置其状态,并建立一个环形链表*/ 
do { 
bh->b_state |= b_state; 
tail = bh; 
bh = bh->b_this_page; 
} while (bh); 
tail->b_this_page = head; 
 
spin_lock(&page->mapping->private_lock); 
/*缓冲区的状态依赖于内存页面中数据的状态*/ 
if (PageUptodate(page) || PageDirty(page)) { 
bh = head; 
do {/*设置相关标志*/ 
if (PageDirty(page)) 
set_buffer_dirty(bh); 
if (PageUptodate(page)) 
set_buffer_uptodate(bh); 
bh = bh->b_this_page; 
} while (bh != head); 

/*将缓冲区关联到页面*/ 
attach_page_buffers(page, head); 
spin_unlock(&page->mapping->private_lock); 

static inline void attach_page_buffers(struct page *page, 
struct buffer_head *head) 

page_cache_get(page);/*递增引用计数*/ 
/*设置PG_private标志,通知内核其他部分,page实例的private成员正在使用中*/ 
SetPagePrivate(page); 
/*将页的private成员设置为一个指向环形链表中第一个缓冲头的指针*/ 
set_page_private(page, (unsigned long)head); 
}


交互

如果对内核的其他部分无益,那么在页和缓冲区之间建立关联就没起作用。一些与块设备之间的传输操作,传输单位的长度依赖于底层设备的块长度,而内核的许多部分更喜欢按页的粒度来执行I/O操作,因为这使得其他事情更容易处理,特别是内存管理方面。在这种场景下,缓冲头区充当了双方的中介。

从缓冲区中读取整页

首先考察内核在从块设备读取整页时采用的方法,以block_read_full_page为例。我们讨论缓冲区实现所关注的部分。

/*
 * Generic "read page" function for block devices that have the normal
 * get_block functionality. This is most of the block device filesystems.
 * Reads the page asynchronously --- the unlock_buffer() and
 * set/clear_buffer_uptodate() functions propagate buffer state into the
 * page struct once IO has completed.
 */ 
int block_read_full_page(struct page *page, get_block_t *get_block) 

struct inode *inode = page->mapping->host; 
sector_t iblock, lblock; 
struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE]; 
unsigned int blocksize; 
int nr, i; 
int fully_mapped = 1; 
 
BUG_ON(!PageLocked(page)); 
blocksize = 1 << inode->i_blkbits; 
/*检查页是否有相关联的缓冲区,如果没有,则创建他*/ 
if (!page_has_buffers(page)) 
create_empty_buffers(page, blocksize, 0); 
/*获得这些缓冲区,无论是新建的还是已经存在的
只是将page的private成员转换为buffer_head指针,因为按照
惯例,private指向与page关联的第一个缓冲头*/ 
head = page_buffers(page); 
 
iblock = (sector_t)page->index << (PAGE_CACHE_SHIFT - inode->i_blkbits); 
lblock = (i_size_read(inode)+blocksize-1) >> inode->i_blkbits; 
bh = head; 
nr = 0; 
i = 0; 
/*内核遍历与页面关联的所有缓冲区*/ 
do { 
/*如果缓冲区内容是最新的,内核继续处理下一个
缓冲区。在这种情况下,页面缓冲区中的数据与块
设备匹配,无需额外的读操作*/ 
if (buffer_uptodate(bh)) 
continue; 
/*如果没有映射*/ 
if (!buffer_mapped(bh)) { 
int err = 0; 
 
fully_mapped = 0; 
if (iblock < lblock) { 
WARN_ON(bh->b_size != blocksize); 
/*确定块在块设备上的位置*/ 
err = get_block(inode, iblock, bh, 0); 
if (err) 
SetPageError(page); 

if (!buffer_mapped(bh)) { 
zero_user(page, i * blocksize, blocksize); 
if (!err) 
set_buffer_uptodate(bh); 
continue; 

/*
 * get_block() might have updated the buffer
 * synchronously
 */ 
if (buffer_uptodate(bh)) 
continue; 

/*如果缓冲区已经建立了与块的映射,但是其内容不是最新
的则将缓冲区放置到一个临时的数组中*/ 
arr[nr++] = bh; 
} while (i++, iblock++, (bh = bh->b_this_page) != head); 
 
if (fully_mapped) 
SetPageMappedToDisk(page); 
 
if (!nr) { 
/*
 * All buffers are uptodate - we can set the page uptodate
 * as well. But not if get_block() returned an error.
 */ 
if (!PageError(page)) 
SetPageUptodate(page); 
unlock_page(page); 
return 0; 

 
/* Stage two: lock the buffers */ 
for (i = 0; i < nr; i++) { 
bh = arr[i]; 
lock_buffer(bh); 
/*将b_end_io设置为end_buffer_async_read,该函数将在数据传输结构时
调用*/ 
mark_buffer_async_read(bh); 

 
/*
 * Stage 3: start the IO.  Check for uptodateness
 * inside the buffer lock in case another process reading
 * the underlying blockdev brought it uptodate (the sct fix).
 */ 
for (i = 0; i < nr; i++) { 
bh = arr[i]; 
if (buffer_uptodate(bh)) 
end_buffer_async_read(bh, 1); 
else 
/*将所有需要读取的缓冲区转交给块层
也就是BIO层,在其中开始读操作*/ 
submit_bh(READ, bh); 

return 0; 
}


将整页写入到缓冲区

除了读操作之外,页面的写操作也可以划分为更小的单位。只有页中实际修改的内容需要回写,而不用回写整页的内容。遗憾的是,从缓冲区的角度来看,写操作的实现比上述的读操作复杂的多。

__block_wirte_full_page函数中回写脏页面设计的缓冲区相关操作。

/*
 * NOTE! All mapped/uptodate combinations are valid:
 *
 *  Mapped  UptodateMeaning
 *
 *  No  No  "unknown" - must do get_block()
 *  No  Yes "hole" - zero-filled
 *  Yes No  "allocated" - allocated on disk, not read in
 *  Yes Yes "valid" - allocated and up-to-date in memory.
 *
 * "Dirty" is valid only with the last case (mapped+uptodate).
 */ 
 
/*
 * While block_write_full_page is writing back the dirty buffers under
 * the page lock, whoever dirtied the buffers may decide to clean them
 * again at any time.  We handle that by only looking at the buffer
 * state inside lock_buffer().
 *
 * If block_write_full_page() is called for regular writeback
 * (wbc->sync_mode == WB_SYNC_NONE) then it will redirty a page which has a
 * locked buffer.   This only can happen if someone has written the buffer
 * directly, with submit_bh().  At the address_space level PageWriteback
 * prevents this contention from occurring.
 *
 * If block_write_full_page() is called with wbc->sync_mode ==
 * WB_SYNC_ALL, the writes are posted using WRITE_SYNC_PLUG; this
 * causes the writes to be flagged as synchronous writes, but the
 * block device queue will NOT be unplugged, since usually many pages
 * will be pushed to the out before the higher-level caller actually
 * waits for the writes to be completed.  The various wait functions,
 * such as wait_on_writeback_range() will ultimately call sync_page()
 * which will ultimately call blk_run_backing_dev(), which will end up
 * unplugging the device queue.
 */ 
static int __block_write_full_page(struct inode *inode, struct page *page, 
get_block_t *get_block, struct writeback_control *wbc, 
bh_end_io_t *handler) 

int err; 
sector_t block; 
sector_t last_block; 
struct buffer_head *bh, *head; 
const unsigned blocksize = 1 << inode->i_blkbits; 
int nr_underway = 0; 
int write_op = (wbc->sync_mode == WB_SYNC_ALL ? 
WRITE_SYNC_PLUG : WRITE); 
 
BUG_ON(!PageLocked(page)); 
 
last_block = (i_size_read(inode) - 1) >> inode->i_blkbits; 
/*页面是否有关联缓冲区,如果没有创建他*/ 
if (!page_has_buffers(page)) { 
create_empty_buffers(page, blocksize, 
(1 << BH_Dirty)|(1 << BH_Uptodate)); 

 
/*
 * Be very careful.  We have no exclusion from __set_page_dirty_buffers
 * here, and the (potentially unmapped) buffers may become dirty at
 * any time.  If a buffer becomes dirty here after we've inspected it
 * then we just miss that fact, and the page stays dirty.
 *
 * Buffers outside i_size may be dirtied by __set_page_dirty_buffers;
 * handle that here by just cleaning them.
 */ 
 
block = (sector_t)page->index << (PAGE_CACHE_SHIFT - inode->i_blkbits); 
head = page_buffers(page); 
bh = head; 
 
/*
 * Get all the dirty buffers mapped to disk addresses and
 * handle any aliases from the underlying blockdev's mapping.
 */ 
 /*对所有未映射的脏缓冲区,在缓冲区和块设备
之间建立映射*/ 
do { 
if (block > last_block) { 
/*
 * mapped buffers outside i_size will occur, because
 * this page can be outside i_size when there is a
 * truncate in progress.
 */ 
/*
 * The buffer was zeroed by block_write_full_page()
 */ 
clear_buffer_dirty(bh); 
set_buffer_uptodate(bh); 
} else if ((!buffer_mapped(bh) || buffer_delay(bh)) && 
   buffer_dirty(bh)) { 
WARN_ON(bh->b_size != blocksize); 
/*查找块设备上与缓冲区项匹配的块*/ 
err = get_block(inode, block, bh, 1); 
if (err) 
goto recover; 
clear_buffer_delay(bh); 
if (buffer_new(bh)) { 
/* blockdev mappings never come here */ 
clear_buffer_new(bh); 
unmap_underlying_metadata(bh->b_bdev, 
bh->b_blocknr); 


bh = bh->b_this_page; 
block++; 
} while (bh != head); 
/*第二遍遍历,将滤出所有的脏缓冲区*/ 
do { 
if (!buffer_mapped(bh)) 
continue; 
/*
 * If it's a fully non-blocking write attempt and we cannot
 * lock the buffer then redirty the page.  Note that this can
 * potentially cause a busy-wait loop from writeback threads
 * and kswapd activity, but those code paths have their own
 * higher-level throttling.
 */ 
if (wbc->sync_mode != WB_SYNC_NONE || !wbc->nonblocking) { 
lock_buffer(bh); 
} else if (!trylock_buffer(bh)) { 
redirty_page_for_writepage(wbc, page); 
continue; 

/*如果设置了脏页标志,则会在调用该函数时清除
因为缓冲区的内容将立即回写*/ 
if (test_clear_buffer_dirty(bh)) { 
/*设置BH_Async_Write状态位,并将end_buffer_async_write
指定为BIO完成处理程序即b_end_io*/ 
mark_buffer_async_write_endio(bh, handler); 
} else { 
unlock_buffer(bh); 

} while ((bh = bh->b_this_page) != head); 
 
/*
 * The page and its buffers are protected by PageWriteback(), so we can
 * drop the bh refcounts early.
 */ 
BUG_ON(PageWriteback(page)); 
set_page_writeback(page); 
/*最后一次遍历*/ 
do { 
struct buffer_head *next = bh->b_this_page; 
if (buffer_async_write(bh)) { 
/*将前一次遍历中标记为BH_Async_Write的所有缓冲区
转交给块层执行实际的写操作,该函数向块层提交
了对应的请求*/ 
submit_bh(write_op, bh); 
nr_underway++; 

bh = next; 
} while (bh != head); 
unlock_page(page); 
 
err = 0; 
done: 
if (nr_underway == 0) { 
/*
 * The page was marked dirty, but the buffers were
 * clean.  Someone wrote them back by hand with
 * ll_rw_block/submit_bh.  A rare case.
 */ 
end_page_writeback(page); 
 
/*
 * The page and buffer_heads can be released at any time from
 * here on.
 */ 

return err; 
 
recover: 
/*
 * ENOSPC, or some other error.  We may already have added some
 * blocks to the file, so we need to write these out to avoid
 * exposing stale data.
 * The page is currently locked and not marked for writeback
 */ 
bh = head; 
/* Recovery: lock and submit the mapped buffers */ 
do { 
if (buffer_mapped(bh) && buffer_dirty(bh) && 
!buffer_delay(bh)) { 
lock_buffer(bh); 
mark_buffer_async_write_endio(bh, handler); 
} else { 
/*
 * The buffer may have been set dirty during
 * attachment to a dirty page.
 */ 
clear_buffer_dirty(bh); 

} while ((bh = bh->b_this_page) != head); 
SetPageError(page); 
BUG_ON(PageWriteback(page)); 
mapping_set_error(page->mapping, err); 
set_page_writeback(page); 
do { 
struct buffer_head *next = bh->b_this_page; 
if (buffer_async_write(bh)) { 
clear_buffer_dirty(bh); 
submit_bh(write_op, bh); 
nr_underway++; 

bh = next; 
} while (bh != head); 
unlock_page(page); 
goto done; 
}