Device Mapper(DM)是Linux 2.6全面引入的块设备新构架,通过DM可以灵活地管理系统中所有的真实或虚拟的块设备。
DM以块设备的形式注册到Linux内核中,凡是挂载(或者说“映射”)于DM结构下的块设备,不管他们是如何组织,如何通讯,在Linux看来都是一个完整的DM块设备。因此DM让不同组织形式的块设备或者块设备集群在Linux内核面前有一个完整统一的DM表示。
一、DM与MD
在Linux内核代码中(本文以2.6.33内核代码为参照),DM指的是整个Device Mapper的设计框架。MD(Mapped Device)是框架所虚拟出来的各种设备。简而言之DM就是不同种类的MD经过特定的关系连接到块设备管理器上的大构架。
DM的内核代码集中在drivers/md目录中。DM构架相关的代码在dm.c和dm-io.c文件中,各种抽象的MD虚拟设备代码分散在同目录的其他文件中,除了dm-ioctl.c是dm设置接口设备。
二、使用DMSetup
使用DM的工具是dmsetup。这个命令帮助我们“组装”、“解散”、“监视”我们自己的虚拟存储结构。本文的目的是结合Linux内核源代码来窥视dm的运作机制。
在dmsetup的子命令中,create、load、reload子命令都带有 --table <table> 的参数,<table>字符串是创建dm设备的关键,若非在命令行中以参数形式给出,则必然需要写在一个table文件中传给dmsetup。
table字符串有如下形式:
<start> <length> <type> <arguments>
为了弄清楚这些参数,首先必须明白,create、load、reload子命令总是将一个或一组已经存在的块设备A的一部分或者全部虚拟称为块设备B。在内核代码中,块设备B(也是我们直接打交道的设备)称为mapped device,那一组块设备A中指定的那部分抽象(可以看作是我们将它交给dm来管理的那一部分)称为target device(s),对应的驱动为target driver。我们并不要求设备A必须是一个真实的磁盘,它也可以是dm已经虚拟出来的另外一个mapped device。详尽的解释了mapped device、target driver和target device之间的关系。但是这篇文章并没有着重分析源代码,而这是本文分析的重点。
table字符串中的<start>和<length>是设备A中,交由dm管理的区域,单位是sector。也就是说,由此创建的mapped device刚好映射在源设备中偏移为start、长度为length的这些sectors中;<type>为target driver的类型,每一个type字符串都对应一个target driver;<arguments>是用来创建target device的参数,这些参数传递给target device的创建函数的形式就如同命令行参数传递给int main(int argc, char *argv[])一样。
Linux内核自带的target driver有linear、stripe、mirror、multi-path、dm-crypt以及一组标准raid的驱动。接下来,我们以stripe的代码为例,来解释target driver是如何创建、运行target device的。
三、Target Driver
每一个target device在内核代码中体现为对应的驱动,这些驱动都必须符合DM构架,受DM的管理。有人可能会疑问,为什么DM构架中的驱动都是target驱动,而不是MD的驱动?因为DM的设计中,MD只是一个对外的统一接口,不同target driver的对外接口都是一样的,因此无需为不同的虚拟方式编写不同的MD,只用提供不同的target driver即可(PS:也许这里叫做mapped driver可以避免混淆,因为MD和target driver(以后简称driver )的实例之间是一对一的关系,而target driver同target device(以后简称target )之间是一对多的关系。将driver的概念融合进入md就变成md与target之间一对多的二元关系,而不是现在的md-driver-target三元关系。但md与driver毕竟一个是通用的,一个是特殊的,由此分解为三元关系也就不难理解)。
再此统一一下术语的简称:我们将mapped device简称为md,target device简称为target。之所以这样简称是因为内核代码的命名规则也大致如此。另外的target driver简称为driver(源代码不会出现,因为DM框架管理的是target,不是driver。driver直接insmod就可以了);源设备简称为device(源代码中只有通过名字包含bdev的变量来代表这些设备)。
本文以dm-stripe.c为例,来分析一个target应该具备哪些基本的要素来完成设计好的设备抽象。stripe所要做的是将多个设备的等长区域合并起来组成一个完成的抽象设备,其重点在于寻址。假设有n个devices,每个的区域长度为m,那么第i个块应该存放在第 (i%n) 个target中的偏移量为 (i/n) 的块(要注意的是实际的偏移还得加上target相对于设备的偏移)。
首先,每个driver需要有一个struct target_type结构向DM注册自己,并且这个结构在所有driver实例间共享,换句话说所有driver实例都可以看作从属于这种类型,因此这个target_type应该理解为driver type才对。dm-stripe的struct target_type结构如下
static struct target_type stripe_target = {
.name = "striped", // <type>名称
.version = {1, 3, 0},
.module = THIS_MODULE,
.ctr = stripe_ctr, // 创建器
.dtr = stripe_dtr, // 销毁器
.map = stripe_map, // 映射
.end_io = stripe_end_io,// IO结束通知
.status = stripe_status,
.iterate_devices = stripe_iterate_devices, // 迭代遍历源设备
.io_hints = stripe_io_hints,
};
创建器、销毁器和映射是一般target driver都具备的功能。
每个创建器都有如下函数原型:
int xxx_ctr(struct dm_target *ti, unsigned int argc, char **argv);
在设备创建时,DM框架会自动创建对应的struct dm_target结构,并力所能及地初始化了一些成员。现在创建器所要做的就是完成对该结构的初始化。那么先来看看DM框架初始化了哪些,需要ctr初始化另外那一些:
struct dm_target {
struct dm_table *table; // @driver 到 target device 的映射表,由DM框架维护
struct target_type *type; // @driver 所注册的那个type
/* target limits */
sector_t begin; // @<start>
sector_t len; // @<length>
/* Always a power of 2 */
sector_t split_io; // 块大小(每个块的扇区数)
/*
* A number of zero-length barrier requests that will be submitted
* to the target for the purpose of flushing cache.
*
* The request number will be placed in union map_info->flush_request.
* It is a responsibility of the target driver to remap these requests
* to the real underlying devices.
*/
unsigned num_flush_requests;
/* target specific data */
void *private; // 自定义的设备相关数据
/* Used to provide an error string from the ctr */
char *error;
};
带@标记的成员由DM初始化,或者部分初始化,其他初始化工作由ctr完成。一般来说ctr主要做两件事:
1. 将源设备的dev信息记录到table中。
2. 将target device(s)的信息初始化并记录在private中
table成员就是driver实例到target(s)之间的映射表。DM框架提供了int dm_get_device(struct dm_target *ti, const char *path, sector_t start, sector_t len, fmode_t mode, struct dm_dev **result)函数将path所指定的设备的bdev以及对应的区间、权限、模式等填入ti->table中。stripe_ctr要做的就是将参数中对应的字符串传递给这个函数。
同时,stripe_ctr创建了自定义的struct stripe_c结构sc,并记录在ti->private上。l利用dm_get_device的result出参填满sc->stripes数组(关键就是记住源设备的dev结构,DM中统一用struct dm_dev *指针来引用)。
对应的销毁器stripe_dtr就是将stripe_ctr向内核申请的资源一一释放掉,在此不累述。
最关键的是map函数。任何一个bio(块设备的io请求)都要映射到最终存储它的设备上的相应位置,map函数就是完成这一功能。该函数的原型如下:
int xxx_map(struct dm_target *ti, struct bio *bio, union map_info *map_context);
ti代表target,bio是发给这个target的io请求。一个bio有三个关键成员:bi_sector(位置)、bi_bdev(设备)、bi_io_vec(数据)。DM框架将bio发给map函数,使得target有机会来改变bio的这三个关键成员,从而实现两个目的:重定位和修改数据。map_context在许多情况也并没有许多作用。
如果map函数将bio赋值后又分发出去,那么就返回DM_MAPIO_SUBMITTED告诉DM不要再处理了;如果map函数修改了bio的内容,希望DM将bio按照新内容再分发,那么就返回DM_MAPIO_REMAPPED即可;如果map函数将bio加入队列中等待后续处理,则返回DM_MAPIO_REQUEUE。DM相应的处理代码可以在dm.c中的__map_bio()函数中找到。
stripe_map就很简单了,直接修改bio的bi_sector和bi_bdev,返回DM_MAPIO_REMAPPED通知DM再分发一次即可。stripe所扮演的角色就好比是一个邮件中转站,下辖N个子邮箱。所有邮件都按照规则被转发到对应子邮箱中,中转站的工作就是把每个邮件的地址和收件人改一改再让邮递员送一遍即可,接下来的bio传递路径分析将详细展示这一中转过程。
四、DM转发bio的过程
DM为每一个driver的实例创建一个md作为对外的接口,每一个md在内核中注册成为一个块设备,因此每一个driver的实例就是一个虚拟的块设备。
每一个md通过driver的实例管理一个或多个target,driver的主要工作就是把每个提交给md的bio请求进行数据转换并转发给对应的target。md实现了一个标准的块设备驱动,这里仅分析bio的转发过程。
每个块设备都有一个请求队列,请求队列包含一个make_request_fn指针指向原型为int make_request(struct request_queue *q, struct bio *bio);的函数,Linux内核中的void generic_make_request(struct bio *bio);函数就是通过bio找到对应的bi_bdev,然后找到该bdev对应的request_queue,并调用其make_request_fn函数:
block/blk_core.c 1484行:ret = q->make_request_fn(q, bio);
这就是bio从内核进入DM的起点。为什么这么说?因为在md创建的时候(通过dm.c中alloc_dev())将md->queue的make_request_fn指针设置为了dm_request:
drivers/md/dm.c 1908行:blk_queue_make_request(md->queue, dm_request);
dm_request接收到bio之后有两种选择:如果q->queue_flags被设置了QUEUE_FLAG_STACKABLE,则对request进行排队处理,否则直接分发。alloc_dev中创建queue的时候按照QUEUE_FLAG_DEFAULT创建,包含了QUEUE_FLAG_STACKABLE,但是接着该标志被清除了:
drivers/md/dm.c 1889行:md->queue = blk_init_queue(dm_request_fn, NULL);
...
drivers/md/dm.c 1903行:queue_flag_clear_unlocked(QUEUE_FLAG_STACKABLE, md->queue);
因此dm_request()函数将走_dm_request()分支。以下是bio所走过的流程:
int dm_request(struct request_queue *q, struct bio *bio)
`-> _dm_request(q, bio);
`-> __split_and_process_bio(md, bio); // md 由 q->queue_data 获得
`-> __clone_and_map(&ci); // md、bio等信息记录在 ci 结构体中
`-> __map_bio(ti, clone, tio); // ti 由 ci->md 查表获得,clone由ci->bio克隆获得
`-> ti->type->map(ti, clone, &tio->info);
最终,bio传递给了对应ti的map函数。
要说明的是,DM构架及其驱动一般不会是真实设备的驱动,因此只会对bio进行处理之后再转发出去。转发的方法就是修改bio->bi_bdev和bio->bi_sector。其中bi_bdev必需是在内核中已注册的设备,这些块设备和dm的块设备一道在Linux内核中注册,在Linux看来是平等的。而一个md其实是将其他块设备的bdev记录在自己的映射表中,按照自身的逻辑规律对bio进行映射转发而已。
dmsetup(8) - Linux man page
Name
dmsetup - low level logical volume management
Synopsis
dmsetup help [-c|-C|--columns]
dmsetup create device_name [-u uuid] [--notable | --table <table> | table_file]
dmsetup remove [-f|--force] device_name
dmsetup remove_all [-f|--force]
dmsetup suspend [--nolockfs] [--noflush] device_name
dmsetup resume device_name
dmsetup load device_name [--table <table> | table_file]
dmsetup clear device_name
dmsetup reload device_name [--table <table> | table_file]
dmsetup rename device_name new_name
dmsetup message device_name sector message
dmsetup ls [--target target_type] [--exec command] [--tree [-o options]]
dmsetup info [device_name]
dmsetup info -c|-C|--columns [--noheadings] [--separator separator] [-o fields] [-O|--sort sort_fields] [device_name]
dmsetup deps [device_name]
dmsetup status [--target target_type] [device_name]
dmsetup table [--target target_type] [device_name]
dmsetup wait device_name [event_nr]
dmsetup mknodes [device_name]
dmsetup targets
dmsetup version
dmsetup setgeometry device_name cyl head sect start
devmap_name major minor
devmap_name major:minor
Description
dmsetup manages logical devices that use the device-mapper driver. Devices are created by loading a table that specifies a target for each sector (512 bytes) in the logical device.
The first argument to dmsetup is a command. The second argument is the logical device name or uuid.
Invoking the command as devmap_name is equivalent to
dmsetup info -c --noheadings -j major -m minor.
Options
-c|-C|--columns
Display output in columns rather than as Field: Value lines.
-j|--major major
Specify the major number.
-m|--minor minor
Specify the minor number.
-n|--noheadings
Suppress the headings line when using columnar output.
--noopencount
Tell the kernel not to supply the open reference count for the device.
--notable
When creating a device, don't load any table.
-o|--options
Specify which fields to display.
-r|--readonly
Set the table being loaded read-only.
--table <table>
Specify a one-line table directly on the command line.
-u|--uuid
Specify the uuid.
-v|--verbose [-v|--verbose]
Produce additional output.
--version
Display the library and kernel driver version.
Commands
create
device_name [-u uuid] [--notable | --table <table> | table_file]
Creates a device with the given name. If table_file or <table> is supplied, the table is loaded and made live. Otherwise a table is read from standard input unless --notable is used. The optional uuid can be used in place of device_name in subsequent dmsetup commands. If successful a device will appear as /dev/device-mapper/<device-name>. See below for information on the table format.
deps
[device_name]
Outputs a list of (major, minor) pairs for devices referenced by the live table for the specified device.
help
[-c|-C|--columns]
Outputs a summary of the commands available, optionally including the list of report fields.
info
[device_name]
Outputs some brief information about the device in the form:
State: SUSPENDED|ACTIVE, READ-ONLY
Tables present: LIVE and/or INACTIVE
Open reference count
Last event sequence number (used by wait)
Major and minor device number
Number of targets in the live table
UUID
info
[--noheadings] [--separator separator] [-o fields] [-O|--sort sort_fields] [device_name]
Output you can customise. Fields are comma-separated and chosen from the following list: name, major, minor, attr, open, segments, events, uuid. Attributes are: (L)ive, (I)nactive, (s)uspended, (r)ead-only, read-(w)rite. Precede the list with '+' to append to the default selection of columns instead of replacing it. Precede any sort_field with - for a reverse sort on that column.
ls
[--target target_type] [--exec command] [--tree [-o options]]
List device names. Optionally only list devices that have at least one target of the specified type. Optionally execute a command for each device. The device name is appended to the supplied command. --tree displays dependencies between devices as a tree. It accepts a comma-separate list of options. Some specify the information displayed against each node: device/nodevice; active, open, rw, uuid. Others specify how the tree is displayed: ascii, utf, vt100; compact, inverted, notrunc.
load|reload
device_name [--table <table> | table_file]
Loads <table> or table_file into the inactive table slot for device_name. If neither is supplied, reads a table from standard input.
message
device_name sector message
Send message to target. If sector not needed use 0.
mknodes
[device_name]
Ensure that the node in /dev/mapper for device_name is correct. If no device_name is supplied, ensure that all nodes in /dev/mapper correspond to mapped devices currently loaded by the device-mapper kernel driver, adding, changing or removing nodes as necessary.
remove
[-f|--force] device_name
Removes a device. It will no longer be visible to dmsetup. Open devices cannot be removed except with older kernels that contain a version of device-mapper prior to 4.8.0. In this case the device will be deleted when its open_count drops to zero. From version 4.8.0 onwards, if a device can't be removed because an uninterruptible process is waiting for I/O to return from it, adding --force will replace the table with one that fails all I/O, which might allow the process to be killed.
remove_all
[-f|--force]
Attempts to remove all device definitions i.e. reset the driver. Use with care! From version 4.8.0 onwards, if devices can't be removed because uninterruptible processess are waiting for I/O to return from them, adding --force will replace the table with one that fails all I/O, which might allow the process to be killed. This also runs mknodes afterwards.
rename
device_name new_name
Renames a device.
resume
device_name
Un-suspends a device. If an inactive table has been loaded, it becomes live. Postponed I/O then gets re-queued for processing.
setgeometry
device_name cyl head sect start
Sets the device geometry to C/H/S.
status
[--target target_type] [device_name]
Outputs status information for each of the device's targets. With --target, only information relating to the specified target type is displayed.
suspend
[--nolockfs] [--noflush] device_name
Suspends a device. Any I/O that has already been mapped by the device but has not yet completed will be flushed. Any further I/O to that device will be postponed for as long as the device is suspended. If there's a filesystem on the device which supports the operation, an attempt will be made to sync it first unless --nolockfs is specified. Some targets such as recent (October 2006) versions of multipath may support the --noflush option. This lets outstanding I/O that has not yet reached the device to remain unflushed.
table
[--target target_type] [device_name]
Outputs the current table for the device in a format that can be fed back in using the create or load commands. With --target, only information relating to the specified target type is displayed.
targets
Displays the names and versions of the currently-loaded targets.
version
Outputs version information.
wait
device_name [event_nr]
Sleeps until the event counter for device_name exceeds event_nr. Use -v to see the event number returned. To wait until the next event is triggered, use info to find the last event number.
Table Format
Each line of the table specifies a single target and is of the form:
logical_start_sector num_sectors target_type target_args
There are currently three simple target types available together with more complex optional ones that implement snapshots and mirrors.
linear
destination_device start_sector
The traditional linear mapping.
striped
num_stripes chunk_size [destination start_sector]+
Creates a striped area.
e.g. striped 2 32 /dev/hda1 0 /dev/hdb1 0 will map the first chunk (16k) as follows:
LV chunk 1 -> hda1, chunk 1
LV chunk 2 -> hdb1, chunk 1
LV chunk 3 -> hda1, chunk 2
LV chunk 4 -> hdb1, chunk 2
etc.
error
Errors any I/O that goes to this area. Useful for testing or for creating devices with holes in them.
Examples
# A table to join two disks together
0 1028160 linear /dev/hda 0
1028160 3903762 linear /dev/hdb 0
# A table to stripe across the two disks,
# and add the spare space from
# hdb to the back of the volume
0 2056320 striped 2 32 /dev/hda 0 /dev/hdb 0
2056320 2875602 linear /dev/hdb 1028160