【原创翻译】A Distributed Storage System for Structured Data（三）_【苏宁物流研发】

Bigtable:ADistributedStorageSystemforStructured...

Bigtable: A Distributed Storage System for Structured DataBigTable：结构化数据的一种分布式存储系统

5 实现方法

The Bigtable implementation has three major components: a library that is linked into every client, one master server, and many tablet servers. Tablet servers can be dynamically added (or removed) from a cluster to accomodate changes in workloads.

BigTable的实现方法包含三个主要组件：一个链接至每个客户端中的程序库、一个主机服务器，以及很多个数据片服务器（Tablet Server）。可以动态地添加（或移除）集群中的数据片服务器，这样便能适应工作负载的变化。

The master is responsible for assigning tablets to tablet servers, detecting the addition and expiration of tablet servers, balancing tablet-server load, and garbage collection of files in GFS. In addition, it handles schema changes such as table and column family creations.

主机（Master）负责数据片服务器的数据片分配、数据片服务器的新增和失效检测、数据片服务器的负载均衡，以及GFS文件的垃圾收集。另外，它还会处理对模式的相关修改操作，例如创建表和列族。

Each tablet server manages a set of tablets (typically we have somewhere between ten to a thousand tablets per tablet server). The tablet server handles read and write requests to the tablets that it has loaded, and also splits tablets that have grown too large.

每个数据片服务器都会管理一系列的数据片（通常，我们的每台数据片服务器都有大约数十个至上千个数据片）。数据片服务器会处理针对已加载数据片的读写请求，并且还会将已经增长过大的数据片拆分为多个较小的数据片。

As with many single-master distributed storage systems, client data does not move through the master: clients communicate directly with tablet servers for reads and writes. Because Bigtable clients do not rely on the master for tablet location information, most clients never communicate with the master. As a result, the master is lightly loaded in practice.

如同很多单主机的分布式存储系统一样，客户端的数据都不会经过主机：当进行读写操作时，客户端会直接和数据片服务器展开通信。因为BigTable的客户端不需要依赖于主机就能取得数据片的位置信息，大多数的客户端从来不会与主机进行通信。在实际应用中，主机的负载是很轻的。

A Bigtable cluster stores a number of tables. Each table consists of a set of tablets, and each tablet contains all data associated with a row range. Initially, each table consists of just one tablet. As a table grows, it is automatically split into multiple tablets, each approximately 100-200 MB in size by default.

一个BigTable集群会存储很多张表。每张表都由一系列的数据片组成，每个数据片都包含与某个行区间相关联的所有数据。最初，每张表都只有一个数据片。随着表数据的增长，初始的数据片会自动拆分为多个数据片，每个数据片的大小默认为100-200MB。

5.1 数据片的位置

We use a three-level hierarchy analogous to that of a B+-tree to store tablet location information (Figure 4).

我们使用一个三级层次结构的、类似于B+树的数据结构来存储数据片的位置信息（如图-4所示）。

Figure 4: Tablet location hierarchy.
图-4：数据片位置的层次结构

The first level is a file stored in Chubby that contains the location of the root tablet. The root tablet contains the location of all tablets in a special METADATA table. Each METADATA tablet contains the location of a set of user tablets. The root tablet is just the first tablet in the METADATA table, but is treated specially – it is never split – to ensure that the tablet location hierarchy has no more than three levels.

第一层是一个存储在Chubby中的文件，它包含了根数据片（Root Tablet）的位置信息。根数据片包含了一个特殊的METADATA（元数据）表中的所有数据片的位置信息。每个METADATA数据片都包含了一组用户数据片的位置信息。根数据片正好就是METADATA表中的第一个数据片，但是对它的处理比较特殊 —— 它从来不会拆分 —— 这样才能确保数据片位置的层次结构不会超过三级。

The METADATA table stores the location of a tablet under a row key that is an encoding of the tablet’s table identifier and its end row. Each METADATA row stores approximately 1KB of data in memory. With a modest limit of 128 MB METADATA tablets, our three-level location scheme is sufficient to address 234 tablets (or 261 bytes in 128 MB tablets).

METADATA表会将一个数据片的位置信息存储在一个行关键字之下，而这个行关键字是由这个数据片的所在表的标识符和这个数据片的最后一行进行编码得到的。每个METADATA行都会在内存中存储大约1KB的数据。METADATA数据片的最合适的大小限制为128MB，我们的三级位置模式完全可以寻址234个数据片（或者在128MB的数据片中寻址261个字节）。

The client library caches tablet locations. If the client does not know the location of a tablet, or if it discovers that cached location information is incorrect, then it recursively moves up the tablet location hierarchy. If the client’s cache is empty, the location algorithm requires three network round-trips, including one read from Chubby. If the client’s cache is stale, the location algorithm could take up to six round-trips, because stale cache entries are only discovered upon misses (assuming that METADATA tablets do not move very frequently). Although tablet locations are stored in memory, so no GFS accesses are required, we further reduce this cost in the common case by having the client library prefetch tablet locations: it reads the metadata for more than one tablet whenever it reads the METADATA table.

客户端程序库会缓存数据片的位置信息。如果客户端不知道某个数据片的位置信息，或者它发现缓存的位置信息是不正确的，那么它就会向上递归访问数据片位置的层次结构。如果客户端的缓存是空的，那么寻址算法需要三个来回的网络通信才能寻址，包括一次对Chubby的读取操作。如果客户端的缓存数据失效了，那么寻址算法就需要6个来回的网络通信才能寻址，因为只有在缓存命中失败时才能发现失效的缓存条目[6]。虽然数据片的位置信息是存储在内存中的，这样就不需要访问GFS文件系统，但是通过客户端程序库预取数据片的位置信息，我们还可以在常见情况下进一步减少开销：无论客户端何时读取METADATA表，它每次都会读取多个数据片的元数据。

We also store secondary information in the METADATA table, including a log of all events pertaining to each tablet (such as when a server begins serving it). This information is helpful for debugging and performance analysis.

我们还会在METADATA表中存储次级信息，包括每个数据片所有的事件日志（例如，某台服务器何时为数据片提供服务）。这种信息对于调试和性能分析非常有帮助。

5.2 数据片的分配

Each tablet is assigned to one tablet server at a time. The master keeps track of the set of live tablet servers, and the current assignment of tablets to tablet servers, including which tablets are unassigned. When a tablet is unassigned, and a tablet server with sufficient room for the tablet is available, the master assigns the tablet by sending a tablet load request to the tablet server.

在任何时候，每个数据片只能分配至一个数据片服务器。主机（Master）会记录当前有哪些活跃的数据片服务器、当前有哪些数据片分配至哪些数据片服务器，以及哪些数据片尚未被分配。当有一个数据片尚未被分配，并且刚好有一个数据片服务器具备足够的空间能够存储这个数据片，那么主机便会向这个数据片服务器发送一个加载请求，这样便能将这个数据片分配至这个服务器了。

Bigtable uses Chubby to keep track of tablet servers. When a tablet server starts, it creates, and acquires an exclusive lock on, a uniquely-named file in a specific Chubby directory. The master monitors this directory (the servers directory) to discover tablet servers. A tablet server stops serving its tablets if it loses its exclusive lock: e.g., due to a network partition that caused the server to lose its Chubby session. (Chubby provides an efficient mechanism that allows a tablet server to check whether it still holds its lock without incurring network traffic.) A tablet server will attempt to reacquire an exclusive lock on its file as long as the file still exists. If the file no longer exists, then the tablet server will never be able to serve again, so it kills itself. Whenever a tablet server terminates (e.g., because the cluster management system is removing the tablet server’ s machine from the cluster), it attempts to release its lock so that the master will reassign its tablets more quickly.

BigTable使用Chubby跟踪记录数据片服务器的状态。当一个数据片服务器启动时，它会在一个特定的Chubby目录中创建一个具有唯一名称的文件，然后便能获取一个该文件的排它锁。主机会监控这个目录（servers目录），这样便能发现新加入的数据片服务器。如果一个数据片服务器丢失了自己的排它锁，那么它便会停止数据片服务：例如，由于网络分区而导致这台服务器丢失了自己的Chubby会话。（Chubby提供了一种有效的机制，使得数据片服务器不用耗费网络流量就可以检查自己是否仍然持有排它锁。）只要上述文件仍然存在，数据片服务器将会尝试重新获取这个文件的排它锁。如果这个文件不再存在了，那么这个数据片服务器将不能再次提供服务，因此它便会停止运行。当一个数据片服务器终止运行时（例如，因为集群管理系统正在将这个数据片服务器的机器从集群中移除），它会尝试释放自己的排它锁，这样主机将可以更加快速地重新分配它的数据片。

The master is responsible for detecting when a tablet server is no longer serving its tablets, and for reassigning those tablets as soon as possible. To detect when a tablet server is no longer serving its tablets, the master periodically asks each tablet server for the status of its lock. If a tablet server reports that it has lost its lock, or if the master was unable to reach a server during its last several attempts, the master attempts to acquire an exclusive lock on the server’s file. If the master is able to acquire the lock, then Chubby is live and the tablet server is either dead or having trouble reaching Chubby, so the master ensures that the tablet server can never serve again by deleting its server file. Once a server’s file has been deleted, the master can move all the tablets that were previously assigned to that server into the set of unassigned tablets. To ensure that a Bigtable cluster is not vulnerable to networking issues between the master and Chubby, the master kills itself if its Chubby session expires. However, as described above, master failures do not change the assignment of tablets to tablet servers.

主机（Master）负责检测数据片服务器何时不再提供数据片服务，并且会尽快将这些数据片进行重新分配。为了检测数据片服务器何时不再提供数据片服务，主机会定期地向每个数据片服务器轮询它的排它锁状态。如果某个数据片服务器报告它已经丢失了自己的排它锁，或者如果主机最近几次尝试和某个数据片服务器进行通信都不成功，那么主机便会尝试获取这个服务器文件的排它锁。如果主机能够获取这个排它锁，那么就说明Chubby是正常运行的，而数据片服务器要么是宕机了，要么是和Chubby之间有通信故障，因此，主机就会删除这个数据片服务器的文件，确保这个数据片服务器再也不能提供服务了。一旦删除了数据片服务器的文件，主机就会将先前分配给这个数据片服务器的所有数据片变为未分配的状态。为了确保主机和Chubby之间的网络问题不会影响BigTable集群的正常运行，如果主机的Chubby会话已过期，那么它便会停止自身的运行。然而，正如上文所述，主机故障不会改变数据片到数据片服务器的分配状态。

When a master is started by the cluster management system, it needs to discover the current tablet assignments before it can change them. The master executes the following steps at startup. (1) The master grabs a unique master lock in Chubby, which prevents concurrent master instantiations. (2) The master scans the servers directory in Chubby to find the live servers. (3) The master communicates with every live tablet server to discover what tablets are already assigned to each server. (4) The master scans the METADATA table to learn the set of tablets. Whenever this scan encounters a tablet that is not already assigned, the master adds the tablet to the set of unassigned tablets, which makes the tablet eligible for tablet assignment.

当集群管理系统启动一个主机时，在修改数据片分配状态之前，它需要先了解当前的数据片分配状态。主机会在启动阶段执行以下步骤：（1）主机会从Chubby获取一个唯一的主机锁，用于防止并发的主机实例化。（2）主机会扫描Chubby的servers目录，找到活跃的服务器。（3）主机会和每个活跃的数据片服务器进行通信，获取每台服务器上的数据片分配状态信息。（4）主机会扫描METADATA表，获取数据片的集合信息。当这种扫描操作遇到一个尚未分配的数据片时，主机便会将这个数据片添加至未分配数据片集合，这样便使得这个数据片符合数据片分配的条件。

One complication is that the scan of the METADATA table cannot happen until the METADATA tablets have been assigned. Therefore, before starting this scan (step 4), the master adds the root tablet to the set of unassigned tablets if an assignment for the root tablet was not discovered during step 3. This addition ensures that the root tablet will be assigned. Because the root tablet contains the names of all METADATA tablets, the master knows about all of them after it has scanned the root tablet.

可能会遇到一种复杂的情况：如果METADATA数据片尚未被分配，那么主机是不能扫描METADATA表的。因此，在开始扫描（步骤-4）之前，如果在步骤-3的执行过程中发现没有分配根数据片（Root Tablet），那么主机就会将根数据片添加至未分配的数据片集合中。这个额外操作能够确保根数据片一定会被分配。因为，根数据片包含了所有METADATA数据片的名称，待主机扫描根数据片之后，主机就会获得所有METADATA数据片的名称了。

The set of existing tablets only changes when a table is created or deleted, two existing tablets are merged to form one larger tablet, or an existing tablet is split into two smaller tablets. The master is able to keep track of these changes because it initiates all but the last. Tablet splits are treated specially since they are initiated by a tablet server. The tablet server commits the split by recording information for the new tablet in the METADATA table. When the split has committed, it notifies the master. In case the split notification is lost (either because the tablet server or the master died), the master detects the new tablet when it asks a tablet server to load the tablet that has now split. The tablet server will notify the master of the split, because the tablet entry it finds in the METADATA table will specify only a portion of the tablet that the master asked it to load.

已有数据片的集合只有当创建或删除表时才会修改，例如，将两个已有的数据片合并为一个较大的数据片，或者将一个已有的数据片拆分为两个较小的数据片。因为除了最后一次修改操作之外，其他的修改操作都是由主机初始化的，所以主机能够跟踪记录这些修改操作。数据片的分拆操作应当特殊对待，因为它们都是由数据片服务器初始化的。数据片服务器会在METADATA表中记录新数据片的信息，这样服务器便能提交分拆操作了。当分拆操作被提交之后，数据片服务器便会通知主机。如果已提交分拆操作的通知信息丢失了（有可能数据片服务器或主机宕机了），那么当主机要求一个数据片服务器加载这个刚刚分拆的数据片时，这个主机便能检测到新的数据片。数据片服务器将会向主机重新发送通知信息，因为它在METADATA表中找到的数据片条目只会指定加载相应数据片的一部分，而主机则会要求数据片服务器完整加载这个数据片。

5.3 数据片的服务

The persistent state of a tablet is stored in GFS, as illustrated in Figure 5. Updates are committed to a commit log that stores redo records. Of these updates, the recently committed ones are stored in memory in a sorted buffer called a memtable; the older updates are stored in a sequence of SSTables. To recover a tablet, a tablet server reads its metadata from the METADATA table. This metadata contains the list of SSTables that comprise a tablet and a set of a redo points, which are pointers into any commit logs that may contain data for the tablet. The server reads the indices of the SSTables into memory and reconstructs the memtable by applying all of the updates that have committed since the redo points.

数据片的持久化状态存储在GFS中，如图-5所示。更新操作会被提交至一个提交日志文件中，这个文件用于存储重做记录（Redo Record）。在这些更新操作中，最近提交的更新操作存储在内存中的一个排序缓冲区中，这个缓冲区被称为memtable；较老的更新操作存储在一系列的SSTable中。若要恢复一个数据片，则数据片服务器就需要从METADATA表中读取它的元数据。数据片的元数据包含了组成一个数据片的SSTable列表，以及一系列的重做点（Redo Point），这些重做点指向可能含有这个数据片的数据的提交日志。数据片服务器会将SSTable的索引读取至内存中，然后通过执行这些重做点之后提交的所有更新操作就可以重建memtable了。

Figure 5: Tablet Representation
图-5：数据片的表示方法

When a write operation arrives at a tablet server, the server checks that it is well-formed, and that the sender is authorized to perform the mutation. Authorization is performed by reading the list of permitted writers from a Chubby file (which is almost always a hit in the Chubby client cache). A valid mutation is written to the commit log. Group commit is used to improve the throughput of lots of small mutations. After the write has been committed, its contents are inserted into the memtable.

当一个数据片服务器收到一条写入操作时，它就会检查这个写入操作的格式是否正确，以及发送方是否具有执行写入操作的权限。数据片服务器会从一个Chubby文件中读取许可的写入者列表（这个文件几乎一定会存放在Chubby的客户端缓存中），这样就能对写入操作的发送发进行身份验证了。一次有效的写入操作会记录在提交日志文件中。可以采用分组提交的方式来提高包含大量小规模写入操作的应用程序的吞吐量。当一个写入操作提交之后，它的内容便会插入至memtable中。

When a read operation arrives at a tablet server, it is similarly checked for well-formedness and proper authorization. A valid read operation is executed on a merged view of the sequence of SSTables and the memtable. Since the SSTables and the memtable are lexicographically sorted data structures, the merged view can be formed efficiently.

当一个数据片服务器收到一条读取操作时，和处理写入操作类似，它会检查这个读取操作的格式是否正确，以及发送方是否具有执行读取操作的权限。一次有效的读取操作会读取一个由SSTable序列和memtable组成的合并视图。因为SSTable和memtable都是按字典排序的数据结构，所以可以高效地生成合并视图。

Incoming read and write operations can continue while tablets are split and merged.

当数据片进行分拆和合并时，数据片服务器仍然可以继续处理收到的读取操作和写入操作。

5.4 空间压缩

As write operations execute, the size of the memtable increases. When the memtable size reaches a threshold, the memtable is frozen, a new memtable is created, and the frozen memtable is converted to an SSTable and written to GFS. This minor compaction process has two goals: it shrinks the memory usage of the tablet server, and it reduces the amount of data that has to be read from the commit log during recovery if this server dies. Incoming read and write operations can continue while compactions occur.

随着写入操作的执行，memtable的尺寸也会不断增大。当memtable的尺寸达到一个阈值时，这个memtable就会被冻结，然后会创建一个新的memtable，然后会将这个被冻结的memtable转换成一个SSTable，最后将其写入至GFS文件系统。这种次级压缩（Minor Compaction）处理有两个目的：它能够减少数据片服务器的内存使用率，如果这个服务器宕机了，它还能减少恢复期间需要从提交日志读取的数据总量。当空间压缩正在进行时，数据片服务器仍然可以继续处理收到的读取操作和写入操作。

Every minor compaction creates a new SSTable. If this behavior continued unchecked, read operations might need to merge updates from an arbitrary number of SSTables. Instead, we bound the number of such files by periodically executing a merging compaction in the background. A merging compaction reads the contents of a few SSTables and the memtable, and writes out a new SSTable. The input SSTables and memtable can be discarded as soon as the compaction has finished.

每次的次级压缩处理都会创建一个新的SSTable。如果这种行为未经检查持续执行下去，那么读取操作可能就需要合并来自任意数量的SSTable的更新操作。但是，我们会定时在后台执行一次合并压缩操作，借此限制这类文件的数量。合并压缩操作会读取一些SSTable和memtable的内容，然后将这些数据合并写入至一个新的SSTable中。一旦合并压缩操作结束，就可以将输入的SSTable和memtable删除了。

A merging compaction that rewrites all SSTables into exactly one SSTable is called a major compaction. SSTables produced by non-major compactions can contain special deletion entries that suppress deleted data in older SSTables that are still live. A major compaction, on the other hand, produces an SSTable that contains no deletion information or deleted data. Bigtable cycles through all of its tablets and regularly applies major compactions to them. These major compactions allow Bigtable to reclaim resources used by deleted data, and also allow it to ensure that deleted data disappears from the system in a timely fashion, which is important for services that store sensitive data.

如果合并压缩会将所有的SSTable都合并重写至一个新的SSTable中，那么这次的合并压缩就被称为一次主干压缩（Major Compaction）。由非主干压缩产生的SSTable可能含有特殊的删除条目，这些删除条目会消除仍然活跃的、较老的SSTable中的被删除数据。另一方面，主干压缩也会产生一个SSTable，但是它不会包含删除信息或被删除数据。BigTable会循环扫描它所有的数据片，并且会定期地对它们进行主干压缩。这些主干压缩操作使得BigTable能够回收被删除数据使用的资源，并且还能够确保被删除的数据能够及时地从系统中消失[7]，这对于存储敏感数据的服务来说是非常重要的。

【微信号】IT_SNL 欢迎关注本公众号。

关注苏宁物流研发

【原创翻译】A Distributed Storage System for Structured Data（三）

5 实现方法

5.1 数据片的位置

5.2 数据片的分配

5.3 数据片的服务

5.4 空间压缩

0 个评论

数据片相关文章

发起人

推荐内容

【原创翻译】A Distributed Storage System for Structured Data（三）

5 实现方法

5.1 数据片的位置

5.2 数据片的分配

5.3 数据片的服务

5.4 空间压缩

0 个评论

数据片 相关文章

发起人

推荐内容

苏宁物流研发相关文章

数据片相关文章