对象存储系统Swift技术详解：综述与概念（下）

about云腾讯认证空间

本帖最后由 sunshine_junge 于 2014-10-6 20:33 编辑

5. Replication 复制

Since each replica in swift functions independently, and clients generally require only a simple majority of nodes responding to consider an operation successful, transient failures like network partitions can quickly cause replicas to diverge. These differences are eventually reconciled by asynchronous, peer-to-peer replicator processes. The replicator processes traverse their local filesystems, concurrently performing operations in a manner that balances load across physical disks.

由于每个副本在Swift中独立地运行，并且客户端通常只需要一个简单的主节点响应就可以认为操作成功，如网络等瞬时故障虚节点会快速导致副本出现分歧。这些不同最终由异步、对等网络的replicator进程来调解。replicator进程遍历它们的本地文件，在物理磁盘上以平衡负载的方式并发地执行操作。

Replication uses a push model, with records and files generally only being copied from local to remote replicas. This is important because data on the node may not belong there (as in the case of handoffs and ring changes), and a replicator can’t know what data exists elsewhere in the cluster that it should pull in. It’s the duty of any node that contains data to ensure that data gets to where it belongs. Replica placement is handled by the ring.

复制使用推模型（推模型的简单实现是通过循环的方式将任务发送到服务器上），记录和文件通常只是从本地拷贝到远程副本。这一点非常重要，因为节点上的数据可能不属于那儿（当在传送数据而环改变的情况下），并且replicator不知道在集群的其它位置上它应该拉什么数据。这是任何一个含有数据的节点职责，确保数据到达它所应该到达的位置。副本的位置由环来处理。

Every deleted record or file in the system is marked by a tombstone, so that deletions can be replicated alongside creations. These tombstones are cleaned up by the replication process after a period of time referred to as the consistency window, which is related to replication duration and how long transient failures can remove a node from the cluster. Tombstone cleanup must be tied to replication to reach replica convergence.

文件系统中每个被删除的记录或文件被标记为墓碑，因此删除可以在创建的时候被复制。在一段称为一致性窗口的时间后，墓碑文件被replication进程清除，与复制的持续时间和将节点从集群移除瞬时故障的持续时间有关。tombstone的清除应该绑定replication和对应的replica，不应该出现有的replica中的tombstone删除掉了，而有的却没有删除掉。

If a replicator detects that a remote drive is has failed, it will use the ring’s “get_more_nodes” interface to choose an alternate node to synchronize with. The replicator can generally maintain desired levels of replication in the face of hardware failures, though some replicas may not be in an immediately usable location.

如果replicator检测到远程驱动器发生故障，它将使用环的get_more_nodes接口来选择一个替代节点进行同步。在面临硬件故障时，复制器通常可以维护所需的复制级别，即使有一些副本可能不在一个直接可用的位置。

Replication is an area of active development, and likely rife with potential improvements to speed and correctness.

复制是一个活跃的开发领域，在速度和正确性上具有提升的潜力。

There are two major classes of replicator - the db replicator, which replicates accounts and containers, and the object replicator, which replicates object data.

有两种主要的replicator类型——用来复制账号和容器的db复制器，以及用来复制对象数据的对象复制器。

5.1 DB Replication DB复制

The first step performed by db replication is a low-cost hash comparison to find out whether or not two replicas already match. Under normal operation, this check is able to verify that most databases in the system are already synchronized very quickly. If the hashes differ, the replicator brings the databases in sync by sharing records added since the last sync point.

db复制执行的第一步是一个低消耗的哈希比较来查明两个副本是否已匹配。在常规运行下，这一检测可以非常快速地验证系统中的大多数数据库已经同步。如果哈希值不一致，复制器通过共享最后一次同步点之后增加的记录对数据库进行同步。

This sync point is a high water mark noting the last record at which two databases were known to be in sync, and is stored in each database as a tuple of the remote database id and record id. Database ids are unique amongst all replicas of the database, and record ids are monotonically increasing integers. After all new records have been pushed to the remote database, the entire sync table of the local database is pushed, so the remote database knows it’s now in sync with everyone the local database has previously synchronized with.

所谓的同步点是一个高水印标记用来记录上一次记录在哪两个数据库间进行了同步，并且存储在每个数据库中作为一个由remote database id和record id组成的元组。在数据库的所有副本中，数据库的id是唯一的，并且记录id为单调递增的整数。当所有的新纪录推送到远程数据库后，本地数据库的整个同步表被推送出去，因此远程数据库知道现在已经和先前本地数据库与之同步的所有数据库同步了。

If a replica is found to be missing entirely, the whole local database file is transmitted to the peer using rsync(1) and vested with a new unique id.

如果某个副本完全丢失了，使用rsync(1)传送整个数据库文件到对等节点的远程数据库，并且赋予一个新的唯一id。

In practice, DB replication can process hundreds of databases per concurrency setting per second (up to the number of available CPUs or disks) and is bound by the number of DB transactions that must be performed.

实际运行中，DB复制可以处理数百个数据库每并发设定值每秒（取决于可用的CPU和磁盘的数量）并且受必须执行DB事务的数量约束。

5.2 Object Replication 对象复制

The initial implementation of object replication simply performed an rsync to push data from a local partition to all remote servers it was expected to exist on. While this performed adequately at small scale, replication times skyrocketed once directory structures could no longer be held in RAM. We now use a modification of this scheme in which a hash of the contents for each suffix directory is saved to a per-partition hashes file. The hash for a suffix directory is invalidated when the contents of that suffix directory are modified.

对象复制的最初实现是简单地执行rsync从本地虚节点推送数据到它预期存放的所有远程服务器上。虽然该方案在小规模上的表现出色，然而一旦目录结构不能保存在RAM中时，复制的时间将会突飞猛涨。我们现在使用这一机制的改进版本，将每个后缀目录的内容的哈希值保存到每一虚节点的哈希文件中。当目录后缀的内容被修改时，它的哈希值将无效。

The object replication process reads in these hash files, calculating any invalidated hashes. It then transmits the hashes to each remote server that should hold the partition, and only suffix directories with differing hashes on the remote server are rsynced. After pushing files to the remote server, the replication process notifies it to recalculate hashes for the rsynced suffix directories.

对象复制进程读取这些哈希文件，计算出失效的哈希值。然后传输这些哈希值到每个有该partition的远程服务器上，并且仅有不一致哈希的后缀目录在远程服务器上的被rsync。在推送文件到远程服务器之后，复制进程通知服务器重新计算执行了rsync的后缀目录的哈希值。

Performance of object replication is generally bound by the number of uncached directories it has to traverse, usually as a result of invalidated suffix directory hashes. Using write volume and partition counts from our running systems, it was designed so that around 2% of the hash space on a normal node will be invalidated per day, which has experimentally given us acceptable replication speeds.

对象复制的性能通常被它要遍历的未缓存目录的数量限制，常常作为是失效的后缀目录的哈希值的结果。从我们运行的系统上使用写卷和虚节点计数，它被设计因此在一个普通节点上有每天大约2%的哈希空间会失效，已经通过试验，提供给我们可接受的复制速度。

6. Rate Limiting 速率限制

Rate limiting in swift is implemented as a pluggable middleware. Rate limiting is performed on requests that result in database writes to the account and container sqlite dbs. It uses memcached and is dependent on the proxy servers having highly synchronized time. The rate limits are limited by the accuracy of the proxy server clocks.

速率限制在swift中是作为一个可插的中间件。速率限制处理在数据库写操作到账号和容器sqlite db的请求。它使用memcached并且依赖于高度同步时间的代理服务器。速率限制受限于代理服务器时钟的精度。

6.1 Configuration 配置

All configuration is optional. If no account or container limits are provided there will be no rate limiting. Configuration available:

所有的配置是可选的。如果没有给出账号或容器的限制，那么就没有速率限制。可用配置参数如下：

Option	Default	Description
clock_accuracy	1000	Represents how accurate the proxy servers’ system clocks are with each other. 1000 means that all the proxies’ clock are accurate to each other within 1 millisecond. No ratelimit should be higher than the clock accuracy. 表示代理服务器的系统时钟相互之间的精度。1000表示所有的代理相互之间的时钟精确到毫秒。没有速率限制应该比该时钟精度更高。
max_sleep_time_seconds	60	App will immediately return a 498 response if the necessary sleep time ever exceeds the given max_sleep_time_seconds. 如果必须的休眠时间超过了给定的max_sleep_time_seconds，应用程序会立刻返回一个498响应
log_sleep_time_seconds	0	To allow visibility into rate limiting set this value > 0 and all sleeps greater than the number will be logged. 在速率限制中考虑可见性，设置这一值大于0并且所有的休眠时间大于这个数值得将被记录。
rate_buffer_seconds	5	Number of seconds the rate counter can drop and be allowed to catch up (at a faster than listed rate). A larger number will result in larger spikes in rate but better average accuracy. 速度计数器终止并允许追赶的秒数（以一个比已登录更快的速度）。一个更大的数将会在速率上产生更大的峰值但是更好的平均精度。
account_ratelimit	0	If set, will limit PUT and DELETE requests to /account_name/container_name. Number is in requests per second. 如果设置，将会限制PUT和DELETE到account_name/container_name请求。数值为每秒的请求数。
account_whitelist	‘’	Comma separated lists of account names that will not be rate limited. 由逗号分隔的不会被速度限制的账号名字列表。
account_blacklist	‘’	Comma separated lists of account names that will not be allowed. Returns a 497 response. 由逗号分隔的不被允许的账号名称列表。
container_ratelimit_size	‘’	When set with container_limit_x = r: for containers of size x, limit requests per second to r. Will limit PUT, DELETE, and POST requests to /a/c/o. 当设置为container_limit_x = r :对于大小为x的容器，限制的请求数为r次每秒。使用/a/c/o来限制PUT,DELETE和POST请求。

The container rate limits are linearly interpolated from the values given. A sample container rate limiting could be:

容器的速率限制从给定值线性地插入。一个容器速率限制的样例如下：

container_ratelimit_100 = 100

container_ratelimit_200 = 50

container_ratelimit_500 = 20

This would result in  
复制代码

这将会产生

Container Size	Rate Limit
0-99	No limiting
100	100
150	75
500	20
1000	20

7. Large Object Support 大对象支持

7.1 Overview 概述

Swift has a limit on the size of a single uploaded object; by default this is 5GB. However, the download size of a single object is virtually unlimited with the concept of segmentation. Segments of the larger object are uploaded and a special manifest file is created that, when downloaded, sends all the segments concatenated as a single object. This also offers much greater upload speed with the possibility of parallel uploads of the segments.

Siwft对于单个上传对象有体积的限制；默认是5GB。不过由于使用了分割的概念，单个对象的下载大小几乎是没有限制的。对于更大的对象进行分割然后上传并且会创建一个特殊的描述文件，当下载该对象的时候，把所有的分割联接为一个单个对象来发送。这使得并行上传分割成为可能，因此也提供了更快的上传速度。

7.2 Using swift for Segmented Objects 使用swift来分割对象

The quickest way to try out this feature is use the included swift Swift Tool. You can use the -S option to specify the segment size to use when splitting a large file. For example:

尝试这一特性的最快捷的方式是使用swift自带的Swift Tool。你可以使用-S选项来描述在分割大文件的时候使用的分卷大小。例如：

swift upload test_container -S 1073741824 large_file

This would split the large_file into 1G segments and begin uploading those segments in parallel. Once all the segments have been uploaded, swift will then create the manifest file so the segments can be downloaded as one.

这个会把large_file分割为1G的分卷并且开始并行地上传这些分卷。一旦所有的分卷上传完毕，swift将会创建描述文件，这样这些分卷可以作为一个对象来下载。

So now, the following swift command would download the entire large object:

所以现在，使用以下swift命令可以下载整个大对象：

swift download test_container large_file

swift uses a strict convention for its segmented object support. In the above example it will upload all the segments into a second container named test_container_segments. These segments will have names like large_file/1290206778.25/21474836480/00000000, large_file/1290206778.25/21474836480/00000001, etc.

swift使用一个严格的约定对于它的分卷对象支持。在上面的例子中，它将会上传所有的分卷到一个名为test_container_segments的附加容器。这些分卷的名称类似于 large_file/1290206778.25/21474836480/00000000, large_file/1290206778.25/21474836480/00000001等。

The main benefit for using a separate container is that the main container listings will not be polluted with all the segment names. The reason for using the segment name format of <name>/<timestamp>/<size>/<segment> is so that an upload of a new file with the same name won’t overwrite the contents of the first until the last moment when the manifest file is updated.

使用一个独立的容器的主要好处是主容器列表将不会被所有的分卷名字污染。使用<name>/<timestamp>/<size>/<segment>分卷名称格式的理由是当上传一个相同名称的新文件时将不会重写先前文件的内容直到最后描述文件被上传的时候。

swift will manage these segment files for you, deleting old segments on deletes and overwrites, etc. You can override this behavior with the --leave-segments option if desired; this is useful if you want to have multiple versions of the same large object available.

swift将会为你管理这些分卷文件，使用删除和重写等方法来删除旧的分卷。若需要，你可以用--leave-segments选项重写这一行为；如果你想要同个大对象的多个版本可用这将非常有用。

7.3 Direct API 直接的API

You can also work with the segments and manifests directly with HTTP requests instead of having swift do that for you. You can just upload the segments like you would any other object and the manifest is just a zero-byte file with an extra X-Object-Manifest header.

你也可以直接用HTTP请求代替swift工具来使用分卷和描述文件。你可以只上传分卷，在带有一个额外的X-Object-Manifest头部中指明任何其他的对象和描述文件只是一个0字节的文件。

All the object segments need to be in the same container, have a common object name prefix, and their names sort in the order they should be concatenated. They don’t have to be in the same container as the manifest file will be, which is useful to keep container listings clean as explained above with swift.

所有的对象分卷需要在同一个容器内，有一个相同的对象名称前缀，并且它们的名称按照连结的顺序排序。它们不用和描述文件在同一个容器下，这与上面解释swift组件中一样有助于保持容器列表的干净。

The manifest file is simply a zero-byte file with the extra X-Object-Manifest: <container>/<prefix> header, where <container> is the container the object segments are in and <prefix> is the common prefix for all the segments.

描述文件仅是一个带有额外X-Objetc-Manifest的0字节文件： <container>/<prefix> 头部，其中<container>是指对象分卷所在的容器，<prefix>是所有分卷的通用前缀。

It is best to upload all the segments first and then create or update the manifest. In this way, the full object won’t be available for downloading until the upload is complete. Also, you can upload a new set of segments to a second location and then update the manifest to point to this new location. During the upload of the new segments, the original manifest will still be available to download the first set of segments.

最好先上传所有的分卷并且然后创建或升级描述文件。在这种方式下，完整的对象的下载直到上传完成才可用。此外，你可以上传一个新的分卷集到新的位置，然后上传描述文件来指出这一新位置。在上传这些新分卷的时候，原始的描述文件将仍然可用来下载第一个分卷集合。

Here’s an example using curl with tiny 1-byte segments:

这里有一个使用curl对1字节的小分卷的例子：

# First, upload the segments

curl -X PUT -H 'X-Auth-Token: <token>' \

    http://<storage_url>/container/myobject/1 --data-binary '1'

curl -X PUT -H 'X-Auth-Token: <token>' \

    http://<storage_url>/container/myobject/2 --data-binary '2'

curl -X PUT -H 'X-Auth-Token: <token>' \

    http://<storage_url>/container/myobject/3 --data-binary '3'

 

# Next, create the manifest file

curl -X PUT -H 'X-Auth-Token: <token>' \

    -H 'X-Object-Manifest: container/myobject/' \

    http://<storage_url>/container/myobject --data-binary ''

 

# And now we can download the segments as a single object

curl -H 'X-Auth-Token: <token>' \

    http://<storage_url>/container/myobject
复制代码

7.4 Additional Notes 其他注意事项

With a GET or HEAD of a manifest file, the X-Object-Manifest: <container>/<prefix> header will be returned with the concatenated object so you can tell where it’s getting its segments from.

带有GET或者HEAD的描述文件，X-Object-Manifest: <container>/<prefix>头部将会返回被连结的对象，这样你可以辨别它从哪里获得它的分卷。

The response’s Content-Length for a GET or HEAD on the manifest file will be the sum of all the segments in the <container>/<prefix>listing, dynamically. So, uploading additional segments after the manifest is created will cause the concatenated object to be that much larger; there’s no need to recreate the manifest file.

在描述文件上的GET或HEAD请求的Content-Length是所有在 <container>/<prefix>列表中的分卷的动态总和。因此，在创建了描述文件之后上传额外的分卷将会导致连结对象变得更大；没有需要去重新创建描述文件。

The response’s Content-Type for a GET or HEAD on the manifest will be the same as the Content-Type set during the PUT request that created the manifest. You can easily change the Content-Type by reissuing the PUT.

GET或HEAD描述文件的请求返回的 Content-Type和在创建描述文件的PUT请求中的Content-Type设置一样。你可以通过重新发出PUT请求来轻松地修改Content-Type

The response’s ETag for a GET or HEAD on the manifest file will be the MD5 sum of the concatenated string of ETags for each of the segments in the <container>/<prefix> listing, dynamically. Usually in Swift the ETag is the MD5 sum of the contents of the object, and that holds true for each segment independently. But, it’s not feasible to generate such an ETag for the manifest itself, so this method was chosen to at least offer change detection.

GET或HEAD描述文件的请求的ETag是<container>/<prefix>所列的连结每个分卷的ETags的字符串的MD5值的动态总和。在Swift中Etag常常是对象内容的MD5值总和，并且适用于每个分卷。但是，为描述文件本身来创建这样一个Etag是不可行的，因此这个方法被选择来至少提供变更检测。

Note 注意

If you are using the container sync feature you will need to ensure both your manifest file and your segment files are synced if they happen to be in different containers.

如果你选择了容器同步的特性，你将需要来确保你的描述文件和你的分卷文件被同步若它们在不同的容器中。

7.5 History 发展史

Large object support has gone through various iterations before settling on this implementation.

大对象的支持在设为现在这种实现方式前已经经历了各种反复修改。

The primary factor driving the limitation of object size in swift is maintaining balance among the partitions of the ring. To maintain an even dispersion of disk usage throughout the cluster the obvious storage pattern was to simply split larger objects into smaller segments, which could then be glued together during a read.

在swift中驱使限制对象大小的主要因素是维持ring中的partiton间的平衡。为了在集群中维持磁盘使用的平坦散布，一种显而易见的方式是简单地将较大的对象分割到更小的分卷，在读取时分卷可以被粘连在一起。

Before the introduction of large object support some applications were already splitting their uploads into segments and re-assembling them on the client side after retrieving the individual pieces. This design allowed the client to support backup and archiving of large data sets, but was also frequently employed to improve performance or reduce errors due to network interruption. The major disadvantage of this method is that knowledge of the original partitioning scheme is required to properly reassemble the object, which is not practical for some use cases, such as CDN origination.

在介绍大型对象支持之前，一些应用已经将它们的上载对象分割为分卷并且在检索出这些独立块之后在客户端上重新装配它们。这一设计允许客户端支持备份和将大的数据集存档，但也频繁地使用来提升性能或减少由于网络中断引发的错误。这一方法的主要缺点是需要初始的分割组合的知识来合适地将对象重新装配，对于一些使用场景来说是不切实际的，诸如CDN源。

In order to eliminate any barrier to entry for clients wanting to store objects larger than 5GB, initially we also prototyped fully transparent support for large object uploads. A fully transparent implementation would support a larger max size by automatically splitting objects into segments during upload within the proxy without any changes to the client API. All segments were completely hidden from the client API.

为了解决客户想要存储大于5GB的对象障碍，最初的我们原型化完全透明的对于上传大对象的支持。一个完全透明的实现可以在上传时通过自动地将对象分割为分卷在代理内对于客户端API没有任何变化来支持更大的最大分卷大小。

This solution introduced a number of challenging failure conditions into the cluster, wouldn’t provide the client with any option to do parallel uploads, and had no basis for a resume feature. The transparent implementation was deemed just too complex for the benefit.

这一解决方案引入了大量的有挑战性的失败条件到集群中，不会提供客户端任何选项来进行并行上传，而且没有把重新开始特性作为基础。这一透明实现被认为对于好处来说是太复杂了。

The current “user manifest” design was chosen in order to provide a transparent download of large objects to the client and still provide the uploading client a clean API to support segmented uploads.

当前的“用户描述”设计被挑选出来为了提供大型对象到客户的透明下载并且仍然对上载客户端提供了干净的API来支持分卷上载。

Alternative “explicit” user manifest options were discussed which would have required a pre-defined format for listing the segments to “finalize” the segmented upload. While this may offer some potential advantages, it was decided that pushing an added burden onto the client which could potentially limit adoption should be avoided in favor of a simpler “API” (essentially just the format of the ‘X-Object-Manifest’ header).

一种替代的“显式”用户描述选项被讨论，需要一个预定义格式来列出分卷来“完成”分卷上周。尽管这可以提供一些潜在的优势，它决定推送一个增加的负载到客户端上，该行为可能潜在地限制了应该采用更简单的“API”的支持（本质上就是‘X-Object-Manifest’ 头的格式）

During development it was noted that this “implicit” user manifest approach which is based on the path prefix can be potentially affected by the eventual consistency window of the container listings, which could theoretically cause a GET on the manifest object to return an invalid whole object for that short term. In reality you’re unlikely to encounter this scenario unless you’re running very high concurrency uploads against a small testing environment which isn’t running the object-updaters or container-replicators.

在开发期间，我们注意到这种基于路径前缀的“隐式”的用户描述方法可以潜在地被容器列表的一致性窗口影响，理论上在短期内这会产生一个对描述对象的GET返回一个无效的整体对象。实际上，你不可能遇到这种场景除非你运行着非常高的并发性上传针对一个小的没有运行着object-updaters或者container-replicator的测试环境。

Like all of swift, Large Object Support is living feature which will continue to improve and may change over time.

像所有的swift版本，大对象支持是一个活跃的特性，将会继续改进并且不断地改变。

8. Container to Container Synchronization 容器同步

8.1 Overview 概述

Swift has a feature where all the contents of a container can be mirrored to another container through background synchronization. Swift cluster operators configure their cluster to allow/accept sync requests to/from other clusters, and the user specifies where to sync their container to along with a secret synchronization key.

swift有一个特性：容器的内容可以通过后端的同步镜像到其他的容器。Swift集群操作员配置他们的集群来允许/接受同步请求到/来自其他的集群，用户使用同步密钥来指定要同步的容器。

Note 注意

Container sync will sync object POSTs only if the proxy server is set to use “object_post_as_copy = true” which is the default. So-called fast object posts, “object_post_as_copy = false” do not update the container listings and therefore can’t be detected for synchronization.

只有代理服务器设置使用 “object_post_as_copy = true”默认值时，容器同步将会同步对象的POSTs。所谓快速对象的posts，使用“object_post_as_copy = false”不升级容器列表并且因此不能被同步检测到。

Note 注意

If you are using the large objects feature you will need to ensure both your manifest file and your segment files are synced if they happen to be in different containers.

如果你使用大对象特性你将需要确保你的描述文件和你的分卷文件被同步了，如果它们在不同的容器内。

8.2 Configuring a Cluster’s Allowable Sync Hosts 配置一个集群容许的同步主机

The Swift cluster operator must allow synchronization with a set of hosts before the user can enable container synchronization. First, the backend container server needs to be given this list of hosts in the container-server.conf file:

Swift集群操作员必须在用户开启容器同步之前允许和一组主机同步。首先，后端的容器服务器需要在container-server.conf文件中给定这些主机列表：

[DEFAULT]

# This is a comma separated list of hosts allowed in the

# X-Container-Sync-To field for containers.

# allowed_sync_hosts = 127.0.0.1

allowed_sync_hosts = host1,host2,etc.

...

 

[container-sync]

# You can override the default log routing for this app here (don't

# use set!):

# log_name = container-sync

# log_facility = LOG_LOCAL0

# log_level = INFO

# Will sync, at most, each container once per interval

# interval = 300

# Maximum amount of time to spend syncing each container

# container_time = 60
复制代码

Tracking sync progress, problems, and just general activity can only be achieved with log processing for this first release of container synchronization. In that light, you may wish to set the above log_ options to direct the container-sync logs to a different file for easier monitoring. Additionally, it should be noted there is no way for an end user to detect sync progress or problems other than HEADing both containers and comparing the overall information.

跟踪同步的进度，问题，以及只是一般的活动可以只用日志处理来实现容器同步的第一次分布。基于此，你可能希望设置以上的log选项将contaniner-sync导向不同的文件方便监控。此外，需要主要的是对于终端用户来说他们除了使用HEAD两个容器并且比较它们的整体信息以外没有方法来检测同步的进度或问题。

The authentication system also needs to be configured to allow synchronization requests. Here is an example with TempAuth:

认真系统也需要进行配置来允许同步请求。这里有一个关于TempAuth的例子：

[filter:tempauth]

# This is a comma separated list of hosts allowed to send

# X-Container-Sync-Key requests.

# allowed_sync_hosts = 127.0.0.1

allowed_sync_hosts = host1,host2,etc.
复制代码

The default of 127.0.0.1 is just so no configuration is required for SAIO setups – for testing.

默认值127.0.0.1只是因为SAIO设置不需要配置-用于测试。

8.3 Using the swift tool to set up synchronized containers 使用swift工具来设置同步容器

Note 注意

You must be the account admin on the account to set synchronization targets and keys.

你必须使用该帐号的帐号管理权限来设置同步目标和键。

You simply tell each container where to sync to and give it a secret synchronization key. First, let’s get the account details for our two cluster accounts:

你只需告诉每个容器同步到何处并且给予一个同步的密钥。首先，让我们先获得我们的两个集群帐号的帐号细节：

$ swift -A http://cluster1/auth/v1.0 -U test:tester -K testing stat -v

StorageURL: http://cluster1/v1/AUTH_208d1854-e475-4500-b315-81de645d060e

Auth Token: AUTH_tkd5359e46ff9e419fa193dbd367f3cd19

   Account: AUTH_208d1854-e475-4500-b315-81de645d060e

Containers: 0

   Objects: 0

     Bytes: 0

 

$ swift -A http://cluster2/auth/v1.0 -U test2:tester2 -K testing2 stat -v

StorageURL: http://cluster2/v1/AUTH_33cdcad8-09fb-4940-90da-0f00cbf21c7c

Auth Token: AUTH_tk816a1aaf403c49adb92ecfca2f88e430

   Account: AUTH_33cdcad8-09fb-4940-90da-0f00cbf21c7c

Containers: 0

   Objects: 0

     Bytes: 0
复制代码

Now, let’s make our first container and tell it to synchronize to a second we’ll make next:

现在，让我们获取我们第一个容器并且告诉它我们下一步将会设置的第二个容器：

$ swift -A http://cluster1/auth/v1.0 -U test:tester -K testing post \

  -t 'http://cluster2/v1/AUTH_33cdcad8-09fb-4940-90da-0f00cbf21c7c/container2' \

  -k 'secret' container1
复制代码

The -t indicates the URL to sync to, which is the StorageURL from cluster2 we retrieved above plus the container name. The -k specifies the secret key the two containers will share for synchronization. Now, we’ll do something similar for the second cluster’s container:

-t表示同步到的URL，就是我们上一步从cluster2检索得到的StorageURL再加上容器名称。-k指定了两个容器共享用于同步的安全密钥。现在，我们对于第二个集群容器将会做一些类似的操作：

$ swift -A http://cluster2/auth/v1.0 -U test2:tester2 -K testing2 post \

  -t 'http://cluster1/v1/AUTH_208d1854-e475-4500-b315-81de645d060e/container1' \

  -k 'secret' container2
复制代码

That’s it. Now we can upload a bunch of stuff to the first container and watch as it gets synchronized over to the second:

就是如此。现在我们可以上载一堆东西到第一个容器并且观察它与第二个容器进行同步：

$ swift -A http://cluster1/auth/v1.0 -U test:tester -K testing \

  upload container1 .

photo002.png

photo004.png

photo001.png

photo003.png
复制代码

$ swift -A http://cluster2/auth/v1.0 -U test2:tester2 -K testing2 \

  list container2
复制代码

[Nothing there yet, so we wait a bit...]

[If you're an operator running SAIO and just testing, you may need to

$ swift -A http://cluster2/auth/v1.0 -U test2:tester2 -K testing2 \

  list container2

photo001.png

photo002.png

photo003.png

photo004.png
复制代码

You can also set up a chain of synced containers if you want more than two. You’d point 1 -> 2, then 2 -> 3, and finally 3 -> 1 for three containers. They’d all need to share the same secret synchronization key.

你也可以设置一个同步容器链如果你想要使用两个以上的容器。对于三个容器，你得将指向1->2，然后2->3，最后3->1。它们须共享同一个同步密钥。

8.4 Using curl (or other tools) instead 使用curl(或其他工具代替)

So what’s swift doing behind the scenes? Nothing overly complicated. It translates the -t <value> option into an X-Container-Sync-To:<value> header and the -k <value> option into an X-Container-Sync-Key: <value> header.

因此swift在这个场景背后做了什么呢？其实没有什么很复杂的操作。它将-t<value>选项转换为一个 X-Container-Sync-To: <value> 头以及将-k <value>选项转换为 X-Container-Sync-Key: <value> 头。

For instance, when we created the first container above and told it to synchronize to the second, we could have used this curl command:

例如，当我们创建以上第一个容器时并且告诉它与第二个容器同步，我们可以使用以下curl命令：

$ curl -i -X POST -H 'X-Auth-Token: AUTH_tkd5359e46ff9e419fa193dbd367f3cd19' \

  -H 'X-Container-Sync-To: http://cluster2/v1/AUTH_33cdcad8-09fb-4940-90da-0f00cbf21c7c/container2' \

  -H 'X-Container-Sync-Key: secret' \

  'http://cluster1/v1/AUTH_208d1854-e475-4500-b315-81de645d060e/container1'

HTTP/1.1 204 No Content

Content-Length: 0

Content-Type: text/plain; charset=UTF-8

Date: Thu, 24 Feb 2011 22:39:14 GMT
复制代码

8.5 What’s going on behind the scenes, in the cluster? 在集群中，后台正在运行着什么？

The swift-container-sync does the job of sending updates to the remote container.

swift-container-sync执行发送update到远程容器的工作。

This is done by scanning the local devices for container databases and checking for x-container-sync-to and x-container-sync-key metadata values. If they exist, newer rows since the last sync will trigger PUTs or DELETEs to the other container.

通过扫描本地设备的容器数据库并且检测x-container-sync-to和x-container-sync-key元数据值来完成。如果它们存在，上一次更新的较新的行将会触发PUTS和DELETEs到其它的容器。

Note 注意

Container sync will sync object POSTs only if the proxy server is set to use “object_post_as_copy = true” which is the default. So-called fast object posts, “object_post_as_copy = false” do not update the container listings and therefore can’t be detected for synchronization.

容器同步将会同步对象POSTs仅当代理服务器设置使用了“object_post_as_copy = true”的默认值。所谓的快速对象发送，“object_post_as_copy = false”不升级容器列表，所以因此不能被同步检测。

The actual syncing is slightly more complicated to make use of the three (or number-of-replicas) main nodes for a container without each trying to do the exact same work but also without missing work if one node happens to be down.

使用一个容器的三个（或者replicas的数目number-of-replicas）主节点实际的同步稍微更复杂些，没有每个尝试去做完全相同的工作而且如果一个节点发生故障不会丢失工作。

Two sync points are kept per container database. All rows between the two sync points trigger updates. Any rows newer than both sync points cause updates depending on the node’s position for the container (primary nodes do one third, etc. depending on the replica count of course). After a sync run, the first sync point is set to the newest ROWID known and the second sync point is set to newest ROWID for which all updates have been sent.

两个同步点被保存在每个容器数据库中。在两个同步点之间的所有的行触发update。两个同步节点间触发updates的任何一个较新的行取决于节点对于容器的位置（主节点做1/3，等等。当然取决于replica的数量）。在同步运行之后，第一个同步点设置为最新已知的ROWID并且第二个同步点被设置为最新的ROWID表示所有的updates已经被发送。

An example may help. Assume replica count is 3 and perfectly matching ROWIDs starting at 1.

一个例子有助于理解。假设replica数量是3并且完全匹配在1开始的ROWID。

First sync run, database has 6 rows:

SyncPoint1 starts as -1.

SyncPoint2 starts as -1.

No rows between points, so no “all updates” rows.

Six rows newer than SyncPoint1, so a third of the rows are sent by node 1, another third by node 2, remaining third by node 3.

SyncPoint1 is set as 6 (the newest ROWID known).

SyncPoint2 is left as -1 since no “all updates” rows were synced.

Next sync run, database has 12 rows:

SyncPoint1 starts as 6.

SyncPoint2 starts as -1.

The rows between -1 and 6 all trigger updates (most of which should short-circuit on the remote end as having already been done).

Six more rows newer than SyncPoint1, so a third of the rows are sent by node 1, another third by node 2, remaining third by node 3.

SyncPoint1 is set as 12 (the newest ROWID known).

SyncPoint2 is set as 6 (the newest “all updates” ROWID).

In this way, under normal circumstances each node sends its share of updates each run and just sends a batch of older updates to ensure nothing was missed.
复制代码

用这种方式，在一般情况下每个节点发送它运行的共享的更新并且仅发送一批较老的更新来确保没有丢失信息。

上一篇：对象存储系统Swift技术详解：综述与概念

引用：http://www.cnblogs.com/yuxc/archive/2011/12/06/2278303.html

好梦一场睡 · 发表于 2014-12-8 16:52:23

请问Swift有没有可改进之处，可否从某一方面改进一下性能，最近在调研这方面，不知道楼主有什么关于性能改进的建议？

图文精华

对象存储系统Swift技术详解：综述与概念（下）

已有(1)人评论

推荐 /2