1.Proxy Server的作用是什么?


1.1 Proxy Server  代理服务器

       对象以流的形式到达(来自) 对象服务器,它们直接从代理服务器传送到(来自)用户—代理服务器并不缓冲它们。

1.2 The Ring   环


1.3 Object Server  对象服务器

1.4 Container Server 容器服务器
       容器服务器的首要工作是处理对象的列表。容器服务器并不知道对象存在哪,只知道指定容器里存的哪些对象。 这些对象信息以sqlite数据库文件的形式存储,和对象一样在集群上做类似的备份。容器服务器也做一些跟踪统计,比如对象的总数,容器的使用情况。

1.5 Account Server  帐号服务器

1.6 Replication  复制

1.7 Updaters   更新器

1.8 Auditors  审计器
       审计器会在本地服务器上反复地爬取来检测对象、容器、帐号的完整性。一旦发现不完整的数据(例如,发生了bit rot的情况:可能改变代码),该文件就会被隔离,然后复制器会从其他的副本那里把问题文件替换。如果其他错误出现(比如在任何一个容器服务器中都找不到所需的对象列表),还会记录进日志。

2. The Rings   环

2.1 Ring Builder   环构造器

2.2 Ring Data Structure   环数据结构
环的数据结构由三个顶层域组成:在集群中设备的列表;设备id列表的列表,表示虚节点到设备的指派;以及表示MD5 hash值位移的位数来计算该哈希值对应的虚节点。

2.2.1 List of Devices   设备列表
sdb1   服务器上设备的磁盘名称。例如:sdb1

注意:设备的列表可能包含了holes,或设为None的索引,表示已经从集群移除的设备。一般地,设备的id不会被重用。一些设备也可以通过设置权重为0.0来暂时地被禁用。为了获得有效设备的列表(例如,用于运行时间轮询),Python代码如下:devices = [device for device in self.devs if device and device['weight']]

2.2.2 Partition Assignment List   虚节点分配列表
       因此,创建指派到一个虚节点的设备字典的列表,Python代码如下:devices =[self.devs[part2dev_id[partition]] for part2dev_id in self._replica2part2dev_id]
2.2.3 Partition Shift Value  虚节点位移值
partition = unpack_from('>I',md5('/account/container/object').digest())[0] >> self._part_shift

2.3 Building the Ring  构建环



2.4 History  发展史

A “live ring” option was considered where each server could maintain its own copy of the ring and the servers would use a gossip protocol to communicate the changes they made. This was discarded as too complex and error prone to code correctly in the project time span available. One bug could easily gossip bad data out to the entire cluster and be difficult to recover from. Having an externally managed ring simplifies the process, allows full validation of data before it’s shipped out to the servers, and guarantees each server is using a ring from the same timeline. It also means that the servers themselves aren’t spending a lot of resources maintaining rings.

曾考虑过"live ring"选项,其中每个服务器自己可以维护环的副本并且服务器将使用gossip协议进行通讯它们所作做的变化。该方法由于过于复杂并且在工程有效时间内正确编写代码容易产生错误而被废弃。一个Bug是可以很容易把坏数据gossip到整个集群而恢复很困难。通过外部管理环可以简化这一过程,允许数据在传输到服务器前进行数据的完整验证,并且保证每个服务器使用相同时间线的环。这也意味着服务器本身不用花费大量的资源来维护环。

A couple of “ring server” options were considered. One was where all ring lookups would be done by calling a service on a separate server or set of servers, but this was discarded due to the latency involved. Another was much like the current process but where servers could submit change requests to the ring server to have a new ring built and shipped back out to the servers. This was discarded due to project time constraints and because ring changes are currently infrequent enough that manual control was sufficient. However, lack of quick automatic ring changes did mean that other parts of the system had to be coded to handle devices being unavailable for a period of hours until someone could manually update the ring.

有一对"ring server"选项曾被考虑过。一个是所有的环查询可以由调用独立的服务器或服务器集上的服务器来完成,但是由于涉及到延迟被弃用了。另一个更类似于当前的过程,不过其中服务器可以提交改变的请求到环服务器来构建一个新的环,然后运回到服务器上。由于工程时间的约束以及就目前来说,环的改变的频繁足够低到人工控制就可以满足而被弃用。然后,缺乏快速自动的环改变意味着系统的其他部件不得不花上数个小时编码来处理失效的设备直到有人可以手动地升级环。

The current ring process has each replica of a partition independently assigned to a device. A version of the ring that used a third of the memory was tried, where the first replica of a partition was directly assigned and the other two were determined by “walking” the ring until finding additional devices in other zones. This was discarded as control was lost as to how many replicas for a given partition moved at once. Keeping each replica independent allows for moving only one partition replica within a given time window (except due to device failures). Using the additional memory was deemed a good tradeoff for moving data around the cluster much less often.


Another ring design was tried where the partition to device assignments weren’t stored in a big list in memory but instead each device was assigned a set of hashes, or anchors. The partition would be determined from the data item’s hash and the nearest device anchors would determine where the replicas should be stored. However, to get reasonable distribution of data each device had to have a lot of anchors and walking through those anchors to find replicas started to add up. In the end, the memory savings wasn’t that great and more processing power was used, so the idea was discarded.


A completely non-partitioned ring was also tried but discarded as the partitioning helps many other parts of the system, especially replication. Replication can be attempted and retried in a partition batch with the other replicas rather than each data item independently attempted and retried. Hashes of directory structures can be calculated and compared with other replicas to reduce directory walking and network traffic.


Partitioning and independently assigning partition replicas also allowed for the best balanced cluster. The best of the other strategies tended to give +-10% variance on device balance with devices of equal weight and +-15% with devices of varying weights. The current strategy allows us to get +-3% and +-8% respectively.


Various hashing algorithms were tried. SHA offers better security, but the ring doesn’t need to be cryptographically secure and SHA is slower. Murmur was much faster, but MD5 was built-in and hash computation is a small percentage of the overall request handling time. In all, once it was decided the servers wouldn’t be maintaining the rings themselves anyway and only doing hash lookups, MD5 was chosen for its general availability, good distribution, and adequate speed.


3. The Account Reaper   账号收割器

The Account Reaper removes data from deleted accounts in the background.


An account is marked for deletion by a reseller through the services server’s remove_storage_account XMLRPC call. This simply puts the value DELETED into the status column of the account_stat table in the account database (and replicas), indicating the data for the account should be deleted later. There is no set retention time and no undelete; it is assumed the reseller will implement such features and only call remove_storage_account once it is truly desired the account’s data be removed.


The account reaper runs on each account server and scans the server occasionally for account databases marked for deletion. It will only trigger on accounts that server is the primary node for, so that multiple account servers aren’t all trying to do the same work at the same time. Using multiple servers to delete one account might improve deletion speed, but requires coordination so they aren’t duplicating effort. Speed really isn’t as much of a concern with data deletion and large accounts aren’t deleted that often.


The deletion process for an account itself is pretty straightforward. For each container in the account, each object is deleted and then the container is deleted. Any deletion requests that fail won’t stop the overall process, but will cause the overall process to fail eventually (for example, if an object delete times out, the container won’t be able to be deleted later and therefore the account won’t be deleted either). The overall process continues even on a failure so that it doesn’t get hung up reclaiming cluster space because of one troublesome spot. The account reaper will keep trying to delete an account until it eventually becomes empty, at which point the database reclaim process within the db_replicator will eventually remove the database files.


3.1 History 发展史
At first, a simple approach of deleting an account through completely external calls was considered as it required no changes to the system. All data would simply be deleted in the same way the actual user would, through the public ReST API. However, the downside was that it would use proxy resources and log everything when it didn’t really need to. Also, it would likely need a dedicated server or two, just for issuing the delete requests.


A completely bottom-up approach was also considered, where the object and container servers would occasionally scan the data they held and check if the account was deleted, removing the data if so. The upside was the speed of reclamation with no impact on the proxies or logging, but the downside was that nearly 100% of the scanning would result in no action creating a lot of I/O load for no reason.


A more container server centric approach was also considered, where the account server would mark all the containers for deletion and the container servers would delete the objects in each container and then themselves. This has the benefit of still speedy reclamation for accounts with a lot of containers, but has the downside of a pretty big load spike. The process could be slowed down to alleviate the load spike possibility, but then the benefit of speedy reclamation is lost and what’s left is just a more complex process. Also, scanning all the containers for those marked for deletion when the majority wouldn’t be seemed wasteful. The db_replicator could do this work while performing its replication scan, but it would have to spawn and track deletion processes which seemed needlessly complex.


In the end, an account server centric approach seemed best, as described above.


4. The Auth System   认证系统
4.1 TempAuth
The auth system for Swift is loosely based on the auth system from the existing Rackspace architecture – actually from a few existing auth systems – and is therefore a bit disjointed. The distilled points about it are:


The token can be passed into Swift using the X-Auth-Token or the X-Storage-Token header. Both have the same format: just a simple string representing the token. Some auth systems use UUID tokens, some an MD5 hash of something unique, some use “something else” but the salient point is that the token is a string which can be sent as-is back to the auth system for validation.


Swift will make calls to the auth system, giving the auth token to be validated. For a valid token, the auth system responds with an overall expiration in seconds from now. Swift will cache the token up to the expiration time.


The included TempAuth also has the concept of admin and non-admin users within an account. Admin users can do anything within the account. Non-admin users can only perform operations per container based on the container’s X-Container-Read and X-Container-Write ACLs. For more information on ACLs, see swift.common.middleware.acl.

其包含的TempAuth,对于account而言,也有admin和non-admin用户的概念。admin用户拥有账号的所有操作权限。non-admin用户仅可以基于每个容器执行基于容器的X-Container-Read and X-Container-Write的访问控制列表进行操作。对于更多关于ACLs的信息,参见swift.common.middleware.acl

Additionally, if the auth system sets the request environ’s swift_owner key to True, the proxy will return additional header information in some requests, such as the X-Container-Sync-Key for a container GET or HEAD.

此外,如果认证系统设置request environ的swift_owner键为True,该代理服务器将在某些请求中返回额外的头部信息,诸如用于容器的GET或HEAD的X-Container-Sync-Key。

The user starts a session by sending a ReST request to the auth system to receive the auth token and a URL to the Swift system.


4.2 Extending Auth  扩展认证
TempAuth is written as wsgi middleware, so implementing your own auth is as easy as writing new wsgi middleware, and plugging it in to the proxy server. The KeyStone project and the Swauth project are examples of additional auth services.

TempAuth被作为wsgi中间件,因此实现你自己的认证系统就如同写一个新的wsgi中间件一样容易,然后把它安装到代理服务器上。KeyStone和Swauth项目是认证服务器的另外例子。也可以参见 Auth Server and Middleware.


