Amazon云服务S3宕机的真实原因,重点看评论

 

Amazon云服务给出了S3宕机的原因分析。表面上看是人员误操作,移除不应该移除的服务器。实际上是Amazon没有测试和预料的重启系统时间和架构问题。更重要的可能还会出问题的是没有遵从N+2的运维原则。...



评论:Amazon云服务给出了S3宕机的原因分析《Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region》,表面上看是人员误操作,移除不应该移除的服务器,从而引起其它子系统的重启,花费了很长时间(宕机时间)来恢复。实际上是Amazon没有测试和预料的重启系统时间和架构问题,重启时,单组服务器花费了大量的长于预期的时间来安全检查和数据验证,不是秒和分钟级别,而是小时级别。更重要的可能还会出问题的是没有遵从N+2的运维原则。

We’d like to give you some additional information about the service disruption that occurred in the Northern Virginia (US-EAST-1) Region on the morning of February 28th.

我们想给你们一些关于在2月28日上午在北弗吉尼亚州(US-EAST-1)地区发生的服务中断的额外信息。

评论:国内外很多公司,都会要求在系统出现事故和恢复之后,写一个Post-Mortem,首先记录事故发生的详细过程,按照时间顺序,比如,什么时候收到报警,什么样的报警,什么时候开始诊断和处理,采取了什么行动,行动后的效果如何,最后,是怎么修复的。然后,要分析事故的原因,实事求是。最后,要给出建议和解决方案避免以后类似事故的发生。同时,会把此类问题的诊断和处理流程,写入手册,写成剧本,给之后的运维工程师提供行动纲领。

The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process.

Amazon Simple Storage Service(S3)团队正在调试一个导致S3结算系统进度比预期慢的问题。在PST上午9:37,使用已建立的剧本的授权的S3团队成员执行命令,该命令旨在为S3计费过程使用的S3子系统移除少量服务器。

评论:手册和剧本带来的好处是可以“无脑无压力”操作,照葫芦画瓢,但不好的地方是容易不经意,以为自己按照指令做了,结果可能由于看错行,或是敲错了,或是参数搞错了,酿成大错。所以,这时,要么要求peer programming,也就是说两个工程师一起工作,一个负责操作,一个负责检查,避免犯错。或是,最好能把剧本中操作自动化,设计成wizard,写成脚本,尽量减少人的参与。

Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems.

不幸的是,命令的输入参数之一输入不正确,造成比预期的更大的服务器集合被删除。 无意中删除的服务器支持另外两个S3子系统。

评论:就算有手册,一个人操作还是容易犯错。而且这种删除的操作,都不大好回滚,因为改变系统的服务设置,很可能是连锁的实时的反应。

One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region. This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests.

这些子系统之一,即索引子系统,管理该区域中所有S3对象的元数据和位置信息。该子系统对于所有GET,LIST,PUT和DELETE请求都是必需的。

The second subsystem, the placement subsystem, manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate. The placement subsystem is used during PUT requests to allocate storage for new objects.

第二子系统,即放置子系统,管理新存储器的分配,并要求索引子系统正常工作以正确操作。在PUT请求期间使用放置子系统为新对象分配存储。

Removing a significant portion of the capacity caused each of these systems to require a full restart. While these subsystems were being restarted, S3 was unable to service requests.

删除大部分容量会导致每个系统需要完全重新启动。当这些子系统正在重新启动时,S3无法处理请求。

评论:大量删除会要求系统重启,这是个什么设计逻辑?一般情况下,系统设计时不会做真正的删除,只是标记为Disable状态。特别时数据库操作,Delete是一件很危险的事情,所以尽量不使用,以Mark Disable来代替。然后,最后线下维护时,再统一Delete那些Disable的记录。

Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable.

而S3 API不可用时,在依赖S3进行存储的US-EAST-1区域中的其他AWS服务,包括S3控制台,Amazon Elastic Compute Cloud(EC2)新实例启动,Amazon Elastic Block Store(EBS)卷(当需要从S3快照)和AWS Lambda也受到影响。

S3 subsystems are designed to support the removal or failure of significant capacity with little or no customer impact.

S3子系统设计用于在很少或没有客户影响的情况下,支持大容量的移除或故障。

We build our systems with the assumption that things will occasionally fail, and we rely on the ability to remove and replace capacity as one of our core operational processes.

当我们建立系统时,假设偶尔会失败,那么依靠移除和替换能力,这是我们的核心运营流程。

评论:热插拔是大规模可扩展系统的必备基础。经常情况下,会使用类似于Zookeeper的系统来管理系统的配置,这样,移除,替换,加入一个服务是可以动态的不间断服务进行。

While this is an operation that we have relied on to maintain our systems since the launch of S3, we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years.

虽然这是我们从S3推出以来一直依赖的维护系统的操作,但是多年来,我们并没有在更大的区域,完全重启索引子系统或放置子系统。

评论:大的公司,应该会把所有的数据中心分成不同的区域,然后,定期的对不同的区域进行维护,包括停电停网,检查设备,重新启动,维护,等等。那么,在设计和实施系统时,采用N+2的原则,一个服务会要求同时在N+2个区域配置和服务,这样,可以允许一个区域维护,还有两个区域能提供服务,就算一个发生了线上问题,至少还有一个能正常服务。难道AWS只有一个区域?而且不做区域维护?

S3 has experienced massive growth over the last several years and the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected.

S3在过去几年中经历了大量增长,重新启动这些服务和运行必要的安全检查以验证元数据的完整性的过程花费比预期更长的时间。

评论:很有借鉴意义,尤其在系统测试时。如果一个服务器承载了太多的初始化工作,和太多的初始数据装入和验证,一定好好好测试,保障在预定的时间内(应该是秒级或是分钟级)能完成相应的工作。如果不能,那么,可能需要重构系统,比如sharding,不让一个服务器负担太重。

The index subsystem was the first of the two affected subsystems that needed to be restarted. By 12:26PM PST, the index subsystem had activated enough capacity to begin servicing S3 GET, LIST, and DELETE requests. By 1:18PM PST, the index subsystem was fully recovered and GET, LIST, and DELETE APIs were functioning normally.

索引子系统是需要重新启动的两个受影响的子系统中的第一个。截至PST的12:26 PM,索引子系统已激活足够的容量来开始服务S3 GET,LIST和DELETE请求。在PST下午1:18 PM之前,索引子系统已完全恢复,GET,LIST和DELETE API正常工作。

评论:从9:37AM到12:26PM,索引子系统的重启和恢复时间够长的。主要时间时安全检查和元数据验证。这些,估计之前都没有这么大规模负载测试过。所有提供公有云服务或是SaaS的公司,如果有很多用户,都需要好好检查和测试一下这个场景!

The S3 PUT API also required the placement subsystem. The placement subsystem began recovery when the index subsystem was functional and finished recovery at 1:54PM PST. At this point, S3 was operating normally. Other AWS services that were impacted by this event began recovering. Some of these services had accumulated a backlog of work during the S3 disruption and required additional time to fully recover.

S3 PUT API还需要放置子系统。当索引子系统运行并且在PST下午1:54完成恢复时,放置子系统开始恢复。此时,S3正常工作。受此事件影响的其他AWS服务开始恢复。其中一些服务在S3中断期间积累了积压的工作,需要额外的时间才能完全恢复。

评论:另外一个注意点,积压工作处理时间。现在很多的系统都是异步系统,都会使用类似Queue的东西来维护异步任务,前端可以无间断接受用户需求和任务,放到Queue里,然后等待Callback。后台服务器从Queue里拿取任务,完成后告诉前端。那么,这时涉及到系统恢复时,必须考虑这些积压在Queue里的工作需要处理完成的时间,才是系统恢复所需的总时间。

We are making several changes as a result of this operational event. While removal of capacity is a key operational practice, in this instance, the tool used allowed too much capacity to be removed too quickly. We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level. This will prevent an incorrect input from triggering a similar event in the future. We are also auditing our other operational tools to ensure we have similar safety checks.

作为此次操作事件的结果,我们正在进行几项更改。虽然移除容量是关键的操作实践,在这种情况下,所使用的工具允许太多的容量被移除太快。我们已修改此工具以更缓慢地删除容量,并增加了安全措施,以防止在任何子系统低于其最低所需容量级别时删除容量。这将防止不正确的输入在将来触发类似的事件。我们也在审核我们的其他操作工具,以确保我们有类似的安全检查。

评论:自动化运维,时时,没一步,要考虑安全检查!

We will also make changes to improve the recovery time of key S3 subsystems. We employ multiple techniques to allow our services to recover from any failure quickly. One of the most important involves breaking services into small partitions which we call cells. By factoring services into cells, engineering teams can assess and thoroughly test recovery processes of even the largest service or subsystem. As S3 has scaled, the team has done considerable work to refactor parts of the service into smaller cells to reduce blast radius and improve recovery. During this event, the recovery time of the index subsystem still took longer than we expected. The S3 team had planned further partitioning of the index subsystem later this year. We are reprioritizing that work to begin immediately.

我们还将进行更改,以提高关键S3子系统的恢复时间。我们采用多种技术,使我们的服务能够从任何故障中迅速恢复。最重要的之一涉及将服务拆分成小分区,我们称之为单元。通过将服务分解成单元,工程团队可以评估和彻底测试甚至最大服务或子系统的恢复过程。随着S3规模的扩大,团队已经做了相当多的工作,将服务的部分重构为更小的单元,以减少爆破半径并提高恢复。在此事件期间,索引子系统的恢复时间仍然比我们预期的更长。 S3团队计划今年晚些时候进一步划分索引子系统。我们正在重新确定这项工作的优先顺序,立即开始。

评论:安全检查,分解成单元,这都是很好很必要,但是,问题会依然存在,如果不将同一服务分布到不同的数据中心或是区域,也就是我们前面说到的N+2。虽然成本高,但为了几个9的SLA,还是值得的。

From the beginning of this event until 11:37AM PST, we were unable to update the individual services’ status on the AWS Service Health Dashboard (SHD) because of a dependency the SHD administration console has on Amazon S3. Instead, we used the AWS Twitter feed (@AWSCloud) and SHD banner text to communicate status until we were able to update the individual services’ status on the SHD.  We understand that the SHD provides important visibility to our customers during operational events and we have changed the SHD administration console to run across multiple AWS regions.

从此事件开始到PST的PST,我们无法更新AWS Service Health Dashboard(SHD)上的各个服务的状态,因为SHD管理控制台在Amazon S3上具有依赖性。相反,我们使用AWS Twitter Feed(@AWSCloud)和SHD横幅文本来传达状态,直到我们能够更新SHD上的各个服务的状态。我们理解,SHD在运营活动期间为我们的客户提供重要的可见性,我们已将SHD管理控制台更改为跨多个AWS地区运行。

评论:Amazon自己的SHD服务也使用了S3,而且知道将它分布到不同的区域保障无故障服务,但是,别的客户怎么办?他们不需要N+2吗?

Finally, we want to apologize for the impact this event caused for our customers. While we are proud of our long track record of availability with Amazon S3, we know how critical this service is to our customers, their applications and end users, and their businesses. We will do everything we can to learn from this event and use it to improve our availability even further.

最后,我们要对此事故给我们的客户造成的影响道歉。虽然我们对我们在Amazon S3上的长期可用性感到自豪,我们知道这项服务对我们的客户,他们的应用程序和最终用户以及他们的业务有多关键。我们将尽一切努力从这次事故中学习,并使用它来进一步提高我们的可用性。

评论:应该有对客户透明的N+2设计和实现,否则,问题依然不能避免。


    关注 待字闺中


微信扫一扫关注公众号

0 个评论

要回复文章请先登录注册