|
发表于 2020-6-5 13:32:53
|
显示全部楼层
Dragon's "Radiation-Tolerant" Design
November 20, 2012
Last week, NASA revealed that SpaceX's first commercial resupply mission to the ISS experienced a number of anomalies in addition to the shutdown of a Falcon 9 first-stage engine, including the loss of one of three flight computers on the Dragon cargo vessel due to a suspected radiation hit. Over the weekend I spoke with John Muratore, SpaceX director of vehicle certification, who said the loss of the computer was a function of the radiation-tolerant system design on which Dragon relies, rather than hard-to-come-by "rad-hardened" parts that can be costly and difficult to upgrade.
AWST: So, NASA does not require SpaceX to use radiation-hardened computer systems on the Dragon?
John Muratore: No, as a matter of fact NASA doesn't require it on their own systems, either. I spent 30 years at NASA and in the Air Force doing this kind of work. My last job was chief engineer of the shuttle program at NASA, and before that as shuttle flight director. I managed flight programs and built the mission control center that we use there today.
On the space station, some areas are using rad-hardened parts and other parts use COTS parts. Most of the control of the space station occurs through laptop computers which are not radiation hardened.
The radiation environment is something people have known about for a long time. It's part of the natural environment, and it varies. It matters what kind of mission you're doing. With Dragon we're doing low-Earth orbit, short-duration missions and that drives a lot of the architecture.
So NASA didn't require radiation-hardened parts. It did, however, require us to do a hard analysis of the radiation environment, the effect of the environment on the Dragon systems and how we'd respond to that. We not only produced that analysis, but it was reviewed by an independent panel of experts. So NASA had very strong requirements for us to understand the environment and have planned out our responses to the environment, and we've done that.
Q: So, these flight computers on Dragon – there are three on board, and that's for redundancy?
A: There are actually six computers. They operate in pairs, so there are three computer units, each of which have two computers checking on each other. The reason we have three is when operating in proximity of ISS, we have to always have two computer strings voting on something on critical actions. We have three so we can tolerate a failure and still have two voting on each other. And that has nothing to do with radiation, that has to do with ensuring that we're safe when we're flying our vehicle in the proximity of the space station.
I went into the lab earlier today, and we have 18 different processing units with computers in them. We have three main computers, but 18 units that have a computer of some kind, and all of them are triple computers – everything is three processors. So we have like 54 processors on the spacecraft. It's a highly distributed design and very fault-tolerant and very robust.
Q: But there's nothing on the spacecraft in the way of radiation-hardened parts?
A: The parts aren't hardened, the design as a total system is hardened. What it is is each part does not go through the screening that is typical of radiation hardened parts. Now that doesn't mean that each part can't take the dose that a “rad-hardened” part can, because we've taken all of our designs and we've tested them extensively, we've had contracts with the the [NASA] Jet Propulsion Lab (JPL) to consult us, and their the world's experts in it, and we've gone to the University of Indiana and tested all of our parts, and we test them until they fail. We keep bringing the environment up and up and up until they fail. But we test them as a total system, not each part at a time. We've tested lots of our parts to very, very high radiation environments. So we test them as a total system, and by that I mean a unit with three processors in it, we test the entire unit. We take the cover off and we hit it really, really hard with radiation, and we do that so we understand how the parts react in the radiation environment.
Q: So what happened in this situation where one computer on board Dragon had a suspected radiation hit and shut down?
A: Think of a computer as lots of white marbles that are arranged in a specific pattern on a table, and a black marble comes in and knocks one of the white marbles out of place. Now, the memories of our computers are constantly checking for that happening. So if we take a hit in our most dense part of our computer – the memory – the computer detects it and repairs it and there's no harm done. But our other circuits in the computer, places like where we're bringing information in and out of the processor, if we take a hit there it can cause basically a bit to flip from a zero to a one. And that instruction can be wrong, and that is where the two processors in a single computer element voting on each other can detect that, and it can force a reboot. And that's what happened, we rebooted the computer.
Q: You rebooted the computer, but I understand it didn't re-sync, was that intentional?
A: Let's say you're working on something on your PC and you have Internet Explorer up and Word and a whole bunch of things and you take a glitch in the computer and it reboots and you lose all your work. What we do is when we re-sync, the two computers that are still running and have all the latest applications up, they load all that information in the memory so the three memories have all the same information. So when we rebooted, we had the option to re-sync. And we had practiced that on the ground lots. We do it all the time. Matter of fact when we normally bring the computers up we re-sync them. So we'd done this tons of times. But we needed to coordinate that and explain what we were doing to all the partners on the space station, and that just took time. And NASA said rather than distract everybody with going through a long technical explanation of why we do that and convincing everybody it's all ok, can you guys just fly away the way you are? And we were like, yeah. We met every requirement that NASA had, even with one computer down.
Q: So, is there going to be any corrective action in terms of modifications to Dragon for the next cargo resupply mission net year? NASA's ISS Program Manager Michael Suffredini has been quoted suggesting you may replace existing parts with “rad-hardened” parts.
A: I think he was just hypothesizing. The first time you do anything on the space station, you talk about it a lot. And then after you talk about it, the next time it happens it's just like the time before, and they say go ahead, no problem. On our output processors, we took some hits on the last mission [the Falcon 9/Dragon demo flight that delivered Dragon to ISS in June under NASA's Commercial Orbital Transportation Services (COTS) program]. And we had to spend a lot of time explaining to people what we were doing. It's an international consortium, it's a $100-billion program, it's a million pounds of hardware, and everybody's systems need to interact, and we need to explain that when we're going to do something. And when we're going to do something the first time, even though we've explained it in safety panels and safety reviews and flight procedures and flight-technique meetings and we had talked about it all before, the first time you actually come up to it, everybody just wants to talk about it again.
So we had similar radiation hits on the output units this time, and we called the flight director and he went “Yeah, go ahead, go reset.” So we reset the input/output units with about a five-minute discussion. It was no big deal. So I think that because of that, he's thinking we spent a lot of time talking about this, maybe you should consider some other kinds of parts. But I think it was just because it was the first time we went through it.
Q: Ok, is there any plan right now to make any changes in the flight computers for the next mission?
A: We might make some slight procedural or software changes so we can get through the re-synching faster. But that's all. We're still talking about that. There's no requirement to make any changes. We met every safety requirement that NASA put on us. Every piece of hardware that had any kind of hit recovered 100%, completely. So the design functioned exactly the way it was intended to function.
Q: Is it possible all three computer units could take a hit and go down at once?
A: So, remember the marbles. Now we've got three tables and the white marbles arranged on all three tables, and the black marble would have to go through so that it hit all three tables at once. And that would be hard to do. But even if it did, we normally power up the vehicle with the computers down. Matter of fact we run with the computers down all the time because each of the input/output units have its own three strings of computers in it. And we can command those directly, we can command them from the station, through the TDRS satellite, we can command them from our own ground station. There was no impact at all. And we would have just rebooted them and come up.
Q: What's the downside to buying radiation-hardened hardware or software? Is it expensive, or just not widely available?
A: It's really not the expense that drives it. We're committed to having the best possible parts in all of our designs. So if it cost a lot and we needed it, we'd go get it. We were already required to have all this redundancy in the computers to meet all the different safety requirements. Then we started looking at what parts do we want to use and what is appropriate for this design. And what really is more important to us than the cost of the parts is the capability of the parts – how much power do they use, how much memory do they hold, how much do they process, and how physically big are they. That's the first thing.
The second thing is what tools they come with. We run the Linux operating system, we program everything in C++, and that enables us to tap into a huge pool of very talented people and find the absolute best people in the computer and software industry to work with us. If you go into the radiation hardened parts, they are very limited in terms of what languages you can work in, what support packages there are for them, who knows how to program in them. It really limits your ability to work with the parts. And the other thing it really does is they all take a little longer time to get and they're a little harder to come by.
I just walked around the factory this morning, just in the office area alone, and we have over 40 of the flight computers sitting on people's desks. And if they were hard-to-come-by items, we wouldn't have that many computers. We've got 54 in a Dragon – and they're all different kinds of computers, different kinds of processors. We've got computers in the Falcon 9, we've got three computers in one unit on each engine in the Falcon 9, so that's 30 computers right there. We have hundreds of flight computers of different capability levels, and we're in multiple generations of design. The radiation parts tend not to have growth and upgrade paths. It's very hard to grow, if you decide you want a little more capability, a little faster, you're really limited – it's that part. And we're already in our third generation of flight computer at SpaceX. In the last two years we've worked through three generations, we've got people working on a fourth generation computer. So we are constantly looking at what's available in the marketplace, moving with the marketplace so we can use the best software tools, the best people the best techniques and achieve the most modern, optimized, efficient design. That's why we don't want to go into these lines, and they are good pieces of equipment, lots of people use them. But they don't open up the kind of possibilities that we want to have. A lot of other programs are one program. At SpaceX our goal is the most reliable, cost effective and safe access to space in the world, and our CEO [Elon Musk] is very clear: We're going to Mars. So building the computer for the Dragon isn't just about building the computer for the Dragon, it's about building a whole suite of tools, techniques, people and processes to then go to the next vehicle, and the next vehicle. And our equipment crosses lines. Falcon designs go into Dragon, we're currently retrofitting the Dragon design into the new Falcon, so our designs constantly keep evolving, and that's why we don't want to get into lines that have limited growth capacity.
Q: Did the space shuttle have rad-hardened computers?
A: They had rad-hardened design, not rad-hardened parts. I was one of the flight directors the first time we went to repair the Hubble Space Telescope, and they had the same kind of error-correcting memory approach that we have. And we just watched the errors counting up. I remember sitting on the console with my flight computer officer and we were just watching them crank up while we were up repairing the Hubble, and we were just going bang, bang, bang, taking errors and correcting them. So radiation-tolerant design vs. radiation-tolerant parts is very common and was used in shuttle.
Q; So you're not breaking a mold here.
A: We're taking it to an extent previously not done, but we're operating in a well known set of techniques and capabilities.
12年的帖子了,机翻了一下
龙的“耐辐射”设计
2012年11月20日
上周,美国宇航局透露,SpaceX首次向国际空间站进行商业补给任务时,除了关闭了Falcon 9第一阶段发动机外,还遇到了许多异常情况,其中包括由于“龙”号货船上的三架飞行计算机之一丢失怀疑辐射被击中。上周末,我与SpaceX车辆认证总监John Muratore进行了交谈,他说计算机的丢失是Dragon所依赖的耐辐射系统设计的结果,而不是难以获得的“防辐射技术” ”,这些部件可能很昂贵且难以升级。
音译:那么,NASA不需要SpaceX在Dragon上使用经过辐射加固的计算机系统吗?
John Muratore:不,事实上,NASA也不在自己的系统上要求它。我在NASA和空军工作了30年,从事此类工作。我的上一份工作是NASA航天飞机计划的首席工程师,在此之前担任航天飞机飞行主管。我管理了飞行程序并建立了我们今天在这里使用的任务控制中心。
在空间站上,某些区域正在使用抗辐射部件,而其他部分则使用COTS部件。对空间站的大多数控制是通过未经辐射加固的笔记本电脑进行的。
辐射环境是人们长期以来所了解的。它是自然环境的一部分,并且千差万别。您执行什么样的任务很重要。借助Dragon,我们正在执行低地球轨道,短时任务,这推动了许多架构的发展。
因此,NASA不需要辐射硬化部件。但是,它确实要求我们对辐射环境,环境对Dragon系统的影响以及我们对此做出的反应进行认真的分析。我们不仅进行了分析,而且还由独立的专家小组进行了审查。因此,NASA对我们了解环境和规划对环境的响应提出了非常严格的要求,而我们已经做到了。
问:那么,Dragon上的这些飞行计算机–机上有3台,是为了冗余吗?
答:实际上有六台计算机。它们成对运行,因此有三个计算机单元,每个单元都有两台相互检查的计算机。之所以有3个,是因为在ISS附近操作时,我们必须始终有2个计算机字符串来对关键动作进行投票。我们有3个,因此我们可以容忍失败,但仍有2个彼此投票。这与辐射无关,而与在空间站附近飞行车辆时确保我们的安全有关。
我今天早些时候去了实验室,我们有18个不同的处理单元,其中装有计算机。我们有三台主计算机,但是18台拥有某种类型的计算机,它们都是三台计算机-一切都是三个处理器。因此,我们在航天器上有54个处理器。这是一个高度分布式的设计,并且非常容错并且非常健壮。
问:但是,航天器上没有防辐射部件吗?
答:零件未硬化,整个系统的设计已硬化。它是每个零件都没有经过辐射硬化零件的典型筛选。现在,这并不意味着每个零件都不能承受“抗辐射”零件的剂量,因为我们已经采用了所有设计并且已经对它们进行了广泛的测试,因此我们与[NASA]喷气推进实验室(JPL)向我们及其全球专家咨询,我们去了印第安纳大学并测试了我们所有的零件,我们对其进行了测试,直到它们失效为止。我们不断提高环境质量,直到环境崩溃为止。但是我们将它们作为一个整体系统进行测试,而不是一次测试每个部分。我们已经在非常高的辐射环境下测试了许多零件。因此,我们将它们作为一个整体系统进行测试,我的意思是,其中包含三个处理器的单元将测试整个单元。我们脱下掩膜,用辐射确实,非常地用力击打它,这样做是为了了解零件在辐射环境中的反应。
问:在这种情况下,Dragon上的一台计算机受到怀疑的辐射撞击并关闭了,该怎么办?
答:可以将一台以特定样式排列在桌子上的白色大理石想象成一台计算机,然后黑色大理石进来并将其中一个白色大理石打掉。现在,我们计算机的记忆正在不断地检查这种情况。因此,如果我们对计算机中最密集的部分(内存)造成了打击,则计算机会对其进行检测并进行修复,并且不会造成任何危害。但是计算机中的其他电路,例如我们要在处理器中传入和传出信息的地方,如果在此处受到冲击,则基本上可以使它从零翻转为一。该指令可能是错误的,这就是单个计算机元素中相互投票的两个处理器可以检测到该错误,并可以强制重新引导。就是这样,我们重新启动了计算机。
问:您重新启动了计算机,但我知道它没有重新同步,这是故意的吗?
答:假设您正在PC上进行某些工作,并且已启动Internet Explorer,Word和许多其他功能,并且在计算机中出现故障后,计算机将重新启动,您将失去所有工作。我们要做的是,当我们重新同步时,两台仍在运行并具有所有最新应用程序的计算机将它们加载到内存中,因此这三个内存具有相同的信息。因此,当我们重新启动时,我们可以选择重新同步。我们已经在地面上进行了练习。我们一直在做。实际上,当我们正常启动计算机时,我们会重新同步它们。所以我们做了很多次。但是,我们需要对此进行协调,并向空间站的所有合作伙伴解释我们正在做的事情,这只是花费时间。美国宇航局表示,不要因为冗长的技术解释而分散大家的注意力,为什么我们要这样做,并说服所有人都没事,你们能以自己的方式飞走吗?我们就像,是的。即使一台计算机宕机,我们也满足了NASA的所有要求。
问:那么,对于下一个货物补给任务网年的Dragon修改,是否将采取任何纠正措施?引用了NASA的ISS计划经理Michael Suffredini的建议,建议您将现有零件替换为“经过防辐射处理”的零件。
答:我认为他只是在假设。第一次在空间站上进行任何操作时,都会谈论很多。然后,在您谈论它之后,下次它又发生了,就像之前的时间一样,他们说继续前进,没问题。在输出处理器上,我们完成了最后一次任务[Falcon 9 / Dragon演示飞行,该飞行在6月根据NASA的商业轨道运输服务(COTS)计划将Dragon交付给ISS]。而且我们不得不花很多时间向人们解释我们在做什么。这是一个国际财团,是一个1000亿美元的计划,一百万英镑的硬件,每个人的系统都需要交互,我们需要解释一下何时要做某事。当我们第一次要做某事时,即使我们
因此,这次我们在输出单元上受到了类似的辐射击中,我们打电话给飞行主管,他说:“是的,继续,进行重置。” 因此,我们通过大约五分钟的讨论来重置输入/输出单元。没什么大不了的。因此,我认为他正因为如此,他认为我们花了很多时间来谈论这个问题,也许您应该考虑其他一些部分。但是我认为这仅仅是因为这是我们第一次经历。
问:好的,现在是否有计划为下一个任务对飞行计算机进行任何更改?
答:我们可能会在程序或软件上进行一些更改,以便我们更快地完成重新同步。但是,仅此而已。我们仍然在谈论那个。无需进行任何更改。我们满足了NASA对我们的所有安全要求。每一种遭受任何打击的硬件都可以完全恢复100%。因此,设计的功能与预期功能完全相同。
问:是否所有三个计算机单元都可能受到撞击并立即掉下来?
答:所以,请记住弹珠。现在我们有了三张桌子,所有三张桌子上都布置了白色大理石,黑色大理石必须经过,这样才能立即击中所有三张桌子。那将很难做到。但是即使这样做,我们通常会在计算机关闭的情况下为车辆加电。实际上,我们一直都在关闭计算机,因为每个输入/输出单元都有自己的三串计算机。我们可以直接指挥这些,也可以通过TDRS卫星从站点指挥它们,也可以从我们自己的地面站指挥它们。完全没有影响。而且我们只需要重新启动它们即可使用。
问:购买经过防辐射处理的硬件或软件有何弊端?是昂贵的,还是只是广泛使用?
答:并不是驱动它的费用。我们致力于在我们所有的设计中拥有最好的零件。因此,如果花费很多并且我们需要它,我们就去买。我们已经被要求在计算机中具有所有这些冗余,以满足所有不同的安全要求。然后,我们开始研究要使用哪些部件以及适合该设计的部件。对于我们来说,比零件的成本真正重要的是零件的功能-它们使用多少电量,它们拥有多少内存,它们要处理多少以及它们在物理上有多大。那是第一件事。
第二件事是它们附带了哪些工具。我们运行Linux操作系统,用C ++编写所有程序,这使我们能够利用大量才华横溢的人才,并找到计算机和软件行业中绝对最优秀的人才与我们合作。如果您要研究辐射硬化部件,那么它们在可以使用的语言,适用于它们的支持包,知道如何编程的方面都非常有限。这确实限制了您使用零件的能力。真正做到的另一件事是,他们都需要更长的时间才能到达,而获得它们却要困难一些。
今天早晨,我仅在办公室区域内就在工厂里走来走去,我们有40多个飞行计算机坐在人们的桌子上。而且如果它们很难获得,我们将没有那么多计算机。龙中有54个,它们都是不同类型的计算机,不同类型的处理器。Falcon 9中装有计算机,Falcon 9中每个引擎上的一台设备中都装有三台计算机,因此这里有30台计算机。我们有数百种功能级别不同的飞行计算机,并且处于多代设计中。辐射部分往往没有增长和升级的路径。增长非常困难,如果您决定想要更多的功能,更快的速度,那么您的确受到限制–就是这一部分。和我们' 现在已经在SpaceX的第三代飞行计算机中使用了。在过去的两年中,我们已经经历了三代人的工作,我们让人们在第四代计算机上工作。因此,我们一直在寻找市场上可用的产品,并与市场一起发展,以便我们可以使用最好的软件工具,最好的人员和最好的技术,并实现最现代,最优化,最有效的设计。这就是为什么我们不想进入这些生产线,它们是很好的设备,很多人都在使用它们。但是它们并没有打开我们想要拥有的那种可能性。许多其他程序是一个程序。在SpaceX,我们的目标是世界上最可靠,最具成本效益和最安全的太空进入,我们的首席执行官[Elon Musk]非常清楚:我们要去火星。因此,为Dragon建造计算机不仅是为Dragon建造计算机,还在于构建整套工具,技术,人员和流程,然后再转到下一辆车和下一辆车。我们的设备越过线。猎鹰的设计进入了Dragon,我们目前正在将Dragon的设计改造成新的Falcon,因此我们的设计不断发展,这就是为什么我们不想进入增长能力有限的生产线。
问:航天飞机有防辐射的计算机吗?
答:它们具有抗辐射设计,而不是抗辐射零件。第一次去哈勃太空望远镜维修时,我是飞行主管之一,他们拥有与我们相同的纠错记忆方法。我们只是看着错误计数。我记得和我的飞行计算机官员坐在控制台上时,我们正看着他们在修理哈勃望远镜时发呆,而我们只是一路走来,一路走来,纠正错误。因此,耐辐射的设计与耐辐射的零件很常见,并在航天飞机中使用。
问; 因此,您在这里不会打破常规。
答:我们正在将其扩展到以前未完成的程度,但是我们正在使用一组众所周知的技术和功能进行操作。 |
|