计算分布式网络中系统故障的概率
我正在尝试构建分布式文件系统中文件可用性的数学模型。我在 MathOverflow 上发布了这个问题,但这也可能被归类为 CS 问题,所以我也在这里尝试一下。
该系统的工作原理如下:节点在 r*b 远程节点存储文件(使用纠删码进行编码),其中 r 是复制因子,b 是整数常量。纠删码文件具有这样的属性:只要至少 b 个远程节点可用并返回文件的一部分,就可以恢复文件。
最简单的方法是假设所有远程节点彼此独立并且具有相同的可用性 p。通过这些假设,文件的可用性遵循二项式分布,即
不幸的是这些两个假设可能会引入不可忽略的错误,如本文所示: http://deim.urv.cat/~lluis.pamies/uploads/Main/icpp09-paper.pdf。
克服所有节点具有相同可用性的假设的一种方法是计算可用/不可用节点的每种可能组合的概率,并取所有这些结果的总和(这就是他们在上面的论文中建议的,只是比我刚才描述的更正式)。您可以将此方法视为深度为 r*b 的二叉树,每个叶子都是可用/不可用节点的一种可能组合。文件的可用性与您到达休假时具有 >=b 个可用节点的概率相同。这种方法更正确,但计算成本为 。此外,它不涉及节点独立性的假设。
你们有什么好的近似的想法吗?它比二项式分布近似引入的误差更少,但计算成本比 ?
您可以假设每个节点的可用性数据是一组由(测量日期,节点测量,正在测量的节点,成功/失败位)
组成的元组。例如,利用这些数据,您可以计算节点之间的可用性与可用性方差的相关性。
I am trying to build a mathematical model of the availability of a file in a distributed file-system. I posted this question at MathOverflow but this might as well be classified as a CS-question so I give it a shot here as well.
The system works like this: a node stores a file (encoed using erasure codes) at r*b remotes nodes, where r is the replication-factor and b is an integer constant. Erasure-coded files have the property that the file can be restored iff at least b of the remote nodes are available and return its part of the file.
The simplest approach to this is to assume that all remote nodes are independent of each other and have the same availability p. With these assumptions the availability of a file follows the Binomial distribution, i.e.
Unfortunately these two assumptions can introduce a non-neligible error, as shown by this paper: http://deim.urv.cat/~lluis.pamies/uploads/Main/icpp09-paper.pdf.
One way to overcome the assumption that all nodes have the same availability is to calculate the probability of each possible combination of availaible/non-available node and take the sum of all these outcomes (which is sort of what they suggest in the paper above, just more formally than what I just described). You can see this approach as a binary tree with depth r*b and each leave is one possible combination of available/not-available nodes. The file's availability is the same thing as the probablity that you arrive at a leave with >=b available nodes. This approach is more correct but has a computational cost of . Also, it doesn't deal with the assumption of node independence.
Do you guys have any ideas of a good approximation which introduces less error than the binomial distribution-aproximation but with better computational cost than ?
You can assume that the availability-data of each node is a set of tuples consisting of (measurement-date, node measuring, node being measured, succes/failure-bit)
. With this data you could for example calculate the correlation of the availability between the nodes and the availability variance.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
假设每个节点都有一个恒定的、已知的和独立的可用性,我会想到分而治之的方法。
假设您有 N 个节点。
N/2
节点。[0,N/2]
) 发生故障的概率。[0,N]
) 发生故障的概率。步骤 2 可以递归完成,在顶层,您可以根据需要求和,以找出超过某个数字的次数。
我不知道这个的复杂性,但如果我必须猜测,我会说等于或低于
O(n^2 log n)
这个机制可以在终端案例上说明。假设我们有 5 个具有正常运行时间的节点 。我们可以将其分成段
A
,其中 和B
以及 。然后,我们可以处理这些以找到每个段的“N 个节点向上”次:对于 A:
对于 B :
将每个
a
与每个b
并适当求和。Assuming each node has a constant, known and independent availability rate, A divide and conquer approach come to mind.
Say you have
N
nodes.N/2
nodes.[0,N/2]
) are down.[0,N]
) of the full set is down.Step 2 can be done recursively and at the top level you can sum as need to find how often more than some number are down.
I don't know the complexity of this but if I has to guess, I'd say at or below
O(n^2 log n)
The mechanics of this can be illustrated on a terminal case. Say we have 5 nodes with up times . We can split this into segments
A
with andB
with . We then can process these to find the "N nodes up" times for each segment:For A:
For B:
The final results for this stage can be found by multiplying each of the
a
's with each of theb
's and summing appropriately.对于较大的
r
和b
,您可以使用称为蒙特卡罗积分的方法,请参见Wikipedia 上的 Monte Carlo 积分(和/或 SICP 第 3.1.2 章)来计算总和。对于较小的r
和b
以及显着不同的节点故障概率p[i]
,精确方法更为优越。 小和大的确切定义取决于几个因素,最好通过实验进行尝试。具体示例代码:这是一个非常基本的示例代码(Python),用于演示这样的过程如何工作:
该函数对应于二项式测试用例并运行
N
个测试,检查r*b
节点中的b
节点是否存活,失败概率为p
。一些实验将使您相信,您需要数千个样本范围内的N
值才能获得任何合理的结果,但原则上复杂度为O(N*r*b )
。结果的精度按sqrt(N)
缩放,即,要将精度提高两倍,您需要将N
提高四倍。对于足够大的r*b
,此方法显然更优越。近似的扩展:显然,您需要设计测试用例,以使其尊重系统的所有属性。您建议了几个扩展,其中一些可以轻松实现,而另一些则不能。让我给你一些建议:
1) 在不同但不相关的
p[i]
的情况下,上述代码的变化很小:在函数头中,你传递一个数组而不是一个单个浮点p
并将行if random.random()
替换为
2) 在相关
p[i 的情况下]
您需要有关系统的其他信息。我在评论中提到的情况可能是这样的网络:在这种情况下,
A
可能是“根节点”,节点D
的故障可能意味着节点F
、G
、H
和J
的概率为100%自动失效;而F
节点故障会自动导致G
、H
和J
等节点宕机。至少这是我在评论中提到的情况(这是一个合理的解释,因为您在原始问题中讨论了概率的树结构)。在这种情况下,您需要修改代码,使p
引用树结构,并且for j in ...
遍历树,跳过当前节点的较低分支一旦测试失败。当然,结果测试仍然是像以前一样alivenum >= b
。3)如果网络是无法用树结构表示的循环图,这种方法将会失败。在这种情况下,您需要首先创建死的或活的图节点,然后在图上运行路由算法来计算唯一的、可到达的节点的数量。这不会增加算法的时间复杂度,但显然会增加代码复杂度。
4) 时间依赖性并不重要,但如果您知道 mtbf/r(平均故障/维修次数),则可以进行修改,因为这可以为您提供任一树的概率
p
-结构或由指数和组成的不相关线性p[i]
。然后,您必须在不同时间运行 MC 过程,并获得p
的相应结果。5)如果您只有日志文件(如上一段所示),则需要对方法进行实质性修改,这超出了我在该板上所能做的范围。日志文件需要足够彻底,以便能够重建网络图的模型(以及
p
的图)以及p< 的所有节点的各个值。 /代码>。否则,准确性将不可靠。这些日志文件还需要比故障和修复的时间尺度长得多,这一假设在现实网络中可能不现实。
For large
r
andb
you can use a method called Monte-Carlo integration, see e.g. Monte Carlo integration on Wikipedia (and/or chapter 3.1.2 of SICP) to compute the sum. For smallr
andb
and significantly different node-failure probabilitiesp[i]
the exact method is superior. The exact definition of small and large will depend on a couple of factors and is best tried out experimentally.Specific sample code: This is a very basic sample code (in Python) to demonstrate how such a procedure could work:
The function corresponds to the binomial test case and runs
N
tests, checking ifb
nodes out ofr*b
nodes are alive with a probability of failure ofp
. A few experiments will convince you that you need values ofN
in the range of thousands of samples before you can get any reasonable results, but in principle the complexity isO(N*r*b)
. The accuracy of the result scales assqrt(N)
, i.e., to increase accuracy by a factor of two you need to increaseN
by a factor of four. For sufficiently larger*b
this method will be clearly superior.Extension of the approximation: You obviously need to design the test case such, that it respects all the properties of the system. You have suggested a couple of extensions, some of which can be easily implemented while others can not. Let me give you a couple of suggestions:
1) In the case of distinct but uncorrelated
p[i]
, the changes of the above code are minimal: In the function head you pass an array instead of a single floatp
and you replace the lineif random.random()<p: alivenum += 1
by2) In the case of correlated
p[i]
you need additional information about the system. The situation I was referring to in my comment could be a network like this:In this case
A
might be the "root node" and a failure of nodeD
could imply the automatic failure with 100% probability of nodesF
,G
,H
andJ
; while a failure of nodeF
would automatically bring downG
,H
andJ
etc. At least this was the case I was referring to in my comment (which is a plausible interpretation since you talk about a tree structure of probabilities in the original question). In such a situation you would need modify the code thatp
refers to a tree structure andfor j in ...
traverses the tree, skipping the lower branches from the current node as soon as a test fails. The resulting test is still whetheralivenum >= b
as before, of course.3) This approach will fail if the network is a cyclic graph that cannot be represented by a tree structure. In such a case you need to first create graph nodes that are either dead or alive and then run a routing algorithm on the graph to count the number of unique, reachable nodes. This won't increase the time-complexity of the algorithm, but obviously the code complexity.
4) Time dependence is a non-trivial, but possible modification if you know the m.t.b.f/r (mean-times-between-failures/repairs) since this can give you the probabilities
p
of either the tree-structure or the uncorrelated linearp[i]
by a sum of exponentials. You will then have to run the MC-procedure at different times with the corresponding results forp
.5) If you merely have the log files (as hinted in your last paragraph) it will require a substantial modification of the approach which is beyond what I can do on this board. The log-files would need to be sufficiently thorough to allow to reconstruct a model for the network graph (and thus the graph of
p
) as well as the individual values of all nodes ofp
. Otherwise, accuracy would be unreliable. These log-files would also need to be substantially longer than the time-scales of failures and repairs, an assumptions which may not be realistic in real-life networks.