多个双打的字典顺序
考虑一个双精度类型的类
class path_cost {
double length;
double time;
};
如果我想按字典顺序对 path_costs
列表进行排序,我就会遇到问题。继续阅读:)
如果我像这样使用精确相等进行相等测试,则
bool operator<(const path_cost& rhs) const {
if (length == rhs.length) return time < rhs.time;
return length < rhs.length;
}
结果顺序可能是错误的,因为一个小的偏差(例如,由于长度计算中的数值不准确)可能会导致长度测试失败,所以例如
{ 231.00000000000001, 40 } < { 231.00000000000002, 10 }
错误地成立。
如果我选择像这样使用容差
bool operator<(const path_cost& rhs) const {
if (std::fabs(length-rhs.length)<1-e6)) return time < rhs.time;
return length < rhs.length;
}
,那么排序算法可能会严重失败,因为 < 运算符不再具有传递性(也就是说,如果 a < b 且 b < c,则 a < c 可能不成立
) ?解决方案?我考虑过对真实行进行分区,以便每个分区内的数字被认为是相等的,但这仍然留下太多相等测试失败但不应该失败的情况。
(James Curran 的更新,希望能解释这个问题): 给定数字:
- A = {231.0000001200, 10}
- B = {231.0000000500, 40}
C = {231.0000000100, 60}
- A.长度和长度B.长度相差7-e7,所以我们使用时间,并且A < B、
- B.长度& C.长度相差4-e7,所以我们使用时间,并且B < C、
- A.长度和长度C.长度相差1.1-e6,所以我们使用长度,并且A> C、
(Esben Mose Hansen 更新) 这并不纯粹是理论上的。当给定非传递排序运算符时,标准排序算法往往会崩溃或更糟。这正是我一直在争论的问题(调试起来真是太有趣了;))
Consider a class of type doubles
class path_cost {
double length;
double time;
};
If I want to lexicographically order a list of path_costs
, I have a problem. Read on :)
If I use exact equal for the equality test like so
bool operator<(const path_cost& rhs) const {
if (length == rhs.length) return time < rhs.time;
return length < rhs.length;
}
the resulting order is likely to be wrong, because a small deviation (e.g. due to numerical inaccuracies in the calculation of the length) may cause the length test to fail, so that e.g.
{ 231.00000000000001, 40 } < { 231.00000000000002, 10 }
erroneously holds.
If I alternatively use a tolerance like so
bool operator<(const path_cost& rhs) const {
if (std::fabs(length-rhs.length)<1-e6)) return time < rhs.time;
return length < rhs.length;
}
then the sorting algorithm may horribly fail since the <-operator is no longer transitive (that is, if a < b and b < c then a < c may not hold)
Any ideas? Solutions? I have thought about partitioning the real line, so that numbers within each partition is considered equal, but that still leaves too many cases where the equality test fails but should not.
(UPDATE by James Curran, hopefully explaining the problem):
Given the numbers:
- A = {231.0000001200, 10}
- B = {231.0000000500, 40}
C = {231.0000000100, 60}
- A.Length & B.Length differ by 7-e7, so we use time, and A < B.
- B.Length & C.Length differ by 4-e7, so we use time, and B < C.
- A.Length & C.Length differ by 1.1-e6, so we use length, and A > C.
(Update by Esben Mose Hansen)
This is not purely theoretical. The standard sort algorithms tends to crash or worse when given a non-transitive sort operator. And this is exactly what I been contending with (and boy was that fun to debug ;) )
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
您真的只想要一个比较功能吗?
为什么不先按长度排序,然后将这些对分组为您认为长度相同的组,然后按时间在每个组中排序?
按长度排序后,您可以应用所需的任何启发式方法来确定长度的“相等”并进行分组。
Do you really want just a compare function?
Why don't you sort by length first, then group the pairs into what you think are the same length and then sort within each group by time?
Once sorted by length, you can apply whatever heuristic you need, to determine 'equality' of lengths, to do the grouping.
我不认为你能够做你想做的事。本质上你似乎是在说,在某些情况下你想忽略 a>b 的事实并假装 a=b。我非常确定您可以构造一个证明,证明当差值小于某个值时,如果 a 和 b 相等,则 a 和 b 对于 a 和 b 的所有值都是相等的。大致如下:
对于C和两个数字A和B的公差,其中不失一般性,A>1。 B 则存在
D(n) = B+n*(C/10)
其中0<=n<=(10*(AB))/(C)
这样,D(n) 基本上在 D(n-1) 和 D(n+1) 的容差范围内,因此等价于它们。另外,D(0) 是 B 并且 D((10*(AB))/(C))=A 因此 A 和 B 可以说是等价的。我认为解决该问题的唯一方法是使用分区方法。乘以 10^6 然后转换为 int shoudl 分区就很好了,但这意味着如果你有 1.00001*10^-6 和 0.999999*10^-6 那么它们将出现在不同的分区中,这可能是不需要的。
然后问题就变成了查看您的数据以找出如何最好地对其进行分区,但我无能为力,因为我对您的数据一无所知。 :)
PS 当给定算法时或者仅仅当它们遇到特定的无法解决的情况时,算法实际上会崩溃吗?
I don't think you are going to be able to do what you want. Essentially you seem to be saying that in certain cases you want to ignore the fact that a>b and pretend a=b. I'm pretty sure that you can construct a proof that says if a and b are equivalent when the difference is smaller than a certain value then a and b are equivalent for all values of a and b. Something along the lines of:
For a tolerance of C and two numbers A and B where without loss of generality A > B then there exist
D(n) = B+n*(C/10)
where0<=n<=(10*(A-B))/(C)
such that trivially D(n) is within the tolerance of D(n-1) and D(n+1) and therefore equivalent to them. Also D(0) is B and D((10*(A-B))/(C))=A so A and B can be said to be equivalent.I think the only way you can solve that problem is using a partitioning method. Something like multiplying by 10^6 and then converting to an int shoudl partition pretty well but will mean that if you have 1.00001*10^-6 and 0.999999*10^-6 then they will come out in different partitions which may not be desired.
The problem then becomes looking at your data to work out how to best partition it which I can't help with since I don't know anything about your data. :)
P.S. Do the algorithms actually crash when given the algorithm or just when they encounter specific unsolvable cases?
我可以想到两种解决方案。
您可以仔细选择一种在比较不传递时不会失败的排序算法。例如,快速排序不应该失败,至少如果您自己实现的话。 (如果您担心快速排序的最坏情况行为,您可以首先随机化列表,然后对其进行排序。)
或者您可以扩展容差补丁,使其成为等价关系并恢复传递性。有标准的并查找算法来完成与等价关系的任何关系。应用 union-find 后,您可以用一致值(例如平均值)替换每个等价类中的长度,然后进行您想要执行的排序。通过修改浮点数来防止虚假的重新排序感觉有点奇怪,但它应该可以工作。
事实上,莫龙说的很有道理。您可以首先按长度排序,然后将容差范围内的邻居链接在一起,然后在第二个键上的每个组中进行子排序,而不是联合和查找。这与我的第二个建议具有相同的结果,但它是一个更简单的实现。
I can think of two solutions.
You could carefully choose a sorting algorithm that does not fail when the comparisons are intransitive. For example, quicksort shouldn't fail, at least if you implement it yourself. (If you are worried about the worst case behavior of quicksort, you can first randomize the list, then sort it.)
Or you could extend your tolerance patch so that it becomes an equivalence relation and you restore transitivity. There are standard union-find algorithms to complete any relation to an equivalence relation. After applying union-find, you can replace the length in each equivalence class with a consensus value (such as the average, say) and then do the sort that you wanted to do. It feels a bit strange to doctor floating point numbers to prevent spurious reordering, but it should work.
Actually, Moron makes a good point. Instead of union and find, you can sort by length first, then link together neighbors that are within tolerance, then do a subsort within each group on the second key. That has the same outcome as my second suggestion, but it is a simpler implementation.
我不熟悉您的应用程序,但我愿意打赌,图中各点之间的距离差异比浮点数的舍入误差大许多数量级。因此,如果两个条目仅因舍入误差不同,那么它们本质上是相同的,并且它们在列表中出现的顺序没有区别。从常识的角度来看,我认为没有理由担心。
I'm not familiar with your application, but I'd be willing to bet that the differences in distance between points in your graph are many orders of magnitude larger than the rounding errors on floating point numbers. Therefore, if two entries differ by only the round-off error, they are essentially the same, and it makes no difference in which order they appear in your list. From a common-sense perspective, I see no reason to worry.
使用普通的 double 永远无法获得 100% 的精度。你说你担心使用容差会影响你程序的正确性。你真的测试过这个吗?您的程序实际需要什么级别的精度?
在大多数常见应用中,我发现像
1e-9
这样的容差就足够了。当然,这一切都取决于您的应用程序。您可以估计所需的准确度水平,然后将容差设置为可接受的值。如果甚至失败,则意味着 double 根本不足以满足您的目的。这种情况极不可能发生,但如果您需要非常高精度的计算,则可能会出现这种情况。在这种情况下,您必须使用任意精度包(例如 Java 中的 BigDecimal 或 C 中的 GMP 之类的东西)。同样,只有在没有其他方法时才选择此选项。
You will never get 100% precision with ordinary
double
s. You say that you are afraid that using tolerances will affect the correctness of your program. Have you actually tested this? What level of precision does your program actually need?In most common applications I find a tolerance of something like
1e-9
suffices. Of course it all depends on your application. You can estimate the level of accuracy you need and just set the tolerance to an acceptable value.If even that fails, it means that
double
is simply inadequate for your purposes. This scenario is highly unlikely, but can arise if you need very high precision calculations. In that case you have to use an arbitrary precision package (e.g. BigDecimal in Java or something like GMP for C). Again, only choose this option when there is no other way.