为近似相似的数字生成相同的哈希码
我正在 C# 3.5 中创建一个应用程序,该应用程序使用 AutoCAD API 读取 2D AutoCAD 绘图,使用定义的业务逻辑对绘图进行更改,然后在 AutoCAD 中将其调整回来。由于逻辑的本质,绘图的形状必须重新构建 - 例如,矩形由 4 条连接直线组成。
我使用 AutoCAD 中每条线的起始和结束坐标创建这些形状,但某些坐标并不完全匹配。例如,一个点可能位于 0.69912839(在一个轴上),但从同一点开始的一条线可能是 0.69990821。这些以毫米为单位,因此距离是分钟(0.00078毫米!)
我创建了自己的类(称为 MyPoint,类似于 PointF),因为我需要向其添加一些额外的逻辑。在该类中,我创建了一个方法,该方法接受两个双精度值并返回 true 或 false,具体取决于两个点彼此的距离是否在 0.001mm 以内。然后我重写了 Equals 方法、 == 和 != 运算符,这样我就可以执行 (point1 == point2 或 point1.Equals(point2)) 来检查所有轴是否彼此在 0.001mm 以内 - 如果是,我将其归类为同一点。
这很好,而且工作得很好。现在,我需要检查这些点类的集合以消除所有重复项,因此我在集合上使用 LINQ 的 Distinct() 方法。但是,此方法使用 GetHashcode() 而不是 Equals() 来确定实例是否相等。因此,我重写了 GetHashcode(),它使用 double 类的 GetHashcode。
但是,上面的示例失败了,因为显然它们是不同的值,因此生成不同的哈希码。有没有办法让两个相差在 0.001 以内的数字可以生成相同的哈希码? (请注意,这些数字彼此并不了解,因为 GetHashcode 是在不同的类实例上单独调用的。)我尝试了多种方法,这些方法适用于某些示例,但不适用于其他示例。
一个例子是将数字截断为 3dp(乘以 10^3,然后截断它)并在结果上创建哈希码 - 这适用于上面的示例 (699 == 699。) 但这不适用于 0.69990821 和0.70000120(699!= 700。)我尝试过四舍五入,这适用于第二组数字(0.700 == 0.700),但不适用于第一组数字(0.699!= 0.700。)我什至尝试将数字截断为3dp然后将其调整到下一个偶数,这适用于前面的两个示例,但不适用于 12.9809 和 12.9818 (12980 != 12982。)
还有其他方法吗,或者我应该废弃 Equals、==、!= 和 GetHashcode覆盖,并创建我自己的 MyPoint.IsEqualTo() 和 MyPointCollection.Distinct() 方法?
I'm creating an application in C# 3.5 that uses the AutoCAD API to read a 2D AutoCAD drawing, make changes to the drawing using defined business logic, then adjust it back in AutoCAD. Due to the nature of the logic, the shape of the drawing has to be re-constructed - e.g. a rectangle is made up of 4 connecting straight lines.
I'm creating these shapes using the start and end co-ordinates of each line from AutoCAD, but some of the co-ordinates don't exactly match up. For example, one point could at 0.69912839 (on one axis), but a line starting from the same point could be 0.69990821. These are in mm so the distance is minute (0.00078mm!)
I've created my own class (call it MyPoint, similar to PointF) because I've needed to add some additional logic to it. In that class I've created a method that takes two doubles and returns true or false depending on if the two points are within 0.001mm of each other. I've then overridden the Equals method, == and != operators so I can do (point1 == point2 or point1.Equals(point2)) which checks if all axis are within 0.001mm of each other - if they are, I class it as being the same point.
That's fine and working brilliantly. Now, I need to check a collection of these point classes to get rid of all duplicates, so I'm using LINQ's Distinct() method on my collection. However this method uses GetHashcode(), not Equals() to determine if the instances are equal. So, I've overriden GetHashcode() which uses the GetHashcode of the double class.
But, the above example fails because obviously they're different values and therefore generate different hashcodes. Is there any way that two numbers that are within 0.001 of each other can generate the same hashcode? (Note the numbers don't know about each other as GetHashcode is called separately on different class instances.) I've tried numerous ways which work for some examples but not for others.
One example is truncating the number to 3dp (multiply it by 10^3, then truncate it) and creating the hashcode on the result - which works for the above example (699 == 699.) But this doesn't work for 0.69990821 and 0.70000120 (699 != 700.) I've tried rounding, which works for the second set of numbers (0.700 == 0.700) but not for the first (0.699 != 0.700.) I've even tried truncating the number to 3dp then adjusting it up to the next even number, which works for both the previous examples, but not for 12.9809 and 12.9818 (12980 != 12982.)
Is there another way, or should I scrap the Equals, ==, != and GetHashcode overrides, and create my own MyPoint.IsEqualTo() and MyPointCollection.Distinct() methods?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
它无法写入正确的哈希码。让我们证明一下:
我们有2分。
var a = point1.GetHashCode();
var b = point2.GetHashCode();
if a!= b 让我们在 point1 和 point2 之间创建点。等等。
经过这样的操作,我们将创建一条线,其中每个点都靠近其他点,并且它们的哈希码将相同。
因此 point1 和 point2 的哈希码应该相等。
所以像这样过度:
并实现你们平等。
It's unable to write correct hashcode. let proof it:
we have 2 points.
var a = point1.GetHashCode();
var b = point2.GetHashCode();
if a!= b let create point between point1 and point2. and so on.
After such operations we will create line, where each point is near to some other point and their hashcodes will be the same.
So hash code for point1 and point2 should be equals.
So overrtide like this:
and implement you equals.
我认为你不应该重写
Equals()
、==
、!=
或GetHashCode()
如果你重写对于其中任何一个,您都应该确保它们的语义不会改变。在你的例子中,他们这样做了。
例如,您的
==
不能传递,如果 P1 距 P2 0.001 毫米,P2 距 P3 0.001 毫米,P1 距 P3 0.002 毫米,则 P1 == P2,P2 == P3 且 P1 == P3,这不是你想要的。一般来说,你最终得到的所有点都等于所有其他点。我会坚持使用单独的方法来确定点是否足够接近。
编辑
通过覆盖
==
,您现在可以编写如下代码:I think you should not override
Equals()
,==
,!=
orGetHashCode()
If you override any of these you should make sure that their semantics do not change. In your example they do.
For instance your
==
cannot be transitive for it it is then if P1 is 0.001 mm from P2, P2 is 0.001 mm from P3 and P1 is 0.002 mm from P3 then P1 == P2, P2 == P3 and P1 == P3, which is not what you want. In general you end up with all points being equal to all other points.I would just stick to using a separate method for determining if points are close enough.
EDIT
With your override of
==
you can now write code like this:删除对 Distinct 方法的依赖会更容易。实现 System.Collections.IComparer (或通用等效项)并使用简单的集合(例如列表)。然后使用比较器确定该项目是否在列表中,如果已包含则不添加它。
It would be easier just to remove the dependency on the Distinct method. Implement a System.Collections.IComparer (or the generic equivalent) and use a simple collection like a list. Then determine if the item is in the list with the comparer, and not add it if it already is contained.
我猜想,如果您始终返回相同的散列(例如 0),LinQ 将尝试将所有元素与
equals
进行比较。毕竟,哈希对于证明两个元素不同而不是相等很有用。但无论如何,我建议您使用更适合该领域的结构和算法,例如二元分割分区 (BSP) 树。
I guess that if you return always the same hash (say 0), LinQ will try to compare all elements with
equals
. After all, a hash is useful to prove that two elements are distinct, not equal.But anyway, I would recommend you to use better suited structures and algorithms for this domain, like Binary Splitting Partition (BSP) trees (for example).
这应该是对 Steck 和 Paolo 所说内容的更清晰的解释。
假设您确实按照您想要的方式编写了 GetHashCode 方法。
然后,对于任何点
a
和b
,无论它们之间的距离如何,a.GetHashCode() == b.GetHashCode()
。证明:假设
a < b。将
a
和b
之间的距离分割成小于 0.001 的段。即a0 = a
、a1 = a0 + 0.0005
、a2 = a1 + 0.0005
等,直到到达b
代码>.然后a.GetHashCode() == a1.GetHashCode() == a2.GetHashCode() == ... == b.GetHashCode()。
This should be a clearer explanation of what Steck and Paolo said.
Suppose you did manage to write the GetHashCode method in the way you want.
Then for any points
a
andb
, whatever the distance between them,a.GetHashCode() == b.GetHashCode()
.Proof: suppose
a < b
. Split the distance betweena
andb
into segments smaller than 0.001. I.e.a0 = a
,a1 = a0 + 0.0005
,a2 = a1 + 0.0005,
etc. until you get tob
.Then
a.GetHashCode() == a1.GetHashCode() == a2.GetHashCode() == ... == b.GetHashCode()
.是的。
但它不一定是某种完全不同的数据结构。在检查重复项时,您需要检查相邻的哈希码,例如 (x+0.001,y)、(x,y-0.001) 等的哈希值。与通常的重复数据删除查找相比,这只会导致常数因子减慢,而且并不复杂,因此这可能是可行的方法。 (这是一个显而易见的观点,但我还没有看到它在这里明确提出。)
更新:为了澄清这一点,让我们看一下问题的一维版本。 “点”是单个数字 x。当abs(x1 - x2)
abs(x1 - x2)
时,我们认为x1和x2匹配。 .001
。现在我们想查明 x 是否与 {x_0, ..., x_n} 中的任何一个匹配。 x_i 保存在哈希表中,其中hash(x) = h(floor(1000*x))
代表一些传播事物的函数 h()。要查看 x 是否已在表中,我们计算hash(x-.001)
、hash(x)
和hash(x+.001)< /code>,然后我们测试 x 是否与三个存储桶中任何一个中的任何 x_i 匹配。任何匹配的 x_i 不能位于任何其他存储桶中。
在 2-d 变体中,有 9 个相邻的桶需要检查(算上中间的);在 3-d 中,27。
Yes.
It needn't necessarily be some completely different data structure, though. In checking for a duplicate, you need to check the neighboring hashcodes, e.g. the hashes for (x+0.001,y), (x,y-0.001), and so on. This makes for just a constant-factor slowdown compared to the usual deduplication lookup, and it's not complicated, so it might be the way to go. (An obvious point, but I don't see it made explicitly here yet.)
Update: To clarify, let's look at a 1-dimensional version of the problem. The 'points' are single numbers, x. We consider x1 and x2 to match when
abs(x1 - x2) < .001
. Now we want to find out if x matches any of {x_0, ..., x_n}. The x_i are kept in a hashtable, wherehash(x) = h(floor(1000*x))
for some function h() that spreads things around. To see if x is already in the table, we computehash(x-.001)
,hash(x)
, andhash(x+.001)
, then we test if x matches any of the x_i in any of the three buckets. Any matching x_i cannot be in any other bucket.In the 2-d variant, there are 9 neighboring buckets to check (counting the middle); in 3-d, 27.
这是一些代码来展示我在做什么。 “原始”中的每对数字应返回相同的值。
上面的方法是“截断到 3dp 然后调整到最接近的偶数”方法。接下来是 truncate 方法(替换开始/结束注释之间的代码):
这是 round 方法:
Here's some code to show what I'm doing. Each pair of numbers in "original" should return the same value.
That above method is the "truncate to 3dp then adjust to the nearest even" method. Next is the truncate method (replace code between the begin/end comments):
This is the round method: