当前位置：文江博客话题详情

计算上下文相关的文本相关性

发布于 2024-08-31 18:24:21 字数 318 浏览 6 评论 0原文

假设我想要将地址记录（或人名或其他任何内容）相互匹配，以合并最有可能引用同一地址的记录。基本上，我想我想计算文本值之间的某种相关性，并在该值超过某个阈值时合并记录。

例子： “West Lawnmower Drive 54 A”可能与“W. Lawn Mower Dr. 54A”相同，但与“East Lawnmower Drive 54 A”不同。

您将如何解决这个问题？是否有必要拥有某种基于上下文的字典，在地址情况下知道“W”、“W”。和“西”一样吗？拼写错误怎么办（“mover”而不是“mower”等）？

我认为这是一个棘手的问题 - 也许有一些众所周知的算法？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

毁梦 2024-09-07 18:24:21

一个好的基线可能是一个不切实际的基线，因为它的计算成本相对较高，更重要的是它会产生许多误报，它是通用的字符串距离算法，例如

编辑距离（又名 Levenshtein 距离）
Ratcliff/Obershelp

取决于所需的准确度级别（顺便说一句，应该根据其召回来指定和精度，即一般表示错过相关性是否比错误地识别相关性更重要），基于以下[一些]启发式的本土流程想法可以做到这一点：

对输入进行标记化，即将输入视为单词数组而不是字符串
标记化还应该保持行号信息
使用常见替换的简短字典对输入进行规范化（例如行尾的“dr”=“drive”、“Jack”=“John”、“Bill”=“William”...、行首的“W.”为“West”等识别
（有点像标记，如 POS 标记）某些实体（例如邮政编码和扩展邮政编码，以及城市）的性质
识别（查找）其中一些实体（例如相对较短的数据库表可以包括 .目标区域中的所有城市/城镇
识别（查找）一些与域相关的实体（如果所有/许多地址涉及法律界人士，则查找律师事务所名称或联邦建筑物可能会有所帮助。
一般来说，对来自地址最后一行的令牌
给予更多（或更少）权重对具有特定实体类型的令牌给予更多（或更少）权重（例如：“Drive”、“Street”、“Court”）应远小于考虑修改后的
SOUNDEX 算法来帮助标准化考虑到

上述内容，实现基于规则的评估器暂时，可以将规则实现为解析输入的树/数组结构的访问者。最初（访客设计模式）。
基于规则的框架的优点是，每个启发式都有自己的功能，并且可以对规则进行优先级排序，即将一些规则放在链的早期，允许尽早中止评估，并使用一些强启发式（例如：不同的城市= > 相关性= 0，置信度= 95% 等...）。

搜索相关性的一个重要考虑因素是需要先验将每个单独的项目（此处为地址）与其他每个项目进行比较，因此需要多达 1/ 2 n^2 项目级比较。因此，以预处理（解析、规范化...）的方式存储参考项可能很有用，并且可能还有一个摘要/排序键，可以用作可能相关性的[非常粗略]指示符（例如，由 5 位邮政编码组成的键，后跟“主”名称的 SOUNDEX 值）。

A good baseline, probably an impractical one in terms of its relatively high computational cost and more importantly its production of many false positive, would be generic string distance algorithms such as

Edit distance (aka Levenshtein distance)
Ratcliff/Obershelp

Depending on the level of accuracy required (which, BTW, should be specified both in terms of its recall and precision, i.e. generally expressing whether it is more important to miss a correlation than to falsely identify one), a home-grown process based on [some of] the following heuristics and ideas could do the trick:

tokenize the input, i.e. see the input as an array of words rather than a string
tokenization should also keep the line number info
normalize the input with the use of a short dictionary of common substituions (such as "dr" at the end of a line = "drive", "Jack" = "John", "Bill" = "William"..., "W." at the begining of a line is "West" etc.
Identify (a bit like tagging, as in POS tagging) the nature of some entities (for example ZIP Code, and Extended ZIP code, and also city
Identify (lookup) some of these entities (for example a relative short database table can include all the Cities / town in the targeted area
Identify (lookup) some domain-related entities (if all/many of the address deal with say folks in the legal profession, a lookup of law firm names or of federal buildings may be of help.
Generally, put more weight on tokens that come from the last line of the address
Put more (or less) weight on tokens with a particular entity type (ex: "Drive", "Street", "Court" should with much less than the tokens which precede them.
Consider a modified SOUNDEX algorithm to help with normalization of

With the above in mind, implement a rule-based evaluator. Tentatively, the rules could be implemented as visitors to a tree/array-like structure where the input is parsed initially (Visitor design pattern).
The advantage of the rule-based framework, is that each heuristic is in its own function and rules can be prioritized, i.e. placing some rules early in the chain, allowing to abort the evaluation early, with some strong heuristics (eg: different City => Correlation = 0, level of confidence = 95% etc...).

An important consideration with search for correlations is the need to a priori compare every single item (here address) with every other item, hence requiring as many as 1/2 n^2 item-level comparisons. Because of this, it may be useful to store the reference items in a way where they are pre-processed (parsed, normalized...) and also to maybe have a digest/key of sort that can be used as [very rough] indicator of a possible correlation (for example a key made of the 5 digit ZIP-Code followed by the SOUNDEX value of the "primary" name).

回复收藏 0 原文

紫竹語嫣☆ 2024-09-07 18:24:21

我会考虑生成一个相似性比较度量，给定两个对象（可能是字符串），返回它们之间的“距离”。

如果您满足以下条件，那么它会有所帮助：

物体与物体之间的距离
本身为零。（自反）
从 a 到 b 的距离是相同的
从 a 到 c 的两个方向（传递）
距离都不大于
比从 a 到 b 的距离加上
a 到 c 的距离。（三角形
规则）

如果您的指标遵循这些规则，您可以在指标空间中排列对象，这意味着您可以运行如下查询：

哪个其他对象最像
这个
给我 5 件物品
最喜欢这个。

这里有一本关于它的好书。一旦您设置了用于托管对象并运行查询的基础设施，您就可以简单地插入不同的比较算法，比较它们的性能，然后调整它们。

我在大学时对地理数据进行了此操作，尝试调整比较算法非常有趣。

我确信您可以想出更高级的方法，但您可以从一些简单的方法开始，例如将地址行减少为数字和每个单词的第一个字母，然后使用最长公共子序列算法比较结果。

希望能以某种方式有所帮助。

回复收藏 0 原文

幻梦 2024-09-07 18:24:21

您可以使用 Levenshtein 编辑距离查找仅相差几个字符的字符串。 BK Trees 可以帮助加快匹配过程。

回复收藏 0 原文

安人多梦 2024-09-07 18:24:21

免责声明：我不知道有什么算法可以做到这一点，但我真的很想知道它是否存在。这个答案是在没有任何先前知识的情况下试图解决问题的天真的尝试。欢迎评论，请不要笑得太过分。

如果您尝试手动完成，我建议对您的字符串应用某种“规范化”：小写它们，删除标点符号，也许用完整的缩写替换常见的缩写单词（Dr. => 驾驶，St => 街道，等等...）。

然后，您可以尝试比较的两个字符串之间的不同对齐方式，并通过平均相应字母之间的绝对差来计算相关性（例如 a = 1、b = 2 等。并且 corr(a, b) = |a - b| = 1) ：

west lawnmover drive
   w lawnmower street

因此，即使某些字母不同，相关性也会很高。然后，只需保留您找到的最大相关性，并在相关性高于给定阈值时确定它们相同。

Disclaimer: I don't know any algorithm that does that, but would really be interested in knowing one if it exists. This answer is a naive attempt of trying to solve the problem, with no previous knowledge whatsoever. Comments welcome, please don't laugh too laud.

If you try doing it by hand, I would suggest applying some kind of "normalization" to your strings : lowercase them, remove punctuation, maybe replace common abbreviations with the full words (Dr. => drive, St => street, etc...).

Then, you can try different alignments between the two strings you compare, and compute the correlation by averaging the absolute differences between corresponding letters (eg a = 1, b = 2, etc.. and corr(a, b) = |a - b| = 1) :

west lawnmover drive
   w lawnmower street

Thus, even if some letters are different, the correlation would be high. Then, simply keep the maximal correlation you found, and decide that their are the same if the correlation is above a given threshold.

回复收藏 0 原文