计算上下文相关的文本相关性
假设我想要将地址记录(或人名或其他任何内容)相互匹配,以合并最有可能引用同一地址的记录。基本上,我想我想计算文本值之间的某种相关性,并在该值超过某个阈值时合并记录。
例子: “West Lawnmower Drive 54 A”可能与“W. Lawn Mower Dr. 54A”相同,但与“East Lawnmower Drive 54 A”不同。
您将如何解决这个问题?是否有必要拥有某种基于上下文的字典,在地址情况下知道“W”、“W”。和“西”一样吗?拼写错误怎么办(“mover”而不是“mower”等)?
我认为这是一个棘手的问题 - 也许有一些众所周知的算法?
Suppose I want to match address records (or person names or whatever) against each other to merge records that are most likely referring to the same address. Basically, I guess I would like to calculate some kind of correlation between the text values and merge the records if this value is over a certain threshold.
Example:
"West Lawnmower Drive 54 A" is probably the same as "W. Lawn Mower Dr. 54A" but different from "East Lawnmower Drive 54 A".
How would you approach this problem? Would it be necessary to have some kind of context-based dictionary that knows, in the address case, that "W", "W." and "West" are the same? What about misspellings ("mover" instead of "mower" etc)?
I think this is a tricky one - perhaps there are some well-known algorithms out there?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
一个好的基线可能是一个不切实际的基线,因为它的计算成本相对较高,更重要的是它会产生许多误报,它是通用的字符串距离算法,例如
取决于所需的准确度级别(顺便说一句,应该根据其召回来指定和精度,即一般表示错过相关性是否比错误地识别相关性更重要),基于以下[一些]启发式的本土流程想法可以做到这一点:
上述内容,实现基于规则的评估器暂时,可以将规则实现为解析输入的树/数组结构的访问者。最初(访客设计模式)。
基于规则的框架的优点是,每个启发式都有自己的功能,并且可以对规则进行优先级排序,即将一些规则放在链的早期,允许尽早中止评估,并使用一些强启发式(例如:不同的城市= > 相关性= 0,置信度= 95% 等...)。
搜索相关性的一个重要考虑因素是需要先验将每个单独的项目(此处为地址)与其他每个项目进行比较,因此需要多达
1/ 2 n^2
项目级比较。因此,以预处理(解析、规范化...)的方式存储参考项可能很有用,并且可能还有一个摘要/排序键,可以用作可能相关性的[非常粗略]指示符(例如,由 5 位邮政编码组成的键,后跟“主”名称的 SOUNDEX 值)。A good baseline, probably an impractical one in terms of its relatively high computational cost and more importantly its production of many false positive, would be generic string distance algorithms such as
Depending on the level of accuracy required (which, BTW, should be specified both in terms of its recall and precision, i.e. generally expressing whether it is more important to miss a correlation than to falsely identify one), a home-grown process based on [some of] the following heuristics and ideas could do the trick:
With the above in mind, implement a rule-based evaluator. Tentatively, the rules could be implemented as visitors to a tree/array-like structure where the input is parsed initially (Visitor design pattern).
The advantage of the rule-based framework, is that each heuristic is in its own function and rules can be prioritized, i.e. placing some rules early in the chain, allowing to abort the evaluation early, with some strong heuristics (eg: different City => Correlation = 0, level of confidence = 95% etc...).
An important consideration with search for correlations is the need to a priori compare every single item (here address) with every other item, hence requiring as many as
1/2 n^2
item-level comparisons. Because of this, it may be useful to store the reference items in a way where they are pre-processed (parsed, normalized...) and also to maybe have a digest/key of sort that can be used as [very rough] indicator of a possible correlation (for example a key made of the 5 digit ZIP-Code followed by the SOUNDEX value of the "primary" name).我会考虑生成一个相似性比较度量,给定两个对象(可能是字符串),返回它们之间的“距离”。
如果您满足以下条件,那么它会有所帮助:
本身为零。 (自反)
从 a 到 c 的两个方向(传递)
比从 a 到 b 的距离加上
a 到 c 的距离。 (三角形
规则)
如果您的指标遵循这些规则,您可以在指标空间中排列对象,这意味着您可以运行如下查询:
这个
最喜欢这个。
这里有一本关于它的好书。一旦您设置了用于托管对象并运行查询的基础设施,您就可以简单地插入不同的比较算法,比较它们的性能,然后调整它们。
我在大学时对地理数据进行了此操作,尝试调整比较算法非常有趣。
我确信您可以想出更高级的方法,但您可以从一些简单的方法开始,例如将地址行减少为数字和每个单词的第一个字母,然后使用最长公共子序列算法比较结果。
希望能以某种方式有所帮助。
I would look at producing a similarity comparison metric that, given two objects (strings perhaps), returns "distance" between them.
If you fulfil the following criteria then it helps:
itself is zero. (reflexive)
both directions (transitive)
than distance from a to b plus
distance from a to c. (triangle
rule)
If your metric obeys these they you can arrange your objects in metric space which means you can run queries like:
this one
most like this one.
There's a good book about it here. Once you've set up the infrastructure for hosting objects and running the queries you can simply plug in different comparison algorithms, compare their performance and then tune them.
I did this for geographic data at university and it was quite fun trying to tune the comparison algorithms.
I'm sure you could come up with something more advanced but you could start with something simple like reducing the address line to the digits and the first letter of each word and then compare the result of that using a longest common subsequence algorithm.
Hope that helps in some way.
您可以使用 Levenshtein 编辑距离 查找仅相差几个字符的字符串。 BK Trees 可以帮助加快匹配过程。
You can use Levenshtein edit distance to find strings that differ by only a few characters. BK Trees can help speed up the matching process.
免责声明:我不知道有什么算法可以做到这一点,但我真的很想知道它是否存在。这个答案是在没有任何先前知识的情况下试图解决问题的天真的尝试。欢迎评论,请不要笑得太过分。
如果您尝试手动完成,我建议对您的字符串应用某种“规范化”:小写它们,删除标点符号,也许用完整的缩写替换常见的缩写单词(Dr. => 驾驶,St => 街道,等等...)。
然后,您可以尝试比较的两个字符串之间的不同对齐方式,并通过平均相应字母之间的绝对差来计算相关性(例如 a = 1、b = 2 等。并且
corr(a, b) = |a - b| = 1
) :因此,即使某些字母不同,相关性也会很高。然后,只需保留您找到的最大相关性,并在相关性高于给定阈值时确定它们相同。
Disclaimer: I don't know any algorithm that does that, but would really be interested in knowing one if it exists. This answer is a naive attempt of trying to solve the problem, with no previous knowledge whatsoever. Comments welcome, please don't laugh too laud.
If you try doing it by hand, I would suggest applying some kind of "normalization" to your strings : lowercase them, remove punctuation, maybe replace common abbreviations with the full words (Dr. => drive, St => street, etc...).
Then, you can try different alignments between the two strings you compare, and compute the correlation by averaging the absolute differences between corresponding letters (eg a = 1, b = 2, etc.. and
corr(a, b) = |a - b| = 1
) :Thus, even if some letters are different, the correlation would be high. Then, simply keep the maximal correlation you found, and decide that their are the same if the correlation is above a given threshold.
早在 90 年代初,当我必须修改一个专有程序来执行此操作时,需要在多个模块中编写数千行代码,这些代码是多年经验积累的。现代机器学习技术应该会让这一切变得更容易,也许你不需要表现得那么好(这是我雇主的面包和黄油)。
因此,如果您正在谈论合并实际邮寄地址列表,如果可以的话,我会通过外包来完成。
美国邮政局进行了一些测试来衡量地址标准化计划的质量。我不记得这是如何运作的,但你可以检查他们是否仍然这样做——也许你可以获得一些好的训练数据。
When I had to modify a proprietary program doing this, back in the early 90s, it took many thousands of lines of code in multiple modules, built up over years of experience. Modern machine-learning techniques ought to make it easier, and perhaps you don't need to perform as well (it was my employer's bread and butter).
So if you're talking about merging lists of actual mailing addresses, I'd do it by outsourcing if I can.
The USPS had some tests to measure quality of address standardization programs. I don't remember anything about how that worked, but you might check if they still do it -- maybe you can get some good training data.