查找重复邮寄地址的策略

发布于 2024-08-03 20:45:46 字数 470 浏览 8 评论 0原文

我正在尝试想出一种根据相似度分数查找重复地址的方法。考虑这些重复的地址:

addr_1 = '# 3 FAIRMONT LINK SOUTH'
addr_2 = '3 FAIRMONT LINK S'

addr_3 = '5703 - 48TH AVE'
adrr_4 = '5703- 48 AVENUE'

我计划应用一些字符串转换来缩写长单词,例如 NORTH ->; N,删除所有空格、逗号、破折号和井号。现在,有了这个输出,我如何将 addr_3 与其余地址进行比较并检测相似?多少百分比的相似度才是安全的?你能为此提供一个简单的Python代码吗?

addr_1 = '3FAIRMONTLINKS'
addr_2 = '3FAIRMONTLINKS'

addr_3 = '570348THAV'
adrr_4 = '570348AV'

感恩,

爱德华多

I'm trying to come up with a method of finding duplicate addresses, based on a similarity score. Consider these duplicate addresses:

addr_1 = '# 3 FAIRMONT LINK SOUTH'
addr_2 = '3 FAIRMONT LINK S'

addr_3 = '5703 - 48TH AVE'
adrr_4 = '5703- 48 AVENUE'

I'm planning on applying some string transformation to make long words abbreviated, like NORTH -> N, remove all spaces, commas and dashes and pound symbols. Now, having this output, how can I compare addr_3 with the rest of addresses and detect similar? What percentage of similarity would be safe? Could you provide a simple python code for this?

addr_1 = '3FAIRMONTLINKS'
addr_2 = '3FAIRMONTLINKS'

addr_3 = '570348THAV'
adrr_4 = '570348AV'

Thankful,

Eduardo

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

讽刺将军 2024-08-10 20:45:46

首先,通过将所有空格折叠为每个单词之间的单个空格来简化地址字符串,并强制所有内容都为小写(如果您愿意,也可以大写):

adr = " ".join(adr.tolower().split())

然后,我会删除“41st Street”中的“st”或“42nd Street”中的“nd”:

adr = re.sub("1st(\b|$)", r'1', adr)
adr = re.sub("([2-9])\s?nd(\b|$)", r'\1', adr)

请注意,第二个 sub() 将使用“2”和“nd”之间的空格,但我没有设置第一个 sub() 来执行此操作;因为我不确定你如何区分“41 St Ave”和“41 St”(第二个是“41 Street”的缩写)。

请务必阅读 re 模块的所有帮助;它很强大但很神秘。

然后,我会将您剩下的内容拆分为单词列表,并应用 Soundex 算法来列出看起来不像数字的项目:

http://en.wikipedia.org/wiki/Soundex

http://wwwhomes.uni-bielefeld.de/gibbon/Forms/Python/SEARCH/ soundex.html

adrlist = [word if word.isdigit() else soundex(word) for word in adr.split()]

然后您可以使用该列表或按照您认为最好的方式将其连接回字符串。

Soundex 的整个想法是处理拼写错误的地址。这可能不是您想要的,在这种情况下,请忽略此 Soundex 想法。

祝你好运。

First, simplify the address string by collapsing all whitespace to a single space between each word, and forcing everything to lower case (or upper case if you prefer):

adr = " ".join(adr.tolower().split())

Then, I would strip out things like "st" in "41st Street" or "nd" in "42nd Street":

adr = re.sub("1st(\b|$)", r'1', adr)
adr = re.sub("([2-9])\s?nd(\b|$)", r'\1', adr)

Note that the second sub() will work with a space between the "2" and the "nd", but I didn't set the first one to do that; because I'm not sure how you can tell the difference between "41 St Ave" and "41 St" (that second one is "41 Street" abbreviated).

Be sure to read all the help for the re module; it's powerful but cryptic.

Then, I would split what you have left into a list of words, and apply the Soundex algorithm to list items that don't look like numbers:

http://en.wikipedia.org/wiki/Soundex

http://wwwhomes.uni-bielefeld.de/gibbon/Forms/Python/SEARCH/soundex.html

adrlist = [word if word.isdigit() else soundex(word) for word in adr.split()]

Then you can work with the list or join it back to a string as you think best.

The whole idea of the Soundex thing is to handle misspelled addresses. That may not be what you want, in which case just ignore this Soundex idea.

Good luck.

兮颜 2024-08-10 20:45:46

删除空格、逗号和破折号将会产生歧义。最好将它们替换为单个空格。

以这个地址为例

56 5th avenue

,这

5, 65th avenue

与您的方法两者都将是:

565THAV

您可以做的是编写一个好的地址缩短算法,然后使用字符串比较来检测重复项。这应该足以在一般情况下检测重复项。一般的相似度算法不起作用。因为一个数字的差异可能意味着地址的巨大变化。

该算法可以这样进行:

  1. 用空格替换所有逗号破折号。使用 translate 方法。
  2. 使用单词及其缩写形式构建字典
  3. 如果紧随数字,则删除 TH 部分。

Removing spaces, commas and dashes will be ambiguous . It will be better to replace them with a single space.

Take for example this address

56 5th avenue

And this

5, 65th avenue

with your method both of them will be:

565THAV

What you can do is write a good address shortening algorithm and then use string comparison to detect duplicates. This should be enough to detect duplicates in the general case. A general similarity algorithm won't work. Because one number difference can mean a huge change in Addresses.

The algorithm can go like this:

  1. replace all commas dashes with spaces. Use he translate method for that.
  2. Build a dictionary with words and their abbreviated form
  3. Remove the TH part if it was following a number.
谁把谁当真 2024-08-10 20:45:46

这应该有助于构建您的缩写字典:

https://pe.usps.com /text/pub28/28apc_002.htm

This should be helpful in building your dictionary of abbreviations:

https://pe.usps.com/text/pub28/28apc_002.htm

浅笑依然 2024-08-10 20:45:46

我定期检查工作地点的重复地址,我不得不说,我发现 Soundex 非常不合适。它既太慢又太急于匹配事物。我对编辑距离也有类似的问题。

对我来说最有效的是清理和标记地址(去掉标点符号,将内容分成单词),然后看看有多少标记匹配。由于地址通常有多个令牌,因此您可以根据 (1) 匹配了多少个令牌、(2) 匹配了多少个数字令牌以及 (3) 来确定置信度有多少代币可用。例如,如果较短地址中的所有令牌都在较长地址中,则匹配的置信度相当高。同样,如果您匹配 5 个标记,其中至少有一个是数字,即使每个地址有 8 个,这仍然是高置信度匹配。

进行一些调整绝对有用,例如替换一些常见的缩写。美国邮政局列出了帮助,尽管我不会热衷于尝试实施所有这些,而且一些最有价值的替代品并不在这些列表中。例如,“JFK”应该与“JOHN F KENNEDY”匹配,并且有多种常见方法可以缩短“MARTIN LUTHER KING JR”。

也许这是不言而喻的,但为了完整性,我还是要说一下:在处理更复杂的事情之前,不要忘记对整个地址进行直接字符串比较!这应该是一个非常便宜的测试,因此可能是理所当然的第一步。

显然,您愿意并且能够花费的时间越多(编程/测试和运行时),您就能做得越好。模糊字符串匹配技术(比 Levenshtein 更快且更不通用)可能很有用,作为与标记方法的单独传递(我不会尝试将各个标记相互模糊匹配)。我发现模糊字符串匹配在地址上并没有给我带来足够的好处(尽管我会在名称上使用它)。

I regularly inspect addresses for duplication where I work, and I have to say, I find Soundex highly unsuitable. It's both too slow and too eager to match things. I have similar issues with Levenshtein distance.

What has worked best for me is to sanitize and tokenize the addresses (get rid of punctuation, split things up into words) and then just see how many tokens match up. Because addresses typically have several tokens, you can develop a level of confidence in terms of a combination of (1) how many tokens were matched, (2) how many numeric tokens were matched, and (3) how many tokens are available. For example, if all tokens in the shorter address are in the longer address, the confidence of a match is pretty high. Likewise, if you match 5 tokens including at least one that's numeric, even if the addresses each have 8, that's still a high-confidence match.

It's definitely useful to do some tweaking, like substituting some common abbreviations. The USPS lists help, though I wouldn't go gung-ho trying to implement all of them, and some of the most valuable substitutions aren't on those lists. For example, 'JFK' should be a match for 'JOHN F KENNEDY', and there are a number of common ways to shorten 'MARTIN LUTHER KING JR'.

Maybe it goes without saying but I'll say it anyway, for completeness: Don't forget to just do a straight string comparison on the whole address before messing with more complicated things! This should be a very cheap test, and thus is probably a no-brainer first pass.

Obviously, the more time you're willing and able to spend (both on programming/testing and on run time), the better you'll be able to do. Fuzzy string matching techniques (faster and less generalized kinds than Levenshtein) can be useful, as a separate pass from the token approach (I wouldn't try to fuzzy match individual tokens against each other). I find that fuzzy string matching doesn't give me enough bang for my buck on addresses (though I will use it on names).

云淡风轻 2024-08-10 20:45:46

为了正确执行此操作,您需要根据 USPS 标准标准化您的地址(您的地址示例似乎基于美国)。有许多直接营销服务提供商提供邮政地址的CASS(编码准确性支持系统)认证。 CASS 流程将对您的所有地址进行标准化,并向其附加 zip + 4。任何无法投递的地址都将被标记,这将进一步降低您的邮寄成本(如果您愿意的话)。一旦所有地址都标准化,消除重复项就变得轻而易举。

In order to do this right, you need to standardize your addresses according to USPS standards (your address examples appear to be US based). There are many direct marketing service providers that offer CASS (Coding Accuracy Support System) certification of postal addresses. The CASS process will standardize all of your addresses and append zip + 4 to them. Any undeliverable addresses will be flagged which will further reduce your postal mailing costs, if that is your intent. Once all of your addresses are standardized, eliminating duplicates will be trivial.

一紙繁鸢 2024-08-10 20:45:46

我不得不这样做一次。我将所有内容都转换为小写,计算每个地址与其他每个地址的编辑距离,并对结果进行排序。效果非常好,但是非常耗时。

如果您有大型数据集,您将需要使用 C 语言而不是 Python 语言实现 Levenshtein。我的有几万,我想花了大半天的时间才跑完。

I had to do this once. I converted everything to lowercase, computed each address's Levenshtein distance to every other address, and ordered the results. It worked very well, but it was quite time-consuming.

You'll want to use an implementation of Levenshtein in C rather than in Python if you have a large data set. Mine was a few tens of thousands and took the better part of a day to run, I think.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文