查找重复邮寄地址的策略

发布于 2024-08-03 20:45:46 字数 470 浏览 13 评论 0原文

我正在尝试想出一种根据相似度分数查找重复地址的方法。考虑这些重复的地址：

addr_1 = '# 3 FAIRMONT LINK SOUTH'
addr_2 = '3 FAIRMONT LINK S'

addr_3 = '5703 - 48TH AVE'
adrr_4 = '5703- 48 AVENUE'

我计划应用一些字符串转换来缩写长单词，例如 NORTH ->; N，删除所有空格、逗号、破折号和井号。现在，有了这个输出，我如何将 addr_3 与其余地址进行比较并检测相似？多少百分比的相似度才是安全的？你能为此提供一个简单的Python代码吗？

addr_1 = '3FAIRMONTLINKS'
addr_2 = '3FAIRMONTLINKS'

addr_3 = '570348THAV'
adrr_4 = '570348AV'

感恩，

爱德华多

原文

I'm trying to come up with a method of finding duplicate addresses, based on a similarity score. Consider these duplicate addresses:

addr_1 = '# 3 FAIRMONT LINK SOUTH'
addr_2 = '3 FAIRMONT LINK S'

addr_3 = '5703 - 48TH AVE'
adrr_4 = '5703- 48 AVENUE'

I'm planning on applying some string transformation to make long words abbreviated, like NORTH -> N, remove all spaces, commas and dashes and pound symbols. Now, having this output, how can I compare addr_3 with the rest of addresses and detect similar? What percentage of similarity would be safe? Could you provide a simple python code for this?

addr_1 = '3FAIRMONTLINKS'
addr_2 = '3FAIRMONTLINKS'

addr_3 = '570348THAV'
adrr_4 = '570348AV'

Thankful,

Eduardo

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

讽刺将军 2024-08-10 20:45:46

首先，通过将所有空格折叠为每个单词之间的单个空格来简化地址字符串，并强制所有内容都为小写（如果您愿意，也可以大写）：

adr = " ".join(adr.tolower().split())

然后，我会删除“41st Street”中的“st”或“42nd Street”中的“nd”：

adr = re.sub("1st(\b|$)", r'1', adr)
adr = re.sub("([2-9])\s?nd(\b|$)", r'\1', adr)

请注意，第二个 sub() 将使用“2”和“nd”之间的空格，但我没有设置第一个 sub() 来执行此操作；因为我不确定你如何区分“41 St Ave”和“41 St”（第二个是“41 Street”的缩写）。

请务必阅读 re 模块的所有帮助；它很强大但很神秘。

然后，我会将您剩下的内容拆分为单词列表，并应用 Soundex 算法来列出看起来不像数字的项目：

http://en.wikipedia.org/wiki/Soundex

http://wwwhomes.uni-bielefeld.de/gibbon/Forms/Python/SEARCH/ soundex.html

adrlist = [word if word.isdigit() else soundex(word) for word in adr.split()]

然后您可以使用该列表或按照您认为最好的方式将其连接回字符串。

Soundex 的整个想法是处理拼写错误的地址。这可能不是您想要的，在这种情况下，请忽略此 Soundex 想法。

祝你好运。

First, simplify the address string by collapsing all whitespace to a single space between each word, and forcing everything to lower case (or upper case if you prefer):

adr = " ".join(adr.tolower().split())

Then, I would strip out things like "st" in "41st Street" or "nd" in "42nd Street":

adr = re.sub("1st(\b|$)", r'1', adr)
adr = re.sub("([2-9])\s?nd(\b|$)", r'\1', adr)

Note that the second sub() will work with a space between the "2" and the "nd", but I didn't set the first one to do that; because I'm not sure how you can tell the difference between "41 St Ave" and "41 St" (that second one is "41 Street" abbreviated).

Be sure to read all the help for the re module; it's powerful but cryptic.

Then, I would split what you have left into a list of words, and apply the Soundex algorithm to list items that don't look like numbers:

http://en.wikipedia.org/wiki/Soundex

http://wwwhomes.uni-bielefeld.de/gibbon/Forms/Python/SEARCH/soundex.html

adrlist = [word if word.isdigit() else soundex(word) for word in adr.split()]

Then you can work with the list or join it back to a string as you think best.

The whole idea of the Soundex thing is to handle misspelled addresses. That may not be what you want, in which case just ignore this Soundex idea.

Good luck.

回复收藏 0 原文

兮颜 2024-08-10 20:45:46

删除空格、逗号和破折号将会产生歧义。最好将它们替换为单个空格。

以这个地址为例

56 5th avenue

，这

5, 65th avenue

与您的方法两者都将是：

565THAV

您可以做的是编写一个好的地址缩短算法，然后使用字符串比较来检测重复项。这应该足以在一般情况下检测重复项。一般的相似度算法不起作用。因为一个数字的差异可能意味着地址的巨大变化。

该算法可以这样进行：

用空格替换所有逗号破折号。使用 translate 方法。
使用单词及其缩写形式构建字典
如果紧随数字，则删除 TH 部分。

Removing spaces, commas and dashes will be ambiguous . It will be better to replace them with a single space.

Take for example this address

56 5th avenue

And this

5, 65th avenue

with your method both of them will be:

565THAV

What you can do is write a good address shortening algorithm and then use string comparison to detect duplicates. This should be enough to detect duplicates in the general case. A general similarity algorithm won't work. Because one number difference can mean a huge change in Addresses.

The algorithm can go like this:

replace all commas dashes with spaces. Use he translate method for that.
Build a dictionary with words and their abbreviated form
Remove the TH part if it was following a number.

回复收藏 0 原文

谁把谁当真 2024-08-10 20:45:46

这应该有助于构建您的缩写字典：

https://pe.usps.com /text/pub28/28apc_002.htm

回复收藏 0 原文

浅笑依然 2024-08-10 20:45:46

我定期检查工作地点的重复地址，我不得不说，我发现 Soundex 非常不合适。它既太慢又太急于匹配事物。我对编辑距离也有类似的问题。

对我来说最有效的是清理和标记地址（去掉标点符号，将内容分成单词），然后看看有多少标记匹配。由于地址通常有多个令牌，因此您可以根据 (1) 匹配了多少个令牌、(2) 匹配了多少个数字令牌以及 (3) 来确定置信度有多少代币可用。例如，如果较短地址中的所有令牌都在较长地址中，则匹配的置信度相当高。同样，如果您匹配 5 个标记，其中至少有一个是数字，即使每个地址有 8 个，这仍然是高置信度匹配。

进行一些调整绝对有用，例如替换一些常见的缩写。美国邮政局列出了帮助，尽管我不会热衷于尝试实施所有这些，而且一些最有价值的替代品并不在这些列表中。例如，“JFK”应该与“JOHN F KENNEDY”匹配，并且有多种常见方法可以缩短“MARTIN LUTHER KING JR”。

也许这是不言而喻的，但为了完整性，我还是要说一下：在处理更复杂的事情之前，不要忘记对整个地址进行直接字符串比较！这应该是一个非常便宜的测试，因此可能是理所当然的第一步。

显然，您愿意并且能够花费的时间越多（编程/测试和运行时），您就能做得越好。模糊字符串匹配技术（比 Levenshtein 更快且更不通用）可能很有用，作为与标记方法的单独传递（我不会尝试将各个标记相互模糊匹配）。我发现模糊字符串匹配在地址上并没有给我带来足够的好处（尽管我会在名称上使用它）。

回复收藏 0 原文

云淡风轻 2024-08-10 20:45:46

为了正确执行此操作，您需要根据 USPS 标准标准化您的地址（您的地址示例似乎基于美国）。有许多直接营销服务提供商提供邮政地址的CASS（编码准确性支持系统）认证。 CASS 流程将对您的所有地址进行标准化，并向其附加 zip + 4。任何无法投递的地址都将被标记，这将进一步降低您的邮寄成本（如果您愿意的话）。一旦所有地址都标准化，消除重复项就变得轻而易举。

回复收藏 0 原文