Python:如何从句子/段落(非regex方法)中提取地址?

发布于 2025-02-09 20:01:45 字数 519 浏览 4 评论 0原文

我正在从事一个项目,需要我从句子中提取地址。

例如,输入句子:嗨,Sam D. Richards先生在这里居住在这里的ABC大楼3楼123号商店,在12345年Aloha Road的Cde Mart后面。如果您需要任何帮助,请致电12345678 < /代码>

我试图仅提取地址IE Shop no / 123,ABC大楼3楼,在CDE MART后面,Aloha Road,Aloha Road,12345 < / code>

我到目前为止尝试过的东西:

我尝试了 pyap 它也适用于正则是正则是言论。我意识到我们不能使用正则态度,因为地址或句子没有任何模式。还尝试了locationTagger,它仅设法返回国家或城市。

有什么更好的方法吗?

I was working on a project which needed me to extract addresses from a sentence.

For e.g. Input sentence: Hi, Mr. Sam D. Richards lives here Shop No / 123, 3rd Floor, ABC Building, Behind CDE Mart, Aloha Road, 12345. If you need any help, call me on 12345678

I am trying to extract just the address i.e. Shop No / 123, 3rd Floor, ABC Building, Behind CDE Mart, Aloha Road, 12345

What I have tried so far:

I tried Pyap which also works on Regex so it is not able to generalize it better for addresses of countries other than US/Canada/UK. I realized that we cannot use Regex as there is no pattern to the address or the sentences whatsoever. Also tried locationtagger which only manages to return the country or the city.

Is there any better way of doing it?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

你是我的挚爱i 2025-02-16 20:01:45

如果没有明显的正则模式,则可以尝试基于ML的方法。有一个众所周知的问题命名的实体识别(ner),通常将其求解为序列标记问题:训练一个模型可以预测每个令牌(例如一个单词还是一个子字)是否是地址的一部分。

您可以查找已经训练提取地址的模型(例如, https://huggingging.co/模型?搜索=地址),或在您自己的数据集中微调基于BERT的模型(在这里是食谱)。

If there is no obvious pattern for regex, you can try an ML-based approach. There is a well known problem named entity recognition (NER), and it is typically solved as a sequence tagging problem: a model is trained to predict for each token (e.g. a word or a subword) whether it is a part of address or not.

You can look for a model that is already trained to extract addresses (e.g. here https://huggingface.co/models?search=address), or fine-tune a BERT-based model on your own dataset (here is a recipe).

花间憩 2025-02-16 20:01:45

地址具有众所周知的结构。使用语法解析器,应该可以解析它们。
Pyparsing具有扫描功能,可以搜索模式,而无需解析所有文件的其余部分。您可以尝试此功能。我为您提供了一个示例,可以检测示例字符串中的三个地址。

#!/bin/python3

from pyparsing import *

GermanWord = Word("ABCDEFGHIJKLMNOPQRSTUVWXYZÄÖÜ", alphas + "ß")
GermanWordComposition = GermanWord + (ZeroOrMore(Optional(Literal("-")) + GermanWord))
GermanName = GermanWordComposition
GermanStreet = GermanWordComposition
GermanHouseNumber = Word(nums) + Optional(Word(alphas, exact=1) + FollowedBy(White()))
GermanAddressSeparator = Literal(",") | Literal("in") 
GermanPostCode = Word(nums, exact=5)
GermanTown = GermanWordComposition

German_Address = GermanName + GermanAddressSeparator + GermanStreet + GermanHouseNumber \
    + GermanAddressSeparator + GermanPostCode + GermanTown


EnglishWord = Word("ABCDEFGHIJKLMNOPQRSTUVWXYZ", alphanums)
EnglishNumber = Word(nums)
EnglishComposition = OneOrMore(EnglishWord)
EnglishExtension = Word("-/", exact=1) + (EnglishComposition | EnglishNumber)
EnglishAddressSeparator = Literal(",")
EnglishFloor = (Literal("1st") | Literal("2nd") | Literal("3rd") | (Combine(EnglishNumber + Literal("th")))) + Literal("Floor")
EnglishWhere = EnglishComposition
EnglishStreet = EnglishComposition


EnglishAddress = EnglishComposition + Optional(EnglishExtension) \
    + EnglishAddressSeparator + Optional(EnglishFloor)           \
    + Optional(EnglishAddressSeparator + EnglishWhere)           \
    + Optional(EnglishAddressSeparator + EnglishWhere)           \
    + EnglishAddressSeparator + EnglishStreet + EnglishAddressSeparator + EnglishNumber

Address = EnglishAddress | German_Address


test_1 = "I am writing to Peter Meyer, Moritzstraße 22, 54543 Musterdorf a letter. But the letter arrived at \
Hubert Figge, Große Straße 14 in 45434 Berlin. In the letter was written: Hi, Mr. Sam D. Richards lives here \
Shop No / 123, 3rd Floor, ABC Building, Behind CDE Mart, Aloha Road, 12345. If you need any help, call       \
me on 12345678."

for i in Address.scanString(test_1):
  print(i)

Addresses have a well known structure. With a grammar parser it should be possible to parse them.
PyParsing has a feature of scanning that searches for pattern without parsing all the rest of the file. You can try this feature. I have an example for you, that detects three addresses in the example string.

#!/bin/python3

from pyparsing import *

GermanWord = Word("ABCDEFGHIJKLMNOPQRSTUVWXYZÄÖÜ", alphas + "ß")
GermanWordComposition = GermanWord + (ZeroOrMore(Optional(Literal("-")) + GermanWord))
GermanName = GermanWordComposition
GermanStreet = GermanWordComposition
GermanHouseNumber = Word(nums) + Optional(Word(alphas, exact=1) + FollowedBy(White()))
GermanAddressSeparator = Literal(",") | Literal("in") 
GermanPostCode = Word(nums, exact=5)
GermanTown = GermanWordComposition

German_Address = GermanName + GermanAddressSeparator + GermanStreet + GermanHouseNumber \
    + GermanAddressSeparator + GermanPostCode + GermanTown


EnglishWord = Word("ABCDEFGHIJKLMNOPQRSTUVWXYZ", alphanums)
EnglishNumber = Word(nums)
EnglishComposition = OneOrMore(EnglishWord)
EnglishExtension = Word("-/", exact=1) + (EnglishComposition | EnglishNumber)
EnglishAddressSeparator = Literal(",")
EnglishFloor = (Literal("1st") | Literal("2nd") | Literal("3rd") | (Combine(EnglishNumber + Literal("th")))) + Literal("Floor")
EnglishWhere = EnglishComposition
EnglishStreet = EnglishComposition


EnglishAddress = EnglishComposition + Optional(EnglishExtension) \
    + EnglishAddressSeparator + Optional(EnglishFloor)           \
    + Optional(EnglishAddressSeparator + EnglishWhere)           \
    + Optional(EnglishAddressSeparator + EnglishWhere)           \
    + EnglishAddressSeparator + EnglishStreet + EnglishAddressSeparator + EnglishNumber

Address = EnglishAddress | German_Address


test_1 = "I am writing to Peter Meyer, Moritzstraße 22, 54543 Musterdorf a letter. But the letter arrived at \
Hubert Figge, Große Straße 14 in 45434 Berlin. In the letter was written: Hi, Mr. Sam D. Richards lives here \
Shop No / 123, 3rd Floor, ABC Building, Behind CDE Mart, Aloha Road, 12345. If you need any help, call       \
me on 12345678."

for i in Address.scanString(test_1):
  print(i)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文