Python:如何从句子/段落(非regex方法)中提取地址?
我正在从事一个项目,需要我从句子中提取地址。
例如,输入句子:嗨,Sam D. Richards先生在这里居住在这里的ABC大楼3楼123号商店,在12345年Aloha Road的Cde Mart后面。如果您需要任何帮助,请致电12345678 < /代码>
我试图仅提取地址IE Shop no / 123,ABC大楼3楼,在CDE MART后面,Aloha Road,Aloha Road,12345 < / code>
我到目前为止尝试过的东西:
我尝试了 pyap 它也适用于正则是正则是言论。我意识到我们不能使用正则态度,因为地址或句子没有任何模式。还尝试了locationTagger
,它仅设法返回国家或城市。
有什么更好的方法吗?
I was working on a project which needed me to extract addresses from a sentence.
For e.g. Input sentence: Hi, Mr. Sam D. Richards lives here Shop No / 123, 3rd Floor, ABC Building, Behind CDE Mart, Aloha Road, 12345. If you need any help, call me on 12345678
I am trying to extract just the address i.e. Shop No / 123, 3rd Floor, ABC Building, Behind CDE Mart, Aloha Road, 12345
What I have tried so far:
I tried Pyap which also works on Regex so it is not able to generalize it better for addresses of countries other than US/Canada/UK. I realized that we cannot use Regex as there is no pattern to the address or the sentences whatsoever. Also tried locationtagger
which only manages to return the country or the city.
Is there any better way of doing it?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果没有明显的正则模式,则可以尝试基于ML的方法。有一个众所周知的问题命名的实体识别(ner),通常将其求解为序列标记问题:训练一个模型可以预测每个令牌(例如一个单词还是一个子字)是否是地址的一部分。
您可以查找已经训练提取地址的模型(例如, https://huggingging.co/模型?搜索=地址),或在您自己的数据集中微调基于BERT的模型(在这里是食谱)。
If there is no obvious pattern for regex, you can try an ML-based approach. There is a well known problem named entity recognition (NER), and it is typically solved as a sequence tagging problem: a model is trained to predict for each token (e.g. a word or a subword) whether it is a part of address or not.
You can look for a model that is already trained to extract addresses (e.g. here https://huggingface.co/models?search=address), or fine-tune a BERT-based model on your own dataset (here is a recipe).
地址具有众所周知的结构。使用语法解析器,应该可以解析它们。
Pyparsing具有扫描功能,可以搜索模式,而无需解析所有文件的其余部分。您可以尝试此功能。我为您提供了一个示例,可以检测示例字符串中的三个地址。
Addresses have a well known structure. With a grammar parser it should be possible to parse them.
PyParsing has a feature of scanning that searches for pattern without parsing all the rest of the file. You can try this feature. I have an example for you, that detects three addresses in the example string.