Spacy Matcher 用于西班牙语文本的地址识别
我想捕获(西班牙语)法律文档中的地址,例如:
import spacy
from spacy.matcher import Matcher
nlp=spacy.load("es_core_news_lg")
texto = "... domiciliado en calle 160 Nº 765 piso 2 dpto A, La Plata, don Ricardo Fabián ROSENFELD, Documento Nacional de Identidad 14.464.003 con domicilio legal en calle Belfast Nº 1435 Tandil, para que, ... "
doc = nlp(texto)
所以输出应该类似于:
['160 Nº 765 piso 2 dpto A, La Plata', 'ortigaz Nº 1435 Tandil']
我认为匹配器应该使用以下事实:相关信息在“calle”一词之后开始,并以该名称结束被识别的城市:
gpe = [ee.text for ee in doc.ents if ee.label_ == 'LOC']
gpe
['La Plata', 'Belfast Nº', 'Tandil']
我认为该算法应该类似于:
- 查找单词“calle”
- 在 gpe 列表中取尽可能远离“calle”但在下一次出现“calle”之前的名称。
- 获取这两个单词之间的所有文本。
我的问题是我不知道如何定义这样的匹配器。
更新: 我用以下函数解决它,
def domicilios(documento:str)->str:
"""
Funcion que identifica las direcciones para cada persona
"""
domicilios = []
for texto in documento.split('calle')[1:]:
doc = nlp(texto)
gpe = [ee.text for ee in doc.ents if ee.label_ == 'LOC']
ciudad = gpe[-1]
domicilios.append([texto[:texto.find(ciudad)], ciudad])
return domicilios
domicilios(documento)
无论如何我仍然认为这应该是专门用 spacy 解决它的方法。
I would like to capture the addreses in (spanish) legal documents like:
import spacy
from spacy.matcher import Matcher
nlp=spacy.load("es_core_news_lg")
texto = "... domiciliado en calle 160 Nº 765 piso 2 dpto A, La Plata, don Ricardo Fabián ROSENFELD, Documento Nacional de Identidad 14.464.003 con domicilio legal en calle Belfast Nº 1435 Tandil, para que, ... "
doc = nlp(texto)
so the output should be something like:
['160 Nº 765 piso 2 dpto A, La Plata', 'ortigaz Nº 1435 Tandil']
I think that the matcher should use the fact that the relevant information starts after the word 'calle' and ends with the name of the city which is recognized by:
gpe = [ee.text for ee in doc.ents if ee.label_ == 'LOC']
gpe
['La Plata', 'Belfast Nº', 'Tandil']
I thought that the algorithm should be something like:
- Look for the word 'calle'
- Take the name in the gpe list that is as far as possible from 'calle' but before the next appearence of 'calle'.
- Take the all text between this two words.
My problem is that I do not know how to define a Matcher like this one.
Update:
I solve it with the following function,
def domicilios(documento:str)->str:
"""
Funcion que identifica las direcciones para cada persona
"""
domicilios = []
for texto in documento.split('calle')[1:]:
doc = nlp(texto)
gpe = [ee.text for ee in doc.ents if ee.label_ == 'LOC']
ciudad = gpe[-1]
domicilios.append([texto[:texto.find(ciudad)], ciudad])
return domicilios
domicilios(documento)
Anyway I still think that should be a way to solve it with spacy exclusively.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
首先,创建一个匹配器和文档对象:
接下来,您需要一个用于停止匹配器的触发器列表(GPE 的最后一个单词,可能是多个单词),因为 Matcher 对象仅查看单个单词:
现在,您可以创建一个遵循您的算法的匹配器:
"OP":"*"
使匹配器在停在 GPE 的某个位置之前(尽可能)贪婪地搜索(但不允许“calle”) 。其余部分来自 SpaCy 文档。First, create a matcher and the document object:
Next, you need a list of triggers for the matcher to stop (the last word of a GPE, which may be multiple words) since the Matcher object only looks at single words:
Now, you can create a matcher that follows your algorithm:
The
"OP":"*"
makes the matcher search greedily (as far as possible) before stopping at a location from GPE (but without allowing "calle"). The rest is from SpaCy documentation.