Spacy Matcher 用于西班牙语文本的地址识别

发布于 2025-01-16 14:17:35 字数 1303 浏览 1 评论 0原文

我想捕获（西班牙语）法律文档中的地址，例如：

import spacy
from spacy.matcher import Matcher
nlp=spacy.load("es_core_news_lg")

texto = "... domiciliado en calle 160 Nº 765 piso 2 dpto A, La Plata, don Ricardo Fabián ROSENFELD, Documento Nacional de Identidad 14.464.003 con domicilio legal en calle Belfast Nº 1435 Tandil, para que, ... "

doc = nlp(texto)

所以输出应该类似于：

['160 Nº 765 piso 2 dpto A, La Plata', 'ortigaz Nº 1435 Tandil']

我认为匹配器应该使用以下事实：相关信息在“calle”一词之后开始，并以该名称结束被识别的城市：

gpe = [ee.text for ee in doc.ents if ee.label_ == 'LOC']
gpe

['La Plata', 'Belfast Nº', 'Tandil']

我认为该算法应该类似于：

查找单词“calle”
在 gpe 列表中取尽可能远离“calle”但在下一次出现“calle”之前的名称。
获取这两个单词之间的所有文本。

我的问题是我不知道如何定义这样的匹配器。

更新：我用以下函数解决它，

def domicilios(documento:str)->str:
    """
    Funcion que identifica las direcciones para cada persona
    """
    domicilios = []
    for texto in documento.split('calle')[1:]:
        doc = nlp(texto)
        gpe = [ee.text for ee in doc.ents if ee.label_ == 'LOC']
        ciudad = gpe[-1]
        domicilios.append([texto[:texto.find(ciudad)], ciudad])
    return domicilios

domicilios(documento)

无论如何我仍然认为这应该是专门用 spacy 解决它的方法。

原文

I would like to capture the addreses in (spanish) legal documents like:

import spacy
from spacy.matcher import Matcher
nlp=spacy.load("es_core_news_lg")

texto = "... domiciliado en calle 160 Nº 765 piso 2 dpto A, La Plata, don Ricardo Fabián ROSENFELD, Documento Nacional de Identidad 14.464.003 con domicilio legal en calle Belfast Nº 1435 Tandil, para que, ... "

doc = nlp(texto)

so the output should be something like:

['160 Nº 765 piso 2 dpto A, La Plata', 'ortigaz Nº 1435 Tandil']

I think that the matcher should use the fact that the relevant information starts after the word 'calle' and ends with the name of the city which is recognized by:

gpe = [ee.text for ee in doc.ents if ee.label_ == 'LOC']
gpe

['La Plata', 'Belfast Nº', 'Tandil']

I thought that the algorithm should be something like:

Look for the word 'calle'
Take the name in the gpe list that is as far as possible from 'calle' but before the next appearence of 'calle'.
Take the all text between this two words.

My problem is that I do not know how to define a Matcher like this one.

Update:
I solve it with the following function,

def domicilios(documento:str)->str:
    """
    Funcion que identifica las direcciones para cada persona
    """
    domicilios = []
    for texto in documento.split('calle')[1:]:
        doc = nlp(texto)
        gpe = [ee.text for ee in doc.ents if ee.label_ == 'LOC']
        ciudad = gpe[-1]
        domicilios.append([texto[:texto.find(ciudad)], ciudad])
    return domicilios

domicilios(documento)

Anyway I still think that should be a way to solve it with spacy exclusively.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

稚气少女 2025-01-23 14:17:35

首先，创建一个匹配器和文档对象：

import spacy
from spacy.matcher import Matcher
nlp=spacy.load("es_core_news_lg")

texto = "... domiciliado en calle 160 Nº 765 piso 2 dpto A, La Plata, don Ricardo Fabián ROSENFELD, Documento Nacional de Identidad 14.464.003 con domicilio legal en calle Belfast Nº 1435 Tandil, para que, ... "
doc = nlp(texto)
gpe = [ee.text for ee in doc.ents if ee.label_ == 'LOC']

接下来，您需要一个用于停止匹配器的触发器列表（GPE 的最后一个单词，可能是多个单词），因为 Matcher 对象仅查看单个单词：

gpe_ends = [loc.split()[-1] for loc in gpe]

现在，您可以创建一个遵循您的算法的匹配器：

pattern = [{"LOWER" : "calle"},
           {"TEXT"  : {"NOT_IN":["calle"]},"OP": "*"},
           {"TEXT"  : {"IN"    : loc_ends}}]

"OP":"*" 使匹配器在停在 GPE 的某个位置之前（尽可能）贪婪地搜索（但不允许“calle”）。其余部分来自 SpaCy 文档。

m = Matcher(nlp.vocab)
m.add("address", [pattern])
matches = m(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # The matched span
    print(match_id, string_id, start, end, span.text)

First, create a matcher and the document object:

import spacy
from spacy.matcher import Matcher
nlp=spacy.load("es_core_news_lg")

texto = "... domiciliado en calle 160 Nº 765 piso 2 dpto A, La Plata, don Ricardo Fabián ROSENFELD, Documento Nacional de Identidad 14.464.003 con domicilio legal en calle Belfast Nº 1435 Tandil, para que, ... "
doc = nlp(texto)
gpe = [ee.text for ee in doc.ents if ee.label_ == 'LOC']

Next, you need a list of triggers for the matcher to stop (the last word of a GPE, which may be multiple words) since the Matcher object only looks at single words:

gpe_ends = [loc.split()[-1] for loc in gpe]

Now, you can create a matcher that follows your algorithm:

pattern = [{"LOWER" : "calle"},
           {"TEXT"  : {"NOT_IN":["calle"]},"OP": "*"},
           {"TEXT"  : {"IN"    : loc_ends}}]

The "OP":"*" makes the matcher search greedily (as far as possible) before stopping at a location from GPE (but without allowing "calle"). The rest is from SpaCy documentation.

m = Matcher(nlp.vocab)
m.add("address", [pattern])
matches = m(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # The matched span
    print(match_id, string_id, start, end, span.text)

回复收藏 0 原文

~没有更多了~