将命名实体与 spaCy 的 Matcher 模块合并

发布于 2025-01-12 18:13:34 字数 3869 浏览 1 评论 0原文

def match_patterns(cleanests_post):

    mark_rutte = [
    [{"LOWER": "mark", 'OP': '?'}, {"LOWER": "rutte", 'OP': '?'}],

    [{"LOWER": "markie"}]

    ]

    matcher.add("Mark Rutte", mark_rutte, on_match=add_person_ent)


    hugo_dejonge = [
    [{"LOWER": "hugo", 'OP': '?'}, {"LOWER": "de jonge", 'OP': '?'}]

    ]

    matcher.add("Hugo de Jonge", hugo_dejonge, on_match=add_person_ent)



    adolf_hitler = [
    [{"LOWER": "adolf", 'OP': '?'}, {"LOWER": "hitler", 'OP': '?'}]

    ]

    matcher.add("Adolf Hitler", adolf_hitler, on_match=add_person_ent)

    matches = matcher(cleanests_post)
    matches.sort(key = lambda x:x[1])

    for match_id, start, end in matches:
        string_id = nlp.vocab.strings[match_id]  # Get string representation
        span = cleanests_post[start:end]  # The matched span
        # print('matches', match_id, string_id, start, end, span.text)
        # print ('$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$')

    
    return (cleanests_post)



def add_person_ent(matcher, cleanests_post, i, matches):
        
    # Get the current match and create tuple of entity label, start and end.
    # Append entity to the doc's entity. (Don't overwrite doc.ents!)

    match_id, start, end = matches[i]
    entity = Span(cleanests_post, start, end, label="PERSON")

    filtered = filter_spans(cleanests_post.ents) # When spans overlap, the (first) longest span is preferred over shorter spans.

    filtered += (entity,)

    cleanests_post = filtered

    return (cleanests_post)

 

with open(filepath, encoding='latin-1') as csvfile:
    reader = csv.reader(csvfile, delimiter=';')

    next(reader, None) # Skip first row (= header) of the csv file

    dict_from_csv = {rows[0]:rows[2] for rows in reader} # creates a dictionary with 'date' as keys and 'text' as values
    #print (dict_from_csv)

    values = dict_from_csv.values()
    values_list = list(values)
    #print ('values_list:', values_list)

    people = []


    for post in values_list: # iterate over each post
       

        # Do some preprocessing here  


        clean_post = remove_images(post)

        cleaner_post = remove_forwards(clean_post)

        cleanest_post = remove_links(cleaner_post)

        cleanests_post = delete_breaks(cleanest_post)

        cleaned_posts.append(cleanests_post)

        cleanests_post = nlp(cleanests_post)

        cleanests_post = match_patterns(cleanests_post) 


        if cleanests_post.ents:
            show_results = displacy.render(cleanests_post, style='ent')
   


        # GET PEOPLE
        
        for named_entity in cleanests_post.ents:
            if named_entity.label_ == "PERSON":
                #print ('NE PERSON:', named_entity)
                people.append(named_entity.text)


    people_tally = Counter(people)

    df = pd.DataFrame(people_tally.most_common(), columns=['character', 'count'])
    print ('people:', df)

我正在使用 spaCy 提取一系列 Telegram 组中提到的命名实体。我的数据是 csv 文件，其中包含“日期”和“文本”列（包含每个帖子内容的字符串）。

为了优化我的输出，我想合并“Mark”、“Rutte”、“Mark Rutte”、“Markie”（及其小写形式）等实体，因为它们指的是同一个人。我的方法是使用 spaCy 内置 Matcher 模块来合并这些实体。

在我的代码中，match_patterns() 用于定义 mark_rutte 等模式，而 add_person_ent() 用于将该模式作为实体附加到 doc.ents（在我的例子中为 cleanests_post.ents）。

脚本的顺序是这样的：

打开带有 Telegram 日期的 csv 文件，作为开环
迭代每个帖子（= 带有帖子文本的字符串），并进行一些预处理
调用 spaCy 的内置 nlp()每个帖子上提取命名实体的函数
在每个帖子上调用我自己的 match_patterns() 函数来合并我在模式 mark_rutte、hugo_dejonge 和 adolf_hitler 中定义的实体
最后，循环cleanests_post.ents 中的实体并将所有 PERSON 实体附加到 people (= list) 并使用 Counter() 和 pandas 生成每个已识别人员的排名

出了什么问题：似乎 match_patterns() 和 add_person_ent() 确实如此不工作。我的输出与我不调用 match_patterns() 时完全相同，即 'Mark'、'mark'、'Rutte'、'rutte'、'Mark Rutte'、'MARK RUTTE'、'markie' 仍被分类为独立的实体。覆盖 cleanests_posts.ents 似乎出了问题。在add_person_ent()中我尝试使用spaCy的filter_spans()来解决问题，但没有成功。

原文

def match_patterns(cleanests_post):

    mark_rutte = [
    [{"LOWER": "mark", 'OP': '?'}, {"LOWER": "rutte", 'OP': '?'}],

    [{"LOWER": "markie"}]

    ]

    matcher.add("Mark Rutte", mark_rutte, on_match=add_person_ent)


    hugo_dejonge = [
    [{"LOWER": "hugo", 'OP': '?'}, {"LOWER": "de jonge", 'OP': '?'}]

    ]

    matcher.add("Hugo de Jonge", hugo_dejonge, on_match=add_person_ent)



    adolf_hitler = [
    [{"LOWER": "adolf", 'OP': '?'}, {"LOWER": "hitler", 'OP': '?'}]

    ]

    matcher.add("Adolf Hitler", adolf_hitler, on_match=add_person_ent)

    matches = matcher(cleanests_post)
    matches.sort(key = lambda x:x[1])

    for match_id, start, end in matches:
        string_id = nlp.vocab.strings[match_id]  # Get string representation
        span = cleanests_post[start:end]  # The matched span
        # print('matches', match_id, string_id, start, end, span.text)
        # print ('$$$$$$$$$$$$$$$$')

    
    return (cleanests_post)



def add_person_ent(matcher, cleanests_post, i, matches):
        
    # Get the current match and create tuple of entity label, start and end.
    # Append entity to the doc's entity. (Don't overwrite doc.ents!)

    match_id, start, end = matches[i]
    entity = Span(cleanests_post, start, end, label="PERSON")

    filtered = filter_spans(cleanests_post.ents) # When spans overlap, the (first) longest span is preferred over shorter spans.

    filtered += (entity,)

    cleanests_post = filtered

    return (cleanests_post)

 

with open(filepath, encoding='latin-1') as csvfile:
    reader = csv.reader(csvfile, delimiter=';')

    next(reader, None) # Skip first row (= header) of the csv file

    dict_from_csv = {rows[0]:rows[2] for rows in reader} # creates a dictionary with 'date' as keys and 'text' as values
    #print (dict_from_csv)

    values = dict_from_csv.values()
    values_list = list(values)
    #print ('values_list:', values_list)

    people = []


    for post in values_list: # iterate over each post
       

        # Do some preprocessing here  


        clean_post = remove_images(post)

        cleaner_post = remove_forwards(clean_post)

        cleanest_post = remove_links(cleaner_post)

        cleanests_post = delete_breaks(cleanest_post)

        cleaned_posts.append(cleanests_post)

        cleanests_post = nlp(cleanests_post)

        cleanests_post = match_patterns(cleanests_post) 


        if cleanests_post.ents:
            show_results = displacy.render(cleanests_post, style='ent')
   


        # GET PEOPLE
        
        for named_entity in cleanests_post.ents:
            if named_entity.label_ == "PERSON":
                #print ('NE PERSON:', named_entity)
                people.append(named_entity.text)


    people_tally = Counter(people)

    df = pd.DataFrame(people_tally.most_common(), columns=['character', 'count'])
    print ('people:', df)

I'm using spaCy to extract named entities mentioned in a range of Telegram groups. My data are csv files with columns 'date' and 'text' (a string with the content of each post).

To optimize my output I'd like to merge entities such as 'Mark', 'Rutte', 'Mark Rutte', 'Markie' (and their lowercase forms) as they refer to the same person. My approach is to use spaCy built-in Matcher module for merging these entities.

In my code, match_patterns() is used to define patterns such as mark_rutte and add_person_ent() is used to append that pattern as entity to doc.ents (in my case cleanests_post.ents).

The order of the script is this:

open the csv file with the Telegram date as a with-open-loop
iterate over each post (= a string with text of the post) individually and do some preprocessing
call spaCy's built-in nlp() function on each of the posts to extract named entities
call my own match_patterns() function on each of these posts to merge the entities I defined in patterns mark_rutte, hugo_dejonge and adolf_hitler
finally, loop over the entities in cleanests_post.ents and append all the PERSON entities to people (= list) and use Counter() and pandas to generate a ranking of each of the persons identified

What goes wrong: it seems as if match_patterns() and add_person_ent() does not work. My output is exactly the same as when I do not call match_patterns(), i.e. 'Mark', 'mark', 'Rutte', 'rutte', 'Mark Rutte', 'MARK RUTTE', 'markie' are still categorised as separate entities. It seems as if something goes wrong with overwriting cleanests_posts.ents. In add_person_ent() I have tried using spaCy's filter_spans() to solve the problem, but without success.

分享到QQ

分享到微博