使用Spacy从文本中提取信息

发布于 2025-02-11 21:38:38 字数 1104 浏览 1 评论 0原文

我想建立一个模型，以提取网站收集

的个人数据。第一步，我取消下图：

从这些句子中，我只想提取包含单词<的< em> “个人信息”

我使用了以下代码，但我没有得到我想要的结果：

def find_names(text):

    names = []

    # spacy doc
    doc = nlp(text)

    # pattern
    pattern = [{'LOWER':'personal'},
          {'LOWER':'data'}]
            
    # Matcher class object 
    matcher = Matcher(nlp.vocab) 
    matcher.add("names", [pattern]) 

    matches = matcher(doc)


    # finding patterns in the text
    for i in range(0,len(matches)):
    
        # match: id, start, end
        token = doc[matches[i][1]:matches[i][2]]
        # append token to list
        names.append(str(token))


    return names

# apply function
df2['PM_Names'] = df2['Sent'].apply(find_names)

输出：

图像

我想用单词 “个人信息”提取整个句子em>仅。

原文

I want to build a model that extracts personal data collected by a website.

The first step, I scrapped the privacy policy of a website, then I split it into sentences and put them on a dataframe as shown in the image below:

image

From those sentences, I only want to extract those that contains the words "personal information"

I used the code below but I don't get the result I want:

def find_names(text):

    names = []

    # spacy doc
    doc = nlp(text)

    # pattern
    pattern = [{'LOWER':'personal'},
          {'LOWER':'data'}]
            
    # Matcher class object 
    matcher = Matcher(nlp.vocab) 
    matcher.add("names", [pattern]) 

    matches = matcher(doc)


    # finding patterns in the text
    for i in range(0,len(matches)):
    
        # match: id, start, end
        token = doc[matches[i][1]:matches[i][2]]
        # append token to list
        names.append(str(token))


    return names

# apply function
df2['PM_Names'] = df2['Sent'].apply(find_names)

The output:

image

I want to extract the whole sentence with the words "personal information" only.

分享到QQ

分享到微博