Spacy（R 语言）- 如何保存一个人的全名

发布于 2025-01-12 19:10:52 字数 4327 浏览 2 评论 0原文

我有一个大型数据框，其中包含德国报纸的文章。我想循环浏览所有文章，并将文本中提到的每个人的全名（如果有）保存在相应文章的一行中。我在 R 中使用 spacyr-package 进行 NER。

函数 spacy_extract_entity() 实际上完成了这项工作，并为我提供了所提到的人的全名。不幸的是，它在 PER 列中返回了许多误报实体。因此，我尝试使用 spacy_parse 并为自己过滤人员实体。

library("spacyr")
spacy_initialize(model = "de_core_news_lg")

for (i in 1:nrow(data)){
  data[i,]<-data[i,] %>%
    mutate(
      persons =  spacy_parse( data[i,]$text, dependency = FALSE, lemma = FALSE)  %>%
        filter(
          entity == "PER_B" &  pos == "PROPN" | entity == "PER_I" &  pos == "PROPN"
        )  %>%
        pull(token) %>%
        toString(.)
    )
}

结果对于我的目的来说要好得多。对于本文下面的文本示例，我得到以下输出：

Kenji, Mizoguchi, Alain, Resnais, Rainer, Werner, Fassbinder, Alfred, Hitchcock, Werner, Herzog, Aguirre, Claude, Chabrol, Hannelore, Elsner, Steven, Spielberg

现在我想要实现一个人的全名不以逗号分隔（Rainer Werner Fassbinder Alfred、Hitchcock 等）。

理论上，我应该检查每一行，下一行是否具有相同的 sentence_id ，如果下一行的 token_id 比上面的数字大一个数字，并且如果这是 TRUE为 TRUE，依此类推，直到 FALSE。然后我将这些标记保存为一个全名逗号，与下一个名称分隔开。我很难编写一个执行此操作的代码块，并且对于更大的语料库来说，执行此操作似乎也很慢。我真的很高兴知道如何解决此问题或替代解决方案。多谢！

example-scrapy_parse 输出

structure(list(doc_id = c("text1", "text1", "text1", "text1", 
"text1", "text1", "text1", "text1", "text1", "text1", "text1", 
"text1", "text1", "text1", "text1", "text1", "text1", "text1"
), sentence_id = c(49L, 49L, 49L, 49L, 49L, 49L, 49L, 49L, 49L, 
49L, 49L, 49L, 53L, 53L, 53L, 53L, 55L, 55L), token_id = c(13L, 
14L, 16L, 17L, 19L, 20L, 21L, 23L, 24L, 26L, 27L, 32L, 15L, 16L, 
18L, 19L, 19L, 20L), token = c("Kenji", "Mizoguchi", "Alain", 
"Resnais", "Rainer", "Werner", "Fassbinder", "Alfred", "Hitchcock", 
"Werner", "Herzog", "Aguirre", "Claude", "Chabrol", "Hannelore", 
"Elsner", "Steven", "Spielberg"), pos = c("PROPN", "PROPN", "PROPN", 
"PROPN", "PROPN", "PROPN", "PROPN", "PROPN", "PROPN", "PROPN", 
"PROPN", "PROPN", "PROPN", "PROPN", "PROPN", "PROPN", "PROPN", 
"PROPN"), entity = c("PER_B", "PER_I", "PER_B", "PER_I", "PER_B", 
"PER_I", "PER_I", "PER_B", "PER_I", "PER_B", "PER_I", "PER_B", 
"PER_B", "PER_I", "PER_B", "PER_I", "PER_B", "PER_I")), row.names = c(NA, 
-18L), class = c("spacyr_parsed", "data.frame"))

example-Text：

Es gibt immer Klassiker von großen Regisseuren, aktuell unter anderem von Kenji Mizoguchi, Alain Resnais, Rainer Werner Fassbinder, Alfred Hitchcock und Werner Herzog - dessen großer "Aguirre" ist beispielsweise noch sieben Tage zu sehen. In ähnlichen Bereichen wie Mubi sind auch lokale deutsche Anbieter aktiv, so wie zum Beispiel Alles Kino oder Filmfriend. Letztere kooperiert mit Bibliotheken in ganz Deutschland, deren Mitglieder die Mediathek mit ihrem Bibliotheksausweis nutzen können. Die Betreiber teilen mit, dass sie bereits eine leicht höhere Nutzung ihres Portals feststellen können, wobei noch nicht ganz klar sei, ob die wirklich auf die Epidemie zurückgehe. Filmfriend hat keine Eigenproduktionen, bietet aber Werkschauen einzelner Filmemacher (derzeit zum Beispiel Claude Chabrol und Hannelore Elsner) und hat eine gute Auswahl an deutschen Kinoklassikern - auch viele Produktionen der DDR-Filmschmiede Defa. Und wenn das immer noch nicht genug Futter ist? Dann startet am 6. April in den USA die Plattform Quibi, auf der Stars und Meister wie Steven Spielberg in handytauglichen Kurzformaten die Zukunft des Entertainment-Häppchens erproben, mit Studio-Unterstützung und Milliardenbudget.

编辑： 经过对伊万斯建议的稍微修改，我以符合我的目的的方式实现了这一点。由于 PER_I 和 PER_B 并不总是检测正确，因此您依赖于句子和字数。我的适应：

 fix_per_names <- function(entity_names){
  a <- entity_names$token
  b <- entity_names$token_id
  c <- entity_names$sentence_id
  d <- entity_names$entity
  e <- NULL
  i <- 0
  
  while(i < (length(a))){
    i  <- i+1
    if(i == length(a)){
      e[i] <- a[i]
    }else if(c[i] == c[i+1] & ( b[i+1] - b[i] == 1 ) ){
      e[i] <- paste(a[i],a[i+1])
    }else if(d[i] == 'PER_I' & d[i+1] == 'PER_B'){
      e[i] <- NA
    }else {
      e[i] <- a[i]
    }
  }
  e <-  toString(e[!is.na(e)])
  return(e)
}

原文

I have a large dataframe with articles from German newspapers. I want to loop through all articles and save the full name (if available) of each person mentioned in the text in a row of the corresponding article.
I am using the spacyr-package in R for the NER.

The function spacy_extract_entity() actually does this job and gives me the full names of the person mentioned. Unfortunately it returns to many false-positive entities in PER column. Because of that, I tried to use spacy_parse and filter the person entities for myself.

library("spacyr")
spacy_initialize(model = "de_core_news_lg")

for (i in 1:nrow(data)){
  data[i,]<-data[i,] %>%
    mutate(
      persons =  spacy_parse( data[i,]$text, dependency = FALSE, lemma = FALSE)  %>%
        filter(
          entity == "PER_B" &  pos == "PROPN" | entity == "PER_I" &  pos == "PROPN"
        )  %>%
        pull(token) %>%
        toString(.)
    )
}

The results are much better for my purpose. For the text example below this post I get this output:

Kenji, Mizoguchi, Alain, Resnais, Rainer, Werner, Fassbinder, Alfred, Hitchcock, Werner, Herzog, Aguirre, Claude, Chabrol, Hannelore, Elsner, Steven, Spielberg

Now I want to achieve that the full name of a person isn't comma-separated (Rainer Werner Fassbinder Alfred, Hitchcock etc.).

In theory I should check for each row, if the next row has the same sentence_id and if that is TRUE if the token_id of the next row is one number higher and if this is TRUE and so on till FALSE. Then I would save these tokens as one full name comma seperated from the next name. I have a hard time writing a code chuck who does it and seems also slow to do it like this for a bigger corpus. I am really glad about any idea how to this or an alternative solution. Thanks a lot!

example-scrapy_parse output

structure(list(doc_id = c("text1", "text1", "text1", "text1", 
"text1", "text1", "text1", "text1", "text1", "text1", "text1", 
"text1", "text1", "text1", "text1", "text1", "text1", "text1"
), sentence_id = c(49L, 49L, 49L, 49L, 49L, 49L, 49L, 49L, 49L, 
49L, 49L, 49L, 53L, 53L, 53L, 53L, 55L, 55L), token_id = c(13L, 
14L, 16L, 17L, 19L, 20L, 21L, 23L, 24L, 26L, 27L, 32L, 15L, 16L, 
18L, 19L, 19L, 20L), token = c("Kenji", "Mizoguchi", "Alain", 
"Resnais", "Rainer", "Werner", "Fassbinder", "Alfred", "Hitchcock", 
"Werner", "Herzog", "Aguirre", "Claude", "Chabrol", "Hannelore", 
"Elsner", "Steven", "Spielberg"), pos = c("PROPN", "PROPN", "PROPN", 
"PROPN", "PROPN", "PROPN", "PROPN", "PROPN", "PROPN", "PROPN", 
"PROPN", "PROPN", "PROPN", "PROPN", "PROPN", "PROPN", "PROPN", 
"PROPN"), entity = c("PER_B", "PER_I", "PER_B", "PER_I", "PER_B", 
"PER_I", "PER_I", "PER_B", "PER_I", "PER_B", "PER_I", "PER_B", 
"PER_B", "PER_I", "PER_B", "PER_I", "PER_B", "PER_I")), row.names = c(NA, 
-18L), class = c("spacyr_parsed", "data.frame"))

example-Text:

Es gibt immer Klassiker von großen Regisseuren, aktuell unter anderem von Kenji Mizoguchi, Alain Resnais, Rainer Werner Fassbinder, Alfred Hitchcock und Werner Herzog - dessen großer "Aguirre" ist beispielsweise noch sieben Tage zu sehen. In ähnlichen Bereichen wie Mubi sind auch lokale deutsche Anbieter aktiv, so wie zum Beispiel Alles Kino oder Filmfriend. Letztere kooperiert mit Bibliotheken in ganz Deutschland, deren Mitglieder die Mediathek mit ihrem Bibliotheksausweis nutzen können. Die Betreiber teilen mit, dass sie bereits eine leicht höhere Nutzung ihres Portals feststellen können, wobei noch nicht ganz klar sei, ob die wirklich auf die Epidemie zurückgehe. Filmfriend hat keine Eigenproduktionen, bietet aber Werkschauen einzelner Filmemacher (derzeit zum Beispiel Claude Chabrol und Hannelore Elsner) und hat eine gute Auswahl an deutschen Kinoklassikern - auch viele Produktionen der DDR-Filmschmiede Defa. Und wenn das immer noch nicht genug Futter ist? Dann startet am 6. April in den USA die Plattform Quibi, auf der Stars und Meister wie Steven Spielberg in handytauglichen Kurzformaten die Zukunft des Entertainment-Häppchens erproben, mit Studio-Unterstützung und Milliardenbudget.

EDIT:
With a slightly adaption of Ivans suggetion I implemented that in a way that fits my purpose. Because PER_I and PER_B are not always detected right, you rely on the sentence and word count. My adaptaion:

 fix_per_names <- function(entity_names){
  a <- entity_names$token
  b <- entity_names$token_id
  c <- entity_names$sentence_id
  d <- entity_names$entity
  e <- NULL
  i <- 0
  
  while(i < (length(a))){
    i  <- i+1
    if(i == length(a)){
      e[i] <- a[i]
    }else if(c[i] == c[i+1] & ( b[i+1] - b[i] == 1 ) ){
      e[i] <- paste(a[i],a[i+1])
    }else if(d[i] == 'PER_I' & d[i+1] == 'PER_B'){
      e[i] <- NA
    }else {
      e[i] <- a[i]
    }
  }
  e <-  toString(e[!is.na(e)])
  return(e)
}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

浮光之海 2025-01-19 19:10:52

这不是世界第五大奇迹，但它有帮助。

dad = structure(list(doc_id = c("text1", "text1", "text1", "text1", 
"text1", "text1", "text1", "text1", "text1", "text1", "text1", 
"text1", "text1", "text1", "text1", "text1", "text1", "text1"
), sentence_id = c(49L, 49L, 49L, 49L, 49L, 49L, 49L, 49L, 49L, 
49L, 49L, 49L, 53L, 53L, 53L, 53L, 55L, 55L), token_id = c(13L, 
14L, 16L, 17L, 19L, 20L, 21L, 23L, 24L, 26L, 27L, 32L, 15L, 16L, 
18L, 19L, 19L, 20L), token = c("Kenji", "Mizoguchi", "Alain", 
"Resnais", "Rainer", "Werner", "Fassbinder", "Alfred", "Hitchcock", 
"Werner", "Herzog", "Aguirre", "Claude", "Chabrol", "Hannelore", 
"Elsner", "Steven", "Spielberg"), pos = c("PROPN", "PROPN", "PROPN", 
"PROPN", "PROPN", "PROPN", "PROPN", "PROPN", "PROPN", "PROPN", 
"PROPN", "PROPN", "PROPN", "PROPN", "PROPN", "PROPN", "PROPN", 
"PROPN"), entity = c("PER_B", "PER_I", "PER_B", "PER_I", "PER_B", 
"PER_I", "PER_I", "PER_B", "PER_I", "PER_B", "PER_I", "PER_B", 
"PER_B", "PER_I", "PER_B", "PER_I", "PER_B", "PER_I")), row.names = c(NA, 
-18L), class = c("spacyr_parsed", "data.frame"))

a <- dad$token
b <- dad$entity
d <- NULL

i <- 0

while(i < (length(a)-1)){
      i  <- i+1
      print(i)
      if(b[i] == 'PER_B' & b[i+1] == 'PER_I'){
         d[i] <- paste(a[i],a[i+1])
         }else if(b[i] == 'PER_I' & b[i+1] == 'PER_B'){
         d[i] <- NA
         }else if(b[i] == 'PER_B' & b[i+1] == 'PER_B'){
         d[i] <- a[i]
         }else{
         d[i] <- a[i+1]
         }
      }
      
e <- d[!is.na(d)]

It's not the fifth wonder of the world, but it helps.

dad = structure(list(doc_id = c("text1", "text1", "text1", "text1", 
"text1", "text1", "text1", "text1", "text1", "text1", "text1", 
"text1", "text1", "text1", "text1", "text1", "text1", "text1"
), sentence_id = c(49L, 49L, 49L, 49L, 49L, 49L, 49L, 49L, 49L, 
49L, 49L, 49L, 53L, 53L, 53L, 53L, 55L, 55L), token_id = c(13L, 
14L, 16L, 17L, 19L, 20L, 21L, 23L, 24L, 26L, 27L, 32L, 15L, 16L, 
18L, 19L, 19L, 20L), token = c("Kenji", "Mizoguchi", "Alain", 
"Resnais", "Rainer", "Werner", "Fassbinder", "Alfred", "Hitchcock", 
"Werner", "Herzog", "Aguirre", "Claude", "Chabrol", "Hannelore", 
"Elsner", "Steven", "Spielberg"), pos = c("PROPN", "PROPN", "PROPN", 
"PROPN", "PROPN", "PROPN", "PROPN", "PROPN", "PROPN", "PROPN", 
"PROPN", "PROPN", "PROPN", "PROPN", "PROPN", "PROPN", "PROPN", 
"PROPN"), entity = c("PER_B", "PER_I", "PER_B", "PER_I", "PER_B", 
"PER_I", "PER_I", "PER_B", "PER_I", "PER_B", "PER_I", "PER_B", 
"PER_B", "PER_I", "PER_B", "PER_I", "PER_B", "PER_I")), row.names = c(NA, 
-18L), class = c("spacyr_parsed", "data.frame"))

a <- dad$token
b <- dad$entity
d <- NULL

i <- 0

while(i < (length(a)-1)){
      i  <- i+1
      print(i)
      if(b[i] == 'PER_B' & b[i+1] == 'PER_I'){
         d[i] <- paste(a[i],a[i+1])
         }else if(b[i] == 'PER_I' & b[i+1] == 'PER_B'){
         d[i] <- NA
         }else if(b[i] == 'PER_B' & b[i+1] == 'PER_B'){
         d[i] <- a[i]
         }else{
         d[i] <- a[i+1]
         }
      }
      
e <- d[!is.na(d)]

回复收藏 0 原文

~没有更多了~