R在tm包中分割文本-识别说话者

发布于 2024-12-26 07:55:48 字数 690 浏览 0 评论 0原文

我试图找出国会演讲中最常用的单词，并必须由国会议员将它们分开。我刚刚开始学习 R 和 tm 包。我有一个代码可以找到最常见的单词，但是我可以使用什么样的代码来自动识别和存储演讲者呢？

文本看起来像这样：

OPENING STATEMENT OF SENATOR HERB KOHL, CHAIRMAN

    The Chairman. Good afternoon to everybody, and thank you 
very much for coming to this hearing this afternoon.
    In today's tough economic climate, millions of seniors have 
lost a big part of their retirement and investments in only a 
matter of months. Unlike younger Americans, they do not have 
time to wait for the markets to rebound in order to recoup a 
lifetime of savings.
[....]

   STATEMENT OF SENATOR MEL MARTINEZ, RANKING MEMBER
[....]

我希望能够获得这些名称，或者由人们分开文本。希望你能帮助我。多谢。

原文

I am trying to identify the most frequently used words in the congress speeches, and have to separate them by the congressperson. I am just starting to learn about R and the tm package. I have a code that can find the most frequent words, but what kind of a code can I use to automatically identify and store the speaker of the speech?

Text looks like this:

OPENING STATEMENT OF SENATOR HERB KOHL, CHAIRMAN

    The Chairman. Good afternoon to everybody, and thank you 
very much for coming to this hearing this afternoon.
    In today's tough economic climate, millions of seniors have 
lost a big part of their retirement and investments in only a 
matter of months. Unlike younger Americans, they do not have 
time to wait for the markets to rebound in order to recoup a 
lifetime of savings.
[....]

   STATEMENT OF SENATOR MEL MARTINEZ, RANKING MEMBER
[....]

I would like to be able to get these names, or separate text by the people. Hope you can help me. Thanks a lot.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

怼怹恏 2025-01-02 07:55:48

如果您想拆分文件以便每个发言者拥有一个文本对象，这样说是否正确？然后使用正则表达式来抓取每个对象的说话者的名字？然后你可以编写一个函数来收集每个对象上的词频等，并将它们放在一个表中，其中行或列名称是说话者的名字。

如果是这样，您可能会说 x 是您的文本，然后使用 strsplit(x, "STATMENT OF") 拆分单词 STATMENT OF，然后使用 grep() 或 < code>str_extract() 返回 SENATOR 之后的 2 或 3 个单词（它们是否总是像您的示例中那样只有两个名称？）。

请参阅此处，了解有关这些函数的使用以及 R 中一般文本操作的更多信息：http://en.wikibooks.org/wiki/R_Programming/Text_Processing

更新这是一个更完整的答案......

#create object containing all text
x <- c("OPENING STATEMENT OF SENATOR HERB KOHL, CHAIRMAN

    The Chairman. Good afternoon to everybody, and thank you 
very much for coming to this hearing this afternoon.
    In today's tough economic climate, millions of seniors have 
lost a big part of their retirement and investments in only a 
matter of months. Unlike younger Americans, they do not have 
time to wait for the markets to rebound in order to recoup a 
lifetime of savings.

STATEMENT OF SENATOR BIG APPLE KOHL, CHAIRMAN

I am trying to identify the most frequently used words in the 
congress speeches, and have to separate them by the congressperson. 
I am just starting to learn about R and the tm package. I have a code 
that can find the most frequent words, but what kind of a code can I  
use to automatically identify and store the speaker of the speech

STATEMENT OF SENATOR LITTLE ORANGE, CHAIRMAN

Would it be correct to say that you want 
to split the file so you have one text object 
per speaker? And then use a regular expression 
to grab the speaker's name for each object? Then 
you can write a function to collect word frequencies, 
etc. on each object and put them in a table where the 
row or column names are the speaker's names.")

# split object on first two words
y <- unlist(strsplit(x, "STATEMENT OF"))

#load library containing handy function
library(stringr)   

 # use word() to return words in positions 3 to 4 of each string, which is where the first and last names are
    z <- word(y[2:4], 3, 4) # note that the first line in the character vector y has only one word and this function gives and error if there are not enough words in the line
    z # have a look at the result...
    [1] "HERB KOHL,"     "BIG APPLE"      "LITTLE ORANGE,"

毫无疑问，正则表达式向导可以想出做某事更快更整洁！

无论如何，从这里您可以运行一个函数来计算向量 y 中每一行的单词频率（即每个说话者的语音），然后创建另一个对象，将单词频率结果与名称相结合以进行进一步的处理。分析。

Would it be correct to say that you want to split the file so you have one text object per speaker? And then use a regular expression to grab the speaker's name for each object? Then you can write a function to collect word frequencies, etc. on each object and put them in a table where the row or column names are the speaker's names.

If so, you might say x is your text, then use strsplit(x, "STATEMENT OF") to split on the words STATEMENT OF, then grep() or str_extract() to return the 2 or 3 words after SENATOR (do they always have only two names as in your example?).

Have a look here for more on the use of these functions, and text manipulation in general in R: http://en.wikibooks.org/wiki/R_Programming/Text_Processing

UPDATE Here's a more complete answer...

#create object containing all text
x <- c("OPENING STATEMENT OF SENATOR HERB KOHL, CHAIRMAN

    The Chairman. Good afternoon to everybody, and thank you 
very much for coming to this hearing this afternoon.
    In today's tough economic climate, millions of seniors have 
lost a big part of their retirement and investments in only a 
matter of months. Unlike younger Americans, they do not have 
time to wait for the markets to rebound in order to recoup a 
lifetime of savings.

STATEMENT OF SENATOR BIG APPLE KOHL, CHAIRMAN

I am trying to identify the most frequently used words in the 
congress speeches, and have to separate them by the congressperson. 
I am just starting to learn about R and the tm package. I have a code 
that can find the most frequent words, but what kind of a code can I  
use to automatically identify and store the speaker of the speech

STATEMENT OF SENATOR LITTLE ORANGE, CHAIRMAN

Would it be correct to say that you want 
to split the file so you have one text object 
per speaker? And then use a regular expression 
to grab the speaker's name for each object? Then 
you can write a function to collect word frequencies, 
etc. on each object and put them in a table where the 
row or column names are the speaker's names.")

# split object on first two words
y <- unlist(strsplit(x, "STATEMENT OF"))

#load library containing handy function
library(stringr)   

 # use word() to return words in positions 3 to 4 of each string, which is where the first and last names are
    z <- word(y[2:4], 3, 4) # note that the first line in the character vector y has only one word and this function gives and error if there are not enough words in the line
    z # have a look at the result...
    [1] "HERB KOHL,"     "BIG APPLE"      "LITTLE ORANGE,"

No doubt a regular expressions wizard could come up with something to do it quicker and neater!

Anyway, from here you can run a function to calculate word freqs on each line in the vector y (ie. each speaker's speech) and then make another object that combines the word freq results with the names for further analysis.

回复收藏 0 原文

柒夜笙歌凉 2025-01-02 07:55:48

这就是我使用 Ben 的示例来处理它的方法（使用 qdap 解析并创建一个数据帧，然后转换为包含 3 个文档的 Corpus ；请注意 qdap 是为这样的转录数据而设计的，语料库可能不是最好的数据格式）：

library(qdap)
dat <- unlist(strsplit(x, "\\n"))

locs <- grep("STATEMENT OF ", dat)
nms <- sapply(strsplit(dat[locs], "STATEMENT OF |,"), "[", 2)
dat[locs] <- "SPLIT_HERE"
corp <- with(data.frame(person=nms, dialogue = 
    Trim(unlist(strsplit(paste(dat[-1], collapse=" "), "SPLIT_HERE")))),
    df2tm_corpus(dialogue, person))

tm::inspect(corp)

## A corpus with 3 text documents
## 
## The metadata consists of 2 tag-value pairs and a data frame
## Available tags are:
##   create_date creator 
## Available variables in the data frame are:
##   MetaID 
## 
## 这就是我使用 Ben 的示例来处理它的方法（使用 qdap 解析并创建一个数据帧，然后转换为包含 3 个文档的 Corpus ；请注意 qdap 是为这样的转录数据而设计的，语料库可能不是最好的数据格式）：
SENATOR BIG APPLE KOHL`
## I am trying to identify the most frequently used words in the  congress speeches, and have to separate them by the congressperson.  I am just starting to learn about R and the tm package. I have a code  that can find the most frequent words, but what kind of a code can I   use to automatically identify and store the speaker of the speech
## 
## 这就是我使用 Ben 的示例来处理它的方法（使用 qdap 解析并创建一个数据帧，然后转换为包含 3 个文档的 Corpus ；请注意 qdap 是为这样的转录数据而设计的，语料库可能不是最好的数据格式）：
SENATOR HERB KOHL`
## The Chairman. Good afternoon to everybody, and thank you  very much for coming to this hearing this afternoon.     In today's tough economic climate, millions of seniors have  lost a big part of their retirement and investments in only a  matter of months. Unlike younger Americans, they do not have  time to wait for the markets to rebound in order to recoup a  lifetime of savings.
## 
## 这就是我使用 Ben 的示例来处理它的方法（使用 qdap 解析并创建一个数据帧，然后转换为包含 3 个文档的 Corpus ；请注意 qdap 是为这样的转录数据而设计的，语料库可能不是最好的数据格式）：
SENATOR LITTLE ORANGE`
## Would it be correct to say that you want  to split the file so you have one text object  per speaker? And then use a regular expression  to grab the speaker's name for each object? Then  you can write a function to collect word frequencies,  etc. on each object and put them in a table where the  row or column names are the speaker's names.

This is how I'd approach it using Ben's example (use qdap to parse and create a dataframe and then convert to a Corpus with 3 documents; note that qdap was designed for transcript data like this and a Corpus may not be the best data format):

library(qdap)
dat <- unlist(strsplit(x, "\\n"))

locs <- grep("STATEMENT OF ", dat)
nms <- sapply(strsplit(dat[locs], "STATEMENT OF |,"), "[", 2)
dat[locs] <- "SPLIT_HERE"
corp <- with(data.frame(person=nms, dialogue = 
    Trim(unlist(strsplit(paste(dat[-1], collapse=" "), "SPLIT_HERE")))),
    df2tm_corpus(dialogue, person))

tm::inspect(corp)

## A corpus with 3 text documents
## 
## The metadata consists of 2 tag-value pairs and a data frame
## Available tags are:
##   create_date creator 
## Available variables in the data frame are:
##   MetaID 
## 
## This is how I'd approach it using Ben's example (use qdap to parse and create a dataframe and then convert to a Corpus with 3 documents; note that qdap was designed for transcript data like this and a Corpus may not be the best data format):
SENATOR BIG APPLE KOHL`
## I am trying to identify the most frequently used words in the  congress speeches, and have to separate them by the congressperson.  I am just starting to learn about R and the tm package. I have a code  that can find the most frequent words, but what kind of a code can I   use to automatically identify and store the speaker of the speech
## 
## This is how I'd approach it using Ben's example (use qdap to parse and create a dataframe and then convert to a Corpus with 3 documents; note that qdap was designed for transcript data like this and a Corpus may not be the best data format):
SENATOR HERB KOHL`
## The Chairman. Good afternoon to everybody, and thank you  very much for coming to this hearing this afternoon.     In today's tough economic climate, millions of seniors have  lost a big part of their retirement and investments in only a  matter of months. Unlike younger Americans, they do not have  time to wait for the markets to rebound in order to recoup a  lifetime of savings.
## 
## This is how I'd approach it using Ben's example (use qdap to parse and create a dataframe and then convert to a Corpus with 3 documents; note that qdap was designed for transcript data like this and a Corpus may not be the best data format):
SENATOR LITTLE ORANGE`
## Would it be correct to say that you want  to split the file so you have one text object  per speaker? And then use a regular expression  to grab the speaker's name for each object? Then  you can write a function to collect word frequencies,  etc. on each object and put them in a table where the  row or column names are the speaker's names.