R在tm包中分割文本-识别说话者
我试图找出国会演讲中最常用的单词,并必须由国会议员将它们分开。我刚刚开始学习 R 和 tm 包。我有一个代码可以找到最常见的单词,但是我可以使用什么样的代码来自动识别和存储演讲者呢?
文本看起来像这样:
OPENING STATEMENT OF SENATOR HERB KOHL, CHAIRMAN
The Chairman. Good afternoon to everybody, and thank you
very much for coming to this hearing this afternoon.
In today's tough economic climate, millions of seniors have
lost a big part of their retirement and investments in only a
matter of months. Unlike younger Americans, they do not have
time to wait for the markets to rebound in order to recoup a
lifetime of savings.
[....]
STATEMENT OF SENATOR MEL MARTINEZ, RANKING MEMBER
[....]
我希望能够获得这些名称,或者由人们分开文本。希望你能帮助我。多谢。
I am trying to identify the most frequently used words in the congress speeches, and have to separate them by the congressperson. I am just starting to learn about R and the tm package. I have a code that can find the most frequent words, but what kind of a code can I use to automatically identify and store the speaker of the speech?
Text looks like this:
OPENING STATEMENT OF SENATOR HERB KOHL, CHAIRMAN
The Chairman. Good afternoon to everybody, and thank you
very much for coming to this hearing this afternoon.
In today's tough economic climate, millions of seniors have
lost a big part of their retirement and investments in only a
matter of months. Unlike younger Americans, they do not have
time to wait for the markets to rebound in order to recoup a
lifetime of savings.
[....]
STATEMENT OF SENATOR MEL MARTINEZ, RANKING MEMBER
[....]
I would like to be able to get these names, or separate text by the people. Hope you can help me. Thanks a lot.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果您想拆分文件以便每个发言者拥有一个文本对象,这样说是否正确?然后使用正则表达式来抓取每个对象的说话者的名字?然后你可以编写一个函数来收集每个对象上的词频等,并将它们放在一个表中,其中行或列名称是说话者的名字。
如果是这样,您可能会说 x 是您的文本,然后使用
strsplit(x, "STATMENT OF")
拆分单词 STATMENT OF,然后使用grep()
或 < code>str_extract() 返回 SENATOR 之后的 2 或 3 个单词(它们是否总是像您的示例中那样只有两个名称?)。请参阅此处,了解有关这些函数的使用以及
R
中一般文本操作的更多信息:http://en.wikibooks.org/wiki/R_Programming/Text_Processing更新 这是一个更完整的答案......
毫无疑问,正则表达式向导可以想出做某事更快更整洁!
无论如何,从这里您可以运行一个函数来计算向量 y 中每一行的单词频率(即每个说话者的语音),然后创建另一个对象,将单词频率结果与名称相结合以进行进一步的处理。分析。
Would it be correct to say that you want to split the file so you have one text object per speaker? And then use a regular expression to grab the speaker's name for each object? Then you can write a function to collect word frequencies, etc. on each object and put them in a table where the row or column names are the speaker's names.
If so, you might say x is your text, then use
strsplit(x, "STATEMENT OF")
to split on the words STATEMENT OF, thengrep()
orstr_extract()
to return the 2 or 3 words after SENATOR (do they always have only two names as in your example?).Have a look here for more on the use of these functions, and text manipulation in general in
R
: http://en.wikibooks.org/wiki/R_Programming/Text_ProcessingUPDATE Here's a more complete answer...
No doubt a regular expressions wizard could come up with something to do it quicker and neater!
Anyway, from here you can run a function to calculate word freqs on each line in the vector
y
(ie. each speaker's speech) and then make another object that combines the word freq results with the names for further analysis.这就是我使用 Ben 的示例来处理它的方法(使用 qdap 解析并创建一个数据帧,然后转换为包含 3 个文档的
Corpus
;请注意 qdap 是为这样的转录数据而设计的,语料库
可能不是最好的数据格式):This is how I'd approach it using Ben's example (use qdap to parse and create a dataframe and then convert to a
Corpus
with 3 documents; note that qdap was designed for transcript data like this and aCorpus
may not be the best data format):