从文件中的 n 行块中提取项目,计算每个块的项目频率,Python
我有一个文本文件,其中包含 5 行制表符分隔行块:
1 \t DESCRIPTION \t SENTENCE \t ITEMS
1 \t DESCRIPTION \t SENTENCE \t ITEMS
1 \t DESCRIPTION \t SENTENCE \t ITEMS
1 \t DESCRIPTION \t SENTENCE \t ITEMS
1 \t DESCRIPTION \t SENTENCE \t ITEMS
2 \t DESCRIPTION \t SENTENCE \t ITEMS
2 \t DESCRIPTION \t SENTENCE \t ITEMS
2 \t DESCRIPTION \t SENTENCE \t ITEMS
2 \t DESCRIPTION \t SENTENCE \t ITEMS
2 \t DESCRIPTION \t SENTENCE \t ITEMS
等等。
在每个块中,DESCRIPTION 和 SENTENCE 列是相同的。感兴趣的数据位于 ITEMS 列中,该列对于块中的每一行都不同,并且采用以下格式:
word1, word2, word3
...等等
对于每个 5 行块,我需要计算 word1、word2 等的频率. 在项目中。例如,如果第一个 5 行块如下所示
1 \t DESCRIPTION \t SENTENCE \t word1, word2, word3
1 \t DESCRIPTION \t SENTENCE \t word1, word2
1 \t DESCRIPTION \t SENTENCE \t word4
1 \t DESCRIPTION \t SENTENCE \t word1, word2, word3
1 \t DESCRIPTION \t SENTENCE \t word1, word2
,则该 5 行块的正确输出将是
1, SENTENCE, (word1: 4, word2: 4, word3: 2, word4: 1)
Ie,块编号后跟句子,后跟单词的频率计数。
我有一些代码来提取五行块,并在提取后计算块中单词的频率,但我坚持隔离每个块、获取单词频率、继续下一个块等任务。
from itertools import groupby
def GetFrequencies(file):
file_contents = open(file).readlines() #file as list
"""use zip to get the entire file as list of 5-line chunk tuples"""
five_line_increments = zip(*[iter(file_contents)]*5)
for chunk in five_line_increments: #for each 5-line chunk...
for sentence in chunk: #...and for each sentence in that chunk
words = sentence.split('\t')[3].split() #get the ITEMS column at index 3
words_no_comma = [x.strip(',') for x in words] #get rid of the commas
words_no_ws = [x.strip(' ')for x in words_no_comma] #get rid of the whitespace resulting from the removed commas
"""STUCK HERE The idea originally was to take the words lists for
each chunk and combine them to create a big list, 'collection,' and
feed this into the for-loop below."""
for key, group in groupby(collection): #collection is a big list containing all of the words in the ITEMS section of the chunk, e.g, ['word1', 'word2', word3', 'word1', 'word1', 'word2', etc.]
print key,len(list(group)),
I have a text file containing 5-line chunks of tab-delimited lines:
1 \t DESCRIPTION \t SENTENCE \t ITEMS
1 \t DESCRIPTION \t SENTENCE \t ITEMS
1 \t DESCRIPTION \t SENTENCE \t ITEMS
1 \t DESCRIPTION \t SENTENCE \t ITEMS
1 \t DESCRIPTION \t SENTENCE \t ITEMS
2 \t DESCRIPTION \t SENTENCE \t ITEMS
2 \t DESCRIPTION \t SENTENCE \t ITEMS
2 \t DESCRIPTION \t SENTENCE \t ITEMS
2 \t DESCRIPTION \t SENTENCE \t ITEMS
2 \t DESCRIPTION \t SENTENCE \t ITEMS
etc.
In each chunk, the DESCRIPTION and SENTENCE columns are the same. The data of interest is in the ITEMS column which is different for each line in the chunk and is in the following format:
word1, word2, word3
...and so on
For each 5-line chunk, I need to count the frequency of word1, word2, etc. in ITEMS. For example, if the first 5-line chunk was as follows
1 \t DESCRIPTION \t SENTENCE \t word1, word2, word3
1 \t DESCRIPTION \t SENTENCE \t word1, word2
1 \t DESCRIPTION \t SENTENCE \t word4
1 \t DESCRIPTION \t SENTENCE \t word1, word2, word3
1 \t DESCRIPTION \t SENTENCE \t word1, word2
then the correct output for this 5-line chunk would be
1, SENTENCE, (word1: 4, word2: 4, word3: 2, word4: 1)
I.e, the chunk number followed by the sentence followed by the frequency counts for the words.
I have some code to extract the five-line chunks and to count the frequency of words in a chunk once it's extracted, but am stuck on the task of isolating each chunk, getting the word frequencies, moving on to the next, etc.
from itertools import groupby
def GetFrequencies(file):
file_contents = open(file).readlines() #file as list
"""use zip to get the entire file as list of 5-line chunk tuples"""
five_line_increments = zip(*[iter(file_contents)]*5)
for chunk in five_line_increments: #for each 5-line chunk...
for sentence in chunk: #...and for each sentence in that chunk
words = sentence.split('\t')[3].split() #get the ITEMS column at index 3
words_no_comma = [x.strip(',') for x in words] #get rid of the commas
words_no_ws = [x.strip(' ')for x in words_no_comma] #get rid of the whitespace resulting from the removed commas
"""STUCK HERE The idea originally was to take the words lists for
each chunk and combine them to create a big list, 'collection,' and
feed this into the for-loop below."""
for key, group in groupby(collection): #collection is a big list containing all of the words in the ITEMS section of the chunk, e.g, ['word1', 'word2', word3', 'word1', 'word1', 'word2', etc.]
print key,len(list(group)),
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
使用 python 2.7
输出:
Using python 2.7
Outputs:
标准库中有一个 csv 解析器可以为您处理输入拆分
There's a csv parser in the standard library that can handle the input splitting for you
稍微编辑一下你的代码,我认为它做了你想要它做的事情:
输出:
Edited your code a little bit, I think it does what you want it to do:
Output:
总结一下:如果所有“单词”不是“DESCRIPTION”或“SENTENCE”,您想将它们附加到集合中吗?试试这个:
To summarize: You want to append all "words" to a collection if they are not "DESCRIPTION" or "SENTENCE"? Try this: