从文件中的 n 行块中提取项目,计算每个块的项目频率,Python

发布于 2024-12-01 03:26:25 字数 2298 浏览 4 评论 0原文

我有一个文本文件,其中包含 5 行制表符分隔行块:

 1 \t DESCRIPTION \t SENTENCE \t ITEMS

 1 \t DESCRIPTION \t SENTENCE \t ITEMS

 1 \t DESCRIPTION \t SENTENCE \t ITEMS

 1 \t DESCRIPTION \t SENTENCE \t ITEMS

 1 \t DESCRIPTION \t SENTENCE \t ITEMS

 2 \t DESCRIPTION \t SENTENCE \t ITEMS

 2 \t DESCRIPTION \t SENTENCE \t ITEMS

 2 \t DESCRIPTION \t SENTENCE \t ITEMS

 2 \t DESCRIPTION \t SENTENCE \t ITEMS

 2 \t DESCRIPTION \t SENTENCE \t ITEMS

等等。

在每个块中,DESCRIPTION 和 SENTENCE 列是相同的。感兴趣的数据位于 ITEMS 列中,该列对于块中的每一行都不同,并且采用以下格式:

word1, word2, word3

...等等

对于每个 5 行块,我需要计算 word1、word2 等的频率. 在项目中。例如,如果第一个 5 行块如下所示

 1 \t DESCRIPTION \t SENTENCE \t word1, word2, word3

 1 \t DESCRIPTION \t SENTENCE \t word1, word2

 1 \t DESCRIPTION \t SENTENCE \t word4

 1 \t DESCRIPTION \t SENTENCE \t word1, word2, word3

 1 \t DESCRIPTION \t SENTENCE \t word1, word2

,则该 5 行块的正确输出将是

1, SENTENCE, (word1: 4, word2: 4, word3: 2, word4: 1)

Ie,块编号后跟句子,后跟单词的频率计数。

我有一些代码来提取五行块,并在提取后计算块中单词的频率,但我坚持隔离每个块、获取单词频率、继续下一个块等任务。

from itertools import groupby 

def GetFrequencies(file):
    file_contents = open(file).readlines()  #file as list
    """use zip to get the entire file as list of 5-line chunk tuples""" 
    five_line_increments = zip(*[iter(file_contents)]*5) 
    for chunk in five_line_increments:  #for each 5-line chunk... 
        for sentence in chunk:          #...and for each sentence in that chunk
            words = sentence.split('\t')[3].split() #get the ITEMS column at index 3
            words_no_comma = [x.strip(',') for x in words]  #get rid of the commas
            words_no_ws = [x.strip(' ')for x in words_no_comma] #get rid of the whitespace resulting from the removed commas


       """STUCK HERE   The idea originally was to take the words lists for 
       each chunk and combine them to create a big list, 'collection,' and
       feed this into the for-loop below."""





    for key, group in groupby(collection): #collection is a big list containing all of the words in the ITEMS section of the chunk, e.g, ['word1', 'word2', word3', 'word1', 'word1', 'word2', etc.]
        print key,len(list(group)),    

I have a text file containing 5-line chunks of tab-delimited lines:

 1 \t DESCRIPTION \t SENTENCE \t ITEMS

 1 \t DESCRIPTION \t SENTENCE \t ITEMS

 1 \t DESCRIPTION \t SENTENCE \t ITEMS

 1 \t DESCRIPTION \t SENTENCE \t ITEMS

 1 \t DESCRIPTION \t SENTENCE \t ITEMS

 2 \t DESCRIPTION \t SENTENCE \t ITEMS

 2 \t DESCRIPTION \t SENTENCE \t ITEMS

 2 \t DESCRIPTION \t SENTENCE \t ITEMS

 2 \t DESCRIPTION \t SENTENCE \t ITEMS

 2 \t DESCRIPTION \t SENTENCE \t ITEMS

etc.

In each chunk, the DESCRIPTION and SENTENCE columns are the same. The data of interest is in the ITEMS column which is different for each line in the chunk and is in the following format:

word1, word2, word3

...and so on

For each 5-line chunk, I need to count the frequency of word1, word2, etc. in ITEMS. For example, if the first 5-line chunk was as follows

 1 \t DESCRIPTION \t SENTENCE \t word1, word2, word3

 1 \t DESCRIPTION \t SENTENCE \t word1, word2

 1 \t DESCRIPTION \t SENTENCE \t word4

 1 \t DESCRIPTION \t SENTENCE \t word1, word2, word3

 1 \t DESCRIPTION \t SENTENCE \t word1, word2

then the correct output for this 5-line chunk would be

1, SENTENCE, (word1: 4, word2: 4, word3: 2, word4: 1)

I.e, the chunk number followed by the sentence followed by the frequency counts for the words.

I have some code to extract the five-line chunks and to count the frequency of words in a chunk once it's extracted, but am stuck on the task of isolating each chunk, getting the word frequencies, moving on to the next, etc.

from itertools import groupby 

def GetFrequencies(file):
    file_contents = open(file).readlines()  #file as list
    """use zip to get the entire file as list of 5-line chunk tuples""" 
    five_line_increments = zip(*[iter(file_contents)]*5) 
    for chunk in five_line_increments:  #for each 5-line chunk... 
        for sentence in chunk:          #...and for each sentence in that chunk
            words = sentence.split('\t')[3].split() #get the ITEMS column at index 3
            words_no_comma = [x.strip(',') for x in words]  #get rid of the commas
            words_no_ws = [x.strip(' ')for x in words_no_comma] #get rid of the whitespace resulting from the removed commas


       """STUCK HERE   The idea originally was to take the words lists for 
       each chunk and combine them to create a big list, 'collection,' and
       feed this into the for-loop below."""





    for key, group in groupby(collection): #collection is a big list containing all of the words in the ITEMS section of the chunk, e.g, ['word1', 'word2', word3', 'word1', 'word1', 'word2', etc.]
        print key,len(list(group)),    

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

追我者格杀勿论 2024-12-08 03:26:25

使用 python 2.7

#!/usr/bin/env python

import collections

chunks={}

with open('input') as fd:
    for line in fd:
        line=line.split()
        if not line:
            continue
        if chunks.has_key(line[0]):
            for i in line[3:]:
                chunks[line[0]].append(i.replace(',',''))
        else:
            chunks[line[0]]=[line[2]]

for k,v in chunks.iteritems():
    counter=collections.Counter(v[1:])
    print k, v[0], counter

输出:

1 SENTENCE Counter({'word1': 3, 'word2': 3, 'word4': 1, 'word3': 1})

Using python 2.7

#!/usr/bin/env python

import collections

chunks={}

with open('input') as fd:
    for line in fd:
        line=line.split()
        if not line:
            continue
        if chunks.has_key(line[0]):
            for i in line[3:]:
                chunks[line[0]].append(i.replace(',',''))
        else:
            chunks[line[0]]=[line[2]]

for k,v in chunks.iteritems():
    counter=collections.Counter(v[1:])
    print k, v[0], counter

Outputs:

1 SENTENCE Counter({'word1': 3, 'word2': 3, 'word4': 1, 'word3': 1})
万人眼中万个我 2024-12-08 03:26:25

标准库中有一个 csv 解析器可以为您处理输入拆分

import csv
import collections

def GetFrequencies(file_in):
    sentences = dict()
    with csv.reader(open(file_in, 'rb'), delimiter='\t') as csv_file:
        for line in csv_file:
            sentence = line[0]
            if sentence not in sentences:
                sentences[sentence] = collections.Counter()
            sentences[sentence].update([x.strip(' ') for x in line[3].split(',')])

There's a csv parser in the standard library that can handle the input splitting for you

import csv
import collections

def GetFrequencies(file_in):
    sentences = dict()
    with csv.reader(open(file_in, 'rb'), delimiter='\t') as csv_file:
        for line in csv_file:
            sentence = line[0]
            if sentence not in sentences:
                sentences[sentence] = collections.Counter()
            sentences[sentence].update([x.strip(' ') for x in line[3].split(',')])
雾里花 2024-12-08 03:26:25

稍微编辑一下你的代码,我认为它做了你想要它做的事情:

file_contents = open(file).readlines()  #file as list
"""use zip to get the entire file as list of 5-line chunk tuples""" 
five_line_increments = zip(*[iter(file_contents)]*5) 
for chunk in five_line_increments:  #for each 5-line chunk...
    word_freq = {} #word frequencies for each chunk
    for sentence in chunk:          #...and for each sentence in that chunk
        words = "".join(sentence.split('\t')[3]).strip('\n').split(', ') #get the ITEMS column at index 3 and put them in list
        for word in words:
            if word not in word_freq:
                word_freq[word] = 1
            else:
                word_freq[word] += 1


    print word_freq

输出:

{'word4': 1, 'word1': 4, 'word3': 2, 'word2': 4}

Edited your code a little bit, I think it does what you want it to do:

file_contents = open(file).readlines()  #file as list
"""use zip to get the entire file as list of 5-line chunk tuples""" 
five_line_increments = zip(*[iter(file_contents)]*5) 
for chunk in five_line_increments:  #for each 5-line chunk...
    word_freq = {} #word frequencies for each chunk
    for sentence in chunk:          #...and for each sentence in that chunk
        words = "".join(sentence.split('\t')[3]).strip('\n').split(', ') #get the ITEMS column at index 3 and put them in list
        for word in words:
            if word not in word_freq:
                word_freq[word] = 1
            else:
                word_freq[word] += 1


    print word_freq

Output:

{'word4': 1, 'word1': 4, 'word3': 2, 'word2': 4}
旧情别恋 2024-12-08 03:26:25

总结一下:如果所有“单词”不是“DESCRIPTION”或“SENTENCE”,您想将它们附加到集合中吗?试试这个:

for word in words_no_ws:
    if word not in ("DESCRIPTION", "SENTENCE"):
        collection.append(word)

To summarize: You want to append all "words" to a collection if they are not "DESCRIPTION" or "SENTENCE"? Try this:

for word in words_no_ws:
    if word not in ("DESCRIPTION", "SENTENCE"):
        collection.append(word)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文