将文件中的句子转换为列表中的单词标记

发布于 2024-12-18 13:40:28 字数 565 浏览 0 评论 0原文

我正在使用 python 将文本文件中句子中的单词转换为列表中的单个标记,以计算单词频率。我无法将不同的句子转换为单个列表。这就是我所做的:

f = open('music.txt', 'r')
sent = [word.lower().split() for word in f]

这给了我以下列表:

[['party', 'rock', 'is', 'in', 'the', 'house', 'tonight'],
 ['everybody', 'just', 'have', 'a', 'good', 'time'],...]

由于文件中的句子位于单独的行中,因此它返回此列表列表,并且 defaultdict 无法识别要计数的各个标记。

它尝试使用以下列表理解来隔离不同列表中的标记并将它们返回到单个列表,但它返回一个空列表:

sent2 = [[w for w in word] for word in sent]

有没有办法使用列表理解来做到这一点?或者也许还有另一种更简单的方法?

I'm using python to convert the words in sentences in a text file to individual tokens in a list for the purpose of counting up word frequencies. I'm having trouble converting the different sentences into a single list. Here's what I do:

f = open('music.txt', 'r')
sent = [word.lower().split() for word in f]

That gives me the following list:

[['party', 'rock', 'is', 'in', 'the', 'house', 'tonight'],
 ['everybody', 'just', 'have', 'a', 'good', 'time'],...]

Since the sentences in the file were in separate lines, it returns this list of lists and defaultdict can't identify the individual tokens to count up.

It tried the following list comprehension to isolate the tokens in the different lists and return them to a single list, but it returns an empty list instead:

sent2 = [[w for w in word] for word in sent]

Is there a way to do this using list comprehensions? Or perhaps another easier way?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

转身以后 2024-12-25 13:40:28

只需在列表理解中使用嵌套循环:

sent = [word for line in f for word in line.lower().split()]

这种方法有一些替代方法,例如使用 itertools.chain.from_iterable(),但我认为在这种情况下嵌套循环要容易得多。

Just use a nested loop inside the list comprehension:

sent = [word for line in f for word in line.lower().split()]

There are some alternatives to this approach, for example using itertools.chain.from_iterable(), but I think the nested loop is much easier in this case.

戒ㄋ 2024-12-25 13:40:28

只需将整个文件作为单个字符串读取到内存中,然后对字符串应用split一次。
在这种情况下,无需逐行读取文件。

因此,您的核心可以短至:(当然

sent = open("music.txt").read().split()

,还有一些细节,例如关闭文件、检查错误、将代码变大一点)

由于您想要计算词频,因此可以使用 collections.Counter 类那:

from collections import Counter
counter = Counter()
for word in open("music.txt").read().split():
    counter[word] += 1

Just read the entire file to memory,a s a single string, and apply splitonce tot hat string.
There is no need to read the file line by line in such a case.

Therefore your core can be as short as:

sent = open("music.txt").read().split()

(A few niceties like closing the file, checking for errors, turn the code a little larger, of course)

Since you want to be counting word frequencies, you can use the collections.Counter class for that:

from collections import Counter
counter = Counter()
for word in open("music.txt").read().split():
    counter[word] += 1
放肆 2024-12-25 13:40:28

列表推导式可以完成这项工作,但会将所有内容累积在内存中。对于大量投入来说,这可能是不可接受的成本。下面的解决方案不会在内存中积累大量数据,即使对于大文件也是如此。最终产品是一个 {token:occurrences} 形式的字典。

import itertools

def distinct_tokens(filename):
  tokendict = {}
  f = open(filename, 'r')
  tokens = itertools.imap(lambda L: iter(L.lower.split()), f)
  for tok in itertools.chain.from_iterable(tokens):
    if tok in tokendict:
      tokendict[tok] += 1
    else:
      tokendict[tok] = 1
  f.close()
  return tokendict

List comprehensions can do the job but will accumulate everything in memory. For large inputs this could be an unacceptable cost. The below solution will not accumulate large amounts of data in memory, even for large files. The final product is a dictionary of the form {token: occurrences}.

import itertools

def distinct_tokens(filename):
  tokendict = {}
  f = open(filename, 'r')
  tokens = itertools.imap(lambda L: iter(L.lower.split()), f)
  for tok in itertools.chain.from_iterable(tokens):
    if tok in tokendict:
      tokendict[tok] += 1
    else:
      tokendict[tok] = 1
  f.close()
  return tokendict
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文