在 Python 中迭代文件的单词

发布于 2024-12-09 09:43:55 字数 84 浏览 0 评论 0原文

我需要遍历一个大文件的单词,该文件由一条长长的行组成。我知道逐行迭代文件的方法,但由于其单行结构,它们不适用于我的情况。

还有其他选择吗?

I need to iterate through the words of a large file, which consists of a single, long long line. I am aware of methods iterating through the file line by line, however they are not applicable in my case, because of its single line structure.

Any alternatives?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

碍人泪离人颜 2024-12-16 09:43:55

这实际上取决于您对单词的定义。但试试这个:

f = file("your-filename-here").read()
for word in f.split():
    # do something with word
    print word

这将使用空白字符作为单词边界。

当然,请记住正确打开和关闭文件,这只是一个简单的示例。

It really depends on your definition of word. But try this:

f = file("your-filename-here").read()
for word in f.split():
    # do something with word
    print word

This will use whitespace characters as word boundaries.

Of course, remember to properly open and close the file, this is just a quick example.

鼻尖触碰 2024-12-16 09:43:55

长长的队伍?我认为该行太大而无法合理地容纳在内存中,因此您需要某种缓冲。

首先,这是一种不好的格式;如果您对文件有任何形式的控制,请使其每行一个字。

如果没有,请使用类似以下内容:

line = ''
while True:
    word, space, line = line.partition(' ')
    if space:
        # A word was found
        yield word
    else:
        # A word was not found; read a chunk of data from file
        next_chunk = input_file.read(1000)
        if next_chunk:
            # Add the chunk to our line
            line = word + next_chunk
        else:
            # No more data; yield the last word and return
            yield word.rstrip('\n')
            return

Long long line? I assume the line is too big to reasonably fit in memory, so you want some kind of buffering.

First of all, this is a bad format; if you have any kind of control over the file, make it one word per line.

If not, use something like:

line = ''
while True:
    word, space, line = line.partition(' ')
    if space:
        # A word was found
        yield word
    else:
        # A word was not found; read a chunk of data from file
        next_chunk = input_file.read(1000)
        if next_chunk:
            # Add the chunk to our line
            line = word + next_chunk
        else:
            # No more data; yield the last word and return
            yield word.rstrip('\n')
            return
貪欢 2024-12-16 09:43:55

你真的应该考虑使用 Generator

def word_gen(file):
    for line in file:
        for word in line.split():
            yield word

with open('somefile') as f:
    word_gen(f)

You really should consider using Generator

def word_gen(file):
    for line in file:
        for word in line.split():
            yield word

with open('somefile') as f:
    word_gen(f)
铜锣湾横着走 2024-12-16 09:43:55

有更有效的方法可以做到这一点,但从语法上来说,这可能是最短的:

 words = open('myfile').read().split()

如果内存是一个问题,你不会想要这样做,因为它将把整个东西加载到内存中,而不是迭代它。

There are more efficient ways of doing this, but syntactically, this might be the shortest:

 words = open('myfile').read().split()

If memory is a concern, you aren't going to want to do this because it will load the entire thing into memory, instead of iterating over it.

荒路情人 2024-12-16 09:43:55

我之前回答过类似的问题,但我已经改进了该答案中使用的方法,这是更新的版本(从最近的答案复制):

这是我的完全实用的方法,避免了阅读和
分割线。它使用 itertools 模块:< /p>

Python 3注意事项,将itertools.imap替换为map

导入 itertools

def readwords(mfile):
    byte_stream = itertools.groupby(
      itertools.takewhile(lambda c: bool(c),
          itertools.imap(mfile.read,
              itertools.repeat(1))), str.isspace)

    return ("".join(group) for pred, group in byte_stream if not pred)

使用示例:

<前><代码>>>>导入系统
>>>>>对于 readwords(sys.stdin) 中的 w:
...打印(w)
...
我真的很喜欢这种用Python读取单词的新方法

真的


新的
方法

阅读


Python

它非常实用!
它是

非常
实用!
>>>>>

我想在你的情况下,这将是使用该函数的方式:

with open('words.txt', 'r') as f:
    对于 readwords(f) 中的单词:
        打印(字)

I've answered a similar question before, but I have refined the method used in that answer and here is the updated version (copied from a recent answer):

Here is my totally functional approach which avoids having to read and
split lines. It makes use of the itertools module:

Note for python 3, replace itertools.imap with map

import itertools

def readwords(mfile):
    byte_stream = itertools.groupby(
      itertools.takewhile(lambda c: bool(c),
          itertools.imap(mfile.read,
              itertools.repeat(1))), str.isspace)

    return ("".join(group) for pred, group in byte_stream if not pred)

Sample usage:

>>> import sys
>>> for w in readwords(sys.stdin):
...     print (w)
... 
I really love this new method of reading words in python
I
really
love
this
new
method
of
reading
words
in
python
           
It's soo very Functional!
It's
soo
very
Functional!
>>>

I guess in your case, this would be the way to use the function:

with open('words.txt', 'r') as f:
    for word in readwords(f):
        print(word)
月竹挽风 2024-12-16 09:43:55

像平常一样读入该行,然后将其拆分为空格以将其分解为单词?

像这样的东西:

word_list = loaded_string.split()

Read in the line as normal, then split it on whitespace to break it down into words?

Something like:

word_list = loaded_string.split()
空城之時有危險 2024-12-16 09:43:55

读完这句话后,你可以这样做:

l = len(pattern)
i = 0
while True:
    i = str.find(pattern, i)
    if i == -1:
        break
    print str[i:i+l] # or do whatever
    i += l

Alex。

After reading the line you could do:

l = len(pattern)
i = 0
while True:
    i = str.find(pattern, i)
    if i == -1:
        break
    print str[i:i+l] # or do whatever
    i += l

Alex.

念﹏祤嫣 2024-12-16 09:43:55

唐纳德·迈纳的建议看起来不错。简单又简短。我在前段时间编写的代码中使用了以下内容:

l = []
f = open("filename.txt", "rU")
for line in f:
    for word in line.split()
        l.append(word)

Donald Miner 建议的更长版本。

What Donald Miner suggested looks good. Simple and short. I used the below in a code that I have written some time ago:

l = []
f = open("filename.txt", "rU")
for line in f:
    for word in line.split()
        l.append(word)

longer version of what Donald Miner suggested.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文