当前位置：文江博客话题详情

在 Python 中迭代文件的单词

发布于 2024-12-09 09:43:55 字数 84 浏览 0 评论 0原文

我需要遍历一个大文件的单词，该文件由一条长长的行组成。我知道逐行迭代文件的方法，但由于其单行结构，它们不适用于我的情况。

还有其他选择吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

碍人泪离人颜 2024-12-16 09:43:55

这实际上取决于您对单词的定义。但试试这个：

f = file("your-filename-here").read()
for word in f.split():
    # do something with word
    print word

这将使用空白字符作为单词边界。

当然，请记住正确打开和关闭文件，这只是一个简单的示例。

It really depends on your definition of word. But try this:

f = file("your-filename-here").read()
for word in f.split():
    # do something with word
    print word

This will use whitespace characters as word boundaries.

Of course, remember to properly open and close the file, this is just a quick example.

回复收藏 0 原文

鼻尖触碰 2024-12-16 09:43:55

长长的队伍？我认为该行太大而无法合理地容纳在内存中，因此您需要某种缓冲。

首先，这是一种不好的格式；如果您对文件有任何形式的控制，请使其每行一个字。

如果没有，请使用类似以下内容：

line = ''
while True:
    word, space, line = line.partition(' ')
    if space:
        # A word was found
        yield word
    else:
        # A word was not found; read a chunk of data from file
        next_chunk = input_file.read(1000)
        if next_chunk:
            # Add the chunk to our line
            line = word + next_chunk
        else:
            # No more data; yield the last word and return
            yield word.rstrip('\n')
            return

Long long line? I assume the line is too big to reasonably fit in memory, so you want some kind of buffering.

First of all, this is a bad format; if you have any kind of control over the file, make it one word per line.

If not, use something like:

line = ''
while True:
    word, space, line = line.partition(' ')
    if space:
        # A word was found
        yield word
    else:
        # A word was not found; read a chunk of data from file
        next_chunk = input_file.read(1000)
        if next_chunk:
            # Add the chunk to our line
            line = word + next_chunk
        else:
            # No more data; yield the last word and return
            yield word.rstrip('\n')
            return

回复收藏 0 原文

貪欢 2024-12-16 09:43:55

你真的应该考虑使用 Generator

def word_gen(file):
    for line in file:
        for word in line.split():
            yield word

with open('somefile') as f:
    word_gen(f)

You really should consider using Generator

def word_gen(file):
    for line in file:
        for word in line.split():
            yield word

with open('somefile') as f:
    word_gen(f)

回复收藏 0 原文

铜锣湾横着走 2024-12-16 09:43:55

有更有效的方法可以做到这一点，但从语法上来说，这可能是最短的：

 words = open('myfile').read().split()

如果内存是一个问题，你不会想要这样做，因为它将把整个东西加载到内存中，而不是迭代它。

There are more efficient ways of doing this, but syntactically, this might be the shortest:

 words = open('myfile').read().split()

If memory is a concern, you aren't going to want to do this because it will load the entire thing into memory, instead of iterating over it.

回复收藏 0 原文

荒路情人 2024-12-16 09:43:55

我之前回答过类似的问题，但我已经改进了该答案中使用的方法，这是更新的版本（从最近的答案复制）：

这是我的完全实用的方法，避免了阅读和
分割线。它使用 itertools 模块：< /p>
Python 3注意事项，将itertools.imap替换为map
导入 itertools

def readwords(mfile):
    byte_stream = itertools.groupby(
      itertools.takewhile(lambda c: bool(c),
          itertools.imap(mfile.read,
              itertools.repeat(1))), str.isspace)

    return ("".join(group) for pred, group in byte_stream if not pred)
使用示例：
<前><代码>>>>导入系统
>>>>>对于 readwords(sys.stdin) 中的 w：
...打印（w）
...
我真的很喜欢这种用Python读取单词的新方法
我
真的
爱
这
新的
方法
的
阅读
字
在
Python
它非常实用！
它是
洙
非常
实用！
>>>>>
我想在你的情况下，这将是使用该函数的方式：
with open('words.txt', 'r') as f:
    对于 readwords(f) 中的单词：
        打印（字）

I've answered a similar question before, but I have refined the method used in that answer and here is the updated version (copied from a recent answer):

Here is my totally functional approach which avoids having to read and
split lines. It makes use of the itertools module:

Note for python 3, replace `itertools.imap` with `map`

import itertools

def readwords(mfile):
    byte_stream = itertools.groupby(
      itertools.takewhile(lambda c: bool(c),
          itertools.imap(mfile.read,
              itertools.repeat(1))), str.isspace)

    return ("".join(group) for pred, group in byte_stream if not pred)

Sample usage:

>>> import sys
>>> for w in readwords(sys.stdin):
...     print (w)
... 
I really love this new method of reading words in python
I
really
love
this
new
method
of
reading
words
in
python
           
It's soo very Functional!
It's
soo
very
Functional!
>>>

I guess in your case, this would be the way to use the function:

with open('words.txt', 'r') as f:
    for word in readwords(f):
        print(word)

回复收藏 0 原文

月竹挽风 2024-12-16 09:43:55

像平常一样读入该行，然后将其拆分为空格以将其分解为单词？

像这样的东西：

word_list = loaded_string.split()

Read in the line as normal, then split it on whitespace to break it down into words?

Something like:

word_list = loaded_string.split()

回复收藏 0 原文

空城之時有危險 2024-12-16 09:43:55

读完这句话后，你可以这样做：

l = len(pattern)
i = 0
while True:
    i = str.find(pattern, i)
    if i == -1:
        break
    print str[i:i+l] # or do whatever
    i += l

Alex。

After reading the line you could do:

l = len(pattern)
i = 0
while True:
    i = str.find(pattern, i)
    if i == -1:
        break
    print str[i:i+l] # or do whatever
    i += l

Alex.

回复收藏 0 原文

念﹏祤嫣 2024-12-16 09:43:55

唐纳德·迈纳的建议看起来不错。简单又简短。我在前段时间编写的代码中使用了以下内容：

l = []
f = open("filename.txt", "rU")
for line in f:
    for word in line.split()
        l.append(word)

Donald Miner 建议的更长版本。

What Donald Miner suggested looks good. Simple and short. I used the below in a code that I have written some time ago:

l = []
f = open("filename.txt", "rU")
for line in f:
    for word in line.split()
        l.append(word)

longer version of what Donald Miner suggested.

回复收藏 0 原文

~没有更多了~

关于作者

遥远的她

暂无简介

0 文章

0 评论

23 人气

关注发私信

友情链接

文江博客

在 Python 中迭代文件的单词

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（8）

Python 3注意事项，将`itertools.imap`替换为`map`

Note for python 3, replace `itertools.imap` with `map`

关于作者

相关话题

热门标签

推荐作者

胡图图

zt006

z祗昰~

冰葑

野の

天空

友情链接

在 Python 中迭代文件的单词

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（8）

Python 3注意事项，将itertools.imap替换为map

Note for python 3, replace itertools.imap with map

关于作者

相关话题

热门标签

推荐作者

胡图图

zt006

z祗昰~

冰葑

野の

天空

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

Python 3注意事项，将`itertools.imap`替换为`map`

Note for python 3, replace `itertools.imap` with `map`