当前位置：文江博客话题详情

Python line-numbers text-files

有没有一种简单的方法可以判断文件指针所在的行号？

发布于 2024-11-16 01:39:48 字数 708 浏览 3 评论 0 原文

在 Python 2.5 中，我正在使用文件指针读取结构化文本数据文件（大小约为 30 MB）：

fp = open('myfile.txt', 'r')
line = fp.readline()
# ... many other fp.readline() processing steps, which
# are used in different contexts to read the structures

但是，在解析文件时，我遇到了一些有趣的事情，我想报告其行号，以便我可以调查文本编辑器中的文件。我可以使用fp.tell()告诉我字节偏移量在哪里（例如16548974L），但是没有“fp.tell_line_number()”来帮助我翻译这是一个行号。

是否有Python内置或扩展可以轻松跟踪和“告诉”文本文件指针所在的行号？

注意：我不问< /a> 使用 line_number += 1 样式计数器，因为我在不同的上下文中调用 fp.readline() ，这种方法需要更多的调试，而不是值得的将计数器插入右角的代码。

原文

In Python 2.5, I am reading a structured text data file (~30 MB in size) using a file pointer:

fp = open('myfile.txt', 'r')
line = fp.readline()
# ... many other fp.readline() processing steps, which
# are used in different contexts to read the structures

But then, while parsing the file, I hit something interesting that I want to report the line number of, so I can investigate the file in a text editor. I can use fp.tell() to tell me where the byte offset is (e.g. 16548974L), but there is no "fp.tell_line_number()" to help me translate this to a line number.

Is there either a Python built-in or extension to easily track and "tell" what line number a text file pointer is on?

Note: I'm not asking to use a line_number += 1 style counter, as I call fp.readline() in different contexts and that approach would require more debugging than it is worth to insert the counter in the right corners of the code.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

束缚ｍ 2024-11-23 01:39:48

此问题的典型解决方案是定义一个新类来包装 file 的现有实例，该实例会自动计算数字。像这样的东西（就在我的脑海中，我还没有测试过这个）：

class FileLineWrapper(object):
    def __init__(self, f):
        self.f = f
        self.line = 0
    def close(self):
        return self.f.close()
    def readline(self):
        self.line += 1
        return self.f.readline()
    # to allow using in 'with' statements 
    def __enter__(self):
        return self
    def __exit__(self, exc_type, exc_val, exc_tb):
        self.close()

像这样使用它：

f = FileLineWrapper(open("myfile.txt", "r"))
f.readline()
print(f.line)

它看起来像标准模块fileinput 做同样的事情（以及其他一些事情）；如果你愿意的话，你可以用它来代替。

A typical solution to this problem is to define a new class that wraps an existing instance of a file, which automatically counts the numbers. Something like this (just off the top of my head, I haven't tested this):

class FileLineWrapper(object):
    def __init__(self, f):
        self.f = f
        self.line = 0
    def close(self):
        return self.f.close()
    def readline(self):
        self.line += 1
        return self.f.readline()
    # to allow using in 'with' statements 
    def __enter__(self):
        return self
    def __exit__(self, exc_type, exc_val, exc_tb):
        self.close()

Use it like this:

f = FileLineWrapper(open("myfile.txt", "r"))
f.readline()
print(f.line)

It looks like the standard module fileinput does much the same thing (and some other things as well); you could use that instead if you like.

回复收藏 0 原文

自找没趣 2024-11-23 01:39:48

您可能会发现 fileinput 模块很有用。它提供了一个通用接口，用于迭代任意数量的文件。文档中的一些相关亮点：

fileinput.lineno()

返回刚刚读取的行的累计行号。在读取第一行之前，返回 0。在读取最后一个文件的最后一行之后，返回该行的行号。

fileinput.filelineno()

返回当前文件中的行号。在读取第一行之前，返回 0。读取最后一个文件的最后一行之后，返回文件中该行的行号。

回复收藏 0 原文

银河中√捞星星 2024-11-23 01:39:48

以下代码将在遍历文件（'testfile'）时打印行号（指针当前所在的位置）

file=open("testfile", "r")
for line_no, line in enumerate(file):
    print line_no     # The content of the line is in variable 'line'
file.close()

输出：

1
2
3
...

The following code will print the line number(where the pointer is currently on) while traversing through the file('testfile')

file=open("testfile", "r")
for line_no, line in enumerate(file):
    print line_no     # The content of the line is in variable 'line'
file.close()

output:

1
2
3
...

回复收藏 0 原文

葮薆情 2024-11-23 01:39:48

我不这么认为，不是以您想要的方式（如 open 返回的 Python 文件句柄的标准内置功能）。

如果您在阅读行或使用包装类时不适合手动跟踪行号（顺便说一句，GregH 和 senderle 提出了很好的建议），那么我认为您只需使用 fp.tell() 计算并返回到文件的开头，阅读直到到达那里。

这并不是一个太糟糕的选择，因为我假设错误条件比一切顺利运行的可能性要小。如果一切正常，就没有影响。

如果出现错误，则您需要花费额外的精力来重新扫描文件。如果文件很大，可能会影响您感知的性能 - 如果这是一个问题，您应该考虑到这一点。

回复收藏 0 原文

简单气质女生网名 2024-11-23 01:39:48

使用 with 上下文管理器打开文件并枚举 for 循环中的行。

with open('file_name.ext', 'r') as f:
    [(line_num, line) for line_num, line in enumerate(f)]

Open file using with context manager and enumerate lines in a for loop.

with open('file_name.ext', 'r') as f:
    [(line_num, line) for line_num, line in enumerate(f)]

回复收藏 0 原文

不气馁 2024-11-23 01:39:48

一种方法可能是迭代该行并保留已看到的行数的显式计数：

>>> f=open('text.txt','r')
>>> from itertools import izip
>>> from itertools import count
>>> f=open('test.java','r')
>>> for line_no,line in izip(count(),f):
...     print line_no,line

One way might be to iterate over the line and keep an explicit count of the number of lines already seen:

>>> f=open('text.txt','r')
>>> from itertools import izip
>>> from itertools import count
>>> f=open('test.java','r')
>>> for line_no,line in izip(count(),f):
...     print line_no,line

回复收藏 0 原文

唐婉 2024-11-23 01:39:48

以下代码创建一个函数 Which_Line_for_Position(pos)，它给出位置 pos 的行号，也就是说 文件中位置 pos 处的字符所在的行数。

该函数可以与任何位置作为参数一起使用，独立于文件指针当前位置的值以及调用该函数之前该指针的移动历史记录。

因此，使用此函数，我们不限于仅在行上不间断迭代期间确定当前行的编号，就像 Greg Hewgill 的解决方案的情况一样。

with open(filepath,'rb') as f:
    GIVE_NO_FOR_END = {}
    end = 0
    for i,line in enumerate(f):
        end += len(line)
        GIVE_NO_FOR_END[end] = i
    if line[-1]=='\n':
        GIVE_NO_FOR_END[end+1] = i+1
    end_positions = GIVE_NO_FOR_END.keys()
    end_positions.sort()

def Which_Line_for_Position(pos,
                            dic = GIVE_NO_FOR_END,
                            keys = end_positions,
                            kmax = end_positions[-1]):
    return dic[(k for k in keys if pos < k).next()] if pos<kmax else None

。

可以借助模块 fileinput 编写相同的解决方案：

import fileinput

GIVE_NO_FOR_END = {}
end = 0
for line in fileinput.input(filepath,'rb'):
    end += len(line)
    GIVE_NO_FOR_END[end] = fileinput.filelineno()
if line[-1]=='\n':
    GIVE_NO_FOR_END[end+1] = fileinput.filelineno()+1
fileinput.close()

end_positions = GIVE_NO_FOR_END.keys()
end_positions.sort()

def Which_Line_for_Position(pos,
                            dic = GIVE_NO_FOR_END,
                            keys = end_positions,
                            kmax = end_positions[-1]):
    return dic[(k for k in keys if pos < k).next()] if pos<kmax else None

但是此解决方案有一些不便：

它需要导入模块 fileinput
它会删除文件的所有内容!!我的代码中一定有问题，但我不知道 fileinput 足以找到它。或者这是 fileinput.input() 函数的正常行为？
似乎在启动任何迭代之前首先要完全读取该文件。如果是这样，对于非常大的文件，文件的大小可能会超出RAM的容量。我不确定这一点：我尝试使用 1.5 GB 的文件进行测试，但它相当长，我暂时放弃了这一点。如果这一点是正确的，那么它就构成了使用 enumerate() 的另一个解决方案的论点

。

示例：

text = '''Harold Acton (1904–1994)
Gilbert Adair (born 1944)
Helen Adam (1909–1993)
Arthur Henry Adams (1872–1936)
Robert Adamson (1852–1902)
Fleur Adcock (born 1934)
Joseph Addison (1672–1719)
Mark Akenside (1721–1770)
James Alexander Allan (1889–1956)
Leslie Holdsworthy Allen (1879–1964)
William Allingham (1824/28-1889)
Kingsley Amis (1922–1995)
Ethel Anderson (1883–1958)
Bruce Andrews (born 1948)
Maya Angelou (born 1928)
Rae Armantrout (born 1947)
Simon Armitage (born 1963)
Matthew Arnold (1822–1888)
John Ashbery (born 1927)
Thomas Ashe (1836–1889)
Thea Astley (1925–2004)
Edwin Atherstone (1788–1872)'''


#with open('alao.txt','rb') as f:

f = text.splitlines(True)
# argument True in splitlines() makes the newlines kept

GIVE_NO_FOR_END = {}
end = 0
for i,line in enumerate(f):
    end += len(line)
    GIVE_NO_FOR_END[end] = i
if line[-1]=='\n':
    GIVE_NO_FOR_END[end+1] = i+1
end_positions = GIVE_NO_FOR_END.keys()
end_positions.sort()


print '\n'.join('line %-3s  ending at position %s' % (str(GIVE_NO_FOR_END[end]),str(end))
                for end in end_positions)

def Which_Line_for_Position(pos,
                            dic = GIVE_NO_FOR_END,
                            keys = end_positions,
                            kmax = end_positions[-1]):
    return dic[(k for k in keys if pos < k).next()] if pos<kmax else None

print
for x in (2,450,320,104,105,599,600):
    print 'pos=%-6s   line %s' % (x,Which_Line_for_Position(x))

结果

line 0    ending at position 25
line 1    ending at position 51
line 2    ending at position 74
line 3    ending at position 105
line 4    ending at position 132
line 5    ending at position 157
line 6    ending at position 184
line 7    ending at position 210
line 8    ending at position 244
line 9    ending at position 281
line 10   ending at position 314
line 11   ending at position 340
line 12   ending at position 367
line 13   ending at position 393
line 14   ending at position 418
line 15   ending at position 445
line 16   ending at position 472
line 17   ending at position 499
line 18   ending at position 524
line 19   ending at position 548
line 20   ending at position 572
line 21   ending at position 600

pos=2        line 0
pos=450      line 16
pos=320      line 11
pos=104      line 3
pos=105      line 4
pos=599      line 21
pos=600      line None

。

然后，有了函数 Which_Line_for_Position() ，就很容易获得当前行的编号：只需将 f.tell() 作为参数传递给函数

但是警告：当使用f.tell()并在文件中移动文件指针时，绝对有必要以二进制模式打开文件：'rb' 或'rb+' 或 'ab' 或 ....

The following code creates a function Which_Line_for_Position(pos) that gives the number of the line for the position pos, that is to say the number of line in which lies the character situated at position pos in the file.

This function can be used with any position as argument, independantly from the value of the file's pointer's current position and from the historic of the movements of this pointer before the function is called.

So, with this function, one isn't limited to determine the number of the current line only during an uninterrupted iteration on the lines, as it is the case with Greg Hewgill's solution.

with open(filepath,'rb') as f:
    GIVE_NO_FOR_END = {}
    end = 0
    for i,line in enumerate(f):
        end += len(line)
        GIVE_NO_FOR_END[end] = i
    if line[-1]=='\n':
        GIVE_NO_FOR_END[end+1] = i+1
    end_positions = GIVE_NO_FOR_END.keys()
    end_positions.sort()

def Which_Line_for_Position(pos,
                            dic = GIVE_NO_FOR_END,
                            keys = end_positions,
                            kmax = end_positions[-1]):
    return dic[(k for k in keys if pos < k).next()] if pos<kmax else None

The same solution can be written with the help of the module fileinput:

import fileinput

GIVE_NO_FOR_END = {}
end = 0
for line in fileinput.input(filepath,'rb'):
    end += len(line)
    GIVE_NO_FOR_END[end] = fileinput.filelineno()
if line[-1]=='\n':
    GIVE_NO_FOR_END[end+1] = fileinput.filelineno()+1
fileinput.close()

end_positions = GIVE_NO_FOR_END.keys()
end_positions.sort()

def Which_Line_for_Position(pos,
                            dic = GIVE_NO_FOR_END,
                            keys = end_positions,
                            kmax = end_positions[-1]):
    return dic[(k for k in keys if pos < k).next()] if pos<kmax else None

But this solution has some inconveniences:

it needs to import the module fileinput
it deletes all the content of the file !! There must be something wrong in my code but I don't know fileinput enough to find it. Or is it a normal behaviour of fileinput.input() function ?
it seems that the file is first entirely read before any iteration can be launched. If so, for a file very big, the size of the file may exceed the capacity of the RAM. I am not sure of this point: I tried to test with a file of 1,5 GB but it's rather long and I dropped this point for the moment. If this point is right, it constitutes an argument to use the other solution with enumerate()

exemple:

text = '''Harold Acton (1904–1994)
Gilbert Adair (born 1944)
Helen Adam (1909–1993)
Arthur Henry Adams (1872–1936)
Robert Adamson (1852–1902)
Fleur Adcock (born 1934)
Joseph Addison (1672–1719)
Mark Akenside (1721–1770)
James Alexander Allan (1889–1956)
Leslie Holdsworthy Allen (1879–1964)
William Allingham (1824/28-1889)
Kingsley Amis (1922–1995)
Ethel Anderson (1883–1958)
Bruce Andrews (born 1948)
Maya Angelou (born 1928)
Rae Armantrout (born 1947)
Simon Armitage (born 1963)
Matthew Arnold (1822–1888)
John Ashbery (born 1927)
Thomas Ashe (1836–1889)
Thea Astley (1925–2004)
Edwin Atherstone (1788–1872)'''


#with open('alao.txt','rb') as f:

f = text.splitlines(True)
# argument True in splitlines() makes the newlines kept

GIVE_NO_FOR_END = {}
end = 0
for i,line in enumerate(f):
    end += len(line)
    GIVE_NO_FOR_END[end] = i
if line[-1]=='\n':
    GIVE_NO_FOR_END[end+1] = i+1
end_positions = GIVE_NO_FOR_END.keys()
end_positions.sort()


print '\n'.join('line %-3s  ending at position %s' % (str(GIVE_NO_FOR_END[end]),str(end))
                for end in end_positions)

def Which_Line_for_Position(pos,
                            dic = GIVE_NO_FOR_END,
                            keys = end_positions,
                            kmax = end_positions[-1]):
    return dic[(k for k in keys if pos < k).next()] if pos<kmax else None

print
for x in (2,450,320,104,105,599,600):
    print 'pos=%-6s   line %s' % (x,Which_Line_for_Position(x))

result

line 0    ending at position 25
line 1    ending at position 51
line 2    ending at position 74
line 3    ending at position 105
line 4    ending at position 132
line 5    ending at position 157
line 6    ending at position 184
line 7    ending at position 210
line 8    ending at position 244
line 9    ending at position 281
line 10   ending at position 314
line 11   ending at position 340
line 12   ending at position 367
line 13   ending at position 393
line 14   ending at position 418
line 15   ending at position 445
line 16   ending at position 472
line 17   ending at position 499
line 18   ending at position 524
line 19   ending at position 548
line 20   ending at position 572
line 21   ending at position 600

pos=2        line 0
pos=450      line 16
pos=320      line 11
pos=104      line 3
pos=105      line 4
pos=599      line 21
pos=600      line None

Then, having function Which_Line_for_Position() , it is easy to obtain the number of a current line : just passing f.tell() as argument to the function

But WARNING: when using f.tell() and doing movements of the file's pointer in the file, it is absolutely necessary that the file is opened in binary mode: 'rb' or 'rb+' or 'ab' or ....

回复收藏 0 原文

智商已欠费 2024-11-23 01:39:48

最近在解决类似的问题并提出了这个基于类的解决方案。

class TextFileProcessor(object):

    def __init__(self, path_to_file):
        self.print_line_mod_number = 0
        self.__path_to_file = path_to_file
        self.__line_number = 0

    def __printLineNumberMod(self):
        if self.print_line_mod_number != 0:
            if self.__line_number % self.print_line_mod_number == 0:
                print(self.__line_number)

    def processFile(self):
        with open(self.__path_to_file, 'r', encoding='utf-8') as text_file:
            for self.__line_number, line in enumerate(text_file, start=1):
                self.__printLineNumberMod()

                # do some stuff with line here.

将 print_line_mod_number 属性设置为您想要记录的节奏，然后调用 processFile。

例如...如果您想要每 100 行反馈一次，它看起来像这样。

tfp = TextFileProcessor('C:\\myfile.txt')
tfp.print_line_mod_number = 100
tfp.processFile()

控制台输出将是

100
200
300
400
etc...

Messing around with a similar problem recently and came up with this class based solution.

class TextFileProcessor(object):

    def __init__(self, path_to_file):
        self.print_line_mod_number = 0
        self.__path_to_file = path_to_file
        self.__line_number = 0

    def __printLineNumberMod(self):
        if self.print_line_mod_number != 0:
            if self.__line_number % self.print_line_mod_number == 0:
                print(self.__line_number)

    def processFile(self):
        with open(self.__path_to_file, 'r', encoding='utf-8') as text_file:
            for self.__line_number, line in enumerate(text_file, start=1):
                self.__printLineNumberMod()

                # do some stuff with line here.

Set the print_line_mod_number property to the cadence you want logged and then call processFile.

For example... if you want feedback every 100 lines it would look like this.

tfp = TextFileProcessor('C:\\myfile.txt')
tfp.print_line_mod_number = 100
tfp.processFile()

The console output would be

100
200
300
400
etc...

回复收藏 0 原文

隱形的亼 2024-11-23 01:39:48

关于 @eyquem 的解决方案，我建议将 mode='r' 与 fileinput 模块一起使用和 fileinput.lineno() 选项，它对我有用。

以下是我在代码中实现这些选项的方法。

    table=fileinput.input('largefile.txt',mode="r")
    if fileinput.lineno() >= stop : # you can disregard the IF condition but I am posting to illustrate the approach from my code.
           temp_out.close()

Regarding the solution by @eyquem, I suggest using mode='r' with the fileinput module and fileinput.lineno() option and it has worked for me.

Here is how I am implementing these options in my code.

    table=fileinput.input('largefile.txt',mode="r")
    if fileinput.lineno() >= stop : # you can disregard the IF condition but I am posting to illustrate the approach from my code.
           temp_out.close()

回复收藏 0 原文

~没有更多了~