将平面文件读取为转置,python

发布于 2024-10-19 09:23:25 字数 190 浏览 1 评论 0原文

我有兴趣以尽可能高效的方式在 Python 中读取固定宽度的文本文件。具体来说,大多数时候我对平面文件中的一列或多列感兴趣,而不是整个记录。

我觉得一次读取一行文件并在将整行读入内存后提取所需的列效率很低。我想我宁愿选择只阅读所需的列,从上到下,从左到右(而不是从左到右,从上到下阅读)。

这样的事情是可取的吗?如果是的话,有可能吗?

I'm interested in reading fixed width text files in Python in as efficient a manner as I can. Specifically, most of the time I'm interested in one or more columns in the flat file but not entire records.

It strikes me as inefficient to read the file a line at a time and extract the desired columns after reading the entire line into memory. I think I'd rather have the option of reading only the desired columns, top to bottom, left to right (instead of reading left to right, top to bottom).

Is such a thing desirable, and if so, is it possible?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

_失温 2024-10-26 09:23:25

文件被布置为(一维)位序列。 “线条”只是我们添加的一种便利措施,目的是让人们更容易阅读内容。因此,一般来说,您所要求的内容在普通文件上是不可能的。为了实现这一点,您需要某种方法来查找记录的开始位置。两种最常见的方法是:

  • 搜索换行符(换句话说,读取整个文件)。
  • 使用特殊的间隔布局,以便每条记录都使用固定的布局。这样,您就可以使用低级文件操作(例如 seek)直接转到您需要去的地方。这可以避免读取整个文件,但手动执行会很痛苦。

我不会太担心文件读取性能,除非它成为问题。是的,您可以内存映射该文件,但您的操作系统可能已经为您缓存了。是的,您可以使用数据库格式(例如,通过 sqlalchemy 的 sqlite3 文件格式),但它可能不值得这么麻烦。

关于“固定宽度”的旁注:这到底是什么意思?如果您的意思确实是“每列始终以相对于记录开头的相同偏移量开始”,那么您绝对可以使用 Python 的 seek 跳过您不感兴趣的过去数据。

Files are laid out as a (one-dimensional) sequence of bits. 'Lines' are just a convenience we added to make things easy to read for humans. So, in general, what you're asking is not possible on plain files. To pull this off, you would need some way of finding where a record starts. The two most common ways are:

  • Search for newline symbols (in other words, read the entire file).
  • Use a specially spaced layout, so that each record is laid out using a fixed with. That way, you can use low level file operations, like seek, to go directly to where you need to go. This avoids reading the entire file, but is painful to do manually.

I wouldn't worry too much about file reading performance unless it becomes a problem. Yes, you could memory map the file, but your OS probably already caches for you. Yes, you could use a database format (e.g., the sqlite3 file format through sqlalchemy), but it probably isn't worth the hassle.

Side note on "fixed width:" What precisely do you mean by this? If you really mean 'every column always starts at the same offset relative to the start of the record' then you can definitely use Python's seek to skip past data that you are not interested in.

逐鹿 2024-10-26 09:23:25

线条有多大?除非每条记录都很大,否则仅阅读您感兴趣的字段而不是整行可能没有什么区别。

对于具有固定格式的大文件,您可能会从映射文件中得到一些东西。我只使用 C 而不是 Python 完成此操作,但似乎映射文件然后直接访问适当的字段可能相当有效。

How big are the lines? Unless each record is huge, it's probably likely to make little difference only reading in the fields you're interested in rather than the whole line.

For big files with fixed formatting, you might get something out of mmapping the file. I've only done this with C rather than Python, but it seems like mmapping the file then accessing the appropriate fields directly is likely to be reasonably efficient.

往事风中埋 2024-10-26 09:23:25

平面文件不适合您想要做的事情。我的建议是将文件转换为 SQL 数据库(使用 sqlite3),然后只读取您想要的列。 SQLite3 速度极快。

Flat files are not good with what you're trying to do. My suggestion is to convert the files to SQL database (using sqlite3) and then reading just the columns you want. SQLite3 is blazing fast.

爱要勇敢去追 2024-10-26 09:23:25

如果它确实是固定宽度,那么您应该能够只调用 read(N) 来跳过从一行的列末尾到下一行的开头的固定字节数。

If it's truly fixed width, then you should be able to just call read(N) to skip past the fixed number of bytes from the end of your column on one line to the start of it on the next.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文