如何处理非常大的文本文件?
我目前正在编写一些需要处理非常大的文本文件(至少几个 GiB)的东西。这里需要的(这是固定的)是:
- 基于 CSV,遵循 RFC 4180(嵌入换行符除外)
- 对行的随机读取访问,尽管主要是逐行并
- 在末尾附近附加行
- (更改行)。显然,这需要重写文件的其余部分,这种情况也很少见,因此目前并不是特别重要
文件的大小禁止将其完全保留在内存中(这也是不可取的,因为在附加更改时应该保留尽快地)。
我考虑过使用内存映射区域作为文件的窗口,如果请求超出其范围的行,该窗口就会移动。当然,在那个阶段我仍然没有字节级别之上的抽象。为了实际处理内容,我有一个 CharsetDecoder
给我一个 CharBuffer
。现在的问题是,我可以在 CharBuffer
中处理文本行,但我还需要知道文件中该行的字节偏移量(以保留行索引的缓存和偏移量,这样我就不必再次从头开始扫描文件来查找特定行)。
有没有办法将 CharBuffer
中的偏移量映射到匹配的 ByteBuffer
中的偏移量?对于 ASCII 或 ISO-8859-* 来说,这显然是微不足道的,对于 UTF-8 来说则不然,对于 ISO 2022 或 BOCU-1 事情会变得非常丑陋(并不是我实际上期望后两者,但 UTF-8 应该是这里的默认值– 并且仍然会带来问题)。
我想我可以再次将CharBuffer
的一部分转换为字节并使用长度。要么它有效,要么我遇到变音符号问题,在这种情况下,我可能会强制使用 NFC 或 NFD 以确保文本始终被明确编码。
不过,我想知道这是否就是去这里的方法。有更好的选择吗?
预计到达时间:对常见问题和建议的一些回复:
这是用于模拟运行的数据存储,旨在成为成熟数据库的小型本地替代方案。我们也有数据库后端并且使用它们,但对于它们不可用或不适用的情况,我们确实需要这样做。
我也只支持 CSV 的一个子集(没有嵌入换行符),但目前还可以。这里的问题点几乎是我无法预测行的长度,因此需要创建文件的粗略地图。
至于我上面概述的内容:我正在思考的问题是我可以轻松地确定字符级别上的行结尾(U + 000D + U + 000A),但我不想假设这看起来像 < code>0A 0D 在字节级别(例如,对于 UTF-16 已经失败,它是 0D 00 0A 00
或 00 0D 00 0A
)。我的想法是,我可以通过不硬编码我当前使用的编码的细节来改变字符编码。但我想我可以坚持使用 UTF-8 并忽略其他所有内容。但不知怎的,感觉不对。
I'm currently writing something that needs to handle very large text files (a few GiB at least). What's needed here (and this is fixed) is:
- CSV-based, following RFC 4180 with the exception of embedded line breaks
- random read access to lines, though mostly line by line and near the end
- appending lines at the end
- (changing lines). Obviously that calls for the rest of the file to be rewritten, it's also rare, so not particularly important at the moment
The size of the file forbids keeping it completely in memory (which is also not desirable, since when appending the changes should be persisted as soon as possible).
I have thought of using a memory-mapped region as a window into the file which gets moved around if a line outside its range is requested. Of course, at that stage I still have no abstraction above the byte level. To actually work with the contents I have a CharsetDecoder
giving me a CharBuffer
. Now the problem is, I can deal with lines of text probably just fine in the CharBuffer
, but I also need to know the byte offset of that line within the file (to keep a cache of line indexes and offsets so I don't have to scan the file from the beginning again to find a specific line).
Is there a way to map the offsets in a CharBuffer
to offsets in the matching ByteBuffer
at all? It's obviously trivial with ASCII or ISO-8859-*, less so with UTF-8 and with ISO 2022 or BOCU-1 things would get downright ugly (not that I actually expect the latter two, but UTF-8 should be the default here – and still poses problems).
I guess I could just convert a portion of the CharBuffer
to bytes again and use the length. Either it works or I get problems with diacritics in which case I could probably mandate the use of NFC or NFD to assure that the text is always unambiguously encoded.
Still, I wonder if that is even the way to go here. Are there better options?
ETA: Some replies to common questions and suggestions here:
This is a data storage for simulation runs, intended to be a small-ish local alternative to a full-blown database. We do have database backends as well and they are used, but for cases where they are unavailable or not applicable we do want this.
I'm also only supporting a subset of CSV (without embedded line breaks), but that's ok for now. The problematic points here are pretty much that I cannot predict how long the lines are and thus need to create a rough map of the file.
As for what I outlined above: The problem I was pondering was that I can easily determine the end of a line on the character level (U+000D + U+000A), but I didn't want to assume that this looks like 0A 0D
on the byte level (which already fails for UTF-16, for example, where it's either 0D 00 0A 00
or 00 0D 00 0A
). My thoughts were that I could make the character encoding changable by not hard-coding details of the encoding I currently use. But I guess I could just stick to UTF-8 and ingore everything else. Feels wrong, somehow, though.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
在 Java 字符序列(实际上是 UTF-16)和字节之间维护 1:1 映射非常困难,字节可以是任何内容,具体取决于文件编码。即使使用 UTF-8,1 个字节到 1 个字符的“明显”映射也仅适用于 ASCII。 UTF-16 和 UTF-8 都不保证 unicode 字符可以存储在单个机器
char
或byte
中。我会将文件窗口维护为字节缓冲区,而不是字符缓冲区。然后,为了在字节缓冲区中查找行结尾,我将 Java 字符串
"\r\n"
(或者可能只是"\n"
)编码为字节序列使用与文件相同的编码。然后我将使用该字节序列在字节缓冲区中搜索行结尾。缓冲区中结束的行的位置 + 缓冲区距文件开头的偏移量精确映射到文件中行结束的字节位置。追加行只是查找文件末尾并添加新行的情况。改变线路更加棘手。我想我会维护更改行的字节位置以及更改内容的列表或映射。准备写入更改时:
It's very difficult to maintain a 1:1 mapping between a sequence of Java chars (which are effectively UTF-16) and bytes which could be anything depending on your file encoding. Even with UTF-8, the "obvious" mapping of 1 byte to 1 char only works for ASCII. Neither UTF-16 nor UTF-8 guarantees that a unicode character can be stored in a single machine
char
orbyte
.I would maintain my window into the file as a byte buffer, not a char buffer. Then to find line endings in the byte buffer, I'd encode the Java string
"\r\n"
(or possibly just"\n"
) as a byte sequence using the same encoding as the file is in. I'd then use that byte sequence to search for line endings in the byte buffer. The position of a line ending in the buffer + the offset of the buffer from the start of the file maps exactly to the byte position in the file of the line ending.Appending lines is just a case of seeking to the end of the file and adding your new lines. Changing lines is more tricky. I think I would maintain a list or map of byte positions of changed lines and what the change is. When ready to write the changes:
是否可以将文件拆分为“子文件”(当然,您不能将其拆分为一个 Utf-8 字符)?然后,您需要每个子文件的一些元数据(字符总数和行总数)。
如果你有这个并且“子文件”相对较小,因此你总是可以完整地加载一个,那么处理就会变得很容易。
甚至编辑也变得很容易,因为您只需要更新“子文件”及其元数据。
如果您将其放在边缘:那么您可以使用数据库并为每个数据库行存储一行。 -- 如果这是一个好主意很大程度上取决于您的用例。
Would it be possible to split the file in "subfiles" (of course you must not split it within one Utf-8 char)? Then you need some meta data for each of the subfiles (total number of chars, and total number of lines).
If you have this and the "subfiles" are relative small so that you can always load one compleatly then the handling becomes easy.
Even the editing becomes easy to, because you only need to update the "subfile" and its meta data.
If you would put it to the edge: then you could use a database and store one line per data base row. -- If this is a good idea strongly depends on your use case.
CharBuffer 假设所有字符都是 UTF-16 或 UCS-2 (也许有人知道其中的区别)
使用正确的文本格式的问题是您需要读取每个字节才能知道第 n 个字符在哪里或第 n 行在哪里是。我使用多 GB 文本文件,但假设 ASCII-7 数据,并且我仅按顺序读/写。
如果您想要随机访问未索引的文本文件,则不能指望它具有高性能。
如果您愿意购买一台新服务器,您可以花费约 1,800 英镑购买 24 GB 的服务器,花费约 4,200 英镑购买 64GB 的服务器。这些甚至可以让您将多 GB 文件加载到内存中。
CharBuffer assumes all characters are UTF-16 or UCS-2 (perhaps someone knows the difference)
The problem using a proper text format is that you need to read every byte to know where the n-th character is or where the n'th line is. I use multi-GB text files but assume ASCII-7 data, and I only read/write sequentially.
If you want random access on an unindexed text file, you can't expect it to be performant.
If you are willing to buy a new server you can get one with 24 GB for around £1,800 and 64GB for around £4,200. These would allow you to load even multi-GB files into memory.
如果您有固定宽度的线,那么使用
RandomAccessFile
可能会解决您的很多问题。我意识到您的线条可能不是固定宽度,但您可以通过添加行尾指示器然后填充线条(例如使用空格)来人为地强加这一点。如果您的文件当前的行长度分布相当均匀并且没有一些非常非常长的行,那么这显然效果最好。缺点是这会人为地增加文件的大小。
If you had fixed width lines then using a
RandomAccessFile
might solve a lot of your problems. I realise that your lines are probably not fixed width, but you could artificially impose this by adding an end of line indicator and then padding lines (eg with spaces).This obviously works best if your file currently has a fairly uniform distribution of line lengths and doesn't have some lines that are very, very long. The downside is that this will artificially increase the size of your file.
坚持使用 UTF-8 并使用 \n 表示行尾应该不是问题。或者,您可以允许 UTF-16,并识别数据:它必须被引用(例如),有 N 个命令(分号)和另一行尾。可以通过阅读标题来知道该结构有多少列。
可以通过在每行的末尾/开头保留一些空间来实现。
只要文件被锁定(与任何其他修改一样),这就是微不足道的
Stick with UTF-8 and \n denoting the end of the line should not be a problem. Alternatively you can allow UTF-16, and recognize the data: it has to be quoted (for instance), has N commans (semicolons) and another end of line. Can read the header to know how many columns the structure.
can be achieved by reserving some space at the end/beginning of each line.
That's trivial as long as the file is locked (as any other modifications)
如果列数固定,我会在逻辑上和/或物理上将文件拆分为列,并为 IO 任务实现一些包装器/适配器并管理整个文件。
In case of fixed column count I'd split the file logically and/or physically into columns and implemented some wrappers/adapters for IO tasks and managing the file as a whole.
在文件中以某种规则的间隔创建一个偏移表怎么样,这样您就可以在您要查找的位置附近的某个位置重新开始解析?
这个想法是,这些将是编码处于其初始状态的字节偏移量(即,如果数据是 ISO-2022 编码的,则该位置将处于 ASCII 兼容模式)。数据的任何索引都将包含指向该表的指针以及查找实际行所需的任何内容。如果您将重新启动点放置在两个点之间,以便每个点都适合 mmap 窗口,那么您可以从解析层省略检查/重新映射/重新启动代码,并使用假设数据按顺序映射的解析器。
How about a table of offsets at somewhat regular intervals in the file, so you can restart parsing somewhere near the spot you are looking for?
The idea would be that these would be byte offsets where the encoding would be in its initial state (i.e. if the data was ISO-2022 encoded, then this spot would be in the ASCII compatible mode). Any index into the data would then consist of a pointer into this table plus whatever is required to find the actual row. If you place the restart points such that each are between two points fits into the mmap window, then you can omit the check/remap/restart code from the parsing layer, and use a parser that assumes that data is sequentially mapped.