在java中读取/写入大尺寸文件

发布于 2024-10-28 21:26:16 字数 1027 浏览 1 评论 0原文

我有一个具有以下格式的二进制文件：

[N bytes identifier & record length] [n1 bytes data] 
[N bytes identifier & record length] [n2 bytes data] 
[N bytes identifier & record length] [n3 bytes data]

如您所见，我有不同长度的记录。在每个记录中，我都有固定的 N 个字节，其中包含 id 和记录中数据的长度。

这个文件非常大，可以包含300万条记录。

我想通过应用程序打开此文件并让用户浏览和编辑记录。（插入/更新/删除记录）

我最初的计划是从原始文件创建文件并建立索引，对于每条记录，保留下一个和上一个记录地址以便轻松向前和向后导航。（某种链表，但在文件中而不是在内存中）

有库（java库）可以帮助我实现这个要求吗？
有什么您认为有用的推荐或经验吗？

- - - - - - - - - 编辑 - - - - - - - - - - - - - - - - --------------

感谢您的指导和建议，

更多信息：

原始文件及其格式超出我的控制（它是第三方文件），我无法更改该文件格式。但我必须阅读它，让用户浏览记录并编辑其中一些（插入新记录/更新现有记录/删除记录），最后将其保存回原始文件格式 。

您仍然推荐DataBase而不是普通的索引文件吗？

----------------- 第二次编辑 ------------------------------------------ ----------------

更新模式下的记录大小是固定的。这意味着更新（编辑）的记录与原始记录的长度相同，除非用户删除该记录并创建另一个不同格式的记录。

非常感谢

原文

i have a binary file with following format :

[N bytes identifier & record length] [n1 bytes data] 
[N bytes identifier & record length] [n2 bytes data] 
[N bytes identifier & record length] [n3 bytes data]

as you see i have records with different lengths. in each record i have N bytes fixed which contains and id and the length of data in record.

this file is very big and can contains 3 millions records.

I want to open this file by an application and let user to browse and edit the records.
( Insert / Update / Delete records)

my initial plan is to create and index file from original file and for each record, keep next and previous record address to navigate forward and backward easily. (some sort of linked list but in file not in memory)

is there library (java library) to help me to implement this requirement ?
any recommendation or experience that you think is useful?

----------------- EDIT ----------------------------------------------

Thanks for guides and suggestions,

some more info:

the original file and its format is out of my control (it's a third party file) and i can't change the file format. but i have to read it, let user to navigate over records and edit some of them (insert new record/ update an existing record/ delete a record) and at the end save it back to original file format.

do u still recommend DataBase instead of a normal index file ?

----------------- SECOND EDIT ----------------------------------------------

record size in update mode is fixed. it means updated (edited) record has same length as original record's, unless user delete the record and create another record with different format.

Many Thanks

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

流心雨 2024-11-04 21:26:16

说真的，您不应该为此使用二进制文件。您应该使用数据库。

尝试将其实现为常规文件的问题源于操作系统不允许您在现有文件的中间插入额外的字节。因此，如果您需要插入一条记录（除末尾之外的任何位置）、更新一条记录（具有不同的大小）或删除一条记录，您将需要：

重写其他记录（在插入/更新/删除点之后）以制作或回收空间，或
在文件中实施某种可用空间管理。

所有这些都很复杂和/或昂贵。

幸运的是，有一类软件可以实现这种事情。它被称为数据库软件。有多种选择，从使用全面的 RDBMS 到使用 BerkeleyDB 文件等轻量级解决方案。

为了响应您的第一次和第二次编辑，数据库仍然会更简单。

然而，这里有一个替代方案，对于这个用例来说，它的性能可能比使用数据库更好......而无需进行复杂的可用空间管理。

读取文件并构建一个将 id 映射到文件位置的内存中索引。
创建第二个文件来保存新的和更新的记录。
执行记录添加/更新/删除：
1. 通过将新记录写入第二个文件的末尾并为其添加索引条目来处理添加。
2. 通过将更新的记录写入第二个文件的末尾并更改现有索引条目以指向它来处理更新。
3. 删除是通过删除记录键的索引条目来处理的。
按如下方式压缩文件：
1. 创建一个新文件。
2. 按顺序读取旧文件中的每条记录，并检查记录键的索引。如果该条目仍然指向该记录的位置，则将该记录复制到新文件中。否则跳过它。
3. 对第二个文件重复步骤 4.2。
如果我们成功完成上述所有操作，请删除旧文件和第二个文件。

请注意，这依赖于能够将索引保留在内存中。如果这不可行，那么实现将会更加复杂……并且更像数据库。

Seriously, you should NOT be using a binary file for this. You should use a database.

The problems with trying to implement this as a regular file stem from the fact that operating systems do not allow you to insert extra bytes into the middle of an existing file. So if you need to insert a record (anywhere but the end), update a record (with a different size) or remove a record, you would need to:

rewrite other records (after the insertion/update/deletion point) to make or reclaim space, or
implement some kind of free space management within the file.

All of this is complicated and / or expensive.

Fortunately, there is a class of software that implements this kind of thing. It is called database software. There are a wide range of options, ranging from using a full-scale RDBMS to light-weight solutions like BerkeleyDB files.

In response to your 1st and 2nd edits, a database will still be simpler.

However, here's an alternative that might perform better for this use-case than using a DB... without doing complicated free-space management.

Read the file and build an in-memory index that maps ids to file locations.
Create a second file to hold new and updated records.
Perform the record adds/updates/deletes:
1. An addition is handled by writing the new record to the end of the second file, and adding an index entry for it.
2. An update is handled by writing the updated record to the end of the second file, and changing the existing index entry to point to it.
3. A delete is handled by deleting the index entry for the record's key.
Compact the file as follows:
1. Create a new file.
2. Read each record in the old file in order, and check the index for the record's key. If the entry still points to the location of the record, copy the record to the new file. Otherwise skip it.
3. Repeat the step 4.2 for the second file.
If we completed all of the above successfully, delete the old file and second file.

Note this relies on being able to keep the index in memory. If that is not feasible, then the implementation is going to be more complicated ... and more like a database.

回复收藏 0 原文

遮云壑 2024-11-04 21:26:16

拥有一个数据文件和一个索引文件将是这种实现的一般基本思想，但是您几乎会发现自己在重复数据更新/删除等时处理数据碎片。这种项目本身应该是一个单独的项目，不应成为主应用程序的一部分。然而，本质上，数据库就是您所需要的，因为它是专门为此类操作和用例而设计的，并且还允许您搜索、排序和扩展（更改）数据结构，而无需重构内部（自定义）解决方案。

我可以建议您下载 Apache Derby 并创建本地嵌入式数据库（derby 会为您执行此操作，希望您在运行时创建新的嵌入式连接）。它不仅比您自己编写的任何内容都要快，而且使您的应用程序更易于维护。

Apache Derby 是一个单一的 jar 文件，您可以简单地将其包含在您的项目中并进行分发（检查许可证如果您的应用中可能存在任何法律问题）。无需数据库服务器或第三方软件；都是纯Java的。

底线是，这一切都取决于您的应用程序有多大、是否需要在多个客户端之间共享数据、速度是否是应用程序的关键方面等等。

对于独立的单用户项目，我推荐 Apache德比。对于n层应用程序，您可能需要查看MySQL, PostgreSQL< /a> 或 (hrm) 甚至甲骨文。使用已经制作和测试的解决方案不仅是明智的，而且会减少您的开发时间（和维护工作）。

干杯。

回复收藏 0 原文

温柔女人霸气范 2024-11-04 21:26:16

一般来说，您最好让图书馆或数据库为您完成这项工作。

您可能不希望拥有 SQL 数据库，并且有很多不使用 SQL 的简单数据库。 http://nosql-database.org/ 列出了其中 122 个。

至少，如果您要写这篇文章，我建议您阅读这些数据库之一的源代码，了解它们是如何工作的。

根据记录的大小，300 万条并不算多，我建议您在内存中保留尽可能多的数据。

您可能遇到的问题是确保数据一致并在发生损坏时恢复数据。第二个问题是有效地处理碎片（GC 工作中最聪明的人都会处理的事情）。第三个问题可能是以事务方式与源数据维护索引，以确保不存在不一致。

虽然这乍一看似乎很简单，但确保数据可靠、可维护且可有效访问却非常复杂。这就是为什么大多数开发人员使用现有的数据库/数据存储库并专注于与其应用程序无关的功能。

回复收藏 0 原文

镜花水月 2024-11-04 21:26:16

（注意：我的答案是关于一般问题的，不考虑任何Java库，或者 - 就像其他建议的答案一样 - 使用数据库（库），这可能比重新发明轮子更好）

创建索引的想法很好并且在性能方面非常有帮助（虽然您编写了“索引文件”，但我认为它应该保存在内存中）。如果您读取每个条目的 ID 和记录长度，然后通过文件查找跳过数据，那么生成索引应该会相当快。

您还应该考虑编辑功能。特别是如果您操作错误，那么在如此大的文件上插入和删除可能会非常慢（例如删除然后移动所有以下条目以缩小差距）。

最好的选择是仅将已删除的条目标记为已删除。插入时，您可以覆盖其中之一或附加到文件末尾。

回复收藏 0 原文

夜访吸血鬼 2024-11-04 21:26:16

插入/更新/删除记录

向文件插入（而不是仅仅追加）和删除记录的成本很高，因为您必须移动文件的所有以下内容才能为新记录创建空间或删除它使用的空间。如果更新更改了记录的长度（您说它们是可变长度），则更新的成本同样昂贵。

您建议的文件格式从根本上不适合您想要执行的操作类型。其他人建议使用数据库。如果您不想走那么远，添加索引文件（按照您的建议）是可行的方法。我建议使索引记录的长度全部相同。

回复收藏 0 原文

蓝色星空 2024-11-04 21:26:16

正如其他人所说，数据库似乎是更好的解决方案。以下是可以使用的 Java SQL DB：H2、Derby 或 HSQLDB

如果您想使用索引文件，请查看 Berkley DB 或 No Sql

如果出于某种原因需要使用文件，请查看 JRecord 。它有

几个类用于读取/写入具有可变长度二进制记录的文件（它们是为 Cobol VB 文件编写的）。任何 Mainframe / Fujitsu / Open Cobol VB 文件结构都可以完成这项工作。
用于编辑 JRecord 文件的编辑器。最新版本的编辑器可以处理大文件（它使用压缩/溢出文件）。编辑者必须下载整个文件，并且一次只有一个用户可以编辑该文件。

JRecord 解决方案仅在