在java中读取/写入大尺寸文件
我有一个具有以下格式的二进制文件:
[N bytes identifier & record length] [n1 bytes data]
[N bytes identifier & record length] [n2 bytes data]
[N bytes identifier & record length] [n3 bytes data]
如您所见,我有不同长度的记录。在每个记录中,我都有固定的 N 个字节,其中包含 id 和记录中数据的长度。
这个文件非常大,可以包含300万条记录。
我想通过应用程序打开此文件并让用户浏览和编辑记录。 (插入/更新/删除记录)
我最初的计划是从原始文件创建文件并建立索引,对于每条记录,保留下一个和上一个记录地址以便轻松向前和向后导航。 (某种链表,但在文件中而不是在内存中)
有库(java库)可以帮助我实现这个要求吗?
有什么您认为有用的推荐或经验吗?
- - - - - - - - - 编辑 - - - - - - - - - - - - - - - - --------------
感谢您的指导和建议,
更多信息:
原始文件及其格式超出我的控制(它是第三方文件),我无法更改该文件格式。但我必须阅读它,让用户浏览记录并编辑其中一些(插入新记录/更新现有记录/删除记录),最后将其保存回原始文件格式 。
您仍然推荐DataBase而不是普通的索引文件吗?
----------------- 第二次编辑 ------------------------------------------ ----------------
更新模式下的记录大小是固定的。这意味着更新(编辑)的记录与原始记录的长度相同,除非用户删除该记录并创建另一个不同格式的记录。
非常感谢
i have a binary file with following format :
[N bytes identifier & record length] [n1 bytes data]
[N bytes identifier & record length] [n2 bytes data]
[N bytes identifier & record length] [n3 bytes data]
as you see i have records with different lengths. in each record i have N bytes fixed which contains and id and the length of data in record.
this file is very big and can contains 3 millions records.
I want to open this file by an application and let user to browse and edit the records.
( Insert / Update / Delete records)
my initial plan is to create and index file from original file and for each record, keep next and previous record address to navigate forward and backward easily. (some sort of linked list but in file not in memory)
is there library (java library) to help me to implement this requirement ?
any recommendation or experience that you think is useful?
----------------- EDIT ----------------------------------------------
Thanks for guides and suggestions,
some more info:
the original file and its format is out of my control (it's a third party file) and i can't change the file format. but i have to read it, let user to navigate over records and edit some of them (insert new record/ update an existing record/ delete a record) and at the end save it back to original file format.
do u still recommend DataBase instead of a normal index file ?
----------------- SECOND EDIT ----------------------------------------------
record size in update mode is fixed. it means updated (edited) record has same length as original record's, unless user delete the record and create another record with different format.
Many Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
说真的,您不应该为此使用二进制文件。您应该使用数据库。
尝试将其实现为常规文件的问题源于操作系统不允许您在现有文件的中间插入额外的字节。因此,如果您需要插入一条记录(除末尾之外的任何位置)、更新一条记录(具有不同的大小)或删除一条记录,您将需要:
所有这些都很复杂和/或昂贵。
幸运的是,有一类软件可以实现这种事情。它被称为数据库软件。有多种选择,从使用全面的 RDBMS 到使用 BerkeleyDB 文件等轻量级解决方案。
为了响应您的第一次和第二次编辑,数据库仍然会更简单。
然而,这里有一个替代方案,对于这个用例来说,它的性能可能比使用数据库更好......而无需进行复杂的可用空间管理。
读取文件并构建一个将 id 映射到文件位置的内存中索引。
创建第二个文件来保存新的和更新的记录。
执行记录添加/更新/删除:
通过将新记录写入第二个文件的末尾并为其添加索引条目来处理添加。
通过将更新的记录写入第二个文件的末尾并更改现有索引条目以指向它来处理更新。
删除是通过删除记录键的索引条目来处理的。
按如下方式压缩文件:
创建一个新文件。
按顺序读取旧文件中的每条记录,并检查记录键的索引。如果该条目仍然指向该记录的位置,则将该记录复制到新文件中。否则跳过它。
对第二个文件重复步骤 4.2。
如果我们成功完成上述所有操作,请删除旧文件和第二个文件。
请注意,这依赖于能够将索引保留在内存中。如果这不可行,那么实现将会更加复杂……并且更像数据库。
Seriously, you should NOT be using a binary file for this. You should use a database.
The problems with trying to implement this as a regular file stem from the fact that operating systems do not allow you to insert extra bytes into the middle of an existing file. So if you need to insert a record (anywhere but the end), update a record (with a different size) or remove a record, you would need to:
All of this is complicated and / or expensive.
Fortunately, there is a class of software that implements this kind of thing. It is called database software. There are a wide range of options, ranging from using a full-scale RDBMS to light-weight solutions like BerkeleyDB files.
In response to your 1st and 2nd edits, a database will still be simpler.
However, here's an alternative that might perform better for this use-case than using a DB... without doing complicated free-space management.
Read the file and build an in-memory index that maps ids to file locations.
Create a second file to hold new and updated records.
Perform the record adds/updates/deletes:
An addition is handled by writing the new record to the end of the second file, and adding an index entry for it.
An update is handled by writing the updated record to the end of the second file, and changing the existing index entry to point to it.
A delete is handled by deleting the index entry for the record's key.
Compact the file as follows:
Create a new file.
Read each record in the old file in order, and check the index for the record's key. If the entry still points to the location of the record, copy the record to the new file. Otherwise skip it.
Repeat the step 4.2 for the second file.
If we completed all of the above successfully, delete the old file and second file.
Note this relies on being able to keep the index in memory. If that is not feasible, then the implementation is going to be more complicated ... and more like a database.
拥有一个数据文件和一个索引文件将是这种实现的一般基本思想,但是您几乎会发现自己在重复数据更新/删除等时处理数据碎片。这种项目本身应该是一个单独的项目,不应成为主应用程序的一部分。然而,本质上,数据库就是您所需要的,因为它是专门为此类操作和用例而设计的,并且还允许您搜索、排序和扩展(更改)数据结构,而无需重构内部(自定义)解决方案。
我可以建议您下载 Apache Derby 并创建本地嵌入式数据库(derby 会为您执行此操作,希望您在运行时创建新的嵌入式连接)。它不仅比您自己编写的任何内容都要快,而且使您的应用程序更易于维护。
Apache Derby 是一个单一的 jar 文件,您可以简单地将其包含在您的项目中并进行分发(检查 许可证 如果您的应用中可能存在任何法律问题)。无需数据库服务器或第三方软件;都是纯Java的。
底线是,这一切都取决于您的应用程序有多大、是否需要在多个客户端之间共享数据、速度是否是应用程序的关键方面等等。
对于独立的单用户项目,我推荐 Apache德比。对于n层应用程序,您可能需要查看MySQL, PostgreSQL< /a> 或 (hrm) 甚至 甲骨文。使用已经制作和测试的解决方案不仅是明智的,而且会减少您的开发时间(和维护工作)。
干杯。
Having a data file and an index file would be the general base idea for such an implementation, but you'd pretty much find yourself dealing with data fragmentation upon repeated data updates/deletion, etc. This kind of project, in itself, should be a separate project and should not be part of your main application. However, essentially, a database is what you need as it is specifically designed for such operations and use cases and will also allow you to search, sort, and extend (alter) your data structure without having to refactor an in-house (custom) solution.
May I suggest you to download Apache Derby and create a local embedded database (derby does it for you want you create a new embedded connection at run-time). It will not only be faster than anything you'll write yourself, but will make your application easier to maintain.
Apache Derby is a single jar file that you can simply include and distribute with your project (check the license if any legal issue may apply in your app). There is no need for a database server or third party software; it's all pure Java.
Bottom line as that it all depends on how large is your application, if you need to share the data across many clients, if speed is a critical aspect of your app, etc.
For a stand-alone, single user project, I recommend Apache Derby. For a n-tier application, you might want to look into MySQL, PostgreSQL or (hrm) even Oracle. Using already made and tested solutions is not only smart, but will cut down your development time (and maintenance efforts).
Cheers.
一般来说,您最好让图书馆或数据库为您完成这项工作。
您可能不希望拥有 SQL 数据库,并且有很多不使用 SQL 的简单数据库。 http://nosql-database.org/ 列出了其中 122 个。
至少,如果您要写这篇文章,我建议您阅读这些数据库之一的源代码,了解它们是如何工作的。
根据记录的大小,300 万条并不算多,我建议您在内存中保留尽可能多的数据。
您可能遇到的问题是确保数据一致并在发生损坏时恢复数据。第二个问题是有效地处理碎片(GC 工作中最聪明的人都会处理的事情)。第三个问题可能是以事务方式与源数据维护索引,以确保不存在不一致。
虽然这乍一看似乎很简单,但确保数据可靠、可维护且可有效访问却非常复杂。这就是为什么大多数开发人员使用现有的数据库/数据存储库并专注于与其应用程序无关的功能。
Generally you are better off letting a library or database do the work for you.
You may not want to have an SQL database and there are plenty of simple databases which don't use SQL. http://nosql-database.org/ lists 122 of them.
At a minimum, if you are going to write this I suggest you read the source for one of these databases to see how they work.
Depending on the size of the records, 3 million isn't that much and I would suggest you keep as much in memory as possible.
The problem you are likely to have is ensuring the data is consistent and recovering the data when a corruption occurs. The second problem is dealing with fragmentation efficiently (some thing the brightest minds working on the GC deal with) The third problem is likely to be maintain the index in a transaction fashion with the source data to ensure there are no inconsistencies.
While this may appear simple at first, there are significant complexities in making sure there data is reliable, maintainable and can be accessed efficiently. This is why most developers use an existing database/datastore library and concentrate on the features which are unqiue to their application.
(注意:我的答案是关于一般问题的,不考虑任何Java库,或者 - 就像其他建议的答案一样 - 使用数据库(库),这可能比重新发明轮子更好)
创建索引的想法很好并且在性能方面非常有帮助(虽然您编写了“索引文件”,但我认为它应该保存在内存中)。如果您读取每个条目的 ID 和记录长度,然后通过文件查找跳过数据,那么生成索引应该会相当快。
您还应该考虑编辑功能。特别是如果您操作错误,那么在如此大的文件上插入和删除可能会非常慢(例如删除然后移动所有以下条目以缩小差距)。
最好的选择是仅将已删除的条目标记为已删除。插入时,您可以覆盖其中之一或附加到文件末尾。
(Note: My answer is about the problem in general, not considering any Java libraries or - like the other answers also proposed - using a database (library), which might be better than reinventing the wheel)
The idea to create an index is good and will be very helpful performance-wise (although you wrote "index file", I think it should be kept in memory). Generating the index should be quite fast if you read the ID and record length for each entry and then just skip the data with a file seek.
You should also think about the edit functionality. Especially inserting and deleting can be very slow on such a big file if you do it wrong (f.e. deleting and then moving all the following entries to close the gap).
The best option would be to only mark deleted entries as deleted. When inserting, you can overwrite one of those or append to the end of the file.
向文件插入(而不是仅仅追加)和删除记录的成本很高,因为您必须移动文件的所有以下内容才能为新记录创建空间或删除它使用的空间。如果更新更改了记录的长度(您说它们是可变长度),则更新的成本同样昂贵。
您建议的文件格式从根本上不适合您想要执行的操作类型。其他人建议使用数据库。如果您不想走那么远,添加索引文件(按照您的建议)是可行的方法。我建议使索引记录的长度全部相同。
Inserting (rather than merely appending) and deleting records to a file is expensive because you have to move all the following content of the file to create space for the new record or to remove the space it used. Updating is similarly expensive if the update changes the length of the record (you say they are variable length).
The file format you propose is fundamentally unsuitable for the kinds of operations you want to perform. Others have suggested using a data-base. If you don't want to go that far, adding an index file (as you suggest) is the way to go. I recommend making the index records all the same length.
正如其他人所说,数据库似乎是更好的解决方案。以下是可以使用的 Java SQL DB:H2、Derby 或 HSQLDB
如果您想使用索引文件,请查看 Berkley DB 或 No Sql
如果出于某种原因需要使用文件,请查看 JRecord 。它有
JRecord 解决方案仅在
As others have stated a database would seem a better solution. The following are Java SQL DB's that could be used: H2, Derby or HSQLDB
If you want to use an index file look at Berkley DB or No Sql
If there is some reason for using a file, look at JRecord . It has
The JRecord solution will only work if