用于编辑大型二进制文件的事务模型

发布于 2024-10-12 05:44:44 字数 363 浏览 2 评论 0原文

我正在为一些非常大的二进制文件创建一个二进制编辑器。软件要求之一是编辑者无法修改原始文件,因此目标文件必须是原始文件的编辑副本。

我想以这样的方式设计编辑器:文件的复制只发生一次(这将是一个 20 分钟的过程)。我知道我可以在编辑文件时锁定该文件,但是如果用户退出程序,他们将不得不重新经历整个 20 分钟的复制过程,除非我能找到一种方法来确定他们仍然在他们最初的编辑过程。

您是否可以想到一些简单的过程,通过它我可以允许用户以某种方式将复制的文件“注册”为可编辑文件,并在完成所有更改后“最终确定”该文件?

理想情况下,这样的过程将允许我检测可编辑文件或交易信息是否在编辑会话之间被篡改(篡改或最终确定将导致发生另一个副本,如果文件再次编辑)。

I am creating a binary editor for some very large binary files. One of the software requirements is that the editor cannot modify the original file, so the target file must be an edited copy of the original.

I want to design the editor in such a way that copying of the file only takes place once (it will be a 20 minute process). I know that I can lock the file while it is being edited, but if the user exits the program, they will have to go through the whole 20 minute copy process over again, unless I can find a way to determine that they are still in their original editing session.

Is there some simple process you can think of by which I can allow the user to "register" the copied file somehow as an editable file, and when they are completed with all of their changes, "finalize" the file?

Ideally, such a process would allow me to detect whether or not the editable file or the transaction information has been tampered with in-between editing sessions (tampering or finalization would cause another copy to occur, if the file is edited again).

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

沧桑㈠ 2024-10-19 05:44:44
  1. 在集中位置创建和维护会话记录(数据库?)。
  2. 会话由用户名(如果您有)、IP 或任何您想用来唯一标识用户的内容以及字节哈希组成。如果哈希对于文件大小来说太繁重,您可以尝试依赖文件日期和大小。
  3. 当用户关闭编辑器时,您可以使用上述信息更新会话记录并将其标记为非活动状态。
  4. 当用户重新打开编辑器时,您应该可以访问您的关键信息,即用户名和文件信息。如果您找到会话记录,则它是一个非活动会话,您可以重新激活,否则,它要么已被篡改,要么是全新的。

这适合您的需求吗?

  1. Create and maintain a record (db?) of sessions in a centralized location.
  2. Session consists of username, if you've got it, or IP, or whatever you want to use to uniquely identify the user, and a hash of the bytes. If hash is too burdensome for the filesize, you might try relying on file date and size.
  3. When the user closes out their editor, you update the session record with the above information and mark it as inactive.
  4. When the user reopens the editor, you should have access to your key information, i.e., username and the file info. If you find a session record, it's an inactive session that you can reactivate, otherwise, it's either been tampered with or is brand new.

Does that suit your needs?

喜你已久 2024-10-19 05:44:44

我认为您会想要保留用户所采取的操作的日志。为了避免写入源数据的副本,我会将日志保存在单独的文件中。使用时间戳信息存储用户的编辑。

当需要提交事务时,只需读取日志文件中的更改列表并应用它们,按时间戳排序。

当用户在编辑过程中需要从文件中读取数据时,您必须将源文件的相关部分读出到内存中,并将更改应用于日志文件中的数据。

这可能确实是最难的部分,具体取决于二进制文件格式。如果您有能力以某种方式索引二进制文件的内容,我会在编辑日志中使用该信息。这样,您就可以从日志文件中仅提取所需的数据,并且您将能够确定哪些编辑适用于该数据。

如果您拥有的只是一个大的、无形的 blob,则必须将整个内容保留在内存中,并在每次执行读取时应用所有更改。我认为这里还有优化的空间,但整个事情仍然非常令人发指。在无法限制读取范围的情况下,您必须假设任何编辑都可以随时更改任何数据。

至于确保编辑的安全,这是一个棘手的问题。如果您在信任的环境中运行,则可以不必保守秘密并使用它来验证信息。这很麻烦,但您可以对二进制文件、编辑日志和只有应用程序知道的秘密的串联进行哈希处理。 (如果没有秘密,任何人都可以修改文件并插入新的哈希值。)

如果您在用户本地的计算机(即桌面)上运行,则保守秘密可能非常困难,尤其是在托管的情况下代码。这本身就是一个话题,我没有一个好的答案给你。

I think you'll want to keep a log of actions taken by the user. In order to avoid writing to the copy of the source data, I would keep the log in a separate file. Store the user's edits with time stamp information.

When it comes time to commit the transaction, simply read down the list of changes in the log file and apply in them, ordered by time stamp.

When the user needs to read data from the file during the editing process, you'll have to read out the relevant portion of the source file into memory and apply the changes to that data from the log file.

This could really be the hardest part, depending on the binary file format. If you have the ability to somehow index the contents of the binary file, I would use that information in the edit log. That way, you can pull only the data you need from the log file, and you'll be able to determine which edits are applicable to that data.

If all you have is a big, formless blob, you'll have to keep the entire thing in memory and apply all of the changes every time you perform a read. There's room for optimization here, I think, but the whole thing is still really heinous. Without being able to limit the scope of the read, you have to assume that any edit could change any data at any time.

As to securing the edits, that's a tricky question. If you're running in an environment you trust, you can get away with keeping a secret and using it to authenticate the information. It's cumbersome, but you could hash the concatenation of the binary file, the edit log, and a secret known only to the application. (Without the secret, anyone could come by, modify the file, and insert a new hash.)

If you're running on a machine local to the user (i.e., a desktop), keeping secrets can be really difficult, especially with managed code. This is a topic unto itself, and I don't have a good answer for you.

独自唱情﹋歌 2024-10-19 05:44:44

难道您不能在该文件中仅在距开始或结束固定偏移处有一个字段,在其中放置会话信息,仅包含“正在编辑”标志吗?它可能包括对其当前编辑进程的引用(例如其pid)。如果pid是我们的pid,那么它就是我们的会话。如果不是我们的 pid,请查看进程列表。如果具有该 pid 的进程存在,则它是合法的编辑器;如果没有,我们将看到崩溃的结果,启动崩溃恢复(如果有)。如果 pid 为 0,则文件已完全完成。

另外:如果大文件可供读取,您真的需要在编辑之前复制它吗?

如果编辑与文件大小相比相当小,我会将用户操作记录为原始文件和结果之间的“差异”。如果一次又一次地编辑同一点,以某种方式“加入”差异可能会很有用,这样您就不会应用太多的差异层。当然,用户对文件的视图是动态应用所有差异的。

与此同时,您复制文件,并且一旦编辑会话结束并且文件完全位于此处,您就可以将所有差异应用到该文件。不过,根据允许编辑的性质,这可能是也可能不是一个耗时的过程。如果编辑会话超过 20 分钟,用户可能会注意到根本没有等待时间。您将在 diff 应用程序的时间内锁定文件,这可能比复制时间短。

Can't you just have a field in that file, at fixed offset from start or end, where you put session information, of just a 'being edited' flag? It may include a reference to its current editing process (e.g. its pid). If the pid is our pid, then it's our session. If it's not our pid, look at process list. If a process with this pid exists, it's the legitimate editor; if not, we're seeing a result of a crash, initiate crash recovery (if any). If pid is 0, the file was cleanly finalized.

Also: If the big file is available for reading, do you really need to copy it before editing?

If edits are rather small compared to the size of file, I'd record user actions as 'diffs' between the original file and the result. If the same spot is edited again and again, it may be useful to "join" the diffs somehow so that you don't apply too many layers of diffs. User's view of the file is, of course, with all diffs dynamically applied.

In the meantime you copy the file, and, once the editing session is over and the file is fully here, you apply all your diffs to the file. Depending on the nature of allowed edits, this may or may not be a time-consuming process, though. If editing sessions are longer than 20 minutes, the user may notice no wait time at all. You will lock the file for the time of diff application, which is presumably shorter than copy time.

抱猫软卧 2024-10-19 05:44:44

由于您正在考虑事务和文件系统活动,因此考虑事务性 NTFS 可能会有所帮助。这并不能回答您的问题,但可能会让您对可能性有新的了解。由于您的问题被标记为 C# 和 Windows,因此您可能需要查看 .NET 包装器,如下所示: http://offroadcoder.com/CategoryView,category,Transactions.aspx。 Scott Klueppel 展示了如何利用熟悉的 .NET 习惯用法 TransactionScope 来执行事务性 NTFS。我对斯科特所做的事情进行了快速测试,并且喜欢我所看到的。

Since you are thinking about transactions and file system activity, it might be helpful to consider Transactional NTFS. This doesn't answer your question but might give you a fresh insight into the possibilities. Since your question is tagged for C# and Windows, you might want to look at a .NET wrapper such as here: http://offroadcoder.com/CategoryView,category,Transactions.aspx. Scott Klueppel shows how to do transactional NTFS utilizing the familiar .NET idiom of a TransactionScope. I did a quick test of what Scott has done, and like what I have seen.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文