什么是允许两个根路径之间有效同步的合理数据结构？

发布于 2024-11-18 02:48:19 字数 1548 浏览 3 评论 0原文

我正在开发一个涉及维护两个本地目录之间一致性的应用程序。具体来说，目录应该是相同的，除了其中一个目录中的所有文件都以某种特定方式修改（这部分对我的问题并不重要）。

在运行时，我的应用程序运行两个进程，侦听每个路径下发生的更改，并执行相关操作以在必要时使它们恢复同步。

就我的具体问题而言：我正在寻求有关启动应用程序时的欺骗情况的建议。此时，每个进程都需要检查它正在查找的路径下的所有文件/文件夹，以查看在应用程序未运行时是否有任何更改。（让我们假设操作系统无法通知应用程序在关闭时发生的任何事情，因此需要直接检查每个文件/文件夹。）

每个进程都可以访问（并维护）一个持久数据结构其指定路径下的所有文件/文件夹。我认为每个文件和文件夹的数据结构中应包含以下内容：

文件/文件夹名称；
文件哈希（CRC32）；
文件/文件夹最后的 mod 数据；和
文件/文件夹大小。

这些信息显然有助于检查文件/文件夹的任何更改，但是存储它们的最佳方式是什么？

在我看来，处理应用程序启动情况的一种明智方法是让每个进程递归扫描其指定路径下的所有文件/文件夹，并将扫描的每个文件的元数据与其数据结构中存储的元数据进行比较。然后，进程还应该迭代数据结构以查找已从路径中删除的内容。在此过程中可能遇到的一些情况是：

文件已修改（在数据结构中找到文件名，但哈希值不同）；
添加文件（在数据结构中未找到相同的文件名或哈希值）；
文件重命名（数据结构中存在具有相同哈希值的文件，但不具有相同文件名）；
添加了文件夹（数据结构中没有文件夹名称）；
文件夹已删除（数据结构中的文件夹名称，但不在路径下）；
文件夹已重命名（棘手）。

那么，用于此任务的最佳数据结构是什么？在我的脑海中，我正在考虑某种形式的排序关联数组，例如红黑树，它存储文件和文件夹对象。每个文件对象包含name、hash和mod-date属性，而每个文件夹 > 对象包含 name 和 children 属性，其中 children 存储另一个关联数组及其下面的所有内容。给定任意文件的路径，例如 /foo/bar/file.txt，您从根 (foo) 开始，检查 bar code> 依此类推，直到到达 file.txt 的父对象。

我能想到的另一种选择是仅仅扁平存储所有内容，这样就有一个红黑树，其中每个键都是每个文件/文件夹的完整路径，值是文件 / 文件夹对象。这可能会更快地检索，但无论如何都不可能在不迭代所有值的情况下检测重命名的文件/文件夹，这听起来很昂贵。在第一种方法中，识别重命名可能只涉及检查数据结构的一部分而不是全部。

抱歉，上述想法没有经过深思熟虑。该领域的最新技术是什么？是否有解决此类问题的常用方法？

原文

I am working on an application that involves maintaining consistency between two local directories. Specifically, the directories should be identical, with the exception that all files in one of the directories are modified in some particular way (this part is not important to my question).

While running, my application runs two processes that listen for changes occurring under each of the paths, and performs relevant operations to bring them back in sync when necessary.

In terms of my specific question: I'm looking for advice on the tricker situation of when one starts the application. At this point, each process needs to check all files/folders under both the path that it is looking after, to see if anything has changed in anyway whilst the application was not running. (Let us assume that the application cannot be notified by the OS of anything that happened while it was shutdown, and thus will need to directly check every file/folder.)

Each process will have access to (and maintain) a persistent data-structure of all files/folder under its designated path. I was thinking that the following should be held within the data-structure for each of the files and folders:

File/folder name;
File hash (CRC32);
File/folder last mod data; and
File/folder size.

These pieces of information will obviously help to check for any changes to files/folder, but what is the best way to store them?

It seems to me that one sensible way to approach the situation of an application start is for each process to recursively scan through all files/folders under its designated path, and compare the metadata for each file scanned to the metadata stored in its data-structure. Then the processes should also iterate through the data-structures to look for things that have been removed from the paths. Some cases that may be encountered during this process are:

file modified (file name found in data-structure, but hash differs);
file added (no identical filename or hash found in data-structure);
file renamed (file with same hash exists in data-structure, but not with same filename);
folder added (no folder name in data-structure);
folder removed (folder name in data-structure, but not under path);
folder renamed (tricky one).

So, what's the best data-structure to use for this task? In my head I'm thinking some form of sorted associative array, e.g., a red-black tree, which store file and folder objects. Each file object contains name, hash and mod-date attributes , while each folder object contains name and children attributes, where children stores another associative array with everything underneath. Given the path to an arbitrary file, e.g., /foo/bar/file.txt, you begin at the root (foo), check for bar and so on until you get to file.txt's parent object.

Another alternative I can think of is to merely store everything flatly, such that there is one red-black tree where each key is the full path to each file/folder, and the value is the file / folder object. This would probably be quicker for retrieval, but it won't be possible to detect renamed files/folders without iterating through all values anyway, which sounds expensive. In the first approach, it may be the case that identifying a rename would only involves checking a portion of the data-structure rather than all of it.

Sorry the above ideas aren't terribly well thought out. What's the state of the art in this area, and are there any well-trodden approaches to these types of problems?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

此生挚爱伱 2024-11-25 02:48:19

您正在对文件系统进行建模，因此使用分层数据结构是很自然的。毕竟，您不需要将 dir1\dir2\foo.txt 中的文件与 dir3\bar.txt 进行比较，对吧？您没有提到目录之间的文件移动是您正在跟踪的内容。

因此，数据结构可能是：

interface IFSEntry {
  string name
  datetime creationDate
  pure virtual bool Compare(IFSEntry other)
  pure virtual void UpdateFrom(IFSEntry other)
  pure virtual bool WasRenamed(Dictionary<string,IFSEntry> possibleOriginals, out string oldName)
  ...
} 

class File : IFSEntry {
  ...
} 

class Directory : IFSEntry {
  private Dictionary<string,IFSEntry> children;
  ...
}

UpdateFrom 和 Compare 的 Directory 实现将递归其子级。

通过比较 CRC，文件重命名相对容易。您会错过在这两个地方都发生更改并被重命名的文件。如果运行比较的时间证明存在性能问题，您可以将 CRC 字典添加到 Directory 类。

对于目录移动，如果子文件也发生了更改，那么您就会遇到逻辑模糊的情况。最好有一个用户可以针对这种情况操作的合并工具。

如果文件在两个位置都发生更改，并且发生冲突的更改，您还需要面向用户的合并策略。我认为这始终是一个好主意，只是为了让用户注意到文档没有失去连贯性。

You're modelling a filesystem, so it's quite natural to use a hierarchical data structure. After all, you don't need to compare the file at dir1\dir2\foo.txt to dir3\bar.txt, right? You didn't mention file moves between directories as something you're tracking.

So, the data structure could be:

interface IFSEntry {
  string name
  datetime creationDate
  pure virtual bool Compare(IFSEntry other)
  pure virtual void UpdateFrom(IFSEntry other)
  pure virtual bool WasRenamed(Dictionary<string,IFSEntry> possibleOriginals, out string oldName)
  ...
} 

class File : IFSEntry {
  ...
} 

class Directory : IFSEntry {
  private Dictionary<string,IFSEntry> children;
  ...
}

The Directory implementations of UpdateFrom and Compare would recurse down their children.

File renames would be relatively easy by comparing CRC's. You'd miss files that changed in both places and were renamed. You could add a CRC dictionary to the Directory class if the time to run the comparisons proves a performance problem.

For directory moves, if the child files also changed, then you've got a fuzzy logic situation. It would be best to have a merge tool that the user would operate for that situation.

If a file changes in both places, you also need a user-facing merge strategy if conflicting changes occur. I'd argue that is always a good idea, just to let the user eyeball that the document didn't lose coherence.

回复收藏 0 原文

~没有更多了~