在程序启动时同步文件系统和缓存数据

发布于 2024-08-11 02:56:10 字数 629 浏览 5 评论 0原文

我有一个程序需要检索有关一组文件（即一个目录及其中的所有文件以及某些类型的子目录）的一些数据。计算数据的成本（非常）昂贵，因此我不是遍历文件系统并在程序启动时计算它，而是将数据缓存在 SQLite 数据库并使用 FilesystemWatcher 来监视文件系统的更改。这在程序运行时效果很好，但问题是如何在程序启动期间刷新/同步数据。如果文件已被添加（或更改 - 我想我可以通过上次修改/大小检测到这一点），则需要在缓存中重新计算数据，如果文件已被删除，则需要从缓存中删除数据（因为该接口遍历缓存而不是文件系统）。

所以问题是：有什么好的算法可以做到这一点？我能想到的一种方法是遍历文件系统并收集字典中所有文件的路径和上次修改/大小。然后我浏览数据库中的整个列表。如果不匹配，那么我从数据库/缓存中删除该项目。如果有匹配，那么我从字典中删除该项目。那么字典就包含了所有需要刷新数据的项目。这可能有效，但是在每次启动时执行似乎会相当耗费内存和时间，所以我想知道是否有人有更好的想法？

如果重要的话：该程序仅在 .NET CLR 3.5 上用 C# 编写，使用 SQLite for ADO.NET 来通过实体框架/LINQ for ADO.NET 进行访问。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

迷爱 2024-08-18 02:56:10

我们的应用程序是跨平台的 C++ 桌面应用程序，但具有非常相似的要求。以下是我所做操作的高级描述：

在我们的 SQLite 数据库中，有一个 Files 表，用于存储 file_id、name、hash（目前我们使用上次修改日期作为哈希值）和state。
所有其他记录都引用一个 file_id。这使得在文件更改时很容易删除“脏”记录。

我们检查文件系统和刷新缓存的过程分为几个不同的步骤，以便更容易测试，并在缓存发生时为我们提供更大的灵活性（斜体中的名称正是我所用的。碰巧选择了类名）：

第一次启动时

数据库为空。 Walker 递归地遍历文件系统并将条目添加到 Files 表中。 state 设置为 UNPROCESSED。
接下来，加载器遍历Files表来查找UNPARSED文件。这些将交给解析器（它负责实际解析和插入数据）
这需要一段时间，因此第一次启动可能会有点慢。

这有一个很大的可测试性好处，因为您可以独立于加载/解析代码来测试文件系统代码的运行。在后续启动中，情况会稍微复杂一些：

n+1 Launch

Scrubber 会迭代 Files 表并查找已被删除的文件。已删除和已修改的文件。如果文件存在但已被修改，则将状态设置为DIRTY；如果文件不再存在，则将状态设置为DELETED。
然后，Deleter（不是最原始的名称）会迭代 Files 表，查找 DIRTY 和 DELETED 文件。它删除其他相关记录（通过 file_id 相关）。一旦相关记录被删除，原始的File记录就会被删除或设置回state=UNPARSED，
然后Walker会遍历文件系统来选择-up 新文件。
最后加载器加载所有UNPARSED文件

目前“最坏的情况”（每个文件都发生变化）非常罕见 - 因此我们每次应用程序启动时都会这样做。但是，通过将流程分解为这些步骤，我们可以轻松地将实现扩展为：

可以重构 Scrubber/Deleter 以将脏记录保留在原处，直到新记录之后
数据已加载（因此应用程序“继续工作”，同时新数据缓存到数据库中）
加载器可以在主应用程序的空闲时间期间在后台线程上加载/解析
如果您了解一些关于提前处理数据文件，您可以为文件分配“权重”，并立即加载/解析真正重要的文件，并将不太重要的文件排队以便稍后处理。

只是一些想法/建议。希望他们帮忙！

Our application is cross-platform C++ desktop application, but has very similar requirements. Here's a high-level description of what I did:

In our SQLite database there is a Files table that stores file_id, name, hash (currently we use last modified date as the hash value) and state.
Every other record refers back to a file_id. This makes is easy to remove "dirty" records when the file changes.

Our procedure for checking the filesystem and refreshing the cache is split into several distinct steps to make things easier to test and to give us more flexibility as to when the caching occurs (the names in italics are just what I happened to pick for class names):

On 1st Launch

The database is empty. The Walker recursively walks the filesystem and adds the entries into the Files table. The state is set to UNPROCESSED.
Next, the Loader iterates through the Files table looking for UNPARSED files. These are handed off to the Parser (which does the actual parsing and inserting of data)
This takes a while, so 1st launch can be a bit slow.

There's a big testability benefit because you can test the walking the filesystem code independently from the loading/parsing code. On subsequent launches the situation is a little more complicated:

n+1 Launch

The Scrubber iterates over the Files table and looks for files that have been deleted and files that have been modified. It sets the state to DIRTY if the file exists but has been modified or DELETED if the file no longer exists.
The Deleter (not the most original name) then iterates over the Files table looking for DIRTY and DELETED files. It deletes other related records (related via the file_id). Once the related records are removed, the original File record is either deleted or set back to state=UNPARSED
The Walker then walks the filesystem to pick-up new files.
Finally the Loader loads all UNPARSED files

Currently the "worst case scenario" (every file changes) is very rare - so we do this every time the application starts-up. But by splitting the process up unto these steps we could easily extend the implementation to:

The Scrubber/Deleter could be refactored to leave the dirty records in-place until after the new
data is loaded (so the application "keeps working" while new data is cached into the database)
The Loader could load/parse on a background thread during an idle time in the main application
If you know something about the data files ahead of time you could assign a 'weight' to the files and load/parse the really-important files immediately and queue-up the less-important files for processing at a later time.

Just some thoughts / suggestions. Hope they help!

回复收藏 0 原文