在程序启动时同步文件系统和缓存数据

发布于 2024-08-11 02:56:10 字数 629 浏览 5 评论 0原文

我有一个程序需要检索有关一组文件(即一个目录及其中的所有文件以及某些类型的子目录)的一些数据。计算数据的成本(非常)昂贵,因此我不是遍历文件系统并在程序启动时计算它,而是将数据缓存在 SQLite 数据库并使用 FilesystemWatcher 来监视文件系统的更改。这在程序运行时效果很好,但问题是如何在程序启动期间刷新/同步数据。如果文件已被添加(或更改 - 我想我可以通过上次修改/大小检测到这一点),则需要在缓存中重新计算数据,如果文件已被删除,则需要从缓存中删除数据(因为该接口遍历缓存而不是文件系统)。

所以问题是:有什么好的算法可以做到这一点?我能想到的一种方法是遍历文件系统并收集字典中所有文件的路径和上次修改/大小。然后我浏览数据库中的整个列表。如果不匹配,那么我从数据库/缓存中删除该项目。如果有匹配,那么我从字典中删除该项目。那么字典就包含了所有需要刷新数据的项目。这可能有效,但是在每次启动时执行似乎会相当耗费内存和时间,所以我想知道是否有人有更好的想法?

如果重要的话:该程序仅在 .NET CLR 3.5 上用 C# 编写,使用 SQLite for ADO.NET 来通过实体框架/LINQ for ADO.NET 进行访问。

I have a program that needs to retrieve some data about a set of files (that is, a directory and all files within it and sub directories of certain types). The data is (very) expensive to calculate, so rather than traversing the filesystem and calculating it on program startup, I keep a cache of the data in a SQLite database and use a FilesystemWatcher to monitor changes to the filesystem. This works great while the program is running, but the question is how to refresh/synchronize the data during program startup. If files have been added (or changed -- I presume I can detect this via last modified/size) the data needs to be recomputed in the cache, and if files have been removed, the data needs to be removed from the cache (since the interface traverses the cache instead of the filesystem).

So the question is: what's a good algorithm to do this? One way I can think of is to traverse the filesystem and gather the path and last modified/size of all files in a dictionary. Then I go through the entire list in the database. If there is not a match, then I delete the item from the database/cache. If there is a match, then I delete the item from the dictionary. Then the dictionary contains all the items whose data needs to be refreshed. This might work, however it seems it would be fairly memory-intensive and time-consuming to perform on every startup, so I was wondering if anyone had better ideas?

If it matters: the program is Windows-only written in C# on .NET CLR 3.5, using the SQLite for ADO.NET thing which is being accessed via the entity framework/LINQ for ADO.NET.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

迷爱 2024-08-18 02:56:10

我们的应用程序是跨平台的 C++ 桌面应用程序,但具有非常相似的要求。以下是我所做操作的高级描述:

  • 在我们的 SQLite 数据库中,有一个 Files 表,用于存储 file_idnamehash(目前我们使用上次修改日期作为哈希值)和state
  • 所有其他记录都引用一个 file_id。这使得在文件更改时很容易删除“脏”记录。

我们检查文件系统和刷新缓存的过程分为几个不同的步骤,以便更容易测试,并在缓存发生时为我们提供更大的灵活性(斜体中的名称正是我所用的。碰巧选择了类名):

第一次启动时

  • 数据库为空。 Walker 递归地遍历文件系统并将条目添加到 Files 表中。 state 设置为 UNPROCESSED
  • 接下来,加载器遍历Files表来查找UNPARSED文件。这些将交给解析器(它负责实际解析和插入数据)
  • 这需要一段时间,因此第一次启动可能会有点慢。

这有一个很大的可测试性好处,因为您可以独立于加载/解析代码来测试文件系统代码的运行。在后续启动中,情况会稍微复杂一些:

n+1 Launch

  • Scrubber 会迭代 Files 表并查找已被删除的文件。已删除和已修改的文件。如果文件存在但已被修改,则将状态设置为DIRTY;如果文件不再存在,则将状态设置为DELETED
  • 然后,Deleter(不是最原始的名称)会迭代 Files 表,查找 DIRTYDELETED 文件。它删除其他相关记录(通过 file_id 相关)。一旦相关记录被删除,原始的File记录就会被删除或设置回state=UNPARSED
  • 然后Walker会遍历文件系统来选择-up 新文件。
  • 最后加载器加载所有UNPARSED文件

目前“最坏的情况”(每个文件都发生变化)非常罕见 - 因此我们每次应用程序启动时都会这样做。但是,通过将流程分解为这些步骤,我们可以轻松地将实现扩展为:

  • 可以重构 Scrubber/Deleter 以将脏记录保留在原处,直到新记录之后
    数据已加载(因此应用程序“继续工作”,同时新​​数据缓存到数据库中)
  • 加载器可以在主应用程序的空闲时间期间在后台线程上加载/解析
  • 如果您了解一些关于提前处理数据文件,您可以为文件分配“权重”,并立即加载/解析真正重要的文件,并将不太重要的文件排队以便稍后处理。

只是一些想法/建议。希望他们帮忙!

Our application is cross-platform C++ desktop application, but has very similar requirements. Here's a high-level description of what I did:

  • In our SQLite database there is a Files table that stores file_id, name, hash (currently we use last modified date as the hash value) and state.
  • Every other record refers back to a file_id. This makes is easy to remove "dirty" records when the file changes.

Our procedure for checking the filesystem and refreshing the cache is split into several distinct steps to make things easier to test and to give us more flexibility as to when the caching occurs (the names in italics are just what I happened to pick for class names):

On 1st Launch

  • The database is empty. The Walker recursively walks the filesystem and adds the entries into the Files table. The state is set to UNPROCESSED.
  • Next, the Loader iterates through the Files table looking for UNPARSED files. These are handed off to the Parser (which does the actual parsing and inserting of data)
  • This takes a while, so 1st launch can be a bit slow.

There's a big testability benefit because you can test the walking the filesystem code independently from the loading/parsing code. On subsequent launches the situation is a little more complicated:

n+1 Launch

  • The Scrubber iterates over the Files table and looks for files that have been deleted and files that have been modified. It sets the state to DIRTY if the file exists but has been modified or DELETED if the file no longer exists.
  • The Deleter (not the most original name) then iterates over the Files table looking for DIRTY and DELETED files. It deletes other related records (related via the file_id). Once the related records are removed, the original File record is either deleted or set back to state=UNPARSED
  • The Walker then walks the filesystem to pick-up new files.
  • Finally the Loader loads all UNPARSED files

Currently the "worst case scenario" (every file changes) is very rare - so we do this every time the application starts-up. But by splitting the process up unto these steps we could easily extend the implementation to:

  • The Scrubber/Deleter could be refactored to leave the dirty records in-place until after the new
    data is loaded (so the application "keeps working" while new data is cached into the database)
  • The Loader could load/parse on a background thread during an idle time in the main application
  • If you know something about the data files ahead of time you could assign a 'weight' to the files and load/parse the really-important files immediately and queue-up the less-important files for processing at a later time.

Just some thoughts / suggestions. Hope they help!

一曲琵琶半遮面シ 2024-08-18 02:56:10

Windows 有一个更改日志机制,它可以满足您的要求:您订阅文件系统某些部分的更改,并且在启动时可以读取自上次读取以来发生的更改列表。请参阅:http://msdn.microsoft.com/en -us/library/aa363798(VS.85).aspx

编辑:不幸的是,我认为它需要相当高的权限

Windows has a change journal mechanism, which does what you want: you subscribe to changes in some part of the filesystem and upon startup can read a list of changes which happened since last time you read them. See: http://msdn.microsoft.com/en-us/library/aa363798(VS.85).aspx

EDIT: I think it requires rather high privileges, unfortunately

浅忆 2024-08-18 02:56:10

我想到的第一个明显的事情是创建一个单独的小应用程序,它总是运行(也许作为服务)并在文件系统中创建一种更改的“日志”(不需要使用 SQLite,只需编写它们到一个文件)。然后,当主应用程序启动时,它可以查看日志并确切地知道发生了什么变化(不要忘记之后清除日志:-)。

但是,如果由于某种原因这对您来说是不可接受的,那么让我们尝试看看原来的问题。

首先,您必须接受这一点:在最坏的情况下,当所有文件都发生更改时,您需要遍历整个树。这可能(尽管不一定会)需要很长时间。一旦意识到这一点,您就必须考虑在后台完成这项工作,而不阻塞应用程序。

其次,如果您必须对只有您知道如何制作的每个文件做出决定,那么除了浏览所有文件之外,可能没有其他方法。

换句话说,您可能会说问题本质上很复杂(任何给定的问题都不能用比问题本身更简单的算法来解决)。

因此,您唯一的希望是通过调整和修改减少搜索空间。我脑子里有两个。

首先,最好为每个文件单独查询数据库,而不是先构建所有文件的字典。如果您在数据库中的文件路径列上创建索引,它应该会更快,当然,内存占用也更少。

其次,您实际上根本不需要查询数据库:-)
只需将应用程序上次运行的确切时间存储在某个地方(在 .settings 文件中?),然后检查每个文件以查看它是否比该时间更新。如果是的话,你就知道它已经改变了。如果不是,您就知道您上次已经捕获了它的更改(使用您的 FileSystemWatcher)。

希望这有帮助。玩得开心。

The first obvious thing that comes to mind is creating a separate small application that would always run (as a service, perhaps) and create a kind of "log" of changes in the file system (no need to work with SQLite, just write them to a file). Then, when the main application starts, it can look at the log and know exactly what has changed (don't forget to clear the log afterwards :-).

However, if that is unacceptable to you for some reason, let us try to look at the original problem.

First of all, you have to accept that, in the worst case scenario, when all the files have changed, you will need to traverse the whole tree. And that may (although not necessarily will) take a long time. Once you realize that, you have to think about doing the job in background, without blocking the application.

Second, if you have to make a decision about each file that only you know how to make, there is probably no other way than going through all files.

Putting the above in other words, you might say that the problem is inherently complex (and any given problem cannot be solved with an algorithm that is simpler than the problem itself).

Therefore, your only hope is reducing the search space by using tweaks and hacks. And I have two of those on my mind.

First, it's better to query the database separately for every file instead of building a dictionary of all files first. If you create an index on the file path column in your database, it should be quicker, and of course, less memory-intensive.

Second, you don't actually have to query the database at all :-)
Just store the exact time when your application was last running somewhere (in a .settings file?) and check every file to see if it's newer than that time. If it is, you know it's changed. If it's not, you know you've caught it's change last time (with your FileSystemWatcher).

Hope this helps. Have fun.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文