在程序启动时同步文件系统和缓存数据
我有一个程序需要检索有关一组文件(即一个目录及其中的所有文件以及某些类型的子目录)的一些数据。计算数据的成本(非常)昂贵,因此我不是遍历文件系统并在程序启动时计算它,而是将数据缓存在 SQLite 数据库并使用 FilesystemWatcher 来监视文件系统的更改。这在程序运行时效果很好,但问题是如何在程序启动期间刷新/同步数据。如果文件已被添加(或更改 - 我想我可以通过上次修改/大小检测到这一点),则需要在缓存中重新计算数据,如果文件已被删除,则需要从缓存中删除数据(因为该接口遍历缓存而不是文件系统)。
所以问题是:有什么好的算法可以做到这一点?我能想到的一种方法是遍历文件系统并收集字典中所有文件的路径和上次修改/大小。然后我浏览数据库中的整个列表。如果不匹配,那么我从数据库/缓存中删除该项目。如果有匹配,那么我从字典中删除该项目。那么字典就包含了所有需要刷新数据的项目。这可能有效,但是在每次启动时执行似乎会相当耗费内存和时间,所以我想知道是否有人有更好的想法?
如果重要的话:该程序仅在 .NET CLR 3.5 上用 C# 编写,使用 SQLite for ADO.NET 来通过实体框架/LINQ for ADO.NET 进行访问。
I have a program that needs to retrieve some data about a set of files (that is, a directory and all files within it and sub directories of certain types). The data is (very) expensive to calculate, so rather than traversing the filesystem and calculating it on program startup, I keep a cache of the data in a SQLite database and use a FilesystemWatcher to monitor changes to the filesystem. This works great while the program is running, but the question is how to refresh/synchronize the data during program startup. If files have been added (or changed -- I presume I can detect this via last modified/size) the data needs to be recomputed in the cache, and if files have been removed, the data needs to be removed from the cache (since the interface traverses the cache instead of the filesystem).
So the question is: what's a good algorithm to do this? One way I can think of is to traverse the filesystem and gather the path and last modified/size of all files in a dictionary. Then I go through the entire list in the database. If there is not a match, then I delete the item from the database/cache. If there is a match, then I delete the item from the dictionary. Then the dictionary contains all the items whose data needs to be refreshed. This might work, however it seems it would be fairly memory-intensive and time-consuming to perform on every startup, so I was wondering if anyone had better ideas?
If it matters: the program is Windows-only written in C# on .NET CLR 3.5, using the SQLite for ADO.NET thing which is being accessed via the entity framework/LINQ for ADO.NET.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我们的应用程序是跨平台的 C++ 桌面应用程序,但具有非常相似的要求。以下是我所做操作的高级描述:
Files
表,用于存储file_id
、name
、hash
(目前我们使用上次修改日期作为哈希值)和state
。file_id
。这使得在文件更改时很容易删除“脏”记录。我们检查文件系统和刷新缓存的过程分为几个不同的步骤,以便更容易测试,并在缓存发生时为我们提供更大的灵活性(斜体中的名称正是我所用的。碰巧选择了类名):
第一次启动时
Files
表中。state
设置为UNPROCESSED
。Files
表来查找UNPARSED
文件。这些将交给解析器(它负责实际解析和插入数据)这有一个很大的可测试性好处,因为您可以独立于加载/解析代码来测试文件系统代码的运行。在后续启动中,情况会稍微复杂一些:
n+1 Launch
Files
表并查找已被删除的文件。已删除和已修改的文件。如果文件存在但已被修改,则将状态
设置为DIRTY
;如果文件不再存在,则将状态
设置为DELETED
。Files
表,查找DIRTY
和DELETED
文件。它删除其他相关记录(通过file_id
相关)。一旦相关记录被删除,原始的File
记录就会被删除或设置回state=UNPARSED
,UNPARSED
文件目前“最坏的情况”(每个文件都发生变化)非常罕见 - 因此我们每次应用程序启动时都会这样做。但是,通过将流程分解为这些步骤,我们可以轻松地将实现扩展为:
数据已加载(因此应用程序“继续工作”,同时新数据缓存到数据库中)
只是一些想法/建议。希望他们帮忙!
Our application is cross-platform C++ desktop application, but has very similar requirements. Here's a high-level description of what I did:
Files
table that storesfile_id
,name
,hash
(currently we use last modified date as the hash value) andstate
.file_id
. This makes is easy to remove "dirty" records when the file changes.Our procedure for checking the filesystem and refreshing the cache is split into several distinct steps to make things easier to test and to give us more flexibility as to when the caching occurs (the names in italics are just what I happened to pick for class names):
On 1st Launch
Files
table. Thestate
is set toUNPROCESSED
.Files
table looking forUNPARSED
files. These are handed off to the Parser (which does the actual parsing and inserting of data)There's a big testability benefit because you can test the walking the filesystem code independently from the loading/parsing code. On subsequent launches the situation is a little more complicated:
n+1 Launch
Files
table and looks for files that have been deleted and files that have been modified. It sets thestate
toDIRTY
if the file exists but has been modified orDELETED
if the file no longer exists.Files
table looking forDIRTY
andDELETED
files. It deletes other related records (related via thefile_id
). Once the related records are removed, the originalFile
record is either deleted or set back tostate=UNPARSED
UNPARSED
filesCurrently the "worst case scenario" (every file changes) is very rare - so we do this every time the application starts-up. But by splitting the process up unto these steps we could easily extend the implementation to:
data is loaded (so the application "keeps working" while new data is cached into the database)
Just some thoughts / suggestions. Hope they help!
Windows 有一个更改日志机制,它可以满足您的要求:您订阅文件系统某些部分的更改,并且在启动时可以读取自上次读取以来发生的更改列表。请参阅:http://msdn.microsoft.com/en -us/library/aa363798(VS.85).aspx
编辑:不幸的是,我认为它需要相当高的权限
Windows has a change journal mechanism, which does what you want: you subscribe to changes in some part of the filesystem and upon startup can read a list of changes which happened since last time you read them. See: http://msdn.microsoft.com/en-us/library/aa363798(VS.85).aspx
EDIT: I think it requires rather high privileges, unfortunately
我想到的第一个明显的事情是创建一个单独的小应用程序,它总是运行(也许作为服务)并在文件系统中创建一种更改的“日志”(不需要使用 SQLite,只需编写它们到一个文件)。然后,当主应用程序启动时,它可以查看日志并确切地知道发生了什么变化(不要忘记之后清除日志:-)。
但是,如果由于某种原因这对您来说是不可接受的,那么让我们尝试看看原来的问题。
首先,您必须接受这一点:在最坏的情况下,当所有文件都发生更改时,您将需要遍历整个树。这可能(尽管不一定会)需要很长时间。一旦意识到这一点,您就必须考虑在后台完成这项工作,而不阻塞应用程序。
其次,如果您必须对只有您知道如何制作的每个文件做出决定,那么除了浏览所有文件之外,可能没有其他方法。
换句话说,您可能会说问题本质上很复杂(任何给定的问题都不能用比问题本身更简单的算法来解决)。
因此,您唯一的希望是通过调整和修改减少搜索空间。我脑子里有两个。
首先,最好为每个文件单独查询数据库,而不是先构建所有文件的字典。如果您在数据库中的文件路径列上创建索引,它应该会更快,当然,内存占用也更少。
其次,您实际上根本不需要查询数据库:-)
只需将应用程序上次运行的确切时间存储在某个地方(在 .settings 文件中?),然后检查每个文件以查看它是否比该时间更新。如果是的话,你就知道它已经改变了。如果不是,您就知道您上次已经捕获了它的更改(使用您的 FileSystemWatcher)。
希望这有帮助。玩得开心。
The first obvious thing that comes to mind is creating a separate small application that would always run (as a service, perhaps) and create a kind of "log" of changes in the file system (no need to work with SQLite, just write them to a file). Then, when the main application starts, it can look at the log and know exactly what has changed (don't forget to clear the log afterwards :-).
However, if that is unacceptable to you for some reason, let us try to look at the original problem.
First of all, you have to accept that, in the worst case scenario, when all the files have changed, you will need to traverse the whole tree. And that may (although not necessarily will) take a long time. Once you realize that, you have to think about doing the job in background, without blocking the application.
Second, if you have to make a decision about each file that only you know how to make, there is probably no other way than going through all files.
Putting the above in other words, you might say that the problem is inherently complex (and any given problem cannot be solved with an algorithm that is simpler than the problem itself).
Therefore, your only hope is reducing the search space by using tweaks and hacks. And I have two of those on my mind.
First, it's better to query the database separately for every file instead of building a dictionary of all files first. If you create an index on the file path column in your database, it should be quicker, and of course, less memory-intensive.
Second, you don't actually have to query the database at all :-)
Just store the exact time when your application was last running somewhere (in a .settings file?) and check every file to see if it's newer than that time. If it is, you know it's changed. If it's not, you know you've caught it's change last time (with your FileSystemWatcher).
Hope this helps. Have fun.