检测数据何时发生变化
好的,故事是这样的:
-- 我有很多特定格式的文件(相当大,大约 25GB),需要导入到数据存储中
-- 这些文件不断更新数据,有时是新的,有时是相同的数据
- 我正在尝试找出一种算法,如何检测文件中特定行的某些内容是否已更改,以便最大限度地减少更新数据库所花费的时间
- 目前的工作方式是我每次都会删除数据库中的所有数据,然后重新导入它,但这将不再起作用,因为我需要一个项目更改时的时间戳。
-- 文件包含字符串和数字(标题、订单、价格等)。
我能想到的唯一解决方案是:
-- 计算数据库中每一行的哈希值,并将其与文件中的行的哈希值进行比较如果它们不同,则更新数据库
- 保留文件的 2 个副本,以前的文件和当前的文件,并对其进行比较(这可能比更新数据库更快),并基于这些文件更新数据库。
由于数据量非常大,我现在没有选择。从长远来看,我将删除这些文件,并将数据直接推入数据库,但问题仍然存在。
任何建议,将不胜感激。
Ok, so the story is like this:
-- I am having lots of files (pretty big, around 25GB) that are in a particular format and needs to be imported in a datastore
-- these files are continuously updated with data, sometimes new, sometimes the same data
-- I am trying to figure out an algorithm on how could I detect if something has changed for a particular line in a file, in order to minimize the time spent updating the database
-- the way it currently works now is that I'm dropping all the data in the database each time and then reimport it, but this won't work anymore since I'll need a timestamp for when an item has changed.
-- the files contains strings and numbers (titles, orders, prices etc.)
The only solutions I could think of are:
-- compute a hash for each row from the database, that it's compared against the hash of the row from the file and if they're different the update the database
-- keep 2 copies of the files, the previous ones and the current ones and make diffs on it (which probably are faster than updating the db) and based on those update the db.
Since the amount of data is very big to huge, I am kind of out of options for now. On the long run, I'll get rid of the files and data will be pushed straight into the database, but the problem still remains.
Any advice, will be appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
理解的问题定义。
假设您的文件包含
正如您所说的 Row can be add/updated ,因此该文件变为
现在的要求是通过在两个 sql 查询或包含两个 sql 语句的 1 个批处理查询中仅插入/更新以上 2 条记录来更新数据库。
上传数据库中的数据。
将记录 [Name,Age] 的哈希值与 ID 存储在内存映射中,其中 ID 是键,值是哈希值 [如果需要可扩展性,请使用 hazelcast ]。
您的批处理框架加载数据[再次假设将一行文件视为一条记录],需要根据内存中映射中的 ID 检查计算的哈希值。第一次创建也可以使用批处理框架来读取文件。
您可能会使用 spring-batch 和 hazelcast 进行并行处理、块处理和内存中数据分区。
http://www.hazelcast.com/
http://static.springframework.org/spring-batch/
希望这会有所帮助。
Problem definition as understood.
Let’s say your file contains
As you stated Row can be added / updated , hence the file becomes
Now the requirement is to update the database by inserting / updating only above 2 records in two sql queries or 1 batch query containing two sql statements.
to upload the data in the database.
Store the hash values of Record [Name,Age] against ID in an in-memory Map where ID is the key and Value is hash [If you require scalability use hazelcast ].
Your Batch Framework to load the data [Again assuming treats one line of file as one record], needs to check the computed hash value against the ID in in-memory Map.First time creation can also be done using your batch framework for reading files.
You might go for parallel processing , chunk processing and in-memory data partitioning using spring-batch and hazelcast.
http://www.hazelcast.com/
http://static.springframework.org/spring-batch/
Hope this helps.
与其按需计算数据库中每一行的哈希值,为什么不存储哈希值呢?
然后,您可以计算相关文件的哈希值并将其与数据库存储的哈希值进行比较。
更新:
我想到的另一个选择是将上次修改的日期/时间信息存储在数据库中,然后将其与相关文件的信息进行比较。如果信息不能有意或无意地更改,这应该可行。
Instead of computing the hash for each row from the database on demand, why don't you store the hash value instead?
Then you could just compute the hash value of the file in question and compare it against the database stored ones.
Update:
Another option that came to my mind is to store the Last Modified date/time information on the database and then compare it against that of the file in question. This should work, provided the information cannot be changed either intentionally or by accident.
好吧,无论你使用什么,最坏的情况都是 O(n),这对于 n ~ 25GB 的数据来说并不是那么漂亮。
除非您可以修改写入文件的过程。
由于您不会一直更新所有 25GB,因此这是节省周期的最大潜力。
1.不要乱写
为什么不让写入数据的过程仅追加呢?这样,您将拥有更多数据,但您将拥有完整的历史记录,并且可以跟踪已处理的数据(已放入数据存储中的数据)。
2.如果必须随机写入,请保留更改列表
或者,如果您确实必须进行随机写入,您可以保留更新行的列表。然后可以像#1 中那样处理此列表,并且您可以跟踪您处理了哪些更改。如果您想节省一些空间,您可以保留数据更改的块列表(其中块是您定义的单位)。
此外,您可以保留更改的块/行的校验和/散列。然而,这可能不是很有趣——计算起来并不便宜,直接比较可能更便宜(如果在写入期间有空闲的 CPU 周期,它可能会为您节省一些阅读时间,YMMV)。
注意
Well regardless what you use your worst case is going to be O(n), which on n ~ 25GB of data is not so pretty.
Unless you can modify the process that writes to the files.
Since you are not updating all of the 25GBs all of the time, that is your biggest potential for saving cycles.
1. Don't write randomly
Why don't you make the process that writes the data append only? This way you'll have more data, but you'll have full history and you can track which data you already processed (what you already put in the datastore).
2. Keep a list of changes if you must write randomly
Alternatively if you really must do the random writes you could keep a list of updated rows. This list can be then processed as in #1, and the you can track which changes you processed. If you want to save some space you can keep a list of blocks in which the data changed (where block is a unit that you define).
Furthermore you can keep checksums/hashes of changed block/lines. However this might not be very interesting - it is not so cheap to compute and direct comparison might be cheaper (if you have free CPU cycles during writing it might save you some reading time later, YMMV).
Note(s)
实际上,这是一种必须通过备份软件来解决的问题,那么为什么不使用他们的一些标准解决方案呢?
最好的方法是挂钩 WriteFile 调用,以便您在每次更新时都会收到回调。这对于二进制记录来说效果很好。
我无法理解的事情是:这些文件实际上是文本文件,不仅是附加的,而且是更新的?这是非常低效的(连同保留 2 个文件副本的想法,因为这会使文件缓存工作更糟)。
practically it's kind of problem that has to be solved by backup software, so why not use some of their standard solutions?
the best one would be to hook the WriteFile calls so that you'll receive callbacks on each update. This would work pretty well with binary records.
Something that I cannot understand: the files are actually text files that are not just appended, but updated? this is highly ineffective ( together with idea of keeping 2 copies of files, because it will make the file caching work even worse).