当前位置：文江博客话题详情

检测数据何时发生变化

发布于 2024-09-10 13:56:46 字数 502 浏览 7 评论 0原文

好的，故事是这样的：

-- 我有很多特定格式的文件（相当大，大约 25GB），需要导入到数据存储中

-- 这些文件不断更新数据，有时是新的，有时是相同的数据

- 我正在尝试找出一种算法，如何检测文件中特定行的某些内容是否已更改，以便最大限度地减少更新数据库所花费的时间

- 目前的工作方式是我每次都会删除数据库中的所有数据，然后重新导入它，但这将不再起作用，因为我需要一个项目更改时的时间戳。

-- 文件包含字符串和数字（标题、订单、价格等）。

我能想到的唯一解决方案是：

-- 计算数据库中每一行的哈希值，并将其与文件中的行的哈希值进行比较如果它们不同，则更新数据库

- 保留文件的 2 个副本，以前的文件和当前的文件，并对其进行比较（这可能比更新数据库更快），并基于这些文件更新数据库。

由于数据量非常大，我现在没有选择。从长远来看，我将删除这些文件，并将数据直接推入数据库，但问题仍然存在。

任何建议，将不胜感激。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

黯然#的苍凉 2024-09-17 13:56:46

理解的问题定义。

假设您的文件包含

ID,Name,Age
1,Jim,20
2,Tim,30
3,Kim,40

正如您所说的 Row can be add/updated ，因此该文件变为

ID,Name,Age
1,Jim,20    -- to be discarded 
2,Tim,35    -- to be updated
3,Kim,40    -- to be discarded 
4,Zim,30    --  to be inserted

现在的要求是通过在两个 sql 查询或包含两个 sql 语句的 1 个批处理查询中仅插入/更新以上 2 条记录来更新数据库。

我在这里做出以下假设

您无法修改现有的文件创建过程。
您正在使用一些批处理[从文件读取-在内存中处理-在数据库中写入]
上传数据库中的数据。

将记录 [Name,Age] 的哈希值与 ID 存储在内存映射中，其中 ID 是键，值是哈希值 [如果需要可扩展性，请使用 hazelcast ]。

您的批处理框架加载数据[再次假设将一行文件视为一条记录]，需要根据内存中映射中的 ID 检查计算的哈希值。第一次创建也可以使用批处理框架来读取文件。

 If (ID present)
--- compare hash 
---found same then discard it
—found different create an update sql 
In case ID not present in in-memory hash,create an insert sql and insert the hashvalue

您可能会使用 spring-batch 和 hazelcast 进行并行处理、块处理和内存中数据分区。

http://www.hazelcast.com/

http://static.springframework.org/spring-batch/

希望这会有所帮助。

Problem definition as understood.

Let’s say your file contains

ID,Name,Age
1,Jim,20
2,Tim,30
3,Kim,40

As you stated Row can be added / updated , hence the file becomes

ID,Name,Age
1,Jim,20    -- to be discarded 
2,Tim,35    -- to be updated
3,Kim,40    -- to be discarded 
4,Zim,30    --  to be inserted

Now the requirement is to update the database by inserting / updating only above 2 records in two sql queries or 1 batch query containing two sql statements.

I am making following assumptions here

You cannot modify the existing process to create files.
You are using some batch processing [Reading from file - Processing in Memory- Writing in DB]
to upload the data in the database.

Store the hash values of Record [Name,Age] against ID in an in-memory Map where ID is the key and Value is hash [If you require scalability use hazelcast ].

Your Batch Framework to load the data [Again assuming treats one line of file as one record], needs to check the computed hash value against the ID in in-memory Map.First time creation can also be done using your batch framework for reading files.

 If (ID present)
--- compare hash 
---found same then discard it
—found different create an update sql 
In case ID not present in in-memory hash,create an insert sql and insert the hashvalue

You might go for parallel processing , chunk processing and in-memory data partitioning using spring-batch and hazelcast.

http://www.hazelcast.com/

http://static.springframework.org/spring-batch/

Hope this helps.

回复收藏 0 原文

不弃不离 2024-09-17 13:56:46

与其按需计算数据库中每一行的哈希值，为什么不存储哈希值呢？

然后，您可以计算相关文件的哈希值并将其与数据库存储的哈希值进行比较。

更新：

我想到的另一个选择是将上次修改的日期/时间信息存储在数据库中，然后将其与相关文件的信息进行比较。如果信息不能有意或无意地更改，这应该可行。

回复收藏 0 原文

茶底世界 2024-09-17 13:56:46

好吧，无论你使用什么，最坏的情况都是 O(n)，这对于 n ~ 25GB 的数据来说并不是那么漂亮。

除非您可以修改写入文件的过程。

由于您不会一直更新所有 25GB，因此这是节省周期的最大潜力。

1.不要乱写
为什么不让写入数据的过程仅追加呢？这样，您将拥有更多数据，但您将拥有完整的历史记录，并且可以跟踪已处理的数据（已放入数据存储中的数据）。

2.如果必须随机写入，请保留更改列表
或者，如果您确实必须进行随机写入，您可以保留更新行的列表。然后可以像#1 中那样处理此列表，并且您可以跟踪您处理了哪些更改。如果您想节省一些空间，您可以保留数据更改的块列表（其中块是您定义的单位）。

此外，您可以保留更改的块/行的校验和/散列。然而，这可能不是很有趣——计算起来并不便宜，直接比较可能更便宜（如果在写入期间有空闲的 CPU 周期，它可能会为您节省一些阅读时间，YMMV）。

注意

只有当您可以调整将数据写入磁盘的过程时，#1 和 #2 才有意义。
如果您无法修改写入 25GB 数据的过程，那么我不明白校验和/哈希有什么帮助 - 你必须读取所有数据来计算哈希（因为你不知道发生了什么变化），这样你就可以在读取时直接进行比较并得出行列表更新/添加（或直接更新/添加）
使用 diff 算法可能不是最佳的，diff 算法不仅会查找已更改的行，还会在给定某些格式选项的情况下检查两个文本文件之间的最小编辑距离。（在 diff 中，可以使用 -H 或 --minimal 来控制以减慢工作速度或更快，即搜索精确的最小解决方案或使用启发式算法，如果 iirc 该算法变为 O(n log n)，这还不错，但仍然比 O(n) 慢，如果您进行直接比较，则可以使用 O(n)逐行）

Well regardless what you use your worst case is going to be O(n), which on n ~ 25GB of data is not so pretty.

Unless you can modify the process that writes to the files.

Since you are not updating all of the 25GBs all of the time, that is your biggest potential for saving cycles.

1. Don't write randomly
Why don't you make the process that writes the data append only? This way you'll have more data, but you'll have full history and you can track which data you already processed (what you already put in the datastore).

2. Keep a list of changes if you must write randomly
Alternatively if you really must do the random writes you could keep a list of updated rows. This list can be then processed as in #1, and the you can track which changes you processed. If you want to save some space you can keep a list of blocks in which the data changed (where block is a unit that you define).

Furthermore you can keep checksums/hashes of changed block/lines. However this might not be very interesting - it is not so cheap to compute and direct comparison might be cheaper (if you have free CPU cycles during writing it might save you some reading time later, YMMV).

Note(s)

Both #1 and #2 are interesting only if you can make adjustment to the process that writes the data to the disk
If you can not modify the process that writes in the 25GB data then I don't see how checksums/hashes can help - you have to read all the data anyway to compute the hashes (since you don't know what changed) so you can directly compare while you read and come up with a list of rows to update/add (or update/add directly)
Using diff algorithms might be suboptimal, diff algorithm will not only look for the lines that changed, but also check for the minimal edit distance between two text files given certain formatting options. (in diff, this can be controlled with -H or --minimal to work slower or faster, ie search for exact minimal solution or use heuristic algorithm for which if iirc this algorithm becomes O(n log n); which is not bad, but still slower then O(n) which is available to you if you do direct comparison line by line)

回复收藏 0 原文