使用缩写和增量对 RDBMS 中的数据集进行版本控制
我正在开发一个使用缩写和增量镜像远程数据集的系统。当初始进入时,它会大量删除先前存在的任何内容并大量插入新数据。当增量出现时,系统会做大量工作将其转换为更新、插入和删除。初始值和增量值在长事务中进行处理,以保持数据完整性。
不幸的是,当前的解决方案的扩展性不是很好。事务规模如此之大且运行时间如此之长,以至于我们的 RDBMS 因各种争用问题而陷入困境。此外,对于增量的应用方式没有良好的审计跟踪,这使得解决导致数据集的本地和远程版本不同步的问题变得困难。
一种想法是根本不运行事务中的缩写和增量,而是将版本号附加到每个记录,指示它来自哪个增量或缩写。一旦初始或增量成功加载,应用程序就会收到数据集新版本可用的警报。
这只剩下如何准确地从初始版本和增量版本构建数据集视图到给定版本的问题。 (Apple 的 TimeMachine 做了类似的事情,使用文件系统上的硬链接来创建某个时间点的“视图”。)
是否有人有解决此类问题或实施此特定解决方案的经验?
谢谢!
I'm working on a system that mirrors remote datasets using initials and deltas. When an initial comes in, it mass deletes anything preexisting and mass inserts the fresh data. When a delta comes in, the system does a bunch of work to translate it into updates, inserts, and deletes. Initials and deltas are processed inside long transactions to maintain data integrity.
Unfortunately the current solution isn't scaling very well. The transactions are so large and long running that our RDBMS bogs down with various contention problems. Also, there isn't a good audit trail for how the deltas are applied, making it difficult to troubleshoot issues causing the local and remote versions of the dataset to get out of sync.
One idea is to not run the initials and deltas in transactions at all, and instead to attach a version number to each record indicating which delta or initial it came from. Once an initial or delta is successfully loaded, the application can be alerted that a new version of the dataset is available.
This just leaves the issue of how exactly to compose a view of a dataset up to a given version from the initial and deltas. (Apple's TimeMachine does something similar, using hard links on the file system to create "view" of a certain point in time.)
Does anyone have experience solving this kind of problem or implementing this particular solution?
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
有一名作者和多个读者数据库。您将写入发送到一个数据库,并让它将完全相同的更改传播到所有其他数据库。读者数据库最终将保持一致,并且更新时间非常快。我见过在每天页面浏览量超过 100 万次的环境中完成此操作。它具有很强的可扩展性。您甚至可以在所有读取的数据库前面放置一个硬件路由器来平衡它们的负载。
have one writer and several reader databases. You send the write to the one database, and have it propagate the exact same changes to all the other databases. The reader databases will be eventually consistent and the time to update is very fast. I have seen this done in environments that get upwards of 1M page views per day. It is very scalable. You can even put a hardware router in front of all the read databases to load balance them.
感谢那些尝试过的人。
对于最终到达这里的其他人,我正在对一个解决方案进行基准测试,该解决方案向每个相关表添加“dataset_version_id”和“dataset_version_verb”列。然后,在检索特定记录时,使用存储过程内的相关子查询来检索当前的 dataset_version_id。如果记录的最新版本的 dataset_version_verb 为“delete”,则会通过 WHERE 子句将其从结果中过滤掉。
到目前为止,这种方法的性能平均下降了约 80%,这对于我们的目的来说可能是可以接受的。
Thanks to those who tried.
For anyone else who ends up here, I'm benchmarking a solution that adds a "dataset_version_id" and "dataset_version_verb" column to each table in question. A correlated subquery inside a stored procedure is then used to retrieve the current dataset_version_id when retrieving specific records. If the latest version of the record has dataset_version_verb of "delete", it's filtered out of the results by a WHERE clause.
This approach has an average ~ 80% performance hit so far, which may be acceptable for our purposes.