单向数据库同步

发布于 2024-07-25 17:48:41 字数 1006 浏览 14 评论 0原文

经常需要将一个数据库中的主表中的数据同步到其他数据库（通常位于其他服务器上）中的克隆表。例如，考虑这样一种情况：后端系统管理库存数据，并且库存数据最终必须推送到作为网站应用程序一部分的一个或多个数据库。

后端系统中的源数据高度规范化，有数十个表和外键约束。它是一个精心设计的OLTP RDBMS系统。许多相关表包含数百万行。需要定期将此数据推送到其他数据库。尽可能频繁；延迟是可以容忍的。最重要的是，后端和远程数据库的最大正常运行时间至关重要。

我正在使用 SQL Server，熟悉更改跟踪、行版本、触发器等。我知道 Microsoft 针对这些场景大力推动复制、SyncFx 和 SSIS。然而，供应商的白皮书和概述推荐的技术与解决方案的实际实施、部署和维护之间存在很大差异。在 SQL Server 领域，复制通常被视为交钥匙解决方案，但我正在尝试探索替代解决方案。（有人担心复制很难管理，很难更改架构，并且如果需要重新初始化，关键系统将会出现大量停机时间。）

有很多陷阱。由于大量表之间存在复杂的外键关系，因此确定执行捕获或应用更新的顺序并非易事。由于唯一索引，两行可能会以这样的方式互锁，以至于一次行更新甚至无法工作（需要在最终更新之前对每行执行中间更新）。这些不一定是表演障碍，因为唯一索引通常可以更改为常规索引，并且可以禁用外键（尽管禁用外键是非常不可取的）。通常，您会听到“仅”使用 SQL 2008 更改跟踪和 SSIS 或 SyncFx。这样的回答确实没有反映实际困难。（当然，客户确实很难理解复制数据为何如此困难，从而使困难的情况变得更糟！）

这个问题最终非常普遍：对许多密切相关的数据库表执行单向同步有很多行。几乎每个涉及数据库的人都必须处理此类问题。白皮书很常见，实用的专业知识很难找到。我们知道这可能是一个困难的问题，但工作必须完成。让我们听听什么对您有用（以及应该避免什么）。讲述您使用 Microsoft 产品或其他供应商产品的体验。但是，如果您个人没有使用大量密切相关的表和行对解决方案进行过实际测试，请不要回答。让我们保持实用性——而不是理论性。

原文

There is frequently the need to synchronize data from master tables in one database to clone tables in other databases, often on other servers. For example, consider the case where a backend system manages inventory data and that inventory data ultimately must be pushed to one or more databases that are part of a web site application.

The source data in the backend system is heavily normalized, with dozens of tables and foreign key constraints. It is a well-designed OLTP RDBMS system. Many of the tables in question contain millions of rows. The need is to push this data out to the other databases regularly. As frequently as feasible; latency can be tolerated. Above all, maximum uptime of both the backend and remote databases is imperative.

I am using SQL Server and am familiar with change tracking, rowversion, triggers, and so on. I know that Microsoft pushes replication, SyncFx, and SSIS heavily for these scenarios. However, there is quite a difference between vendor whitepapers and overviews recommending technologies and the actual implementation, deployment, and maintenance of the solution. In the SQL Server world, replication is often viewed as the turnkey solution, but I am trying to explore alternate solutions. (There is some fear that replication is difficult to administer, makes it hard to change schema, and in the event that a re-initialize is ever required there would be large downtime for critical systems.)

There are lots of gotchas. Due to the complex foreign key relationships among large numbers of tables, determining what order to perform captures or to apply updates is not trivial. Due to unique indexes, two rows might be interlocked in such a way that row-at-a-time update will not even work (need to perform intermediate updates to each row before the final update). These are not necessarily show-stoppers, as unique indexes can often be changed to regular indexes and foreign keys can be disabled (though disabling the foreign keys is extremely undesirable). Often, you will hear, "just" use SQL 2008 change tracking and SSIS or SyncFx. These kinds of answers really do not do justice to the practical difficulties. (And of course, clients really have a hard time wrapping their heads over how copying data could be so difficult, making a difficult situation all the worse!)

This issue is ultimately very generic: perform one-way synchronization of many heavily related database tables with lots of rows. Almost everyone involved in databases has to deal with this kind of issue. Whitepapers are common, practical expertise hard to find. We know this can be a difficult issue, but the job must get done. Let's hear about what has worked for you (and what to avoid). Tell your experience with Microsoft products or products from other vendors. But if you personally have not battle-tested the solution with large numbers of heavily-related tables and rows, please refrain from answering. Let's keep this practical -- not theoretical.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

青巷忧颜 2024-08-01 17:48:41

最好在 serverfault.com 上询问（我无法发表评论，脚本在 SO 中被破坏，所以我必须发布完整的答案）

更新：（切换到 Safari，脚本再次工作，我可以正确发布）

没有灵丹妙药。为了易于使用和“一键式”部署，没有什么比复制更好的了。是唯一涵盖深度冲突检测和解决的解决方案，支持推送架构更改，并附带一套全面的工具来设置和监控它。在这个“议程”被 .Net 群体接管之前，它多年来一直是 MS 数据同步的典型代表。在我看来，复制有两个根本问题：

用于推送更改的技术原始、缓慢且不可靠。它需要文件共享来启动副本，并且依赖于 T-SQL 来实际复制数据，从而导致各种可扩展性问题：复制线程使用服务器工作线程，并且它们与任意表和应用程序查询交互，从而导致阻塞和僵局。我听说过的最大部署约为 400-500 个站点，由超人 MVP 和顶级顾问完成。这使得许多在 1500 个站点启动的项目停止了运行（远远超出了最大的已部署复制项目）。我很想知道我是否错了，并且您知道部署了 500 多个站点的 SQL Server 复制解决方案。
复制比喻过于以数据为中心。它没有考虑分布式应用程序的要求：需要版本化和形式化的合同、数据自治'封地'，与可用性和安全性观点的松散耦合。因此，基于复制的解决方案解决了“使数据可用”的迫切需求，但未能解决“我的应用程序需要与您的应用程序对话”的真正问题。

另一方面，您将找到真正解决应用程序通信问题的解决方案，例如基于排队消息传递的服务。但要么速度慢得令人痛苦，要么充满了源于通信机制（Web 服务和/或 msmq）与数据存储（comm 和 db 之间的 DTC 事务、没有常见的高可用性故事、没有常见的可恢复性故事等）分离的问题。 MS 中存在速度极快且与数据库完全集成的解决方案堆栈，但没有人知道如何使用它们。在这些和复制之间的某个地方，您会发现各种中间解决方案，例如 OCS/Synch 框架和基于 SSIS 的自定义解决方案。没有一个能够提供复制设置和监控的便利性，但它们可能会扩展并表现得更好。

我参与了几个需要大规模“数据同步”的项目（+1200 个站点、+1600 个站点），我的解决方案是将问题转化为“应用程序通信”问题。一旦思维方式改变，数据流不再被视为“带有表 Y 的键 X 的记录”，而是“传达客户 Y 购买商品 X 的消息”，解决方案将变得更容易理解和应用。您不再考虑“按 XYZ 顺序插入记录，这样 FK 关系就不会中断”，而是考虑“按照消息 XYZ 的描述处理购买”。

在我看来，复制及其衍生产品（即数据跟踪和数据报传送）是植根于 80 年代技术和数据/应用程序视图的解决方案。过时的恐龙（并且绝不会变成鸟类）。

我知道这甚至还没有开始解决你所有的（非常合法的）担忧，但是写出我对这个话题必须说的/咆哮/咆哮的所有内容将填满平装本......

Better ask on serverfault.com (I can't post comments, scripts are broken in SO, so I have to post a full answer)

Update: (switched to Safari, scripts work again, I can post properly)

There is no silver bullet. For ease of use and 'one key turn' deployment nothing can beat replication. Is the only solution that covers deeply conflict detection and resolution, has support for pushing schema changes and comes with a comprehensive set of tools for setting it up and monitoring it. It has been the MS poster child of data synchronization for many years before this 'agenda' was taken over by the .Net crowd. Replication has two underlying problems in my opinion:

The technology used to pushing changes is primitive, slow and unreliable. It requires file shares to initiate the replicas and it depends on T-SQL to actually replicate data, resulting in all sort of scalability problems: the replication threads use server worker threads and the fact that they interact with arbitrary tables and application queries lead to blocking and deadlocks. The biggest deployments I've heard of are around 400-500 sites and are done by superhuman MVPs and top dollar consultants. This stops on its track many projects that start at 1500 sites (way beyond largest deployed replication projects). I'm curious to hear if I'm wrong and you know of a SQL Server replication solution deployed with more than 500 sites.
The replication metaphor is too data centric. It does not take into account the requirements of distributed applications: need of versioned and formalized contracts, autonomy of data 'fiefdoms', loose coupling from availability and security pov. As a result replication based solution solve the immediate need to 'make data available there', but fail to solve the true problem of 'my app needs to talk with your app'.

At the other end of the spectrum you'll find solutions that truly address the problem of application communication, like services based on queued messaging. But are either painfully slow and riddled with problems rooted in the separation of the communication mechanism (web services and or msmq) and the data storage (DTC transactions between comm and db, no common high availability story, no common recoverability story etc etc). Solutions that are blazingly fast and fully integrated with DB exists in the MS stack, but nobody knows how to use them. Somewhere in between these and replication you'll find various intermediate solutions, like OCS/Synch framework and SSIS based custom solutions. None will offer the ease of setup and monitoring of replication, but they might scale and perform better.

I was involved with several projects that required 'data synchronization' on a very large scale (+1200 sites, +1600 sites) and my solution was to turn the problem on a 'application communication' problem. Once the mindset is changed to this and the data flow is no longer seen as 'record with key X of table Y' but instead 'message communicating the purchase of item X by customer Y' the solution becomes easier to understand and apply. You no longer think in terms of 'insert records in order X-Y-Z so FK relations don't break' but instead in terms of 'process purchase as described by message XYZ'.

In my view replication, and it derivatives (ie. data tracking and data-gram shipping), are solutions anchored in the '80 technologies and view of the data/applications. Obsolete dinosaurs (and by no way turning into birds).

I know this does not even begin to address all your (very legit) concerns, but writing out all I have to say/rant/rable on this topic would fill volumes of paperback...

回复收藏 0 原文

~没有更多了~