如何将数据从远程系统复制到 SQL Server 并进行验证
我手头的任务是为数据库创建某种逻辑。 现在有一个 IBM MQ 系列设置可以从几个远程系统复制数据。 目前,它在远程系统中发生更改,并将数据转储到 SQL Server 2005 数据库中的临时表中。
我想根据一些简单的验证逻辑(必填字段和数据格式)和数据层次结构的一些更高级的检查来验证数据。 导入的数据是分层的,某些数据之间存在父子关系。 分层验证应检查暂存表和生产表中当前可用的数据,以确保导入完成后存在有效的分层结构。
如果记录验证失败,则应以某种方式记录该失败,并且不应导入失败的记录,但应导入其余记录。 如果当前生产表中的记录具有相同的 id,则应将其替换为新记录。 数据出现在临时表中后,应尽快从临时表中传输数据。
据我目前所知,生产表中的记录总数不会超过100万条,每批更新的条数最多也就几千行。
我的问题是哪种解决方案最适合导入数据?
我想到了几种可能的解决方案:
启动传输的方法:
- 轮询的 Windows 服务 定期的暂存表 并开始某种转移 每当有新数据时就进行处理 插入。
- 让 MQ 在插入新数据时启动传输过程 表
- 安排 SSIS 作业定期
运行 验证和传输数据的方法:
- 创建 SSIS 作业
- 创建存储过程
- 自定义 .NET 代码
我主要关心的是生产表中的层次结构必须始终保持完整,并且数据应在出现在暂存表中后不久就出现在生产表中。 我无法确定暂存表中是否始终提供完整的层次结构。
I have a task at hand of creating some kind of logic for a database. There is a IBM MQ Series setup in place today that replicates data from a couple of remote systems. At the moment it changes in the remote system and dumps the data into a staging table in a SQL Server 2005 database.
I want to validate the data according to some simple validation logic (required fields and data format) and some more advanced checks of the data hierarchy. The data being imported is hierarchical with a parent child relationship between some of the data. The hierarchical validation should check both data currently available in the staging table and the production table to make sure that there will be a valid hierarchy after the import completes.
If a validation of a record fails, the failure should be logged somehow and the failing records should not be imported, but the remaining records should. Records in the current production tables should be replaced by new records if they have the same id. The transfer of the data from the staging table should be made as soon as possible after that the data appears in the staging table.
As far as I know now, the total number of records in the production tables will not exceed 1 million and probable number of updated items per batch would be at the most a couple of thousand lines.
My question is what solution is best fit for importing the data?
I have thought of a couple of possible solutions:
Means of starting the transfer:
- Windows service that polls the
staging table at a regular interval
and kicks off some kind of transfer
process whenever new data is
inserted. - Make the MQ kick off the transfer process upon inserting new data into
the table - Scheduling an SSIS job to run at a regular interval
Means of validating and transferring the data:
- Creating a SSIS job
- Creating Stored procedures
- Custom .NET code
My main concerns are that the hierarchy must remain intact in the production tables at all times and the data should be available in the production table shortly after appearing in the staging table. I cannot be sure that a full hierarchy is available at all times in the staging table.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我们对此类工作使用封闭流程。
将数据从远程系统获取到临时表中,导入到产品表中。
如果任何内容可以随意插入到临时表中,并且存在子表,那么存在在临时表中创建所有子记录之前导入到生产数据库中的风险。
如果您可以随意向临时表添加列(远程端将忽略这些列),或者如果临时表具有唯一/不重复的 IDENTITY 或 GUID,则您可以创建并行表。
理想情况下,在暂存表中创建行的例程将使用批次号,然后在成功时创建“批次号完成”记录。 因此,您有一个信号量来阻止您导入,直到所有关联记录都已输入。
(它们可以插入到事务块中,但您必须确信插入临时表中的所有进程都会遵守这一点)。
给定 IDENTITY / GUID,我将创建一个 1:1“错误表”来存储描述导入失败的任何消息。
您可以选择将失败的行移动或复制到单独的故障暂存表中,这样主暂存表就不会被堵塞,并且更容易确保故障正在被排序(我猜是由人来解决) )。
话虽如此,这里是我们特定过程的更详细描述:
为了最小化带宽和更新(即减少阻塞并最小化不必要的事务日志条目等),我们执行以下操作:
在源计算机上保留正在传输的表的副本。 这有更新和操作的附加列(更新或删除标志 - 更新包括插入,并且插入可能在目标获取该行之前再次更新...)
插入到此表中,新, 行都正常
仅当至少一列存在差异时才会更新此表 - 因此可以将整个源表与临时表进行比较以找出发生了更改的内容(如果是 CPU 密集型)。 更新设置更新列。 (当心时钟何时倒转)
如果在源表中找不到临时表行,我们会定期将其标记为 Action=Deleted。
数据从源复制到目标上的相同表,其中更新是在上次传输之后。
在目标服务器上,检查数据并将其导入生产的例程仅在更新日期运行(处理自上次更新以来的所有内容)
所有临时表都位于具有最少日志记录的单独数据库中 - 该数据库将如果恢复,会自动从“上次更新日期”刷新,因此我们的目标是最小化事务日志。 我们不会将临时表存储在生产数据库中(该数据库将具有完整的事务日志记录),以避免 TLog 膨胀。
同样,如果这些过程是连续的,这样它们就不会同时发生,这会很有帮助,否则会出现某种批次号需要防止在存在父/子表的情况下传输部分完成的批次。
我们按照Source数据库的方式传输数据,只在Destination端进行数据操作。 这样做的原因是,如果数据操作错误,并且我们修复了它,我们只需在目标端重新运行它 - 源数据全部存在于目标端的暂存表中。 如果我们在源端进行操作,我们也必须重新传输所有数据。
对于调试来说,在暂存表中包含更新日期会很有帮助。 我们经常会得到“为什么这是不同的”,并且看到更新告诉我们目标端的操作是否有错误(即我们有显示预期源数据的最新数据)或源端(没有找到最新数据! )
We use a closed process for this type of work.
Get data from Remote system into Staging tables, Import into Product tables.
If anything can insert into Staging Tables at-will, and there are Child Tables, then there is risk that you might Import into the Production DB before all the Child Records are created in the staging tables.
If you are free to add columns to the Staging Tables (which the remote end will ignore), or if the staging tables have an IDENTITY or GUID that is unique / non-repeating, then you can create a parallel table.
Ideally the routine creating rows in the Staging Table will use a Batch number, and then create a "Batch Number Done" record when successful. So you have a semaphore to stop you importing until all associated records are in.
(They could be inserted in a Transaction Block, but you would have to be confident that all processes inserting in Staging Table honoured that).
Given the IDENTITY / GUID I would create a 1:1 "Error table" to store any messages describing import failure.
You may choose to move, or copy, failed rows to a separate failures staging table, so that the main staging table doesn't get clogged up and it is easier to make sure that the failures are being sorted out (by a human I presume).
Having said that here is a more detailed description of our specific processes:
To minimise bandwidth and updates (i.e. reduce blocking and minimise unnecessary transaction log entries etc.) we do the following:
On Source machine hold a copy of the table being transferred. This has additional columns for Updated-on and Action (Update or Delete flag - Update includes Insert, and Insert may have been Updated again before destination ever gets that row ...)
Inserts to this table, of new, rows are OK
Updates to this table only happen if there is a difference in at least one column - so its OK (if rather CPU intensive) to compare the whole source table against the staging table to find out what has changed. Updates set the Updated-on column. (Beware of when the clocks go back)
Periodically we flag the stagin table rows Action=Deleted if they cannot be found in the Source table.
The data is copied from Source to identical tables on Destination where the Updated-on is after the last transfer.
On the Destination server the routine that checks the data, and imports it to Production, works solely on the Updated-on date (process everything since last Updated-on)
All staging tables are in a separate database which has minimal logging - the database will automatically refresh from "last updated date" if it is restored, so our aim is to minimise transaction log. We don;t store the staging tables in the Production database (which will have full transaction logging) in order to avoid bloating the TLogs
Again, it is helpful if these processes are sequential so that they cannot happen concurrently, otherwise some sort of Batch No is needed to prevent transferring partially completed batches where Parent/Child tables exist.
We transfer the data in the style of the Source database, and only make data manipulations at the Destination end. The reason for this is that if the data manipulation is wrong, and we fix it, we only have to rerun it at the Destination end - the Source data is all present in the staging tables at the Destination end. If we did the manipulation at the Source end we would have to retransmit all the data again, as well.
For debugging having the Updated-on date in the Staging table is helpful. We often times get "Why is this different" and seeing the Updated-on tells us if the manipulation on the Destination end is at fault (i.e. we have recent data showing the expected Source data) or the Source end (no recent data found!)