通常我会用一个简单的联合来解决这个问题。但这一个有我。我有一个表复制到两个不同的数据库引擎。一份在 postgres 中,一份在带有 DB2 的 AS/400 上。
我需要想出一个例程,它采用 postgres 表,将每个记录的字段中的数据与 DB2 表进行比较,并将不同的记录记录在 postgres 表中以进行报告。
当前系统遵循此模型,
load postgres data into recordset prs
for each record in prs {
search for record in db2
compare record
if diff, log
}
通常可以正常工作。问题是,看到大约有 500,000 条记录,大约需要 3 小时才会触发我们的 VPN 中的错误,从而终止连接。当然,你必须重新开始,重新设计例程比修复 VPN 错误更快。
我正在考虑一个例程,将 db2 数据放入临时表中,然后执行联合查询,但在创建和删除表时遇到了安全问题。
我能想到的最现实的方法是将 db2 数据作为记录集加载,然后将其加载到哈希图中,键为 db id 号,然后循环 postgres 数据,但这可能会严重占用我的内存资源。但它更多的是一种蛮力技术。你们能想到更优雅的解决方案吗?
编辑 -
抱歉,该表在年初有 350 万条记录,因此内存解决方案变得非常不切实际。有什么想法吗?
normally i would solve this with a simple union. but this one has me. i have a table replicated to two different database engines. one copy in postgres, one copy on an AS/400 with DB2.
i need to come up with a routine that takes the postgres table, and compares the data in the fields of each record with the DB2 table and log the different records in a postgres table for reporting.
the current system follows this model
load postgres data into recordset prs
for each record in prs {
search for record in db2
compare record
if diff, log
}
this normally works. the problem, seeing there are about 500,000 records it takes about 3 hours before it triggers a bug in our VPN that kills the connection. and of course you have to start over and it will be quicker to redesign the routine than to fix the VPN bug.
i was thinking of a routine that brings the db2 data over in a temporary table and doing a union query then, but i run into security issues with creating and dropping the table.
the most realistic method i can think of is loading the db2 data as a recordset, then loading it into a hashmap with the key being the db id number, and then looping the postgres data but that could really slam my memory resources. but its still more of a brute force technique. is there a more elegant solution you guys can think of?
edit--
sorry, the table starts the year with 3.5 million records so the in memory solution becomes pretty impractical. any ideas?
发布评论
评论(1)
我想做的是将主键划分为多个范围,使每个作业的大小易于管理(并且在触发 VPN 错误所需的时间内)。
使用这些范围打开 2 个记录集,每个表 1 个,迭代两个比较记录(有点像合并排序,但现在用于两个源之间的比较)。这应该比查询另一个表来查找记录更快。
What I would try to do is to partition the primary keys into ranges making the size per job manageable (and under the time it takes to trigger your VPN bug).
Using those ranges open 2 recordsets, 1 for each table, iterating both comparing records as you go (kinda like merge sort but now for diffing between 2 sources). That should be faster than querying the other table to look for the record.