与不同运算符协调的 SQL 语句
这与问题非常相关:SQL 协调语句,但有更多的变化。
给定以下架构:
create table TBL1 (ID varchar2(100) primary key not null, MATCH_CRITERIA timestamp);
create table TBL2 (ID varchar2(100) primary key not null, MATCH_CRITERIA timestamp);
create table TBL_RESULT (ID varchar2(100) primary key not null, TBL1_ID varchar2(100), TBL2_ID varchar2(100));
create unique index UK_TBL_RESULT_TBL1_ID on TBL_RESULT(TBL1_ID);
create unique index UK_TBL_RESULT_TBL2_ID on TBL_RESULT(TBL2_ID);
insert into TBL1 VALUES('1', to_date('01/26/2012 20:00:00', 'mm/dd/yyyy hh24:mi:ss'));
insert into TBL1 VALUES('2', to_date('01/26/2012 20:05:00', 'mm/dd/yyyy hh24:mi:ss'));
insert into TBL2 VALUES('3', to_date('01/26/2012 19:59:00', 'mm/dd/yyyy hh24:mi:ss'));
insert into TBL2 VALUES('4', to_date('01/26/2012 20:04:00', 'mm/dd/yyyy hh24:mi:ss'));
我们当前的查询:
INSERT INTO TBL_RESULT (ID, TBL1_ID, TBL2_ID)
SELECT rawtohex(sys_guid()),t1.id,t2.id
FROM
(SELECT t1.match_criteria,t1.id, row_number() OVER (PARTITION BY t1.match_criteria ORDER BY t1.id) rn
FROM tbl1 t1) t1,
(SELECT t2.match_criteria,t2.id, row_number() OVER (PARTITION BY t2.match_criteria ORDER BY t2.id) rn
FROM tbl2 t2) t2
WHERE t1.match_criteria between t2.match_criteria - (10/1440) AND t2.match_criteria + (10/1440)
AND t1.rn=t2.rn
它的输出:
| ID | TBL1_ID | TBL2_ID |
| '1' | '1' | '3' |
| '2' | '1' | '4' |
| '3' | '2' | '3' |
| '4' | '2' | '4' |
如您所见,结果不符合唯一约束(重复的 TBL1_ID /重复的 TBL2_ID)。这是因为:
- 每条记录的 RN 始终为 1(因此始终相等)
- 两条记录之间的日期为 10 分钟。
我们期望输出如下表所示:
| ID | TBL1_ID | TBL2_ID |
| '1' | '1' | '4' |
| '2' | '2' | '3' |
注意 1:“1”是否与“3”匹配并不重要,但“2”应该与“4”匹配遵守约束条件,并且只要 T1.MATCH_CRITERIA 与 T2.MATCH_CRITERIA 的时间间隔在 10 分钟以内。
注 2:我们从 TBL1 插入一百万条记录,从 TBL2 插入另外一百万条记录。因此,使用 PL/SQL 的顺序插入是不可接受的,除非它运行得非常快(少于 15 分钟)。
注3:不匹配的数据应剔除。预计数据也会出现不平衡。
注 4:我们不限于只执行 1 个查询。一系列有限查询就可以了。
This is very related to question: SQL Statement for Reconciliation but with a more twist.
Given the schema below:
create table TBL1 (ID varchar2(100) primary key not null, MATCH_CRITERIA timestamp);
create table TBL2 (ID varchar2(100) primary key not null, MATCH_CRITERIA timestamp);
create table TBL_RESULT (ID varchar2(100) primary key not null, TBL1_ID varchar2(100), TBL2_ID varchar2(100));
create unique index UK_TBL_RESULT_TBL1_ID on TBL_RESULT(TBL1_ID);
create unique index UK_TBL_RESULT_TBL2_ID on TBL_RESULT(TBL2_ID);
insert into TBL1 VALUES('1', to_date('01/26/2012 20:00:00', 'mm/dd/yyyy hh24:mi:ss'));
insert into TBL1 VALUES('2', to_date('01/26/2012 20:05:00', 'mm/dd/yyyy hh24:mi:ss'));
insert into TBL2 VALUES('3', to_date('01/26/2012 19:59:00', 'mm/dd/yyyy hh24:mi:ss'));
insert into TBL2 VALUES('4', to_date('01/26/2012 20:04:00', 'mm/dd/yyyy hh24:mi:ss'));
Our current query:
INSERT INTO TBL_RESULT (ID, TBL1_ID, TBL2_ID)
SELECT rawtohex(sys_guid()),t1.id,t2.id
FROM
(SELECT t1.match_criteria,t1.id, row_number() OVER (PARTITION BY t1.match_criteria ORDER BY t1.id) rn
FROM tbl1 t1) t1,
(SELECT t2.match_criteria,t2.id, row_number() OVER (PARTITION BY t2.match_criteria ORDER BY t2.id) rn
FROM tbl2 t2) t2
WHERE t1.match_criteria between t2.match_criteria - (10/1440) AND t2.match_criteria + (10/1440)
AND t1.rn=t2.rn
It's output:
| ID | TBL1_ID | TBL2_ID |
| '1' | '1' | '3' |
| '2' | '1' | '4' |
| '3' | '2' | '3' |
| '4' | '2' | '4' |
As you can see, results were not compliant to the unique constraint (duplicate TBL1_ID / duplicate TBL2_ID). This is because:
- The RN for each record is always 1 (thus always equal)
- The date between the two records is 10 minutes.
We are expecting an output that looks like the table below:
| ID | TBL1_ID | TBL2_ID |
| '1' | '1' | '4' |
| '2' | '2' | '3' |
Note 1: it doesn't matter if '1' is matched with '3', but then '2' should be matched to '4' to comply with the constraints and as long as the T1.MATCH_CRITERIA is within 10 minutes of T2.MATCH_CRITERIA.
Note 2: we are inserting a million records from TBL1 and another million records from TBL2. Thus, sequential insert using PL/SQL is not acceptable unless it can run really fast (less than 15 minutes).
Note 3: unmatched data should be eliminated. Unbalanced data is also expected.
Note 4: we are not limited to execute 1 query only. A series of finite queries will do.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
目前,您的查询会产生交叉联接,因为您的业务规则无法提供一种机制来将 T1 中的一条记录与 T2 中的一条记录链接起来。鉴于这显然是一个玩具示例,我们很难提出任何其他非常简单的建议:
这将简单地将 T1 结果集中的第一行与 T2 结果集中的第一行相匹配,将 T1 结果集中的第二行与T2 结果集中的第二行,依此类推。
这可能不是你想要的。在这种情况下,您需要解释您的数据以及决定什么与什么链接的规则。例如,任一组时间是否存在某种模式可以让您导出锚点?
顺便说一句,当我统治世界时,使用 VARCHAR2(100) 列来保存数字 ID 的人将被枪杀。
At the moment your query produces a cross join, because your business rules fail to provide a mechanism to link one record in T1 with just one record in T2. Given that this is obviously a toy example it is difficult for us to suggest anything other than something very simplistic:
This will simply match the first row in the T1 resultset with the first row in the T2 resultset, the second row in the T1 resultset with the second row in the T2 resultset, and so on.
That probably isn't what you want. In which case you need to explain your data and the rules for deciding what links with what. For instance is there some sort of pattern to either set of times which would allow you to derive an anchor point?
As an aside, when I rule the world people who use VARCHAR2(100) columns to hold numeric IDs will be shot.
我认为这可以工作:
I think this can work: