如何加快对包含有效日期数据的大型数据仓库表的查询？

发布于 2024-09-07 02:41:32 字数 1666 浏览 17 评论 0原文

所以我正在查询一些非常大的表。它们之所以如此之大，是因为每次对某些数据进行更改时，PeopleSoft 都会插入新记录，而不是更新现有记录。实际上，它的事务表也是一个数据仓库。

这需要在其中嵌套选择的查询才能获取最新/当前行。它们都具有有效日期，并且在每个日期（转换为一天）内它们可以具有有效序列。因此，为了获取 user_id=123 的当前记录，我必须这样做：

select * from sometable st
where st.user_id = 123
and st.effective_date = (select max(sti.effective_date) 
  from sometable sti where sti.user_id = st.user_id)
and st.effective_sequence = (select max(sti.effective_sequence) 
  from sometable sti where sti.user_id = st.user_id
  and sti.effective_date = st.effective_date)

这些表上有大量索引，而且我找不到任何其他可以加快速度的内容我的疑问。

我的麻烦是，我经常想要从这些表中获取有关某个人的数据（可能有 50 个 user_id），但是当我将其中只有几条记录的表与其中一些 PeopleSoft 表连接起来时，事情就会变得一团糟。

PeopleSoft 表位于我通过数据库链接访问的远程数据库上。我的查询往往如下所示：

select st.* from local_table lt, sometable@remotedb st
where lt.user_id in ('123', '456', '789')
and lt.user_id = st.user_id
and st.effective_date = (select max(sti.effective_date) 
  from sometable@remotedb sti where sti.user_id = st.user_id)
and st.effective_sequence = (select max(sti.effective_sequence) 
  from sometable@remotedb sti where sti.user_id = st.user_id
  and sti.effective_date = st.effective_date)

当我必须将多个 PeopleSoft 表与我的本地表连接起来时，情况会变得更糟。性能简直令人无法接受。

我可以采取哪些措施来提高绩效？我已尝试使用查询提示来确保我的本地表首先连接到其在 PeopleSoft 中的合作伙伴，因此在将其范围缩小到正确的 user_id 之前，它不会尝试将其所有表连接在一起。我已经尝试过 LEADING 提示并尝试了尝试将处理推送到远程数据库的提示，但是解释计划被遮盖了，只是对几个操作说了“远程”，我有不知道发生了什么事。

假设我无权更改 PeopleSoft 和我的表格的位置，提示是我的最佳选择吗？如果我将本地表与四个远程表连接起来，并且本地表与其中两个表连接，我将如何格式化提示，以便我的本地表（非常小 - 事实上，我可以只做一个内联视图）让我的本地表只是我感兴趣的 user_ids）首先与每个远程表连接？

编辑：应用程序需要实时数据，因此不幸的是物化视图或其他缓存数据的方法是不够的。

原文

So I am querying some extremely large tables. The reason they are so large is because PeopleSoft inserts new records every time a change is made to some data, rather than updating existing records. In effect, its transactional tables are also a data warehouse.

This necessitates queries that have nested selects in them, to get the most recent/current row. They are both effective dated and within each date (cast to a day) they can have an effective sequence. Thus, in order to get the current record for user_id=123, I have to do this:

select * from sometable st
where st.user_id = 123
and st.effective_date = (select max(sti.effective_date) 
  from sometable sti where sti.user_id = st.user_id)
and st.effective_sequence = (select max(sti.effective_sequence) 
  from sometable sti where sti.user_id = st.user_id
  and sti.effective_date = st.effective_date)

There are a phenomenal number of indexes on these tables, and I can't find anything else that would speed up my queries.

My trouble is that I often times want to get data about an individual from these tables for maybe 50 user_ids, but when I join my tables having only a few records in them with a few of these PeopleSoft tables, things just go to crap.

The PeopleSoft tables are on a remote database that I access through a database link. My queries tend to look like this:

select st.* from local_table lt, sometable@remotedb st
where lt.user_id in ('123', '456', '789')
and lt.user_id = st.user_id
and st.effective_date = (select max(sti.effective_date) 
  from sometable@remotedb sti where sti.user_id = st.user_id)
and st.effective_sequence = (select max(sti.effective_sequence) 
  from sometable@remotedb sti where sti.user_id = st.user_id
  and sti.effective_date = st.effective_date)

Things get even worse when I have to join several PeopleSoft tables with my local table. Performance is just unacceptable.

What are some things I can do to improve performance? I've tried query hints to ensure that my local table is joined to its partner in PeopleSoft first, so it doesn't attempt to join all its tables together before narrowing it down to the correct user_id. I've tried the LEADING hint and toyed around with hints that tried to push the processing to the remote database, but the explain plan was obscured and just said 'REMOTE' for several of the operations and I had no idea what was going on.

Assuming I don't have the power to change PeopleSoft and the location of my tables, are hints my best choice? If I was joining a local table with four remote tables, and the local table joined with two of them, how would I format the hint so that my local table (which is very small -- in fact, I can just do an inline view to have my local table only be the user_ids I'm interested in) is joined first with each of the remote ones?

EDIT: The application needs real-time data so unfortunately a materialized view or other method of caching data will not suffice.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

坚持沉默 2024-09-14 02:41:32

像这样重构你的查询有帮助吗？

SELECT *
  FROM (SELECT st.*, MAX(st.effective_date) OVER (PARTITION BY st.user_id) max_dt,
                     MAX(st.effective_sequence) OVER (PARTITION BY st.user_id, st.effective_date) max_seq
          FROM local_table lt JOIN sometable@remotedb st ON (lt.user_id = st.user_id)
         WHERE lt.user_id in ('123', '456', '789'))
 WHERE effective_date = max_dt
   AND effective_seq = max_seq;

我同意@Mark Baker 的观点，即通过数据库链接加入的性能确实很糟糕，并且您使用这种方法可以完成的工作可能会受到限制。

Does refactoring your query something like this help at all?

SELECT *
  FROM (SELECT st.*, MAX(st.effective_date) OVER (PARTITION BY st.user_id) max_dt,
                     MAX(st.effective_sequence) OVER (PARTITION BY st.user_id, st.effective_date) max_seq
          FROM local_table lt JOIN sometable@remotedb st ON (lt.user_id = st.user_id)
         WHERE lt.user_id in ('123', '456', '789'))
 WHERE effective_date = max_dt
   AND effective_seq = max_seq;

I agree with @Mark Baker that performance joining over DB Links really can suck and you're likely to be limited in what you can accomplish with this approach.

回复收藏 0 原文

坏尐絯 2024-09-14 02:41:32

一种方法是在所有事情上都使用 PL/SQL 函数。
作为一个例子

create table remote (user_id number, eff_date date, eff_seq number, value varchar2(10));

create type typ_remote as object (user_id number, eff_date date, eff_seq number, value varchar2(10));
.
/

create type typ_tab_remote as table of typ_remote;
.
/

insert into remote values (1, date '2010-01-02', 1, 'a');
insert into remote values (1, date '2010-01-02', 2, 'b');
insert into remote values (1, date '2010-01-02', 3, 'c');
insert into remote values (1, date '2010-01-03', 1, 'd');
insert into remote values (1, date '2010-01-03', 2, 'e');
insert into remote values (1, date '2010-01-03', 3, 'f');

insert into remote values (2, date '2010-01-02', 1, 'a');
insert into remote values (2, date '2010-01-02', 2, 'b');
insert into remote values (2, date '2010-01-03', 1, 'd');

create function show_remote (i_user_id_1 in number, i_user_id_2 in number) return typ_tab_remote pipelined is
    CURSOR c_1 is
    SELECT user_id, eff_date, eff_seq, value
    FROM
        (select user_id, eff_date, eff_seq, value, 
                        rank() over (partition by user_id order by eff_date desc, eff_seq desc) rnk
        from remote
        where user_id in (i_user_id_1,i_user_id_2))
    WHERE rnk = 1;
begin
    for c_rec in c_1 loop
        pipe row (typ_remote(c_rec.user_id, c_rec.eff_date, c_rec.eff_seq, c_rec.value));
    end loop;
    return;
end;
/

select * from table(show_remote(1,null));

select * from table(show_remote(1,2));

，您可以将它们加载到本地表（例如全局临时表）中，而不是将 user_id 作为参数单独传递。然后 PL/SQL 将循环遍历表，对本地表中的每一行进行远程选择。没有一个查询会同时拥有本地表和远程表。实际上，您将编写自己的加入代码。

One approach would be to stick PL/SQL functions around everything.
As an example

create table remote (user_id number, eff_date date, eff_seq number, value varchar2(10));

create type typ_remote as object (user_id number, eff_date date, eff_seq number, value varchar2(10));
.
/

create type typ_tab_remote as table of typ_remote;
.
/

insert into remote values (1, date '2010-01-02', 1, 'a');
insert into remote values (1, date '2010-01-02', 2, 'b');
insert into remote values (1, date '2010-01-02', 3, 'c');
insert into remote values (1, date '2010-01-03', 1, 'd');
insert into remote values (1, date '2010-01-03', 2, 'e');
insert into remote values (1, date '2010-01-03', 3, 'f');

insert into remote values (2, date '2010-01-02', 1, 'a');
insert into remote values (2, date '2010-01-02', 2, 'b');
insert into remote values (2, date '2010-01-03', 1, 'd');

create function show_remote (i_user_id_1 in number, i_user_id_2 in number) return typ_tab_remote pipelined is
    CURSOR c_1 is
    SELECT user_id, eff_date, eff_seq, value
    FROM
        (select user_id, eff_date, eff_seq, value, 
                        rank() over (partition by user_id order by eff_date desc, eff_seq desc) rnk
        from remote
        where user_id in (i_user_id_1,i_user_id_2))
    WHERE rnk = 1;
begin
    for c_rec in c_1 loop
        pipe row (typ_remote(c_rec.user_id, c_rec.eff_date, c_rec.eff_seq, c_rec.value));
    end loop;
    return;
end;
/

select * from table(show_remote(1,null));

select * from table(show_remote(1,2));

Rather than having user_id's passed individually as parameters, you could load them into a local table (eg a global temporary table). The PL/SQL would loop then through the table, doing the remote select for each row in the local table. No single query would have both local and remote tables. Effectively you would be writing your own join code.

回复收藏 0 原文

唱一曲作罢 2024-09-14 02:41:32

一种选择是首先使用公用表表达式具体化查询的远程部分，这样您就可以确保只从远程数据库获取相关数据。另一个改进是将针对远程数据库的 2 个子查询合并到一个基于分析函数的子查询中.这样的查询也可以在您当前的查询中使用。只有在玩过数据库之后我才能提出其他建议。

见下文

with remote_query as
(
    select /*+ materialize */  st.* from sometable@remotedb st
    where st.user_id in ('123', '456', '789')
    and st.rowid in( select first_value(rowid) over (order by effective_date desc, 
                         effective_sequence desc ) from sometable@remotedb st1 
                      where st.user_id=st1.user_id)
)

select lt.*,st.* 
FROM local_table st,remote_query rt
where st.user_id=rt.user_id

One option is to first materialize the remote part of the query using a common table expression so you can be sure only relevantt data is fetched from remote db.Another improvement would be to merge the 2 subqueries against the remote db into one analytical function based subquery.Such a query can be used in your current query also. I can make other suggestions only after playing with the db.

see below

with remote_query as
(
    select /*+ materialize */  st.* from sometable@remotedb st
    where st.user_id in ('123', '456', '789')
    and st.rowid in( select first_value(rowid) over (order by effective_date desc, 
                         effective_sequence desc ) from sometable@remotedb st1 
                      where st.user_id=st1.user_id)
)

select lt.*,st.* 
FROM local_table st,remote_query rt
where st.user_id=rt.user_id

回复收藏 0 原文

久随 2024-09-14 02:41:32

您没有提到数据新鲜度的要求，但一种选择是创建物化视图（您将被限制为 REFRESH COMPLETE，因为您无法在源系统中创建快照日志），该视图仅包含以下数据：事务表的当前版本行。这些物化视图表将驻留在您的本地系统中，并且可以向其中添加额外的索引以提高查询性能。

回复收藏 0 原文

无需解释 2024-09-14 02:41:32

性能问题将是跨链路的访问。对于针对本地表的部分查询，它全部在本地执行，因此无法访问远程索引，并且它将所有远程数据拉回以进行 lkocally 测试。

如果您可以在本地数据库中使用定期（每晚）从 peoplesoft 数据库刷新的物化视图来获取历史数据，则仅访问远程 peoplesoft 数据库以获取今天的更改（将 effective_date = Today 添加到 where 子句）并合并两个查询。

另一种选择可能是仅对远程数据使用 INSERT INTO X SELECT FROM 将其拉入临时本地表或物化视图，然后使用第二个查询将其与本地数据连接...类似于 josephj1989 的建议

或者（尽管可能存在许可问题）尝试使用远程 peoplesoft 数据库对本地数据库进行 RAC 集群。

回复收藏 0 原文

蒲公英的约定 2024-09-14 02:41:32

您可以尝试这个，而不是使用子查询。我不知道Oracle是否会在这个方面表现更好，因为我不经常使用Oracle。

SELECT
    ST1.col1,
    ST1.col2,
    ...
FROM
    Some_Table ST1
LEFT OUTER JOIN Some_Table ST2 ON
    ST2.user_id = ST1.user_id AND
    (
        ST2.effective_date > ST1.effective_date OR
        (
            ST2.effective_date = ST1.effective_date AND
            ST2.effective_sequence > ST1.effective_sequence
        )
    )
WHERE
    ST2.user_id IS NULL

另一种可能的解决方案是：

SELECT
    ST1.col1,
    ST1.col2,
    ...
FROM
    Some_Table ST1
WHERE
    NOT EXISTS
    (
        SELECT
        FROM
            Some_Table ST2
        WHERE
            ST2.user_id = ST1.user_id AND
            (
                ST2.effective_date > ST1.effective_date OR
                (
                    ST2.effective_date = ST1.effective_date AND
                    ST2.effective_sequence > ST1.effective_sequence
                )
            )
    )

Instead of using the subqueries, you can try this. I don't know if Oracle will perform better with this or not, since I don't use Oracle much.

SELECT
    ST1.col1,
    ST1.col2,
    ...
FROM
    Some_Table ST1
LEFT OUTER JOIN Some_Table ST2 ON
    ST2.user_id = ST1.user_id AND
    (
        ST2.effective_date > ST1.effective_date OR
        (
            ST2.effective_date = ST1.effective_date AND
            ST2.effective_sequence > ST1.effective_sequence
        )
    )
WHERE
    ST2.user_id IS NULL

Another possible solution would be:

SELECT
    ST1.col1,
    ST1.col2,
    ...
FROM
    Some_Table ST1
WHERE
    NOT EXISTS
    (
        SELECT
        FROM
            Some_Table ST2
        WHERE
            ST2.user_id = ST1.user_id AND
            (
                ST2.effective_date > ST1.effective_date OR
                (
                    ST2.effective_date = ST1.effective_date AND
                    ST2.effective_sequence > ST1.effective_sequence
                )
            )
    )

回复收藏 0 原文

你列表最软的妹 2024-09-14 02:41:32

是否可以选择创建一个用于非仓储类型内容的数据库，并且可以每晚更新？如果是，您可以创建一个每晚进程，仅移动最新的记录。这将消除您每天查询时所做的 MAX 工作，并显着减少数量或记录。

此外，还取决于最新数据和可用数据之间是否可以有 1 天的间隔。

我对 Oracle 不太熟悉，所以可能有一种方法可以通过更改查询来获得改进......

回复收藏 0 原文

眼眸 2024-09-14 02:41:32

您能否将具有所需 user_id 的行 ETL 到您自己的表中，仅创建所需的索引来支持您的查询并对其执行查询？

回复收藏 0 原文

伤感在游骋 2024-09-14 02:41:32

PeopleSoft 表是交付的还是定制的？你确定这是一个物理表，而不是PS端写得不好的视图？如果您要反对的是已交付的记录（示例看起来很像 PS_JOB 或引用它的视图），也许您可以指出这一点。 PS_JOB 是一个提供了大量索引的野兽，大多数网站添加的索引甚至更多。

如果您知道表上的索引，则可以使用 Oracle 提示来指定要使用的首选索引；这有时会有所帮助。

您是否制定了解释计划，看看是否可以确定问题出在哪里？也许有笛卡尔连接、全表扫描等？

回复收藏 0 原文

婴鹅 2024-09-14 02:41:32

在我看来，您正在处理数据仓库中的第 2 类维度。实现类型 2 维度的方法有多种，主要有 ValidFrom、ValidTo、Version、Status 等列。并非所有这些都始终存在，如果您可以发布表的架构，那将会很有趣。下面是一个示例（John Smith 于 2010 年 6 月 24 日从印第安纳州搬到了俄亥俄州）

UserKey  UserBusinessKey State    ValidFrom    ValidTo   Version  Status
7234     John_Smith_17   Indiana  2005-03-20  2010-06-23    1     expired
9116     John_Smith_17   Ohio     2010-06-24  3000-01-01    2     current

要获取行的最新版本，通常使用

WHERE Status = 'current'

或

WHERE ValidTo = '3000-01-01'

请注意，该行在远方有一些常量未来。

或

WHERE ValidTo > CURRENT_DATE

似乎您的示例使用 ValidFrom ( effective_date) ，因此您被迫查找 max() 才能找到最新行。查看架构 - 您的表中是否有 Status 或 ValidTo 等效项？

It looks to me that you are dealing with a type 2 dimension in the data warehouse. There are several ways how to implement type 2 dimension, mostly having columns like ValidFrom, ValidTo, Version, Status. Not all of them are always present, it would be interesting if you could post the schema for your table. Here is an example of how it may look like (John Smith moved from Indiana to Ohio on 2010-06-24)

UserKey  UserBusinessKey State    ValidFrom    ValidTo   Version  Status
7234     John_Smith_17   Indiana  2005-03-20  2010-06-23    1     expired
9116     John_Smith_17   Ohio     2010-06-24  3000-01-01    2     current

To obtain the latest version of a row, it is common to use

WHERE Status = 'current'

WHERE ValidTo = '3000-01-01'

Note that this one has some constant far in the future.

WHERE ValidTo > CURRENT_DATE

Seems that your example uses ValidFrom (effective_date), so you are forced to find max() in order to locate the latest row. Take a look at the schema -- is there Status or ValidTo equivalents in your tables?

回复收藏 0 原文

~没有更多了~