如何加快对包含有效日期数据的大型数据仓库表的查询?
所以我正在查询一些非常大的表。它们之所以如此之大,是因为每次对某些数据进行更改时,PeopleSoft 都会插入新记录,而不是更新现有记录。实际上,它的事务表也是一个数据仓库。
这需要在其中嵌套选择的查询才能获取最新/当前行。它们都具有有效日期,并且在每个日期(转换为一天)内它们可以具有有效序列。因此,为了获取 user_id=123
的当前记录,我必须这样做:
select * from sometable st
where st.user_id = 123
and st.effective_date = (select max(sti.effective_date)
from sometable sti where sti.user_id = st.user_id)
and st.effective_sequence = (select max(sti.effective_sequence)
from sometable sti where sti.user_id = st.user_id
and sti.effective_date = st.effective_date)
这些表上有大量索引,而且我找不到任何其他可以加快速度的内容我的疑问。
我的麻烦是,我经常想要从这些表中获取有关某个人的数据(可能有 50 个 user_id),但是当我将其中只有几条记录的表与其中一些 PeopleSoft 表连接起来时,事情就会变得一团糟。
PeopleSoft 表位于我通过数据库链接访问的远程数据库上。我的查询往往如下所示:
select st.* from local_table lt, sometable@remotedb st
where lt.user_id in ('123', '456', '789')
and lt.user_id = st.user_id
and st.effective_date = (select max(sti.effective_date)
from sometable@remotedb sti where sti.user_id = st.user_id)
and st.effective_sequence = (select max(sti.effective_sequence)
from sometable@remotedb sti where sti.user_id = st.user_id
and sti.effective_date = st.effective_date)
当我必须将多个 PeopleSoft 表与我的本地表连接起来时,情况会变得更糟。性能简直令人无法接受。
我可以采取哪些措施来提高绩效?我已尝试使用查询提示来确保我的本地表首先连接到其在 PeopleSoft 中的合作伙伴,因此在将其范围缩小到正确的 user_id 之前,它不会尝试将其所有表连接在一起。我已经尝试过 LEADING 提示并尝试了尝试将处理推送到远程数据库的提示,但是解释计划被遮盖了,只是对几个操作说了“远程”,我有不知道发生了什么事。
假设我无权更改 PeopleSoft 和我的表格的位置,提示是我的最佳选择吗?如果我将本地表与四个远程表连接起来,并且本地表与其中两个表连接,我将如何格式化提示,以便我的本地表(非常小 - 事实上,我可以只做一个内联视图)让我的本地表只是我感兴趣的 user_ids)首先与每个远程表连接?
编辑:应用程序需要实时数据,因此不幸的是物化视图或其他缓存数据的方法是不够的。
So I am querying some extremely large tables. The reason they are so large is because PeopleSoft inserts new records every time a change is made to some data, rather than updating existing records. In effect, its transactional tables are also a data warehouse.
This necessitates queries that have nested selects in them, to get the most recent/current row. They are both effective dated and within each date (cast to a day) they can have an effective sequence. Thus, in order to get the current record for user_id=123
, I have to do this:
select * from sometable st
where st.user_id = 123
and st.effective_date = (select max(sti.effective_date)
from sometable sti where sti.user_id = st.user_id)
and st.effective_sequence = (select max(sti.effective_sequence)
from sometable sti where sti.user_id = st.user_id
and sti.effective_date = st.effective_date)
There are a phenomenal number of indexes on these tables, and I can't find anything else that would speed up my queries.
My trouble is that I often times want to get data about an individual from these tables for maybe 50 user_ids, but when I join my tables having only a few records in them with a few of these PeopleSoft tables, things just go to crap.
The PeopleSoft tables are on a remote database that I access through a database link. My queries tend to look like this:
select st.* from local_table lt, sometable@remotedb st
where lt.user_id in ('123', '456', '789')
and lt.user_id = st.user_id
and st.effective_date = (select max(sti.effective_date)
from sometable@remotedb sti where sti.user_id = st.user_id)
and st.effective_sequence = (select max(sti.effective_sequence)
from sometable@remotedb sti where sti.user_id = st.user_id
and sti.effective_date = st.effective_date)
Things get even worse when I have to join several PeopleSoft tables with my local table. Performance is just unacceptable.
What are some things I can do to improve performance? I've tried query hints to ensure that my local table is joined to its partner in PeopleSoft first, so it doesn't attempt to join all its tables together before narrowing it down to the correct user_id. I've tried the LEADING
hint and toyed around with hints that tried to push the processing to the remote database, but the explain plan was obscured and just said 'REMOTE' for several of the operations and I had no idea what was going on.
Assuming I don't have the power to change PeopleSoft and the location of my tables, are hints my best choice? If I was joining a local table with four remote tables, and the local table joined with two of them, how would I format the hint so that my local table (which is very small -- in fact, I can just do an inline view to have my local table only be the user_ids I'm interested in) is joined first with each of the remote ones?
EDIT: The application needs real-time data so unfortunately a materialized view or other method of caching data will not suffice.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(10)
像这样重构你的查询有帮助吗?
我同意@Mark Baker 的观点,即通过数据库链接加入的性能确实很糟糕,并且您使用这种方法可以完成的工作可能会受到限制。
Does refactoring your query something like this help at all?
I agree with @Mark Baker that performance joining over DB Links really can suck and you're likely to be limited in what you can accomplish with this approach.
一种方法是在所有事情上都使用 PL/SQL 函数。
作为一个例子
,您可以将它们加载到本地表(例如全局临时表)中,而不是将 user_id 作为参数单独传递。然后 PL/SQL 将循环遍历表,对本地表中的每一行进行远程选择。没有一个查询会同时拥有本地表和远程表。实际上,您将编写自己的加入代码。
One approach would be to stick PL/SQL functions around everything.
As an example
Rather than having user_id's passed individually as parameters, you could load them into a local table (eg a global temporary table). The PL/SQL would loop then through the table, doing the remote select for each row in the local table. No single query would have both local and remote tables. Effectively you would be writing your own join code.
一种选择是首先使用公用表表达式具体化查询的远程部分,这样您就可以确保只从远程数据库获取相关数据。另一个改进是将针对远程数据库的 2 个子查询合并到一个基于分析函数的子查询中.这样的查询也可以在您当前的查询中使用。只有在玩过数据库之后我才能提出其他建议。
见下文
One option is to first materialize the remote part of the query using a common table expression so you can be sure only relevantt data is fetched from remote db.Another improvement would be to merge the 2 subqueries against the remote db into one analytical function based subquery.Such a query can be used in your current query also. I can make other suggestions only after playing with the db.
see below
您没有提到数据新鲜度的要求,但一种选择是创建物化视图(您将被限制为 REFRESH COMPLETE,因为您无法在源系统中创建快照日志),该视图仅包含以下数据:事务表的当前版本行。这些物化视图表将驻留在您的本地系统中,并且可以向其中添加额外的索引以提高查询性能。
You haven't mentioned the requirements for the freshness of the data, but one option would be to create materialized views (you'll be restricted to REFRESH COMPLETE since you can't create snapshot logs in the source system) that have data only for the current versioned row of the transaction tables. These materialized view tables will reside in your local system and additional indexing can be added to them to improve query performance.
性能问题将是跨链路的访问。对于针对本地表的部分查询,它全部在本地执行,因此无法访问远程索引,并且它将所有远程数据拉回以进行 lkocally 测试。
如果您可以在本地数据库中使用定期(每晚)从 peoplesoft 数据库刷新的物化视图来获取历史数据,则仅访问远程 peoplesoft 数据库以获取今天的更改(将 effective_date = Today 添加到 where 子句)并合并两个查询。
另一种选择可能是仅对远程数据使用 INSERT INTO X SELECT FROM 将其拉入临时本地表或物化视图,然后使用第二个查询将其与本地数据连接...类似于 josephj1989 的建议
或者(尽管可能存在许可问题)尝试使用远程 peoplesoft 数据库对本地数据库进行 RAC 集群。
The performance issue is going to be the access across the link. With part of the query against local tables, it's all being executed locally so no access to the remote indexes and it's pulling all the remote data back to test lkocally.
If you could use materialized views in a local database refreshed from the peoplesoft database on a periodic (nightly) basis for the historic data, only accessing the remote peoplesoft database for today's changes (adding a effective_date = today to your where clause) and merging the two queries.
Another option might be to use an INSERT INTO X SELECT FROM just for the remote data to pull it into a temporary local table or materialized view, then a second query to join that with your local data... similar to josephj1989's suggestion
Alternatively (though there may be licensing issues) try RAC Clustering your local db with the remote peoplesoft db.
您可以尝试这个,而不是使用子查询。我不知道Oracle是否会在这个方面表现更好,因为我不经常使用Oracle。
另一种可能的解决方案是:
Instead of using the subqueries, you can try this. I don't know if Oracle will perform better with this or not, since I don't use Oracle much.
Another possible solution would be:
是否可以选择创建一个用于非仓储类型内容的数据库,并且可以每晚更新?如果是,您可以创建一个每晚进程,仅移动最新的记录。这将消除您每天查询时所做的 MAX 工作,并显着减少数量或记录。
此外,还取决于最新数据和可用数据之间是否可以有 1 天的间隔。
我对 Oracle 不太熟悉,所以可能有一种方法可以通过更改查询来获得改进......
Would it be an option to create a database that you use for non-warehousing type stuff that you could update on a nightly basis? If it is you could create a nightly process that will move over only the most recent records. That would get rid of the MAX stuff you are doing for every day queries and significantly reduce the number or records.
Also, depends on whether you can have a 1 day lapse between the most recent data and what is available.
I'm not super familiar with Oracle so there may be a way to get improvements by making changes to your query also...
您能否将具有所需 user_id 的行 ETL 到您自己的表中,仅创建所需的索引来支持您的查询并对其执行查询?
Can you ETL the rows with the desired user_id's into your own table, creating only the needed indexes to support your queries and perform your queries on it?
PeopleSoft 表是交付的还是定制的?你确定这是一个物理表,而不是PS端写得不好的视图?如果您要反对的是已交付的记录(示例看起来很像 PS_JOB 或引用它的视图),也许您可以指出这一点。 PS_JOB 是一个提供了大量索引的野兽,大多数网站添加的索引甚至更多。
如果您知道表上的索引,则可以使用 Oracle 提示来指定要使用的首选索引;这有时会有所帮助。
您是否制定了解释计划,看看是否可以确定问题出在哪里?也许有笛卡尔连接、全表扫描等?
Is the PeopleSoft table a delivered one, or is it custom? Are you sure it's a physical table, and not a poorly-written view on the PS side? If it's a delivered record you're going against (example looks much like PS_JOB or a view that references it), maybe you could indicate this. PS_JOB is a beast with tons of indexes delivered, and most sites add even more.
If you know the indexes on the table, you can use Oracle hints to specify a preferred index to use; that sometimes helps.
Have you done an explain plan to see if you can determine where the problem is? Maybe there's a cartesian join, full table scan, etc.?
在我看来,您正在处理数据仓库中的第 2 类维度。实现类型 2 维度的方法有多种,主要有 ValidFrom、ValidTo、Version、Status 等列。并非所有这些都始终存在,如果您可以发布表的架构,那将会很有趣。下面是一个示例(John Smith 于 2010 年 6 月 24 日从印第安纳州搬到了俄亥俄州)
要获取行的最新版本,通常使用
或
请注意,该行在远方有一些常量未来。
或
似乎您的示例使用
ValidFrom
( effective_date) ,因此您被迫查找max()
才能找到最新行。查看架构 - 您的表中是否有Status 或 ValidTo
等效项?It looks to me that you are dealing with a type 2 dimension in the data warehouse. There are several ways how to implement type 2 dimension, mostly having columns like
ValidFrom, ValidTo, Version, Status
. Not all of them are always present, it would be interesting if you could post the schema for your table. Here is an example of how it may look like (John Smith moved from Indiana to Ohio on 2010-06-24)To obtain the latest version of a row, it is common to use
or
Note that this one has some constant far in the future.
or
Seems that your example uses
ValidFrom
(effective_date), so you are forced to findmax()
in order to locate the latest row. Take a look at the schema -- is thereStatus or ValidTo
equivalents in your tables?