测试数据仓库性能的最佳方法?
我有一个基于Postgresql的数据仓库。
到目前为止,我一直在尝试对仅包含真实数据库一小部分的数据库运行查询。一旦我以一种对这个小型测试数据库有效的方式编写了查询,我就在真实的数据库上运行查询。
问题是,一旦我在真实数据库上运行查询,真实数据库就会耗尽内存并开始将索引和临时表等内容写入磁盘。这意味着不同的查询可能最适合测试数据库和真实数据库。这是否意味着我真的必须运行需要几分钟才能完成的查询才能知道哪个查询是最佳查询。
I have a data warehouse based on Postgresql.
Until now, I have been trying to run queries on a database containing just a fraction of my real database. Once I have written the queries in a way that make them efficient for this small test database I run the queries on the real one.
The problem is that once I run the queries on the real database, the real database runs out of memory and starts writing things like indexes and temp tables to disk. This means that it could be that different queries would be optimal for for test database and the real database. Does this mean that I really have to run queries that take several minutes to complete in order to know, which query is the optimal one.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
了解如何解释
EXPLAIN
输出,然后在运行查询之前检查EXPLAIN
输出是否显示大型数据库中所选的查询计划与您期望的类似。Learn how to interpret the
EXPLAIN
output, then check that theEXPLAIN
output shows that the chosen query plan in your large database is similar to what you would expect, before running the query.三个问题:
1)查询有多复杂?索引和临时表的生成表明服务器必须生成这些内容,因为对未索引的列进行复杂的操作。这有多大可能?从您的报告来看,可能的答案似乎是“复杂”
2)回报集有多大?最终结果是100行还是100万行?从你的报告来看,答案可能是任何东西。我怀疑这个问题并不那么重要,但至少知道这一点很重要。
3)以不同的方式重述问题1,即使返回的集合很小,在获得小结果的过程中是否需要编译大量的中间结果?再次,我怀疑这里的答案是正在生成大量复杂的中间结果。
这表明至少需要对某些内容建立索引,并且可能需要对数据进行结构化以更接近您要查询的内容。
最后一个问题,对于大多数更重要的查询来说,这是普遍存在的问题,还是仅对于一两个查询来说?
编辑评论:我整天都在进行数据仓库查询,其中一些需要 10 分钟左右。有些需要几个小时,我把它们推到后台工作,并将它们分成几个阶段,以防止一切陷入困境。这就是处理非常大的数据集的本质。
我在原始答案中的问题旨在确定您的问题查询是否会永远完成。可能会无意中编写一个产生大量中间数据的查询,以至于您可以离开,两天后回来,它仍然在运行。所以我想重申一下我原来的三个问题,它们实际上是完整回答你的问题的唯一方法。
回顾:是的,有些查询需要更长的时间,这是野兽的本性。您可以期望的最好结果是性能与读取的数据量成线性关系,如果要处理 1 亿行,则需要几分钟而不是几秒钟。但更重要的是,如果查询在 100 万行上运行 4 秒,但在 1 亿行上则需要 >> 400 秒(比如一个小时),然后我问的那些原始问题将帮助您找出原因,目的是优化这些查询。
Three questions:
1) How complex are the queries? The generation of indexes and temp tables suggests the server has to generate these things because of complex operations on unindexed columns. How likely is this? From what you report, it seems the likely answer is "complex"
2) How large are the returns sets? Is the end result 100 rows or 1 million? From what you report, the answer could be anything. I suspect this question is not as important, but it is important at least to know.
3) Restating question 1 in a different way, even if the returned sets are small, are there enormous intermediate results that have to be compiled on the way to the small result? Again, I suspect the answer here is large complex intermediate results are being generated.
This would suggest that at very least some things need to be indexed, and perhaps the data needs to be structured on the way in to be closer to what you are trying to query.
One last question, is this a pervasive problem for most of your more important queries, or only for one or two?
EDIT IN RESPONSE TO COMMENT: I do data warehouse queries all day, and some of them take 10 minutes or so. Some take hours, and I push them off into a background job and break them up into stages to prevent bogging everything down. That is the nature of handling very large data sets.
My questions in the original answer are aimed at figuring out if your problem queries will ever finish. It is possible to unwittingly write a query that produces so much intermediate data that you can walk away, come back 2 days later, and it is still running. So I would restate my original three questions, they are in fact the only way to answer your question completely.
Recap: Yes, some queries take much longer, it is the nature of the beast. The best you can hope for is performance linear to the amount of data being read, and if there 100 million rows to process, that will take minutes instead of seconds. But much more importantly, if a query runs in 4 seconds on 1 million rows, but on 100 million rows takes >> 400 seconds (like an hour) then those original questions I asked will help you figure out why, with the aim to optimize those queries.