为什么 SQL 聚合函数比 Python 和 Java（或 Poor Man 的 OLAP）慢得多

发布于 2024-07-05 02:32:09 字数 1718 浏览 11 评论 0原文

我需要真正的 DBA 的意见。 Postgres 8.3 在我的 Macbook Pro 上执行此查询需要 200 毫秒，而 Java 和 Python 在 20 毫秒内执行相同的计算（350,000 行）：

SELECT count(id), avg(a), avg(b), avg(c), avg(d) FROM tuples;

这是使用 SQL 数据库时的正常行为吗？

模式（该表包含对调查的回复）：

CREATE TABLE tuples (id integer primary key, a integer, b integer, c integer, d integer);

\copy tuples from '350,000 responses.csv' delimiter as ','

我用 Java 和 Python 编写了一些上下文测试，它们压垮了 SQL（纯 Python 除外）：

java   1.5 threads ~ 7 ms    
java   1.5         ~ 10 ms    
python 2.5 numpy   ~ 18 ms  
python 2.5         ~ 370 ms

即使 sqlite3 也能与 Postgres 竞争，尽管它假设所有列都是字符串（作为对比：即使在 Postgres 中只使用切换到数字列而不是整数也会导致 10 倍的速度减慢）

我尝试过但没有成功的调整包括（盲目地遵循一些网络建议）：

increased the shared memory available to Postgres to 256MB    
increased the working memory to 2MB
disabled connection and statement logging
used a stored procedure via CREATE FUNCTION ... LANGUAGE SQL

所以我的问题是，我在这里的经历是否正常，这就是我可以做的期望使用 SQL 数据库时？我可以理解 ACID 必须付出代价，但在我看来这有点疯狂。我并不要求实时游戏速度，但由于 Java 可以在 20 毫秒内处理数百万个双精度数，所以我感到有点嫉妒。

有没有更好的方法以便宜的方式进行简单的 OLAP（无论是在金钱还是服务器复杂性方面）？我研究过 Mondrian 和 Pig + Hadoop，但对维护另一个服务器应用程序并不太感兴趣，并且不确定它们是否会有所帮助。

可以说，Python 代码和 Java 代码并不在内部完成所有工作。我只是生成 4 个数组，每个数组有 350,000 个随机值，然后取平均值。我不包括计时中的生成，仅包括平均步骤。 java 线程计时使用 4 个线程（平均每个数组一个），有点过大，但它绝对是最快的。

sqlite3 计时由 Python 程序驱动，并从磁盘运行（不是：内存：）

我意识到 Postgres 在幕后做了更多的事情，但大部分工作对我来说并不重要，因为这是只读数据。

Postgres 查询不会更改后续运行的时间。

我重新运行了 Python 测试，包括将其从磁盘上假脱机出来。时间明显减慢至近 4 秒。但我猜测 Python 的文件处理代码几乎都是用 C 编写的（尽管可能不是 csv lib？），所以这向我表明 Postgres 也没有从磁盘流式传输（或者你是对的，我应该屈服在编写存储层之前！）

原文

I need a real DBA's opinion. Postgres 8.3 takes 200 ms to execute this query on my Macbook Pro while Java and Python perform the same calculation in under 20 ms (350,000 rows):

SELECT count(id), avg(a), avg(b), avg(c), avg(d) FROM tuples;

Is this normal behaviour when using a SQL database?

The schema (the table holds responses to a survey):

CREATE TABLE tuples (id integer primary key, a integer, b integer, c integer, d integer);

\copy tuples from '350,000 responses.csv' delimiter as ','

I wrote some tests in Java and Python for context and they crush SQL (except for pure python):

java   1.5 threads ~ 7 ms    
java   1.5         ~ 10 ms    
python 2.5 numpy   ~ 18 ms  
python 2.5         ~ 370 ms

Even sqlite3 is competitive with Postgres despite it assumping all columns are strings (for contrast: even using just switching to numeric columns instead of integers in Postgres results in 10x slowdown)

Tunings i've tried without success include (blindly following some web advice):

increased the shared memory available to Postgres to 256MB    
increased the working memory to 2MB
disabled connection and statement logging
used a stored procedure via CREATE FUNCTION ... LANGUAGE SQL

So my question is, is my experience here normal, and this is what I can expect when using a SQL database? I can understand that ACID must come with costs, but this is kind of crazy in my opinion. I'm not asking for realtime game speed, but since Java can process millions of doubles in under 20 ms, I feel a bit jealous.

Is there a better way to do simple OLAP on the cheap (both in terms of money and server complexity)? I've looked into Mondrian and Pig + Hadoop but not super excited about maintaining yet another server application and not sure if they would even help.

No the Python code and Java code do all the work in house so to speak. I just generate 4 arrays with 350,000 random values each, then take the average. I don't include the generation in the timings, only the averaging step. The java threads timing uses 4 threads (one per array average), overkill but it's definitely the fastest.

The sqlite3 timing is driven by the Python program and is running from disk (not :memory:)

I realize Postgres is doing much more behind the scenes, but most of that work doesn't matter to me since this is read only data.

The Postgres query doesn't change timing on subsequent runs.

I've rerun the Python tests to include spooling it off the disk. The timing slows down considerably to nearly 4 secs. But I'm guessing that Python's file handling code is pretty much in C (though maybe not the csv lib?) so this indicates to me that Postgres isn't streaming from the disk either (or that you are correct and I should bow down before whoever wrote their storage layer!)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

甜警司 2024-07-12 02:32:09

您是否使用 TCP 访问 Postgres？在这种情况下，内格尔会打乱你的时机。

回复收藏 0 原文

累赘 2024-07-12 02:32:09

您需要将 postgres 的缓存增加到整个工作集适合内存的程度，然后才能期望看到与使用程序在内存中执行操作相当的性能。

回复收藏 0 原文

国际总奸 2024-07-12 02:32:09

感谢 Oracle 的计时，这就是我正在寻找的东西（虽然令人失望:-)

物化视图可能值得考虑，因为我认为我可以为大多数用户预先计算此查询的最有趣的形式。

我不认为查询往返时间应该很高，因为我在运行 Postgres 的同一台机器上运行查询，所以它不会增加太多延迟？

我还对缓存大小进行了一些检查，似乎 Postgres 依赖操作系统来处理缓存，他们特别提到 BSD 作为理想的操作系统，所以我认为 Mac OS 应该非常聪明地将表带入记忆。除非有人有更具体的参数，否则我认为更具体的缓存是我无法控制的。

最后，我可能可以忍受 200 毫秒的响应时间，但知道 7 毫秒是一个可能的目标让我感到不满意，因为即使是 20-50 毫秒的时间也能让更多的用户获得更多最新的查询并摆脱大量缓存和预先计算的技巧。

我刚刚使用 MySQL 5 检查了计时，它们比 Postgres 稍差。因此，除非出现一些重大的缓存突破，否则我想这就是我所期望的关系数据库路线。

我希望我可以对你的一些答案进行投票，但我还没有足够的分数。

回复收藏 0 原文

稀香 2024-07-12 02:32:09

RDBMS 通常为您做的另一件事是通过保护您免受另一个进程的同时访问来提供并发性。这是通过放置锁来完成的，并且会产生一些开销。

如果您正在处理永远不会改变的完全静态数据，特别是如果您处于基本的“单用户”场景，那么使用关系数据库并不一定会给您带来太多好处。

回复收藏 0 原文

混吃等死 2024-07-12 02:32:09

我不认为你的结果有那么令人惊讶——如果有什么不同的话，那就是 Postgres 太快了。

一旦有机会缓存数据，Postgres 查询第二次运行速度是否会更快？为了更公平一点，您对 Java 和 Python 的测试应该首先涵盖获取数据的成本（理想情况下从磁盘加载数据）。

如果这个性能水平在实践中对您的应用程序来说是一个问题，但由于其他原因您需要 RDBMS，那么您可以查看内存缓存。然后，您可以更快地缓存访问原始数据，并可以在代码中进行计算。

回复收藏 0 原文

只为守护你 2024-07-12 02:32:09

我自己就是 MS-SQL 人员，我们会使用 DBCC PINTABLE< /a> 来缓存表，并 SET STATISTICS IO 来看到它是从缓存读取，而不是磁盘。

我在 Postgres 上找不到任何东西来模仿 PINTABLE，但是 pg_buffercache似乎提供了有关缓存中内容的详细信息 - 您可能需要检查它，并查看您的表是否确实被缓存。

快速回溯计算让我怀疑您正在从磁盘进行分页。假设 Postgres 使用 4 字节整数，每行有 (6 * 4) 字节，因此您的表至少为 (24 * 350,000) 字节 ~ 8.4MB。假设 HDD 上的持续吞吐量为 40 MB/s，您需要大约 200 毫秒来读取数据（其中，正如所指出的，应该是几乎所有时间都花在的地方）。

除非我在某个地方搞砸了我的数学，否则我不明白你如何能够将 8MB 读入你的 Java 应用程序并在你显示的时间处理它 - 除非该文件已经由驱动器或你的驱动器缓存操作系统。

回复收藏 0 原文

暖伴 2024-07-12 02:32:09

这些都是非常详细的答案，但它们主要回避了一个问题：鉴于数据很容易装入内存，需要并发读取但不需要写入，并且使用相同的查询一遍又一遍地查询，我如何在不离开 Postgres 的情况下获得这些好处。

是否可以预编译查询和优化计划？我本以为存储过程会做到这一点，但它并没有真正帮助。

为了避免磁盘访问，有必要将整个表缓存在内存中，我可以强制 Postgres 这样做吗？我认为它已经在这样做了，因为重复运行后查询只需要 200 毫秒即可执行。

我可以告诉 Postgres 该表是只读的，以便它可以优化任何锁定代码吗？

我认为可以使用空表（时间范围为 20-60 毫秒）来估计查询构建成本，但

我仍然不明白为什么 Java/Python 测试无效。 Postgres 只是没有做那么多的工作（尽管我还没有解决并发方面的问题，只是缓存和查询构造）

更新：
我认为通过将 350,000 个驱动程序和序列化步骤拉入 Python 来运行聚合来比较 SELECTS 是不公平的，甚至省略聚合也是不公平的，因为格式化和显示的开销很难与计时分开。如果两个引擎都在内存数据中运行，那么应该是同类比较，但我不确定如何保证这种情况已经发生。

我不知道如何添加评论，也许我没有足够的声誉？

回复收藏 0 原文

呆° 2024-07-12 02:32:09

我用 MySQL 重新测试，指定 ENGINE = MEMORY，它没有改变任何事情（仍然是 200 毫秒）。使用内存数据库的 Sqlite3 也给出了类似的计时（250 毫秒）。

数学

我只是不相信磁盘原因缓慢的论点，因为有迹象表明表在内存中（ Postgres 的人都警告不要太努力地将表固定到内存，因为他们发誓操作系统会比程序员做得更好）

为了澄清时间，Java 代码不从磁盘读取，如果 Postgres 是这样，那么这是一个完全不公平的比较从磁盘读取并计算复杂的查询，但这实际上不是重点，数据库应该足够智能，可以将一个小表放入内存并预编译存储过程恕我直言。

更新（响应下面的第一条评论）：

我不确定如何在不使用聚合函数的情况下以公平的方式测试查询，因为如果我选择所有行，它会花费大量时间序列化和格式化所有内容。我并不是说缓慢是由于聚合函数造成的，它仍然可能只是并发性、完整性和朋友的开销。我只是不知道如何将聚合隔离为唯一的自变量。

回复收藏 0 原文

空城旧梦 2024-07-12 02:32:09

Postgres 所做的事情比看起来要多得多（首先维护数据一致性！）

如果值不必 100% 准确，或者表很少更新，但您经常运行此计算，那么您可能需要研究物化视图来加快速度。

（注意，我没有在 Postgres 中使用物化视图，它们看起来有点老套，但可能适合你的情况）。

物化视图

还要考虑实际连接到服务器的开销以及发送所需的往返行程向服务器发出请求并返回。

我认为 200 毫秒对于这样的事情相当不错，在我的 oracle 服务器上进行快速测试，相同的表结构，大约 500k 行，没有索引，大约需要 1 - 1.5 秒，这几乎只是 oracle 吸数据脱离磁盘。

真正的问题是，200ms 足够快吗？

-------------- 更多 --------------------

我对使用物化视图解决这个问题感兴趣，因为我从来没有真的和他们一起玩。这是在甲骨文中。

首先我制作了一个每分钟刷新一次的MV。

create materialized view mv_so_x 
build immediate 
refresh complete 
START WITH SYSDATE NEXT SYSDATE + 1/24/60
 as select count(*),avg(a),avg(b),avg(c),avg(d) from so_x;

刷新时，没有返回任何行

SQL> select * from mv_so_x;

no rows selected

Elapsed: 00:00:00.00

一旦刷新，它比原始查询快得多

SQL> select count(*),avg(a),avg(b),avg(c),avg(d) from so_x;

  COUNT(*)     AVG(A)     AVG(B)     AVG(C)     AVG(D)
---------- ---------- ---------- ---------- ----------
   1899459 7495.38839 22.2905454 5.00276131 2.13432836

Elapsed: 00:00:05.74
SQL> select * from mv_so_x;

  COUNT(*)     AVG(A)     AVG(B)     AVG(C)     AVG(D)
---------- ---------- ---------- ---------- ----------
   1899459 7495.38839 22.2905454 5.00276131 2.13432836

Elapsed: 00:00:00.00
SQL>

如果我们插入到基表中，则结果无法立即查看MV。

SQL> insert into so_x values (1,2,3,4,5);

1 row created.

Elapsed: 00:00:00.00
SQL> commit;

Commit complete.

Elapsed: 00:00:00.00
SQL> select * from mv_so_x;

  COUNT(*)     AVG(A)     AVG(B)     AVG(C)     AVG(D)
---------- ---------- ---------- ---------- ----------
   1899459 7495.38839 22.2905454 5.00276131 2.13432836

Elapsed: 00:00:00.00
SQL>

但稍等一分钟左右，MV 将在幕后更新，并且结果会如您所愿地快速返回。

SQL> /

  COUNT(*)     AVG(A)     AVG(B)     AVG(C)     AVG(D)
---------- ---------- ---------- ---------- ----------
   1899460 7495.35823 22.2905352 5.00276078 2.17647059

Elapsed: 00:00:00.00
SQL>

这并不理想。首先，它不是实时的，插入/更新不会立即可见。此外，无论您是否需要，您都会运行一个查询来更新 MV（这可以调整到任何时间范围或按需）。但是，这确实表明，如果您可以接受不太精确的值，那么 MV 在最终用户看来可以快得多。

Postgres is doing a lot more than it looks like (maintaining data consistency for a start!)

If the values don't have to be 100% spot on, or if the table is updated rarely, but you are running this calculation often, you might want to look into Materialized Views to speed it up.

(Note, I have not used materialized views in Postgres, they look at little hacky, but might suite your situation).

Materialized Views

Also consider the overhead of actually connecting to the server and the round trip required to send the request to the server and back.

I'd consider 200ms for something like this to be pretty good, A quick test on my oracle server, the same table structure with about 500k rows and no indexes, takes about 1 - 1.5 seconds, which is almost all just oracle sucking the data off disk.

The real question is, is 200ms fast enough?

-------------- More --------------------

I was interested in solving this using materialized views, since I've never really played with them. This is in oracle.

First I created a MV which refreshes every minute.

create materialized view mv_so_x 
build immediate 
refresh complete 
START WITH SYSDATE NEXT SYSDATE + 1/24/60
 as select count(*),avg(a),avg(b),avg(c),avg(d) from so_x;

While its refreshing, there is no rows returned

SQL> select * from mv_so_x;

no rows selected

Elapsed: 00:00:00.00

Once it refreshes, its MUCH faster than doing the raw query

SQL> select count(*),avg(a),avg(b),avg(c),avg(d) from so_x;

  COUNT(*)     AVG(A)     AVG(B)     AVG(C)     AVG(D)
---------- ---------- ---------- ---------- ----------
   1899459 7495.38839 22.2905454 5.00276131 2.13432836

Elapsed: 00:00:05.74
SQL> select * from mv_so_x;

  COUNT(*)     AVG(A)     AVG(B)     AVG(C)     AVG(D)
---------- ---------- ---------- ---------- ----------
   1899459 7495.38839 22.2905454 5.00276131 2.13432836

Elapsed: 00:00:00.00
SQL>

If we insert into the base table, the result is not immediately viewable view the MV.

SQL> insert into so_x values (1,2,3,4,5);

1 row created.

Elapsed: 00:00:00.00
SQL> commit;

Commit complete.

Elapsed: 00:00:00.00
SQL> select * from mv_so_x;

  COUNT(*)     AVG(A)     AVG(B)     AVG(C)     AVG(D)
---------- ---------- ---------- ---------- ----------
   1899459 7495.38839 22.2905454 5.00276131 2.13432836

Elapsed: 00:00:00.00
SQL>

But wait a minute or so, and the MV will update behind the scenes, and the result is returned fast as you could want.

SQL> /

  COUNT(*)     AVG(A)     AVG(B)     AVG(C)     AVG(D)
---------- ---------- ---------- ---------- ----------
   1899460 7495.35823 22.2905352 5.00276078 2.17647059

Elapsed: 00:00:00.00
SQL>

This isn't ideal. for a start, its not realtime, inserts/updates will not be immediately visible. Also, you've got a query running to update the MV whether you need it or not (this can be tune to whatever time frame, or on demand). But, this does show how much faster an MV can make it seem to the end user, if you can live with values which aren't quite upto the second accurate.

回复收藏 0 原文

一个人的夜不怕黑 2024-07-12 02:32:09

我想说你的测试方案并不是很有用。为了完成数据库查询，数据库服务器经历几个步骤：

解析SQL，
制定查询计划，即决定使用哪些索引（如果有），优化等。
如果使用索引，则搜索指向的指针实际数据，然后转到数据中的适当位置，或者
如果没有使用索引，则扫描整个表以确定需要哪些行将
数据从磁盘加载到临时位置（希望如此，但是不一定，内存）
执行 count() 和 avg() 计算

因此，在 Python 中创建一个数组并获取平均值基本上会跳过所有这些步骤，除了最后一个步骤。由于磁盘 I/O 是程序必须执行的最昂贵的操作之一，因此这是测试中的一个主要缺陷（另请参阅这个问题我之前在这里问过）。即使您在其他测试中从磁盘读取数据，该过程也完全不同，并且很难判断结果的相关性。

要获得有关 Postgres 花费时间的更多信息，我建议进行以下测试：

将查询的执行时间与不带聚合函数的 SELECT 进行比较（即删除步骤 5）
如果您发现聚合导致速度显着减慢，尝试一下 Python 是否做得更快，通过比较中的普通 SELECT 获取原始数据。

要加快查询速度，请首先减少磁盘访问。我非常怀疑是不是聚合需要时间。

有几种方法可以做到这一点：

缓存数据（在内存中！）以供后续访问，可以通过数据库引擎自身的功能或使用 memcached 等工具
减少存储数据的大小
优化索引的使用。有时这可能意味着完全跳过索引使用（毕竟，这也是磁盘访问）。对于 MySQL，我似乎记得如果假设查询获取表中所有数据的 10% 以上，建议跳过索引。
如果您的查询充分利用了索引，我知道对于 MySQL 数据库来说，将索引和数据放在单独的物理磁盘上会有所帮助。但是，我不知道这是否适用于 Postgres。
还可能存在更复杂的问题，例如如果由于某种原因结果集无法在内存中完全处理，则将行交换到磁盘。但我会放弃这种研究，直到遇到严重的性能问题，而我无法找到其他方法来解决，因为它需要了解流程中的许多幕后细节。

更新：

我刚刚意识到您似乎对上述查询没有使用索引，并且很可能也没有使用任何索引，因此我对索引的建议可能没有帮助。对不起。不过，我想说聚合不是问题，而是磁盘访问才是问题。我会把索引的东西留下来，无论如何，它可能仍然有一些用处。

I would say your test scheme is not really useful. To fulfill the db query, the db server goes through several steps:

parse the SQL
work up a query plan, i. e. decide on which indices to use (if any), optimize etc.
if an index is used, search it for the pointers to the actual data, then go to the appropriate location in the data or
if no index is used, scan the whole table to determine which rows are needed
load the data from disk into a temporary location (hopefully, but not necessarily, memory)
perform the count() and avg() calculations

So, creating an array in Python and getting the average basically skips all these steps save the last one. As disk I/O is among the most expensive operations a program has to perform, this is a major flaw in the test (see also the answers to this question I asked here before). Even if you read the data from disk in your other test, the process is completely different and it's hard to tell how relevant the results are.

To obtain more information about where Postgres spends its time, I would suggest the following tests:

Compare the execution time of your query to a SELECT without the aggregating functions (i. e. cut step 5)
If you find that the aggregation leads to a significant slowdown, try if Python does it faster, obtaining the raw data through the plain SELECT from the comparison.

To speed up your query, reduce disk access first. I doubt very much that it's the aggregation that takes the time.

There's several ways to do that:

Cache data (in memory!) for subsequent access, either via the db engine's own capabilities or with tools like memcached
Reduce the size of your stored data
Optimize the use of indices. Sometimes this can mean to skip index use altogether (after all, it's disk access, too). For MySQL, I seem to remember that it's recommended to skip indices if you assume that the query fetches more than 10% of all the data in the table.
If your query makes good use of indices, I know that for MySQL databases it helps to put indices and data on separate physical disks. However, I don't know whether that's applicable for Postgres.
There also might be more sophisticated problems such as swapping rows to disk if for some reason the result set can't be completely processed in memory. But I would leave that kind of research until I run into serious performance problems that I can't find another way to fix, as it requires knowledge about a lot of little under-the-hood details in your process.

Update:

I just realized that you seem to have no use for indices for the above query and most likely aren't using any, too, so my advice on indices probably wasn't helpful. Sorry. Still, I'd say that the aggregation is not the problem but disk access is. I'll leave the index stuff in, anyway, it might still have some use.

回复收藏 0 原文

~没有更多了~