为什么 SQL 聚合函数比 Python 和 Java(或 Poor Man 的 OLAP)慢得多
我需要真正的 DBA 的意见。 Postgres 8.3 在我的 Macbook Pro 上执行此查询需要 200 毫秒,而 Java 和 Python 在 20 毫秒内执行相同的计算(350,000 行):
SELECT count(id), avg(a), avg(b), avg(c), avg(d) FROM tuples;
这是使用 SQL 数据库时的正常行为吗?
模式(该表包含对调查的回复):
CREATE TABLE tuples (id integer primary key, a integer, b integer, c integer, d integer);
\copy tuples from '350,000 responses.csv' delimiter as ','
我用 Java 和 Python 编写了一些上下文测试,它们压垮了 SQL(纯 Python 除外):
java 1.5 threads ~ 7 ms
java 1.5 ~ 10 ms
python 2.5 numpy ~ 18 ms
python 2.5 ~ 370 ms
即使 sqlite3 也能与 Postgres 竞争,尽管它假设所有列都是字符串(作为对比:即使在 Postgres 中只使用切换到数字列而不是整数也会导致 10 倍的速度减慢)
我尝试过但没有成功的调整包括(盲目地遵循一些网络建议):
increased the shared memory available to Postgres to 256MB
increased the working memory to 2MB
disabled connection and statement logging
used a stored procedure via CREATE FUNCTION ... LANGUAGE SQL
所以我的问题是,我在这里的经历是否正常,这就是我可以做的期望使用 SQL 数据库时? 我可以理解 ACID 必须付出代价,但在我看来这有点疯狂。 我并不要求实时游戏速度,但由于 Java 可以在 20 毫秒内处理数百万个双精度数,所以我感到有点嫉妒。
有没有更好的方法以便宜的方式进行简单的 OLAP(无论是在金钱还是服务器复杂性方面)? 我研究过 Mondrian 和 Pig + Hadoop,但对维护另一个服务器应用程序并不太感兴趣,并且不确定它们是否会有所帮助。
可以说,Python 代码和 Java 代码并不在内部完成所有工作。 我只是生成 4 个数组,每个数组有 350,000 个随机值,然后取平均值。 我不包括计时中的生成,仅包括平均步骤。 java 线程计时使用 4 个线程(平均每个数组一个),有点过大,但它绝对是最快的。
sqlite3 计时由 Python 程序驱动,并从磁盘运行(不是:内存:)
我意识到 Postgres 在幕后做了更多的事情,但大部分工作对我来说并不重要,因为这是只读数据。
Postgres 查询不会更改后续运行的时间。
我重新运行了 Python 测试,包括将其从磁盘上假脱机出来。 时间明显减慢至近 4 秒。 但我猜测 Python 的文件处理代码几乎都是用 C 编写的(尽管可能不是 csv lib?),所以这向我表明 Postgres 也没有从磁盘流式传输(或者你是对的,我应该屈服在编写存储层之前!)
I need a real DBA's opinion. Postgres 8.3 takes 200 ms to execute this query on my Macbook Pro while Java and Python perform the same calculation in under 20 ms (350,000 rows):
SELECT count(id), avg(a), avg(b), avg(c), avg(d) FROM tuples;
Is this normal behaviour when using a SQL database?
The schema (the table holds responses to a survey):
CREATE TABLE tuples (id integer primary key, a integer, b integer, c integer, d integer);
\copy tuples from '350,000 responses.csv' delimiter as ','
I wrote some tests in Java and Python for context and they crush SQL (except for pure python):
java 1.5 threads ~ 7 ms
java 1.5 ~ 10 ms
python 2.5 numpy ~ 18 ms
python 2.5 ~ 370 ms
Even sqlite3 is competitive with Postgres despite it assumping all columns are strings (for contrast: even using just switching to numeric columns instead of integers in Postgres results in 10x slowdown)
Tunings i've tried without success include (blindly following some web advice):
increased the shared memory available to Postgres to 256MB
increased the working memory to 2MB
disabled connection and statement logging
used a stored procedure via CREATE FUNCTION ... LANGUAGE SQL
So my question is, is my experience here normal, and this is what I can expect when using a SQL database? I can understand that ACID must come with costs, but this is kind of crazy in my opinion. I'm not asking for realtime game speed, but since Java can process millions of doubles in under 20 ms, I feel a bit jealous.
Is there a better way to do simple OLAP on the cheap (both in terms of money and server complexity)? I've looked into Mondrian and Pig + Hadoop but not super excited about maintaining yet another server application and not sure if they would even help.
No the Python code and Java code do all the work in house so to speak. I just generate 4 arrays with 350,000 random values each, then take the average. I don't include the generation in the timings, only the averaging step. The java threads timing uses 4 threads (one per array average), overkill but it's definitely the fastest.
The sqlite3 timing is driven by the Python program and is running from disk (not :memory:)
I realize Postgres is doing much more behind the scenes, but most of that work doesn't matter to me since this is read only data.
The Postgres query doesn't change timing on subsequent runs.
I've rerun the Python tests to include spooling it off the disk. The timing slows down considerably to nearly 4 secs. But I'm guessing that Python's file handling code is pretty much in C (though maybe not the csv lib?) so this indicates to me that Postgres isn't streaming from the disk either (or that you are correct and I should bow down before whoever wrote their storage layer!)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(10)
您是否使用 TCP 访问 Postgres? 在这种情况下,内格尔会打乱你的时机。
Are you using TCP to access the Postgres? In that case Nagle is messing with your timing.
您需要将 postgres 的缓存增加到整个工作集适合内存的程度,然后才能期望看到与使用程序在内存中执行操作相当的性能。
You need to increase postgres' caches to the point where the whole working set fits into memory before you can expect to see perfomance comparable to doing it in-memory with a program.
感谢 Oracle 的计时,这就是我正在寻找的东西(虽然令人失望:-)
物化视图可能值得考虑,因为我认为我可以为大多数用户预先计算此查询的最有趣的形式。
我不认为查询往返时间应该很高,因为我在运行 Postgres 的同一台机器上运行查询,所以它不会增加太多延迟?
我还对缓存大小进行了一些检查,似乎 Postgres 依赖操作系统来处理缓存,他们特别提到 BSD 作为理想的操作系统,所以我认为 Mac OS 应该非常聪明地将表带入记忆。 除非有人有更具体的参数,否则我认为更具体的缓存是我无法控制的。
最后,我可能可以忍受 200 毫秒的响应时间,但知道 7 毫秒是一个可能的目标让我感到不满意,因为即使是 20-50 毫秒的时间也能让更多的用户获得更多最新的查询并摆脱大量缓存和预先计算的技巧。
我刚刚使用 MySQL 5 检查了计时,它们比 Postgres 稍差。 因此,除非出现一些重大的缓存突破,否则我想这就是我所期望的关系数据库路线。
我希望我可以对你的一些答案进行投票,但我还没有足够的分数。
Thanks for the Oracle timings, that's the kind of stuff I'm looking for (disappointing though :-)
Materialized views are probably worth considering as I think I can precompute the most interesting forms of this query for most users.
I don't think query round trip time should be very high as i'm running the the queries on the same machine that runs Postgres, so it can't add much latency?
I've also done some checking into the cache sizes, and it seems Postgres relies on the OS to handle caching, they specifically mention BSD as the ideal OS for this, so I thinking Mac OS ought to be pretty smart about bringing the table into memory. Unless someone has more specific params in mind I think more specific caching is out of my control.
In the end I can probably put up with 200 ms response times, but knowing that 7 ms is a possible target makes me feel unsatisfied, as even 20-50 ms times would enable more users to have more up to date queries and get rid of a lots of caching and precomputed hacks.
I just checked the timings using MySQL 5 and they are slightly worse than Postgres. So barring some major caching breakthroughs, I guess this is what I can expect going the relational db route.
I wish I could up vote some of your answers, but I don't have enough points yet.
RDBMS 通常为您做的另一件事是通过保护您免受另一个进程的同时访问来提供并发性。 这是通过放置锁来完成的,并且会产生一些开销。
如果您正在处理永远不会改变的完全静态数据,特别是如果您处于基本的“单用户”场景,那么使用关系数据库并不一定会给您带来太多好处。
One other thing that an RDBMS generally does for you is to provide concurrency by protecting you from simultaneous access by another process. This is done by placing locks, and there's some overhead from that.
If you're dealing with entirely static data that never changes, and especially if you're in a basically "single user" scenario, then using a relational database doesn't necessarily gain you much benefit.
我不认为你的结果有那么令人惊讶——如果有什么不同的话,那就是 Postgres 太快了。
一旦有机会缓存数据,Postgres 查询第二次运行速度是否会更快? 为了更公平一点,您对 Java 和 Python 的测试应该首先涵盖获取数据的成本(理想情况下从磁盘加载数据)。
如果这个性能水平在实践中对您的应用程序来说是一个问题,但由于其他原因您需要 RDBMS,那么您可以查看 内存缓存。 然后,您可以更快地缓存访问原始数据,并可以在代码中进行计算。
I don't think that your results are all that surprising -- if anything it is that Postgres is so fast.
Does the Postgres query run faster a second time once it has had a chance to cache the data? To be a little fairer your test for Java and Python should cover the cost of acquiring the data in the first place (ideally loading it off disk).
If this performance level is a problem for your application in practice but you need a RDBMS for other reasons then you could look at memcached. You would then have faster cached access to raw data and could do the calculations in code.
我自己就是 MS-SQL 人员,我们会使用 DBCC PINTABLE< /a> 来缓存表,并 SET STATISTICS IO 来看到它是从缓存读取,而不是磁盘。
我在 Postgres 上找不到任何东西来模仿 PINTABLE,但是 pg_buffercache似乎提供了有关缓存中内容的详细信息 - 您可能需要检查它,并查看您的表是否确实被缓存。
快速回溯计算让我怀疑您正在从磁盘进行分页。 假设 Postgres 使用 4 字节整数,每行有 (6 * 4) 字节,因此您的表至少为 (24 * 350,000) 字节 ~ 8.4MB。 假设 HDD 上的持续吞吐量为 40 MB/s,您需要大约 200 毫秒来读取数据(其中,正如所指出的,应该是几乎所有时间都花在的地方)。
除非我在某个地方搞砸了我的数学,否则我不明白你如何能够将 8MB 读入你的 Java 应用程序并在你显示的时间处理它 - 除非该文件已经由驱动器或你的驱动器缓存操作系统。
I'm a MS-SQL guy myself, and we'd use DBCC PINTABLE to keep a table cached, and SET STATISTICS IO to see that it's reading from cache, and not disk.
I can't find anything on Postgres to mimic PINTABLE, but pg_buffercache seems to give details on what is in the cache - you may want to check that, and see if your table is actually being cached.
A quick back of the envelope calculation makes me suspect that you're paging from disk. Assuming Postgres uses 4-byte integers, you have (6 * 4) bytes per row, so your table is a minimum of (24 * 350,000) bytes ~ 8.4MB. Assuming 40 MB/s sustained throughput on your HDD, you're looking at right around 200ms to read the data (which, as pointed out, should be where almost all of the time is being spent).
Unless I screwed up my math somewhere, I don't see how it's possible that you are able to read 8MB into your Java app and process it in the times you're showing - unless that file is already cached by either the drive or your OS.
这些都是非常详细的答案,但它们主要回避了一个问题:鉴于数据很容易装入内存,需要并发读取但不需要写入,并且使用相同的查询一遍又一遍地查询,我如何在不离开 Postgres 的情况下获得这些好处。
是否可以预编译查询和优化计划? 我本以为存储过程会做到这一点,但它并没有真正帮助。
为了避免磁盘访问,有必要将整个表缓存在内存中,我可以强制 Postgres 这样做吗? 我认为它已经在这样做了,因为重复运行后查询只需要 200 毫秒即可执行。
我可以告诉 Postgres 该表是只读的,以便它可以优化任何锁定代码吗?
我认为可以使用空表(时间范围为 20-60 毫秒)来估计查询构建成本,但
我仍然不明白为什么 Java/Python 测试无效。 Postgres 只是没有做那么多的工作(尽管我还没有解决并发方面的问题,只是缓存和查询构造)
更新:
我认为通过将 350,000 个驱动程序和序列化步骤拉入 Python 来运行聚合来比较 SELECTS 是不公平的,甚至省略聚合也是不公平的,因为格式化和显示的开销很难与计时分开。 如果两个引擎都在内存数据中运行,那么应该是同类比较,但我不确定如何保证这种情况已经发生。
我不知道如何添加评论,也许我没有足够的声誉?
Those are very detailed answers, but they mostly beg the question, how do I get these benefits without leaving Postgres given that the data easily fits into memory, requires concurrent reads but no writes and is queried with the same query over and over again.
Is it possible to precompile the query and optimization plan? I would have thought the stored procedure would do this, but it doesn't really help.
To avoid disk access it's necessary to cache the whole table in memory, can I force Postgres to do that? I think it's already doing this though, since the query executes in just 200 ms after repeated runs.
Can I tell Postgres that the table is read only, so it can optimize any locking code?
I think it's possible to estimate the query construction costs with an empty table (timings range from 20-60 ms)
I still can't see why the Java/Python tests are invalid. Postgres just isn't doing that much more work (though I still haven't addressed the concurrency aspect, just the caching and query construction)
UPDATE:
I don't think it's fair to compare the SELECTS as suggested by pulling 350,000 through the driver and serialization steps into Python to run the aggregation, nor even to omit the aggregation as the overhead in formatting and displaying is hard to separate from the timing. If both engines are operating on in memory data, it should be an apples to apples comparison, I'm not sure how to guarantee that's already happening though.
I can't figure out how to add comments, maybe i don't have enough reputation?
我用 MySQL 重新测试,指定 ENGINE = MEMORY,它没有改变任何事情(仍然是 200 毫秒)。 使用内存数据库的 Sqlite3 也给出了类似的计时(250 毫秒)。
数学
我只是不相信磁盘原因缓慢的论点,因为有迹象表明表在内存中( Postgres 的人都警告不要太努力地将表固定到内存,因为他们发誓操作系统会比程序员做得更好)
为了澄清时间,Java 代码不从磁盘读取,如果 Postgres 是这样,那么这是一个完全不公平的比较从磁盘读取并计算复杂的查询,但这实际上不是重点,数据库应该足够智能,可以将一个小表放入内存并预编译存储过程恕我直言。
更新(响应下面的第一条评论):
我不确定如何在不使用聚合函数的情况下以公平的方式测试查询,因为如果我选择所有行,它会花费大量时间序列化和格式化所有内容。 我并不是说缓慢是由于聚合函数造成的,它仍然可能只是并发性、完整性和朋友的开销。 我只是不知道如何将聚合隔离为唯一的自变量。
I retested with MySQL specifying ENGINE = MEMORY and it doesn't change a thing (still 200 ms). Sqlite3 using an in-memory db gives similar timings as well (250 ms).
The math here looks correct (at least the size, as that's how big the sqlite db is :-)
I'm just not buying the disk-causes-slowness argument as there is every indication the tables are in memory (the postgres guys all warn against trying too hard to pin tables to memory as they swear the OS will do it better than the programmer)
To clarify the timings, the Java code is not reading from disk, making it a totally unfair comparison if Postgres is reading from the disk and calculating a complicated query, but that's really besides the point, the DB should be smart enough to bring a small table into memory and precompile a stored procedure IMHO.
UPDATE (in response to the first comment below):
I'm not sure how I'd test the query without using an aggregation function in a way that would be fair, since if i select all of the rows it'll spend tons of time serializing and formatting everything. I'm not saying that the slowness is due to the aggregation function, it could still be just overhead from concurrency, integrity, and friends. I just don't know how to isolate the aggregation as the sole independent variable.
Postgres 所做的事情比看起来要多得多(首先维护数据一致性!)
如果值不必 100% 准确,或者表很少更新,但您经常运行此计算,那么您可能需要研究物化视图来加快速度。
(注意,我没有在 Postgres 中使用物化视图,它们看起来有点老套,但可能适合你的情况)。
物化视图
还要考虑实际连接到服务器的开销以及发送所需的往返行程向服务器发出请求并返回。
我认为 200 毫秒对于这样的事情相当不错,在我的 oracle 服务器上进行快速测试,相同的表结构,大约 500k 行,没有索引,大约需要 1 - 1.5 秒,这几乎只是 oracle 吸数据脱离磁盘。
真正的问题是,200ms 足够快吗?
-------------- 更多 --------------------
我对使用物化视图解决这个问题感兴趣,因为我从来没有真的和他们一起玩。 这是在甲骨文中。
首先我制作了一个每分钟刷新一次的MV。
刷新时,没有返回任何行
一旦刷新,它比原始查询快得多
如果我们插入到基表中,则结果无法立即查看MV。
但稍等一分钟左右,MV 将在幕后更新,并且结果会如您所愿地快速返回。
这并不理想。 首先,它不是实时的,插入/更新不会立即可见。 此外,无论您是否需要,您都会运行一个查询来更新 MV(这可以调整到任何时间范围或按需)。 但是,这确实表明,如果您可以接受不太精确的值,那么 MV 在最终用户看来可以快得多。
Postgres is doing a lot more than it looks like (maintaining data consistency for a start!)
If the values don't have to be 100% spot on, or if the table is updated rarely, but you are running this calculation often, you might want to look into Materialized Views to speed it up.
(Note, I have not used materialized views in Postgres, they look at little hacky, but might suite your situation).
Materialized Views
Also consider the overhead of actually connecting to the server and the round trip required to send the request to the server and back.
I'd consider 200ms for something like this to be pretty good, A quick test on my oracle server, the same table structure with about 500k rows and no indexes, takes about 1 - 1.5 seconds, which is almost all just oracle sucking the data off disk.
The real question is, is 200ms fast enough?
-------------- More --------------------
I was interested in solving this using materialized views, since I've never really played with them. This is in oracle.
First I created a MV which refreshes every minute.
While its refreshing, there is no rows returned
Once it refreshes, its MUCH faster than doing the raw query
If we insert into the base table, the result is not immediately viewable view the MV.
But wait a minute or so, and the MV will update behind the scenes, and the result is returned fast as you could want.
This isn't ideal. for a start, its not realtime, inserts/updates will not be immediately visible. Also, you've got a query running to update the MV whether you need it or not (this can be tune to whatever time frame, or on demand). But, this does show how much faster an MV can make it seem to the end user, if you can live with values which aren't quite upto the second accurate.
我想说你的测试方案并不是很有用。 为了完成数据库查询,数据库服务器经历几个步骤:
因此,在 Python 中创建一个数组并获取平均值基本上会跳过所有这些步骤,除了最后一个步骤。 由于磁盘 I/O 是程序必须执行的最昂贵的操作之一,因此这是测试中的一个主要缺陷(另请参阅 这个问题我之前在这里问过)。 即使您在其他测试中从磁盘读取数据,该过程也完全不同,并且很难判断结果的相关性。
要获得有关 Postgres 花费时间的更多信息,我建议进行以下测试:
要加快查询速度,请首先减少磁盘访问。 我非常怀疑是不是聚合需要时间。
有几种方法可以做到这一点:
更新:
我刚刚意识到您似乎对上述查询没有使用索引,并且很可能也没有使用任何索引,因此我对索引的建议可能没有帮助。 对不起。 不过,我想说聚合不是问题,而是磁盘访问才是问题。 我会把索引的东西留下来,无论如何,它可能仍然有一些用处。
I would say your test scheme is not really useful. To fulfill the db query, the db server goes through several steps:
So, creating an array in Python and getting the average basically skips all these steps save the last one. As disk I/O is among the most expensive operations a program has to perform, this is a major flaw in the test (see also the answers to this question I asked here before). Even if you read the data from disk in your other test, the process is completely different and it's hard to tell how relevant the results are.
To obtain more information about where Postgres spends its time, I would suggest the following tests:
To speed up your query, reduce disk access first. I doubt very much that it's the aggregation that takes the time.
There's several ways to do that:
Update:
I just realized that you seem to have no use for indices for the above query and most likely aren't using any, too, so my advice on indices probably wasn't helpful. Sorry. Still, I'd say that the aggregation is not the problem but disk access is. I'll leave the index stuff in, anyway, it might still have some use.