从理论上讲,这个 SQL 查询是否太大而无法在负载下快速提供服务?

发布于 2024-10-23 21:52:54 字数 290 浏览 3 评论 0原文

我有一个 sql 查询,可以轻松返回关系数据库表的 70,000 行(涉及一些连接)。该结果集的总大小约为 20mb。表本身的总大小约为 1000 万行。

我在这里缺乏视角,所以我想知道即使网页上每秒有几百个请求,这种大小的查询实际上是否能够快速提供服务?另外,这不是一个只读表:有相当多的更新/删除(根据一年中的时间,读/写比在 3:1 到 10:1 之间)

我知道我需要索引等。我是什么我想知道单个数据库服务器(例如 4GB 内存和现代四核 CPU)是否可以在理论上提供此服务,而无需消耗 CPU 或磁盘 IO 并获得糟糕的性能?

I have a sql query that can easily return 70,000 rows of a relational database table (with a few joins involved). The total size of this resultset is about 20mb. The total size of the table itself is about 10 million rows.

I lack perspective here, so I am wondering if a query of this size is practically capable of being served quickly even at a few hundred requests per second on a web page? Also, this is NOT a readonly table: there are a fairly significant number of updates/deletes (somewhere between 3:1 and 10:1 reads/write ratio depending on time of year)

I know I need indexes, etc. What I am wondering is if a single database server (with say 4gb of ram and a modern quad core CPU) could even theoretically serve this without thrashing the hell out of the cpu or disk IO and getting terrible performance?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

仅一夜美梦 2024-10-30 21:52:54

您没有提供太多关于您的查询实际情况的背景信息,但我将引导您了解如何估计您的期望是否现实,以 PostgreSQL 为例。

准备一个包含 10M 行、每行 80 字节填充数据的虚拟表:

create table foo as select
    generate_series(1,10000000) as foo_id,
    repeat('a', 80) as filler;
create unique index foo_foo_id on foo (foo_id);
vacuum analyze foo;

该表总共 1400 MB,包括索引,因此它完全适合我的操作系统缓存,但不适合 PostgreSQL 的共享缓冲区。

创建自定义 pgbench 脚本来获取按索引排序的 70000 行:

\setrandom key 1 9000000
SELECT * FROM foo WHERE foo_id > :key ORDER BY foo_id LIMIT 70000;

以下是运行结果在我的 4 核台式计算机 (AMD Phenom II X4 955) 上进行基准测试 1 分钟:

% pgbench -j 4 -c 4 -T 60 -n -f script.pgb
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 4
number of threads: 4
duration: 60 s
number of transactions actually processed: 3922
tps = 65.309954 (including connections establishing)
tps = 65.316916 (excluding connections establishing)

请注意,这里客户端 (pgbench) 和服务器位于同一台物理计算机上。实际上,它们会有所不同,因此网络开销和吞吐量等因素会发挥作用。

这种简单的配置每秒可以处理约 65 个此类查询。远低于“每秒几百个请求”,因此您需要更强大的服务器来处理这种工作负载。可以选择使用多个从站进行复制。

为了获得更真实的结果,您应该调整 pgbench 脚本和测试数据以更接近您的工作负载。

You haven't provided much background as to what your queries actually look like, but I will walk you through how to estimate the ballpark of whether your expectations are realistic, using PostgreSQL as example.

Preparing a dummy table with 10M rows, and 80 bytes of filler data per row:

create table foo as select
    generate_series(1,10000000) as foo_id,
    repeat('a', 80) as filler;
create unique index foo_foo_id on foo (foo_id);
vacuum analyze foo;

This table is 1400 MB total, including the index, so it fits entirely into my OS cache, but not PostgreSQL's shared buffers.

Creating a custom pgbench script to fetch 70000 rows ordered by an index:

\setrandom key 1 9000000
SELECT * FROM foo WHERE foo_id > :key ORDER BY foo_id LIMIT 70000;

Here are the results from running the benchmark on my 4-core desktop computer (AMD Phenom II X4 955) for 1 minute:

% pgbench -j 4 -c 4 -T 60 -n -f script.pgb
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 4
number of threads: 4
duration: 60 s
number of transactions actually processed: 3922
tps = 65.309954 (including connections establishing)
tps = 65.316916 (excluding connections establishing)

Note that here the client (pgbench) and server are on the same physical machine. In reality they would be different, so things like network overhead and throughput come into play.

This naive configuration can handle ~65 such queries per second. Much lower than "few hundred requests per second", so you'd need a lot more powerful server to handle this kind of workload. Replication with multiple slaves is an option.

To get a more realistic result, you should tweak the pgbench script and test data to match your workload closer.

楠木可依 2024-10-30 21:52:54

嗯,不。但是,如果您可以限制您的结果集(以分页显示)、缓存结果,并可能预处理/转换您的数据(实际上,创建您自己的优化索引),那么它可能是可能的。

编辑:我的预处理意思是定期运行一个 cronjob,将您的数据整理成消费者可以轻松查询的形式,例如临时表或中间表(不涉及连接)。这样,您只需每隔几秒/分钟执行一次大量连接的查询。如果您依赖于准确的实时查询,则 cronjob 优化可能无法实现。

为了能够在不使数据库层超载的情况下回答所有查询,您可以将先前搜索的可重用结果缓存在内存缓存中,例如 memcached)。

Well, no. But if you can limit your result set (to display it paginated), cache results, and possibly preprocess/transform your data (in effect, creating your own, optimized index), it could be possible.

EDIT: What I meant with preprocessing is to run, for example, a cronjob periodically that massages your data into a form where it can be very easily be queried by your consumer, e.g. a temporary or intermediate table (no joins involved). That way you perform the join-heavy queries only every few seconds/minutes. If you depend on accurate real-time queries, the cronjob optimization might not be possible.

To be able to answer all queries without overloading the DB layer, you could possibly cache the reusable results of previous searches in a memory cache, e.g. memcached).

吃不饱 2024-10-30 21:52:54

这在很大程度上取决于索引的选择性以及您对数据的处理方式。我想说,如果您将结果集通过管道传输到文件以进行自动处理,则 70K 行和 20mb 并不是一个阻碍。但如果您尝试将其加载到网页中,它可能会成为一个阻碍。

不管怎样,我鼓励您思考一下有人需要在网页上查看 70,000 行和 20 MB 的真正原因。他们想一次性利用这么多数据来完成什么任务?

It depends a lot on how selective the indexes are, and what you're doing with the data. I'd say 70K rows and 20mb isn't a show-stopper if you're piping the result set to a file for automatic processing. But it might be a show-stopper if you're trying to load it into a web page.

Anyway, I'd encourage you to give some thought to the real reason someone needs to see 70,000 rows and 20 megabytes on a web page. What are they trying to accomplish with that much data at one time?

萌能量女王 2024-10-30 21:52:54

对于您描述的硬件,您遗漏了最重要的部分:存储。典型的数据库首先是磁盘,然后是内存。现代 CPU 速度如此之快,通常不是问题所在。如果你有严重的raid或SSD,你可以让它做一些重要的事情。对于您描述的大多数工具来说,10M 行表将完全位于内存中。

然而,您描述的问题可能会挂在锁定上。有许多用户在表中读取和写入少量事实,然后您读取该表的大部分内容。有不同的方法可以实现这一点,称为隔离级别。对于您所描述的负载,您可能希望完全避免这种情况。

这是数据仓库运动中的一个经典问题,您想要针对在线系统运行大型分析查询。例如,您想使用日志传送创建该表的第二个副本。您标记的大多数数据库都可以做到这一点。日志传送将在快速变化的表和分析表之间创建一个缓冲区。当您锁定此分析表时,更新会聚集在一起,直到您完成为止。有几个人在这张桌子上阅读,所以你可以自己拥有一切。通常,这只会花费数据库最大吞吐量的百分之几。如果您已经接近这个水平,那么您就会遇到缩放问题。如果您确实需要查看最新数据,请查看实时 BI。

此外,拥有该数据的第二个副本可以让您以一种非常容易查询的方式以不同的方式构建它。中心思想是星型模式。

问候 GJ

With the hardware you describe you leave out the most import part: storage. Typical database are bottlenecked by the disk and then memory. Modern CPU's are so fast they're usually not the problem. If you get serious raid or SSD you can make it do some serious stuff. And the 10M row table will entirly be in memory anyway for most of the tools you describe.

The problem you describe however will probably get hung on locking. There many users reading and writing little facts to a table and then you read a large portion of that table. The are different ways of doing that called isolation levels. With the loads you describe you probably want to stay clear of that altogether.

This is a classical problem in a sport called datawarehousing where you want to run large analytical queries against an online system. You want to create a second copy of that table using log shipping for instance. Most of the databases you tagged can do this. Log shipping would create a buffer between the fast changing table and the analytical table. When you lock this analytical table the updates bunch up until you're done. and there a few people reading from this table so you have it all to yourself. tipically this will cost just a couple of percent of you databases max throughput. If you're near that already you have scaling issues. If you really need to see the latest data look into real time BI.

Also having a second copy of that data frees you to structure it differently, in a way that is very easy to query. Central idea there is the Star Schema.

Regards GJ

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文