为什么 PostgreSQL 耗尽了我所有宝贵的硬盘空间?

发布于 2024-07-10 18:16:12 字数 1092 浏览 3 评论 0原文

我刚刚完成了尽可能多的有关维基百科(英语)的链接结构数据的传输。 基本上,我从维基百科的最新转储存储库下载了一堆 SQL 转储。 由于我使用 PostgreSQL 而不是 MySQL,因此我决定使用 管道 shell 将所有这些转储加载到我的数据库中命令

无论如何,其中一个表有 2.95 亿行:pagelinks 表; 它包含所有维基内部的超链接。 我从我的笔记本电脑上使用 pgAdmin III 将以下命令发送到我的数据库服务器(另一台计算机):

SELECT pl_namespace, COUNT(*) FROM pagelinks GROUP BY (pl_namespace);

它已经运行了一个小时左右。 问题是,邮政局长似乎越来越多地占用我非常有限的高清空间。 我认为到目前为止它已经消耗了大约 20 GB 的空间。 我之前曾尝试过 postgresql.conf 文件,以便为它提供更多的性能灵活性(即让它使用更多的资源),因为它使用 12 GB RAM 运行。 我想我基本上将这个文件的大多数字节和此类相关变量增加了四倍,认为它会使用更多的 RAM 来完成它的工作。

然而,数据库似乎并没有使用太多内存。 使用 Linux 系统监视器,我可以看到邮局管理员正在使用 1.6 GB 的共享内存 (RAM)。 不管怎样,我想知道你们是否可以帮助我更好地理解它在做什么,因为我似乎真的不明白PostgreSQL如何使用HD资源

关于维基百科数据库的元结构,他们提供了一个很好的 架构 这可能对您有用,甚至您感兴趣。

请随时向我询问更多详细信息,谢谢。

I just finished transferring as much link-structure data concerning wikipedia (English) as I could. Basically, I downloaded a bunch of SQL dumps from wikipedia's latest dump repository. Since I am using PostgreSQL instead of MySQL, I decided to load all these dumps into my db using pipeline shell commands.

Anyway, one of these tables has 295 million rows: the pagelinks table; it contains all intra-wiki hyperlinks. From my laptop, using pgAdmin III, I sent the following command to my database server (another computer):

SELECT pl_namespace, COUNT(*) FROM pagelinks GROUP BY (pl_namespace);

Its been at it for an hour or so now. The thing is that the postmaster seems to be eating up more and more of my very limited HD space. I think it ate about 20 GB as of now. I had previously played around with the postgresql.conf file in order to give it more performance flexibility (i.e. let it use more resources) for it is running with 12 GB of RAM. I think I basically quadrupled most bytes and such related variables of this file thinking it would use more RAM to do its thing.

However, the db does not seem to use much RAM. Using the Linux system monitor, I am able to see that the postmaster is using 1.6 GB of shared memory (RAM). Anyway, I was wondering if you guys could help me better understand what it is doing for it seems that I really do not understand how PostgreSQL uses HD resources.

Concerning the metastructure of wikipedia databases, they provide a good schema that may be of use or even but of interest to you.

Feel free to ask me for more details, thx.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

瀟灑尐姊 2024-07-17 18:16:12

可能是 GROUP BY 导致了问题。 为了进行分组,数据库必须对行进行排序以将重复的项目放在一起。 索引可能没有帮助。 粗略计算:

假设每行占用 100 字节的空间,即 29,500,000,000 字节,或大约 30GB 的存储空间。 它无法在内存中容纳所有内容,因此您的系统会出现抖动,从而使操作速度降低 1000 倍或更多。 如果您的硬盘空间使用交换文件,则它可能会消失在交换空间中。

如果您只需要执行一次此计算,请尝试将其分解为更小的数据子集。 假设 pl_namespace 是数字,范围为 1-2.95 亿,请尝试如下操作:

SELECT pl_namespace, COUNT(*)
FROM pagelinks
WHERE pl_namespace between 1 and 50000000
GROUP BY (pl_namespace);

然后对 50000001-100000000 执行相同操作,依此类推。 使用 UNION 将您的答案组合在一起,或者简单地使用外部程序将结果制成表格。 忘记我写的关于索引不能帮助 GROUP BY 的内容吧; 在这里,索引将有助于 WHERE 子句。

It's probably the GROUP BY that's causing the problem. In order to do grouping, the database has to sort the rows to put duplicate items together. An index probably won't help. A back-of-the-envelope calculation:

Assuming each row takes 100 bytes of space, that's 29,500,000,000 bytes, or about 30GB of storage. It can't fit all that in memory, so your system is thrashing, which slows operations down by a factor of 1000 or more. Your HD space may be disappearing into swap space, if it's using swap files.

If you only need to do this calculation once, try breaking it apart into smaller subsets of the data. Assuming pl_namespace is numeric and ranges from 1-295million, try something like this:

SELECT pl_namespace, COUNT(*)
FROM pagelinks
WHERE pl_namespace between 1 and 50000000
GROUP BY (pl_namespace);

Then do the same for 50000001-100000000 and so forth. Combine your answers together using UNION or simply tabulate the results with an external program. Forget what I wrote about an index not helping GROUP BY; here, an index will help the WHERE clause.

苦妄 2024-07-17 18:16:12

声称只占用 9.5MB RAM 的到底是什么? 这对我来说听起来不太可能 - 共享内存几乎肯定是在不同 Postgres 进程之间共享的 RAM。 (据我所知,每个客户端最终都会成为一个单独的进程,尽管已经有一段时间了,所以我可能是非常错误的。)

您在 pl_namespace 列上有索引吗? 如果有大量不同的结果,我可以想象该查询对于一个没有索引的 2.95 亿行表来说非常繁重。 话虽如此,10GB 实在是太难以承受了。 你知道它正在写入哪些文件吗?

What exactly is claiming that it's only taking 9.5MB of RAM? That sounds unlikely to me - the shared memory almost certainly is RAM which is being shared between different Postgres processes. (From what I remember, each client ends up as a separate process, although it's been a while so I could be very wrong.)

Do you have an index on the pl_namespace column? If there's an awful lot of distinct results, I could imagine that query being pretty heavy on a 295 million row table with no index. Having said that, 10GB is an awful lot to swallow. Do you know which files it's writing to?

噩梦成真你也成魔 2024-07-17 18:16:12

好的,要点如下:

GROUP BY 子句使索引无效,因此 postmaster(postgresql 服务器进程)决定创建一堆位于目录 $PGDATA/base/ 中的表(23GB 表) 16384/pgsql_tmp。

修改 postgresql.conf 文件时,我授予 postgreSQL 使用 1.6 GB RAM 的权限(现在我将其加倍,因为它可以访问 11.7 GB RAM); postmaster 进程确实使用了 1.6 GB RAM,但这还不够,因此需要 pgsql_tmp 目录。

正如 Barry Brown 所指出的,由于我只是执行此 SQL 命令来获取有关 pagelinks.namespaces 之间链接分布的一些统计信息,因此我可以查询 2.96 亿个链接的子集页面链接(这就是他们为调查所做的)。

当命令返回结果集时,所有临时表都被自动删除,就像什么也没发生一样。

谢谢你们的帮助!

Ok so here is the gist of it:

the GROUP BY clause made the index' invalid, so the postmaster (postgresql server process) decided to create a bunch of tables (23GB of tables) that were located in the directory $PGDATA/base/16384/pgsql_tmp.

When modifying the postgresql.conf file, I had given permission to postgreSQL to use 1.6 GB of RAM (which I will now double for it has access to 11.7 GB of RAM); the postmaster process was indeed using up 1.6 GB of RAM, but that wasn't enough, thus the pgsql_tmp directory.

As was pointed out by Barry Brown, since I was only executing this SQL command to get some statistical information about the distribution of the links among the pagelinks.namespaces, I could have queried a subset of the 296 million pagelinks (this is what they do for surveys).

When the command returned the result set, all temporary tables were automatically deleted as if nothing had happened.

Thx for your help guys!

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文