为什么 PostgreSQL 耗尽了我所有宝贵的硬盘空间？

发布于 2024-07-10 18:16:12 字数 1092 浏览 3 评论 0原文

我刚刚完成了尽可能多的有关维基百科（英语）的链接结构数据的传输。基本上，我从维基百科的最新转储存储库下载了一堆 SQL 转储。由于我使用 PostgreSQL 而不是 MySQL，因此我决定使用管道 shell 将所有这些转储加载到我的数据库中命令。

无论如何，其中一个表有 2.95 亿行：pagelinks 表；它包含所有维基内部的超链接。我从我的笔记本电脑上使用 pgAdmin III 将以下命令发送到我的数据库服务器（另一台计算机）：

SELECT pl_namespace, COUNT(*) FROM pagelinks GROUP BY (pl_namespace);

它已经运行了一个小时左右。问题是，邮政局长似乎越来越多地占用我非常有限的高清空间。我认为到目前为止它已经消耗了大约 20 GB 的空间。我之前曾尝试过 postgresql.conf 文件，以便为它提供更多的性能灵活性（即让它使用更多的资源），因为它使用 12 GB RAM 运行。我想我基本上将这个文件的大多数字节和此类相关变量增加了四倍，认为它会使用更多的 RAM 来完成它的工作。

然而，数据库似乎并没有使用太多内存。使用 Linux 系统监视器，我可以看到邮局管理员正在使用 1.6 GB 的共享内存 (RAM)。不管怎样，我想知道你们是否可以帮助我更好地理解它在做什么，因为我似乎真的不明白PostgreSQL如何使用HD资源。

关于维基百科数据库的元结构，他们提供了一个很好的架构这可能对您有用，甚至您感兴趣。

请随时向我询问更多详细信息，谢谢。

原文

I just finished transferring as much link-structure data concerning wikipedia (English) as I could. Basically, I downloaded a bunch of SQL dumps from wikipedia's latest dump repository. Since I am using PostgreSQL instead of MySQL, I decided to load all these dumps into my db using pipeline shell commands.

Anyway, one of these tables has 295 million rows: the pagelinks table; it contains all intra-wiki hyperlinks. From my laptop, using pgAdmin III, I sent the following command to my database server (another computer):

SELECT pl_namespace, COUNT(*) FROM pagelinks GROUP BY (pl_namespace);

Its been at it for an hour or so now. The thing is that the postmaster seems to be eating up more and more of my very limited HD space. I think it ate about 20 GB as of now. I had previously played around with the postgresql.conf file in order to give it more performance flexibility (i.e. let it use more resources) for it is running with 12 GB of RAM. I think I basically quadrupled most bytes and such related variables of this file thinking it would use more RAM to do its thing.

However, the db does not seem to use much RAM. Using the Linux system monitor, I am able to see that the postmaster is using 1.6 GB of shared memory (RAM). Anyway, I was wondering if you guys could help me better understand what it is doing for it seems that I really do not understand how PostgreSQL uses HD resources.

Concerning the metastructure of wikipedia databases, they provide a good schema that may be of use or even but of interest to you.

Feel free to ask me for more details, thx.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

瀟灑尐姊 2024-07-17 18:16:12

可能是 GROUP BY 导致了问题。为了进行分组，数据库必须对行进行排序以将重复的项目放在一起。索引可能没有帮助。粗略计算：

假设每行占用 100 字节的空间，即 29,500,000,000 字节，或大约 30GB 的存储空间。它无法在内存中容纳所有内容，因此您的系统会出现抖动，从而使操作速度降低 1000 倍或更多。如果您的硬盘空间使用交换文件，则它可能会消失在交换空间中。

如果您只需要执行一次此计算，请尝试将其分解为更小的数据子集。假设 pl_namespace 是数字，范围为 1-2.95 亿，请尝试如下操作：

SELECT pl_namespace, COUNT(*)
FROM pagelinks
WHERE pl_namespace between 1 and 50000000
GROUP BY (pl_namespace);

然后对 50000001-100000000 执行相同操作，依此类推。使用 UNION 将您的答案组合在一起，或者简单地使用外部程序将结果制成表格。忘记我写的关于索引不能帮助 GROUP BY 的内容吧；在这里，索引将有助于 WHERE 子句。

It's probably the GROUP BY that's causing the problem. In order to do grouping, the database has to sort the rows to put duplicate items together. An index probably won't help. A back-of-the-envelope calculation:

Assuming each row takes 100 bytes of space, that's 29,500,000,000 bytes, or about 30GB of storage. It can't fit all that in memory, so your system is thrashing, which slows operations down by a factor of 1000 or more. Your HD space may be disappearing into swap space, if it's using swap files.

If you only need to do this calculation once, try breaking it apart into smaller subsets of the data. Assuming pl_namespace is numeric and ranges from 1-295million, try something like this:

SELECT pl_namespace, COUNT(*)
FROM pagelinks
WHERE pl_namespace between 1 and 50000000
GROUP BY (pl_namespace);

Then do the same for 50000001-100000000 and so forth. Combine your answers together using UNION or simply tabulate the results with an external program. Forget what I wrote about an index not helping GROUP BY; here, an index will help the WHERE clause.

回复收藏 0 原文