当需要向具有数百万行的表中添加一列时，Postgres 是否比 MySql 更好？

发布于 2024-10-09 00:39:06 字数 342 浏览 3 评论 0原文

我们在使用 Mysql 时遇到问题。当我四处寻找时，我发现很多人都有同样的问题。

我加入了一个产品，该产品的数据库有一些表，行数多达 1.5 亿。我们的问题的一个例子是，其中一张表有 30 多个列，其中大约一半不再使用。当尝试删除列或重命名列时，mysql想要复制整个表并重命名。对于如此大量的数据，需要花费很多小时才能完成此操作，并且该网站几乎一直处于离线状态。这只是改进架构的几次大型迁移中的第一个。这些并不是常规的事情。我继承了很多清理工作。

我尝试搜索看看人们是否对 Postgres 有同样的问题，但相比之下我发现几乎没有人谈论这个问题。这是因为 Postgres 在这方面做得更好，还是只是使用 Postgres 的人更少了？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

没有心的人 2024-10-16 00:39:06

在 PostgreSQL 中，向表添加没有默认值的新列是瞬时的，因为新列仅在系统目录中注册，而不是实际添加到磁盘上。

回复收藏 0 原文

念﹏祤嫣 2024-10-16 00:39:06

当你唯一知道的工具是锤子时，你所有的问题看起来都像钉子。对于这个问题，PostgreSQL 在处理这些类型的更改方面要好得多。事实是，无论您的应用程序设计得多么好，总有一天您将不得不更改实时数据库上的架构。虽然 MySQL 的各种引擎对于某些特殊情况确实非常出色，但它们都没有帮助。 PostgreSQL 各层之间非常紧密的集成意味着您可以拥有诸如事务 ddl 之类的东西，允许您回滚任何不是更改/创建数据库/表空间的内容。或者非常非常快地改变表。或者无阻碍地创建索引。等等。它将 PostgreSQL 限制在它擅长的事情上（传统事务数据库负载处理是一个强项），而不擅长 MySQL 经常填补的空白，比如使用 ndb 引擎的实时网络集群存储。

在这种情况下，MySQL 中的任何不同引擎都无法让您轻松解决此问题。多个存储引擎的多功能性意味着数据库的词法分析器/解析器/顶层不能与存储引擎紧密集成，因此 pgsql 可以做的很多很酷的事情 mysql 却不能。

我的统计数据库中有一个 118GB 的表。它有 11 亿行。它确实应该被分区，但它并没有被读取很多，当它被读取时我们可以等待它。在 300MB/秒（阵列读取速度）下，读取大约需要 118*~3 秒，即大约 5 分钟。这台机器有 32Gigs RAM，因此它无法在内存中保存该表。

当我在此表上运行简单的语句时：

alter table mytable add test text;

它挂着等待真空。我杀死了vacuum（选择pg_cancel_backend（12345）（<-- pid在那里）并且它立即完成。顺便说一句，这个表上的vacuum需要很长时间才能运行。通常这没什么大不了的，但是当对表进行更改时结构，你必须等待真空，或者杀死它们。

现在

我们来讨论 postgresql 的问题，那就是堆内 MVCC 存储，如果你添加该列，那么就这样做。更新表设置 test='abc' 它会更新每一行，并将表的大小精确加倍，除非 HOT 可以就地更新行，但是您需要一个 50% 填充因子的表，该表的大小是双倍的。收回空间的唯一方法是等待并让 Vacuum 随着时间的推移回收它并一次重用它一个更新，或者运行 cluster 或 Vacuum Full 以缩小它，

您可以通过运行更新来解决这个问题。一次删除表的一部分（更新 pkid 在 1 到 10000000 之间的位置；...），并在每次运行之间运行真空以回收空间。

因此，这两个系统都有需要处理的问题。

When the only tool you know is a hammer, all your problems look like a nail. For this problem, PostgreSQL is much much better at handling these types of changes. And the fact is, it doesn't matter how well you designed your app, you WILL have to change the schema on a live database someday. While MySQL's various engines really are amazing for certain corner cases, here none of them help. PostgreSQL's very close integration between the various layers means that you can have things like transactional ddl that allow you to roll back anything that isn't an alter / create database / tablespace. Or very very fast alter tables. Or non-impeding create indexes. And so on. It limits PostgreSQL to the things it does well (traditional transactional db load handling is a strong point) and not so great at the things that MySQL often fills in the gaps on, like live networked clustered storage with the ndb engine.

In this case none of the different engines in MySQL allow you to easily solve this problem. The very versatility of multiple storage engines means that the lexer / parser / top layer of the DB cannot be as tightly integrated to the storage engines, and therefore a lot of the cool things pgsql can do here mysql can't.

I've got a 118Gigabyte table in my stats db. It has 1.1 billion rows in it. It really should be partitioned but it's not read a whole lot, and when it is we can wait on it. At 300MB/sec (the speed the array it's on can read) it takes approximately 118*~3seconds to read, or right around 5 minutes. This machine has 32Gigs of RAM, so it cannot hold the table in memory.

When I ran the simple statement on this table:

alter table mytable add test text;

it hung waiting for a vacuum. I killed the vacuum (select pg_cancel_backend(12345) (<-- pid in there) and it finished immediately. A vacuum on this table takes a long time to run btw. Normally it's not a big deal, but when making changes to table structure, you have to wait on vacuums, or kill them.

Dropping a column is just as simple and fast.

Now we come to the problem with postgresql, and that is the in-heap MVCC storage. If you add that column, then do an update table set test='abc' it updates each row, and exactly doubles the size of the table. Unless HOT can update the rows in place, but then you need a 50% fill factor table which is double sized to begin with. The only way to get the space back is to either wait and let vacuum reclaim it over time and reuse it one update at a time, or to run cluster or vacuum full to shrink it back down.

you can get around this by running updates on parts of the table at a time (update where pkid between 1 and 10000000; ...) and running vacuum between each run to reclaim the space.

So, both systems have warts and bumps to deal with.

回复收藏 0 原文