当需要向具有数百万行的表中添加一列时,Postgres 是否比 MySql 更好?
我们在使用 Mysql 时遇到问题。当我四处寻找时,我发现很多人都有同样的问题。
我加入了一个产品,该产品的数据库有一些表,行数多达 1.5 亿。我们的问题的一个例子是,其中一张表有 30 多个列,其中大约一半不再使用。当尝试删除列或重命名列时,mysql想要复制整个表并重命名。对于如此大量的数据,需要花费很多小时才能完成此操作,并且该网站几乎一直处于离线状态。这只是改进架构的几次大型迁移中的第一个。这些并不是常规的事情。我继承了很多清理工作。
我尝试搜索看看人们是否对 Postgres 有同样的问题,但相比之下我发现几乎没有人谈论这个问题。这是因为 Postgres 在这方面做得更好,还是只是使用 Postgres 的人更少了?
We're having problems with Mysql. When I search around, I see many people having the same problem.
I have joined up with a product where the database has some tables with as many as 150 million rows. One example of our problem is that one of these tables has over 30 columns and about half of them are no longer used. When trying to remove columns or renaming columns, mysql wants to copy the entire table and rename. With this amount of data, it would take many hours to do this and the site would be offline pretty much the whole time. This is just the first of several large migrations to improve the schema. These aren't intended as a regular thing. Just a lot of cleanup I inherited.
I tried searching to see if people have the same problem with Postgres and I find almost nothing in comparison talking about this issue. Is this because Postgres is a lot better at it, or just that less people are using postgres?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
在 PostgreSQL 中,向表添加没有默认值的新列是瞬时的,因为新列仅在系统目录中注册,而不是实际添加到磁盘上。
In PostgreSQL, adding a new column without default value to a table is instantaneous, because the new column is only registered in the system catalog, not actually added on disk.
当你唯一知道的工具是锤子时,你所有的问题看起来都像钉子。对于这个问题,PostgreSQL 在处理这些类型的更改方面要好得多。事实是,无论您的应用程序设计得多么好,总有一天您将不得不更改实时数据库上的架构。虽然 MySQL 的各种引擎对于某些特殊情况确实非常出色,但它们都没有帮助。 PostgreSQL 各层之间非常紧密的集成意味着您可以拥有诸如事务 ddl 之类的东西,允许您回滚任何不是更改/创建数据库/表空间的内容。或者非常非常快地改变表。或者无阻碍地创建索引。等等。它将 PostgreSQL 限制在它擅长的事情上(传统事务数据库负载处理是一个强项),而不擅长 MySQL 经常填补的空白,比如使用 ndb 引擎的实时网络集群存储。
在这种情况下,MySQL 中的任何不同引擎都无法让您轻松解决此问题。多个存储引擎的多功能性意味着数据库的词法分析器/解析器/顶层不能与存储引擎紧密集成,因此 pgsql 可以做的很多很酷的事情 mysql 却不能。
我的统计数据库中有一个 118GB 的表。它有 11 亿行。它确实应该被分区,但它并没有被读取很多,当它被读取时我们可以等待它。在 300MB/秒(阵列读取速度)下,读取大约需要 118*~3 秒,即大约 5 分钟。这台机器有 32Gigs RAM,因此它无法在内存中保存该表。
当我在此表上运行简单的语句时:
alter table mytable add test text;
它挂着等待真空。我杀死了vacuum(选择pg_cancel_backend(12345)(<-- pid在那里)并且它立即完成。顺便说一句,这个表上的vacuum需要很长时间才能运行。通常这没什么大不了的,但是当对表进行更改时结构,你必须等待真空,或者杀死它们。
现在
我们来讨论 postgresql 的问题,那就是堆内 MVCC 存储,如果你添加该列,那么就这样做。更新表设置 test='abc' 它会更新每一行,并将表的大小精确加倍,除非 HOT 可以就地更新行,但是您需要一个 50% 填充因子的表,该表的大小是双倍的。收回空间的唯一方法是等待并让 Vacuum 随着时间的推移回收它并一次重用它一个更新,或者运行 cluster 或 Vacuum Full 以缩小它,
您可以通过运行更新来解决这个问题 。一次删除表的一部分(更新 pkid 在 1 到 10000000 之间的位置;...),并在每次运行之间运行真空以回收空间。
因此,这两个系统都有需要处理的问题。
When the only tool you know is a hammer, all your problems look like a nail. For this problem, PostgreSQL is much much better at handling these types of changes. And the fact is, it doesn't matter how well you designed your app, you WILL have to change the schema on a live database someday. While MySQL's various engines really are amazing for certain corner cases, here none of them help. PostgreSQL's very close integration between the various layers means that you can have things like transactional ddl that allow you to roll back anything that isn't an alter / create database / tablespace. Or very very fast alter tables. Or non-impeding create indexes. And so on. It limits PostgreSQL to the things it does well (traditional transactional db load handling is a strong point) and not so great at the things that MySQL often fills in the gaps on, like live networked clustered storage with the ndb engine.
In this case none of the different engines in MySQL allow you to easily solve this problem. The very versatility of multiple storage engines means that the lexer / parser / top layer of the DB cannot be as tightly integrated to the storage engines, and therefore a lot of the cool things pgsql can do here mysql can't.
I've got a 118Gigabyte table in my stats db. It has 1.1 billion rows in it. It really should be partitioned but it's not read a whole lot, and when it is we can wait on it. At 300MB/sec (the speed the array it's on can read) it takes approximately 118*~3seconds to read, or right around 5 minutes. This machine has 32Gigs of RAM, so it cannot hold the table in memory.
When I ran the simple statement on this table:
alter table mytable add test text;
it hung waiting for a vacuum. I killed the vacuum (select pg_cancel_backend(12345) (<-- pid in there) and it finished immediately. A vacuum on this table takes a long time to run btw. Normally it's not a big deal, but when making changes to table structure, you have to wait on vacuums, or kill them.
Dropping a column is just as simple and fast.
Now we come to the problem with postgresql, and that is the in-heap MVCC storage. If you add that column, then do an update table set test='abc' it updates each row, and exactly doubles the size of the table. Unless HOT can update the rows in place, but then you need a 50% fill factor table which is double sized to begin with. The only way to get the space back is to either wait and let vacuum reclaim it over time and reuse it one update at a time, or to run cluster or vacuum full to shrink it back down.
you can get around this by running updates on parts of the table at a time (update where pkid between 1 and 10000000; ...) and running vacuum between each run to reclaim the space.
So, both systems have warts and bumps to deal with.
也许是因为这不应该经常发生。
也许,从字里行间看,您需要向另一个表添加一行,而不是向现有的大型表添加列..?
maybe because this should not be a regualr occurrence.
perhaps, reading between the lines, you need to be adding a row to another table, instead of columns to a large existing table..?