在 PostgreSQL 中运行 N 个独立列更新的最佳方法是什么？ SQL 规范中最好的方法是什么？

发布于 2024-08-20 03:42:45 字数 842 浏览 5 评论 0原文

我正在寻找一种更有效的方法来在同一个表上运行许多列更新，如下所示：

UPDATE TABLE table
SET col = regexp_replace( col, 'foo', 'bar' )
WHERE regexp_match( col, 'foo' );

这样 foo 和 bar 将是 40 个不同正则表达式的组合- 替换。我怀疑甚至 25% 的数据集都需要更新，但我想知道的是可以在 SQL 中干净地实现以下目标。

单遍更新正则
表达式的单次匹配，触发单次替换
如果只有一个匹配，则不运行所有可能的 regexp_replaces
如果只有一个需要更新，则更新所有列
>如果没有列发生更改，则不更新行

我也很好奇，我知道在 MySQL 中（请耐心等待）

UPDATE foo SET bar = 'baz'

有一个隐式的 WHERE bar != 'baz' 子句

但是，在PostgreSQL 我知道这不存在：如果我知道如何在目标列未更新的情况下跳过单行的更新，我想我至少可以回答我的一个问题。

像这样的东西

UPDATE TABLE table
SET col = *temp_var* = regexp_replace( col, 'foo', 'bar' )
WHERE col != *temp_var*

原文

I'm looking for a more efficient way to run many columns updates on the same table like this:

UPDATE TABLE table
SET col = regexp_replace( col, 'foo', 'bar' )
WHERE regexp_match( col, 'foo' );

Such that foo, and bar, will be a combination of 40 different regex-replaces. I doubt even 25% of the dataset needs to be updated at all, but what I'm wanting to know is it is possible to cleanly achieve the following in SQL.

A single pass update
A single match of the regex, triggers a single replace
Not running all possible regexp_replaces if only one matches
Not updating all columns if only one needs the update
Not updating a row if no column has changed

I'm also curious, I know in MySQL (bear with me)

UPDATE foo SET bar = 'baz'

Has an implicit WHERE bar != 'baz' clause

However, in PostgreSQL I know this doesn't exist: I think I could at least answer one of my questions if I knew how to skip a single row's update if the target columns weren't updated.

Something like

UPDATE TABLE table
SET col = *temp_var* = regexp_replace( col, 'foo', 'bar' )
WHERE col != *temp_var*

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

深海夜未眠 2024-08-27 03:42:45

用代码来做。打开一个游标，然后：抓取一行，通过 40 个正则表达式运行它，如果发生更改，则将其保存回来。重复此操作，直到光标不再显示任何行。

无论你这样做还是想出神奇的 SQL 表达式，它仍然是对整个表进行行扫描，但代码会简单得多。

实验结果

为了回应批评，我进行了一个实验。我将文档文件中的 10,000 行插入到具有串行主键和 varchar 列的表中。然后我测试了两种更新方法。方法 1：

in a transaction:
  opened up a cursor (select for update)
  while reading 100 rows from the cursor returns any rows:
    for each row:
      for each regular expression:
        do the gsub on the text column
      update the row

使用本地连接的数据库需要 1.16 秒。

然后是“大替换”，一个大型正则表达式更新：

更新 foo 设置 t =
regexp_replace（regexp_replace（regexp_replace（regexp_replace）regexp_replace（regexp_replace（regexp_replace（regexp_replace）regexp_replace（regexp_replace（regexp_replace（regexp_replace）regexp_replace（regexp_replace（regexp_replace）regexp_replace（regexp_replace（regexp_replace（regexp_替换（regexp_replace（regexp_replace（regexp_replace（regexp_replace（ regexp_replace（regexp_replace（regexp_replace（regexp_replace）regexp_replace（regexp_replace（regexp_replace（regexp_replace）regexp_replace（regexp_replace（regexp_replace）regexp_replace（regexp_replace（regexp_replace（regexp_replace（regexp_replace（t，
E'\bcommit\b', E'COMMIT'),
E'\b9acf10762b5f3d3b1b33ea07792a936a25e45010\b',
E'9ACF10762B5F3D3B1B33EA07792A936A25E45010'),
E'\b作者：\b'，E'作者：'),
E'\bCarl\b', E'CARL'), E'\bWorth\b',
E'值得'), E'\b\b',
E''), E'\b日期:\b',
E'日期:'), E'\bMon\b', E'MON'),
E'\bOct\b', E'OCT'), E'\b26\b',
E'26'), E'\b04:53:13\b', E'04:53:13'),
E'\b2009\b'、E'2009')、E'\b-0700\b'、
E'-0700'), E'\bUpdate\b', E'更新'),
E'\bversion\b', E'版本'),
E'\bto\b', E'TO'), E'\b2.9.1\b',
E'2.9.1'), E'\bcommit\b', E'COMMIT'),
E'\b61c89e56f361fa860f18985137d6bf53f48c16ac\b',
E'61C89E56F361FA860F18985137D6BF53F48C16AC'),
E'\b作者：\b'，E'作者：'),
E'\bCarl\b', E'CARL'), E'\bWorth\b',
E'值得'), E'\b\b',
E''), E'\b日期:\b',
E'日期:'), E'\bMon\b', E'MON'),
E'\bOct\b', E'OCT'), E'\b26\b',
E'26'), E'\b04:51:58\b', E'04:51:58'),
E'\b2009\b'、E'2009')、E'\b-0700\b'、
E'-0700'), E'\b新闻:\b', E'新闻:'),
E'\bAdd\b', E'ADD'), E'\bnotes\b',
E'注释'), E'\bfor\b', E'FOR'),
E'\bthe\b', E'THE'), E'\b2.9.1\b',
E'2.9.1'), E'\brelease.\b',
E'发布。'), E'\b谢谢\b',
E'谢谢'), E'\bto\b', E'TO'),
E'\beveryone\b', E'所有人'),
E'\bfor\b', E'FOR')

大型正则表达式更新需要 0.94 秒来更新。

与 1.16 秒相比，大型正则表达式更新速度为 0.94 秒，确实更快，在代码中运行的时间为 81%。但它不是，但是快得多。你们大神们，看看那个更新声明。你想写这个，或者当 Postgres 抱怨你在某处掉了括号时尝试找出问题所在吗？

代码

使用的代码是：

  def stupid_regex_replace
    sql = Select.new
    sql.select('id')
    sql.select('t')
    sql.for_update
    sql.from(TABLE_NAME)
    Cursor.new('foo', sql, {}, @db) do |cursor|
      until (rows = cursor.fetch(100)).empty?
        for row in rows
          for regex, replacement in regexes
            row['t'] = row['t'].gsub(regex, replacement)
          end
        end
        sql = Update.new(TABLE_NAME, @db)
        sql.set('t', row['t'])
        sql.where(['id = %s', row['id']])
        sql.exec
      end
    end
  end

我通过从文件中获取单词动态生成正则表达式；对于每个单词“foo”，其正则表达式为“\bfoo\b”，其替换字符串为“FOO”（单词大写）。我使用文件中的文字来确保替换确实发生。我让测试程序吐出正则表达式，以便您可以看到它们。每一对都是一个正则表达式和相应的替换字符串：

[[/\bcommit\b/, "COMMIT"],
 [/\b9acf10762b5f3d3b1b33ea07792a936a25e45010\b/,
  "9ACF10762B5F3D3B1B33EA07792A936A25E45010"],
 [/\bAuthor:\b/, "AUTHOR:"],
 [/\bCarl\b/, "CARL"],
 [/\bWorth\b/, "WORTH"],
 [/\b<[email protected]>\b/, "<[email protected]>"],
 [/\bDate:\b/, "DATE:"],
 [/\bMon\b/, "MON"],
 [/\bOct\b/, "OCT"],
 [/\b26\b/, "26"],
 [/\b04:53:13\b/, "04:53:13"],
 [/\b2009\b/, "2009"],
 [/\b-0700\b/, "-0700"],
 [/\bUpdate\b/, "UPDATE"],
 [/\bversion\b/, "VERSION"],
 [/\bto\b/, "TO"],
 [/\b2.9.1\b/, "2.9.1"],
 [/\bcommit\b/, "COMMIT"],
 [/\b61c89e56f361fa860f18985137d6bf53f48c16ac\b/,
  "61C89E56F361FA860F18985137D6BF53F48C16AC"],
 [/\bAuthor:\b/, "AUTHOR:"],
 [/\bCarl\b/, "CARL"],
 [/\bWorth\b/, "WORTH"],
 [/\b<[email protected]>\b/, "<[email protected]>"],
 [/\bDate:\b/, "DATE:"],
 [/\bMon\b/, "MON"],
 [/\bOct\b/, "OCT"],
 [/\b26\b/, "26"],
 [/\b04:51:58\b/, "04:51:58"],
 [/\b2009\b/, "2009"],
 [/\b-0700\b/, "-0700"],
 [/\bNEWS:\b/, "NEWS:"],
 [/\bAdd\b/, "ADD"],
 [/\bnotes\b/, "NOTES"],
 [/\bfor\b/, "FOR"],
 [/\bthe\b/, "THE"],
 [/\b2.9.1\b/, "2.9.1"],
 [/\brelease.\b/, "RELEASE."],
 [/\bThanks\b/, "THANKS"],
 [/\bto\b/, "TO"],
 [/\beveryone\b/, "EVERYONE"],
 [/\bfor\b/, "FOR"]]

如果这是手动生成的正则表达式列表，而不是自动生成，我的问题仍然合适：您宁愿创建或维护哪个？

Do it in code. Open up a cursor, then: grab a row, run it through the 40 regular expressions, and if it changed, save it back. Repeat until the cursor doesn't give you any more rows.

Whether you do it that way or come up with the magical SQL expression, it's still going to be a row scan of the entire table, but the code will be much simpler.

Experimental Results

In response to criticism, I ran an experiment. I inserted 10,000 lines from a documentation file into a table with a serial primary key and a varchar column. Then I tested two ways to do the update. Method 1:

in a transaction:
  opened up a cursor (select for update)
  while reading 100 rows from the cursor returns any rows:
    for each row:
      for each regular expression:
        do the gsub on the text column
      update the row

This takes 1.16 seconds with a locally connected database.

Then the "big replace," a single mega-regex update:

update foo set t =
regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(t,
E'\bcommit\b', E'COMMIT'),
E'\b9acf10762b5f3d3b1b33ea07792a936a25e45010\b',
E'9ACF10762B5F3D3B1B33EA07792A936A25E45010'),
E'\bAuthor:\b', E'AUTHOR:'),
E'\bCarl\b', E'CARL'), E'\bWorth\b',
E'WORTH'), E'\b\b',
E''), E'\bDate:\b',
E'DATE:'), E'\bMon\b', E'MON'),
E'\bOct\b', E'OCT'), E'\b26\b',
E'26'), E'\b04:53:13\b', E'04:53:13'),
E'\b2009\b', E'2009'), E'\b-0700\b',
E'-0700'), E'\bUpdate\b', E'UPDATE'),
E'\bversion\b', E'VERSION'),
E'\bto\b', E'TO'), E'\b2.9.1\b',
E'2.9.1'), E'\bcommit\b', E'COMMIT'),
E'\b61c89e56f361fa860f18985137d6bf53f48c16ac\b',
E'61C89E56F361FA860F18985137D6BF53F48C16AC'),
E'\bAuthor:\b', E'AUTHOR:'),
E'\bCarl\b', E'CARL'), E'\bWorth\b',
E'WORTH'), E'\b\b',
E''), E'\bDate:\b',
E'DATE:'), E'\bMon\b', E'MON'),
E'\bOct\b', E'OCT'), E'\b26\b',
E'26'), E'\b04:51:58\b', E'04:51:58'),
E'\b2009\b', E'2009'), E'\b-0700\b',
E'-0700'), E'\bNEWS:\b', E'NEWS:'),
E'\bAdd\b', E'ADD'), E'\bnotes\b',
E'NOTES'), E'\bfor\b', E'FOR'),
E'\bthe\b', E'THE'), E'\b2.9.1\b',
E'2.9.1'), E'\brelease.\b',
E'RELEASE.'), E'\bThanks\b',
E'THANKS'), E'\bto\b', E'TO'),
E'\beveryone\b', E'EVERYONE'),
E'\bfor\b', E'FOR')

The mega-regex update takes 0.94 seconds to update.

At 0.94 seconds compared to 1.16, it's true that the mega-regex update is faster, running in 81% of the time of doing it in code. It is not, however a lot faster. And ye Gods, look at that update statement. Do you want to write that, or try to figure out what went wrong when Postgres complains that you dropped a parenthesis somewhere?

Code

The code used was:

  def stupid_regex_replace
    sql = Select.new
    sql.select('id')
    sql.select('t')
    sql.for_update
    sql.from(TABLE_NAME)
    Cursor.new('foo', sql, {}, @db) do |cursor|
      until (rows = cursor.fetch(100)).empty?
        for row in rows
          for regex, replacement in regexes
            row['t'] = row['t'].gsub(regex, replacement)
          end
        end
        sql = Update.new(TABLE_NAME, @db)
        sql.set('t', row['t'])
        sql.where(['id = %s', row['id']])
        sql.exec
      end
    end
  end

I generated the regular expressions dynamically by taking words from the file; for each word "foo", its regular expression was "\bfoo\b" and its replacement string was "FOO" (the word uppercased). I used words from the file to make sure that replacements did happen. I made the test program spit out the regex's so you can see them. Each pair is a regex and the corresponding replacement string:

[[/\bcommit\b/, "COMMIT"],
 [/\b9acf10762b5f3d3b1b33ea07792a936a25e45010\b/,
  "9ACF10762B5F3D3B1B33EA07792A936A25E45010"],
 [/\bAuthor:\b/, "AUTHOR:"],
 [/\bCarl\b/, "CARL"],
 [/\bWorth\b/, "WORTH"],
 [/\b<[email protected]>\b/, "<[email protected]>"],
 [/\bDate:\b/, "DATE:"],
 [/\bMon\b/, "MON"],
 [/\bOct\b/, "OCT"],
 [/\b26\b/, "26"],
 [/\b04:53:13\b/, "04:53:13"],
 [/\b2009\b/, "2009"],
 [/\b-0700\b/, "-0700"],
 [/\bUpdate\b/, "UPDATE"],
 [/\bversion\b/, "VERSION"],
 [/\bto\b/, "TO"],
 [/\b2.9.1\b/, "2.9.1"],
 [/\bcommit\b/, "COMMIT"],
 [/\b61c89e56f361fa860f18985137d6bf53f48c16ac\b/,
  "61C89E56F361FA860F18985137D6BF53F48C16AC"],
 [/\bAuthor:\b/, "AUTHOR:"],
 [/\bCarl\b/, "CARL"],
 [/\bWorth\b/, "WORTH"],
 [/\b<[email protected]>\b/, "<[email protected]>"],
 [/\bDate:\b/, "DATE:"],
 [/\bMon\b/, "MON"],
 [/\bOct\b/, "OCT"],
 [/\b26\b/, "26"],
 [/\b04:51:58\b/, "04:51:58"],
 [/\b2009\b/, "2009"],
 [/\b-0700\b/, "-0700"],
 [/\bNEWS:\b/, "NEWS:"],
 [/\bAdd\b/, "ADD"],
 [/\bnotes\b/, "NOTES"],
 [/\bfor\b/, "FOR"],
 [/\bthe\b/, "THE"],
 [/\b2.9.1\b/, "2.9.1"],
 [/\brelease.\b/, "RELEASE."],
 [/\bThanks\b/, "THANKS"],
 [/\bto\b/, "TO"],
 [/\beveryone\b/, "EVERYONE"],
 [/\bfor\b/, "FOR"]]

If this were a hand-generated list of regex's, and not automatically generated, my question is still appropriate: Which would you rather have to create or maintain?

回复收藏 0 原文