对在特定时间内创建的项目进行分组

发布于 2024-12-07 05:53:25 字数 656 浏览 2 评论 0原文

我在过去几年创建的数据库中有一堆产品(500k 左右),我想将它们分组在一起(Rails 2.3.14)

将被视为同一组:

  1. 理想情况下,如果满足以下条件,它们 由同一家公司创建_id
  2. 它们是在 10 分钟内创建的

粗略地说明我想要完成的任务:

def self.package_products
  Company.each do |company|
   package = Package.new
   products = Product.find(:all, :conditions => [:company_id = company && created_around_similar_times])
   package.contents = first_few_product_descriptions
   package.save!
   products.update_all(:package_id => package.id)
 end
end

但对我来说,它闻起来很糟糕。我不喜欢循环浏览这些公司,并且忍不住认为有更好的方法来做到这一点。有谁有任何可以对相似项目进行分组的sql-fu?基本上是寻找来自同一家公司且在 10 分钟内创建的产品,并为它们分配相同的 package_id。

I have a bunch of products (500k or so) in a database that were created over the last several years and I'd like to group them together (Rails 2.3.14)

Ideally, they would be considered the same group if:

  1. They were created by the same company_id
  2. They were created within 10 minutes of each other

A rough pass at what I'm trying to accomplish:

def self.package_products
  Company.each do |company|
   package = Package.new
   products = Product.find(:all, :conditions => [:company_id = company && created_around_similar_times])
   package.contents = first_few_product_descriptions
   package.save!
   products.update_all(:package_id => package.id)
 end
end

To me it smells bad though. I don't like looping through the companies and can't help but think there's a better way to do it. Does anyone have any sql-fu that can group similar items? Basically looking to find products from the same company that were created within 10 minutes of each other and assign them the same package_id.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

哎呦我呸! 2024-12-14 05:53:25

这在纯 SQL 中很难做到。我会诉诸 plpgsql 过程。
比如说,你的桌子看起来像这样:
(下次,请发布一个表定义。值一千多字。)

create table p (
  id serial primary key     -- or whatever your primary key is!
, company_id int4 NOT NULL
, create_time timestamp NOT NULL
, for_sale bool NOT NULL
);

使用这样的 plpgsql 函数:

CREATE OR REPLACE FUNCTION f_p_group()
  RETURNS void AS
$BODY$
DECLARE
    g_id             integer := 1;
    last_time        timestamp;
    last_company_id  integer;
    r                p%ROWTYPE;
BEGIN

-- If the table is huge, special settings for these parameters will help
SET temp_buffers = '100MB';   -- more RAM for temp table, adjust to actual size of p
SET work_mem = '100MB';       -- more RAM for sorting

-- create temp table just like original.
CREATE TEMP TABLE tmp_p ON COMMIT DROP AS
SELECT * FROM p LIMIT 0;      -- no rows yet

-- add group_id.
ALTER TABLE tmp_p ADD column group_id integer;

-- loop through table, write row + group_id to temp table
FOR r IN
    SELECT *                  -- get the whole row!
      FROM p
--   WHERE for_sale       -- commented out, after it vanished from the question
     ORDER BY company_id, create_time -- group by company_id first, there could be several groups intertwined

LOOP
    IF r.company_id <> last_company_id OR (r.create_time - last_time) > interval '10 min' THEN
        g_id := g_id + 1;
    END IF;

    INSERT INTO tmp_p SELECT r.*, g_id;

    last_time       := r.create_time;
    last_company_id := r.company_id;
END LOOP;

TRUNCATE p;
ALTER TABLE p ADD column group_id integer; -- add group_id now

INSERT INTO p
SELECT * FROM tmp_p;          -- ORDER BY something?

ANALYZE p;                    -- table has been rewritten, no VACUUM is needed.

END;
$BODY$
  LANGUAGE plpgsql;

调用一次,然后丢弃:

SELECT f_p_group();

DROP FUNCTION f_p_group();

现在,组中的所有成员都为根据您的定义共享一个 group_id

在问题编辑后编辑

我添加了更多内容:

  • 将表读入临时表(在此过程中排序),在那里进行所有更新,截断原始表添加 group_id 并一次性从临时表中写入更新的行。应该会快得多,并且之后不需要真空。但是,您需要一些 RAM 来存储
  • 在查询中被忽略的 for_sale,因为它不再出现在问题中。
  • 了解%ROWTYPE
  • 请阅读此处了解work_mem 和 temp_buffers
  • TRUNCATE、ANALYZE、TEMP TABLE、ALTER TABLE...全部在精美手册中
  • 我用 pg 9.0 测试了它。应该可以在 8.4 - 9.0 甚至更旧的版本中工作。

This is hard to to in pure SQL. I would resort to a plpgsql procedure.
Say, your table looks like this:
(Next time, be so nice as to post a table definition. Worth more than a thousand words.)

create table p (
  id serial primary key     -- or whatever your primary key is!
, company_id int4 NOT NULL
, create_time timestamp NOT NULL
, for_sale bool NOT NULL
);

Use a plpgsql function like this:

CREATE OR REPLACE FUNCTION f_p_group()
  RETURNS void AS
$BODY$
DECLARE
    g_id             integer := 1;
    last_time        timestamp;
    last_company_id  integer;
    r                p%ROWTYPE;
BEGIN

-- If the table is huge, special settings for these parameters will help
SET temp_buffers = '100MB';   -- more RAM for temp table, adjust to actual size of p
SET work_mem = '100MB';       -- more RAM for sorting

-- create temp table just like original.
CREATE TEMP TABLE tmp_p ON COMMIT DROP AS
SELECT * FROM p LIMIT 0;      -- no rows yet

-- add group_id.
ALTER TABLE tmp_p ADD column group_id integer;

-- loop through table, write row + group_id to temp table
FOR r IN
    SELECT *                  -- get the whole row!
      FROM p
--   WHERE for_sale       -- commented out, after it vanished from the question
     ORDER BY company_id, create_time -- group by company_id first, there could be several groups intertwined

LOOP
    IF r.company_id <> last_company_id OR (r.create_time - last_time) > interval '10 min' THEN
        g_id := g_id + 1;
    END IF;

    INSERT INTO tmp_p SELECT r.*, g_id;

    last_time       := r.create_time;
    last_company_id := r.company_id;
END LOOP;

TRUNCATE p;
ALTER TABLE p ADD column group_id integer; -- add group_id now

INSERT INTO p
SELECT * FROM tmp_p;          -- ORDER BY something?

ANALYZE p;                    -- table has been rewritten, no VACUUM is needed.

END;
$BODY$
  LANGUAGE plpgsql;

Call once, then discard:

SELECT f_p_group();

DROP FUNCTION f_p_group();

Now, all members of a group as per your definition share a group_id.

Edit after question edit

I put in a couple more things:

  • Read the table into a temporary table (ordering in the process), do all the updates there, truncate the original table add group_id and write updated rows from the temp table in one go. Should be much faster and no vacuum needed afterwards. But you need some RAM for that
  • for_sale ignored in query after it's not in the question any more.
  • Read about %ROWTYPE.
  • Read here about work_mem and temp_buffers.
  • TRUNCATE, ANALYZE, TEMP TABLE, ALTER TABLE, ... all in the fine manual
  • I tested it with pg 9.0. should work in 8.4 - 9.0 and probably older versions too.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文