PL/SQL rownum 更新

发布于 2024-10-22 19:06:15 字数 688 浏览 2 评论 0原文

我正在开发一个包含几个表的数据库。它们是一个

      districts table
      PK district_id

      student_data table
      PK study_id
      FK district_id

      ga_data table
      PK study_id
      district_id

ga_data 表是我添加的数据。student_data 表和 ga_data 都有 130 万条记录。两个表之间的 Study_id 为 1 比 1,但 ga_data.district_id 为 NULL,需要更新。我在使用以下 PL/SQL 时遇到问题:

update ga_data
set district_id = (select district_id from student_data
where student_data.study_id = ga_data.study_id)
where ga_data.district_id is null and rownum < 100;

我需要增量地执行它,这就是我需要 rownum 的原因。但我使用得正确吗?运行查询多次后,它只更新了 130 万条记录中的大约 8,000 条记录(应该是大约 110 万条更新,因为学生数据中的一些 District_id 为空)。谢谢!

I am working on a database with a couple of tables. They are a

      districts table
      PK district_id

      student_data table
      PK study_id
      FK district_id

      ga_data table
      PK study_id
      district_id

The ga_data table is data that I am adding in. Both the student_data table and ga_data have 1.3 million records. The study_id's are 1 to 1 between the two tables, but the ga_data.district_id's are NULL and need to be updated. I am having trouble with the following PL/SQL:

update ga_data
set district_id = (select district_id from student_data
where student_data.study_id = ga_data.study_id)
where ga_data.district_id is null and rownum < 100;

I need to do it incremently so that's why I need rownum. But am I using it correctly? After running the query a bunch of times, it only updated about 8,000 records of the 1.3 million (should be about 1.1 million updates since some of the district_ids are null in student_data). Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

焚却相思 2024-10-29 19:06:15

ROWNUM 只是在前 n 行之后截断查询。 STUDENT_DATA 中有一些行的 DISTRICT_ID 为 NULL。因此,在多次运行之后,您的查询很可能会陷入困境,返回相同的 100 条 QA_DATA 记录,所有这些记录都与那些讨厌的 STUDENT_DATA 行之一匹配。

因此,您需要某种机制来确保您按照自己的方式逐步完成 QA_DATA 表。标志列是一种解决方案。另一种方法是对查询进行分区,使其命中一组不同的 STUDENT_ID。

目前尚不清楚为什么必须以 100 个为批次执行此操作,但也许最简单的方法是使用 BULK PROCESSING(至少在 Oracle 中:此 PL/SQL 语法在 MySQL 中不起作用)。

以下是一些测试数据:

SQL> select district_id, count(*)
  2  from student_data
  3  group by district_id
  4  /

DISTRICT_ID   COUNT(*)
----------- ----------
   7369        192
   7499        190
   7521        192
   7566        190
   7654        192
   7698        191
   7782        191
   7788        191
   7839        191
   7844        192
   7876        191
   7900        192
   7902        191
   7934        192
   8060        190
   8061        193
   8083        190
   8084        193
   8085        190
   8100        193
   8101        190
               183

22 rows selected.

SQL> select district_id, count(*)
  2  from qa_data
  3  group by district_id
  4  /

DISTRICT_ID   COUNT(*)
----------- ----------
                  4200

SQL>

该匿名块使用 Bulk 处理 LIMIT 子句将结果集批处理为 100 行的块。

SQL> declare
  2      type qa_nt is table of qa_data%rowtype;
  3      qa_recs qa_nt;
  4
  5      cursor c_qa is
  6          select qa.student_id
  7                 , s.district_id
  8          from qa_data qa
  9                  join student_data s
 10                      on (s.student_id = qa.student_id);
 11  begin
 12      open c_qa;
 13
 14      loop
 15          fetch c_qa bulk collect into qa_recs limit 100;
 16          exit when qa_recs.count() = 0;
 17
 18          for i in qa_recs.first()..qa_recs.last()
 19          loop
 20              update qa_data qt
 21                  set qt.district_id = qa_recs(i).district_id
 22                  where qt.student_id = qa_recs(i).student_id;
 23          end loop;
 24
 25      end loop;
 26  end;
 27  /

PL/SQL procedure successfully completed.

SQL>

请注意,此构造允许我们在发布更新之前对所选行进行附加处理。如果我们需要以编程方式应用复杂的修复,这会很方便。

如您所见,QA_DATA 中的数据现在与 STUDENT_DATA 中的数据匹配

SQL> select district_id, count(*)
  2  from qa_data
  3  group by district_id
  4  /

DISTRICT_ID   COUNT(*)
----------- ----------
   7369        192
   7499        190
   7521        192
   7566        190
   7654        192
   7698        191
   7782        191
   7788        191
   7839        191
   7844        192
   7876        191
   7900        192
   7902        191
   7934        192
   8060        190
   8061        193
   8083        190
   8084        193
   8085        190
   8100        193
   8101        190
               183

22 rows selected.

SQL>

ROWNUM just chops off query after the first n rows. You have some rows in STUDENT_DATA which have a NULL for DISTRICT_ID. So after a number of runs your query is liable to get stuck in a rut, returning the same 100 QA_DATA records, all of which match one of those pesky STUDENT_DATA rows.

So you need some mechanism for ensuring that you are working your way progressively through the QA_DATA table. A flag column would be one solution. Partitioning the query so it hits a different set of STUDENT_IDs is another.

It's not clear why you have to do this in batches of 100, but perhaps the easiest way of doing this would be to use BULK PROCESSING (at least in Oracle: this PL/SQL syntax won't work in MySQL).

Here is some test data:

SQL> select district_id, count(*)
  2  from student_data
  3  group by district_id
  4  /

DISTRICT_ID   COUNT(*)
----------- ----------
   7369        192
   7499        190
   7521        192
   7566        190
   7654        192
   7698        191
   7782        191
   7788        191
   7839        191
   7844        192
   7876        191
   7900        192
   7902        191
   7934        192
   8060        190
   8061        193
   8083        190
   8084        193
   8085        190
   8100        193
   8101        190
               183

22 rows selected.

SQL> select district_id, count(*)
  2  from qa_data
  3  group by district_id
  4  /

DISTRICT_ID   COUNT(*)
----------- ----------
                  4200

SQL>

This anonymous block uses the Bulk processing LIMIT clause to batch the result set into chunks of 100 rows.

SQL> declare
  2      type qa_nt is table of qa_data%rowtype;
  3      qa_recs qa_nt;
  4
  5      cursor c_qa is
  6          select qa.student_id
  7                 , s.district_id
  8          from qa_data qa
  9                  join student_data s
 10                      on (s.student_id = qa.student_id);
 11  begin
 12      open c_qa;
 13
 14      loop
 15          fetch c_qa bulk collect into qa_recs limit 100;
 16          exit when qa_recs.count() = 0;
 17
 18          for i in qa_recs.first()..qa_recs.last()
 19          loop
 20              update qa_data qt
 21                  set qt.district_id = qa_recs(i).district_id
 22                  where qt.student_id = qa_recs(i).student_id;
 23          end loop;
 24
 25      end loop;
 26  end;
 27  /

PL/SQL procedure successfully completed.

SQL>

Note that this construct allows us to do additional processing on the selected rows before issuing the update. This is handy if we need to apply complicated fixes programmatically.

As you can see, the data in QA_DATA now matches that in STUDENT_DATA

SQL> select district_id, count(*)
  2  from qa_data
  3  group by district_id
  4  /

DISTRICT_ID   COUNT(*)
----------- ----------
   7369        192
   7499        190
   7521        192
   7566        190
   7654        192
   7698        191
   7782        191
   7788        191
   7839        191
   7844        192
   7876        191
   7900        192
   7902        191
   7934        192
   8060        190
   8061        193
   8083        190
   8084        193
   8085        190
   8100        193
   8101        190
               183

22 rows selected.

SQL>
痴梦一场 2024-10-29 19:06:15

一次只更新 100 行是一个奇怪的要求。这是为什么?

无论如何,由于student_data 中的district_id 可能为空,因此您可能会一遍又一遍地更新相同的100 行。

如果您扩展查询以确保存在非空的 District_id,您可能会到达您想要的位置:

update ga_data
set district_id = (
  select district_id 
  from student_data
  where student_data.study_id = ga_data.study_id
)
where ga_data.district_id is null 
and exists (
  select 1
  from student_data
  where student_data.study_id = ga_data.study_id
  and district_id is not null
)
and rownum < 100;

It is kind of an odd requirement to only update 100 rows at a time. Why is that?

Anyway, since district_id in student_data can be null, you might be updating the same 100 rows over and over again.

If you extend your query to make sure a non-null district_id exists, you might end up where you want to be:

update ga_data
set district_id = (
  select district_id 
  from student_data
  where student_data.study_id = ga_data.study_id
)
where ga_data.district_id is null 
and exists (
  select 1
  from student_data
  where student_data.study_id = ga_data.study_id
  and district_id is not null
)
and rownum < 100;
贪恋 2024-10-29 19:06:15

如果这是一次性转换,您应该考虑一种完全不同的方法。重新创建该表作为两个表的联接。我保证,当您意识到它与各种有趣的一次 100 行更新相比有多快时,您会大笑起来。

create table new_table as
   select study_id
         ,s.district_id
         ,g.the_remaining_columns_in_ga_data
    from student_data s
    join ga_data      g using(study_id);

   create indexes, constraints etc 
   drop table ga_data;
   alter table new_table rename to ga_data;

或者,如果它不是一次性转换,或者您无法重新创建/删除表,或者您只是想在数据加载上多花几个小时:

merge
 into ga_data      g
using student_data s
   on (g.study_id  = s.study_id)
when matched then
   update
      set g.district_id = s.district_id;

最后一条语句也可以重写为可更新视图,但我个人从不使用它们。

在运行合并之前删除/禁用 ga_data.district_id 上的索引/约束并在之后重新创建它们将提高性能。

If this is a one-time conversion you should consider a completely different approach. Recreate the table as the join of your two tables. I promise you will laugh out loud when you realise how fast it is compared to all kinds of funny 100-rows-at-a-time updates.

create table new_table as
   select study_id
         ,s.district_id
         ,g.the_remaining_columns_in_ga_data
    from student_data s
    join ga_data      g using(study_id);

   create indexes, constraints etc 
   drop table ga_data;
   alter table new_table rename to ga_data;

Or if it isn't a one time conversion or you can't re-create/drop tables or you just feel like spending a few extra hours on data loading:

merge
 into ga_data      g
using student_data s
   on (g.study_id  = s.study_id)
when matched then
   update
      set g.district_id = s.district_id;

The last statement can also be rewritten as an updatable-view, but I personally never use them.

Drop/disable indexes/constraints on ga_data.district_id before running the merge and recreate them afterward will improve on the performance.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文