需要更好的选择 - 外连接 32 次到同一张表

发布于 2024-11-06 20:24:34 字数 1688 浏览 0 评论 0原文

我有一个令人讨厌的 SQL 查询问题,我希望得到一个优雅的解决方案的帮助。我试图避免对同一个表进行 32 个左外连接。

数据库是 Teradata。

我有一个包含 1400 万条记录和 33 列的表。主键(我们称之为 Trans_Id)和 32 个编码字段(我们称之为encoded_1 ...encoded_32)。像这样的事情:

CREATE SET TABLE BigTable ,NO FALLBACK , NO BEFORE JOURNAL, NO AFTER JOURNAL, CHECKSUM = DEFAULT
     (
      TRANS_ID VARCHAR(20) CHARACTER SET LATIN NOT CASESPECIFIC,
      ENCODED_1 VARCHAR(10) CHARACTER SET LATIN NOT CASESPECIFIC,
      ENCODED_2 VARCHAR(10) CHARACTER SET LATIN NOT CASESPECIFIC,
      ENCODED_3 VARCHAR(10) CHARACTER SET LATIN NOT CASESPECIFIC,
      ...
      ENCODED_32 VARCHAR(10) CHARACTER SET LATIN NOT CASESPECIFIC )
PRIMARY INDEX ( TRANS_ID );

我还有一个包含编码/解码值的表。假设此表中有 100 条记录。

CREATE SET TABLE LookupTable ,NO FALLBACK , NO BEFORE JOURNAL, NO AFTER JOURNAL, CHECKSUM = DEFAULT
     (
      UNIQ_PK { just random numbers }
      ENCODED_VAR VARCHAR(10) CHARACTER SET LATIN NOT CASESPECIFIC,
      DECODED_DESC VARCHAR(50) CHARACTER SET LATIN NOT CASESPECIFIC)
PRIMARY INDEX ( UNIQ_PK );

我想避免像这样的令人讨厌的连接(我使用省略号而不是显示所有 32 个外部连接):

SELECT 
TRANS_ID
, a.ENCODED_1
, b1.DECODED_DESC DECODED_DESC_1
, a.ENCODED_2
, b2.DECODED_DESC DECODED_DESC_2
...
, a.ENCODED_31
, b31.DECODED_DESC DECODED_DESC_31
, a.ENCODED_32
, b32.DECODED_DESC DECODED_DESC_32
FROM BigTable a
LEFT OUTER JOIN LookupTable b1 ON a.ENCODED_1 = b1.ENCODED
LEFT OUTER JOIN LookupTable b2 ON a.ENCODED_2 = b1.ENCODED
...
LEFT OUTER JOIN LookupTable b31 ON a.ENCODED_31 = b31.ENCODED
LEFT OUTER JOIN LookupTable b32 ON a.ENCODED_32 = b32.ENCODED

任何帮助将不胜感激。我有一种感觉,外部连接 14M 记录 32 次并不是执行此操作的有效方法!

I have a nasty SQL query problem and I'd love help with an elegant solution. I'm trying to avoid 32 left outer joins to the same table.

The database is Teradata.

I have a table with 14 million records and 33 columns. The primary key (let's call it Trans_Id), and 32 encoded fields (let's call them encoded_1 ... encoded_32). Something like this:

CREATE SET TABLE BigTable ,NO FALLBACK , NO BEFORE JOURNAL, NO AFTER JOURNAL, CHECKSUM = DEFAULT
     (
      TRANS_ID VARCHAR(20) CHARACTER SET LATIN NOT CASESPECIFIC,
      ENCODED_1 VARCHAR(10) CHARACTER SET LATIN NOT CASESPECIFIC,
      ENCODED_2 VARCHAR(10) CHARACTER SET LATIN NOT CASESPECIFIC,
      ENCODED_3 VARCHAR(10) CHARACTER SET LATIN NOT CASESPECIFIC,
      ...
      ENCODED_32 VARCHAR(10) CHARACTER SET LATIN NOT CASESPECIFIC )
PRIMARY INDEX ( TRANS_ID );

I also have a single table with the coded / decoded values. Let's say there are 100 records in this table.

CREATE SET TABLE LookupTable ,NO FALLBACK , NO BEFORE JOURNAL, NO AFTER JOURNAL, CHECKSUM = DEFAULT
     (
      UNIQ_PK { just random numbers }
      ENCODED_VAR VARCHAR(10) CHARACTER SET LATIN NOT CASESPECIFIC,
      DECODED_DESC VARCHAR(50) CHARACTER SET LATIN NOT CASESPECIFIC)
PRIMARY INDEX ( UNIQ_PK );

I want to avoid a nasty join like this (I used ellipses instead of showing all 32 outer joins):

SELECT 
TRANS_ID
, a.ENCODED_1
, b1.DECODED_DESC DECODED_DESC_1
, a.ENCODED_2
, b2.DECODED_DESC DECODED_DESC_2
...
, a.ENCODED_31
, b31.DECODED_DESC DECODED_DESC_31
, a.ENCODED_32
, b32.DECODED_DESC DECODED_DESC_32
FROM BigTable a
LEFT OUTER JOIN LookupTable b1 ON a.ENCODED_1 = b1.ENCODED
LEFT OUTER JOIN LookupTable b2 ON a.ENCODED_2 = b1.ENCODED
...
LEFT OUTER JOIN LookupTable b31 ON a.ENCODED_31 = b31.ENCODED
LEFT OUTER JOIN LookupTable b32 ON a.ENCODED_32 = b32.ENCODED

Any help would be appreciated. I have a feeling outer joining 14M records 32 times is not the efficient way to do this!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

总以为 2024-11-13 20:24:34

您可以创建一个函数,将 VARCHAR(10)encoded_var 作为参数并返回 VARCHAR(50)decoded_desc,然后您的选择将如下所示:

SELECT TRANS_ID,
     ENCODED_1, somefunc(ENCODED_1) AS DECODED_DESC_1,
     ENCODED_2, somefunc(ENCODED_2) AS DECODED_DESC_2,
     etc.

根据您计划一次返回的行数,这是可行的。

You could create a function that takes as a parameter the VARCHAR(10) encoded_var and return the VARCHAR(50) decoded_desc, then your select would be something like this:

SELECT TRANS_ID,
     ENCODED_1, somefunc(ENCODED_1) AS DECODED_DESC_1,
     ENCODED_2, somefunc(ENCODED_2) AS DECODED_DESC_2,
     etc.

Depending on the number of rows you are planning on returning at a time, this would be doable.

阪姬 2024-11-13 20:24:34

如果encoded_1、encoded_2等都被用作同一个表的查找键,听起来它们都是“相同的想法”。但我的第一个想法是,在这种情况下更好的设计是:

big_table (trans_id, var_id, encoded_var)
lookup_table (encoded_var, decoded_desc)

然后查询就变成:

select trans_id, var_id, encoded_var, decoded_desc
from big_table
join lookup_table on lookup_table.encoded_var=big_table.encoded_var

我不知道这是否是真正的字段名称,或者您只是想省略不相关的细节。您可能会在这里遗漏相关细节。 encoded_1 和encoded_2 等有什么区别?如果它们可以互换,则没有理由为它们设置单独的字段。确实,它会带来很多问题。即使存在语义差异,如果它们都使用相同的查找表,则它们也必须来自同一域。

例如,几年前,我开发了一个系统来管理我们组织制作和使用的技术手册。每本手册有 3 名管理员。 (负责处理预算和时间表的行政经理,负责跟踪谁需要副本并确保他们获得副本的库存经理,以及负责实际文本的内容经理。)但他们都来自同一份人员名单,通常同一个人会担任多个角色,或者可能针对不同的手册担任不同的角色。因此,我们制作了一个包含 ID、姓名、电子邮件地址等的“人员”表,然后在基本手动记录中我创建了 3 列,每个经理类型对应一列。

这是一个巨大的错误。我应该做的是创建一个单独的表,其中包含手动 id、经理类型 id 和人员 id,然后为 3 种经理类型创建 3 条记录,而不是一条记录中的 3 个字段。

为什么?对于三列,我遇到了您所描述的相同问题,尽管规模较小:我必须从手动表连接到人员表三次。像“鲍勃·史密斯负责哪些书?”这样的查询需要一个令人惊讶的复杂查询,就像使用单个列一样,

select ... whatever ...
from manual
join person p1 on p1.person_id=manual.admin_id
join person p2 on p2.person_id=manual.stockmanager_id
join person p3 on p3.person_id=manual.contentmanager_id
where p1.name='Bob Smith'
 or p2.name='Bob Smith'
 or p3.name='Bob Smith'

它本来就很简单

select ... whatever ...
from manual
join manual_manager on manual_manager.manual_id=manual.manual_id
join person on person.person_id=manual_manager.person_id
where person.name='Bob Smith'"

。随着所有的重复,毫不奇怪的是,有几次程序员不小心只检查了 2 个字段而不是所有 3 个字段。使用 1字段此错误是不可能的。对于 3 个字段,如果我们添加第四种类型的管理器,我们将不得不添加另一列,然后更改查看这些字段的每个查询。如果有 1 个字段,我们可能不会。等等。

对于 3 个字段,我们需要 3 个索引,并且还有其他性能影响。

我怀疑同样的想法也适用于你。

如果您的 32 个字段都是完全可互换的,那么该表只需要一个序列号即可创建唯一的 pk。如果它们之间存在一些差异,那么您可以创建一个代码来区分它们。

If encoded_1, encoded_2, etc are all being used as look up keys to the same table, it sounds like they are all the "same idea". But my first thought is that a better design in this case would be:

big_table (trans_id, var_id, encoded_var)
lookup_table (encoded_var, decoded_desc)

Then the query just becomes:

select trans_id, var_id, encoded_var, decoded_desc
from big_table
join lookup_table on lookup_table.encoded_var=big_table.encoded_var

I don't know if this is the real field name or if you're just trying to leave out irrelevant details. You may be leaving out relevant details here. What's the difference between encoded_1 and encoded_2, etc? If they are interchangeable, there is no reason to have separate fields for them. Indeed, it causes a lot of problems. Even if there is a semantic difference, if they all use the same lookup table, they must all be coming from the same domain.

For example, a few years ago I worked on a system to manage technical manuals that our organization produced and used. Each manual had 3 managers. (An administrative manager who handled budgets and schedules, a stock manager who kept track of who needed copies and made sure they got them, and a content manager responsible for the actual text.) But they were all drawn from the same list of people, as often the same person would have more than one of these roles or could have different roles for different manuals. So we made a table of "people" with an id, name, email address, etc, and then in the basic manual record I created 3 columns, one for each manager type.

This was a huge mistake. What I should have done was create a separate table with manual id, manager type id, and person id, and then have 3 RECORDS for the 3 manager types rather than 3 fields within one record.

Why? With three columns, I ran into the same problem you are describing, though on a smaller scale: I had to join from the manual table to the person table three times. A query like "what books is Bob Smith responsible for?" required a surprising complex query, something like

select ... whatever ...
from manual
join person p1 on p1.person_id=manual.admin_id
join person p2 on p2.person_id=manual.stockmanager_id
join person p3 on p3.person_id=manual.contentmanager_id
where p1.name='Bob Smith'
 or p2.name='Bob Smith'
 or p3.name='Bob Smith'

With a single column it would have been simply

select ... whatever ...
from manual
join manual_manager on manual_manager.manual_id=manual.manual_id
join person on person.person_id=manual_manager.person_id
where person.name='Bob Smith'"

With all the repetition, it was not surprising that there were a couple of times that a programmer accidentally only checked 2 of the fields instead of all 3. With 1 field this error wouldn't be possible. With 3 fields, if we added a 4th type of manager, we would have had to add another column, and then change every query that looked at these fields. With 1 field, we probably wouldn't. Etc.

With 3 fields we needed 3 indexes, and there are other performance implications.

I suspect the same sort of thinking applies to you.

If your 32 fields are all completely interchangeable, then the table would only need a sequence number to make a unique pk. If there is some difference between them, then you could create a code to distinguish them.

猫弦 2024-11-13 20:24:34

首先,我会将 LookUp 表上的 PI 更改为 Encoded_Var。您必须在每个 Encoded_Var 列上重新分配大表才能连接到 LookUp 表。为什么每次都必须重新分配 LookUp 表呢?

您的表设计是否有原因不接近

CREATE SET TABLE BigTable ,NO FALLBACK , NO BEFORE JOURNAL, NO AFTER JOURNAL, CHECKSUM = DEFAULT
     (
      TRANS_ID    VARCHAR(20) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL,
      ENCODED_VAR VARCHAR(10) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL
UNIQUE PRIMARY INDEX ( TRANS_ID, ENCODED_VAR );

这将在 trans_id 和encoded_var 之间建立更合适的 1:M 关系。除非有遗漏的相关细节可以解释为什么这行不通。事实上,如果有必要,您可以将此表构建为关系表,并拥有另一个如下所示的表:

    CREATE SET TABLE BigTable2 ,NO FALLBACK , NO BEFORE JOURNAL, NO AFTER JOURNAL, CHECKSUM = DEFAULT
         (
          TRANS_ID    VARCHAR(20) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL,
          OtherData1  VARCHAR(10) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL,
          OtherData2  SMALLINT NOT NULL,
          ....,
          OtherDataN  VARCHAR(10) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL
   UNIQUE PRIMARY INDEX ( TRANS_ID );

希望有所帮助。

I would change the PI on the LookUp table to the Encoded_Var for starters. You already have to redistribute the big table on each of the Encoded_Var columns in order to join to the LookUp table. Why bother having to redistribute the LookUp table each time as well.

Is there a reason why your table design isn't something closer to

CREATE SET TABLE BigTable ,NO FALLBACK , NO BEFORE JOURNAL, NO AFTER JOURNAL, CHECKSUM = DEFAULT
     (
      TRANS_ID    VARCHAR(20) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL,
      ENCODED_VAR VARCHAR(10) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL
UNIQUE PRIMARY INDEX ( TRANS_ID, ENCODED_VAR );

This would build a more appropriate 1:M relationship between a trans_id and encoded_var. Unless there are pertinent details that have been left out that would explain why this would not work. In fact, if necessary you could build this table as relationship table and have another table that looks like:

    CREATE SET TABLE BigTable2 ,NO FALLBACK , NO BEFORE JOURNAL, NO AFTER JOURNAL, CHECKSUM = DEFAULT
         (
          TRANS_ID    VARCHAR(20) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL,
          OtherData1  VARCHAR(10) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL,
          OtherData2  SMALLINT NOT NULL,
          ....,
          OtherDataN  VARCHAR(10) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL
   UNIQUE PRIMARY INDEX ( TRANS_ID );

Hope that helps.

放低过去 2024-11-13 20:24:34

如果您不想重复编写相同的查询,我建议将其放在视图中。

为了提高性能,我建议如下:

  • 按照 Rob Paller 的建议,将 LookupTable 上的主索引更改为 ENCODED_VAR。
  • 在 LookupTable 中,添加一条新记录,其中 DECODED_DESC = null 且 ENCODED_VAR = 某个您永远不会使用的值。更新 BigTable 以用此值替换所有 null ENCODED_*。然后,您可以更改查询以使用所有内部联接并获得相同的结果。

If you don't want to repeatedly write the same query, I'd suggest putting it in a view.

For performance, I'd suggest the following:

  • As Rob Paller recommended, change primary index on LookupTable to ENCODED_VAR.
  • In LookupTable, add a new record with DECODED_DESC = null and ENCODED_VAR = some value you will never use. Update BigTable to replace all null ENCODED_* with this value. You can then change your query to use all inner joins and get the same result.
壹場煙雨 2024-11-13 20:24:34

你就不能这样吗:

SELECT 
TRANS_ID
, a.ENCODED_1
, CASE a.ENCODED_1 WHEN b.ENCODED THEN b.DECODED_DESC END DECODED_DESC_1
, a.ENCODED_2
, CASE a.ENCODED_2 WHEN b.ENCODED THEN b.DECODED_DESC END DECODED_DESC_2
...
, a.ENCODED_31
, CASE a.ENCODED_31 WHEN b.ENCODED THEN b.DECODED_DESC END DECODED_DESC_31
, a.ENCODED_32
, CASE a.ENCODED_32 WHEN b.ENCODED THEN b.DECODED_DESC END DECODED_DESC_32
FROM BigTable a
 LEFT JOIN LookupTable b ON (
   a.ENCODED_1 = b.ENCODED OR
   a.ENCODED_2 = b.ENCODED OR
   ...
   a.ENCODED_31 = b.ENCODED OR
   a.ENCODED_32 = b.ENCODED
 )

我也可能想重写加入条件如下:

...ON b.ENCODED IN (a.ENCODED_1, a.ENCODED_2, ... a.ENCODED_31, a.ENCODED_32)

但我不确定它是否不能比以前的版本慢。实际上,我的猜测是,它确实会变慢,但我仍然会验证这一点。

Couldn't you have it like this:

SELECT 
TRANS_ID
, a.ENCODED_1
, CASE a.ENCODED_1 WHEN b.ENCODED THEN b.DECODED_DESC END DECODED_DESC_1
, a.ENCODED_2
, CASE a.ENCODED_2 WHEN b.ENCODED THEN b.DECODED_DESC END DECODED_DESC_2
...
, a.ENCODED_31
, CASE a.ENCODED_31 WHEN b.ENCODED THEN b.DECODED_DESC END DECODED_DESC_31
, a.ENCODED_32
, CASE a.ENCODED_32 WHEN b.ENCODED THEN b.DECODED_DESC END DECODED_DESC_32
FROM BigTable a
 LEFT JOIN LookupTable b ON (
   a.ENCODED_1 = b.ENCODED OR
   a.ENCODED_2 = b.ENCODED OR
   ...
   a.ENCODED_31 = b.ENCODED OR
   a.ENCODED_32 = b.ENCODED
 )

?

I might also be tempted to rewrite the joining condition as follows:

...ON b.ENCODED IN (a.ENCODED_1, a.ENCODED_2, ... a.ENCODED_31, a.ENCODED_32)

But I'm not really sure if it can't be slower than the former version. Actually, my guess is, it would indeed turn out slower, but I would still verify that to be sure.

黑寡妇 2024-11-13 20:24:34

我也遇到了同样的问题,也是在 Teradata 上。一位同事告诉我一个优雅的解决方案,使用单个 LEFT OUTER JOIN 和多个 CASE 语句。

但是,您的示例有点令人困惑,因为您要加入的列不存在(“LookupTable”中的“ENCODED”列,我认为应该是“ENCODED_VAR”?)。

SELECT     TRANS_ID    
           , a.ENCODED_1    
           , MAX(CASE WHEN b.ENCODED_VAR = a.ENCODED_1
                 THEN b.DECODED_DESC
                 ELSE NULL
             END) DECODED_DESC_1    
           , a.ENCODED_2    
           , MAX(CASE WHEN b.ENCODED_VAR = a.ENCODED_2
                 THEN b.DECODED_DESC
                 ELSE NULL
             END) DECODED_DESC_2    
           ...    
           , a.ENCODED_31    
           , MAX(CASE WHEN b.ENCODED_VAR = a.ENCODED_31
                 THEN b.DECODED_DESC
                 ELSE NULL
             END) DECODED_DESC_31    
           , a.ENCODED_32    
           , MAX(CASE WHEN b.ENCODED_VAR = a.ENCODED_32
                 THEN b.DECODED_DESC
                 ELSE NULL
             END) DECODED_DESC_32    
FROM BigTable a
LEFT OUTER JOIN LookupTable b 
    ON a.ENCODED_1 = b.ENCODED_VAR
    OR a.ENCODED_2 = b.ENCODED_VAR
    ...
    OR a.ENCODED_31 = b.ENCODED_VAR
    OR a.ENCODED_32 = b.ENCODED_VAR
GROUP BY a.TRANS_ID 

这确实依赖于 BigTable 中的 ENCODED_n 和 LookupTable 中的 ENCODED_VAR 之间存在 1:1 关系。

另外,顺便说一句,您不应使用随机数作为 Teradata 表中的主索引。虽然这将为您提供良好的表分布,但在进行表查找时它将完全无用。如果您对 PI 使用通用联接字段,则数据库可以直接转到存储数据的 AMP。然而,如果没有这个,DBMS 必须每次都进行全表扫描。只要分布仍然合理,您就可以使用 ENCODED_VAR 作为 PRIMARY INDEX 并看到性能的显着提高。

我希望这能起作用。我认为这适用于你的情况。我尚未验证我的代码是否正确,但它与我自己的解决方案非常相似。

I have encountered the same problem, also on Teradata. A colleague has informed me of an elegant solution for you using a single LEFT OUTER JOIN, and a number of CASE statements.

However, your example is a little confusing, because you are joining on a column the doesn't exist ("ENCODED" column in "LookupTable", which I assume should be "ENCODED_VAR"?).

SELECT     TRANS_ID    
           , a.ENCODED_1    
           , MAX(CASE WHEN b.ENCODED_VAR = a.ENCODED_1
                 THEN b.DECODED_DESC
                 ELSE NULL
             END) DECODED_DESC_1    
           , a.ENCODED_2    
           , MAX(CASE WHEN b.ENCODED_VAR = a.ENCODED_2
                 THEN b.DECODED_DESC
                 ELSE NULL
             END) DECODED_DESC_2    
           ...    
           , a.ENCODED_31    
           , MAX(CASE WHEN b.ENCODED_VAR = a.ENCODED_31
                 THEN b.DECODED_DESC
                 ELSE NULL
             END) DECODED_DESC_31    
           , a.ENCODED_32    
           , MAX(CASE WHEN b.ENCODED_VAR = a.ENCODED_32
                 THEN b.DECODED_DESC
                 ELSE NULL
             END) DECODED_DESC_32    
FROM BigTable a
LEFT OUTER JOIN LookupTable b 
    ON a.ENCODED_1 = b.ENCODED_VAR
    OR a.ENCODED_2 = b.ENCODED_VAR
    ...
    OR a.ENCODED_31 = b.ENCODED_VAR
    OR a.ENCODED_32 = b.ENCODED_VAR
GROUP BY a.TRANS_ID 

This does rely on there being a 1:1 relationship between ENCODED_n in BigTable and ENCODED_VAR in LookupTable.

Also, as an aside, you shouldn't use a random number as the PRIMARY INDEX in a Teradata table. While this will give you great table distribution, it will be completely useless when doing a table lookup. If you use a commonly joined field for the PI, the database can go directly to the AMP on which the data is stored. However, without this, the DBMS must do a full-table scan every time. You could likely use ENCODED_VAR as your PRIMARY INDEX and see much improved performance, as long as the distribution was still reasonable.

I hope this works. I think this will work in your case. I haven't verified my code is correct, but it's very similar to my own solution.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文