Hive 中的左连接产生特殊结果

发布于 2025-01-14 03:51:53 字数 2694 浏览 1 评论 0原文

我有三个表,我想通过一些公共列将它们一一连接在一起,尽管两个表中都存在数据(在第二个左连接操作中),但最终结果中始终包含大量空值。

这是我运行的代码:

SELECT T1.INNERCODE                     AS CODE,
       T2.DATESTAMP,
       T1.REPORTDATE,
       t1.FUNDINNERCODE,
       T1.MARKETVALUE,
       T1.SHARESHOLDING,
       T1.RATIOINNV,
       T3.UNIT_NAV,
       T3.VALUE,
       T3.RETURNRATE
FROM JUYUAN_DB.MF_FUNDPORTIFOLIODETAIL T1
         LEFT JOIN (SELECT CALENDAR_DATE           AS CALENDAR_DATE,
                           CLOSEST_DATETIME_BEFORE AS DATESTAMP
                    FROM APP_RQA.T_CALENDAR_SS
                    WHERE EXCHMARKET = '83') T2
                   ON T1.REPORTDATE = T2.CALENDAR_DATE
         LEFT JOIN (SELECT CODE,
                           DATESTAMP,
                           UNIT_NAV,
                           VALUE,
                           RETURNRATE
                    FROM APP_RQA.T_FUND_NAV_SS) T3
                   ON T1.FUNDINNERCODE = T3.CODE
                       AND T2.DATESTAMP = T3.DATESTAMP

下面是最终结果。正如你所看到的,最后一个表中的unit_nav、returnrate和value列大部分都是空的 输入图片description here

例如,当fundinnercode = 4082,datestamp = 2010-06-30(图中突出显示)时,unit_nav、returnrate和value都是 无效的。但该行出现在表 t_fund_nav_ss 中,如下所示: 输入图片此处的描述

更奇怪的是,当我使用 where 子句专门定位最终结果中的行时,三个缺失的列似乎包含数据

SELECT T1.INNERCODE                     AS CODE,
       T2.DATESTAMP,
       T1.REPORTDATE,
       t1.FUNDINNERCODE,
       T1.MARKETVALUE,
       T1.SHARESHOLDING,
       T1.RATIOINNV,
       T3.UNIT_NAV,
       T3.VALUE,
       T3.RETURNRATE
FROM JUYUAN_DB.MF_FUNDPORTIFOLIODETAIL T1
         LEFT JOIN (SELECT CALENDAR_DATE           AS CALENDAR_DATE,
                           CLOSEST_DATETIME_BEFORE AS DATESTAMP
                    FROM APP_RQA.T_CALENDAR_SS
                    WHERE EXCHMARKET = '83') T2
                   ON T1.REPORTDATE = T2.CALENDAR_DATE
         LEFT JOIN (SELECT CODE,
                           DATESTAMP,
                           UNIT_NAV,
                           VALUE,
                           RETURNRATE
                    FROM APP_RQA.T_FUND_NAV_SS) T3
                   ON T1.FUNDINNERCODE = T3.CODE
                       AND T2.DATESTAMP = T3.DATESTAMP
WHERE T1.FUNDINNERCODE = 4082
  AND T2.DATESTAMP = '2010-06-30'

在此处输入图像描述

我无法理解它,非常感谢任何帮助或建议。

I have three tables and I want to join them together one by one by some common columns and the end results always have massive amount of null values in them although the data is present (in the second left join operation) in both tables.

Here is the code I ran:

SELECT T1.INNERCODE                     AS CODE,
       T2.DATESTAMP,
       T1.REPORTDATE,
       t1.FUNDINNERCODE,
       T1.MARKETVALUE,
       T1.SHARESHOLDING,
       T1.RATIOINNV,
       T3.UNIT_NAV,
       T3.VALUE,
       T3.RETURNRATE
FROM JUYUAN_DB.MF_FUNDPORTIFOLIODETAIL T1
         LEFT JOIN (SELECT CALENDAR_DATE           AS CALENDAR_DATE,
                           CLOSEST_DATETIME_BEFORE AS DATESTAMP
                    FROM APP_RQA.T_CALENDAR_SS
                    WHERE EXCHMARKET = '83') T2
                   ON T1.REPORTDATE = T2.CALENDAR_DATE
         LEFT JOIN (SELECT CODE,
                           DATESTAMP,
                           UNIT_NAV,
                           VALUE,
                           RETURNRATE
                    FROM APP_RQA.T_FUND_NAV_SS) T3
                   ON T1.FUNDINNERCODE = T3.CODE
                       AND T2.DATESTAMP = T3.DATESTAMP

Below is the end results. As you can see, column unit_nav, returnrate and value which are from the last table are mostly empty
enter image description here

For example, when fundinnercode = 4082, datestamp = 2010-06-30 (highlighted in the picture), unit_nav, returnrate and value are all null. but this row is present in the table t_fund_nav_ss as shown below:
enter image description here

what is even weirder is when I use the where clause to specifically locate the row in the end results, the three missing columns appear to have data in them

SELECT T1.INNERCODE                     AS CODE,
       T2.DATESTAMP,
       T1.REPORTDATE,
       t1.FUNDINNERCODE,
       T1.MARKETVALUE,
       T1.SHARESHOLDING,
       T1.RATIOINNV,
       T3.UNIT_NAV,
       T3.VALUE,
       T3.RETURNRATE
FROM JUYUAN_DB.MF_FUNDPORTIFOLIODETAIL T1
         LEFT JOIN (SELECT CALENDAR_DATE           AS CALENDAR_DATE,
                           CLOSEST_DATETIME_BEFORE AS DATESTAMP
                    FROM APP_RQA.T_CALENDAR_SS
                    WHERE EXCHMARKET = '83') T2
                   ON T1.REPORTDATE = T2.CALENDAR_DATE
         LEFT JOIN (SELECT CODE,
                           DATESTAMP,
                           UNIT_NAV,
                           VALUE,
                           RETURNRATE
                    FROM APP_RQA.T_FUND_NAV_SS) T3
                   ON T1.FUNDINNERCODE = T3.CODE
                       AND T2.DATESTAMP = T3.DATESTAMP
WHERE T1.FUNDINNERCODE = 4082
  AND T2.DATESTAMP = '2010-06-30'

enter image description here

I could not wrap my head around it and any help or suggestions are greatly appreciated.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

夜司空 2025-01-21 03:51:53

感谢大家的评论。今天偶然遇到一位大数据工程师,向他询问了这个问题。对于那些好奇的人来说,问题出在 T3 表的大小上。 T1 和 T2 大约有 20-30 k 行,而 T3 有 3000 万行。并且左连接操作会导致数据丢失。所以我首先将 T3 过滤到大约 200 万行,现在结果看起来正常。

Thanks to everybody for commenting. I stumbled across a big data engineer today and asked him about the issue. For those who are curious, it turns out that the issue is about the size of T3 table. T1 and T2 have around 20-30 k rows whereas T3 has 30 mil. And the left join operation causes data loss. So I first filter T3 to around 2 mil rows and results look normal now.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文