Hive 中的左连接产生特殊结果
我有三个表,我想通过一些公共列将它们一一连接在一起,尽管两个表中都存在数据(在第二个左连接操作中),但最终结果中始终包含大量空值。
这是我运行的代码:
SELECT T1.INNERCODE AS CODE,
T2.DATESTAMP,
T1.REPORTDATE,
t1.FUNDINNERCODE,
T1.MARKETVALUE,
T1.SHARESHOLDING,
T1.RATIOINNV,
T3.UNIT_NAV,
T3.VALUE,
T3.RETURNRATE
FROM JUYUAN_DB.MF_FUNDPORTIFOLIODETAIL T1
LEFT JOIN (SELECT CALENDAR_DATE AS CALENDAR_DATE,
CLOSEST_DATETIME_BEFORE AS DATESTAMP
FROM APP_RQA.T_CALENDAR_SS
WHERE EXCHMARKET = '83') T2
ON T1.REPORTDATE = T2.CALENDAR_DATE
LEFT JOIN (SELECT CODE,
DATESTAMP,
UNIT_NAV,
VALUE,
RETURNRATE
FROM APP_RQA.T_FUND_NAV_SS) T3
ON T1.FUNDINNERCODE = T3.CODE
AND T2.DATESTAMP = T3.DATESTAMP
下面是最终结果。正如你所看到的,最后一个表中的unit_nav、returnrate和value列大部分都是空的
例如,当fundinnercode = 4082,datestamp = 2010-06-30(图中突出显示)时,unit_nav、returnrate和value都是 无效的。但该行出现在表 t_fund_nav_ss 中,如下所示:
更奇怪的是,当我使用 where 子句专门定位最终结果中的行时,三个缺失的列似乎包含数据
SELECT T1.INNERCODE AS CODE,
T2.DATESTAMP,
T1.REPORTDATE,
t1.FUNDINNERCODE,
T1.MARKETVALUE,
T1.SHARESHOLDING,
T1.RATIOINNV,
T3.UNIT_NAV,
T3.VALUE,
T3.RETURNRATE
FROM JUYUAN_DB.MF_FUNDPORTIFOLIODETAIL T1
LEFT JOIN (SELECT CALENDAR_DATE AS CALENDAR_DATE,
CLOSEST_DATETIME_BEFORE AS DATESTAMP
FROM APP_RQA.T_CALENDAR_SS
WHERE EXCHMARKET = '83') T2
ON T1.REPORTDATE = T2.CALENDAR_DATE
LEFT JOIN (SELECT CODE,
DATESTAMP,
UNIT_NAV,
VALUE,
RETURNRATE
FROM APP_RQA.T_FUND_NAV_SS) T3
ON T1.FUNDINNERCODE = T3.CODE
AND T2.DATESTAMP = T3.DATESTAMP
WHERE T1.FUNDINNERCODE = 4082
AND T2.DATESTAMP = '2010-06-30'
我无法理解它,非常感谢任何帮助或建议。
I have three tables and I want to join them together one by one by some common columns and the end results always have massive amount of null values in them although the data is present (in the second left join operation) in both tables.
Here is the code I ran:
SELECT T1.INNERCODE AS CODE,
T2.DATESTAMP,
T1.REPORTDATE,
t1.FUNDINNERCODE,
T1.MARKETVALUE,
T1.SHARESHOLDING,
T1.RATIOINNV,
T3.UNIT_NAV,
T3.VALUE,
T3.RETURNRATE
FROM JUYUAN_DB.MF_FUNDPORTIFOLIODETAIL T1
LEFT JOIN (SELECT CALENDAR_DATE AS CALENDAR_DATE,
CLOSEST_DATETIME_BEFORE AS DATESTAMP
FROM APP_RQA.T_CALENDAR_SS
WHERE EXCHMARKET = '83') T2
ON T1.REPORTDATE = T2.CALENDAR_DATE
LEFT JOIN (SELECT CODE,
DATESTAMP,
UNIT_NAV,
VALUE,
RETURNRATE
FROM APP_RQA.T_FUND_NAV_SS) T3
ON T1.FUNDINNERCODE = T3.CODE
AND T2.DATESTAMP = T3.DATESTAMP
Below is the end results. As you can see, column unit_nav, returnrate and value which are from the last table are mostly empty
For example, when fundinnercode = 4082, datestamp = 2010-06-30 (highlighted in the picture), unit_nav, returnrate and value are all null. but this row is present in the table t_fund_nav_ss as shown below:
what is even weirder is when I use the where clause to specifically locate the row in the end results, the three missing columns appear to have data in them
SELECT T1.INNERCODE AS CODE,
T2.DATESTAMP,
T1.REPORTDATE,
t1.FUNDINNERCODE,
T1.MARKETVALUE,
T1.SHARESHOLDING,
T1.RATIOINNV,
T3.UNIT_NAV,
T3.VALUE,
T3.RETURNRATE
FROM JUYUAN_DB.MF_FUNDPORTIFOLIODETAIL T1
LEFT JOIN (SELECT CALENDAR_DATE AS CALENDAR_DATE,
CLOSEST_DATETIME_BEFORE AS DATESTAMP
FROM APP_RQA.T_CALENDAR_SS
WHERE EXCHMARKET = '83') T2
ON T1.REPORTDATE = T2.CALENDAR_DATE
LEFT JOIN (SELECT CODE,
DATESTAMP,
UNIT_NAV,
VALUE,
RETURNRATE
FROM APP_RQA.T_FUND_NAV_SS) T3
ON T1.FUNDINNERCODE = T3.CODE
AND T2.DATESTAMP = T3.DATESTAMP
WHERE T1.FUNDINNERCODE = 4082
AND T2.DATESTAMP = '2010-06-30'
I could not wrap my head around it and any help or suggestions are greatly appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
感谢大家的评论。今天偶然遇到一位大数据工程师,向他询问了这个问题。对于那些好奇的人来说,问题出在 T3 表的大小上。 T1 和 T2 大约有 20-30 k 行,而 T3 有 3000 万行。并且左连接操作会导致数据丢失。所以我首先将 T3 过滤到大约 200 万行,现在结果看起来正常。
Thanks to everybody for commenting. I stumbled across a big data engineer today and asked him about the issue. For those who are curious, it turns out that the issue is about the size of T3 table. T1 and T2 have around 20-30 k rows whereas T3 has 30 mil. And the left join operation causes data loss. So I first filter T3 to around 2 mil rows and results look normal now.