了解 WHERE 子句中的 SQL 连接
我有一个 SQL 查询,我正在尝试将其转换为 Pig Latin(用于 Hadoop 集群)。大多数时候,我将查询转移到 Pig 没有问题,但我遇到了一些我似乎无法弄清楚的事情。
想象一下这样的 select 语句:
SELECT a.f1, b.f2, b.f3 -- just for example
FROM tableA a, tableB b
WHERE(
(
a.f1 = b.f2
AND
(
a.f2 = b.f1
OR
(
(a.f2 = 'somestring1' OR a.f2 = 'somestring2')
AND
(b.f1 is null OR b.f1 ='somestring3' OR b.f1 = 'somestring4')
)
)
OR
(
a.f3 = b.f4
AND
(
a.f4 = b.f3
OR
(
(a.f4 = 'somestring5' OR a.f4 = 'somestring6')
AND
(b.f3 is null OR b.f3 ='somestring7' OR b.f3 = 'somestring8')
)
)
)
)
GROUP BY -- some other stuff
现在,我知道直接翻译为 Pig 可能是不可能的。 如果是这样的话,我想知道这个语句如何分解为计算它所需的 JOIN(和过滤器)(因为可能更容易了解如何从中构造 Pig 查询)。
我已经去了 Pig 邮件列表,但我还没有找到一个好的解决方案,因为 Pig 并没有真正对 JOIN 进行“OR”操作。我知道这个问题有多么奇怪。
(如果有人确实知道如何在 Pig 中完成此操作,我不会反对查看它......)
编辑:有谁知道这在 Hive 中是否会更容易,也许?
I have a query in SQL that I'm trying to translate into Pig Latin (for use on a Hadoop cluster). Most of the time I have no problem moving the queries over to Pig, but I've encountered something I can't seem to figure out.
Imagine a select statement like this:
SELECT a.f1, b.f2, b.f3 -- just for example
FROM tableA a, tableB b
WHERE(
(
a.f1 = b.f2
AND
(
a.f2 = b.f1
OR
(
(a.f2 = 'somestring1' OR a.f2 = 'somestring2')
AND
(b.f1 is null OR b.f1 ='somestring3' OR b.f1 = 'somestring4')
)
)
OR
(
a.f3 = b.f4
AND
(
a.f4 = b.f3
OR
(
(a.f4 = 'somestring5' OR a.f4 = 'somestring6')
AND
(b.f3 is null OR b.f3 ='somestring7' OR b.f3 = 'somestring8')
)
)
)
)
GROUP BY -- some other stuff
Now, I know that a direct translation to Pig might not be possible.
If that's the case, I'm wondering how this statement gets decomposed into the JOINs (and filters) required to compute it (as it may be easier to see how to construct a Pig query out of that).
I've gone to the Pig mailing list but I haven't found a good solution yet, as Pig doesn't really do "OR" with JOINs. I am aware of how odd this question is.
(If anyone does have an idea of how this be done in Pig, I wouldn't be opposed to looking at it...)
Edit: Does anyone know if this would be easier in Hive, maybe?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
UNION 是另一种拼写 OR ... 的方式。
或许:
UNION is another way of spelling OR ... sort of.
Maybe:
我不知道 Pig,但在 SQL 中,你的语句相当于你写的内容:
它也会很慢,因为缺少一个优化器(例如 Postgres)来分解 OR 子句并尝试每个子句适当的索引,您最终将整个两个表交叉连接。
如果 Pig 更理解后者,您可以将该语句重写为
select from (select ... union select ...) group by ...
。I dunno about Pig, but in SQL, your statement is equivalent to pretty what you've written:
It'll be slow, too, because short of having an optimizer (e.g. Postgres) that decomposes the OR clauses and tries each of them with the appropriate index, you end up cross joining the two tables in their entirety.
You could rewrite the statement as
select from (select ... union select ...) group by ...
instead, if Pig makes more sense of the latter.