使用 Pig 和 Python

发布于 2024-11-18 23:15:04 字数 263 浏览 2 评论 0原文

如果这个问题措辞不好，我深表歉意：我正在着手一个大规模的机器学习项目，但我不喜欢用 Java 编程。我喜欢用 Python 编写程序。我听说过有关猪的好消息。我想知道是否有人可以向我解释 Pig 与 Python 结合起来如何用于数学相关的工作。另外，如果我要编写“流式 python 代码”，Jython 会出现吗？如果它确实出现的话，效率会更高吗？

谢谢

P.S：出于多种原因，我不喜欢按原样使用 Mahout 的代码。我可能想使用他们的一些数据结构：知道这是否可行会很有用。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

陌上芳菲 2024-11-25 23:15:04

将 Python 与 Hadoop 结合使用的另一个选项是 PyCascading。您可以在 Python 中将整个作业放在一起，在定义数据处理管道的同一脚本中使用 Python 函数作为“UDF”，而不是仅在 Python/Jython 中编写 UDF 或使用流式处理。使用 Jython 作为 Python 解释器，流操作的 MapReduce 框架是 Cascading。连接、分组等的工作方式在本质上与 Pig 类似，因此如果您已经了解 Pig，那么这并不奇怪。

字数统计示例如下所示：

@map(produces=['word'])
def split_words(tuple):
    # This is called for each line of text
    for word in tuple.get(1).split():
        yield [word]

def main():
    flow = Flow()
    input = flow.source(Hfs(TextLine(), 'input.txt'))
    output = flow.tsv_sink('output')

    # This is the processing pipeline
    input | split_words | GroupBy('word') | Count() | output

    flow.run()

Another option to use Python with Hadoop is PyCascading. Instead of writing only the UDFs in Python/Jython, or using streaming, you can put the whole job together in Python, using Python functions as "UDFs" in the same script as where the data processing pipeline is defined. Jython is used as the Python interpreter, and the MapReduce framework for the stream operations is Cascading. The joins, groupings, etc. work similarly to Pig in spirit, so there is no surprise there if you already know Pig.

A word counting example looks like this:

@map(produces=['word'])
def split_words(tuple):
    # This is called for each line of text
    for word in tuple.get(1).split():
        yield [word]

def main():
    flow = Flow()
    input = flow.source(Hfs(TextLine(), 'input.txt'))
    output = flow.tsv_sink('output')

    # This is the processing pipeline
    input | split_words | GroupBy('word') | Count() | output

    flow.run()

回复收藏 0 原文

看海 2024-11-25 23:15:04

当您使用 pig 中的流式传输时，什么并不重要你使用的语言...它所做的就是在 shell 中执行命令（例如通过 bash）。您可以使用 Python，就像使用 grep 或 C 程序一样。

您现在可以在 Python 中原生定义 Pig UDF。这些 UDF 在执行时将通过 Jython 进行调用。

回复收藏 0 原文

狼性发作 2024-11-25 23:15:04

《Programming Pig》一书讨论了 UDF 的使用。总的来说，这本书是不可或缺的。在最近的一个项目中，我们使用了 Python UDF，偶尔会遇到浮点数与双精度数不匹配的问题，因此请注意。我的印象是，对 Python UDF 的支持可能不如对 Java UDF 的支持那么牢固，但总的来说，它运行得很好。

回复收藏 0 原文

~没有更多了~