我们如何使用pyspark垂直向下迭代列?
例如,在col1是列的名称并且具有值1,2,3的数据框中,因此,对于每行,我如何单独迭代10,20,30 ..值?
For instance, in a dataframe where col1 is the name of a column and it has values 1,2,3 and so for every row, how do I iterate through the 10,20,30.. values alone?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
好吧...坦率地说,在火花中,您只是不迭代。您不会处理火花中的行。您只需学习一种新的思维方式,只处理专栏即可。
例如,您的示例:
如果您只想获得col1 = 10,........ 20,........ 30,........ 40 ....... 40 .............................................. .....您必须在那里看到一个序列。您会考虑一下并创建一个规则以智能滤波器您的数据框架:
在Spark中,行订单绝不是确定性的。每个动作都会改变行顺序。分类可用,但是它是昂贵且不切实际的,因为下一个操作会破坏订单。当您排序时,将所有内容都拉到一台计算机中(仅当数据在一个节点上时,您至少可以暂时保留订单,因为通常数据在许多机器上分配,并且它们都不是“第一个”或“第二”或“第二”) 。在分布式计算中,数据应尽可能保持分布。
也就是说,很少需要迭代。有
df.collect()
(与排序相同)将所有行收集到一台计算机(驱动程序 - 最弱的机器)中的一个列表中。应该避免此操作,因为它会扭曲分布式计算的性质。但是在极少数情况下,它被使用。在行上迭代是一个例外。几乎所有数据操作都是可能的,而无需迭代。您只需搜索网络,思考并学习新的做事方式即可。Well... Bluntly said, in Spark you just don't iterate. You don't deal with rows in Spark. You just learn a new way of thinking and only deal with columns.
E.g., your example:
If you want to get only rows where col1 = 10,........ 20,........ 30,........ 40......... you must see a sequence there. You think about it and create a rule to smart-filter your dataframe:
Row order is never deterministic in Spark. Every action changes row order. Sorting is available, but it's costy and impractical, as next operation will ruin the order. When you sort, you pull everything into one machine (only when data is on one node you may, at least temporarily, preserve the order, because normally data is split across many machines and none of them is "first" or "second"). In distributed computing, data as much as possible should stay distributed.
That said, iterating rarely may be needed. There's
df.collect()
which (same as sorting) collects all rows into one list in one machine (the driver - the weakest machine). This operation is to be avoided, because it distorts the nature of distributed computing. But in rare cases it is used. Iterating over rows is an exception. Almost any data operation is possible without iterating. You just search the web, think and learn new ways of doing things.