SAS Proc Transpose到Pyspark

发布于 2025-01-28 14:50:45 字数 1540 浏览 2 评论 0原文

我正在尝试将SAS Proc Transpose语句转换为Databricks中的Pyspark。 以以下数据为示例:

data = [{"duns":1234, "finc stress":100,"ver":6.0},{"duns":1234, "finc stress":125,"ver":7.0},{"duns":1234, "finc stress":135,"ver":7.1},{"duns":12345, "finc stress":125,"ver":7.6}]

我希望结果看起来像这样,

我尝试使用pandas pivot_table()函数,但我遇到了数据大小的某些性能问题:

tst = (df.pivot_table(index=['duns'], columns=['ver'], values='finc stress')
              .add_prefix('ver')
              .reset_index())

是否有一种翻译方法Proc Transpose SAS逻辑到Pyspark而不是使用Pandas?

我正在尝试这样的事情,但是

tst= sparkdf.groupBy('duns').pivot('ver').agg('finc_stress').withColumn('ver')

AssertionError: all exprs should be Column
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<command-2507760044487307> in <module>
      4 df = pd.DataFrame(data) # pandas
      5 
----> 6 tst= sparkdf.groupBy('duns').pivot('ver').agg('finc_stress').withColumn('ver')
      7 
      8 

/databricks/spark/python/pyspark/sql/group.py in agg(self, *exprs)
    115         else:
    116             # Columns
--> 117             assert all(isinstance(c, Column) for c in exprs), "all exprs should be Column"
    118             jdf = self._jgd.agg(exprs[0]._jc,
    119                                 _to_seq(self.sql_ctx._sc, [c._jc for c in exprs[1:]]))

AssertionError: all exprs should be Column

如果您能帮助我,我会很感激!太感谢了。

I am trying to convert a SAS proc transpose statement to pyspark in databricks.
With the following data as a sample:

data = [{"duns":1234, "finc stress":100,"ver":6.0},{"duns":1234, "finc stress":125,"ver":7.0},{"duns":1234, "finc stress":135,"ver":7.1},{"duns":12345, "finc stress":125,"ver":7.6}]

I would expect the result to look like this

I tried using the pandas pivot_table() function with the following code however I ran into some performance issues with the size of the data:

tst = (df.pivot_table(index=['duns'], columns=['ver'], values='finc stress')
              .add_prefix('ver')
              .reset_index())

Is there a way to translate the PROC Transpose SAS logic to Pyspark instead of using pandas?

I am trying something like this but am getting an error

tst= sparkdf.groupBy('duns').pivot('ver').agg('finc_stress').withColumn('ver')

AssertionError: all exprs should be Column
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<command-2507760044487307> in <module>
      4 df = pd.DataFrame(data) # pandas
      5 
----> 6 tst= sparkdf.groupBy('duns').pivot('ver').agg('finc_stress').withColumn('ver')
      7 
      8 

/databricks/spark/python/pyspark/sql/group.py in agg(self, *exprs)
    115         else:
    116             # Columns
--> 117             assert all(isinstance(c, Column) for c in exprs), "all exprs should be Column"
    118             jdf = self._jgd.agg(exprs[0]._jc,
    119                                 _to_seq(self.sql_ctx._sc, [c._jc for c in exprs[1:]]))

AssertionError: all exprs should be Column

If you could help me out I would so appreciate it! Thank you so much.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

东风软 2025-02-04 14:50:45

我不知道您是如何从数据创建DF的,但这是我所做的:

import pyspark.pandas as ps

df = ps.DataFrame(data)
df['ver'] = df['ver'].astype('str')

然后您的pandas代码工作了。

要使用Pyspark方法,这是我所做的:

sparkdf.groupBy('duns').pivot('ver').agg(F.first('finc stress'))

I don't know how you create df from data but here is what I did:

import pyspark.pandas as ps

df = ps.DataFrame(data)
df['ver'] = df['ver'].astype('str')

Then your pandas code worked.

To use PySpark method, here is what I did:

sparkdf.groupBy('duns').pivot('ver').agg(F.first('finc stress'))
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文