如何使用 MlLib Pyspark 对每个组进行分组和执行数据缩放？

发布于 2025-01-09 07:48:01 字数 2188 浏览 1 评论 0原文

我有一个数据集，就像下面的示例所示，我试图对给定符号的所有行进行分组，并对每个组执行标准缩放，以便最终我的所有数据都按组缩放。我如何使用 MlLib 和 Pyspark 做到这一点？我在互联网上找不到单一的解决方案。有人可以帮忙吗？

+------+------------------+------------------+------------------+------------------+
|symbol|              open|              high|               low|             close|
+------+------------------+------------------+------------------+------------------+
|   AVT|             4.115|             4.115|            4.0736|            4.0736|
|   ZEC| 365.6924715181936| 371.9164684545918| 364.8854025324053| 369.5950712239761|
|   ETH|  647.220769018717| 654.6370842160561| 644.8942258095359| 652.1231757197687|
|   XRP|0.3856343600456335|0.4042970302356221|0.3662228285447956|0.4016658006619401|
|   XMR|304.97650674864144|304.98649644294267|299.96970554155274| 303.8663243145598|
|   LTC|321.32437862304715| 335.1872636382617| 320.9704201234651| 334.5057757774086|
|   EOS|            5.1171|            5.1548|            5.1075|             5.116|
|   BCH| 1526.839255299505| 1588.106037653013|1526.8392543926366|1554.8447136830328|
|  DASH|      878.00000003|      884.03769206|      869.22000004|      869.22000004|
|   BTC|17042.224796462127| 17278.87984139109|16898.509289685637|17134.611038665582|
|   REP|       32.50162799|         32.501628|       32.41062673|       32.50162799|
|  DASH|      858.98413357|      863.01413927|      851.07145059|      851.17051529|
|   ETH| 633.1390884474979|  650.546984589714| 631.2674221381849| 641.4566047907362|
|   XRP|0.3912300406160967|0.3915937383961073|0.3480682353334925|0.3488616679337076|
|   EOS|              5.11|            5.1675|            5.0995|            5.1674|
|   BCH|1574.9602789966184|1588.6004569127992|            1515.3|            1521.0|
|   BTC|  17238.0199449088| 17324.83886467445|16968.291408828714| 16971.12960974206|
|   LTC| 303.3999614441217| 317.6966006615225|302.40702519057584|  310.971265429805|
|   REP|       32.50162798|       32.50162798|         32.345677|         32.345677|
|   XMR| 304.1618444641083| 306.2720324372592|295.38042671416935|  295.520097663825|
+------+------------------+------------------+------------------+------------------+

原文

I have a dataset just like in the example below and I am trying to group all rows from a given symbol and perform standard scaling of each group so that at the end all my data is scaled by groups. How can I do that with MlLib and Pyspark? I could not find a single solution on internet for it. Can anyone help here?

+------+------------------+------------------+------------------+------------------+
|symbol|              open|              high|               low|             close|
+------+------------------+------------------+------------------+------------------+
|   AVT|             4.115|             4.115|            4.0736|            4.0736|
|   ZEC| 365.6924715181936| 371.9164684545918| 364.8854025324053| 369.5950712239761|
|   ETH|  647.220769018717| 654.6370842160561| 644.8942258095359| 652.1231757197687|
|   XRP|0.3856343600456335|0.4042970302356221|0.3662228285447956|0.4016658006619401|
|   XMR|304.97650674864144|304.98649644294267|299.96970554155274| 303.8663243145598|
|   LTC|321.32437862304715| 335.1872636382617| 320.9704201234651| 334.5057757774086|
|   EOS|            5.1171|            5.1548|            5.1075|             5.116|
|   BCH| 1526.839255299505| 1588.106037653013|1526.8392543926366|1554.8447136830328|
|  DASH|      878.00000003|      884.03769206|      869.22000004|      869.22000004|
|   BTC|17042.224796462127| 17278.87984139109|16898.509289685637|17134.611038665582|
|   REP|       32.50162799|         32.501628|       32.41062673|       32.50162799|
|  DASH|      858.98413357|      863.01413927|      851.07145059|      851.17051529|
|   ETH| 633.1390884474979|  650.546984589714| 631.2674221381849| 641.4566047907362|
|   XRP|0.3912300406160967|0.3915937383961073|0.3480682353334925|0.3488616679337076|
|   EOS|              5.11|            5.1675|            5.0995|            5.1674|
|   BCH|1574.9602789966184|1588.6004569127992|            1515.3|            1521.0|
|   BTC|  17238.0199449088| 17324.83886467445|16968.291408828714| 16971.12960974206|
|   LTC| 303.3999614441217| 317.6966006615225|302.40702519057584|  310.971265429805|
|   REP|       32.50162798|       32.50162798|         32.345677|         32.345677|
|   XMR| 304.1618444641083| 306.2720324372592|295.38042671416935|  295.520097663825|
+------+------------------+------------------+------------------+------------------+

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

山川志 2025-01-16 07:48:01

我建议您导入以下内容：

import pyspark.sql.functions as f

然后您可以这样做（未经过完全测试的代码）：

stats_df = df.groupBy('symbol').withColumn(\
           'open', f.mean("open")).alias("open_mean")\
                           .withColumn(\
           'open', f.stddev("open")).alias("open_stddev").collect()

这是您如何执行此操作的原则（您可以使用代替 MinMax 缩放的 min 和 max 函数），那么您只需将标准缩放公式应用于 stats_df：

x' = (x - μ) / σ

I suggest you import the following:

import pyspark.sql.functions as f

then you can do it like this (not fully tested code):

stats_df = df.groupBy('symbol').withColumn(\
           'open', f.mean("open")).alias("open_mean")\
                           .withColumn(\
           'open', f.stddev("open")).alias("open_stddev").collect()

This is the principle of how you would do it (you could use instead the min and max functions for a MinMax scaling), then you just have to apply the formula of standard scaling to stats_df:

x' = (x - μ) / σ

回复收藏 0 原文

~没有更多了~