如何使用 MlLib Pyspark 对每个组进行分组和执行数据缩放?
我有一个数据集,就像下面的示例所示,我试图对给定符号的所有行进行分组,并对每个组执行标准缩放,以便最终我的所有数据都按组缩放。我如何使用 MlLib 和 Pyspark 做到这一点?我在互联网上找不到单一的解决方案。有人可以帮忙吗?
+------+------------------+------------------+------------------+------------------+
|symbol| open| high| low| close|
+------+------------------+------------------+------------------+------------------+
| AVT| 4.115| 4.115| 4.0736| 4.0736|
| ZEC| 365.6924715181936| 371.9164684545918| 364.8854025324053| 369.5950712239761|
| ETH| 647.220769018717| 654.6370842160561| 644.8942258095359| 652.1231757197687|
| XRP|0.3856343600456335|0.4042970302356221|0.3662228285447956|0.4016658006619401|
| XMR|304.97650674864144|304.98649644294267|299.96970554155274| 303.8663243145598|
| LTC|321.32437862304715| 335.1872636382617| 320.9704201234651| 334.5057757774086|
| EOS| 5.1171| 5.1548| 5.1075| 5.116|
| BCH| 1526.839255299505| 1588.106037653013|1526.8392543926366|1554.8447136830328|
| DASH| 878.00000003| 884.03769206| 869.22000004| 869.22000004|
| BTC|17042.224796462127| 17278.87984139109|16898.509289685637|17134.611038665582|
| REP| 32.50162799| 32.501628| 32.41062673| 32.50162799|
| DASH| 858.98413357| 863.01413927| 851.07145059| 851.17051529|
| ETH| 633.1390884474979| 650.546984589714| 631.2674221381849| 641.4566047907362|
| XRP|0.3912300406160967|0.3915937383961073|0.3480682353334925|0.3488616679337076|
| EOS| 5.11| 5.1675| 5.0995| 5.1674|
| BCH|1574.9602789966184|1588.6004569127992| 1515.3| 1521.0|
| BTC| 17238.0199449088| 17324.83886467445|16968.291408828714| 16971.12960974206|
| LTC| 303.3999614441217| 317.6966006615225|302.40702519057584| 310.971265429805|
| REP| 32.50162798| 32.50162798| 32.345677| 32.345677|
| XMR| 304.1618444641083| 306.2720324372592|295.38042671416935| 295.520097663825|
+------+------------------+------------------+------------------+------------------+
I have a dataset just like in the example below and I am trying to group all rows from a given symbol and perform standard scaling of each group so that at the end all my data is scaled by groups. How can I do that with MlLib and Pyspark? I could not find a single solution on internet for it. Can anyone help here?
+------+------------------+------------------+------------------+------------------+
|symbol| open| high| low| close|
+------+------------------+------------------+------------------+------------------+
| AVT| 4.115| 4.115| 4.0736| 4.0736|
| ZEC| 365.6924715181936| 371.9164684545918| 364.8854025324053| 369.5950712239761|
| ETH| 647.220769018717| 654.6370842160561| 644.8942258095359| 652.1231757197687|
| XRP|0.3856343600456335|0.4042970302356221|0.3662228285447956|0.4016658006619401|
| XMR|304.97650674864144|304.98649644294267|299.96970554155274| 303.8663243145598|
| LTC|321.32437862304715| 335.1872636382617| 320.9704201234651| 334.5057757774086|
| EOS| 5.1171| 5.1548| 5.1075| 5.116|
| BCH| 1526.839255299505| 1588.106037653013|1526.8392543926366|1554.8447136830328|
| DASH| 878.00000003| 884.03769206| 869.22000004| 869.22000004|
| BTC|17042.224796462127| 17278.87984139109|16898.509289685637|17134.611038665582|
| REP| 32.50162799| 32.501628| 32.41062673| 32.50162799|
| DASH| 858.98413357| 863.01413927| 851.07145059| 851.17051529|
| ETH| 633.1390884474979| 650.546984589714| 631.2674221381849| 641.4566047907362|
| XRP|0.3912300406160967|0.3915937383961073|0.3480682353334925|0.3488616679337076|
| EOS| 5.11| 5.1675| 5.0995| 5.1674|
| BCH|1574.9602789966184|1588.6004569127992| 1515.3| 1521.0|
| BTC| 17238.0199449088| 17324.83886467445|16968.291408828714| 16971.12960974206|
| LTC| 303.3999614441217| 317.6966006615225|302.40702519057584| 310.971265429805|
| REP| 32.50162798| 32.50162798| 32.345677| 32.345677|
| XMR| 304.1618444641083| 306.2720324372592|295.38042671416935| 295.520097663825|
+------+------------------+------------------+------------------+------------------+
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我建议您导入以下内容:
import pyspark.sql.functions as f
然后您可以这样做(未经过完全测试的代码):
这是您如何执行此操作的原则(您可以使用代替 MinMax 缩放的 min 和 max 函数),那么您只需将标准缩放公式应用于
stats_df
:x' = (x - μ) / σ
I suggest you import the following:
import pyspark.sql.functions as f
then you can do it like this (not fully tested code):
This is the principle of how you would do it (you could use instead the min and max functions for a MinMax scaling), then you just have to apply the formula of standard scaling to
stats_df
:x' = (x - μ) / σ