使用 Pyspark Dataframe 的 Python 训练模型

发布于 2025-01-19 08:55:22 字数 1964 浏览 1 评论 0原文

为了训练模型,我在逻辑回归上训练了一个数据集,并在下面的脚本中使用该模型,但它给了我一个错误 “没有名为“sklearn”的模块” 我已经在那里安装了该软件包,但仍然无法工作。有人可以告诉我可以做什么吗? 这是我在此 博客

import pyspark.sql.functions as f
import pyspark.sql.types as t
from pyspark.sql.window import Window as w

model = LogisticRegression(C=1e5)
model.fit(X, Y)

#creating test data from Pyspark
vectorAssembler = VectorAssembler(inputCols = [col for col in df.columns if '_id' not in col and 'label' not in col], outputCol="features")
features_vectorized = vectorAssembler.transform(df)

model_broadcast = sc.broadcast(model)
# udf to predict on the cluster
def predict_new(feature_map):
    ids, features = zip(*[
        (k,  v) for d in feature_map for k, v in d.items()
    ])
    ind = model_broadcast.value.classes_.tolist().index(1.0)
    probs = [
        float(v) for v in 
        model_broadcast.value.predict_proba(features)[:, ind]
    ]
    return dict(zip(ids, probs))
predict_new_udf = f.udf(
    predict_new, 
    t.MapType(t.LongType(), t.FloatType()
))
# set the number of prediction groups to create
nparts = 5000
# put everything together
outcome_sdf = (
                features_vectorized.select(
                            f.create_map(f.col('id'), f.col('features')).alias('feature_map'), 
                            (f.row_number().over(w.partitionBy(f.lit(1)).orderBy(f.lit(1))) % nparts).alias('grouper')
                          )
                .groupby(f.col('grouper'))
                .agg(f.collect_list(f.col('feature_map')).alias('feature_map'))
                .select(predict_new_udf(f.col('feature_map')).alias('results'))
                .select(f.explode(f.col('results')).alias('unique_id', 'probability_estimate'))
            )

这个运行并且执行得很好,但是当我查找outcome_sdf的值时,我收到一个错误,没有名为sklearn的模块。我读到有关在集群中安装 sklearn 的内容,有人可以帮助我吗?

in order to train a model I trained a dataset on Logistic Regression to start with and using that model in below script but it gives me an error saying
"No module named 'sklearn'"
I have installed the package there but still doesn't work. Can someone please tell me as to what can be done?
Here is the script I found on this blog

import pyspark.sql.functions as f
import pyspark.sql.types as t
from pyspark.sql.window import Window as w

model = LogisticRegression(C=1e5)
model.fit(X, Y)

#creating test data from Pyspark
vectorAssembler = VectorAssembler(inputCols = [col for col in df.columns if '_id' not in col and 'label' not in col], outputCol="features")
features_vectorized = vectorAssembler.transform(df)

model_broadcast = sc.broadcast(model)
# udf to predict on the cluster
def predict_new(feature_map):
    ids, features = zip(*[
        (k,  v) for d in feature_map for k, v in d.items()
    ])
    ind = model_broadcast.value.classes_.tolist().index(1.0)
    probs = [
        float(v) for v in 
        model_broadcast.value.predict_proba(features)[:, ind]
    ]
    return dict(zip(ids, probs))
predict_new_udf = f.udf(
    predict_new, 
    t.MapType(t.LongType(), t.FloatType()
))
# set the number of prediction groups to create
nparts = 5000
# put everything together
outcome_sdf = (
                features_vectorized.select(
                            f.create_map(f.col('id'), f.col('features')).alias('feature_map'), 
                            (f.row_number().over(w.partitionBy(f.lit(1)).orderBy(f.lit(1))) % nparts).alias('grouper')
                          )
                .groupby(f.col('grouper'))
                .agg(f.collect_list(f.col('feature_map')).alias('feature_map'))
                .select(predict_new_udf(f.col('feature_map')).alias('results'))
                .select(f.explode(f.col('results')).alias('unique_id', 'probability_estimate'))
            )

This runs and also executes well but when I look for values of the outcome_sdf, I get an error with no module named sklearn. I read about installing sklearn in cluster, can someone help me with that?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

蘑菇王子 2025-01-26 08:55:22

您需要在中安装sklearn群集的所有节点,而不是一个节点。

You'll need to install sklearn in all nodes of your cluster, not a single node.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文