使用 Pyspark Dataframe 的 Python 训练模型
为了训练模型,我在逻辑回归上训练了一个数据集,并在下面的脚本中使用该模型,但它给了我一个错误 “没有名为“sklearn”的模块” 我已经在那里安装了该软件包,但仍然无法工作。有人可以告诉我可以做什么吗? 这是我在此 博客
import pyspark.sql.functions as f
import pyspark.sql.types as t
from pyspark.sql.window import Window as w
model = LogisticRegression(C=1e5)
model.fit(X, Y)
#creating test data from Pyspark
vectorAssembler = VectorAssembler(inputCols = [col for col in df.columns if '_id' not in col and 'label' not in col], outputCol="features")
features_vectorized = vectorAssembler.transform(df)
model_broadcast = sc.broadcast(model)
# udf to predict on the cluster
def predict_new(feature_map):
ids, features = zip(*[
(k, v) for d in feature_map for k, v in d.items()
])
ind = model_broadcast.value.classes_.tolist().index(1.0)
probs = [
float(v) for v in
model_broadcast.value.predict_proba(features)[:, ind]
]
return dict(zip(ids, probs))
predict_new_udf = f.udf(
predict_new,
t.MapType(t.LongType(), t.FloatType()
))
# set the number of prediction groups to create
nparts = 5000
# put everything together
outcome_sdf = (
features_vectorized.select(
f.create_map(f.col('id'), f.col('features')).alias('feature_map'),
(f.row_number().over(w.partitionBy(f.lit(1)).orderBy(f.lit(1))) % nparts).alias('grouper')
)
.groupby(f.col('grouper'))
.agg(f.collect_list(f.col('feature_map')).alias('feature_map'))
.select(predict_new_udf(f.col('feature_map')).alias('results'))
.select(f.explode(f.col('results')).alias('unique_id', 'probability_estimate'))
)
这个运行并且执行得很好,但是当我查找outcome_sdf的值时,我收到一个错误,没有名为sklearn的模块。我读到有关在集群中安装 sklearn 的内容,有人可以帮助我吗?
in order to train a model I trained a dataset on Logistic Regression to start with and using that model in below script but it gives me an error saying
"No module named 'sklearn'"
I have installed the package there but still doesn't work. Can someone please tell me as to what can be done?
Here is the script I found on this blog
import pyspark.sql.functions as f
import pyspark.sql.types as t
from pyspark.sql.window import Window as w
model = LogisticRegression(C=1e5)
model.fit(X, Y)
#creating test data from Pyspark
vectorAssembler = VectorAssembler(inputCols = [col for col in df.columns if '_id' not in col and 'label' not in col], outputCol="features")
features_vectorized = vectorAssembler.transform(df)
model_broadcast = sc.broadcast(model)
# udf to predict on the cluster
def predict_new(feature_map):
ids, features = zip(*[
(k, v) for d in feature_map for k, v in d.items()
])
ind = model_broadcast.value.classes_.tolist().index(1.0)
probs = [
float(v) for v in
model_broadcast.value.predict_proba(features)[:, ind]
]
return dict(zip(ids, probs))
predict_new_udf = f.udf(
predict_new,
t.MapType(t.LongType(), t.FloatType()
))
# set the number of prediction groups to create
nparts = 5000
# put everything together
outcome_sdf = (
features_vectorized.select(
f.create_map(f.col('id'), f.col('features')).alias('feature_map'),
(f.row_number().over(w.partitionBy(f.lit(1)).orderBy(f.lit(1))) % nparts).alias('grouper')
)
.groupby(f.col('grouper'))
.agg(f.collect_list(f.col('feature_map')).alias('feature_map'))
.select(predict_new_udf(f.col('feature_map')).alias('results'))
.select(f.explode(f.col('results')).alias('unique_id', 'probability_estimate'))
)
This runs and also executes well but when I look for values of the outcome_sdf, I get an error with no module named sklearn. I read about installing sklearn in cluster, can someone help me with that?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您需要在中安装
sklearn
,群集的所有节点,而不是一个节点。You'll need to install
sklearn
in all nodes of your cluster, not a single node.