pyspark RandomForestClassifier无法识别weightCol参数

发布于 2025-01-19 23:41:04 字数 1389 浏览 1 评论 0原文

我正在尝试在某些不平衡数据集上实现加权的随机森林模型。

这是我要做的事情:我有一个在Google Colab上运行的jupyter笔记本,并运行了一个火花群。

首先我安装:

!apt-get update -qq > /dev/null
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-2.4.8/spark-2.4.8-bin-hadoop2.7.tgz
!tar xf spark-2.4.8-bin-hadoop2.7.tgz
!pip install -q findspark

然后设置火花上下文:

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.8-bin-hadoop2.7"

import findspark
findspark.init()

from pyspark import SparkContext, SparkConf
from pyspark.ml.linalg import Vectors
import numpy as np 
from pyspark.sql import SparkSession #to make dataframe from rdd easily
import time

sc = SparkContext(appName="bitcoinFraud", master="local[*]")
spark_session = SparkSession(sc)

最后,在制作了一些数据范围之后,我尝试实例化一个随机森林模型,该模型将消化训练数据,包括一个实例权重:

from pyspark.ml.classification import RandomForestClassifier

# define the random forest model, using weights this time
rf_weighted = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", weightCol='weight', numTrees=100)

但是,我得到了这个错误:

TypeError: __init__() got an unexpected keyword argument 'weightCol'

我是surprised, because the

我不知道发生了什么事 - 我是新来的火花,我真的需要一些帮助!

I'm trying to implement a weighted random forest model on some imbalanced dataset.

Here is what I am trying to do: I have a Jupyter notebook running in Google Colab, running a Spark cluster.

First I install:

!apt-get update -qq > /dev/null
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-2.4.8/spark-2.4.8-bin-hadoop2.7.tgz
!tar xf spark-2.4.8-bin-hadoop2.7.tgz
!pip install -q findspark

Then set up the Spark context:

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.8-bin-hadoop2.7"

import findspark
findspark.init()

from pyspark import SparkContext, SparkConf
from pyspark.ml.linalg import Vectors
import numpy as np 
from pyspark.sql import SparkSession #to make dataframe from rdd easily
import time

sc = SparkContext(appName="bitcoinFraud", master="local[*]")
spark_session = SparkSession(sc)

Finally, after making some DataFrames and whatnot, I try to instantiate a random forest model which will digest training data, including a column of instance weights:

from pyspark.ml.classification import RandomForestClassifier

# define the random forest model, using weights this time
rf_weighted = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", weightCol='weight', numTrees=100)

However, I get this error:

TypeError: __init__() got an unexpected keyword argument 'weightCol'

I was surprised, because the documentation for RandomForestClassifier seems to say that there is such an argument. Furthermore, this source seems to show it working.

I have no idea what's going on--I'm new to Spark, and I really need some help!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

无尽的现实 2025-01-26 23:41:04

Spark < 中不提供此功能3.xx——下载最新版本的Spark就足以解决这个问题。

This feature is not available in Spark < 3.xx--it's enough to download the most recent version of Spark to fix this problem.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文