pyspark RandomForestClassifier无法识别weightCol参数
我正在尝试在某些不平衡数据集上实现加权的随机森林模型。
这是我要做的事情:我有一个在Google Colab上运行的jupyter笔记本,并运行了一个火花群。
首先我安装:
!apt-get update -qq > /dev/null
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-2.4.8/spark-2.4.8-bin-hadoop2.7.tgz
!tar xf spark-2.4.8-bin-hadoop2.7.tgz
!pip install -q findspark
然后设置火花上下文:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.8-bin-hadoop2.7"
import findspark
findspark.init()
from pyspark import SparkContext, SparkConf
from pyspark.ml.linalg import Vectors
import numpy as np
from pyspark.sql import SparkSession #to make dataframe from rdd easily
import time
sc = SparkContext(appName="bitcoinFraud", master="local[*]")
spark_session = SparkSession(sc)
最后,在制作了一些数据范围之后,我尝试实例化一个随机森林模型,该模型将消化训练数据,包括一个实例权重:
from pyspark.ml.classification import RandomForestClassifier
# define the random forest model, using weights this time
rf_weighted = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", weightCol='weight', numTrees=100)
但是,我得到了这个错误:
TypeError: __init__() got an unexpected keyword argument 'weightCol'
我是surprised, because the
我不知道发生了什么事 - 我是新来的火花,我真的需要一些帮助!
I'm trying to implement a weighted random forest model on some imbalanced dataset.
Here is what I am trying to do: I have a Jupyter notebook running in Google Colab, running a Spark cluster.
First I install:
!apt-get update -qq > /dev/null
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-2.4.8/spark-2.4.8-bin-hadoop2.7.tgz
!tar xf spark-2.4.8-bin-hadoop2.7.tgz
!pip install -q findspark
Then set up the Spark context:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.8-bin-hadoop2.7"
import findspark
findspark.init()
from pyspark import SparkContext, SparkConf
from pyspark.ml.linalg import Vectors
import numpy as np
from pyspark.sql import SparkSession #to make dataframe from rdd easily
import time
sc = SparkContext(appName="bitcoinFraud", master="local[*]")
spark_session = SparkSession(sc)
Finally, after making some DataFrames and whatnot, I try to instantiate a random forest model which will digest training data, including a column of instance weights:
from pyspark.ml.classification import RandomForestClassifier
# define the random forest model, using weights this time
rf_weighted = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", weightCol='weight', numTrees=100)
However, I get this error:
TypeError: __init__() got an unexpected keyword argument 'weightCol'
I was surprised, because the documentation for RandomForestClassifier seems to say that there is such an argument. Furthermore, this source seems to show it working.
I have no idea what's going on--I'm new to Spark, and I really need some help!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
Spark < 中不提供此功能3.xx——下载最新版本的Spark就足以解决这个问题。
This feature is not available in Spark < 3.xx--it's enough to download the most recent version of Spark to fix this problem.