PySpark ML 模型是否可以根据条件仅在数据帧的一部分上运行?
我训练了一个逻辑回归算法,将职位名称和描述与一组 4 位数字代码相匹配。这一点它做得非常好。它将形成管道的一部分,该管道首先尝试通过加入参考数据库来匹配这些数据,这使得数据帧的一些条目与 4 位代码相匹配,而一些条目则留下一个虚拟代码,表明它们仍然需要匹配。因此,在运行 ML 算法之前,我的数据帧的状态是
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data = spark.createDataFrame(
[
('1', 'BUTCHER', 'MEAT PERSON', '-1'),
('2', 'BAKER', 'BREAD AND PASTRY AND CAKE', '1468'),
('3', 'CANDLESTICK MAKER', 'LET THERE BE LIGHT', '-1')
],
[
'ID',
'COLUMN_TO_VECTORIZE_1',
'COLUMN_TO_VECTORIZE_2',
'RESULTS_COLUMN'
]
)
,其中 '-1'
是“尚未匹配”的虚拟代码,而 'BAKER'
具有已与 '1468'
匹配。
如果我想匹配整个数据帧,我会继续写
data = pretrained_model_pipeline.transform(data) # vectorizes/assembles feature column, runs ML algorithm
# other code to perform index to string conversion on labels, and place labels into RESULTS_COLUMN
但是通过算法运行整个数据帧,并且之前匹配的任何内容都不需要再次匹配。
我可以从旧数据帧创建一个新数据帧,仅选择 'RESULTS_COLUMN'
中带有 '-1'
的行,但我对 Spark 的有限理解表明这会基本上我的内存使用量增加了一倍。
有没有办法让预训练模型获得整个数据帧进行转换,但使用某种掩码告诉它跳过结果中带有 '-1'
的行?
I have trained a logistic regression algorithm to match job titles and descriptions to a set of 4 digit numeric codes. This it does very well. It will form part of a pipeline that first attempts to match these data by joining to a reference database, which leaves some entries of the dataframe matched to a 4 digit code, and some left with a dummy code indicating they are still to be matched. Therefore, the state of my dataframe just prior to my running the ML algorithm is
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data = spark.createDataFrame(
[
('1', 'BUTCHER', 'MEAT PERSON', '-1'),
('2', 'BAKER', 'BREAD AND PASTRY AND CAKE', '1468'),
('3', 'CANDLESTICK MAKER', 'LET THERE BE LIGHT', '-1')
],
[
'ID',
'COLUMN_TO_VECTORIZE_1',
'COLUMN_TO_VECTORIZE_2',
'RESULTS_COLUMN'
]
)
where '-1'
is the dummy code for 'as yet unmatched' and 'BAKER'
has been matched to '1468'
already.
If I wanted to match the whole dataframe, I would go on to write
data = pretrained_model_pipeline.transform(data) # vectorizes/assembles feature column, runs ML algorithm
# other code to perform index to string conversion on labels, and place labels into RESULTS_COLUMN
But that runs the whole dataframe through the algorithm, and anything previously matched does not need to be matched again.
I COULD create a new dataframe from the old one, selecting just those rows with '-1'
in the 'RESULTS_COLUMN'
, but my limited understanding of Spark says that this would essentially double my memory usage.
Is there a way for the pretrained model to be given the whole dataframe to transform, but with some sort of mask telling it to skip rows with '-1'
in the results?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
Spark 并不总是将数据加载到内存中。它也不一定将所有内容加载到内存中,除非我们告诉它这样做。所以,只要这样就足够了
Spark doesn't always load data into memory. It also doesn't necessarily load everything into memory, unless we tell it to. So, just this is enough