PySpark ML 模型是否可以根据条件仅在数据帧的一部分上运行？

发布于 2025-01-17 04:19:35 字数 1177 浏览 0 评论 0原文

我训练了一个逻辑回归算法，将职位名称和描述与一组 4 位数字代码相匹配。这一点它做得非常好。它将形成管道的一部分，该管道首先尝试通过加入参考数据库来匹配这些数据，这使得数据帧的一些条目与 4 位代码相匹配，而一些条目则留下一个虚拟代码，表明它们仍然需要匹配。因此，在运行 ML 算法之前，我的数据帧的状态是

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

data = spark.createDataFrame(
    [
        ('1', 'BUTCHER', 'MEAT PERSON', '-1'),
        ('2', 'BAKER', 'BREAD AND PASTRY AND CAKE', '1468'),
        ('3', 'CANDLESTICK MAKER', 'LET THERE BE LIGHT', '-1')
    ],
    [
        'ID',
        'COLUMN_TO_VECTORIZE_1',
        'COLUMN_TO_VECTORIZE_2',
        'RESULTS_COLUMN'
]
)

，其中 '-1' 是“尚未匹配”的虚拟代码，而 'BAKER' 具有已与 '1468' 匹配。

如果我想匹配整个数据帧，我会继续写

data = pretrained_model_pipeline.transform(data) # vectorizes/assembles feature column, runs ML algorithm

# other code to perform index to string conversion on labels, and place labels into RESULTS_COLUMN

但是通过算法运行整个数据帧，并且之前匹配的任何内容都不需要再次匹配。

我可以从旧数据帧创建一个新数据帧，仅选择 'RESULTS_COLUMN' 中带有 '-1' 的行，但我对 Spark 的有限理解表明这会基本上我的内存使用量增加了一倍。

有没有办法让预训练模型获得整个数据帧进行转换，但使用某种掩码告诉它跳过结果中带有 '-1' 的行？

原文

I have trained a logistic regression algorithm to match job titles and descriptions to a set of 4 digit numeric codes. This it does very well. It will form part of a pipeline that first attempts to match these data by joining to a reference database, which leaves some entries of the dataframe matched to a 4 digit code, and some left with a dummy code indicating they are still to be matched. Therefore, the state of my dataframe just prior to my running the ML algorithm is

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

data = spark.createDataFrame(
    [
        ('1', 'BUTCHER', 'MEAT PERSON', '-1'),
        ('2', 'BAKER', 'BREAD AND PASTRY AND CAKE', '1468'),
        ('3', 'CANDLESTICK MAKER', 'LET THERE BE LIGHT', '-1')
    ],
    [
        'ID',
        'COLUMN_TO_VECTORIZE_1',
        'COLUMN_TO_VECTORIZE_2',
        'RESULTS_COLUMN'
]
)

where '-1' is the dummy code for 'as yet unmatched' and 'BAKER' has been matched to '1468' already.

If I wanted to match the whole dataframe, I would go on to write

data = pretrained_model_pipeline.transform(data) # vectorizes/assembles feature column, runs ML algorithm

# other code to perform index to string conversion on labels, and place labels into RESULTS_COLUMN

But that runs the whole dataframe through the algorithm, and anything previously matched does not need to be matched again.

I COULD create a new dataframe from the old one, selecting just those rows with '-1' in the 'RESULTS_COLUMN', but my limited understanding of Spark says that this would essentially double my memory usage.

Is there a way for the pretrained model to be given the whole dataframe to transform, but with some sort of mask telling it to skip rows with '-1' in the results?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

红墙和绿瓦 2025-01-24 04:19:35

Spark 并不总是将数据加载到内存中。它也不一定将所有内容加载到内存中，除非我们告诉它这样做。所以，只要这样就足够了

data = pretrained_model_pipeline.transform(data.where(F.col('RESULTS_COLUMN') != -1))

Spark doesn't always load data into memory. It also doesn't necessarily load everything into memory, unless we tell it to. So, just this is enough

data = pretrained_model_pipeline.transform(data.where(F.col('RESULTS_COLUMN') != -1))

回复收藏 0 原文

~没有更多了~

关于作者

还在原地等你

暂无简介

文章

25 人气

关注发私信

友情链接

文江博客

PySpark ML 模型是否可以根据条件仅在数据帧的一部分上运行？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

╰ゝ天使的微笑

少女净妖师

朱洁

觉浅

滥情空心

hl1314520

友情链接

PySpark ML 模型是否可以根据条件仅在数据帧的一部分上运行？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

╰ゝ天使的微笑

少女净妖师

朱洁

觉浅

滥情空心

hl1314520

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。