PySpark ML 模型是否可以根据条件仅在数据帧的一部分上运行?

发布于 2025-01-17 04:19:35 字数 1177 浏览 0 评论 0原文

我训练了一个逻辑回归算法,将职位名称和描述与一组 4 位数字代码相匹配。这一点它做得非常好。它将形成管道的一部分,该管道首先尝试通过加入参考数据库来匹配这些数据,这使得数据帧的一些条目与 4 位代码相匹配,而一些条目则留下一个虚拟代码,表明它们仍然需要匹配。因此,在运行 ML 算法之前,我的数据帧的状态是

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

data = spark.createDataFrame(
    [
        ('1', 'BUTCHER', 'MEAT PERSON', '-1'),
        ('2', 'BAKER', 'BREAD AND PASTRY AND CAKE', '1468'),
        ('3', 'CANDLESTICK MAKER', 'LET THERE BE LIGHT', '-1')
    ],
    [
        'ID',
        'COLUMN_TO_VECTORIZE_1',
        'COLUMN_TO_VECTORIZE_2',
        'RESULTS_COLUMN'
]
)

,其中 '-1' 是“尚未匹配”的虚拟代码,而 'BAKER' 具有已与 '1468' 匹配。

如果我想匹配整个数据帧,我会继续写

data = pretrained_model_pipeline.transform(data) # vectorizes/assembles feature column, runs ML algorithm

# other code to perform index to string conversion on labels, and place labels into RESULTS_COLUMN

但是通过算法运行整个数据帧,并且之前匹配的任何内容都不需要再次匹配。

我可以从旧数据帧创建一个新数据帧,仅选择 'RESULTS_COLUMN' 中带有 '-1' 的行,但我对 Spark 的有限理解表明这会基本上我的内存使用量增加了一倍。

有没有办法让预训练模型获得整个数据帧进行转换,但使用某种掩码告诉它跳过结果中带有 '-1' 的行?

I have trained a logistic regression algorithm to match job titles and descriptions to a set of 4 digit numeric codes. This it does very well. It will form part of a pipeline that first attempts to match these data by joining to a reference database, which leaves some entries of the dataframe matched to a 4 digit code, and some left with a dummy code indicating they are still to be matched. Therefore, the state of my dataframe just prior to my running the ML algorithm is

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

data = spark.createDataFrame(
    [
        ('1', 'BUTCHER', 'MEAT PERSON', '-1'),
        ('2', 'BAKER', 'BREAD AND PASTRY AND CAKE', '1468'),
        ('3', 'CANDLESTICK MAKER', 'LET THERE BE LIGHT', '-1')
    ],
    [
        'ID',
        'COLUMN_TO_VECTORIZE_1',
        'COLUMN_TO_VECTORIZE_2',
        'RESULTS_COLUMN'
]
)

where '-1' is the dummy code for 'as yet unmatched' and 'BAKER' has been matched to '1468' already.

If I wanted to match the whole dataframe, I would go on to write

data = pretrained_model_pipeline.transform(data) # vectorizes/assembles feature column, runs ML algorithm

# other code to perform index to string conversion on labels, and place labels into RESULTS_COLUMN

But that runs the whole dataframe through the algorithm, and anything previously matched does not need to be matched again.

I COULD create a new dataframe from the old one, selecting just those rows with '-1' in the 'RESULTS_COLUMN', but my limited understanding of Spark says that this would essentially double my memory usage.

Is there a way for the pretrained model to be given the whole dataframe to transform, but with some sort of mask telling it to skip rows with '-1' in the results?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

红墙和绿瓦 2025-01-24 04:19:35

Spark 并不总是将数据加载到内存中。它也不一定将所有内容加载到内存中,除非我们告诉它这样做。所以,只要这样就足够了

data = pretrained_model_pipeline.transform(data.where(F.col('RESULTS_COLUMN') != -1))

Spark doesn't always load data into memory. It also doesn't necessarily load everything into memory, unless we tell it to. So, just this is enough

data = pretrained_model_pipeline.transform(data.where(F.col('RESULTS_COLUMN') != -1))
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文