ALS的Pyspark实施如何处理每个用户项目组合的多个评级？

发布于 2025-01-24 03:04:38 字数 3221 浏览 0 评论 0 原文

我观察到，到ALS的输入数据不需要每个用户项目组合都具有唯一的评分。

这是一个可再现的例子。

# Sample Dataframe
df = spark.createDataFrame([(0, 0, 4.0),(0, 1, 2.0), 
(1, 1, 3.0), (1, 2, 4.0), 
(2, 1, 1.0), (2, 2, 5.0)],["user", "item", "rating"])

df.show(50,0)
+----+----+------+
|user|item|rating|
+----+----+------+
|0   |0   |4.0   |
|0   |1   |2.0   |
|1   |1   |3.0   |
|1   |2   |4.0   |
|2   |1   |1.0   |
|2   |2   |5.0   |
+----+----+------+

如您所见，每个用户项目组合只有一个评分（理想的情况）。如果我们将此数据框传递到ALS，它将为您提供以下预测：

# Fitting ALS
from pyspark.ml.recommendation import ALS
als = ALS(rank=5, 
          maxIter=5, 
          seed=0,
          regParam = 0.1,
         userCol='user',
         itemCol='item',
         ratingCol='rating',
         nonnegative=True)
model = als.fit(df)

# predictions from als
all_comb = df.select('user').distinct().join(broadcast(df.select('item').distinct()))
predictions = model.transform(all_comb)

predictions.show(20,0)
+----+----+----------+
|user|item|prediction|
+----+----+----------+
|0   |0   |3.9169915 |
|0   |1   |2.031506  |
|0   |2   |2.3546133 |
|1   |0   |4.9588947 |
|1   |1   |2.8347554 |
|1   |2   |4.003007  |
|2   |0   |0.9958025 |
|2   |1   |1.0896711 |
|2   |2   |4.895194  |
+----+----+----------+

到目前为止，一切对我来说都是有意义的。但是，如果我们有一个包含多个用户项目额定组合的数据框架，如以下所示，

# sample daataframe
df = spark.createDataFrame([(0, 0, 4.0), (0, 0, 3.5),
                            (0, 0, 4.1),(0, 1, 2.0),
                            (0, 1, 1.9),(0, 1, 2.1),
                            (1, 1, 3.0), (1, 1, 2.8),
                            (1, 2, 4.0),(1, 2, 3.6),
                            (2, 1, 1.0), (2, 1, 0.9),
                            (2, 2, 5.0),(2, 2, 4.9)],
                           ["user", "item", "rating"])
df.show(100,0)
+----+----+------+
|user|item|rating|
+----+----+------+
|0   |0   |4.0   |
|0   |0   |3.5   |
|0   |0   |4.1   |
|0   |1   |2.0   |
|0   |1   |1.9   |
|0   |1   |2.1   |
|1   |1   |3.0   |
|1   |1   |2.8   |
|1   |2   |4.0   |
|1   |2   |3.6   |
|2   |1   |1.0   |
|2   |1   |0.9   |
|2   |2   |5.0   |
|2   |2   |4.9   |
+----+----+------+

您可以在上面的数据框架中看到，有一个用户项目组合的多个记录。例如 - 用户'0'对项目'0'额定额定级，分别为4.0,3.5和4.1。

如果我将此输入数据框传递给ALS怎么办？这个可以吗？我最初认为它应该不起作用，因为ALS应该每个用户项目组合获得唯一的评分，但是当我运行此功能时，它可以奏效并使我感到惊讶！

# Fitting ALS
als = ALS(rank=5, 
          maxIter=5, 
          seed=0,
          regParam = 0.1,
         userCol='user',
         itemCol='item',
         ratingCol='rating',
         nonnegative=True)
model = als.fit(df)

# predictions from als
all_comb = df.select('user').distinct().join(broadcast(df.select('item').distinct()))
predictions = model.transform(all_comb)

predictions.show(20,0)
+----+----+----------+
|user|item|prediction|
+----+----+----------+
|0   |0   |3.7877638 |
|0   |1   |2.020348  |
|0   |2   |2.4364853 |
|1   |0   |4.9624424 |
|1   |1   |2.7311888 |
|1   |2   |3.8018093 |
|2   |0   |1.2490809 |
|2   |1   |1.0351425 |
|2   |2   |4.8451777 |
+----+----+----------+

为什么起作用？我认为它会失败，但也没有失败，也给了我预测。

我尝试查看研究论文，有限的ALS源代码以及Internet上的可用信息，但找不到任何有用的东西。是否平均将这些不同的评分拿到ALS或其他任何评分？

有人遇到过类似的事情吗？还是关于ALS如何在内部处理此类数据的任何想法？

原文

I observed that the input data to ALS need not have unique rating per user-item combination.

Here is a reproducible example.

# Sample Dataframe
df = spark.createDataFrame([(0, 0, 4.0),(0, 1, 2.0), 
(1, 1, 3.0), (1, 2, 4.0), 
(2, 1, 1.0), (2, 2, 5.0)],["user", "item", "rating"])

df.show(50,0)
+----+----+------+
|user|item|rating|
+----+----+------+
|0   |0   |4.0   |
|0   |1   |2.0   |
|1   |1   |3.0   |
|1   |2   |4.0   |
|2   |1   |1.0   |
|2   |2   |5.0   |
+----+----+------+

As you can see that, each user-item combination have only one rating (an ideal scenario).
If we pass this dataframe into ALS, it will give you the predictions like below:

# Fitting ALS
from pyspark.ml.recommendation import ALS
als = ALS(rank=5, 
          maxIter=5, 
          seed=0,
          regParam = 0.1,
         userCol='user',
         itemCol='item',
         ratingCol='rating',
         nonnegative=True)
model = als.fit(df)

# predictions from als
all_comb = df.select('user').distinct().join(broadcast(df.select('item').distinct()))
predictions = model.transform(all_comb)

predictions.show(20,0)
+----+----+----------+
|user|item|prediction|
+----+----+----------+
|0   |0   |3.9169915 |
|0   |1   |2.031506  |
|0   |2   |2.3546133 |
|1   |0   |4.9588947 |
|1   |1   |2.8347554 |
|1   |2   |4.003007  |
|2   |0   |0.9958025 |
|2   |1   |1.0896711 |
|2   |2   |4.895194  |
+----+----+----------+

Everything so far make sense to me. But what if we have a dataframe that contains multiple user-item rating combination like below -

# sample daataframe
df = spark.createDataFrame([(0, 0, 4.0), (0, 0, 3.5),
                            (0, 0, 4.1),(0, 1, 2.0),
                            (0, 1, 1.9),(0, 1, 2.1),
                            (1, 1, 3.0), (1, 1, 2.8),
                            (1, 2, 4.0),(1, 2, 3.6),
                            (2, 1, 1.0), (2, 1, 0.9),
                            (2, 2, 5.0),(2, 2, 4.9)],
                           ["user", "item", "rating"])
df.show(100,0)
+----+----+------+
|user|item|rating|
+----+----+------+
|0   |0   |4.0   |
|0   |0   |3.5   |
|0   |0   |4.1   |
|0   |1   |2.0   |
|0   |1   |1.9   |
|0   |1   |2.1   |
|1   |1   |3.0   |
|1   |1   |2.8   |
|1   |2   |4.0   |
|1   |2   |3.6   |
|2   |1   |1.0   |
|2   |1   |0.9   |
|2   |2   |5.0   |
|2   |2   |4.9   |
+----+----+------+

As you can see in above dataframe, there are multiple records of one user-item combination. For Example - user '0' has rated item '0' multiple times i.e. 4.0,3.5 and 4.1 respectively.

What if i pass this input dataframe to ALS? Will this work?
I initially thought it should not work as ALS is supposed to get unique rating per user-item combination but when i ran this, it worked and surprised me!

# Fitting ALS
als = ALS(rank=5, 
          maxIter=5, 
          seed=0,
          regParam = 0.1,
         userCol='user',
         itemCol='item',
         ratingCol='rating',
         nonnegative=True)
model = als.fit(df)

# predictions from als
all_comb = df.select('user').distinct().join(broadcast(df.select('item').distinct()))
predictions = model.transform(all_comb)

predictions.show(20,0)
+----+----+----------+
|user|item|prediction|
+----+----+----------+
|0   |0   |3.7877638 |
|0   |1   |2.020348  |
|0   |2   |2.4364853 |
|1   |0   |4.9624424 |
|1   |1   |2.7311888 |
|1   |2   |3.8018093 |
|2   |0   |1.2490809 |
|2   |1   |1.0351425 |
|2   |2   |4.8451777 |
+----+----+----------+

Why did it work? I thought it will fail but it did not and giving me predictions as well.

I tried looking at research papers, limited source code of ALS and available information on internet but could not find anything useful.
Is it taking average of these different ratings and then passing it to ALS or anything else?

Has anyone encountered similar thing before? Or any idea around how ALS is handling this kind of data internally?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

以往的大感动 2025-01-31 03:04:38

在Spark组中实现的矩阵分解的平行化方法实际上是通过（用户，项目）对的，并添加了用户对同一项目的不同评分。您可以在Spark的GitHub中的Scala代码中自己验证此事，第1377行：

Implementation note: This implementation produces the same result as the following but
   * generates fewer intermediate objects:
   *
   * {{{
   *     ratings.map { r =>
   *       ((srcPart.getPartition(r.user), dstPart.getPartition(r.item)), r)
   *     }.aggregateByKey(new RatingBlockBuilder)(
   *         seqOp = (b, r) => b.add(r),
   *         combOp = (b0, b1) => b0.merge(b1.build()))
   *       .mapValues(_.build())

de seqop确定如何添加两个评分对象的位置。

The parallelized method to matrix factorization implemented in Spark actually groups by (user,item) pairs and adds the different ratings a user made for the same item. You can verify this by yourself in the Scala Code in Spark's github, line 1377:

Implementation note: This implementation produces the same result as the following but
   * generates fewer intermediate objects:
   *
   * {{{
   *     ratings.map { r =>
   *       ((srcPart.getPartition(r.user), dstPart.getPartition(r.item)), r)
   *     }.aggregateByKey(new RatingBlockBuilder)(
   *         seqOp = (b, r) => b.add(r),
   *         combOp = (b0, b1) => b0.merge(b1.build()))
   *       .mapValues(_.build())

where de seqOp determines how to add up two rating objects.
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala

回复收藏 0 原文

~没有更多了~