如何使用具有多个隐式反馈的ALS？

发布于 2025-02-08 07:02:58 字数 737 浏览 0 评论 0 原文

在pyspark在此文档中给出的ALS示例中 - ）所使用的数据在一列中具有明确的反馈。数据就是这样： |用户|项目|评分| | --- | --- | --- | |第一| A | 2 | |第二| b | 3 |

但是，就我而言，我在这样的多列中有隐式反馈： |用户|项目|点击|视图|购买| | --- | --- | --- | --- | --- | |第一| A | 20 | 35 | 3 | |第二| b | 3 | 12 | 0 |

我知道我们可以通过设置 indimitprefs 为 false 来使用隐式反馈。但是，它仅接受一个列。如何使用多个列？

我发现了这个问题：如何管理多个积极的隐式反馈？，它与火花和交替的最小平方方法无关。我是否必须根据答案手动分配加权方案？还是Pyspark中有更好的解决方案？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

抠脚大汉 2025-02-15 07:02:58

我已经彻底研究了您的问题，我还没有发现通过ALS中的多个列，大多数此类问题正在通过手动称重和创建评级列来解决。

以下是我的解决方案

创建视图，单击和购买值的索引，如下

提取最小值（除外），并以同一COLMN为例，以下

示例：购买Col的最低值为3

3/3、10/3、20/3 ..等等，

因此，在获得这些列的索引值后，现在

的评分是评级的公式

以下是评级

data.show()
+------+----+------+-----+--------+
|  User|Item|Clicks|Views|Purchase|
+------+----+------+-----+--------+
| First|   A|    20|   35|       3|
|Second|   B|     3|   12|       0|
| Three|   C|     4|   15|      20|
|  Four|   D|     5|   16|      10|
+------+----+------+-----+--------+
df1 = data.sort('Purchase').select('Purchase')
df= df1.filter(df1.Purchase >0)
purch_index = df.first()['Purchase']
df2 = data.sort('Views').select('Views')
df2= df2.filter(df2.Views >0)
Views_index = df2.first()['Views']
f3 = data.sort('Clicks').select('Clicks')
df3= df3.filter(df3.Clicks >0)
CLicks_index = df3.first()['Clicks']
semi_rawdf = data.withColumn('Clicks',round(col('Clicks')/CLicks_index)).withColumn('Views',round(col('Views')/Views_index)).withColumn('Purchase',round(col('Purchase')/purch_index))

semi_rawdf.show()

+------+----+------+-----+--------+
|  User|Item|Clicks|Views|Purchase|
+------+----+------+-----+--------+
| First|   A|   7.0|  3.0|     1.0|
|Second|   B|   1.0|  1.0|     0.0|
| Three|   C|   1.0|  1.0|     7.0|
|  Four|   D|   2.0|  1.0|     3.0|
+------+----+------+-----+--------+

from pyspark.sql.types import DecimalType
from decimal import Decimal
refined_df = semi_rawdf.withColumn('Rating',((col('Clicks')*0.3)+round(col('Views')*0.1)+round(col('Purchase')*0.6)))
refined_df = refined_df.withColumn('Rating', col('Rating').cast(DecimalType(6,2)))

refined_df.select('User','Item','Rating').show()

+------+----+------+
|  User|Item|Rating|
+------+----+------+
| First|   A|  3.10|
|Second|   B|  0.30|
| Three|   C|  4.30|
|  Four|   D|  2.60|
+------+----+------+

I have thoroughly Researched your issue, i haven't found passing multiple columns in ALS, most of the such problems are being solved by manually weighing and creating Rating column.

Below is my solution

Create indexing for Views, Clicks and Purchase value as below

Extract Smallest value (except 0) and devide all ements for same colmn by it

example : min value for Purchase col is 3
so 3/3, 10/3, 20/3 .. etc

Now after getting indexed value for these columns calculate Rating

Below is the formula for Rating

Rating = 60% of Purchase + 30% of Clicks + 10% of Views

data.show()
+------+----+------+-----+--------+
|  User|Item|Clicks|Views|Purchase|
+------+----+------+-----+--------+
| First|   A|    20|   35|       3|
|Second|   B|     3|   12|       0|
| Three|   C|     4|   15|      20|
|  Four|   D|     5|   16|      10|
+------+----+------+-----+--------+
df1 = data.sort('Purchase').select('Purchase')
df= df1.filter(df1.Purchase >0)
purch_index = df.first()['Purchase']
df2 = data.sort('Views').select('Views')
df2= df2.filter(df2.Views >0)
Views_index = df2.first()['Views']
f3 = data.sort('Clicks').select('Clicks')
df3= df3.filter(df3.Clicks >0)
CLicks_index = df3.first()['Clicks']
semi_rawdf = data.withColumn('Clicks',round(col('Clicks')/CLicks_index)).withColumn('Views',round(col('Views')/Views_index)).withColumn('Purchase',round(col('Purchase')/purch_index))

semi_rawdf.show()

+------+----+------+-----+--------+
|  User|Item|Clicks|Views|Purchase|
+------+----+------+-----+--------+
| First|   A|   7.0|  3.0|     1.0|
|Second|   B|   1.0|  1.0|     0.0|
| Three|   C|   1.0|  1.0|     7.0|
|  Four|   D|   2.0|  1.0|     3.0|
+------+----+------+-----+--------+

from pyspark.sql.types import DecimalType
from decimal import Decimal
refined_df = semi_rawdf.withColumn('Rating',((col('Clicks')*0.3)+round(col('Views')*0.1)+round(col('Purchase')*0.6)))
refined_df = refined_df.withColumn('Rating', col('Rating').cast(DecimalType(6,2)))

refined_df.select('User','Item','Rating').show()

+------+----+------+
|  User|Item|Rating|
+------+----+------+
| First|   A|  3.10|
|Second|   B|  0.30|
| Three|   C|  4.30|
|  Four|   D|  2.60|
+------+----+------+

回复收藏 0 原文

~没有更多了~