如何使用具有多个隐式反馈的ALS?

发布于 2025-02-08 07:02:58 字数 737 浏览 0 评论 0 原文

在pyspark在此文档中给出的ALS示例中 - )所使用的数据在一列中具有明确的反馈。数据就是这样: |用户|项目|评分| | --- | --- | --- | |第一| A | 2 | |第二| b | 3 |

但是,就我而言,我在这样的多列中有隐式反馈: |用户|项目|点击|视图|购买| | --- | --- | --- | --- | --- | |第一| A | 20 | 35 | 3 | |第二| b | 3 | 12 | 0 |

我知道我们可以通过设置 indimitprefs false 来使用隐式反馈。但是,它仅接受一个列。如何使用多个列?

我发现了这个问题:如何管理多个积极的隐式反馈? ,它与火花和交替的最小平方方法无关。我是否必须根据答案手动分配加权方案?还是Pyspark中有更好的解决方案?

In the ALS example given in PySpark as per this documentation - http://spark.apache.org/docs/latest/ml-collaborative-filtering.html) the data used has explicit feedback in one column. The data is like this:
| User | Item | Rating |
| --- | --- | --- |
| First | A | 2 |
| Second | B | 3|

However, in my case I have implicit feedbacks in multiple columns like this:
| User | Item | Clicks | Views | Purchase |
| --- | --- | --- | --- | --- |
| First | A | 20 | 35 | 3 |
| Second | B | 3| 12 | 0 |

I know we can use implicit feedback by setting implicitPrefs as False. However, it only accepts a single column. How to use multiple columns?

I found this question: How to manage multiple positive implicit feedbacks? However, it is not related with Spark and Alternating Least Square method. Do I have to manually assign a weighting scheme as per that answer? or is there a better solution in PySpark?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

抠脚大汉 2025-02-15 07:02:58

我已经彻底研究了您的问题,我还没有发现通过ALS中的多个列,大多数此类问题正在通过手动称重和创建评级列来解决。

以下是我的解决方案

  1. 创建视图,单击和购买值的索引,如下

提取最小值(除外),并以同一COLMN为例,以下

示例:购买Col的最低值为3

3/3、10/3、20/3 ..等等,

  1. 因此,在获得这些列的索引值后,现在

的评分是评级的公式

以下是评级

data.show()
+------+----+------+-----+--------+
|  User|Item|Clicks|Views|Purchase|
+------+----+------+-----+--------+
| First|   A|    20|   35|       3|
|Second|   B|     3|   12|       0|
| Three|   C|     4|   15|      20|
|  Four|   D|     5|   16|      10|
+------+----+------+-----+--------+
df1 = data.sort('Purchase').select('Purchase')
df= df1.filter(df1.Purchase >0)
purch_index = df.first()['Purchase']
df2 = data.sort('Views').select('Views')
df2= df2.filter(df2.Views >0)
Views_index = df2.first()['Views']
f3 = data.sort('Clicks').select('Clicks')
df3= df3.filter(df3.Clicks >0)
CLicks_index = df3.first()['Clicks']
semi_rawdf = data.withColumn('Clicks',round(col('Clicks')/CLicks_index)).withColumn('Views',round(col('Views')/Views_index)).withColumn('Purchase',round(col('Purchase')/purch_index))

semi_rawdf.show()

+------+----+------+-----+--------+
|  User|Item|Clicks|Views|Purchase|
+------+----+------+-----+--------+
| First|   A|   7.0|  3.0|     1.0|
|Second|   B|   1.0|  1.0|     0.0|
| Three|   C|   1.0|  1.0|     7.0|
|  Four|   D|   2.0|  1.0|     3.0|
+------+----+------+-----+--------+

from pyspark.sql.types import DecimalType
from decimal import Decimal
refined_df = semi_rawdf.withColumn('Rating',((col('Clicks')*0.3)+round(col('Views')*0.1)+round(col('Purchase')*0.6)))
refined_df = refined_df.withColumn('Rating', col('Rating').cast(DecimalType(6,2)))

refined_df.select('User','Item','Rating').show()

+------+----+------+
|  User|Item|Rating|
+------+----+------+
| First|   A|  3.10|
|Second|   B|  0.30|
| Three|   C|  4.30|
|  Four|   D|  2.60|
+------+----+------+

I have thoroughly Researched your issue, i haven't found passing multiple columns in ALS, most of the such problems are being solved by manually weighing and creating Rating column.

Below is my solution

  1. Create indexing for Views, Clicks and Purchase value as below

Extract Smallest value (except 0) and devide all ements for same colmn by it

example : min value for Purchase col is 3
so 3/3, 10/3, 20/3 .. etc

  1. Now after getting indexed value for these columns calculate Rating

Below is the formula for Rating

Rating = 60% of Purchase + 30% of Clicks + 10% of Views

data.show()
+------+----+------+-----+--------+
|  User|Item|Clicks|Views|Purchase|
+------+----+------+-----+--------+
| First|   A|    20|   35|       3|
|Second|   B|     3|   12|       0|
| Three|   C|     4|   15|      20|
|  Four|   D|     5|   16|      10|
+------+----+------+-----+--------+
df1 = data.sort('Purchase').select('Purchase')
df= df1.filter(df1.Purchase >0)
purch_index = df.first()['Purchase']
df2 = data.sort('Views').select('Views')
df2= df2.filter(df2.Views >0)
Views_index = df2.first()['Views']
f3 = data.sort('Clicks').select('Clicks')
df3= df3.filter(df3.Clicks >0)
CLicks_index = df3.first()['Clicks']
semi_rawdf = data.withColumn('Clicks',round(col('Clicks')/CLicks_index)).withColumn('Views',round(col('Views')/Views_index)).withColumn('Purchase',round(col('Purchase')/purch_index))

semi_rawdf.show()

+------+----+------+-----+--------+
|  User|Item|Clicks|Views|Purchase|
+------+----+------+-----+--------+
| First|   A|   7.0|  3.0|     1.0|
|Second|   B|   1.0|  1.0|     0.0|
| Three|   C|   1.0|  1.0|     7.0|
|  Four|   D|   2.0|  1.0|     3.0|
+------+----+------+-----+--------+

from pyspark.sql.types import DecimalType
from decimal import Decimal
refined_df = semi_rawdf.withColumn('Rating',((col('Clicks')*0.3)+round(col('Views')*0.1)+round(col('Purchase')*0.6)))
refined_df = refined_df.withColumn('Rating', col('Rating').cast(DecimalType(6,2)))

refined_df.select('User','Item','Rating').show()

+------+----+------+
|  User|Item|Rating|
+------+----+------+
| First|   A|  3.10|
|Second|   B|  0.30|
| Three|   C|  4.30|
|  Four|   D|  2.60|
+------+----+------+
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文