如何使用增量 SVD 推荐系统创建推荐

发布于 2024-12-27 01:58:02 字数 417 浏览 6 评论 0 原文

我正在测试一个根据 Simon Funk 算法构建的推荐系统。 (由 Timely Dev 编写。http://www.timelydevelopment.com/demos/NetflixPrize.aspx)

问题是,所有增量 SVD 算法都尝试预测 user_id 和 movie_id 的评分。但在真实的系统中,这应该为活动用户生成一个新项目列表。 我看到有些人在增量 SVD 之后使用了 kNN,但是如果我没有错过任何东西,如果我在通过增量 SVD 创建模型后使用 kNN,我会失去所有性能增益。

任何人都有增量 SVD/Simon Funk 方法的经验,并告诉我如何生成新推荐项目列表?

I am testing a recommendation system that is built according to Simon Funk's algorithm.
(written by Timely Dev. http://www.timelydevelopment.com/demos/NetflixPrize.aspx)

The problem is, all Incremental SVD algorithms try to predict the rating for user_id and movie_id. But in a real system, this should produce a list of new items to the active user.
I see that some people used kNN after Incremental SVD, but if I don't miss something, I lose all the performance gain if I use kNN after creating the model by Incremental SVD.

Anyone has any experience with Incremental SVD/Simon Funk method, and tell me how to produce list of new recommended items?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。



需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。


夏末的微笑 2025-01-03 01:58:02


  1. 取一个未看过的电影列表,
  2. 将其特征向量乘以用户的特征向量。
  3. 按结果降序排序并选取排名靠前的电影。

对于理论:假装只有两个维度(喜剧和戏剧)。如果我喜欢喜剧,但讨厌戏剧,我的特征向量是[1.0, 0.0]。如果你将我与以下电影进行比较:

Comedy:  [1.0, 0.0] x [1.0, 0.0] = 1
Dramedy: [0.5, 0.5] x [1.0, 0.0] = 0.5
Drama:   [0.0, 1.0] x [1.0, 0,0] = 0 

The way to produce recommended movies:

  1. Take a list of movies that hasn't been viewed
  2. Multiply their feature vector by the user's feature vector.
  3. Sort descending by the result and take the top movies.

For the theory: pretend there are only two dimensions (comedy and drama). If I love comedies, but hate dramas, my feature vector is [1.0, 0.0]. If you compare me against the following movies:

Comedy:  [1.0, 0.0] x [1.0, 0.0] = 1
Dramedy: [0.5, 0.5] x [1.0, 0.0] = 0.5
Drama:   [0.0, 1.0] x [1.0, 0,0] = 0 
稍尽春風 2025-01-03 01:58:02

这是一个基于 Yelp Netflix 代码的简单 Python 代码。如果您安装 Numba,它将以 C 速度运行。


import os
import numpy as np
from scipy import sparse

class DataLoader:
    def __init__(self):

    def create_review_matrix(file_path):
        data = np.array([[int(tok) for tok in line.split('\t')[:3]]
                         for line in open(file_path)])

        ij = data[:, :2]
        ij -= 1
        values = data[:, 2]
        review_matrix = sparse.csc_matrix((values, ij.T)).astype(float)
        return review_matrix

movielens_file_path = '%s/Downloads/ml-100k/u1.base' % os.environ['HOME']

my_reviews = DataLoader.create_review_matrix(movielens_file_path)

user_reviews = my_reviews[8]
user_reviews = user_reviews.toarray().ravel()
user_rated_movies,  = np.where(user_reviews > 0)
user_ratings = user_reviews[user_rated_movies]

movie_reviews = my_reviews[:, 201]
movie_reviews = movie_reviews.toarray().ravel()
movie_rated_users,  = np.where(movie_reviews > 0)
movie_ratings = movie_reviews[movie_rated_users]

user_pseudo_average_ratings = {}
user_pseudo_average_ratings[8] = np.mean(user_ratings)
user_pseudo_average_ratings[9] = np.mean(user_ratings)
user_pseudo_average_ratings[10] = np.mean(user_ratings)
users, movies = my_reviews.nonzero()

users_matrix = np.empty((3, 3))
users_matrix[:] = 0.1

movies_matrix = np.empty((3, 3))
movies_matrix[:] = 0.1

result = users_matrix[0] * movies_matrix[0]
otro = movies_matrix[:, 2]
otro[2] = 8


# Requires Movielens 100k data 
import numpy as np, time, sys
from data_loader import DataLoader
from numba import jit
import os

def get_user_ratings(user_id, review_matrix):
    Returns a numpy array with the ratings that user_id has made

    :rtype : numpy array
    :param user_id: the id of the user
    :return: a numpy array with the ratings that user_id has made
    user_reviews = review_matrix[user_id]
    user_reviews = user_reviews.toarray().ravel()
    user_rated_movies, = np.where(user_reviews > 0)
    user_ratings = user_reviews[user_rated_movies]
    return user_ratings

def get_movie_ratings(movie_id, review_matrix):
    Returns a numpy array with the ratings that movie_id has received

    :rtype : numpy array
    :param movie_id: the id of the movie
    :return: a numpy array with the ratings that movie_id has received
    movie_reviews = review_matrix[:, movie_id]
    movie_reviews = movie_reviews.toarray().ravel()
    movie_rated_users, = np.where(movie_reviews > 0)
    movie_ratings = movie_reviews[movie_rated_users]
    return movie_ratings

def create_user_feature_matrix(review_matrix, NUM_FEATURES, FEATURE_INIT_VALUE):
    Creates a user feature matrix of size NUM_FEATURES X NUM_USERS
    with all cells initialized to FEATURE_INIT_VALUE

    :rtype : numpy matrix
    :return: a matrix of size NUM_FEATURES X NUM_USERS
    with all cells initialized to FEATURE_INIT_VALUE
    num_users = review_matrix.shape[0]
    user_feature_matrix = np.empty((NUM_FEATURES, num_users))
    user_feature_matrix[:] = FEATURE_INIT_VALUE
    return user_feature_matrix

def create_movie_feature_matrix(review_matrix, NUM_FEATURES, FEATURE_INIT_VALUE):
    Creates a user feature matrix of size NUM_FEATURES X NUM_MOVIES
    with all cells initialized to FEATURE_INIT_VALUE

    :rtype : numpy matrix
    :return: a matrix of size NUM_FEATURES X NUM_MOVIES
    with all cells initialized to FEATURE_INIT_VALUE
    num_movies = review_matrix.shape[1]
    movie_feature_matrix = np.empty((NUM_FEATURES, num_movies))
    movie_feature_matrix[:] = FEATURE_INIT_VALUE
    return movie_feature_matrix

def predict_rating(user_id, movie_id, user_feature_matrix, movie_feature_matrix):
    Makes a prediction of the rating that user_id will give to movie_id if
    he/she sees it

    :rtype : float
    :param user_id: the id of the user
    :param movie_id: the id of the movie
    :return: a float in the range [1, 5] with the predicted rating for
    movie_id by user_id
    rating = 1.
    for f in range(user_feature_matrix.shape[0]):
        rating += user_feature_matrix[f, user_id] * movie_feature_matrix[f, movie_id]

    # We trim the ratings in case they go above or below the stars range
    if rating > 5: rating = 5
    elif rating < 1: rating = 1
    return rating

def sgd_inner(feature, A_row, A_col, A_data, user_feature_matrix, movie_feature_matrix, NUM_FEATURES):
    K = 0.015
    LEARNING_RATE = 0.001
    squared_error = 0
    for k in range(len(A_data)):
        user_id = A_row[k]
        movie_id = A_col[k]
        rating = A_data[k]
        p = predict_rating(user_id, movie_id, user_feature_matrix, movie_feature_matrix)
        err = rating - p

        squared_error += err ** 2

        user_feature_value = user_feature_matrix[feature, user_id]
        movie_feature_value = movie_feature_matrix[feature, movie_id]
        #for j in range(NUM_FEATURES):
        user_feature_matrix[feature, user_id] += \
            LEARNING_RATE * (err * movie_feature_value - K * user_feature_value)
        movie_feature_matrix[feature, movie_id] += \
            LEARNING_RATE * (err * user_feature_value - K * movie_feature_value)

    return squared_error

def calculate_features(A_row, A_col, A_data, user_feature_matrix, movie_feature_matrix, NUM_FEATURES):
    Iterates through all the ratings in search for the best features that
    minimize the error between the predictions and the real ratings.
    This is the main function in Simon Funk SVD algorithm

    :rtype : void
    MIN_IMPROVEMENT = 0.0001
    rmse = 0
    last_rmse = 0
    print len(A_data)
    num_ratings = len(A_data)
    for feature in xrange(NUM_FEATURES):
        iter = 0
        while (iter < MIN_ITERATIONS) or  (rmse < last_rmse - MIN_IMPROVEMENT):
            last_rmse = rmse
            squared_error = sgd_inner(feature, A_row, A_col, A_data, user_feature_matrix, movie_feature_matrix, NUM_FEATURES)
            rmse = (squared_error / num_ratings) ** 0.5
            iter += 1
        print ('Squared error = %f' % squared_error)
        print ('RMSE = %f' % rmse)
        print ('Feature = %d' % feature)
    return last_rmse

LAMBDA = 0.02

movielens_file_path = '%s/Downloads/ml-100k/u1.base' % os.environ['HOME']

A = DataLoader.create_review_matrix(movielens_file_path)
from scipy.io import mmread, mmwrite
mmwrite('./data/A', A)

user_feature_matrix = create_user_feature_matrix(A, NUM_FEATURES, FEATURE_INIT_VALUE)
movie_feature_matrix = create_movie_feature_matrix(A, NUM_FEATURES, FEATURE_INIT_VALUE)

users, movies = A.nonzero()
A = A.tocoo()

rmse = calculate_features(A.row, A.col, A.data, user_feature_matrix, movie_feature_matrix, NUM_FEATURES )
print 'rmse', rmse

Here is a simple Python code based on Yelp Netflix code. If you install Numba it will go at C speeds.


import os
import numpy as np
from scipy import sparse

class DataLoader:
    def __init__(self):

    def create_review_matrix(file_path):
        data = np.array([[int(tok) for tok in line.split('\t')[:3]]
                         for line in open(file_path)])

        ij = data[:, :2]
        ij -= 1
        values = data[:, 2]
        review_matrix = sparse.csc_matrix((values, ij.T)).astype(float)
        return review_matrix

movielens_file_path = '%s/Downloads/ml-100k/u1.base' % os.environ['HOME']

my_reviews = DataLoader.create_review_matrix(movielens_file_path)

user_reviews = my_reviews[8]
user_reviews = user_reviews.toarray().ravel()
user_rated_movies,  = np.where(user_reviews > 0)
user_ratings = user_reviews[user_rated_movies]

movie_reviews = my_reviews[:, 201]
movie_reviews = movie_reviews.toarray().ravel()
movie_rated_users,  = np.where(movie_reviews > 0)
movie_ratings = movie_reviews[movie_rated_users]

user_pseudo_average_ratings = {}
user_pseudo_average_ratings[8] = np.mean(user_ratings)
user_pseudo_average_ratings[9] = np.mean(user_ratings)
user_pseudo_average_ratings[10] = np.mean(user_ratings)
users, movies = my_reviews.nonzero()

users_matrix = np.empty((3, 3))
users_matrix[:] = 0.1

movies_matrix = np.empty((3, 3))
movies_matrix[:] = 0.1

result = users_matrix[0] * movies_matrix[0]
otro = movies_matrix[:, 2]
otro[2] = 8


# Requires Movielens 100k data 
import numpy as np, time, sys
from data_loader import DataLoader
from numba import jit
import os

def get_user_ratings(user_id, review_matrix):
    Returns a numpy array with the ratings that user_id has made

    :rtype : numpy array
    :param user_id: the id of the user
    :return: a numpy array with the ratings that user_id has made
    user_reviews = review_matrix[user_id]
    user_reviews = user_reviews.toarray().ravel()
    user_rated_movies, = np.where(user_reviews > 0)
    user_ratings = user_reviews[user_rated_movies]
    return user_ratings

def get_movie_ratings(movie_id, review_matrix):
    Returns a numpy array with the ratings that movie_id has received

    :rtype : numpy array
    :param movie_id: the id of the movie
    :return: a numpy array with the ratings that movie_id has received
    movie_reviews = review_matrix[:, movie_id]
    movie_reviews = movie_reviews.toarray().ravel()
    movie_rated_users, = np.where(movie_reviews > 0)
    movie_ratings = movie_reviews[movie_rated_users]
    return movie_ratings

def create_user_feature_matrix(review_matrix, NUM_FEATURES, FEATURE_INIT_VALUE):
    Creates a user feature matrix of size NUM_FEATURES X NUM_USERS
    with all cells initialized to FEATURE_INIT_VALUE

    :rtype : numpy matrix
    :return: a matrix of size NUM_FEATURES X NUM_USERS
    with all cells initialized to FEATURE_INIT_VALUE
    num_users = review_matrix.shape[0]
    user_feature_matrix = np.empty((NUM_FEATURES, num_users))
    user_feature_matrix[:] = FEATURE_INIT_VALUE
    return user_feature_matrix

def create_movie_feature_matrix(review_matrix, NUM_FEATURES, FEATURE_INIT_VALUE):
    Creates a user feature matrix of size NUM_FEATURES X NUM_MOVIES
    with all cells initialized to FEATURE_INIT_VALUE

    :rtype : numpy matrix
    :return: a matrix of size NUM_FEATURES X NUM_MOVIES
    with all cells initialized to FEATURE_INIT_VALUE
    num_movies = review_matrix.shape[1]
    movie_feature_matrix = np.empty((NUM_FEATURES, num_movies))
    movie_feature_matrix[:] = FEATURE_INIT_VALUE
    return movie_feature_matrix

def predict_rating(user_id, movie_id, user_feature_matrix, movie_feature_matrix):
    Makes a prediction of the rating that user_id will give to movie_id if
    he/she sees it

    :rtype : float
    :param user_id: the id of the user
    :param movie_id: the id of the movie
    :return: a float in the range [1, 5] with the predicted rating for
    movie_id by user_id
    rating = 1.
    for f in range(user_feature_matrix.shape[0]):
        rating += user_feature_matrix[f, user_id] * movie_feature_matrix[f, movie_id]

    # We trim the ratings in case they go above or below the stars range
    if rating > 5: rating = 5
    elif rating < 1: rating = 1
    return rating

def sgd_inner(feature, A_row, A_col, A_data, user_feature_matrix, movie_feature_matrix, NUM_FEATURES):
    K = 0.015
    LEARNING_RATE = 0.001
    squared_error = 0
    for k in range(len(A_data)):
        user_id = A_row[k]
        movie_id = A_col[k]
        rating = A_data[k]
        p = predict_rating(user_id, movie_id, user_feature_matrix, movie_feature_matrix)
        err = rating - p

        squared_error += err ** 2

        user_feature_value = user_feature_matrix[feature, user_id]
        movie_feature_value = movie_feature_matrix[feature, movie_id]
        #for j in range(NUM_FEATURES):
        user_feature_matrix[feature, user_id] += \
            LEARNING_RATE * (err * movie_feature_value - K * user_feature_value)
        movie_feature_matrix[feature, movie_id] += \
            LEARNING_RATE * (err * user_feature_value - K * movie_feature_value)

    return squared_error

def calculate_features(A_row, A_col, A_data, user_feature_matrix, movie_feature_matrix, NUM_FEATURES):
    Iterates through all the ratings in search for the best features that
    minimize the error between the predictions and the real ratings.
    This is the main function in Simon Funk SVD algorithm

    :rtype : void
    MIN_IMPROVEMENT = 0.0001
    rmse = 0
    last_rmse = 0
    print len(A_data)
    num_ratings = len(A_data)
    for feature in xrange(NUM_FEATURES):
        iter = 0
        while (iter < MIN_ITERATIONS) or  (rmse < last_rmse - MIN_IMPROVEMENT):
            last_rmse = rmse
            squared_error = sgd_inner(feature, A_row, A_col, A_data, user_feature_matrix, movie_feature_matrix, NUM_FEATURES)
            rmse = (squared_error / num_ratings) ** 0.5
            iter += 1
        print ('Squared error = %f' % squared_error)
        print ('RMSE = %f' % rmse)
        print ('Feature = %d' % feature)
    return last_rmse

LAMBDA = 0.02

movielens_file_path = '%s/Downloads/ml-100k/u1.base' % os.environ['HOME']

A = DataLoader.create_review_matrix(movielens_file_path)
from scipy.io import mmread, mmwrite
mmwrite('./data/A', A)

user_feature_matrix = create_user_feature_matrix(A, NUM_FEATURES, FEATURE_INIT_VALUE)
movie_feature_matrix = create_movie_feature_matrix(A, NUM_FEATURES, FEATURE_INIT_VALUE)

users, movies = A.nonzero()
A = A.tocoo()

rmse = calculate_features(A.row, A.col, A.data, user_feature_matrix, movie_feature_matrix, NUM_FEATURES )
print 'rmse', rmse
深居我梦 2025-01-03 01:58:02

我认为这是一个大问题,因为有很多推荐方法我认为可以称为“增量 SVD”。要回答您的具体问题:kNN 是在投影项目空间上运行的,而不是在原始空间上运行的,因此应该非常快。

I think this is a big question, as there are many recommender approaches that I think could be called "incremental SVD". To answer your specific question: kNN is run on the projected item space, not the original space, so should be quite fast.

野侃 2025-01-03 01:58:02

假设您有 n 个用户和 m 个项目。经过增量 SVD 后,您将获得 k 个经过训练的特征。要获取给定用户的新项目,请将 1xk 用户特征向量和 kxm 项目特征矩阵相乘。您最终会得到该用户对每个项目的 m 个评分。然后对它们进行排序,删除已经看过的,并显示一些新的。

Assume you have n users and m items. After incremental SVD you have k trained features. To get the new items for a given user multiply the 1xk user feature vector and the kxm item feature matrix together. You end up with the m ratings for each item for that user. Then just sort them, remove ones they have already seen, and show some number of new ones.

我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。