如何在sckit-learn最近的伯爵中使用自定义项目标识符

发布于 2025-01-18 17:51:21 字数 2070 浏览 6 评论 0原文

我将Sckit-Learn最近的尼克伯器用作电影镜头数据库的推荐引擎。推荐引擎是一个基于项目项目的推荐人(一个项目的邻居是其他项目)。

我正在将算法的腌制版本放在API烧瓶应用程序后面。

我使用电影的电影镜头ID(标识符),这些电影不是连续的。例如:电影1,电影2,电影3,电影7,电影11等(没有电影5、6、8、9、10)。这些ID被存储为PANDAS DataFrame的INT64Index

为了允许检索合适的电影邻居,我需要两件事:

  • 要根据其电影镜头ID检索正确的电影向量。
  • 最近的Neighbors算法必须返回那些自定义ID作为邻居,而不是Numpy Array的连续IDS,

否则我不会,我不会能够在Numpy ID和电影镜头ID之间建立关系。

有办法做到吗?还是您认为还有另一个适合我用例的批准?

我查找了of of of 最近的neighbors.fit()方法的源代码,并且看起来数据框在某个时候变成了常规的numpy数组,并且“忘记”了自定义ID。

令我惊讶的是,对于这种常见用例,没有这样的选择。也许我缺少一些东西。

这是我到目前为止使用的代码:


# The traning part :
# ==================

import pandas as pd
from sklearn.neighbors import NearestNeighbors
from scipy.sparse import csr_matrix
import pickle

df = pd.read_csv('movie_lens_ratings.csv')

# Here the custom movie IDs are stored as the index of the pivot DF :
df_pivot = df.pivot_table(index='movieId', columns='userId', values='rating').fillna(0)

sparse = csr_matrix(df_pivot.values)

knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=5)
knn.fit(df_pivot)

pickle.dump(knn, open('knn_movie_lens.pkl', 'wb'))

# The API part :
# ==================

class KnnRecommender:
    def __init__(self, movie_lens_id):
        self.movie_lens_id = movie_lens_id

    def recommend(self):
        model: NearestNeighbors = pickle.load(open(f'knn_movie_lens.pkl', 'rb'))
        # Here I fetch the training data. It's a regular numpy array,
        # without any custom indices :
        data: csr_matrix = model._fit_X

        # This will return wrong movie vector, since self.movie_lens_id does not
        # match the numpy indices :
        movie_rating_vector: np.ndarray = data.getrow(self.movie_lens_id).toarray()

        # Here, the neighbors are contiguous numpy indices. I cannot use them
        # to retrieve proper movies from my database :
        distances, neighbors = model.kneighbors(movie_rating_vector)

        return neighbors

谢谢

I use sckit-learn NearestNeighbors as a recommendation engine of movies, with Movie Lens database. The recommendation engine is an item-item based recommander (neighbors of one item are other items).

I'm putting a pickled version of the algorithm behind an API Flask application.

I work with Movie Lens IDs (identifiers) of movies, that are not contiguous. Ex : movie 1, movie 2, movie 3, movie 7, movie 11, etc (there are no movies 5, 6, 8, 9, 10). Those IDs are stored as the Int64Index of a Pandas Dataframe.

For allowing retrieval of proper movie neighbors, I need two things :

  • to retrieve the correct movie vector based on its Movie Lens ID
  • the NearestNeighbors algorithm must return those custom IDs as Neighbors, instead of the contiguous IDs of numpy array

Otherwise, I won't be able to make the relationship between the numpy IDs and the Movie Lens IDs.

Is there a way to do that ? Or do you think there is another approch that can fit my use case ?

I looked up the source code of NearestNeighbors.fit() method, and it looks like the Dataframe gets turned into a regular numpy array at some point, and "forget" about the custom ID.

I'm surprised there is no such an option, for that common use case. Maybe I am missing something.

Here is the code I use so far :


# The traning part :
# ==================

import pandas as pd
from sklearn.neighbors import NearestNeighbors
from scipy.sparse import csr_matrix
import pickle

df = pd.read_csv('movie_lens_ratings.csv')

# Here the custom movie IDs are stored as the index of the pivot DF :
df_pivot = df.pivot_table(index='movieId', columns='userId', values='rating').fillna(0)

sparse = csr_matrix(df_pivot.values)

knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=5)
knn.fit(df_pivot)

pickle.dump(knn, open('knn_movie_lens.pkl', 'wb'))

# The API part :
# ==================

class KnnRecommender:
    def __init__(self, movie_lens_id):
        self.movie_lens_id = movie_lens_id

    def recommend(self):
        model: NearestNeighbors = pickle.load(open(f'knn_movie_lens.pkl', 'rb'))
        # Here I fetch the training data. It's a regular numpy array,
        # without any custom indices :
        data: csr_matrix = model._fit_X

        # This will return wrong movie vector, since self.movie_lens_id does not
        # match the numpy indices :
        movie_rating_vector: np.ndarray = data.getrow(self.movie_lens_id).toarray()

        # Here, the neighbors are contiguous numpy indices. I cannot use them
        # to retrieve proper movies from my database :
        distances, neighbors = model.kneighbors(movie_rating_vector)

        return neighbors

Thank you

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

甜警司 2025-01-25 17:51:21

事实上,sklearn 转换为 numpy 数组。他们在保留附加信息(列名称)方面取得了长足进步,但由于其效率,处理可能会继续使用 numpy。

我只需将电影镜头 ID 列表/数组/索引保存为额外属性,并将预测的连续 numpy 索引映射到电影 ID。

Indeed, sklearn converts to numpy arrays. They have made strides toward keeping additional information (column names), but processing will probably continue to be in numpy for its efficiency.

I would just save the Movie Lens ID list/array/index as an extra attribute, and map from the predicted contiguous numpy indices to movie IDs.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文