如何在sckit-learn最近的伯爵中使用自定义项目标识符

发布于 2025-01-18 17:51:21 字数 2070 浏览 6 评论 0原文

我将Sckit-Learn最近的尼克伯器用作电影镜头数据库的推荐引擎。推荐引擎是一个基于项目项目的推荐人（一个项目的邻居是其他项目）。

我正在将算法的腌制版本放在API烧瓶应用程序后面。

我使用电影的电影镜头ID（标识符），这些电影不是连续的。例如：电影1，电影2，电影3，电影7，电影11等（没有电影5、6、8、9、10）。这些ID被存储为PANDAS DataFrame的INT64Index。

为了允许检索合适的电影邻居，我需要两件事：

要根据其电影镜头ID检索正确的电影向量。
最近的Neighbors算法必须返回那些自定义ID作为邻居，而不是Numpy Array的连续IDS，

否则我不会，我不会能够在Numpy ID和电影镜头ID之间建立关系。

有办法做到吗？还是您认为还有另一个适合我用例的批准？

我查找了of of of 最近的neighbors.fit（）方法的源代码，并且看起来数据框在某个时候变成了常规的numpy数组，并且“忘记”了自定义ID。

令我惊讶的是，对于这种常见用例，没有这样的选择。也许我缺少一些东西。

这是我到目前为止使用的代码：


# The traning part :
# ==================

import pandas as pd
from sklearn.neighbors import NearestNeighbors
from scipy.sparse import csr_matrix
import pickle

df = pd.read_csv('movie_lens_ratings.csv')

# Here the custom movie IDs are stored as the index of the pivot DF :
df_pivot = df.pivot_table(index='movieId', columns='userId', values='rating').fillna(0)

sparse = csr_matrix(df_pivot.values)

knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=5)
knn.fit(df_pivot)

pickle.dump(knn, open('knn_movie_lens.pkl', 'wb'))

# The API part :
# ==================

class KnnRecommender:
    def __init__(self, movie_lens_id):
        self.movie_lens_id = movie_lens_id

    def recommend(self):
        model: NearestNeighbors = pickle.load(open(f'knn_movie_lens.pkl', 'rb'))
        # Here I fetch the training data. It's a regular numpy array,
        # without any custom indices :
        data: csr_matrix = model._fit_X

        # This will return wrong movie vector, since self.movie_lens_id does not
        # match the numpy indices :
        movie_rating_vector: np.ndarray = data.getrow(self.movie_lens_id).toarray()

        # Here, the neighbors are contiguous numpy indices. I cannot use them
        # to retrieve proper movies from my database :
        distances, neighbors = model.kneighbors(movie_rating_vector)

        return neighbors

谢谢

原文

I use sckit-learn NearestNeighbors as a recommendation engine of movies, with Movie Lens database. The recommendation engine is an item-item based recommander (neighbors of one item are other items).

I'm putting a pickled version of the algorithm behind an API Flask application.

I work with Movie Lens IDs (identifiers) of movies, that are not contiguous. Ex : movie 1, movie 2, movie 3, movie 7, movie 11, etc (there are no movies 5, 6, 8, 9, 10). Those IDs are stored as the Int64Index of a Pandas Dataframe.

For allowing retrieval of proper movie neighbors, I need two things :

to retrieve the correct movie vector based on its Movie Lens ID
the NearestNeighbors algorithm must return those custom IDs as Neighbors, instead of the contiguous IDs of numpy array

Otherwise, I won't be able to make the relationship between the numpy IDs and the Movie Lens IDs.

Is there a way to do that ? Or do you think there is another approch that can fit my use case ?

I looked up the source code of NearestNeighbors.fit() method, and it looks like the Dataframe gets turned into a regular numpy array at some point, and "forget" about the custom ID.

I'm surprised there is no such an option, for that common use case. Maybe I am missing something.

Here is the code I use so far :


# The traning part :
# ==================

import pandas as pd
from sklearn.neighbors import NearestNeighbors
from scipy.sparse import csr_matrix
import pickle

df = pd.read_csv('movie_lens_ratings.csv')

# Here the custom movie IDs are stored as the index of the pivot DF :
df_pivot = df.pivot_table(index='movieId', columns='userId', values='rating').fillna(0)

sparse = csr_matrix(df_pivot.values)

knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=5)
knn.fit(df_pivot)

pickle.dump(knn, open('knn_movie_lens.pkl', 'wb'))

# The API part :
# ==================

class KnnRecommender:
    def __init__(self, movie_lens_id):
        self.movie_lens_id = movie_lens_id

    def recommend(self):
        model: NearestNeighbors = pickle.load(open(f'knn_movie_lens.pkl', 'rb'))
        # Here I fetch the training data. It's a regular numpy array,
        # without any custom indices :
        data: csr_matrix = model._fit_X

        # This will return wrong movie vector, since self.movie_lens_id does not
        # match the numpy indices :
        movie_rating_vector: np.ndarray = data.getrow(self.movie_lens_id).toarray()

        # Here, the neighbors are contiguous numpy indices. I cannot use them
        # to retrieve proper movies from my database :
        distances, neighbors = model.kneighbors(movie_rating_vector)

        return neighbors

Thank you

分享到QQ

分享到微博