有没有办法加快通过其索引位置从列表中提取的熊猫功能?

发布于 2025-02-10 11:26:23 字数 1227 浏览 4 评论 0原文

我正在使用一些机器从 sbert python模块来计算输入的最常见k最常见的字符串coprus和一个目标语料库(在这种情况下为100k vs 100k的大小)。

该模块非常强大,并且可以快速完成比较,向我返回了包含top-k的词典列表,其中包含“ top-k”的最相似比较,每个输入字符串的格式中的每个输入字符串:

{colpus id:samelity_score}

我可以然后将其用作索引的查询字符串列表将其包装在数据范围内。给我一个格式的数据框:

query_string | colpus_id | samelity_score

与我的方法的主要时间链接是将语料库ID与语料库中的字符串匹配,因此我知道输入与哪个字符串相匹配。我当前的解决方案是将pandas 的组合与pandarallear模块应用

def retrieve_match_text(row, corpus_list):

    dict_obj = row['dictionary']

    corpus_id = dict_obj['corpus_id'] #corpus ID is an integer representing the index of a list
    score = dict_obj['score']

    matched_corpus_keyword = corpus_list[corpus_id] #list index lookup (speed this up)

    return [matched_corpus_keyword, score]

.....
.....

# expand the dictionary into two columns and match the corpus KW to its ID
output_df[['Matched Corpus KW', 'Score']] = output_df.parallel_apply(
                            lambda x: pd.Series(retrieve_match_text(x, sentence_list_2)), axis=1)

对于100K的输入语料库,大约需要2分钟的时间,而另一个大小为100k的语料库。但是,我正在处理数百万美元的语料库,因此欢迎这里的速度进一步提高。

I'm using some machine learning from the SBERT python module to calculate the top K most common strings given an input coprus and a target corpus (in this case 100K vs 100K in size).

The module is pretty robust and gets the comparison done pretty fast,returning me a list of dictionaries containing the top-K most similar comparisons for each input string in the format:

{Corpus ID : Similarity_Score}

I can then wrap it up in a dataframe with the query string list used as an index. Getting me a dataframe in the format:

Query_String | Corpus_ID | Similarity_Score

The main time-sink with my approach however is matching up the Corpus ID with the string in the Corpus so I know what string the input is matched against. My current solution is using a combination of pandas apply with the pandarallel module:

def retrieve_match_text(row, corpus_list):

    dict_obj = row['dictionary']

    corpus_id = dict_obj['corpus_id'] #corpus ID is an integer representing the index of a list
    score = dict_obj['score']

    matched_corpus_keyword = corpus_list[corpus_id] #list index lookup (speed this up)

    return [matched_corpus_keyword, score]

.....
.....

# expand the dictionary into two columns and match the corpus KW to its ID
output_df[['Matched Corpus KW', 'Score']] = output_df.parallel_apply(
                            lambda x: pd.Series(retrieve_match_text(x, sentence_list_2)), axis=1)

This takes around 2 minutes to do for an input corpus of 100K against another corpus of 100K in size. However I'm dealing with a corpus in the size of several million so any further increase in speed here is welcomed.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

止于盛夏 2025-02-17 11:26:23

如果我正确阅读了问题,则有列:query_string and Dictionary(这是正确的吗?)。
然后将colpus_id和得分存储在该词典中。

您与大熊猫的第一个目标应该是以熊猫友好的方式工作。避免嵌套词典,将值直接存储在列中。之后,您可以使用有效的熊猫操作。

索引列表并不是您慢的。如果正确执行此操作,则可以是全表合并/加入,并且不需要逐行应用程序和词典查找。

第1步。如果这样做:

target_corpus = pd.Series(sentence_list_2, name="target_corpus")

那么您有一个索引系列的一个语料库(以前是“列表查找”)。

步骤2。获取scorecoldus_id的列在您的主dataFrame

步骤3中。使用 pd.merge cold> coldus_id上加入输入语料库,而的索引> target_corpus并使用how =“ left”(仅与现有colpus_id匹配的项目是相关的)。这应该是一种有效的方法,并且是整个数据帧操作。

开发并测试溶液与一个小子集(1K)以迭代快速生长。

If I read the question correctly, you have the columns: Query_String and dictionary (is this correct?).
And then corpus_id and score are stored in that dictionary.

Your first target with pandas should be to work in a pandas-friendly way. Avoid the nested dictionary, store values directly in columns. After that, you can use efficient pandas operations.

Indexing a list is not what is slow for you. If you do this correctly it can be a whole-table merge/join and won't need any slow row-by-row apply and dictionary lookups.

Step 1. If you do this:

target_corpus = pd.Series(sentence_list_2, name="target_corpus")

Then you have an indexed series of one corpus (formerly the "list lookup").

Step 2. Get columns of score and corpus_id in your main dataframe

Step 3. Use pd.merge to join the input corpus on corpus_id vs the index of target_corpus and using how="left" (only items that match an existing corpus_id are relevant). This should be an efficient way to do it, and it's a whole-dataframe operation.

Develop and test the solution vs a small subset (1K) to iterate quicky then grow.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文