有没有办法加快通过其索引位置从列表中提取的熊猫功能？

发布于 2025-02-10 11:26:23 字数 1227 浏览 4 评论 0原文

我正在使用一些机器从 sbert python模块来计算输入的最常见k最常见的字符串coprus和一个目标语料库（在这种情况下为100k vs 100k的大小）。

该模块非常强大，并且可以快速完成比较，向我返回了包含top-k的词典列表，其中包含“ top-k”的最相似比较，每个输入字符串的格式中的每个输入字符串：

{colpus id：samelity_score}

我可以然后将其用作索引的查询字符串列表将其包装在数据范围内。给我一个格式的数据框：

query_string | colpus_id | samelity_score

与我的方法的主要时间链接是将语料库ID与语料库中的字符串匹配，因此我知道输入与哪个字符串相匹配。我当前的解决方案是将pandas 的组合与pandarallear模块应用：

def retrieve_match_text(row, corpus_list):

    dict_obj = row['dictionary']

    corpus_id = dict_obj['corpus_id'] #corpus ID is an integer representing the index of a list
    score = dict_obj['score']

    matched_corpus_keyword = corpus_list[corpus_id] #list index lookup (speed this up)

    return [matched_corpus_keyword, score]

.....
.....

# expand the dictionary into two columns and match the corpus KW to its ID
output_df[['Matched Corpus KW', 'Score']] = output_df.parallel_apply(
                            lambda x: pd.Series(retrieve_match_text(x, sentence_list_2)), axis=1)

对于100K的输入语料库，大约需要2分钟的时间，而另一个大小为100k的语料库。但是，我正在处理数百万美元的语料库，因此欢迎这里的速度进一步提高。

原文

I'm using some machine learning from the SBERT python module to calculate the top K most common strings given an input coprus and a target corpus (in this case 100K vs 100K in size).

The module is pretty robust and gets the comparison done pretty fast,returning me a list of dictionaries containing the top-K most similar comparisons for each input string in the format:

{Corpus ID : Similarity_Score}

I can then wrap it up in a dataframe with the query string list used as an index. Getting me a dataframe in the format:

Query_String | Corpus_ID | Similarity_Score

The main time-sink with my approach however is matching up the Corpus ID with the string in the Corpus so I know what string the input is matched against. My current solution is using a combination of pandas apply with the pandarallel module:

def retrieve_match_text(row, corpus_list):

    dict_obj = row['dictionary']

    corpus_id = dict_obj['corpus_id'] #corpus ID is an integer representing the index of a list
    score = dict_obj['score']

    matched_corpus_keyword = corpus_list[corpus_id] #list index lookup (speed this up)

    return [matched_corpus_keyword, score]

.....
.....

# expand the dictionary into two columns and match the corpus KW to its ID
output_df[['Matched Corpus KW', 'Score']] = output_df.parallel_apply(
                            lambda x: pd.Series(retrieve_match_text(x, sentence_list_2)), axis=1)

This takes around 2 minutes to do for an input corpus of 100K against another corpus of 100K in size. However I'm dealing with a corpus in the size of several million so any further increase in speed here is welcomed.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

止于盛夏 2025-02-17 11:26:23

如果我正确阅读了问题，则有列：query_string and Dictionary（这是正确的吗？）。
然后将colpus_id和得分存储在该词典中。

您与大熊猫的第一个目标应该是以熊猫友好的方式工作。避免嵌套词典，将值直接存储在列中。之后，您可以使用有效的熊猫操作。

索引列表并不是您慢的。如果正确执行此操作，则可以是全表合并/加入，并且不需要逐行应用程序和词典查找。

第1步。如果这样做：

target_corpus = pd.Series(sentence_list_2, name="target_corpus")

那么您有一个索引系列的一个语料库（以前是“列表查找”）。

步骤2。获取score和coldus_id的列在您的主dataFrame

步骤3中。使用 pd.merge 在cold> coldus_id上加入输入语料库，而的索引> target_corpus并使用how =“ left”（仅与现有colpus_id匹配的项目是相关的）。这应该是一种有效的方法，并且是整个数据帧操作。

开发并测试溶液与一个小子集（1K）以迭代快速生长。

If I read the question correctly, you have the columns: Query_String and dictionary (is this correct?).
And then corpus_id and score are stored in that dictionary.

Your first target with pandas should be to work in a pandas-friendly way. Avoid the nested dictionary, store values directly in columns. After that, you can use efficient pandas operations.

Indexing a list is not what is slow for you. If you do this correctly it can be a whole-table merge/join and won't need any slow row-by-row apply and dictionary lookups.

Step 1. If you do this:

target_corpus = pd.Series(sentence_list_2, name="target_corpus")

Then you have an indexed series of one corpus (formerly the "list lookup").

Step 2. Get columns of score and corpus_id in your main dataframe

Step 3. Use pd.merge to join the input corpus on corpus_id vs the index of target_corpus and using how="left" (only items that match an existing corpus_id are relevant). This should be an efficient way to do it, and it's a whole-dataframe operation.

Develop and test the solution vs a small subset (1K) to iterate quicky then grow.

回复收藏 0 原文

~没有更多了~