有没有办法加快通过其索引位置从列表中提取的熊猫功能?
我正在使用一些机器从 sbert python模块来计算输入的最常见k最常见的字符串coprus和一个目标语料库(在这种情况下为100k vs 100k的大小)。
该模块非常强大,并且可以快速完成比较,向我返回了包含top-k的词典列表,其中包含“ top-k”的最相似比较,每个输入字符串的格式中的每个输入字符串:
{colpus id:samelity_score}
我可以然后将其用作索引的查询字符串列表将其包装在数据范围内。给我一个格式的数据框:
query_string | colpus_id | samelity_score
与我的方法的主要时间链接是将语料库ID与语料库中的字符串匹配,因此我知道输入与哪个字符串相匹配。我当前的解决方案是将pandas 的组合与pandarallear模块应用
:
def retrieve_match_text(row, corpus_list):
dict_obj = row['dictionary']
corpus_id = dict_obj['corpus_id'] #corpus ID is an integer representing the index of a list
score = dict_obj['score']
matched_corpus_keyword = corpus_list[corpus_id] #list index lookup (speed this up)
return [matched_corpus_keyword, score]
.....
.....
# expand the dictionary into two columns and match the corpus KW to its ID
output_df[['Matched Corpus KW', 'Score']] = output_df.parallel_apply(
lambda x: pd.Series(retrieve_match_text(x, sentence_list_2)), axis=1)
对于100K的输入语料库,大约需要2分钟的时间,而另一个大小为100k的语料库。但是,我正在处理数百万美元的语料库,因此欢迎这里的速度进一步提高。
I'm using some machine learning from the SBERT python module to calculate the top K most common strings given an input coprus and a target corpus (in this case 100K vs 100K in size).
The module is pretty robust and gets the comparison done pretty fast,returning me a list of dictionaries containing the top-K most similar comparisons for each input string in the format:
{Corpus ID : Similarity_Score}
I can then wrap it up in a dataframe with the query string list used as an index. Getting me a dataframe in the format:
Query_String | Corpus_ID | Similarity_Score
The main time-sink with my approach however is matching up the Corpus ID with the string in the Corpus so I know what string the input is matched against. My current solution is using a combination of pandas apply
with the pandarallel module:
def retrieve_match_text(row, corpus_list):
dict_obj = row['dictionary']
corpus_id = dict_obj['corpus_id'] #corpus ID is an integer representing the index of a list
score = dict_obj['score']
matched_corpus_keyword = corpus_list[corpus_id] #list index lookup (speed this up)
return [matched_corpus_keyword, score]
.....
.....
# expand the dictionary into two columns and match the corpus KW to its ID
output_df[['Matched Corpus KW', 'Score']] = output_df.parallel_apply(
lambda x: pd.Series(retrieve_match_text(x, sentence_list_2)), axis=1)
This takes around 2 minutes to do for an input corpus of 100K against another corpus of 100K in size. However I'm dealing with a corpus in the size of several million so any further increase in speed here is welcomed.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果我正确阅读了问题,则有列:query_string and Dictionary(这是正确的吗?)。
然后将colpus_id和得分存储在该词典中。
您与大熊猫的第一个目标应该是以熊猫友好的方式工作。避免嵌套词典,将值直接存储在列中。之后,您可以使用有效的熊猫操作。
索引列表并不是您慢的。如果正确执行此操作,则可以是全表合并/加入,并且不需要逐行应用程序和词典查找。
第1步。如果这样做:
那么您有一个索引系列的一个语料库(以前是“列表查找”)。
步骤2。获取
score
和coldus_id
的列在您的主dataFrame步骤3中。使用
pd.merge
在cold> coldus_id
上加入输入语料库,而的索引>
target_corpus
并使用how =“ left”
(仅与现有colpus_id匹配的项目是相关的)。这应该是一种有效的方法,并且是整个数据帧操作。开发并测试溶液与一个小子集(1K)以迭代快速生长。
If I read the question correctly, you have the columns: Query_String and dictionary (is this correct?).
And then corpus_id and score are stored in that dictionary.
Your first target with pandas should be to work in a pandas-friendly way. Avoid the nested dictionary, store values directly in columns. After that, you can use efficient pandas operations.
Indexing a list is not what is slow for you. If you do this correctly it can be a whole-table merge/join and won't need any slow row-by-row apply and dictionary lookups.
Step 1. If you do this:
Then you have an indexed series of one corpus (formerly the "list lookup").
Step 2. Get columns of
score
andcorpus_id
in your main dataframeStep 3. Use
pd.merge
to join the input corpus oncorpus_id
vs the index oftarget_corpus
and usinghow="left"
(only items that match an existing corpus_id are relevant). This should be an efficient way to do it, and it's a whole-dataframe operation.Develop and test the solution vs a small subset (1K) to iterate quicky then grow.