比较使用Levenshtein在Pandas中比较字符串时,改善Python代码性能
我的代码可以正常运行,并产生我要寻找的结果:
from thefuzz import fuzz
import pandas as pd
df = pd.read_csv('/folder/folder/2011_05-rc.csv', dtype=str, lineterminator='\n')
df_compare = pd.DataFrame(
df['text'].apply(lambda row: [fuzz.partial_ratio(x, row) for x in df['text']]).to_list())
for i in df_compare.index:
for j in df_compare.columns[i:]:
df_compare.iloc[i, j] = 0
df[df_compare.max(axis=1) < 75].to_csv('/folder/folder/2011_05-ready.csv', index=False)
print('Done did')
但是,由于字符串比较是一个非常昂贵的操作,因此该脚本非常慢,并且仅在相对较小的CSV文件上工作,该文件具有5000-7000行。任何大型(超过12MB)都需要几天才能发布与内存有关的错误消息。我尝试使用32个GB内存的32个内核上的modin运行它,但它没有改变任何东西,最终得到了相同的结果。
import glob
from thefuzz import fuzz
import modin.pandas as pd
files = glob.glob('/folder/folder/2013/*.csv')
for file in files:
df = pd.read_csv(file, dtype=str, lineterminator='\n')
f_compare = pd.DataFrame(
df['text'].apply(lambda row: [fuzz.partial_ratio(x, row) for x in df['text']]).to_list())
for i in df_compare.index:
for j in df_compare.columns[i:]:
df_compare.iloc[i, j] = 0
df[df_compare.max(axis=1) < 75].to_csv(f'{file[:-4]}-done.csv', index=False)
print(f'{file} has been done')
它可以在运行的较小文件上工作,但是文件太多了,无法单独进行。是否有一种方法来优化此代码或其他可能的解决方案?
数据是推文的集合,而仅比较一列(大约30列)。看起来像这样:
ID | 文字 |
---|---|
11213 | 我要去Cinema |
23213 | Black是我最喜欢的颜色 |
35455 | 我和您一起去电影院 |
421323 | 我的朋友们认为我是个好人。 |
I have this code that functions properly and produces the result I am looking for:
from thefuzz import fuzz
import pandas as pd
df = pd.read_csv('/folder/folder/2011_05-rc.csv', dtype=str, lineterminator='\n')
df_compare = pd.DataFrame(
df['text'].apply(lambda row: [fuzz.partial_ratio(x, row) for x in df['text']]).to_list())
for i in df_compare.index:
for j in df_compare.columns[i:]:
df_compare.iloc[i, j] = 0
df[df_compare.max(axis=1) < 75].to_csv('/folder/folder/2011_05-ready.csv', index=False)
print('Done did')
However, since string comparison is a very costly operation, the script is very slow and only works on relatively small CSV files with 5000-7000 rows. Anything large (over 12MB) takes days before throwing a memory-related error message. I attempted running it with modin on 32 cores with 32 gb memory, but it did not change anything and I ended up with the same result.
import glob
from thefuzz import fuzz
import modin.pandas as pd
files = glob.glob('/folder/folder/2013/*.csv')
for file in files:
df = pd.read_csv(file, dtype=str, lineterminator='\n')
f_compare = pd.DataFrame(
df['text'].apply(lambda row: [fuzz.partial_ratio(x, row) for x in df['text']]).to_list())
for i in df_compare.index:
for j in df_compare.columns[i:]:
df_compare.iloc[i, j] = 0
df[df_compare.max(axis=1) < 75].to_csv(f'{file[:-4]}-done.csv', index=False)
print(f'{file} has been done')
It works on smaller files running as a separate job, but files are too many to do it all separately. Would there be a way to optimise this code or some other possible solution?
The data is a collection of tweets while and only one column is being compared (out of around 30 columns). It looks like this:
ID | Text |
---|---|
11213 | I am going to the cinema |
23213 | Black is my favourite colour |
35455 | I am going to the cinema with you |
421323 | My friends think I am a good guy. |
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
似乎要求将每个句子与其他句子进行比较。鉴于这里的总体方法,我认为没有一个好的答案。您正在查看n^2比较。随着您的行计算大量,总体处理要求将很快变成怪物。
为了弄清可行性,您可以运行一些计算N^2的较小测试,以使该测试每秒进行评估行。然后计算您要使用的大数据集的n^2,以了解所需的处理时间。那就是假设您的内存可以处理它。也许有关于处理N^2问题的工作。可能想环顾四周。
您的工作是需要做的两倍以上。您将所有内容与一切与自身进行比较。但是,即使那时,如果您只是进行组合,n(n-1)/2仍然很可怕。
It appears that the requirement is to compare each sentence against every other sentence. Given that overall approach here I don't think there is a good answer. You are looking at n^2 comparisons. As your row count gets large the overall processing requirements turn into a monster very quickly.
To figure out the feasibility you could run some smaller tests calculating the n^2 for that test to get an evaluations rows per second metric. Then calculate n^2 for the big datasets that you want to do to get an idea of the required processing time. That is assuming that your memory could handle it. Maybe there is work done on handling n^2 problems. Might want to look around for something like that.
You are doing more than twice the work that you need to do. You compare everything against everything twice and against itself. But even then when things get large, if you just do the combinations, n(n-1)/2 is still monstrous.