比较使用Levenshtein在Pandas中比较字符串时,改善Python代码性能

发布于 2025-02-04 12:27:33 字数 1918 浏览 5 评论 0原文

我的代码可以正常运行,并产生我要寻找的结果:

from thefuzz import fuzz
import pandas as pd

df = pd.read_csv('/folder/folder/2011_05-rc.csv', dtype=str, lineterminator='\n')
df_compare = pd.DataFrame(
    df['text'].apply(lambda row: [fuzz.partial_ratio(x, row) for x in df['text']]).to_list())

for i in df_compare.index:
    for j in df_compare.columns[i:]:
        df_compare.iloc[i, j] = 0

df[df_compare.max(axis=1) < 75].to_csv('/folder/folder/2011_05-ready.csv', index=False)

print('Done did')

但是,由于字符串比较是一个非常昂贵的操作,因此该脚本非常慢,并且仅在相对较小的CSV文件上工作,该文件具有5000-7000行。任何大型(超过12MB)都需要几天才能发布与内存有关的错误消息。我尝试使用32个GB内存的32个内核上的modin运行它,但它没有改变任何东西,最终得到了相同的结果。

import glob
from thefuzz import fuzz
import modin.pandas as pd

files = glob.glob('/folder/folder/2013/*.csv')

for file in files:
    df = pd.read_csv(file, dtype=str, lineterminator='\n')
    f_compare = pd.DataFrame(
        df['text'].apply(lambda row: [fuzz.partial_ratio(x, row) for x in df['text']]).to_list())

    for i in df_compare.index:
        for j in df_compare.columns[i:]:
            df_compare.iloc[i, j] = 0

    df[df_compare.max(axis=1) < 75].to_csv(f'{file[:-4]}-done.csv', index=False)
    print(f'{file} has been done')

它可以在运行的较小文件上工作,但是文件太多了,无法单独进行。是否有一种方法来优化此代码或其他可能的解决方案?

数据是推文的集合,而仅比较一列(大约30列)。看起来像这样:

ID文字
11213我要去Cinema
23213Black是我最喜欢的颜色
35455我和您一起去电影院
421323我的朋友们认为我是个好人。

I have this code that functions properly and produces the result I am looking for:

from thefuzz import fuzz
import pandas as pd

df = pd.read_csv('/folder/folder/2011_05-rc.csv', dtype=str, lineterminator='\n')
df_compare = pd.DataFrame(
    df['text'].apply(lambda row: [fuzz.partial_ratio(x, row) for x in df['text']]).to_list())

for i in df_compare.index:
    for j in df_compare.columns[i:]:
        df_compare.iloc[i, j] = 0

df[df_compare.max(axis=1) < 75].to_csv('/folder/folder/2011_05-ready.csv', index=False)

print('Done did')

However, since string comparison is a very costly operation, the script is very slow and only works on relatively small CSV files with 5000-7000 rows. Anything large (over 12MB) takes days before throwing a memory-related error message. I attempted running it with modin on 32 cores with 32 gb memory, but it did not change anything and I ended up with the same result.

import glob
from thefuzz import fuzz
import modin.pandas as pd

files = glob.glob('/folder/folder/2013/*.csv')

for file in files:
    df = pd.read_csv(file, dtype=str, lineterminator='\n')
    f_compare = pd.DataFrame(
        df['text'].apply(lambda row: [fuzz.partial_ratio(x, row) for x in df['text']]).to_list())

    for i in df_compare.index:
        for j in df_compare.columns[i:]:
            df_compare.iloc[i, j] = 0

    df[df_compare.max(axis=1) < 75].to_csv(f'{file[:-4]}-done.csv', index=False)
    print(f'{file} has been done')

It works on smaller files running as a separate job, but files are too many to do it all separately. Would there be a way to optimise this code or some other possible solution?

The data is a collection of tweets while and only one column is being compared (out of around 30 columns). It looks like this:

IDText
11213I am going to the cinema
23213Black is my favourite colour
35455I am going to the cinema with you
421323My friends think I am a good guy.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

无需解释 2025-02-11 12:27:33

似乎要求将每个句子与其他句子进行比较。鉴于这里的总体方法,我认为没有一个好的答案。您正在查看n^2比较。随着您的行计算大量,总体处理要求将很快变成怪物。

为了弄清可行性,您可以运行一些计算N^2的较小测试,以使该测试每秒进行评估行。然后计算您要使用的大数据集的n^2,以了解所需的处理时间。那就是假设您的内存可以处理它。也许有关于处理N^2问题的工作。可能想环顾四周。

您的工作是需要做的两倍以上。您将所有内容与一切与自身进行比较。但是,即使那时,如果您只是进行组合,n(n-1)/2仍然很可怕。

It appears that the requirement is to compare each sentence against every other sentence. Given that overall approach here I don't think there is a good answer. You are looking at n^2 comparisons. As your row count gets large the overall processing requirements turn into a monster very quickly.

To figure out the feasibility you could run some smaller tests calculating the n^2 for that test to get an evaluations rows per second metric. Then calculate n^2 for the big datasets that you want to do to get an idea of the required processing time. That is assuming that your memory could handle it. Maybe there is work done on handling n^2 problems. Might want to look around for something like that.

You are doing more than twice the work that you need to do. You compare everything against everything twice and against itself. But even then when things get large, if you just do the combinations, n(n-1)/2 is still monstrous.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文