使用Python脚本查找重复的数据库记录太慢了

发布于 2025-02-03 20:30:49 字数 695 浏览 1 评论 0原文

我们有一个具有100.000名艺术家的数据库。但是，一些艺术家是重复的。我们想运行以下脚本以将每位艺术家与所有其他艺术家进行比较，并获得重复的清单。

artistquery = Artist.objects.filter(mainartist__rank__pos__lt=999, created_at__gt=latestcheckdate).order_by('name').distinct()

from difflib import SequenceMatcher

for artist in artistquery:
    checkedartistlist.append(artist.id)
    ref_artistquery = Artist.objects.all().exclude(id__in=checkedartistlist)

    for ref_artist in ref_artistquery:
        similarity = SequenceMatcher(None, artist.name, ref_artist.name).ratio()
        if similarity > 0.90:

            SimilarArtist.objects.get_or_create(artist1=artist, artist2=ref_artist)

不幸的是，这会导致100亿个比较。有没有办法更有效，更快地完成它？

原文

We have a database with 100.000 artists. However, some artists are duplicates. We want to run the script below to compare every artist to all other artists and get a list of duplicates.

artistquery = Artist.objects.filter(mainartist__rank__pos__lt=999, created_at__gt=latestcheckdate).order_by('name').distinct()

from difflib import SequenceMatcher

for artist in artistquery:
    checkedartistlist.append(artist.id)
    ref_artistquery = Artist.objects.all().exclude(id__in=checkedartistlist)

    for ref_artist in ref_artistquery:
        similarity = SequenceMatcher(None, artist.name, ref_artist.name).ratio()
        if similarity > 0.90:

            SimilarArtist.objects.get_or_create(artist1=artist, artist2=ref_artist)

Unfortunately, this results in 10 billion comparisons. Is there a way to accomplish it more efficiently and quickly?

分享到QQ

分享到微博