使用Python脚本查找重复的数据库记录太慢了
我们有一个具有100.000名艺术家的数据库。但是,一些艺术家是重复的。我们想运行以下脚本以将每位艺术家与所有其他艺术家进行比较,并获得重复的清单。
artistquery = Artist.objects.filter(mainartist__rank__pos__lt=999, created_at__gt=latestcheckdate).order_by('name').distinct()
from difflib import SequenceMatcher
for artist in artistquery:
checkedartistlist.append(artist.id)
ref_artistquery = Artist.objects.all().exclude(id__in=checkedartistlist)
for ref_artist in ref_artistquery:
similarity = SequenceMatcher(None, artist.name, ref_artist.name).ratio()
if similarity > 0.90:
SimilarArtist.objects.get_or_create(artist1=artist, artist2=ref_artist)
不幸的是,这会导致100亿个比较。有没有办法更有效,更快地完成它?
We have a database with 100.000 artists. However, some artists are duplicates. We want to run the script below to compare every artist to all other artists and get a list of duplicates.
artistquery = Artist.objects.filter(mainartist__rank__pos__lt=999, created_at__gt=latestcheckdate).order_by('name').distinct()
from difflib import SequenceMatcher
for artist in artistquery:
checkedartistlist.append(artist.id)
ref_artistquery = Artist.objects.all().exclude(id__in=checkedartistlist)
for ref_artist in ref_artistquery:
similarity = SequenceMatcher(None, artist.name, ref_artist.name).ratio()
if similarity > 0.90:
SimilarArtist.objects.get_or_create(artist1=artist, artist2=ref_artist)
Unfortunately, this results in 10 billion comparisons. Is there a way to accomplish it more efficiently and quickly?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论