有没有办法根据字符串相似度过滤 django 查询集(la python difflib)?

发布于 2024-09-12 01:21:27 字数 2060 浏览 1 评论 0原文

我需要将冷线索与我们客户的数据库进行匹配。

潜在客户大量来自第三方提供商(数千条记录),销售人员要求我们(用他们的话说)“过滤掉我们的客户”,这样他们就不会试图向现有客户出售我们的服务。

显然,线索中存在拼写错误。查尔斯变成查理,约瑟夫变成乔,等等。所以我真的不能只做一个过滤器比较 Lead_first_name 和 client_first_name 等。

我需要使用某种 字符串相似度机制。

现在,我正在使用可爱的 difflib 来比较潜在客户的名字和姓氏使用 Client.objects.all() 生成的列表。它有效,但由于客户端数量较多,速度往往很慢。

我知道大多数sql数据库都有soundex和difference函数。请参阅下面的更新中我对它的测试 - 它的工作效果不如 difflib。

还有其他解决方案吗?有更好的解决方案吗?

编辑:

Soundex,至少在我的数据库中,表现不如 difflib。

这是一个简单的测试 - 在包含“Joseph Lopes”的表中查找“Joe Lopes”:

with temp (first_name, last_name) as (
select 'Joseph', 'Lopes'
union
select 'Joe', 'Satriani'
union
select 'CZ', 'Lopes'
union
select 'Blah', 'Lopes'
union
select 'Antonio', 'Lopes'
union
select 'Carlos', 'Lopes'
)
select first_name, last_name
  from temp
 where difference(first_name+' '+last_name, 'Joe Lopes') >= 3
 order by difference(first_name+' '+last_name, 'Joe Lopes')

上面返回“Joe Satriani”作为唯一匹配项。即使将相似度阈值降低到 2,也不会返回“Joseph Lopes”作为潜在匹配项。

但 difflib 做得更好:

difflib.get_close_matches('Joe Lopes', ['Joseph Lopes', 'Joe Satriani', 'CZ Lopes', 'Blah Lopes', 'Antonio Lopes', 'Carlos Lopes'])
['Joseph Lopes', 'CZ Lopes', 'Carlos Lopes']

在 gruszczy 的回复后编辑:

在编写自己的内容之前,我查找并 在所有知识库中找到了Levenshtein Distance的T-SQL实现。

在测试中,它仍然不会比difflib做更好的匹配工作。

这促使我研究 difflib 背后的算法。它似乎是 修改版本 ://www.itl.nist.gov/div897/sqg/dads/HTML/ratcliffObershelp.html" rel="noreferrer">Ratcliff-Obershelp 算法。

不幸的是,我似乎找不到其他已经创建了基于 difflib 的 T-SQL 实现的人...我会尽可能尝试一下。

如果在接下来的几天内没有其他人提出更好的答案,我会将其授予 gruszczy。谢谢,好心的先生。

I have a need to match cold leads against a database of our clients.

The leads come from a third party provider in bulk (thousands of records) and sales is asking us to (in their words) "filter out our clients" so they don't try to sell our service to a established client.

Obviously, there are misspellings in the leads. Charles becomes Charlie, Joseph becomes Joe, etc. So I can't really just do a filter comparing lead_first_name to client_first_name, etc.

I need to use some sort of string similarity mechanism.

Right now I'm using the lovely difflib to compare the leads' first and last names to a list generated with Client.objects.all(). It works, but because of the number of clients it tends to be slow.

I know that most sql databases have soundex and difference functions. See my test of it in the update below - it doesn't work as well as difflib.

Is there another solution? Is there a better solution?

Edit:

Soundex, at least in my db, doesn't behave as well as difflib.

Here is a simple test - look for "Joe Lopes" in a table containing "Joseph Lopes":

with temp (first_name, last_name) as (
select 'Joseph', 'Lopes'
union
select 'Joe', 'Satriani'
union
select 'CZ', 'Lopes'
union
select 'Blah', 'Lopes'
union
select 'Antonio', 'Lopes'
union
select 'Carlos', 'Lopes'
)
select first_name, last_name
  from temp
 where difference(first_name+' '+last_name, 'Joe Lopes') >= 3
 order by difference(first_name+' '+last_name, 'Joe Lopes')

The above returns "Joe Satriani" as the only match. Even reducing the similarity threshold to 2 doesn't return "Joseph Lopes" as a potential match.

But difflib does a much better job:

difflib.get_close_matches('Joe Lopes', ['Joseph Lopes', 'Joe Satriani', 'CZ Lopes', 'Blah Lopes', 'Antonio Lopes', 'Carlos Lopes'])
['Joseph Lopes', 'CZ Lopes', 'Carlos Lopes']

Edit after gruszczy's response:

Before writing my own, I looked for and found a T-SQL implementation of Levenshtein Distance in the repository of all knowledge.

In testing it, it still won't do a better matching job than difflib.

Which led me to research what algorithm is behind difflib. It seems to be a modified version of the Ratcliff-Obershelp algorithm.

Unhappily I can't seem to find some other kind soul who has already created a T-SQL implementation based on difflib's... I'll try my hand at it when I can.

If nobody else comes up with a better answer in the next few days, I'll grant it to gruszczy. Thanks, kind sir.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

翻身的咸鱼 2024-09-19 01:21:27

根据 andilabs 的答案,您可以使用 Levenshtein 函数来创建自定义函数。 Postgres doc表明Levenshtein函数如下:

levenshtein(文本源,文本目标,int ins_cost,int del_cost,int sub_cost)返回int
levenshtein(text source, text target) returns int

andilabs 答案可以使用唯一的第二个函数。如果您想要具有插入/删除/替换成本的更高级搜索,您可以像这样重写函数:

from django.db.models import Func

class Levenshtein(Func):
    template = "%(function)s(%(expressions)s, '%(search_term)s', %(ins_cost)d, %(del_cost)d, %(sub_cost)d)"
    function = 'levenshtein'

    def __init__(self, expression, search_term, ins_cost=1, del_cost=1, sub_cost=1, **extras):
        super(Levenshtein, self).__init__(
            expression,
            search_term=search_term,
            ins_cost=ins_cost,
            del_cost=del_cost,
            sub_cost=sub_cost,
            **extras
        )

并调用该函数:

from django.db.models import F

Spot.objects.annotate(
    lev_dist=Levenshtein(F('name'), 'Kfaka', 3, 3, 1)  # ins = 3, del = 3, sub = 1
).filter(
    lev_dist__lte=2
)

As per the answer of andilabs you can use the Levenshtein function to create your custom function. Postgres doc indicates that the Levenshtein function is as follows:

levenshtein(text source, text target, int ins_cost, int del_cost, int sub_cost) returns int
levenshtein(text source, text target) returns int

andilabs answer can use the only second function. If you want a more advanced search with insertion/deletion/substitution costs, you can rewrite function like this:

from django.db.models import Func

class Levenshtein(Func):
    template = "%(function)s(%(expressions)s, '%(search_term)s', %(ins_cost)d, %(del_cost)d, %(sub_cost)d)"
    function = 'levenshtein'

    def __init__(self, expression, search_term, ins_cost=1, del_cost=1, sub_cost=1, **extras):
        super(Levenshtein, self).__init__(
            expression,
            search_term=search_term,
            ins_cost=ins_cost,
            del_cost=del_cost,
            sub_cost=sub_cost,
            **extras
        )

And call the function:

from django.db.models import F

Spot.objects.annotate(
    lev_dist=Levenshtein(F('name'), 'Kfaka', 3, 3, 1)  # ins = 3, del = 3, sub = 1
).filter(
    lev_dist__lte=2
)
久光 2024-09-19 01:21:27

soundex 不会帮助你,因为它是一种语音算法。 Joe 和 Joseph 在语音上并不相似,因此 soundex 不会将它们标记为相似。

您可以尝试 Levenshtein distance,它是在 PostgreSQL 中实现的。也许也在您的数据库中,如果没有,您应该能够编写一个存储过程,它将计算两个字符串之间的距离并在计算中使用它。

soundex won't help you, because it's a phonetic algorithm. Joe and Joseph aren't similar phonetically, so soundex won't mark them as similar.

You can try Levenshtein distance, which is implemented in PostgreSQL. Maybe in your database too and if not, you should be able to write a stored procedure, which will calculate the distance between two strings and use it in your computation.

我的鱼塘能养鲲 2024-09-19 01:21:27

自 Django 1.10 起,可以使用 trigram_similar 查找,请参阅文档了解 PostgreSQL 特定查找全文搜索

It's possible with trigram_similar lookups since Django 1.10, see docs for PostgreSQL specific lookups and Full text search

似狗非友 2024-09-19 01:21:27

如果您需要使用 django 和 postgres 实现这一目标,并且不想使用 1.10 trigram-similarity 中引入的 https://docs.djangoproject.com/en/2.0/ref/contrib/postgres/lookups/#trigram-similarity 您可以使用 Levensthein 实现,如下所示:

< strong>需要扩展 fuzzystrmatch

您需要将 postgres 扩展添加到 psql 中的数据库:

CREATE EXTENSION fuzzystrmatch;

让我们定义自定义函数,我们可以用它来注释查询集。它只需要一个参数 search_term 并使用 postgres levenshtein 函数(请参阅文档):

from django.db.models import Func

class Levenshtein(Func):
    template = "%(function)s(%(expressions)s, '%(search_term)s')"
    function = "levenshtein"

    def __init__(self, expression, search_term, **extras):
        super(Levenshtein, self).__init__(
            expression,
            search_term=search_term,
            **extras
        )

然后在项目中的任何其他位置,我们只需导入定义的 Levenshtein 和 F 来传递 django 字段。

from django.db.models import F

Spot.objects.annotate(
    lev_dist=Levenshtein(F('name'), 'Kfaka')
).filter(
    lev_dist__lte=2
)

If you need getting there with django and postgres and don't want to use introduced in 1.10 trigram-similarity https://docs.djangoproject.com/en/2.0/ref/contrib/postgres/lookups/#trigram-similarity you can implement using Levensthein like these:

Extension needed fuzzystrmatch

you need adding postgres extension to your db in psql:

CREATE EXTENSION fuzzystrmatch;

Lets define custom function with wich we can annotate queryset. It just take one argument the search_term and uses postgres levenshtein function (see docs):

from django.db.models import Func

class Levenshtein(Func):
    template = "%(function)s(%(expressions)s, '%(search_term)s')"
    function = "levenshtein"

    def __init__(self, expression, search_term, **extras):
        super(Levenshtein, self).__init__(
            expression,
            search_term=search_term,
            **extras
        )

then in any other place in project we just import defined Levenshtein and F to pass the django field.

from django.db.models import F

Spot.objects.annotate(
    lev_dist=Levenshtein(F('name'), 'Kfaka')
).filter(
    lev_dist__lte=2
)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文