Difflib.SequenceMatcher isjunk 可选参数查询:如何忽略空格、制表符、空行?

发布于 2024-07-06 11:57:52 字数 1925 浏览 12 评论 0原文

我正在尝试使用 Difflib.SequenceMatcher 来计算两个文件之间的相似性。 这两个文件几乎相同,只是一个文件包含一些额外的空格、空行,而另一个文件则不包含。 我正在尝试用于

s=difflib.SequenceMatcher(isjunk,text1,text2)
ratio =s.ratio()

此目的。

所以,问题是如何为这个 isjunk 方法编写 lambda 表达式,以便 SequenceMatcher 方法将折扣所有空格、空行等。我尝试使用参数 lambda x: x==" ",但结果不是同样伟大。 对于两个非常相似的文本,该比例非常低。 这是非常违反直觉的。

出于测试目的,以下是您可以在测试中使用的两个字符串:

jwovu 工作的动力是什么 出色地? 好的,这是一个试图 赢得价值 100 美元的软件开发奖 尽管我没有书 阅读

编程书籍。 为了赢得 奖品是你必须写一篇文章并且
是什么促使 fggmum 完成你的工作 出色地。 因此这篇文章。 第一的 动机

钱。 我知道,这听起来不像 对许多人来说是一个巨大的启发,并且 说钱是其中之一 动机因素可能会打击我 机会消失。

好像金钱是编程的禁忌 世界。 我知道有些人 不能被金钱所激励。 女士,在 另一方面,我生活在真实的 世界,

房子抵押贷款要付,我自己去付 饲料和账单需要支付。 所以我不能 真的把钱排除在我之外 考虑。 如果我能得到一个大的 金额

做得好,那么肯定 鼓舞我的士气。 我不会在意是否 我使用的是旧工作站,或者 被迫与他人共用房间或隔间 其他

人们,或者不得不忍受 烦人的老板,或者其他什么。 事实 到最后我会步行 本身就带着一大堆钱 就够了

为了我克服一切困难, 忍受所有的痛苦和 伤害自尊心,容忍缓慢的计算机 甚至忍耐

这是另一根绳子

是什么激励你完成工作 出色地? 好的,这是一个试图 赢得价值 100 美元的软件开发奖 书,尽管事实上我不 阅读编程书籍。 为了 赢得奖品你必须写一个 条目并描述您的动机 做好你的工作。 因此这篇文章。

第一动力,金钱。 我知道这个 听起来不像是一个伟大的灵感 对很多人来说,金钱就是一 的动机因素可能只是 让我失去机会。 仿佛金钱是一个 编程世界的禁忌。 我知道 有些人不能 受金钱驱动。 向他们致敬。 我, 另一方面,我生活在一个真实的 世界,需要支付房屋抵押贷款, 我自己要吃饭,我要支付账单。 所以 我真的不能把钱排除在我的之外 考虑。

如果我能得到一大笔钱 做好工作,那就会了 绝对会鼓舞我的士气。 我不会 关心我是否使用旧的 工作站,或被迫共用房间 或与其他人一起的小隔间,或有 忍受烦人的老板,或者 任何。 事实是,在年底 有一天我会带着一个大的东西走开 一堆钱本身就够我用了 克服一切障碍, 带着所有的难过和伤害 自负,容忍缓慢的计算机并且 甚至忍耐

我运行了上面的命令,并将 isjunk 设置为 lambda x:x==" ",比率也只有 0.36。

I am trying to use Difflib.SequenceMatcher to compute the similarities between two files. These two files are almost identical except that one contains some extra whitespaces, empty lines and other doesn't. I am trying to use

s=difflib.SequenceMatcher(isjunk,text1,text2)
ratio =s.ratio()

for this purpose.

So, the question is how to write the lambda expression for this isjunk method so the SequenceMatcher method will discount all the whitespaces, empty lines etc. I tried to use the parameter lambda x: x==" ", but the result isn't as great. For two closely similar text, the ratio is very low. This is highly counter intuitive.

For testing purpose, here are the two strings that you can use on testing:

What Motivates jwovu to do your Job
Well? OK, this is an entry trying to
win $100 worth of software development
books despite the fact that I don‘t
read

programming books. In order to win the
prize you have to write an entry and
what motivatesfggmum to do your job
well. Hence this post. First
motivation

money. I know, this doesn‘t sound like
a great inspiration to many, and
saying that money is one of the
motivation factors might just blow my
chances away.

As if money is a taboo in programming
world. I know there are people who
can‘t be motivated by money. Mme, on
the other hand, am living in a real
world,

with house mortgage to pay, myself to
feed and bills to cover. So I can‘t
really exclude money from my
consideration. If I can get a large
sum of money for

doing a good job, then definitely
boost my morale. I won‘t care whether
I am using an old workstation, or
forced to share rooms or cubicle with
other

people, or have to put up with an
annoying boss, or whatever. The fact
that at the end of the day I will walk
off with a large pile of money itself
is enough

for me to overcome all the obstacles,
put up with all the hard feelings and
hurt egos, tolerate a slow computer
and even endure

And here's another string

What Motivates You to do your Job
Well? OK, this is an entry trying to
win $100 worth of software development
books, despite the fact that I don't
read programming books. In order to
win the prize you have to write an
entry and describes what motivates you
to do your job well. Hence this post.

First motivation, money. I know, this
doesn't sound like a great inspiration
to many, and saying that money is one
of the motivation factors might just
blow my chances away. As if money is a
taboo in programming world. I know
there are people who can't be
motivated by money. Kudos to them. Me,
on the other hand, am living in a real
world, with house mortgage to pay,
myself to feed and bills to cover. So
I can't really exclude money from my
consideration.

If I can get a large sum of money for
doing a good job, then thatwill
definitely boost my morale. I won't
care whether I am using an old
workstation, or forced to share rooms
or cubicle with other people, or have
to put up with an annoying boss, or
whatever. The fact that at the end of
the day I will walk off with a large
pile of money itself is enough for me
to overcome all the obstacles, put up
with all the hard feelings and hurt
egos, tolerate a slow computer and
even endure

I ran the above command, and set the isjunk to lambda x:x==" ", the ratio is only 0.36.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

束缚m 2024-07-13 11:57:52

如果匹配所有空格,则相似性会更好:

difflib.SequenceMatcher(lambda x: x in " \t\n", doc1, doc2).ratio()

但是,difflib 并不是解决此类问题的理想选择,因为这是两个几乎相同的文档,但打字错误等会产生 difflib 的差异,而人类不会看到很多差异。

尝试阅读tf-idf贝叶斯概率, 向量空间模型< /a> 和 w-shingling

我写了一个 tf-idf 的实现 将其应用于向量空间并使用点积作为距离度量来对文档进行分类。

If you match all whitespaces the similarity is better:

difflib.SequenceMatcher(lambda x: x in " \t\n", doc1, doc2).ratio()

However, difflib is not ideal to such a problem because these are two nearly identical documents, but typos and such produce differences for difflib where a human wouldn't see many.

Try reading up on tf-idf, Bayesian probability, Vector space Models and w-shingling

I have written a an implementation of tf-idf applying it to a vector space and using the dot product as a distance measure to classify documents.

独留℉清风醉 2024-07-13 11:57:52

使用示例字符串:

>>> s=difflib.SequenceMatcher(lambda x: x == '\n', s1, s2)
>>> s.ratio()
0.94669848846459825

有趣的是,如果“ ”也被作为垃圾包含在内:

>>> s=difflib.SequenceMatcher(lambda x: x in ' \n', s1, s2)
>>> s.ratio()
0.7653142402545744

看起来新行比空格的影响要大得多。

Using your sample strings:

>>> s=difflib.SequenceMatcher(lambda x: x == '\n', s1, s2)
>>> s.ratio()
0.94669848846459825

Interestingly if ' ' is also included as junk:

>>> s=difflib.SequenceMatcher(lambda x: x in ' \n', s1, s2)
>>> s.ratio()
0.7653142402545744

Looks like the new lines are having a much greater affect than the spaces.

疯到世界奔溃 2024-07-13 11:57:52

鉴于上面的文本,测试确实如建议的那样:

difflib.SequenceMatcher(lambda x: x in " \t\n", doc1, doc2).ratio()

但是,为了加快速度,您可以利用 CPython 的 方法包装

difflib.SequenceMatcher(" \t\n".__contains__, doc1, doc2).ratio()

这避免了许多 python 函数调用。

Given the texts above, the test is indeed as suggested:

difflib.SequenceMatcher(lambda x: x in " \t\n", doc1, doc2).ratio()

However, to speed up things a little, you can take advantage of CPython's method-wrappers:

difflib.SequenceMatcher(" \t\n".__contains__, doc1, doc2).ratio()

This avoids many python function calls.

怪我鬧 2024-07-13 11:57:52

我没有使用过 Difflib.SequenceMatcher,但是您是否考虑过预处理文件以删除所有空白行和空格(可能通过正则表达式),然后进行比较?

I haven't used Difflib.SequenceMatcher, but have you considered pre-processing the files to remove all blank lines and whitespace (perhaps via regular expressions) and then doing the compare?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文