Difflib.SequenceMatcher isjunk 可选参数查询：如何忽略空格、制表符、空行？

发布于 2024-07-06 11:57:52 字数 1925 浏览 12 评论 0原文

我正在尝试使用 Difflib.SequenceMatcher 来计算两个文件之间的相似性。这两个文件几乎相同，只是一个文件包含一些额外的空格、空行，而另一个文件则不包含。我正在尝试用于

s=difflib.SequenceMatcher(isjunk,text1,text2)
ratio =s.ratio()

此目的。

所以，问题是如何为这个 isjunk 方法编写 lambda 表达式，以便 SequenceMatcher 方法将折扣所有空格、空行等。我尝试使用参数 lambda x: x==" "，但结果不是同样伟大。对于两个非常相似的文本，该比例非常低。这是非常违反直觉的。

出于测试目的，以下是您可以在测试中使用的两个字符串：

jwovu 工作的动力是什么出色地？好的，这是一个试图赢得价值 100 美元的软件开发奖尽管我没有书阅读
编程书籍。为了赢得奖品是你必须写一篇文章并且
是什么促使 fggmum 完成你的工作出色地。因此这篇文章。第一的动机
钱。我知道，这听起来不像对许多人来说是一个巨大的启发，并且说钱是其中之一动机因素可能会打击我机会消失。
好像金钱是编程的禁忌世界。我知道有些人不能被金钱所激励。女士，在另一方面，我生活在真实的世界，
房子抵押贷款要付，我自己去付饲料和账单需要支付。所以我不能真的把钱排除在我之外考虑。如果我能得到一个大的金额
做得好，那么肯定鼓舞我的士气。我不会在意是否我使用的是旧工作站，或者被迫与他人共用房间或隔间其他
人们，或者不得不忍受烦人的老板，或者其他什么。事实到最后我会步行本身就带着一大堆钱就够了
为了我克服一切困难，忍受所有的痛苦和伤害自尊心，容忍缓慢的计算机甚至忍耐

这是另一根绳子

是什么激励你完成工作出色地？好的，这是一个试图赢得价值 100 美元的软件开发奖书，尽管事实上我不阅读编程书籍。为了赢得奖品你必须写一个条目并描述您的动机做好你的工作。因此这篇文章。
第一动力，金钱。我知道这个听起来不像是一个伟大的灵感对很多人来说，金钱就是一的动机因素可能只是让我失去机会。仿佛金钱是一个编程世界的禁忌。我知道有些人不能受金钱驱动。向他们致敬。我，另一方面，我生活在一个真实的世界，需要支付房屋抵押贷款，我自己要吃饭，我要支付账单。所以我真的不能把钱排除在我的之外考虑。
如果我能得到一大笔钱做好工作，那就会了绝对会鼓舞我的士气。我不会关心我是否使用旧的工作站，或被迫共用房间或与其他人一起的小隔间，或有忍受烦人的老板，或者任何。事实是，在年底有一天我会带着一个大的东西走开一堆钱本身就够我用了克服一切障碍，带着所有的难过和伤害自负，容忍缓慢的计算机并且甚至忍耐

我运行了上面的命令，并将 isjunk 设置为 lambda x:x==" "，比率也只有 0.36。

原文

I am trying to use Difflib.SequenceMatcher to compute the similarities between two files. These two files are almost identical except that one contains some extra whitespaces, empty lines and other doesn't. I am trying to use

s=difflib.SequenceMatcher(isjunk,text1,text2)
ratio =s.ratio()

for this purpose.

So, the question is how to write the lambda expression for this isjunk method so the SequenceMatcher method will discount all the whitespaces, empty lines etc. I tried to use the parameter lambda x: x==" ", but the result isn't as great. For two closely similar text, the ratio is very low. This is highly counter intuitive.

For testing purpose, here are the two strings that you can use on testing:

What Motivates jwovu to do your Job
Well? OK, this is an entry trying to
win $100 worth of software development
books despite the fact that I don‘t
read
programming books. In order to win the
prize you have to write an entry and
what motivatesfggmum to do your job
well. Hence this post. First
motivation
money. I know, this doesn‘t sound like
a great inspiration to many, and
saying that money is one of the
motivation factors might just blow my
chances away.
As if money is a taboo in programming
world. I know there are people who
can‘t be motivated by money. Mme, on
the other hand, am living in a real
world,
with house mortgage to pay, myself to
feed and bills to cover. So I can‘t
really exclude money from my
consideration. If I can get a large
sum of money for
doing a good job, then definitely
boost my morale. I won‘t care whether
I am using an old workstation, or
forced to share rooms or cubicle with
other
people, or have to put up with an
annoying boss, or whatever. The fact
that at the end of the day I will walk
off with a large pile of money itself
is enough
for me to overcome all the obstacles,
put up with all the hard feelings and
hurt egos, tolerate a slow computer
and even endure

And here's another string

What Motivates You to do your Job
Well? OK, this is an entry trying to
win $100 worth of software development
books, despite the fact that I don't
read programming books. In order to
win the prize you have to write an
entry and describes what motivates you
to do your job well. Hence this post.
First motivation, money. I know, this
doesn't sound like a great inspiration
to many, and saying that money is one
of the motivation factors might just
blow my chances away. As if money is a
taboo in programming world. I know
there are people who can't be
motivated by money. Kudos to them. Me,
on the other hand, am living in a real
world, with house mortgage to pay,
myself to feed and bills to cover. So
I can't really exclude money from my
consideration.
If I can get a large sum of money for
doing a good job, then thatwill
definitely boost my morale. I won't
care whether I am using an old
workstation, or forced to share rooms
or cubicle with other people, or have
to put up with an annoying boss, or
whatever. The fact that at the end of
the day I will walk off with a large
pile of money itself is enough for me
to overcome all the obstacles, put up
with all the hard feelings and hurt
egos, tolerate a slow computer and
even endure

I ran the above command, and set the isjunk to lambda x:x==" ", the ratio is only 0.36.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

束缚ｍ 2024-07-13 11:57:52

如果匹配所有空格，则相似性会更好：

difflib.SequenceMatcher(lambda x: x in " \t\n", doc1, doc2).ratio()

但是，difflib 并不是解决此类问题的理想选择，因为这是两个几乎相同的文档，但打字错误等会产生 difflib 的差异，而人类不会看到很多差异。

尝试阅读tf-idf，贝叶斯概率, 向量空间模型< /a> 和 w-shingling

我写了一个 tf-idf 的实现将其应用于向量空间并使用点积作为距离度量来对文档进行分类。

If you match all whitespaces the similarity is better:

difflib.SequenceMatcher(lambda x: x in " \t\n", doc1, doc2).ratio()

However, difflib is not ideal to such a problem because these are two nearly identical documents, but typos and such produce differences for difflib where a human wouldn't see many.

Try reading up on tf-idf, Bayesian probability, Vector space Models and w-shingling

I have written a an implementation of tf-idf applying it to a vector space and using the dot product as a distance measure to classify documents.

回复收藏 0 原文

独留℉清风醉 2024-07-13 11:57:52

使用示例字符串：

>>> s=difflib.SequenceMatcher(lambda x: x == '\n', s1, s2)
>>> s.ratio()
0.94669848846459825

有趣的是，如果“ ”也被作为垃圾包含在内：

>>> s=difflib.SequenceMatcher(lambda x: x in ' \n', s1, s2)
>>> s.ratio()
0.7653142402545744

看起来新行比空格的影响要大得多。

Using your sample strings:

>>> s=difflib.SequenceMatcher(lambda x: x == '\n', s1, s2)
>>> s.ratio()
0.94669848846459825

Interestingly if ' ' is also included as junk:

>>> s=difflib.SequenceMatcher(lambda x: x in ' \n', s1, s2)
>>> s.ratio()
0.7653142402545744

Looks like the new lines are having a much greater affect than the spaces.

回复收藏 0 原文

疯到世界奔溃 2024-07-13 11:57:52

鉴于上面的文本，测试确实如建议的那样：

difflib.SequenceMatcher(lambda x: x in " \t\n", doc1, doc2).ratio()

但是，为了加快速度，您可以利用 CPython 的方法包装：

difflib.SequenceMatcher(" \t\n".__contains__, doc1, doc2).ratio()

这避免了许多 python 函数调用。

Given the texts above, the test is indeed as suggested:

difflib.SequenceMatcher(lambda x: x in " \t\n", doc1, doc2).ratio()

However, to speed up things a little, you can take advantage of CPython's method-wrappers:

difflib.SequenceMatcher(" \t\n".__contains__, doc1, doc2).ratio()

This avoids many python function calls.

回复收藏 0 原文

怪我鬧 2024-07-13 11:57:52

我没有使用过 Difflib.SequenceMatcher，但是您是否考虑过预处理文件以删除所有空白行和空格（可能通过正则表达式），然后进行比较？

回复收藏 0 原文

~没有更多了~

关于作者

留一抹残留的笑

暂无简介

0 文章

0 评论

24 人气

关注发私信

友情链接

文江博客

Difflib.SequenceMatcher isjunk 可选参数查询：如何忽略空格、制表符、空行？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

胡图图

zt006

z祗昰~

冰葑

野の

天空

友情链接

Difflib.SequenceMatcher isjunk 可选参数查询：如何忽略空格、制表符、空行？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

胡图图

zt006

z祗昰~

冰葑

野の

天空

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。