确定两个或多个摘要是否相似

发布于 2024-07-16 02:00:46 字数 317 浏览 12 评论 0原文

问题如下:

我有一个摘要,通常在 20 到 50 个单词之间,我想将其与其他相对相似的摘要进行比较。 摘要所指的一般类别和地理位置是已知的。

例如,如果来自同一地区的人正在撰写有关建造房屋的文章,我希望能够以一定程度的确定性列出这些摘要,表明他们实际上指的是建造房屋而不是建造车库或后院游泳池。

该数据集目前约有 50,000 个文档,并且以每天约 200 个文档的速度增长。

首选语言是 Python、PHP、C/C++、Haskell 或 Erlang,无论哪种都可以完成工作。 另外,如果您不介意的话,我想了解选择特定语言的原因。

The problem is as follows:

I have one summary, usually between 20 to 50 words, that I'd like to compare to other relatively similar summaries. The general category and the geographical location to which the summary refers to are already known.

For instance, if people from the same area are writing about building a house, I'd like to be able to list those summaries with some level of certainty that they actually refer to building houses instead of building a garage or a backyard swimming pool.

The data set is currently around 50 000 documents with a growth rate of some 200 documents per day.

Preferred languages would be Python, PHP, C/C++, Haskell or Erlang, whichever might get the job done. Also, if you don't mind, I'd like to understand the reasoning for picking a specific language.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

〆一缕阳光ご 2024-07-23 02:00:46

您可以尝试使用一些字符串相似性度量,例如 Jaccard 和 Dice,但不是计算字符重叠,而是计算单词重叠。 例如,使用 Python,您可以使用以下命令:

def word_overlap(a, b):
    return [x for x in a if x in b]


def jaccard(a, b, overlap_fn=word_overlap):
    """
    Jaccard coefficient (/\ represents intersection), given by :
        Jaccard(A, B) = (A /\ B) / (|a|) + (|b|) - (A /\ B)
    """
    c = overlap_fn(a, b)
    return float(len(c)) / (len(a) + len(b) - len(c))

jaccard("Selling a beautiful house in California".split(), "Buying a beautiful crip in California".split())

You can try to use some string similarity measures, such as Jaccard and Dice, but instead of calculating character overlaps, you calculate word overlaps. For example, using Python, you can use the following:

def word_overlap(a, b):
    return [x for x in a if x in b]


def jaccard(a, b, overlap_fn=word_overlap):
    """
    Jaccard coefficient (/\ represents intersection), given by :
        Jaccard(A, B) = (A /\ B) / (|a|) + (|b|) - (A /\ B)
    """
    c = overlap_fn(a, b)
    return float(len(c)) / (len(a) + len(b) - len(c))

jaccard("Selling a beautiful house in California".split(), "Buying a beautiful crip in California".split())
暮光沉寂 2024-07-23 02:00:46

由于Python对集合有很好的原生支持,我们可以修改 JG 代码为,

def jaccard(a, b):
    """
    Jaccard coefficient (/\ represents intersection), given by :
        Jaccard(A, B) = (A /\ B) / (|a|) + (|b|) - (A /\ B)
    """
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

jaccard(set("Selling a beautiful house in California"), set("Buying a beautiful crip in California"))

Since there is a native nice support for sets in python, we can modify JGs code as,

def jaccard(a, b):
    """
    Jaccard coefficient (/\ represents intersection), given by :
        Jaccard(A, B) = (A /\ B) / (|a|) + (|b|) - (A /\ B)
    """
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

jaccard(set("Selling a beautiful house in California"), set("Buying a beautiful crip in California"))
以往的大感动 2024-07-23 02:00:46

您可以查看 WEBSOM 项目

尽管他们的网站今年没有完全更新,但解决的问题非常相似。 由于他们处理的数据量与 10 年前的您相似(甚至更多),因此今天您几乎可以在手机上运行这些算法。

You could have a look at the WEBSOM project.

Even though their web site has not been updated exactly this year, the problem being solved is very similar. As they were processing amounts of data similar to yours (and more) like 10 years ago, today you could probably run the algorithms almost on a cell phone.

堇年纸鸢 2024-07-23 02:00:46

实际上没有特定的语言可供选择。 您正在尝试找到语义相似性。 这是一个非常大的区域。 您可能对这篇论文感兴趣:

基于语料库和知识的文本语义相似度的测量

There isn't really a particular language to pick. You're trying to find semantic similarity. This is a very large area. You might be interested in this paper:

Corpus-based and Knowledge-based Measures of Text Semantic Similarity

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文