从句子列表中查找与示例句子具有相似相对含义的句子

发布于 2024-11-04 04:50:27 字数 450 浏览 3 评论 0原文

我希望能够找到具有相同含义的句子。我有一个查询句子,以及一长串数百万个其他句子。句子是单词,或者是一种特殊类型的单词,称为符号,它只是象征正在谈论的某个对象的单词类型。

例如,我的查询语句是:

示例:将 (x) 添加到 (y) 给出 (z)

我的数据库中可能存在一个句子列表,例如: 1. (x) 和 (y) 的总和为(z) 2. (x) 加 (y) 等于 (z) 3. (x) 乘以 (y) 不等于 (z) 4. (z) 是 (x) 和 (y) 之和

示例应该匹配我的数据库中的句子 1、2、4,但不匹配 3。此外,句子匹配应该有一定的权重。

它不仅仅是数学句子,它是任何可以根据单词含义与任何其他句子进行比较的句子。我需要某种方法来比较一个句子和许多其他句子,以找到具有最密切相关含义的句子。即根据句子的含义在句子之间进行映射。

谢谢! (该标签是语言设计的,因为我无法创建任何新标签)

I want to be able to find sentences with the same meaning. I have a query sentence, and a long list of millions of other sentences. Sentences are words, or a special type of word called a symbol which is just a type of word symbolizing some object being talked about.

For example, my query sentence is:

Example: add (x) to (y) giving (z)

There may be a list of sentences already existing in my database such as: 1. the sum of (x) and (y) is (z) 2. (x) plus (y) equals (z) 3. (x) multiplied by (y) does not equal (z) 4. (z) is the sum of (x) and (y)

The example should match the sentences in my database 1, 2, 4 but not 3. Also there should be some weight for the sentence matching.

Its not just math sentences, its any sentence which can be compared to any other sentence based upon the meaning of the words. I need some way to have a comparison between a sentence and many other sentences to find the ones with the closes relative meaning. I.e. mapping between sentences based upon their meaning.

Thanks! (the tag is language-design as I couldn't create any new tag)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

稳稳的幸福 2024-11-11 04:50:28

没那么容易^^
您应该首先使用停用词过滤器,以从中去除不包含信息的单词。 这里有一些不错的

然后你想处理同义词。这实际上是一个非常复杂的主题,因为你需要某种词义消歧才能做到这一点。大多数最先进的方法只比最简单的解决方案好一点。也就是说,您采用一个词最常用的含义。您可以使用 WordNet 来做到这一点。您可以获得一个单词的同义词集,其中包含所有同义词。然后,您可以概括该单词(称为上位词)并采用最常用的含义并用它替换搜索词。

顺便说一句,在 NLP 中处理同义词相当困难。如果你只是想处理不同的词形,例如添加和添加,你可以使用词干分析器,但没有词干分析器可以帮助你从添加到总和(wsd是唯一的方法)

然后你的句子中有不同的词序,如果您想要确切的答案(x+y=z 与 x+z=y 不同),也不应该忽略它。因此,您还需要单词依赖关系,以便您可以查看哪些单词相互依赖。如果您想使用英语,斯坦福解析器实际上是完成该任务的最佳选择。

也许您应该从句子中取出名词和动词,并对它们进行所有预处理,并询问搜索索引中的依赖关系。
依赖关系看起来像

x (sum, y)
y (sum, x)
sum (x, y)

您可以用于搜索的依赖关系,

因此您需要标记化、概括、获取依赖关系、过滤不重要的单词才能获得结果。如果你想用德语来做,你还需要一个单词分解器。

Not that easy ^^
You should use a stopword filter first, to get non-information-bearing words out of it. Here are some good ones

Then you wanna handle synonyms. Thats actually a really complex theme, cause you need some kind of word sense disambiguation to do it. And most state of the art methods are just a little bit better then the easiest solution. That would be, that you take the most used meaning of a word. That you can do with WordNet. You can get synsets for a word, where all synonyms are in it. Then you can generalize that word (its called a hyperonym) and take the most used meaning and replace the search term with it.

Just to say it, handling synonyms is pretty hard in NLP. If you just wanna handle different wordforms like add and adding for example, you could use a stemmer, but no stemmer would help you to get from add to sum (wsd is the only way there)

And then you have different word orderings in your sentences, which shouldnt be ignored aswell, if you want exact answers (x+y=z is different from x+z=y). So you need word dependencies aswell, so you can see which words depend on each other. The Stanford Parser is actually the best for that task if you wanna use english.

Perhaps you should just get nouns and verbs out of a sentence and make all the preprocessing on them and ask for the dependencies in your search index.
A dependency would look like

x (sum, y)
y (sum, x)
sum (x, y)

which you could use for ur search

So you need to tokenize, generalize, get dependencies, filter unimportant words to get your result. And if you wanna do it in german, you need a word decompounder aswell.

左秋 2024-11-11 04:50:27

首先:你想要解决的是一个非常的难题。根据数据集中的内容,它可能是AI 完整的

您需要您的程序知道或了解addplussum引用相同的概念,而乘法 em> 是一个不同的概念。您可以通过测量 WordNet/FrameNet 中单词的同义词集之间的距离来做到这一点,但如果您不想找到乘法,则距离计算必须非常精细。否则,您可能需要手动建立一些单词概念映射(例如 {'add' : 'addition', 'plus' : 'addition', 'sum' : 'addition', 'times' : 'multiplication '})。

如果您想要完整的句子语义,您还必须解析句子并从解析树/依赖图导出含义。 Stanford 解析器 是一种流行的解析选择。

您还可以在问答研究中找到该问题的灵感。在那里,一种常见的方法是解析句子,然后将解析树的片段存储在索引中,并通过常见的搜索引擎技术(例如 tf-idf,在 Lucene 中实现)搜索它们。这也会给你每个句子的分数。

First off: what you're trying to solve is a very hard problem. Depending on what's in your dataset, it may be AI-complete.

You'll need your program to know or learn that add, plus and sum refer to the same concept, while multiplies is a different concept. You may be able to do this by measuring distance between the words' synsets in WordNet/FrameNet, though your distance calculation will have to be quite refined if you don't want to find multiplies. Otherwise, you may want to manually establish some word-concept mappings (such as {'add' : 'addition', 'plus' : 'addition', 'sum' : 'addition', 'times' : 'multiplication'}).

If you want full sentence semantics, you will in addition have to parse the sentences and derive the meaning from the parse trees/dependency graphs. The Stanford parser is a popular choice for parsing.

You can also find inspiration for this problem in Question Answering research. There, a common approach is to parse sentences, then store fragments of the parse tree in an index and search for them by common search engines techniques (e.g. tf-idf, as implemented in Lucene). That will also give you a score for each sentence.

小红帽 2024-11-11 04:50:27

您需要将句子中的单词词干分解为常见的同义词,然后比较这些词干并使用句子中词干匹配的比例(10 个单词中的 5 个)来与句子匹配的某个阈值进行比较。例如,所有单词匹配率超过 80%(或您认为准确的百分比)的句子。至少这是一种方法。

You will need to stem the words in your sentences down to a common synonym, and then compare those stems and use the ratio of stem matches in a sentence (5 out of 10 words) to compare against some threshold that the sentence is a match. For example all sentences with a word match of over 80% (or what ever percentage you deem acurate). At least that is one way to do it.

爱已欠费 2024-11-11 04:50:27

编写一个函数,从句子中创建某种哈希或“表达式”,这与其他句子的哈希相比很容易。

CCA:
1.“(x) 和 (y) 之和为 (z)” => x + y = z
4. “(z) 是 (x) 和 (y) 之和” => z = x + y

转换的一些技巧:省略“the”单词,将双字术语转换为单个单词“sum of”=> “sumof”,找到运算符词并用它替换“and”。

Write a function which creates some kinda hash, or "expression" from a sentence, which can be easy compared with other sentences' hashes.

Cca:
1. "the sum of (x) and (y) is (z)" => x + y = z
4. "(z) is the sum of (x) and (y)" => z = x + y

Some tips for the transformation: omit "the" words, convert double-word terms to a single word "sum of" => "sumof", find operator word and replace "and" with it.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文