数据库中的单词 - 通过词汇词典搜索(语义相似性)

发布于 2024-09-12 09:39:05 字数 895 浏览 14 评论 0原文

我正在实现一个小型字典数据库,我想根据它们之间的词汇/语义相似性进行搜索。

例如,啤酒有“姐妹词”,例如苏打水、柠檬水、葡萄酒、香槟各自在“不同方向”上“不同”(例如:前两个是“啤酒”概念的“温和”版本,而后两个是“更极端”版本

)知道 WordNet 有一个 API,但我的字典中的大多数单词(和短语)都以更非正式的方式相关

(另一个例子。“gangster”与 [nun、orphan、rebel] {< code>罪犯、黑手党老大、杀人犯},其中肢体从左到右不同,[]中的为“正肢体”,{}中的为“负肢体”)

使用中:

  1. 用户输入搜索输入(单词)
  2. 单词与姐妹词匹配。
  3. 用户有机会通过在至少 2 个方向上改变肢体来“微调单词”,例如上面的示例。

实现此类搜索的最佳方法是什么(上述步骤 2 和 3)?

我正在考虑使用 PHP/MySQL,因为这是我熟悉的,但是有什么更好的选择呢?再次强调 - 请记住,这不是一本大词典。这只是一些常用词的精选。


这是我尝试回答这个问题 - 这是非常非常基本的...欢迎改进建议:

MySQL 表单词:


id, (primary key, autoincrement) 
word (varchar 75), 
relatedword (varchar 75)
relationscore (int 11)
direction (tinyint, -1 or 1)

给定 $word 查询和 $direction:

"从单词中选择相关单词 WHERE word='$word' AND Direction= $direction ORDER BY 关系分数 DESC"

I'm implementing a small dictionary database where I'd like to do searches based on lexical/semantic similarity between them..

For example, beer has "sister words" such as soda, lemonade, wine, champagne each "different" in a "different direction" (in example: the first two are "moderate" versions of the idea of "beer", while the latter two are "more extreme" versions)

I know WordNet has an API, but most of the words (and phrases) in my dictionary are related in more informal ways

(another example. "gangster" is related to [nun, orphan, rebel] {criminal, mafia boss, murderer}, where extremity varies from left to right, and the ones in [] are considered "positive extremities", and the ones in {} are "negative extremities")

In usage:

  1. User enters search input (a word)
  2. Word is matched with sister words.
  3. User has chance to "finetune word" by altering extremities in at least 2 directions, such as in examples above.

What's the best way to implement such a search -- steps 2 and 3 above?

I'm considering using PHP/MySQL since that is what I am familiar with, but what are better alternatives? Again - keep in mind that this isn't a large dictionary. It's just a selection of common words.


Here's my attempt at answering this - it's very, very basic... improvement suggestions welcome:

MySQL table words:


id, (primary key, autoincrement) 
word (varchar 75), 
relatedword (varchar 75)
relationscore (int 11)
direction (tinyint, -1 or 1)

Given a $word query and $direction:

"SELECT relatedword FROM words WHERE word='$word' AND direction=$direction ORDER BY relationscore DESC"

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

著墨染雨君画夕 2024-09-19 09:39:05

我不清楚你为什么认为 Wordnet 不合适。我认为你所说的“正/负极端”和“姐妹词”就是语言学家所说的上位词(更一般的同义词)和下位词(更具体的同义词)。 Wordnet 包含了一个相当好的模型。

要使用 Wordnet,您可以使用上位词(“啤酒”)关系“向上”几个级别来找到“姐妹”词。因此,如果您从“啤酒”开始,上升 3 级将为您提供“饮料”。然后,您使用下义词(“饮料”)关系“下降”几个级别,以获得与啤酒具有相同特异性的饮料类型。

这是通过 Nodebox Linguistics 访问的 Wordnet 界面示例。我相信 PHP 有一个等效的 Wordnet 接口,尽管我从未使用过它。

>>> import en
>>> noun = 'beer'
>>> generalization_depth = 3
>>> sister_words = en.noun.hyponym(en.noun.hypernyms(noun)[generalization_depth][0])
>>> for word in reduce(lambda a,b: a+b, sister_words, []):
...     print word
... 
milk
wish-wash
potion
alcohol
alcoholic beverage
intoxicant
inebriant
hydromel
oenomel
near beer
ginger beer
mixer
cooler
refresher
smoothie
fizz
cider
cyder
cocoa
chocolate
hot chocolate
drinking chocolate
fruit juice
fruit crush
fruit drink
ade
mate
soft drink
coffee
java
tea
tea-like drink
drinking water

I'm unclear why you think Wordnet is inappropriate. I think what you're calling "postive/negative extremities" and "sister words" are what linguists call hypernyms (more general synonyms) and hyponyms (more specific synonyms). Wordnet includes a reasonably good model of these.

To use Wordnet, you'd find "sister" words by "going up" a few levels using the hypernyms('beer') relation. So if you started with "beer", going up 3 levels would give you "beverage". Then, you use the hyponyms('beverage') relation to "go down" several levels, to get types of beverages with the same amount of specificity as beer.

This is an example of Wordnet's interface as accessed through Nodebox Linguistics. I believe PHP has an equivalent Wordnet interface, although I've never used it.

>>> import en
>>> noun = 'beer'
>>> generalization_depth = 3
>>> sister_words = en.noun.hyponym(en.noun.hypernyms(noun)[generalization_depth][0])
>>> for word in reduce(lambda a,b: a+b, sister_words, []):
...     print word
... 
milk
wish-wash
potion
alcohol
alcoholic beverage
intoxicant
inebriant
hydromel
oenomel
near beer
ginger beer
mixer
cooler
refresher
smoothie
fizz
cider
cyder
cocoa
chocolate
hot chocolate
drinking chocolate
fruit juice
fruit crush
fruit drink
ade
mate
soft drink
coffee
java
tea
tea-like drink
drinking water
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文