php:单词邻近脚本?

发布于 2024-10-29 05:48:07 字数 500 浏览 6 评论 0原文

好吧,我花了很长时间在谷歌上搜索, 甚至在hotscripts等、几个php论坛和这个地方进行了一些特定的搜索……什么也没有(无论如何都没有用)。

我希望能够获取一段文本(页面/文件/文档)并将其分开以找到特定术语之间的“距离”(找到接近度/合理距离等)。

我本以为周围至少会有一些这样的东西——但我没有找到它们。 所以——这可能比我想象的要难。 我知道这可能是一项有点“饥饿”的努力 - 因为它可能对大型文档相当密集 - 但肯定有可能吗?

事实上 - 在环顾四周 - 我发现的大多数参考资料(除了 lamo-repeat SEO 网站)似乎都建议进行高级语言研究、安装到服务器上的奇怪/高级软件包等。

我是否可以假设“接近”实际上是存在的一个高度复杂的问题, 并且需要大量的资源和大量的开发? (老实说 - 在我看来,它似乎有点温和 - 所以我想知道我到底错过了什么(注意:相对意义上简单......我会将其与简单(密度/计数)到困难(词干/基础/同义词)))

所以 - 参考文献/建议/想法/想法???

Okay - so, I've spent ages searching in Google,
and even went through a few specific searches at hotscripts etc., several php forums and this place ... nothing (not of use anyway).

i want to be able to take a block of text (page/file/doc) and pull it apart to find the "distance" between specific terms (find the proximity/raltional distance etc.).

I woudl have thought there'd be at least a few such thigns around - but I'm not finding them.
So - it may be harder than I thought.
I understand it may be a somewhat "hungry" endevour - as it's likely to be fairly intensive on large documents - but surely it is possible?

Infact - whilst looking around - the majority of references that I find (apart from lamo-repeat SEO sites) seems to suggest advanced linguistic studies, strange/advanced packages to install onto a server etc.

Am I to assume that "proximity" is infact a highly complex issue,
and will require serious resources and an awful lot of development?
(Honestly - in my mind it seems somewhat moderate - so I'm wondering exactly what it is I'm missing (Note: Simple in a relative sense ... I would compare it to easy (density/count) through to difficult(word stemming/base/thesaurusing)).

So - references/suggestions/ideas/thoughts???

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

謸气贵蔟 2024-11-05 05:48:07

正如菲利克斯·克林评论的那样,我还想到了汉明距离。也许您可以做出一些变体,将单词编码为特定的代码字,然后通过保存代码字的数组检查它们的距离。

所以如果你有数组[11, 02, 85, 37, 11],你可以很容易地发现11在这个数组中的最大距离为4。

不知道这是否适合你,但我想我会以类似的方式做到这一点。

I also thought of Hamming distance as Felix Kling commented. Maybe you can make some variant, where you encode your words into specific codewords and then check their distances through an array that holds your codewords.

So if you have array[11, 02, 85, 37, 11], you can easily find that 11 has a maximum distance of 4 in this array.

Don't know if this would work for you, but i think i would do it in a similar manner.

卷耳 2024-11-05 05:48:07

如果您正在谈论特定的单词比较,您将需要查看 MySQL 的 SOUNDEX 函数。 (我假设你可能正在使用mysql)。比较两个单词时,您可以参考它们的发音:

SELECT `word` FROM `list_of_words` WHERE SOUNDEX(`word`) = SOUNDEX('{TEST_WORD}');

然后,当您获得单词列表时(很可能您会得到很多单词),您可以检查这些单词之间的距离,找出最接近的单词(或单词组,具体取决于您编写代码的方式)。

$word = '{WORD TO CHECK}';
$distance = 4; // the smalled the distance the closed the word
foreach($word_results as $comparison_word) {
   $distance = levenshtein($comparison_word, $word);
   if($distance < $threshold) {
      $threshold = $distance;
      $similar_word = $comparison_word;
   }
}
echo $similar_word;

希望能帮助您找到您正在寻找的方向。

快乐编码!

If you are speaking about specific word comparisons, you will want to look at the SOUNDEX function of MySQL. (I will assume you may be using mysql). When comparing two words, you can get a reference to how they sound:

SELECT `word` FROM `list_of_words` WHERE SOUNDEX(`word`) = SOUNDEX('{TEST_WORD}');

Then when you get your list of words (as most likely you will get quite a few), you cna check the distance between those words for the word that is CLOSEST (or the group of words depending on how you write your code).

$word = '{WORD TO CHECK}';
$distance = 4; // the smalled the distance the closed the word
foreach($word_results as $comparison_word) {
   $distance = levenshtein($comparison_word, $word);
   if($distance < $threshold) {
      $threshold = $distance;
      $similar_word = $comparison_word;
   }
}
echo $similar_word;

Hope that helps you find the direction you are looking for.

Happy coding!

○闲身 2024-11-05 05:48:07

你的例子搜索了Word1 ... Word2,Word2 ... Word1也应该匹配吗?一个简单的解决方案是使用正则表达式:

即:

  1. 在第一个匹配组中使用正则表达式:\bWord1\b(.*)\bWord2\b
  2. ,使用空格(或任何边界)将其拆分为数组,并计数

这是最直接的方法,但绝对不是最好的(即性能方面)方法。我认为如果您想要更具体的答案,您需要澄清您的需求。

更新:

这两个问题合并后,我看到其他答案提到了 soundex、levinstein 和汉明距离等。我建议 theclueless1 澄清要求,以便人们可以提供有用的帮助。如果这是与搜索或文档聚类相关的应用程序,我还建议您看看成熟的全文索引/搜索解决方案,例如 sphinx 或 lucene。我认为它们中的任何一个都可以与 PHP 一起使用。

your example searched Word1 ... Word2, should Word2 ... Word1 also be matched? A simple solution is to use RegEx:

i.e.:

  1. use regex: \bWord1\b(.*)\bWord2\b
  2. in the first match group, use space (or whatever boundary) to split it into an array, and count

this is the most straight forward method, but definitely not the best (i.e. performance wise) method. I think you need to clarify your needs if you want a more specific answer.

Update:

After the 2 questions are merged, I see other answers mentioning soundex, levinstein and hamming distance etc. I would suggest theclueless1 to CLARIFY the requirements so that people can give useful help. If this is an application related to searching or document clustering, I also suggest you to take a look at mature full text indexing/searching solutions such as sphinx or lucene. I think any of them can be used with PHP.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文