当前位置：文江博客话题详情

PHP单词索引、性能和合理结果

发布于 2024-09-11 02:01:44 字数 1223 浏览 10 评论 0 原文

我目前正在开发用于搜索功能的索引器。索引器将处理“字段”中的数据。字段看起来像：

  Field_id   Field_type   Field_name   Field_Data
- 101        text         Name         Intel i7
- 102        integer      Cores        4 physical, 4 virtual
- 103        select       Vendor       Intel
- 104        multitext    Description  The i7 is intel's next gen range of cpus.

索引器将生成以下结果/索引：

  Keyword    Occurrences
- intel      101, 103, 104
- i7         101, 104
- physical   102
- virtual    102
- next       104
- gen        104
- range      104
- cpus       104   (*)
- cpu        104   (*)

所以它看起来有点不错，但是，有一些我想解决的问题：

过滤掉常见单词（正如您可能注意到的那样，“ “列表中缺少”“is”“of”和“intel's”）
关于“cpu”（复数与单数），最好使用特定类型（单数或复数），两者或精确（即“ cpus”与“cpu”不同）？
继续上一项，我如何确定复数（不同风格：test=>tests Fish=>fish 和 leaf=>leaves）
我目前正在使用 MySql 并且我非常担心 存在性能问题；我们有 500 多个类别，我们甚至没有启动该网站
假设我想使用搜索词“vendor:intel”，其中供应商指定字段名称 (field_name)，您认为这会对sql服务器？
搜索限制；我一点也不喜欢这样，但这是有可能的，如果您知道任何解决方法，请告诉我们！
还有其他问题我可能忘记了，如果你发现任何问题，欢迎你对我大喊大叫;-)
我不需要搜索引擎来抓取链接，事实上，我特别希望它不抓取链接。

（顺便说一句，我并不偏向 intel，只是碰巧我拥有一台基于 i7 的电脑；-)）

原文

I'm currently working on an indexer for a search feature. The indexer will work over data from "fields".
Fields looks like:

  Field_id   Field_type   Field_name   Field_Data
- 101        text         Name         Intel i7
- 102        integer      Cores        4 physical, 4 virtual
- 103        select       Vendor       Intel
- 104        multitext    Description  The i7 is intel's next gen range of cpus.

The indexer would generate the following results/index:

  Keyword    Occurrences
- intel      101, 103, 104
- i7         101, 104
- physical   102
- virtual    102
- next       104
- gen        104
- range      104
- cpus       104   (*)
- cpu        104   (*)

So it somewhat looks all nice and fine, however, there are some issues which I'd like to sort out:

filtering out common words (as you perhaps noticed, "the" "is" "of" and "intel's" are missing from list)
With regards to "cpus" (plurals vs singulars), would it be best to use a particular type (singular or plural), both or exact (ie, "cpus" is different "cpu")?
Continuing with previous item, how can I determine a plural (different flavors: test=>tests fish=>fish and leaf=>leaves)
I'm currently using MySql and I'm very concerned with performance issues; we have 500+ categories and we didn't even launch the site
Let's say I wanted to use the search term "vendor:intel", where vendor specifies the field name (field_name), do you think there would be a huge impact on the sql server?
Search throttling; I don't like this at all, but it's a possibility, and if you know of any workarounds, make yourself heard!
There were other issues which I probably forgot about, if you spot any, you're welcome to yell at me ;-)
I do not need the search engine to crawl links, in fact, I specifically want it to not crawl links.

(by the way, I'm not biased towards intel, it simply happens that I own an i7-based pc ;-) )

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

对风讲故事 2024-09-18 02:01:44

从这里获取停用词（非关键字）列表，这家伙甚至已经为您将它们格式化为 php 格式。
http://armandbrahaj.blog.al/2009 /04/14/list-of-english-stop-words/

然后只需对要索引的字符串执行 preg_replace 即可。

我过去所做的就是使用正则表达式删除“s”、“ed”等后缀，并在搜索字符串上使用相同的正则表达式。但这并不理想。这是一个只有 200 页的基本网站。

如果您担心性能，您可能需要考虑使用 Lucine (solr) 等搜索引擎而不是数据库。这将使索引变得更加容易。你不想在这里重新发明轮子。

回复收藏 0 原文

萌辣 2024-09-18 02:01:44

这是为了回答您最初的问题，以及您后来的答案/问题。

我以前使用过 Sphinx 搜索引擎（很久以前，所以我有点生疏了），并发现它非常好，即使文档有时有点缺乏。

我确信还有其他方法可以做到这一点，无论是使用您自己的自定义代码，还是使用其他搜索引擎——Sphinx 恰好是我使用过的一种。我并不是说它会做你想要的一切，只是按照你想要的方式，但我相当肯定它会很容易地完成大部分工作，并且比单独用 PHP/MySQL 编写的任何东西都要快得多。

我建议在深入研究之前阅读使用 PHP 构建自定义搜索引擎 Sphinx 文档。如果您在阅读后认为不合适，那也很公平。

为了回答您的具体问题，我整理了文档中的一些链接以及一些相关引用：

过滤掉常见单词（正如您可能注意到的那样，“the”“is”“of”和“intel's”）列表中缺少）

11.2.8。停用词

停用词是那些不会出现的词
被索引。通常你会把大多数
停用词列表中的常用词
因为它们并没有增加太多价值
搜索结果但消耗大量
要处理的资源。

关于“cpu”（复数与单数），最好使用特定类型（单数或复数），两者还是精确（即“cpus”与“cpu”不同）？

11.2.9。词形

单词形式在之后应用
对传入文本进行标记
charset_table 规则。他们本质上
让您用一个单词替换另一个单词。
通常，这将用于带来
不同的单词形式
正常形式（例如，标准化所有
变体，例如“行走”、“行走”、
“walking”改为正常形式“walk”）。
它也可以用来实现
阻止异常，因为阻止
不适用于在中找到的单词
表单列表。

继续上一项，如何确定复数（不同风格：test=>tests Fish=>fish 和 leaf=>leaves）

Sphinx 支持 Porter 词干算法

Porter 词干算法（或
'Porter Stemmer'）是一个过程
去除常见的形态
以及单词的屈折词尾
英语。它的主要用途是作为
术语标准化过程是
通常在设置时完成
信息检索系统。

假设我想使用搜索词“vendor:intel”，其中vendor指定了字段名称(field_name)，您认为这会对sql server产生巨大影响吗？

3.2。属性

属性的一个很好的例子是
论坛帖子表。假设只有
标题和内容字段需要
全文可搜索 - 但那
有时也需要限制
搜索特定作者或
子论坛（即仅搜索那些行
有一些特定的值
中的author_id或forum_id列
SQL 表）；或对匹配项进行排序
post_date 列；或分组匹配
按 post_date 的月份发布的帖子和
计算每组的匹配数。

这可以通过指定所有
提到的列（不包括标题
和内容，即全文
字段）作为属性，对它们进行索引，
然后使用 API 调用进行设置
过滤、排序和分组。

您还可以使用 5.3。扩展查询语法用于搜索特定字段（而不是按属性过滤结果）：

字段搜索运算符：
@供应商英特尔

搜索引擎如何索引一组字段并将找到的短语/关键字/等与特定字段 ID 绑定？

8.6.1。查询

成功时，Query() 返回一个结果集，其中包含一些找到的匹配项（按照 SetLimits() 的要求）以及附加的常规每个查询统计信息。 >结果集是具有以下键和值的哈希（特定于 PHP；其他语言可能使用其他结构而不是哈希）：

“匹配”：
将找到的文档 ID 映射到另一个包含文档权重和属性值的小哈希的哈希（如果启用了 SetArrayResult()，则为类似小哈希的数组）。

“总计”：
此查询在服务器上检索到的匹配总数（即服务器端结果集）。您可以使用当前查询设置从服务器检索此查询文本的最多匹配数量。

“找到的总数”：
索引中匹配文档的总数（在服务器上找到并处理的）。

“单词”：
哈希，将查询关键字（大小写折叠、词干和其他处理）映射到带有每个关键字统计信息（“文档”、“点击数”）的小哈希。

“错误”：
searchd 报告的查询错误消息（字符串，人类可读）。如果没有错误则为空。

“警告”：
searchd 报告的查询警告消息（字符串，人类可读）。如果没有警告则为空。

另请参阅清单 11 和清单 13 来自使用 PHP 构建自定义搜索引擎。

This is in response to your original question, and your later answer/question.

I've used the Sphinx search engine before (quite a while ago, so I'm a bit rusty), and found it to be very good, even if the documentation is sometimes a bit lacking.

I'm sure there are other ways to do this, both with your own custom code, or with other search engines—Sphinx just happens to be the one I've used. I'm not suggesting that it will do everything you want, just the way you want, but I am reasonably certain that it will do most of it quite easily, and a lot faster than anything written in PHP/MySQL alone.

I recommend reading Build a custom search engine with PHP before digging into the Sphinx documentation. If you don't think it's suitable after reading that, fair enough.

In answer to your specific questions, I've put together some links from the documentation, together with some relevant quotes:

filtering out common words (as you perhaps noticed, "the" "is" "of" and "intel's" are missing from list)

11.2.8. stopwords

Stopwords are the words that will not
be indexed. Typically you'd put most
frequent words in the stopwords list
because they do not add much value to
search results but consume a lot of
resources to process.

With regards to "cpus" (plurals vs singulars), would it be best to use a particular type (singular or plural), both or exact (ie, "cpus" is different "cpu")?

11.2.9. wordforms

Word forms are applied after
tokenizing the incoming text by
charset_table rules. They essentialy
let you replace one word with another.
Normally, that would be used to bring
different word forms to a single
normal form (eg. to normalize all the
variants such as "walks", "walked",
"walking" to the normal form "walk").
It can also be used to implement
stemming exceptions, because stemming
is not applied to words found in the
forms list.

Continuing with previous item, how can I determine a plural (different flavors: test=>tests fish=>fish and leaf=>leaves)

Sphinx supports the Porter Stemming Algorithm

The Porter stemming algorithm (or
‘Porter stemmer’) is a process for
removing the commoner morphological
and inflexional endings from words in
English. Its main use is as part of a
term normalisation process that is
usually done when setting up
Information Retrieval systems.

Let's say I wanted to use the search term "vendor:intel", where vendor specifies the field name (field_name), do you think there would be a huge impact on the sql server?

3.2. Attributes

A good example for attributes would be
a forum posts table. Assume that only
title and content fields need to be
full-text searchable - but that
sometimes it is also required to limit
search to a certain author or a
sub-forum (ie. search only those rows
that have some specific values of
author_id or forum_id columns in the
SQL table); or to sort matches by
post_date column; or to group matching
posts by month of the post_date and
calculate per-group match counts.

This can be achieved by specifying all
the mentioned columns (excluding title
and content, that are full-text
fields) as attributes, indexing them,
and then using API calls to setup
filtering, sorting, and grouping.

You can also use the 5.3. Extended query syntax to search specific fields (as opposed to filtering results by attributes):

field search operator:
@vendor intel

How does a search engine index a set of fields and bind the found phrases/keywords/etc with the particular field id?

8.6.1. Query

On success, Query() returns a result set that contains some of the found matches (as requested by SetLimits()) and additional general per-query statistics. > The result set is a hash (PHP specific; other languages might utilize other structures instead of hash) with the following keys and values:

"matches":
Hash which maps found document IDs to another small hash containing document weight and attribute values (or an array of the similar small hashes if SetArrayResult() was enabled).

"total":
Total amount of matches retrieved on server (ie. to the server side result set) by this query. You can retrieve up to this amount of matches from server for this query text with current query settings.

"total_found":
Total amount of matching documents in index (that were found and procesed on server).

"words":
Hash which maps query keywords (case-folded, stemmed, and otherwise processed) to a small hash with per-keyword statitics ("docs", "hits").

"error":
Query error message reported by searchd (string, human readable). Empty if there were no errors.

"warning":
Query warning message reported by searchd (string, human readable). Empty if there were no warnings.