PHP单词索引、性能和合理结果

发布于 2024-09-11 02:01:44 字数 1223 浏览 9 评论 0 原文

我目前正在开发用于搜索功能的索引器。索引器将处理“字段”中的数据。 字段看起来像:

  Field_id   Field_type   Field_name   Field_Data
- 101        text         Name         Intel i7
- 102        integer      Cores        4 physical, 4 virtual
- 103        select       Vendor       Intel
- 104        multitext    Description  The i7 is intel's next gen range of cpus.

索引器将生成以下结果/索引:

  Keyword    Occurrences
- intel      101, 103, 104
- i7         101, 104
- physical   102
- virtual    102
- next       104
- gen        104
- range      104
- cpus       104   (*)
- cpu        104   (*)

所以它看起来有点不错,但是,有一些我想解决的问题:

  • 过滤掉常见单词(正如您可能注意到的那样,“ “列表中缺少”“is”“of”和“intel's”)
  • 关于“cpu”(复数与单数),最好使用特定类型(单数或复数),两者或精确(即“ cpus”与“cpu”不同)?
  • 继续上一项,我如何确定复数(不同风格:test=>tests Fish=>fish 和 leaf=>leaves)
  • 我目前正在使用 MySql 并且我非常担心 存在性能问题;我们有 500 多个类别,我们甚至没有启动该网站
  • 假设我想使用搜索词“vendor:intel”,其中供应商指定字段名称 (field_name),您认为这会对sql服务器?
  • 搜索限制;我一点也不喜欢这样,但这是有可能的,如果您知道任何解决方法,请告诉我们!
  • 还有其他问题我可能忘记了,如果你发现任何问题,欢迎你对我大喊大叫;-)
  • 我不需要搜索引擎来抓取链接,事实上,我特别希望它不抓取链接。

(顺便说一句,我并不偏向 intel,只是碰巧我拥有一台基于 i7 的电脑;-))

I'm currently working on an indexer for a search feature. The indexer will work over data from "fields".
Fields looks like:

  Field_id   Field_type   Field_name   Field_Data
- 101        text         Name         Intel i7
- 102        integer      Cores        4 physical, 4 virtual
- 103        select       Vendor       Intel
- 104        multitext    Description  The i7 is intel's next gen range of cpus.

The indexer would generate the following results/index:

  Keyword    Occurrences
- intel      101, 103, 104
- i7         101, 104
- physical   102
- virtual    102
- next       104
- gen        104
- range      104
- cpus       104   (*)
- cpu        104   (*)

So it somewhat looks all nice and fine, however, there are some issues which I'd like to sort out:

  • filtering out common words (as you perhaps noticed, "the" "is" "of" and "intel's" are missing from list)
  • With regards to "cpus" (plurals vs singulars), would it be best to use a particular type (singular or plural), both or exact (ie, "cpus" is different "cpu")?
  • Continuing with previous item, how can I determine a plural (different flavors: test=>tests fish=>fish and leaf=>leaves)
  • I'm currently using MySql and I'm very concerned with performance issues; we have 500+ categories and we didn't even launch the site
  • Let's say I wanted to use the search term "vendor:intel", where vendor specifies the field name (field_name), do you think there would be a huge impact on the sql server?
  • Search throttling; I don't like this at all, but it's a possibility, and if you know of any workarounds, make yourself heard!
  • There were other issues which I probably forgot about, if you spot any, you're welcome to yell at me ;-)
  • I do not need the search engine to crawl links, in fact, I specifically want it to not crawl links.

(by the way, I'm not biased towards intel, it simply happens that I own an i7-based pc ;-) )

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

对风讲故事 2024-09-18 02:01:44

从这里获取停用词(非关键字)列表,这家伙甚至已经为您将它们格式化为 php 格式。
http://armandbrahaj.blog.al/2009 /04/14/list-of-english-stop-words/

然后只需对要索引的字符串执行 preg_replace 即可。

我过去所做的就是使用正则表达式删除“s”、“ed”等后缀,并在搜索字符串上使用相同的正则表达式。但这并不理想。这是一个只有 200 页的基本网站。

如果您担心性能,您可能需要考虑使用 Lucine (solr) 等搜索引擎而不是数据库。这将使索引变得更加容易。你不想在这里重新发明轮子。

Grab a list of stop words(non-keywords) from here, the guy has even formatted them in php for you.
http://armandbrahaj.blog.al/2009/04/14/list-of-english-stop-words/

Then simply do a preg_replace on the string you are indexing.

What I've done in past is remove suffixes like 's', 'ed' etc with regex and use the same regex on the search string. It's not ideal though. This was for a basic website with only 200 pages.

If you are concerned about performance you might want to consider using a search engine like Lucine (solr) instead of a database. This will make indexing much easier. You don't want to reinvent the wheel here.

萌辣 2024-09-18 02:01:44

这是为了回答您最初的问题,以及您后来的答案/问题

我以前使用过 Sphinx 搜索引擎(很久以前,所以我有点生疏了) ,并发现它非常好,即使文档有时有点缺乏​​。

我确信还有其他方法可以做到这一点,无论是使用您自己的自定义代码,还是使用其他搜索引擎——Sphinx 恰好是我使用过的一种。我并不是说它会做你想要的一切,只是按照你想要的方式,但我相当肯定它会很容易地完成大部分工作,并且比单独用 PHP/MySQL 编写的任何东西都要快得多。

我建议在深入研究之前阅读使用 PHP 构建自定义搜索引擎 Sphinx 文档。如果您在阅读后认为不合适,那也很公平。

为了回答您的具体问题,我整理了文档中的一些链接以及一些相关引用:

过滤掉常见单词(正如您可能注意到的那样,“the”“is”“of”和“intel's”)列表中缺少)

11.2.8。停用词

停用词是那些不会出现的词
被索引。通常你会把大多数
停用词列表中的常用词
因为它们并没有增加太多价值
搜索结果但消耗大量
要处理的资源。

关于“cpu”(复数与单数),最好使用特定类型(单数或复数),两者还是精确(即“cpus”与“cpu”不同)?

11.2.9。词形

单词形式在之后应用
对传入文本进行标记
charset_table 规则。他们本质上
让您用一个单词替换另一个单词。
通常,这将用于带来
不同的单词形式
正常形式(例如,标准化所有
变体,例如“行走”、“行走”、
“walking”改为正常形式“walk”)。
它也可以用来实现
阻止异常,因为阻止
不适用于在中找到的单词
表单列表。

继续上一项,如何确定复数(不同风格:test=>tests Fish=>fish 和 leaf=>leaves)

Sphinx 支持 Porter 词干算法

Porter 词干算法(或
'Porter Stemmer')是一个过程
去除常见的形态
以及单词的屈折词尾
英语。它的主要用途是作为
术语标准化过程是
通常在设置时完成
信息检索系统。

假设我想使用搜索词“vendor:intel”,其中vendor指定了字段名称(field_name),您认为这会对sql server产生巨大影响吗?

3.2。属性

属性的一个很好的例子是
论坛帖子表。假设只有
标题和内容字段需要
全文可搜索 - 但那
有时也需要限制
搜索特定作者或
子论坛(即仅搜索那些行
有一些特定的值
中的author_id或forum_id列
SQL 表);或对匹配项进行排序
post_date 列;或分组匹配
按 post_date 的月份发布的帖子和
计算每组的匹配数。

这可以通过指定所有
提到的列(不包括标题
和内容,即全文
字段)作为属性,对它们进行索引,
然后使用 API 调用进行设置
过滤、排序和分组。

您还可以使用 5.3。扩展查询语法用于搜索特定字段(而不是按属性过滤结果):

字段搜索运算符:
@供应商英特尔

搜索引擎如何索引一组字段并将找到的短语/关键字/等与特定字段 ID 绑定?

8.6.1。查询

成功时,Query() 返回一个结果集,其中包含一些找到的匹配项(按照 SetLimits() 的要求)以及附加的常规每个查询统计信息。 >结果集是具有以下键和值的哈希(特定于 PHP;其他语言可能使用其他结构而不是哈希):

“匹配”:
将找到的文档 ID 映射到另一个包含文档权重和属性值的小哈希的哈希(如果启用了 SetArrayResult(),则为类似小哈希的数组)。

“总计”:
此查询在服务器上检索到的匹配总数(即服务器端结果集)。您可以使用当前查询设置从服务器检索此查询文本的最多匹配数量。

“找到的总数”:
索引中匹配文档的总数(在服务器上找到并处理的)。

“单词”:
哈希,将查询关键字(大小写折叠、词干和其他处理)映射到带有每个关键字统计信息(“文档”、“点击数”)的小哈希。

“错误”:
searchd 报告的查询错误消息(字符串,人类可读)。如果没有错误则为空。

“警告”:
searchd 报告的查询警告消息(字符串,人类可读)。如果没有警告则为空。

另请参阅清单 11清单 13 来自 使用 PHP 构建自定义搜索引擎

This is in response to your original question, and your later answer/question.

I've used the Sphinx search engine before (quite a while ago, so I'm a bit rusty), and found it to be very good, even if the documentation is sometimes a bit lacking.

I'm sure there are other ways to do this, both with your own custom code, or with other search engines—Sphinx just happens to be the one I've used. I'm not suggesting that it will do everything you want, just the way you want, but I am reasonably certain that it will do most of it quite easily, and a lot faster than anything written in PHP/MySQL alone.

I recommend reading Build a custom search engine with PHP before digging into the Sphinx documentation. If you don't think it's suitable after reading that, fair enough.

In answer to your specific questions, I've put together some links from the documentation, together with some relevant quotes:

filtering out common words (as you perhaps noticed, "the" "is" "of" and "intel's" are missing from list)

11.2.8. stopwords

Stopwords are the words that will not
be indexed. Typically you'd put most
frequent words in the stopwords list
because they do not add much value to
search results but consume a lot of
resources to process.

With regards to "cpus" (plurals vs singulars), would it be best to use a particular type (singular or plural), both or exact (ie, "cpus" is different "cpu")?

11.2.9. wordforms

Word forms are applied after
tokenizing the incoming text by
charset_table rules. They essentialy
let you replace one word with another.
Normally, that would be used to bring
different word forms to a single
normal form (eg. to normalize all the
variants such as "walks", "walked",
"walking" to the normal form "walk").
It can also be used to implement
stemming exceptions, because stemming
is not applied to words found in the
forms list.

Continuing with previous item, how can I determine a plural (different flavors: test=>tests fish=>fish and leaf=>leaves)

Sphinx supports the Porter Stemming Algorithm

The Porter stemming algorithm (or
‘Porter stemmer’) is a process for
removing the commoner morphological
and inflexional endings from words in
English. Its main use is as part of a
term normalisation process that is
usually done when setting up
Information Retrieval systems.

Let's say I wanted to use the search term "vendor:intel", where vendor specifies the field name (field_name), do you think there would be a huge impact on the sql server?

3.2. Attributes

A good example for attributes would be
a forum posts table. Assume that only
title and content fields need to be
full-text searchable - but that
sometimes it is also required to limit
search to a certain author or a
sub-forum (ie. search only those rows
that have some specific values of
author_id or forum_id columns in the
SQL table); or to sort matches by
post_date column; or to group matching
posts by month of the post_date and
calculate per-group match counts.

This can be achieved by specifying all
the mentioned columns (excluding title
and content, that are full-text
fields) as attributes, indexing them,
and then using API calls to setup
filtering, sorting, and grouping.

You can also use the 5.3. Extended query syntax to search specific fields (as opposed to filtering results by attributes):

field search operator:
@vendor intel

How does a search engine index a set of fields and bind the found phrases/keywords/etc with the particular field id?

8.6.1. Query

On success, Query() returns a result set that contains some of the found matches (as requested by SetLimits()) and additional general per-query statistics. > The result set is a hash (PHP specific; other languages might utilize other structures instead of hash) with the following keys and values:

"matches":
Hash which maps found document IDs to another small hash containing document weight and attribute values (or an array of the similar small hashes if SetArrayResult() was enabled).

"total":
Total amount of matches retrieved on server (ie. to the server side result set) by this query. You can retrieve up to this amount of matches from server for this query text with current query settings.

"total_found":
Total amount of matching documents in index (that were found and procesed on server).

"words":
Hash which maps query keywords (case-folded, stemmed, and otherwise processed) to a small hash with per-keyword statitics ("docs", "hits").

"error":
Query error message reported by searchd (string, human readable). Empty if there were no errors.

"warning":
Query warning message reported by searchd (string, human readable). Empty if there were no warnings.

Also see Listing 11 and Listing 13 from Build a custom search engine with PHP.

梅窗月明清似水 2024-09-18 02:01:44

过滤掉常用词(如您
也许注意到了,“the”“is”“of”和
列表中缺少“intel's”)

查找(或创建)常用单词列表并过滤用户输入。

关于“cpu”(复数与
单数),最好使用
特定类型(单数或复数),
两者或精确(即“cpus”不同)
“CPU”)?

视情况而定。如果这不是一个很大的负担,我会寻找两者;或者如果可能的话,对于单数形式使用 LIKE 子句。

继续上一条,如何
我确定复数(不同
口味:测试=>测试鱼=>鱼和
叶子=>叶子)

创建 Inflector 方法或类。即:Inflect::plural('fish') 为您提供'fish'。可能有类似的英语课程,查一下。

我目前正在使用 MySql,而且我非常喜欢
关注绩效问题;我们
有 500 多个类别,而我们没有
甚至启动网站

拥有良好的架构和代码设计也会有所帮助,但我真的无法就此给您太多建议。

假设我想使用搜索
术语“供应商:英特尔”,其中供应商
指定字段名称(field_name),
你认为会有一个巨大的
对sql服务器有影响吗?

这确实很有帮助,因为您将查找单个列而不是多个列。只需要小心过滤用户输入和/或允许仅查找特定列。

搜索限制;我不喜欢这个
完全有可能,但这是有可能的,如果
你知道任何解决方法,使
你自己听到了!

这里没有太多选择。为了在此处提供帮助并提高性能,您应该考虑进行某种缓存。

filtering out common words (as you
perhaps noticed, "the" "is" "of" and
"intel's" are missing from list)

Find (or create) a list of common words and filter user input.

With regards to "cpus" (plurals vs
singulars), would it be best to use a
particular type (singular or plural),
both or exact (ie, "cpus" is different
"cpu")?

Depends. I would search for both if that's not a big burden; or for the singular form using the LIKE clause if possible.

Continuing with previous item, how can
I determine a plural (different
flavors: test=>tests fish=>fish and
leaf=>leaves)

Create an Inflector method or class. ie: Inflect::plural('fish') gives you 'fish'. There might be classes like these for the English language, look them up.

I'm currently using MySql and I'm very
concerned with performance issues; we
have 500+ categories and we didn't
even launch the site

Having good schema and code design helps, but I can't really give you much advice on that one.

Let's say I wanted to use the search
term "vendor:intel", where vendor
specifies the field name (field_name),
do you think there would be a huge
impact on the sql server?

That would really help, since you'd be looking up a single column instead of multiple. Just be careful to filter user input and/or allow looking up only particular columns.

Search throttling; I don't like this
at all, but it's a possibility, and if
you know of any workarounds, make
yourself heard!

Not many options here. To help here and in performance, you should consider having some sort of caching.

策马西风 2024-09-18 02:01:44

我衷心建议您看看 Solr。它是一个基于 Java 的独立搜索和索引系统,可能比 PHP 解决方案具有更多优势。

I would heartily suggest you take a look at Solr. It's a Java based self contained Search and index system and probably has more benefits than a PHP solution.

浅语花开 2024-09-18 02:01:44

搜索很难实施。如果您是新手,建议您使用包。

您是否考虑过 http://framework.zend.com/manual/en /zend.search.lucene.html

Search is tough to implement. Would recommend using a package if you're new to it.

Have you considered http://framework.zend.com/manual/en/zend.search.lucene.html ?

笑脸一如从前 2024-09-18 02:01:44

由于许多人建议使用现有的包(我想让你变得更难,而不仅仅是建议一个包;-)),让我们假设我将使用这样的包(在这个答案线程中)。

搜索引擎如何索引一组字段并将找到的短语/关键字/等与特定字段 ID 绑定?
这不是我想要回答的问题,至少不是直接回答。我的问题是,让搜索引擎按照我的意愿工作有多容易?
鉴于我的上述要求,这是否可能/可行?

从个人经验来看,我宁愿浪费一些时间调整我的系统,也不愿修复别人的代码,因为我必须首先浪费更多时间来理解代码。
你可以说我保守,但我很少坚持别人的代码/程序,当我这样做时,那是因为绝望的情况——而我通常最终会以某种方式为上述项目做出贡献。

Since many are suggesting to use an existing package, (and I want to make it harder for you than just suggesting a package ;-) ), let's presume I will use such a package (over in this answer thread).

How does a search engine index a set of fields and bind the found phrases/keywords/etc with the particular field id?
That's not the question I want answered, at least not directly. My issue is, how easy is it to make the search engine work as I want?
Given my above requirements, is this even possible/feasible?

From personal experience, I'd rather wasted some time tweaking my system rather than fixing someone else's code, which I have to waste way more time to understand first.
Call me conservative, but I rarely stick to someone else's code/programs, and when I did, it was because of a desperate situation - and I usually end up somehow contributing to said project.

梦明 2024-09-18 02:01:44

php/ir 上有 Brill 词性标注器的 PHP 实现。这可能提供一个框架来识别那些应该被丢弃的单词和那些你想要索引的单词,同时它还识别复数(和词根单数)。它并不完美,尽管是一个处理技术术语的自定义词典,但它可能对解决您的前三个问题很有用。

There's a PHP implementation of a Brill Part of Speech tagger on php/ir. This might provide a framework for identifying those words that should be discarded and those you want to index, while it also identifies plurals (and the root singular). It's not perfect, though a custom dictionary to handle technical terms, it could prove useful for resolving your first three questions.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文