我目前正在开发用于搜索功能的索引器。索引器将处理“字段”中的数据。
字段看起来像:
Field_id Field_type Field_name Field_Data
- 101 text Name Intel i7
- 102 integer Cores 4 physical, 4 virtual
- 103 select Vendor Intel
- 104 multitext Description The i7 is intel's next gen range of cpus.
索引器将生成以下结果/索引:
Keyword Occurrences
- intel 101, 103, 104
- i7 101, 104
- physical 102
- virtual 102
- next 104
- gen 104
- range 104
- cpus 104 (*)
- cpu 104 (*)
所以它看起来有点不错,但是,有一些我想解决的问题:
- 过滤掉常见单词(正如您可能注意到的那样,“ “列表中缺少”“is”“of”和“intel's”)
- 关于“cpu”(复数与单数),最好使用特定类型(单数或复数),两者或精确(即“ cpus”与“cpu”不同)?
- 继续上一项,我如何确定复数(不同风格:test=>tests Fish=>fish 和 leaf=>leaves)
- 我目前正在使用 MySql 并且我非常担心 存在性能问题;我们有 500 多个类别,我们甚至没有启动该网站
- 假设我想使用搜索词“vendor:intel”,其中供应商指定字段名称 (field_name),您认为这会对sql服务器?
- 搜索限制;我一点也不喜欢这样,但这是有可能的,如果您知道任何解决方法,请告诉我们!
- 还有其他问题我可能忘记了,如果你发现任何问题,欢迎你对我大喊大叫;-)
- 我不需要搜索引擎来抓取链接,事实上,我特别希望它不抓取链接。
(顺便说一句,我并不偏向 intel,只是碰巧我拥有一台基于 i7 的电脑;-))
I'm currently working on an indexer for a search feature. The indexer will work over data from "fields".
Fields looks like:
Field_id Field_type Field_name Field_Data
- 101 text Name Intel i7
- 102 integer Cores 4 physical, 4 virtual
- 103 select Vendor Intel
- 104 multitext Description The i7 is intel's next gen range of cpus.
The indexer would generate the following results/index:
Keyword Occurrences
- intel 101, 103, 104
- i7 101, 104
- physical 102
- virtual 102
- next 104
- gen 104
- range 104
- cpus 104 (*)
- cpu 104 (*)
So it somewhat looks all nice and fine, however, there are some issues which I'd like to sort out:
- filtering out common words (as you perhaps noticed, "the" "is" "of" and "intel's" are missing from list)
- With regards to "cpus" (plurals vs singulars), would it be best to use a particular type (singular or plural), both or exact (ie, "cpus" is different "cpu")?
- Continuing with previous item, how can I determine a plural (different flavors: test=>tests fish=>fish and leaf=>leaves)
- I'm currently using MySql and I'm very concerned with performance issues; we have 500+ categories and we didn't even launch the site
- Let's say I wanted to use the search term "vendor:intel", where vendor specifies the field name (field_name), do you think there would be a huge impact on the sql server?
- Search throttling; I don't like this at all, but it's a possibility, and if you know of any workarounds, make yourself heard!
- There were other issues which I probably forgot about, if you spot any, you're welcome to yell at me ;-)
- I do not need the search engine to crawl links, in fact, I specifically want it to not crawl links.
(by the way, I'm not biased towards intel, it simply happens that I own an i7-based pc ;-) )
发布评论
评论(7)
从这里获取停用词(非关键字)列表,这家伙甚至已经为您将它们格式化为 php 格式。
http://armandbrahaj.blog.al/2009 /04/14/list-of-english-stop-words/
然后只需对要索引的字符串执行 preg_replace 即可。
我过去所做的就是使用正则表达式删除“s”、“ed”等后缀,并在搜索字符串上使用相同的正则表达式。但这并不理想。这是一个只有 200 页的基本网站。
如果您担心性能,您可能需要考虑使用 Lucine (solr) 等搜索引擎而不是数据库。这将使索引变得更加容易。你不想在这里重新发明轮子。
Grab a list of stop words(non-keywords) from here, the guy has even formatted them in php for you.
http://armandbrahaj.blog.al/2009/04/14/list-of-english-stop-words/
Then simply do a preg_replace on the string you are indexing.
What I've done in past is remove suffixes like 's', 'ed' etc with regex and use the same regex on the search string. It's not ideal though. This was for a basic website with only 200 pages.
If you are concerned about performance you might want to consider using a search engine like Lucine (solr) instead of a database. This will make indexing much easier. You don't want to reinvent the wheel here.
这是为了回答您最初的问题,以及您后来的答案/问题。
我以前使用过 Sphinx 搜索引擎(很久以前,所以我有点生疏了) ,并发现它非常好,即使文档有时有点缺乏。
我确信还有其他方法可以做到这一点,无论是使用您自己的自定义代码,还是使用其他搜索引擎——Sphinx 恰好是我使用过的一种。我并不是说它会做你想要的一切,只是按照你想要的方式,但我相当肯定它会很容易地完成大部分工作,并且比单独用 PHP/MySQL 编写的任何东西都要快得多。
我建议在深入研究之前阅读使用 PHP 构建自定义搜索引擎 Sphinx 文档。如果您在阅读后认为不合适,那也很公平。
为了回答您的具体问题,我整理了文档中的一些链接以及一些相关引用:
过滤掉常见单词(正如您可能注意到的那样,“the”“is”“of”和“intel's”)列表中缺少)
11.2.8。停用词
关于“cpu”(复数与单数),最好使用特定类型(单数或复数),两者还是精确(即“cpus”与“cpu”不同)?
11.2.9。词形
继续上一项,如何确定复数(不同风格:test=>tests Fish=>fish 和 leaf=>leaves)
Sphinx 支持 Porter 词干算法
假设我想使用搜索词“vendor:intel”,其中vendor指定了字段名称(field_name),您认为这会对sql server产生巨大影响吗?
3.2。属性
您还可以使用 5.3。扩展查询语法用于搜索特定字段(而不是按属性过滤结果):
搜索引擎如何索引一组字段并将找到的短语/关键字/等与特定字段 ID 绑定?
8.6.1。查询
另请参阅清单 11 和 清单 13 来自 使用 PHP 构建自定义搜索引擎。
This is in response to your original question, and your later answer/question.
I've used the Sphinx search engine before (quite a while ago, so I'm a bit rusty), and found it to be very good, even if the documentation is sometimes a bit lacking.
I'm sure there are other ways to do this, both with your own custom code, or with other search engines—Sphinx just happens to be the one I've used. I'm not suggesting that it will do everything you want, just the way you want, but I am reasonably certain that it will do most of it quite easily, and a lot faster than anything written in PHP/MySQL alone.
I recommend reading Build a custom search engine with PHP before digging into the Sphinx documentation. If you don't think it's suitable after reading that, fair enough.
In answer to your specific questions, I've put together some links from the documentation, together with some relevant quotes:
filtering out common words (as you perhaps noticed, "the" "is" "of" and "intel's" are missing from list)
11.2.8. stopwords
With regards to "cpus" (plurals vs singulars), would it be best to use a particular type (singular or plural), both or exact (ie, "cpus" is different "cpu")?
11.2.9. wordforms
Continuing with previous item, how can I determine a plural (different flavors: test=>tests fish=>fish and leaf=>leaves)
Sphinx supports the Porter Stemming Algorithm
Let's say I wanted to use the search term "vendor:intel", where vendor specifies the field name (field_name), do you think there would be a huge impact on the sql server?
3.2. Attributes
You can also use the 5.3. Extended query syntax to search specific fields (as opposed to filtering results by attributes):
How does a search engine index a set of fields and bind the found phrases/keywords/etc with the particular field id?
8.6.1. Query
Also see Listing 11 and Listing 13 from Build a custom search engine with PHP.
查找(或创建)常用单词列表并过滤用户输入。
视情况而定。如果这不是一个很大的负担,我会寻找两者;或者如果可能的话,对于单数形式使用 LIKE 子句。
创建 Inflector 方法或类。即:
Inflect::plural('fish')
为您提供'fish'
。可能有类似的英语课程,查一下。拥有良好的架构和代码设计也会有所帮助,但我真的无法就此给您太多建议。
这确实很有帮助,因为您将查找单个列而不是多个列。只需要小心过滤用户输入和/或允许仅查找特定列。
这里没有太多选择。为了在此处提供帮助并提高性能,您应该考虑进行某种缓存。
Find (or create) a list of common words and filter user input.
Depends. I would search for both if that's not a big burden; or for the singular form using the LIKE clause if possible.
Create an Inflector method or class. ie:
Inflect::plural('fish')
gives you'fish'
. There might be classes like these for the English language, look them up.Having good schema and code design helps, but I can't really give you much advice on that one.
That would really help, since you'd be looking up a single column instead of multiple. Just be careful to filter user input and/or allow looking up only particular columns.
Not many options here. To help here and in performance, you should consider having some sort of caching.
我衷心建议您看看 Solr。它是一个基于 Java 的独立搜索和索引系统,可能比 PHP 解决方案具有更多优势。
I would heartily suggest you take a look at Solr. It's a Java based self contained Search and index system and probably has more benefits than a PHP solution.
搜索很难实施。如果您是新手,建议您使用包。
您是否考虑过 http://framework.zend.com/manual/en /zend.search.lucene.html ?
Search is tough to implement. Would recommend using a package if you're new to it.
Have you considered http://framework.zend.com/manual/en/zend.search.lucene.html ?
由于许多人建议使用现有的包(我想让你变得更难,而不仅仅是建议一个包;-)),让我们假设我将使用这样的包(在这个答案线程中)。
搜索引擎如何索引一组字段并将找到的短语/关键字/等与特定字段 ID 绑定?
这不是我想要回答的问题,至少不是直接回答。我的问题是,让搜索引擎按照我的意愿工作有多容易?
鉴于我的上述要求,这是否可能/可行?
从个人经验来看,我宁愿浪费一些时间调整我的系统,也不愿修复别人的代码,因为我必须首先浪费更多时间来理解代码。
你可以说我保守,但我很少坚持别人的代码/程序,当我这样做时,那是因为绝望的情况——而我通常最终会以某种方式为上述项目做出贡献。
Since many are suggesting to use an existing package, (and I want to make it harder for you than just suggesting a package ;-) ), let's presume I will use such a package (over in this answer thread).
How does a search engine index a set of fields and bind the found phrases/keywords/etc with the particular field id?
That's not the question I want answered, at least not directly. My issue is, how easy is it to make the search engine work as I want?
Given my above requirements, is this even possible/feasible?
From personal experience, I'd rather wasted some time tweaking my system rather than fixing someone else's code, which I have to waste way more time to understand first.
Call me conservative, but I rarely stick to someone else's code/programs, and when I did, it was because of a desperate situation - and I usually end up somehow contributing to said project.
php/ir 上有 Brill 词性标注器的 PHP 实现。这可能提供一个框架来识别那些应该被丢弃的单词和那些你想要索引的单词,同时它还识别复数(和词根单数)。它并不完美,尽管是一个处理技术术语的自定义词典,但它可能对解决您的前三个问题很有用。
There's a PHP implementation of a Brill Part of Speech tagger on php/ir. This might provide a framework for identifying those words that should be discarded and those you want to index, while it also identifies plurals (and the root singular). It's not perfect, though a custom dictionary to handle technical terms, it could prove useful for resolving your first three questions.