搜索查询标记器
我们正在尝试向列出餐馆的网站添加简单的搜索功能。我们尝试从搜索字符串中检测地名、位置和地点特征,例如“开罗附近的便宜餐馆”或“弗吉尼亚州的中国和高端食品”。
我们现在正在做的是将查询标记化,并首先在性能成本最低的表中进行搜索(价格表(廉价-预算-昂贵-高端)小于地点列表的表)。这是正确的做法吗?
-- 问候。 叶希亚
We're trying to add a simple search functionality to our website that lists restaurants. We try to detect the place name, location, and place features from the search string, something like "cheap restaurants near cairo" or "chinese and high-end food in virginia".
What we are doing right now it tokenizing the query and searching in the tables with the least performance cost first (the table of prices (cheap-budget-expensive-high-end) is smaller than the tables of the places list). Is this the right approach ?
--
Regards.
Yehia
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我想说你应该构建同义词集(例如廉价、低预算等进入 synset:1)并将每个标记从搜索字符串映射到其中一个组。
顺便说一句,这里很容易处理拼写错误,因为这通常是一个相当小的搜索空间。编辑距离、常见的 k-grams……任何事情都应该没问题。
在下一步中,您应该为每个同步组构建倒排索引列表,将映射映射到可以与该属性关联的餐厅的排序列表。对于查询中的每个同步组,获取所有这些列表并简单地将它们相交。
无法映射到这些同义词集之一的单词可能必须被忽略,除非您有某种可以索引的餐馆的全文。在该功能中,您还可以为“普通”单词构建此类餐厅列表并将它们相交。但这已经非常接近经典搜索引擎,并且使用像 apache lucence 这样的技术可能是个好主意。如果没有全文,我认为您不需要这样的东西,因为 snygroups 的倒排索引非常容易您自己处理。
I'd say you should build sets of synonyms (e.g. cheap, low budget, etc go into synset:1) and map each token from the search string to one of those groups.
Btw, it will be easy to handle spelling mistakes here since this is genereally a pretty small search space. Edit distance, common k-grams, ... anything should be alright.
In a next step you should build inverted index lists for each of those syn-groups the map to a sorted list of restaurants that can be associated with that property. For each syngroup from a query, get all those lists and simply intersect them.
Words that cannot be mapped to one of those synsets will probably have to be ignored unless you have some sort of fulltexts about the restaurants that you could index as well. In that can you can also buildsuch restaurant lists for "normal" words and intersect them as well. But this would already be quite close to classical search engines and it might be a good idea to use a technology like apache lucence. Without fulltexts I don't think you'd need such a thing because an inverted index of snygroups is really easy to process on your own.
似乎您可能错过了如何处理拼写错误的查询。
Seems you may be missing how misspelled queries are handled.