Solr 常见关键字/短语
我通过 PHP 使用 Solr 来搜索我网站的各个方面。我正在尝试实现一项功能,但找不到有关如何实现它的任何信息。
我有一组文档(评论),每个文档都与特定产品有关。
我想找到出现在单个产品的多个评论中的独特 1-2 个单词关键字(无停用词),并统计它们出现的评论数量。
一旦找到,我想显示前 X 个关键字,他们所在的评论数量,每条评论的最高评论都强调了该关键字的使用。
编辑:
一旦我有了出现在多个评论中的唯一(不间断词/常用词)关键字列表,我想根据它们在评论中出现的次数对它们进行排名。例如,如果人们正在撰写有关相机的评论,关键字可能会显示为:
昂贵(出现在 7 条评论中) 快门速度(出现在 5 条评论中) 图像不佳(出现在 3 条评论中)
一旦我按照评论数量对这些关键字进行了排名,我想为每个关键字选择 1 条评论,并显示突出显示该关键字的这些评论。例如:
“...不幸的是,这款相机对于您所得到的来说太贵了...”(7 条评论) “……快门速度对于……来说太慢了”(共 5 条评论) “……糟糕的图像质量是这款相机最大的缺点……”(共 3 条评论)
至于何时运行此功能,我仍然不确定。可能是实时的(当您查看产品时,然后缓存 X 时间),每当发布新评论时,标记要更新的产品,或者每天执行 cronjob 等。它不会同时针对所有关键字运行,它将针对单个产品的所有评论中的所有关键字运行。然后对每个产品重复此操作。
希望这更有意义。
任何有关如何在 Solr 中实现此目的的帮助将不胜感激。
I am using Solr through PHP for searching all aspects of my site. I am trying to implement a feature and can't find any information on how to accomplish it.
I have a group of documents (reviews), each about a specific product.
I want to find unique 1-2 word keywords (no stop words) that appear in multiple reviews for a single product, with a count for how many reviews they appear in.
Once I have that, I want to show the top X keywords, number of reviews they are in, and a single top review for each one highlighted the use of the keyword.
EDIT:
Once I have a list of unique (non stop word/common words) keywords that appear in multiple reviews, I want to rank them by the number of times they appear across reviews. For example, if people are writing reviews about cameras, the keywords might appear like this:
expensive (appears in 7 reviews)
shutter speed (appears in 5 reviews)
poor image (appears in 3 reviews)
Once I have those keywords ranked by number of reviews, I want to select 1 review per keyword and show those reviews highlighting the keyword. For example:
"... unfortunately this camera is far too EXPENSIVE for what you get ..." (in 7 reviews)
"... the SHUTTER SPEED is far too slow for ..." (in 5 reviews)
"... the POOR IMAGE quality is tis cameras biggest downfall ..." (in 3 reviews)
As far as when to run this, I'm still not sure. Possibly real time (when you view a product, then cached for X time), whenever a new review is posted, mark the product to be updated, or on a cronjob daily, etc. It will not be run against all keywords at one time, it will be run against all keywords in all reviews for a single product. Then repeated for each product.
Hope that makes more sense.
Any help on how to accomplish this in Solr would be greatly appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
在我看来,您正在寻找的是 ShingleFilter。您可以使用它来生成一元语法/二元语法(可能带有复制字段),然后获取这些标记的统计信息以生成您的界面。
It sounds to me that what you're looking for is the ShingleFilter.You can use it to produce unigrams/bigrams (probably with a copyfield) and then get stats on those tokens to generate your interface.
这个任务并不是特别适合 solr。使用 solr 唯一获得的好处是词干/停用词支持,如果在本地算法中实现,速度会快得多。我将在数据库中为“review_keyword”创建一个新表,将评论映射到关键字单例和关键字对。插入新评论时,还要为评论中的每个关键字添加到单独行的映射(这是词干/停用词发挥作用的地方)。当您想要查找产品评论以获取产品评论中的热门关键字以及该集中的评论时,您可以在此表中运行联接选择。根据您的使用情况,这最好在更新上运行,而不是在查询上运行。
This task is not particularly well suited to solr. The only thing you gain from using solr is the stemming/stop word support which would be much faster if implemented in a local algorithm. I would create a new table in the database for "review_keyword" mapping reviews to keyword singletons and pairs. When inserting a new review, also add a mapping to a separate row for each keyword in the review (this is where stemming/stop words kicks in). You can run a join select across this table when you want to lookup reviews for a product to get the top keywords in reviews for a product, and a review from that set. Depending on your usage, this would be better run on updates, rather than queries.
这看起来像是文本解析器而不是 solr 的工作。您可能需要一个Python脚本(因为它具有良好的文本解析库),该脚本查看评论中的所有单词,然后为您提供每个评论(或)所有评论中出现频率最高的单词及其计数。然后,您可以在这些最常出现的单词两侧索引几个单词,并为您的文档(本例中的产品)创建摘要,并将其在 Solr 中索引以作为搜索结果的一部分返回。
This looks like a job for a text parser rather than solr. You will need a script probably in python (since it has good text parsing libs) that looks at all the words in the reviews and then gives you the top occurring words within each review (or) in all reviews with their counts. Then you can index few words on either side of these top occurring words and create an abstract for your document (the product in this case) and index it in Solr to be returned as part of the search result.