这个 solr 4.0 连接查询如何比 *:* 查询返回更多结果?
我遇到了一些非常奇怪的行为,我认为这是一个错误,但我可能是错的或者没有正确理解文档,所以我问。
我有一个 SOLR 索引并使用 4.0 版本的新功能。
这是我使用的代码(我正在使用 PECL SOLR 扩展):
<?
$options = array (
'hostname' => '192.168.200.31',
'path' => 'solr/slave',
);
$client = new SolrClient($options);
$query = new SolrQuery();
#$query->setQuery("{!join from=id to=med_id }type:medium");
$query->setQuery("*:*");
$query->addFilterQuery('type:product');
$query->addFilterQuery("product_type:tv_free");
$query_response = $client->query($query);
$response = $query_response->getResponse();
echo '<pre>'.print_r($response,true)."</pre>";
?>
上面的代码返回 38296
文档。 但是,如果我取消注释行 #$query->setQuery("*:*");
,以便查询现在为 *:*
并有效匹配每个文档,我收到返回的 21867
文档 - 我认为这是正确的数字。
如果您想更多地了解用例以及背后的想法,您可以继续阅读 - 但这只是背景信息:
我正在索引两种类型的文档,我通过字段 type 的值来区分它们
:
medium - 对我来说这是一部电影(如《阿凡达》、《卡萨布兰卡》等)
产品- 这些是优惠亚马逊上的 DVD 之类的电影
这种分割的原因是我想要过滤器/方面查询,使用户能够搜索:
- 产品 - 这些是亚马逊上的 DVD 之类 已在 1990 至 1955 年间发布(此元数据存储在介质文档中),
- 并且在亚马逊上以 5% 或更少的 DVD 形式提供(此信息存储在产品文档中)
- ,并且在其中包含“jungle”一词电影标题(存储在medium文档中)
我正在对标题中带有“jungle”的“medium”类型的所有文档进行搜索(使用dismax):
$query->setQuery("{!type=dismax qf='$qf' mm='1' q.alt='*:*'}jungle");
然后我添加一个过滤器查询,如下所示:
$query->addFilterQuery("{!join from=med_id to=id}provider:amazon");
$query->addFilterQuery("{!join from=med_id to=id}price:[0 TO 500]"); // price is in cents
$query->addFilterQuery("release_year:[1990 TO 1995]");
请注意,我需要第一个两个查询作为 prdouct 类型文档的联接,其中有一个名为 med_id 的字段,该字段保存与它们关联的 media 类型文档的 id。
这一切都很好! 然而,我想通过产品类型文档中保存的元数据来进行搜索。例如,它们可用的国家/地区(我可以在其中订购 DVD)
我从该查询中获取介质文档中包含的所有字段的构面计数,但是联接查询不携带用于过滤的源表的任何信息与结果的连接。所以我需要第二个查询:
我的做法与上面完全相同,但是这次我使用交换联接而不是联接查询:
所以我的 dismax 查询现在变成联接查询:
$query->setQuery("{!join from= id to=med_id }{!type=dismax qf='$qf' mm='1' q.alt=':'}丛林");
我的联接过滤器查询变成普通过滤器查询:
$query->addFilterQuery("provider:amazon");
$query->addFilterQuery("price:[0 TO 500]");
而我的常规过滤器查询变成联接查询 - 这次从字段 id 到 med_id:
$query->addFilterQuery("!join from=id to=med_id}release_year:[1990 TO 1995]”);
现在,这会返回与我们的过滤器匹配的所有产品。对于一种媒体可能有不止一种产品 - 但我只希望我的方面计数反映电影的数量,而不是产品的数量,所以我还按 med_id 进行分组并将组截断设置为 true,如下所示:
$query->addParam("group","true");
$query->addParam("group.field","med_id");
$query->addParam("group.truncate","true");
唯一的问题是在中等字段中进行搜索的联接查询使我的查询以某种方式返回更多结果而不是更少,我将其归结为要重现的问题开头的最少代码。
I encountered some very weird behaviour which I think is a bug, but I might be wrong or not understanding the documentation properly so I am asking.
I have a SOLR index and working with the new functions of the 4.0 version.
This is the code I use (I am using the PECL SOLR extension):
<?
$options = array (
'hostname' => '192.168.200.31',
'path' => 'solr/slave',
);
$client = new SolrClient($options);
$query = new SolrQuery();
#$query->setQuery("{!join from=id to=med_id }type:medium");
$query->setQuery("*:*");
$query->addFilterQuery('type:product');
$query->addFilterQuery("product_type:tv_free");
$query_response = $client->query($query);
$response = $query_response->getResponse();
echo '<pre>'.print_r($response,true)."</pre>";
?>
The code above returns 38296
documents.
However if I uncomment the Line #$query->setQuery("*:*");
, so that the query is now *:*
and effectively matches every document, I get 21867
documents returned - which I think is the correct number.
If you want to know a bit more about the use case and what thoughts are behind, you may read on - but it is only background information:
I am indexing two types of documents that I distinguish by the value of the field type
:
medium - In my case this is a movie (like avatar, casablanca, etc)
product - Those are offers for the movies like a DVD on amazon
The reason for this split is that I want filter/facet queries that enable the user for example to search for:
- a movie that has been released between 1990 and 1955 (this metadata is stored in the medium document)
- and that is available on amazon as dvd for 5% or less (this information is stored in the product document)
- and that has the word "jungle" in the movie title (stored in the medium document)
I am doing a search (using dismax) on all documents of type "medium" with "jungle" in the title:
$query->setQuery("{!type=dismax qf='$qf' mm='1' q.alt='*:*'}jungle");
Then I add a filter queries like this:
$query->addFilterQuery("{!join from=med_id to=id}provider:amazon");
$query->addFilterQuery("{!join from=med_id to=id}price:[0 TO 500]"); // price is in cents
$query->addFilterQuery("release_year:[1990 TO 1995]");
Note that I need the first two queries as a join to the documents of type prdouct, which have a field called med_id which holds the id of the document of type medium associated with them.
This all works fine!
Howver I want to facet the search by metada held in the documents of type product. For example the country where they are available (where I can order the DVD)
I get the facet counts for all fields that are contained in the medium documents from this quere, however join queries do not carry any information of the source tables used to filter the join to the result. So I need a second query:
I do exactly the same as above, but this time I use swap join and not joined queries:
So my dismax query now becomes a join query:
$query->setQuery("{!join from=id to=med_id }{!type=dismax qf='$qf' mm='1' q.alt=':'}jungle");
My joined filter queries become normal filter queries:
$query->addFilterQuery("provider:amazon");
$query->addFilterQuery("price:[0 TO 500]");
And my conventional filter query becomes a joined one - this time from the field id to med_id:
$query->addFilterQuery("!join from=id to=med_id}release_year:[1990 TO 1995]");
This now returns all products that match our filters. For one medium there may be more than one products - but I only want my facet counts to reflect the number of movies, not the number of products so I also group by med_id and set group truncating to true like this:
$query->addParam("group","true");
$query->addParam("group.field","med_id");
$query->addParam("group.truncate","true");
The only problem with this is that the join query doing the search in the medium fields makes my query somehow return more results and not less, which I boiled down to the minimal code at the beginning of the question to reproduce.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我想我通过将查询添加为过滤器查询而不是像这样的查询来解决我的问题:
它似乎在小型测试用例中工作,但是我的数据库存中仍然有一些不合适的地方,但我需要仔细检查我的数据源对于任何危险,并设置一个测试用例,我可以证明其中的差异。
我也仍然感兴趣为什么它在设置为查询时会出现问题...
编辑:
这个答案中描述的方法有效地解决了问题,但是我不确定它为什么存在。
然而,facet 计数的效果并不是理想的效果,因为字段折叠让 solrfacet 仅适用于组中最相关的文档。
意义:
如果不折叠(分组),计数可能会多于介质的实际结果计数(因为可能存在多个匹配产品)。
折叠后可能会更少(因为只考虑一个文档的值)。
所以面计数不会以这种方式工作。您唯一真正知道哪些方面值将返回至少 1 个结果,并且取决于您是否使用折叠或不表示表示上限和下限的数字,但可能不是实际结果数。
I think i worked around my problem by adding my query as a filter query and not as a query like this:
It seems to work in small test cases however I still have some misfits in my data stock but i need to double check my data sources for any hazards and set up a test case where i can prove the difference.
I am also still interested why it makes problems when setting as query...
Edit:
The method described in this answer effectively solves the problem, however I am not sure why it existed in the first place.
However the effect of the facet counts is not the desired one, because the field collapsing lets solr facet only for the most relevant document in the group.
Meaning:
Without collapsing (grouping) the count may be more than the actual result count of mediums (because several matching products may exist).
With collapsing it may be less (because only the values of one document are taken into account).
So facet counts won't work this way. The only thing you really know which facet values WILL return at least 1 result and depending on whether you use collapsing or not a number which represents an upper and lower bound but may not be the actual number of results.