查询产品目录 RavenDB 存储以获取任意产品集合的规格聚合
这是此中概述的项目的延续问题。
我有以下模型:
class Product {
public string Id { get; set; }
public string[] Specs { get; set; }
public int CategoryId { get; set; }
}
“Specs”数组存储由特殊字符连接的产品规格名称值对。例如,如果产品颜色为蓝色,则规格字符串将为“Color~Blue”。以这种方式表示规格允许查询具有由查询指定的多个规格值的产品。我想支持两个主要查询:
- 获取给定类别中的所有产品。
- 获取给定类别中具有一组指定规格的所有产品。
这与 RavenDB 配合得很好。但是,除了满足给定查询的产品之外,我还想返回一个结果集,其中包含查询指定的产品集的所有规格名称-值对。规格名称-值对应按规格的名称和值进行分组,并包含具有给定规格名称-值对的产品计数。对于查询#1,我创建了以下映射归约索引:
class CategorySpecGroups {
public int CategoryId { get; set; }
public string Spec { get; set; }
public int Count { get; set; }
}
public class SpecGroups_ByCategoryId : AbstractIndexCreationTask<Product, CategorySpecGroups>
{
public SpecGroups_ByCategoryId()
{
this.Map = products => from product in products
where product.Specs != null
from spec in product.Specs
select new
{
CategoryId = product.CategoryId,
Spec = spec,
Count = 1
};
this.Reduce = results => from result in results
group result by new { result.CategoryId, result.Spec } into g
select new
{
CategoryId = g.Key.CategoryId,
Spec = g.Key.Spec,
Count = g.Sum(x => x.Count)
};
}
}
然后我可以查询该索引并获取给定类别中的所有规范名称-值对。我遇到的问题是获得相同的结果集,但对于按类别和一组规范名称-值对进行过滤的查询。使用 SQL 时,可以通过对按类别和规格过滤的一组产品进行分组来获得此结果集。一般来说,这种类型的查询成本很高,但是当按类别和规格进行过滤时,产品集通常很小,但不足以适合单个页面 - 它们可能最多包含 1000 个产品。作为参考,MongoDB 支持 group 方法,可用于实现相同的结果集。这在服务器端执行临时分组并且性能是可以接受的。
如何使用 RavenDB 获取此类结果集?
一种可能的解决方案是获取查询的所有产品并在内存中执行分组,另一种选择是创建如上所述的映射缩减索引,尽管这样做的挑战是推导出可以为给定类别做出的所有可能的规格选择此外,此类索引的大小可能会爆炸。
例如,请查看 此紧固件类别页面。用户可以通过选择属性来过滤他们的选择。选择某个属性后,它会缩小产品的选择范围并在新的产品集中显示该属性。这种类型的交互通常称为分面搜索。
编辑
与此同时,我将尝试使用 Solr 作为解决方案他们支持开箱即用的分面搜索。
编辑2
看来RavenDB还支持分面搜索(其中当然是有道理的,索引是由 Lucene 存储的,就像 Solr 一样)。我将对此进行探索并发布更新。
编辑3
RavenDB 分面搜索功能按预期工作。我为每个类别 ID 存储一个构面设置文档,用于计算给定类别内查询的构面。我现在遇到的问题是性能。对于具有 4500 个不同类别的 500k 个产品的集合,产生 4500 个构面设置文档,在查询构面时按类别 ID 进行查询大约需要 16 秒,在不查询构面时大约需要 0.05 秒。测试的特定类别包含大约 6k 个产品、23 个不同方面和 2k 个不同方面名称范围组合。查看 FacetedQueryRunner 中的代码后似乎构面查询将导致对每个构面名称-值组合进行 Lucene 查询以获取计数,以及对每个构面名称进行查询以获取术语。实现的一个问题是,无论查询如何,它都会检索给定构面名称的所有不同术语,这在大多数情况下将显着减少构面的术语数量,从而减少 Lucene 查询的数量。此处提高性能的一种方法是为每个构面设置文档存储 MapReduce 计算结果集(如上所示),然后在按构面进一步过滤时可以查询该结果集以获取所有不同的术语。然而,整体性能可能仍然太慢。
This is a continuation of the project outlined in this question.
I have the following model:
class Product {
public string Id { get; set; }
public string[] Specs { get; set; }
public int CategoryId { get; set; }
}
The "Specs" array stores product specification name value pairs joined by a special character. For example if a product is colored blue the spec string would be "Color~Blue". Representing specs in this way allows querying for products having multiple spec values specified by a query. There are two principal queries that I would like to support:
- Get all products in a given category.
- Get all products in a given category which have a set of specified specs.
This works well with RavenDB. However, in addition to the products satisfying a given query I would like to return a result set which contains all spec name-value pairs for the set of products specified by the query. The spec name-value pairs should be grouped by the name and value of the spec and contain a count of products which have a given spec name-value pair. For query #1 I created the following map reduce index:
class CategorySpecGroups {
public int CategoryId { get; set; }
public string Spec { get; set; }
public int Count { get; set; }
}
public class SpecGroups_ByCategoryId : AbstractIndexCreationTask<Product, CategorySpecGroups>
{
public SpecGroups_ByCategoryId()
{
this.Map = products => from product in products
where product.Specs != null
from spec in product.Specs
select new
{
CategoryId = product.CategoryId,
Spec = spec,
Count = 1
};
this.Reduce = results => from result in results
group result by new { result.CategoryId, result.Spec } into g
select new
{
CategoryId = g.Key.CategoryId,
Spec = g.Key.Spec,
Count = g.Sum(x => x.Count)
};
}
}
I can then query this index and get all spec name-value pairs in a given category. The problem I am running into is to get the same result set but for a query which filters both by a category and a set of spec name-value pairs. When using SQL this result set would be obtained by doing a group by over a set of products filtered by category and specs. In general, this type of query is expensive but when filtering by both category and specs the product sets are normally small, though not small enough to fit into a single page - they may contain up to 1000 products. For reference, MongoDB supports a group method which can be used to achieve the same result set. This performs the ad hoc grouping server side and the performance is acceptable.
How can I get this type of result set using RavenDB?
One possible solution is to get all the products for a query and perform the grouping in memory and another option is to create a mapreduce index as above, though the challenge with this would be deducing all possible spec selections that can be made for a given category and additionally, this type of index might explode in size.
For an example, take a look at this fastener category page. The user can filter their selection by selecting attributes. When an attribute is selected it narrows the selection of products and displays the attributes within the new set of products. This type of interaction is typically called faceted search.
EDIT
In the meantime, I will be attempting a solution using Solr as they support faceted search out of the box.
EDIT 2
It appears that RavenDB also supports faceted search (which of course makes sense, indexes are stored by Lucene just like Solr). I will be exploring this and post updates.
EDIT 3
The RavenDB faceted search functionality works as expected. I store a facet setup document for each category ID which is used to calculate facets for a query within a given category. The issue I am having now is performance. For a collection of 500k products with 4500 distinct categories resulting in 4500 facet setup documents a query by category id takes about 16 seconds when also querying for facets and about 0.05 seconds when not querying for facets. The particular category tested contains about 6k products, 23 distinct facets and 2k distinct facet name-range combinations. After looking at the code in FacetedQueryRunner it appears a facets query will result in a Lucene query for every facet name-value combination to get the counts, as a well as a query for each facet name to get the terms. One problem with the implementation is that it will retrieve all the distinct terms for a given facet name regardless of the query, which in most cases will significantly reduce the number of terms for a facet and therefore reduce the number of Lucene queries. One way to improve performance here would be to store a MapReduce computed result set (as shown above) for each facet setup document which could then be queried to get all the distinct terms when further filtering by facets. The overall performance however may still be too slow.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我已经使用 RavenDB 分面搜索 实现了此功能,但是我对 FacetedQueryRunner 支持启发式优化。启发式是,在我的例子中,构面仅显示在叶类别中。这是一个合理的约束,因为根类别和内部类别之间的导航可以由搜索或子类别列表驱动。
现在给定约束,我为每个叶类别存储一个 FacetSetup 文档,其 ID 类似于“facets/category_123”。存储构面设置文档时,我可以访问类别中包含的构面名称以及构面值(或范围)。因此,我可以将所有可用的facet值存储在FacetSetup文档中每个Facet的Ranges集合中,但是facet模式仍然是FacetMode.Default。
以下是对 FacetedQueryRunner 的更改。具体来说,优化会检查给定方面是否存储范围,在这种情况下,它会返回这些值以用于搜索,而不是获取与给定方面关联的索引中的所有术语。在大多数情况下,这将显着减少所需的 Lucene 搜索数量,因为给定类别中的可用构面值是整个索引中构面值的子集。
下一个可以进行的优化是,如果原始查询仅按类别 id 进行过滤,那么 FacetSetup 文档实际上也可以存储计数。实现此目的的一种方法(尽管很老套)是将计数附加到 Ranges 集合中的每个方面值,然后向 FacetSetup 文档添加一个布尔值以指示附加计数。现在这个facet查询基本上会返回FacetSetup文档中的值——不需要查询。
现在要考虑的是使 FacetSetup 文档保持最新,但是无论哪种方式都需要这样做。除此之外,还可以利用缓存,我相信这是 Solr 分面搜索所采用的方法。
此外,如果 FacetSetup 文档能够自动与产品集合同步,那就太好了,因为实际上它们是对最初按类别 ID、然后是 Facet 的名称、然后是值分组的产品集进行聚合 MapReduce 操作的结果。
I've implemented this feature using RavenDB faceted search, however I made some changes to FacetedQueryRunner to support a heuristic optimization. The heuristic is that, in my case, facets are only displayed in leaf categories. This is a reasonable constraint since navigation between root and internal categories can be driven by either search or listings of child categories.
Now given the constraint I store a FacetSetup document for each leaf category with the Id being something like "facets/category_123". When the facet setup document is being stored I have access to the facet names as well as facet values (or ranges) that are contained in the category. Therefore, I can store all available facet values in the Ranges collection of each Facet in the FacetSetup document, however the facet mode is still FacetMode.Default.
Here are the changes to FacetedQueryRunner. Specifically, the optimization checks to see if a given facet stores ranges, in which case it returns those values to use for searching instead of getting all terms in an index associated with a given facet. In most cases this will significantly reduce the number of Lucene searches that are required since there available facet values in a given category are a subset of facet values in the entire index.
The next optimization that can be made is that if the original query only filters by a category id, then the FacetSetup document can actually store the counts as well. One, albeit hacky, way to do this would be to append the count to each facet value in the Ranges collection, then add a boolean to FacetSetup document to indicate that counts are appended. Now this facet query will basically return the values in the FacetSetup document - no need to query.
A consideration now would be to keep the FacetSetup documents up to date, however this would be required either way. Beyond this optimization caching can be utilized, which is I believe the approach taken by Solr faceted search.
Furthermore, it would be nice if the FacetSetup documents where automatically synchronized with the product collection since effectively they are result of an aggregating MapReduce operation over the set of products grouping initially by category id, then the name of the facet and then the values.