带二次排序的 GAE 文本搜索
我使用下面的解决方案来代替 GAE 上的全文搜索,返回一个已排序的结果集,首先按关键字相关性,其次按日期(尽管第二次排序可能是任何实际排序)。它感觉有点笨重,我担心大规模的性能,所以我正在寻找优化建议或完全不同的方法。
二次排序对我的用例很重要,因为给定的搜索可能会具有相同相关性的多个结果(通过关键字匹配的数量来衡量),但保留原始查询顺序现在会增加很多复杂性。有什么想法吗?
第 1 步:获取与每个搜索词匹配的键列表
results_key_list = []
search_terms = ['a','b','c'] #User's search query, split into a list of strings
#query each search term and add the results to a list
#yields a list of keys with frequency of occurance indicating relevance
for item in search_terms:
subquery = SomeEntity.all(keys_only=True)
subquery.filter('SearchIndex = ', item) #SearchIndex is a StringListProperty
#more filters...
subquery.order('-DateCreated')
for returned_item in subquery:
results_key_list.append(str(returned_item))
第 2 步:按频率对列表进行分组,同时保持原始顺序
#return a dictionary of keys, with their frequency of occurrence
grouped_results = defaultdict(int)
for key in results_key_list:
grouped_results[key] += 1
sorted_results = []
known = set()
#creates an empty list for each match frequency
for i in range(len(search_terms)):
sorted_results.append([])
#using the original results ordering,
#construct an array of results grouped and ordered by descending frequency
for key in results_key_list:
if key in known: continue
frequency = grouped_results[key]
sorted_results[len(search_terms) - frequency].append(key)
known.add(key)
#combine into a single list
ordered_key_list = []
for l in sorted_results:
ordered_key_list.extend(l)
del ordered_key_list[:offset]
del ordered_key_list[limit:]
result = SomeEntity.get(ordered_key_list)
In lieu of full-text search on GAE I'm using the solution below to return a resultset that's sorted, first by keyword relevance, and secondly by date (though the second sorting could be anything really). It feels a bit bulky and I'm concerned about performance at scale so I'm looking for optimization suggestions or a different approach altogether.
The secondary sorting is important to my use case, since a given search will likely have multple results of the same relevance (as measured by the number of keyword matches), but preserving the original query ordering adds a lot of complexity right now. Any ideas?
Step 1: Get a list of keys that match each search term
results_key_list = []
search_terms = ['a','b','c'] #User's search query, split into a list of strings
#query each search term and add the results to a list
#yields a list of keys with frequency of occurance indicating relevance
for item in search_terms:
subquery = SomeEntity.all(keys_only=True)
subquery.filter('SearchIndex = ', item) #SearchIndex is a StringListProperty
#more filters...
subquery.order('-DateCreated')
for returned_item in subquery:
results_key_list.append(str(returned_item))
Step 2: Group the list by frequency while maintaining the original order
#return a dictionary of keys, with their frequency of occurrence
grouped_results = defaultdict(int)
for key in results_key_list:
grouped_results[key] += 1
sorted_results = []
known = set()
#creates an empty list for each match frequency
for i in range(len(search_terms)):
sorted_results.append([])
#using the original results ordering,
#construct an array of results grouped and ordered by descending frequency
for key in results_key_list:
if key in known: continue
frequency = grouped_results[key]
sorted_results[len(search_terms) - frequency].append(key)
known.add(key)
#combine into a single list
ordered_key_list = []
for l in sorted_results:
ordered_key_list.extend(l)
del ordered_key_list[:offset]
del ordered_key_list[limit:]
result = SomeEntity.get(ordered_key_list)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可以按照出现的顺序累积按键
并且可以一次性建立所有关键频率。
利用排序稳定性通过降低频率然后按出现顺序进行排序:
You can accumulate the keys in the order of appearance
and can build-up the key frequencies in all in one pass.
Take advantage of sort stability to make a sort by decreasing frequency, then by order of appearance:
在步骤 1 中,您将迭代查询对象。这会导致每返回 20 个对象需要一次 fetch RPC,这在时间上效率低下。相反,对 Query 对象调用
fetch(n)
,其中n
是要返回的最大结果数 - 这仅执行一次 RPC。它还具有限制搜索结果数量的好处 - 现在,如果我搜索“I”,您的应用程序将在处理步骤 1 中的几乎每条记录时陷入困境。也完全没有必要将键转换为字符串 - 您可以将键添加到集合中就好了。
不过,就其价值而言,我个人认为“或”搜索特别无用。我意识到您会首先对与所有术语匹配的项目进行排名,但后面不可避免地会出现一堆堆不相关的结果。您只需对每个搜索词使用等式过滤器执行一次查询即可进行“与”搜索。
In step 1, you're iterating over a query object. This results in one fetch RPC per 20 objects returned, which is inefficient, time-wise. Instead, call
fetch(n)
on the Query object, wheren
is the maximum number of results to return - this does only a single RPC. It also has the benefit of limiting the number of results to search - right now, if I search for 'I', your app will get stuck processing nearly every record in step 1.It's also completely unnecessary to convert the keys into strings - you can add keys to a set just fine.
For what it's worth, though, I personally find 'or' searches to be particularly useless. I realise you'll rank items that match all the terms first, but those will inevitably be followed by piles and piles of irrelevant results. You could do an 'and' search simply by doing one query with an equality filter for each search term.