Sphinx Search - 多索引搜索与客户端程序聚合

发布于 2024-11-28 19:31:19 字数 2039 浏览 3 评论 0原文

寻找为 Sphinx 搜索实现 Python 客户端的最佳方法的见解。

我正在搜索的数据集由个人资料内容组成。所有配置文件均使用纬度和经度按地理位置组织。配置文件具有许多不同的属性，所有属性均以与正确的配置文件 ID 关联的文本形式存储在数据库中。基本上，从搜索角度来看，查询过程是发出地理搜索，使用 Haversign 查找半径内的所有 id，然后使用 Sphinx 搜索所有这些属性，以查找其发布的内容与该区域最相关的配置文件。发出查询。

到目前为止，我一直在开发的 sphinx 客户端使用 sphinx 的几个不同索引，并运行单独的查询。 python 对象首先运行位置查询，保存范围内的 id，然后对所有其他索引运行查询，仅进行过滤，以便地理集中的 id 可以作为有效结果返回。

我想知道的是，将位置数据加入到 sphinx 的全文搜索索引中并让 sphinx 处理所有查询是否会更有效，而不是构建使用 api 通过查询“回退”的客户端程序这。将所有数据收集为一个 sphinx“文档”的大型索引比让客户端负责运行额外的查询和过滤是否有任何优势？

下面发布的代码可让您了解查询如何运行：

def LocationQuery(self):    
    self.SetServer('127.0.0.1', 9312)
    self.SetMatchMode(SPH_MATCH_ALL)    

    self.SetGeoAnchor('latitude','longitude',float(math.radians(self._lat)), float(math.radians(self._lon)))
    self.SetLimits(0,1000)  

    self.SetFilterFloatRange('@geodist',float(0),self._radius,0)
    self.SetSortMode(SPH_SORT_EXTENDED, '@geodist asc')
    self._results = self.Query('loc', GEO_INDEX)
    for match in self._results['matches']:
            attrsdump = ''
            for attr in self._results['attrs']:
                attrname = attr[0]
                attrtype = attr[1]
                val = match['attrs'][attrname]
            self._ids_in_range.append(ProfileResult(match['id'],match['attrs']['@geodist']))
    #for obj in self._ids_in_range:
        #print obj.__repr__()

def DescriptionQuery(self):
    self.ResetFilters()
    self.SetSortMode(SPH_SORT_EXTENDED, 'profileid_attr asc')
    ids = []
    for obj in self._ids_in_range:
        ids.append(obj.profID) 

    self.SetFilter('profileid_attr', ids)
    self._results = self.Query(self._query, DESCRIPTION_INDEX)
    for match in self._results['matches']:
        for id_valid in self._ids_in_range:
            if match['id'] == id_valid.profID:
                self.ResultSet.append(id_valid)
    print 'Description Results: %s' % (len(self._results['matches']))                   
    print 'Total Results: %s' % (self.ResultSet.count())

这些方法将按顺序运行，将找到的 id 保存到对象中。

原文

Looking for insight into the best approach to implementing a python client for Sphinx Search.

The dataset I am searching through is made up of profile content. All the profiles are organized geographically as locations using latitude and longitude. The profiles have many different attributes all stored in the database as TEXT associated with the right profile ID. Basically, the query procedure from a search standpoint would be to issue a geographic search that uses Haversign to find all ids that fall within a radius, and then use Sphinx to search through all these properties to find profiles whose published content are most relevant to the issued query.

The client for sphinx I've been working on so far uses several different indexes from sphinx, and runs separate queries. The python object first runs the location query, saves the ids that fall within the range, and then runs queries against all the other indexes, filtering only so that ids from the geographic set can be returned as valid results.

What I am wondering is if it would be more efficient to join the location data into the fulltext search index for sphinx and have sphinx handle all the querying, rather than structuring my client program that uses the api to "fall back" through the queries like this. Would there be any advantage to one large index that gathers all the data as one sphinx "document" rather than having the client be responsible for running additional queries and filtering?

Code posted below to give an idea of how the queries run:

def LocationQuery(self):    
    self.SetServer('127.0.0.1', 9312)
    self.SetMatchMode(SPH_MATCH_ALL)    

    self.SetGeoAnchor('latitude','longitude',float(math.radians(self._lat)), float(math.radians(self._lon)))
    self.SetLimits(0,1000)  

    self.SetFilterFloatRange('@geodist',float(0),self._radius,0)
    self.SetSortMode(SPH_SORT_EXTENDED, '@geodist asc')
    self._results = self.Query('loc', GEO_INDEX)
    for match in self._results['matches']:
            attrsdump = ''
            for attr in self._results['attrs']:
                attrname = attr[0]
                attrtype = attr[1]
                val = match['attrs'][attrname]
            self._ids_in_range.append(ProfileResult(match['id'],match['attrs']['@geodist']))
    #for obj in self._ids_in_range:
        #print obj.__repr__()

def DescriptionQuery(self):
    self.ResetFilters()
    self.SetSortMode(SPH_SORT_EXTENDED, 'profileid_attr asc')
    ids = []
    for obj in self._ids_in_range:
        ids.append(obj.profID) 

    self.SetFilter('profileid_attr', ids)
    self._results = self.Query(self._query, DESCRIPTION_INDEX)
    for match in self._results['matches']:
        for id_valid in self._ids_in_range:
            if match['id'] == id_valid.profID:
                self.ResultSet.append(id_valid)
    print 'Description Results: %s' % (len(self._results['matches']))                   
    print 'Total Results: %s' % (self.ResultSet.count())

These methods would be run in sequence, saving to the object the ids that are found.

分享到QQ

分享到微博