如何在Django中实现全文搜索？

发布于 2024-08-25 06:31:49 字数 779 浏览 6 评论 0原文

我想在 django 博客应用程序中实现搜索功能。现状是我有一个由用户提供的字符串列表，并且查询集按每个字符串缩小范围，以仅包含与该字符串匹配的那些对象。

请参阅：

if request.method == "POST":
    form = SearchForm(request.POST)
    if form.is_valid():
        posts = Post.objects.all()
        for string in form.cleaned_data['query'].split():
            posts = posts.filter(
                    Q(title__icontains=string) | 
                    Q(text__icontains=string) |
                    Q(tags__name__exact=string)
                    )
        return archive_index(request, queryset=posts, date_field='date')

现在，如果我不想将通过逻辑 AND 搜索的每个单词与逻辑 OR 连接起来，该怎么办？我该怎么做呢？有没有一种方法可以使用 Django 自己的 Queryset 方法来做到这一点，或者是否必须退回到原始 SQL 查询？

一般来说，像这样进行全文搜索是否是一个合适的解决方案，或者您会推荐使用 Solr、Whoosh 或 Xapian 等搜索引擎吗？他们有什么好处？

原文

I would like to implement a search function in a django blogging application. The status quo is that I have a list of strings supplied by the user and the queryset is narrowed down by each string to include only those objects that match the string.

See:

if request.method == "POST":
    form = SearchForm(request.POST)
    if form.is_valid():
        posts = Post.objects.all()
        for string in form.cleaned_data['query'].split():
            posts = posts.filter(
                    Q(title__icontains=string) | 
                    Q(text__icontains=string) |
                    Q(tags__name__exact=string)
                    )
        return archive_index(request, queryset=posts, date_field='date')

Now, what if I didn't want do concatenate each word that is searched for by a logical AND but with a logical OR? How would I do that? Is there a way to do that with Django's own Queryset methods or does one have to fall back to raw SQL queries?

In general, is it a proper solution to do full text search like this or would you recommend using a search engine like Solr, Whoosh or Xapian. What are their benefits?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

划一舟意中人 2024-09-01 06:31:49

我建议你采用搜索引擎。

我们使用了 Haystack 搜索，这是一个 django 模块化搜索应用程序，支持许多搜索引擎（Solr、Xapian、Whoosh 等）。 ..)

优点：

可以更快地
即使不查询数据库，也
执行搜索查询。突出显示搜索术语
“更像这样”功能
拼写建议
更好的排名
等...

缺点：

搜索索引的大小增长得相当快
最好的搜索引擎 (Solr) 作为 Java servlet 运行（Xapian 则不然）

我们很漂亮对这个解决方案很满意，而且很容易实现。

回复收藏 0 原文

似狗非友 2024-09-01 06:31:49

实际上，您发布的查询确实使用 OR 而不是 AND - 您使用 \ 来分隔 Q 对象。 AND 将是 &。

一般来说，我强烈建议使用合适的搜索引擎。我们在 Solr 之上使用 Haystack 取得了巨大成功 - Haystack 管理所有 Solr 配置，并公开了一个与 Django 自己的 ORM 非常相似的漂亮 API。

回复收藏 0 原文

撩动你心 2024-09-01 06:31:49

SOLR 非常容易设置并与 Django 集成。 Haystack 让事情变得更加简单。

回复收藏 0 原文

绅刃 2024-09-01 06:31:49

回答您的一般问题：一定要为此使用正确的应用程序。

通过查询，您始终检查字段的全部内容（标题、文本、标签）。您不会从索引等中获得任何好处。

使用适当的全文搜索引擎（或无论您如何称呼它），每次插入新记录时文本（单词）都会被索引。因此，查询速度会快很多，尤其是当数据库增长时。

回复收藏 0 原文

停滞 2024-09-01 06:31:49

我认为应用程序级别的全文搜索更多地取决于您拥有什么以及您期望它如何扩展。如果您运行一个使用率较低的小型网站，我认为花一些时间进行自定义全文搜索可能比安装应用程序来为您执行搜索更经济。应用程序在存储数据时会产生更多的依赖性、维护性和额外的工作量。通过自己进行搜索，您可以构建良好的自定义功能。例如，如果您的文本与一个标题完全匹配，您可以将用户引导至该页面，而不是显示结果。另一种方法是允许关键字使用 title: 或author: 前缀。

这是我用来从网络查询生成相关搜索结果的方法。

import shlex

class WeightedGroup:
    def __init__(self):  
        # using a dictionary will make the results not paginate
        # but it will be a lot faster when storing data          
        self.data = {}

    def list(self, max_len=0):
        # returns a sorted list of the items with heaviest weight first
        res = []
        while len(self.data) != 0:
            nominated_weight = 0                      
            for item, weight in self.data.iteritems():
                if weight > nominated_weight:
                    nominated = item
                    nominated_weight = weight
            self.data.pop(nominated)
            res.append(nominated)
            if len(res) == max_len:
                return res
        return res

    def append(self, weight, item):
        if item in self.data:
            self.data[item] += weight
        else:
            self.data[item] = weight


def search(searchtext):
    candidates = WeightedGroup()

    for arg in shlex.split(searchtext): # shlex understand quotes

        # Search TITLE
        # order by date so we get most recent posts
        query = Post.objects.filter_by(title__icontains=arg).order_by('-date')
        arg_hits = query.count() # count is cheap

        if arg_hits > 1000:
            continue # skip keywords which has too many hits

        # Each of these are expensive as it would transfer data
        #  from the db and build a python object, 
        for post in query[:50]: # so we limit it to 50 for example                
            # more hits a keyword has the lesser it's relevant
            candidates.append(100.0 / arg_hits, post.post_id)

        # TODO add searchs for other areas
        # Weight might also be adjusted with number of hits within the text
        #  or perhaps you can find other metrics to value an post higher,
        #  like number of views

    # candidates can contain a lot of stuff now, show most relevant only
    sorted_result = Post.objects.filter_by(post_id__in=candidates.list(20))

I think full text search on an application level is more a matter of what you have and how you expect it to scale. If you run a small site with low usage I think it might be more affordable to put some time into making an custom full text search rather than installing an application to perform the search for you. And application would create more dependency, maintenance and extra effort when storing data. By making your search yourself and you can build in nice custom features. Like for example, if your text exactly matches one title you can direct the user to that page instead of showing the results. Another would be to allow title: or author: prefixes to keywords.

Here is a method I've used for generating relevant search results from a web query.

import shlex

class WeightedGroup:
    def __init__(self):  
        # using a dictionary will make the results not paginate
        # but it will be a lot faster when storing data          
        self.data = {}

    def list(self, max_len=0):
        # returns a sorted list of the items with heaviest weight first
        res = []
        while len(self.data) != 0:
            nominated_weight = 0                      
            for item, weight in self.data.iteritems():
                if weight > nominated_weight:
                    nominated = item
                    nominated_weight = weight
            self.data.pop(nominated)
            res.append(nominated)
            if len(res) == max_len:
                return res
        return res

    def append(self, weight, item):
        if item in self.data:
            self.data[item] += weight
        else:
            self.data[item] = weight


def search(searchtext):
    candidates = WeightedGroup()

    for arg in shlex.split(searchtext): # shlex understand quotes

        # Search TITLE
        # order by date so we get most recent posts
        query = Post.objects.filter_by(title__icontains=arg).order_by('-date')
        arg_hits = query.count() # count is cheap

        if arg_hits > 1000:
            continue # skip keywords which has too many hits

        # Each of these are expensive as it would transfer data
        #  from the db and build a python object, 
        for post in query[:50]: # so we limit it to 50 for example                
            # more hits a keyword has the lesser it's relevant
            candidates.append(100.0 / arg_hits, post.post_id)

        # TODO add searchs for other areas
        # Weight might also be adjusted with number of hits within the text
        #  or perhaps you can find other metrics to value an post higher,
        #  like number of views

    # candidates can contain a lot of stuff now, show most relevant only
    sorted_result = Post.objects.filter_by(post_id__in=candidates.list(20))

回复收藏 0 原文

送你一个梦 2024-09-01 06:31:49

有关 Python 中的全文搜索，请参阅 PyLucene。它允许非常复杂的查询。这里的主要问题是您必须找到一种方法来告诉搜索引擎哪些页面发生了更改并最终更新索引。

或者，您可以使用 Google 站点地图告诉 Google 更快地为您的网站建立索引，然后嵌入自定义站点地图您网站中的查询字段。这里的优点是，您只需要告诉 Google 更改的页面，Google 将完成所有艰苦的工作（索引、解析查询等）。最重要的是，大多数人习惯使用 Google 进行搜索，而且它也将使您的网站在全球 Google 搜索中保持最新状态。

回复收藏 0 原文

~没有更多了~