如何使用 Sphinx BuildExcerpts

发布于 2025-01-03 04:13:33 字数 2133 浏览 1 评论 0原文

因此,我设置了 Sphinx 配置文件。我有一个非常简单的模式,有两个字段:标题和正文,其中标题是小说的名称,正文是完整的小说本身。为了简单起见,我只添加了一本小说。索引器工作得很好,Python API 使查询 sphinxd 变得轻而易举。到目前为止,我真的印象深刻,这似乎是我迄今为止研究过的最容易设置的全文搜索引擎(比 Lucene 或 Solr 容易得多,比 Woosh 更快)。

我已经跳过了任何数据库后端。我的小说采用纯 .txt 格式,并且我添加了 带有这个简单 xml 的示例(通过 xmlpipe) 顺便说一句

<?xml version="1.0" encoding="utf-8"?>
<sphinx:docset>
       <sphinx:document id="1">
             <title><![CDATA[Dan Simmons - I Canti di Hyperion 3 - Endymion]]></title>
             <body><![CDATA[ * ALL THE NOVEL HERE * ]]></body>
       </sphinx:document>
</sphinx:docset>

,我在存档中搜索“tartaruga”,它是意大利语“turtle”,我确信这个词就是文件。事实上,它被找到了 3 次,我猜这就是 Sphinx 返回给我的内容('hits': 3)。这是完整的结果:

{'attrs': [],
'error': '',
'fields': ['title', 'body'],
'matches': [{'attrs': {}, 'id': 1, 'weight': 1}],
'status': 0,
'time': '0.392',
'total': 1,
'total_found': 1,
'warning': '',
'words': [{'docs': 1, 'hits': 3, 'word': 'tartaruga'}]}

最终,我想要的是这样的:

[
  {
    'title': 'Dan Simmons - I Canti di Hyperion 3 - Endymion',
    'body': 'il vecchio mostrò quel suo sorriso a becco di tartaruga. — non bisogna dimenticare il palazzo dello shrike, né il nostro vecchio amico shrike, giusto? non ce ne sono altre?'
  },
  {
    'title': 'Dan Simmons - I Canti di Hyperion 3 - Endymion',
    'body': '— vieni più vicino, raul endymion. — la voce pareva il rumore di una lama spuntata che sfregasse su pergamena. le labbra si muovevano come il becco d\'una tartaruga.'
  },
  {
    'title': 'Dan Simmons - I Canti di Hyperion 3 - Endymion',
    'body': 'il becco di tartaruga ebbe una contrazione, la grossa testa si mosse in un cenno d\'assenso. notai ora che il viso del vecchio, malgrado i danni provocati dai secoli, aveva ancora tratti netti e spigolosi... un\'aria da satiro.'
  },
]

我的意思是,摘录的书中出现的一系列事件以及上下文中的单词(我选择了句子,但是比赛之前或之后的n个单词都可以)。我想我必须使用 BuildExcerpts,但是如何呢?

另外,如果我想同时匹配 tartaruga(海龟)和 tartarughe(海龟),我想发出类似 tartarug* 的查询。这是狮身人面像怎么办?提前致谢。

So, I've set up a Sphinx configuration file. I have a very simple schema with two fields, title and body, where the title is the name of a novel and body is the complete novel itself. To keep things simple, I've only added one novel. The indexer worked just fine and the Python API made querying sphinxd a breeze. I'm really impressed so far, this seems the easiest to set up full-text search engine I've investigated so far (much easier than Lucene or Solr and faster than Woosh).

I have skipped any DB backend. I have my novels in plain .txt format, and I've added the
sample one with this simple xml (through xmlpipe)

<?xml version="1.0" encoding="utf-8"?>
<sphinx:docset>
       <sphinx:document id="1">
             <title><![CDATA[Dan Simmons - I Canti di Hyperion 3 - Endymion]]></title>
             <body><![CDATA[ * ALL THE NOVEL HERE * ]]></body>
       </sphinx:document>
</sphinx:docset>

By the way, I search the archive for "tartaruga", it is italian for "turtle" and I'm sure that the word is the file. In fact, is found three times, and I guess that's what Sphinx returns to me ('hits': 3). This is the complete result:

{'attrs': [],
'error': '',
'fields': ['title', 'body'],
'matches': [{'attrs': {}, 'id': 1, 'weight': 1}],
'status': 0,
'time': '0.392',
'total': 1,
'total_found': 1,
'warning': '',
'words': [{'docs': 1, 'hits': 3, 'word': 'tartaruga'}]}

What I want to have, in the end, is something like this:

[
  {
    'title': 'Dan Simmons - I Canti di Hyperion 3 - Endymion',
    'body': 'il vecchio mostrò quel suo sorriso a becco di tartaruga. — non bisogna dimenticare il palazzo dello shrike, né il nostro vecchio amico shrike, giusto? non ce ne sono altre?'
  },
  {
    'title': 'Dan Simmons - I Canti di Hyperion 3 - Endymion',
    'body': '— vieni più vicino, raul endymion. — la voce pareva il rumore di una lama spuntata che sfregasse su pergamena. le labbra si muovevano come il becco d\'una tartaruga.'
  },
  {
    'title': 'Dan Simmons - I Canti di Hyperion 3 - Endymion',
    'body': 'il becco di tartaruga ebbe una contrazione, la grossa testa si mosse in un cenno d\'assenso. notai ora che il viso del vecchio, malgrado i danni provocati dai secoli, aveva ancora tratti netti e spigolosi... un\'aria da satiro.'
  },
]

I mean, an array of occurrences with the book the excerpt is taken from and the word within a context (i've chosen sentencies, but n words before or after the match would work). I think I have to use BuildExcerpts, but how?

Also, if I want to match both tartaruga (turtle) and tartarughe (turtles), I'd like to issue a query like tartarug*. How to do this is Sphinx? Thanks in advance.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

压抑⊿情绪 2025-01-10 04:13:33

我对我从事的项目做了同样的事情。我的建议是,将整本书作为单个字段加载并不是一个好主意,除非您只打算使用一本书而不是多本书。我是这样做的。

  1. 书籍一次一页地存储在 MySQL 数据库中。
  2. 在包含数百万页文本的数据库中运行 sphinx -
    工作速度非常快,返回包含您要查找的文本的每一页(或者根据数据库中的页数,只获取前 30 页或其他)。
  3. 使用摘录生成器从页面中获取摘录,然后突出显示搜索阶段。
  4. 如果 Python 无法访问摘录构建器(可能仅限于 php),那么您可以使用正则表达式完成相同的工作,没有太大困难 - 您只需要找到您的搜索短语并执行正则表达式即可找到这么多内容两侧的文本,以及另一个正则表达式来添加突出显示。

您可以编写一个 python 脚本(我使用从 bash shell 运行的 PHP 脚本)来一次提取一页文本,对其进行清理,然后将其添加到数据库中。

您需要一个至少有两个表的数据库,例如

books(字段可以是调用、id、name、author)

pages(字段可以是 id、book_id、page_text)

Sphinx 会返回一个页面 id,然后您使用一个简单的查询从 MySQL 获取该页面...

SELECT page_text FROM Pages WHERE id = $idreturnedbysphinx;

然后您发送返回的文本到文本摘录/文本荧光笔。

Sphinx 可以搜索精确单词或词干单词(以及更多),但您需要在 sphinx.conf 文件中进行设置。

您至少需要两个索引定义:

indexer indexname1
{
     #source database connection and sql query
      source                = src1
      path                  = /var/data/indexname1
     [... other settings ...]
     #make sure stemming is switched off
     morphology             = none


}
#child index inherits the above, and add stemming
index indexname1stemmed : indexname1
{
    path            = /var/data/indexname1stemmed
    morphology      = stem_en
    index_exact_words   = 1
}

然后您还需要在 sphinx 搜索中指定要使用的匹配模式。我不知道 python 语法,但 sphinx 手册比我能更好地阐述它:
http://sphinxsearch.com/docs/current.html#matching-modes

您可以在没有 SQL 数据库的情况下完成所有这些操作并将其保存在文本文件中,但我可能会每页使用一个文本文件作为一种更易于管理的工作方式,否则您将返回返回整个电子书作为搜索结果。

I do the same thing for a project I work on. My suggestion would be that loading an entire book as a single field isn't a great idea unless you're only ever going to work with one book, rather than many books. Here's how I do it.

  1. Book is stored in MySQL database one page at a time.
  2. Run sphinx across database with several million pages of text -
    works very fast, returns every page with the text you are looking for (or depending on the number of pages in the DB, just get the first 30 or whatever).
  3. Use Excerpt Builder to get an excerpt from a page, and then highlight the search phase.
  4. If Python doesn't have access to the excerpt builder (it may be php only), then you could do the same job without too much difficulty using regular expressions - you just need to find your search phrase and do a regex to find so much text either side, and another regex to add highlighting.

You could write a python script (I use a PHP script run from the bash shell) to extract your text one page at a time, sanitize it, and add it to the database.

You'd need a database with at least two tables something like

books (fields could be called, id, name, author)

pages (fields would be id, book_id, page_text)

Sphinx would return you a page id, you then get the page from MySQL using a simple query...

SELECT page_text FROM pages WHERE id = $idreturnedbysphinx;

You then send that returned text to the text excerpter/text highlighter.

Sphinx can either search for exact words or stemmed words (and much much more), but you need to set this up in your sphinx.conf file.

You need at least two index definitions:

indexer indexname1
{
     #source database connection and sql query
      source                = src1
      path                  = /var/data/indexname1
     [... other settings ...]
     #make sure stemming is switched off
     morphology             = none


}
#child index inherits the above, and add stemming
index indexname1stemmed : indexname1
{
    path            = /var/data/indexname1stemmed
    morphology      = stem_en
    index_exact_words   = 1
}

You then also need to specify in your sphinx search the match mode you want to use. I don't know the python syntax, but the sphinx manual sets it out better than I can:
http://sphinxsearch.com/docs/current.html#matching-modes

You could do all this without a SQL database and keep it in text files, but I'd probably go to one text file per page as a more manageable way to work, otherwise you'll be back to returning the entire ebook as your search result.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文