Solr 中的索引和查询 URL

发布于 2024-10-11 21:27:46 字数 1434 浏览 10 评论 0原文

我有一个我想要搜索的 URL 数据库。因为 URL 并不总是写成相同的(可能有也可能没有 www),所以我正在寻找索引和查询 url 的正确方法。 我已经尝试了一些方法,我认为我已经很接近了,但不确定为什么它不起作用:

这是我的自定义字段类型:

 <fieldType name="customUrlType" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" preserveOriginal="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" preserveOriginal="0"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

例如:

http://www.twitter.com/AndersonCooper 被索引时,将在不同位置包含以下单词:http、www、twitter、com、andersoncooper

如果我简单搜索 twitter .com/andersoncooper,我希望此查询与已索引的记录相匹配,这就是为什么我还使用 WDF 来拆分搜索查询, 但是搜索查询最终会像这样:

myfield:("twitter com andersoncooper") 当真正希望它匹配具有以下所有单独单词的所有记录时: twitter com andersoncooper

是否有不同的查询过滤器或分词器我应该是使用?

I have a database of URLs that I would like to search. Because URLs are not always written the same (may or may not have www), I am looking for the correct way to Index and Query urls.
I've tried a few things, and I think I'm close but not sure why it doesn't work:

Here is my custom field type:

 <fieldType name="customUrlType" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" preserveOriginal="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" preserveOriginal="0"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

For example:

http://www.twitter.com/AndersonCooper when indexed, will have the following words in different positions: http,www,twitter,com,andersoncooper

If I search for simply twitter.com/andersoncooper, I would like this query to match the record that was indexed, which is why I also use the WDF to split the search query,
however the search query ends up being like so:

myfield:("twitter com andersoncooper") when really want it to match all records that have all of the following separate words: twitter com andersoncooper

Is there a different query filter or tokenizer I should be using?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

所有深爱都是秘密 2024-10-18 21:27:47

这应该是最简单的解决方案:

<field name="iconUrl" type="string" indexed="true" stored="true" />

但是根据您的要求,您需要将其设为多值并对其进行索引 1. 不进行任何更改 2. 没有 http 3. 没有 www

或使 URL 可通过前面的通配符进行搜索(我猜这会比较慢) )

This should be the most simplest solution:

<field name="iconUrl" type="string" indexed="true" stored="true" />

But for you requirement you will need to make it multivalued and index it 1. no changes 2. without http 3. without www

or make the URL searchable via wildcards at the front (which is slower I guess)

唐婉 2024-10-18 21:27:47

您可以尝试关键字分词器

摘自 Packt 出版的Solr 1.4 Enterprise Search Server一书

KeywordTokenizerFactory:这不是
实际上进行任何标记化或
任何事情都可以!它
将原始文本作为一项返回。
在某些情况下,您有
总是只有一个词的字段,但是
你需要做一些基本分析
就像小写一样。然而,它更多的是
可能是由于排序或
您需要的方面要求
索引字段不超过
一学期。当然是一个文档
标识符字段(如果提供但未提供)
一个数字,会使用这个。

You can try the keyword tokenizer

From the book Solr 1.4 Enterprise Search Server published by Packt

KeywordTokenizerFactory: This doesn't
actually do any tokenization or
anything at all for that matter! It
returns the original text as one term.
There are cases where you have a
field that always gets one word, but
you need to do some basic analysis
like lowercasing. However, it is more
likely that due to sorting or
faceting requirements you will require
an indexed field with no more than
one term. Certainly a document's
identifier field, if supplied and not
a number, would use this.

陪你到最终 2024-10-18 21:27:46

如果我从你的问题中理解了这个陈述

myfield:("twitter com andersoncooper") 当真正希望它匹配具有以下所有单独单词的所有记录时:twitter com andersoncooper

您正在尝试编写一个与以下两者匹配的查询:

http://www.twitter.com/AndersonCooper

http://www.andersoncooper.com/socialmedia/twitter

(两个链接都包含 all 的令牌),但不匹配

http://www.facebook.com/AndersonCooper 

http://www.twitter.com/AliceCooper

如果这是正确的,您现有的配置应该可以正常工作。假设您正在使用标准查询解析器,并且您正在通过curl或其他一些基于url的机制进行查询,您需要的查询参数如下所示:

&q=myField:andersoncooper AND myField:twitter AND myField:com

可能让您陷入困境的问题之一是默认查询运算符(查询中的术语之间)是“OR”,这就是为什么上面必须明确指定 AND 的原因。或者,为了节省一些空间,您可以将默认查询运算符更改为“AND”,如下所示:

&q.op=AND&q=myField:(andersoncooper twitter com)

If I understand this statement from your question

myfield:("twitter com andersoncooper") when really want it to match all records that have all of the following separate words: twitter com andersoncooper

You are trying to write a query that would match both:

http://www.twitter.com/AndersonCooper

and

http://www.andersoncooper.com/socialmedia/twitter

(both links contain all of the tokens), but not match either

http://www.facebook.com/AndersonCooper 

or

http://www.twitter.com/AliceCooper

If that is correct, your existing configuration should work just fine. Assuming that you are using the standard query parser and you are querying via curl or some other url based mechanism, you need the query parameter to look like this:

&q=myField:andersoncooper AND myField:twitter AND myField:com

One of the gotchas that may have been tripping you up is that the default query operator (between terms in a query) is "OR", which is why the AND's must be explicitly specified above. Alternately to save some space, you can change the default query operator to "AND" like this:

&q.op=AND&q=myField:(andersoncooper twitter com)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文