使用 ElasticSearch 搜索文件名
我想使用 ElasticSearch 搜索文件名(而不是文件内容)。因此我需要找到文件名的一部分(完全匹配,没有模糊搜索)。
示例:
我的文件具有以下名称:
My_first_file_created_at_2012.01.13.doc
My_second_file_created_at_2012.01.13.pdf
Another file.txt
And_again_another_file.docx
foo.bar.txt
现在我想搜索 2012.01.13
以获取前两个文件。
搜索 file
或 ile
应返回除最后一个文件名之外的所有文件名。
我如何使用 ElasticSearch 来实现这一点?
这是我测试过的,但它总是返回零结果:
curl -X DELETE localhost:9200/files
curl -X PUT localhost:9200/files -d '
{
"settings" : {
"index" : {
"analysis" : {
"analyzer" : {
"filename_analyzer" : {
"type" : "custom",
"tokenizer" : "lowercase",
"filter" : ["filename_stop", "filename_ngram"]
}
},
"filter" : {
"filename_stop" : {
"type" : "stop",
"stopwords" : ["doc", "pdf", "docx"]
},
"filename_ngram" : {
"type" : "nGram",
"min_gram" : 3,
"max_gram" : 255
}
}
}
}
},
"mappings": {
"files": {
"properties": {
"filename": {
"type": "string",
"analyzer": "filename_analyzer"
}
}
}
}
}
'
curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "My_first_file_created_at_2012.01.13.doc" }'
curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "My_second_file_created_at_2012.01.13.pdf" }'
curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "Another file.txt" }'
curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "And_again_another_file.docx" }'
curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "foo.bar.txt" }'
curl -X POST "http://localhost:9200/files/_refresh"
FILES='
http://localhost:9200/files/_search?q=filename:2012.01.13
'
for file in ${FILES}
do
echo; echo; echo ">>> ${file}"
curl "${file}&pretty=true"
done
I want to use ElasticSearch to search filenames (not the file's content). Therefore I need to find a part of the filename (exact match, no fuzzy search).
Example:
I have files with the following names:
My_first_file_created_at_2012.01.13.doc
My_second_file_created_at_2012.01.13.pdf
Another file.txt
And_again_another_file.docx
foo.bar.txt
Now I want to search for 2012.01.13
to get the first two files.
A search for file
or ile
should return all filenames except the last one.
How can i accomplish that with ElasticSearch?
This is what I have tested, but it always returns zero results:
curl -X DELETE localhost:9200/files
curl -X PUT localhost:9200/files -d '
{
"settings" : {
"index" : {
"analysis" : {
"analyzer" : {
"filename_analyzer" : {
"type" : "custom",
"tokenizer" : "lowercase",
"filter" : ["filename_stop", "filename_ngram"]
}
},
"filter" : {
"filename_stop" : {
"type" : "stop",
"stopwords" : ["doc", "pdf", "docx"]
},
"filename_ngram" : {
"type" : "nGram",
"min_gram" : 3,
"max_gram" : 255
}
}
}
}
},
"mappings": {
"files": {
"properties": {
"filename": {
"type": "string",
"analyzer": "filename_analyzer"
}
}
}
}
}
'
curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "My_first_file_created_at_2012.01.13.doc" }'
curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "My_second_file_created_at_2012.01.13.pdf" }'
curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "Another file.txt" }'
curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "And_again_another_file.docx" }'
curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "foo.bar.txt" }'
curl -X POST "http://localhost:9200/files/_refresh"
FILES='
http://localhost:9200/files/_search?q=filename:2012.01.13
'
for file in ${FILES}
do
echo; echo; echo ">>> ${file}"
curl "${file}&pretty=true"
done
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您粘贴的内容存在各种问题:
1) 映射不正确
创建索引时,您指定:
但您的类型实际上是
file
,而不是files
。如果您检查了映射,您会立即看到:2) 分析器定义不正确
您已指定
小写
分词器,但会删除任何非字母的内容,(请参阅docs),因此您的数字将被完全删除。您可以使用分析 API 进行检查:
3) Ngrams搜索
您将 ngram 标记过滤器包含在索引分析器和搜索分析器中。这对于索引分析器来说很好,因为您希望对 ngram 建立索引。但是当您搜索时,您希望搜索整个字符串,而不是每个 ngram。
例如,如果您使用长度为 1 到 4 的 ngram 来索引
"abcd"
,您最终会得到以下标记:但是如果您搜索
"dcba"
(这应该不匹配)并且您还使用 ngram 分析您的搜索词,那么您实际上正在搜索:所以
a
,b
,c
和 < code>d 将匹配!解决方案
首先,您需要选择合适的分析仪。您的用户可能会搜索单词、数字或日期,但他们可能不会期望
ile
与file
匹配。相反,使用 edge ngrams 可能会更有用,它将 ngram 锚定到每个单词的开头(或结尾)。另外,为什么排除 docx 等?用户肯定很想搜索文件类型吗?
因此,让我们通过删除任何非字母或数字的内容来将每个文件名分解为更小的标记(使用 模式分词器):
然后对于索引分析器,我们还将在每个标记上使用边缘 ngram:
我们按如下方式创建索引:
现在,测试我们的分析器是否正常工作:
filename_search:
filename_index:
好的 - 似乎工作正常。因此,让我们添加一些文档:
并尝试搜索:
成功!
#### UPDATE ####
我意识到搜索
2012.01
将同时匹配2012.01.12
和2012.12.01
所以我尝试更改查询以使用 文本短语 查询反而。然而,这并没有奏效。事实证明,边缘 ngram 过滤器会增加每个 ngram 的位置计数(虽然我认为每个 ngram 的位置与单词开头的位置相同)。上面第 (3) 点提到的问题仅在使用尝试匹配任何令牌的
query_string
、field
或text
查询时才会出现问题。但是,对于text_phrase
查询,它会尝试以正确的顺序匹配所有标记。为了演示该问题,请使用不同的日期为另一个文档建立索引:
并执行与上面相同的搜索:
第一个结果的日期为
2012.12.01
,该日期与2012.01 不是最佳匹配
。因此,为了仅匹配该确切的短语,我们可以这样做:或者,如果您仍然想匹配所有 3 个文件(因为用户可能记住文件名中的某些单词,但顺序错误),您可以运行两个查询,但增加按正确顺序排列的文件名的重要性:
You have various problems with what you pasted:
1) Incorrect mapping
When creating the index, you specify:
But your type is actually
file
, notfiles
. If you checked the mapping, you would see that immediately:2) Incorrect analyzer definition
You have specified the
lowercase
tokenizer but that removes anything that isn't a letter, (see docs), so your numbers are being completely removed.You can check this with the analyze API:
3) Ngrams on search
You include your ngram token filter in both the index analyzer and the search analyzer. That's fine for the index analyzer, because you want the ngrams to be indexed. But when you search, you want to search on the full string, not on each ngram.
For instance, if you index
"abcd"
with ngrams of length 1 to 4, you will end up with these tokens:But if you search on
"dcba"
(which shouldn't match) and you also analyze your search terms with ngrams, then you are actually searching on:So
a
,b
,c
andd
will match!Solution
First, you need to choose the right analyzer. Your users will probably search for words, numbers or dates, but they probably won't expect
ile
to matchfile
. Instead, it will probably be more useful to use edge ngrams, which will anchor the ngram to the start (or end) of each word.Also, why exclude
docx
etc? Surely a user may well want to search on the file type?So lets break up each filename into smaller tokens by removing anything that isn't a letter or a number (using the pattern tokenizer):
Then for the index analyzer, we'll also use edge ngrams on each of those tokens:
We create the index as follows:
Now, test that the our analyzers are working correctly:
filename_search:
filename_index:
OK - seems to be working correctly. So let's add some docs:
And try a search:
Success!
#### UPDATE ####
I realised that a search for
2012.01
would match both2012.01.12
and2012.12.01
so I tried changing the query to use a text phrase query instead. However, this didn't work. It turns out that the edge ngram filter increments the position count for each ngram (while I would have thought that the position of each ngram would be the same as for the start of the word).The issue mentioned in point (3) above is only a problem when using a
query_string
,field
, ortext
query which tries to match ANY token. However, for atext_phrase
query, it tries to match ALL of the tokens, and in the correct order.To demonstrate the issue, index another doc with a different date:
And do a the same search as above:
The first result has a date
2012.12.01
which isn't the best match for2012.01
. So to match only that exact phrase, we can do:Or, if you still want to match all 3 files (because the user might remember some of the words in the filename, but in the wrong order), you can run both queries but increase the importance of the filename which is in the correct order:
我相信这是因为使用了标记生成器。
http ://www.elasticsearch.org/guide/reference/index-modules/analysis/lowercase-tokenizer.html
小写分词器在单词边界上分开,因此 2012.01.13 将被索引为“2012”、“01”和“13”。搜索字符串“2012.01.13”显然不会匹配。
一种选择是在搜索中添加标记化。因此,搜索“2012.01.13”将被标记为与索引中相同的标记,并且它将匹配。这也很方便,因为您不需要总是在代码中将搜索小写。
第二种选择是使用 n-gram 分词器而不是过滤器。这意味着它将忽略单词边界(并且您也会得到“_”),但是您可能会遇到大小写不匹配的问题,这可能是您首先添加小写分词器的原因。
I believe this is because of the tokenizer being used..
http://www.elasticsearch.org/guide/reference/index-modules/analysis/lowercase-tokenizer.html
The lowercase tokenizer splits out on word boundaries so 2012.01.13 will be indexed as "2012","01" and "13". Searching for the string "2012.01.13" will obviously not match.
One option would be to add the tokenisation on search as well. Therefore, searching for "2012.01.13" will be tokenised down to the same tokens as in the index and it will match. This is also handy as you then don't need to always lowercase your searches in code.
The second option would be to use an n-gram tokenizer instead of the filter. This will mean that it will ignore word boundaries (and you will get the "_"'s as well), however you may have issues with case mismatches, which is presumably the reason you added the lowercase tokenizer in the first place.
我没有 ES 经验,但在 Solr 中,您需要将字段类型指定为文本。
您的字段类型为字符串,而不是文本。字符串字段不会被分析,而是逐字存储和索引。试一试,看看是否有效。
I have no experience with ES, but in Solr you would need to specify the field type as text.
Your field is of type string instead of text. String fields, are not analyzed, but stored and indexed verbatim. Give that a shot and see if it works.