Lucene.NET:驼峰式分词器?
我今天开始使用 Lucene.NET,并编写了一个简单的测试方法来对源代码文件进行索引和搜索。问题在于标准分析器/标记器将整个驼峰式源代码标识符名称视为单个标记。
我正在寻找一种方法将驼峰式标识符(例如 MaxWidth
)处理为三个标记:maxwidth
、max
和 width
。我一直在寻找这样的标记器,但找不到。在写我自己的文章之前:这个方向有什么东西吗?或者有比从头开始编写分词器更好的方法吗?
更新:最后我决定亲自动手,自己编写了一个 CamelCaseTokenFilter
。我将在我的博客上写一篇关于它的文章,并更新问题。
I've started playing with Lucene.NET today and I wrote a simple test method to do indexing and searching on source code files. The problem is that the standard analyzers/tokenizers treat the whole camel case source code identifier name as a single token.
I'm looking for a way to treat camel case identifiers like MaxWidth
into three tokens: maxwidth
, max
and width
. I've looked for such a tokenizer, but I couldn't find it. Before writing my own: is there something in this direction? Or is there a better approach than writing a tokenizer from scratch?
UPDATE: in the end I decided to get my hands dirty and I wrote a CamelCaseTokenFilter
myself. I'll write a post about it on my blog and I'll update the question.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
Solr 有一个 WordDelimiterFactory ,它会生成与您需要的类似的标记生成器。也许你可以将源代码翻译成C#。
Solr has a WordDelimiterFactory which generates a tokenizer similar to what you need. Maybe you can translate the source code into C#.
下面的链接可能有助于编写自定义标记生成器...
http://karticles.com/NoSql/lucene_custom_tokenizer.html
Below link might be helpful to write custom tokenizer...
http://karticles.com/NoSql/lucene_custom_tokenizer.html
这是我的实现:
Here is my implementation :