这个 Lucene TokenFilter 有什么问题吗?
免责声明:过去 41 小时中,我有 36 个小时都在编码。我头疼。我不明白为什么这个组合 TokenFilter 返回 2 个令牌,都是来自源流的第一个令牌。
public class TokenCombiner extends TokenFilter {
/*
* Recombines all tokens back into a single token using the specified delimiter.
*/
public TokenCombiner(TokenStream in, int delimiter) {
super(in);
this.delimiter = delimiter;
}
int delimiter;
private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
private final OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);
private boolean firstToken = true;
int startOffset = 0;
@Override
public final boolean incrementToken() throws IOException {
while (true){
boolean eos = input.incrementToken(); //We have to process tokens even if they return end of file.
CharTermAttribute token = input.getAttribute(CharTermAttribute.class);
if (eos && token.length() == 0) break; //Break early to avoid extra whitespace.
if (firstToken){
startOffset = input.getAttribute(OffsetAttribute.class).startOffset();
firstToken = false;
}else{
termAtt.append(Character.toString((char)delimiter));
}
termAtt.append(token);
if (eos) break;
}
offsetAtt.setOffset(startOffset, input.getAttribute(OffsetAttribute.class).endOffset());
return false;
}
@Override
public void reset() throws IOException {
super.reset();
firstToken = true;
startOffset = 0;
}
}
Disclaimer: I've been coding for 36 of the last 41 hours. I have a headache. And I can't figure out why this combining TokenFilter is returning 2 tokens, both the first token from the source stream.
public class TokenCombiner extends TokenFilter {
/*
* Recombines all tokens back into a single token using the specified delimiter.
*/
public TokenCombiner(TokenStream in, int delimiter) {
super(in);
this.delimiter = delimiter;
}
int delimiter;
private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
private final OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);
private boolean firstToken = true;
int startOffset = 0;
@Override
public final boolean incrementToken() throws IOException {
while (true){
boolean eos = input.incrementToken(); //We have to process tokens even if they return end of file.
CharTermAttribute token = input.getAttribute(CharTermAttribute.class);
if (eos && token.length() == 0) break; //Break early to avoid extra whitespace.
if (firstToken){
startOffset = input.getAttribute(OffsetAttribute.class).startOffset();
firstToken = false;
}else{
termAtt.append(Character.toString((char)delimiter));
}
termAtt.append(token);
if (eos) break;
}
offsetAtt.setOffset(startOffset, input.getAttribute(OffsetAttribute.class).endOffset());
return false;
}
@Override
public void reset() throws IOException {
super.reset();
firstToken = true;
startOffset = 0;
}
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我认为这里的根本问题是,您必须意识到 TokenCombiner 和它消耗(输入)的生产者共享并重用相同的属性!所以 token == termAtt 总是(尝试添加一个断言!)。
伙计,如果你周末编码了 36 个小时,那就太糟糕了……试试这个:
I think the fundamental problem here, is that you must realize both TokenCombiner and the producer it consumes (input) share and reuse the same attributes! So token == termAtt always (try adding an assert!).
Man, that sucks if you have been coding for 36 hours on a weekend... try this: