这个 Lucene TokenFilter 有什么问题吗？

发布于 2024-12-04 00:02:28 字数 1514 浏览 2 评论 0原文

免责声明：过去 41 小时中，我有 36 个小时都在编码。我头疼。我不明白为什么这个组合 TokenFilter 返回 2 个令牌，都是来自源流的第一个令牌。

public class TokenCombiner extends TokenFilter {

  /*
   * Recombines all tokens back into a single token using the specified delimiter.
   */
  public TokenCombiner(TokenStream in, int delimiter) {
    super(in);
    this.delimiter = delimiter;
  }
  int delimiter;


  private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
  private final OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);


  private boolean firstToken = true;
  int startOffset = 0;
  @Override
  public final boolean incrementToken() throws IOException {
    while (true){ 
        boolean eos = input.incrementToken(); //We have to process tokens even if they return end of file.
        CharTermAttribute token = input.getAttribute(CharTermAttribute.class);
        if (eos && token.length() == 0) break; //Break early to avoid extra whitespace.
        if (firstToken){
            startOffset = input.getAttribute(OffsetAttribute.class).startOffset();
            firstToken = false;

        }else{
            termAtt.append(Character.toString((char)delimiter));
        }
        termAtt.append(token);
        if (eos) break;
    }
    offsetAtt.setOffset(startOffset, input.getAttribute(OffsetAttribute.class).endOffset());
    return false;
  }

  @Override
  public void reset() throws IOException {
    super.reset();
    firstToken = true;
    startOffset = 0;
  }
}

原文

Disclaimer: I've been coding for 36 of the last 41 hours. I have a headache. And I can't figure out why this combining TokenFilter is returning 2 tokens, both the first token from the source stream.

public class TokenCombiner extends TokenFilter {

  /*
   * Recombines all tokens back into a single token using the specified delimiter.
   */
  public TokenCombiner(TokenStream in, int delimiter) {
    super(in);
    this.delimiter = delimiter;
  }
  int delimiter;


  private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
  private final OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);


  private boolean firstToken = true;
  int startOffset = 0;
  @Override
  public final boolean incrementToken() throws IOException {
    while (true){ 
        boolean eos = input.incrementToken(); //We have to process tokens even if they return end of file.
        CharTermAttribute token = input.getAttribute(CharTermAttribute.class);
        if (eos && token.length() == 0) break; //Break early to avoid extra whitespace.
        if (firstToken){
            startOffset = input.getAttribute(OffsetAttribute.class).startOffset();
            firstToken = false;

        }else{
            termAtt.append(Character.toString((char)delimiter));
        }
        termAtt.append(token);
        if (eos) break;
    }
    offsetAtt.setOffset(startOffset, input.getAttribute(OffsetAttribute.class).endOffset());
    return false;
  }

  @Override
  public void reset() throws IOException {
    super.reset();
    firstToken = true;
    startOffset = 0;
  }
}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

两人的回忆 2024-12-11 00:02:28

我认为这里的根本问题是，您必须意识到 TokenCombiner 和它消耗（输入）的生产者共享并重用相同的属性！所以 token == termAtt 总是（尝试添加一个断言！）。

伙计，如果你周末编码了 36 个小时，那就太糟糕了……试试这个：


public class TokenCombiner extends TokenFilter {
  private final StringBuilder sb = new StringBuilder();
  private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
  private final OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);
  private final char separator;
  private boolean consumed; // true if we already consumed

  protected TokenCombiner(TokenStream input, char separator) {
    super(input);
    this.separator = separator;
  }

  @Override
  public final boolean incrementToken() throws IOException {
    if (consumed) {
      return false; // don't call input.incrementToken() after it returns false
    }
    consumed = true;

    int startOffset = 0;
    int endOffset = 0;

    boolean found = false; // true if we actually consumed any tokens
    while (input.incrementToken()) {
      if (!found) {
        startOffset = offsetAtt.startOffset();
        found = true;
      }
      sb.append(termAtt);
      sb.append(separator);
      endOffset = offsetAtt.endOffset();
    }

    if (found) {
      assert sb.length() > 0; // always: because we append separator
      sb.setLength(sb.length() - 1);
      clearAttributes();
      termAtt.setEmpty().append(sb);
      offsetAtt.setOffset(startOffset, endOffset);
      return true;
    } else {
      return false;
    }
  }

  @Override
  public void reset() throws IOException {
    super.reset();
    sb.setLength(0);
    consumed = false;
  }
}

I think the fundamental problem here, is that you must realize both TokenCombiner and the producer it consumes (input) share and reuse the same attributes! So token == termAtt always (try adding an assert!).

Man, that sucks if you have been coding for 36 hours on a weekend... try this:


public class TokenCombiner extends TokenFilter {
  private final StringBuilder sb = new StringBuilder();
  private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
  private final OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);
  private final char separator;
  private boolean consumed; // true if we already consumed

  protected TokenCombiner(TokenStream input, char separator) {
    super(input);
    this.separator = separator;
  }

  @Override
  public final boolean incrementToken() throws IOException {
    if (consumed) {
      return false; // don't call input.incrementToken() after it returns false
    }
    consumed = true;

    int startOffset = 0;
    int endOffset = 0;

    boolean found = false; // true if we actually consumed any tokens
    while (input.incrementToken()) {
      if (!found) {
        startOffset = offsetAtt.startOffset();
        found = true;
      }
      sb.append(termAtt);
      sb.append(separator);
      endOffset = offsetAtt.endOffset();
    }

    if (found) {
      assert sb.length() > 0; // always: because we append separator
      sb.setLength(sb.length() - 1);
      clearAttributes();
      termAtt.setEmpty().append(sb);
      offsetAtt.setOffset(startOffset, endOffset);
      return true;
    } else {
      return false;
    }
  }

  @Override
  public void reset() throws IOException {
    super.reset();
    sb.setLength(0);
    consumed = false;
  }
}

回复收藏 0 原文

~没有更多了~