Lucene：在标记流时如何保留空格等？

发布于 2024-12-27 09:13:24 字数 2007 浏览 2 评论 0原文

我正在尝试执行某种文本流的“翻译”。更具体地说，我需要对输入流进行标记，在专门的字典中查找每个术语并输出标记的相应“翻译”。但是，我还想保留输入中的所有原始空格、停用词等，以便输出的格式与输入的格式相同，而不是最终成为翻译流。因此，如果我的输入是

Term1: Term2 Stopword!第三学期 Term4

然后我希望输出看起来像

Term1': Term2' Stopword!第三学期' Term4'

（其中 Termi' 是 Termi 的翻译），而不是简单的

Term1' Term2' Term3' Term4'

目前我是执行以下操作：

PatternAnalyzer pa = new PatternAnalyzer(Version.LUCENE_31,
                             PatternAnalyzer.WHITESPACE_PATTERN,
                             false, 
                             WordlistLoader.getWordSet(new File(stopWordFilePath)));
TokenStream ts = pa.tokenStream(null, in);
CharTermAttribute charTermAttribute = ts.getAttribute(CharTermAttribute.class);

while (ts.incrementToken()) { // loop over tokens
     String termIn = charTermAttribute.toString(); 
     ...
}

但这当然会丢失所有空格等。我如何修改它以便能够将它们重新插入到输出中？非常感谢！

============ 更新！

我尝试将原始流分成“单词”和“非单词”。看起来效果很好。但不确定这是否是最有效的方法：

public ArrayList splitToWords(String sIn) {

if (sIn == null || sIn.length() == 0) {
    return null;
}

char[] c = sIn.toCharArray();
ArrayList<Token> list = new ArrayList<Token>(); 
int tokenStart = 0;
boolean curIsLetter = Character.isLetter(c[tokenStart]);
for (int pos = tokenStart + 1; pos < c.length; pos++) {
    boolean newIsLetter = Character.isLetter(c[pos]);
    if (newIsLetter == curIsLetter) {
        continue;
    }
    TokenType type = TokenType.NONWORD;
    if (curIsLetter == true)
    {
        type = TokenType.WORD;
    }

    list.add(new Token(new String(c, tokenStart, pos - tokenStart),type));
    tokenStart = pos;

    curIsLetter = newIsLetter;
}
TokenType type = TokenType.NONWORD;
if (curIsLetter == true)
{
    type = TokenType.WORD;
}
list.add(new Token(new String(c, tokenStart, c.length - tokenStart),type));

return list;

<代码>}

原文

I am trying to perform a "translation" of sorts of a stream of text. More specifically, I need to tokenize the input stream, look up every term in a specialized dictionary and output the corresponding "translation" of the token. However, i also want to preserve all the original whitespaces, stopwords etc from the input so that the output is formatted in the same way as the input instead of ended up being a stream of translations. So if my input is

Term1: Term2 Stopword! Term3
Term4

then I want the output to look like

Term1': Term2' Stopword! Term3'
Term4'

(where Termi' is translation of Termi) instead of simply

Term1' Term2' Term3' Term4'

Currently I am doing the following:

PatternAnalyzer pa = new PatternAnalyzer(Version.LUCENE_31,
                             PatternAnalyzer.WHITESPACE_PATTERN,
                             false, 
                             WordlistLoader.getWordSet(new File(stopWordFilePath)));
TokenStream ts = pa.tokenStream(null, in);
CharTermAttribute charTermAttribute = ts.getAttribute(CharTermAttribute.class);

while (ts.incrementToken()) { // loop over tokens
     String termIn = charTermAttribute.toString(); 
     ...
}

but this, of course, loses all the whitespaces etc. How can I modify this to be able to re-insert them into the output? thanks much!

============ UPDATE!

I tried splitting the original stream into "words" and "non-words". It seems to work fine. Not sure whether it's the most efficient way, though:

public ArrayList splitToWords(String sIn) {

if (sIn == null || sIn.length() == 0) {
    return null;
}

char[] c = sIn.toCharArray();
ArrayList<Token> list = new ArrayList<Token>(); 
int tokenStart = 0;
boolean curIsLetter = Character.isLetter(c[tokenStart]);
for (int pos = tokenStart + 1; pos < c.length; pos++) {
    boolean newIsLetter = Character.isLetter(c[pos]);
    if (newIsLetter == curIsLetter) {
        continue;
    }
    TokenType type = TokenType.NONWORD;
    if (curIsLetter == true)
    {
        type = TokenType.WORD;
    }

    list.add(new Token(new String(c, tokenStart, pos - tokenStart),type));
    tokenStart = pos;

    curIsLetter = newIsLetter;
}
TokenType type = TokenType.NONWORD;
if (curIsLetter == true)
{
    type = TokenType.WORD;
}
list.add(new Token(new String(c, tokenStart, c.length - tokenStart),type));

return list;

}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

累赘 2025-01-03 09:13:24

好吧，它并没有真正丢失空格，你仍然有原始文本:)

所以我认为你应该使用 OffsetAttribute，它包含原始文本中每个术语的 startOffset() 和 endOffset() 。例如，这就是 lucene 用来突出显示原始文本中的搜索结果片段的方法。

我写了一个快速测试（使用 EnglishAnalyzer）来演示：
输入是：

Just a test of some ideas. Let's see if it works.

输出是：

just a test of some idea. let see if it work.

// just for example purposes, not necessarily the most performant.
public void testString() throws Exception {
  String input = "Just a test of some ideas. Let's see if it works.";
  EnglishAnalyzer analyzer = new EnglishAnalyzer(Version.LUCENE_35);
  StringBuilder output = new StringBuilder(input);
  // in some cases, the analyzer will make terms longer or shorter.
  // because of this we must track how much we have adjusted the text so far
  // so that the offsets returned will still work for us via replace()
  int delta = 0;

  TokenStream ts = analyzer.tokenStream("bogus", new StringReader(input));
  CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class);
  OffsetAttribute offsetAtt = ts.addAttribute(OffsetAttribute.class);
  ts.reset();
  while (ts.incrementToken()) {
    String term = termAtt.toString();
    int start = offsetAtt.startOffset();
    int end = offsetAtt.endOffset();
    output.replace(delta + start, delta + end, term);
    delta += (term.length() - (end - start));
  }
  ts.close();

System.out.println(output.toString());

}

Well it doesn't really lose whitespace, you still have your original text :)

So I think you should make use of OffsetAttribute, which contains startOffset() and endOffset() of each term into your original text. This is what lucene uses, for example, to highlight snippets of search results from the original text.

I wrote up a quick test (uses EnglishAnalyzer) to demonstrate:
The input is:

Just a test of some ideas. Let's see if it works.

The output is:

just a test of some idea. let see if it work.

// just for example purposes, not necessarily the most performant.
public void testString() throws Exception {
  String input = "Just a test of some ideas. Let's see if it works.";
  EnglishAnalyzer analyzer = new EnglishAnalyzer(Version.LUCENE_35);
  StringBuilder output = new StringBuilder(input);
  // in some cases, the analyzer will make terms longer or shorter.
  // because of this we must track how much we have adjusted the text so far
  // so that the offsets returned will still work for us via replace()
  int delta = 0;

  TokenStream ts = analyzer.tokenStream("bogus", new StringReader(input));
  CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class);
  OffsetAttribute offsetAtt = ts.addAttribute(OffsetAttribute.class);
  ts.reset();
  while (ts.incrementToken()) {
    String term = termAtt.toString();
    int start = offsetAtt.startOffset();
    int end = offsetAtt.endOffset();
    output.replace(delta + start, delta + end, term);
    delta += (term.length() - (end - start));
  }
  ts.close();

System.out.println(output.toString());

}

回复收藏 0 原文

~没有更多了~