是否有 Porter2 词干分析器的 java 实现

你知道 Porter2 词干分析器的 Java 实现(或者用 Java 编写的更好的词干分析器)吗?我知道这里有一个Java版本的Porter(不是Porter2):

http://tartarus。 org/~martin/PorterStemmer/java.txt

但在 http://tartarus.org/~ martin/PorterStemmer/ 作者提到 Porter 有点过时,建议使用 Porter2,可在


但是,我的问题是这个 Porter2 是用 Snowball 编写的(我以前从未听说过,所以不要'我对此一无所知)。我正在寻找的是它的java版本。


Do you know any java implementation of the Porter2 stemmer(or any better stemmer written in java)? I know that there is a java version of Porter(not Porter2) here :


but on http://tartarus.org/~martin/PorterStemmer/ the author mentions that the Porter is bit outdated and recommends to use Porter2, available at


However, the problem with me is that this Porter2 is written in snowball(I never heard of it before, so don't know anything about it). What I am exactly looking for is a java version of it.

Thanks. Your help will he highly appreciated.

Snowball 算法可作为 Java 下载

并从 snowball.tartarus.org

2002 年 2 月 - Richard 获得了 Java 支持
产生 Java 输出以及 ANSI
C 输出。这意味着纯Java



  Class stemClass = Class.forName("org.tartarus.snowball.ext." + lang + "Stemmer");
  stemmer = (SnowballProgram) stemClass.newInstance();
  String your_stemmed_word = stemmer.getCurrent();  

The Snowball algo is available as a Java download

And from snowball.tartarus.org:

Feb 2002 - Java support Richard has
modified the snowball code generator
to produce Java output as well as ANSI
C output. This means that pure Java
systems can now use the snowball

This is what you want, right?

You can create an instance of it like so:

  Class stemClass = Class.forName("org.tartarus.snowball.ext." + lang + "Stemmer");
  stemmer = (SnowballProgram) stemClass.newInstance();
  String your_stemmed_word = stemmer.getCurrent();  
萌梦深 2024-10-13 23:29:18

   Porter stemmer in Java. The original paper is in

       Porter, 1980, An algorithm for suffix stripping, Program, Vol. 14,
       no. 3, pp 130-137,

   See also http://www.tartarus.org/~martin/PorterStemmer


   Release 1

   Bug 1 (reported by Gonzalo Parra 16/10/99) fixed as marked below.
   The words 'aed', 'eed', 'oed' leave k at 'a' for step 3, and b[k-1]
   is then out outside the bounds of b.

   Release 2


   Bug 2 (reported by Steve Dyrdahl 22/2/00) fixed as marked below.
   'ion' by itself leaves j = -1 in the test for 'ion' in step 5, and
   b[j] is then outside the bounds of b.

   Release 3

   Considerably revised 4/9/00 in the light of many helpful suggestions
   from Brian Goetz of Quiotix Corporation ([email protected]).

   Release 4


import java.io.*;

  * Stemmer, implementing the Porter Stemming Algorithm
  * The Stemmer class transforms a word into its root form.  The input
  * word can be provided a character at time (by calling add()), or at once
  * by calling one of the various stem(something) methods.

class Stemmer
{  private char[] b;
   private int i,     /* offset into b */
               i_end, /* offset to end of stemmed word */
               j, k;
   private static final int INC = 50;
                     /* unit of size whereby b is increased */
   public Stemmer()
   {  b = new char[INC];
      i = 0;
      i_end = 0;

    * Add a character to the word being stemmed.  When you are finished
    * adding characters, you can call stem(void) to stem the word.

   public void add(char ch)
   {  if (i == b.length)
      {  char[] new_b = new char[i+INC];
         for (int c = 0; c < i; c++) new_b[c] = b[c];
         b = new_b;
      b[i++] = ch;

   /** Adds wLen characters to the word being stemmed contained in a portion
    * of a char[] array. This is like repeated calls of add(char ch), but
    * faster.

   public void add(char[] w, int wLen)
   {  if (i+wLen >= b.length)
      {  char[] new_b = new char[i+wLen+INC];
         for (int c = 0; c < i; c++) new_b[c] = b[c];
         b = new_b;
      for (int c = 0; c < wLen; c++) b[i++] = w[c];

    * After a word has been stemmed, it can be retrieved by toString(),
    * or a reference to the internal buffer can be retrieved by getResultBuffer
    * and getResultLength (which is generally more efficient.)
   public String toString() { return new String(b,0,i_end); }

    * Returns the length of the word resulting from the stemming process.
   public int getResultLength() { return i_end; }

    * Returns a reference to a character buffer containing the results of
    * the stemming process.  You also need to consult getResultLength()
    * to determine the length of the result.
   public char[] getResultBuffer() { return b; }

   /* cons(i) is true <=> b[i] is a consonant. */

   private final boolean cons(int i)
   {  switch (b[i])
      {  case 'a': case 'e': case 'i': case 'o': case 'u': return false;
         case 'y': return (i==0) ? true : !cons(i-1);
         default: return true;

   /* m() measures the number of consonant sequences between 0 and j. if c is
      a consonant sequence and v a vowel sequence, and <..> indicates arbitrary

         <c><v>       gives 0
         <c>vc<v>     gives 1
         <c>vcvc<v>   gives 2
         <c>vcvcvc<v> gives 3

   private final int m()
   {  int n = 0;
      int i = 0;
      {  if (i > j) return n;
         if (! cons(i)) break; i++;
      {  while(true)
         {  if (i > j) return n;
               if (cons(i)) break;
         {  if (i > j) return n;
            if (! cons(i)) break;

   /* vowelinstem() is true <=> 0,...j contains a vowel */

   private final boolean vowelinstem()
   {  int i; for (i = 0; i <= j; i++) if (! cons(i)) return true;
      return false;

   /* doublec(j) is true <=> j,(j-1) contain a double consonant. */

   private final boolean doublec(int j)
   {  if (j < 1) return false;
      if (b[j] != b[j-1]) return false;
      return cons(j);

   /* cvc(i) is true <=> i-2,i-1,i has the form consonant - vowel - consonant
      and also if the second c is not w,x or y. this is used when trying to
      restore an e at the end of a short word. e.g.

         cav(e), lov(e), hop(e), crim(e), but
         snow, box, tray.


   private final boolean cvc(int i)
   {  if (i < 2 || !cons(i) || cons(i-1) || !cons(i-2)) return false;
      {  int ch = b[i];
         if (ch == 'w' || ch == 'x' || ch == 'y') return false;
      return true;

   private final boolean ends(String s)
   {  int l = s.length();
      int o = k-l+1;
      if (o < 0) return false;
      for (int i = 0; i < l; i++) if (b[o+i] != s.charAt(i)) return false;
      j = k-l;
      return true;

   /* setto(s) sets (j+1),...k to the characters in the string s, readjusting
      k. */

   private final void setto(String s)
   {  int l = s.length();
      int o = j+1;
      for (int i = 0; i < l; i++) b[o+i] = s.charAt(i);
      k = j+l;

   /* r(s) is used further down. */

   private final void r(String s) { if (m() > 0) setto(s); }

   /* step1() gets rid of plurals and -ed or -ing. e.g.

          caresses  ->  caress
          ponies    ->  poni
          ties      ->  ti
          caress    ->  caress
          cats      ->  cat

          feed      ->  feed
          agreed    ->  agree
          disabled  ->  disable

          matting   ->  mat
          mating    ->  mate
          meeting   ->  meet
          milling   ->  mill
          messing   ->  mess

          meetings  ->  meet


   private final void step1()
   {  if (b[k] == 's')
      {  if (ends("sses")) k -= 2; else
         if (ends("ies")) setto("i"); else
         if (b[k-1] != 's') k--;
      if (ends("eed")) { if (m() > 0) k--; } else
      if ((ends("ed") || ends("ing")) && vowelinstem())
      {  k = j;
         if (ends("at")) setto("ate"); else
         if (ends("bl")) setto("ble"); else
         if (ends("iz")) setto("ize"); else
         if (doublec(k))
         {  k--;
            {  int ch = b[k];
               if (ch == 'l' || ch == 's' || ch == 'z') k++;
         else if (m() == 1 && cvc(k)) setto("e");

   /* step2() turns terminal y to i when there is another vowel in the stem. */

   private final void step2() { if (ends("y") && vowelinstem()) b[k] = 'i'; }

   /* step3() maps double suffices to single ones. so -ization ( = -ize plus
      -ation) maps to -ize etc. note that the string before the suffix must give
      m() > 0. */

   private final void step3() { if (k == 0) return; /* For Bug 1 */ switch (b[k-1])
       case 'a': if (ends("ational")) { r("ate"); break; }
                 if (ends("tional")) { r("tion"); break; }
       case 'c': if (ends("enci")) { r("ence"); break; }
                 if (ends("anci")) { r("ance"); break; }
       case 'e': if (ends("izer")) { r("ize"); break; }
       case 'l': if (ends("bli")) { r("ble"); break; }
                 if (ends("alli")) { r("al"); break; }
                 if (ends("entli")) { r("ent"); break; }
                 if (ends("eli")) { r("e"); break; }
                 if (ends("ousli")) { r("ous"); break; }
       case 'o': if (ends("ization")) { r("ize"); break; }
                 if (ends("ation")) { r("ate"); break; }
                 if (ends("ator")) { r("ate"); break; }
       case 's': if (ends("alism")) { r("al"); break; }
                 if (ends("iveness")) { r("ive"); break; }
                 if (ends("fulness")) { r("ful"); break; }
                 if (ends("ousness")) { r("ous"); break; }
       case 't': if (ends("aliti")) { r("al"); break; }
                 if (ends("iviti")) { r("ive"); break; }
                 if (ends("biliti")) { r("ble"); break; }
       case 'g': if (ends("logi")) { r("log"); break; }
   } }

   /* step4() deals with -ic-, -full, -ness etc. similar strategy to step3. */

   private final void step4() { switch (b[k])
       case 'e': if (ends("icate")) { r("ic"); break; }
                 if (ends("ative")) { r(""); break; }
                 if (ends("alize")) { r("al"); break; }
       case 'i': if (ends("iciti")) { r("ic"); break; }
       case 'l': if (ends("ical")) { r("ic"); break; }
                 if (ends("ful")) { r(""); break; }
       case 's': if (ends("ness")) { r(""); break; }
   } }

   /* step5() takes off -ant, -ence etc., in context <c>vcvc<v>. */

   private final void step5()
   {   if (k == 0) return; /* for Bug 1 */ switch (b[k-1])
       {  case 'a': if (ends("al")) break; return;
          case 'c': if (ends("ance")) break;
                    if (ends("ence")) break; return;
          case 'e': if (ends("er")) break; return;
          case 'i': if (ends("ic")) break; return;
          case 'l': if (ends("able")) break;
                    if (ends("ible")) break; return;
          case 'n': if (ends("ant")) break;
                    if (ends("ement")) break;
                    if (ends("ment")) break;
                    /* element etc. not stripped before the m */
                    if (ends("ent")) break; return;
          case 'o': if (ends("ion") && j >= 0 && (b[j] == 's' || b[j] == 't')) break;
                                    /* j >= 0 fixes Bug 2 */
                    if (ends("ou")) break; return;
                    /* takes care of -ous */
          case 's': if (ends("ism")) break; return;
          case 't': if (ends("ate")) break;
                    if (ends("iti")) break; return;
          case 'u': if (ends("ous")) break; return;
          case 'v': if (ends("ive")) break; return;
          case 'z': if (ends("ize")) break; return;
          default: return;
       if (m() > 1) k = j;

   /* step6() removes a final -e if m() > 1. */

   private final void step6()
   {  j = k;
      if (b[k] == 'e')
      {  int a = m();
         if (a > 1 || a == 1 && !cvc(k-1)) k--;
      if (b[k] == 'l' && doublec(k) && m() > 1) k--;

   /** Stem the word placed into the Stemmer buffer through calls to add().
    * Returns true if the stemming process resulted in a word different
    * from the input.  You can retrieve the result with
    * getResultLength()/getResultBuffer() or toString().
   public void stem()
   {  k = i - 1;
      if (k > 1) { step1(); step2(); step3(); step4(); step5(); step6(); }
      i_end = k+1; i = 0;

   /** Test program for demonstrating the Stemmer.  It reads text from a
    * a list of files, stems each word, and writes the result to standard
    * output. Note that the word stemmed is expected to be in lower case:
    * forcing lower case must be done outside the Stemmer class.
    * Usage: Stemmer file-name file-name ...
   public static void main(String[] args)
      char[] w = new char[501];
      Stemmer s = new Stemmer();
      for (int i = 0; i < args.length; i++)
         FileInputStream in = new FileInputStream(args[i]);

         { while(true)

           {  int ch = in.read();
              if (Character.isLetter((char) ch))
                 int j = 0;
                 {  ch = Character.toLowerCase((char) ch);
                    w[j] = (char) ch;
                    if (j < 500) j++;
                    ch = in.read();
                    if (!Character.isLetter((char) ch))
                       /* to test add(char ch) */
                       for (int c = 0; c < j; c++) s.add(w[c]);

                       /* or, to test add(char[] w, int j) */
                       /* s.add(w, j); */

                       {  String u;

                          /* and now, to test toString() : */
                          u = s.toString();

                          /* to test getResultBuffer(), getResultLength() : */
                          /* u = new String(s.getResultBuffer(), 0, s.getResultLength()); */

              if (ch < 0) break;
         catch (IOException e)
         {  System.out.println("error reading " + args[i]);
      catch (FileNotFoundException e)
      {  System.out.println("file " + args[i] + " not found");

浅暮の光 2024-10-13 23:29:18

它作为 MG4J 的一部分提供。

请参阅 EnglishStemmer 的文档,即Porter2。使用方法 processTerm(MutableString ms)

MG4J 还为您提供了其他词干分析器的 java 版本。请参阅 snowball包。所有这些词干分析器都可以独立使用。

It is available as a part of MG4J.

See the documentation for EnglishStemmer, i.e. Porter2. Use method processTerm(MutableString ms)

MG4J also gives you java versions of other stemmers. See the snowball package. All these stemmers can be used independently.

沒落の蓅哖 2024-10-13 23:29:18

也许不是直接答案,但许多 NLP 工具包中都有词干分析器 - 请参阅 http://en.wikipedia。 org/wiki/Natural_language_processing_toolkits
这里有一个相关的问题Tokenizer,stop Word Removal,Stemming in Java有几个可能有用的答案。

我们使用 OpenNLP,它是用 Java 编写的,可以提供该功能。如果您使用英语工作,我认为词干分析器之间的差异不会很重要。

Maybe not a direct answer, but there are stemmers in many NLP toolkits - see http://en.wikipedia.org/wiki/Natural_language_processing_toolkits.
There's a related question here Tokenizer, Stop Word Removal, Stemming in Java with several answers that might be useful.

We use OpenNLP which is written in Java and may provide the functionality. I wouldn't expect the variation between stemmers to be critical if you are working in English.

活泼老夫 2024-10-13 23:29:18

看起来像 Lucene 以一种或另一种形式集成一些词干算法。您可能会从包 org.apache.lucene.analysis。然而,我担心词干代码会被深度集成到分析组件中,从而使其提取变得相当困难......

Seems like Lucene integrates, in one form or another, some stemming algorithms. You may find what you're looking for starting at package org.apache.lucene.analysis. I however fear the stemming code to be deeply integrated into analysis components, making as a consequence quite hard its extraction ...

何处潇湘 2024-10-13 23:29:18

以下链接包含 Snowball Stemmer API。它具有 Porter Stemmer2 实现。

The following link contains snowball stemmer api.It has the porter stemmer2 implementation.

疯了 2024-10-13 23:29:18

这是我制作的轻量级包装,它是易于重用可在 Maven Central 上获取。

