当前位置：文江博客话题详情

有没有可用于印度语言的词干分析器

发布于 2024-09-29 13:11:36 字数 1539 浏览 10 评论 0原文

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦回梦里 2024-10-06 13:11:36

Lucene 中提供了带有词干分析器的印地语分析器。它基于此算法(pdf)。

回复收藏 0 原文

黎歌 2024-10-06 13:11:36

hindi_stemmer 是“< a href="http://computing.open.ac.uk/Sites/EACLSouthAsia/Papers/p6-Ramanathan.pdf" rel="nofollow">印地语轻量级词干分析器" 作者：Ananthakrishnan Ramanathan 和 Durgesh D饶。

回复收藏 0 原文

吃颗糖壮壮胆 2024-10-06 13:11:36

import java.util.Map;
import java.util.WeakHashMap;

/**
* 印地语轻词干分析器 - 删除名词和形容词中的数字、性别和大小写后缀

public class HindiStemmerLight{

/**
 * A cache of words and their stems
 */
static private Map<String, String> cache = new WeakHashMap<String, String>();

/**
 * A buffer of the current word being stemmed
 */
private StringBuilder sb = new StringBuilder();

/**
 * Default constructor
 */
public HindiStemmerLight() {
}

public String stem(String word) {
    String result = cache.get(word);

    if (result != null)
        return result;

    // 
    sb.delete(0, sb.length());

    // 
    sb.append(word);

    /* remove the case endings from nouns and adjectives */
    remove_suffix(sb);



    result = sb.toString();
    cache.put(word, result);

    return result;
}

private void remove_suffix(StringBuilder word) {
    int len = word.length() - 1;

    /* article */

    if (len > 4) {
        if (word.substring( len- 2, len+1).equals("िया")) {
            word.delete(len-2 , len + 1);
            return;
        }
        if (word.substring( len- 2, len+1).equals("ियो")) {
            word.delete(len-2 , len + 1);
            return;
        }

    } /* end if len >4 */
    if (len > 3) {
        if (word.substring(len-1, len+1).equals("ाए")) {
            word.delete(len - 1, len + 1);
            return;
        }
        if (word.substring(len-1, len+1).equals(" ाओ")) {
            word.delete(len - 1, len + 1);
            return;
        }
        if (word.substring(len-1, len+1).equals(" ुआ")) {
            word.delete(len - 1, len + 1);
            return;
        }
        if (word.substring(len-1, len+1).equals(" ुओ")) {
            word.delete(len - 1, len + 1);
            return;
        }
        if (word.substring( len- 1, len+1).equals("ये")) {
            word.delete(len-1 , len + 1);
            return;
        }
        if (word.substring(len-1, len+1).equals(" ेन")) {
            word.delete(len - 1, len + 1);
            return;
        }
        if (word.substring(len-1, len+1).equals(" ेण")) {
            word.delete(len - 1, len + 1);
            return;
        }
        if (word.substring( len- 1, len+1).equals(" ीय")) {
            word.delete(len-1 , len + 1);
            return;
        }
        if (word.substring(len-1, len+1).equals("टी")) {
            word.delete(len - 1, len + 1);
            return;
        }
        if (word.substring(len-1, len+1).equals("ार")) {
            word.delete(len - 1, len + 1);
            return;
        }
        if (word.substring(len-1, len+1).equals("ाई")) {
            word.delete(len - 1, len + 1);
            return;
        }

    } /* end if len > 3 */
    if (len > 2) {
        if (word.substring(len, len+1).equals(" ा")) {
            word.delete(len , len + 1);
            return;
        }
        if (word.substring(len, len+1).equals(" े")) {
            word.delete(len , len + 1);
            return;
        }
        if (word.substring(len, len+1).equals(" ी")) {
            word.delete(len , len + 1);
            return;
        }
        if (word.substring(len, len+1).equals(" ो")) {
            word.delete(len , len + 1);
            return;
        }
        if (word.substring(len, len+1).equals("ि ")) {
            word.delete(len , len + 1);
            return;
        }
        if (word.substring(len, len+1).equals("अ")) {
            word.delete(len , len + 1);
            return;
        }

    } /* end if len > 2 */
    return;
}

}

import java.util.Map;
import java.util.WeakHashMap;

/**
* Hindi light stemmer- removes number, gender and case suffixes from nouns and adjectives

public class HindiStemmerLight{

/**
 * A cache of words and their stems
 */
static private Map<String, String> cache = new WeakHashMap<String, String>();

/**
 * A buffer of the current word being stemmed
 */
private StringBuilder sb = new StringBuilder();

/**
 * Default constructor
 */
public HindiStemmerLight() {
}

public String stem(String word) {
    String result = cache.get(word);

    if (result != null)
        return result;

    // 
    sb.delete(0, sb.length());

    // 
    sb.append(word);

    /* remove the case endings from nouns and adjectives */
    remove_suffix(sb);



    result = sb.toString();
    cache.put(word, result);

    return result;
}

private void remove_suffix(StringBuilder word) {
    int len = word.length() - 1;

    /* article */

    if (len > 4) {
        if (word.substring( len- 2, len+1).equals("िया")) {
            word.delete(len-2 , len + 1);
            return;
        }
        if (word.substring( len- 2, len+1).equals("ियो")) {
            word.delete(len-2 , len + 1);
            return;
        }

    } /* end if len >4 */
    if (len > 3) {
        if (word.substring(len-1, len+1).equals("ाए")) {
            word.delete(len - 1, len + 1);
            return;
        }
        if (word.substring(len-1, len+1).equals(" ाओ")) {
            word.delete(len - 1, len + 1);
            return;
        }
        if (word.substring(len-1, len+1).equals(" ुआ")) {
            word.delete(len - 1, len + 1);
            return;
        }
        if (word.substring(len-1, len+1).equals(" ुओ")) {
            word.delete(len - 1, len + 1);
            return;
        }
        if (word.substring( len- 1, len+1).equals("ये")) {
            word.delete(len-1 , len + 1);
            return;
        }
        if (word.substring(len-1, len+1).equals(" ेन")) {
            word.delete(len - 1, len + 1);
            return;
        }
        if (word.substring(len-1, len+1).equals(" ेण")) {
            word.delete(len - 1, len + 1);
            return;
        }
        if (word.substring( len- 1, len+1).equals(" ीय")) {
            word.delete(len-1 , len + 1);
            return;
        }
        if (word.substring(len-1, len+1).equals("टी")) {
            word.delete(len - 1, len + 1);
            return;
        }
        if (word.substring(len-1, len+1).equals("ार")) {
            word.delete(len - 1, len + 1);
            return;
        }
        if (word.substring(len-1, len+1).equals("ाई")) {
            word.delete(len - 1, len + 1);
            return;
        }

    } /* end if len > 3 */
    if (len > 2) {
        if (word.substring(len, len+1).equals(" ा")) {
            word.delete(len , len + 1);
            return;
        }
        if (word.substring(len, len+1).equals(" े")) {
            word.delete(len , len + 1);
            return;
        }
        if (word.substring(len, len+1).equals(" ी")) {
            word.delete(len , len + 1);
            return;
        }
        if (word.substring(len, len+1).equals(" ो")) {
            word.delete(len , len + 1);
            return;
        }
        if (word.substring(len, len+1).equals("ि ")) {
            word.delete(len , len + 1);
            return;
        }
        if (word.substring(len, len+1).equals("अ")) {
            word.delete(len , len + 1);
            return;
        }

    } /* end if len > 2 */
    return;
}

}

回复收藏 0 原文

护你周全 2024-10-06 13:11:36

的 java 版本 Python Hindi Stemmer

我们创建了原始代码原始印地语词干分析器的作者使用变量 L 的方式尚不清楚，但有一个完整的代码可以使用

import java.util.ArrayList;
导入 org.apache.commons.lang.StringUtils;

public class SimpleHindiStemmer {

/* This is not coming while pasting the code better copy from Python Code as given in suffixes[1]  */
private static String [] stem1 = new String [] { "ो", "े", "ू", "ु", "ी", "ि", "ा" };

/* This is not coming while pasting the code better copy from Python Code as given in suffixes[2]  */
private static String [] stem2 = new String [] { "कर", "ाओ", "िए", "ाई", "ाए", "ने", "नी", "ना", "ते", "ीं", "ती", "ता", "ाँ", "ां", "ों", "ें" };

private static String [] stem3 = new String [] { "ाकर", "ाइए", "ाईं", "ाया", "ेगी", "ेगा", "ोगी", "ोगे", "ाने", "ाना", "ाते", "ाती", "ाता", "तीं", "ाओं", "ाएं", "ुओं", "ुएं", "ुआं" };

private static String [] stem4 = new String [] { "ाएगी", "ाएगा", "ाओगी", "ाओगे", "एंगी", "ेंगी", "एंगे", "ेंगे", "ूंगी", "ूंगा", "ातीं", "नाओं", "नाएं", "ताओं", "ताएं", "ियाँ", "ियों", "ियां" };

private static String [] stem5 = new String [] { "ाएंगी", "ाएंगे", "ाऊंगी", "ाऊंगा", "ाइयाँ", "ाइयों", "ाइयां" };

private static ArrayList<String []> stemList = new ArrayList<String []>(5);

static int [] cut = new int [] { 0, 1, 1, 1, 2, 2 };

static {

    stemList.add(stem5);
    stemList.add(stem4);
    stemList.add(stem3);
    stemList.add(stem2);
    stemList.add(stem1);

}

public SimpleHindiStemmer(){

}

public static void main(String [] argv){

    SimpleHindiStemmer sm = new SimpleHindiStemmer();
    String word = "रास्ते";
    System.out.println(sm.stemprocess(word));
}

public String stemprocess(String word){


    int wlen = word.length();
    int wordlen = wlen*3;
    int icnt = 5;
    for (String [] stemwords : stemList){
        if(wordlen > (icnt + 1)){
            for(String sw: stemwords){
                if(StringUtils.endsWith(word, sw)){
                    return StringUtils.substring(word, 0, wlen - cut[icnt]);

                }
            }

        }


        icnt--;
    }

    return word;
}
}

正如您所看到的，在某些情况下 utf-8 字符没有被很好地捕获。查看原始 python 代码，然后从那里复制后缀值。

We created a java version of original code Python Hindi Stemmer

Although the way authors of original Hindi stemmer used variable L is not well understood but there is a complete code what would work

import java.util.ArrayList;
import org.apache.commons.lang.StringUtils;

public class SimpleHindiStemmer {

/* This is not coming while pasting the code better copy from Python Code as given in suffixes[1]  */
private static String [] stem1 = new String [] { "ो", "े", "ू", "ु", "ी", "ि", "ा" };

/* This is not coming while pasting the code better copy from Python Code as given in suffixes[2]  */
private static String [] stem2 = new String [] { "कर", "ाओ", "िए", "ाई", "ाए", "ने", "नी", "ना", "ते", "ीं", "ती", "ता", "ाँ", "ां", "ों", "ें" };

private static String [] stem3 = new String [] { "ाकर", "ाइए", "ाईं", "ाया", "ेगी", "ेगा", "ोगी", "ोगे", "ाने", "ाना", "ाते", "ाती", "ाता", "तीं", "ाओं", "ाएं", "ुओं", "ुएं", "ुआं" };

private static String [] stem4 = new String [] { "ाएगी", "ाएगा", "ाओगी", "ाओगे", "एंगी", "ेंगी", "एंगे", "ेंगे", "ूंगी", "ूंगा", "ातीं", "नाओं", "नाएं", "ताओं", "ताएं", "ियाँ", "ियों", "ियां" };

private static String [] stem5 = new String [] { "ाएंगी", "ाएंगे", "ाऊंगी", "ाऊंगा", "ाइयाँ", "ाइयों", "ाइयां" };

private static ArrayList<String []> stemList = new ArrayList<String []>(5);

static int [] cut = new int [] { 0, 1, 1, 1, 2, 2 };

static {

    stemList.add(stem5);
    stemList.add(stem4);
    stemList.add(stem3);
    stemList.add(stem2);
    stemList.add(stem1);

}

public SimpleHindiStemmer(){

}

public static void main(String [] argv){

    SimpleHindiStemmer sm = new SimpleHindiStemmer();
    String word = "रास्ते";
    System.out.println(sm.stemprocess(word));
}

public String stemprocess(String word){


    int wlen = word.length();
    int wordlen = wlen*3;
    int icnt = 5;
    for (String [] stemwords : stemList){
        if(wordlen > (icnt + 1)){
            for(String sw: stemwords){
                if(StringUtils.endsWith(word, sw)){
                    return StringUtils.substring(word, 0, wlen - cut[icnt]);

                }
            }

        }


        icnt--;
    }

    return word;
}
}

As you see utf-8 chars in some cases are not well captured . Look at original python code and just copy suffixes values from there.

回复收藏 0 原文

~没有更多了~

关于作者

暖风昔人

暂无简介

文章

26 人气

关注发私信

梦途

文章 0 评论 0

关注

唐睦州

文章 0 评论 0

关注

且行且努力

文章 0 评论 0

关注

Yiu Peng

文章 0 评论 0

关注

albertliao

文章 0 评论 0

关注

逆夏时光

文章 0 评论 0

友情链接

文江博客

有没有可用于印度语言的词干分析器

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

梦途

唐睦州

且行且努力

Yiu Peng

albertliao

逆夏时光

友情链接

有没有可用于印度语言的词干分析器

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

梦途

唐睦州

且行且努力

Yiu Peng

albertliao

逆夏时光

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。