查找文本中最常见术语的简单工具

发布于 2024-09-06 13:46:53 字数 210 浏览 6 评论 0原文

我有一篇文本,我想提取最常见的术语,即使由多个单词组成(即:总经理、职位、薪水、网络开发人员)。

我需要一个库或一个可安装的可执行文件,而不是一个网络服务。

我遇到了一些需要培训的复杂工具(例如 Topia 的术语提取,MAUI)。对于我的目的来说,它们过于复杂,我发现它们很难被我使用。

我只需要一个可以提取文本中最常见术语的软件。

谢谢。

I have a text and I would like to extract the most recurrent terms, even if made up by more than one word (i.e.: managing director, position, salary, web developer).

I would need a library or an installable executable, more than a web service.

I came across some complex tools (such as Topia's Term Extraction, MAUI) that require training. There are overcomplicated for my purpose and I find them difficult to use by me.

I just need a piece of software that extracts the most recurrent terms in a text.

Thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

夏尔 2024-09-13 13:46:53

你用Linux吗?我使用这些 shell 函数,

# copyright by Werner Rudolph <werner (at) artistoex (dot) net>
# copying and distributing of the following source code 
# is permitted, as long as this note is preserved.

# ftr CHAR1 CHAR2
# translate delimiter char in frequency list
#

ftr()
{
    sed -r 's/^( *[0-9]+)'"$1"'/\1'"$2"'/'
}


# valid-collocations -- find valid collocations in inputstream
# reads records COUNT<SPC>COLLOCATION from inputstream
# writes records with existing collocations to stdout. 

valid-collocations () 
{ 
    #sort -k 2 -m - "$coll" |uniq -f 1 -D|accumulate
    local delimiter="_"
    ftr ' ' $delimiter |
    join -t $delimiter -o 1.1 0 -1 2 -2 1 - /tmp/wordsets-helper-collocations |
    ftr $delimiter ' '

}

# ngrams MAX [MIN]
#   
#   Generates all n-grams (for each MIN <= n <= MAX, where MIN defaults to 2)
#   from inputstream
#
#   reads word list, as generated by 
# 
#     $  words < text 
#

#   from stdin.  For each WORD in wordlist, it writes MAX-1 records
#
#   COUNT<TAB>WORD<SPC>SUCC_1<SPC>
#   COUNT<TAB>WORD<SPC>SUCC_1<SPC>SUCC_2<SPC>
#                            : 
#   COUNT<TAB>WORD<SPC>SUCC_1<SPC>SUCC_2<SPC>...<SPC>SUCC_MAX-2
#   COUNT<TAB>WORD<SPC>SUCC_1<SPC>SUCC_2<SPC>...<SPC>SUCC_MAX-1
#
#   to stdout, where word SUCC follows word WORD, and SUCC_n follows
#   SUCC_n-1 in input stream COUNT times.

ngrams () 
{ 
    local max=$1
    local min=${2:-2};
    awk 'FNR > 1 {print old " " $0} {old=$1}' | if (( $max - 1 > 1 )); then
        if (( $min <= 2 )); then
            tee >( ngrams $(( $max - 1 )) $(( $min - 1 )) );
        else
            ngrams $(( $max - 1 )) $(( $min - 1 ));
        fi;
    else
        cat;
    fi
}

words() {
    grep -Eo '\<([a-zA-Z]'"'"'?){'${1:-3}',}\>'|grep -v "[A-Z]"
}

parse-collocations() {
    local freq=${1:-0}
    local length=${2:-4}

    words | ngrams $length | sort | uniq -c |
    awk '$1 > '"$freq"' { print $0; }' |
    valid-collocations 
}

其中 parse-collocation 是实际使用的函数。它接受两个可选参数:第一个参数设置要从结果中跳过的术语的最大重复频率(默认为 0,即考虑所有术语)。第二个参数设置要搜索的最大术语长度。该函数将从标准输入读取文本并将术语逐行打印到标准输出。它需要位于 /tmp/wordsets-helper-collocations 的字典文件(下载一个 此处):

用法示例:

$ parse-collocation < some-text

几乎就是您想要的。但是,如果您不希望术语与字典匹配,则可以使用此

$ words < some-text | ngrams 3 4 | sort | uniq -c |sort -nr

ngrams 的第一个参数设置最小术语长度,而其第二个(可选)参数设置最大术语长度。

Do you use linux? I use these shell functions

# copyright by Werner Rudolph <werner (at) artistoex (dot) net>
# copying and distributing of the following source code 
# is permitted, as long as this note is preserved.

# ftr CHAR1 CHAR2
# translate delimiter char in frequency list
#

ftr()
{
    sed -r 's/^( *[0-9]+)'"$1"'/\1'"$2"'/'
}


# valid-collocations -- find valid collocations in inputstream
# reads records COUNT<SPC>COLLOCATION from inputstream
# writes records with existing collocations to stdout. 

valid-collocations () 
{ 
    #sort -k 2 -m - "$coll" |uniq -f 1 -D|accumulate
    local delimiter="_"
    ftr ' ' $delimiter |
    join -t $delimiter -o 1.1 0 -1 2 -2 1 - /tmp/wordsets-helper-collocations |
    ftr $delimiter ' '

}

# ngrams MAX [MIN]
#   
#   Generates all n-grams (for each MIN <= n <= MAX, where MIN defaults to 2)
#   from inputstream
#
#   reads word list, as generated by 
# 
#     $  words < text 
#

#   from stdin.  For each WORD in wordlist, it writes MAX-1 records
#
#   COUNT<TAB>WORD<SPC>SUCC_1<SPC>
#   COUNT<TAB>WORD<SPC>SUCC_1<SPC>SUCC_2<SPC>
#                            : 
#   COUNT<TAB>WORD<SPC>SUCC_1<SPC>SUCC_2<SPC>...<SPC>SUCC_MAX-2
#   COUNT<TAB>WORD<SPC>SUCC_1<SPC>SUCC_2<SPC>...<SPC>SUCC_MAX-1
#
#   to stdout, where word SUCC follows word WORD, and SUCC_n follows
#   SUCC_n-1 in input stream COUNT times.

ngrams () 
{ 
    local max=$1
    local min=${2:-2};
    awk 'FNR > 1 {print old " " $0} {old=$1}' | if (( $max - 1 > 1 )); then
        if (( $min <= 2 )); then
            tee >( ngrams $(( $max - 1 )) $(( $min - 1 )) );
        else
            ngrams $(( $max - 1 )) $(( $min - 1 ));
        fi;
    else
        cat;
    fi
}

words() {
    grep -Eo '\<([a-zA-Z]'"'"'?){'${1:-3}',}\>'|grep -v "[A-Z]"
}

parse-collocations() {
    local freq=${1:-0}
    local length=${2:-4}

    words | ngrams $length | sort | uniq -c |
    awk '$1 > '"$freq"' { print $0; }' |
    valid-collocations 
}

Where parse-collocation is the actual function to use. It accepts two optional parameters: The first sets the maximum recurring frequency of the terms to be skipped from the result (defaults to 0, i.e. consider all terms). The second parameter sets the maximum term length to search for. The function will read the text from stdin and print the terms to stdout line by line. It requires a dictionary file at /tmp/wordsets-helper-collocations (download one here):

Usage example:

$ parse-collocation < some-text

would be pretty much what you want. However, if you don't want terms to be matched with a dictionary, you can use this one

$ words < some-text | ngrams 3 4 | sort | uniq -c |sort -nr

ngrams's first parameter sets the minimum term length whereas its second (optional) parameter sets the maximum term length.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文