当前位置：文江博客话题详情

查找文本中最常见术语的简单工具

发布于 2024-09-06 13:46:53 字数 210 浏览 6 评论 0原文

我有一篇文本，我想提取最常见的术语，即使由多个单词组成（即：总经理、职位、薪水、网络开发人员）。

我需要一个库或一个可安装的可执行文件，而不是一个网络服务。

我遇到了一些需要培训的复杂工具（例如 Topia 的术语提取，MAUI）。对于我的目的来说，它们过于复杂，我发现它们很难被我使用。

我只需要一个可以提取文本中最常见术语的软件。

谢谢。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

夏尔 2024-09-13 13:46:53

你用Linux吗？我使用这些 shell 函数，

# copyright by Werner Rudolph <werner (at) artistoex (dot) net>
# copying and distributing of the following source code 
# is permitted, as long as this note is preserved.

# ftr CHAR1 CHAR2
# translate delimiter char in frequency list
#

ftr()
{
    sed -r 's/^( *[0-9]+)'"$1"'/\1'"$2"'/'
}


# valid-collocations -- find valid collocations in inputstream
# reads records COUNT<SPC>COLLOCATION from inputstream
# writes records with existing collocations to stdout. 

valid-collocations () 
{ 
    #sort -k 2 -m - "$coll" |uniq -f 1 -D|accumulate
    local delimiter="_"
    ftr ' ' $delimiter |
    join -t $delimiter -o 1.1 0 -1 2 -2 1 - /tmp/wordsets-helper-collocations |
    ftr $delimiter ' '

}

# ngrams MAX [MIN]
#   
#   Generates all n-grams (for each MIN <= n <= MAX, where MIN defaults to 2)
#   from inputstream
#
#   reads word list, as generated by 
# 
#     $  words < text 
#

#   from stdin.  For each WORD in wordlist, it writes MAX-1 records
#
#   COUNT<TAB>WORD<SPC>SUCC_1<SPC>
#   COUNT<TAB>WORD<SPC>SUCC_1<SPC>SUCC_2<SPC>
#                            : 
#   COUNT<TAB>WORD<SPC>SUCC_1<SPC>SUCC_2<SPC>...<SPC>SUCC_MAX-2
#   COUNT<TAB>WORD<SPC>SUCC_1<SPC>SUCC_2<SPC>...<SPC>SUCC_MAX-1
#
#   to stdout, where word SUCC follows word WORD, and SUCC_n follows
#   SUCC_n-1 in input stream COUNT times.

ngrams () 
{ 
    local max=$1
    local min=${2:-2};
    awk 'FNR > 1 {print old " " $0} {old=$1}' | if (( $max - 1 > 1 )); then
        if (( $min <= 2 )); then
            tee >( ngrams $(( $max - 1 )) $(( $min - 1 )) );
        else
            ngrams $(( $max - 1 )) $(( $min - 1 ));
        fi;
    else
        cat;
    fi
}

words() {
    grep -Eo '\<([a-zA-Z]'"'"'?){'${1:-3}',}\>'|grep -v "[A-Z]"
}

parse-collocations() {
    local freq=${1:-0}
    local length=${2:-4}

    words | ngrams $length | sort | uniq -c |
    awk '$1 > '"$freq"' { print $0; }' |
    valid-collocations 
}

其中 parse-collocation 是实际使用的函数。它接受两个可选参数：第一个参数设置要从结果中跳过的术语的最大重复频率（默认为 0，即考虑所有术语）。第二个参数设置要搜索的最大术语长度。该函数将从标准输入读取文本并将术语逐行打印到标准输出。它需要位于 /tmp/wordsets-helper-collocations 的字典文件（下载一个此处）：

用法示例：

$ parse-collocation < some-text

几乎就是您想要的。但是，如果您不希望术语与字典匹配，则可以使用此

$ words < some-text | ngrams 3 4 | sort | uniq -c |sort -nr

ngrams 的第一个参数设置最小术语长度，而其第二个（可选）参数设置最大术语长度。

Do you use linux? I use these shell functions

# copyright by Werner Rudolph <werner (at) artistoex (dot) net>
# copying and distributing of the following source code 
# is permitted, as long as this note is preserved.

# ftr CHAR1 CHAR2
# translate delimiter char in frequency list
#

ftr()
{
    sed -r 's/^( *[0-9]+)'"$1"'/\1'"$2"'/'
}


# valid-collocations -- find valid collocations in inputstream
# reads records COUNT<SPC>COLLOCATION from inputstream
# writes records with existing collocations to stdout. 

valid-collocations () 
{ 
    #sort -k 2 -m - "$coll" |uniq -f 1 -D|accumulate
    local delimiter="_"
    ftr ' ' $delimiter |
    join -t $delimiter -o 1.1 0 -1 2 -2 1 - /tmp/wordsets-helper-collocations |
    ftr $delimiter ' '

}

# ngrams MAX [MIN]
#   
#   Generates all n-grams (for each MIN <= n <= MAX, where MIN defaults to 2)
#   from inputstream
#
#   reads word list, as generated by 
# 
#     $  words < text 
#

#   from stdin.  For each WORD in wordlist, it writes MAX-1 records
#
#   COUNT<TAB>WORD<SPC>SUCC_1<SPC>
#   COUNT<TAB>WORD<SPC>SUCC_1<SPC>SUCC_2<SPC>
#                            : 
#   COUNT<TAB>WORD<SPC>SUCC_1<SPC>SUCC_2<SPC>...<SPC>SUCC_MAX-2
#   COUNT<TAB>WORD<SPC>SUCC_1<SPC>SUCC_2<SPC>...<SPC>SUCC_MAX-1
#
#   to stdout, where word SUCC follows word WORD, and SUCC_n follows
#   SUCC_n-1 in input stream COUNT times.

ngrams () 
{ 
    local max=$1
    local min=${2:-2};
    awk 'FNR > 1 {print old " " $0} {old=$1}' | if (( $max - 1 > 1 )); then
        if (( $min <= 2 )); then
            tee >( ngrams $(( $max - 1 )) $(( $min - 1 )) );
        else
            ngrams $(( $max - 1 )) $(( $min - 1 ));
        fi;
    else
        cat;
    fi
}

words() {
    grep -Eo '\<([a-zA-Z]'"'"'?){'${1:-3}',}\>'|grep -v "[A-Z]"
}

parse-collocations() {
    local freq=${1:-0}
    local length=${2:-4}

    words | ngrams $length | sort | uniq -c |
    awk '$1 > '"$freq"' { print $0; }' |
    valid-collocations 
}

Where parse-collocation is the actual function to use. It accepts two optional parameters: The first sets the maximum recurring frequency of the terms to be skipped from the result (defaults to 0, i.e. consider all terms). The second parameter sets the maximum term length to search for. The function will read the text from stdin and print the terms to stdout line by line. It requires a dictionary file at /tmp/wordsets-helper-collocations (download one here):

Usage example:

$ parse-collocation < some-text

would be pretty much what you want. However, if you don't want terms to be matched with a dictionary, you can use this one

$ words < some-text | ngrams 3 4 | sort | uniq -c |sort -nr

ngrams's first parameter sets the minimum term length whereas its second (optional) parameter sets the maximum term length.

回复收藏 0 原文

~没有更多了~

关于作者

蒲公英的约定

暂无简介

文章

25 人气

关注发私信

友情链接

文江博客

查找文本中最常见术语的简单工具

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

佚名

羁客

天天爱笑的徐老师

星

夏日落

隐诗

友情链接

查找文本中最常见术语的简单工具

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

佚名

羁客

天天爱笑的徐老师

星

夏日落

隐诗

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。