当前位置：文江博客话题详情

从文本生成关键字的简单方法是什么？

发布于 2024-07-11 13:17:23 字数 203 浏览 6 评论 0原文

我想我可以获取一段文本并从中删除高频英语单词。通过关键字，我的意思是我想提取最能表征文本内容的单词（标签）。它不一定是完美的，一个好的近似值就可以满足我的需求。

有人做过类似的事吗？您知道有 Perl 或 Python 库可以做到这一点吗？

Lingua::EN::Tagger 正是我所要求的，但是我需要一个也可以用于法语文本的库。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

人间☆小暴躁 2024-07-18 13:17:23

“高频英语单词”的名称是停用词，并且有许多可用列表。我不知道有任何 python 或 perl 库，但是您可以在二叉树或散列中对停用词列表进行编码（或者您可以使用 python 的 freezeset），然后当您从输入文本中读取每个单词时，检查它是否是在您的“停止列表”中并将其过滤掉。

请注意，删除停用词后，您需要执行一些词干提取来规范化结果文本（删除复数、-ings、-eds），然后删除所有重复的“关键字”。

回复收藏 0 原文

带上头具痛哭 2024-07-18 13:17:23

您可以尝试使用 perl 模块 Lingua::EN::Tagger 快速简单的解决方案。

更复杂的模块Lingua::EN::Semtags::Engine将 Lingua::EN::Tagger 与 WordNet 数据库结合使用以获得更结构化的输出。两者都非常易于使用，只需在安装模块后查看 CPAN 上的文档或使用 perldoc 即可。

回复收藏 0 原文

他不在意 2024-07-18 13:17:23

要查找文本中最常用的单词，请执行以下操作：

#!/usr/bin/perl -w

use strict;
use warnings 'all';

# Read the text:
open my $ifh, '<', 'text.txt'
  or die "Cannot open file: $!";
local $/;
my $text = <$ifh>;

# Find all the words, and count how many times they appear:
my %words = ( );
map { $words{$_}++ }
  grep { length > 1 && $_ =~ m/^[\@a-z-']+$/i }
    map { s/[",\.]//g; $_ }
      split /\s/, $text;

print "Words, sorted by frequency:\n";
my (@data_line);
format FMT = 
@<<<<<<<<<<<<<<<<<<<<<<...     @########
@data_line
.
local $~ = 'FMT';

# Sort them by frequency:
map { @data_line = ($_, $words{$_}); write(); }
  sort { $words{$b} <=> $words{$a} }
    grep { $words{$_} > 2 }
      keys(%words);

示例输出如下所示：

john@ubuntu-pc1:~/Desktop$ perl frequency.pl 
Words, sorted by frequency:
for                                   32
Jan                                   27
am                                    26
of                                    21
your                                  21
to                                    18
in                                    17
the                                   17
Get                                   13
you                                   13
OTRS                                  11
today                                 11
PSM                                   10
Card                                  10
me                                     9
on                                     9
and                                    9
Offline                                9
with                                   9
Invited                                9
Black                                  8
get                                    8
Web                                    7
Starred                                7
All                                    7
View                                   7
Obama                                  7

To find the most frequently-used words in a text, do something like this:

#!/usr/bin/perl -w

use strict;
use warnings 'all';

# Read the text:
open my $ifh, '<', 'text.txt'
  or die "Cannot open file: $!";
local $/;
my $text = <$ifh>;

# Find all the words, and count how many times they appear:
my %words = ( );
map { $words{$_}++ }
  grep { length > 1 && $_ =~ m/^[\@a-z-']+$/i }
    map { s/[",\.]//g; $_ }
      split /\s/, $text;

print "Words, sorted by frequency:\n";
my (@data_line);
format FMT = 
@<<<<<<<<<<<<<<<<<<<<<<...     @########
@data_line
.
local $~ = 'FMT';

# Sort them by frequency:
map { @data_line = ($_, $words{$_}); write(); }
  sort { $words{$b} <=> $words{$a} }
    grep { $words{$_} > 2 }
      keys(%words);

Example output looks like this:

john@ubuntu-pc1:~/Desktop$ perl frequency.pl 
Words, sorted by frequency:
for                                   32
Jan                                   27
am                                    26
of                                    21
your                                  21
to                                    18
in                                    17
the                                   17
Get                                   13
you                                   13
OTRS                                  11
today                                 11
PSM                                   10
Card                                  10
me                                     9
on                                     9
and                                    9
Offline                                9
with                                   9
Invited                                9
Black                                  8
get                                    8
Web                                    7
Starred                                7
All                                    7
View                                   7
Obama                                  7

回复收藏 0 原文

扶醉桌前 2024-07-18 13:17:23

在 Perl 中，有 Lingua::EN::Keywords。

回复收藏 0 原文

权谋诡计 2024-07-18 13:17:23

我认为仍然保持简单性的最准确方法是计算源中的词频，然后根据它们在常见英语（或任何其他语言）使用中的频率对它们进行加权。

常用的单词（例如“coffeehouse”）比出现频率较高的单词（例如“dog”）更有可能成为关键字。不过，如果您的消息来源提到“dog”500 次，“coffeehouse”两次，则“dog”更有可能是一个关键字，即使它是一个常用词。

决定权重方案将是困难的部分。

回复收藏 0 原文

爱的十字路口 2024-07-18 13:17:23

TF-IDF（词频-逆文档频率）就是为此而设计的。

基本上它会问，与所有文档相比，哪些单词在此文档中出现频率最高？

它会对所有文档中出现的单词给予较低的分数，而对给定文档中频繁出现的单词给予较高的分数。

您可以在此处查看计算工作表：

https://docs.google .com/spreadsheet/ccc?key=0AreO9JhY28gcdFMtUFJrc0dRdkpiUWlhNHVGS1h5Y2c&usp=sharing

（切换到底部的 TFIDF 选项卡）

这是一个 python 库：

https://github.com/hrs/python-tf-idf

回复收藏 0 原文

二智少女 2024-07-18 13:17:23

做你想做的事情的最简单的方法是这样...

>>> text = "this is some of the sample text"
>>> words = [word for word in set(text.split(" ")) if len(word) > 3]
>>> words
['this', 'some', 'sample', 'text']

我不知道有任何标准模块可以做到这一点，但是通过查找一组常见英语来替换三个字母单词的限制并不难字。

The simplest way to do what you want is this...

>>> text = "this is some of the sample text"
>>> words = [word for word in set(text.split(" ")) if len(word) > 3]
>>> words
['this', 'some', 'sample', 'text']

I don't know of any standard module that does this, but it wouldn't be hard to replace the limit on three letter words with a lookup into a set of common English words.

回复收藏 0 原文

还如梦归 2024-07-18 13:17:23

一种线性解决方案（超过两个字符且出现两次以上的单词）：

perl -ne'$h{$1}++while m/\b(\w{3,})\b/g}{printf"%-20s %5d\n",$_,$h{$_}for sort{$h{$b}<=>$h{$a}}grep{$h{$_}>2}keys%h'

编辑：如果想按字母顺序对具有相同频率的单词进行排序，可以使用此增强型解决方案：

perl -ne'$h{$1}++while m/\b(\w{3,})\b/g}{printf"%-20s %5d\n",$_,$h{$_}for sort{$h{$b}<=>$h{$a}or$a cmp$b}grep{$h{$_}>2}keys%h'

One liner solution (words longer than two chars which occurred more than two times):

perl -ne'$h{$1}++while m/\b(\w{3,})\b/g}{printf"%-20s %5d\n",$_,$h{$_}for sort{$h{$b}<=>$h{$a}}grep{$h{$_}>2}keys%h'

EDIT: If one wants to sort alphabetically words with same frequency can use this enhanced one:

perl -ne'$h{$1}++while m/\b(\w{3,})\b/g}{printf"%-20s %5d\n",$_,$h{$_}for sort{$h{$b}<=>$h{$a}or$a cmp$b}grep{$h{$_}>2}keys%h'

回复收藏 0 原文

~没有更多了~

关于作者

一曲琵琶半遮面シ

暂无简介

0 文章

0 评论

24 人气

关注发私信

友情链接

文江博客

从文本生成关键字的简单方法是什么？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（8）

关于作者

相关话题

热门标签

推荐作者

謌踐踏愛綪

开始看清了

高速公鹿

alipaysp_PLnULTzf66

热情消退

白色月光

友情链接

从文本生成关键字的简单方法是什么？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（8）

关于作者

相关话题

热门标签

推荐作者

謌踐踏愛綪

开始看清了

高速公鹿

alipaysp_PLnULTzf66

热情消退

白色月光

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。