当前位置：文江博客话题详情

Postgres 中字符串的词频？

发布于 2024-10-20 21:23:15 字数 50 浏览 5 评论 0原文

是否可以从 Postgres 中包含文本字符串的字段中识别不同的单词和每个单词的计数？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

╭⌒浅淡时光〆 2024-10-27 21:23:15

像这样的东西吗？

SELECT some_pk, 
       regexp_split_to_table(some_column, '\s') as word
FROM some_table

那么获取不同的单词很容易：

SELECT DISTINCT word
FROM ( 
  SELECT regexp_split_to_table(some_column, '\s') as word
  FROM some_table
) t

或者获取每个单词的计数：

SELECT word, count(*)
FROM ( 
  SELECT regexp_split_to_table(some_column, '\s') as word
  FROM some_table
) t
GROUP BY word

Something like this?

SELECT some_pk, 
       regexp_split_to_table(some_column, '\s') as word
FROM some_table

Getting the distinct words is easy then:

SELECT DISTINCT word
FROM ( 
  SELECT regexp_split_to_table(some_column, '\s') as word
  FROM some_table
) t

or getting the count for each word:

SELECT word, count(*)
FROM ( 
  SELECT regexp_split_to_table(some_column, '\s') as word
  FROM some_table
) t
GROUP BY word

回复收藏 0 原文

ぃ双果 2024-10-27 21:23:15

您还可以使用 PostgreSQL 文本搜索功能来实现此目的，例如：

SELECT * FROM ts_stat('SELECT to_tsvector(''hello dere hello hello ridiculous'')');

将产生：

  word   | ndoc | nentry 
---------+------+--------
 ridicul |    1 |      1
 hello   |    1 |      3
 dere    |    1 |      1
(3 rows)

（PostgreSQL 应用依赖于语言的词干提取和停用词删除，这可能是您想要的，也可能不是。停用词删除和词干提取可以通过使用 simple 而不是 english 字典来禁用，请参见下文。）

嵌套的 SELECT 语句可以是生成 tsvector 的任何 select 语句列，因此您可以替换一个函数，将 to_tsvector 函数应用于任意数量的文本字段，并将它们连接到文档的任何子集上的单个 tsvector 中，例如示例：

SELECT * FROM ts_stat('SELECT to_tsvector(''english'',title) || to_tsvector(''english'',body) from my_documents id < 500') ORDER BY nentry DESC;

将生成从前 500 个文档的 title 和 body 字段中获取的总字数矩阵，按出现次数降序排列。对于每个单词，您还将获得该单词出现的文档数量（ndoc 列）。

有关更多详细信息，请参阅文档： http://www.postgresql.org/docs/当前/static/textsearch.html

You could also use the PostgreSQL text-searching functionality for this, for example:

SELECT * FROM ts_stat('SELECT to_tsvector(''hello dere hello hello ridiculous'')');

will yield:

  word   | ndoc | nentry 
---------+------+--------
 ridicul |    1 |      1
 hello   |    1 |      3
 dere    |    1 |      1
(3 rows)

(PostgreSQL applies language-dependent stemming and stop-word removal, which could be what you want, or maybe not. Stop-word removal and stemming can be disabled by using the simple instead of the english dictionary, see below.)

The nested SELECT statement can be any select statement that yields a tsvector column, so you could substitute a function that applies the to_tsvector function to any number of text fields, and concatenates them into a single tsvector, over any subset of your documents, for example:

SELECT * FROM ts_stat('SELECT to_tsvector(''english'',title) || to_tsvector(''english'',body) from my_documents id < 500') ORDER BY nentry DESC;

Would yield a matrix of total word counts taken from the title and body fields of the first 500 documents, sorted by descending number of occurrences. For each word, you'll also get the number of documents it occurs in (the ndoc column).

See the documentation for more details: http://www.postgresql.org/docs/current/static/textsearch.html

回复收藏 0 原文

下雨或天晴 2024-10-27 21:23:15

单词之间应该用空格 ' ' 或其他分隔符号分隔；不使用“s”，除非有意这样做，例如，将“myWordshere”视为“myWord”和“here”。

SELECT word, count(*)
FROM ( 
  SELECT regexp_split_to_table(some_column, ' ') as word
  FROM some_table
) t
GROUP BY word

Should be split by a space ' ' or other delimit symbol between words; not by an 's', unless intended to do so, e.g., treating 'myWordshere' as 'myWord' and 'here'.

SELECT word, count(*)
FROM ( 
  SELECT regexp_split_to_table(some_column, ' ') as word
  FROM some_table
) t
GROUP BY word

回复收藏 0 原文

~没有更多了~

关于作者

温馨耳语

暂无简介

0 文章

0 评论

24 人气

关注发私信

友情链接

文江博客

Postgres 中字符串的词频？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

missyouangeled

三生一梦

压抑⊿情绪

天涯离梦残月幽梦

指尖微凉心微凉

☆獨立☆

友情链接

Postgres 中字符串的词频？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

missyouangeled

三生一梦

压抑⊿情绪

天涯离梦残月幽梦

指尖微凉心微凉

☆獨立☆

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。