Postgres 中字符串的词频?

发布于 2024-10-20 21:23:15 字数 50 浏览 5 评论 0原文

是否可以从 Postgres 中包含文本字符串的字段中识别不同的单词和每个单词的计数?

Is it possible to identify distinct words and a count for each, from fields containing text strings in Postgres?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

╭⌒浅淡时光〆 2024-10-27 21:23:15

像这样的东西吗?

SELECT some_pk, 
       regexp_split_to_table(some_column, '\s') as word
FROM some_table

那么获取不同的单词很容易:

SELECT DISTINCT word
FROM ( 
  SELECT regexp_split_to_table(some_column, '\s') as word
  FROM some_table
) t

或者获取每个单词的计数:

SELECT word, count(*)
FROM ( 
  SELECT regexp_split_to_table(some_column, '\s') as word
  FROM some_table
) t
GROUP BY word

Something like this?

SELECT some_pk, 
       regexp_split_to_table(some_column, '\s') as word
FROM some_table

Getting the distinct words is easy then:

SELECT DISTINCT word
FROM ( 
  SELECT regexp_split_to_table(some_column, '\s') as word
  FROM some_table
) t

or getting the count for each word:

SELECT word, count(*)
FROM ( 
  SELECT regexp_split_to_table(some_column, '\s') as word
  FROM some_table
) t
GROUP BY word
ぃ双果 2024-10-27 21:23:15

您还可以使用 PostgreSQL 文本搜索功能来实现此目的,例如:

SELECT * FROM ts_stat('SELECT to_tsvector(''hello dere hello hello ridiculous'')');

将产生:

  word   | ndoc | nentry 
---------+------+--------
 ridicul |    1 |      1
 hello   |    1 |      3
 dere    |    1 |      1
(3 rows)

(PostgreSQL 应用依赖于语言的词干提取和停用词删除,这可能是您想要的,也可能不是。停用词删除和词干提取可以通过使用 simple 而不是 english 字典来禁用,请参见下文。)

嵌套的 SELECT 语句可以是生成 tsvector 的任何 select 语句列,因此您可以替换一个函数,将 to_tsvector 函数应用于任意数量的文本字段,并将它们连接到文档的任何子集上的单个 tsvector 中,例如示例:

SELECT * FROM ts_stat('SELECT to_tsvector(''english'',title) || to_tsvector(''english'',body) from my_documents id < 500') ORDER BY nentry DESC;

将生成从前 500 个文档的 titlebody 字段中获取的总字数矩阵,按出现次数降序排列。对于每个单词,您还将获得该单词出现的文档数量(ndoc 列)。

有关更多详细信息,请参阅文档: http://www.postgresql.org/docs/当前/static/textsearch.html

You could also use the PostgreSQL text-searching functionality for this, for example:

SELECT * FROM ts_stat('SELECT to_tsvector(''hello dere hello hello ridiculous'')');

will yield:

  word   | ndoc | nentry 
---------+------+--------
 ridicul |    1 |      1
 hello   |    1 |      3
 dere    |    1 |      1
(3 rows)

(PostgreSQL applies language-dependent stemming and stop-word removal, which could be what you want, or maybe not. Stop-word removal and stemming can be disabled by using the simple instead of the english dictionary, see below.)

The nested SELECT statement can be any select statement that yields a tsvector column, so you could substitute a function that applies the to_tsvector function to any number of text fields, and concatenates them into a single tsvector, over any subset of your documents, for example:

SELECT * FROM ts_stat('SELECT to_tsvector(''english'',title) || to_tsvector(''english'',body) from my_documents id < 500') ORDER BY nentry DESC;

Would yield a matrix of total word counts taken from the title and body fields of the first 500 documents, sorted by descending number of occurrences. For each word, you'll also get the number of documents it occurs in (the ndoc column).

See the documentation for more details: http://www.postgresql.org/docs/current/static/textsearch.html

下雨或天晴 2024-10-27 21:23:15

单词之间应该用空格 ' ' 或其他分隔符号分隔;不使用“s”,除非有意这样做,例如,将“myWordshere”视为“myWord”和“here”。

SELECT word, count(*)
FROM ( 
  SELECT regexp_split_to_table(some_column, ' ') as word
  FROM some_table
) t
GROUP BY word

Should be split by a space ' ' or other delimit symbol between words; not by an 's', unless intended to do so, e.g., treating 'myWordshere' as 'myWord' and 'here'.

SELECT word, count(*)
FROM ( 
  SELECT regexp_split_to_table(some_column, ' ') as word
  FROM some_table
) t
GROUP BY word
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文