Postgres 中字符串的词频?
是否可以从 Postgres 中包含文本字符串的字段中识别不同的单词和每个单词的计数?
Is it possible to identify distinct words and a count for each, from fields containing text strings in Postgres?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
像这样的东西吗?
那么获取不同的单词很容易:
或者获取每个单词的计数:
Something like this?
Getting the distinct words is easy then:
or getting the count for each word:
您还可以使用 PostgreSQL 文本搜索功能来实现此目的,例如:
将产生:
(PostgreSQL 应用依赖于语言的词干提取和停用词删除,这可能是您想要的,也可能不是。停用词删除和词干提取可以通过使用
simple
而不是english
字典来禁用,请参见下文。)嵌套的
SELECT
语句可以是生成 tsvector 的任何 select 语句列,因此您可以替换一个函数,将to_tsvector
函数应用于任意数量的文本字段,并将它们连接到文档的任何子集上的单个tsvector
中,例如示例:将生成从前 500 个文档的
title
和body
字段中获取的总字数矩阵,按出现次数降序排列。对于每个单词,您还将获得该单词出现的文档数量(ndoc
列)。有关更多详细信息,请参阅文档: http://www.postgresql.org/docs/当前/static/textsearch.html
You could also use the PostgreSQL text-searching functionality for this, for example:
will yield:
(PostgreSQL applies language-dependent stemming and stop-word removal, which could be what you want, or maybe not. Stop-word removal and stemming can be disabled by using the
simple
instead of theenglish
dictionary, see below.)The nested
SELECT
statement can be any select statement that yields a tsvector column, so you could substitute a function that applies theto_tsvector
function to any number of text fields, and concatenates them into a singletsvector
, over any subset of your documents, for example:Would yield a matrix of total word counts taken from the
title
andbody
fields of the first 500 documents, sorted by descending number of occurrences. For each word, you'll also get the number of documents it occurs in (thendoc
column).See the documentation for more details: http://www.postgresql.org/docs/current/static/textsearch.html
单词之间应该用空格 ' ' 或其他分隔符号分隔;不使用“s”,除非有意这样做,例如,将“myWordshere”视为“myWord”和“here”。
Should be split by a space ' ' or other delimit symbol between words; not by an 's', unless intended to do so, e.g., treating 'myWordshere' as 'myWord' and 'here'.