如何提取MySQL字符串中的第n个单词并统计单词出现次数?
我想要一个像这样的mysql查询:
select <second word in text> word, count(*) from table group by word;
mysql中的所有正则表达式示例都用于查询文本是否与表达式匹配,但不是从表达式中提取文本。有这样的语法吗?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
以下是针对OP的特定问题(提取字符串的第二个单词)的建议解决方案,但应该注意的是,正如mc0e的答案所述,实际上不支持提取正则表达式匹配-MySQL 中的盒子。如果你确实需要这个,那么你的选择基本上是 1) 在客户端的后处理中进行,或者 2) 安装 MySQL 扩展来支持它。
BenWells 的说法几乎是正确的。根据他的代码,这里有一个稍微调整的版本:
作为一个工作示例,我使用:
这成功地提取了单词
IS
The following is a proposed solution for the OP's specific problem (extracting the 2nd word of a string), but it should be noted that, as mc0e's answer states, actually extracting regex matches is not supported out-of-the-box in MySQL. If you really need this, then your choices are basically to 1) do it in post-processing on the client, or 2) install a MySQL extension to support it.
BenWells has it very almost correct. Working from his code, here's a slightly adjusted version:
As a working example, I used:
This successfully extracts the word
IS
用于提取句子中第二个单词的较短选项:
SUBSTRING_INDEX 的 MySQL 文档< /a>
Shorter option to extract the second word in a sentence:
MySQL docs for SUBSTRING_INDEX
根据 http://dev.mysql.com/ SUBSTRING 函数使用起始位置,然后使用长度,因此第二个单词的函数肯定是:
According to http://dev.mysql.com/ the SUBSTRING function uses start position then the length so surely the function for the second word would be:
不,没有使用正则表达式提取文本的语法。您必须使用普通的字符串操作函数。
或者,从数据库中选择整个值(如果您担心数据传输过多,则选择前 n 个字符),然后在客户端上使用正则表达式。
No, there isn't a syntax for extracting text using regular expressions. You have to use the ordinary string manipulation functions.
Alternatively select the entire value from the database (or the first n characters if you are worried about too much data transfer) and then use a regular expression on the client.
正如其他人所说,mysql不提供用于提取子字符串的正则表达式工具。这并不是说如果您准备使用用户定义的函数扩展 mysql,您就不能拥有它们:
https: //github.com/mysqludf/lib_mysqludf_preg
如果您想分发软件,这可能没有多大帮助,因为它会成为安装软件的障碍,但对于内部解决方案来说,这可能是合适的。
As others have said, mysql does not provide regex tools for extracting sub-strings. That's not to say you can't have them though if you're prepared to extend mysql with user-defined functions:
https://github.com/mysqludf/lib_mysqludf_preg
That may not be much help if you want to distribute your software, being an impediment to installing your software, but for an in-house solution it may be appropriate.
我使用 Brendan Bullen 的答案作为我遇到的类似问题的起点,该问题是检索 JSON 字符串中特定字段的值。然而,就像我对他的回答的评论一样,它并不完全准确。如果您的左边界不仅仅是像原始问题中那样的空间,那么差异就会增加。
更正的解决方案:
两个差异是 SUBSTRING 索引参数中的 +1 和长度参数中的 -1。
对于“查找两个提供的边界之间字符串的第一次出现”的更通用的解决方案:
I used Brendan Bullen's answer as a starting point for a similar issue I had which was to retrive the value of a specific field in a JSON string. However, like I commented on his answer, it is not entirely accurate. If your left boundary isn't just a space like in the original question, then the discrepancy increases.
Corrected solution:
The two differences are the +1 in the SUBSTRING index parameter and the -1 in the length parameter.
For a more general solution to "find the first occurence of a string between two provided boundaries":
我认为这样的事情是不可能的。您可以使用
SUBSTRING
函数来提取您想要的部分。I don't think such a thing is possible. You can use
SUBSTRING
function to extract the part you want.我的自制的正则表达式替换函数可以用于此目的。
演示
请参阅此 DB-Fiddle 演示,其中返回著名十四行诗中的第二个单词(“I”)及其出现次数 (1)。
SQL
假设使用 MySQL 8 或更高版本(以允许使用 公用表表达式),以下将返回第二个单词及其出现次数:
解释
上面的 SQL 中使用了一些技巧和一些认证是需要的。首先,正则表达式替换器用于替换所有连续的非单词字符块 - 每个块都被单个 tilda (
~
) 字符替换。 注意:如果文本中可能出现波浪号,则可以选择不同的字符。来自 这个答案中的巧妙技术相结合,用于生成一个由一系列递增数字组成的表格:0 - 10,000案件。
My home-grown regular expression replace function can be used for this.
Demo
See this DB-Fiddle demo, which returns the second word ("I") from a famous sonnet and the number of occurrences of it (1).
SQL
Assuming MySQL 8 or later is being used (to allow use of a Common Table Expression), the following will return the second word and the number of occurrences of it:
Explanation
A few tricks are used in the SQL above and some accreditation is needed. Firstly the regular expression replacer is used to replace all continuous blocks of non-word characters - each being replaced by a single tilda (
~
) character. Note: A different character could be chosen instead if there is any possibility of a tilda appearing in the text.The technique from this answer is then used for transforming a string with delimited values into separate row values. It's combined with the clever technique from this answer for generating a table consisting of a sequence of incrementing numbers: 0 - 10,000 in this case.
该字段的值为:
结果为:
The field's value is:
Result is: