获取字符串的前 N 个单词
如何只获取字符串中的前 10 个单词?
How do I only get the first 10 words from a string?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
如何只获取字符串中的前 10 个单词?
How do I only get the first 10 words from a string?
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
接受
或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
发布评论
评论(13)
为了添加对逗号和破折号等其他分词符的支持,
preg_match
提供了一种快速方法,并且不需要拆分字符串:正如 Pebbl 提到的,PHP 不能很好地处理 UTF-8 或 Unicode ,所以如果这是一个问题,那么您可以将
\w
替换为[^\s,\.;\?\!]
和\W
> 对于[\s,\.;\?\!]
。To add support for other word breaks like commas and dashes,
preg_match
gives a quick way and doesn't require splitting the string:As Pebbl mentions, PHP doesn't handle UTF-8 or Unicode all that well, so if that is a concern then you can replace
\w
for[^\s,\.;\?\!]
and\W
for[\s,\.;\?\!]
.如果句子结构中存在意外字符代替空格,或者句子包含多个相连的空格,则简单地按空格分割将无法正确运行。
无论您在单词之间使用哪种“空格”,以下版本都可以工作,并且可以轻松扩展以处理其他字符...它目前支持任何空白字符加上 , 。 ; ? !
正则表达式非常适合解决这个问题,因为您可以轻松地使代码变得灵活或严格,如您所愿。不过,你必须要小心。我专门针对单词之间的间隙(而不是单词本身)来处理上述内容,因为明确说明单词的定义是相当困难的。
取
\w
字边界或其逆\W
。我很少依赖这些,主要是因为 - 取决于您使用的软件(例如某些版本的 PHP) - 它们并不总是包含 UTF-8 或 Unicode 字符。在正则表达式中,最好始终保持具体。这样您的表达式就可以处理如下内容,无论它们在何处呈现:
然而,就性能而言,避免拆分可能是值得的。因此,您可以使用 Kelly 的更新方法,但将
\w
切换为[^\s,\.;\?\!]+
和\W
对于[\s,\.;\?\!]+
。虽然我个人喜欢上面使用的分割表达式的简单性,但它更容易阅读和修改。然而 PHP 函数的堆栈有点丑陋:)Simply splitting on spaces will function incorrectly if there is an unexpected character in place of a space in the sentence structure, or if the sentence contains multiple conjoined spaces.
The following version will work no matter what kind of "space" you use between words and can be easily extended to handle other characters... it currently supports any white space character plus , . ; ? !
Regular expressions are perfect for this issue, because you can easily make the code as flexible or strict as you like. You do have to be careful however. I specifically approached the above targeting the gaps between words — rather than the words themselves — because it is rather difficult to state unequivocally what will define a word.
Take the
\w
word boundary, or its inverse\W
. I rarely rely on these, mainly because — depending on the software you are using (like certain versions of PHP) — they don't always include UTF-8 or Unicode characters.In regular expressions it is better to be specific, at all times. So that your expressions can handle things like the following, no matter where they are rendered:
Avoiding splitting could be worthwhile however, in terms of performance. So you could use Kelly's updated approach but switch
\w
for[^\s,\.;\?\!]+
and\W
for[\s,\.;\?\!]+
. Although, personally I like the simplicity of the splitting expression used above, it is easier to read and therefore modify. The stack of PHP functions however, is a bit ugly :)http://snipplr.com /view/8480/a-php-function-to-return-the-first-n-words-from-a-string/
http://snipplr.com/view/8480/a-php-function-to-return-the-first-n-words-from-a-string/
我建议使用
str_word_count
:上面的例子将输出:
使用循环来获取你想要的单词。
来源:http://php.net/str_word_count
I suggest to use
str_word_count
:The above example will output:
The use a loop to get the words you want.
Source: http://php.net/str_word_count
要选择给定文本的 10 个单词,您可以实现以下功能:
To select 10 words of the given text you can implement following function:
这可以使用
str_word_count()
轻松完成This can easily be done using
str_word_count()
这可能对你有帮助。返回 N 号的函数。的话
This might help you. Function to return N no. of words
试试这个
我知道现在不是回答的时候,但让新来者选择自己的答案。
Try this
I know this is not time to answer , but let the new comers choose their own answers.
像这样使用它:
输出:
Lorem ipsum dolor sat amet
此函数对于阿拉伯字符等 unicode 字符也能很好地工作。
输出:
> qucy>fouistimغsty:
fouthing:
f。
Use it like this:
Output:
Lorem ipsum dolor sit amet
This function also works very well with unicode characters like Arabic characters.
Output:
نموذج لنص عربي الغرض منه توضيح كيف يمكن استخلاص أول عدد معين من الكلمات الموجودة فى نص معين.
这完全就是我们正在寻找的
只需剪切并粘贴到您的程序中即可运行。
只需调用代码块中的函数即可
It is totally what we are searching
Just cut n pasted into your program and ran.
and just call the function in your block of code just as
我这样做:
它兼容 UTF8...
I do it this way:
Its UTF8 compatible...
这可能对你有帮助。返回 10
no 的函数。词数
。This might help you. Function to return 10
no. of words
.不是生成一个包含 N 个单词的数组,然后截断数组,然后重新内爆单词,而是截断第 N 个单词之后的输入字符串。 Demo
该模式将搜索 N 个零个或多个空白字符后跟一个或多个非空白字符的序列,然后
\K
重新启动全字符串匹配(有效地“释放”匹配字符,然后.*
将匹配字符串的其余部分。无论匹配到什么,都将被替换为空字符串 。该解决方案将确保 输出字符串的单词数可能不超过 N 个,因此请注意,不会发生任何突变,并且如果该字符串有尾随空格,则不会删除该空格。 。
为了确保删除前导和空格,请调整模式以捕获由空格分隔的 0 到 N 个单词 一个>
Instead of generating an array of N words, then truncating the array, then re-imploding the words, just truncate the input string after the Nth word. Demo
The pattern will search N sequences of zero or more whitespace character followed by one or more non-whitespace characters, then
\K
restarts the fullstring match (effectively "releasing" the matches characters, then.*
will match the rest of the string. Whatever is matched will be replaced with an empty string.This solution will ensure that the output string does not have more than N words. It is possible that the string has fewer words than N, so be aware that no mutation will take place and that if that string has a trailing whitespace -- that whitespace will not be removed.
To ensure that leading and whitespaces are removed, adjust the pattern to capture zero to N words which are delimited by whitespaces. Demo