从文本字符串创建单词数组
我想使用 PHP 将文本拆分为单个单词。 您知道如何实现这一目标吗?
我的方法:
function tokenizer($text) {
$text = trim(strtolower($text));
$punctuation = '/[^a-z0-9äöüß-]/';
$result = preg_split($punctuation, $text, -1, PREG_SPLIT_NO_EMPTY);
for ($i = 0; $i < count($result); $i++) {
$result[$i] = trim($result[$i]);
}
return $result; // contains the single words
}
$text = 'This is an example text, it contains commas and full-stops. Exclamation marks, too! Question marks? All punctuation marks you know.';
print_r(tokenizer($text));
这是一个好方法吗? 您有什么改进的想法吗?
提前致谢!
I would like to split a text into single words using PHP. Do you have any idea how to achieve this?
My approach:
function tokenizer($text) {
$text = trim(strtolower($text));
$punctuation = '/[^a-z0-9äöüß-]/';
$result = preg_split($punctuation, $text, -1, PREG_SPLIT_NO_EMPTY);
for ($i = 0; $i < count($result); $i++) {
$result[$i] = trim($result[$i]);
}
return $result; // contains the single words
}
$text = 'This is an example text, it contains commas and full-stops. Exclamation marks, too! Question marks? All punctuation marks you know.';
print_r(tokenizer($text));
Is this a good approach? Do you have any idea for improvement?
Thanks in advance!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
标记化 - strtok。
Tokenize - strtok.
我首先将字符串转换为小写,然后再将其拆分。 这将使
i
修饰符和随后的数组处理变得不必要。 此外,我会使用\W
速记来表示非单词字符,并添加+
乘数。编辑 使用Unicode 字符属性 而不是
\W
按照 marcog 的建议。 像[\p{P}\p{Z}]
(标点符号和分隔符)之类的内容将覆盖比\W
更具体的字符。I would first make the string to lower-case before splitting it up. That would make the
i
modifier and the array processing afterwards unnecessary. Additionally I would use the\W
shorthand for non-word characters and add a+
multiplier.Edit Use the Unicode character properties instead of
\W
as marcog suggested. Something like[\p{P}\p{Z}]
(punctuation and separator characters) would cover the characters more specific than\W
.您还可以使用 PHP strtok() 函数从大字符串中获取字符串标记。 你可以像这样使用它:
查看更多关于 strtok()
you can also use PHP strtok() function to fetch string tokens from your large string. you can use it like this:
see more on php documentation for strtok()
做:
或者如果你需要 unicode 支持:
Do:
Or if you need unicode support:
您还可以使用爆炸方法: http://php.net/manual/en/function .explode.php
You can also use the method explode : http://php.net/manual/en/function.explode.php
使用匹配任何 unicode 标点符号的 \p{P} 类,并结合 \s 空白类。
这将拆分为一组一个或多个空白字符,但也会吸收任何周围的标点符号。 它还匹配字符串开头或结尾的标点字符。 这会区分诸如“不”和“他说‘哎哟!’”等情况。
Use the class \p{P} which matches any unicode punctuation character, combined with the \s whitespace class.
This will split on a group of one or more whitespace characters, but also suck in any surrounding punctuation characters. It also matches punctuation characters at the beginning or end of the string. This discriminates cases such as "don't" and "he said 'ouch!'"