从文本字符串创建单词数组

发布于 2024-07-17 20:01:47 字数 626 浏览 10 评论 0原文

我想使用 PHP 将文本拆分为单个单词。您知道如何实现这一目标吗？

我的方法：

function tokenizer($text) {
    $text = trim(strtolower($text));
    $punctuation = '/[^a-z0-9äöüß-]/';
    $result = preg_split($punctuation, $text, -1, PREG_SPLIT_NO_EMPTY);
    for ($i = 0; $i < count($result); $i++) {
        $result[$i] = trim($result[$i]);
    }
    return $result; // contains the single words
}
$text = 'This is an example text, it contains commas and full-stops. Exclamation marks, too! Question marks? All punctuation marks you know.';
print_r(tokenizer($text));

这是一个好方法吗？您有什么改进的想法吗？

提前致谢！

原文

I would like to split a text into single words using PHP. Do you have any idea how to achieve this?

My approach:

function tokenizer($text) {
    $text = trim(strtolower($text));
    $punctuation = '/[^a-z0-9äöüß-]/';
    $result = preg_split($punctuation, $text, -1, PREG_SPLIT_NO_EMPTY);
    for ($i = 0; $i < count($result); $i++) {
        $result[$i] = trim($result[$i]);
    }
    return $result; // contains the single words
}
$text = 'This is an example text, it contains commas and full-stops. Exclamation marks, too! Question marks? All punctuation marks you know.';
print_r(tokenizer($text));

Is this a good approach? Do you have any idea for improvement?

Thanks in advance!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

花辞树 2024-07-24 20:01:48

标记化 - strtok。

<?php
$text = 'This is an example text, it contains commas and full stops. Exclamation marks, too! Question marks? All punctuation marks you know.';
$delim = ' \n\t,.!?:;';

$tok = strtok($text, $delim);

while ($tok !== false) {
    echo "Word=$tok<br />";
    $tok = strtok($delim);
}
?>

Tokenize - strtok.

<?php
$text = 'This is an example text, it contains commas and full stops. Exclamation marks, too! Question marks? All punctuation marks you know.';
$delim = ' \n\t,.!?:;';

$tok = strtok($text, $delim);

while ($tok !== false) {
    echo "Word=$tok<br />";
    $tok = strtok($delim);
}
?>

回复收藏 0 原文

爱你是孤单的心事 2024-07-24 20:01:48

我首先将字符串转换为小写，然后再将其拆分。这将使 i 修饰符和随后的数组处理变得不必要。此外，我会使用 \W 速记来表示非单词字符，并添加 + 乘数。

$text = 'This is an example text, it contains commas and full stops. Exclamation marks, too! Question marks? All punctuation marks you know.';
$result = preg_split('/\W+/', strtolower($text), -1, PREG_SPLIT_NO_EMPTY);

编辑使用Unicode 字符属性而不是 \W 按照 marcog 的建议。像 [\p{P}\p{Z}] （标点符号和分隔符）之类的内容将覆盖比 \W 更具体的字符。

I would first make the string to lower-case before splitting it up. That would make the i modifier and the array processing afterwards unnecessary. Additionally I would use the \W shorthand for non-word characters and add a + multiplier.

$text = 'This is an example text, it contains commas and full stops. Exclamation marks, too! Question marks? All punctuation marks you know.';
$result = preg_split('/\W+/', strtolower($text), -1, PREG_SPLIT_NO_EMPTY);

Edit Use the Unicode character properties instead of \W as marcog suggested. Something like [\p{P}\p{Z}] (punctuation and separator characters) would cover the characters more specific than \W.

回复收藏 0 原文

阿楠 2024-07-24 20:01:48

您还可以使用 PHP strtok() 函数从大字符串中获取字符串标记。你可以像这样使用它：

 $result = array();
 // your original string
 $text = 'This is an example text, it contains commas and full stops. Exclamation marks, too! Question marks? All punctuation marks you know.';
 // you pass strtok() your string, and a delimiter to specify how tokens are separated. words are seperated by a space.
 $word = strtok($text,' ');
 while ( $word !== false ) {
     $result[] = $word;
     $word = strtok(' ');
 }

查看更多关于 strtok()

you can also use PHP strtok() function to fetch string tokens from your large string. you can use it like this:

 $result = array();
 // your original string
 $text = 'This is an example text, it contains commas and full stops. Exclamation marks, too! Question marks? All punctuation marks you know.';
 // you pass strtok() your string, and a delimiter to specify how tokens are separated. words are seperated by a space.
 $word = strtok($text,' ');
 while ( $word !== false ) {
     $result[] = $word;
     $word = strtok(' ');
 }

see more on php documentation for strtok()

回复收藏 0 原文

落在眉间の轻吻 2024-07-24 20:01:48

做：

str_word_count($text, 1);

或者如果你需要 unicode 支持：

function str_word_count_Helper($string, $format = 0, $search = null)
{
    $result = array();
    $matches = array();

    if (preg_match_all('~[\p{L}\p{Mn}\p{Pd}\'\x{2019}' . preg_quote($search, '~') . ']+~u', $string, $matches) > 0)
    {
        $result = $matches[0];
    }

    if ($format == 0)
    {
        return count($result);
    }

    return $result;
}

Do:

str_word_count($text, 1);

Or if you need unicode support:

function str_word_count_Helper($string, $format = 0, $search = null)
{
    $result = array();
    $matches = array();

    if (preg_match_all('~[\p{L}\p{Mn}\p{Pd}\'\x{2019}' . preg_quote($search, '~') . ']+~u', $string, $matches) > 0)
    {
        $result = $matches[0];
    }

    if ($format == 0)
    {
        return count($result);
    }

    return $result;
}

回复收藏 0 原文

抱着落日 2024-07-24 20:01:48

您还可以使用爆炸方法： http://php.net/manual/en/function .explode.php

$words = explode(" ", $sentence);

You can also use the method explode : http://php.net/manual/en/function.explode.php

$words = explode(" ", $sentence);

回复收藏 0 原文

無心 2024-07-24 20:01:47

使用匹配任何 unicode 标点符号的 \p{P} 类，并结合 \s 空白类。

$result = preg_split('/((^\p{P}+)|(\p{P}*\s+\p{P}*)|(\p{P}+$))/', $text, -1, PREG_SPLIT_NO_EMPTY);

这将拆分为一组一个或多个空白字符，但也会吸收任何周围的标点符号。它还匹配字符串开头或结尾的标点字符。这会区分诸如“不”和“他说‘哎哟！’”等情况。

Use the class \p{P} which matches any unicode punctuation character, combined with the \s whitespace class.

$result = preg_split('/((^\p{P}+)|(\p{P}*\s+\p{P}*)|(\p{P}+$))/', $text, -1, PREG_SPLIT_NO_EMPTY);

This will split on a group of one or more whitespace characters, but also suck in any surrounding punctuation characters. It also matches punctuation characters at the beginning or end of the string. This discriminates cases such as "don't" and "he said 'ouch!'"

回复收藏 0 原文

~没有更多了~