合并两个正则表达式以截断字符串中的单词

发布于 2024-08-29 20:45:27 字数 1987 浏览 8 评论 0原文

我试图提出以下函数，将字符串截断为整个单词（如果可能，否则它应该截断为字符）：

function Text_Truncate($string, $limit, $more = '...')
{
    $string = trim(html_entity_decode($string, ENT_QUOTES, 'UTF-8'));

    if (strlen(utf8_decode($string)) > $limit)
    {
        $string = preg_replace('~^(.{1,' . intval($limit) . '})(?:\s.*|$)~su', '$1', $string);

        if (strlen(utf8_decode($string)) > $limit)
        {
            $string = preg_replace('~^(.{' . intval($limit) . '}).*~su', '$1', $string);
        }

        $string .= $more;
    }

    return trim(htmlentities($string, ENT_QUOTES, 'UTF-8', true));
}

以下是一些测试：

// Iñtërnâtiônàlizætiøn and then the quick brown fox... (49 + 3 chars)
echo dyd_Text_Truncate('Iñtërnâtiônàlizætiøn and then the quick brown fox jumped overly the lazy dog and one day the lazy dog humped the poor fox down until she died.', 50, '...');

// Iñtërnâtiônàlizætiøn_and_then_the_quick_brown_fox_...  (50 + 3 chars)
echo dyd_Text_Truncate('Iñtërnâtiônàlizætiøn_and_then_the_quick_brown_fox_jumped_overly_the_lazy_dog and one day the lazy dog humped the poor fox down until she died.', 50, '...');

它们都按原样工作，但是如果我删除第二个 preg_replace() 我得到以下信息：

Iñtërnâtiônàlizætiøn_and_then_the_quick_brown_fox_jumped_overly_the_lazy_dog 有一天，这只懒狗驼背了可怜的狐狸一直到死......

我无法使用 substr() 因为它只能在字节级别上工作，而且我无权访问 mb_substr() ATM，我已经多次尝试将第二个正则表达式与第一个正则表达式连接起来，但没有成功。

请帮助短信，我已经为此苦苦挣扎了近一个小时。

编辑：对不起，我已经醒了 40 个小时，我无耻地错过了这个：

$string = preg_replace('~^(.{1,' . intval($limit) . '})(?:\s.*|$)?~su', '$1', $string);

不过，如果有人有更优化的正则表达式（或忽略尾随空格的正则表达式），请分享：

"Iñtërnâtiônàlizætiøn and then "
"Iñtërnâtiônàlizætiøn_and_then_"

< strong>编辑2：我仍然无法摆脱尾随空白，有人可以帮助我吗？

编辑3：好吧，我的编辑都没有真正起作用，我被RegexBuddy愚弄了 -我也许应该把这件事留到改天再去睡吧。今天休息。

原文

I'm trying to come up with the following function that truncates string to whole words (if possible, otherwise it should truncate to chars):

function Text_Truncate($string, $limit, $more = '...')
{
    $string = trim(html_entity_decode($string, ENT_QUOTES, 'UTF-8'));

    if (strlen(utf8_decode($string)) > $limit)
    {
        $string = preg_replace('~^(.{1,' . intval($limit) . '})(?:\s.*|$)~su', '$1', $string);

        if (strlen(utf8_decode($string)) > $limit)
        {
            $string = preg_replace('~^(.{' . intval($limit) . '}).*~su', '$1', $string);
        }

        $string .= $more;
    }

    return trim(htmlentities($string, ENT_QUOTES, 'UTF-8', true));
}

Here are some tests:

// Iñtërnâtiônàlizætiøn and then the quick brown fox... (49 + 3 chars)
echo dyd_Text_Truncate('Iñtërnâtiônàlizætiøn and then the quick brown fox jumped overly the lazy dog and one day the lazy dog humped the poor fox down until she died.', 50, '...');

// Iñtërnâtiônàlizætiøn_and_then_the_quick_brown_fox_...  (50 + 3 chars)
echo dyd_Text_Truncate('Iñtërnâtiônàlizætiøn_and_then_the_quick_brown_fox_jumped_overly_the_lazy_dog and one day the lazy dog humped the poor fox down until she died.', 50, '...');

They both work as it is, however if I drop the second preg_replace() I get the following:

Iñtërnâtiônàlizætiøn_and_then_the_quick_brown_fox_jumped_overly_the_lazy_dog
and one day the lazy dog humped the
poor fox down until she died....

I can't use substr() because it only works on byte level and I don't have access to mb_substr() ATM, I've made several attempts to join the second regex with the first one but without success.

Please help S.M.S., I've been struggling with this for almost an hour.

EDIT: I'm sorry, I've been awake for 40 hours and I shamelessly missed this:

$string = preg_replace('~^(.{1,' . intval($limit) . '})(?:\s.*|$)?~su', '$1', $string);

Still, if someone has a more optimized regex (or one that ignores the trailing space) please share:

"Iñtërnâtiônàlizætiøn and then "
"Iñtërnâtiônàlizætiøn_and_then_"

EDIT 2: I still can't get rid of the trailing whitespace, can someone help me out?

EDIT 3: Okay, none of my edits did really work, I was being fooled by RegexBuddy - I should probably leave this to another day and get some sleep now. Off for today.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

把时间冻结 2024-09-05 20:45:27

也许我可以在经历了一整夜的正则表达式噩梦后给你一个快乐的早晨：

'~^(.{1,' . intval($limit) . '}(?<=\S)(?=\s)|.{'.intval($limit).'}).*~su'

归结起来：

^      # Start of String
(       # begin capture group 1
 .{1,x} # match 1 - x characters
 (?<=\S)# lookbehind, match must end with non-whitespace 
 (?=\s) # lookahead, if the next char is whitespace, match
 |      # otherwise test this:
 .{x}   # got to x chars anyway.
)       # end cap group
.*     # match the rest of the string (since you were using replace)

你总是可以将 |$ 添加到 (?=\s) 但由于您的代码已经检查字符串长度是否比 $limit 长，我认为这种情况没有必要。

Perhaps I can give you a happy morning after a long night of RegExp nightmares:

'~^(.{1,' . intval($limit) . '}(?<=\S)(?=\s)|.{'.intval($limit).'}).*~su'

Boiling it down:

^      # Start of String
(       # begin capture group 1
 .{1,x} # match 1 - x characters
 (?<=\S)# lookbehind, match must end with non-whitespace 
 (?=\s) # lookahead, if the next char is whitespace, match
 |      # otherwise test this:
 .{x}   # got to x chars anyway.
)       # end cap group
.*     # match the rest of the string (since you were using replace)

You could always add the |$ to the end of (?=\s) but since your code was already checking that the string length was longer than the $limit, I didn't feel that case would be neccesary.

回复收藏 0 原文