匹配和替换字符串中的表情符号 - 什么是最有效的方法?
维基百科定义了许多人们可以使用的表情符号。我想将此列表与字符串中的单词相匹配。我现在有这个:
$string = "Lorem ipsum :-) dolor :-| samet";
$emoticons = array(
'[HAPPY]' => array(' :-) ', ' :) ', ' :o) '), //etc...
'[SAD]' => array(' :-( ', ' :( ', ' :-| ')
);
foreach ($emoticons as $emotion => $icons) {
$string = str_replace($icons, " $emotion ", $string);
}
echo $string;
输出:
Lorem ipsum [HAPPY] dolor [SAD] samet
所以原则上这是可行的。但是,我有两个问题:
如您所见,我在数组中的每个表情符号周围放置了空格,例如“:-)”而不是“:-)”,在我看来,这使得数组的可读性较差。有没有一种方法可以存储不带空格的表情符号,但仍然与周围有空格的 $string 匹配? (和现在的代码一样高效吗?)
或者是否有一种方法可以将表情符号放入一个变量中,并在空间上爆炸以检查 $string?类似的东西
$表情符号 = 数组( '[快乐]' => ">:] :-) :) :o) :] :3 :c) :> =] 8) =) :} :^)", '[悲伤]' => ":'-( :'( :'-) :')" //etc...
str_replace 是最有效的方法吗?
我这样问是因为我需要检查数百万个字符串,所以我正在寻找节省处理时间的最有效方法:)
Wikipedia defines a lot of possible emoticons people can use. I want to match this list to words in a string. I now have this:
$string = "Lorem ipsum :-) dolor :-| samet";
$emoticons = array(
'[HAPPY]' => array(' :-) ', ' :) ', ' :o) '), //etc...
'[SAD]' => array(' :-( ', ' :( ', ' :-| ')
);
foreach ($emoticons as $emotion => $icons) {
$string = str_replace($icons, " $emotion ", $string);
}
echo $string;
Output:
Lorem ipsum [HAPPY] dolor [SAD] samet
so in principle this works. However, I have two questions:
As you can see, I'm putting spaces around each emoticon in the array, such as ' :-) ' instead of ':-)' This makes the array less readable in my opinion. Is there a way to store emoticons without the spaces, but still match against $string with spaces around them? (and as efficiently as the code is now?)
Or is there perhaps a way to put the emoticons in one variable, and explode on space to check against $string? Something like
$emoticons = array(
'[HAPPY]' => ">:] :-) :) :o) :] :3 :c) :> =] 8) =) :} :^)",
'[SAD]' => ":'-( :'( :'-) :')" //etc...Is str_replace the most efficient way of doing this?
I'm asking because I need to check millions of strings, so I'm looking for the most efficient way to save processing time :)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
以下是使用 CPAN 中的 Perl 第 3 方 Regexp::Assemble 模块的想法。例如,给定这个程序:
它将输出以下内容:
那里有一些您实际上可能不需要的额外内容,因此这些内容将减少为:
左右。您可以将其构建到 Perl 程序中以修剪多余的位。然后,您可以将右侧直接放入
preg_replace
中。我使用
use utf8
的原因是这样我可以使用¡
作为我的qw//
分隔符,因为我不想搞乱逃离里面的东西。如果整个程序都在 Perl 中,您就不需要这样做,因为现代版本的 Perl 已经知道自动为您执行此操作。但了解如何使用该模块仍然很有用,这样您就可以生成在其他语言中使用的模式。
Here’s the idea using the Perl 3rd-party Regexp::Assemble module from CPAN. For example, given this program:
It will output this:
There’s a bit of extra stuff there you don’t really probably need, so those would reduce to just:
or so. You could build that into your Perl program to trim the extra bits. Then you could place the righthand sides straight into your
preg_replace
.The reason I did the
use utf8
was so I could use¡
as myqw//
delimiter, because I didn’t want to mess with escaping things inside there.You wouldn’t need to do this if the whole program were in Perl, because modern versions of Perl already know to do this for you automatically. But it’s still useful to know how to use the module so you can generate patterns to use in other languages.
这听起来像是正则表达式的一个很好的应用程序,它是模糊文本匹配和替换的工具。
str_replace
是一个精确文本搜索和替换的工具;正则表达式将让您搜索整个类别的“看起来像 this 的文本”,其中 this 是根据您将接受的字符类型、多少个字符来定义的 如果您使用正则表达式,那么...
\s
通配符将匹配空格,因此您可以匹配\s$emotion\s
.(还要考虑表情符号出现在字符串末尾的情况 - 即
这很有趣哈哈:)
- 你不能总是假设表情符号周围有空格。您可以编写一个正则表达式来处理此问题。)您可以编写一个正则表达式来匹配列表中的任何表情符号。您可以使用交替符号
|
来执行此操作,您可以将其读作OR
符号。语法为(a|b|c)
来匹配模式a
ORb
ORc
。例如
(:\)|:-\)|:o\))
将匹配任何:),:-),:o)
。请注意,我必须转义)
,因为它们在正则表达式中具有特殊含义(括号用作分组运算符。)过早的优化是万恶之源。
首先尝试最明显的事情。如果这不起作用,您可以稍后对其进行优化(在分析代码以确保这确实会给您带来切实的性能优势之后。)。
如果您想学习正则表达式,请尝试本章TextWrangler 手册的第 8 部分。这是对正则表达式的使用和语法的非常易于理解的介绍。
注意:我的建议与编程语言无关。我的 PHP-fu 比我的 Python-fu 弱得多,所以我无法提供示例代码。 :(
This sounds like a good application for regular expressions, which are a tool for fuzzy text matching and replacement.
str_replace
is a tool for exact text search and replace; regexps will let you search for entire classes of "text that looks something like this", where the this is defined in terms of what kinds of characters you will accept, how many of them, in what order, etc.If you use regular expressions, then...
The
\s
wildcard will match whitespace, so you can match\s$emotion\s
.(Also consider the case where the emoticon occurs at the end of a string - i.e.
that was funny lol :)
- you can't always assume emoticons will have spaces around them. You can write a regexp that handles this.)You can write a regular expression that will match any of the emoticons in the list. You do this using the alternation symbol
|
, which you can read as anOR
symbol. The syntax is(a|b|c)
to match patterna
ORb
ORc
.For example
(:\)|:-\)|:o\))
will match any of:),:-),:o)
. Note that I had to escape the)
's because they have a special meaning inside regexps (parentheses are used as a grouping operator.)Premature optimisation is the root of all evil.
Try the most obvious thing first. If that doesn't work, you can optimise it later (after you profile the code to ensure this is really going to give you a tangible performance benefit.)
If you want to learn regular expressions, try Chapter 8 of the TextWrangler manual. It's a very accessible introduction to the uses and syntax of regular expressions.
Note: my advice is programming-language independent. My PHP-fu is much weaker than my Python-fu, so I can't provide sample code. :(
如果您想要替换表情符号的 $string 是由您网站的访问者提供的(我的意思是它是用户输入的评论或其他内容),那么您不应该转发表情符号之前或之后有一个空格。此外,至少还有几个表情符号,非常相似但又不同,例如 :-) 和 :-))。
所以我认为如果你像这样定义表情符号数组,你会获得更好的结果:
当你填充所有查找/替换定义时,你应该以某种方式重新排序这个数组,这样就没有机会替换:-)) :-)。我相信如果你按长度对数组值进行排序就足够了。这是为了防止您要使用 str_replace()。 strtr() 会自动按长度排序!
如果您关心性能,可以查看strtr vs str_replace,但我建议您自己进行测试(关于 $string 长度和查找/替换定义,您可能会得到不同的结果)。
最简单的方法是如果您的“查找定义”不包含尾随空格:
If the $string, in which you want replace emoticons, is provided by a visitor of your site(I mean it's a user's input like comment or something), then you should not relay that there will be a space before or after the emoticon. Also there are at least couple of emoticons, that are very similar but different, like :-) and :-)).
So I think that you will achieve better result if you define your emoticon's array like this:
And when you fill all find/replace definitions, you should reorder this array in a way, that there will be no chance to replace :-)) with :-). I believe if you sort array values by length will be enough. This is in case your are going to use str_replace(). strtr() will do this sort by length automatically!
If you are concerned about performance, you can check strtr vs str_replace, but I will suggest to make your own testing (you may get different result regarding your $string length and find/replace definitions).
The easiest way will be if your "find definitions" doesn't contain trailing spaces:
从我从你的代码中看到的是,你做了两次可以保存的字符串处理,将替换放入特定的空格中。您可以先用您的定义展开它:
这将在您每次调用该函数时节省几分之一微秒的时间,这将为您提供您可能不会注意到的更好的性能。这让我想到你应该用 C 语言编写它并编译它。
更接近 C 的是使用正则表达式编译一次然后重新使用,这已经在另一个答案中建议过。这样做的好处是,如果您多次运行相同的表达式,您可能会拥有使用 PHP 实现此目的的最快方法并且您可以预先生成正则表达式,因此您可以将其存储为以下格式:您更容易编辑。然后,您可以缓存正则表达式,以防您几乎不需要调整性能。
是的,这是可能的,但在您需要将配置数据进一步处理为替换数据的意义上,效率并不高。不知道你真正谈论的是哪种效率,但我认为是后者,所以答案是,可能,但不适合你非常特殊的用例。通常我更喜欢更容易编辑的东西,也就是说你处理它的效率更高,而不是关心处理速度,因为通过将处理分布在多台计算机上可以大大缩短处理速度。
当然,这是可能的,但您会遇到与 1 中相同的问题。
好吧,现在使用您提供的代码,这是您询问的唯一方法。由于您告诉我们没有其他选择,因此它至少对您有用,目前这是为您做到这一点的最有效方法。所以现在,是的。
From what I can see from your code is that you do two times a string processing you could save, putting the replacement into spaces in specific. You could unroll it with your definition first:
This will save you some fractions of a microsecond each time you call that which, well give you better performance you'll probably not notice. Which brings me to the point that you should probably write this in C and compile it.
A bit closer to C would be using a regular expression compiled once and then re-used, which has been suggested in another answer already. The benefit here is that you might have the fastest way you can do it with PHP if you run the same expression multiple times and you could generate the regular expression upfront, so you can store it in a format that is easier for you to edit. You could then cache the regular expression in case you would need to even need to tweak performance that hardly.
Yes this is possible but not more efficiently in the sense that you would need to further process the configuration data into the replacement data. No idea about which kind of efficiency you really talk, but I assume the later, so the answer is, possible but not suitable for your very special use-case. Normally I would prefer something that's easier to edit, so to say you're more efficient to deal with it instead of caring about processing speed, because processing speed can be fairly well shorten by distributing the processing across multiple computers.
Sure, that's possible but you run into the same issues as in 1.
Well right now with the code you've offered it's the only way you ask about. As there is no alternative you tell us about, it's at least working for you which at this point in time is the most efficient way of doing that for you. So right now, yes.
我将首先开始尝试最简单的实现,使用
str_replace
和那些带有空格的数组。如果性能不可接受,请尝试针对每种情绪使用单个正则表达式。这会压缩很多东西:如果性能仍然不可接受,您可以使用更奇特的东西,例如后缀树(请参阅:http://en.wikipedia.org/wiki/Suffix_tree ),它允许您对所有表情符号仅扫描一次字符串。这个概念很简单,你有一棵树,其根是一个空格(因为你想匹配表情符号之前的空格),第一个孩子是':'和'=',然后':'的孩子是']', ')'、'-' 等。您有一个循环逐个字符地扫描字符串。当找到空格时,您将移动到树中的下一层,然后查看下一个字符是否是该层的节点之一(“:”或“=”),如果是,则移动到下一层,依此类推如果在任何时候当前字符不是当前级别中的节点,则返回到根。
I would start trying out the simplest implementation first, using
str_replace
and those arrays with spaces. If the performance is unacceptable, try a single regular expression per emotion. That compresses things quite a bit:If performance is still unacceptable, you can use something fancier, like a suffix tree (see: http://en.wikipedia.org/wiki/Suffix_tree ), which allows you to scan the string only once for all emoticons. The concept is simple, you have a tree whose root is a space (since you want to match a space before the emoticon), the first children are ':' and '=', then children of ':' are ']', ')', '-', etc. You have a single loop that scans the string, char by char. When you find a space, you move to the next level in the tree, then see if the next character is one of the nodes at that level (':' or '='), if so, move to the next level, etc. If, at any point, the current char is not a node in the current level, you go back to root.