(更新了一点)
必须说,我对使用 PHP 进行国际化并不是很有经验,而且大量搜索并没有真正提供我正在寻找的答案。
我需要找到一种可靠的方法,使用 PHP 仅将“相关”文本转换为 Unicode 以在 SMS 消息中发送(只是暂时的,同时使用 C# 重写服务) - 显然,当前发送的消息已发送作为纯文本。
我可以想象将所有内容都转换为 Unicode 字符集(而不是使用标准 GSM 字符集),但这意味着所有消息都将限制为 70 个字符(而不是 160 个)。
所以,我想我真正的问题是:检测消息是否需要 Unicode 编码的最可靠方法是什么,所以我只需要在满足以下条件时才这样做 绝对必要(例如对于非拉丁语言字符)?
添加信息:
好的,所以我花了一个早上的时间来解决这个问题,但我仍然没有比我开始时更进一步(当然是因为我在字符集转换方面完全缺乏能力)。 因此,这里是修改后的场景:
我有来自外部源的文本 SMS 消息,该外部源以纯文本 + Unicode 斜杠转义字符提供对我的响应。 例如“显示”文本:
让我们测试一下 öäü éàè אמכה בעברйת
בעברйת 返回:
让我们测试一下 \u00f6\u00e4\u00fc \u00e9\u00e0\u00e8 \u05d0\u05d9\u05df \u05ea\u05de\u05d9\u05db\u05d4 \u05d1\u05e2\u05d1\u05e8\u05d9\u05 EA
现在,我可以发送到我的纯文本、GSM 03.38 或 Unicode 的 SMS 提供商。 显然,以纯文本形式发送上述内容会导致大量丢失字符(它们被我的提供商替换为空格) - 我需要采用与存在的内容相关的内容。 我想要执行的操作如下:
-
如果所有文本都在 GSM 03.38 代码页,按原样发送。 (除了上面的希伯来字符之外的所有字符都属于此类别,但需要转换。)
-
否则,请将其转换为 Unicode,并通过多条消息发送(因为 Unicode 限制是 70 个字符,而不是 SMS 的 160 个字符)。
正如我上面所说,我对在 PHP 中执行此操作感到困惑(由于内置了一些简单的转换函数,C# 并不是什么大问题),但很可能我只是在这里错过了显而易见的事情。 我也无法在 PHP 中找到任何用于 7 位编码的预制转换类 - 而且我尝试自己转换字符串并将其发送似乎是徒劳的。
任何帮助将不胜感激。
(Updated a little)
I'm not very experienced with internationalization using PHP, it must be said, and a deal of searching didn't really provide the answers I was looking for.
I'm in need of working out a reliable way to convert only 'relevant' text to Unicode to send in an SMS message, using PHP (just temporarily, whilst service is rewritten using C#) - obviously, messages sent at the moment are sent as plain text.
I could conceivably convert everything to the Unicode charset (as opposed to using the standard GSM charset), but that would mean that all messages would be limited to 70 characters (instead of 160).
So, I guess my real question is: what is the most reliable way to detect the requirement for a message to be Unicode-encoded, so I only have to do it when it's absolutely necessary (e.g. for non-Latin-language characters)?
Added Info:
Okay, so I've spent the morning working on this, and I'm still no further on than when I started (certainly due to my complete lack of competency when it comes to charset conversion). So here's the revised scenario:
I have text SMS messages coming from an external source, this external source provides the responses to me in plain text + Unicode slash-escaped characters. E.g. the 'displayed' text:
Let's test öäü éàè אין תמיכה בעברית
Returns:
Let's test \u00f6\u00e4\u00fc \u00e9\u00e0\u00e8 \u05d0\u05d9\u05df \u05ea\u05de\u05d9\u05db\u05d4 \u05d1\u05e2\u05d1\u05e8\u05d9\u05ea
Now, I can send on to my SMS provider in plaintext, GSM 03.38 or Unicode. Obviously, sending the above as plaintext results in a lot of missing characters (they're replaced by spaces by my provider) - I need to adopt relating to what content there is. What I want to do with this is the following:
-
If all text is within the GSM 03.38 codepage, send it as-is. (All but the Hebrew characters above fit into this category, but need to be converted.)
-
Otherwise, convert it to Unicode, and send it over multiple messages (as the Unicode limit is 70 chars not 160 for an SMS).
As I said above, I'm stumped on doing this in PHP (C# wasn't much of an issue due to some simple conversion functions built-in), but it's quite probable I'm just missing the obvious, here. I couldn't find any pre-made conversion classes for 7-bit encoding in PHP, either - and my attempts to convert the string myself and send it on seemed futile.
Any help would be greatly appreciated.
发布评论
评论(6)
为了在进入机制之前从概念上处理它,并且如果其中任何一个是明显的,我们深表歉意,字符串可以定义为 Unicode 字符序列,Unicode 是一个数据库,它为您可能遇到的每个字符提供一个称为代码点的 id 号。需要合作。 GSM-338 包含 Unicode 字符的子集,因此您要做的就是从字符串中提取一组代码点,并检查该组代码点是否包含在 GSM-338 中。
这就留下了函数 codepoints($string) 的定义,它不是 PHP 内置的。 PHP 将字符串理解为字节序列而不是 Unicode 字符序列。 弥补差距的最佳方法是尽快将字符串转换为 UTF8,并尽可能长时间地保留它们 - 在处理外部系统时,您将不得不使用其他编码,但要隔离到与该系统的接口并且仅在内部处理 utf8。
您需要在 utf8 中的 php 字符串和代码点序列之间进行转换的函数可以在 http://hsivonen 中找到.iki.fi/php-utf8/ ,这就是您的 codepoints() 函数。
如果您从提供 Unicode 斜杠转义字符的外部源获取数据(“让我们测试 \u00f6\u00e4\u00fc...”),则该字符串转义格式应转换为 utf8。 我不知道有什么函数可以做到这一点,如果找不到,那就是字符串/正则表达式处理+使用 hsivonen.iki.fi 函数的问题,例如,当您点击 \u00f6 时,替换它使用代码点 0xf6 的 utf8 表示形式。
To deal with it conceptually before getting into mechanisms, and apologies if any of this is obvious, a string can be defined as a sequence of Unicode characters, Unicode being a database that gives an id number known as a code point to every character you might need to work with. GSM-338 contains a subset of the Unicode characters, so what you're doing is extracting a set of codepoints from your string, and checking to see if that set is contained in GSM-338.
That leaves the definition of the function codepoints($string), which isn't built in to PHP. PHP understands a string to be a sequence of bytes rather than a sequence of Unicode characters. The best way of bridging the gap is to get your strings into UTF8 as quickly as you can and keep them in UTF8 as long as you can - you'll have to use other encodings when dealing with external systems, but isolate the conversion to the interface to that system and deal only with utf8 internally.
The functions you need to convert between php strings in utf8 and sequences of codepoints can be found at http://hsivonen.iki.fi/php-utf8/ , so that's your codepoints() function.
If you're taking data from an external source that gives you Unicode slash-escaped characters ("Let's test \u00f6\u00e4\u00fc..."), that string escape format should be converted to utf8. I don't know offhand of a function to do this, if one can't be found, it's a matter of string/regex processing + the use of the hsivonen.iki.fi functions, for example when you hit \u00f6, replace it with the utf8 representation of the codepoint 0xf6.
PHP6 将拥有更好的 unicode 支持,但您可以使用一些函数。
我的第一个想法是
mb_convert_encoding
但正如您所说,这会将消息缩短为 70 个字符 - 所以也许您可以将其与mb_detect_encoding
?请参阅:多字节函数
PHP6 will have better unicode support but there are a few functions you can use.
My first thought was
mb_convert_encoding
but as you said this will shorten messages to 70 chars - so perhaps you can use this in conjunction withmb_detect_encoding
?See: Multibyte Functions
或者
or
我知道这不是 php 代码,但我认为无论如何它可能会有所帮助。 这就是我在我编写的一个应用程序中执行此操作的方法,该应用程序用于检测是否可以作为 GSM 03.38 发送(您可以对纯文本执行类似的操作)。 它有两个转换表,一个用于普通 GSM,另一个用于扩展。 然后是一个循环遍历所有字符检查是否可以转换的函数。
I know this isnt php code, but I think it might help anyway. This is how I do it in an app I wrote to detect if its possible to send as GSM 03.38 (you could do something similar for plain text). It has two translation tables, one for normal GSM and one for the extended. And then a function that loops through all characters checking if it can be converted.
虽然这是一个旧线程,但我最近不得不解决一个非常相似的问题,并想发布我的答案。 PHP 代码有些简单。 它以数组中的大量 GSM 有效字符代码开始,然后使用 ord($string) 函数 返回传递的字符串第一个字符的 ascii 值。 下面是我用来验证字符串是否具有 GSM 价值的代码。
Although this is an old thread I recently had to solve a very similar problem and wanted to post my answer. The PHP code is somewhat simple. It starts with a painstakingly large array of GSM valid character codes in an array, then simply checks if the current character is in that array using the ord($string) function which returns the ascii value of the first character of the string passed. Here is the code I use to validate if a string is GSM worth.