编码麻烦 - 一种格式到另一种格式
我有一个抓取工具正在从其他地方收集一些我无法控制的数据。源数据包含各种有趣的 Unicode 字符,但它将它们转换为一种非常无用的格式,因此
\u00e4
对于带有元音变音的小“a”(没有我认为应该存在的双引号)*。当然,这会在我的 HTML 中以纯文本形式呈现。
是否有任何实际的方法可以将 unicode 源转换为正确的字符,而不需要我手动处理每个字符串序列并在抓取过程中替换它们?
*这是它输出的 json 示例:
({"content":{"pagelet_tab_content":"<div class=\"post_user\">Latest post by <span>D\u00e4vid<\/span><\/div>\n})
I have a scraper that is collecting some data from elsewhere that I have no control over. The source data does all sorts of interesting Unicode characters but it converts them to a pretty unhelpful format, so
\u00e4
for a small 'a' with umlaut (sans the double quotes that I think are supposed to be there)*. of course this gets rendered in my HTML as plain text.
Is there any realistic way to convert the unicode source into proper characters that doesn't involve me manually crunching out every single string sequence and replacing them during the scrape?
*here is a sample of the json that it spits out:
({"content":{"pagelet_tab_content":"<div class=\"post_user\">Latest post by <span>D\u00e4vid<\/span><\/div>\n})
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
考虑到 \u00e4 是 Unicode 字符的 Javascript 表示,可能会使用
json_decode( )
PHP 函数,将其解码为 PHP 字符串...有效的 JSON 字符串将是:
而这个 :
将为您提供正确的输出:
(这是一个字符,但两个字节长)
不过还是感觉有点老套^^
而且它可能不太好用,具体取决于您作为输入获得的字符串类型...
[编辑]我刚刚看到您的评论,您似乎表明您得到的 JSON 为输入 ?如果是这样,
json_decode()
可能真的是适合这项工作的工具;-)Considering \u00e4 is the Javascript representation of an Unicode character, a possibility could be to use the
json_decode()
PHP function, to decode that to a PHP string...The valid JSON string would be :
And this :
would give you the right output :
(It's one character, but two bytes long)
Still, it feels a bit hackish ^^
And it might not work too well, depending on the kind of string you get as input...
[Edit] I've just seen your comment where you seem to indicate you get JSON as input ? If so,
json_decode()
might really be the right tool for the job ;-)如果您尝试在页面执行之间的某个位置使用 JSON 编码(例如作为某些 CMS 的插件)或无法设置标头信息,则接受的答案将不起作用。但当然,页眉应该始终正确设置。
您可以为 json_encode / json_decode 函数提供附加参数,以“强制”它使用 utf-8。我正在为此构建一个简单的类并使用静态方法来获取结果。
其关键是标志 JSON_UNESCAPED_UNICODE。
像这样使用它:
数据类
用法
The accepted Answer wouldn't work if you try to use the JSON Encode somewhere between the Page execution (e.g. as Plugin for some CMS) or cannot set the header Information. But of course, the Page Header should been set always correctly.
You can provide the json_encode / json_decode Function with additional Parameters to "force" it to use utf-8. I'm building a simple Class for this and using static Methods to get my results.
The key for this is the Flag JSON_UNESCAPED_UNICODE.
Use it like this:
Data Class
Usage