如何使用转义的 unicode 解码字符串?
我不确定这叫什么,所以我在搜索时遇到了麻烦。如何使用 JavaScript 将带有 unicode 的字符串从 http\u00253A\u00252F\u00252Fexample.com
解码为 http://example.com
?我尝试了 unescape
、decodeURI
和 decodeURIComponent
所以我猜剩下的就是字符串替换。
编辑:该字符串不是键入的,而是来自另一段代码的子字符串。因此,要解决这个问题,您必须从这样的事情开始:
var s = 'http\\u00253A\\u00252F\\u00252Fexample.com';
我希望这能说明为什么 unescape() 不起作用。
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
编辑 (2017-10-12):
@MechaLynx 和 @Kevin-Weber 注意到
unescape()
在非浏览器环境中已弃用,并且在 TypeScript 中不存在。decodeURIComponent
是一个直接替代品。为了获得更广泛的兼容性,请改用以下内容:原始答案:
您可以将所有工作卸载到
JSON.parse
Edit (2017-10-12):
@MechaLynx and @Kevin-Weber note that
unescape()
is deprecated from non-browser environments and does not exist in TypeScript.decodeURIComponent
is a drop-in replacement. For broader compatibility, use the below instead:Original answer:
You can offload all the work to
JSON.parse
更新:请注意,这是一个适用于较旧的浏览器或非浏览器平台的解决方案,并且出于教学目的而保持活动状态。请参阅下面 @radicand 的答案以获得更新的答案。
这是一个 unicode 转义字符串。首先对字符串进行转义,然后使用 unicode 进行编码。转换回正常状态:
解释一下:我使用正则表达式来查找
\u0025
。但是,由于我只需要该字符串的一部分来进行替换操作,因此我使用括号来隔离要重用的部分,0025
。这个孤立的部分称为群。表达式末尾的 gi 部分表示它应该匹配字符串中的所有实例,而不仅仅是第一个实例,并且匹配应该不区分大小写。考虑到这个例子,这可能看起来没有必要,但它增加了多功能性。
现在,要从一个字符串转换为下一个字符串,我需要对每个匹配的每一组执行一些步骤,而我无法通过简单地转换字符串来做到这一点。有用的是,String.replace 操作可以接受一个函数,该函数将为每次匹配执行。该函数的返回将替换字符串中的匹配本身。
我使用该函数接受的第二个参数,这是我需要使用的组,并将其转换为等效的 utf-8 序列,然后使用内置的 unescape 函数将字符串解码为其正确的形式。
UPDATE: Please note that this is a solution that should apply to older browsers or non-browser platforms, and is kept alive for instructional purposes. Please refer to @radicand 's answer below for a more up to date answer.
This is a unicode, escaped string. First the string was escaped, then encoded with unicode. To convert back to normal:
To explain: I use a regular expression to look for
\u0025
. However, since I need only a part of this string for my replace operation, I use parentheses to isolate the part I'm going to reuse,0025
. This isolated part is called a group.The
gi
part at the end of the expression denotes it should match all instances in the string, not just the first one, and that the matching should be case insensitive. This might look unnecessary given the example, but it adds versatility.Now, to convert from one string to the next, I need to execute some steps on each group of each match, and I can't do that by simply transforming the string. Helpfully, the String.replace operation can accept a function, which will be executed for each match. The return of that function will replace the match itself in the string.
I use the second parameter this function accepts, which is the group I need to use, and transform it to the equivalent utf-8 sequence, then use the built - in
unescape
function to decode the string to its proper form.请注意,
unescape()
的使用是 已弃用并且不适用于 TypeScript 编译器。根据 Radicand 的回答和下面的评论部分,这是一个更新的解决方案:
http://example.com
Note that the use of
unescape()
is deprecated and doesn't work with the TypeScript compiler, for example.Based on radicand's answer and the comments section below, here's an updated solution:
http://example.com
为此使用
JSON.decode
会带来一些必须注意的重大缺点:JSON.decode
(将它们用双引号括起来后)将会出错,即使这些内容都是有效的:\\n
、\n
,\\0
,a"a
\\x45
\\u{045}
这里本质上,使用 JSON.decode 来实现此目的是一种黑客行为,并且不会按照您期望的方式工作。您应该坚持使用 JSON。 > 处理 JSON 的库,而不是用于字符串操作的库。
我最近自己遇到了这个问题,并且想要一个强大的解码器,所以我最终自己编写了一个完整且经过彻底测试的库,可以在此处找到:https://github.com/iansan5653/unraw。它尽可能地模仿 JavaScript 标准。
说明:
源代码大约有 250 行,所以我不会在这里全部包含。但本质上它使用以下正则表达式来查找所有转义序列,然后使用 parseInt(string, 16) 解析它们以解码基 16 数字,然后
String.fromCodePoint(number)
获取相应的字符:已注释(注意:此正则表达式匹配所有转义序列,包括无效序列。如果字符串在 JS 中抛出错误,它也会在我的库中抛出错误 [即
'\x!!'
将出错]):示例
使用该库:
Using
JSON.decode
for this comes with significant drawbacks that you must be aware of:JSON.decode
(after wrapping them in double quotes) will error even though these are all valid:\\n
,\n
,\\0
,a"a
\\x45
\\u{045}
There are other caveats as well. Essentially, using
JSON.decode
for this purpose is a hack and doesn't work the way you might always expect. You should stick with using theJSON
library to handle JSON, not for string operations.I recently ran into this issue myself and wanted a robust decoder, so I ended up writing one myself. It's complete and thoroughly tested and is available here: https://github.com/iansan5653/unraw. It mimics the JavaScript standard as closely as possible.
Explanation:
The source is about 250 lines so I won't include it all here, but essentially it uses the following Regex to find all escape sequences and then parses them using
parseInt(string, 16)
to decode the base-16 numbers and thenString.fromCodePoint(number)
to get the corresponding character:Commented (NOTE: This regex matches all escape sequences, including invalid ones. If the string would throw an error in JS, it throws an error in my library [ie,
'\x!!'
will error]):Example
Using that library:
我没有足够的代表将其放在对现有答案的评论下:
unescape
仅在使用 URI(或任何编码的 utf-8)时被弃用,这可能是大多数人的需求。encodeURIComponent
将 js 字符串转换为转义的 UTF-8,而decodeURIComponent
仅适用于转义的 UTF-8 字节。它会抛出类似decodeURIComponent('%a9'); 的错误。 // 错误
,因为扩展 ascii 不是有效的 utf-8(即使它仍然是 unicode 值),而unescape('%a9'); // ©
所以在使用decodeURIComponent时你需要知道你的数据。decodeURIComponent 不适用于
"%C2"
或0x7f
上的任何单独字节,因为在 utf-8 中表示代理项的一部分。然而decodeURIComponent("%C2%A9") //gives you ©
Unescape 无法在该// ©
上正常工作并且不会抛出错误,因此,如果您不知道自己的数据,则 unescape 可能会导致错误代码。I don't have enough rep to put this under comments to the existing answers:
unescape
is only deprecated for working with URIs (or any encoded utf-8) which is probably the case for most people's needs.encodeURIComponent
converts a js string to escaped UTF-8 anddecodeURIComponent
only works on escaped UTF-8 bytes. It throws an error for something likedecodeURIComponent('%a9'); // error
because extended ascii isn't valid utf-8 (even though that's still a unicode value), whereasunescape('%a9'); // ©
So you need to know your data when using decodeURIComponent.decodeURIComponent won't work on
"%C2"
or any lone byte over0x7f
because in utf-8 that indicates part of a surrogate. HoweverdecodeURIComponent("%C2%A9") //gives you ©
Unescape wouldn't work properly on that// ©
AND it wouldn't throw an error, so unescape can lead to buggy code if you don't know your data.这不是这个确切问题的答案,但对于那些通过搜索结果访问此页面并尝试(像我一样)在给定转义代码点序列的情况下构造单个 Unicode 字符的人,请注意,您可以传递多个
String.fromCodePoint()
如下所示:This is not an answer to this exact question, but for those who are hitting this page via a search result and who are trying to (like I was) construct a single Unicode character given a sequence of escaped codepoints, note that you can pass multiple arguments to
String.fromCodePoint()
like so:You can of course parse your string to extract the hex codepoint strings and then do something like:
就我而言,我试图
unescape
HTML文件,就像下面
的例子适用于我的情况:
In my case, I was trying to
unescape
HTML file sth liketo
Here below works in my case: