如何使用转义的 unicode 解码字符串?

发布于 2024-12-11 17:51:53 字数 427 浏览 0 评论 0 原文

我不确定这叫什么,所以我在搜索时遇到了麻烦。如何使用 JavaScript 将带有 unicode 的字符串从 http\u00253A\u00252F\u00252Fexample.com 解码为 http://example.com?我尝试了 unescapedecodeURIdecodeURIComponent 所以我猜剩下的就是字符串替换。

编辑:该字符串不是键入的,而是来自另一段代码的子字符串。因此,要解决这个问题,您必须从这样的事情开始:

var s = 'http\\u00253A\\u00252F\\u00252Fexample.com';

我希望这能说明为什么 unescape() 不起作用。

I'm not sure what this is called so I'm having trouble searching for it. How can I decode a string with unicode from http\u00253A\u00252F\u00252Fexample.com to http://example.com with JavaScript? I tried unescape, decodeURI, and decodeURIComponent so I guess the only thing left is string replace.

EDIT: The string is not typed, but rather a substring from another piece of code. So to solve the problem you have to start with something like this:

var s = 'http\\u00253A\\u00252F\\u00252Fexample.com';

I hope that shows why unescape() doesn't work.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

以酷 2024-12-18 17:51:53

编辑 (2017-10-12)

@MechaLynx 和 @Kevin-Weber 注意到 unescape() 在非浏览器环境中已弃用,并且在 TypeScript 中不存在。 decodeURIComponent 是一个直接替代品。为了获得更广泛的兼容性,请改用以下内容:

decodeURIComponent(JSON.parse('"http\\u00253A\\u00252F\\u00252Fexample.com"'));
> 'http://example.com'

原始答案:

unescape(JSON.parse('"http\\u00253A\\u00252F\\u00252Fexample.com"'));
> 'http://example.com'

您可以将所有工作卸载到 JSON.parse

Edit (2017-10-12):

@MechaLynx and @Kevin-Weber note that unescape() is deprecated from non-browser environments and does not exist in TypeScript. decodeURIComponent is a drop-in replacement. For broader compatibility, use the below instead:

decodeURIComponent(JSON.parse('"http\\u00253A\\u00252F\\u00252Fexample.com"'));
> 'http://example.com'

Original answer:

unescape(JSON.parse('"http\\u00253A\\u00252F\\u00252Fexample.com"'));
> 'http://example.com'

You can offload all the work to JSON.parse

智商已欠费 2024-12-18 17:51:53

更新:请注意,这是一个适用于较旧的浏览器或非浏览器平台的解决方案,并且出于教学目的而保持活动状态。请参阅下面 @radicand 的答案以获得更新的答案。


这是一个 unicode 转义字符串。首先对字符串进行转义,然后使用 unicode 进行编码。转换回正常状态:

var x = "http\\u00253A\\u00252F\\u00252Fexample.com";
var r = /\\u([\d\w]{4})/gi;
x = x.replace(r, function (match, grp) {
    return String.fromCharCode(parseInt(grp, 16)); } );
console.log(x);  // http%3A%2F%2Fexample.com
x = unescape(x);
console.log(x);  // http://example.com

解释一下:我使用正则表达式来查找 \u0025。但是,由于我只需要该字符串的一部分来进行替换操作,因此我使用括号来隔离要重用的部分,0025。这个孤立的部分称为群。

表达式末尾的 gi 部分表示它应该匹配字符串中的所有实例,而不仅仅是第一个实例,并且匹配应该不区分大小写。考虑到这个例子,这可能看起来没有必要,但它增加了多功能性。

现在,要从一个字符串转换为下一个字符串,我需要对每个匹配的每一组执行一些步骤,而我无法通过简单地转换字符串来做到这一点。有用的是,String.replace 操作可以接受一个函数,该函数将为每次匹配执行。该函数的返回将替换字符串中的匹配本身。

我使用该函数接受的第二个参数,这是我需要使用的组,并将其转换为等效的 utf-8 序列,然后使用内置的 unescape 函数将字符串解码为其正确的形式。

UPDATE: Please note that this is a solution that should apply to older browsers or non-browser platforms, and is kept alive for instructional purposes. Please refer to @radicand 's answer below for a more up to date answer.


This is a unicode, escaped string. First the string was escaped, then encoded with unicode. To convert back to normal:

var x = "http\\u00253A\\u00252F\\u00252Fexample.com";
var r = /\\u([\d\w]{4})/gi;
x = x.replace(r, function (match, grp) {
    return String.fromCharCode(parseInt(grp, 16)); } );
console.log(x);  // http%3A%2F%2Fexample.com
x = unescape(x);
console.log(x);  // http://example.com

To explain: I use a regular expression to look for \u0025. However, since I need only a part of this string for my replace operation, I use parentheses to isolate the part I'm going to reuse, 0025. This isolated part is called a group.

The gi part at the end of the expression denotes it should match all instances in the string, not just the first one, and that the matching should be case insensitive. This might look unnecessary given the example, but it adds versatility.

Now, to convert from one string to the next, I need to execute some steps on each group of each match, and I can't do that by simply transforming the string. Helpfully, the String.replace operation can accept a function, which will be executed for each match. The return of that function will replace the match itself in the string.

I use the second parameter this function accepts, which is the group I need to use, and transform it to the equivalent utf-8 sequence, then use the built - in unescape function to decode the string to its proper form.

世界等同你 2024-12-18 17:51:53

请注意,unescape() 的使用是 已弃用并且不适用于 TypeScript 编译器。

根据 Radicand 的回答和下面的评论部分,这是一个更新的解决方案:

var string = "http\\u00253A\\u00252F\\u00252Fexample.com";
decodeURIComponent(JSON.parse('"' + string.replace(/\"/g, '\\"') + '"'));

http://example.com

Note that the use of unescape() is deprecated and doesn't work with the TypeScript compiler, for example.

Based on radicand's answer and the comments section below, here's an updated solution:

var string = "http\\u00253A\\u00252F\\u00252Fexample.com";
decodeURIComponent(JSON.parse('"' + string.replace(/\"/g, '\\"') + '"'));

http://example.com

爱殇璃 2024-12-18 17:51:53

为此使用 JSON.decode 会带来一些必须注意的重大缺点:

  • 必须将字符串用双引号引起来
  • 许多字符不受支持,必须自行转义。例如,将以下任何内容传递给 JSON.decode (将它们用双引号括起来后)将会出错,即使这些内容都是有效的:\\n \n, \\0, a"a
  • 不支持十六进制转义: \\x45
  • 不支持 Unicode代码点序列:\\u{045}

这里本质上,使用 JSON.decode 来实现此目的是一种黑客行为,并且不会按照您期望的方式工作。您应该坚持使用 JSON。 > 处理 JSON 的库,而不是用于字符串操作的库。


我最近自己遇到了这个问题,并且想要一个强大的解码器,所以我最终自己编写了一个完整且经过彻底测试的库,可以在此处找到:https://github.com/iansan5653/unraw。它尽可能地模仿 JavaScript 标准。

说明:

源代码大约有 250 行,所以我不会在这里全部包含。但本质上它使用以下正则表达式来查找所有转义序列,然后使用 parseInt(string, 16) 解析它们以解码基 16 数字,然后String.fromCodePoint(number) 获取相应的字符:

/\\(?:(\\)|x([\s\S]{0,2})|u(\{[^}]*\}?)|u([\s\S]{4})\\u([^{][\s\S]{0,3})|u([\s\S]{0,4})|([0-3]?[0-7]{1,2})|([\s\S])|$)/g

已注释(注意:此正则表达式匹配所有转义序列,包括无效序列。如果字符串在 JS 中抛出错误,它也会在我的库中抛出错误 [即 '\x!!' 将出错]):

/
\\ # All escape sequences start with a backslash
(?: # Starts a group of 'or' statements
(\\) # If a second backslash is encountered, stop there (it's an escaped slash)
| # or
x([\s\S]{0,2}) # Match valid hexadecimal sequences
| # or
u(\{[^}]*\}?) # Match valid code point sequences
| # or
u([\s\S]{4})\\u([^{][\s\S]{0,3}) # Match surrogate code points which get parsed together
| # or
u([\s\S]{0,4}) # Match non-surrogate Unicode sequences
| # or
([0-3]?[0-7]{1,2}) # Match deprecated octal sequences
| # or
([\s\S]) # Match anything else ('.' doesn't match newlines)
| # or
$ # Match the end of the string
) # End the group of 'or' statements
/g # Match as many instances as there are

示例

使用该库:

import unraw from "unraw";

let step1 = unraw('http\\u00253A\\u00252F\\u00252Fexample.com');
// yields "http%3A%2F%2Fexample.com"
// Then you can use decodeURIComponent to further decode it:
let step2 = decodeURIComponent(step1);
// yields http://example.com

Using JSON.decode for this comes with significant drawbacks that you must be aware of:

  • You must wrap the string in double quotes
  • Many characters are not supported and must be escaped themselves. For example, passing any of the following to JSON.decode (after wrapping them in double quotes) will error even though these are all valid: \\n, \n, \\0, a"a
  • It does not support hexadecimal escapes: \\x45
  • It does not support Unicode code point sequences: \\u{045}

There are other caveats as well. Essentially, using JSON.decode for this purpose is a hack and doesn't work the way you might always expect. You should stick with using the JSON library to handle JSON, not for string operations.


I recently ran into this issue myself and wanted a robust decoder, so I ended up writing one myself. It's complete and thoroughly tested and is available here: https://github.com/iansan5653/unraw. It mimics the JavaScript standard as closely as possible.

Explanation:

The source is about 250 lines so I won't include it all here, but essentially it uses the following Regex to find all escape sequences and then parses them using parseInt(string, 16) to decode the base-16 numbers and then String.fromCodePoint(number) to get the corresponding character:

/\\(?:(\\)|x([\s\S]{0,2})|u(\{[^}]*\}?)|u([\s\S]{4})\\u([^{][\s\S]{0,3})|u([\s\S]{0,4})|([0-3]?[0-7]{1,2})|([\s\S])|$)/g

Commented (NOTE: This regex matches all escape sequences, including invalid ones. If the string would throw an error in JS, it throws an error in my library [ie, '\x!!' will error]):

/
\\ # All escape sequences start with a backslash
(?: # Starts a group of 'or' statements
(\\) # If a second backslash is encountered, stop there (it's an escaped slash)
| # or
x([\s\S]{0,2}) # Match valid hexadecimal sequences
| # or
u(\{[^}]*\}?) # Match valid code point sequences
| # or
u([\s\S]{4})\\u([^{][\s\S]{0,3}) # Match surrogate code points which get parsed together
| # or
u([\s\S]{0,4}) # Match non-surrogate Unicode sequences
| # or
([0-3]?[0-7]{1,2}) # Match deprecated octal sequences
| # or
([\s\S]) # Match anything else ('.' doesn't match newlines)
| # or
$ # Match the end of the string
) # End the group of 'or' statements
/g # Match as many instances as there are

Example

Using that library:

import unraw from "unraw";

let step1 = unraw('http\\u00253A\\u00252F\\u00252Fexample.com');
// yields "http%3A%2F%2Fexample.com"
// Then you can use decodeURIComponent to further decode it:
let step2 = decodeURIComponent(step1);
// yields http://example.com
星光不落少年眉 2024-12-18 17:51:53

我没有足够的代表将其放在对现有答案的评论下:

unescape 仅在使用 URI(或任何编码的 utf-8)时被弃用,这可能是大多数人的需求。 encodeURIComponent 将 js 字符串转换为转义的 UTF-8,而 decodeURIComponent 仅适用于转义的 UTF-8 字节。它会抛出类似 decodeURIComponent('%a9'); 的错误。 // 错误,因为扩展 ascii 不是有效的 utf-8(即使它仍然是 unicode 值),而 unescape('%a9'); // © 所以在使用decodeURIComponent时你需要知道你的数据。

decodeURIComponent 不适用于 "%C2"0x7f 上的任何单独字节,因为在 utf-8 中表示代理项的一部分。然而 decodeURIComponent("%C2%A9") //gives you © Unescape 无法在该 // © 上正常工作并且不会抛出错误,因此,如果您不知道自己的数据,则 unescape 可能会导致错误代码。

I don't have enough rep to put this under comments to the existing answers:

unescape is only deprecated for working with URIs (or any encoded utf-8) which is probably the case for most people's needs. encodeURIComponent converts a js string to escaped UTF-8 and decodeURIComponent only works on escaped UTF-8 bytes. It throws an error for something like decodeURIComponent('%a9'); // error because extended ascii isn't valid utf-8 (even though that's still a unicode value), whereas unescape('%a9'); // © So you need to know your data when using decodeURIComponent.

decodeURIComponent won't work on "%C2" or any lone byte over 0x7f because in utf-8 that indicates part of a surrogate. However decodeURIComponent("%C2%A9") //gives you © Unescape wouldn't work properly on that // © AND it wouldn't throw an error, so unescape can lead to buggy code if you don't know your data.

拥有 2024-12-18 17:51:53

这不是这个确切问题的答案,但对于那些通过搜索结果访问此页面并尝试(像我一样)在给定转义代码点序列的情况下构造单个 Unicode 字符的人,请注意,您可以传递多个String.fromCodePoint() 如下所示:

String.fromCodePoint(parseInt("1F469", 16), parseInt("200D", 16), parseInt("1F4BC", 16)) // 

This is not an answer to this exact question, but for those who are hitting this page via a search result and who are trying to (like I was) construct a single Unicode character given a sequence of escaped codepoints, note that you can pass multiple arguments to String.fromCodePoint() like so:

String.fromCodePoint(parseInt("1F469", 16), parseInt("200D", 16), parseInt("1F4BC", 16)) // ????‍????

You can of course parse your string to extract the hex codepoint strings and then do something like:

let codePoints = hexCodePointStrings.map(s => parseInt(s, 16));
let str = String.fromCodePoint(...codePoints);
瀟灑尐姊 2024-12-18 17:51:53

就我而言,我试图unescape HTML文件,就像

"\u003Cdiv id=\u0022app\u0022\u003E\r\n    \u003Cdiv data-v-269b6c0d\u003E\r\n        \u003Cdiv data-v-269b6c0d class=\u0022menu\u0022\u003E\r\n    \u003Cdiv data-v-269b6c0d class=\u0022faux_column\u0022\u003E\r\n        \u003Cdiv data-v-269b6c0d class=\u0022row\u0022\u003E\r\n            \u003Cdiv data-v-269b6c0d class=\u0022col-md-12\u0022\u003E\r\n"  

下面

<div id="app">
    <div data-v-269b6c0d>
        <div data-v-269b6c0d class="menu">
    <div data-v-269b6c0d class="faux_column">
        <div data-v-269b6c0d class="row">
            <div data-v-269b6c0d class="col-md-12">

的例子适用于我的情况:

const jsEscape = (str: string) => {
  return str.replace(new RegExp("'", 'g'),"\\'");
}

export const decodeUnicodeEntities = (data: any) => {
  return unescape(jsEscape(data));
}

// Use it
const data = ".....";
const unescaped = decodeUnicodeEntities(data); // Unescaped html

In my case, I was trying to unescape HTML file sth like

"\u003Cdiv id=\u0022app\u0022\u003E\r\n    \u003Cdiv data-v-269b6c0d\u003E\r\n        \u003Cdiv data-v-269b6c0d class=\u0022menu\u0022\u003E\r\n    \u003Cdiv data-v-269b6c0d class=\u0022faux_column\u0022\u003E\r\n        \u003Cdiv data-v-269b6c0d class=\u0022row\u0022\u003E\r\n            \u003Cdiv data-v-269b6c0d class=\u0022col-md-12\u0022\u003E\r\n"  

to

<div id="app">
    <div data-v-269b6c0d>
        <div data-v-269b6c0d class="menu">
    <div data-v-269b6c0d class="faux_column">
        <div data-v-269b6c0d class="row">
            <div data-v-269b6c0d class="col-md-12">

Here below works in my case:

const jsEscape = (str: string) => {
  return str.replace(new RegExp("'", 'g'),"\\'");
}

export const decodeUnicodeEntities = (data: any) => {
  return unescape(jsEscape(data));
}

// Use it
const data = ".....";
const unescaped = decodeUnicodeEntities(data); // Unescaped html

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文