如何将西里尔转换为UTF16
tl;dr 有没有办法将哈希表中存储的西里尔字母转换为 UTF-16?
就像 кириллица
进入 \u043a\u0438\u0440\u0438\u043b\u043b\u0438\u0446\u0430
我需要导入文件,将其解析为 id 和
value
然后将其转换为 .json,现在我正在努力寻找一种方法将 value
转换为 utf 代码。
是的,需要这样
cyrillic.txt:
1 кириллица
PH:
clear-host
foreach ($line in (Get-Content C:\Users\users\Downloads\cyrillic.txt)){
$nline = $line.Split(' ', 2)
$properties = @{
'id'= $nline[0] #stores "1" from file
'value'=$nline[1] #stores "кириллица" from file
}
$temp+=New-Object PSObject -Property $properties
}
$temp | ConvertTo-Json | Out-File "C:\Users\user\Downloads\data.json"
输出:
[
{
"id": "1",
"value": "кириллица"
},
]
需要:
[
{
"id": "1",
"value": "\u043a\u0438\u0440\u0438\u043b\u043b\u0438\u0446\u0430"
},
]
此时作为 PH 的新手,我什至不知道如何正确搜索它
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
建立在 Jeroen Mostert 的有用评论,以下工作非常强烈,假设输入文件不包含
,则可以进行以下工作鲁棒性。 nul
字符(通常是 text files的安全假设):输出(输出到
convertfrom-json
以验证其有效):说明:
[uint16 []] [char []] $ nline [1]
转换[char]
存储在$ nline [1] <的字符串实例> /代码>进入基础UTF-16代码单元(a .net
[char]
是编码Unicode代码点的未签名的16位整数)。请注意,即使在
)之外,简单地表示为UTF-16代码单元的 pairs ,所谓的替代Pairs ,JSON处理器应识别( convertfrom-json
做)。0xffff
上,即使代码点上方的Unicode字符也可以使用,即太大,无法适合[uint16]
。这样的字符在所谓的BMP(基本多语言平面)(例如.foreach()
array> array方法处理每个结果代码单位:“``0 {0 {0:x4}“ -f $ _
使用 Expandable String 创建一个以> nul
“ 0” ),然后是4位十六进制。手头代码单元的表示(x4
),通过-f
创建,格式操作员。\ u
prefix 临时nul
nul 字符is需要,因为逐字\
嵌入在字符串值中的词,鉴于\
在JSON中执行逃脱字符,因此在其JSON表示中始终将 em 在其JSON表示中翻倍 。结果类似于
“&lt; nul&gt; 043a”
,convertto-json
转换如下,鉴于它必须逃脱每个nul
nul arte> \ u0000
:convertto-json
的结果可以通过替换\ u0000
(逃脱为\\ u0000
convertto-json 。 >与REGEX一起使用-replace
oeprator)\ u
,例如:Building on Jeroen Mostert's helpful comment, the following works robustly, assuming that the input file contains no
NUL
characters (which is usually a safe assumption for text files):Output (pipe to
ConvertFrom-Json
to verify that it works):Explanation:
[uint16[]] [char[]] $nline[1]
converts the[char]
instances of the strings stored in$nline[1]
into the underlying UTF-16 code units (a .NET[char]
is an unsigned 16-bit integer encoding a Unicode code point).0xFFFF
, i.e. that are too large to fit into a[uint16]
. Such characters outside the so-called BMP (Basic Multilingual Plane), e.g.????
, are simply represented as pairs of UTF-16 code units, so-called surrogate pairs, which a JSON processor should recognize (ConvertFrom-Json
does).The call to the
.ForEach()
array method processes each resulting code unit:"`0{0:x4}" -f $_
uses an expandable string to create a string that starts with aNUL
character ("`0"
), followed by a 4-digit hex. representation (x4
) of the code unit at hand, created via-f
, the format operator.\u
prefix temporarily with aNUL
character is needed, because a verbatim\
embedded in a string value would invariably be doubled in its JSON representation, given that\
acts the escape character in JSON.The result is something like
"<NUL>043a"
, whichConvertTo-Json
transforms as follows, given that it must escape eachNUL
character as\u0000
:The result from
ConvertTo-Json
can then be transformed into the desired escape sequences simply by replacing\u0000
(escaped as\\u0000
for use with the regex-based-replace
oeprator) with\u
, e.g.:这是一种将其保存到UTF16BE文件中,然后读取字节并格式化它,跳过前2个字节(即BOM(\ UFEFF))的方法。 $ _自己不起作用。请注意,有两个 utf16编码,这些编码具有不同的字节订单,大恩迪安和小末日。西里尔的范围为u+0400..u+04ff。添加-Nonewline。
Here's a way simply saving it to a utf16be file and then reading out the bytes, and formatting it, skipping the first 2 bytes, which is the bom (\ufeff). $_ didn't work by itself. Note that there's two utf16 encodings that have different byte orders, big endian and little endian. The range of cyrillic is U+0400..U+04FF. Added -nonewline.
必须有一种更简单的方法,但这可以对您有用:
更简单地使用
.toCharArray()
:value
“。将转换为
\ code> \ code> \ code> \ code> U043A \ U0438 \ U0440 \ U0438 \ u043b \ u043b \ u0438 \ u0438 \ u0446 \ u0430
There must be a simpler way of doing this, but this could work for you:
Simpler using
.ToCharArray()
:Value
"кириллица"
will be converted to\u043a\u0438\u0440\u0438\u043b\u043b\u0438\u0446\u0430