如何将西里尔转换为UTF16

发布于 2025-01-17 21:59:21 字数 1033 浏览 1 评论 0 原文

tl;dr 有没有办法将哈希表中存储的西里尔字母转换为 UTF-16? 就像 кириллица 进入 \u043a\u0438\u0440\u0438\u043b\u043b\u0438\u0446\u0430

我需要导入文件,将其解析为 idvalue 然后将其转换为 .json,现在我正在努力寻找一种方法将 value 转换为 utf 代码。

是的,需要这样

cyrillic.txt:

1 кириллица

PH:

clear-host
foreach ($line in (Get-Content C:\Users\users\Downloads\cyrillic.txt)){
    $nline = $line.Split(' ', 2)
    $properties = @{
        'id'= $nline[0] #stores "1" from file
        'value'=$nline[1] #stores "кириллица" from file
    }
    $temp+=New-Object PSObject -Property $properties
}
$temp | ConvertTo-Json | Out-File "C:\Users\user\Downloads\data.json"

输出:

[
    {
        "id":  "1",
        "value":  "кириллица"
    },
]

需要:

[
    {
        "id":  "1",
        "value":  "\u043a\u0438\u0440\u0438\u043b\u043b\u0438\u0446\u0430"
    },
]

此时作为 PH 的新手,我什至不知道如何正确搜索它

tl;dr Is there a way to convert cyrillic stored in hashtable into UTF-16?
Like кириллица into \u043a\u0438\u0440\u0438\u043b\u043b\u0438\u0446\u0430

I need to import file, parse it into id and value then convert it into .json and now im struggling to find a way to convert value into utf codes.

And yes, it is needed that way

cyrillic.txt:

1 кириллица

PH:

clear-host
foreach ($line in (Get-Content C:\Users\users\Downloads\cyrillic.txt)){
    $nline = $line.Split(' ', 2)
    $properties = @{
        'id'= $nline[0] #stores "1" from file
        'value'=$nline[1] #stores "кириллица" from file
    }
    $temp+=New-Object PSObject -Property $properties
}
$temp | ConvertTo-Json | Out-File "C:\Users\user\Downloads\data.json"

Output:

[
    {
        "id":  "1",
        "value":  "кириллица"
    },
]

Needed:

[
    {
        "id":  "1",
        "value":  "\u043a\u0438\u0440\u0438\u043b\u043b\u0438\u0446\u0430"
    },
]

At this point as a newcomer to PH i have no idea even how to search for it properly

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

如此安好 2025-01-24 21:59:21

建立在 Jeroen Mostert 的有用评论,以下工作非常强烈,假设输入文件不包含,则可以进行以下工作鲁棒性。 nul 字符(通常是 text files的安全假设):

# Sample value pair; loop over file lines omitted for brevity.
$nline = '1 кириллица'.Split(' ', 2)

$properties = [ordered] @{
  id = $nline[0]
  # Insert aux. NUL characters before the 4-digit hex representations of each
  # code unit, to be removed later.
  value = -join ([uint16[]] [char[]] $nline[1]).ForEach({ "`0{0:x4}" -f $_ })
}

# Convert to JSON, then remove the escaped representations of the aux. NUL chars.,
# resulting in proper JSON escape sequences.
# Note: ... | Out-File ... omitted.
(ConvertTo-Json @($properties)) -replace '\\u0000', '\u'

输出(输出到 convertfrom-json 以验证其有效):

[
  {
    "id": "1",
    "value": "\u043a\u0438\u0440\u0438\u043b\u043b\u0438\u0446\u0430"
  }
]

说明:

  • [uint16 []] [char []] $ nline [1] 转换 [char] 存储在 $ nline [1] <的字符串实例> /代码>进入基础UTF-16代码单元(a .net [char] 是编码Unicode代码点的未签名的16位整数)。



    • 请注意,即使在 0xffff 上,即使代码点上方的Unicode字符也可以使用,即太大,无法适合 [uint16] 。这样的字符在所谓的BMP(基本多语言平面)(例如 )之外,简单地表示为UTF-16代码单元的 pairs ,所谓的替代Pairs ,JSON处理器应识别( convertfrom-json 做)。
    • 但是,在 Windows上此类字符。根据您的控制台窗口的字体,可能无法正确正确地渲染。最安全的选项是使用 windows terminal ,可用
  • 中/powershell/module/microsoft.powershell.core/about/about/about_arrays#foreach“ rel =” nofollow noreferrer“> .foreach() array> array方法处理每个结果代码单位:

    • “``0 {0 {0:x4}“ -f $ _ 使用 Expandable String 创建一个以> nul “ 0” ),然后是4位十六进制。手头代码单元的表示( x4 ),通过 -f 创建,格式操作员

      • 这种替换应该最终的技巧是逐字 \ u prefix 临时 nul nul 字符is需要,因为逐字 \ 嵌入在字符串值中的词,鉴于 \ 在JSON中执行逃脱字符,因此在其JSON表示中始终将 em 在其JSON表示中翻倍
    • 结果类似于“&lt; nul&gt; 043a” convertto-json 转换如下,鉴于它必须逃脱每个 nul nul arte > \ u0000

       “ \ u0000043a”
       
  • convertto-json 的结果可以通过替换 \ u0000 (逃脱为 \\ u0000 convertto-json 。 >与REGEX一起使用 -replace oeprator) \ u ,例如:

     “ \ u0000043a” -replace'\\ u0000','\ u'# - &gt; “ \ u043a”,即
     

Building on Jeroen Mostert's helpful comment, the following works robustly, assuming that the input file contains no NUL characters (which is usually a safe assumption for text files):

# Sample value pair; loop over file lines omitted for brevity.
$nline = '1 кириллица'.Split(' ', 2)

$properties = [ordered] @{
  id = $nline[0]
  # Insert aux. NUL characters before the 4-digit hex representations of each
  # code unit, to be removed later.
  value = -join ([uint16[]] [char[]] $nline[1]).ForEach({ "`0{0:x4}" -f $_ })
}

# Convert to JSON, then remove the escaped representations of the aux. NUL chars.,
# resulting in proper JSON escape sequences.
# Note: ... | Out-File ... omitted.
(ConvertTo-Json @($properties)) -replace '\\u0000', '\u'

Output (pipe to ConvertFrom-Json to verify that it works):

[
  {
    "id": "1",
    "value": "\u043a\u0438\u0440\u0438\u043b\u043b\u0438\u0446\u0430"
  }
]

Explanation:

  • [uint16[]] [char[]] $nline[1] converts the [char] instances of the strings stored in $nline[1] into the underlying UTF-16 code units (a .NET [char] is an unsigned 16-bit integer encoding a Unicode code point).

    • Note that this works even with Unicode characters that have code points above 0xFFFF, i.e. that are too large to fit into a [uint16]. Such characters outside the so-called BMP (Basic Multilingual Plane), e.g. ????, are simply represented as pairs of UTF-16 code units, so-called surrogate pairs, which a JSON processor should recognize (ConvertFrom-Json does).
    • However, on Windows such chars. may not render correctly, depending on your console window's font. The safest option is to use Windows Terminal, available in the Microsoft Store
  • The call to the .ForEach() array method processes each resulting code unit:

    • "`0{0:x4}" -f $_ uses an expandable string to create a string that starts with a NUL character ("`0"), followed by a 4-digit hex. representation (x4) of the code unit at hand, created via -f, the format operator.

      • This trick of replacing what should ultimately be a verbatim \u prefix temporarily with a NUL character is needed, because a verbatim \ embedded in a string value would invariably be doubled in its JSON representation, given that \ acts the escape character in JSON.
    • The result is something like "<NUL>043a", which ConvertTo-Json transforms as follows, given that it must escape each NUL character as \u0000:

      "\u0000043a"
      
  • The result from ConvertTo-Json can then be transformed into the desired escape sequences simply by replacing \u0000 (escaped as \\u0000 for use with the regex-based -replace oeprator) with \u, e.g.:

      "\u0000043a" -replace '\\u0000', '\u' # -> "\u043a", i.e. к
    
南七夏 2025-01-24 21:59:21

这是一种将其保存到UTF16BE文件中,然后读取字节并格式化它,跳过前2个字节(即BOM(\ UFEFF))的方法。 $ _自己不起作用。请注意,有两个 utf16编码,这些编码具有不同的字节订单,大恩迪安和小末日。西里尔的范围为u+0400..u+04ff。添加-Nonewline。

'кириллица' | set-content utf16be.txt -encoding BigEndianUnicode -nonewline
$list = get-content utf16be.txt -Encoding Byte -readcount 2 | 
  % { '\u{0:x2}{1:x2}' -f $_[0],$_[1] } | select -skip 1
-join $list

\u043a\u0438\u0440\u0438\u043b\u043b\u0438\u0446\u0430

Here's a way simply saving it to a utf16be file and then reading out the bytes, and formatting it, skipping the first 2 bytes, which is the bom (\ufeff). $_ didn't work by itself. Note that there's two utf16 encodings that have different byte orders, big endian and little endian. The range of cyrillic is U+0400..U+04FF. Added -nonewline.

'кириллица' | set-content utf16be.txt -encoding BigEndianUnicode -nonewline
$list = get-content utf16be.txt -Encoding Byte -readcount 2 | 
  % { '\u{0:x2}{1:x2}' -f $_[0],$_[1] } | select -skip 1
-join $list

\u043a\u0438\u0440\u0438\u043b\u043b\u0438\u0446\u0430
旧街凉风 2025-01-24 21:59:21

必须有一种更简单的方法,但这可以对您有用:

$temp = foreach ($line in (Get-Content -Path 'C:\Users\users\Downloads\cyrillic.txt')){
    $nline = $line.Split(' ', 2)
    # output an object straight away so it gets collected in variable $temp
    [PsCustomObject]@{
        id    = $nline[0]   #stores "1" from file
        value = (([system.Text.Encoding]::BigEndianUnicode.GetBytes($nline[1]) | 
                ForEach-Object {'{0:x2}' -f $_ }) -join '' -split '(.{4})' -ne '' | 
                ForEach-Object { '\u{0}' -f $_ }) -join ''
    }
}
($temp | ConvertTo-Json) -replace '\\\\u', '\u' | Out-File 'C:\Users\user\Downloads\data.json'

更简单地使用 .toCharArray()

$temp = foreach ($line in (Get-Content -Path 'C:\Users\users\Downloads\cyrillic.txt')){
    $nline = $line.Split(' ', 2)
    # output an object straight away so it gets collected in variable $temp
    [PsCustomObject]@{
        id    = $nline[0]   #stores "1" from file
        value = ($nline[1].ToCharArray() | ForEach-Object {'\u{0:x4}' -f [uint16]$_ }) -join ''
    }
}
($temp | ConvertTo-Json) -replace '\\\\u', '\u' | Out-File 'C:\Users\user\Downloads\data.json'

value “。将转换为 \ code> \ code> \ code> \ code> U043A \ U0438 \ U0440 \ U0438 \ u043b \ u043b \ u0438 \ u0438 \ u0446 \ u0430

There must be a simpler way of doing this, but this could work for you:

$temp = foreach ($line in (Get-Content -Path 'C:\Users\users\Downloads\cyrillic.txt')){
    $nline = $line.Split(' ', 2)
    # output an object straight away so it gets collected in variable $temp
    [PsCustomObject]@{
        id    = $nline[0]   #stores "1" from file
        value = (([system.Text.Encoding]::BigEndianUnicode.GetBytes($nline[1]) | 
                ForEach-Object {'{0:x2}' -f $_ }) -join '' -split '(.{4})' -ne '' | 
                ForEach-Object { '\u{0}' -f $_ }) -join ''
    }
}
($temp | ConvertTo-Json) -replace '\\\\u', '\u' | Out-File 'C:\Users\user\Downloads\data.json'

Simpler using .ToCharArray():

$temp = foreach ($line in (Get-Content -Path 'C:\Users\users\Downloads\cyrillic.txt')){
    $nline = $line.Split(' ', 2)
    # output an object straight away so it gets collected in variable $temp
    [PsCustomObject]@{
        id    = $nline[0]   #stores "1" from file
        value = ($nline[1].ToCharArray() | ForEach-Object {'\u{0:x4}' -f [uint16]$_ }) -join ''
    }
}
($temp | ConvertTo-Json) -replace '\\\\u', '\u' | Out-File 'C:\Users\user\Downloads\data.json'

Value "кириллица" will be converted to \u043a\u0438\u0440\u0438\u043b\u043b\u0438\u0446\u0430

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文