无法让 Ruby 接受 UTF-8 输入

发布于 2025-01-16 03:49:45 字数 5903 浏览 0 评论 0原文

从几个版本的 Ruby 开始，我就遇到了这个问题，我什至同时更换了计算机和操作系统。尽管如此，我还是完全过不去。重点是，现在我正在使用 Ruby 为我的专业流媒体服务生成图形叠加层，所以我真的需要一劳永逸地解决这个问题。

让我们将此线程视为对这个老问题是我在一年零八个月前发布的，与当时的 Ruby 版本有关。现在我正在使用 Windows 10，Ruby 版本为 3.1.1。

这是一个 MWE：

puts "Write something with accents such as àòèùì, or €"
asd = gets
puts asd

如果我输入任何带重音的字母，会发生以下情况：

如果我输入“€”，会发生以下情况：

在我上面提到的旧线程中，我使用了两个不再需要的命令。但为了论证，让我们尝试一下：

`chcp 65001`

puts "Write something with accents such as àòèùì, or €"
asd = gets
puts asd

chcp 65001 应该将终端的编码切换为 UTF-8。截至 2022 年，这应该是默认值。不过，如果我使用该行，确实会发生一些变化……变得更糟。

如果我输入任何带重音的字母，我必须在输入字符后按两次回车键。我会得到两个损坏的字形，而不是一个。

如果我输入“€”符号，程序将立即崩溃，甚至在我按回车键之前。

无论是否使用 chcp 65001 命令，添加 #encode: utf-8 确实对 MWE 没有任何影响。

这里的问题是，这个小事情会对我编写的任何其他程序产生深远的影响，在这些程序中我必须考虑可能包含重音字母的用户输入。

例如，如果我尝试通过 tty-prompt 获取用户输入，就会发生以下情况。

require "tty-prompt"

prompt = TTY::Prompt.new
asd = prompt.ask("Write something with accents such as àòèùì, or €")
puts asd

重音字母在插入时显示为损坏的字形，然后在我按回车键后消失而不是显示：

“€”符号像往常一样仅显示为问号：

这个问题甚至延伸到了不是我输入的字符上。例如，Ruby 无法正确显示 gem tty-spinner 使用的字符。这里：

require "tty-spinner"

spinner = TTY::Spinner.new("[:spinner] Loading ...", format: :pulse_2)
spinner.auto_spin
sleep(2)
spinner.stop("Done!")

如您所见，执行时不会显示字符：

最后，它实际上能够读取 UTF-8 编码的文本文件上写的重音字母，并且应该能够生成 UTF-8 编码的 HTML 文件，但我正在使用 OBS访问该文件但无法读取它，这让我想知道该文件是否真的以 UTF-8 编码，因为在这种情况下 OBS 应该能够读取它。

该程序...

def indent (indentazione, stringa)
    unless indentazione == 0
        for cont in 1..indentazione
            stringa.prepend("\t")
        end
    end
    return stringa
end

testo = File.open('C:\Users\rapto\OneDrive\Documenti\Macro streaming\MietTV\riquadro\riquadro_updater.txt', "r").readlines[0].chomp
pagina = File.open('C:\Users\rapto\OneDrive\Documenti\Macro streaming\MietTV\riquadro\riquadro.html', "w:UTF-8")

pagina.puts(indent(0, "<html>"))
pagina.puts(indent(0, ""))
pagina.puts(indent(0, "<head>"))
pagina.puts(indent(1, "<link rel=\"stylesheet\" href=\"../stile.css\">"))
pagina.puts(indent(0, "</head>"))
pagina.puts(indent(0, ""))
pagina.puts(indent(0, "<body>"))
pagina.puts(indent(1, "<div id=\"riquadro\">"))
pagina.puts(indent(2, "<p id=\"riquadro_testo\">" + testo + "</p>"))
pagina.puts(indent(1, "</div>"))
pagina.puts(indent(0, "</body>"))
pagina.puts(indent(0, ""))
pagina.puts(indent(0, "</html>"))

puts "Operazione completata"

...将读取此文本文件...

...由此 bash 代码创建...

@ECHO OFF
chcp 65001

SET /P data1= "Inserisci il testo del riquadro: "
ECHO %data1%> "C:\Users\rapto\OneDrive\Documenti\Macro streaming\MietTV\riquadro\riquadro_updater.txt"

"C:\Users\rapto\OneDrive\Documenti\Macro streaming\MietTV\riquadro\riquadro_updater.rb"

...并生成此 HTML 页面...

<html>

<head>
    <link rel="stylesheet" href="../stile.css">
</head>

<body>
    <div id="riquadro">
        <p id="riquadro_testo">La magia nera della narrazione: età dei personaggi</p>
    </div>
</body>

</html>

...这将由 Opera 正确渲染...

...但不是 OBS，它应该能够读取 UTF-8 编码的页面。

幸运的是，我可以通过将所有重音字母转换为各自的 HTML 代码来解决后一个问题。不过，如果一切顺利的话那就太好了。

对我来说，Ruby 在管理 UTF-8 编码文件方面显然存在一些问题。这完全可能是我在如何处理他们方面遗漏了一些东西。可能是我设置不正确。欢迎所有建议。

更新

正如 @Holger 所指出的，这个问题似乎主要是由默认的 Windows 10 终端引起的。我通过从 Microsoft Store“Windows Terminal”下载其更新版本解决了该问题。

如果我使用通过所述终端提供的第一个 mwe，我可以轻松有效地键入重音字母，并正确地将它们作为输出接收回来：

但它仍然不适用于欧元符号：

<一个href="https://i.sstatic.net/Yg0Fr.png" rel="nofollow noreferrer"> 第 2 轮, # 2

如果我包含 chcp 65001 部分，程序将出现与以前类似的问题。如果我输入带重音的字母，我需要按两次回车键，然后接收这两个符号作为输出：

如果我输入欧元符号，它会崩溃。

原文

I had this very problem since several version of Ruby ago, and I even changed both computer and OS in the meanwhile. Still, I can't get through it at all. The point is that now I'm using Ruby to produce graphical overlays for my professional streaming services, so I really need to get through this once and for all.

Let's consider this thread a gigantic update to this old question I posted 1 year and 8 months ago, pertaining to what then was the current version of Ruby. Now I'm working on Windows 10, with Ruby being at version 3.1.1.

Here's a MWE:

puts "Write something with accents such as àòèùì, or €"
asd = gets
puts asd

Here's what happens if I type any of the accented letters:

Here's what happens if I type "€":

In the old thread I mentioned above I used two commands that shouldn't be needed anymore. But let's try them for the sake of argument:

`chcp 65001`

puts "Write something with accents such as àòèùì, or €"
asd = gets
puts asd

chcp 65001 should switch the terminal's encoding to UTF-8. Which should be the default, as of 2022. Though, if I use that line, something indeed changes... for the worse.

If I type any accented letter I'll have to press return twice after typing the characters. And I'll get two broken glyphs instead of one.

If I instead type the "€" symbol the program will instantly crash, even before I hit return.

Adding # encode: utf-8 doesn't indeed have any effect at all on the MWE, with or without the chcp 65001 command.

The issue here is that this little thing has deep consequences on any other program I write where I have to consider a user input that might include accented letters.

For instance, here's what happens if I try to get the user input via tty-prompt.

require "tty-prompt"

prompt = TTY::Prompt.new
asd = prompt.ask("Write something with accents such as àòèùì, or €")
puts asd

Accented letters appear as that broken glyph while being inserted, then disappear instead of being shown after I hit return:

The "€" symbol instead is just shown as a question mark, as usual:

This issue extends itself over characters that aren't even typed by me. For instance, Ruby isn't able to properly show the characters used by the gem tty-spinner. Here:

require "tty-spinner"

spinner = TTY::Spinner.new("[:spinner] Loading ...", format: :pulse_2)
spinner.auto_spin
sleep(2)
spinner.stop("Done!")

As you see it won't show the characters while being executed:

And finally, it actually WILL be able to read accented letters wrote on UTF-8 encoded text files, and it should be able to produce a UTF-8 encoded HTML file, but I'm using OBS to access that file and it is not being able to read it, which makes me wonder if that file really is being encoded in UTF-8, since OBS should be able to read it in that case.

This program...

def indent (indentazione, stringa)
    unless indentazione == 0
        for cont in 1..indentazione
            stringa.prepend("\t")
        end
    end
    return stringa
end

testo = File.open('C:\Users\rapto\OneDrive\Documenti\Macro streaming\MietTV\riquadro\riquadro_updater.txt', "r").readlines[0].chomp
pagina = File.open('C:\Users\rapto\OneDrive\Documenti\Macro streaming\MietTV\riquadro\riquadro.html', "w:UTF-8")

pagina.puts(indent(0, "<html>"))
pagina.puts(indent(0, ""))
pagina.puts(indent(0, "<head>"))
pagina.puts(indent(1, "<link rel=\"stylesheet\" href=\"../stile.css\">"))
pagina.puts(indent(0, "</head>"))
pagina.puts(indent(0, ""))
pagina.puts(indent(0, "<body>"))
pagina.puts(indent(1, "<div id=\"riquadro\">"))
pagina.puts(indent(2, "<p id=\"riquadro_testo\">" + testo + "</p>"))
pagina.puts(indent(1, "</div>"))
pagina.puts(indent(0, "</body>"))
pagina.puts(indent(0, ""))
pagina.puts(indent(0, "</html>"))

puts "Operazione completata"

...will read this text file...

...created by this bash code...

@ECHO OFF
chcp 65001

SET /P data1= "Inserisci il testo del riquadro: "
ECHO %data1%> "C:\Users\rapto\OneDrive\Documenti\Macro streaming\MietTV\riquadro\riquadro_updater.txt"

"C:\Users\rapto\OneDrive\Documenti\Macro streaming\MietTV\riquadro\riquadro_updater.rb"

...and produce this HTML page...

<html>

<head>
    <link rel="stylesheet" href="../stile.css">
</head>

<body>
    <div id="riquadro">
        <p id="riquadro_testo">La magia nera della narrazione: età dei personaggi</p>
    </div>
</body>

</html>

...which will be correctly rendered by Opera...

...but not by OBS, which should be able to read UTF-8 encoded pages.

Luckily I can solve this latter problem by converting all accented letters to their respective HTML code. Still, it'd be nice if everything just worked.

To me it clearly looks like Ruby has some issue in managing UTF-8 encoded files. It totally might be me missing something in how to deal with them. It might be that I uncorrectly set something. All suggestions are welcome.

UPDATE

As indicated by @Holger Just the issue seems to be mostly caused by the default Windows 10 terminal. I solved the problem by downloading its sort-of-updated version from the Microsoft Store, "Windows Terminal".

If I use the first mwe I provided via said terminal I can effectively type the accented letters without hassle, correctly receiving them back as output:

It still doesn't work with the EUR symbol though:

The program will present similar issues as before if I include the chcp 65001 part. If I type an accented letter I'll need to press return twice, and then receive these two symbols as output:

It will crash if I type the EUR symbol.

分享到QQ

分享到微博