无法让 Ruby 接受 UTF-8 输入
从几个版本的 Ruby 开始,我就遇到了这个问题,我什至同时更换了计算机和操作系统。尽管如此,我还是完全过不去。重点是,现在我正在使用 Ruby 为我的专业流媒体服务生成图形叠加层,所以我真的需要一劳永逸地解决这个问题。
让我们将此线程视为对 这个老问题是我在一年零八个月前发布的,与当时的 Ruby 版本有关。现在我正在使用 Windows 10,Ruby 版本为 3.1.1。
这是一个 MWE:
puts "Write something with accents such as àòèùì, or €"
asd = gets
puts asd
如果我输入任何带重音的字母,会发生以下情况:
如果我输入“€”,会发生以下情况:
在我上面提到的旧线程中,我使用了两个不再需要的命令。但为了论证,让我们尝试一下:
`chcp 65001`
puts "Write something with accents such as àòèùì, or €"
asd = gets
puts asd
chcp 65001
应该将终端的编码切换为 UTF-8。截至 2022 年,这应该是默认值。不过,如果我使用该行,确实会发生一些变化……变得更糟。
如果我输入任何带重音的字母,我必须在输入字符后按两次回车键。我会得到两个损坏的字形,而不是一个。
如果我输入“€”符号,程序将立即崩溃,甚至在我按回车键之前。
无论是否使用 chcp 65001 命令,添加 #encode: utf-8 确实对 MWE 没有任何影响。
这里的问题是,这个小事情会对我编写的任何其他程序产生深远的影响,在这些程序中我必须考虑可能包含重音字母的用户输入。
例如,如果我尝试通过 tty-prompt 获取用户输入,就会发生以下情况。
require "tty-prompt"
prompt = TTY::Prompt.new
asd = prompt.ask("Write something with accents such as àòèùì, or €")
puts asd
重音字母在插入时显示为损坏的字形,然后在我按回车键后消失而不是显示:
“€”符号像往常一样仅显示为问号:
这个问题甚至延伸到了不是我输入的字符上。例如,Ruby 无法正确显示 gem tty-spinner
使用的字符。这里:
require "tty-spinner"
spinner = TTY::Spinner.new("[:spinner] Loading ...", format: :pulse_2)
spinner.auto_spin
sleep(2)
spinner.stop("Done!")
如您所见,执行时不会显示字符:
最后,它实际上能够读取 UTF-8 编码的文本文件上写的重音字母,并且应该能够生成 UTF-8 编码的 HTML 文件,但我正在使用 OBS访问该文件但无法读取它,这让我想知道该文件是否真的以 UTF-8 编码,因为在这种情况下 OBS 应该能够读取它。
该程序...
def indent (indentazione, stringa)
unless indentazione == 0
for cont in 1..indentazione
stringa.prepend("\t")
end
end
return stringa
end
testo = File.open('C:\Users\rapto\OneDrive\Documenti\Macro streaming\MietTV\riquadro\riquadro_updater.txt', "r").readlines[0].chomp
pagina = File.open('C:\Users\rapto\OneDrive\Documenti\Macro streaming\MietTV\riquadro\riquadro.html', "w:UTF-8")
pagina.puts(indent(0, "<html>"))
pagina.puts(indent(0, ""))
pagina.puts(indent(0, "<head>"))
pagina.puts(indent(1, "<link rel=\"stylesheet\" href=\"../stile.css\">"))
pagina.puts(indent(0, "</head>"))
pagina.puts(indent(0, ""))
pagina.puts(indent(0, "<body>"))
pagina.puts(indent(1, "<div id=\"riquadro\">"))
pagina.puts(indent(2, "<p id=\"riquadro_testo\">" + testo + "</p>"))
pagina.puts(indent(1, "</div>"))
pagina.puts(indent(0, "</body>"))
pagina.puts(indent(0, ""))
pagina.puts(indent(0, "</html>"))
puts "Operazione completata"
...将读取此文本文件...
...由此 bash 代码创建...
@ECHO OFF
chcp 65001
SET /P data1= "Inserisci il testo del riquadro: "
ECHO %data1%> "C:\Users\rapto\OneDrive\Documenti\Macro streaming\MietTV\riquadro\riquadro_updater.txt"
"C:\Users\rapto\OneDrive\Documenti\Macro streaming\MietTV\riquadro\riquadro_updater.rb"
...并生成此 HTML 页面...
<html>
<head>
<link rel="stylesheet" href="../stile.css">
</head>
<body>
<div id="riquadro">
<p id="riquadro_testo">La magia nera della narrazione: età dei personaggi</p>
</div>
</body>
</html>
...这将由 Opera 正确渲染...
...但不是 OBS,它应该能够读取 UTF-8 编码的页面。
幸运的是,我可以通过将所有重音字母转换为各自的 HTML 代码来解决后一个问题。不过,如果一切顺利的话那就太好了。
对我来说,Ruby 在管理 UTF-8 编码文件方面显然存在一些问题。这完全可能是我在如何处理他们方面遗漏了一些东西。可能是我设置不正确。欢迎所有建议。
更新
正如 @Holger 所指出的,这个问题似乎主要是由默认的 Windows 10 终端引起的。我通过从 Microsoft Store“Windows Terminal”下载其更新版本解决了该问题。
如果我使用通过所述终端提供的第一个 mwe,我可以轻松有效地键入重音字母,并正确地将它们作为输出接收回来:
但它仍然不适用于欧元符号:
<一个href="https://i.sstatic.net/Yg0Fr.png" rel="nofollow noreferrer">
如果我包含 chcp 65001
部分,程序将出现与以前类似的问题。如果我输入带重音的字母,我需要按两次回车键,然后接收这两个符号作为输出:
如果我输入欧元符号,它会崩溃。
I had this very problem since several version of Ruby ago, and I even changed both computer and OS in the meanwhile. Still, I can't get through it at all. The point is that now I'm using Ruby to produce graphical overlays for my professional streaming services, so I really need to get through this once and for all.
Let's consider this thread a gigantic update to this old question I posted 1 year and 8 months ago, pertaining to what then was the current version of Ruby. Now I'm working on Windows 10, with Ruby being at version 3.1.1.
Here's a MWE:
puts "Write something with accents such as àòèùì, or €"
asd = gets
puts asd
Here's what happens if I type any of the accented letters:
Here's what happens if I type "€":
In the old thread I mentioned above I used two commands that shouldn't be needed anymore. But let's try them for the sake of argument:
`chcp 65001`
puts "Write something with accents such as àòèùì, or €"
asd = gets
puts asd
chcp 65001
should switch the terminal's encoding to UTF-8. Which should be the default, as of 2022. Though, if I use that line, something indeed changes... for the worse.
If I type any accented letter I'll have to press return twice after typing the characters. And I'll get two broken glyphs instead of one.
If I instead type the "€" symbol the program will instantly crash, even before I hit return.
Adding # encode: utf-8
doesn't indeed have any effect at all on the MWE, with or without the chcp 65001
command.
The issue here is that this little thing has deep consequences on any other program I write where I have to consider a user input that might include accented letters.
For instance, here's what happens if I try to get the user input via tty-prompt
.
require "tty-prompt"
prompt = TTY::Prompt.new
asd = prompt.ask("Write something with accents such as àòèùì, or €")
puts asd
Accented letters appear as that broken glyph while being inserted, then disappear instead of being shown after I hit return:
The "€" symbol instead is just shown as a question mark, as usual:
This issue extends itself over characters that aren't even typed by me. For instance, Ruby isn't able to properly show the characters used by the gem tty-spinner
. Here:
require "tty-spinner"
spinner = TTY::Spinner.new("[:spinner] Loading ...", format: :pulse_2)
spinner.auto_spin
sleep(2)
spinner.stop("Done!")
As you see it won't show the characters while being executed:
And finally, it actually WILL be able to read accented letters wrote on UTF-8 encoded text files, and it should be able to produce a UTF-8 encoded HTML file, but I'm using OBS to access that file and it is not being able to read it, which makes me wonder if that file really is being encoded in UTF-8, since OBS should be able to read it in that case.
This program...
def indent (indentazione, stringa)
unless indentazione == 0
for cont in 1..indentazione
stringa.prepend("\t")
end
end
return stringa
end
testo = File.open('C:\Users\rapto\OneDrive\Documenti\Macro streaming\MietTV\riquadro\riquadro_updater.txt', "r").readlines[0].chomp
pagina = File.open('C:\Users\rapto\OneDrive\Documenti\Macro streaming\MietTV\riquadro\riquadro.html', "w:UTF-8")
pagina.puts(indent(0, "<html>"))
pagina.puts(indent(0, ""))
pagina.puts(indent(0, "<head>"))
pagina.puts(indent(1, "<link rel=\"stylesheet\" href=\"../stile.css\">"))
pagina.puts(indent(0, "</head>"))
pagina.puts(indent(0, ""))
pagina.puts(indent(0, "<body>"))
pagina.puts(indent(1, "<div id=\"riquadro\">"))
pagina.puts(indent(2, "<p id=\"riquadro_testo\">" + testo + "</p>"))
pagina.puts(indent(1, "</div>"))
pagina.puts(indent(0, "</body>"))
pagina.puts(indent(0, ""))
pagina.puts(indent(0, "</html>"))
puts "Operazione completata"
...will read this text file...
...created by this bash code...
@ECHO OFF
chcp 65001
SET /P data1= "Inserisci il testo del riquadro: "
ECHO %data1%> "C:\Users\rapto\OneDrive\Documenti\Macro streaming\MietTV\riquadro\riquadro_updater.txt"
"C:\Users\rapto\OneDrive\Documenti\Macro streaming\MietTV\riquadro\riquadro_updater.rb"
...and produce this HTML page...
<html>
<head>
<link rel="stylesheet" href="../stile.css">
</head>
<body>
<div id="riquadro">
<p id="riquadro_testo">La magia nera della narrazione: età dei personaggi</p>
</div>
</body>
</html>
...which will be correctly rendered by Opera...
...but not by OBS, which should be able to read UTF-8 encoded pages.
Luckily I can solve this latter problem by converting all accented letters to their respective HTML code. Still, it'd be nice if everything just worked.
To me it clearly looks like Ruby has some issue in managing UTF-8 encoded files. It totally might be me missing something in how to deal with them. It might be that I uncorrectly set something. All suggestions are welcome.
UPDATE
As indicated by @Holger Just the issue seems to be mostly caused by the default Windows 10 terminal. I solved the problem by downloading its sort-of-updated version from the Microsoft Store, "Windows Terminal".
If I use the first mwe I provided via said terminal I can effectively type the accented letters without hassle, correctly receiving them back as output:
It still doesn't work with the EUR symbol though:
The program will present similar issues as before if I include the chcp 65001
part. If I type an accented letter I'll need to press return twice, and then receive these two symbols as output:
It will crash if I type the EUR symbol.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这可能与大多数 Windows shell 本身不使用 UTF-8 编码有关。因此,如果外部程序(例如您的 Ruby 程序)从 shell 读取数据,它可能不是以 UTF-8 编码(如 Ruby 所期望的),而是以其他编码(具体取决于您的系统)编码。
然而,Ruby 无法真正知道数据的编码。你可能必须告诉它。从 Ruby 3.0 开始,Ruby 默认采用 UTF-8 作为 Windows 上的外部编码(请参阅 功能 #16604 了解详细信息)。以前的版本使用 Windows 版本的“本机”编码,这在将数据写入例如文件时可能会导致各种问题。
现在,在您的示例中发生的情况是 Ruby 使用
gets
从 Shell 读取。 shell 提供了一些数据,Ruby 由于其Encoding.default_external
设置而假定这些数据是 UTF-8 格式,但实际上并非如此。根据 shell 如何解释 Ruby 发送的数据,事情可能会出乎意料……
唯一实际的解决方案是确保您的 shell 与 Ruby 就它们交换的数据的编码方式达成一致。为此,您可能需要调整 shell 的设置。
This is probably related to most Windows shells NOT using UTF-8 encoding on their own. Thus, if an external program (such as your Ruby program) reads data from the shell it is likely not encoded in UTF-8 (as expected by Ruby) but in some other encoding depending on your system.
However, Ruby has no way to actually know the encoding of the data. You may have to tell it. Since Ruby 3.0, Ruby defaults to assume UTF-8 as the external encoding on Windows (see Feature #16604 for details). Previous versions used the "native" encoding of your Windows version which could cause all kinds of issues when writing data to e.g. files.
Now, that happens in your example is that Ruby reads from the Shell with
gets
. The shell provides some data which Ruby assumes to be in UTF-8 because of itsEncoding.default_external
setting but is not.Depending on how the shell interprets the data sent by Ruby, things could be unexpected...
The only actual solution would be to make sure that your shell agrees with Ruby about the encoding of the data they exchange. For that, you likely need to adjust the settings of your shell.