通过检查输入字节确定输入编码
我正在从用户那里获取控制台输入,并希望将其编码为 UTF-8。我的理解是,C++ 没有输入流的标准编码,而是取决于编译器、运行时环境、本地化等等。
如何通过检查输入的字节来确定输入编码?
I'm getting console input from the user and want to encode it to UTF-8. My understanding is C++ does not have a standard encoding for input streams, and that it instead depends on the compiler, the runtime environment, localization, and what not.
How can I determine the input encoding by examining the bytes of the input?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
一般来说,你不能。如果我在您的应用程序中拍摄随机生成的字节流,它如何确定它们的“编码”?您只需指定您的应用程序接受某些编码,或者假设操作系统为您提供的内容将被适当编码。
In general, you can't. If I shoot a stream of randomly generated bytes at your app how can it determine their "encoding"? You simply have to specify that your application accepts certain encodings, or make an assumption that what the OS hands you will be suitably encoded.
一般来说,检查输入是否为 UTF 是一个启发式问题——没有明确的算法可以告诉你“是/否”。启发式越复杂,得到的误报/漏报就越少,但是没有“确定”的方法。
有关启发式的示例,您可以查看此库:http://utfcpp.sourceforge.net/
可以使用它,或者检查其来源,他们是如何做到这一点的。
Generally checking whether input is UTF is a matter of heuristics -- there's no definitive algorithm that'll state you "yes/no". The more complex the heuristic, the less false positives/negatives you will get, however there's no "sure" way.
For an example of heuristics you can check out this library : http://utfcpp.sourceforge.net/
You can either use it, or check its sources how they have done it.
使用内置操作系统手段。这些因操作系统而异。在 Windows 上,最好使用 WideChar API 而根本不考虑编码。
如果您的输入来自文件,而不是真正的控制台,那么所有的赌注都会被取消。
Use the built-in operating system means. Those vary from one OS to another. On Windows, it's always better to use WideChar APIs and not think of encoding at all.
And if your input comes from a file, as opposed to a real console, then all bets are off.
Jared Oberhaus 在一个特定于 java 的相关问题上很好地回答了这个问题。
基本上,您可以采取一些步骤来做出合理的猜测,但最终这只是猜测,没有明确的指示。 (因此 UTF-8 文件中著名的 BOM 标记)
Jared Oberhaus answered this well on a related question specific to java.
Basically there are a few steps you can take to make a reasonable guess, but ultimately it's just guesswork without explicit indication. (Hence the (in)famous BOM marker in UTF-8 files)
正如在回答John Weldon 指出的问题时所说,有许多库可以进行字符编码识别。您还可以看看
unix
file
命令的源代码并查看它使用哪些测试来确定文件编码。从file
的手册页:PCRE 提供了一个函数来测试给定字符串是否完全有效的 UTF-8。
As has already been said in response to the question John Weldon has pointed to, there are a number of libraries which do character encoding recognition. You could also take a look at the
source of the unix
file
command and see what tests it uses to determine file encoding. From the man page offile
:PCRE provides a function to test a given string for its completely being valid UTF-8.