通过检查输入字节确定输入编码

发布于 2024-08-17 08:27:49 字数 108 浏览 2 评论 0原文

我正在从用户那里获取控制台输入,并希望将其编码为 UTF-8。我的理解是,C++ 没有输入流的标准编码,而是取决于编译器、运行时环境、本地化等等。

如何通过检查输入的字节来确定输入编码?

I'm getting console input from the user and want to encode it to UTF-8. My understanding is C++ does not have a standard encoding for input streams, and that it instead depends on the compiler, the runtime environment, localization, and what not.

How can I determine the input encoding by examining the bytes of the input?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

浅忆流年 2024-08-24 08:27:49

一般来说,你不能。如果我在您的应用程序中拍摄随机生成的字节流,它如何确定它们的“编码”?您只需指定您的应用程序接受某些编码,或者假设操作系统为您提供的内容将被适当编码。

In general, you can't. If I shoot a stream of randomly generated bytes at your app how can it determine their "encoding"? You simply have to specify that your application accepts certain encodings, or make an assumption that what the OS hands you will be suitably encoded.

幻梦 2024-08-24 08:27:49

一般来说,检查输入是否为 UTF 是一个启发式问题——没有明确的算法可以告诉你“是/否”。启发式越复杂,得到的误报/漏报就越少,但是没有“确定”的方法。

有关启发式的示例,您可以查看此库:http://utfcpp.sourceforge.net/

bool valid_utf8_file(iconst char* file_name)
{
    ifstream ifs(file_name);
    if (!ifs)
        return false; // even better, throw here

    istreambuf_iterator<char> it(ifs.rdbuf());
    istreambuf_iterator<char> eos;

    return utf8::is_valid(it, eos);
}

可以使用它,或者检查其来源,他们是如何做到这一点的。

Generally checking whether input is UTF is a matter of heuristics -- there's no definitive algorithm that'll state you "yes/no". The more complex the heuristic, the less false positives/negatives you will get, however there's no "sure" way.

For an example of heuristics you can check out this library : http://utfcpp.sourceforge.net/

bool valid_utf8_file(iconst char* file_name)
{
    ifstream ifs(file_name);
    if (!ifs)
        return false; // even better, throw here

    istreambuf_iterator<char> it(ifs.rdbuf());
    istreambuf_iterator<char> eos;

    return utf8::is_valid(it, eos);
}

You can either use it, or check its sources how they have done it.

甩你一脸翔 2024-08-24 08:27:49

使用内置操作系统手段。这些因操作系统而异。在 Windows 上,最好使用 WideChar API 而根本不考虑编码。

如果您的输入来自文件,而不是真正的控制台,那么所有的赌注都会被取消。

Use the built-in operating system means. Those vary from one OS to another. On Windows, it's always better to use WideChar APIs and not think of encoding at all.

And if your input comes from a file, as opposed to a real console, then all bets are off.

丘比特射中我 2024-08-24 08:27:49

Jared Oberhaus 在一个特定于 java 的相关问题上很好地回答了这个问题。

基本上,您可以采取一些步骤来做出合理的猜测,但最终这只是猜测,没有明确的指示。 (因此 UTF-8 文件中著名的 BOM 标记)

Jared Oberhaus answered this well on a related question specific to java.

Basically there are a few steps you can take to make a reasonable guess, but ultimately it's just guesswork without explicit indication. (Hence the (in)famous BOM marker in UTF-8 files)

坠似风落 2024-08-24 08:27:49

正如在回答John Weldon 指出的问题时所说,有许多库可以进行字符编码识别。您还可以看看
unix file 命令的源代码并查看它使用哪些测试来确定文件编码。从 file 的手册页:

ASCII、ISO-8859-x、非 ISO 8 位扩展 ASCII 字符集(例如 Macintosh 和 IBM PC 系统上使用的字符集)、UTF-8 编码的 Unicode、UTF-16 编码的 Unicode 和EBCDIC 字符集可以通过构成每个集合中可打印文本的不同范围和字节序列来区分。

PCRE 提供了一个函数来测试给定字符串是否完全有效的 UTF-8。

As has already been said in response to the question John Weldon has pointed to, there are a number of libraries which do character encoding recognition. You could also take a look at the
source of the unix file command and see what tests it uses to determine file encoding. From the man page of file:

ASCII, ISO-8859-x, non-ISO 8-bit extended-ASCII character sets (such as those used on Macintosh and IBM PC systems), UTF-8-encoded Unicode, UTF-16-encoded Unicode, and EBCDIC character sets can be distinguished by the different ranges and sequences of bytes that constitute printable text in each set.

PCRE provides a function to test a given string for its completely being valid UTF-8.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文