如何在不使用 BOM 的情况下识别不同的编码?

发布于 2024-08-03 08:47:40 字数 2414 浏览 12 评论 0原文

我有一个文件观察器,它从使用 utf-16LE 编码的不断增长的文件中获取内容。写入其中的第一位数据具有可用的 BOM——我使用它来识别针对 UTF-8 的编码(我传入的大部分文件都是用该编码进行编码的)。我捕获 BOM 并重新编码为 UTF-8,这样我的解析器就不会崩溃。问题是,由于它是一个不断增长的文件,因此并非每一位数据都包含 BOM。

这是我的问题 - 如果不将 BOM 字节添加到我拥有的每组数据中(因为我无法控制源),我可以只查找 UTF 中固有的空字节吗? 16 \000,然后使用它作为我的标识符而不是 BOM?这会让我以后头疼吗?

我的架构涉及一个 ruby​​ Web 应用程序,当我用 java 编写的解析器接收到接收到的数据时,将其记录到临时文件中。

现在编写我的识别/重新编码代码,如下所示:

  // guess encoding if utf-16 then
  // convert to UTF-8 first
  try {
    FileInputStream fis = new FileInputStream(args[args.length-1]);
    byte[] contents = new byte[fis.available()];
    fis.read(contents, 0, contents.length);

    if ( (contents[0] == (byte)0xFF) && (contents[1] == (byte)0xFE) ) {
      String asString = new String(contents, "UTF-16");
      byte[] newBytes = asString.getBytes("UTF8");
      FileOutputStream fos = new FileOutputStream(args[args.length-1]);
      fos.write(newBytes);
      fos.close();
    }

    fis.close();
    } catch(Exception e) {
      e.printStackTrace();
  }

更新

我想支持欧元、破折号和其他字符等内容。 我将上面的代码修改为如下所示,它似乎通过了我所有的测试 对于那些角色:

  // guess encoding if utf-16 then
  // convert to UTF-8 first
  try {
    FileInputStream fis = new FileInputStream(args[args.length-1]);
    byte[] contents = new byte[fis.available()];
    fis.read(contents, 0, contents.length);
    byte[] real = null;

    int found = 0;

    // if found a BOM then skip out of here... we just need to convert it
    if ( (contents[0] == (byte)0xFF) && (contents[1] == (byte)0xFE) ) {
      found = 3;
      real = contents;

    // no BOM detected but still could be UTF-16
    } else {

      for(int cnt=0; cnt<10; cnt++) {
        if(contents[cnt] == (byte)0x00) { found++; };

        real = new byte[contents.length+2];
        real[0] = (byte)0xFF;
        real[1] = (byte)0xFE;

        // tack on BOM and copy over new array
        for(int ib=2; ib < real.length; ib++) {
          real[ib] = contents[ib-2];
        }
      }

    }

    if(found >= 2) {
      String asString = new String(real, "UTF-16");
      byte[] newBytes = asString.getBytes("UTF8");
      FileOutputStream fos = new FileOutputStream(args[args.length-1]);
      fos.write(newBytes);
      fos.close();
    }

    fis.close();
    } catch(Exception e) {
      e.printStackTrace();
  }

你们觉得怎么样?

I have a file watcher that is grabbing content from a growing file encoded with utf-16LE. The first bit of data written to it has the BOM available -- I was using this to identify the encoding against UTF-8 (which MOST of my files coming in are encoded in). I catch the BOM and re-encode to UTF-8 so my parser doesn't freak out. The problem is that since it's a growing file not every bit of data has the BOM in it.

Here's my question -- without prepending the BOM bytes to each set of data I have (because I don't have control on the source) can I can just look for null bytes that are inherent in UTF-16 \000, and then use that as my identifier instead of the BOM? Will this cause me headaches down the road?

My architecture involves a ruby web application logging the received data to a temporary file when my parser written in java picks it up.

Write now my identification/re-encoding code looks like this:

  // guess encoding if utf-16 then
  // convert to UTF-8 first
  try {
    FileInputStream fis = new FileInputStream(args[args.length-1]);
    byte[] contents = new byte[fis.available()];
    fis.read(contents, 0, contents.length);

    if ( (contents[0] == (byte)0xFF) && (contents[1] == (byte)0xFE) ) {
      String asString = new String(contents, "UTF-16");
      byte[] newBytes = asString.getBytes("UTF8");
      FileOutputStream fos = new FileOutputStream(args[args.length-1]);
      fos.write(newBytes);
      fos.close();
    }

    fis.close();
    } catch(Exception e) {
      e.printStackTrace();
  }

UPDATE

I want to support stuff like euros, em-dashes, and other characters as such.
I modified the above code to look like this and it seems to pass all my tests
for those characters:

  // guess encoding if utf-16 then
  // convert to UTF-8 first
  try {
    FileInputStream fis = new FileInputStream(args[args.length-1]);
    byte[] contents = new byte[fis.available()];
    fis.read(contents, 0, contents.length);
    byte[] real = null;

    int found = 0;

    // if found a BOM then skip out of here... we just need to convert it
    if ( (contents[0] == (byte)0xFF) && (contents[1] == (byte)0xFE) ) {
      found = 3;
      real = contents;

    // no BOM detected but still could be UTF-16
    } else {

      for(int cnt=0; cnt<10; cnt++) {
        if(contents[cnt] == (byte)0x00) { found++; };

        real = new byte[contents.length+2];
        real[0] = (byte)0xFF;
        real[1] = (byte)0xFE;

        // tack on BOM and copy over new array
        for(int ib=2; ib < real.length; ib++) {
          real[ib] = contents[ib-2];
        }
      }

    }

    if(found >= 2) {
      String asString = new String(real, "UTF-16");
      byte[] newBytes = asString.getBytes("UTF8");
      FileOutputStream fos = new FileOutputStream(args[args.length-1]);
      fos.write(newBytes);
      fos.close();
    }

    fis.close();
    } catch(Exception e) {
      e.printStackTrace();
  }

What do you all think?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

眼眸里的快感 2024-08-10 08:47:40

一般来说,你无法100%准确地识别数据流的字符编码。您能做的最好的事情就是尝试使用一组有限的预期编码进行解码,然后对解码结果应用一些启发式方法,看看它是否“看起来像”预期语言中的文本。 (但是任何启发式方法都会对某些数据流给出误报和漏报。)或者,让人工参与循环来决定哪种解码最有意义。

更好的解决方案是重新设计协议,以便提供数据的任何内容也必须提供用于数据的编码方案。 (如果你不能,请责怪负责设计/实现无法为你提供编码方案的系统的人!)。

编辑:根据您对问题的评论,数据文件是通过 HTTP 传递的。在这种情况下,您应该安排您的 HTTP 服务器捕获传递数据的 POST 请求的“内容类型”标头,从标头中提取字符集/编码,并将其保存在文件解析器可以使用的方式/位置处理。

In general, you cannot identify the character encoding of a data stream with 100% accuracy. The best you can do is try to decode using a limited set of expected encodings, and then apply some heuristics to the decoded result to see if it "looks like" text in the expected language. (But any heuristic will give false positives and false negatives for certain data streams.) Alternatively, put a human in the loop to decide which decoding makes the most sense.

A better solution is to to redesign your protocol so that whatever is supplying the data has to also supply the encoding scheme used for the data. (And if you cannot, blame whoever is responsible for designing / implementing the system that cannot give you an encoding scheme!).

EDIT: from your comments on the question, the data files are being delivered via HTTP. In this case, you should arrange that your HTTP server snarfs the "content-type" header of the POST requests delivering the data, extract the character set / encoding from the header, and save it in a way / place that your file parser can deal with.

只是偏爱你 2024-08-10 08:47:40

毫无疑问,这会给你带来麻烦。您可以检查简单情况下的交替零字节(仅限 ASCII、UTF-16、任一字节顺序),但是当您开始获取高于 0x7f 代码点的字符流时,该方法就变得毫无用处。

如果您有文件句柄,最好的办法是保存当前文件指针,查找到开头,读取 BOM,然后查找回原始位置。

要么这样,要么以某种方式记住 BOM。

依赖数据内容是一个主意,除非您绝对确定所有输入的字符范围都将受到限制。

This will cause you headaches down the road, no doubt about it. You can check for alternating zero bytes for the simplistic cases (ASCII only, UTF-16, either byte order) but the minute you start getting a stream of characters above the 0x7f code point, that method becomes useless.

If you have the file handle, the best bet is to save the current file pointer, seek to the start, read the BOM then seek back to the original position.

Either that or remember the BOM somehow.

Relying on the data contents is a bad idea unless you're absolutely certain the character range will be restricted for all inputs.

俯瞰星空 2024-08-10 08:47:40

此问题包含一些字符检测选项,这些选项似乎不需要 BOM。

我的项目当前正在使用 jCharDet 但我可能需要查看那里列出的其他一些选项,因为 jCharDet 是不是100%可靠。

This question contains a few options for character detection which don't appear to require a BOM.

My project is currently using jCharDet but I might need to look at some of the other options listed there as jCharDet is not 100% reliable.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文