使用 Java 创建 gettext 二进制 MO 文件

发布于 2025-01-02 11:14:01 字数 2809 浏览 2 评论 0 原文

我尝试创建一个实用程序来解析 gettext po 文件并生成二进制 mo 文件。解析器很简单(我的公司不使用模糊、复数等,只是 msgid/msgstr),但生成器不起作用。

这是mo文件的描述,这里是原始生成器源(它是C),还找到了一个php脚本(https://github.com/josscrowcroft/php.mo/blob/master/php-mo.php) 。

我的代码:

public void writeFile(String filename, Map<String, String> polines) throws FileNotFoundException, IOException {

  DataOutputStream os = new DataOutputStream(new FileOutputStream(filename));
  HashMap<String, String> bvc = new HashMap<String, String>();
  TreeMap<String, String> hash = new TreeMap(bvc);
  hash.putAll(polines);


  StringBuilder ids = new StringBuilder();
  StringBuilder strings = new StringBuilder();
  ArrayList<ArrayList> offsets = new ArrayList<ArrayList>();
  ArrayList<Integer> key_offsets = new ArrayList<Integer>();
  ArrayList<Integer> value_offsets = new ArrayList<Integer>();
  ArrayList<Integer> temp_offsets = new ArrayList<Integer>();

  for (Map.Entry<String, String> entry : hash.entrySet()) {
    String id = entry.getKey();
    String str = entry.getValue();

    ArrayList<Integer> offsetsItems = new ArrayList<Integer>();
    offsetsItems.add(ids.length());
    offsetsItems.add(id.length());
    offsetsItems.add(strings.length());
    offsetsItems.add(str.length());
    offsets.add((ArrayList) offsetsItems.clone());

    ids.append(id).append('\0');
    strings.append(str).append('\0');
  }
  Integer key_start = 7 * 4 + hash.size() * 4 * 4;
  Integer value_start = key_start + ids.length();

  Iterator e = offsets.iterator();
  while (e.hasNext()) {
    ArrayList<Integer> offEl = (ArrayList<Integer>) e.next();
    key_offsets.add(offEl.get(1));
    key_offsets.add(offEl.get(0) + key_start);
    value_offsets.add(offEl.get(3));
    value_offsets.add(offEl.get(2) + value_start);
  }

  temp_offsets.addAll(key_offsets);
  temp_offsets.addAll(value_offsets);


  os.writeByte(0xde);
  os.writeByte(0x12);
  os.writeByte(0x04);
  os.writeByte(0x95);

  os.writeByte(0x00);
  os.writeInt(hash.size() & 0xff);
  os.writeInt((7 * 4) & 0xff);
  os.writeInt((7 * 4 + hash.size() * 8) & 0xff);
  os.writeInt(0x00000000);
  os.writeInt(key_start & 0xff);

  Iterator offi = temp_offsets.iterator();
  while (offi.hasNext()) {
    Integer off = (Integer) offi.next();
    os.writeInt(off & 0xff);
  }
  os.writeUTF(ids.toString());
  os.writeUTF(strings.toString());

  os.close();
}

os.writeInt(key_start); 行似乎没问题,与原始工具生成的 mo 文件的差异是在这些字节之后开始的。

怎么了? (除了我可怕的英语..)

I tried creating a utility to parse gettext po file and generate binary mo file. The parser is simple (my co. not use fuzzy, plural, etc. things, just msgid/msgstr), but the generator is not work.

Here is the description of the mo file, here is the original generator source (it's C), and found a php script (https://github.com/josscrowcroft/php.mo/blob/master/php-mo.php) also.

My code:

public void writeFile(String filename, Map<String, String> polines) throws FileNotFoundException, IOException {

  DataOutputStream os = new DataOutputStream(new FileOutputStream(filename));
  HashMap<String, String> bvc = new HashMap<String, String>();
  TreeMap<String, String> hash = new TreeMap(bvc);
  hash.putAll(polines);


  StringBuilder ids = new StringBuilder();
  StringBuilder strings = new StringBuilder();
  ArrayList<ArrayList> offsets = new ArrayList<ArrayList>();
  ArrayList<Integer> key_offsets = new ArrayList<Integer>();
  ArrayList<Integer> value_offsets = new ArrayList<Integer>();
  ArrayList<Integer> temp_offsets = new ArrayList<Integer>();

  for (Map.Entry<String, String> entry : hash.entrySet()) {
    String id = entry.getKey();
    String str = entry.getValue();

    ArrayList<Integer> offsetsItems = new ArrayList<Integer>();
    offsetsItems.add(ids.length());
    offsetsItems.add(id.length());
    offsetsItems.add(strings.length());
    offsetsItems.add(str.length());
    offsets.add((ArrayList) offsetsItems.clone());

    ids.append(id).append('\0');
    strings.append(str).append('\0');
  }
  Integer key_start = 7 * 4 + hash.size() * 4 * 4;
  Integer value_start = key_start + ids.length();

  Iterator e = offsets.iterator();
  while (e.hasNext()) {
    ArrayList<Integer> offEl = (ArrayList<Integer>) e.next();
    key_offsets.add(offEl.get(1));
    key_offsets.add(offEl.get(0) + key_start);
    value_offsets.add(offEl.get(3));
    value_offsets.add(offEl.get(2) + value_start);
  }

  temp_offsets.addAll(key_offsets);
  temp_offsets.addAll(value_offsets);


  os.writeByte(0xde);
  os.writeByte(0x12);
  os.writeByte(0x04);
  os.writeByte(0x95);

  os.writeByte(0x00);
  os.writeInt(hash.size() & 0xff);
  os.writeInt((7 * 4) & 0xff);
  os.writeInt((7 * 4 + hash.size() * 8) & 0xff);
  os.writeInt(0x00000000);
  os.writeInt(key_start & 0xff);

  Iterator offi = temp_offsets.iterator();
  while (offi.hasNext()) {
    Integer off = (Integer) offi.next();
    os.writeInt(off & 0xff);
  }
  os.writeUTF(ids.toString());
  os.writeUTF(strings.toString());

  os.close();
}

The line os.writeInt(key_start); seems like ok, the differences from the original tool generated mo file starting after theese bytes.

What's wrong? (aside from my scary english..)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

-小熊_ 2025-01-09 11:14:01

在将您的实现与文档进行比较时,我注意到两件事:

  1. 紧随幻数之后的修订应该是一个 int。 这似乎有效,可能是因为 writeByte 输出一些填充。然而,使用 writeInt 会更清晰。
  2. <代码>& writeInt 调用中的 0xFF 部分可能是错误的。需要此操作将有符号字节转换为其无符号整数值,对于正整数则不需要此操作。

要解析 po 文件,您还可以查看 github 上的 zanata/tennera 项目

编辑: writeUTF 调用也是有问题的,因为它使用两个字节长度作为输出前缀,并使用 javas 修改的 utf 编码来破坏 '\0' 字节。您可以将其替换为:

os.write(ids.toString().getBytes("utf-8"));
os.write(strings.toString().getBytes("utf-8"));

另一个编辑: 我无法让这段代码消失,关于 chars 与 utf8 字节中的字符串长度以及用 DataOutputStream 还存在其他问题href="http://en.wikipedia.org/wiki/Endianness" rel="nofollow">大端而非小端。我认为以下代码应该可以工作,不同之处在于 msgfmt 生成的文件包含一个可选的哈希表以加快访问速度:

public static void writeInt(OutputStream os, int i) throws IOException {
    os.write((i) & 0xFF);
    os.write((i >>> 8) & 0xFF);
    os.write((i >>> 16) & 0xFF);
    os.write((i >>> 24) & 0xFF);
}

public static void writeFile(String filename, TreeMap<String, String> polines) throws IOException {
    OutputStream os = new BufferedOutputStream(new FileOutputStream(filename));
    ByteArrayOutputStream bos = new ByteArrayOutputStream();
    int size = polines.size();
    int[] indices = new int[size*2];
    int[] lengths = new int[size*2];
    int idx = 0;
    // write the strings and translations to a byte array and remember offsets and length in bytes
    for (String key : polines.keySet()) {
        byte[] utf = key.getBytes("utf-8");
        indices[idx] = bos.size();
        lengths[idx] = utf.length;
        bos.write(utf);
        bos.write(0);
        idx++;
    }
    for (String val : polines.values()) {
        byte[] utf = val.getBytes("utf-8");
        indices[idx] = bos.size();
        lengths[idx] = utf.length;
        bos.write(utf);
        bos.write(0);
        idx++;
    }

    try {
        int headerLength = 7*4;
        int tableLength = size*2*2*4;
        writeInt(os, 0x950412DE);                   // magic
        writeInt(os, 0);                            // file format revision
        writeInt(os, size);                         //number of strings
        writeInt(os, headerLength);                 // offset of table with original strings
        writeInt(os, headerLength + tableLength/2); // offset of table with translation strings
        writeInt(os, 0);                            // size of hashing table
        writeInt(os, headerLength + tableLength);   // offset of hashing table, not used since length is 0

        for (int i=0; i<size*2; i++) {
            writeInt(os, lengths[i]);
            writeInt(os, headerLength + tableLength + indices[i]);
        }

        // copy keys and translations
        bos.writeTo(os);

    } finally {
        os.close();
    }
}

When comparing your implementation with the documentation I noticed two things:

  1. The revision, directly after the magic number, should be an int. This seems to work, probably because writeByte outputs some padding. Using writeInt would be clearer however.
  2. The & 0xFF part in the writeInt calls is probably wrong. This operation is needed to convert a signed byte to its unsigned integer value, for positive integers it should not be needed.

For parsing of the po files you could also have a look at the zanata/tennera project on github.

Edit: The writeUTF call is also problematic since it prefixes the output with a two-byte length and mangles '\0' bytes using javas modified utf encoding. You could replace it by:

os.write(ids.toString().getBytes("utf-8"));
os.write(strings.toString().getBytes("utf-8"));

Another Edit: I could not let got of this code, there were further problems concerning string length in chars vs utf8 bytes and DataOutputStream writing in big-endian instead of little endian. I think the following code should work, the difference is that the file produced by msgfmt contains an optional hashtable to speed up access:

public static void writeInt(OutputStream os, int i) throws IOException {
    os.write((i) & 0xFF);
    os.write((i >>> 8) & 0xFF);
    os.write((i >>> 16) & 0xFF);
    os.write((i >>> 24) & 0xFF);
}

public static void writeFile(String filename, TreeMap<String, String> polines) throws IOException {
    OutputStream os = new BufferedOutputStream(new FileOutputStream(filename));
    ByteArrayOutputStream bos = new ByteArrayOutputStream();
    int size = polines.size();
    int[] indices = new int[size*2];
    int[] lengths = new int[size*2];
    int idx = 0;
    // write the strings and translations to a byte array and remember offsets and length in bytes
    for (String key : polines.keySet()) {
        byte[] utf = key.getBytes("utf-8");
        indices[idx] = bos.size();
        lengths[idx] = utf.length;
        bos.write(utf);
        bos.write(0);
        idx++;
    }
    for (String val : polines.values()) {
        byte[] utf = val.getBytes("utf-8");
        indices[idx] = bos.size();
        lengths[idx] = utf.length;
        bos.write(utf);
        bos.write(0);
        idx++;
    }

    try {
        int headerLength = 7*4;
        int tableLength = size*2*2*4;
        writeInt(os, 0x950412DE);                   // magic
        writeInt(os, 0);                            // file format revision
        writeInt(os, size);                         //number of strings
        writeInt(os, headerLength);                 // offset of table with original strings
        writeInt(os, headerLength + tableLength/2); // offset of table with translation strings
        writeInt(os, 0);                            // size of hashing table
        writeInt(os, headerLength + tableLength);   // offset of hashing table, not used since length is 0

        for (int i=0; i<size*2; i++) {
            writeInt(os, lengths[i]);
            writeInt(os, headerLength + tableLength + indices[i]);
        }

        // copy keys and translations
        bos.writeTo(os);

    } finally {
        os.close();
    }
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文