使用 Java 创建 gettext 二进制 MO 文件

发布于 2025-01-02 11:14:01 字数 2809 浏览 2 评论 0 原文

我尝试创建一个实用程序来解析 gettext po 文件并生成二进制 mo 文件。解析器很简单（我的公司不使用模糊、复数等，只是 msgid/msgstr），但生成器不起作用。

这是mo文件的描述，这里是原始生成器源（它是C），还找到了一个php脚本（https://github.com/josscrowcroft/php.mo/blob/master/php-mo.php）。

我的代码：

public void writeFile(String filename, Map<String, String> polines) throws FileNotFoundException, IOException {

  DataOutputStream os = new DataOutputStream(new FileOutputStream(filename));
  HashMap<String, String> bvc = new HashMap<String, String>();
  TreeMap<String, String> hash = new TreeMap(bvc);
  hash.putAll(polines);


  StringBuilder ids = new StringBuilder();
  StringBuilder strings = new StringBuilder();
  ArrayList<ArrayList> offsets = new ArrayList<ArrayList>();
  ArrayList<Integer> key_offsets = new ArrayList<Integer>();
  ArrayList<Integer> value_offsets = new ArrayList<Integer>();
  ArrayList<Integer> temp_offsets = new ArrayList<Integer>();

  for (Map.Entry<String, String> entry : hash.entrySet()) {
    String id = entry.getKey();
    String str = entry.getValue();

    ArrayList<Integer> offsetsItems = new ArrayList<Integer>();
    offsetsItems.add(ids.length());
    offsetsItems.add(id.length());
    offsetsItems.add(strings.length());
    offsetsItems.add(str.length());
    offsets.add((ArrayList) offsetsItems.clone());

    ids.append(id).append('\0');
    strings.append(str).append('\0');
  }
  Integer key_start = 7 * 4 + hash.size() * 4 * 4;
  Integer value_start = key_start + ids.length();

  Iterator e = offsets.iterator();
  while (e.hasNext()) {
    ArrayList<Integer> offEl = (ArrayList<Integer>) e.next();
    key_offsets.add(offEl.get(1));
    key_offsets.add(offEl.get(0) + key_start);
    value_offsets.add(offEl.get(3));
    value_offsets.add(offEl.get(2) + value_start);
  }

  temp_offsets.addAll(key_offsets);
  temp_offsets.addAll(value_offsets);


  os.writeByte(0xde);
  os.writeByte(0x12);
  os.writeByte(0x04);
  os.writeByte(0x95);

  os.writeByte(0x00);
  os.writeInt(hash.size() & 0xff);
  os.writeInt((7 * 4) & 0xff);
  os.writeInt((7 * 4 + hash.size() * 8) & 0xff);
  os.writeInt(0x00000000);
  os.writeInt(key_start & 0xff);

  Iterator offi = temp_offsets.iterator();
  while (offi.hasNext()) {
    Integer off = (Integer) offi.next();
    os.writeInt(off & 0xff);
  }
  os.writeUTF(ids.toString());
  os.writeUTF(strings.toString());

  os.close();
}

os.writeInt(key_start); 行似乎没问题，与原始工具生成的 mo 文件的差异是在这些字节之后开始的。

怎么了？（除了我可怕的英语..）

原文

I tried creating a utility to parse gettext po file and generate binary mo file. The parser is simple (my co. not use fuzzy, plural, etc. things, just msgid/msgstr), but the generator is not work.

Here is the description of the mo file, here is the original generator source (it's C), and found a php script (https://github.com/josscrowcroft/php.mo/blob/master/php-mo.php) also.

My code:

public void writeFile(String filename, Map<String, String> polines) throws FileNotFoundException, IOException {

  DataOutputStream os = new DataOutputStream(new FileOutputStream(filename));
  HashMap<String, String> bvc = new HashMap<String, String>();
  TreeMap<String, String> hash = new TreeMap(bvc);
  hash.putAll(polines);


  StringBuilder ids = new StringBuilder();
  StringBuilder strings = new StringBuilder();
  ArrayList<ArrayList> offsets = new ArrayList<ArrayList>();
  ArrayList<Integer> key_offsets = new ArrayList<Integer>();
  ArrayList<Integer> value_offsets = new ArrayList<Integer>();
  ArrayList<Integer> temp_offsets = new ArrayList<Integer>();

  for (Map.Entry<String, String> entry : hash.entrySet()) {
    String id = entry.getKey();
    String str = entry.getValue();

    ArrayList<Integer> offsetsItems = new ArrayList<Integer>();
    offsetsItems.add(ids.length());
    offsetsItems.add(id.length());
    offsetsItems.add(strings.length());
    offsetsItems.add(str.length());
    offsets.add((ArrayList) offsetsItems.clone());

    ids.append(id).append('\0');
    strings.append(str).append('\0');
  }
  Integer key_start = 7 * 4 + hash.size() * 4 * 4;
  Integer value_start = key_start + ids.length();

  Iterator e = offsets.iterator();
  while (e.hasNext()) {
    ArrayList<Integer> offEl = (ArrayList<Integer>) e.next();
    key_offsets.add(offEl.get(1));
    key_offsets.add(offEl.get(0) + key_start);
    value_offsets.add(offEl.get(3));
    value_offsets.add(offEl.get(2) + value_start);
  }

  temp_offsets.addAll(key_offsets);
  temp_offsets.addAll(value_offsets);


  os.writeByte(0xde);
  os.writeByte(0x12);
  os.writeByte(0x04);
  os.writeByte(0x95);

  os.writeByte(0x00);
  os.writeInt(hash.size() & 0xff);
  os.writeInt((7 * 4) & 0xff);
  os.writeInt((7 * 4 + hash.size() * 8) & 0xff);
  os.writeInt(0x00000000);
  os.writeInt(key_start & 0xff);

  Iterator offi = temp_offsets.iterator();
  while (offi.hasNext()) {
    Integer off = (Integer) offi.next();
    os.writeInt(off & 0xff);
  }
  os.writeUTF(ids.toString());
  os.writeUTF(strings.toString());

  os.close();
}

The line os.writeInt(key_start); seems like ok, the differences from the original tool generated mo file starting after theese bytes.

What's wrong? (aside from my scary english..)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

-小熊_ 2025-01-09 11:14:01

在将您的实现与文档进行比较时，我注意到两件事：

~~紧随幻数之后的修订应该是一个 int。~~ 这似乎有效，可能是因为 writeByte 输出一些填充。然而，使用 writeInt 会更清晰。
<代码>& writeInt 调用中的 0xFF 部分可能是错误的。需要此操作将有符号字节转换为其无符号整数值，对于正整数则不需要此操作。

要解析 po 文件，您还可以查看 github 上的 zanata/tennera 项目。

编辑： writeUTF 调用也是有问题的，因为它使用两个字节长度作为输出前缀，并使用 javas 修改的 utf 编码来破坏 '\0' 字节。您可以将其替换为：

os.write(ids.toString().getBytes("utf-8"));
os.write(strings.toString().getBytes("utf-8"));

另一个编辑： 我无法让这段代码消失，关于 chars 与 utf8 字节中的字符串长度以及用 DataOutputStream 还存在其他问题href="http://en.wikipedia.org/wiki/Endianness" rel="nofollow">大端而非小端。我认为以下代码应该可以工作，不同之处在于 msgfmt 生成的文件包含一个可选的哈希表以加快访问速度：

public static void writeInt(OutputStream os, int i) throws IOException {
    os.write((i) & 0xFF);
    os.write((i >>> 8) & 0xFF);
    os.write((i >>> 16) & 0xFF);
    os.write((i >>> 24) & 0xFF);
}

public static void writeFile(String filename, TreeMap<String, String> polines) throws IOException {
    OutputStream os = new BufferedOutputStream(new FileOutputStream(filename));
    ByteArrayOutputStream bos = new ByteArrayOutputStream();
    int size = polines.size();
    int[] indices = new int[size*2];
    int[] lengths = new int[size*2];
    int idx = 0;
    // write the strings and translations to a byte array and remember offsets and length in bytes
    for (String key : polines.keySet()) {
        byte[] utf = key.getBytes("utf-8");
        indices[idx] = bos.size();
        lengths[idx] = utf.length;
        bos.write(utf);
        bos.write(0);
        idx++;
    }
    for (String val : polines.values()) {
        byte[] utf = val.getBytes("utf-8");
        indices[idx] = bos.size();
        lengths[idx] = utf.length;
        bos.write(utf);
        bos.write(0);
        idx++;
    }

    try {
        int headerLength = 7*4;
        int tableLength = size*2*2*4;
        writeInt(os, 0x950412DE);                   // magic
        writeInt(os, 0);                            // file format revision
        writeInt(os, size);                         //number of strings
        writeInt(os, headerLength);                 // offset of table with original strings
        writeInt(os, headerLength + tableLength/2); // offset of table with translation strings
        writeInt(os, 0);                            // size of hashing table
        writeInt(os, headerLength + tableLength);   // offset of hashing table, not used since length is 0

        for (int i=0; i<size*2; i++) {
            writeInt(os, lengths[i]);
            writeInt(os, headerLength + tableLength + indices[i]);
        }

        // copy keys and translations
        bos.writeTo(os);

    } finally {
        os.close();
    }
}

When comparing your implementation with the documentation I noticed two things:

~~The revision, directly after the magic number, should be an int.~~ This seems to work, probably because writeByte outputs some padding. Using writeInt would be clearer however.
The & 0xFF part in the writeInt calls is probably wrong. This operation is needed to convert a signed byte to its unsigned integer value, for positive integers it should not be needed.

For parsing of the po files you could also have a look at the zanata/tennera project on github.

Edit: The writeUTF call is also problematic since it prefixes the output with a two-byte length and mangles '\0' bytes using javas modified utf encoding. You could replace it by:

os.write(ids.toString().getBytes("utf-8"));
os.write(strings.toString().getBytes("utf-8"));

Another Edit: I could not let got of this code, there were further problems concerning string length in chars vs utf8 bytes and DataOutputStream writing in big-endian instead of little endian. I think the following code should work, the difference is that the file produced by msgfmt contains an optional hashtable to speed up access:

public static void writeInt(OutputStream os, int i) throws IOException {
    os.write((i) & 0xFF);
    os.write((i >>> 8) & 0xFF);
    os.write((i >>> 16) & 0xFF);
    os.write((i >>> 24) & 0xFF);
}

public static void writeFile(String filename, TreeMap<String, String> polines) throws IOException {
    OutputStream os = new BufferedOutputStream(new FileOutputStream(filename));
    ByteArrayOutputStream bos = new ByteArrayOutputStream();
    int size = polines.size();
    int[] indices = new int[size*2];
    int[] lengths = new int[size*2];
    int idx = 0;
    // write the strings and translations to a byte array and remember offsets and length in bytes
    for (String key : polines.keySet()) {
        byte[] utf = key.getBytes("utf-8");
        indices[idx] = bos.size();
        lengths[idx] = utf.length;
        bos.write(utf);
        bos.write(0);
        idx++;
    }
    for (String val : polines.values()) {
        byte[] utf = val.getBytes("utf-8");
        indices[idx] = bos.size();
        lengths[idx] = utf.length;
        bos.write(utf);
        bos.write(0);
        idx++;
    }

    try {
        int headerLength = 7*4;
        int tableLength = size*2*2*4;
        writeInt(os, 0x950412DE);                   // magic
        writeInt(os, 0);                            // file format revision
        writeInt(os, size);                         //number of strings
        writeInt(os, headerLength);                 // offset of table with original strings
        writeInt(os, headerLength + tableLength/2); // offset of table with translation strings
        writeInt(os, 0);                            // size of hashing table
        writeInt(os, headerLength + tableLength);   // offset of hashing table, not used since length is 0

        for (int i=0; i<size*2; i++) {
            writeInt(os, lengths[i]);
            writeInt(os, headerLength + tableLength + indices[i]);
        }

        // copy keys and translations
        bos.writeTo(os);

    } finally {
        os.close();
    }
}

回复收藏 0 原文

~没有更多了~