如何将 MathType 方程转换为 MathML 格式?
我想将保存为 GIF 格式的 MathType 方程转换为 MathML。首先,我打开这些 GIF 文件并将它们保存在 MathType 6.7 中。结果,MathML 文本被插入到 GIF 文件的末尾。然而,当我使用 Perl 脚本从这些 GIF 文件中提取 MathML 文本时,我发现 MathML 文本中存在一些乱码,如下所示:
<mn>xxx
在上面的行中,在'mn'标签之前插入了一个乱码字符。这是MathType的BUG吗?我该如何解决这个问题?我已经上传了我的测试 GIF 文件。网址为: http://ubuntuone.com/p/1352/
更新: 我尝试在这里粘贴完整的 MathML 块,但我发现 MathML 文本的语法格式很混乱。所以我将 MathML 粘贴到 GitHub 上:https://gist.github.com/1068723。
MathML 文本的第七行有一个乱码:“ ?#x00A0;”。
不包含 MathML 文本的原始 GIF 文件: http://ubuntuone.com/p/13Ba/
Perl 脚本,从 MathType 生成的 GIF 图像中提取 MathML:https://gist.github.com/1068749
谢谢, 思考
I want to convert MathType equation saved as GIF format to MathML. Firstly, I opened these GIF files and saved them within MathType 6.7. As a result, MathML text is inserted into the end of GIF files. However, when I extracted MathML text from these GIF files using Perl script, I found some garbled characters in the MathML text as following text:
<mn>xxx</mn>
In the above line, a garbled character is inserted before 'mn' label. Is this MathType 's BUG? How can I work around this problem? I have uploaded my test GIF files. URL is: http://ubuntuone.com/p/1352/
Update:
I have tried to paste full block of MathML here, but I found the syntax format of MathML text was messed. So I pasted the MathML on GitHub: https://gist.github.com/1068723.
There is a garbled character in the seventh line of MathML text: " ?#x00A0;".
The original GIF file which doesn't contain MathML text: http://ubuntuone.com/p/13Ba/
Perl script that extracts MathML from GIF image generated by MathType: https://gist.github.com/1068749
Thanks,
thinkhy
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
谢谢思希。可能是您错误地提取了数据(我们还没有查看您的脚本)。您的 GIF 中只有一张具有 MathML,即文件名以 106R 开头的一张。在那个例子中,如果你只是抓取从看起来像 MathML 的第一位到最后的所有字节,你确实会定期得到奇数字节,除了最后一个字节之外,大部分都是 255。 (然而,这似乎不是您所看到的垃圾字符。) 255 的原因是 MathML 分布在多个注释记录中,每个注释记录都以记录中的字节计数开始。从 MathType SDK(免费下载;链接如下):
GIF 图像文件
MathML 文本作为应用程序扩展记录嵌入到 GIF 文件中,其中包含 14 字节标头(应用程序扩展描述符),后跟 MTEF 数据。标头包含:
数据跟随此标头,并被写入一系列块,每个块包含 255 个字节或更少。每个块以单个字节计数开始,后跟数据。末尾被标记为长度为 0 的块。
标头足够唯一,提取数据的最简单方法可能是扫描文件中的 14 字节标头,然后期望后面是 MathML 数据块。正确解码 GIF 记录也不难,但显然需要您阅读 GIF 规范。
您可能已经在使用该 SDK,但您没有说明是否使用过,因此链接如下: http://www.dessci.com/en/reference/sdk/。
Thanks thinkhy. It could be you extracting the data incorrectly (we haven't looked at your script yet). Only one of your GIFs had MathML -- the one that has a file name starting 106R. In that one, if you just grab all the bytes from the first bit that looks like MathML until the end, you do periodically get odd bytes in there, mostly 255's except the last one. (This however doesn't appear to be the junk character you're seeing.) The reason for the 255's is that the MathML is distributed over multiple comment records, each one of which starts with a count of the bytes in the record. From the MathType SDK (free download; link below):
GIF Image Files
MathML text is embedded into a GIF file as an Application Extension Record, which consists of a 14-byte header (Application Extension Descriptor), followed by the MTEF data. The header contains:
The data follows this header and is written as a series of blocks each containing 255 bytes or less. Each block starts with a single byte count followed by the data. The end is marked as a block with length 0.
The header is unique enough that the easiest way to extract the data might be to scan the file for the 14-byte header, then expect the MathML data blocks to follow. Properly decoding the GIF records isn't that hard either, but obviously requires you read the GIF specification.
You may already be using the SDK, but you didn't say whether you were or not, so here's the link: http://www.dessci.com/en/reference/sdk/.