将源代码翻译成外语
我正在运营一个教育网站,向儿童(12-15 岁)教授编程。
由于解决方案的代码源中他们并不都说英语,因此我们使用法语变量和函数名称。 不过,我们计划将内容翻译成其他语言(德语、西班牙语、英语)。为此,我想尽快翻译源代码。 我们主要有 C/C++ 代码。
我计划使用的解决方案:
- 从源代码中提取所有变量/函数名称,以及它们在文件中的位置(声明、使用、调用它们的位置......)
- 删除所有语言关键字和库函数
- 询问翻译人员为其余名称提供翻译,
- 替换文件中的名称
是否已经有一些开源代码/项目可以做到这一点? (对于第 1,2 和 4 点)
如果没有,第一个中最困难的点:使用 C/C++ 解析器构建语法树,然后提取变量及其位置似乎是可行的方法。你还有其他想法吗?
感谢您的任何建议。
编辑: 正如评论中所述,我还需要处理评论,但只有少数评论:完整的解决方案已经以纯文本形式解释,然后我们将显示带有自解释变量/函数的代码源名称。源代码很少超过 30/40 行长,如果您已经知道代码在做什么,那么好的名称必须使其无需注释即可理解。
附加信息:对于感兴趣的人来说,该网站是国际信息学奥林匹克的培训平台,C/C++(至少是编程竞赛所需的最低限度)对于 12 岁的人来说并不难学会老的。
I'm running an educational website which is teaching programming to kids (12-15 years old).
As they don't all speak English in the code source of the solutions we are using French variables and functions names.
However we are planing to translate the content into other languages (German, Spanish, English). To do so I would like to translate the source code as fast as possible.
We mostly have C/C++ code.
The solution I'm planning to use :
- extract all variables/functions names from the source-code, with their position in the file (where they are declared, used, called...)
- remove all language keywords and library functions
- ask the translator to provide translations for the remaining names
- replace the names in the file
Is there already some open-source code/project that can do that ? (For the points 1,2 and 4)
If there isn't, the most difficult point in the first one : using a C/C++ parser to build a syntactical tree and then extracting the variables with their position seems the way to go. Do you have others ideas ?
Thank you for any advice.
Edit :
As noted in a comment I will also need to take care of the comments but there is only a few of them : the complete solution is already explained in plain-text and then we are showing the code-source with self-explained variable/function names. The source code is rarely more that 30/40 lines long and good names must make it understandable without comments if you already know what the code is doing.
Additional info : for the people interested the website is a training platform for the International Olympiads in Informatics and C/C++ (at least the minimum needed for programming contest) is not so difficult to learn by a 12 years old.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您确定需要完整的语法树吗?我认为进行词法分析来查找标识符就足够了,这要容易得多。然后排除也出现在所包含的头文件中的关键字和标识符。
原则上,您可能希望将具有相同英文名称的不同变量翻译为法语/德语中的不同单词 - 但对于教育用途,出现这种情况的风险可能很小,一开始就可以忽略。您可以通过使用一些消除歧义的准匈牙利语前缀编写原始来源来回避这个问题,然后使用相同的翻译机制删除这些前缀以显示给英语最终用户。
在选择翻译之前,请务必让译者看到他们要翻译的名称以及完整的上下文。
Are you sure you need a full syntax tree for this? I think it would be enough to do lexical analysis to find the identifiers, which is much easier. Then exclude keywords and identifiers that also appear in the header files being included.
In principle it is possible that you want different variables with the same English name to be translated to different words in French/German -- but for educational use the risk of this arising is probably small enough to ignore at first. You could sidestep the issue by writing the original sources with some disambiguating quasi-Hungarian prefixes and then remove these with the same translation mechanism for display to English-speaking end users.
Be sure to let translators see the name they are translating with full context before they choose a translation.
我真的认为你可以使用 clang (libclang) 来解析你的源代码并做你想做的事情 (请参阅此处以获取更多信息),好消息是它们具有 python 绑定,如果您想访问翻译服务或类似服务 那。
I really think you can use clang (libclang) to parse your sources and do what you want (see here for more information), the good news is that they have python bindings, which will make your life easier if you want to access a translation service or something like that.
您实际上并不需要 C/C++ 解析器,只需一个简单的词法分析器即可为您一一提供代码元素。然后你会得到很多
{
,[
,213
,)
等,你只需忽略它们并将其写入结果文件。您可以翻译仅由字母组成的内容(关键字除外),然后将它们放入输出中。现在想来,就这么简单:
代码是我在编辑器里写的,所以可能会有小错误。如果有的话请告诉我,我会修复它。
编辑:说明:
代码所做的只是逐个字符读取输入,输出它读取的任何非字母字符(包括空格、制表符和换行符)。如果它确实看到一个字母,它将开始将所有后续字母放入一个字符串中(直到到达另一个非字母)。然后,如果该字符串是关键字,它将输出关键字本身。如果不是,将翻译它并输出它。
输出的格式与输入完全相同。
You don't really need a C/C++ parser, just a simple lexer that gives you elements of the code one by one. Then you get a lot of
{
,[
,213
,)
etc that you simply ignore and write to the result file. You translate whatever consists of only letters (except keywords) and you put them in the output.Now that I think about it, it's as simple as this:
I wrote the code in the editor, so there may be minor errors. Tell me if there are any and I'll fix it.
Edit: Explanation:
What the code does is simply to read input character by character, outputting whatever non-letter characters it reads (including spaces, tabs and new lines). If it does see a letter though, it will start putting all the following letters in one string (until it reaches another non-letter). Then if the string was a keyword, it would output the keyword itself. If it was not, would translate it and output it.
The output would have the exact same format as the input.
我认为替换代码中的标识符不是一个好主意。
首先,你不会得到像样的翻译。这里非常重要的一点是翻译(尤其是自动翻译或相当愚蠢的翻译)会丢失和扭曲信息。实际上,您最终可能会得到比原来更糟糕的东西。
其次,如果要再次编译代码,编译器可能无法编译翻译后的标识符中包含非英文字母的代码。
第三,如果您用其他内容替换标识符,则需要确保不会用同一个单词替换 2 个或更多不同的标识符。这要么使代码无法编译,要么破坏其逻辑。
第四,您必须确保不翻译来自该语言标准库的保留字和标识符。翻译这些将使代码不可编译且不可读。区分程序员定义的标识符与语言及其标准库提供的标识符可能不是一项非常简单的任务。
我要做的不是用翻译替换标识符,而是将翻译作为注释提供在它们旁边,例如:
这样您就不会因翻译不正确而丢失任何信息,并且不会破坏代码。
I don't think replacing identifiers in the code is a good idea.
First, you are not going to get decent translations. A very important point here is that translation (especially automatic or pretty dumb translation) loses and distorts information. You may actually end up with something that's worse than the original.
Second, if the code is meant to be compiled again, the compiler may not be able to compile code containing non-English letters in the translated identifiers.
Third, if you replace identifiers with something else, you need to make sure you don't replace 2 or more different identifiers with the same word. That'll either make the code non-compilable or ruin its logic.
Fourth, you must make sure you don't translate reserved words and identifiers coming from the standard library of the language either. Translating those will make the code non-compilable and unreadable. It may not be a very trivial task to differentiate between the identifiers that the programmer has defined from those provided by the language and its standard library.
What I'd do instead of replacing identifiers with their translations is, provide the translations as comments next to them, for example:
This way you lose no information due to incorrect translation and don't break the code.