为 Perl/Mason i18n 进行字符串提取的自动化方法？

发布于 2024-12-07 21:53:38 字数 1243 浏览 1 评论 0原文

我目前正在作为一个团队致力于国际化一个非常大的 Perl/Mason Web 应用程序（这是否意味着这是一场死亡行军？？）。该应用程序已有近 20 年的历史，并且是用相对老式的 Perl 风格编写的；它不使用 Moose 或其他 OO 模块。我目前计划使用 Locale::Maketext::Gettext 进行消息查找，并使用 GNU Gettext 目录文件。

我一直在尝试开发一些工具来帮助从我们的 bigass 代码库中提取字符串。目前，我拥有的只是一个相对简单的 Perl 脚本，用于解析源代码以查找字符串文字，提示用户一些上下文以及是否应将字符串标记为翻译，如果是，则标记它。

与我可以忽略的字符串相比，我需要标记的字符串有太多噪音。源代码中的许多字符串不是面向用户的，例如散列键或像

if (ref($db_obj) eq 'A::Type::Of::Db::Module')

我这样的类型比较确实对每个建议的字符串应用一些启发式方法，看看我是否可以立即忽略它（例如，我忽略用于哈希查找的字符串，因为在我们的代码库中 99% 的时间这些字符串不是面向用户的）。然而，尽管如此，我的程序向我显示的大约 90% 的字符串都是我不关心的。

有没有更好的方法可以帮助我自动化字符串提取任务（即比从源中获取每个字符串文字更智能的方法）？是否有任何商业程序可以执行此操作，可以同时处理 Perl 和 Mason 源代码？

另外，我有一个（相当愚蠢的）关于高级工具的想法，我将其工作流程放在下面。是否值得付出努力来实现这样的事情（这可能会很快完成 80% 的工作），或者我应该接受一个艰巨、烦人的手动字符串提取过程？

首先从源中提取每个字符串文字，并将其放入 Gettext PO 文件中。
然后，编写一个 Mason 插件来解析应用程序所提供的每个页面的 HTML，目的是记录用户所看到的字符串。
使用应用程序并尝试覆盖所有用例，建立面向用户的字符串存储。
给定用户看到的字符串存储，对目录文件中的字符串进行模糊匹配，并跟踪与 UI 匹配的目录条目。
最后，目录文件中未匹配的任何内容都可能不会面向用户，因此请从目录中删除它们。

原文

I'm currently working on internationalizing a very large Perl/Mason web application, as a team of one (does that make this a death march??). The application is nearing 20 years old, and is written in a relatively old-school Perl style; it doesn't use Moose or another OO module. I'm currently planning on using Locale::Maketext::Gettext to do message lookups, and using GNU Gettext catalog files.

I've been trying to develop some tools to aid in string extraction from our bigass codebase. Currently, all I have is a relatively simple Perl script to parse through source looking for string literals, prompt the user with some context and whether or not the string should be marked for translation, and mark it if so.

There's way too much noise in terms of strings I need to mark versus strings I can ignore. A lot of strings in the source aren't user-facing, such as hash keys, or type comparisons like

if (ref($db_obj) eq 'A::Type::Of::Db::Module')

I do apply some heuristics to each proposed string to see whether I can ignore it off the bat (ex. I ignore strings that are used for hash lookups, since 99% of the time in our codebase these aren't user facing). However, despite all that, around 90% of the strings my program shows me are ones I don't care about.

Is there a better way I could help automate my task of string extraction (i.e. something more intelligent than grabbing every string literal from the source)? Are there any commercial programs that do this that could handle both Perl and Mason source?

ALSO, I had a (rather silly) idea for a superior tool, whose workflow I put below. Would it be worth the effort implementing something like this (which would probably take care of 80% of the work very quickly), or should I just submit to an arduous, annoying, manual string extraction process?

Start by extracting EVERY string literal from the source, and putting it into a Gettext PO file.
Then, write a Mason plugin to parse the HTML for each page being served by the application, with the goal of noting strings that the user is seeing.
Use the hell out of the application and try to cover all use cases, building up a store of user facing strings.
Given this store of strings the user saw, do fuzzy matches against strings in the catalog file, and keep track of catalog entries that have a match from the UI.
At the end, anything in the catalog file that didn't get matched would likely not be user facing, so delete those from the catalog.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

你没皮卡萌 2024-12-14 21:53:38

据我所知，没有 Perl 工具能够智能地提取可能需要国际化的字符串和不需要国际化的字符串。您应该在编写代码时在代码中标记它们，但正如您所说，这还没有完成。

您可以使用 PPI 智能地进行字符串提取。

#!/usr/bin/env perl

use strict;
use warnings;

use Carp;
use PPI;

my $doc = PPI::Document->new(shift);

# See PPI::Node for docs on find
my $strings = $doc->find(sub {
    my($top, $element) = @_;
    print ref $element, "\n";

    # Look for any quoted string or here doc.
    # Does not pick up unquoted hash keys.
    return $element->isa("PPI::Token::Quote")   ||
           $element->isa("PPI::Token::HereDoc");
});

# Display the content and location.
for my $string (@$strings) {
    my($line, $row, $col) = @{ $string->location };
    print  "Found string at line $line starting at character $col.\n";
    printf "String content: '%s'\n", string_content($string);
}


# *sigh* PPI::Token::HereDoc doesn't have a string method
sub string_content {
    my $string = shift;
    return $string->isa("PPI::Token::Quote")   ? $string->string :
           $string->isa("PPI::Token::HereDoc") ? $string->heredoc :
           croak "$string is neither a here-doc nor a quote";
}

您可以对字符串周围的标记进行更复杂的检查，以确定它是否重要。请参阅 PPI::Element 和 PPI::Node 了解更多详细信息。或者您可以检查字符串的内容以确定它是否重要。

我不能说得更远，因为“重要”取决于你。

There are no Perl tools I know of which will intelligently extract strings which might need internationalization vs ones that will not. You're supposed to mark them in the code as you write them, but as you said that wasn't done.

You can use PPI to do the string extraction intelligently.

#!/usr/bin/env perl

use strict;
use warnings;

use Carp;
use PPI;

my $doc = PPI::Document->new(shift);

# See PPI::Node for docs on find
my $strings = $doc->find(sub {
    my($top, $element) = @_;
    print ref $element, "\n";

    # Look for any quoted string or here doc.
    # Does not pick up unquoted hash keys.
    return $element->isa("PPI::Token::Quote")   ||
           $element->isa("PPI::Token::HereDoc");
});

# Display the content and location.
for my $string (@$strings) {
    my($line, $row, $col) = @{ $string->location };
    print  "Found string at line $line starting at character $col.\n";
    printf "String content: '%s'\n", string_content($string);
}


# *sigh* PPI::Token::HereDoc doesn't have a string method
sub string_content {
    my $string = shift;
    return $string->isa("PPI::Token::Quote")   ? $string->string :
           $string->isa("PPI::Token::HereDoc") ? $string->heredoc :
           croak "$string is neither a here-doc nor a quote";
}

You can do more sophisticated examination of the tokens surrounding the strings to determine if it's something significant. See PPI::Element and PPI::Node for more details. Or you can examine the content of the string to determine if it's significant.

I can't go much further because "significant" is up to you.

回复收藏 0 原文

﹎☆浅夏丿初晴 2024-12-14 21:53:38

我们的源代码搜索引擎通常用于有效地搜索大型代码库，使用从<构建的索引它所知道的语言的 em>lexemes。该语言列表相当广泛，包括 Java、C#、COBOL 和……Perl。词位提取器是语言精确的（因为它们是从我们的 DMS 软件重新工程工具包，一种与语言无关的程序转换系统，其中精度是基础）。

给定一个索引代码库，然后可以输入查询来查找任意的词位序列，而不管特定于语言的空白；人们可以记录此类查询的点击次数及其位置。

极短的查询：

搜索引擎查找所有被分类为字符串的词汇元素（关键字、变量名、注释都被忽略；只是字符串！）。（通常人们会使用正则表达式约束编写更复杂的查询，例如 S=*Hello 来查找以“Hello”结尾的字符串）

这里的相关性是源代码搜索引擎对 Perl 中字符串的词法语法有精确的了解（包括特别是内插字符串的元素和所有古怪的转义序列）。所以上面的查询将找到 Perl 中的所有字符串；登录后，您会记录所有字符串及其位置。

这个特技实际上适用于搜索引擎理解的任何语言，因此它是为此类国际化任务提取字符串的相当通用的方法。

Our Source Code Search Engine is normally used to efficiently search large code bases, using indexes constructed from the lexemes of the languages it knows. That list of languages is pretty broad, including Java, C#, COBOL and ... Perl. The lexeme extractors are language precise (because they are "stolen" from our DMS Software Reengineering Toolkit, a language-agnostic program transformation system, where precision is fundamental).

Given an indexed code base, one can then enter queries to find arbitrary sequences of lexemes in spite of language-specific white space; one can log the hits of such queries and their locations.

The extremely short query:

to the Search Engine finds all lexical elements which are classified as strings (keywords, variable names, comments are all ignored; just strings!). (Normally people write more complex queries with regular expression constraints, such as S=*Hello to find strings that end with "Hello")

The relevance here is that the Source Code Search Engine has precise knowledge of lexical syntax of strings in Perl (including specifically elements of interpolated strings and all the wacky escape sequences). So the query above will find all strings in Perl; with logging on, you get all the strings and their locations logged.

This stunt actually works for any langauge the Search Engine understands, so it is a rather general way to extract the strings for such internationalization tasks.

回复收藏 0 原文

~没有更多了~