Twitter 文本压缩挑战

发布于 2024-07-25 03:01:15 字数 2672 浏览 7 评论 0原文

规则

  1. 您的程序必须有两种模式:编码解码
  2. 编码时:

    1. 您的程序必须将一些人类可读的 Latin1 文本(大概是英语)作为输入。
      • 忽略标点符号也没关系。
      • 您只需要担心实际的英语单词,而不是 L337。
      • 任何带重音的字母都可以转换为简单的 ASCII。
      • 您可以选择处理数字的方式。
      • 123
        • 一二三
        • 一百二十三
        • 123
        • 1 2 3
      • 一百二十三
        • 一二三
        • 一百二十三
        • 123
        • 1 2 3
    2. 您的程序必须输出一条可以用以下形式表示的消息

      • U+0000U+10FFFF范围内有 140 个代码点

        排除非字符:

        • U+FFFE
        • U+FFFF
        • U+nFFFEU+nFFFF,其中 n110十六进制
        • U+FDD0U+FDEF
        • U+D800U+DFFF(代理代码点)。

    它可以以您选择的任何合理的编码输出; GNU iconv 支持的任何编码都将被视为合理, 并且您的平台本机编码或区域设置编码可能是一个不错的选择。

  3. 解码时:

    1. 您的程序应将编码模式的输出作为输入。
    2. 文本输出应该是输入文本的近似值。
      • 越接近原文越好。
      • 不需要任何标点符号。
    3. 输出文本应该是人类可读的,同样是英语。

      • 可以是 L337,或者哈哈。
    4. 解码过程可能无法访问编码过程的任何其他输出 除了上面指定的输出之外; 也就是说,你不能将文本上传到某处并输出 URL 用于下载解码过程,或类似的愚蠢的事情。
  4. 为了用户界面的一致性,您的程序必须表现如下:
    1. 您的程序必须是一个可以设置为在具有适当解释器的平台上可执行的脚本, 或可以编译为可执行文件的程序。
    2. 您的程序必须将encodedecode 作为其第一个参数来设置模式。
    3. 您的程序必须至少通过以下方式之一获取输入:
      • 从标准输入获取输入并在标准输出上生成输出。
        • 我的程序编码output.utf
        • 我的程序解码output.txt
      • 从第二个参数中指定的文件中获取输入,并在第三个参数中指定的文件中生成输出。
        • 我的程序编码输入.txt 输出.utf
        • 我的程序解码output.utf output.txt
  5. 对于您的解决方案,请发布:
    1. 您的完整代码和/或其他地方托管的代码链接 (如果它很长,或者需要编译很多文件,或者其他什么)。
    2. 解释它是如何工作的(如果从代码中看不出来) 或者如果代码很长并且人们会对摘要感兴趣。
    3. 示例文本,包含原始文本、压缩后的文本以及解码后的文本。
    4. 如果您的想法是基于别人的想法,请注明出处。 尝试改进别人的想法是可以的,但你必须归因于他们。

这些规则是 Twitter 图像编码挑战规则的变体。

Rules

  1. Your program must have two modes: encoding and decoding.
  2. When encoding:

    1. Your program must take as input some human readable Latin1 text, presumably English.
      • It doesn't matter if you ignore punctuation marks.
      • You only need to worry about actual English words, not L337.
      • Any accented letters may be converted to simple ASCII.
      • You may choose how you want to deal with numbers.
      • 123
        • one two three
        • one hundred twenty three
        • 123
        • 1 2 3
      • one hundred twenty three
        • one two three
        • one hundred twenty three
        • 123
        • 1 2 3
    2. Your program must output a message which can be represented in

      • 140 code points in the range U+0000U+10FFFF

        Excluding non-characters:

        • U+FFFE
        • U+FFFF
        • U+nFFFE, U+nFFFF where n is 110 hexadecimal
        • U+FDD0U+FDEF
        • U+D800U+DFFF (surrogate code points).

    It may be output in any reasonable encoding of your choice;
    any encoding supported by GNU iconv will be considered reasonable,
    and your platform native encoding or locale encoding would likely be a good choice.

  3. When decoding:

    1. Your program should take as input the output of your encoding mode.
    2. The text output should be an approximation of the input text.
      • The closer you can get to the original text, the better.
      • Doesn't need to have any punctuation.
    3. The output text should be readable by a human, again presumably English.

      • Can be L337, or lol.
    4. The decoding process may have no access to any other output of the encoding process
      other than the output specified above;
      that is, you can't upload the text somewhere and output the URL
      for the decoding process to download, or anything silly like that.
  4. For the sake of consistency in user interface, your program must behave as follows:
    1. Your program must be a script that can be set to executable on a platform with the appropriate interpreter,
      or a program that can be compiled into an executable.
    2. Your program must take as its first argument either encode or decode to set the mode.
    3. Your program must take input in at least one of the following ways:
      • Take input from standard in and produce output on standard out.
        • my-program encode <input.txt >output.utf
        • my-program decode <output.utf >output.txt
      • Take input from a file named in the second argument, and produce output in the file named in the third.
        • my-program encode input.txt output.utf
        • my-program decode output.utf output.txt
  5. For your solution, please post:
    1. Your code, in full, and/or a link to it hosted elsewhere
      (if it's very long, or requires many files to compile, or something).
    2. An explanation of how it works, if it's not immediately obvious from the code
      or if the code is long and people will be interested in a summary.
    3. An example text, with the original text, the text it compresses down to, and the decoded text.
    4. If you are building on an idea that someone else had, please attribute them.
      It's OK to try to do a refinement of someone else's idea, but you must attribute them.

The rules are a variation on the rules for Twitter image encoding challenge.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

放手` 2024-08-01 03:01:15

不确定我是否有时间/精力用实际代码来跟进,但这是我的想法:

  • 任何特定长度下的任意 LATIN 1 字符串都可以简单地编码(甚至压缩)没有丢失 140 个字符。 天真的估计是 280 个字符,尽管由于比赛规则中的代码点限制,它可能比这个短一点。
  • 比上述长度稍长的字符串(让我们估计在 280 到 500 个字符之间)很可能可以使用标准压缩技术缩小为足够短的字符串以允许上述编码。

如果超过这个长度,我们就会开始丢失文本中的信息。 因此,执行以下最少数量的步骤,将字符串减少到可以使用上述方法进行压缩/编码的长度。 另外,如果仅在子字符串上执行这些替换将使字符串足够短(我可能会向后遍历字符串),则不要对整个字符串执行这些替换。

  1. 将 127 以上的所有 LATIN 1 字符(主要是重音字母和时髦符号)替换为最接近的非重音字母字符,或者可能使用通用符号替换(如“#”)
  2. 将所有大写字母替换为其等效的小写形式
  3. 替换所有非重音字母字母数字(任何剩余的符号或标点符号)带有空格
  4. 将所有数字替换为 0

好的,现在我们已经消除了尽可能多的多余字符。 现在我们要做一些更戏剧性的减少:

  1. 用单个字母(balon)替换所有双字母(balloon)。 看起来很奇怪,但仍然希望读者能够理解。
  2. 将其他常见字母组合替换为较短的等效字母(CK 替换为 K,WR 替换为 R 等)

好的,这就是我们可以做到的让文本可读的程度。 除此之外,让我们看看是否可以想出一种方法,使文本类似于原始文本,即使它最终无法可解读(再次执行此操作)从字符串末尾开始一次一个字符,当足够短时停止):

  1. 将所有元音 (aeiouy) 替换为 a
  2. 将所有“高”字母 (bdfhklt) 替换为 l
  3. 将所有“短”字母 (cmnrsvwxz) 替换为 n
  4. 将所有“悬挂”字母 (gjpq) 替换为 p

这应该会留下一个由 5 个可能值(a、l、n、p 和空格)组成的字符串,这应该允许我们编码相当长的字符串。

除此之外,我们只需要截断即可。

我能想到的唯一其他技术是对常见单词或字母组进行基于字典的编码。 这可能会给我们一些正确句子的好处,但可能不会给任意字符串带来好处。

Not sure if I'll have the time/energy to follow this up with actual code, but here's my idea:

  • Any arbitrary LATIN 1 string under a certain length could be simply encoded (not even compressed) with no loss into 140 characters. The naive estimate is 280 characters, although with the code point restrictions in the contest rules, its probably a little shorter than that.
  • Strings slightly longer than the above length (lets guestimate between 280 and 500 characters) can most likely be shrunk using standard compression techniques, into a string short enough to allow the above encoding.

Anything longer than that, and we're starting lose information in the text. So execute the minimum number of the following steps to reduce the string to a length that can then be compressed/encoded using the above methods. Also, don't perform these replacements on the entire string if just performing them on a substring will make it short enough (I would probably walk through the string backwards).

  1. Replace all LATIN 1 characters above 127 (primarily accented letters and funky symbols) with their closest equivalent in non-accented alphabetic characters, or possibly with a generic symbol replacement like "#"
  2. Replace all uppercase letters with their equivalent lowercase form
  3. Replace all non-alphanumerics (any remaining symbols or punctuation marks) with a space
  4. Replace all numbers with 0

Ok, so now we've eliminated as many excess characters as we can reasonably get rid of. Now we're going to do some more dramatic reductions:

  1. Replace all double-letters (balloon) with a single letter (balon). Will look weird, but still hopefully decipherable by the reader.
  2. Replace other common letter combinations with shorter equivalents (CK with K, WR with R, etc)

Ok, that's about as far as we can go and have the text be readable. Beyond this, lets see if we can come up with a method so that the text will resemble the original, even if it isn't ultimately deciperable (again, perform this one character at a time from the end of the string, and stop when it is short enough):

  1. Replace all vowels (aeiouy) with a
  2. Replace all "tall" letters (bdfhklt) with l
  3. Replace all "short" letters (cmnrsvwxz) with n
  4. Replace all "hanging" letters (gjpq) with p

This should leave us with a string consisting of exactly 5 possible values (a, l, n, p, and space), which should allow us to encode pretty lengthy strings.

Beyond that, we'd simply have to truncate.

Only other technique I can think of would be to do dictionary-based encoding, for common words or groups of letters. This might give us some benefit for proper sentences, but probably not for arbitrary strings.

夕色琉璃 2024-08-01 03:01:15

这是我的实际英语变体。

每个代码点都有大约 1100000 种可能的状态。 嗯,空间很大。

因此,我们对所有原始文本进行词干并从中获取 Wordnet 同义词集。 数字被转换成英文名称(“四十二”)。 1,1M 个状态将允许我们保存 synset id(可以在 0 到 82114 之间)、synset 内的位置(我想大概有 10 个变体)和 synset 类型(这是四种类型之一 - 名词、动词、形容词、副词) 。 我们甚至可能有足够的空间来存储单词的原始形式(例如动词时态 ID)。

解码器只是将同义词集提供给 Wordnet 并检索相应的单词。

源文本:

A white dwarf is a small star composed mostly of electron-degenerate matter. Because a
white dwarf's mass is comparable to that of the Sun and its volume is comparable to that 
of the Earth, it is very dense.

变为:(

A white dwarf be small star composed mostly electron degenerate matter because white
dwarf mass be comparable sun IT volume be comparable earth IT be very dense

使用 Online Wordnet 测试)。 这个“代码”应该有 27 个代码点。
当然,所有像“lol”和“L33T”这样的“胡言乱语”都将永远丢失。

Here is my variant for actual English.

Each code point have something like 1100000 possible states. Well, that's a lot of space.

So, we stem all original text and get Wordnet synsets from it. Numbers are cast into english names ("fourty two"). 1,1M states will allow us to hold synset id (which can be between 0 and 82114), position inside synset(~10 variants, i suppose) and synset type (which is one of four - noun, verb, adjective, adverb). We even may have enough space to store original form of word (like verb tense id).

Decoder just feeds synsets to Wordnet and retrieves corresponding words.

Source text:

A white dwarf is a small star composed mostly of electron-degenerate matter. Because a
white dwarf's mass is comparable to that of the Sun and its volume is comparable to that 
of the Earth, it is very dense.

Becomes:

A white dwarf be small star composed mostly electron degenerate matter because white
dwarf mass be comparable sun IT volume be comparable earth IT be very dense

(tested with Online Wordnet). This "code" should take 27 code points.
Ofcourse all "gibberish" like 'lol' and 'L33T' will be lost forever.

七色彩虹 2024-08-01 03:01:15

PAQ8O10T《》 FTW

PAQ8O10T << FTW

给妤﹃绝世温柔 2024-08-01 03:01:15

这是一个简单的示例,它采用输入文件并删除所有非单词字符。

#! perl
use strict;
use warnings;
use 5.010;


use Getopt::Long;
use Pod::Usage;
use autodie;

my %opts = (
  infile  => '-',
  outfile => '-',
);
GetOptions (
  'encode|e'    => \$opts{encode},
  'decode|d'    => \$opts{decode},
  'infile|i=s'  => \$opts{infile},
  'outfile|o=s' => \$opts{outfile},
  'help|h'      => \&help,
  'man|m'       => \&man,
);

unless(
  # exactly one of these should be set
  $opts{encode} xor $opts{decode}
){
  help();
}


{
  my $infile;
  if( $opts{infile} ~~ ['-', '&0'] ){
    $infile = *STDIN{IO};
  }else{
    open $infile, '<', $opts{infile};
  }

  my $outfile;
  if( $opts{outfile} ~~ ['-', '&1'] ){
    $outfile = *STDOUT{IO};
  }elsif( $opts{outfile} ~~ '&2' ){
    $outfile = *STDERR{IO};
  }else{
    open $outfile, '>', $opts{outfile};
  }

  if( $opts{decode} ){
    while( my $line = <$infile> ){
      chomp $line;

      say {$outfile} $line;
    }
  }elsif( $opts{encode} ){
    while( my $line = <$infile> ){
      chomp $line;

      $line =~ s/[\W_]+/ /g;

      say {$outfile} $line;
    }
  }else{
    die 'How did I get here?';
  }
}

sub help{
  pod2usage();
}
sub man{
  pod2usage(1);
}
__END__

=head1 NAME

sample.pl - Using GetOpt::Long and Pod::Usage

=head1 SYNOPSIS

sample.pl [options] [file ...]

 Options:
   --help     -h      brief help message
   --man      -m      full documentation
   --encode   -e      encode text
   --decode   -d      decode text
   --infile   -i      input  filename
   --outfile  -o      output filename

=head1 OPTIONS

=over 8

=item B<--help>

Print a brief help message and exits.

=item B<--man>

Prints the manual page and exits.

=item B<--encode>

Removes any character other than /\w/.

=item B<--decode>

Just reads from one file, and writes to the other.

=item B<--infile>

Input filename. If this is '-' or '&0', then read from STDIN instead.
If you use '&0', you must pass it in with quotes.

=item B<--outfile>

Output filename. If this is '-' or '&1', then write to STDOUT instead.
If this is '&2', then write to STDERR instead.
If you use '&1' or '&2', you must pass it in with quotes.

=back

=head1 DESCRIPTION

B<This program> will read the given input file(s) and do something
useful with the contents thereof.

=cut
echo Hello, this is, some text | perl sample.pl -e
Hello this is some text

Here is a simple example that takes an input file and removes any non-word characters.

#! perl
use strict;
use warnings;
use 5.010;


use Getopt::Long;
use Pod::Usage;
use autodie;

my %opts = (
  infile  => '-',
  outfile => '-',
);
GetOptions (
  'encode|e'    => \$opts{encode},
  'decode|d'    => \$opts{decode},
  'infile|i=s'  => \$opts{infile},
  'outfile|o=s' => \$opts{outfile},
  'help|h'      => \&help,
  'man|m'       => \&man,
);

unless(
  # exactly one of these should be set
  $opts{encode} xor $opts{decode}
){
  help();
}


{
  my $infile;
  if( $opts{infile} ~~ ['-', '&0'] ){
    $infile = *STDIN{IO};
  }else{
    open $infile, '<', $opts{infile};
  }

  my $outfile;
  if( $opts{outfile} ~~ ['-', '&1'] ){
    $outfile = *STDOUT{IO};
  }elsif( $opts{outfile} ~~ '&2' ){
    $outfile = *STDERR{IO};
  }else{
    open $outfile, '>', $opts{outfile};
  }

  if( $opts{decode} ){
    while( my $line = <$infile> ){
      chomp $line;

      say {$outfile} $line;
    }
  }elsif( $opts{encode} ){
    while( my $line = <$infile> ){
      chomp $line;

      $line =~ s/[\W_]+/ /g;

      say {$outfile} $line;
    }
  }else{
    die 'How did I get here?';
  }
}

sub help{
  pod2usage();
}
sub man{
  pod2usage(1);
}
__END__

=head1 NAME

sample.pl - Using GetOpt::Long and Pod::Usage

=head1 SYNOPSIS

sample.pl [options] [file ...]

 Options:
   --help     -h      brief help message
   --man      -m      full documentation
   --encode   -e      encode text
   --decode   -d      decode text
   --infile   -i      input  filename
   --outfile  -o      output filename

=head1 OPTIONS

=over 8

=item B<--help>

Print a brief help message and exits.

=item B<--man>

Prints the manual page and exits.

=item B<--encode>

Removes any character other than /\w/.

=item B<--decode>

Just reads from one file, and writes to the other.

=item B<--infile>

Input filename. If this is '-' or '&0', then read from STDIN instead.
If you use '&0', you must pass it in with quotes.

=item B<--outfile>

Output filename. If this is '-' or '&1', then write to STDOUT instead.
If this is '&2', then write to STDERR instead.
If you use '&1' or '&2', you must pass it in with quotes.

=back

=head1 DESCRIPTION

B<This program> will read the given input file(s) and do something
useful with the contents thereof.

=cut
echo Hello, this is, some text | perl sample.pl -e
Hello this is some text
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文