针对mp3文件中ID3v2不同步方案的正则表达式？

发布于 2024-11-02 08:04:56 字数 4732 浏览 10 评论 0原文

我正在创建一段代码来检查服务器上的 mp3 文件并获取结果，其中一些文件是否存在错误同步。简而言之，我使用 fread() 函数并在变量中获取流。在拆分该流以获得 id3v1（不必要，它不是同步的主题）、id3v2（主要问题）和音频的单独流之后，我必须针对 id3v2 流实现该方案。

根据ID3v2官方文档：

“不同步方案”的唯一目的是使ID3v2标记为尽可能与现有软件兼容。如果文件仅由新软件处理，则“不同步”标签没有用处。只能对 MPEG 2 第 I、II 和 III 层以及 MPEG 2.5 文件进行取消同步。

每当在标签内发现错误同步时，就会在第一个错误同步字节之后插入一个归零字节。 ID3 编码器应更改的正确同步格式如下：

%11111111 111xxxxx

应替换为：

%11111111 00000000 111xxxxx

这有一个副作用，即所有 $FF 00 组合都必须更改，因此它们不会受到解码过程的影响。因此，在取消同步期间，所有 $FF 00 组合都必须替换为 $FF 00 00 组合。

为了指示不同步的使用，应该设置“ID3 flags”中的第一位（注意：我已经找到了该位）。仅当标签包含现已更正的错误同步时，才应设置该位。仅当标签不包含任何错误同步时，才应清除该位。

请记住，如果编码器使用压缩方案，则应之后应用不同步方案。解码压缩的“不同步”文件时，应首先解析“不同步方案”，然后再解压缩。

我的问题是：

如何搜索和查找将此位模式 %11111111 111xxxxx 替换为 %11111111 00000000 111xxxxx？
反之亦然，如何搜索和查找？将此位模式 %11111111 00000000 111xxxxx 替换为 %11111111 111xxxxx？

...使用preg_replace()。

到目前为止，我创建的代码工作得很好，我只多了一行（嗯，正好两行）。

<?php

  // some basic checkings here, such as 'does file exist'
  // and 'is it readable'

  $f = fopen('test.mp3', 'r');

  // ...rest of my code...  

  $pattern1 = '?????'; // pattern from 1st question
  $id3stream = preg_replace($pattern1, 'something1', $id3stream);

  // ...extracting frames...

  $pattern1 = '?????'; // pattern from 2nd question
  $id3stream = preg_replace($pattern2, 'something2', $id3stream);

  // ..do more job...

  fclose($f);

?>

如何使用 preg_replace() 函数使这两行起作用？

PS 我知道如何在某种循环中逐字节读取，但我确信使用正则表达式可以做到这一点（顺便说一句，说实话，我很喜欢正则表达式）。

如果您需要更多详细信息，请告诉我。

还有一件事......

目前我正在使用这种模式

$pattern0 = '/[\x00].*/';
echo preg_replace($pattern0, '', $input_string);

来切断从第一个零字节开始直到结束的部分字符串。这是正确的方法吗？

更新

（@mario 的回答）。

在前几次测试中...此代码返回了正确的结果。

  // print original stream
  printStreamHex($stream_original, 'ORIGINAL STREAM');

  // adding zero pads on unsync scheme
  $stream_1 = preg_replace(':([\\xFF])([\\xE0-\\xFF]):', "$1\x00$2", $stream_original);
  printStreamHex($stream_1, 'AFTER ADDING ZEROS');

  // reversing process
  $stream_2 = preg_replace(':([\\xFF])([\\x00])([\\xE0-\\xFF]):', "$1$3", $stream_1);
  printStreamHex($stream_2, 'AFTER REMOVING ZEROS');


  echo "Status: <b>" . ($stream_original == $stream_2 ? "OK" : "Failed") . "</b>";

但几分钟后，我发现了特定的情况，一切看起来都像预期的结果，但流中仍然有 FFE0+ 对。

ORIGINAL STREAM
+-----------------------------------------------------------------+
| FF  E0  DB  49  53  BE  3B  E0  90  40  EA  2B  3A  61  FF  FA  |
| 84  E0  A9  99  1F  39  B5  E1  54  FF  E7  ED  B8  B1  3A  36  |
| 88  01  69  CA  7D  47  FA  E1  70  7C  85  34  B8  1A  FF  FF  |
| FF  F8  21  F9  2F  FF  F7  17  67  EB  2A  EB  6E  41  82  FF  |
+-----------------------------------------------------------------+

AFTER ADDING ZEROS
+-----------------------------------------------------------------+
| FF  00  E0  DB  49  53  BE  3B  E0  90  40  EA  2B  3A  61  FF  |
| 00  FA  84  E0  A9  99  1F  39  B5  E1  54  FF  00  E7  ED  B8  |
| B1  3A  36  88  01  69  CA  7D  47  FA  E1  70  7C  85  34  B8  |
| 1A  FF  00  FF  FF  00  F8  21  F9  2F  FF  00  F7  17  67  EB  |
| 2A  EB  6E  41  82  FF                                          |
+-----------------------------------------------------------------+

AFTER REMOVING ZEROS
+-----------------------------------------------------------------+
| FF  E0  DB  49  53  BE  3B  E0  90  40  EA  2B  3A  61  FF  FA  |
| 84  E0  A9  99  1F  39  B5  E1  54  FF  E7  ED  B8  B1  3A  36  |
| 88  01  69  CA  7D  47  FA  E1  70  7C  85  34  B8  1A  FF  FF  |
| FF  F8  21  F9  2F  FF  F7  17  67  EB  2A  EB  6E  41  82  FF  |
+-----------------------------------------------------------------+

Status: OK

如果流包含类似 FF FF FF FF 的内容，它将被替换为 FF 00 FF FF 00 FF，但它应该是 FF 00 FF 00 FF 00 FF< /代码>。该 FF FF 对将再次导致 mp3 同步错误，因此我的任务是避免音频流之前的每个 FFE0+ 模式（在 ID3v2 标记流中；因为 mp3 以 FFE0+ 开头）字节对，它应该在音频数据的开头第一次出现）。我发现我可以循环相同的正则表达式，直到获得没有 FFE0+ 字节对的流。有没有不需要循环的解决方案？

干得好@mario，非常感谢！

原文

I'm creating piece of code to check mp3 files on my server and get result do some of them have false sync or not. In short, I'm loading those files in PHP using fread() function and getting stream in variable. After splitting that stream to get separate streams for id3v1 (not necessary, it's not a subject of sync), id3v2 (main problem) and audio, I have to implement that scheme against id3v2 stream.

According to ID3v2 official documentation:

The only purpose of the 'unsynchronisation scheme' is to make the ID3v2 tag as compatible as possible with existing software. There is no use in 'unsynchronising' tags if the file is only to be processed by new software. Unsynchronisation may only be made with MPEG 2 layer I, II and III and MPEG 2.5 files.

Whenever a false synchronisation is found within the tag, one zeroed byte is inserted after the first false synchronisation byte. The format of a correct sync that should be altered by ID3 encoders is as follows:

%11111111 111xxxxx

And should be replaced with:

%11111111 00000000 111xxxxx

This has the side effect that all $FF 00 combinations have to be altered, so they won't be affected by the decoding process. Therefore all the $FF 00 combinations have to be replaced with the $FF 00 00 combination during the unsynchronisation.

To indicate usage of the unsynchronisation, the first bit in 'ID3 flags' should be set (note: I've found that bit). This bit should only be set if the tag contains a, now corrected, false synchronisation. The bit should only be clear if the tag does not contain any false synchronisations.

Do bear in mind, that if a compression scheme is used by the encoder, the unsynchronisation scheme should be applied afterwards. When decoding a compressed, 'unsynchronised' file, the 'unsynchronisation scheme' should be parsed first, decompression afterwards.

My questions are:

How to search & replace this bit-pattern %11111111 111xxxxx with %11111111 00000000 111xxxxx?
Vice versa, how to search & replace this bit-pattern %11111111 00000000 111xxxxx with %11111111 111xxxxx?

...using preg_replace().

Code I've created so far works perfectly and I have just one line more (well, two exactly).

<?php

  // some basic checkings here, such as 'does file exist'
  // and 'is it readable'

  $f = fopen('test.mp3', 'r');

  // ...rest of my code...  

  $pattern1 = '?????'; // pattern from 1st question
  $id3stream = preg_replace($pattern1, 'something1', $id3stream);

  // ...extracting frames...

  $pattern1 = '?????'; // pattern from 2nd question
  $id3stream = preg_replace($pattern2, 'something2', $id3stream);

  // ..do more job...

  fclose($f);

?>

How to make those two lines with preg_replace() function work?

P.S. I know how to do it reading byte after byte in some kind of loop, but I'm sure this is possible using regular expressions (btw, to be honest, I suck in regex).

Let me know If you need more details.

One more thing...

At the moment I'm using this pattern

$pattern0 = '/[\x00].*/';
echo preg_replace($pattern0, '', $input_string);

to cut off part of string starting at first zero-byte until the end. Is that correct way for doing this?

Update

(@mario's answer).

In first couple of tests... this code has returned correct result.

  // print original stream
  printStreamHex($stream_original, 'ORIGINAL STREAM');

  // adding zero pads on unsync scheme
  $stream_1 = preg_replace(':([\\xFF])([\\xE0-\\xFF]):', "$1\x00$2", $stream_original);
  printStreamHex($stream_1, 'AFTER ADDING ZEROS');

  // reversing process
  $stream_2 = preg_replace(':([\\xFF])([\\x00])([\\xE0-\\xFF]):', "$1$3", $stream_1);
  printStreamHex($stream_2, 'AFTER REMOVING ZEROS');


  echo "Status: <b>" . ($stream_original == $stream_2 ? "OK" : "Failed") . "</b>";

But minutes after, I've found specific case where everything looks like expected result but there are still FFE0+ pairs in the stream.

ORIGINAL STREAM
+-----------------------------------------------------------------+
| FF  E0  DB  49  53  BE  3B  E0  90  40  EA  2B  3A  61  FF  FA  |
| 84  E0  A9  99  1F  39  B5  E1  54  FF  E7  ED  B8  B1  3A  36  |
| 88  01  69  CA  7D  47  FA  E1  70  7C  85  34  B8  1A  FF  FF  |
| FF  F8  21  F9  2F  FF  F7  17  67  EB  2A  EB  6E  41  82  FF  |
+-----------------------------------------------------------------+

AFTER ADDING ZEROS
+-----------------------------------------------------------------+
| FF  00  E0  DB  49  53  BE  3B  E0  90  40  EA  2B  3A  61  FF  |
| 00  FA  84  E0  A9  99  1F  39  B5  E1  54  FF  00  E7  ED  B8  |
| B1  3A  36  88  01  69  CA  7D  47  FA  E1  70  7C  85  34  B8  |
| 1A  FF  00  FF  FF  00  F8  21  F9  2F  FF  00  F7  17  67  EB  |
| 2A  EB  6E  41  82  FF                                          |
+-----------------------------------------------------------------+

AFTER REMOVING ZEROS
+-----------------------------------------------------------------+
| FF  E0  DB  49  53  BE  3B  E0  90  40  EA  2B  3A  61  FF  FA  |
| 84  E0  A9  99  1F  39  B5  E1  54  FF  E7  ED  B8  B1  3A  36  |
| 88  01  69  CA  7D  47  FA  E1  70  7C  85  34  B8  1A  FF  FF  |
| FF  F8  21  F9  2F  FF  F7  17  67  EB  2A  EB  6E  41  82  FF  |
+-----------------------------------------------------------------+

Status: OK

If stream contains something like FF FF FF FF it will be replaced with FF 00 FF FF 00 FF, but it should be FF 00 FF 00 FF 00 FF. That FF FF pair will false mp3 synchronisation again so my mission is to avoid every FFE0+ pattern before audio stream (in ID3v2 tag-stream; because mp3 starts with FFE0+ byte-pair and it should be first occurrence at the beginning of audio data). I figured out that I can loop same regex until I got stream without FFE0+ byte-pair. Is there any solution that doesn't require loop?

Great job @mario, thanks a lot!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

北城孤痞 2024-11-09 08:04:56

二进制字符串不完全是正则表达式的领域。但您已经有了使用 \x00 的正确方法。

3.. 截断从第一个零字节开始直到结尾的部分字符串

$pattern0 = '/[\\x00].*$/';

您只是在这里缺少 $ 。

1..如何搜索&将此位模式 %11111111 111xxxxx 替换为 %11111111 00000000 111xxxxx？

对这些位串使用序列 FF 和 E0。

preg_replace(':([\\xFF])([\\xE0-\\xFF]):', "$1\x00$2");

由于您搜索的是变量字节，所以在替换字符串中使用 $2。否则，更简单的 str_replace 就可以工作。

2..反之亦然，如何搜索&将此位模式 %11111111 00000000 111xxxxx 替换为 %11111111 111xxxxx？

同样的伎俩。

preg_replace(':([\\xFF])([\\x00])([\\xE0-\\xFF]):', "$1$3");

我只会注意始终使用 \ 双反斜杠，因此 PCRE 解释了 \x00 十六进制序列，而不是 PHP 解析器。（在到达 libpcre 之前，它最终会成为 C 字符串终止符。）

Binary strings are not quite the turf of regular expressions. But you already had the right approach with using \x00.

3.. to cut off part of string starting at first zero-byte until the end

$pattern0 = '/[\\x00].*$/';

You were just missing the $ here.

1.. How to search & replace this bit-pattern %11111111 111xxxxx with %11111111 00000000 111xxxxx?

Use the the sequence FF and E0 for these bit-strings.

preg_replace(':([\\xFF])([\\xE0-\\xFF]):', "$1\x00$2");

Using the $2 here in the replacement string, since you search for a variable byte. Otherwise a simpler str_replace would work.

2.. Vice versa, how to search & replace this bit-pattern %11111111 00000000 111xxxxx with %11111111 111xxxxx?

Same trick.

preg_replace(':([\\xFF])([\\x00])([\\xE0-\\xFF]):', "$1$3");

I would only watch out to always use the \ double backslash, so it is PCRE which interpretets the \x00 hex sequences, not the PHP parser. (It would end up becoming a C string terminator before it reaches libpcre.)

回复收藏 0 原文

~没有更多了~