替换所有“\” *不在“”内的字符标签

发布于 2024-08-12 07:47:04 字数 2680 浏览 14 评论 0原文

首先,首先: 这个这个、这个也不是回答了我的问题。那我就开一个新的吧。

请阅读

好吧好吧。我知道正则表达式不是解析一般 HTML 的方法。请注意,创建的文档是使用有限的、受控的 HTML 子集编写的。编写文档的人知道他们在做什么。他们都是IT专业人士!

考虑到受控语法,可以使用正则表达式解析我这里的文档。

我并不是想从网络上下载任意文档并解析它们!

如果解析确实失败,则文档将被编辑,因此它将进行解析。我在这里解决的问题比这更普遍(即不替换其他两个模式中的模式)。

一点背景知识(您可以跳过这个......)

在我们的办公室,我们应该“漂亮地打印”我们的文档。因此,有些人想出了将所有内容都放入 Word 文档中的原因。值得庆幸的是,到目前为止我们还没有完全做到这一点。而且,如果我完成了这件事,我们可能就不需要了。

当前状态(...以及这个)

文档的主要部分存储在 TikiWiki 数据库中。我创建了一个愚蠢的 PHP 脚本,它将文档从 HTML(通过 LaTeX)转换为 PDF。所选维基系统必须具备的功能之一是所见即所得编辑器。正如预期的那样,我们得到的文档的 DOM 不太正式。

因此,我使用“简单”正则表达式来音译该文档。到目前为止,一切(大部分)都运行良好,但我遇到了一个我自己还没有解决的问题。

问题

一些特殊字符需要用 LaTeX 标记替换。例如, \ 字符应替换为 $\backslash$ (除非有人知道其他解决方案?)。

例外逐字块中!

我确实用 verbatim 部分替换 标签。但如果此 code 块包含反斜杠(如 Windows 文件夹名称的情况),脚本仍会替换这些反斜杠。

我认为我可以使用负 LookBehinds 和/或 LookAheads 来解决这个问题。但我的尝试没有成功。

当然,如果我有一个真正的解析器会更好。事实上,它是我的“大脑路线图”上的东西,但它目前超出了范围。该脚本对于我们有限的知识领域来说足够有效。创建一个解析器需要我从头开始。

我的尝试

示例输入

The Hello \ World document is located in:
<code>C:\documents\hello_world.txt</code>

预期输出

The Hello $\backslash$ World document is located in:
\begin{verbatim}C:\documents\hello_world.txt\end{verbatim}

这是迄今为止我能想到的最好结果:

<?php
$patterns = array(
    "special_chars2" => array( '/(?<!<code[^>]*>.*)\\\\[^$](?!.*<\/code>)/U', '$\\backslash$'),
);

foreach( $patterns as $name => $p ){
    $tex_input = preg_replace( $p[0], $p[1], $tex_input );
}
?>

请注意,这只是摘录,并且 [^$] 是另一个 LaTeX 要求。

另一种尝试似乎有效:

<?php
$patterns = array(
    "special_chars2" => array( '/\\\\[^$](?!.*<\/code>)/U', '$\\backslash$'),
);

foreach( $patterns as $name => $p ){
    $tex_input = preg_replace( $p[0], $p[1], $tex_input );
}
?>

...换句话说:忽略了消极的回顾。

但这看起来比后向查找和先行查找更容易出错。

相关问题

您可能已经注意到,该模式是非贪婪的(/.../U)。那么,这会在 块内尽可能少地匹配吗?考虑环顾四周?

First things first: Neither this, this, this nor this answered my question. So I'll open a new one.

Please read

Okay okay. I know that regexes are not the way to parse general HTML. Please take note that the created documents are written using a limited, controlled HTML subset. And people writing the docs know what they're doing. They are all IT professionals!

Given the controlled syntax it is possible to parse the documents I have here using regexes.

I am not trying to download arbitrary documents from the web and parse them!

And if the parsing does fail, the document is edited, so it'll parse. The problem I am addressing here is more general than that (i.e. not replace patterns inside two other patterns).

A little bit of background (you can skip this...)

In our office we are supposed to "pretty print" our documentation. Hence why some came up with putting it all into Word documents. So far we're thankfully not quite there yet. And, if I get this done, we might not need to.

The current state (... and this)

The main part of the docs are stored in a TikiWiki database. I've created a daft PHP script which converts the documents from HTML (via LaTeX) to PDF. One of the must have features of the selected Wiki-System was a WYSIWYG editor. Which, as expected leaves us with documents with a less then formal DOM.

Consequently, I am transliterating the document using "simple" regexes. It all works (mostly) fine so far, but I encountered one problem I haven't figured out on my own yet.

The problem

Some special characters need to replaced by LaTeX markup. For exaple, the \ character should be replaced by $\backslash$ (unless someone knows another solution?).

Except while in a verbatim block!

I do replace <code> tags with verbatim sections. But if this code block contains backslashes (as is the case for Windows folder names), the script still replaces these backslashes.

I reckon I could solve this using negative LookBehinds and/or LookAheads. But my attempts did not work.

Granted, I would be better off with a real parser. In fact, it is something on my "in-brain-roadmap", but it is currently out of the scope. The script works well enough for our limited knowledge domain. Creating a parser would require me to start pretty much from scratch.

My attempt

Example Input

The Hello \ World document is located in:
<code>C:\documents\hello_world.txt</code>

Expected output

The Hello $\backslash$ World document is located in:
\begin{verbatim}C:\documents\hello_world.txt\end{verbatim}

This is the best I could come up with so far:

<?php
$patterns = array(
    "special_chars2" => array( '/(?<!<code[^>]*>.*)\\\\[^$](?!.*<\/code>)/U', '$\\backslash

Note that this is only an excerpt, and the [^$] is another LaTeX requirement.

Another attempt which seemed to work:

<?php
$patterns = array(
    "special_chars2" => array( '/\\\\[^$](?!.*<\/code>)/U', '$\\backslash

... in other words: leaving out the negative lookbehind.

But this looks more error-prone than with both lookbehind and lookahead.

A related question

As you may have noticed, the pattern is ungreedy (/.../U). So will this match only as little possible inside a <code> block? Considering the look-arounds?

), ); foreach( $patterns as $name => $p ){ $tex_input = preg_replace( $p[0], $p[1], $tex_input ); } ?>

Note that this is only an excerpt, and the [^$] is another LaTeX requirement.

Another attempt which seemed to work:


... in other words: leaving out the negative lookbehind.

But this looks more error-prone than with both lookbehind and lookahead.

A related question

As you may have noticed, the pattern is ungreedy (/.../U). So will this match only as little possible inside a <code> block? Considering the look-arounds?

), ); foreach( $patterns as $name => $p ){ $tex_input = preg_replace( $p[0], $p[1], $tex_input ); } ?>

... in other words: leaving out the negative lookbehind.

But this looks more error-prone than with both lookbehind and lookahead.

A related question

As you may have noticed, the pattern is ungreedy (/.../U). So will this match only as little possible inside a <code> block? Considering the look-arounds?

), ); foreach( $patterns as $name => $p ){ $tex_input = preg_replace( $p[0], $p[1], $tex_input ); } ?>

Note that this is only an excerpt, and the [^$] is another LaTeX requirement.

Another attempt which seemed to work:

... in other words: leaving out the negative lookbehind.

But this looks more error-prone than with both lookbehind and lookahead.

A related question

As you may have noticed, the pattern is ungreedy (/.../U). So will this match only as little possible inside a <code> block? Considering the look-arounds?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

月下客 2024-08-19 07:47:04

如果是我,我会尝试找到 HTML 解析器并使用它。

另一种选择是尝试将字符串分成 .*?其他部分

并将更新其他部分,并将重新组合它。

$x="The Hello \ World document is located in:\n<br>
<code>C:\documents\hello_world.txt</code>";

$r=preg_split("/(<code>.*?<\/code>)/", $x,-1,PREG_SPLIT_DELIM_CAPTURE);

for($i=0;$i<count($r);$i+=2)
    $r[$i]=str_replace("\\","$\\backslash$",$r[$i]);

$x=implode($r);

echo $x;

这是结果。

The Hello $\backslash$ World document is located in: 
C:\documents\hello_world.txt

抱歉,如果我的方法不适合您。

If me, I will try to find HTML parser and will do with that.

Another option is will try to chunk the string into <code>.*?</code> and other parts.

and will update other parts, and will recombine it.

$x="The Hello \ World document is located in:\n<br>
<code>C:\documents\hello_world.txt</code>";

$r=preg_split("/(<code>.*?<\/code>)/", $x,-1,PREG_SPLIT_DELIM_CAPTURE);

for($i=0;$i<count($r);$i+=2)
    $r[$i]=str_replace("\\","$\\backslash$",$r[$i]);

$x=implode($r);

echo $x;

Here is the results.

The Hello $\backslash$ World document is located in: 
C:\documents\hello_world.txt

Sorry, If my approach is not suitable for you.

执着的年纪 2024-08-19 07:47:04

我认为我可以使用负 LookBehinds 和/或 LookAheads 来解决这个问题。

你算错了。 正则表达式不能替代解析器< /a>.

我建议您通过 htmltidy 管道传输 html,然后使用 dom 解析器读取它,然后将 dom 转换为您的目标输出格式。有什么阻碍你走这条路吗?

I reckon I could solve this using negative LookBehinds and/or LookAheads.

You reckon wrong. Regular expressions are not a replacement for a parser.

I would suggest that you pipe the html through htmltidy, then read it with a dom-parser and then transform the dom to your target output format. Is there anything preventing your from taking this route?

梦里泪两行 2024-08-19 07:47:04

解析器 FTW,好的。但是,如果您无法使用解析器,并且您可以确定 标记永远不会嵌套,您可以尝试以下操作:

  1. 查找 <文件的 ;code>.*? 部分(可能需要打开点匹配换行模式)。
  2. 将该部分中的所有反斜杠替换为 #?#?#?# 等独特内容 将
  3. 1 中找到的部分替换为新部分
  4. 将所有反斜杠替换为 $\backslash$
  5. 替换als \begin{verbatim} 以及所有 \end{verbatim} code>
  6. #?#?#?# 替换为 \

仅供参考,PHP 中的正则表达式不支持可变长度后向查找。因此,这使得两个边界之间的条件匹配变得困难。

Parser FTW, ok. But if you can't use a parser, and you can be certain that <code> tags are never nested, you could try the following:

  1. Find <code>.*?</code> sections of your file (probably need to turn on dot-matches-newlines mode).
  2. Replace all backslashes inside that section with something unique like #?#?#?#
  3. Replace the section found in 1 with that new section
  4. Replace all backslashes with $\backslash$
  5. Replace als <code> with \begin{verbatim} and all </code> with \end{verbatim}
  6. Replace #?#?#?# with \

FYI, regexes in PHP don't support variable-length lookbehind. So that makes this conditional matching between two boundaries difficult.

分分钟 2024-08-19 07:47:04

Pandoc? Pandoc 在多种格式之间进行转换。您还可以将一堆苍蝇连接在一起然后隐藏它们。也许一些 shell 脚本与您的 php 抓取脚本相结合?

使用“预期输入”和命令 pandoc -o text.tex test.html ,输出为:

The Hello \textbackslash{} World document is located in:
\verb!C:\documents\hello_world.txt!

pandoc 可以从 stdin 读取、写入 stdout 或直接通过管道传输到文件中。

Pandoc? Pandoc converts between a bunch of formats. you can also concatenate a bunch of flies together then covert them. Maybe a few shell scripts combined with your php scraping scripts?

With your "expected input" and the command pandoc -o text.tex test.html the output is:

The Hello \textbackslash{} World document is located in:
\verb!C:\documents\hello_world.txt!

pandoc can read from stdin, write to stdout or pipe right into a file.

请别遗忘我 2024-08-19 07:47:04

假设您的 块未嵌套,则此正则表达式会在 ^ 字符串开头或 之后找到反斜杠code> 之间没有

((?:^|</code>)(?:(?!<code>).)+?)\\
    |            |              |
    |            |              \-- backslash
    |            \-- least amount of anything not followed by <code>
    \-- start-of-string or </code>

并将其替换为:

$1$\backslash$

您必须在“单行”模式下运行此正则表达式,因此 . 匹配换行符。您还必须多次运行它,指定全局替换是不够的。每次替换只会替换字符串开头或 之后的第一个符合条件的反斜杠。

Provided that your <code> blocks are not nested, this regex would find a backslash after ^ start-of-string or </code> with no <code> in between.

((?:^|</code>)(?:(?!<code>).)+?)\\
    |            |              |
    |            |              \-- backslash
    |            \-- least amount of anything not followed by <code>
    \-- start-of-string or </code>

And replace it with:

$1$\backslash$

You'd have to run this regex in "singleline" mode, so . matches newlines. You'd also have to run it multiple times, specifying global replacement is not enough. Each replacement will only replace the first eligible backslash after start-of-string or </code>.

你与清晨阳光 2024-08-19 07:47:04

基于 HTML 或 XML 解析器编写解析器,例如 DOMDocument。遍历解析后的 DOM,并将不是 code 节点后代的每个文本节点上的 \ 替换为 $\backslash$ 和每个节点这是一个带有 \begin{verbatim} … \end{verbatim}code 节点。

Write a parser based on an HTML or XML parser like DOMDocument. Traverse the parsed DOM and replace the \ on every text node that is not a descendent of a code node with $\backslash$ and every node that is a code node with \begin{verbatim} … \end{verbatim}.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文