我可以使用正则表达式将标题复制到每个条目，直到下一个标题吗？（电子书中的超链接尾注）

发布于 2024-11-27 22:32:36 字数 1849 浏览 11 评论 0原文

好吧，正则表达式忍者。我正在尝试设计一种模式来将超链接添加到 ePub 电子书 XHTML 文件中的尾注。问题是编号在每一章中重新开始，因此我需要向锚点名称添加一个唯一标识符，以便哈希链接到它。

给定一个这样的（非常简化的）列表：

<h2>Introduction</h2>
<p> 1 Endnote entry number one.</p>
<p> 2 Endnote entry number two.</p>
<p> 3 Endnote entry number three.</p>
<p> 4 Endnote entry number four.</p>

<h2>Chapter 1: The Beginning</h2>
<p> 1 Endnote entry number one.</p>
<p> 2 Endnote entry number two.</p>
<p> 3 Endnote entry number three.</p>
<p> 4 Endnote entry number four.</p>

我需要将其变成这样的内容：

<h2>Introduction</h2>
<a name="endnote-introduction-1"></a><p> 1 Endnote entry number one.</p>
<a name="endnote-introduction-2"></a><p> 2 Endnote entry number two.</p>
<a name="endnote-introduction-3"></a><p> 3 Endnote entry number three.</p>
<a name="endnote-introduction-4"></a><p> 4 Endnote entry number four.</p>

<h2>Chapter 1: The Beginning</h2>
<a name="endnote-chapter-1-the-beginning-1"></a><p> 1 Endnote entry number one.</p>
<a name="endnote-chapter-1-the-beginning-2"></a><p> 2 Endnote entry number two.</p>
<a name="endnote-chapter-1-the-beginning-3"></a><p> 3 Endnote entry number three.</p>
<a name="endnote-chapter-1-the-beginning-4"></a><p> 4 Endnote entry number four.</p>

显然，在本书的实际文本中需要进行类似的搜索，其中每个尾注将链接到 endnotes.xhtml #endnote-introduction-1 等。

最大的障碍是每个匹配搜索在前一个搜索结束后开始，因此除非使用递归，否则无法匹配相同的位（在本例中为标题）多个条目。然而，到目前为止，我的递归尝试只产生了无限循环。

我正在使用 TextWrangler 的 grep 引擎，但如果您在不同的编辑器（例如 vim）中有解决方案，那也很好。

谢谢！

原文

Okay, regex ninjas. I'm trying to devise a pattern to add hyperlinks to endnotes in an ePub ebook XHTML file. The problem is that numbering restarts within each chapter, so I need to add a unique identifier to the anchor name in order to hash link to it.

Given a (much simplified) list like this:

<h2>Introduction</h2>
<p> 1 Endnote entry number one.</p>
<p> 2 Endnote entry number two.</p>
<p> 3 Endnote entry number three.</p>
<p> 4 Endnote entry number four.</p>

<h2>Chapter 1: The Beginning</h2>
<p> 1 Endnote entry number one.</p>
<p> 2 Endnote entry number two.</p>
<p> 3 Endnote entry number three.</p>
<p> 4 Endnote entry number four.</p>

I need to turn it into something like this:

<h2>Introduction</h2>
<a name="endnote-introduction-1"></a><p> 1 Endnote entry number one.</p>
<a name="endnote-introduction-2"></a><p> 2 Endnote entry number two.</p>
<a name="endnote-introduction-3"></a><p> 3 Endnote entry number three.</p>
<a name="endnote-introduction-4"></a><p> 4 Endnote entry number four.</p>

<h2>Chapter 1: The Beginning</h2>
<a name="endnote-chapter-1-the-beginning-1"></a><p> 1 Endnote entry number one.</p>
<a name="endnote-chapter-1-the-beginning-2"></a><p> 2 Endnote entry number two.</p>
<a name="endnote-chapter-1-the-beginning-3"></a><p> 3 Endnote entry number three.</p>
<a name="endnote-chapter-1-the-beginning-4"></a><p> 4 Endnote entry number four.</p>

Obviously there will need to be a similar search in the actual text of the book, where each endnote will be linked to endnotes.xhtml#endnote-introduction-1 etc.

The biggest obstacle is that each match search begins AFTER the previous search ends, so unless you use recursion, you can't match the same bit (in this case, the title) for more than one entry. My attempts with recursion have so far yielded only endless loops, however.

I'm using TextWrangler's grep engine, but if you have a solution in a different editor (such as vim), that's fine too.

Thanks!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

一人独醉 2024-12-04 22:32:36

一点 awk 可能会解决这个问题：

创建以下脚本（我将其命名为 add_endnote_tags.awk）：

/^<h2>/ {
    i = 0;
    chapter_name = $0;
    gsub(/<[^>]+>/, "", chapter_name);
    chapter_name = tolower(chapter_name);
    gsub(/[^a-z]+/, "-", chapter_name);
    print;
}

/^<p>/ {
    i = i + 1;
    printf("<a name=\"endnote-%s-%d\"></a>%s\n", chapter_name, i, $0);
}

$0 !~ /^<h2>/ && $0 !~ /^<p>/ {
    print;
}

然后用它来解析您的文件：

awk -f add_endnote_tags.awk < source_file.xml > dest_file.xml

希望有所帮助。如果您使用的是 Windows 平台，则可能需要通过安装 cygwin 和 awk 包或下载来安装 awk Windows 版 gawk

A bit of awk might do the trick:

Create the following script (I've named it add_endnote_tags.awk):

/^<h2>/ {
    i = 0;
    chapter_name = $0;
    gsub(/<[^>]+>/, "", chapter_name);
    chapter_name = tolower(chapter_name);
    gsub(/[^a-z]+/, "-", chapter_name);
    print;
}

/^<p>/ {
    i = i + 1;
    printf("<a name=\"endnote-%s-%d\"></a>%s\n", chapter_name, i, $0);
}

$0 !~ /^<h2>/ && $0 !~ /^<p>/ {
    print;
}

And then use it to parse your file:

awk -f add_endnote_tags.awk < source_file.xml > dest_file.xml

Hope that helps. If you are on a Windows platform, you might need to install awk by either installing cygwin and the awk package or downloading gawk for Windows

回复收藏 0 原文

清风疏影 2024-12-04 22:32:36

我认为这在文本编辑器中很难完成，因为它需要两步过程。首先，您需要将文件分成章节，然后需要处理每个章节的内容。假设“尾注段落”（您希望添加锚点的位置）被定义为第一个单词等于整数单词的段落，那么这个 PHP 脚本将满足您的需要。

<?php
$data = file_get_contents('testdata.txt');
$output = processBook($data);
file_put_contents('testdata_out.txt', $output);
echo $output;

// Main function to process book adding endnote anchors.
function processBook($text) {
    $re_chap = '%
        # Regex 1: Get Chapter.
        <h2>([^<>]+)</h2>  # $1: Chapter title.
        (                  # $2: Chapter contents.
          .+?              # Contents are everything up to
          (?=<h2>|$)       # next chapter or end of file.
        )                  # End $2: Chapter contents.
        %six';
    // Match and process each chapter using callback function.
    $text = preg_replace_callback($re_chap, '_cb_chap', $text);
    return $text;
}
// Callback function to process each chapter.
function _cb_chap($matches) {
    // Build ID from H2 title contents.
    // Trim leading and trailing ws from title.
    $baseid = trim($matches[1]);
    // Strip all non-space, non-alphanums.
    $baseid = preg_replace('/[^ A-Za-z0-9]/', '', $matches[1]);
    // Append prefix and convert whitespans to single - dash.
    $baseid = 'endnote-'. preg_replace('/ +/', '-', $baseid);
    // Convert to lowercase.
    $baseid = strtolower($baseid);
    $text = preg_replace(
                '/(<p>\s*)(\d+)\b/',
                '<a name="'. $baseid .'-$2"></a>$1$2',
                $matches[2]);
    return '<h2>'. $matches[1] .'</h2>'. $text;

}
?>

该脚本可以正确处理您的示例数据。

I think this would be difficult to accomplish in a text editor as it requires a two-step process. First you need to section the file into chapters, then you need to process the contents of each chapter. Assuming that "endnote paragraphs" (which is where you wish to add the anchors), are defined as paragraphs having a first word equal to an integer word, then this PHP script will do what you need.

<?php
$data = file_get_contents('testdata.txt');
$output = processBook($data);
file_put_contents('testdata_out.txt', $output);
echo $output;

// Main function to process book adding endnote anchors.
function processBook($text) {
    $re_chap = '%
        # Regex 1: Get Chapter.
        <h2>([^<>]+)</h2>  # $1: Chapter title.
        (                  # $2: Chapter contents.
          .+?              # Contents are everything up to
          (?=<h2>|$)       # next chapter or end of file.
        )                  # End $2: Chapter contents.
        %six';
    // Match and process each chapter using callback function.
    $text = preg_replace_callback($re_chap, '_cb_chap', $text);
    return $text;
}
// Callback function to process each chapter.
function _cb_chap($matches) {
    // Build ID from H2 title contents.
    // Trim leading and trailing ws from title.
    $baseid = trim($matches[1]);
    // Strip all non-space, non-alphanums.
    $baseid = preg_replace('/[^ A-Za-z0-9]/', '', $matches[1]);
    // Append prefix and convert whitespans to single - dash.
    $baseid = 'endnote-'. preg_replace('/ +/', '-', $baseid);
    // Convert to lowercase.
    $baseid = strtolower($baseid);
    $text = preg_replace(
                '/(<p>\s*)(\d+)\b/',
                '<a name="'. $baseid .'-$2"></a>$1$2',
                $matches[2]);
    return '<h2>'. $matches[1] .'</h2>'. $text;

}
?>

This script correctly proccesses your example data.

回复收藏 0 原文

~没有更多了~