从“para”中提取文本嵌入“para”的标签孩子们?

发布于 2024-11-09 18:09:03 字数 4167 浏览 0 评论 0原文

我在 Windows 上使用 Altova 的命令行 xml 处理器来处理帮助和信息。手动 xml 文件。帮助&手册是编写软件的帮助。

我使用以下 xslt 从中提取文本内容。具体来说,我对最终的段落规则有疑问:

<?xml version='1.0'?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="text" />
  <xsl:strip-space elements="*" />
  <xsl:template match="para[@styleclass='Heading1']">
    <xsl:text>====== </xsl:text>
    <xsl:value-of select="." />
    <xsl:text> ======&#xA;&#xA;</xsl:text>
  </xsl:template>
  <xsl:template match="para[@styleclass='Heading2']">
    <xsl:text>===== </xsl:text>
    <xsl:value-of select="." />
    <xsl:text> =====&#xA;&#xA;</xsl:text>
  </xsl:template>
  <xsl:template match="para">
    <xsl:value-of select="." />
    <xsl:text>&#xA;&#xA;</xsl:text>
  </xsl:template>
  <xsl:template match="toggle">
    <xsl:text>**</xsl:text>
    <xsl:apply-templates />
    <xsl:text>**&#xA;&#xA;</xsl:text>
  </xsl:template>
  <xsl:template match="title" />
  <xsl:template match="topic">
    <xsl:apply-templates select="body" />
  </xsl:template>
  <xsl:template match="body">
    <xsl:text>Content-Type: text/x-zim-wiki&#xA;Wiki-Format: zim 0.4&#xA;&#xA;</xsl:text>
    <xsl:apply-templates />
  </xsl:template>
</xsl:stylesheet>

我遇到了从某些段落元素提取文本的问题。以这个 xml 为例:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="../helpproject.xsl" ?>
<topic template="Default" lasteditedby="tlilley" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="../helpproject.xsd">
  <title translate="true">New Installs</title>
  <keywords>
    <keyword translate="true">Regional and Language Options</keyword>
  </keywords>
  <body>
    <header>
      <para styleclass="Heading1"><text styleclass="Heading1" translate="true">New Installs</text></para>
    </header>
    <para styleclass="Normal"><table rowcount="1" colcount="2" style="width:100%; cell-padding:6px; cell-spacing:0px; page-break-inside:auto; border-width:1px; border-spacing:0px; cell-border-width:0px; border-color:#000000; border-style:solid; background-color:#fffff0; head-row-background-color:none; alt-row-background-color:none;">
      <tr style="vertical-align:top">
        <td style="vertical-align:middle; width:96px; height:103px;">
          <para styleclass="Normal" style="text-align:center;"><image src="books.png" scale="100.00%" styleclass="Image Caption"></image></para>
        </td>
        <td style="vertical-align:middle; width:1189px; height:103px;">
          <para styleclass="Callouts"><text styleclass="Callouts" style="font-weight:bold;" translate="true">Documentation Convention</text></para>
          <para styleclass="Callouts"><text styleclass="Callouts" translate="true">To make the examples concrete, we refer to the </text><var styleclass="Callouts">Add2Exchange</var><text styleclass="Callouts" translate="true"> Service Account as &quot;zAdd2Exchange&quot; throughout this document.  If your Service Account name is different, substitute that value for &quot;zAdd2Exchange&quot; in all commands and examples.  If you have named your account according to the recommended &quot;zAdd2Exchange&quot;, then you may cut and paste any given commands as is.</text></para>
        </td>
      </tr>
    </table></para>
  </body>
</topic>

当 xslt 在该段落上运行时,它会拉出文本,但会在顶部段落元素中拉出文本。该转换应该向所有提取的段落添加一对换行符,但没有机会在嵌入的 元素上执行此操作,因为文本是在父 para 元素。

请注意,我不关心表格标签,我只想删除它们。

有没有办法构建 para 规则,以便正确提取 para 元素的直接拥有文本以及任何子 para 的文本,以便每个提取的块在输出文本中获取规则的换行符?

I'm using Altova's command-line xml processor on Windows to process a Help & Manual xml file. Help & Manual is help authoring software.

I'm extracting the text content from it using the following xslt. Specifically, I'm having an issue with the final para rule:

<?xml version='1.0'?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="text" />
  <xsl:strip-space elements="*" />
  <xsl:template match="para[@styleclass='Heading1']">
    <xsl:text>====== </xsl:text>
    <xsl:value-of select="." />
    <xsl:text> ======

</xsl:text>
  </xsl:template>
  <xsl:template match="para[@styleclass='Heading2']">
    <xsl:text>===== </xsl:text>
    <xsl:value-of select="." />
    <xsl:text> =====

</xsl:text>
  </xsl:template>
  <xsl:template match="para">
    <xsl:value-of select="." />
    <xsl:text>

</xsl:text>
  </xsl:template>
  <xsl:template match="toggle">
    <xsl:text>**</xsl:text>
    <xsl:apply-templates />
    <xsl:text>**

</xsl:text>
  </xsl:template>
  <xsl:template match="title" />
  <xsl:template match="topic">
    <xsl:apply-templates select="body" />
  </xsl:template>
  <xsl:template match="body">
    <xsl:text>Content-Type: text/x-zim-wiki
Wiki-Format: zim 0.4

</xsl:text>
    <xsl:apply-templates />
  </xsl:template>
</xsl:stylesheet>

I've run into an issue with the extraction of text from certain paragraph elements. Take for example this xml:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="../helpproject.xsl" ?>
<topic template="Default" lasteditedby="tlilley" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="../helpproject.xsd">
  <title translate="true">New Installs</title>
  <keywords>
    <keyword translate="true">Regional and Language Options</keyword>
  </keywords>
  <body>
    <header>
      <para styleclass="Heading1"><text styleclass="Heading1" translate="true">New Installs</text></para>
    </header>
    <para styleclass="Normal"><table rowcount="1" colcount="2" style="width:100%; cell-padding:6px; cell-spacing:0px; page-break-inside:auto; border-width:1px; border-spacing:0px; cell-border-width:0px; border-color:#000000; border-style:solid; background-color:#fffff0; head-row-background-color:none; alt-row-background-color:none;">
      <tr style="vertical-align:top">
        <td style="vertical-align:middle; width:96px; height:103px;">
          <para styleclass="Normal" style="text-align:center;"><image src="books.png" scale="100.00%" styleclass="Image Caption"></image></para>
        </td>
        <td style="vertical-align:middle; width:1189px; height:103px;">
          <para styleclass="Callouts"><text styleclass="Callouts" style="font-weight:bold;" translate="true">Documentation Convention</text></para>
          <para styleclass="Callouts"><text styleclass="Callouts" translate="true">To make the examples concrete, we refer to the </text><var styleclass="Callouts">Add2Exchange</var><text styleclass="Callouts" translate="true"> Service Account as "zAdd2Exchange" throughout this document.  If your Service Account name is different, substitute that value for "zAdd2Exchange" in all commands and examples.  If you have named your account according to the recommended "zAdd2Exchange", then you may cut and paste any given commands as is.</text></para>
        </td>
      </tr>
    </table></para>
  </body>
</topic>

When the xslt is run on that paragraph, it pulls the text out but does so at the top paragraph element. The transform is supposed to add a pair of newlines to all extracted paragraphs, but doesn't have a chance to do so on the embedded <para> elements because the text is extracted at the parent para element.

Note that I don't care about the table tags, I just want to strip those.

Is there a way to construct the para rule so that it properly extracts the directly-owned text of a para element, as well as the text of any children para's, such that each extracted chunk gets the rule's newlines in the output text?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

你如我软肋 2024-11-16 18:09:03

我想我已经找到答案了。我没有使用最后一条规则的 value-of ,而是使用 apply-templates ,这似乎捕获了所有这些。

I think I've found the answer. Instead of value-of with the last para rule, I'm using apply-templates instead and that seems to catch them all.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文