从“para”中提取文本嵌入“para”的标签孩子们？

发布于 2024-11-09 18:09:03 字数 4167 浏览 0 评论 0原文

我在 Windows 上使用 Altova 的命令行 xml 处理器来处理帮助和信息。手动 xml 文件。帮助&手册是编写软件的帮助。

我使用以下 xslt 从中提取文本内容。具体来说，我对最终的段落规则有疑问：

<?xml version='1.0'?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="text" />
  <xsl:strip-space elements="*" />
  <xsl:template match="para[@styleclass='Heading1']">
    <xsl:text>====== </xsl:text>
    <xsl:value-of select="." />
    <xsl:text> ======&#xA;&#xA;</xsl:text>
  </xsl:template>
  <xsl:template match="para[@styleclass='Heading2']">
    <xsl:text>===== </xsl:text>
    <xsl:value-of select="." />
    <xsl:text> =====&#xA;&#xA;</xsl:text>
  </xsl:template>
  <xsl:template match="para">
    <xsl:value-of select="." />
    <xsl:text>&#xA;&#xA;</xsl:text>
  </xsl:template>
  <xsl:template match="toggle">
    <xsl:text>**</xsl:text>
    <xsl:apply-templates />
    <xsl:text>**&#xA;&#xA;</xsl:text>
  </xsl:template>
  <xsl:template match="title" />
  <xsl:template match="topic">
    <xsl:apply-templates select="body" />
  </xsl:template>
  <xsl:template match="body">
    <xsl:text>Content-Type: text/x-zim-wiki&#xA;Wiki-Format: zim 0.4&#xA;&#xA;</xsl:text>
    <xsl:apply-templates />
  </xsl:template>
</xsl:stylesheet>

我遇到了从某些段落元素提取文本的问题。以这个 xml 为例：

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="../helpproject.xsl" ?>
<topic template="Default" lasteditedby="tlilley" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="../helpproject.xsd">
  <title translate="true">New Installs</title>
  <keywords>
    <keyword translate="true">Regional and Language Options</keyword>
  </keywords>
  <body>
    <header>
      <para styleclass="Heading1"><text styleclass="Heading1" translate="true">New Installs</text></para>
    </header>
    <para styleclass="Normal"><table rowcount="1" colcount="2" style="width:100%; cell-padding:6px; cell-spacing:0px; page-break-inside:auto; border-width:1px; border-spacing:0px; cell-border-width:0px; border-color:#000000; border-style:solid; background-color:#fffff0; head-row-background-color:none; alt-row-background-color:none;">
      <tr style="vertical-align:top">
        <td style="vertical-align:middle; width:96px; height:103px;">
          <para styleclass="Normal" style="text-align:center;"><image src="books.png" scale="100.00%" styleclass="Image Caption"></image></para>
        </td>
        <td style="vertical-align:middle; width:1189px; height:103px;">
          <para styleclass="Callouts"><text styleclass="Callouts" style="font-weight:bold;" translate="true">Documentation Convention</text></para>
          <para styleclass="Callouts"><text styleclass="Callouts" translate="true">To make the examples concrete, we refer to the </text><var styleclass="Callouts">Add2Exchange</var><text styleclass="Callouts" translate="true"> Service Account as &quot;zAdd2Exchange&quot; throughout this document.  If your Service Account name is different, substitute that value for &quot;zAdd2Exchange&quot; in all commands and examples.  If you have named your account according to the recommended &quot;zAdd2Exchange&quot;, then you may cut and paste any given commands as is.</text></para>
        </td>
      </tr>
    </table></para>
  </body>
</topic>

当 xslt 在该段落上运行时，它会拉出文本，但会在顶部段落元素中拉出文本。该转换应该向所有提取的段落添加一对换行符，但没有机会在嵌入的元素上执行此操作，因为文本是在父 para 元素。

请注意，我不关心表格标签，我只想删除它们。

有没有办法构建 para 规则，以便正确提取 para 元素的直接拥有文本以及任何子 para 的文本，以便每个提取的块在输出文本中获取规则的换行符？

原文

I'm using Altova's command-line xml processor on Windows to process a Help & Manual xml file. Help & Manual is help authoring software.

I'm extracting the text content from it using the following xslt. Specifically, I'm having an issue with the final para rule:

<?xml version='1.0'?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="text" />
  <xsl:strip-space elements="*" />
  <xsl:template match="para[@styleclass='Heading1']">
    <xsl:text>====== </xsl:text>
    <xsl:value-of select="." />
    <xsl:text> ======

</xsl:text>
  </xsl:template>
  <xsl:template match="para[@styleclass='Heading2']">
    <xsl:text>===== </xsl:text>
    <xsl:value-of select="." />
    <xsl:text> =====

</xsl:text>
  </xsl:template>
  <xsl:template match="para">
    <xsl:value-of select="." />
    <xsl:text>

</xsl:text>
  </xsl:template>
  <xsl:template match="toggle">
    <xsl:text>**</xsl:text>
    <xsl:apply-templates />
    <xsl:text>**

</xsl:text>
  </xsl:template>
  <xsl:template match="title" />
  <xsl:template match="topic">
    <xsl:apply-templates select="body" />
  </xsl:template>
  <xsl:template match="body">
    <xsl:text>Content-Type: text/x-zim-wiki
Wiki-Format: zim 0.4

</xsl:text>
    <xsl:apply-templates />
  </xsl:template>
</xsl:stylesheet>

I've run into an issue with the extraction of text from certain paragraph elements. Take for example this xml:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="../helpproject.xsl" ?>
<topic template="Default" lasteditedby="tlilley" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="../helpproject.xsd">
  <title translate="true">New Installs</title>
  <keywords>
    <keyword translate="true">Regional and Language Options</keyword>
  </keywords>
  <body>
    <header>
      <para styleclass="Heading1"><text styleclass="Heading1" translate="true">New Installs</text></para>
    </header>
    <para styleclass="Normal"><table rowcount="1" colcount="2" style="width:100%; cell-padding:6px; cell-spacing:0px; page-break-inside:auto; border-width:1px; border-spacing:0px; cell-border-width:0px; border-color:#000000; border-style:solid; background-color:#fffff0; head-row-background-color:none; alt-row-background-color:none;">
      <tr style="vertical-align:top">
        <td style="vertical-align:middle; width:96px; height:103px;">
          <para styleclass="Normal" style="text-align:center;"><image src="books.png" scale="100.00%" styleclass="Image Caption"></image></para>
        </td>
        <td style="vertical-align:middle; width:1189px; height:103px;">
          <para styleclass="Callouts"><text styleclass="Callouts" style="font-weight:bold;" translate="true">Documentation Convention</text></para>
          <para styleclass="Callouts"><text styleclass="Callouts" translate="true">To make the examples concrete, we refer to the </text><var styleclass="Callouts">Add2Exchange</var><text styleclass="Callouts" translate="true"> Service Account as "zAdd2Exchange" throughout this document.  If your Service Account name is different, substitute that value for "zAdd2Exchange" in all commands and examples.  If you have named your account according to the recommended "zAdd2Exchange", then you may cut and paste any given commands as is.</text></para>
        </td>
      </tr>
    </table></para>
  </body>
</topic>

When the xslt is run on that paragraph, it pulls the text out but does so at the top paragraph element. The transform is supposed to add a pair of newlines to all extracted paragraphs, but doesn't have a chance to do so on the embedded <para> elements because the text is extracted at the parent para element.

Note that I don't care about the table tags, I just want to strip those.

Is there a way to construct the para rule so that it properly extracts the directly-owned text of a para element, as well as the text of any children para's, such that each extracted chunk gets the rule's newlines in the output text?

分享到QQ

分享到微博