将 XML 转换为纯文本 - 我应该如何忽略/处理 XSLT 中的空格?

发布于 2024-07-06 17:26:12 字数 632 浏览 5 评论 0原文

我正在尝试使用 XSLT 将 XML 文件转换为 dokuwiki 使用的标记。 这实际上在某种程度上有效,但是 XSL 文件中的缩进被插入到结果中。 目前,我有两个选择:完全放弃这个 XSLT 事物,并找到另一种从 XML 转换为 dokuwiki 标记的方法,或者从 XSL 文件中删除大约 95% 的空白,使其几乎不可读,并且成为维护的噩梦。

是否有某种方法可以保留 XSL 文件中的缩进,而不将所有空格传递到最终文档?

背景:我正在将 autodoc 工具从静态 HTML 页面迁移到 dokuwiki,因此每当应用程序团队遇到记录不充分的代码时,应用程序团队可以进一步记录服务器团队开发的 API。 逻辑是为自动文档工具留出每个页面的一部分,并允许在该块之外的任何地方进行评论。 我使用 XSLT 是因为我们已经有了从 XML 转换为 XHTML 的 XSL 文件,并且我假设重写 XSL 比从头开始推出我自己的解决方案更快。

编辑:啊,对了,愚蠢的我,我忽略了缩进属性。 (其他背景说明:我是 XSLT 的新手。)另一方面,我仍然需要处理换行符。 Dokuwiki 使用管道来区分表列,这意味着表行中的所有数据必须位于一行上。 有没有办法抑制换行符的输出(只是偶尔),这样我就可以以某种可读的方式为每个表格单元格执行一些相当复杂的逻辑?

I'm trying to convert an XML file into the markup used by dokuwiki, using XSLT. This actually works to some degree, but the indentation in the XSL file is getting inserted into the results. At the moment, I have two choices: abandon this XSLT thing entirely, and find another way to convert from XML to dokuwiki markup, or delete about 95% of the whitespace from the XSL file, making it nigh-unreadable and a maintenance nightmare.

Is there some way to keep the indentation in the XSL file without passing all that whitespace on to the final document?

Background: I'm migrating an autodoc tool from static HTML pages over to dokuwiki, so the API developed by the server team can be further documented by the applications team whenever the apps team runs into poorly-documented code. The logic is to have a section of each page set aside for the autodoc tool, and to allow comments anywhere outside this block. I'm using XSLT because we already have the XSL file to convert from XML to XHTML, and I'm assuming it will be faster to rewrite the XSL than to roll my own solution from scratch.

Edit: Ah, right, foolish me, I neglected the indent attribute. (Other background note: I am new to XSLT.) On the other hand, I still have to deal with newlines. Dokuwiki uses pipes to differentiate between table columns, which means that all of the data in a table line must be on one line. Is there a way to suppress newlines being outputted (just occasionally), so I can do some fairly complex logic for each table cell in a somewhat readable fasion?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

心如荒岛 2024-07-13 17:26:12

XSLT 转换结果中出现不需要的空白有以下三个原因:

  1. 来自源文档中节点之间的空白
  2. 来自源文档中节点内部的空白
  3. 来自样式表的空白

我将讨论所有这些三是因为很难判断空白从何而来,因此您可能需要使用多种策略。

要解决源文档中节点之间的空白,您应该使用 去除两个节点之间出现的任何空白,然后使用 < xsl:preserve-space> 保留混合内容中可能出现的重要空白。 例如,如果您的源文档如下所示:

<ul>
  <li>This is an <strong>important</strong> <em>point</em></li>
</ul>

那么您将需要忽略

  • 之间以及

,不重要,但保留 之间的空格 元素,这些元素很重要(否则你会得到“这是一个 **重要***点*”)。 为此,请使用

<xsl:strip-space elements="*" />
<xsl:preserve-space elements="li" />

上的 elements 属性基本上应列出文档中具有混合内容的所有元素。

旁白:使用 还可以减少内存中源树的大小,并使样式表更加高效,因此即使您没有,也值得这样做此类空白问题。

要解决源文档中节点内出现的空格问题,您应该使用 normalize-space()。 例如,如果您有:

<dt>
  a definition
</dt>

并且您可以确定

元素不会包含您想要执行某些操作的任何元素,那么您可以执行以下操作:

<xsl:template match="dt">
  ...
  <xsl:value-of select="normalize-space(.)" />
  ...
</xsl:template>

前导和尾随空格将从

元素的值中删除,您将只得到字符串 "aDefinition"

为了解决来自样式表的空白(这可能是您遇到的问题),当您在模板中包含这样的文本时:

<xsl:template match="name">
  Name:
  <xsl:value-of select="." />
</xsl:template>

XSLT 样式表的解析方式与它们处理的源文档相同,因此上面的 XSLT 是解释为包含具有 match 属性的 元素的树,其第一个子节点是文本节点,第二个子节点是 <具有 select 属性的 xsl:value-of> 元素。 文本节点有前导和尾随空格(包括换行符); 由于它是样式表中的文字文本,因此它会被逐字复制到结果中,并带有所有前导和尾随空格。

但是 XSLT 样式表中的一些空白会被自动删除,即节点之间的空白。 结果中没有换行符,因为 的结束之间存在换行符>。

要在结果中仅获取所需的文本,请使用 元素,如下所示:

<xsl:template match="name">
  <xsl:text>Name: </xsl:text>
  <xsl:value-of select="." />
</xsl:template>

XSLT 处理器将忽略节点之间出现的换行符和缩进,并仅输出文本在 元素内。

There are three reasons for getting unwanted whitespace in the result of an XSLT transformation:

  1. whitespace that comes from between nodes in the source document
  2. whitespace that comes from within nodes in the source document
  3. whitespace that comes from the stylesheet

I'm going to talk about all three because it can be hard to tell where whitespace comes from so you might need to use several strategies.

To address the whitespace that is between nodes in your source document, you should use <xsl:strip-space> to strip out any whitespace that appears between two nodes, and then use <xsl:preserve-space> to preserve the significant whitespace that might appear within mixed content. For example, if your source document looks like:

<ul>
  <li>This is an <strong>important</strong> <em>point</em></li>
</ul>

then you will want to ignore the whitespace between the <ul> and the <li> and between the </li> and the </ul>, which is not significant, but preserve the whitespace between the <strong> and <em> elements, which is significant (otherwise you'd get "This is an **important***point*"). To do this use

<xsl:strip-space elements="*" />
<xsl:preserve-space elements="li" />

The elements attribute on <xsl:preserve-space> should basically list all the elements in your document that have mixed content.

Aside: using <xsl:strip-space> also reduces the size of the source tree in memory, and makes your stylesheet more efficient, so it's worth doing even if you don't have whitespace problems of this sort.

To address the whitespace that appears within nodes in your source document, you should use normalize-space(). For example, if you have:

<dt>
  a definition
</dt>

and you can be sure that the <dt> element won't hold any elements that you want to do something with, then you can do:

<xsl:template match="dt">
  ...
  <xsl:value-of select="normalize-space(.)" />
  ...
</xsl:template>

The leading and trailing whitespace will be stripped from the value of the <dt> element and you will just get the string "a definition".

To address whitespace coming from the stylesheet, which is perhaps the one you're experiencing, is when you have text within a template like this:

<xsl:template match="name">
  Name:
  <xsl:value-of select="." />
</xsl:template>

XSLT stylesheets are parsed in the same way as the source documents that they process, so the above XSLT is interpreted as a tree that holds an <xsl:template> element with a match attribute whose first child is a text node and whose second child is a <xsl:value-of> element with a select attribute. The text node has leading and trailing whitespace (including line breaks); since it's literal text in the stylesheet, it gets literally copied over into the result, with all the leading and trailing whitespace.

But some whitespace in XSLT stylesheets get stripped automatically, namely those between nodes. You don't get a line break in your result because there's a line break between the <xsl:value-of> and the close of the <xsl:template>.

To get only the text you want in the result, use the <xsl:text> element like this:

<xsl:template match="name">
  <xsl:text>Name: </xsl:text>
  <xsl:value-of select="." />
</xsl:template>

The XSLT processor will ignore the line breaks and indentation that appear between nodes, and only output the text within the <xsl:text> element.

幻想少年梦 2024-07-13 17:26:12

您在输出标记中使用 indent="no" 吗?

<xsl:output method="text" indent="no" />

另外,如果您使用 xsl:value-of,您可以使用disable-output-escaping="yes" 来帮助解决一些空白问题。

Are you using indent="no" in your output tag?

<xsl:output method="text" indent="no" />

Also if you're using xsl:value-of you can use the disable-output-escaping="yes" to help with some whitespace issues.

一枫情书 2024-07-13 17:26:12

@JeniT 的答案很好,我只是想指出管理空白的技巧。 我不确定这是最好的方法(甚至是好方法),但它目前对我有用。

(“s”代表空格,“e”代表空,“n”代表换行符。)

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xsl:transform [
  <!ENTITY s "<xsl:text xmlns:xsl='http://www.w3.org/1999/XSL/Transform'> </xsl:text>" >
  <!ENTITY s2 "<xsl:text xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>  </xsl:text>" >
  <!ENTITY s4 "<xsl:text xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>    </xsl:text>" >
  <!ENTITY s6 "<xsl:text xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>      </xsl:text>" >
  <!ENTITY e "<xsl:text xmlns:xsl='http://www.w3.org/1999/XSL/Transform'></xsl:text>" >
  <!ENTITY n "<xsl:text xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>
</xsl:text>" >
]>

<xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xsl:output method="text"/>
<xsl:template match="/">
  &e;Flush left, despite the indentation.&n;
  &e;  This line will be output indented two spaces.&n;

      <!-- the blank lines above/below won't be output -->

  <xsl:for-each select="//foo">
    &e;  Starts with two blanks: <xsl:value-of select="@bar"/>.&n;
    &e;  <xsl:value-of select="@baz"/> The 'e' trick won't work here.&n;
    &s2;<xsl:value-of select="@baz"/> Use s2 instead.&n;
    &s2;    <xsl:value-of select="@abc"/>    <xsl:value-of select="@xyz"/>&n;
    &s2;    <xsl:value-of select="@abc"/>&s;<xsl:value-of select="@xyz"/>&n;
  </xsl:for-each>
</xsl:template>
</xsl:transform>

应用于:

<?xml version="1.0" encoding="UTF-8"?>
<foo bar="bar" baz="baz" abc="abc" xyz="xyz"></foo>

输出:

Flush left, despite the indentation.
  This line will be output indented two spaces.
  Starts with two blanks: bar.
baz The 'e' trick won't work here.
  baz Use s2 instead.
  abcxyz
  abc xyz

“e”技巧在包含至少一个非空白字符的文本节点之前起作用,因为它扩展为:

<xsl:template match="/">
  <xsl:text></xsl:text>Flush left, despite the indentation.<xsl:text>
</xsl:text>

由于剥离空格的规则规定仅包含空格的文本节点会被剥离,换行符和之间的缩进 和被剥光(好)。 由于规则规定保留至少一个空格字符的文本节点,因此包含 " This line will be output indented two paths." 的隐式文本节点保留其前导空格(但我猜这也取决于关于剥离/保留/标准化的设置)。 “&n;” 在行尾插入换行符,但它也确保忽略后面的任何空格,因为它出现在两个节点之间。

我遇到的麻烦是当我想输出以开头的缩进行时。 在这种情况下,“&e;” 不会有帮助,因为缩进空白不会“附加”到任何非空白字符。 因此,对于这些情况,我使用“&s2;” 或“&s4;”,具体取决于我想要的缩进量。

我确信这是一个丑陋的黑客,但至少我没有冗长的“” 标签散落在我的 XSLT 中,至少我仍然可以缩进 XSLT 本身,使其清晰易读。 我觉得我正在滥用 XSLT 来做一些它不适合的事情(文本处理),这是我能做的最好的事情。


编辑:
作为对评论的回应,这就是没有“宏”的情况:

<xsl:template match="/">
  <xsl:text>Flush left, despite the indentation.</xsl:text>
  <xsl:text>  This line will be output indented two spaces.</xsl:text>
  <xsl:for-each select="//foo">
    <xsl:text>  Starts with two blanks: </xsl:text><xsl:value-of select="@bar"/>.<xsl:text>
</xsl:text>
    <xsl:text>    </xsl:text><xsl:value-of select="@abc"/><xsl:text> </xsl:text><xsl:value-of select="@xyz"/><xsl:text>
</xsl:text>
  </xsl:for-each>
</xsl:template>

我认为这使得看到预期的输出缩进变得不太清楚,并且它搞砸了 XSL 本身的缩进,因为 结束标记必须出现在 XSL 文件的第 1 列(否则您会在输出文件中得到不需要的空格)。

@JeniT's answer is great, I just want to point out a trick for managing whitespace. I'm not certain it's the best way (or even a good way), but it works for me for now.

("s" for space, "e" for empty, "n" for newline.)

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xsl:transform [
  <!ENTITY s "<xsl:text xmlns:xsl='http://www.w3.org/1999/XSL/Transform'> </xsl:text>" >
  <!ENTITY s2 "<xsl:text xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>  </xsl:text>" >
  <!ENTITY s4 "<xsl:text xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>    </xsl:text>" >
  <!ENTITY s6 "<xsl:text xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>      </xsl:text>" >
  <!ENTITY e "<xsl:text xmlns:xsl='http://www.w3.org/1999/XSL/Transform'></xsl:text>" >
  <!ENTITY n "<xsl:text xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>
</xsl:text>" >
]>

<xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xsl:output method="text"/>
<xsl:template match="/">
  &e;Flush left, despite the indentation.&n;
  &e;  This line will be output indented two spaces.&n;

      <!-- the blank lines above/below won't be output -->

  <xsl:for-each select="//foo">
    &e;  Starts with two blanks: <xsl:value-of select="@bar"/>.&n;
    &e;  <xsl:value-of select="@baz"/> The 'e' trick won't work here.&n;
    &s2;<xsl:value-of select="@baz"/> Use s2 instead.&n;
    &s2;    <xsl:value-of select="@abc"/>    <xsl:value-of select="@xyz"/>&n;
    &s2;    <xsl:value-of select="@abc"/>&s;<xsl:value-of select="@xyz"/>&n;
  </xsl:for-each>
</xsl:template>
</xsl:transform>

Applied to:

<?xml version="1.0" encoding="UTF-8"?>
<foo bar="bar" baz="baz" abc="abc" xyz="xyz"></foo>

Outputs:

Flush left, despite the indentation.
  This line will be output indented two spaces.
  Starts with two blanks: bar.
baz The 'e' trick won't work here.
  baz Use s2 instead.
  abcxyz
  abc xyz

The 'e' trick works prior to a text node containing at least one non-whitespace character because it expands to this:

<xsl:template match="/">
  <xsl:text></xsl:text>Flush left, despite the indentation.<xsl:text>
</xsl:text>

Since the rules for stripping whitespace say that whitespace-only text nodes get stripped, the newline and indentation between the <xsl:template> and <xsl:text> get stripped (good). Since the rules say a text node with at least one whitespace character is preserved, the implicit text node containing " This line will be output indented two spaces." keeps its leading whitespace (but I guess this also depends on the settings for strip/preserve/normalize). The "&n;" at the end of the line inserts a newline, but it also ensures that any following whitespace is ignored, because it appears between two nodes.

The trouble I have is when I want to output an indented line that begins with an <xsl:value-of>. In that case, the "&e;" won't help, because the indentation whitespace isn't "attached" to any non-whitespace characters. So for those cases, I use "&s2;" or "&s4;", depending on how much indentation I want.

It's an ugly hack I'm sure, but at least I don't have the verbose "<xsl:text>" tags littering my XSLT, and at least I can still indent the XSLT itself so it's legible. I feel like I'm abusing XSLT for something it was not designed for (text processing) and this is the best I can do.


Edit:
In response to comments, this is what it looks like without the "macros":

<xsl:template match="/">
  <xsl:text>Flush left, despite the indentation.</xsl:text>
  <xsl:text>  This line will be output indented two spaces.</xsl:text>
  <xsl:for-each select="//foo">
    <xsl:text>  Starts with two blanks: </xsl:text><xsl:value-of select="@bar"/>.<xsl:text>
</xsl:text>
    <xsl:text>    </xsl:text><xsl:value-of select="@abc"/><xsl:text> </xsl:text><xsl:value-of select="@xyz"/><xsl:text>
</xsl:text>
  </xsl:for-each>
</xsl:template>

I think that makes it less clear to see the intended output indentation, and it screws up the indentation of the XSL itself because the </xsl:text> end tags have to appear at column 1 of the XSL file (otherwise you get undesired whitespace in the output file).

情痴 2024-07-13 17:26:12

关于对新行的编辑,您可以使用此模板递归地替换另一个字符串中的一个字符串,并且可以将其用于换行符:

<xsl:template name="replace.string.section">
  <xsl:param name="in.string"/>
  <xsl:param name="in.characters"/>
  <xsl:param name="out.characters"/>
  <xsl:choose>
    <xsl:when test="contains($in.string,$in.characters)">
      <xsl:value-of select="concat(substring-before($in.string,$in.characters),$out.characters)"/>
      <xsl:call-template name="replace.string.section">
        <xsl:with-param name="in.string" select="substring-after($in.string,$in.characters)"/>
        <xsl:with-param name="in.characters" select="$in.characters"/>
        <xsl:with-param name="out.characters" select="$out.characters"/>
      </xsl:call-template>
    </xsl:when>
    <xsl:otherwise>
      <xsl:value-of select="$in.string"/>
    </xsl:otherwise>
  </xsl:choose>
</xsl:template> 

按如下方式调用它(此示例将 $some.string 变量中的换行符替换为空格):

    <xsl:call-template name="replace.string.section">
        <xsl:with-param name="in.string" select="$some.string"/>
        <xsl:with-param name="in.characters" select="'
'"/>
        <xsl:with-param name="out.characters" select="' '"/>
    </xsl:call-template>

Regarding your edit about new lines, you can use this template to recursively replace one string within another string, and you can use it for line breaks:

<xsl:template name="replace.string.section">
  <xsl:param name="in.string"/>
  <xsl:param name="in.characters"/>
  <xsl:param name="out.characters"/>
  <xsl:choose>
    <xsl:when test="contains($in.string,$in.characters)">
      <xsl:value-of select="concat(substring-before($in.string,$in.characters),$out.characters)"/>
      <xsl:call-template name="replace.string.section">
        <xsl:with-param name="in.string" select="substring-after($in.string,$in.characters)"/>
        <xsl:with-param name="in.characters" select="$in.characters"/>
        <xsl:with-param name="out.characters" select="$out.characters"/>
      </xsl:call-template>
    </xsl:when>
    <xsl:otherwise>
      <xsl:value-of select="$in.string"/>
    </xsl:otherwise>
  </xsl:choose>
</xsl:template> 

Call it as follows (this example replaces line breaks in the $some.string variable with a space):

    <xsl:call-template name="replace.string.section">
        <xsl:with-param name="in.string" select="$some.string"/>
        <xsl:with-param name="in.characters" select="'
'"/>
        <xsl:with-param name="out.characters" select="' '"/>
    </xsl:call-template>
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文