将 XML 转换为纯文本
我的目标是构建一个引擎,采用最新的 HL7 3.0 CDA 文档,并使它们向后兼容 HL7 2.5,这是一个完全不同的野兽。
CDA 文档是一个 XML 文件,当与其匹配的 XSL 文件配对时,会呈现一个适合向最终用户显示的 HTML 文档。
在 HL7 2.5 中,我需要获取没有任何标记的渲染文本,并将其折叠到文本流(或类似的)中,我可以用 80 个字符行写出该文本来填充 HL7 2.5 消息。
到目前为止,我采用的方法是使用 XslCompiledTransform 来使用 XSLT 转换 XML 文档并生成结果 HTML 文档。
我的下一步是获取该文档(或者可能在此之前的步骤)并将 HTML 呈现为文本。 我已经搜索了一段时间,但不知道如何实现这一点。 我希望它是一些简单的东西,我只是忽略了,或者只是找不到神奇的搜索词。 有人可以提供一些帮助吗?
FWIW,我已经阅读了 SO 中的 5 或 10 个其他问题,这些问题拥抱或警告使用 RegEx 来实现这一点,并且我不认为我想走这条路。 我需要渲染的文本。
using System;
using System.IO;
using System.Xml;
using System.Xml.Xsl;
using System.Xml.XPath;
public class TransformXML
{
public static void Main(string[] args)
{
try
{
string sourceDoc = "C:\\CDA_Doc.xml";
string resultDoc = "C:\\Result.html";
string xsltDoc = "C:\\CDA.xsl";
XPathDocument myXPathDocument = new XPathDocument(sourceDoc);
XslCompiledTransform myXslTransform = new XslCompiledTransform();
XmlTextWriter writer = new XmlTextWriter(resultDoc, null);
myXslTransform.Load(xsltDoc);
myXslTransform.Transform(myXPathDocument, null, writer);
writer.Close();
StreamReader stream = new StreamReader (resultDoc);
}
catch (Exception e)
{
Console.WriteLine ("Exception: {0}", e.ToString());
}
}
}
My goal is to build an engine that takes the latest HL7 3.0 CDA documents and make them backward compatible with HL7 2.5 which is a radically different beast.
The CDA document is an XML file which when paired with its matching XSL file renders a HTML document fit for display to the end user.
In HL7 2.5 I need to get the rendered text, devoid of any markup, and fold it into a text stream (or similar) that I can write out in 80 character lines to populate the HL7 2.5 message.
So far, I'm taking an approach of using XslCompiledTransform to transform my XML document using XSLT and product a resultant HTML document.
My next step is to take that document (or perhaps at a step before this) and render the HTML as text. I have searched for a while, but can't figure out how to accomplish this. I'm hoping its something easy that I'm just overlooking, or just can't find the magical search terms. Can anyone offer some help?
FWIW, I've read the 5 or 10 other questions in SO which embrace or admonish using RegEx for this, and don't think that I want to go down that road. I need the rendered text.
using System;
using System.IO;
using System.Xml;
using System.Xml.Xsl;
using System.Xml.XPath;
public class TransformXML
{
public static void Main(string[] args)
{
try
{
string sourceDoc = "C:\\CDA_Doc.xml";
string resultDoc = "C:\\Result.html";
string xsltDoc = "C:\\CDA.xsl";
XPathDocument myXPathDocument = new XPathDocument(sourceDoc);
XslCompiledTransform myXslTransform = new XslCompiledTransform();
XmlTextWriter writer = new XmlTextWriter(resultDoc, null);
myXslTransform.Load(xsltDoc);
myXslTransform.Transform(myXPathDocument, null, writer);
writer.Close();
StreamReader stream = new StreamReader (resultDoc);
}
catch (Exception e)
{
Console.WriteLine ("Exception: {0}", e.ToString());
}
}
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
由于您拥有 XML 源,因此请考虑编写一个 XSL,它将为您提供所需的输出,而无需中间的 HTML 步骤。 这比尝试转换 HTML 可靠得多。
Since you have the XML source, consider writing an XSL that will give you the output you want without the intermediate HTML step. It would be far more reliable than trying to transform the HTML.
这只会留下文本:
This will leave you with just the text:
或者您可以使用正则表达式:
Or you can use a regular expression:
你可以使用类似的东西这个 它使用 lynx 和 perl 渲染 html,然后将其转换为纯文本?
Can you use something like this which uses lynx and perl to render the html and then convert that to plain text?
这是 XSL:FO 和 FOP 的一个很好的用例。 FOP 不仅仅适用于 PDF 输出,支持的其他主要输出之一是文本。 您应该能够构建一个简单的 xslt + fo 样式表,其中包含您想要的规格(即线宽)。
这个解决方案比 ScottSEA 建议的仅使用 xml->xslt->text 稍微重一些,但是如果您有任何更复杂的格式要求(例如缩进),那么用 fo 表达会变得更容易,而不是在 xslt 中进行模拟。
我会避免使用正则表达式来提取文本。 这太低级了并且肯定很脆弱。 如果您只需要文本和 80 个字符行,默认的 xslt 模板将仅打印元素文本。 一旦您只有文本,您就可以应用任何必要的文本处理。
顺便说一句,我在一家生产 CDA 的公司工作,该公司生产 CDA 作为我们产品的一部分(指示语音识别)。 我会研究一个将 3.0 直接转换为 2.5 的 XSLT。 根据您想要在两个版本之间保持的保真度,如果您真正想要实现的是格式之间的转换,那么完整的 XSLT 路线可能是您最简单的选择。 这就是 XSLT 的构建目的。
This is a great use-case for XSL:FO and FOP. FOP isn't just for PDF output, one of the other major outputs that is supported is text. You should be able to construct a simple xslt + fo stylesheet that has the specifications (i.e. line width) that you want.
This solution will is a bit more heavy-weight that just using xml->xslt->text as ScottSEA suggested, but if you have any more complex formatting requirements (e.g. indenting), it will become much easier to express in fo, than mocking up in xslt.
I would avoid regexs for extracting the text. That's too low-level and guaranteed to be brittle. If you just want text and 80 character lines, the default xslt template will only print element text. Once you have only the text, you can apply whatever text processing is necessary.
Incidentally, I work for a company who produces CDAs as part of our product (voice recognition for dications). I would look into an XSLT that transforms the 3.0 directly into 2.5. Depending on the fidelity you want to keep between the two versions, the full XSLT route will probably be your easiest bet if what you really want to achieve is conversion between the formats. That's what XSLT was built to do.