是否有一种特定的格式,可以被所有应用程序理解(即特别是像 doc、pdf 这样的读者)
好吧,我遇到了很多问题,将页面上的 html 数据转换为 pdf 和 doc,确保图像也出现在转换后的文章中,但失败了
我明白 XML 类似于基础
,所以是吗?
以及如何使用它?
我的意思是关于如何生成页面的 xml 然后将其扩展名更改为所需的(pdf,doc)的任何指南?
使用 vs08、asp.net、c#
well i faced i lot of prob converting the html data on page to pdf and to doc making sure images also appear in the converted article but failed
i understand that XML is something like a foundation
so is it?
and how to use it?
i mean any guide of how to generate the xml of the page and then to change its extension to the needed(pdf,doc)?
using vs08,asp.net,c#
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
简短的回答是否定的。
如果存在这样的格式,为什么所有应用程序不首先使用它呢?
关于不同格式的说明
Almost all document applications understands plain text (but image applications, etc. does not). The problem with plain text is that it does not contain any formatting. No pictures, no font size, no margins, nothing except text. Here is also the root cause why there are many different formats, the formatting.
以 HTML 为例。 HTML 非常适合在具有通过滚动条导航的连续文本块的网站上流动文本。没有分页符,可以根据屏幕尺寸适应不同的列宽等。HTML 也非常动态,页面可以扩展部分、替换内容并对用户输入做出反应。
相反,采用 PDF。 PDF是面向页面的,页面的宽度和高度是固定的。它也仅供观看。文本换行是通过显式换行来修复的。 (将 PDF 中的文本复制到 Word 文档中,并在行中间插入一些文本,换行将会变得一团糟)。 PDF 模拟带有边距和所有内容的打印页面。
中间的某个位置是 Word 文档。像PDF一样面向页面,但形状不像PDF文档那样固定,以支持良好的编辑体验。当文本插入到中间时,文本部分可以很好地重排。它在编辑时相当灵活,但最终结果在形式上与PDF一样严格。打印 Word 文档时,打印输出看起来与屏幕上的完全一样。
XML
XML is a very general format, you can think of it as a format for formats. XML in itself does not say anything about the content, you need a separate description of how to interpret the XML for a given application. There exists specifications like DocBook that specifies how to describe a document in XML. But that is not an exact description of how the document will look. It separates content from layout. You need to apply a layout/template to generate a visible output format. From a DocBook XML you can generate PDF, HTML, etc.
没有给定的方法将给定的文档格式转换为 XML,甚至没有给定的 XML 格式(如 DocBook)。基于 XML 的格式可用作源格式来生成不同的可视格式。
关于转换的说明
The problem of converting different formats to each other comes from the different purposes and strengths of each format. One format is simple not suitable or even able to describe the properties of another format correctly. There is no general method of converting between formats, because formats like PDF does not reveal the document structure in a reusable way.
如何发布为不同的格式
The key to success when publishing to different formats is to separate content from layout. You need to specify what text you have, how the structure is (headers, sections, etc), what images you have and how they relate to your sections of text. The text and structure description may be in XML, in a database or something else.
然后,您需要一个工具来使用某种工具从模板生成每种输出格式。
关于图像格式的旁注
Image formats on the other hand are much easier to convert between each other (as long as you convert pixel based formats to pixel based formats and vector based formats to vector based formats) since the end result is exactly the same. The difference between different image formats is mainly the compression algorithm used to compress images. The when uncompressing the images the original image with all of its information is restored (except minor compression artifacts).
The short answer is no.
If there was such a format, why wouldn't all applications use it in the first place?
A note on different formats
Almost all document applications understands plain text (but image applications, etc. does not). The problem with plain text is that it does not contain any formatting. No pictures, no font size, no margins, nothing except text. Here is also the root cause why there are many different formats, the formatting.
Take HTML for example. HTML is good for flowing texts on web sites with a continuous block of text which is navigated by a scrollbar. No page breaks, can adapt to different column widths depending on screen size, etc. HTML is also very dynamic, pages can expand sections, replace content and react to user input.
On the contrary, take PDF. PDF is page oriented, fixed width and height of the pages. It is also targeted at viewing only. Text wrapping is fixed with explicit line breaks. (Copy the text from a PDF to a Word document and insert some text in the middle of a line, and the line breaking will be a real mess). PDF is emulating a printed page with margins and everything.
Somewhere in the middle is the Word document. Page oriented like PDF, but not as fixed in the shape as a PDF document, to support a nice editing experience. Sections of texts reflow nicely when text is inserted in the middle. It is quite flexible when editing, but the final result is as strict in form as PDF. When printing a Word document the printout will look exactly like it was on the screen.
XML
XML is a very general format, you can think of it as a format for formats. XML in itself does not say anything about the content, you need a separate description of how to interpret the XML for a given application. There exists specifications like DocBook that specifies how to describe a document in XML. But that is not an exact description of how the document will look. It separates content from layout. You need to apply a layout/template to generate a visible output format. From a DocBook XML you can generate PDF, HTML, etc.
There is not given way of converting a given document format to XML, not even a given XML-format like DocBook. XML based formats can be used as a source format to generate different viewable format.
A note on conversion
The problem of converting different formats to each other comes from the different purposes and strengths of each format. One format is simple not suitable or even able to describe the properties of another format correctly. There is no general method of converting between formats, because formats like PDF does not reveal the document structure in a reusable way.
How to publish to different formats
The key to success when publishing to different formats is to separate content from layout. You need to specify what text you have, how the structure is (headers, sections, etc), what images you have and how they relate to your sections of text. The text and structure description may be in XML, in a database or something else.
Then you need a tool to generate each output format from a template using some kind of tool.
Side note on image formats
Image formats on the other hand are much easier to convert between each other (as long as you convert pixel based formats to pixel based formats and vector based formats to vector based formats) since the end result is exactly the same. The difference between different image formats is mainly the compression algorithm used to compress images. The when uncompressing the images the original image with all of its information is restored (except minor compression artifacts).