在 ASP.NET 中,将 PDF 文件转换为 HTML 的最佳方法是什么?
我的用户要做的就是在他们的计算机上选择一个 PDF 文档,将其上传到我的网站,我将在其中转换为 HTML 文档以在网站上显示。 转换后的文档将存储在数据库中。
将 PDF 转换为 HTML 的最佳方法是什么?
我收到一个要求,用户将创建一个 pdf 格式的“新闻”故事,然后将其上传到服务器,在服务器上它将转换为 HTML 并显示在网站上。
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
任何可以将文档另存为 PDF 的文档创建软件都可以将其另存为 HTML。 我假设问题是您的用户将创建丰富的文档(大量嵌入图像),这会产生多个文件,而您的要求源于希望使用户上传这些文档尽可能简单。
有许多转换包可能可以为您做到这一点,但是当您谈论丰富的内容时,您谈论的是文本加图像。 这些图像必须存储在某处并以某种方式提供,无论您使用什么转换方法,都需要您检查所有图像源以确保它们指向服务器上的有效位置。
我想建议您的团队采用一种替代方法:实现用于发布内容的众多博客 API 之一。 有一些免费和商业软件包使用这些 API 将内容直接发布到网站,例如 Windows Live Writer 和 Microsoft Word。 您的用户可以简单地创建他们的内容并将其直接上传到您的网站,而无需先将其发布为 PDF,然后再上传。 因此,这个过程对于您的用户来说变得更加顺利,并且您可以以不需要花费数千美元来开发或购买转换代码的形式获得帖子。
两个最常见的 API 是 MetaWeblog API 和 可移动类型 API。 两者都非常简单且易于实现。 我认为这种方式比你正在考虑做的事情要好得多。
Any document creation software that can save documents as PDF can save them as HTML. I'm assuming the issue is that your users will be creating rich documents (lots of embedded images), which results in multiple files, and your requirements stem from a desire to make uploading these documents as simple as possible to the user.
There are numerous conversion packages that can probably do this for you, however when you're talking about rich content, you are talking about text plus images. Those images have to be stored somewhere and served somehow, and whatever conversion method you use will require you to examine all image sources to make sure they point to valid locations on your server.
I would like to suggest an alternate way of doing this that you can take to your team: Implement one of the many blog APIs for publishing content. There are free and commercial software packages that use these APIs to publish content directly to a website, such as Windows Live Writer and Microsoft Word. Your users can simply create their content and upload it directly to your website without having to publish it as PDF first then upload it. So the process becomes much smoother for your users, and you get the posts in a form that doesn't require you spend thousands of dollars on developing or buying conversion code.
The two most common APIs are the MetaWeblog API and the Movable Type API. Both are very simple and easy to implement. I think this way would be a MUCH better alternative than what you're thinking about doing.
我认为将 PDF 转换为 HTML 字符串不一定是最好的主意,特别是如果您想将其导出为 PDF。 PDF 文件通常包含二进制元素(例如图像),因此您最好通过编码(例如 Base64)将其转换为 ASCII。 这样您就可以将 ASCII 字符串保存到数据库中的文本字段中,然后将其转换回来。 您能否对主要要求进行更多扩展?
I don't think converting a PDF to an HTML string is necessarily the best idea, especially if you want to export it back as PDF. PDF files often contain binary elements such as images, so you may be best to convert it to ASCII via an encoding, such as Base64. That way you will have an ASCII string you can save into a text field in the DB and then convert it back out. Could you expand more on the main requirement?
我的建议是如果可能的话不要这样做(但我们都知道经理是什么样的)所以...
我建议您不要将 PDF 转换为 HTML 或从 HTML 转换(因为除非您能找到商业解决方案)这几乎是不可能的),而是像已经提到的那样,将其存储为编码的 Base64 字符串、BLOB 或数据库中的其他某种二进制格式,然后使用某种 PDF 视图插件将其显示给用户浏览器。
My recommendation would be to not do it that way IF POSSIBLE (but we all know what managers are like) so...
I would recommend that you stay away from converting the PDF to/from HTML (because unless you can find a commercial solution it will be nigh on impossible) and instead do as has already been mentioned and store it as an encoded Base64 string, or BLOB or some other binary format in the database, and then display it to the user with some sort of PDF view plugin for the browser.
只需在谷歌上简单搜索“PDF to HTML”即可: http://www.gnostice.com /pdf2manyOverview_x.asp。 我确信还有其他人。
因此,虽然这是“可能的”,但您可能需要向经理解释这不是最好的内容管理解决方案。
All it took was a simple google search for "PDF to HTML": http://www.gnostice.com/pdf2manyOverview_x.asp. I'm sure there are others.
So while it's 'possible', you may want to explain to your manager that this isn't the best content management solution.
为什么不使用iTextSharp来阅读PDF内容呢? 然后您可以将二进制 PDF 和文本内容保存到数据库中。 然后,您可以让用户搜索内容并下载 PDF。
Why not use the iTextSharp to read the PDF content? Then You could save both the binary PDF and the text content to the database. You could then let users search the content and download the PDF.
您应该查看 DynamicPDF。 他们有一个转换器(目前是测试版)就是为了达到这个目的。 我们使用他们的产品取得了巨大成功(尤其是将 Reporting Services 报告直接转储为 PDF)。
参考:http://www.dynamicpdf.com/
You should look into DynamicPDF. They have a converter (currently Beta) out for serving exactly this purpose. We have used their products with great success (especially for dumping Reporting Services reports directly to PDF).
Ref: http://www.dynamicpdf.com/