尚不支持交叉引用流
我是 Zend Framework 的新手,所以如果我错过了一些简单的东西,我很抱歉。但是,我本以为代码直接取自 文档 会起作用的。相反,我得到了一个未捕获的异常。
Fatal error: Uncaught exception 'Zend_Pdf_Exception' with message 'Cross-reference streams are not supported yet.' in C:\xampp\php\zend\library\Zend\Pdf\Parser.php:318
Stack trace:
#0 C:\xampp\php\zend\library\Zend\Pdf\Parser.php(460): Zend_Pdf_Parser->_loadXRefTable('116')
#1 C:\xampp\php\zend\library\Zend\Pdf.php(318): Zend_Pdf_Parser->__construct('PDF/Current...', Object(Zend_Pdf_ElementFactory_Proxy), true)
#2 C:\xampp\php\zend\library\Zend\Pdf.php(267): Zend_Pdf->__construct('PDF/Current...', NULL, true)
#3 C:\xampp\htdocs\test\test.php(7): Zend_Pdf::load('PDF/Current...')
#4 {main}
thrown in C:\xampp\php\zend\library\Zend\Pdf\Parser.php on line 318
我一直在四处寻找可能的解决方案,但运气不佳。 这是最相似的,它不能解决我的问题。根据我在那里读到的内容以及其他来源的内容,PDF 版本 1.4 及更早版本应该可以正常工作,但这里的情况并非如此,而且它已经存在很多年了。我的 PDF 版本都是 1.4,所以我什至不确定该帖子的准确性。该代码适用于演示中包含的 PDF,但不适用于我尝试使用的任何现有 PDF。我会上传 PDF,但它们都是保密的。
我只是想获取元数据,但我什至无法加载文档。我开始使用框架,这样我就不必创建自己的解析器。如果有一个更简单的方法来做到这一点,或者如果有人可以阐明这一点,我将非常感激。
编辑:为了澄清,我已经尝试了链接文档页面中的两种方法。两者都不起作用。
I'm new to the Zend Framework so my apologies if I'm missing something simple. However, I would have thought that code taken directly from the documentation would work. Instead I'm getting an uncaught exception.
Fatal error: Uncaught exception 'Zend_Pdf_Exception' with message 'Cross-reference streams are not supported yet.' in C:\xampp\php\zend\library\Zend\Pdf\Parser.php:318
Stack trace:
#0 C:\xampp\php\zend\library\Zend\Pdf\Parser.php(460): Zend_Pdf_Parser->_loadXRefTable('116')
#1 C:\xampp\php\zend\library\Zend\Pdf.php(318): Zend_Pdf_Parser->__construct('PDF/Current...', Object(Zend_Pdf_ElementFactory_Proxy), true)
#2 C:\xampp\php\zend\library\Zend\Pdf.php(267): Zend_Pdf->__construct('PDF/Current...', NULL, true)
#3 C:\xampp\htdocs\test\test.php(7): Zend_Pdf::load('PDF/Current...')
#4 {main}
thrown in C:\xampp\php\zend\library\Zend\Pdf\Parser.php on line 318
I've been reading around looking for a possible solution to this, but have had little luck. This is the most similar and it does not solve my problem. From what I've read there, and from other sources, PDF versions 1.4 and older should work fine, but this is not the case here, and its years old. My PDF versions are all 1.4, so I'm not even sure how accurate that post is anyways. The code works for the PDF included in the demo, but not on any of the existing ones I'm trying to use. I would upload the PDF, but they are all confidential.
I'm only trying to get the metadata, but I am not even able to load the document. I started using a framework so I wouldn't have to create my own parser. If there is a simpler way to do this, or if someone can shed some light on this, I would be much obliged.
Edit: for clarification, I've tried both methods from linked documentation page. Neither works.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
就我而言,当我将 PDF 转换为版本 1.4(从 1.6)时,它起作用了。我使用了这里的命令: https://superuser.com/questions/25598/linux-pdf-version -转换器
It my case, it worked when I converted the PDF to version 1.4 (from 1.6). I used the command from here: https://superuser.com/questions/25598/linux-pdf-version-converter
我最终不得不为此创建自己的解析器。如果有人发现这个并对我的做法有任何进一步的建议或疑问,只需添加评论即可。
解决方案
我不会上传整个代码,因为它非常长、非常混乱且效率低下。自从最初的帖子以来,我作为一名开发人员已经成长了一些,并且一直想回去再尝试一下。因此,我将用这篇文章来解释我所拥有的内容,指出我发现的一些问题和解决方案,并就如何提高效率提出一些评论。希望这能让你更轻松,也希望这能激励我做出一些改变。 免责声明:自从我上次查看此代码以来已经有几个月了,所以不要指望我能记住所有内容。然而,我非常擅长记录我的代码和发现(一次),所以我不记得的大部分都是次要的。
我可以告诉您的最重要的事情是查看原始 XML、做笔记并比较一些文件。 Adobe 在创建元数据语法时显然无法下定决心,因此您最终将不得不为所有不同的修订添加多个检查(我稍后将给出一个示例)。实际上在文档中查找元数据非常容易。 Adobe 为您提供了一组很好的开始/结束标签,因此您只需迭代文档直到找到它们。这是我正在解析的 PDF 之一中经过清理和概括的示例。
查看原始 XML 数据的最佳方法是下载 notepad++(尽管您可以使用任何类似记事本的程序)并在其中打开 PDF。您将首先看到的是 PDF 版本,在本例中为“%PDF-1.4”,然后是许多看起来令人困惑的字符。忽略它,但请注意 PDF 版本。请注意上面示例中的“xpacket”标签,这就是您每次想要查找元数据时都需要查找的标签。只需按 Ctrl+F 即可找到“xmpmeta”,第一个出现的应该是您的元数据。 警告:不要尝试使用受密码保护的文档。一切都被混淆了,包括元数据,这也意味着 PHP 也无法读取它。我相信有一个选项可以允许读取受密码保护的 PDF 中的元数据,但我记不清了,也不知道它是否真的适用于 PHP。
正如您可以使用 Ctrl+F 在 notepad++ 中查找元数据一样,您也可以在 PHP 中使用
fgets()
和 while 循环执行相同的操作。我没有做但可能是一个好主意,那就是确定从文档的哪一端开始。这在所有 PDF 版本之间并不通用,但相同版本的位置似乎相似。例如,在 PDF 1.4 中,它们似乎都更接近文档的底部,而在 PDF 1.6 中,它们更接近顶部。同样,您可以从第一行开始检查 PDF 版本。使用 PHP 读取文档的设置应该非常简单,因此我将跳过这段代码。不过,我会指出,一旦找到整个元数据,最好退出循环,因为这是一个处理量很大的操作,因此您会希望尽可能节省时间。我还建议一次仅在 10-20 个文件组上运行此命令,如果文档较大,则应减少运行次数。设置缓存系统对我解决超时错误有很大帮助。在字符串中获取元数据后,您需要对其进行一些清理。您要做的第一件事是确保您的元数据很好地包装在单个根节点中,以便 XML 解析器可以读取它。有几个例子却并非如此。解决此问题的最佳/最简单方法是添加一个通用包装器。我建议您使用最常用的一种。对我来说,这是带有内部“rdf”包装器的“xmpmeta”标签。确保每个元数据以相同的方式启动对于导航文档非常重要。可能有更好的方法来做到这一点,但是这有效并且效率并不算太低(至少现在,在我删除两个循环之后)。
之后您将需要删除名称空间。我尝试使用它们,但是当 URL 在每个实现中不断变化并且您不确定自己拥有哪些 URL 时,这样做有点困难。此外,它已经开始运行缓慢,添加所有额外的 XML 解析只会让情况变得更糟。删除它们要简单得多。
$nodesToRemove
对您来说可能有点不同。这些只是我遇到的所有名称空间。 注意:我遇到了删除节点的顺序很重要的问题。我不知道为什么,但它会从“xmpMM”中删除“xmp”,而我将陷入“MM”命名空间。上面的代码似乎没有这个问题,所以我不确定它是否仍然是一个问题,但以防万一,请保持警惕。不管怎样,修复起来并不难,只需让 PHP 对其进行排序然后反转即可。 REGEX 删除默认命名空间声明。我尝试了多种不同的方法来解决这个问题,但这是我发现的唯一一种始终有效的方法。可能有一种方法可以将这两个 REGEX 函数结合起来,但是当涉及到 REGEX 时我完全迷失了方向,而且我的尝试只是让它失败了。我不知道为什么我要再次使用 XML 删除名称空间。这似乎是我最近尝试清理这个问题的尝试之一,但是这是来自一个有效的解决方案,所以它不会造成伤害(至少不会影响功能)。除了 REGEX 之外,第一部分可能可以删除并用 XML 解决方案替换,尽管我还没有验证这一点。在将字符串加载到 XML 之前,仍然需要删除默认名称空间,因为 XML 解析器不认为“xmlns”属性是实际属性。命名空间版本“xmlns:$prefix
”起作用的唯一原因是它们不被视为“xmlns”属性,而是被视为“xmlns:$prefix
”属性。微妙之处。别像我一样。不要尝试实现曾经创建的每个 PDF 版本。这是不可能的。嗯......它可能可以,但它的麻烦比它的价值更多。对我来说幸运的是,这些都是内部文档,所以当我达到我的极限并且厌倦了为了破坏其他东西而调整它,或者失去我以前拥有的兼容性时,我只是转换了最后几个文档。找到最常见的版本并处理它们,然后找到下一个最常见的版本并为它们设置条件,依此类推。一旦您只剩下几个版本,请更新它们,或者只是宣布您不支持此版本。特别是如果他们年纪大了。为只用于少数文档的东西添加功能是没有意义的。我记得的一个大问题是“xpacket”并不总是在自己的线路上。有时它与一些元数据标签共享空间。这导致了“丢失”数据,因为直到找到“xpacket”之后我才开始记录元数据。这似乎是一个简单的修复,但它发现了很多问题,所以我最终完全放弃了该修订并更新了它们。幸运的是,这些是最后 3-4 个文件。
清理完元数据后,您就可以将其解析为 XML。例如,这是我获取描述的方式。
对此有几点需要注意。第一个是 XPATH 数组。这些是我之前谈到的多重条件。您可能还注意到注释掉了 XPATH。我要么仍在致力于兼容性,要么已经放弃了。我不记得了,自从我不得不查看这个以来已经有一段时间了,而且没有人抱怨错误。所以我假设这不是问题。另一件需要注意的事情是这个 ONE 字段的偏差量。元数据发生了很大变化,有时甚至会恢复。因此,您必须检查每种情况,确保没有其他偏差,然后添加可能发生的任何其他条件。需要考虑的是根据版本保存单独的解析器,然后加载正确的解析器,可能会降低效率。现在回想起来,也许更简单的方法是查找每个修订版的标准化文档,但我最终主要是通过反复试验来做到这一点。因此,虽然这对我有用,但我可能会错过一些事情,因为这在我的任何文档中都不是问题。另一件需要注意的事情是修订版之间的标签有多相似。我没有,而且仍然不太擅长高级 XPATH,所以也许有更好的方法来做到这一点,我不知道。
我希望这会有所帮助。我知道它给了我一些想法。如果您有任何其他具体问题,请告诉我。
I ended up having to create my own parser for this. If anyone finds this and has any further suggestions or questions about how I did it just add a comment.
Solution
I'm not going to upload the whole code as its really long, very messy, and inefficient. I've grown a bit as a developer since the initial post and have been meaning to go back and take another swing at it. So I'll use this post to explain what I have, point out some of the problems and solutions I have found, as well as make some comments on how to make it more efficient. Hopefully this will make it easier for you, and hopefully this will inspire me to make some changes. Disclaimer: It has been months since I have last looked at this code, so don't expect me to remember everything. However, I was pretty good about documenting my code and findings (for once) so what I'm not remembering is mostly minor.
The most important thing I can tell you is to look at the raw XML, take notes, and compare a few of your files. Adobe apparently couldn't make up their mind when creating the metadata syntax, so you will end up having to add multiple checks for all the different revisions (I'll give an example later). Actually finding the metadata in the document is pretty easy. Adobe gives you a nice set of begin/end tags, so you just iterate over the document until you find them. Here's a cleaned up and generalized sample from one of the PDF's I'm parsing.
The best way to view the raw XML data is to download notepad++ (though you could use any notepad like program) and open up the PDF's in that. The first thing you will see is the PDF version, "%PDF-1.4" in this case, and then a lot of confusing looking characters. Ignore that, but note the PDF version. Notice the "xpacket" tags in the sample above, that's what you are going to need to look for every time you want to find the metadata. Just Ctrl+F to find "xmpmeta", the first occurrence should be your metadata. Word of caution: Don't attempt to use password protected documents. Everything is obfuscated, including the meta, this also means that PHP can't read it either. I believe there is an option to allow the reading of the meta in password protected PDF's, but I can't remember for sure, nor do I know if it actually works for PHP.
Just as you can Ctrl+F to find the meta in notepad++, you can do the same thing in PHP with
fgets()
and a while loop. Something I didn't do but would probably be a good idea to implement, is to determine which end of the document to start from. This isn't universal between all PDF versions, but same versions seem to be similarly placed. For instance, in PDF 1.4 they appear to all be closer to the bottom of the document, while in PDF 1.6 they are closer to the top. Again, you can check the PDF version from the first line. Reading the document with PHP should be pretty simple to set up, so I'm going to skip this bit of code. Though, I will point out that it is a good idea to quit the loop once you have found the entire metadata as this is a very processing intense operation so you'll want to save time where you can. I would also suggest only running this on groups of 10-20 files at a time, less if larger documents. Setting up a caching system helped me quite a bit with timeout errors.After you've got the metadata in a string, then you'll want to clean it up a bit. The first thing you are going to want to do is make sure your metadata is wrapped up nicely in a single root node so that the XML parser can read it. There were a couple of instances where they weren't. The best/easiest way to fix this is to add a common wrapper. I would suggest using the most common one available to you. For me, that was the "xmpmeta" tag with an inner "rdf" wrapper. Ensuring that each metdata starts the same is important for navigating the document. There might be a better way of doing this, but this works and isn't too inefficient (at least now, after I removed the two loops).
Afterwards you are going to want to remove the namespaces. I tried using them, but its kind of hard to do so when the URLs keep changing in each implementation and you don't know for sure which ones you have. Besides, it was already starting to run slow and adding all that extra XML parsing would have only made it worse. It was just much simpler to remove them.
The
$nodesToRemove
might be a little different for you. Those are just all the namespaces I ran across. Note: I was having issues where the order in which you remove the nodes was important. I'm not sure why, but it would remove the "xmp" from "xmpMM" and I would be stuck with an "MM" namespace. The code above doesn't appear to have that issue, so I'm not sure if it still is an issue, but just in case, be wary. Either way, it isn't too hard to fix, just have PHP sort it then reverse it. The REGEX removes default namespace declarations. I tried a number of different ways to go about this, but this was the only one that I could find that consistently worked. There's probably a way to combine those two REGEX functions, but I'm completely lost when it comes to REGEX, and my attempts just left it broken. I'm not sure why I'm then removing the namespaces again with XML. This appears to be one of my more recent attempts at cleaning this up a bit, however this is from a working solution, so it doesn't hurt (at least not functionality). The first bit, besides the REGEX, can probably be removed and replaced with the XML solution, though I've not verified this. It's still necessary to remove the default namespaces before loading the string into XML because the XML parsers do not consider the "xmlns" attribute to be an actual attribute. The only reason the namespaced version "xmlns:$prefix
" works is because they are not considered "xmlns" attributes but "xmlns:$prefix
" attributes. Subtleties.Don't be like me. Don't try to implement every version of PDF ever created. It CAN'T be done. Well... it probably can, but its more hassle than its worth. Luckily for me, these were all in-house documents, so when I reached my limit and was tired of tweaking it just to break something else, or lose compatibility that I previously had, I just had those last few documents converted. Find the most common versions and handle those, then the next most common and set up conditions for those, and so on. Once you get to a point where you've only a few left, have them updated, or just announce that you don't support this version. Especially if they are older. No sense in adding functionality for something that's only ever going to be used for just a few documents. One of the big ones I can remember is a situation where the "xpacket" was not always on its own line. Sometimes it shared space with a few metadata tags. This caused "missing" data, because I did not start recording the meta until after the "xpacket" was found. It seemed like a simple fix, but it uncovered a whole lot of issues, so I ended up just scrapping that revision altogether and having them updated. Luckily those were the last 3-4 files.
Once you have cleaned the metadata, then you are ready to parse it as XML. For example, here's how I get the description.
There's a few things to note about this. The first is the array of XPATH's. These are those multiple conditions I was talking about earlier. You may also notice that commented out XPATH. That's one I am either still working on compatibility for, or have given up on. I don't remember, its been a while since I've had to look at this, and no one has complained about errors. So I'm assuming its not an issue. Another thing to notice is the amount of deviations for just this ONE field. The metadata changed quite a bit, and sometimes reverted. So you have to check for each case, make sure there were no other deviations, and then add any other conditions that may have occurred. Something to look into would be saving separate parsers based on version then loading the proper parser, may cut down on inefficiency. Looking back on this now, perhaps the easier way would have been to look up the standardization docs for each revision, but instead I ended up doing this mostly through trial and error. So, while this works for me, there may be some things I missed because it wasn't an issue in any of my documents. The other thing to note is how similar the tags are between the revisions. I wasn't, and still am not all that great with advanced XPATH, so maybe there is some better way to do this, I don't know.
I hope this helps somewhat. I know its given me a few ideas. If you have any other specific questions let me know.
我在使用 OpenOffice Writer 的导出为 PDF 功能生成的 PDF 时遇到了同样的问题。在 Acrobat 或其他 PDF 阅读器中,它们可以毫无问题地打开,但 ZF 无法处理它们。
我将 OpenOffice 文件另存为 .docs,并使用 MS Word 将其导出为 .pdf。现在它们显示...
I encountered the same problem with PDFs generated by OpenOffice Writer's export to PDF function. In Acrobat or other PDF readers they open without problems but ZF can't handle them.
I saved the OpenOffice files as .docs and exported them to .pdf with MS Word. Now they are displayed...
我在使用 adobe 创建的 pdf 文档时遇到了同样的问题。
我再次重新保存了文档,这次没有使用 adobe 的标准保存选项。这次我用“优化的PDF”(另存为下的另一个adobe预设)另存为文档。
现在 zend 可以打开该文件并且工作正常。
我不太确定预设中哪些选项不同,但我认为这是 zend 无法处理的某种流式/分开的网络版本。
I had the same problem with a pdf document created with adobe.
I resaved the document again this time not with the standard saving options of adobe. This time i saved as a document with "Optimized PDF" (another adobe preset under save as).
Now zend can open the file and it works fine.
I'm not quite sure which options are different in the presets but i think it is some kind of streamed/parted web-version which zend can't handle.