如何在Linux上获取Word文档的页数?
我看到这个问题PHP - 获取Word文档中的页数。我还需要确定给定单词文件(doc/docx)的页数。我尝试调查 phplivedocx/ZF (@hobodave 链接到原始帖子答案中的内容),但我在那里失去了手脚。我也无法使用任何外部网络服务(例如 DOC2PDF 网站,然后计算 PDF 版本中的页面数,等等......)。
简单地说:是否有任何 php 代码(使用 ZF 或 PHP 中的其他任何内容,排除 COM 对象或其他执行文件,例如“AbiWord”;我正在使用共享 Linux 服务器,没有 exec
或类似的函数),查找word文件的页数?
编辑:即将支持的Word版本是Microsoft-Word 2003和Microsoft-Word 2003。 2007年。
I saw this question PHP - Get number of pages in a Word document . I also need to determine the pages count from given word file (doc/docx). I tried to investigate phplivedocx/ZF (@hobodave linked to those in the original post answers), but I lost my hands and legs there. I can't use any outer web service either (like DOC2PDF sites, and then count the pages in the PDF version, or so...).
Simply: Is there any php code (using ZF or anything else in PHP, excluding COM object or other execution-files, such 'AbiWord'; I'm using shared Linux server, without exec
or similar function), to find the pages count of word file?
EDIT: The word versions that about to be supported are Microsoft-Word 2003 & 2007.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
获取 docx 文件的页数非常简单:
对于 97-2003 格式,这当然具有挑战性,但绝不是不可能的。页数存储在文档的 SummaryInformation 部分中,但由于文件的 OLE 格式,导致查找起来很困难。该结构的定义非常彻底(尽管在我看来很糟糕)这里< /a> 和更简单的 这里。我今天看了一个小时,但没有走多远! (不是我习惯的抽象级别),但输出十六进制以更好地理解结构:
它将输出代码,您可以在其中找到以下部分:
这将允许您查看引用信息,例如:
哪个将允许您确定所描述的属性:
这将使您找到相关的代码部分,将其解压并获取页码。当然,这是一个困难的部分,我只是没有时间,但应该为您指明正确的方向。
M$ 并不容易!
Getting the number of pages for docx files is very easy:
For 97-2003 format it's certainly challenging, but by no means impossible. The number of pages is stored in the SummaryInformation section of the document, but due to the OLE format of the files that makes it a pain to find. The structure is defined extremely thoroughly (though badly imo) here and simpler here. I looked at this for an hour today, but didn't get very far! (not a level of abstraction I'm used to), but output the hex to better understand the structure:
which will out put code where you can find the sections such as:
Which will allow you to see the referencing info such as:
Which will allow you to determine properties described:
Which will let you find the relevant section of code, unpack it and get the page number. Of course this is the hard bit that I just don't have time for, but should set you in the right direction.
M$ don't make it easy!
看看 microsoft codeplex 的 PhpWord ...“http://phpword.codeplex.com/
它将允许您在 PHP 中打开并读取 Word 格式的文件,并进行您需要的任何处理。
Have a look at PhpWord from microsoft codeplex ... "http://phpword.codeplex.com/
It will allow you to open and read the word formatted file in PHP and do whatever processing you require.
为了使用 PHP 获取 doc、docx、ppt 和 pptx 的元数据属性(如页数、幻灯片数),我遵循了以下过程,它的工作方式很迷人,我很高兴,下面是我遵循的过程,希望它能帮助某人
一次完成后,您可以尝试执行以下命令,它会提供有关您的文件的所有元数据,
一旦测试,您可以在 PHP 脚本中执行此命令。谢谢。
To get meta data properties of doc,docx,ppt and pptx like number of pages, number of slides using PHP i followed the following process and it worked liked charm and iam so happy, below is the process i followed , hope it helps someone
once its done you could try executing the following commadn it will give all the meta data about your file
once tested you can execute this comman in PHP script. Thanks.
排除使用 Abiword 或 OpenOffice?不可能 - 页数将取决于单词/字母的数量、使用的字体、对齐方式和字偶距、页边距大小、行距、段落间距、段落数、列数、图形/嵌入对象的大小、页/列分隔符和页边距。
您需要能够理解所有这些的东西。
即使您使用 OpenOffice 或 Abiword,重排文本也可能会更改页数。事实上,在某些情况下,在不同的 MSWord 实例上打开同一文档可能会导致差异。
您可能可以管理的最好方法是基于文档表示的统计方法 - 但您仍然会看到巨大的差异。
Excluding using Abiword or OpenOffice? Impossible - number of pages will depend on number of words/letters, fonts used, justification and kerning, margin size, line spacing, paragraph spacing, number of paragraphs, columns, size of graphics / embedded objects, page / column breaks and page margins.
You need something which will can understand all of these.
Even if you use OpenOffice or Abiword, reflowing the text may change the number of pages. Indeed, in some cases opening the same document on a different instance of MSWord may result in a difference.
The best you could probably manage would be a statistical approach based on a representation of the document - but you'll still see huge variance.