google 使用什么应用程序在 gmail 中显示 PDF 附件

发布于 2024-07-17 07:00:46 字数 146 浏览 3 评论 0原文

当谷歌在新窗口中显示 Gmail 中的 PDF 附件时,我观察了流量。 每个 PDF 页面的内容均以 PNG 图像形式提供。 并且可以选择其文本。 谷歌在服务器端使用什么来为pdf文件中的特定页面生成PNG文件? png 文件上的文本选择是如何工作的? 有任何想法吗?

I watched the traffic when google displays PDF attachments in gmail in a new window. The content is served as PNG images for each PDF page. And its text can be selected. What does google use on server side to generate a PNG file for a particular page in a pdf file? How does the selection of text on a png file work? Any ideas?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

人生百味 2024-07-24 07:00:46

默认情况下,可以使用 https://docs.google.com/gview 安全地查看附件,但事实证明您可以通过纯 HTTP 请求文件。 这使得使用 Wireshark 更容易弄清楚发生了什么。

正如您所指出的,很明显 PDF 在服务器端转换为 PNG(ImageMagick 确实是出于此目的的合理解决方案),这样做的明显原因是保留确切的布局,同时仍然能够在不需要 PDF 查看器的情况下查看文件。

然而,通过查看流量,我发现在调用 /gview?a=gt&docid=&chan=&thid= 时,整个 PDF 也会转换为自定义 XML 格式(这会在您请求时立即完成)该文件)。 由于我无法使用 Wireshark 复制 XML,因此我求助于 Firefox 扩展 Live HTTP标题。 这是摘录:

<pdf2xml>
    <meta name="Author" content="Bruce van der Kooij"/>
    <meta name="Creator" content="Writer"/>
    <meta name="Producer" content="OpenOffice.org 3.0"/>
    <meta name="CreationDate" content="20090218171300+01'00'"/>
    <page t="0" l="0" w="595" h="842">
        <text l="188" t="99" w="213" h="27" p="188,213">Programmabureau</text>
        <text l="85" t="127" w="425" h="27" p="85,117,209,61,277,21,305,124,436,75">Nederland Open in Verbinding (NOiV)</text>
    </page>
</pdf2xml>

我还不太确定文本元素上的所有属性代表什么(w 和 h 除外),但它们显然是文本的坐标和可能的长度。 由于 Google 使用的 JavaScript 被最小化(或者可能被混淆) ,但这不太可能)准确地弄清楚客户端选择功能是如何工作的并不是那么容易。 但最有可能的是,它使用此 XML 文件来确定用户正在查看的文本,然后将其复制到用户的剪贴板。

请注意,有一个名为 pdf2xml 的开源(GPL 许可)工具,它具有类似但不完全相同的功能相同的输出。 这是他们主页上的示例:

<?xml version="1.0" encoding="utf-8" ?>
<pdf2xml pages="3">
  <title>My Title</title>
  <page width="780" height="1152">
    <font size="10" face="MHCJMH+FuturaT-Bold" color="#FF0000">
      <text x="324" y="37" width="132" height="10">Friday, September 27, 2002</text>
      <img x="324" y="232" width="277" height="340" src="text_pic0001.png"/>
      <link x="324" y="232" width="277" height="340" dest_page="2" dest_x="141" dest_y="187"/>
    </font>
    <font size="12" face="AGaramond-Regular" italic="true" bold="true">
      <text x="509" y="68" width="121" height="12">This is a test PDF file</text>
      <link x="509" y="68" width="121" height="12" href="www.mobipocket.com"/>
    </font>
  </page>
</pdf2xml>

希望这些信息在任何方面都有用,但是就像其他发帖人提到的那样,确定谷歌所做的唯一方法就是询问他们。 遗憾的是 Google 没有官方 IRC 频道,但他们确实有 Google 文档支持问题论坛

祝你好运。

By default attachments are viewed securely using https://docs.google.com/gview, however it turns out you are allowed to request files over plain HTTP. This makes it a little bit easier to figure out what is going on using Wireshark.

As you indicated it was already clear that the PDF is converted on the server side to a PNG (ImageMagick is indeed a reasonable solution for this purpose), the obvious reason for this is to preserve the exact layout while still being able to view the file without requiring a PDF viewer.

However, from looking at the traffic I found out that the entire PDF is also converted to a custom XML format when calling /gview?a=gt&docid=&chan=&thid= (this is done as soon as you request the document). As I couldn't use Wireshark to copy the XML I resorted to the Firefox extension Live HTTP Headers. Here's an excerpt:

<pdf2xml>
    <meta name="Author" content="Bruce van der Kooij"/>
    <meta name="Creator" content="Writer"/>
    <meta name="Producer" content="OpenOffice.org 3.0"/>
    <meta name="CreationDate" content="20090218171300+01'00'"/>
    <page t="0" l="0" w="595" h="842">
        <text l="188" t="99" w="213" h="27" p="188,213">Programmabureau</text>
        <text l="85" t="127" w="425" h="27" p="85,117,209,61,277,21,305,124,436,75">Nederland Open in Verbinding (NOiV)</text>
    </page>
</pdf2xml>

I'm not quite sure yet what all the attributes on the text element stand for (with the exception of w and h) but they're obviously the coordinates of the text and possibly length. As the JavaScript Google uses is minimized (or possibly obsfuscated, but this is not likely) figuring out precisely how the client-side selection function works is not quite that easy. But most likely it uses this XML file to figure out what text the user is looking at and then copies that to the user's clipboard.

Note that there is an open source (GPL licensed) tool called pdf2xml which has similar but not quite the same output. Here's the example from their homepage:

<?xml version="1.0" encoding="utf-8" ?>
<pdf2xml pages="3">
  <title>My Title</title>
  <page width="780" height="1152">
    <font size="10" face="MHCJMH+FuturaT-Bold" color="#FF0000">
      <text x="324" y="37" width="132" height="10">Friday, September 27, 2002</text>
      <img x="324" y="232" width="277" height="340" src="text_pic0001.png"/>
      <link x="324" y="232" width="277" height="340" dest_page="2" dest_x="141" dest_y="187"/>
    </font>
    <font size="12" face="AGaramond-Regular" italic="true" bold="true">
      <text x="509" y="68" width="121" height="12">This is a test PDF file</text>
      <link x="509" y="68" width="121" height="12" href="www.mobipocket.com"/>
    </font>
  </page>
</pdf2xml>

Hope this information is in any way useful, however like one of the other posters mentioned the only way to be sure what Google does is by asking them. It's a shame Google doesn't have an official IRC channel but they do have a forum for Google Docs support questions.

Good luck.

友欢 2024-07-24 07:00:46

Google 使用内部开发的非开源 PDF 转换器应用程序。 因此,您最好查看其他答案发布的链接,因为您无法获得 Google 版本。 对不起!

Google uses a non-open-sourced PDF converter app developed in-house. So you're better off looking into the links posted by other answers, since you can't get your hands on the Google version. Sorry!

離殇 2024-07-24 07:00:46

如果您有文本,您可以将其设为您想要的内容,

更具体的您应该查看此链接:使用 php 将 pdf 转为 png

因此需要 imageMagick imageMagic

编辑:另一个

编辑:我在谷歌找到了这个,它看起来很有趣......所以你可以使用谷歌API
Google 文档列表数据 Api,这是一篇关于它的博文Google API 现在可以让您获取多种格式的文档

当然,为了确定 google 使用什么,您需要从他们那里得到答案吗? :)

祝你好运 !

if you have the text you can make it what you want offcourse,

more specific you should check out this link : pdf to png using php

so imageMagick will be needed imageMagic

edit : another interesting link.

edit : i found this at google, it looks interesting ... so you could use the google api
Google Document List Data Api and this is a blogpost about it Google API Now Lets You Get Documents in Many Formats

Offcourse to be sure what google uses you need an answer from them ? :)

good luck !

赤濁 2024-07-24 07:00:46

要查看 pdf 的创建内容,请右键单击它并转到“文档属性”(在 Adob​​e reader 中)。 PDF 制作者将显示为“PDF 制作者”。 我认为谷歌同时使用 PrinceIText(不组合用于创建 PDF)。 Google 对上述工具包进行了一些重大修改以创建最终产品。

To see what a pdf is created with, right click on it and go to the Document Properties (in Adobe reader). The PDF producer will show up as the "PDF Producer". I think google uses both Prince and IText (not in combination for creating PDFs). Google has created some major modifications on the above toolkits to create that end product.

缱绻入梦 2024-07-24 07:00:46

嗯..这可能只是 Google 正在使用的 pdf2xml 工具。 他们只改变了完整的单词宽度、高度等,并添加了 p 属性...结果是包含行内单词坐标的属性。 刚刚玩了一下就发现了:) 打算使用来自 google 的 pdf2xml :P 上传,让他们转换...也使用 xml 来转换...epub? :P

Well.. this might just be the pdf2xml tool Google is using. They only changed they full words width, height etc and they added the p attribute... which turns out to be the attribute containing the coordinates for the words inside the line. Just played with it and found out :) Going to use this pdf2xml from google :P Upload, let them convert... use xml to transform tooo... epub? :P

記柔刀 2024-07-24 07:00:46

您可能还想研究使用 Lucence 来索引这些大型 pdf 文件并向用户提供相关页面。

请参阅 http://www.jguru.com/faq/view.jsp?EID =1074237了解更多想法。

You may also want to investigate use Lucence to index those big pdf files and serve related pages to your users.

See http://www.jguru.com/faq/view.jsp?EID=1074237 for more ideas.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文