最好的“文件格式”是什么? 用于将完整的网页(图像等)保存在单个存档中?

发布于 2024-07-08 11:52:36 字数 1449 浏览 7 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

依 靠 2024-07-15 11:52:36

我最喜欢的是 ZIP 格式。 因为:

  • 它非常适合该目的
  • 有详细记录
  • 有很多实现可用于创建或读取它们
  • 用户可以轻松提取单个文件,更改它们并将它们放回到存档中
  • 几乎每个主要操作系统(Windows, Mac 和大多数 Linux)都有内置的 ZIP 程序。

替代方案都有一些缺陷:

  • 使用 MHTMl,您不能轻松编辑。
  • 对于数据 URI,我不知道实现起来会有多困难。 (使用 ZIP,即使我在 3 年前也可以用 PHP 做到这一点......)
  • 将内容存储为单独文件的选项有太多可能会出错并弄乱您的存档的内容。

My favourite is the ZIP format. Because:

  • It is very well sutied for the purpose
  • It is well documented
  • There a a lot of implementations available for creating or reading them
  • A user can easily extract single files, change them and put them back in the archive
  • Almost every major Operating System (Windows, Mac and most linux) have a ZIP program built in

The alternatives all have some flaw:

  • With MHTMl, you can not easily edit.
  • With data URI's, I don't know how difficult the implementation would be. (With ZIP, even I could do it in PHP, 3 years ago...)
  • The option to store things as seperate files just has far too many things that could go wrong and mess up your archive.
牵你的手,一向走下去 2024-07-15 11:52:36

几乎所有平台上的几乎所有浏览器都支持 PDF,并将内容和图像存储在单个文件中。 可以使用正确的工具对其进行编辑。 这几乎肯定不理想,但这是一个值得考虑的选择。

PDFs are supported on nearly all browsers on nearly all platforms and store content and images in a single file. They can be edited with the right tools. This is almost definitely not ideal, but it's an option to consider.

迷荒 2024-07-15 11:52:36

这不仅仅是文件格式的问题。 另一个关键问题是您到底想存储什么? 是:

  1. 按原样存储整个页面以及所有引用的资源 - 图像,
    CSS 和 javascript?

  2. 捕获页面在某个时间点呈现的情况; 静态的
    网页 DOM 的某些渲染状态的图像?

浏览器中最新的“页面另存为”功能,无论是 MAF、MHTML 还是文件+目录,都尝试第一种方式。 这最终是有缺陷的方法。

不要忘记网页现在更像是本地应用程序,而不是您可以轻松存储的静态文档。 潜在问题:

  1. 一个页面实际上是由 JS 动态构建的多个页面,需要用户交互
    要达到想要的状态

  2. AJAX 应用程序可以与渲染它的远程服务进行远程通信
    无法用于离线查看。

  3. JavaScript 代码中的隐藏链接。 这样的资源就不是存储页面的一部分。
    即使解析 JS 代码也可能无法发现它们。 您需要运行代码。

  4. 甚至可以重新计算基本 html 元素的位置,也可以通过以下方式动态计算
    JS 并且并不总是可能/容易在本地重新创建它。

  5. 您需要某种 JS 内存转储并加载它以使页面达到所需状态
    你希望存储

还有更多问题...

请检查 Chrome SingleFile 扩展程序。 它将网页存储到一个 html 文件中,并使用已经提到的数据 URI 内嵌图像。 我没有对它进行太多测试,所以我不能说它处理“易失性”ajax 页面的效果如何。

It is not only question of file format. Another crucial question is what exactly you want to store? Is it:

  1. store whole page as it is with all referenced resources - images,
    CSS and javascript?

  2. to capture page as it was rendered at some point in time; a static
    image of some rendered state of web page DOM?

Most current "save page as" functionality in browser, be it to MAF or MHTML or file+dir, attempts the first way. This is ultimately flawed approach.

Don't forget web pages there days are rather local applications then a static document you can easily store. Potential issues:

  1. one page is in fact several pages build dynamically by JS, user interaction is needed
    to get it to desired state

  2. AJAX applications can do remote communication with remote service rendering it
    unusable for offline view.

  3. Hidden links in javascript code. Such resource is then not part of stored page.
    Even parsing JS code may not discover them. You need to run the code.

  4. Even position of basic html elements may be recomputed may be computed dynamically by
    JS and it is not always possible/easy to recreate it locally.

  5. You would need some sort of JS memory dump and load this to get page to desired state
    you hoped to store

And many many more issues...

Check Chrome SingleFile extension. It stores a web page to one html file with images inlined using already mentioned data URIs. I haven't tested it much so I cannot say how well it handles "volatile" ajax pages.

錯遇了你 2024-07-15 11:52:36

使用 zip 文件。

您始终可以制作一个程序/脚本,将 zip 文件提取到临时目录并在浏览器中加载 index.html 文件。 您甚至可以使用index.ini/txt 文件来指定提取时应加载的文件。

基本上,您需要类似 Mozilla Archive 格式的东西,但不需要不必要的 rdf 垃圾,只是为了指定要加载的文件。

MHT 文件很好,但它们通常使用 base64 来嵌入文件,这会使文件大小大于应有的大小(数据 URI 也是如此)。 您可以将附件添加为二进制文件,但您必须使用十六进制编辑器手动执行此操作,或者创建一个工具,并且客户端对其的支持可能不太好。

当然,如果您想使用浏览器生成的内容,MHT(至少 Opera 和 IE)可能会更好。

Use a zip file.

You could always make a program/script that extracts the zip file to a temp directory and loads the index.html file in your browser. You could even use an index.ini/txt file to specify the file that should be loaded when extracting.

Basically, you want something like the Mozilla Archive format, but without the unnecessary rdf crap just to specify what file to load.

MHT files are good, but they usually use base64 to embed files, which will make the file size bigger than it should be (data URIs are the same way). You can add attachments as binary, but you'll have to manually do that with a hex editor or create a tool and support for it by clients might not be as good.

Of course, if you want to use what browsers generate, MHT (Opera and IE at least) might be better.

痴情换悲伤 2024-07-15 11:52:36

我认为没有理由使用 zip 文件以外的任何东西

i see no excuse to use anything other than a zipfile

染墨丶若流云 2024-07-15 11:52:36

好吧,如果浏览器支持和易于编辑是最大的问题,我认为您会坚持使用文件+目录方法,除非您愿意为单一文件格式提供编辑器并且在浏览器中没有很好的支持。

您可以通过压缩内容来创建单个文件。 您还可以创建父目录以简化处理。

Well, if browser support and ease of editing are the biggest concerns I think you are stuck with the file+directory approach unless you are willing to provide an editor for the single file format and live with not very good support in browsers.

You can create a single file by compressing the contents. You can also create a parent directory to ease handling.

神也荒唐 2024-07-15 11:52:36

问题是html是自下而上而不是自上而下。 查看保存在我的盒子上的文件名“在单个存档中保存完整网页(图像等)的最佳“文件格式”是什么? - Stack Overflow.html”

只需添加一个“|” 将备份复制并粘贴到备用驱动器时遇到困难。 最终你还是结束了。 剪切文件名以保存它。 数十个/也许数百个相同的index.html或index.php使我的驱动器变得混乱。

部分解决方案是编写您自己的 CMS 并使用脚本将所有相关文件映射到平面文件数据库 - 然后使用 fileName、大小、mtime 和 md5 获取每个文件的唯一 ID。 创建允许 100k 或 1000k 记录的平面文件索引。 目标是一次编写,多次使用。 因此,您需要一个真正的 CMS,您需要一个基于 files_archive 中的内容(例如,index8765432.html)的唯一 ID。 其他人也是如此。 然后,您可以非破坏性地从保存的原始 html 到 files_archive 进行符号链接,并在需要时使用 php 或替代脚本重新创建文件。 不知道它是否会起作用,因为我和你处于同一点 - 也许一周后就会知道。 更有用的方法是根据您的业务或个人需求以及相关任务建立自上而下的结构。 因此,您的文件可能是自上而下组织的,但外部文件可能是自下而上组织的,以保留原始内容。 我的兴趣在于 Web 3.0 服务,机器与机器的交互越接近,就越需要构建信息。 也许是时候重新考虑将所有内容捆绑到一个文件中的想法了。 因此,您有数百个 main.css,为什么要捆绑,而自上而下的解决方案可能会让您修改一个文件而不是数百个文件。

The problem is that html is bottoms up not top down. Look at your file name which saved on my box as "What's the best "file format" for saving complete web pages (images, etc.) in a single archive? - Stack Overflow.html"

Just add a '|' and one has trouble doing copy and paste backups to a spare drive. In the end you end up. chopping the file name in order to save it. Dozens/ perhaps hundreds of identical index.html or index.php are cluttering my drives.

The partial solution is to write you own CMS and use scripts to map all relevant files to a flat file database - then use fileName, size, mtime and md5 to get a unique Id for each file. Create a flat file index permitting 100k or 1000k records. The goal is to write once and use many times. So you need a real CMS you need a unique id based on content (eg index8765432.html) that goes in your files_archive. Ditto for the others. Then you can non-destructively symlink from the saved original html to the files_archive and just recreate the file using a php or alternative script if need be. Don't know if it will work as I'm at the same point you're at - maybe in a week will know for sure. The more useful approach is to have a top down structure based on your business or personal wants and related tasks. So your files might be organized top down but external ones bottom up to preserve the original content. My interest is in Web 3.0 services and the closer you get to machine to machine interaction the greater the need to structure the information. Maybe time to rethink the idea of bundling everything into a single file. So you have hundreds of main.css why bundle when a top down solution might let you modify one file instead of hundreds.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文