将 DOM 操作应用于 HTML 并保存结果?

发布于 2024-11-27 00:21:52 字数 445 浏览 4 评论 0原文

我有大约 100 个静态 HTML 页面,我想对其应用一些 DOM 操作。它们都遵循相同的 HTML 结构。我想对每个文件应用一些 DOM 操作,然后保存生成的 HTML。

这些是我想要应用的操作:

# [start]
$("h1.title, h2.description", this).wrap("<hgroup>");
if ( $("h1.title").height() < 200 ) {
  $("div.content").addClass('tall');
}
# [end]
# SAVE NEW HTML

第一行 (.wrap()) 我可以轻松地使用查找和替换来完成,但是当我必须确定元素的计算高度时,它会变得棘手,在没有 JavaScript 的情况下无法轻松确定。

有谁知道我怎样才能实现这一目标?谢谢!

I have about 100 static HTML pages that I want to apply some DOM manipulations to. They all follow the same HTML structure. I want to apply some DOM manipulations to each of these files, and then save the resulting HTML.

These are the manipulations I want to apply:

# [start]
$("h1.title, h2.description", this).wrap("<hgroup>");
if ( $("h1.title").height() < 200 ) {
  $("div.content").addClass('tall');
}
# [end]
# SAVE NEW HTML

The first line (.wrap()) I could easily do with a find and replace, but it gets tricky when I have to determine the calculated height of an element, which can't be easily be determined sans-JavaScript.

Does anyone know how I can achieve this? Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

嘦怹 2024-12-04 00:21:52

虽然第一部分确实可以使用正则表达式或更完整的 JavaScript DOM 实现在“文本模式”下解决,但对于第二部分(高度计算),您需要一个真正的、完整的浏览器或无头引擎,例如 <强>PhantomJS。

PhantomJS 主页

PhantomJS是一个打包并嵌入WebKit的命令行工具。
从字面上看,它的行为就像任何其他基于 WebKit 的 Web 浏览器,除了
屏幕上不会显示任何内容(因此,术语“无头”)。在
除此之外,PhantomJS 可以使用其控制或编写脚本
JavaScript API。


下面是一个示意性说明(我承认没有经过测试)。

在修改脚本(例如 modify-html-file.js)中打开一个 HTML 页面,修改其 DOM 树和 console.log 根元素的 HTML

var page = new WebPage();

page.open(encodeURI('file://' + phantom.args[0]), function (status) {
    if (status === 'success') {
        var html = page.evaluate(function () {
            // your DOM manipulation here
            return document.documentElement.outerHTML;
        });
        console.log(html);
    }
    phantom.exit();
});

: ,通过将脚本的输出重定向到文件来保存新的 HTML:

#!/bin/bash

mkdir modified
for i in *.html; do
    phantomjs modify-html-file.js "$1" > modified/"$1"
done

While the first part could indeed be solved in "text mode" using regular expressions or a more complete DOM implementation in JavaScript, for the second part (the height calculation), you'll need a real, full browser or a headless engine like PhantomJS.

From the PhantomJS homepage:

PhantomJS is a command-line tool that packs and embeds WebKit.
Literally it acts like any other WebKit-based web browser, except that
nothing gets displayed to the screen (thus, the term headless). In
addition to that, PhantomJS can be controlled or scripted using its
JavaScript API.


A schematic instruction (which I admit is not tested) follows.

In your modification script (say, modify-html-file.js) open an HTML page, modify it's DOM tree and console.log the HTML of the root element:

var page = new WebPage();

page.open(encodeURI('file://' + phantom.args[0]), function (status) {
    if (status === 'success') {
        var html = page.evaluate(function () {
            // your DOM manipulation here
            return document.documentElement.outerHTML;
        });
        console.log(html);
    }
    phantom.exit();
});

Next, save the new HTML by redirecting your script's output to a file:

#!/bin/bash

mkdir modified
for i in *.html; do
    phantomjs modify-html-file.js "$1" > modified/"$1"
done
翻了热茶 2024-12-04 00:21:52

我尝试了 PhantomJSkatspaugh 的回答,但在尝试操作页面时遇到了几个问题。我的用例是修改 Doxygen 的静态 html 输出,而不修改 Doxygen 本身。目标是通过从页面中删除不必要的元素并将其转换为 HTML5 来减少交付的文件大小。此外,我还想使用 jQuery 更轻松地访问和修改元素。

在 PhantomJS 中加载页面

自接受答案以来,API 似乎发生了巨大变化。此外,我使用了一种不同的方法(源自这个答案),这对于缓解我遇到的主要问题之一非常重要。遭遇。

var system = require('system');
var fs = require('fs');
var page = require('webpage').create();

// Reading the page's content into your "webpage"
// This automatically refreshes the page
page.content = fs.read(system.args[1]);

// Make all your changes here

fs.write(system.args[2], page.content, 'w');
phantom.exit();

阻止 JavaScript 运行

我的页面在页脚中使用了 Google Analytics,现在页面的修改超出了我的意图,大概是因为运行了 javascript。如果我们禁用 javascript,我们实际上无法使用 jQuery 来修改页面,因此这不是一个选项。我尝试过暂时更改标签,但当我这样做时,每个特殊字符都会被替换为 html 转义的等效字符,从而破坏页面上的所有 javascript 代码。然后,我遇到了 这个答案,它给了我以下想法。

var rawPageString = fs.read(system.args[1]);
rawPageString = rawPageString.replace(/<script type="text\/javascript"/g, "<script type='foo/bar'");
rawPageString = rawPageString.replace(/<script>/g, "<script type='foo/bar'>");

page.content = rawPageString;

// Make all your changes here

rawPageString = page.content;
rawPageString = rawPageString.replace(/<script type='foo\/bar'/g, "<script");

添加 jQuery

实际上是一个关于如何使用 jQuery 的示例。不过,我认为离线副本会更合适。最初我尝试使用 page.includeJs 如示例中所示,但发现 page.injectJs 更适合该用例。与 includeJs 不同,页面上下文中没有添加

page.injectJs("jquery-2.1.4.min.js");
page.evaluate(function () {

  // Make all changes here

  // Remove the foo/bar type more easily here
  $("script[type^=foo]").removeAttr("type");
});

fs.write(system.args[2], page.content, 'w');
phantom.exit();

将它们放在一起

var system = require('system');
var fs = require('fs');
var page = require('webpage').create();

var rawPageString = fs.read(system.args[1]);
// Prevent in-page javascript execution
rawPageString = rawPageString.replace(/<script type="text\/javascript"/g, "<script type='foo/bar'");
rawPageString = rawPageString.replace(/<script>/g, "<script type='foo/bar'>");

page.content = rawPageString;

page.injectJs("jquery-2.1.4.min.js");
page.evaluate(function () {

  // Make all changes here

  // Remove the foo/bar type
  $("script[type^=foo]").removeAttr("type");
});

fs.write(system.args[2], page.content, 'w');
phantom.exit();

从命令行使用它:

phantomjs modify-html-file.js "input_file.html" "output_file.html"

注意:这已经过测试,并且可以在 Windows 8.1 上与 PhantomJS 2.0.0 一起使用。

专业提示:如果速度很重要,您应该考虑从您的文件中迭代文件PhantomJS 脚本而不是 shell 脚本。这将避免 PhantomJS 启动时的延迟。

I tried PhantomJS as in katspaugh's answer, but ran into several issues trying to manipulate pages. My use case was modifying the static html output of Doxygen, without modifying Doxygen itself. The goal was to reduce delivered file size by remove unnecessary elements from the page, and convert it to HTML5. Additionally I also wanted to use jQuery to access and modify elements more easily.

Loading the page in PhantomJS

The APIs appear to have changed drastically since the accepted answer. Additionally, I used a different approach (derived from this answer), which will be important in mitigating one of the major issues I encountered.

var system = require('system');
var fs = require('fs');
var page = require('webpage').create();

// Reading the page's content into your "webpage"
// This automatically refreshes the page
page.content = fs.read(system.args[1]);

// Make all your changes here

fs.write(system.args[2], page.content, 'w');
phantom.exit();

Preventing JavaScript from Running

My page uses Google Analytics in the footer, and now the page is modified beyond my intention, presumably because javascript was run. If we disable javascript, we can't actually use jQuery to modify the page, so that isn't an option. I've tried temporarily changing the tag, but when I do, every special character is replaced with an html-escaped equivalent, destroying all javascript code on the page. Then, I came across this answer, which gave me the following idea.

var rawPageString = fs.read(system.args[1]);
rawPageString = rawPageString.replace(/<script type="text\/javascript"/g, "<script type='foo/bar'");
rawPageString = rawPageString.replace(/<script>/g, "<script type='foo/bar'>");

page.content = rawPageString;

// Make all your changes here

rawPageString = page.content;
rawPageString = rawPageString.replace(/<script type='foo\/bar'/g, "<script");

Adding jQuery

There's actually an example on how to use jQuery. However, I thought an offline copy would be more appropriate. Initially I tried using page.includeJs as in the example, but found that page.injectJs was more suitable for the use case. Unlike includeJs, there's no <script> tag added to the page context, and the call blocks execution which simplifies the code. jQuery was placed in the same directory I was executing my script from.

page.injectJs("jquery-2.1.4.min.js");
page.evaluate(function () {

  // Make all changes here

  // Remove the foo/bar type more easily here
  $("script[type^=foo]").removeAttr("type");
});

fs.write(system.args[2], page.content, 'w');
phantom.exit();

Putting it All Together

var system = require('system');
var fs = require('fs');
var page = require('webpage').create();

var rawPageString = fs.read(system.args[1]);
// Prevent in-page javascript execution
rawPageString = rawPageString.replace(/<script type="text\/javascript"/g, "<script type='foo/bar'");
rawPageString = rawPageString.replace(/<script>/g, "<script type='foo/bar'>");

page.content = rawPageString;

page.injectJs("jquery-2.1.4.min.js");
page.evaluate(function () {

  // Make all changes here

  // Remove the foo/bar type
  $("script[type^=foo]").removeAttr("type");
});

fs.write(system.args[2], page.content, 'w');
phantom.exit();

Using it from the command line:

phantomjs modify-html-file.js "input_file.html" "output_file.html"

Note: This was tested and working with PhantomJS 2.0.0 on Windows 8.1.

Pro tip: If speed matters, you should consider iterating the files from within your PhantomJS script rather than a shell script. This will avoid the latency that PhantomJS has when starting up.

秋心╮凉 2024-12-04 00:21:52

您可以通过 $('html').html() 获取修改后的内容(如果您不想要诸如 head 标签之类的内容,则可以使用更具体的选择器),然后将其作为大字符串提交到您的服务器并写入文件服务器边。

you can get your modified content by $('html').html() (or a more specific selector if you don't want stuff like head tags), then submit it as a big string to your server and write the file server side.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文