Nokogiri 无法在 CentOS 中读取/解析 HTML 文件的结构

发布于 2025-01-03 15:05:15 字数 2247 浏览 1 评论 0原文

我编写了一个脚本来解析上传到我们应用程序的 HTML 文件中的一些所需代码。在 OS X 上,这个过程运行良好。但是,当我上传到我们的测试服务器时,却没有。当我进入测试服务器上的控制台并尝试解析文件时,Nokogiri 将看不到该结构 - 每次我得到一行输出而不是整个文档结构。我的脚本的其余部分没有被执行,因为 Nokogiri 没有遍历文档。寻求有关如何解决问题的帮助。

下面是我用来打开文件并将其提供给 Nokogiri 的必要代码:

html = Nokogiri::HTML(File.open("index.html", "r"))

html 相当于:

#<Nokogiri::HTML::Document:0x10d9bbf0 name="document" children=[#<Nokogiri::XML::DTD:0x10d9b81c name="html">]>

在 OS X 中,我按照预期获得了整个树。

以下是 index.html 的内容:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<link rel="stylesheet" href="zero.css" type="text/css" charset="utf-8" />
</head>
<body class="fullpage-vert" onunload="javascript:clearInterval(audioLoop);">
<div id="container">
    <div id="danceHolder">
        <img id="danceVid" src="1-1.jpg" width="320" height="480" alt="" />
    </div>
    <div id="introHolder">
        <img id="introVid" src="0-1.jpg" width="320" height="480" alt="" />
        <div id="ctabg"></div>
        <div id="cta1"></div>
        <div id="cta2"></div>
        <div id="cta3"></div>
        <div id="phone"></div>
        <div id="logo"></div>
    </div>
</div>
<a href="mmbridge:*">bridge test</a>
<frameset cols="25%,75%">
   <frame src="frame_a.htm" />
   <frame src="frame_b.htm" />
</frameset>
</body>
</html>

例如,当我尝试搜索框架集时,我什么也没得到:

html.css("frameset").size
0

我知道 Nokogiri 对 CentOS (2.6.2) 上安装的默认 Libxml2 版本有问题,但我已按照说明进行操作让它建立在新版本(2.7.8)上。这是 nokogiri -v 的输出:

# Nokogiri (1.5.0)
    --- 
    warnings: []

    nokogiri: 1.5.0
    ruby: 
      version: 1.9.2
      platform: x86_64-linux
      description: ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-linux]
      engine: ruby
    libxml: 
      binding: extension
      compiled: 2.7.8
      loaded: 2.7.8

还有其他人看到过这样的行为吗?

I've written a script to parse out some needed code in HTML files that are uploaded to our app. On OS X, this process works fine. However, when I upload to our testing server, it doesn't. When I go into the console on the test server and attempt to parse the file, Nokogiri won't see the structure - each time I get a single line of output instead of the whole document structure. The rest of my script isn't being executed because Nokogiri isn't traversing the document. Looking for some help on how to resolve the issue.

Here's the requisite code I'm using to open the file and feed it to Nokogiri:

html = Nokogiri::HTML(File.open("index.html", "r"))

Here's what html equates to:

#<Nokogiri::HTML::Document:0x10d9bbf0 name="document" children=[#<Nokogiri::XML::DTD:0x10d9b81c name="html">]>

In OS X, I get the entire tree, as expected.

Here's the contents of index.html:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<link rel="stylesheet" href="zero.css" type="text/css" charset="utf-8" />
</head>
<body class="fullpage-vert" onunload="javascript:clearInterval(audioLoop);">
<div id="container">
    <div id="danceHolder">
        <img id="danceVid" src="1-1.jpg" width="320" height="480" alt="" />
    </div>
    <div id="introHolder">
        <img id="introVid" src="0-1.jpg" width="320" height="480" alt="" />
        <div id="ctabg"></div>
        <div id="cta1"></div>
        <div id="cta2"></div>
        <div id="cta3"></div>
        <div id="phone"></div>
        <div id="logo"></div>
    </div>
</div>
<a href="mmbridge:*">bridge test</a>
<frameset cols="25%,75%">
   <frame src="frame_a.htm" />
   <frame src="frame_b.htm" />
</frameset>
</body>
</html>

When I try and search for the frameset, for example, I get nothing:

html.css("frameset").size
0

I know Nokogiri has problems with the default Libxml2 version installed on CentOS (2.6.2), but I've followed the instructions to get it built on a new version (2.7.8). Here's the output for nokogiri -v:

# Nokogiri (1.5.0)
    --- 
    warnings: []

    nokogiri: 1.5.0
    ruby: 
      version: 1.9.2
      platform: x86_64-linux
      description: ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-linux]
      engine: ruby
    libxml: 
      binding: extension
      compiled: 2.7.8
      loaded: 2.7.8

Has anyone else seen behavior like this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

ゃ懵逼小萝莉 2025-01-10 15:05:15

由于某种原因,交换

html = Nokogiri::HTML(File.open("index.html", "r"))

可以

html = Nokogiri::HTML(File.read("index.html"))

工作,尽管现在它无法正确计算行号(一切都是行号 0)。

For some reason, swapping

html = Nokogiri::HTML(File.open("index.html", "r"))

for

html = Nokogiri::HTML(File.read("index.html"))

works, although now it won't calculate line numbers properly (everything is line number 0).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文