如何在 Solaris 10 UNIX 机器上解析 HTML 文件以将所有值放入中元素写入 CSV 文件?
我非常熟悉 PHP,包括命令行,半熟悉 BASH 脚本,没有 Perl 或其他语言的经验,但愿意使用任何有效的语言。
我试图解析的 HTML 文件有 700,000 多行,61MB。我无法更改构建 HTML 表格的源,只能通过 wget http://10.1 下载整个表格。 1.2/file.pl。
以下是我尝试解析的 HTML 代码的示例格式:
<HTML>
<HEAD>
<TITLE>Objects</TITLE>
<STYLE type="text/css">
a:hover
{
color:red
}
</STYLE>
</HEAD>
<BODY>
<IMG src="http://10.1.1.2/images/logo.gif"/>
<BR/><BR/>
<TABLE border="0">
<TR>
<TH>Objects</TH>
</TR>
<TR>
<TD><HR style="width:227px"></TD>
</TR>
</TABLE>
<table border=1 cellpadding=5 cellspacing=0><tr><th><b>Subtype</b></th><th><b>Object</b> </th></tr>
<tr><td>10GigEthernet</td><td>SNFCCAMK34T-TenGigE0/10/0/0</td></tr>
<tr><td>10GigEthernet</td><td>SNFCCAMK34T-TenGigE0/13/0/0</td></tr>
<tr><td>10GigEthernet</td><td>SNFCCAMK34T-TenGigE0/13/3/0</td></tr>
<tr><td>10GigEthernet</td><td>SNFCCAMK34T-TenGigE0/3/0/0</td></tr>
<tr><td>10GigEthernet</td><td>SNFCCAMK34T-TenGigE0/3/0/0-5</td></tr>
... 700,000 more lines ...
</table> </BODY>
</HTML>
我想要 CSV 中的内容:
Subtype,Object
10GigEthernet,SNFCCAMK34T-TenGigE0/10/0/0
10GigEthernet,SNFCCAMK34T-TenGigE0/13/0/0
10GigEthernet,SNFCCAMK34T-TenGigE0/13/3/0
10GigEthernet,SNFCCAMK34T-TenGigE0/3/0/0
10GigEthernet,SNFCCAMK34T-TenGigE0/3/0/0-5
非常感谢您提供的任何帮助!提前致谢。
@shellter 代码的结果:
# wget http://10.1.1.2/reports/file.pl
--2012-01-19 06:56:59-- http://10.1.1.2/reports/file.pl
Connecting to 10.1.1.2... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified
Saving to: `file.pl'
[ <=> ] 61,000,000 1.01M/s in 58s
2012-01-19 06:58:00 (1.01 MB/s) - `file.pl' saved [61000000]
# sed -n '/<\/td>/{
> s@<tr><td>@@;
> s@</td>@XaYbZc@;
> s@<td>@@;
> s@</td></tr>@@;
> s/XaYbZc/,/
> s/^ //
> p
> }' file.pl > routerList.csv
# ls -l
total 203408
-rw-r--r-- 1 root root 61000000 Jan 19 06:58 file.pl
-rw-r--r-- 1 root root 42708247 Jan 19 06:58 routerList.csv
# head routerList.csv
10GigEthernetn,SNFCCAMK34T-TenGigE0/10/0/0
10GigEthernetn,SNFCCAMK34T-TenGigE0/13/0/0
10GigEthernetn,SNFCCAMK34T-TenGigE0/13/3/0
10GigEthernetn,SNFCCAMK34T-TenGigE0/3/0/0
10GigEthernetn,SNFCCAMK34T-TenGigE0/3/0/0-5
I'm pretty familiar with PHP including command line, semi-familiar with BASH scripting, and no experience with Perl or other languages but willing to use whatever works.
The HTML file I am trying to parse is 700,000+ lines, 61MB. I cannot change the source that builds the HTML table, only download the entire table via wget http://10.1.1.2/file.pl.
Here's an example format of the HTML code that I'm trying to parse:
<HTML>
<HEAD>
<TITLE>Objects</TITLE>
<STYLE type="text/css">
a:hover
{
color:red
}
</STYLE>
</HEAD>
<BODY>
<IMG src="http://10.1.1.2/images/logo.gif"/>
<BR/><BR/>
<TABLE border="0">
<TR>
<TH>Objects</TH>
</TR>
<TR>
<TD><HR style="width:227px"></TD>
</TR>
</TABLE>
<table border=1 cellpadding=5 cellspacing=0><tr><th><b>Subtype</b></th><th><b>Object</b> </th></tr>
<tr><td>10GigEthernet</td><td>SNFCCAMK34T-TenGigE0/10/0/0</td></tr>
<tr><td>10GigEthernet</td><td>SNFCCAMK34T-TenGigE0/13/0/0</td></tr>
<tr><td>10GigEthernet</td><td>SNFCCAMK34T-TenGigE0/13/3/0</td></tr>
<tr><td>10GigEthernet</td><td>SNFCCAMK34T-TenGigE0/3/0/0</td></tr>
<tr><td>10GigEthernet</td><td>SNFCCAMK34T-TenGigE0/3/0/0-5</td></tr>
... 700,000 more lines ...
</table> </BODY>
</HTML>
What I'd like in the CSV:
Subtype,Object
10GigEthernet,SNFCCAMK34T-TenGigE0/10/0/0
10GigEthernet,SNFCCAMK34T-TenGigE0/13/0/0
10GigEthernet,SNFCCAMK34T-TenGigE0/13/3/0
10GigEthernet,SNFCCAMK34T-TenGigE0/3/0/0
10GigEthernet,SNFCCAMK34T-TenGigE0/3/0/0-5
I'd appreciate any help you can give! Thanks in advance.
Result from @shellter's code:
# wget http://10.1.1.2/reports/file.pl
--2012-01-19 06:56:59-- http://10.1.1.2/reports/file.pl
Connecting to 10.1.1.2... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified
Saving to: `file.pl'
[ <=> ] 61,000,000 1.01M/s in 58s
2012-01-19 06:58:00 (1.01 MB/s) - `file.pl' saved [61000000]
# sed -n '/<\/td>/{
> s@<tr><td>@@;
> s@</td>@XaYbZc@;
> s@<td>@@;
> s@</td></tr>@@;
> s/XaYbZc/,/
> s/^ //
> p
> }' file.pl > routerList.csv
# ls -l
total 203408
-rw-r--r-- 1 root root 61000000 Jan 19 06:58 file.pl
-rw-r--r-- 1 root root 42708247 Jan 19 06:58 routerList.csv
# head routerList.csv
10GigEthernetn,SNFCCAMK34T-TenGigE0/10/0/0
10GigEthernetn,SNFCCAMK34T-TenGigE0/13/0/0
10GigEthernetn,SNFCCAMK34T-TenGigE0/13/3/0
10GigEthernetn,SNFCCAMK34T-TenGigE0/3/0/0
10GigEthernetn,SNFCCAMK34T-TenGigE0/3/0/0-5
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
使用 Perl 及其
XML::LibXML
模块既快速又肮脏(它不是 Perl 的标准配置,但通常很容易安装,
一旦您知道如何安装 CPAN 模块):
这里,xpath是我编写的一个简单的 Perl 脚本,用于使用 XPath 从 XML/HTML 文档中选择内容。
第二个 Perl 命令是将结果重新格式化为两列格式的快速但肮脏的方法,
如果您的文档包含您不希望出现在输出中的其他类型的
,则该操作将会失败。
因此,这可能无法完全满足您现在的需要,但特别是如果您预计将来必须进行更多此类选择,您可能需要编写一个可以稍后调整的脚本,在这种情况下,这个是一个可能的起点。
Quick and dirty with Perl and its
XML::LibXML
module(which doesn't come standard with Perl, but will usually be easy to install,
once you know how to install CPAN modules):
Here, xpath is a simple Perl script I wrote to select stuff from XML/HTML documents using XPath.
The second Perl command is a quick and dirty way to reformat the results into two-column format,
that will fail if your document has other kinds of
<td/>
s that you don't want to be in the output.So this probably won't do exactly what you need right now, but especially if you anticipate having to do more of these kinds of selections in the future, you probably want to write a script you can adjust later, and in that case, this is a possible starting point.
到目前为止,所有答案都说“你应该以正确的方式做”,然后展示如何以“错误的方式”做。这是正确方法的一个例子。此版本使用 DOM 解析器(特别是
Mojo::DOM
的工作方式类似)和Text::CSV
。这与
其他结果非常相似,但可以处理各种边缘情况。在我看来,使用现代 DOM(甚至 XPath)解析器,以正确的方式进行操作比编写正则表达式更容易,而且还可以避免因以错误的方式进行操作而产生的所有陷阱;那么为什么不先以正确的方式去做呢?
All the answers so far say "you should do it the Right Way" then show how to do it the "Wrong Way". Here is an example of the Right Way. This version uses a DOM parser (specifically
Mojo::DOM
though others will work similarly) andText::CSV
.This results in
much like the others, but handles all kinds of edge cases. In my opinion, with modern DOM (or even XPath) parsers, doing it the right way is easier than crafting a regex anyway, plus you avoid all of the pitfalls that come from doing it the wrong way; so why not just do it the right way first?
虽然我必须同意大多数评论,例如“使用 DOM 或 XPATH 等”,
在这种情况下你很幸运,你想要处理的所有数据都在一行上。如果该数据中的任何地方都存在换行符,那么这将不起作用,并且基本上不可能获得 sed 的工作解决方案。因此,请注意这些问题,请尝试此
sed 脚本使用“@”字符作为匹配/替换节分隔符。
首先,我们取出该行的第一个
并将其删除,
然后取出第一个
并将其替换为 XaYbZc 作为临时标记。
删除剩余的开口
。
删除尾随的
将临时 XaYbZc 替换为 ','
删除行前面的 4 个空格。
打印缓冲区。 (完成!)
我希望这会有所帮助。
While I have to agree with most of the comments like 'use a DOM, or XPATH, etc.',
you are lucky in this case that all data you want to process is on one line. If there are ever linebreaks anywhere in that data, then this will not work AND it will be essentially impossible to get a working solution is sed. So forwarned of these issues, try this
The sed script is using the '@' char as the match/replace section delimiter.
First we take the first
<tr><td>
on the line and delete it,We then take the first
</td>
and replace it with XaYbZc as a temp marker.Remove the remaining opening
<td>
.Remove the trailing
</td></tr>
Replace the temporary XaYbZc with the ','
Remove 4 spaces at the front of the line.
Print the buffer. (Done!)
I hope this helps.
我会放弃使用正确的方法(使用真正的解析器)并仅使用正则表达式来处理它。
这(在 Perl 中)很脆弱并且容易出错,但应该是尽可能快的......
I would abandon using the Right Way (using a real parser) and just process it with a regex.
This (in Perl) is fragile and error prone, but ought to be about as fast as you can get...
这可能对你有用:
This might work for you: