导出整个 html使用 Watir 转换为文本文档

发布于 2024-12-18 23:10:05 字数 1301 浏览 8 评论 0原文

基本上我想做的就是将整个 html 表导出到 .txt 文件（记事本文档）。

到目前为止我已经学会了如何指示浏览器找到带有表格的html页面。

require 'rubygems' 
require 'hpricot' 
require "watir-webdriver" 
url = "http://www.example.com"
browser = Watir::Browser.new 
browser.goto url

在 cmd 中运行上述命令后，我现在可以在浏览器中看到 html 表。

这就是我被困住的地方。如何使用 Watir

查找标签
收集和中的所有内容（即 html 和文本）。
将这些结果提取到 .txt 文件（记事本文档）并将其保存在特定文件夹中。

仅供参考，html 表格看起来像这样......

<table border="1" cellpadding="2">
<tr>
<th> Address </th>
<th> Council tax band </th>
<th> Annual council tax </th>
</tr>

<tr>
<td> 2, STONELEIGH AVENUE, COVENTRY, CV5 6BZ </td>
<td align="center"> F </td>
<td align="center"> &pound;2125 </td>
</tr>

....... 上面的行重复了很多次......

</table>

然后表格被关闭。

所以回顾一下我的情况。我可以使用 Watir 将浏览器导航到包含 html 表的页面，但我的问题是我不确定如何将结果（标签内的所有内容 - 包括 html）提取到 .txt 文件，然后保存该 .txt文件到我的电脑上。

我更愿意使用 Watir 采取更小的步骤。我知道这一点，因此我只想学习如何提取表并将我提取的所有内容保存到 .txt 文件中。我在网上看到了几个使用 hpricot 的例子。然而，大多数示例似乎错过了详细说明如何将数组（如果这是正确的方法）输出到 .txt 文件中的代码。

您能否帮忙演示如何编写一段简单的代码，将 html 表（以及所有内容，包括和之间的所有内容）提取到 .txt 记事本文件中？

非常感谢您抽出时间。

原文

Basically all I would like to do is export a whole html table to a .txt file (notepad document).

So far I have learnt how to instruct the browser to find the html page with the table.

require 'rubygems' 
require 'hpricot' 
require "watir-webdriver" 
url = "http://www.example.com"
browser = Watir::Browser.new 
browser.goto url

After running the above in cmd I can now see the html table in the browser.

This is where I am stuck. How do I use Watir to

Find the tag
collect everything (i.e. the html , and the text) which is within and .
Extract those results to a .txt file (notepad document) and save it in a specific folder.

FYI the html table looks like this...

<table border="1" cellpadding="2">
<tr>
<th> Address </th>
<th> Council tax band </th>
<th> Annual council tax </th>
</tr>

<tr>
<td> 2, STONELEIGH AVENUE, COVENTRY, CV5 6BZ </td>
<td align="center"> F </td>
<td align="center"> £2125 </td>
</tr>

....... The above row is repeated many time ......

</table>

Then the table is closed.

So to re-cap my situation. I can use Watir to navigate the browser to the page containing the html table but my problem is that I am unsure of how to extract the results (everything within the tag - including the html) to a .txt file and then save that .txt file onto my computer.

I would prefer to take smaller steps with using Watir. I am knew to it therefore I would just like to learn how to extract the table and save everything that I have extracted into a .txt file. I have seen a couple of examples online using hpricot. However most of the examples seem to miss off code detailing how the array (if that is the correct approach) is outputted into a .txt file.

Could you help by demonstrating how to write a simple piece of code which will extract the html table ( and everything, including the , and everything in between) to a .txt notepad file?

Many thanks for your time.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

怪异←思 2024-12-25 23:10:05

要获取整个表格的 HTML（如果它是页面上唯一的表格）：

browser.table.html

您将得到如下内容：

=> "<table border=\"1\" cellpadding=\"2\">\n<tbody><tr>\n<th> Address </th>\n<th> Council tax band </th>\n<th> Annual council tax </th>\n</tr>\n\n<tr>\n<td> 2, STONELEIGH AVENUE, COVENTRY, CV5 6BZ </td>\n<td align=\"center\"> F </td>\n<td align=\"center\"> £2125 </td>\n</tr>\n\n</tbody></table>"

要获取每行的 HTML 并将其放入数组中：

browser.table.trs.collect {|tr| tr.html}

=> ["<tr>\n<th> Address </th>\n<th> Council tax band </th>\n<th> Annual council tax </th>\n</tr>",
    "<tr>\n<td> 2, STONELEIGH AVENUE, COVENTRY, CV5 6BZ </td>\n<td align=\"center\"> F </td>\n<td align=\"center\"> £2125 </td>\n</tr>"]

要获取每个单元格的文本并将其放入数组中数组：

browser.table.trs.collect {|tr| [tr[0].text, tr[1].text, tr[2].text]}
=> [["Address", "Council tax band", "Annual council tax"],
    ["2, STONELEIGH AVENUE, COVENTRY, CV5 6BZ", "F", "£2125"]]

将每个单元格的文本写入文件：

content = b.table.trs.collect {|tr| [tr[0].text, tr[1].text, tr[2].text]}
File.open("table.txt", "w") {|file| file.puts content}

该文件将如下所示：

Address
Council tax band
Annual council tax
2, STONELEIGH AVENUE, COVENTRY, CV5 6BZ
F
£2125

To get HTML of the entire table (if it is the only table on the page):

browser.table.html

You will get something like this:

=> "<table border=\"1\" cellpadding=\"2\">\n<tbody><tr>\n<th> Address </th>\n<th> Council tax band </th>\n<th> Annual council tax </th>\n</tr>\n\n<tr>\n<td> 2, STONELEIGH AVENUE, COVENTRY, CV5 6BZ </td>\n<td align=\"center\"> F </td>\n<td align=\"center\"> £2125 </td>\n</tr>\n\n</tbody></table>"

To get HTML of each row and put it in an array:

browser.table.trs.collect {|tr| tr.html}

=> ["<tr>\n<th> Address </th>\n<th> Council tax band </th>\n<th> Annual council tax </th>\n</tr>",
    "<tr>\n<td> 2, STONELEIGH AVENUE, COVENTRY, CV5 6BZ </td>\n<td align=\"center\"> F </td>\n<td align=\"center\"> £2125 </td>\n</tr>"]

To get text of each cell and put it in an array:

browser.table.trs.collect {|tr| [tr[0].text, tr[1].text, tr[2].text]}
=> [["Address", "Council tax band", "Annual council tax"],
    ["2, STONELEIGH AVENUE, COVENTRY, CV5 6BZ", "F", "£2125"]]

To write text of each cell to file:

content = b.table.trs.collect {|tr| [tr[0].text, tr[1].text, tr[2].text]}
File.open("table.txt", "w") {|file| file.puts content}

The file will look like this:

Address
Council tax band
Annual council tax
2, STONELEIGH AVENUE, COVENTRY, CV5 6BZ
F
£2125

回复收藏 0 原文

我的黑色迷你裙 2024-12-25 23:10:05

有很多方法可以解决这个问题，如果我们更多地了解您具体想要实现的目标，那么我们可以为您提供更具体而不是笼统的答案。

如果您想将内容转换为数组，您可以使用 .collect，如 Zeljko 所示。如果您只想处理数据或迭代表中的行和单元格，那么 .each 或 .each_with_index 可能就是您想要的。

我怀疑您确实想要表格中的文本，而不是 HTML。因此，这里有一些可以尝试的东西（未经测试，但应该可以工作）

browser.table(:how => what).rows.each_with_index do |row, r|
  row.cells.each_with_index do |cell, c|
    puts "Row:#{r} Cell:#{c} text is: #{cell.text}"
  end
end

如果 .rows 或 .cells 在上面不起作用（未知方法），请尝试用 替换。 trs
和 .tds 分别（并非所有版本的 watir 都有友好的
这些方法的别名）

看看是否会吐出您感兴趣的内容。如果是这样，您应该能够轻松修改以将所需内容写入文件，而不是将其显示到屏幕上。

但是，如果验证是您的目标，那么让自动化代码在数据库中查找内容并为您进行比较可能会更容易。

There are a lot of ways to approach this, if we know a bit more about what you are specifically trying to accomplish, then we can give you answers that are also a bit more specific instead of general.

You can use .collect as Zeljko has shown if you want to convert stuff to arrays. If you just want to work with the data or iterate over the rows and cells in the table then .each or .each_with_index may be what you want.

I suspect you really want the text from the table, not the HTML. So here's something to try (untested but it should work)

browser.table(:how => what).rows.each_with_index do |row, r|
  row.cells.each_with_index do |cell, c|
    puts "Row:#{r} Cell:#{c} text is: #{cell.text}"
  end
end

if .rows or .cells does not work (unknown method) in the above, try replacing with .trs
and .tds respectively (not all versions of watir have the friendly
aliases for those methods)

See if that spits out what you are interested in. If so, you should be able to easily modify to write what you want to a file instead of putting it to the screen.

However if verification is your goal, then it might be easier to have the automation code look things up in the db and do the comparison for you.

回复收藏 0 原文

~没有更多了~