屏幕抓取记录未正确导入

发布于 2025-01-03 18:01:08 字数 1624 浏览 3 评论 0原文

我的屏幕抓取脚本(在 Rails 3.1 应用程序中)中有以下代码部分:

# Add each row to a new call record
page = agent.page.search("table tbody tr").each do |row|
  next if (!row.at('td'))
  time, source, destination, duration = row.search('td').map{ |td| td.text.strip }
  call = Call.find_or_create_by_time(time)
  call.update_attributes({:time => time, :source => source, :destination => destination, :duration => duration})
end

这是有效的,但我认为远程站点已进行了一些更改(他们当前没有 API)。

新的 HTML 代码如下:

<tr class='o'>
<td class='checkbox'><input class="bulk-check" id="recordings_13877" name="recordings[13877]" type="checkbox" value="1" /></td>
<td>09 Feb 11:37</td>
<td>Danny McClelland</td>
<td>01772123573</td>
<td>00:00:28</td>
<td></td>
<td class='opt recording'>
<a href="/unit/27/logs/recording/13877"><img alt="" class="icon recordings" src="/images/icons/recordings.png?1313703677" title="" /></a>
<a href="/unit/27/logs/recording/13877" data-confirm="Are you sure you wish to delete this recording?" data-method="delete" rel="nofollow"><img alt="" class="icon recording-remove" src="/images/icons/recording-remove.png?1317304112" title="" /></a>
</td>
</tr>

由于可疑的更改,数据被导入到错误的字段中或完全丢失。目前我想要/需要的数据的唯一部分是:

<td>09 Feb 11:37</td>
<td>Danny McClelland</td>
<td>01772123573</td>
<td>00:00:28</td>

遗憾的是,这些行没有任何唯一标识符。

任何帮助/建议表示赞赏! 有没有更好的方法来编写更“面向未来”的脚本?

I have the following section of code from my screen scraping script (in a Rails 3.1 application):

# Add each row to a new call record
page = agent.page.search("table tbody tr").each do |row|
  next if (!row.at('td'))
  time, source, destination, duration = row.search('td').map{ |td| td.text.strip }
  call = Call.find_or_create_by_time(time)
  call.update_attributes({:time => time, :source => source, :destination => destination, :duration => duration})
end

This was working but I think a few changes have been made on the remote site (they don't currently have an API).

The new HTML code is as follows:

<tr class='o'>
<td class='checkbox'><input class="bulk-check" id="recordings_13877" name="recordings[13877]" type="checkbox" value="1" /></td>
<td>09 Feb 11:37</td>
<td>Danny McClelland</td>
<td>01772123573</td>
<td>00:00:28</td>
<td></td>
<td class='opt recording'>
<a href="/unit/27/logs/recording/13877"><img alt="" class="icon recordings" src="/images/icons/recordings.png?1313703677" title="" /></a>
<a href="/unit/27/logs/recording/13877" data-confirm="Are you sure you wish to delete this recording?" data-method="delete" rel="nofollow"><img alt="" class="icon recording-remove" src="/images/icons/recording-remove.png?1317304112" title="" /></a>
</td>
</tr>

Since the suspected changes the data is being imported in the wrong fields or being missed completely. Currently the only part of the data I want/need is:

<td>09 Feb 11:37</td>
<td>Danny McClelland</td>
<td>01772123573</td>
<td>00:00:28</td>

Sadly, those rows don't have any unique identifiers though.

Any help/advice is appreciated!
Is there a better way to write the script that is more 'future' proof?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

终止放荡 2025-01-10 18:01:08

第一个 td 现在是一个复选框。
所以只需将其更改为:

time, source, destination, duration = row.search('td')[1..5].map{ |td| td.text.strip }

确实没有办法在未来证明刮刀(除非你有通灵能力)

the first td is a checkbox now.
So just change it to:

time, source, destination, duration = row.search('td')[1..5].map{ |td| td.text.strip }

There's really no way to future proof a scraper (unless you're psychic)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文