屏幕抓取记录未正确导入
我的屏幕抓取脚本(在 Rails 3.1 应用程序中)中有以下代码部分:
# Add each row to a new call record
page = agent.page.search("table tbody tr").each do |row|
next if (!row.at('td'))
time, source, destination, duration = row.search('td').map{ |td| td.text.strip }
call = Call.find_or_create_by_time(time)
call.update_attributes({:time => time, :source => source, :destination => destination, :duration => duration})
end
这是有效的,但我认为远程站点已进行了一些更改(他们当前没有 API)。
新的 HTML 代码如下:
<tr class='o'>
<td class='checkbox'><input class="bulk-check" id="recordings_13877" name="recordings[13877]" type="checkbox" value="1" /></td>
<td>09 Feb 11:37</td>
<td>Danny McClelland</td>
<td>01772123573</td>
<td>00:00:28</td>
<td></td>
<td class='opt recording'>
<a href="/unit/27/logs/recording/13877"><img alt="" class="icon recordings" src="/images/icons/recordings.png?1313703677" title="" /></a>
<a href="/unit/27/logs/recording/13877" data-confirm="Are you sure you wish to delete this recording?" data-method="delete" rel="nofollow"><img alt="" class="icon recording-remove" src="/images/icons/recording-remove.png?1317304112" title="" /></a>
</td>
</tr>
由于可疑的更改,数据被导入到错误的字段中或完全丢失。目前我想要/需要的数据的唯一部分是:
<td>09 Feb 11:37</td>
<td>Danny McClelland</td>
<td>01772123573</td>
<td>00:00:28</td>
遗憾的是,这些行没有任何唯一标识符。
任何帮助/建议表示赞赏! 有没有更好的方法来编写更“面向未来”的脚本?
I have the following section of code from my screen scraping script (in a Rails 3.1 application):
# Add each row to a new call record
page = agent.page.search("table tbody tr").each do |row|
next if (!row.at('td'))
time, source, destination, duration = row.search('td').map{ |td| td.text.strip }
call = Call.find_or_create_by_time(time)
call.update_attributes({:time => time, :source => source, :destination => destination, :duration => duration})
end
This was working but I think a few changes have been made on the remote site (they don't currently have an API).
The new HTML code is as follows:
<tr class='o'>
<td class='checkbox'><input class="bulk-check" id="recordings_13877" name="recordings[13877]" type="checkbox" value="1" /></td>
<td>09 Feb 11:37</td>
<td>Danny McClelland</td>
<td>01772123573</td>
<td>00:00:28</td>
<td></td>
<td class='opt recording'>
<a href="/unit/27/logs/recording/13877"><img alt="" class="icon recordings" src="/images/icons/recordings.png?1313703677" title="" /></a>
<a href="/unit/27/logs/recording/13877" data-confirm="Are you sure you wish to delete this recording?" data-method="delete" rel="nofollow"><img alt="" class="icon recording-remove" src="/images/icons/recording-remove.png?1317304112" title="" /></a>
</td>
</tr>
Since the suspected changes the data is being imported in the wrong fields or being missed completely. Currently the only part of the data I want/need is:
<td>09 Feb 11:37</td>
<td>Danny McClelland</td>
<td>01772123573</td>
<td>00:00:28</td>
Sadly, those rows don't have any unique identifiers though.
Any help/advice is appreciated!
Is there a better way to write the script that is more 'future' proof?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
第一个 td 现在是一个复选框。
所以只需将其更改为:
确实没有办法在未来证明刮刀(除非你有通灵能力)
the first td is a checkbox now.
So just change it to:
There's really no way to future proof a scraper (unless you're psychic)