如何在 ruby on Rails 中优雅地失败并在屏幕抓取失败时收到通知

发布于 2024-09-25 00:11:34 字数 876 浏览 8 评论 0原文

我正在开发一个 Rails 3 项目，该项目严重依赖屏幕抓取来收集数据，主要使用 Nokogiri。我基本上聚合了所有相同的数据，但我从许多不同的来源获取它，随着时间的推移，我将添加越来越多的数据。然而，我敏锐地意识到屏幕抓取可能是出了名的不可靠。

因此，我感兴趣的是其他人如何处理验证数据的问题，然后在失败时收到通知。

我目前的计划如下。

我将对大多数领域的模型进行验证。如果它们失败了，我不会将不良数据输入到我的系统中。尽管以有意义的方式记录此失败仍然是一个问题。
我正在考虑某种计数器，在来自特定来源的多次失败之后，我以某种方式将其关闭。不知道如何跟踪这一点。我想唯一的方法是在我的源模型上有一个字段来计算它并且可以重置。
记录的是 800 磅重的大猩猩，我不知道如何处理。我可以对日志进行标准写入，但如果出现问题，我想存储整个 html，以便我可以弄清楚。我还需要以某种方式通知自己，以便我可以解决这些问题。我想也许只是为所有这些创建一个模型并将其存储在数据库中。如果我这样做，我可能必须将 html 存储在 s3 或其他东西上。我在 Heroku 上运行这个，所以这会影响我能做的事情。
开始设置并救援每个场地周围的方块。我试图找出一个更好的 ruby 方式来编码，所以我只是没有它们的页面，但尽管我确实有一些字段是直接的 doc.css_at("#whatever") 有相当多需要各种格式或计算的数字，因此我认为尝试挽救它是有意义的，这样我就可以记录出了什么问题。另一种选择是让异常冒泡并在我尝试创建模型时捕获它。

无论如何，我确信我什至没有考虑到所有事情，但这就是为什么我试图弄清楚其他人是如何处理这个问题的。

原文

I am working on a Rails 3 project that relies heavily on screen scraping to collect data mainly using Nokogiri. I'm aggregating essentially all the same data but I'm grabbing it from many difference sources and as time goes on I will be adding more and more. However I am acutely aware that screen scraping can be notoriously unreliable.

As such I am interested in how other people have handled the problem of verifying the data and then also getting notified if it is failing.

My current plan is as follow.

I am going to have validation on my model for most of the fields. If they fail I won't get bad data into my system. Although logging this failure in a meaningful way is still a problem.
I was thinking of some kind of counter where after so many failures from a particular source I somehow turn it off. Not sure how to keep track of that. I guess the only way is to have a field on my Source model that counts it and can be reset.
Logging is 800 pound gorilla I'm not sure how to deal with. I could just do standard writing to logs but if something fails I'd like to store the entire html so I can figure it out. Also I need to notify myself somehow so I can address the issues. I thought of maybe just creating a model for all this and storing it in the database. If I did this I'd probably have to store the html on s3 or something. I'm running this on heroku so that influences what I can do.
Setup begin and rescue blocks around every field. I was trying to figure out a to code this in a nicer ruby way so I just don't have a page of them but although I do have some fields are just straight up doc.css_at("#whatever") there are quite a number that require various formatting or calculations so I think it makes sense to try to rescue that so I can then log what went wrong. The other option is to let the exception bubble up and catch it when I try to create the model.

Anyway I'm sure I'm not even thinking of everything but that is why I'm trying to figure out how other people have handled this problem.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

少女净妖师 2024-10-02 00:11:34

我们的团队做了类似的事情，所以这里有一些想法：

我们使用非常高级的开始/救援事务来确保我们不会进入奇怪的半加载状态：

<前><代码>开始
ActiveRecord::Base.transaction 做
...尝试加载数据源...
结尾
救援
...错误处理...
结尾

当发生某些错误时，给您自己发送电子邮件/寻呼。我们使用 exception_notifier 但如果你使用 Heroku，Exceptional 插件似乎也是一个不错的选择。我还听说有人
捕获状态对于解决问题非常重要。 GMail 对我们来说非常有效。我们的装载机实际上有两个阶段：
1. 捕获数据并将其发送到我们的 Gmail 帐户
2. 登录gmail，下载最新数据并解析

第二阶段是复杂的阶段，如果失败，开发人员可以简单地登录 gmail 帐户并轻松检查失败的消息。这个过程有一些限制（每封电子邮件和每个邮箱的存储限制、两阶段管道等），我们开始这样做是因为我们没有其他选择，但事实证明它具有令人震惊的弹性和方便性。请记住，电子邮件是存储非关键状态的一种廉价/简单的方法。我们一开始并没有考虑以这种方式使用它，但现在我们真的很高兴我们这么做了。登录 GMail 比挖掘日志文件感觉更好。

构建仪表板 UI。我们有一个简单的仪表板，其中包含每日的源网格，看起来像这样。根据当天该源的加载是否成功，每个框的颜色为红色或绿色。您可以更进一步，在此 UI（mon.itor.us 或同等版本）上设置一个监视器，如果满足某个错误阈值，该监视器就会发出警报。

Our team does something similar to this, so here's some ideas:

we use a really high level begin/rescue transaction to make sure we don't get into weird half loaded states:

begin
  ActiveRecord::Base.transaction do
    ...try to load a data source...
  end
rescue
  ...error handling...
end

Email/page yourself when certain errors occur. We use exception_notifier but if you're sitting on Heroku the Exceptional plugin also seems like a good option. I've also heard of people having success w/ hoptoad
Capturing state is VERY important for troubleshooting issues. Something that's worked quite well for us is GMail. Our loaders effectively have two phases:
1. capture data and send it to our gmail account
2. log into gmail, download latest data and parse it

The second phase is the complex one, and if it fails a developer can simply log into the gmail account and easily inspect the failed message. This process has some limitations (per email and per mailbox storage limits, two phase pipeline, etc.) and we started out doing it because we had no other option, but it's proven shockingly resilient and convenient. Keep email in mind as a cheap/easy way to store noncritical state. We didn't start out thinking of using it that way and are now really glad we do. Logging into GMail feels better than digging through log files.

Build a dashboard UI. We have a simple dashboard with a grid of sources by day that looks like this. Each box is colored either red or green based on whether the load for that source on that day succeeded. You can go one step further and set up a monitor on this UI (mon.itor.us or equivalent) that alarms if some error threshold is met.

回复收藏 0 原文

~没有更多了~