抓取推文 - 使用网站还是 API 更好?
我正在使用 twitter
gem 在 Ruby 中构建 Twitter 机器人。我试图使其能够自我维持,所以我希望它通过抓取其社交圈之外的用户的推文来生成自己的推文内容(然后可能用马尔可夫链生成器对它们进行混淆)。
哪一个是更好的策略?
- 通过 api 搜索推文
- 加载 Twitter 页面并使用
Hpricot
或Nokogiri
抓取推文
另外,我如何尝试确保基本推文来自我的机器人关注者的朋友之外,所以它是更难辨别它是机器人吗?
目前,我使用 .yml
文件来手动生成推文,这远非理想。
I'm using the twitter
gem to build a Twitter bot in Ruby. I am trying to make it self-sustainable as it were, so I want it to generate its own content to tweet by scraping tweets of users outside its social circle (and then perhaps garbling them with Markov chain generator).
Which one is a better strategy?
- Search for tweets via api
- Load Twitter pages and scrape tweets with
Hpricot
orNokogiri
Also, how can I try to ensure the base tweets come from outside my bot's followers' friends so it's harder to tell it's a bot?
At the moment I use a .yml
file with tweets I generated by hand, which is far from ideal.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这里有两个问题。
如果有可用的 API,最好使用 API。如果更改了简单的 html 元素,这将使您免受机器人随机破坏的影响,并且还允许网站(即 Twitter)对您的搜索进行速率限制,以防您对服务施加过高的负载。虽然这对于 Twitter 来说不太可能,但这是一个很好的做法。
有时,您想要的信息无法通过 API 获得。在这种情况下,你应该考虑是否真的需要刮,如果需要的话,如何限制自己有礼貌。
基本上,如果 API 允许您做您想做的事情,请使用它来实现可维护性。
至于你的第二个问题,我对twitter API没有任何经验。有没有办法获取所有关注者的 Twitter ID 以及他们关注的人?如果没有,您将被迫像前面提到的那样进行抓取 - 如果您确实需要此信息。
一旦您获得了关注者关注者的列表,您就可以检查您要转发的内容的发帖者 ID 是否属于该集合。
您会考虑转发机器人的这方面内容吗?
There's two questions here.
It's always better to use an API where one is available. This will future-proof you against the bot randomly breaking if a simple html element is changed, and it will also allow the website (ie, twitter) to rate limit your searches in case you put too high a load on the service. Although this is unlikely for twitter, it's good practice.
Sometimes, the information you want is unobtainable via the API. In this case, you should consider if you really need to scrape it, and if so, how to limit yourself to be polite.
Basically, if the API allows you to do what you want, use it for maintainability.
As for your second question, I do not have any experience with the twitter API. Is there a method to get twitter IDs of all your followers, and who they follow? If not, you'll be forced to scrape as earlier mentioned - if you really do need this information.
Once you have a list of those who your followers follow, you can check if the ID of the poster of what you want to repost falls inside this set.
Would you consider retweeting for this aspect of the bot?
还需要注意的一件事是性能。如果您要抓取网站,则必须下载整个页面,然后抓取该页面(这实际上是处理器密集型的)。与访问 API 不同,API 只会返回 JSON/XML 数据。
因此,从严格的性能角度来看,我会选择 API。
One thing to also note is performance. If you were to scrape the website, you would have to download the entire page, then scrape the page(which is processor intensive as it is). As opposed to hitting the API, which would only return JSON/XML data.
So from strictly a performance standpoint, I would go with the API.