如何在 Clojure 中懒惰地阅读网页
我和一位朋友最近在我的 Clojure IRC 机器人中实现了链接抓取。当它看到链接时,它会抓取页面并从页面中获取标题。问题是它必须抓取整个页面才能获取链接。
一个人如何懒洋洋地阅读一页直到第一个?
I and a friend recently implemented link grabbing in my Clojure IRC bot. When it sees a link, it slurps the page and grabs the title from the page. The problem is that it has to slurp the ENTIRE page just to grab the link.
How does one go about reading a page lazily until the first </title>?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
使用 line-seq 但不要忘记在完成后关闭底层流。
Use
line-seq
but don't forget to close the underlying stream when done.我不认为 HTML 一定会以合理的方式分成行;不看我们自己的后院外面,例如Compojure(我猜目前是Hiccup)不会插入换行符,我相信(更新:刚刚检查了Hiccup - 没有换行符)。
我建议在
java.io.BufferedInputStream
之上进行惰性 XML 解析(使用clojure.contrib.lazy-xml
)。I wouldn't count on the HTML necessarily being split into lines in a sensible way; without looking outside of our own backyard, e.g. Compojure (or Hiccup currently, I guess) doesn't bother inserting line breaks, I believe (update: just checked Hiccup -- no line breaks).
What I'd suggest instead is lazy XML parsing (with
clojure.contrib.lazy-xml
) on top of ajava.io.BufferedInputStream
.