如何在球拍中刮擦页面标题?
我到了以下代码获得页面的HTML的点:
#!/usr/bin/env racket
#lang racket/base
(require net/url racket/port)
(require (planet neil/html-parsing:3:0))
(define p (get-pure-port (string->url "https://www.rosettacode.org/wiki/Web_scraping")))
(define my-html (port->string p))
(close-input-port p)
如何获得标题,即< title>
tag的文本,来自my-html < /代码>?
I got to the point of obtaining the html of the page with the following code:
#!/usr/bin/env racket
#lang racket/base
(require net/url racket/port)
(require (planet neil/html-parsing:3:0))
(define p (get-pure-port (string->url "https://www.rosettacode.org/wiki/Web_scraping")))
(define my-html (port->string p))
(close-input-port p)
How do I get the title, i.e. the text inside of <title>
tag, from my-html
?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我更喜欢使用HTML或在球拍上使用XML工具(以及一般方案), sxml 。这使您可以使用 xpath类似XPath类似查询轻松从文档中提取数据。幸运的是,将html解析为sxml表达式很容易:
或者在您的情况下
(
html-&gt; xexp
携带持有html文档或输入端口的字符串),有趣的位是
sexp
sexp
,它采用SXPATH字符串并返回一个新过程,当用SXML参数依次调用时,将返回所有匹配的列表。如果您要重复寻找同一件事,则值得定义新功能而不是使用临时性:html-parsing
和sxml
应该安装软件包通过drracket软件包管理器或带有raco pkg的命令行,安装html-parssing sxml
,无论您喜欢哪个。I prefer working with XML tooling over HTML, or in Racket (And scheme in general), sxml. That lets you use XPath-like queries to easily extract data from the document. Luckily, it's simple to parse HTML into a sxml expression:
or in your case
(
html->xexp
takes either a string holding a HTML document or an input port)The interesting bit is
sexp
, which takes an SXPath string and returns a new procedure that when called in turn with a sxml argument, returns a list of all matches. If you're going to be looking for the same thing repeatedly, it's worth defining a new function instead of using a temporary:The
html-parsing
andsxml
packages should be installed via the DrRacket package manager or from the command line withraco pkg install html-parsing sxml
, whichever you prefer.