如何在球拍中刮擦页面标题?

发布于 2025-02-13 00:20:07 字数 403 浏览 1 评论 0原文

我到了以下代码获得页面的HTML的点:

#!/usr/bin/env racket
#lang racket/base

(require net/url racket/port)
(require (planet neil/html-parsing:3:0))

(define p (get-pure-port (string->url "https://www.rosettacode.org/wiki/Web_scraping")))
(define my-html (port->string p))
(close-input-port p)

如何获得标题,即&lt; title&gt; tag的文本,来自my-html < /代码>?

I got to the point of obtaining the html of the page with the following code:

#!/usr/bin/env racket
#lang racket/base

(require net/url racket/port)
(require (planet neil/html-parsing:3:0))

(define p (get-pure-port (string->url "https://www.rosettacode.org/wiki/Web_scraping")))
(define my-html (port->string p))
(close-input-port p)

How do I get the title, i.e. the text inside of <title> tag, from my-html?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

楠木可依 2025-02-20 00:20:07

我更喜欢使用HTML或在球拍上使用XML工具(以及一般方案), sxml 。这使您可以使用 xpath类似XPath类似查询轻松从文档中提取数据。幸运的是,将html解析为sxml表达式很容易:

#!/usr/bin/env racket
#lang racket
(require html-parsing)
(require sxml/sxpath)

(define my-html "<!doctype html><html><head><title>Title text here</title></head><body><p>a paragraph of text</p></body></html>")
(define document (html->xexp my-html))
; Returns a list of strings
(display-lines ((sxpath "/html/head/title/text()") document))

或者在您的情况下

(define document (call/input-url (string->url "https://www.rosettacode.org/wiki/Web_scraping")
                                 get-pure-port html->xexp))

html-&gt; xexp携带持有html文档或输入端口的字符串),

有趣的位是sexp sexp,它采用SXPATH字符串并返回一个新过程,当用SXML参数依次调用时,将返回所有匹配的列表。如果您要重复寻找同一件事,则值得定义新功能而不是使用临时性:

(define get-title-text (sxpath "/html/head/title/text()"))

html-parsingsxml应该安装软件包通过drracket软件包管理器或带有raco pkg的命令行,安装html-parssing sxml,无论您喜欢哪个。

I prefer working with XML tooling over HTML, or in Racket (And scheme in general), sxml. That lets you use XPath-like queries to easily extract data from the document. Luckily, it's simple to parse HTML into a sxml expression:

#!/usr/bin/env racket
#lang racket
(require html-parsing)
(require sxml/sxpath)

(define my-html "<!doctype html><html><head><title>Title text here</title></head><body><p>a paragraph of text</p></body></html>")
(define document (html->xexp my-html))
; Returns a list of strings
(display-lines ((sxpath "/html/head/title/text()") document))

or in your case

(define document (call/input-url (string->url "https://www.rosettacode.org/wiki/Web_scraping")
                                 get-pure-port html->xexp))

(html->xexp takes either a string holding a HTML document or an input port)

The interesting bit is sexp, which takes an SXPath string and returns a new procedure that when called in turn with a sxml argument, returns a list of all matches. If you're going to be looking for the same thing repeatedly, it's worth defining a new function instead of using a temporary:

(define get-title-text (sxpath "/html/head/title/text()"))

The html-parsing and sxml packages should be installed via the DrRacket package manager or from the command line with raco pkg install html-parsing sxml, whichever you prefer.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文