Python HTML 抓取

发布于 2024-08-12 11:25:16 字数 257 浏览 10 评论 0原文

这并不是真正的抓取，我只是想在网页中找到该类具有特定值的 URL。例如：

<a class="myClass" href="/url/7df028f508c4685ddf65987a0bd6f22e">

我想获取href值。关于如何做到这一点有什么想法吗？也许正则表达式？你能发布一些示例代码吗？我猜 html 抓取库，例如 BeautifulSoup，只是为了这个有点矫枉过正......

非常感谢！

原文

It's not really scraping, I'm just trying to find the URLs in a web page where the class has a specific value. For example:

<a class="myClass" href="/url/7df028f508c4685ddf65987a0bd6f22e">

I want to get the href value. Any ideas on how to do this? Maybe regex? Could you post some example code?
I'm guessing html scraping libs, such as BeautifulSoup, are a bit of overkill just for this...

Huge thanks!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

路弥 2024-08-19 11:25:16

正则表达式通常是一个坏主意，请尝试使用 BeautifulSoup

简单示例：

html = #get html
soup = BeautifulSoup(html)
links = soup.findAll('a', attrs={'class': 'myclass'})
for link in links:
    #process link

Regex is usally a bad idea, try using BeautifulSoup

Quick example:

html = #get html
soup = BeautifulSoup(html)
links = soup.findAll('a', attrs={'class': 'myclass'})
for link in links:
    #process link

回复收藏 0 原文

白云悠悠 2024-08-19 11:25:16

啊，不是用于解析 HTML 的正则表达式！

幸运的是，在 Python 中我们有 BeautifulSoup 或 lxml 为我们完成这项工作。

回复收藏 0 原文

夜血缘 2024-08-19 11:25:16

正则表达式将是一个糟糕的选择。 HTML 不是常规语言。美丽汤怎么样？

回复收藏 0 原文

渔村楼浪 2024-08-19 11:25:16

正则表达式不应用于解析 HTML。有关解释，请参阅此问题的第一个答案 :)

+1 为 BeautifulSoup。

回复收藏 0 原文

浪荡不羁 2024-08-19 11:25:16

如果您的任务就是这么简单，那么

f=open("htmlfile")
for line in f:
    if "<a class" in line and "myClass" in line and "href" in line:
        s = line [ line.index("href") + len('href="') : ]
        print s[:s.index('">')]
f.close()

在这种情况下，不必使用字符串操作（甚至不需要正则表达式）HTML 解析器。

If your task is just this simple, just use string manipulation (without even regex)

f=open("htmlfile")
for line in f:
    if "<a class" in line and "myClass" in line and "href" in line:
        s = line [ line.index("href") + len('href="') : ]
        print s[:s.index('">')]
f.close()

HTML parsers is not a must for such cases.

回复收藏 0 原文