Python HTML 抓取
这并不是真正的抓取,我只是想在网页中找到该类具有特定值的 URL。例如:
<a class="myClass" href="/url/7df028f508c4685ddf65987a0bd6f22e">
我想获取href值。关于如何做到这一点有什么想法吗?也许正则表达式?你能发布一些示例代码吗? 我猜 html 抓取库,例如 BeautifulSoup,只是为了这个有点矫枉过正......
非常感谢!
It's not really scraping, I'm just trying to find the URLs in a web page where the class has a specific value. For example:
<a class="myClass" href="/url/7df028f508c4685ddf65987a0bd6f22e">
I want to get the href value. Any ideas on how to do this? Maybe regex? Could you post some example code?
I'm guessing html scraping libs, such as BeautifulSoup, are a bit of overkill just for this...
Huge thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
正则表达式通常是一个坏主意,请尝试使用 BeautifulSoup
简单示例:
Regex is usally a bad idea, try using BeautifulSoup
Quick example:
啊,不是用于解析 HTML 的正则表达式!
幸运的是,在 Python 中我们有 BeautifulSoup 或 lxml 为我们完成这项工作。
Aargh, not regex for parsing HTML!
Luckily in Python we have BeautifulSoup or lxml to do that job for us.
正则表达式将是一个糟糕的选择。 HTML 不是常规语言。 美丽汤怎么样?
Regex would be a bad choice. HTML is not a regular language. How about Beautiful Soup?
正则表达式不应用于解析 HTML。有关解释,请参阅此问题的第一个答案 :)
+1 为 BeautifulSoup。
Regex should not be used to parse HTML. See the first answer to this question for an explanation :)
+1 for BeautifulSoup.
如果您的任务就是这么简单,那么
在这种情况下,不必使用字符串操作(甚至不需要正则表达式)HTML 解析器。
If your task is just this simple, just use string manipulation (without even regex)
HTML parsers is not a must for such cases.
问题是我知道 HTML 页面的结构,我只想找到特定类型的链接(其中 class="myclass")。无论如何,美丽汤?
The thing is I know the structure of the HTML page, and I just want to find that specific kind of links (where class="myclass"). BeautifulSoup anyway?
阅读解析 Html 克苏鲁方式 https://blog.codinghorror.com/parsing -html-the-cthulhu-way/
read Parsing Html The Cthulhu Way https://blog.codinghorror.com/parsing-html-the-cthulhu-way/