从网页中提取元关键字?

发布于 2024-09-08 12:39:20 字数 113 浏览 1 评论 0原文

我需要使用 Python 从网页中提取元关键字。我认为这可以使用 urllib 或 urllib2 来完成,但我不确定。有人有什么想法吗?

我在 Windows XP 上使用 Python 2.6

I need to extract the meta keywords from a web page using Python. I was thinking that this could be done using urllib or urllib2, but I'm not sure. Anyone have any ideas?

I am using Python 2.6 on Windows XP

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

倒带 2024-09-15 12:39:20

lxml 比 BeautifulSoup 更快(我认为)并且具有更好的功能,同时保持相对简单使用。示例:

52> from urllib import urlopen
53> from lxml import etree

54> f = urlopen( "http://www.google.com" ).read()
55> tree = etree.HTML( f )
61> m = tree.xpath( "//meta" )

62> for i in m:
..>     print etree.tostring( i )
..>
<meta http-equiv="content-type" content="text/html; charset=ISO-8859-2"/>  

编辑:另一个示例。

75> f = urlopen( "http://www.w3schools.com/XPath/xpath_syntax.asp" ).read()
76> tree = etree.HTML( f )
85> tree.xpath( "//meta[@name='Keywords']" )[0].get("content")
85> "xml,tutorial,html,dhtml,css,xsl,xhtml,javascript,asp,ado,vbscript,dom,sql,colors,soap,php,authoring,programming,training,learning,b
eginner's guide,primer,lessons,school,howto,reference,examples,samples,source code,tags,demos,tips,links,FAQ,tag list,forms,frames,color table,w3c,cascading
 style sheets,active server pages,dynamic html,internet,database,development,Web building,Webmaster,html guide"

顺便说一句:XPath 值得了解。

另一种编辑:

或者,您可以只使用 regexp:

87> f = urlopen( "http://www.w3schools.com/XPath/xpath_syntax.asp" ).read()
88> import re
101> re.search( "<meta name=\"Keywords\".*?content=\"([^\"]*)\"", f ).group( 1 )
101>"xml,tutorial,html,dhtml,css,xsl,xhtml,javascript,asp,ado,vbscript,dom,sql, ...etc...

...但我发现它的可读性较差且更容易出错(但仅涉及标准模块并且仍然适合一行)。

lxml is faster than BeautifulSoup (I think) and has much better functionality, while remaining relatively easy to use. Example:

52> from urllib import urlopen
53> from lxml import etree

54> f = urlopen( "http://www.google.com" ).read()
55> tree = etree.HTML( f )
61> m = tree.xpath( "//meta" )

62> for i in m:
..>     print etree.tostring( i )
..>
<meta http-equiv="content-type" content="text/html; charset=ISO-8859-2"/>  

Edit: another example.

75> f = urlopen( "http://www.w3schools.com/XPath/xpath_syntax.asp" ).read()
76> tree = etree.HTML( f )
85> tree.xpath( "//meta[@name='Keywords']" )[0].get("content")
85> "xml,tutorial,html,dhtml,css,xsl,xhtml,javascript,asp,ado,vbscript,dom,sql,colors,soap,php,authoring,programming,training,learning,b
eginner's guide,primer,lessons,school,howto,reference,examples,samples,source code,tags,demos,tips,links,FAQ,tag list,forms,frames,color table,w3c,cascading
 style sheets,active server pages,dynamic html,internet,database,development,Web building,Webmaster,html guide"

BTW: XPath is worth knowing.

Another edit:

Alternatively, you can just use regexp:

87> f = urlopen( "http://www.w3schools.com/XPath/xpath_syntax.asp" ).read()
88> import re
101> re.search( "<meta name=\"Keywords\".*?content=\"([^\"]*)\"", f ).group( 1 )
101>"xml,tutorial,html,dhtml,css,xsl,xhtml,javascript,asp,ado,vbscript,dom,sql, ...etc...

...but I find it less readable and more error prone (but involves only standard module and still fits on one line).

思念绕指尖 2024-09-15 12:39:20

BeautifulSoup 是使用 Python 解析 HTML 的好方法。

特别是检查 findAll 方法:
http://www.crummy.com/software/BeautifulSoup/documentation.html

BeautifulSoup is a great way to parse HTML with Python.

Particularly, check out the findAll method:
http://www.crummy.com/software/BeautifulSoup/documentation.html

七分※倦醒 2024-09-15 12:39:20

为什么不使用正则表达式

keywordregex = re.compile('<meta\sname=
["\']keywords["\']\scontent=["\'](.*?)["\']\s/>')

keywordlist = keywordregex.findall(html)
if len(keywordlist) > 0:
    keywordlist = keywordlist[0]
    keywordlist = keywordlist.split(", ")

Why not use a regular expression

keywordregex = re.compile('<meta\sname=
["\']keywords["\']\scontent=["\'](.*?)["\']\s/>')

keywordlist = keywordregex.findall(html)
if len(keywordlist) > 0:
    keywordlist = keywordlist[0]
    keywordlist = keywordlist.split(", ")
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文