3.4 CSS 选择器
CSS即层叠样式表,其选择器是一种用来确定HTML文档中某部分位置的语言。
CSS选择器的语法比XPath更简单一些,但功能不如XPath强大。实际上,当我们调用Selector对象的CSS方法时,在其内部会使用Python库cssselect将CSS选择器表达式翻译成XPath表达式,然后调用Selector对象的XPATH方法。
表3-2列出了CSS选择器的一些基本语法。
表3-2 CSS选择器
和学习XPath一样,通过一些例子展示CSS选择器的使用。
先创建一个HTML文档并构造一个HtmlResponse对象:
>>> from scrapy.selector import Selector >>> from scrapy.http import HtmlResponse >>> body = ''' ... <html> ... <head> ... <base href='http://example.com/' /> ... <title>Example website</title> ... </head> ... <body> ... <div id='images-1'> ... <a href='image1.html'>Name: Image 1 <br/><img src='image1.jpg' /></a> ... <a href='image2.html'>Name: Image 2 <br/><img src='image2.jpg' /></a> ... <a href='image3.html'>Name: Image 3 <br/><img src='image3.jpg' /></a> ... </div> ... ... <div id='images-2' class='small'> ... <a href='image4.html'>Name: Image 4 <br/><img src='image4.jpg' /></a> ... <a href='image5.html'>Name: Image 5 <br/><img src='image5.jpg' /></a> ... </div> ... </body> ... </html> ... ''' ... >>> response = HtmlResponse(url='http://www.example.com', body=body, encoding='utf8')
E:选中E元素。
# 选中所有的img >>> response.css('img') [<Selector xpath='descendant-or-self::img' data='<img src="image1.jpg">'>, <Selector xpath='descendant-or-self::img' data='<img src="image2.jpg">'>, <Selector xpath='descendant-or-self::img' data='<img src="image3.jpg">'>, <Selector xpath='descendant-or-self::img' data='<img src="image4.jpg">'>, <Selector xpath='descendant-or-self::img' data='<img src="image5.jpg">'>]
E1,E2:选中E1和E2元素。
# 选中所有base和title >>> response.css('base,title') [<Selector xpath='descendant-or-self::base | descendant-or-self::title' data='<base href="http://example.com/">'>, <Selector xpath='descendant-or-self::base | descendant-or-self::title' data='<title>Example website</title>'>]
E1 E2:选中E1后代元素中的E2元素。
# div 后代中的img >>> response.css('div img') [<Selector xpath='descendant-or-self::div/descendant-or-self::*/img' data='<img src="image1.jpg">'>, <Selector xpath='descendant-or-self::div/descendant-or-self::*/img' data='<img src="image2.jpg">'>, <Selector xpath='descendant-or-self::div/descendant-or-self::*/img' data='<img src="image3.jpg">'>, <Selector xpath='descendant-or-self::div/descendant-or-self::*/img' data='<img src="image4.jpg">'>, <Selector xpath='descendant-or-self::div/descendant-or-self::*/img' data='<img src="image5.jpg">'>]
E1>E2:选中E1子元素中的E2元素。
# body 子元素中的div >>> response.css('body>div') [<Selector xpath='descendant-or-self::body/div' data='<div id="images-1"images-2" class="small">\n '>]
[ATTR]:选中包含ATTR属性的元素。
# 选中包含style属性的元素 >>> response.css('[style]') [<Selector xpath='descendant-or-self::*[@style]' data='<div id="images-1"mso-spacerun:'yes';font-family:monospace;color:rgb(0,0,0); letter-spacing:0.0000pt;font-weight:normal;text-transform:none; font-style:normal;font-variant:normal;font-size:12.0000pt;">[ATTR=VALUE]:选中包含ATTR属性且值为VALUE的元素。
# 选中属性id值为images-1的元素 >>> response.css('[id=images-1]') [<Selector xpath="descendant-or-self::*[@id = 'images-1']" data='<div id="images-1"mso-spacerun:'yes';font-family:monospace;color:rgb(0,0,0); letter-spacing:0.0000pt;font-weight:normal;text-transform:none; font-style:normal;font-variant:normal;font-size:12.0000pt;">
E:nth-child(n):选中E元素,且该元素必须是其父元素的第n个子元素。
# 选中每个div的第一个a >>> response.css('div>a:nth-child(1)') [<Selector xpath="descendant-or-self::div/*[name() = 'a' and (position() = 1)]" data='<a href="image1.html">Name: Image 1 <br>'>, <Selector xpath="descendant-or-self::div/*[name() = 'a' and (position() = 1)]" data='<a href="image4.html">Name: Image 4 <br>'>] # 选中第二个div的第一个a >>> response.css('div:nth-child(2)>a:nth-child(1)') [<Selector xpath="descendant-or-self::*/*[name() = 'div' and (position() = 2)]/*[name() = 'a' and (position() = 1)]" data='<a href="image4.html">Name: Image 4 <br>'>]
E:first-child:选中E元素,该元素必须是其父元素的第一个子元素。
E:last-child:选中E元素,该元素必须是其父元素的倒数第一个子元素。
# 选中第一个div的最后一个a >>> response.css('div:first-child>a:last-child') [<Selector xpath="descendant-or-self::*/*[name() = 'div' and (position() = 1)]/*[name() = 'a' and (position() = last())]" data='<a href="image3.html">Name: Image 3 <br>'>]
E::text:选中E元素的文本节点。
# 选中所有a的文本 >>> sel = response.css('a::text') >>> sel [<Selector xpath='descendant-or-self::a/text()' data='Name: Image 1 '>, <Selector xpath='descendant-or-self::a/text()' data='Name: Image 2 '>, <Selector xpath='descendant-or-self::a/text()' data='Name: Image 3 '>, <Selector xpath='descendant-or-self::a/text()' data='Name: Image 4 '>, <Selector xpath='descendant-or-self::a/text()' data='Name: Image 5 '>] >>> sel.extract() ['Name: Image 1 ', 'Name: Image 2 ', 'Name: Image 3 ', 'Name: Image 4 ', 'Name: Image 5 ']
关于CSS选择器的使用先介绍到这里,更多详细内容可以参看CSS选择器文档:https://www.w3.org/TR/css3-selectors/。
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论