我正在尝试从任意网站抓取 META 关键字和描述标签。显然我无法控制该网站,所以必须接受我所得到的。它们的标签和属性有多种大小写,这意味着我需要不区分大小写。我不敢相信 lxml 作者竟然如此顽固地坚持完全强制遵守标准,因为它排除了对其库的大部分使用。
我希望能够说 doc.cssselect('meta[name=description]')
(或一些 XPath 等效项),但这不会捕获
标签由于大写 D。
我目前正在使用它作为解决方法,但这太可怕了!
for meta in doc.cssselect('meta'):
name = meta.get('name')
content = meta.get('content')
if name and content:
if name.lower() == 'keywords':
keywords = content
if name.lower() == 'description':
description = content
标签名称 meta
似乎不区分大小写,但属性却不然。更烦人的是 meta
也区分大小写!
I'm trying to scrape META keywords and description tags from arbitrary websites. I obviusly have no control over said website, so have to take what I'm given. They have a variety of casings for the tag and attributes, which means I need to work case-insensitively. I can't believe that the lxml authors are as stubborn as to insist on full forced standards-compliance when it excludes much of the use of their library.
I'd like to be able to say doc.cssselect('meta[name=description]')
(or some XPath equivalent) but this will not catch <meta name="Description" Content="...">
tags due othe captial D.
I'm currently using this as a workaround, but it's horrible!
for meta in doc.cssselect('meta'):
name = meta.get('name')
content = meta.get('content')
if name and content:
if name.lower() == 'keywords':
keywords = content
if name.lower() == 'description':
description = content
It seems that the tag name meta
is treated case-insensitively, but the attributes are not. It would be even more annoying meta
was case-sensitive too!
发布评论
评论(3)
属性的值必须区分大小写。
您可以使用任意正则表达式来选择元素:
输出:
Values of attributes must be case-sensitive.
You can use arbitrary regular expression to select an element:
Output:
lxml 是一个 XML 解析器。 XML 区分大小写。您正在解析 HTML,因此您应该使用 HTML 解析器。 BeautifulSoup 很受欢迎。它唯一的缺点是速度可能很慢。
lxml is an XML parser. XML is case-sensitive. You are parsing HTML, so you should use an HTML parser. BeautifulSoup is very popular. Its only drawback is that it can be slow.
您可以使用
它将“name”的值转换为小写,然后进行匹配。
另请参阅:
You can use
It translates the value of "name" to lowercase and then matches.
See also: