lxml 是否可以以不区分大小写的方式工作?

发布于 2024-08-11 04:03:23 字数 697 浏览 6 评论 0 原文

我正在尝试从任意网站抓取 META 关键字和描述标签。显然我无法控制该网站,所以必须接受我所得到的。它们的标签和属性有多种大小写,这意味着我需要不区分大小写。我不敢相信 lxml 作者竟然如此顽固地坚持完全强制遵守标准,因为它排除了对其库的大部分使用。

我希望能够说 doc.cssselect('meta[name=description]') (或一些 XPath 等效项),但这不会捕获 标签由于大写 D。

我目前正在使用它作为解决方法,但这太可怕了!

for meta in doc.cssselect('meta'):
    name = meta.get('name')
    content = meta.get('content')

    if name and content:
        if name.lower() == 'keywords':
            keywords = content
        if name.lower() == 'description':
            description = content

标签名称 meta 似乎不区分大小写,但属性却不然。更烦人的是 meta 也区分大小写!

I'm trying to scrape META keywords and description tags from arbitrary websites. I obviusly have no control over said website, so have to take what I'm given. They have a variety of casings for the tag and attributes, which means I need to work case-insensitively. I can't believe that the lxml authors are as stubborn as to insist on full forced standards-compliance when it excludes much of the use of their library.

I'd like to be able to say doc.cssselect('meta[name=description]') (or some XPath equivalent) but this will not catch <meta name="Description" Content="..."> tags due othe captial D.

I'm currently using this as a workaround, but it's horrible!

for meta in doc.cssselect('meta'):
    name = meta.get('name')
    content = meta.get('content')

    if name and content:
        if name.lower() == 'keywords':
            keywords = content
        if name.lower() == 'description':
            description = content

It seems that the tag name meta is treated case-insensitively, but the attributes are not. It would be even more annoying meta was case-sensitive too!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

相思故 2024-08-18 04:03:23

属性的必须区分大小写。

您可以使用任意正则表达式来选择元素:

#!/usr/bin/env python
from lxml import html

doc = html.fromstring('''
    <meta name="Description">
    <meta name="description">
    <META name="description">
    <meta NAME="description">
''')
for meta in doc.xpath('//meta[re:test(@name, "^description$", "i")]',
                      namespaces={"re": "http://exslt.org/regular-expressions"}):
    print html.tostring(meta, pretty_print=True),

输出:

<meta name="Description">
<meta name="description">
<meta name="description">
<meta name="description">

Values of attributes must be case-sensitive.

You can use arbitrary regular expression to select an element:

#!/usr/bin/env python
from lxml import html

doc = html.fromstring('''
    <meta name="Description">
    <meta name="description">
    <META name="description">
    <meta NAME="description">
''')
for meta in doc.xpath('//meta[re:test(@name, "^description$", "i")]',
                      namespaces={"re": "http://exslt.org/regular-expressions"}):
    print html.tostring(meta, pretty_print=True),

Output:

<meta name="Description">
<meta name="description">
<meta name="description">
<meta name="description">
最冷一天 2024-08-18 04:03:23

lxml 是一个 XML 解析器。 XML 区分大小写。您正在解析 HTML,因此您应该使用 HTML 解析器。 BeautifulSoup 很受欢迎。它唯一的缺点是速度可能很慢。

lxml is an XML parser. XML is case-sensitive. You are parsing HTML, so you should use an HTML parser. BeautifulSoup is very popular. Its only drawback is that it can be slow.

独孤求败 2024-08-18 04:03:23

您可以使用

doc.cssselect.xpath("//meta[translate(@name,
    'ABCDEFGHJIKLMNOPQRSTUVWXYZ', 'abcdefghjiklmnopqrstuvwxyz')='description']")

它将“name”的值转换为小写,然后进行匹配。

另请参阅:

You can use

doc.cssselect.xpath("//meta[translate(@name,
    'ABCDEFGHJIKLMNOPQRSTUVWXYZ', 'abcdefghjiklmnopqrstuvwxyz')='description']")

It translates the value of "name" to lowercase and then matches.

See also:

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文