lxml中属性和样式标签的区别

发布于 2024-08-06 06:13:33 字数 1086 浏览 3 评论 0原文

使用BeautifulSoup后我正在尝试学习lxml。不过,总的来说我并不是一个很强的程序员。

我在一些源 html 中有以下代码:

<p style="font-family:times;text-align:justify"><font size="2"><b><i> The reasons to eat pickles include:  </i></b></font></p>

因为文本是粗体的,所以我想提取该文本。我似乎无法区分该特定行是粗体的。

当我今天晚上开始这项工作时,我正在处理一个在样式属性中包含粗体一词的文档,如下所示:

<p style="font-style:italic;font-weight:bold;margin:0pt 0pt 6.0pt;text-indent:0pt;"><b><i><font size="2" face="Times New Roman" style="font-size:10.0pt;">The reason I like tomatoes include:</font></i></b></p>

我应该说,我正在处理的文档是我在行中读取的片段,将行连接在一起然后使用 html.fromstring 函数

txtFile=open(r'c:\myfile.htm','r').readlines()
strHTM=''.join(txtFile)
newHTM=html.fromstring(strHTM)

,所以我上面的 htm 代码的第一行是 newHTM[19]

嗯,这似乎让我更接近

newHTM.cssselect('b')

我还没有完全理解,但这里是解决方案:

for each in newHTM:
    if each.cssselect('b')
        each.text_content()

I am trying to learn lxml after having used BeautifulSoup. However, I am not a strong programmer in general.

I have the following code in some source html:

<p style="font-family:times;text-align:justify"><font size="2"><b><i> The reasons to eat pickles include:  </i></b></font></p>

Because the text is bolded, I want to pull that text. I can't seem to be able to differentiate that that particular line is bolded.

When I started this work this evening I was working with a document that had the word bold in the style attrib like the following:

<p style="font-style:italic;font-weight:bold;margin:0pt 0pt 6.0pt;text-indent:0pt;"><b><i><font size="2" face="Times New Roman" style="font-size:10.0pt;">The reason I like tomatoes include:</font></i></b></p>

I should say that the document I am working from is a fragment that I read in the lines, joined the lines together and then used the html.fromstring function

txtFile=open(r'c:\myfile.htm','r').readlines()
strHTM=''.join(txtFile)
newHTM=html.fromstring(strHTM)

and so the first line of htm code I have above is newHTM[19]

Humm this seems to be getting me closer

newHTM.cssselect('b')

I don't fully understand yet but here is the solution:

for each in newHTM:
    if each.cssselect('b')
        each.text_content()

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

我们的影子 2024-08-13 06:13:33

使用 CSS API 确实不是正确的方法。如果你想找到所有 b 元素,请执行以下操作

strHTM=open(r'c:\myfile.htm','r').read() # no need to split it into lines first
newHTM=html.fromString(strHTM)
bELements = newHTM.findall('b')
for b in bElements:
    print b.text_content()

Using the CSS API really isn't the right approach. If you want to find all b elements, do

strHTM=open(r'c:\myfile.htm','r').read() # no need to split it into lines first
newHTM=html.fromString(strHTM)
bELements = newHTM.findall('b')
for b in bElements:
    print b.text_content()
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文