使用lxml解析HTML数据

发布于 2024-12-22 21:53:29 字数 821 浏览 1 评论 0原文

我是编码初学者,我的一个朋友告诉我使用 BeautifulSoup 而不是 htmlparser。在遇到一些问题后,我得到了使用 lxml 而不是 BeaytifulSoup 的提示,因为它的性能好 10 倍。

我希望有人能给我提示如何抓取我正在寻找的文本。

我想要的是找到一个包含以下行和数据的表格:

<tr>
    <td><a href="website1.com">website1</a></td>
    <td>info1</td>
    <td>info2</td>              
    <td><a href="spam1.com">spam1</a></td>
</tr>
<tr>
    <td><a href="website2.com">website2</a></td>
    <td>info1</td>
    <td>info2</td>              
    <td><a href="spam2.com">spam2</a></td>
</tr>

如何使用信息 1 和 2 抓取网站,没有垃圾邮件,使用 lxml 并获得以下结果?

[['url' 'info1', 'info2'], ['url', 'info1', 'info2']]

I'm a beginner in coding and a friend of mine told me to use BeautifulSoup instead of htmlparser. After running into some problems I got a tip to use lxml instead of BeaytifulSoup because it's 10x better.

I'm hoping someone can give me a hint how to scrape the text I'm looking for.

What I want is to find a table with the following rows and data:

<tr>
    <td><a href="website1.com">website1</a></td>
    <td>info1</td>
    <td>info2</td>              
    <td><a href="spam1.com">spam1</a></td>
</tr>
<tr>
    <td><a href="website2.com">website2</a></td>
    <td>info1</td>
    <td>info2</td>              
    <td><a href="spam2.com">spam2</a></td>
</tr>

How do I scrape the website with info 1 and 2, without spam, with lxml and get the following results?

[['url' 'info1', 'info2'], ['url', 'info1', 'info2']]

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

暖心男生 2024-12-29 21:53:29
import lxml.html as lh

tree = lh.fromstring(your_html)

result = []
for row in tree.xpath("tr"):
    url, info1, info2 = row.xpath("td")[:3]
    result.append([url.xpath("a")[0].attrib['href'],
                   info1.text_content(),
                   info2.text_content()])

结果:

[['website1.com', 'info1', 'info2'], ['website2.com', 'info1', 'info2']]
import lxml.html as lh

tree = lh.fromstring(your_html)

result = []
for row in tree.xpath("tr"):
    url, info1, info2 = row.xpath("td")[:3]
    result.append([url.xpath("a")[0].attrib['href'],
                   info1.text_content(),
                   info2.text_content()])

Result:

[['website1.com', 'info1', 'info2'], ['website2.com', 'info1', 'info2']]
九八野马 2024-12-29 21:53:29

我使用xpathtd/a[not(contains(.,"spam"))]/@href | td[not(a)]/text()

$ python3
>>> import lxml.html
>>> doc = lxml.html.parse('data.xml')
>>> [[j for j in i.xpath('td/a[not(contains(.,"spam"))]/@href | td[not(a)]/text()')] for i in doc.xpath('//tr')]
[['website1.com', 'info1', 'info2'], ['website2.com', 'info1', 'info2']]

I use the xpath: td/a[not(contains(.,"spam"))]/@href | td[not(a)]/text()

$ python3
>>> import lxml.html
>>> doc = lxml.html.parse('data.xml')
>>> [[j for j in i.xpath('td/a[not(contains(.,"spam"))]/@href | td[not(a)]/text()')] for i in doc.xpath('//tr')]
[['website1.com', 'info1', 'info2'], ['website2.com', 'info1', 'info2']]
苏佲洛 2024-12-29 21:53:29
import lxml.html as LH

doc = LH.fromstring(content)
print([tr.xpath('td[1]/a/@href | td[position()=2 or position()=3]/text()')
       for tr in doc.xpath('//tr')])

长XPath的含义如下:

td[1]                                   find the first <td>  
  /a                                    find the <a>
    /@href                              return its href attribute value
|                                       or
td[position()=2 or position()=3]        find the second or third <td>
  /text()                               return its text value
import lxml.html as LH

doc = LH.fromstring(content)
print([tr.xpath('td[1]/a/@href | td[position()=2 or position()=3]/text()')
       for tr in doc.xpath('//tr')])

The long XPath has the following meaning:

td[1]                                   find the first <td>  
  /a                                    find the <a>
    /@href                              return its href attribute value
|                                       or
td[position()=2 or position()=3]        find the second or third <td>
  /text()                               return its text value
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文