打印某些 HTML Python Mechanize

发布于 2024-12-09 16:04:57 字数 436 浏览 0 评论 0原文

我正在制作一个用于自动登录网站的小 python 脚本。但我被困住了。

我希望将 html 的一小部分打印到终端中,该部分位于网站 html 文件中的此标记内:

<td class=h3 align='right'>&nbsp;&nbsp;John Appleseed</td><td>&nbsp;<a href="members_myaccount.php"><img border=0 src="../tbs_v7_0/images/myaccount.gif" alt="My Account"></a></td>

但是如何提取并打印名称 John Appleseed?

顺便说一下,我在 Mac 上使用 Python 的 Mechanize。

Im making a small python script for auto logon to a website. But i'm stuck.

I'm looking to print into terminal a small part of the html, located within this tag in the html file on the site:

<td class=h3 align='right'>  John Appleseed</td><td> <a href="members_myaccount.php"><img border=0 src="../tbs_v7_0/images/myaccount.gif" alt="My Account"></a></td>

But how do I extract and print just the name, John Appleseed?

I'm using Pythons' Mechanize on a mac, by the way.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

暖风昔人 2024-12-16 16:04:57

Mechanize 仅适用于获取 html。一旦您想从 html 中提取信息,您可以使用 BeautifulSoup 等。 (另请参阅我对类似问题的回答:网页挖掘或抓取或爬行?我应该使用什么工具/库?

取决于 位于 html 中(从您的问题中不清楚),您可以使用以下代码:

html = ... # this is the html you've fetched

from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(html)
# use this (gets all <td> elements)
cols = soup.findAll('td')
# or this (gets only <td> elements with class='h3')
cols = soup.findAll('td', attrs={"class" : 'h3'})
print cols[0].renderContents() # print content of first <td> element

Mechanize is only good for fetching the html. Once you want to extract information from the html, you could use for example BeautifulSoup. (See also my answer to a similar question: Web mining or scraping or crawling? What tool/library should I use?)

Depending on where the <td> is located in the html (it's unclear from your question), you could use the following code:

html = ... # this is the html you've fetched

from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(html)
# use this (gets all <td> elements)
cols = soup.findAll('td')
# or this (gets only <td> elements with class='h3')
cols = soup.findAll('td', attrs={"class" : 'h3'})
print cols[0].renderContents() # print content of first <td> element
很酷又爱笑 2024-12-16 16:04:57

由于您尚未提供页面的完整 HTML,因此现在唯一的选择是使用 string.find() 或正则表达式。

但是,找到这个的标准方法是使用 xpath。看到这个问题:How to use Xpath in Python?

即可获取xpath对于使用 Firefox 的“检查元素”功能的元素。

例如,如果您想在 stackoverflow 站点中查找用户名的 XPATH。

  • 打开firefox并登录网站&右键单击用户名(在我的例子中为 shadyabhi)并选择“检查元素”。
  • 将鼠标悬停在标签上或右键单击它并“复制 xpath”。

在此处输入图像描述

As you have not provided the full HTML of the page, the only option right now is either using string.find() or regular expressions.

But, the standard way of finding this is using xpath. See this question: How to use Xpath in Python?

You can obtain the xpath for an element using "inspect element" feature of firefox.

For ex, if you want to find the XPATH for username in stackoverflow site.

  • Open firefox and login to the website & RIght-click on username(shadyabhi in my case) and select Inspect Element.
  • Keep your mouse over tag or right click it and "Copy xpath".

enter image description here

还在原地等你 2024-12-16 16:04:57

您可以使用解析器提取文档中的任何信息。我建议您使用 lxml 模块。

这里有一个示例:

from lxml import etree
from StringIO import StringIO

parser = etree.HTMLParser()

tree = etree.parse(StringIO("""<td class=h3 align='right'>  John Appleseed</td><td> <a href="members_myaccount.php"><img border=0 src="../tbs_v7_0/images/myaccount.gif" alt="My Account"></a></td>"""),parser)


>>> tree.xpath("string()").strip()
u'John Appleseed'

有关 lxml 的更多信息此处

You can use a parser to extract any information in a document. I suggest you to use lxml module.

Here you have an example:

from lxml import etree
from StringIO import StringIO

parser = etree.HTMLParser()

tree = etree.parse(StringIO("""<td class=h3 align='right'>  John Appleseed</td><td> <a href="members_myaccount.php"><img border=0 src="../tbs_v7_0/images/myaccount.gif" alt="My Account"></a></td>"""),parser)


>>> tree.xpath("string()").strip()
u'John Appleseed'

More information about lxml here

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文