Python 3 HTML 解析器

发布于 2024-12-27 06:25:02 字数 599 浏览 0 评论 0原文

我确信每个人都会抱怨,并告诉我查看文档(我有),但我只是不明白如何实现与以下内容相同的效果:

curl -s http://www.maxmind.com/app/locate_my_ip | awk '/align="center">/{getline;print}'

到目前为止,我在 python3 中拥有的只是:

import urllib.request

f = urllib.request.urlopen('http://www.maxmind.com/app/locate_my_ip')

for lines in f.readlines():
    print(lines)

f.close()

说真的,有什么建议吗? (请不要告诉我阅读 http://docs.python.org/release/3.0.1/library/html.parser.html 因为我已经学习Python 1天了,很容易混淆)一个简单的例子会很棒!

I'm sure everyone will groan, and tell me to look at the documentation (which I have) but I just don't understand how to achieve the same as the following:

curl -s http://www.maxmind.com/app/locate_my_ip | awk '/align="center">/{getline;print}'

All I have in python3 so far is:

import urllib.request

f = urllib.request.urlopen('http://www.maxmind.com/app/locate_my_ip')

for lines in f.readlines():
    print(lines)

f.close()

Seriously, any suggestions (please don't tell me to read http://docs.python.org/release/3.0.1/library/html.parser.html as I have been learning python for 1 day, and get easily confused) a simple example would be amazing!!!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

梦魇绽荼蘼 2025-01-03 06:25:03

这是基于上面拉斯曼斯的回答。

f = urllib.request.urlopen('http://www.maxmind.com/app/locate_my_ip')
for line in f:
    if b'align="center">' in line:
        print(next(f).decode().rstrip())
f.close()

说明:

for line in f 迭代类文件对象 f 中的行。 Python 允许您迭代文件中的行,就像迭代列表中的项目一样。

if b'align="center">' in line 查找字符串 'align="center">'在当前行中。 b 表示这是一个字节缓冲区,而不是一个字符串。看来 urllib.reqquest.urlopen 将结果解释为二进制数据,而不是 unicode 字符串,并且未经修饰的 'align="center">' 将被解释为一个 unicode 字符串。 (这是上面 TypeError 的来源。)

next(f) 获取文件的下一行,因为您的原始 awk 脚本在 'align=" 之后打印了该行中心">'而不是当前行。 decode 方法(字符串在 Python 中具有方法)获取二进制数据并将其转换为可打印的 unicode 对象。 rstrip() 方法会去除所有尾随空白(即每行末尾的换行符)。

This is based off of larsmans's answer, above.

f = urllib.request.urlopen('http://www.maxmind.com/app/locate_my_ip')
for line in f:
    if b'align="center">' in line:
        print(next(f).decode().rstrip())
f.close()

Explanation:

for line in f iterates over the lines in the file-like object, f. Python let's you iterate over lines in a file like you would items in a list.

if b'align="center">' in line looks for the string 'align="center">' in the current line. The b indicates that this is a buffer of bytes, rather than a string. It appears that urllib.reqquest.urlopen interpets the results as binary data, rather than unicode strings, and an unadorned 'align="center">' would be interpreted as a unicode string. (That was the source of the TypeError above.)

next(f) takes the next line of the file, because your original awk script printed the line after 'align="center">' rather than the current line. The decode method (strings have methods in Python) takes the binary data and converts it to a printable unicode object. The rstrip() method strips any trailing whitespace (namely, the newline at the end of each line.

愿与i 2025-01-03 06:25:03
# no need for .readlines here
for ln in f:
    if 'align="center">' in ln:
        print(ln)

但请务必阅读 Python 教程

# no need for .readlines here
for ln in f:
    if 'align="center">' in ln:
        print(ln)

But be sure to read the Python tutorial.

凉墨 2025-01-03 06:25:03

我可能会使用正则表达式来获取ip本身:

import re
import urllib

f = urllib.request.urlopen('http://www.maxmind.com/app/locate_my_ip')
html_text=f.read()
re.findall(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}',html_text)[0]

它将打印格式的第一个字符串:1-3位数字,句点,1-3位数字,...

我认为您正在寻找该行,您可以简单地扩展findall() 表达式中的字符串来处理这个问题。 (有关更多详细信息,请参阅 python 文档)。
顺便说一句,匹配字符串前面的 r 使其成为原始字符串,因此您不需要转义其中的 python 转义字符(但您仍然需要转义 RE 转义字符)。

希望有帮助

I would probably use regular expressions to get the ip itself:

import re
import urllib

f = urllib.request.urlopen('http://www.maxmind.com/app/locate_my_ip')
html_text=f.read()
re.findall(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}',html_text)[0]

which will print the first string of the format: 1-3digits, period, 1-3digits,...

I take it you were looking for the line, you could simply extend the string in the findall() expression to take care of that. (see the python docs for re for more details).
By the way, the r in front of the match string makes it a raw string so you wouldn't need to escape python escape characters inside of it (but you still need to escape RE escape characters).

Hope that helps

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文