在python中解析嵌入在HTML中的固定格式数据

发布于 2024-07-10 18:15:26 字数 1310 浏览 15 评论 0原文

我正在使用谷歌的 appengine api

from google.appengine.api import urlfetch

来获取网页。结果

result = urlfetch.fetch("http://www.example.com/index.html")

是 html 内容的字符串（在 result.content 中）。问题是我想要解析的数据并不是真正的 HTML 形式，所以我认为使用 python HTML 解析器对我不起作用。我需要解析 html 文档正文中的所有纯文本。唯一的问题是 urlfetch 返回整个 HTML 文档的单个字符串，删除所有换行符和多余空格。

编辑： 好吧，我尝试获取不同的 URL，显然 urlfetch 不会删除换行符，这是我试图解析的原始网页，以这种方式提供 HTML 文件...... END EDIT

如果文档是这样的：

<html><head></head><body>
AAA 123 888 2008-10-30 ABC
BBB 987 332 2009-01-02 JSE
...
A4A       288        AAA
</body></html>

result.content 将是这样的，在 urlfetch 获取它之后：

'<html><head></head><body>AAA 123 888 2008-10-30 ABCBBB 987     2009-01-02 JSE...A4A     288            AAA</body></html>'

使用 HTML 解析器不会帮助我处理 body 标记之间的数据，所以我打算使用正则表达式来解析我的数据，但是正如您所看到的，一行的最后一部分与下一行的第一部分组合在一起，并且我不知道如何拆分它。我尝试过

result.content.split('\n')

，

result.content.split('\r')

但结果列表只有 1 个元素。我在 google 的 urlfetch 函数中没有看到任何不删除换行符的选项。

我有什么想法可以解析这些数据吗？也许我需要以不同的方式获取它？

提前致谢！

原文

I am using google's appengine api

from google.appengine.api import urlfetch

to fetch a webpage. The result of

result = urlfetch.fetch("http://www.example.com/index.html")

is a string of the html content (in result.content). The problem is the data that I want to parse is not really in HTML form, so I don't think using a python HTML parser will work for me. I need to parse all of the plain text in the body of the html document. The only problem is that urlfetch returns a single string of the entire HTML document, removing all newlines and extra spaces.

EDIT:
Okay, I tried fetching a different URL and apparently urlfetch does not strip the newlines, it was the original webpage I was trying to parse that served the HTML file that way...
END EDIT

If the document is something like this:

<html><head></head><body>
AAA 123 888 2008-10-30 ABC
BBB 987 332 2009-01-02 JSE
...
A4A       288        AAA
</body></html>

result.content will be this, after urlfetch fetches it:

'<html><head></head><body>AAA 123 888 2008-10-30 ABCBBB 987     2009-01-02 JSE...A4A     288            AAA</body></html>'

Using an HTML parser will not help me with the data between the body tags, so I was going to use regular expresions to parse my data, but as you can see the last part of one line gets combined with the first part of the next line, and I don't know how to split it. I tried

result.content.split('\n')

and

result.content.split('\r')

but the resulting list was all just 1 element. I don't see any options in google's urlfetch function to not remove newlines.

Any ideas how I can parse this data? Maybe I need to fetch it differently?

Thanks in advance!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

眼泪淡了忧伤 2024-07-17 18:15:26

据我了解，该文档的格式就是您发布的格式。在这种情况下，我同意像 Beautiful Soup 这样的解析器可能不是一个好的解决方案。

我假设您已经使用像

import re
data = re.findall('<body>([^\<]*)</body>', result)[0]

这样的正则表达式获取了有趣的数据（在 BODY 标记之间），它应该像这样简单：（

start = 0
end = 5
while (end<len(data)):
   print data[start:end]
   start = end+1
   end = end+5
print data[start:]

注意：我没有针对边界情况检查此代码，我确实希望它会失败此处仅展示一般想法）

I understand that the format of the document is the one you have posted. In that case, I agree that a parser like Beautiful Soup may not be a good solution.

I assume that you are already getting the interesting data (between the BODY tags) with a regular expression like

import re
data = re.findall('<body>([^\<]*)</body>', result)[0]

then, it should be as easy as:

start = 0
end = 5
while (end<len(data)):
   print data[start:end]
   start = end+1
   end = end+5
print data[start:]

(note: I did not check this code against boundary cases, and I do expect it to fail. It is only here to show the generic idea)

回复收藏 0 原文

浮华 2024-07-17 18:15:26

我能想到的唯一建议是将其解析为具有固定宽度的列。 HTML 不考虑换行符。

如果您可以控制源数据，请将其放入文本文件而不是 HTML 中。

回复收藏 0 原文

心房的律动 2024-07-17 18:15:26

将正文文本设置为单个长字符串后，您可以按如下方式将其分解。
假设每条记录有 26 个字符。

body= "AAA 123 888 2008-10-30 ABCBBB 987     2009-01-02 JSE...A4A     288            AAA"
for i in range(0,len(body),26):
    line= body[i:i+26]
    # parse the line

Once you have the body text as a single, long string, you can break it up as follows.
This presumes that each record is 26 characters.

body= "AAA 123 888 2008-10-30 ABCBBB 987     2009-01-02 JSE...A4A     288            AAA"
for i in range(0,len(body),26):
    line= body[i:i+26]
    # parse the line

回复收藏 0 原文

叹沉浮 2024-07-17 18:15:26

编辑：阅读理解是一件令人向往的事情。我错过了关于线路一起运行且线路之间没有分隔符的部分，这可能是整个过程的重点，不是吗？所以，别介意我的回答，它实际上并不相关。

如果您知道每行是 5 个空格分隔的列，那么（一旦您删除了 html）您可以执行类似的操作（未经测试）：

def generate_lines(datastring):
    while datastring:
        splitresult = datastring.split(' ', 5)
        if len(splitresult) >= 5:
            datastring = splitresult[5]
        else:
            datastring = None
        yield splitresult[:5]

for line in generate_lines(data):
    process_data_line(line)

当然，您可以根据需要更改分隔字符和列数（可能甚至将它们作为附加参数传递到生成器函数中），并根据需要添加错误处理。

EDIT: Reading comprehension is a desirable thing. I missed the bit about the lines being run together with no separator between them, which would kinda be the whole point of this, wouldn't it? So, nevermind my answer, it's not actually relevant.

If you know that each line is 5 space-separated columns, then (once you've stripped out the html) you could do something like (untested):

def generate_lines(datastring):
    while datastring:
        splitresult = datastring.split(' ', 5)
        if len(splitresult) >= 5:
            datastring = splitresult[5]
        else:
            datastring = None
        yield splitresult[:5]

for line in generate_lines(data):
    process_data_line(line)

Of course, you can change the split character and number of columns as needed (possibly even passing them into the generator function as additional parameters), and add error handling as appropriate.

回复收藏 0 原文

书信已泛黄 2024-07-17 18:15:26

将字符串 s 拆分为 26 个字符块的进一步建议：

作为列表：

>>> [s[x:x+26] for x in range(0, len(s), 26)]
['AAA 123 888 2008-10-30 ABC',
 'BBB 987     2009-01-02 JSE',
 'A4A     288            AAA']

作为生成器：

>>> for line in (s[x:x+26] for x in range(0, len(s), 26)): print line
AAA 123 888 2008-10-30 ABC
BBB 987     2009-01-02 JSE
A4A     288            AAA

将 range() 替换为 xrange()在 Python 2.x 中，如果 s 很长。

Further suggestions for splitting the string s into 26-character blocks:

As a list:

>>> [s[x:x+26] for x in range(0, len(s), 26)]
['AAA 123 888 2008-10-30 ABC',
 'BBB 987     2009-01-02 JSE',
 'A4A     288            AAA']

As a generator:

>>> for line in (s[x:x+26] for x in range(0, len(s), 26)): print line
AAA 123 888 2008-10-30 ABC
BBB 987     2009-01-02 JSE
A4A     288            AAA

Replace range() with xrange() in Python 2.x if s is very long.

回复收藏 0 原文

~没有更多了~