在python中解析嵌入在HTML中的固定格式数据
我正在使用谷歌的 appengine api
from google.appengine.api import urlfetch
来获取网页。 结果
result = urlfetch.fetch("http://www.example.com/index.html")
是 html 内容的字符串(在 result.content 中)。 问题是我想要解析的数据并不是真正的 HTML 形式,所以我认为使用 python HTML 解析器对我不起作用。 我需要解析 html 文档正文中的所有纯文本。 唯一的问题是 urlfetch 返回整个 HTML 文档的单个字符串,删除所有换行符和多余空格。
编辑: 好吧,我尝试获取不同的 URL,显然 urlfetch 不会删除换行符,这是我试图解析的原始网页,以这种方式提供 HTML 文件...... END EDIT
如果文档是这样的:
<html><head></head><body>
AAA 123 888 2008-10-30 ABC
BBB 987 332 2009-01-02 JSE
...
A4A 288 AAA
</body></html>
result.content 将是这样的,在 urlfetch 获取它之后:
'<html><head></head><body>AAA 123 888 2008-10-30 ABCBBB 987 2009-01-02 JSE...A4A 288 AAA</body></html>'
使用 HTML 解析器不会帮助我处理 body 标记之间的数据,所以我打算使用正则表达式来解析我的数据,但是正如您所看到的,一行的最后一部分与下一行的第一部分组合在一起,并且我不知道如何拆分它。 我尝试过
result.content.split('\n')
,
result.content.split('\r')
但结果列表只有 1 个元素。 我在 google 的 urlfetch 函数中没有看到任何不删除换行符的选项。
我有什么想法可以解析这些数据吗? 也许我需要以不同的方式获取它?
提前致谢!
I am using google's appengine api
from google.appengine.api import urlfetch
to fetch a webpage. The result of
result = urlfetch.fetch("http://www.example.com/index.html")
is a string of the html content (in result.content). The problem is the data that I want to parse is not really in HTML form, so I don't think using a python HTML parser will work for me. I need to parse all of the plain text in the body of the html document. The only problem is that urlfetch returns a single string of the entire HTML document, removing all newlines and extra spaces.
EDIT:
Okay, I tried fetching a different URL and apparently urlfetch does not strip the newlines, it was the original webpage I was trying to parse that served the HTML file that way...
END EDIT
If the document is something like this:
<html><head></head><body>
AAA 123 888 2008-10-30 ABC
BBB 987 332 2009-01-02 JSE
...
A4A 288 AAA
</body></html>
result.content will be this, after urlfetch fetches it:
'<html><head></head><body>AAA 123 888 2008-10-30 ABCBBB 987 2009-01-02 JSE...A4A 288 AAA</body></html>'
Using an HTML parser will not help me with the data between the body tags, so I was going to use regular expresions to parse my data, but as you can see the last part of one line gets combined with the first part of the next line, and I don't know how to split it. I tried
result.content.split('\n')
and
result.content.split('\r')
but the resulting list was all just 1 element. I don't see any options in google's urlfetch function to not remove newlines.
Any ideas how I can parse this data? Maybe I need to fetch it differently?
Thanks in advance!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
据我了解,该文档的格式就是您发布的格式。 在这种情况下,我同意像 Beautiful Soup 这样的解析器可能不是一个好的解决方案。
我假设您已经使用像
这样的正则表达式获取了有趣的数据(在 BODY 标记之间),它应该像这样简单:(
注意:我没有针对边界情况检查此代码,我确实希望它会失败此处仅展示一般想法)
I understand that the format of the document is the one you have posted. In that case, I agree that a parser like Beautiful Soup may not be a good solution.
I assume that you are already getting the interesting data (between the BODY tags) with a regular expression like
then, it should be as easy as:
(note: I did not check this code against boundary cases, and I do expect it to fail. It is only here to show the generic idea)
我能想到的唯一建议是将其解析为具有固定宽度的列。 HTML 不考虑换行符。
如果您可以控制源数据,请将其放入文本文件而不是 HTML 中。
Only suggestion I can think of is to parse it as if it has fixed width columns. Newlines are not taken into consideration for HTML.
If you have control of the source data, put it into a text file rather than HTML.
将正文文本设置为单个长字符串后,您可以按如下方式将其分解。
假设每条记录有 26 个字符。
Once you have the body text as a single, long string, you can break it up as follows.
This presumes that each record is 26 characters.
编辑:阅读理解是一件令人向往的事情。 我错过了关于线路一起运行且线路之间没有分隔符的部分,这可能是整个过程的重点,不是吗? 所以,别介意我的回答,它实际上并不相关。
如果您知道每行是 5 个空格分隔的列,那么(一旦您删除了 html)您可以执行类似的操作(未经测试):
当然,您可以根据需要更改分隔字符和列数(可能甚至将它们作为附加参数传递到生成器函数中),并根据需要添加错误处理。
EDIT: Reading comprehension is a desirable thing. I missed the bit about the lines being run together with no separator between them, which would kinda be the whole point of this, wouldn't it? So, nevermind my answer, it's not actually relevant.
If you know that each line is 5 space-separated columns, then (once you've stripped out the html) you could do something like (untested):
Of course, you can change the split character and number of columns as needed (possibly even passing them into the generator function as additional parameters), and add error handling as appropriate.
将字符串
s
拆分为 26 个字符块的进一步建议:作为列表:
作为生成器:
将
range()
替换为xrange()
在 Python 2.x 中,如果s
很长。Further suggestions for splitting the string
s
into 26-character blocks:As a list:
As a generator:
Replace
range()
withxrange()
in Python 2.x ifs
is very long.