在python中提取字符串
基本上,我想从文本文件中提取字符串“AAA”、“BBB”、“CCC”、“DDD”...
...... (other text goes here).....
<TD align="left" class=texttd><font class='textfont'>AAA</font></TD>
..... (useless text here).....
<TD align="left" class=texttd><font class='textfont'>BBB</font></TD>
....(more text).....
<TD align="left" class=texttd><font class='textfont'>CCC</font></TD>
<TD align="left" class=texttd><font class='textfont'>DDD</font></TD>
......(more text).....
我想要类似的东西:-
data = foo("文件.txt")
我得到:-
数据 = ['AAA','BBB','CCC','DDD']
最好的方法是什么?我的文件不大...
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
您可以编写一个 REGEX,但它会在某种程度上“解析”HTML。为 HTML 编写正则表达式的问题是 HTML 是一团糟。它很少是完美的,当您依赖它获取数据时,这会导致问题。
我个人会使用 BeautifulSoup。它确实比你要求的要多,但也比你付出的努力还要多。
You could write a REGEX but it would be "parsing" the HTML to some extent. The problem with writing regular expressions for HTML is HTML is a mess. It's rarely perfect and this causes problems when you rely on it for data.
I would personally use BeautifulSoup. It does do more than you're asking but also at superfraction of the effort.
您想要 BeautifulSoup:
You want BeautifulSoup:
在这种情况下,尝试使用正则表达式(这将是真正的),使用预先编写的库,或者使用
f = open() f.read() 和你的自己的解析器。
In a case like this it's, attempt regex for it ( which will be really had ), use a prewritten library, or do it your self with a
f = open() f.read()
and your own parser.如果您只想从 HTML 文档中的所有标签内部获取数据,同时删除所有标签本身,您可以执行以下操作:
如果您的选择标准更复杂,和/或如果输入格式错误,您可能会更好地使用像 lxml 这样的库。
您不想使用正则表达式来“解析”html。请参阅此处。
If you just want to get the data from inside all of the tags in the HTML document, while dropping all the tags themselves, you could do something like this:
If your selection criteria is more complex though, and/or if the input is malformed, you'd probably be better off with a library like lxml.
You do NOT want to use regular expressions to "parse" html. See here.