在python中提取字符串

发布于 2024-08-26 06:46:35 字数 1082 浏览 9 评论 0 原文

基本上,我想从文本文件中提取字符串“AAA”、“BBB”、“CCC”、“DDD”...

...... (other text goes here).....
<TD align="left" class=texttd><font class='textfont'>AAA</font></TD>
..... (useless text here).....
<TD align="left" class=texttd><font class='textfont'>BBB</font></TD>
....(more text).....
<TD align="left" class=texttd><font class='textfont'>CCC</font></TD>
<TD align="left" class=texttd><font class='textfont'>DDD</font></TD>
......(more text).....

我想要类似的东西:-

data = foo("文件.txt")

我得到:-

数据 = ['AAA','BBB','CCC','DDD']

最好的方法是什么?我的文件不大...

基本上,我想从 此文件,其 HTML 格式类似于 此< /a>

Basically, I want to extract the strings "AAA", "BBB", "CCC", "DDD" from a text file...

...... (other text goes here).....
<TD align="left" class=texttd><font class='textfont'>AAA</font></TD>
..... (useless text here).....
<TD align="left" class=texttd><font class='textfont'>BBB</font></TD>
....(more text).....
<TD align="left" class=texttd><font class='textfont'>CCC</font></TD>
<TD align="left" class=texttd><font class='textfont'>DDD</font></TD>
......(more text).....

I want something like if I do:-

data = foo("file.txt")

I get:-

data = ['AAA','BBB','CCC','DDD']

What is the best possible way? My file is not big...

Basically, I want to extract "remaining upload data transfer" from this file which in HTML looks like THIS

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

-残月青衣踏尘吟 2024-09-02 06:46:35

您可以编写一个 REGEX,但它会在某种程度上“解析”HTML。为 HTML 编写正则表达式的问题是 HTML 是一团糟。它很少是完美的,当您依赖它获取数据时,这会导致问题。

我个人会使用 BeautifulSoup。它确实比你要求的要多,但也比你付出的努力还要多。

You could write a REGEX but it would be "parsing" the HTML to some extent. The problem with writing regular expressions for HTML is HTML is a mess. It's rarely perfect and this causes problems when you rely on it for data.

I would personally use BeautifulSoup. It does do more than you're asking but also at superfraction of the effort.

薆情海 2024-09-02 06:46:35

您想要 BeautifulSoup

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(your_file)

soup.find("font", "textfont")

You want BeautifulSoup:

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(your_file)

soup.find("font", "textfont")
究竟谁懂我的在乎 2024-09-02 06:46:35
def foo():
    input_file = open("myfile.txt", 'r')
    input = ''.join(input_file.readlines())

    looking_for = ['AAA', 'BBB', 'CCC', 'DDD']
    have = []

    for thing in looking_for:
        if thing in input:
            have.append(thing)
    return have
def foo():
    input_file = open("myfile.txt", 'r')
    input = ''.join(input_file.readlines())

    looking_for = ['AAA', 'BBB', 'CCC', 'DDD']
    have = []

    for thing in looking_for:
        if thing in input:
            have.append(thing)
    return have
兔小萌 2024-09-02 06:46:35

在这种情况下,尝试使用正则表达式(这将是真正的),使用预先编写的库,或者使用 f = open() f.read() 和你的自己的解析器。

In a case like this it's, attempt regex for it ( which will be really had ), use a prewritten library, or do it your self with a f = open() f.read() and your own parser.

゛时过境迁 2024-09-02 06:46:35

如果您只想从 HTML 文档中的所有标签内部获取数据,同时删除所有标签本身,您可以执行以下操作:

import HTMLParser

class DataOnlyParser(HTMLParser.HTMLParser):
    def parse(self, text):
        self.result = []
        self.feed(text)
        self.close()
        return self.result

    def handle_data(self, data):
        data = data.strip()
        if data:
            self.result.append(data)

p = DataOnlyParser()

data = """
<TD align="left" class=texttd><font class='textfont'>AAA</font></TD>
<TD align="left" class=texttd><font class='textfont'>BBB</font></TD>
<TD align="left" class=texttd><font class='textfont'>CCC</font></TD>
<TD align="left" class=texttd><font class='textfont'>DDD</font></TD>
"""

print p.parse(data)
# ['AAA', 'BBB', 'CCC', 'DDD']

如果您的选择标准更复杂,和/或如果输入格式错误,您可能会更好地使用像 lxml 这样的库。

您不想使用正则表达式来“解析”html。请参阅此处

If you just want to get the data from inside all of the tags in the HTML document, while dropping all the tags themselves, you could do something like this:

import HTMLParser

class DataOnlyParser(HTMLParser.HTMLParser):
    def parse(self, text):
        self.result = []
        self.feed(text)
        self.close()
        return self.result

    def handle_data(self, data):
        data = data.strip()
        if data:
            self.result.append(data)

p = DataOnlyParser()

data = """
<TD align="left" class=texttd><font class='textfont'>AAA</font></TD>
<TD align="left" class=texttd><font class='textfont'>BBB</font></TD>
<TD align="left" class=texttd><font class='textfont'>CCC</font></TD>
<TD align="left" class=texttd><font class='textfont'>DDD</font></TD>
"""

print p.parse(data)
# ['AAA', 'BBB', 'CCC', 'DDD']

If your selection criteria is more complex though, and/or if the input is malformed, you'd probably be better off with a library like lxml.

You do NOT want to use regular expressions to "parse" html. See here.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文