从文本文件中解析项目

发布于 2024-09-05 15:01:05 字数 239 浏览 5 评论 0原文

我有一个文本文件,其中包含 {[]} 标记内的数据。解析该数据的建议方法是什么,以便我可以仅使用标签内的数据?

示例文本文件如下所示:

“这是一堆在任何{[方式]}中都没有{[真正]}用处的文本。我需要{[从]}它{[获取]}一些项目。'

我想最终在列表中包含“真的”、“方式”、“得到”、“来自”。我想我可以使用 split 来做到这一点..但似乎可能有更好的方法。我见过很多解析库,有一个最适合我想做的事情吗?

I have a text file that includes data inside {[]} tags. What would be the suggested way to parse that data so I can just use the data inside the tags?

Example text file would look like this:

'this is a bunch of text that is not {[really]} useful in any {[way]}. I need to {[get]} some items {[from]} it.'

I would like to end up with 'really', 'way', 'get', 'from' in a list. I guess I could use split to do it.. but seems like there might be a better way out there. I have seen a ton parsing libraries, is there one that would be perfect for what I want to do?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

薄荷梦 2024-09-12 15:01:05

我会使用正则表达式。此答案假设没有任何标记字符 {}[] 出现在其他标记字符中。

import re
text = 'this is a bunch of text that is not {[really]} useful in any {[way]}. I need to {[get]} some items {[from]} it.'

for s in re.findall(r'\{\[(.*?)\]\}', text):
    print s

在Python正则表达式中使用详细模式:

re.findall('''
    \{   # opening curly brace
    \[   # followed by an opening square bracket
    (    # capture the next pattern
    .*?  # followed by shortest possible sequence of anything
    )    # end of capture
    \]   # followed by closing square bracket
    \}   # followed by a closing curly brace
    ''', text, re.VERBOSE)

I would use regular expressions. This answer assumes that none of the tag characters {}[] appear within other tag characters.

import re
text = 'this is a bunch of text that is not {[really]} useful in any {[way]}. I need to {[get]} some items {[from]} it.'

for s in re.findall(r'\{\[(.*?)\]\}', text):
    print s

Using the verbose mode in python regular expressions:

re.findall('''
    \{   # opening curly brace
    \[   # followed by an opening square bracket
    (    # capture the next pattern
    .*?  # followed by shortest possible sequence of anything
    )    # end of capture
    \]   # followed by closing square bracket
    \}   # followed by a closing curly brace
    ''', text, re.VERBOSE)
唯憾梦倾城 2024-09-12 15:01:05

这是正则表达式的工作:

>>> import re
>>> text = 'this is a bunch of text that is not {[really]} useful in any {[way]}. I need to {[get]} some items {[from]} it.'
>>> re.findall(r'\{\[(\w+)\]\}', text)
['really', 'way', 'get', 'from']

This is a job for regex:

>>> import re
>>> text = 'this is a bunch of text that is not {[really]} useful in any {[way]}. I need to {[get]} some items {[from]} it.'
>>> re.findall(r'\{\[(\w+)\]\}', text)
['really', 'way', 'get', 'from']
〆凄凉。 2024-09-12 15:01:05

更慢,更大,没有

传统的正则表达式:P

def f(s):
    result = []
    tmp = ''
    for c in s:
        if c in '{[':
            stack.append(c)
        elif c in ']}':
            stack.pop()
            if c == ']':
                result.append(tmp)
                tmp = ''
        elif stack and stack[-1] == '[':
            tmp += c
    return result

>>> s
'this is a bunch of text that is not {[really]} useful in any {[way]}. I need to {[get]} some items {[from]} it.'
>>> f(s)
['really', 'way', 'get', 'from']

slower, bigger, no regular expresions

the old school way :P

def f(s):
    result = []
    tmp = ''
    for c in s:
        if c in '{[':
            stack.append(c)
        elif c in ']}':
            stack.pop()
            if c == ']':
                result.append(tmp)
                tmp = ''
        elif stack and stack[-1] == '[':
            tmp += c
    return result

>>> s
'this is a bunch of text that is not {[really]} useful in any {[way]}. I need to {[get]} some items {[from]} it.'
>>> f(s)
['really', 'way', 'get', 'from']
千仐 2024-09-12 15:01:05

另一种方式

def between_strings(source, start='{[', end=']}'):
    words = []
    while True:
        start_index = source.find(start)
        if start_index == -1:
            break
        end_index = source.find(end)
        words.append(source[start_index+len(start):end_index])
        source = source[end_index+len(end):]
    return words


text = "this is a bunch of text that is not {[really]} useful in any {[way]}. I need to {[get]} some items {[from]} it."
assert between_strings(text) == ['really', 'way', 'get', 'from']

Another way

def between_strings(source, start='{[', end=']}'):
    words = []
    while True:
        start_index = source.find(start)
        if start_index == -1:
            break
        end_index = source.find(end)
        words.append(source[start_index+len(start):end_index])
        source = source[end_index+len(end):]
    return words


text = "this is a bunch of text that is not {[really]} useful in any {[way]}. I need to {[get]} some items {[from]} it."
assert between_strings(text) == ['really', 'way', 'get', 'from']
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文