如何解析标记的文本以进行进一步处理？

发布于 2024-07-26 10:59:02 字数 1357 浏览 15 评论 0原文

在 Edit-1 中查看更新的输入和输出数据。

我想要完成的是转变

+ 1
 + 1.1
  + 1.1.1
   - 1.1.1.1
   - 1.1.1.2
 + 1.2
  - 1.2.1
  - 1.2.2
 - 1.3
+ 2
- 3

为 python 数据结构，例如

[{'1': [{'1.1': {'1.1.1': ['1.1.1.1', '1.1.1.2']}, '1.2': ['1.2.1', '1.2.2']}, '1.3'], '2': {}}, ['3',]]

我已经查看了许多不同的 wiki 标记语言、markdown、重组文本等但它们对我来说都非常复杂，要理解它是如何工作的，因为它们必须涵盖大量标签和语法（我只需要其中大部分的“列表”部分，但当然转换为 python 而不是 html。

）我还研究了分词器、词法分析器和解析器，但它们又比我需要的和我能理解的要复杂得多。

我不知道从哪里开始，希望在这个问题上得到任何帮助。谢谢

Edit-1：是的，行开头的字符很重要，从之前和现在所需的输出可以看出 * 表示有子节点的根节点，+ 有子节点，- 没有子节点（根节点或其他节点），只是与该节点相关的额外信息。 * 并不重要，可以与 + 互换（我可以通过其他方式获取 root 状态。）

因此，新的要求将仅使用* 表示有或没有子节点的节点，并且 - 不能有子节点。我还对其进行了更改，因此键不是 * 之后的文本，因为这无疑会在以后更改为实际标题。

例如

* 1
 * 1.1
 * 1.2
  - Note for 1.2
* 2
* 3
- Note for root

。

[{'title': '1', 'children': [{'title': '1.1', 'children': []}, {'title': '1.2', 'children': []}]}, {'title': '2', 'children': [], 'notes': ['Note for 1.2', ]}, {'title': '3', 'children': []}, 'Note for root']

，如果你有另一个想法来用 python 表示轮廓，那么就提出它

原文

See updated input and output data at Edit-1.

What I am trying to accomplish is turning

+ 1
 + 1.1
  + 1.1.1
   - 1.1.1.1
   - 1.1.1.2
 + 1.2
  - 1.2.1
  - 1.2.2
 - 1.3
+ 2
- 3

into a python data structure such as

[{'1': [{'1.1': {'1.1.1': ['1.1.1.1', '1.1.1.2']}, '1.2': ['1.2.1', '1.2.2']}, '1.3'], '2': {}}, ['3',]]

I've looked at many different wiki markup languages, markdown, restructured text, etc but they are all extremely complicated for me to understand how it works since they must cover a large amount of tags and syntax (I would only need the "list" parts of most of these but converted to python instead of html of course.)

I've also taken a look at tokenizers, lexers and parsers but again they are much more complicated than I need and that I can understand.

I have no idea where to begin and would appreciate any help possible on this subject. Thanks

Edit-1: Yes the character at the beginning of the line matters, from the required output from before and now it could be seen that the * denotes a root node with children, the + has children and the - has no children (root or otherwise) and is just extra information pertaining to that node. The * is not important and can be interchanged with + (I can get root status other ways.)

Therefore the new requirement would be using only * to denote a node with or without children and - cannot have children. I've also changed it so the key isn't the text after the * since that will no doubt changer later to an actual title.

For example

* 1
 * 1.1
 * 1.2
  - Note for 1.2
* 2
* 3
- Note for root

would give

[{'title': '1', 'children': [{'title': '1.1', 'children': []}, {'title': '1.2', 'children': []}]}, {'title': '2', 'children': [], 'notes': ['Note for 1.2', ]}, {'title': '3', 'children': []}, 'Note for root']

Or if you have another idea to represent the outline in python then bring it forward.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

又爬满兰若 2024-08-02 10:59:02

编辑：由于规范中的澄清和更改，我编辑了代码，为了清晰起见，仍然使用显式 Node 类作为中间步骤 - 逻辑是将行列表转换为节点列表，然后将该节点列表转换为树（通过适当地使用它们的缩进属性），然后以可读形式打印该树（这只是一个“调试帮助”步骤，以检查树是否构建良好，当然可以在脚本的最终版本中注释掉——当然，这将从文件中获取行，而不是对它们进行硬编码以进行调试！-)，最后构建所需的 Python 结构并打印它。这是代码，正如我们稍后将看到的，结果几乎与OP指定的一样，但有一个例外——但是，首先是代码：

import sys

class Node(object):
  def __init__(self, title, indent):
    self.title = title
    self.indent = indent
    self.children = []
    self.notes = []
    self.parent = None
  def __repr__(self):
    return 'Node(%s, %s, %r, %s)' % (
        self.indent, self.parent, self.title, self.notes)
  def aspython(self):
    result = dict(title=self.title, children=topython(self.children))
    if self.notes:
      result['notes'] = self.notes
    return result

def print_tree(node):
  print ' ' * node.indent, node.title
  for subnode in node.children:
    print_tree(subnode)
  for note in node.notes:
    print ' ' * node.indent, 'Note:', note

def topython(nodelist):
  return [node.aspython() for node in nodelist]

def lines_to_tree(lines):
  nodes = []
  for line in lines:
    indent = len(line) - len(line.lstrip())
    marker, body = line.strip().split(None, 1)
    if marker == '*':
      nodes.append(Node(body, indent))
    elif marker == '-':
      nodes[-1].notes.append(body)
    else:
      print>>sys.stderr, "Invalid marker %r" % marker

  tree = Node('', -1)
  curr = tree
  for node in nodes:
    while node.indent <= curr.indent:
      curr = curr.parent
    node.parent = curr
    curr.children.append(node)
    curr = node

  return tree


data = """\
* 1
 * 1.1
 * 1.2
  - Note for 1.2
* 2
* 3
- Note for root
""".splitlines()

def main():
  tree = lines_to_tree(data)
  print_tree(tree)
  print
  alist = topython(tree.children)
  print alist

if __name__ == '__main__':
  main()

运行时，会发出：

 1
  1.1
  1.2
  Note: 1.2
 2
 3
 Note: 3

[{'children': [{'children': [], 'title': '1.1'}, {'notes': ['Note for 1.2'], 'children': [], 'title': '1.2'}], 'title': '1'}, {'children': [], 'title': '2'}, {'notes': ['Note for root'], 'children': [], 'title': '3'}]

除了键的顺序之外（当然，这是无关紧要的，并且在字典中不能保证），这几乎按照要求 - 除了这里所有注释显示为字典条目，键为 < code>notes 和一个字符串列表的值（但如果列表为空，则注释条目将被省略，大致如问题中的示例中所做的那样）。

在当前版本的问题中，如何表示音符有点不清楚；一个注释显示为独立字符串，其他注释显示为值为字符串的条目（而不是我正在使用的字符串列表）。目前尚不清楚注释在一种情况下必须显示为独立字符串，而在所有其他情况下则必须显示为字典条目，因此我使用的这种方案更为常规；如果注释（如果有）是单个字符串而不是列表，那么如果一个节点出现多个注释，这是否意味着这是一个错误？就后者而言，我使用的这个方案更通用（让节点具有从 0 开始的任意数量的注释，而不是问题中明显暗示的仅 0 或 1）。

编写了这么多代码（预编辑答案大约一样长，有助于澄清和更改规范）来提供（我希望）99% 的所需解决方案，我希望这能满足原始发布者的要求，因为最后几次调整使它们相互匹配的代码和/或规格对他来说应该很容易做到！

Edit: thanks to the clarification and change in the spec I've edited my code, still using an explicit Node class as an intermediate step for clarity -- the logic is to turn the list of lines into a list of nodes, then turn that list of nodes into a tree (by using their indent attribute appropriately), then print that tree in a readable form (this is just a "debug-help" step, to check the tree is well constructed, and can of course get commented out in the final version of the script -- which, just as of course, will take the lines from a file rather than having them hardcoded for debugging!-), finally build the desired Python structure and print it. Here's the code, and as we'll see after that the result is almost as the OP specifies with one exception -- but, the code first:

import sys

class Node(object):
  def __init__(self, title, indent):
    self.title = title
    self.indent = indent
    self.children = []
    self.notes = []
    self.parent = None
  def __repr__(self):
    return 'Node(%s, %s, %r, %s)' % (
        self.indent, self.parent, self.title, self.notes)
  def aspython(self):
    result = dict(title=self.title, children=topython(self.children))
    if self.notes:
      result['notes'] = self.notes
    return result

def print_tree(node):
  print ' ' * node.indent, node.title
  for subnode in node.children:
    print_tree(subnode)
  for note in node.notes:
    print ' ' * node.indent, 'Note:', note

def topython(nodelist):
  return [node.aspython() for node in nodelist]

def lines_to_tree(lines):
  nodes = []
  for line in lines:
    indent = len(line) - len(line.lstrip())
    marker, body = line.strip().split(None, 1)
    if marker == '*':
      nodes.append(Node(body, indent))
    elif marker == '-':
      nodes[-1].notes.append(body)
    else:
      print>>sys.stderr, "Invalid marker %r" % marker

  tree = Node('', -1)
  curr = tree
  for node in nodes:
    while node.indent <= curr.indent:
      curr = curr.parent
    node.parent = curr
    curr.children.append(node)
    curr = node

  return tree


data = """\
* 1
 * 1.1
 * 1.2
  - Note for 1.2
* 2
* 3
- Note for root
""".splitlines()

def main():
  tree = lines_to_tree(data)
  print_tree(tree)
  print
  alist = topython(tree.children)
  print alist

if __name__ == '__main__':
  main()

When run, this emits:

 1
  1.1
  1.2
  Note: 1.2
 2
 3
 Note: 3

[{'children': [{'children': [], 'title': '1.1'}, {'notes': ['Note for 1.2'], 'children': [], 'title': '1.2'}], 'title': '1'}, {'children': [], 'title': '2'}, {'notes': ['Note for root'], 'children': [], 'title': '3'}]

Apart from the ordering of keys (which is immaterial and not guaranteed in a dict, of course), this is almost as requested -- except that here all notes appear as dict entries with a key of notes and a value that's a list of strings (but the notes entry is omitted if the list would be empty, roughly as done in the example in the question).

In the current version of the question, how to represent the notes is slightly unclear; one note appears as a stand-alone string, others as entries whose value is a string (instead of a list of strings as I'm using). It's not clear what's supposed to imply that the note must appear as a stand-alone string in one case and as a dict entry in all others, so this scheme I'm using is more regular; and if a note (if any) is a single string rather than a list, would that mean it's an error if more than one note appears for a node? In the latter regard, this scheme I'm using is more general (lets a node have any number of notes from 0 up, instead of just 0 or 1 as apparently implied in the question).

Having written so much code (the pre-edit answer was about as long and helped clarify and change the specs) to provide (I hope) 99% of the desired solution, I hope this satisfies the original poster, since the last few tweaks to code and/or specs to make them match each other should be easy for him to do!

回复收藏 0 原文

不回头走下去 2024-08-02 10:59:02

由于您正在处理大纲情况，因此可以使用堆栈来简化事情。基本上，您想要创建一个具有与轮廓深度相对应的 dict 的堆栈。当您解析新行并且轮廓的深度增加时，您将一个新的 dict 推入堆栈，该堆栈由堆栈顶部的前一个 dict 引用。当您解析深度较低的行时，您会弹出堆栈以返回到父级。当您遇到具有相同深度的行时，您可以将其添加到堆栈顶部的 dict 中。

回复收藏 0 原文

无人接听 2024-08-02 10:59:02

在解析树时，堆栈是一种非常有用的数据结构。您只需始终保留从最后添加的节点到堆栈上根的路径，以便您可以通过缩进的长度找到正确的父节点。像这样的东西应该适用于解析你的最后一个例子：

import re
line_tokens = re.compile('( *)(\\*|-) (.*)')

def parse_tree(data):
    stack = [{'title': 'Root node', 'children': []}]
    for line in data.split("\n"):
        indent, symbol, content = line_tokens.match(line).groups()        
        while len(indent) + 1 < len(stack):
            stack.pop() # Remove everything up to current parent
        if symbol == '-':
            stack[-1].setdefault('notes', []).append(content)
        elif symbol == '*':
            node = {'title': content, 'children': []}
            stack[-1]['children'].append(node)
            stack.append(node) # Add as the current deepest node
    return stack[0]

Stacks are a really useful datastructure when parsing trees. You just keep the path from the last added node up to the root on the stack at all times so you can find the correct parent by the length of the indent. Something like this should work for parsing your last example:

import re
line_tokens = re.compile('( *)(\\*|-) (.*)')

def parse_tree(data):
    stack = [{'title': 'Root node', 'children': []}]
    for line in data.split("\n"):
        indent, symbol, content = line_tokens.match(line).groups()        
        while len(indent) + 1 < len(stack):
            stack.pop() # Remove everything up to current parent
        if symbol == '-':
            stack[-1].setdefault('notes', []).append(content)
        elif symbol == '*':
            node = {'title': content, 'children': []}
            stack[-1]['children'].append(node)
            stack.append(node) # Add as the current deepest node
    return stack[0]

回复收藏 0 原文