如何使用函数递归遍历 txt 或 html 文件并返回每个单独的字符

发布于 01-15 22:33 字数 1582 浏览 3 评论 0原文

我正在尝试为我正在构建的 HTML 解析器的标记化阶段创建一个输入流。这是一些背景：

输入流由解码输入字节流时推入其中的字符组成。

在标记化阶段之前，必须通过规范化换行符来预处理输入流。因此，HTML DOM 中的换行符由 U+000A LF 字符表示，并且在标记化阶段的输入中永远不会有任何 U+000D CR 字符。

下一个输入字符是输入流中尚未使用的第一个字符。最初，下一个输入字符是输入中的第一个字符。当前输入的字符是最后一个被消耗的字符。

我的 test.html 文件：

< !DOCTYPE html >在第 0 行

<块引用>
<头>嗨< /头>在第 1 行

我的代码：

with open('test.html', 'r') as f:
    file = f.readlines()
    file = [item.replace('\n', '\f') for item in file]
    file = [str(item) for item in file]

def input_stream():
    for line_no, line in enumerate(file):  # the whole line 
            eof_no = len(file[line_no]) - 1
            
            for char_no, char in enumerate(line):  # each character in that line
                    eof_no = len(file[line_no]) - 1
                    if char_no == eof_no:
                        eof = True
                        return eof
                    return char

def run():
    eof = False
    while eof == False:
            result = input_stream()
            if result == True:
                break
            else:
                return result
print(run())

def state_machine(input): #Output of run() is to be passed in here
    #Statements..

到目前为止，我相信我已经成功地包含了除了递归部分之外的所有内容。我需要一次返回一个字符，将其传递到 state_machine 函数中，以使其执行某些操作并最终返回令牌 - 直到文件末尾。

我知道返回函数中的任何内容都会结束/打破任何循环，但我不知道如何对其进行建模。

回顾：迭代返回结果不起作用。有什么想法吗？

原文

I am trying to create an input stream for the tokenization stage of an HTML parser I am building. Here is some context:

The input stream consists of the characters pushed into it as the input byte stream is decoded.

Before the tokenization stage, the input stream must be preprocessed by normalizing newlines. Thus, newlines in HTML DOMs are represented by U+000A LF characters, and there are never any U+000D CR characters in the input to the tokenization stage.

The next input character is the first character in the input stream that has not yet been consumed. Initially, the next input character is the first character in the input. The current input character is the last character to have been consumed.

My test.html file:

< !DOCTYPE html > on line 0

< head >Hi< /head > on line 1

My code:

with open('test.html', 'r') as f:
    file = f.readlines()
    file = [item.replace('\n', '\f') for item in file]
    file = [str(item) for item in file]

def input_stream():
    for line_no, line in enumerate(file):  # the whole line 
            eof_no = len(file[line_no]) - 1
            
            for char_no, char in enumerate(line):  # each character in that line
                    eof_no = len(file[line_no]) - 1
                    if char_no == eof_no:
                        eof = True
                        return eof
                    return char

def run():
    eof = False
    while eof == False:
            result = input_stream()
            if result == True:
                break
            else:
                return result
print(run())

def state_machine(input): #Output of run() is to be passed in here
    #Statements..

So far I believe I have managed to include everything apart from the recursive portion of it. I need to return one character at a time, pass it into the state_machine function for it to perform certain operations and eventually return token(s) - all until the end of the file.

I know that returning anything in a function will end/break out of any loops but I do not know how else to model it.

Recap:
The returning of result iteratively does not work.
Any ideas?

分享到QQ

分享到微博