解析带有某些要拆分的关键字的字符串（在字符串文字之外），但不在 Python 中的字符串文字内部拆分

发布于 2025-01-20 14:22:57 字数 4232 浏览 0 评论 0原文

我可以问一个关于我最近遇到的问题的问题吗？如果你们想帮助我解决这个问题，我会非常感谢它：)

因此，我有一个简单的字符串，我想解析，使用'@''关键字（如果'@''不在字符串之外那在字符串中）。其背后的原因是我正在尝试学习如何基于某些关键字来解析/拆分的字符串，因为我正在尝试实现自己的“简单编程语言” ...

这是我使用的示例REGEX：（“@'关键字之后的空格”并不重要）

# Ignore the 'println(' thing, it's basically a builtin print statement that I made, so
# you can only focus on the string itself :)

# (?!\B"[^"]*)@(?![^"]*"\B)
# As I looking up how to use this thing with regex, I found this one that basically
# split the strings into elements by '@' keyword, but not splitting it if '@' is found
# inside a string. Here's what I mean:

# '"[email protected]"'     --- found '@' inside a string, so don't parse it
# '"[email protected]" @ x' --- found '@' outside a string, so after being parsed would be like this:
# ['"[email protected]", x']
print_args = re.split(r'(?!\B"[^"]*)@(?![^"]*"\B)', codes[x].split('println(')[-1].removesuffix(')\n' or ')'))
vars: list[str] = []
result_for_testing: list[str] = []
            
for arg in range(0, len(print_args)):
    # I don't know if this works because it's split the string for each space, but
    # if there are some spaces inside a string, it would be considered as the spaces
    # that should've been split, but it should not be going to be split because
    # because that space is inside a string that is inside a string, not outside a
    # string that is inside a string.

    # Example 1: '"Hello, World!" @   x @     y' => ['"Hello, World!"', x, y]
    # Example 2: '"Hello,      World!      " @    x @   y' => ['"Hello,      World!      "', x, y]
    # At this point, the parsing doesn't have to worry about unnecessary spaces inside a string, just like the example 2 is...
    compare: list[str] = print_args[arg].split()

    # This one is basically checking if '"'is not in a string that has been parsed (in this
    # case is a word that doesn't have '"'). Else, append the whole thing for the rest of
    # the comparison elements
    
    # Here's the string: '"Value of a is: " @ a @ "String"' [for example 1]
    # Example 1: ['"Value of a is: "', 'a', '"String"'] (This one is correct)

    # Here's the string: '"   Value of a is: " @ a @ "   String"'
    # Example 2: ['" Value of a is: " @ a @ " String"'] (This one is incorrect)
    vars.append(compare[0]) if '"' not in compare[0] else vars.append(" ".join(compare[0:]))
    
    for v in range(0, len(vars)):
        # This thing is just doing it job, appending the same elements in 'vars'
        # to the 'result_for_testing'
        result_for_testing.append(vars[v])

print(result_for_testing)

在这类操作之后，我将获得基本事物而没有不必要的空间的输出就是这样：

string_to_be_parsed: str = '"Value of a is: " @ a @ "String"'
Output > ['"Value of a is: "', 'a', '"String"'] # As what I'm expected to be...

但是当这样的事情时，它以某种方式破裂了（带有不必要的空间）：

string_to_be_parsed: str = '"   Value    of  a  is:     "    @     a   @  "   String  "'
Output > ['" Value of a is: " @ a @ " String "']
# Incorrect result and I'm hoping the result will be like this:

Expected Output > ["   Value    of  a  is:     ", a, "   String  "]
# If there are spaces inside a string, it just has to be ignored, but I don't know how to do it

好的，伙计们，这就是我遇到的问题，结论是：

如何用'@'关键字分解字符串并将每个字符串拆分，但是如果'@''是在字符串中发现的字符串中？

Example: '"@ in a string inside a string" @ is_out_from_a_string'
The result should be: ['"@ in a string inside a string"', is_out_from_a_string]

在解析字符串时，如何忽略字符串中字符串中的所有空格？

Example: '"    unnecessary      spaces  here      too" @ x @ y @ z "   this   one     too"'
The result should be: ['"    unnecessary      spaces  here      too"', x, y, z, '"   this   one     too"']

再一次，我真的很感谢您的辛勤工作，以帮助我找到解决问题的解决方案，如果我做错了或误解了，请告诉我哪里，我应该如何解决它:)

谢谢:)

原文

may I ask a question about the problem I've been getting these days? I would appreciate it so much if you guys would like to help me solve this :)

So, I have this simple string that I want to parse, using the '@' keyword (only parse this if '@' is outside of a string that's inside a string). The reason behind this is I'm trying to learn how to parse some strings based on certain keywords to parse/split because I'm trying to implement my own 'simple programming language'...

Here's an example that I've made using regex:
(spaces after the '@' keyword is doesn't really matter)

# Ignore the 'println(' thing, it's basically a builtin print statement that I made, so
# you can only focus on the string itself :)

# (?!\B"[^"]*)@(?![^"]*"\B)
# As I looking up how to use this thing with regex, I found this one that basically
# split the strings into elements by '@' keyword, but not splitting it if '@' is found
# inside a string. Here's what I mean:

# '"[email protected]"'     --- found '@' inside a string, so don't parse it
# '"[email protected]" @ x' --- found '@' outside a string, so after being parsed would be like this:
# ['"[email protected]", x']
print_args = re.split(r'(?!\B"[^"]*)@(?![^"]*"\B)', codes[x].split('println(')[-1].removesuffix(')\n' or ')'))
vars: list[str] = []
result_for_testing: list[str] = []
            
for arg in range(0, len(print_args)):
    # I don't know if this works because it's split the string for each space, but
    # if there are some spaces inside a string, it would be considered as the spaces
    # that should've been split, but it should not be going to be split because
    # because that space is inside a string that is inside a string, not outside a
    # string that is inside a string.

    # Example 1: '"Hello, World!" @   x @     y' => ['"Hello, World!"', x, y]
    # Example 2: '"Hello,      World!      " @    x @   y' => ['"Hello,      World!      "', x, y]
    # At this point, the parsing doesn't have to worry about unnecessary spaces inside a string, just like the example 2 is...
    compare: list[str] = print_args[arg].split()

    # This one is basically checking if '"'is not in a string that has been parsed (in this
    # case is a word that doesn't have '"'). Else, append the whole thing for the rest of
    # the comparison elements
    
    # Here's the string: '"Value of a is: " @ a @ "String"' [for example 1]
    # Example 1: ['"Value of a is: "', 'a', '"String"'] (This one is correct)

    # Here's the string: '"   Value of a is: " @ a @ "   String"'
    # Example 2: ['" Value of a is: " @ a @ " String"'] (This one is incorrect)
    vars.append(compare[0]) if '"' not in compare[0] else vars.append(" ".join(compare[0:]))
    
    for v in range(0, len(vars)):
        # This thing is just doing it job, appending the same elements in 'vars'
        # to the 'result_for_testing'
        result_for_testing.append(vars[v])

print(result_for_testing)

After these kinds of operations, the output I get for basic things to be parsed without unnecessary spaces is like this:

string_to_be_parsed: str = '"Value of a is: " @ a @ "String"'
Output > ['"Value of a is: "', 'a', '"String"'] # As what I'm expected to be...

But somehow it's broken when something like this (with unnecessary spaces):

string_to_be_parsed: str = '"   Value    of  a  is:     "    @     a   @  "   String  "'
Output > ['" Value of a is: " @ a @ " String "']
# Incorrect result and I'm hoping the result will be like this:

Expected Output > ["   Value    of  a  is:     ", a, "   String  "]
# If there are spaces inside a string, it just has to be ignored, but I don't know how to do it

Alright, guys, that's the problems I've encountered, and the conclusion is:

How to parse the string and split each string inside a string by the '@' keyword, but it's not going to get split if '@' is found inside a string in a string?

Example: '"@ in a string inside a string" @ is_out_from_a_string'
The result should be: ['"@ in a string inside a string"', is_out_from_a_string]

While parsing the strings, how to ignore all the spaces inside a string in a string?

Example: '"    unnecessary      spaces  here      too" @ x @ y @ z "   this   one     too"'
The result should be: ['"    unnecessary      spaces  here      too"', x, y, z, '"   this   one     too"']

Once again, I would really appreciate your hard work to help me find the solutions for the problems I got, and if there's something I did wrong or misconception, please tell me where, and how should I fix it :)

Thank you :)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

感性 2025-01-27 14:22:57

在谈论编程语言时，string.split（）和嵌套循环还不够。编程语言通常将其分为两个步骤：Tokenizer或Lexer，以及解析器。 Tokenizer获取输入字符串（在您的降压中），并返回代表关键字，标识符等的令牌列表。在您的代码中，这是结果中的每个元素。

无论哪种方式，您都可能想重组代码。对于代币器，这里有一些python-ish-pseudocode：

yourcode = input
tokens = []
cursor = 0
while cursor < len(yourcode):
    yourcode = yourcode[cursor:-1] # remove previously scanned tokens
    match token regex from list of regexes
    if match == token:
        add add token of matched type to tokens
        cursor += len(matched string)
    elif match == whitespace:
        cursor += len(matched whitespace)
    else throw error invalid token

这使用光标通过输入字符串和提取令牌进行前进，作为您问题的直接答案。对于REGEXES列表，只需使用对的列表，其中每对都包含一个正则和描述令牌类型的字符串。

但是，对于第一个编程语言项目而言，建立手动令牌和解析器可能并不是要走的方法，因为它很快就会变得非常复杂，尽管一旦您对基础知识感到满意，这是一次很棒的学习体验。我会考虑使用解析器生成器。我已经使用了一个称为 sly 带有python以及Ply（Sly的前身），并取得良好的结果。解析器生成器采用grammar，以特定格式对您的语言的描述，并输出可以解析您的语言的程序，以便您可以担心语言本身的功能，而不是如何解析语言的功能文本/代码输入。

在开始实施之前，可能还值得做更多的研究。具体来说，我建议阅读有关抽象语法树和解析算法的阅读，特别是递归下降，如果您手动构建解析器，这就是您要写的，lalr （1）（lookahead从左到右）是狡猾的生成的。

AST是解析器的输出（解析器生成器为您所做的），用于解释或编译您的语言。它们对于编程语言的构建至关重要，所以我将从那里开始。此视频说明了语法树，并且还有许多有关Python特定视频的解析视频。本系列也可以使用狡猾的人在python中创建一种简单的语言。

edit ：关于字符串前 @符号的特定解析，我建议对 @ sign使用一种令牌类型，而另一个代币类型为字符串。在解析器中，您可以检查当解析器遇到 @符号时，下一个令牌是否为字符串。如果您实现了将来还使用 @或字符串文字的功能，这将通过拆分发条来降低复杂性，并允许您重复使用令牌。

When talking about programming languages, a string.split() and nested loops aren't going to be enough. Programming languages usually split this into two steps: the tokenizer or lexer, and the parser. The tokenizer takes the input string (code in your-lang) and returns a list of tokens that represent keywords, identifiers, etc. In your code, this is each element in the result.

Either way, you're probably going to want to restructure your code a bit. For a tokenizer, here's some python-ish pseudocode:

yourcode = input
tokens = []
cursor = 0
while cursor < len(yourcode):
    yourcode = yourcode[cursor:-1] # remove previously scanned tokens
    match token regex from list of regexes
    if match == token:
        add add token of matched type to tokens
        cursor += len(matched string)
    elif match == whitespace:
        cursor += len(matched whitespace)
    else throw error invalid token

This uses a cursor to advance through your input string and extract tokens, as a direct answer to your question. For the list of regexes, simply use a list of pairs, where each pair includes a regex and a string describing the token type.

However, for a first programming language project, building a manual tokenizer and parser is probably not the way to go as it can get extremely complex very quickly, though it is a great learning experience once you're comfortable with the basics. I would consider look at using a parser generator. I have used one called SLY with python as well as PLY (SLY's predecessor) with good results. Parser generators take a grammar, a description of your language in a specific format, and output a program that can parse your language so that you can worry about the functionality of the language itself more than how you parse the text/code input.

It also may be worth doing some more research before beginning your implementation. Specifically, I would recommend reading about Abstract Syntax Trees and parsing algorithms, specifically recursive descent which is what you would be writing if you built a parser manually, and LALR(1)(Lookahead Left-to-Right) which is what SLY generates.

ASTs are the output of a parser (what the parser generator does for you) and are used to interpret or compile your language. They are fundamental to the construction of programming languages, so I would start there. This video explains syntax trees, and there are many python-specific videos on parsing as well. This series also covers using SLY to create a simple language in python.

EDIT: In regards to the specific parsing of the @ sign before a string, I would recommend using one token type for the @ sign and another for your string literal. In your parser, you can check if the next token is a string literal when the parser encounters an @ symbol. This will decrease complexity by splitting up your regexes, and also allow you to reuse the tokens if you implement functionality that also uses @ or string literals in the future.

回复收藏 0 原文

~没有更多了~