解析带有某些要拆分的关键字的字符串(在字符串文字之外),但不在 Python 中的字符串文字内部拆分
我可以问一个关于我最近遇到的问题的问题吗?如果你们想帮助我解决这个问题,我会非常感谢它:)
因此,我有一个简单的字符串,我想解析,使用'@''关键字(如果'@''不在字符串之外那在字符串中)。其背后的原因是我正在尝试学习如何基于某些关键字来解析/拆分的字符串,因为我正在尝试实现自己的“简单编程语言” ...
这是我使用的示例REGEX: (“@'关键字之后的空格”并不重要)
# Ignore the 'println(' thing, it's basically a builtin print statement that I made, so
# you can only focus on the string itself :)
# (?!\B"[^"]*)@(?![^"]*"\B)
# As I looking up how to use this thing with regex, I found this one that basically
# split the strings into elements by '@' keyword, but not splitting it if '@' is found
# inside a string. Here's what I mean:
# '"[email protected]"' --- found '@' inside a string, so don't parse it
# '"[email protected]" @ x' --- found '@' outside a string, so after being parsed would be like this:
# ['"[email protected]", x']
print_args = re.split(r'(?!\B"[^"]*)@(?![^"]*"\B)', codes[x].split('println(')[-1].removesuffix(')\n' or ')'))
vars: list[str] = []
result_for_testing: list[str] = []
for arg in range(0, len(print_args)):
# I don't know if this works because it's split the string for each space, but
# if there are some spaces inside a string, it would be considered as the spaces
# that should've been split, but it should not be going to be split because
# because that space is inside a string that is inside a string, not outside a
# string that is inside a string.
# Example 1: '"Hello, World!" @ x @ y' => ['"Hello, World!"', x, y]
# Example 2: '"Hello, World! " @ x @ y' => ['"Hello, World! "', x, y]
# At this point, the parsing doesn't have to worry about unnecessary spaces inside a string, just like the example 2 is...
compare: list[str] = print_args[arg].split()
# This one is basically checking if '"'is not in a string that has been parsed (in this
# case is a word that doesn't have '"'). Else, append the whole thing for the rest of
# the comparison elements
# Here's the string: '"Value of a is: " @ a @ "String"' [for example 1]
# Example 1: ['"Value of a is: "', 'a', '"String"'] (This one is correct)
# Here's the string: '" Value of a is: " @ a @ " String"'
# Example 2: ['" Value of a is: " @ a @ " String"'] (This one is incorrect)
vars.append(compare[0]) if '"' not in compare[0] else vars.append(" ".join(compare[0:]))
for v in range(0, len(vars)):
# This thing is just doing it job, appending the same elements in 'vars'
# to the 'result_for_testing'
result_for_testing.append(vars[v])
print(result_for_testing)
在这类操作之后,我将获得基本事物而没有不必要的空间的输出就是这样:
string_to_be_parsed: str = '"Value of a is: " @ a @ "String"'
Output > ['"Value of a is: "', 'a', '"String"'] # As what I'm expected to be...
但是当这样的事情时,它以某种方式破裂了(带有不必要的空间):
string_to_be_parsed: str = '" Value of a is: " @ a @ " String "'
Output > ['" Value of a is: " @ a @ " String "']
# Incorrect result and I'm hoping the result will be like this:
Expected Output > [" Value of a is: ", a, " String "]
# If there are spaces inside a string, it just has to be ignored, but I don't know how to do it
好的,伙计们,这就是我遇到的问题,结论是:
- 如何用'@'关键字分解字符串并将每个字符串拆分,但是如果'@''是在字符串中发现的字符串中?
Example: '"@ in a string inside a string" @ is_out_from_a_string'
The result should be: ['"@ in a string inside a string"', is_out_from_a_string]
- 在解析字符串时,如何忽略字符串中字符串中的所有空格?
Example: '" unnecessary spaces here too" @ x @ y @ z " this one too"'
The result should be: ['" unnecessary spaces here too"', x, y, z, '" this one too"']
再一次,我真的很感谢您的辛勤工作,以帮助我找到解决问题的解决方案,如果我做错了或误解了,请告诉我哪里,我应该如何解决它:)
谢谢:)
may I ask a question about the problem I've been getting these days? I would appreciate it so much if you guys would like to help me solve this :)
So, I have this simple string that I want to parse, using the '@' keyword (only parse this if '@' is outside of a string that's inside a string). The reason behind this is I'm trying to learn how to parse some strings based on certain keywords to parse/split because I'm trying to implement my own 'simple programming language'...
Here's an example that I've made using regex:
(spaces after the '@' keyword is doesn't really matter)
# Ignore the 'println(' thing, it's basically a builtin print statement that I made, so
# you can only focus on the string itself :)
# (?!\B"[^"]*)@(?![^"]*"\B)
# As I looking up how to use this thing with regex, I found this one that basically
# split the strings into elements by '@' keyword, but not splitting it if '@' is found
# inside a string. Here's what I mean:
# '"[email protected]"' --- found '@' inside a string, so don't parse it
# '"[email protected]" @ x' --- found '@' outside a string, so after being parsed would be like this:
# ['"[email protected]", x']
print_args = re.split(r'(?!\B"[^"]*)@(?![^"]*"\B)', codes[x].split('println(')[-1].removesuffix(')\n' or ')'))
vars: list[str] = []
result_for_testing: list[str] = []
for arg in range(0, len(print_args)):
# I don't know if this works because it's split the string for each space, but
# if there are some spaces inside a string, it would be considered as the spaces
# that should've been split, but it should not be going to be split because
# because that space is inside a string that is inside a string, not outside a
# string that is inside a string.
# Example 1: '"Hello, World!" @ x @ y' => ['"Hello, World!"', x, y]
# Example 2: '"Hello, World! " @ x @ y' => ['"Hello, World! "', x, y]
# At this point, the parsing doesn't have to worry about unnecessary spaces inside a string, just like the example 2 is...
compare: list[str] = print_args[arg].split()
# This one is basically checking if '"'is not in a string that has been parsed (in this
# case is a word that doesn't have '"'). Else, append the whole thing for the rest of
# the comparison elements
# Here's the string: '"Value of a is: " @ a @ "String"' [for example 1]
# Example 1: ['"Value of a is: "', 'a', '"String"'] (This one is correct)
# Here's the string: '" Value of a is: " @ a @ " String"'
# Example 2: ['" Value of a is: " @ a @ " String"'] (This one is incorrect)
vars.append(compare[0]) if '"' not in compare[0] else vars.append(" ".join(compare[0:]))
for v in range(0, len(vars)):
# This thing is just doing it job, appending the same elements in 'vars'
# to the 'result_for_testing'
result_for_testing.append(vars[v])
print(result_for_testing)
After these kinds of operations, the output I get for basic things to be parsed without unnecessary spaces is like this:
string_to_be_parsed: str = '"Value of a is: " @ a @ "String"'
Output > ['"Value of a is: "', 'a', '"String"'] # As what I'm expected to be...
But somehow it's broken when something like this (with unnecessary spaces):
string_to_be_parsed: str = '" Value of a is: " @ a @ " String "'
Output > ['" Value of a is: " @ a @ " String "']
# Incorrect result and I'm hoping the result will be like this:
Expected Output > [" Value of a is: ", a, " String "]
# If there are spaces inside a string, it just has to be ignored, but I don't know how to do it
Alright, guys, that's the problems I've encountered, and the conclusion is:
- How to parse the string and split each string inside a string by the '@' keyword, but it's not going to get split if '@' is found inside a string in a string?
Example: '"@ in a string inside a string" @ is_out_from_a_string'
The result should be: ['"@ in a string inside a string"', is_out_from_a_string]
- While parsing the strings, how to ignore all the spaces inside a string in a string?
Example: '" unnecessary spaces here too" @ x @ y @ z " this one too"'
The result should be: ['" unnecessary spaces here too"', x, y, z, '" this one too"']
Once again, I would really appreciate your hard work to help me find the solutions for the problems I got, and if there's something I did wrong or misconception, please tell me where, and how should I fix it :)
Thank you :)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
在谈论编程语言时,string.split()和嵌套循环还不够。编程语言通常将其分为两个步骤:Tokenizer或Lexer,以及解析器。 Tokenizer获取输入字符串(在您的降压中),并返回代表关键字,标识符等的令牌列表。在您的代码中,这是结果中的每个元素。
无论哪种方式,您都可能想重组代码。对于代币器,这里有一些python-ish-pseudocode:
这使用光标通过输入字符串和提取令牌进行前进,作为您问题的直接答案。对于REGEXES列表,只需使用对的列表,其中每对都包含一个正则和描述令牌类型的字符串。
但是,对于第一个编程语言项目而言,建立手动令牌和解析器可能并不是要走的方法,因为它很快就会变得非常复杂,尽管一旦您对基础知识感到满意,这是一次很棒的学习体验。我会考虑使用解析器生成器。我已经使用了一个称为 sly 带有python以及Ply(Sly的前身),并取得良好的结果。解析器生成器采用
grammar
,以特定格式对您的语言的描述,并输出可以解析您的语言的程序,以便您可以担心语言本身的功能,而不是如何解析语言的功能文本/代码输入。在开始实施之前,可能还值得做更多的研究。具体来说,我建议阅读有关
抽象语法树
和解析算法的阅读,特别是递归下降
,如果您手动构建解析器,这就是您要写的,lalr (1)
(lookahead从左到右)是狡猾的生成的。AST是解析器的输出(解析器生成器为您所做的),用于解释或编译您的语言。它们对于编程语言的构建至关重要,所以我将从那里开始。 此视频说明了语法树,并且还有许多有关Python特定视频的解析视频。 本系列也可以使用狡猾的人在python中创建一种简单的语言。
edit :关于字符串前 @符号的特定解析,我建议对 @ sign使用一种令牌类型,而另一个代币类型为字符串。在解析器中,您可以检查当解析器遇到 @符号时,下一个令牌是否为字符串。如果您实现了将来还使用 @或字符串文字的功能,这将通过拆分发条来降低复杂性,并允许您重复使用令牌。
When talking about programming languages, a string.split() and nested loops aren't going to be enough. Programming languages usually split this into two steps: the tokenizer or lexer, and the parser. The tokenizer takes the input string (code in your-lang) and returns a list of tokens that represent keywords, identifiers, etc. In your code, this is each element in the result.
Either way, you're probably going to want to restructure your code a bit. For a tokenizer, here's some python-ish pseudocode:
This uses a cursor to advance through your input string and extract tokens, as a direct answer to your question. For the list of regexes, simply use a list of pairs, where each pair includes a regex and a string describing the token type.
However, for a first programming language project, building a manual tokenizer and parser is probably not the way to go as it can get extremely complex very quickly, though it is a great learning experience once you're comfortable with the basics. I would consider look at using a parser generator. I have used one called SLY with python as well as PLY (SLY's predecessor) with good results. Parser generators take a
grammar
, a description of your language in a specific format, and output a program that can parse your language so that you can worry about the functionality of the language itself more than how you parse the text/code input.It also may be worth doing some more research before beginning your implementation. Specifically, I would recommend reading about
Abstract Syntax Trees
and parsing algorithms, specificallyrecursive descent
which is what you would be writing if you built a parser manually, andLALR(1)
(Lookahead Left-to-Right) which is what SLY generates.ASTs are the output of a parser (what the parser generator does for you) and are used to interpret or compile your language. They are fundamental to the construction of programming languages, so I would start there. This video explains syntax trees, and there are many python-specific videos on parsing as well. This series also covers using SLY to create a simple language in python.
EDIT: In regards to the specific parsing of the @ sign before a string, I would recommend using one token type for the @ sign and another for your string literal. In your parser, you can check if the next token is a string literal when the parser encounters an @ symbol. This will decrease complexity by splitting up your regexes, and also allow you to reuse the tokens if you implement functionality that also uses @ or string literals in the future.