正则表达式在cobol中查找段落名称
在 java 中读取 COBOL 文件时,我需要一个正则表达式来匹配段落名称。以下是段落名称的示例..
9800-WRITE-SCREEN-A.
C70-WRITE-ABFGRPPARM.
FGH0-REWRITE-ABFGRPPARM.
8100-FILE-ERROR.
谢谢
I need a regular expression to match the paragraph name while reading a COBOL file in java. following are the example for paragraph name..
9800-WRITE-SCREEN-A.
C70-WRITE-ABFGRPPARM.
FGH0-REWRITE-ABFGRPPARM.
8100-FILE-ERROR.
thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
许多人认为,因为 COBOL 很旧,所以它一定很简单……错误的假设。实际上
解析 COBOL 绝非易事。您可能会认为扫描 COBOL 程序可以
仅识别
PARAGRAPH
名称不应该需要一个完整的解析器 --- 但它会有它的挑战。仅靠正则表达式无法完成任务。
以下是一些提示和需要注意的事项:
PARAGRAPH
名称可能出现在PROCEDURE DIVISION
以外的地方。基于关于你问题中给出的名字,我怀疑你应该只分析
程序的
PROCEDURE DIVISION
。这是传统的最后一个部门COBOL 程序(假设程序不包含嵌套程序)。
如果您需要分析 OO COBOL 或嵌套 COBOL 程序,您将需要
更高级的解析技术
比正则表达式可以提供的更多。
在单个源“甲板”中编写多个独立程序,但并不常见
完成了,所以要意识到你可能不会优雅地处理这个问题。
PARAGRAPH
名称将从第 8 列到第 11 列之间的某个位置开始。6 并从第 73 列到行尾。
应忽略第 7 列中的内容(这是注释或调试行)。
PROCEDURE DIVISION
中包含COPY
或REPLACE
指令,您的分析将不完整和/或不准确。
COPY
可能会引入包含段落名称和 REPLACE 的附加源代码
指令可以在文本操作期间更改后续段落的名称
编译阶段(即编译后的程序可能具有与
您检测到的那个)。这不是一种常见做法,但您需要注意。
单词(例如段落名称)可以拆分为多个源代码行。然而,在
段落名称的情况,它们跨越多个的情况并不常见
线。
任何可能出现空格的地方(至少在 PROCEDURE DIVISION 内)。你
可能想用空格替换它们以简化
后续分析。
引用的文本跨越多个源代码行。文本引用和延续规则
COBOL 不同于任何其他
您可能熟悉的语言 - 并且对解析来说确实令人头疼。我不是
甚至要在这里开始解释它们!
如何识别COBOL程序的Procedure部分中的
PARAGRAPH
名称?很简单,只需查找由句点(“.”)分隔的单个“单词”。段落名称是单个
单词(可能包含连字符、字母和/或数字字符)并且前面总是
一个句点,然后是一个句点。前后可能(或可能没有)空格
每个时期。
现在在我看来,如果你想识别
PARAGRAPH
名称,你可能想要还可以识别
SECTION
名称。SECTION
名称与PARAGRAPH
名称类似,不同之处在于它后面是强制保留字
SECTION
,并且可选地后面是PRIORITY NUMBER
。PRIORITY NUMBER
不再被广泛使用(事实上它已经过时了)所以你可能不需要处理它们。
识别 COBOL 段落名称的有点缺陷但合理的过程
这不是单个正则表达式,而是涉及多个正则表达式和/或的过程
文本操作。
当涉及连续线时,它会变得非常困难。如果 COPY/REPLACE 指令
参与 - 算了!
PARAGRAPH
名称。第一个单词是一个
SECTION
名称。上面的内容并非万无一失,但应该足以识别大多数“普通”COBOL 程序中的段落和部分名称。
祝你好运。
Many think that because COBOL is old it must be simple... Bad assumption. In fact
parsing COBOL is anything but trivial. You might think scanning a COBOL program to
identify only
PARAGRAPH
names should not require a full blown parser --- but it will have itschallenges. Regex alone is not up to the task.
Here are a few tips and things to be aware of:
PARAGRAPH
names may occur in places other than thePROCEDURE DIVISION
. Basedon the names given in your question I suspect you should only analyze the
PROCEDURE DIVISION
of the program. This is the last DIVISION of a traditionalCOBOL program (assuming the program does not contain nested programs).
If you need to analyze OO COBOL or nested COBOL programs you will need
more advanced parsing techniques
than Regex can provide.
code multiple independent programs in a single source 'deck' but is not commonly
done, so realize that you probably will not handle this gracefully.
PARAGRAPH
names will begin somewhere between columns 8 through 11.6 and from column 73 to the end of line.
in column 7 should be ignored (this is a comment or debug line).
COPY
orREPLACE
directives in thePROCEDURE DIVISION
,your analysis is going to be incomplete and/or inaccurate.
COPY
can potentiallybring in additional source code containing paragraph names and the
REPLACE
directive can change names of subsequent paragraphs during the text manipulation
phase of the compile (i.e. the compiled program may have names different from
the one you detect). This is not a common practice but one you need to be aware of.
word (e.g. paragraph name) may be split over multiple source lines. However, in the
case of paragraph names it is not a common occurrence for them to span multiple
lines.
anywhere a space can occur (at least within the PROCEDURE DIVISION). You
might want to replace these with spaces to simplify
subsequent analysis.
quoted text spans multiple source lines. Text quoting and continuation rules
for COBOL are unlike any other
language you may be familiar with - and present real headaches for parsing. I'm not
even going to begin to explain them here!
How to recognize a
PARAGRAPH
name in the Procedure division of a COBOL program?Simple, just look for single "words" delimited by periods ("."). A paragraph name is a single
word (may contain hyphens, alpha and/or numeric characters) and is always preceded by
a period and followed by a period. There may (or may not) be blank spaces before or after
each of the periods.
Now it seems to me that if you want to identify
PARAGRAPH
names, you probably want toidentify
SECTION
names too. ASECTION
name is similar to aPARAGRAPH
name except thatit is followed by the mandatory reserved word
SECTION
and optionally followed by aPRIORITY NUMBER
.PRIORITY NUMBER
is not much used any more (in factit is obsolete) so you might not have to deal with them.
A somewhat flawed but reasonable process to identify COBOL paragraph names
This is not a single Regex, but a process that involves multiple Regex and or
text manipulations.
when continuation lines are involved it gets quite difficult. If COPY/REPLACE directives
are involved - forget it!
PARAGRAPH
name.the first word is a
SECTION
name.The above is not fool proof, but should be good enough to identify paragraph and section names in most "garden variety" COBOL programs.
I wish you luck.
段落名称从第 8-11 列开始。 $1 将是名称。
Paragraph names begin in columns 8-11. $1 will be the name.
一些规则:
正则表达式 = ^[ ]{7,10}([-\w]+\.\n)
Some rules:
Regex = ^[ ]{7,10}([-\w]+\.\n)