正则表达式在cobol中查找段落名称

发布于 2024-12-25 03:57:29 字数 198 浏览 3 评论 0原文

在 java 中读取 COBOL 文件时，我需要一个正则表达式来匹配段落名称。以下是段落名称的示例..

9800-WRITE-SCREEN-A.
C70-WRITE-ABFGRPPARM.
FGH0-REWRITE-ABFGRPPARM. 
8100-FILE-ERROR.

谢谢

原文

I need a regular expression to match the paragraph name while reading a COBOL file in java. following are the example for paragraph name..

9800-WRITE-SCREEN-A.
C70-WRITE-ABFGRPPARM.
FGH0-REWRITE-ABFGRPPARM. 
8100-FILE-ERROR.

thanks

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

俏︾媚 2025-01-01 03:57:29

许多人认为，因为 COBOL 很旧，所以它一定很简单……错误的假设。实际上
解析 COBOL 绝非易事。您可能会认为扫描 COBOL 程序可以
仅识别
PARAGRAPH 名称不应该需要一个完整的解析器 --- 但它会有它的
挑战。仅靠正则表达式无法完成任务。

以下是一些提示和需要注意的事项：

PARAGRAPH 名称可能出现在 PROCEDURE DIVISION 以外的地方。基于
关于你问题中给出的名字，我怀疑你应该只分析
程序的PROCEDURE DIVISION。这是传统的最后一个部门
COBOL 程序（假设程序不包含嵌套程序）。
如果您需要分析 OO COBOL 或嵌套 COBOL 程序，您将需要
更高级的解析技术
比正则表达式可以提供的更多。
将您的分析限制为包含单个程序的文本文件。有可能
在单个源“甲板”中编写多个独立程序，但并不常见
完成了，所以要意识到你可能不会优雅地处理这个问题。
对于固定格式 COBOL 程序（旧式编码），您可以信赖以下事实：
PARAGRAPH 名称将从第 8 列到第 11 列之间的某个位置开始。
对于固定格式 COBOL，您需要忽略第 1 列到第 11 列中出现的任何文本
6 并从第 73 列到行尾。
对于固定格式 COBOL，包含空格或连字符以外的字符的任何行
应忽略第 7 列中的内容（这是注释或调试行）。
如果程序在 PROCEDURE DIVISION 中包含 COPY 或 REPLACE 指令，
您的分析将不完整和/或不准确。 COPY 可能会
引入包含段落名称和 REPLACE 的附加源代码
指令可以在文本操作期间更改后续段落的名称
编译阶段（即编译后的程序可能具有与
您检测到的那个）。这不是一种常见做法，但您需要注意。
连续行确实会搞乱简单的文本扫描仪，因为单个 COBOL
单词（例如段落名称）可以拆分为多个源代码行。然而，在
段落名称的情况，它们跨越多个的情况并不常见
线。
逗号（“,”）和分号（“;”）字符是“噪音”，几乎可以出现
任何可能出现空格的地方（至少在 PROCEDURE DIVISION 内）。你
可能想用空格替换它们以简化
后续分析。
引用的文字。 COBOL 有一些有趣的引用约定，特别是当
引用的文本跨越多个源代码行。文本引用和延续规则
COBOL 不同于任何其他
您可能熟悉的语言 - 并且对解析来说确实令人头疼。我不是
甚至要在这里开始解释它们！

如何识别COBOL程序的Procedure部分中的PARAGRAPH名称？
很简单，只需查找由句点（“.”）分隔的单个“单词”。段落名称是单个
单词（可能包含连字符、字母和/或数字字符）并且前面总是
一个句点，然后是一个句点。前后可能（或可能没有）空格
每个时期。

现在在我看来，如果你想识别 PARAGRAPH 名称，你可能想要
还可以识别 SECTION 名称。 SECTION 名称与 PARAGRAPH 名称类似，不同之处在于
它后面是强制保留字SECTION，并且可选地后面是PRIORITY NUMBER。 PRIORITY NUMBER 不再被广泛使用（事实上
它已经过时了）所以你可能不需要处理它们。

识别 COBOL 段落名称的有点缺陷但合理的过程

这不是单个正则表达式，而是涉及多个正则表达式和/或的过程
文本操作。

假设固定格式 COBOL
消除所有引用的文本。对于简单的文本来说并不是一个困难的命题，但是
当涉及连续线时，它会变得非常困难。如果 COPY/REPLACE 指令
参与 - 算了！
消除注释行（即第 7 列包含星号）
删除第 1 列到第 7 列和第 73 列直到行尾
删除“PROCEDURE DIVISION”一词之前的所有文本
用单个空格字符替换所有出现的逗号和分号
提取句点（“.”）之间的所有文本
如果提取的文本包含单个单词，则它是 PARAGRAPH 名称。
如果提取的文本包含两个单词，并且第二个单词是“SECTION”，则
第一个单词是一个SECTION名称。

上面的内容并非万无一失，但应该足以识别大多数“普通”COBOL 程序中的段落和部分名称。

祝你好运。

Many think that because COBOL is old it must be simple... Bad assumption. In fact
parsing COBOL is anything but trivial. You might think scanning a COBOL program to
identify only
PARAGRAPH names should not require a full blown parser --- but it will have its
challenges. Regex alone is not up to the task.

Here are a few tips and things to be aware of:

PARAGRAPH names may occur in places other than the PROCEDURE DIVISION. Based
on the names given in your question I suspect you should only analyze the
PROCEDURE DIVISION of the program. This is the last DIVISION of a traditional
COBOL program (assuming the program does not contain nested programs).
If you need to analyze OO COBOL or nested COBOL programs you will need
more advanced parsing techniques
than Regex can provide.
Limit your analysis to text files containing single programs. It is possible to
code multiple independent programs in a single source 'deck' but is not commonly
done, so realize that you probably will not handle this gracefully.
For Fixed Format COBOL programs (older style coding) you can rely on the fact that
PARAGRAPH names will begin somewhere between columns 8 through 11.
For Fixed Format COBOL, you need to ignore any text appearing in columns 1 through
6 and from column 73 to the end of line.
For Fixed Format COBOL, any line containing a character other than space or hyphen
in column 7 should be ignored (this is a comment or debug line).
If the program contains COPY or REPLACE directives in the PROCEDURE DIVISION,
your analysis is going to be incomplete and/or inaccurate. COPY can potentially
bring in additional source code containing paragraph names and the REPLACE
directive can change names of subsequent paragraphs during the text manipulation
phase of the compile (i.e. the compiled program may have names different from
the one you detect). This is not a common practice but one you need to be aware of.
Continuation lines can really mess up a simple text scanner because a single COBOL
word (e.g. paragraph name) may be split over multiple source lines. However, in the
case of paragraph names it is not a common occurrence for them to span multiple
lines.
The comma (",") and semi-colon (";") characters are "noise" and can appear almost
anywhere a space can occur (at least within the PROCEDURE DIVISION). You
might want to replace these with spaces to simplify
subsequent analysis.
Quoted text. COBOL has some interesting quoting conventions, particularly when
quoted text spans multiple source lines. Text quoting and continuation rules
for COBOL are unlike any other
language you may be familiar with - and present real headaches for parsing. I'm not
even going to begin to explain them here!

How to recognize a PARAGRAPH name in the Procedure division of a COBOL program?
Simple, just look for single "words" delimited by periods ("."). A paragraph name is a single
word (may contain hyphens, alpha and/or numeric characters) and is always preceded by
a period and followed by a period. There may (or may not) be blank spaces before or after
each of the periods.

Now it seems to me that if you want to identify PARAGRAPH names, you probably want to
identify SECTION names too. A SECTION name is similar to a PARAGRAPH name except that
it is followed by the mandatory reserved word SECTION and optionally followed by a PRIORITY NUMBER. PRIORITY NUMBER is not much used any more (in fact
it is obsolete) so you might not have to deal with them.

A somewhat flawed but reasonable process to identify COBOL paragraph names

This is not a single Regex, but a process that involves multiple Regex and or
text manipulations.

Assume Fixed Format COBOL
Eliminate all quoted text. Not a difficult proposition for simple text but
when continuation lines are involved it gets quite difficult. If COPY/REPLACE directives
are involved - forget it!
Eliminate comment lines (i.e. column 7 contains an asterix)
Strip out columns 1 through 7 and 73 through to end of line
Drop all text prior to the words "PROCEDURE DIVISION"
Replace all occurrences of comma and semi-colon with a single space character
Extract all text between periods (".")
If the extracted text contains a single word, then it is a PARAGRAPH name.
If the extracted text contains two words, and the second word is "SECTION", then
the first word is a SECTION name.

The above is not fool proof, but should be good enough to identify paragraph and section names in most "garden variety" COBOL programs.

I wish you luck.

回复收藏 0 原文