正则表达式在cobol中查找段落名称

发布于 2024-12-25 03:57:29 字数 198 浏览 3 评论 0原文

在 java 中读取 COBOL 文件时,我需要一个正则表达式来匹配段落名称。以下是段落名称的示例..

9800-WRITE-SCREEN-A.
C70-WRITE-ABFGRPPARM.
FGH0-REWRITE-ABFGRPPARM. 
8100-FILE-ERROR.

谢谢

I need a regular expression to match the paragraph name while reading a COBOL file in java. following are the example for paragraph name..

9800-WRITE-SCREEN-A.
C70-WRITE-ABFGRPPARM.
FGH0-REWRITE-ABFGRPPARM. 
8100-FILE-ERROR.

thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

俏︾媚 2025-01-01 03:57:29

许多人认为,因为 COBOL 很旧,所以它一定很简单……错误的假设。实际上
解析 COBOL 绝非易事。您可能会认为扫描 COBOL 程序可以
仅识别
PARAGRAPH 名称不应该需要一个完整的解析器 --- 但它会有它的
挑战。仅靠正则表达式无法完成任务。

以下是一些提示和需要注意的事项:

  • PARAGRAPH 名称可能出现在 PROCEDURE DIVISION 以外的地方。基于
    关于你问题中给出的名字,我怀疑你应该只分析
    程序的PROCEDURE DIVISION。这是传统的最后一个部门
    COBOL 程序(假设程序不包含嵌套程序)。
    如果您需要分析 OO COBOL 或嵌套 COBOL 程序,您将需要
    更高级的解析技术
    比正则表达式可以提供的更多。
  • 将您的分析限制为包含单个程序的文本文件。有可能
    在单个源“甲板”中编写多个独立程序,但并不常见
    完成了,所以要意识到你可能不会优雅地处理这个问题。
  • 对于固定格式 COBOL 程序(旧式编码),您可以信赖以下事实:
    PARAGRAPH 名称将从第 8 列到第 11 列之间的某个位置开始。
  • 对于固定格式 COBOL,您需要忽略第 1 列到第 11 列中出现的任何文本
    6 并从第 73 列到行尾。
  • 对于固定格式 COBOL,包含空格或连字符以外的字符的任何行
    应忽略第 7 列中的内容(这是注释或调试行)。
  • 如果程序在 PROCEDURE DIVISION 中包含 COPYREPLACE 指令,
    您的分析将不完整和/或不准确。 COPY 可能会
    引入包含段落名称和 REPLACE 的附加源代码
    指令可以在文本操作期间更改后续段落的名称
    编译阶段(即编译后的程序可能具有与
    您检测到的那个)。这不是一种常见做法,但您需要注意。
  • 连续行确实会搞乱简单的文本扫描仪,因为单个 COBOL
    单词(例如段落名称)可以拆分为多个源代码行。然而,在
    段落名称的情况,它们跨越多个的情况并不常见
    线。
  • 逗号(“,”)和分号(“;”)字符是“噪音”,几乎可以出现
    任何可能出现空格的地方(至少在 PROCEDURE DIVISION 内)。你
    可能想用空格替换它们以简化
    后续分析。
  • 引用的文字。 COBOL 有一些有趣的引用约定,特别是当
    引用的文本跨越多个源代码行。文本引用和延续规则
    COBOL 不同于任何其他
    您可能熟悉的语言 - 并且对解析来说确实令人头疼。我不是
    甚至要在这里开始解释它们!

如何识别COBOL程序的Procedure部分中的PARAGRAPH名称?
很简单,只需查找由句点(“.”)分隔的单个“单词”。段落名称是单个
单词(可能包含连字符、字母和/或数字字符)并且前面总是
一个句点,然后是一个句点。前后可能(或可能没有)空格
每个时期。

现在在我看来,如果你想识别 PARAGRAPH 名称,你可能想要
还可以识别 SECTION 名称。 SECTION 名称与 PARAGRAPH 名称类似,不同之处在于
它后面是强制保留字SECTION,并且可选地后面是PRIORITY NUMBERPRIORITY NUMBER 不再被广泛使用(事实上
它已经过时了)所以你可能不需要处理它们。

识别 COBOL 段落名称的有点缺陷但合理的过程

这不是单个正则表达式,而是涉及多个正则表达式和/或的过程
文本操作。

  • 假设固定格式 COBOL
  • 消除所有引用的文本。对于简单的文本来说并不是一个困难的命题,但是
    当涉及连续线时,它会变得非常困难。如果 COPY/REPLACE 指令
    参与 - 算了!
  • 消除注释行(即第 7 列包含星号)
  • 删除第 1 列到第 7 列和第 73 列直到行尾
  • 删除“PROCEDURE DIVISION”一词之前的所有文本
  • 用单个空格字符替换所有出现的逗号和分号
  • 提取句点(“.”)之间的所有文本
  • 如果提取的文本包含单个单词,则它是 PARAGRAPH 名称。
  • 如果提取的文本包含两个单词,并且第二个单词是“SECTION”,则
    第一个单词是一个SECTION名称。

上面的内容并非万无一失,但应该足以识别大多数“普通”COBOL 程序中的段落和部分名称。

祝你好运。

Many think that because COBOL is old it must be simple... Bad assumption. In fact
parsing COBOL is anything but trivial. You might think scanning a COBOL program to
identify only
PARAGRAPH names should not require a full blown parser --- but it will have its
challenges. Regex alone is not up to the task.

Here are a few tips and things to be aware of:

  • PARAGRAPH names may occur in places other than the PROCEDURE DIVISION. Based
    on the names given in your question I suspect you should only analyze the
    PROCEDURE DIVISION of the program. This is the last DIVISION of a traditional
    COBOL program (assuming the program does not contain nested programs).
    If you need to analyze OO COBOL or nested COBOL programs you will need
    more advanced parsing techniques
    than Regex can provide.
  • Limit your analysis to text files containing single programs. It is possible to
    code multiple independent programs in a single source 'deck' but is not commonly
    done, so realize that you probably will not handle this gracefully.
  • For Fixed Format COBOL programs (older style coding) you can rely on the fact that
    PARAGRAPH names will begin somewhere between columns 8 through 11.
  • For Fixed Format COBOL, you need to ignore any text appearing in columns 1 through
    6 and from column 73 to the end of line.
  • For Fixed Format COBOL, any line containing a character other than space or hyphen
    in column 7 should be ignored (this is a comment or debug line).
  • If the program contains COPY or REPLACE directives in the PROCEDURE DIVISION,
    your analysis is going to be incomplete and/or inaccurate. COPY can potentially
    bring in additional source code containing paragraph names and the REPLACE
    directive can change names of subsequent paragraphs during the text manipulation
    phase of the compile (i.e. the compiled program may have names different from
    the one you detect). This is not a common practice but one you need to be aware of.
  • Continuation lines can really mess up a simple text scanner because a single COBOL
    word (e.g. paragraph name) may be split over multiple source lines. However, in the
    case of paragraph names it is not a common occurrence for them to span multiple
    lines.
  • The comma (",") and semi-colon (";") characters are "noise" and can appear almost
    anywhere a space can occur (at least within the PROCEDURE DIVISION). You
    might want to replace these with spaces to simplify
    subsequent analysis.
  • Quoted text. COBOL has some interesting quoting conventions, particularly when
    quoted text spans multiple source lines. Text quoting and continuation rules
    for COBOL are unlike any other
    language you may be familiar with - and present real headaches for parsing. I'm not
    even going to begin to explain them here!

How to recognize a PARAGRAPH name in the Procedure division of a COBOL program?
Simple, just look for single "words" delimited by periods ("."). A paragraph name is a single
word (may contain hyphens, alpha and/or numeric characters) and is always preceded by
a period and followed by a period. There may (or may not) be blank spaces before or after
each of the periods.

Now it seems to me that if you want to identify PARAGRAPH names, you probably want to
identify SECTION names too. A SECTION name is similar to a PARAGRAPH name except that
it is followed by the mandatory reserved word SECTION and optionally followed by a PRIORITY NUMBER. PRIORITY NUMBER is not much used any more (in fact
it is obsolete) so you might not have to deal with them.

A somewhat flawed but reasonable process to identify COBOL paragraph names

This is not a single Regex, but a process that involves multiple Regex and or
text manipulations.

  • Assume Fixed Format COBOL
  • Eliminate all quoted text. Not a difficult proposition for simple text but
    when continuation lines are involved it gets quite difficult. If COPY/REPLACE directives
    are involved - forget it!
  • Eliminate comment lines (i.e. column 7 contains an asterix)
  • Strip out columns 1 through 7 and 73 through to end of line
  • Drop all text prior to the words "PROCEDURE DIVISION"
  • Replace all occurrences of comma and semi-colon with a single space character
  • Extract all text between periods (".")
  • If the extracted text contains a single word, then it is a PARAGRAPH name.
  • If the extracted text contains two words, and the second word is "SECTION", then
    the first word is a SECTION name.

The above is not fool proof, but should be good enough to identify paragraph and section names in most "garden variety" COBOL programs.

I wish you luck.

末骤雨初歇 2025-01-01 03:57:29
"^[ ]{7,10}([-\\w]+)"

段落名称从第 8-11 列开始。 $1 将是名称。

"^[ ]{7,10}([-\\w]+)"

Paragraph names begin in columns 8-11. $1 will be the name.

纸伞微斜 2025-01-01 03:57:29

一些规则:

  1. 段落名称从 A 区开始(第 8-11 列)。
  2. 可以包含字符、数字或连字符。
  3. 以点 (.) 结尾。
  4. 没有空白字符。

正则表达式 = ^[ ]{7,10}([-\w]+\.\n)

Some rules:

  1. Paragraph names begin in Area A (columns 8-11).
  2. Can contain characters, numbers or hyphens.
  3. Ends with a dot(.).
  4. No whitespace character.

Regex = ^[ ]{7,10}([-\w]+\.\n)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文