ADA程序文本的实际字符在哪里定义？

发布于 2025-02-05 17:22:32 字数 2005 浏览 1 评论 0 原文

我正在尝试制作 tree-sitter parser，这样VIM）可以对ADA程序文本进行解析并进行更高级的操纵，例如提取物和重命名变量。但是在定义角色集的情况下似乎存在一些问题。

在 ada 2012参考手册，我找到了一个模糊类别描述的列表，即“任何一般类别为x”的形式的形式，这意味着除了下划线外，所有这些（‿‿⁀︳︴﹍﹎﹎＿）也允许在标识符中，这似乎是荒谬的，而GNAT拒绝以'Inlogeal Charnem'拒绝。该列表的列表列出了：

“未指定实现ADA程序文本的实现的实际图形符号集。”

这是否真的意味着没有办法知道应该接受哪些字符？

在上有两个页面，这些示例是明确给出的有效的标识符，但GNAT 2021拒绝了它们：

procedure Main is
   Πλάτων  : constant := 12;     -- Plato
   Чайковский : constant := 12;  -- Tchaikovsky
   θ, φ : constant := 12;        -- Angles
begin
   null;
end Main;

$ gprbuild
using project file foo.gpr
Compile
   [Ada]          main.adb
main.adb:2:04: error: declaration expected
main.adb:2:05: error: illegal character
main.adb:3:04: error: declaration expected
main.adb:3:05: error: illegal character
main.adb:4:05: error: illegal character
gprbuild: *** compilation phase failed

ADA程序定义的实际字符在哪里？ GNAT 2021弄错了吗？

下面使用标识符中使用Unicode字符的示例程序进行实验。请注意，字面字符串中宽字符的使用不在问题范围之内。

main.ADB：

with Ada.Wide_Text_IO; use Ada.Wide_Text_IO;

procedure Main is
   δεδομένα_πράμα : constant Wide_String := "Ο Πλάτων θα ενέκρινε";
begin
   Put_Line (Δεδομένα_πράμα);
end Main;

foo.gpr

project foo is

   for Source_Dirs use (".");
   for Main use ("main.adb");

   package Compiler is
      for Default_Switches ("ada") use ("-gnatW8", "-gnatiw");
   end Compiler;

end foo;

构建＆amp;跑步：

gprbuild
./main

原文

I'm trying to make a tree-sitter parser, so that IDEs (in this case, Vim) can parse and do more advanced manipulation of Ada program text, such as extract-subprogram and rename-variable. But there seem to be some problems defining the character set.

In the Ada 2012 Reference Manual, I found a list of vague category descriptions, of the form 'Any character whose General Category is X' which means that for instance, besides the underscore, all of these ( ‿ ⁀ ⁔ ︳︴﹍﹎﹏＿) are also allowed in an identifier, which seems absurd, and GNAT rejects with 'illegal character'. The list is prefaced by this statement:

"The actual set of graphic symbols used by an implementation for the visual representation of the text of an Ada program is not specified."

Does that really mean there's no way to know which characters should be accepted?

Two pages on, these examples are explicitly given as valid identifiers, and yet GNAT 2021 rejects them:

procedure Main is
   Πλάτων  : constant := 12;     -- Plato
   Чайковский : constant := 12;  -- Tchaikovsky
   θ, φ : constant := 12;        -- Angles
begin
   null;
end Main;

$ gprbuild
using project file foo.gpr
Compile
   [Ada]          main.adb
main.adb:2:04: error: declaration expected
main.adb:2:05: error: illegal character
main.adb:3:04: error: declaration expected
main.adb:3:05: error: illegal character
main.adb:4:05: error: illegal character
gprbuild: *** compilation phase failed

Where is the actual character set for Ada programs defined? Has GNAT 2021 got it wrong?

An example program using Unicode characters in identifiers is below for your experimentation. Note that the use of wide characters in the literal string is outside the scope of the question.

main.adb:

with Ada.Wide_Text_IO; use Ada.Wide_Text_IO;

procedure Main is
   δεδομένα_πράμα : constant Wide_String := "Ο Πλάτων θα ενέκρινε";
begin
   Put_Line (Δεδομένα_πράμα);
end Main;

foo.gpr

project foo is

   for Source_Dirs use (".");
   for Main use ("main.adb");

   package Compiler is
      for Default_Switches ("ada") use ("-gnatW8", "-gnatiw");
   end Compiler;

end foo;

To build & run:

gprbuild
./main

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

纵山崖 2025-02-12 17:22:32

自ADA 2005以来的所有ADA版本都要求实施支持UTF-8源代码，但是对于ADA 83＆amp; 95兼容性不需要它是默认编码。 GNAT的默认源编码为 latin-1 ，尽管它有助于切换到utf----- 8如果A byte-orte-orde-orde mark 找到。 To explicitly specify file encoding, you can pass the -gnatW8 flag,

但是，尽管这允许源文件中的UTF-8，但标识符仍然仅限于GNAT中的Latin-1，但您还必须传递 -gnatiw flag以允许标识符中的宽字符。 GNAT似乎没有默认，因为您可以制作非常奇怪的标识符（如您所指出的那样），而是因为标识符不再适当地对病例不敏感； gnat在任何宽字符集上折叠最小的情况，除了它支持的其他编码中存在的字符以外。

ARM§2.3指定标识符的要求：
标识符:: = dissideifier_start {dissideifier_start | disidefier_extend} ，
其中标识符可以总结为Unicode一般类别L中的任何内容，其余字符可以是数字， punctuation_connector s，十进制标记和非Whitespace组合标记， “标识符不得包含两个连续字符 punctuation_connector ，或以该类别中的字符结尾。 “

超越了您的问题，请注意，尽管所有这些标志，字符串仍然是编码为拉丁语-1（矛盾的是，字符串文字为utf-8，而不是基础字符串：/）。您需要使用 vss 对于Unicode字符串处理。