为什么 PROC FCMP 函数总是返回 33 个字节而不是更多?

发布于 2024-07-25 16:39:50 字数 992 浏览 8 评论 0原文

我通过 PROC FCMP 定义了以下函数。 代码的要点应该非常明显并且相对简单。 我正在从 XHTML 行返回属性的值。 代码如下:

proc fcmp outlib=library.funcs.crawl;
    function getAttr(htmline $, Attribute $) $;

       /*-- Find the position of the match --*/
    Pos = index( htmline , strip( Attribute )||"=" );

       /*-- Now do something about it --*/
       if pos > 0 then do;
          Value = scan( substr( htmline, Pos + length( Attribute ) + 2), 1, '"');
       end;
       else Value = "";
       return( Value);
    endsub;
run;

无论我如何使用 length 或 attrib 语句来尝试显式声明返回的数据类型,它总是仅返回所请求字符串的最大 33 个字节,无论该字符串有多长实际返回值是。 无论我搜索哪个属性,都会发生这种情况。 数据步骤中的相同代码(硬编码)会返回正确的结果,因此这与 PROC FCMP 相关。

这是我用来测试它的数据步骤(其中 PageSource.html 是任何具有 xhtml 兼容属性的 html 文件 - 完全引用):

data TEST;
length href $200;
infile "F:\PageSource.html";

input;

htmline = _INFILE_;

href = getAttr( htmline, "href");
x = length(href);

run;

更新:升级到 SAS9.2 - 第 2 版后,这似乎可以正常工作

I have the following function defined via PROC FCMP. The point of the code should be pretty obvious and relatively straightforward. I'm returning the value of an attribute from a line of XHTML. Here's the code:

proc fcmp outlib=library.funcs.crawl;
    function getAttr(htmline $, Attribute $) $;

       /*-- Find the position of the match --*/
    Pos = index( htmline , strip( Attribute )||"=" );

       /*-- Now do something about it --*/
       if pos > 0 then do;
          Value = scan( substr( htmline, Pos + length( Attribute ) + 2), 1, '"');
       end;
       else Value = "";
       return( Value);
    endsub;
run;

No matter what I do with length or attrib statement to try to explicitly declare the data type returned, it ALWAYS returns only a max of 33 bytes of the requested string, regardless of how long the actual return value is. This happens no matter which attribute I am searching for. The same code (hard-coded) into a data step returns the correct results so this is related to PROC FCMP.

Here is the datastep I'm using to test it (where PageSource.html is any html file that has xhtml compliant attributes -- fully quoted):

data TEST;
length href $200;
infile "F:\PageSource.html";

input;

htmline = _INFILE_;

href = getAttr( htmline, "href");
x = length(href);

run;

UPDATE: This seems to work properly after upgrading to SAS9.2 - Release 2

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

梦里°也失望 2024-08-01 16:39:50

我认为问题(虽然我不知道为什么)出在 scan 函数中 - 它似乎截断了 substr() 的输入。 如果将 substr 函数从 scan() 中取出,将 substr 函数的结果分配给一个新变量,然后将其传递给 scan,它似乎可以工作。

这是我跑的:

proc fcmp outlib=work.funcs.crawl;
    function getAttr(htmline $, Attribute $) $;
    length y $200;
       /*-- Find the position of the match --*/
    Pos = index( htmline , strip( Attribute )||"=" );

       /*-- Now do something about it --*/
       if pos > 0 then do;
          y=substr( htmline, Pos + length( Attribute ) + 2);
          Value = scan( y, 1, '"');       
       end;
       else Value = "";
       return( Value);
    endsub;
run;

options cmplib=work.funcs;

data TEST;
length href $200;
infile "PageSource.html";

input;

htmline = _INFILE_;
href = getAttr( htmline, "href");
x = length(href);
run;

I think the problem (though I don't know why) is in the scan function - it seems to be truncating input from substr(). If you pull the substr function out of scan(), assign the result of the substr function to a new variable that you then pass to scan, it seems to work.

Here is what I ran:

proc fcmp outlib=work.funcs.crawl;
    function getAttr(htmline $, Attribute $) $;
    length y $200;
       /*-- Find the position of the match --*/
    Pos = index( htmline , strip( Attribute )||"=" );

       /*-- Now do something about it --*/
       if pos > 0 then do;
          y=substr( htmline, Pos + length( Attribute ) + 2);
          Value = scan( y, 1, '"');       
       end;
       else Value = "";
       return( Value);
    endsub;
run;

options cmplib=work.funcs;

data TEST;
length href $200;
infile "PageSource.html";

input;

htmline = _INFILE_;
href = getAttr( htmline, "href");
x = length(href);
run;
过去的过去 2024-08-01 16:39:50

在这种情况下,输入指针控件就足够了。 希望这可以帮助。

/* create a test input file */
data _null_;
  file "f:\pageSource.html";
  input;
  put _infile_;
cards4;
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="w3.org/StyleSheets/TR/W3C-REC.css"; type="text/css"?>
;;;;
run;

/* extract the href attribute value, if any.                          */
/* assuming that the value and the attribute name occurs in one line. */
/* and max length is 200 chars.                                       */
data one;
  infile "f:\pageSource.html" missover;
  input @("href=") href :$200.;
  href = scan(href, 1, '"'); /* unquote */
run;

/* check */
proc print data=one;
run;
/* on lst
Obs                  href
 1
 2     w3.org/StyleSheets/TR/W3C-REC.css
*/

In this case, an input pointer control should be enough. hope this helps.

/* create a test input file */
data _null_;
  file "f:\pageSource.html";
  input;
  put _infile_;
cards4;
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="w3.org/StyleSheets/TR/W3C-REC.css"; type="text/css"?>
;;;;
run;

/* extract the href attribute value, if any.                          */
/* assuming that the value and the attribute name occurs in one line. */
/* and max length is 200 chars.                                       */
data one;
  infile "f:\pageSource.html" missover;
  input @("href=") href :$200.;
  href = scan(href, 1, '"'); /* unquote */
run;

/* check */
proc print data=one;
run;
/* on lst
Obs                  href
 1
 2     w3.org/StyleSheets/TR/W3C-REC.css
*/
肥爪爪 2024-08-01 16:39:50

PROC FCMP 中未初始化的变量的默认长度似乎为 33 字节。 考虑以下演示代码:

OPTIONS INSERT = (CMPLIB = WORK.FCMP);

PROC FCMP
    OUTLIB = WORK.FCMP.FOO
;

    FUNCTION FOO(
        BAR $
    );

        * Assign the value of BAR to the uninitialised variable BAZ;
        BAZ = BAR;

        * Diagnostics;
        PUT 'BAR IS ' BAR;
        PUT 'BAZ IS ' BAZ;  

        * Return error code;
        IF
            LENGTH(BAZ) NE LENGTH(BAR)
        THEN
            RETURN(0)
        ; ELSE
            RETURN(1)
        ;

    ENDSUB;

RUN;

DATA _NULL_;

    X = 'shortstring';
    Y = 'exactly 33 characters long string';
    Z = 'this string is somewhat longer than 33 characters';

    ARRAY STRINGS{*} _CHARACTER_;
    ARRAY RC{3} 8 _TEMPORARY_;

    DO I = 1 TO DIM(STRINGS);

        RC[I] = FOO(STRINGS[I]);

    END;

RUN;

在我的站点安装 (Base SAS 9.4 M2) 中,它将以下行打印到日志中:

BAR IS  shortstring
BAZ IS  shortstring
BAR IS  exactly 33 characters long string
BAZ IS  exactly 33 characters long string
BAR IS  this string is somewhat longer than 33 characters
BAZ IS  this string is somewhat longer th

这可能与 PROC FCMP(如 DATA 步骤)无法在运行时动态分配变量长度这一事实有关。 然而,这有点令人困惑,因为它确实参数动态分配可变长度。 我假设 PROC FCMP 子例程有一个单独的“初始化”阶段,在此期间确定作为参数传递的值的长度,并将必须保存这些值的参数变量初始化为所需的长度。 然而,在子例程主体中定义的变量的长度只能在运行时、当内存已经分配时才能发现。 因此,在运行时之前(无论是在编译时还是我假设的“初始化”阶段),内存会通过显式 LENGTH 语句(如果存在)分配给这些变量,否则会回退到默认的 33 字节。

现在真正有趣的是PROC FCMP 在这方面非常聪明——在严格分离初始化/运行时阶段的情况下。 如果在子例程的主体中,变量 A 具有显式定义的 LENGTH,则另一个未初始化的变量 B 被分配了 A 的函数>,然后将 B 设置为与 A 相同的长度。 考虑对上述函数的修改,其中 BAR 的值没有直接分配给 BAZ,而是通过第三个变量 QUX,其中显式定义的 LENGTH 为 50 字节:

OPTIONS INSERT = (CMPLIB = WORK.FCMP);

PROC FCMP
    OUTLIB = WORK.FCMP.FOO
;

    FUNCTION FOO(
        BAR $
    );


        LENGTH QUX $ 50;
        QUX = BAR;
        * Assign the value of BAR to the uninitialised variable BAZ;
        BAZ = QUX;

        * Diagnostics;
        PUT 'BAR IS ' BAR;
        PUT 'BAZ IS ' BAZ;  

        * Return error code;
        IF
            LENGTH(BAZ) NE LENGTH(BAR)
        THEN
            RETURN(0)
        ; ELSE
            RETURN(1)
        ;

    ENDSUB;

RUN;

DATA _NULL_;

    X = 'shortstring';
    Y = 'exactly 33 characters long string';
    Z = 'this string is somewhat longer than 33 characters';

    ARRAY STRINGS{*} _CHARACTER_;
    ARRAY RC{3} 8 _TEMPORARY_;

    DO I = 1 TO DIM(STRINGS);

        RC[I] = FOO(STRINGS[I]);

    END;

RUN;

日志显示:

BAR IS  shortstring
BAZ IS  shortstring
BAR IS  exactly 33 characters long string
BAZ IS  exactly 33 characters long string
BAR IS  this string is somewhat longer than 33 characters
BAZ IS  this string is somewhat longer than 33 characters

这种“有用”的行为很可能是导致先前答案中的混乱和差异的原因。 我想知道这种行为是否有记录?

我将把它作为练习留给读者来研究 smart SAS 到底如何尝试解决这个问题。 例如,如果一个未初始化的变量被分配了具有显式分配长度的其他两个变量的串联值,那么它的长度是否设置为其他两个变量的总和?

It seems like uninitialized variables in PROC FCMP get a default length of 33 bytes. Consider the following demonstration code:

OPTIONS INSERT = (CMPLIB = WORK.FCMP);

PROC FCMP
    OUTLIB = WORK.FCMP.FOO
;

    FUNCTION FOO(
        BAR $
    );

        * Assign the value of BAR to the uninitialised variable BAZ;
        BAZ = BAR;

        * Diagnostics;
        PUT 'BAR IS ' BAR;
        PUT 'BAZ IS ' BAZ;  

        * Return error code;
        IF
            LENGTH(BAZ) NE LENGTH(BAR)
        THEN
            RETURN(0)
        ; ELSE
            RETURN(1)
        ;

    ENDSUB;

RUN;

DATA _NULL_;

    X = 'shortstring';
    Y = 'exactly 33 characters long string';
    Z = 'this string is somewhat longer than 33 characters';

    ARRAY STRINGS{*} _CHARACTER_;
    ARRAY RC{3} 8 _TEMPORARY_;

    DO I = 1 TO DIM(STRINGS);

        RC[I] = FOO(STRINGS[I]);

    END;

RUN;

Which, with my site's installation (Base SAS 9.4 M2) prints the following lines to the log:

BAR IS  shortstring
BAZ IS  shortstring
BAR IS  exactly 33 characters long string
BAZ IS  exactly 33 characters long string
BAR IS  this string is somewhat longer than 33 characters
BAZ IS  this string is somewhat longer th

This is likely related to the fact that PROC FCMP, like DATA steps, cannot allocate variable lengths dynamically at runtime. However, it's a little confusing, because it does dynamically allocate variable lengths for parameters. I'm assuming that there is a separate "initialization" phase for PROC FCMP subroutines, during which the length of values passed as arguments are determined and parameter variables which must hold those values are initialized to the required length. However, the length of variables defined only within the body of the subroutine can only be discovered at runtime, when memory has already been allocated. So prior to runtime (whether at compile-time or my hypothetical "initialization" phase), memory is allocated to these variables with an explicit LENGTH statement if present, and otherwise falls back to a default of 33 bytes.

Now what's really interesting is that PROC FCMP is as smart as can be about this -- within the strict separation of initialization/runtime stages. If, in the body of the subroutine, a variable A has an explicitly defined LENGTH, and then another uninitialized variable B is assigned a function of A, then B is set to the same length as A. Consider this modification of the above function, in which the value of BAR is not assigned directly to BAZ, but rather via the third variable QUX, which has an explicitly defined LENGTH of 50 bytes:

OPTIONS INSERT = (CMPLIB = WORK.FCMP);

PROC FCMP
    OUTLIB = WORK.FCMP.FOO
;

    FUNCTION FOO(
        BAR $
    );


        LENGTH QUX $ 50;
        QUX = BAR;
        * Assign the value of BAR to the uninitialised variable BAZ;
        BAZ = QUX;

        * Diagnostics;
        PUT 'BAR IS ' BAR;
        PUT 'BAZ IS ' BAZ;  

        * Return error code;
        IF
            LENGTH(BAZ) NE LENGTH(BAR)
        THEN
            RETURN(0)
        ; ELSE
            RETURN(1)
        ;

    ENDSUB;

RUN;

DATA _NULL_;

    X = 'shortstring';
    Y = 'exactly 33 characters long string';
    Z = 'this string is somewhat longer than 33 characters';

    ARRAY STRINGS{*} _CHARACTER_;
    ARRAY RC{3} 8 _TEMPORARY_;

    DO I = 1 TO DIM(STRINGS);

        RC[I] = FOO(STRINGS[I]);

    END;

RUN;

The log shows:

BAR IS  shortstring
BAZ IS  shortstring
BAR IS  exactly 33 characters long string
BAZ IS  exactly 33 characters long string
BAR IS  this string is somewhat longer than 33 characters
BAZ IS  this string is somewhat longer than 33 characters

It's likely that this "helpful" behavior is the cause of confusion and differences in the previous answers. I wonder if this behavior is documented?

I'll leave it as an exercise to the reader to investigate exactly how smart SAS tries to get about this. For example, if an uninitialized variable gets assigned the concatenated values of two other variables with explicitly assigned lengths, is its length set to the sum of those of the other two?

茶底世界 2024-08-01 16:39:50

我最终放弃使用 FCMP 定义的数据步骤函数。 我认为他们还没有准备好迎接黄金时段。 我不仅无法解决 33 字节返回问题,而且它开始经常使 SAS 崩溃。

回到古老(几十年前)的宏技术。 这有效:

/*********************************/
/*= Macro to extract Attribute  =*/
/*= from XHTML string           =*/
/*********************************/
%macro getAttr( htmline, Attribute, NewVar );
   if index( &htmline , strip( &Attribute )||"=" ) > 0 then do;
      &NewVar = scan( substr( &htmline, index( &htmline , strip( &Attribute )||"=" ) + length( &Attribute ) + 2), 1, '"' );
   end;
%mend;

I ended up backing out of using FCMP defined data step functions. I don't think they're ready for primetime. Not only could I not solve the 33 byte return issue, but it started regularly crashing SAS.

So back to the good old (decades old) technology of macros. This works:

/*********************************/
/*= Macro to extract Attribute  =*/
/*= from XHTML string           =*/
/*********************************/
%macro getAttr( htmline, Attribute, NewVar );
   if index( &htmline , strip( &Attribute )||"=" ) > 0 then do;
      &NewVar = scan( substr( &htmline, index( &htmline , strip( &Attribute )||"=" ) + length( &Attribute ) + 2), 1, '"' );
   end;
%mend;
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文