如何使用SAS从姓氏(如果有)删除世代后缀?

发布于 2025-01-22 08:55:51 字数 1312 浏览 3 评论 0原文

在我的数据集中,姓氏( lname )有时会附上世代后缀。关于世代的后缀:

  • 变量和后缀之间没有空格或其他可能的定系数
  • lname
  • 后缀有时包括

我试图首先考虑简单解决方案的整数和字符的组合。我想不出任何使用Excel,因为它们的所有字符串解决方案都需要具有要删除的值的一致位置。

在SAS中, parse 需要一个定界符,并且 trim 需要一致的位置。

在我所附的语法中,我尝试了四种不同的方法。他们都没有成功,我完全承认用户错误。除了 compress ,我不熟悉其中任何一个,然后仅用于删除空白。

有没有办法为没有世代后缀的姓氏做一个新变量?

太感谢了!

第一件作品适用于我的每一次尝试。

data want;
    input id lname $ fname $;
    datalines;
        123456  Smith       John
        234567  SMITH       ANDREW
        345678  SmithJr     Alan
        456789  SMITHSR     SAM
        789012  smithiii    robert
        890123  smithIIII   william
        901234  Smith4th    Tim
        ;
run;

我的尝试从这里开始。

/* COMPRESS */
data want;
    set have;
    lname2 = compress(lname,'Jr');
    put string=;
run;

/* TRANWARD */
data want;
    set have;
    lname2 = tranwrd(lname,"Jr", "");
    lname2 = tranwrd(lname,"Sr", "");
    lname2 = tranwrd(lname,"III", "");
run;

/* PRXCHANGE */
data want;
    set have;
    lname2 = lname;
    lname2 = prxchange('s/(.*)(jr|sr|iii|iv)$/$1/i',1,trim(lname));
run;

/* PRXMATCH */
data want;
    set have;
    if prxmatch('/Jr|Sr|III/',lname) then lname2 = '';
run;

In my dataset, the last name (lname) occasionally has the generational suffix attached. Regarding the generational suffix:

  • there are no spaces or other possible delimiters between the lname variable and the suffix
  • the suffix ranges between 2 and 4 characters in length
  • the suffix is a mix of lowercase, uppercase, and proper case
  • the suffix sometimes includes a combination of integers and characters

I tried to think simple solutions first. I couldn't think of any using Excel because all of their string solutions require having a consistent position of the values to be removed.

In SAS, PARSE requires a delimiter, and TRIM requires a consistent position.

In the syntax I've attached are four different approaches I tried. None of them were successful, and I totally admit user error. I'm not familiar with any of them other than COMPRESS, and then only for removing blanks.

Is there a way I can make a new variable for last name that doesn't have the generational suffix attached?

Thank you so much!

This first piece applies to each of the my attempts.

data want;
    input id lname $ fname $;
    datalines;
        123456  Smith       John
        234567  SMITH       ANDREW
        345678  SmithJr     Alan
        456789  SMITHSR     SAM
        789012  smithiii    robert
        890123  smithIIII   william
        901234  Smith4th    Tim
        ;
run;

My attempts start here.

/* COMPRESS */
data want;
    set have;
    lname2 = compress(lname,'Jr');
    put string=;
run;

/* TRANWARD */
data want;
    set have;
    lname2 = tranwrd(lname,"Jr", "");
    lname2 = tranwrd(lname,"Sr", "");
    lname2 = tranwrd(lname,"III", "");
run;

/* PRXCHANGE */
data want;
    set have;
    lname2 = lname;
    lname2 = prxchange('s/(.*)(jr|sr|iii|iv)$/$1/i',1,trim(lname));
run;

/* PRXMATCH */
data want;
    set have;
    if prxmatch('/Jr|Sr|III/',lname) then lname2 = '';
run;

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

感情洁癖 2025-01-29 08:55:51
  1. 您根本无法将Compress()用于此目的而
  2. 不是TranWrd(需要一个定界符),您可能会尝试使用翻译。但是,您不会解决在单词的开头或中间替换图案的问题
    data have;
        input id lname $ fname $;
        datalines;
            123456  Smith       John
            234567  SMITH       ANDREW
            345678  SmithJr     Alan
            456789  SMITHSR     SAM
            789012  smithiii    robert
            890123  smithIIII   william
            901234  Smith4th    Tim
            901235  SRith4th    Tim
            ;
    run;
         
    data want;
        set have;
    
          /* Use PRXPARSE to compile the Perl regular expression.    */
       patternID=prxparse('/(JR$)|(SR$)|(III$)/');
          /* Use PRXMATCH to find the position of the pattern match. */
       position=prxmatch(patternID, compress(upcase(lname)));
       put position=;
       if position then do;
         put lname=;
         lname2 = '';
       end;
    run;
  1. You can not use compress() for this purpose at all
  2. Instead of tranwrd (it requires a delimiter) you might try to use translate. But you will not solve the problem of replacing your pattern in the beginning or midle of the word
  3. The example of prxmatch is below.
    data have;
        input id lname $ fname $;
        datalines;
            123456  Smith       John
            234567  SMITH       ANDREW
            345678  SmithJr     Alan
            456789  SMITHSR     SAM
            789012  smithiii    robert
            890123  smithIIII   william
            901234  Smith4th    Tim
            901235  SRith4th    Tim
            ;
    run;
         
    data want;
        set have;
    
          /* Use PRXPARSE to compile the Perl regular expression.    */
       patternID=prxparse('/(JR$)|(SR$)|(III$)/');
          /* Use PRXMATCH to find the position of the pattern match. */
       position=prxmatch(patternID, compress(upcase(lname)));
       put position=;
       if position then do;
         put lname=;
         lname2 = '';
       end;
    run;
墨落画卷 2025-01-29 08:55:51

我认为您对您的prxchange方法很好,对我来说,这是最可靠和易于维护的方法,我只会更改2件事:

  1. 我们的“ O”修饰符只能编译后,一旦正则
  2. 施以使用条款而不是修剪器(strip相当于ltrim + rtrim)
data want;
    set have;
    attrib lname2 format=$50.;
    lname2 = prxchange('s/(.*)(jr|sr|iii|iv)$/$1/oi', 1, strip(lname));
run;

I think you're fine with your prxchange method, for me it's the most reliable and easy to maintain, I would just change 2 things:

  1. us the 'o' modifier to compile only once the regex
  2. Use a strip instead of a trim (strip is an equivalent of ltrim + rtrim)
data want;
    set have;
    attrib lname2 format=$50.;
    lname2 = prxchange('s/(.*)(jr|sr|iii|iv)$/$1/oi', 1, strip(lname));
run;
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文