如何使用SAS从姓氏(如果有)删除世代后缀?
在我的数据集中,姓氏( lname )有时会附上世代后缀。关于世代的后缀:
- 变量和后缀之间没有空格或其他可能的定系数
- 。
- lname
- 后缀有时包括
我试图首先考虑简单解决方案的整数和字符的组合。我想不出任何使用Excel,因为它们的所有字符串解决方案都需要具有要删除的值的一致位置。
在SAS中, parse 需要一个定界符,并且 trim 需要一致的位置。
在我所附的语法中,我尝试了四种不同的方法。他们都没有成功,我完全承认用户错误。除了 compress ,我不熟悉其中任何一个,然后仅用于删除空白。
有没有办法为没有世代后缀的姓氏做一个新变量?
太感谢了!
第一件作品适用于我的每一次尝试。
data want;
input id lname $ fname $;
datalines;
123456 Smith John
234567 SMITH ANDREW
345678 SmithJr Alan
456789 SMITHSR SAM
789012 smithiii robert
890123 smithIIII william
901234 Smith4th Tim
;
run;
我的尝试从这里开始。
/* COMPRESS */
data want;
set have;
lname2 = compress(lname,'Jr');
put string=;
run;
/* TRANWARD */
data want;
set have;
lname2 = tranwrd(lname,"Jr", "");
lname2 = tranwrd(lname,"Sr", "");
lname2 = tranwrd(lname,"III", "");
run;
/* PRXCHANGE */
data want;
set have;
lname2 = lname;
lname2 = prxchange('s/(.*)(jr|sr|iii|iv)$/$1/i',1,trim(lname));
run;
/* PRXMATCH */
data want;
set have;
if prxmatch('/Jr|Sr|III/',lname) then lname2 = '';
run;
In my dataset, the last name (lname) occasionally has the generational suffix attached. Regarding the generational suffix:
- there are no spaces or other possible delimiters between the lname variable and the suffix
- the suffix ranges between 2 and 4 characters in length
- the suffix is a mix of lowercase, uppercase, and proper case
- the suffix sometimes includes a combination of integers and characters
I tried to think simple solutions first. I couldn't think of any using Excel because all of their string solutions require having a consistent position of the values to be removed.
In SAS, PARSE requires a delimiter, and TRIM requires a consistent position.
In the syntax I've attached are four different approaches I tried. None of them were successful, and I totally admit user error. I'm not familiar with any of them other than COMPRESS, and then only for removing blanks.
Is there a way I can make a new variable for last name that doesn't have the generational suffix attached?
Thank you so much!
This first piece applies to each of the my attempts.
data want;
input id lname $ fname $;
datalines;
123456 Smith John
234567 SMITH ANDREW
345678 SmithJr Alan
456789 SMITHSR SAM
789012 smithiii robert
890123 smithIIII william
901234 Smith4th Tim
;
run;
My attempts start here.
/* COMPRESS */
data want;
set have;
lname2 = compress(lname,'Jr');
put string=;
run;
/* TRANWARD */
data want;
set have;
lname2 = tranwrd(lname,"Jr", "");
lname2 = tranwrd(lname,"Sr", "");
lname2 = tranwrd(lname,"III", "");
run;
/* PRXCHANGE */
data want;
set have;
lname2 = lname;
lname2 = prxchange('s/(.*)(jr|sr|iii|iv)$/$1/i',1,trim(lname));
run;
/* PRXMATCH */
data want;
set have;
if prxmatch('/Jr|Sr|III/',lname) then lname2 = '';
run;
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我认为您对您的prxchange方法很好,对我来说,这是最可靠和易于维护的方法,我只会更改2件事:
I think you're fine with your prxchange method, for me it's the most reliable and easy to maintain, I would just change 2 things: