具有多个选项的子字符串提取

发布于 2025-01-11 17:48:17 字数 504 浏览 8 评论 0原文

我的数据集中有一个 Stata 变量，如下所示：

city
Washington city
Boston city
El Paso city
Nashville-Davidson metropolitan government (balance)
Lexington-Fayette urban county

我希望它看起来像：

city
Washington
Boston
El Paso
Nashville-Davidson
Lexington-Fayette

“城市”、“县”和“城市县”是城市名称后面唯一的三个单词。换句话说，我想从左侧提取子字符串到城市、县或城市之前的空格。

我能想到使用 subinstring 来解决这个问题的唯一方法：

replace city = subinstr(city, " city", "", .)

但是，我不认为我可以在这里添加多个选项。

原文

I have a variable in Stata in my dataset that looks like this:

city
Washington city
Boston city
El Paso city
Nashville-Davidson metropolitan government (balance)
Lexington-Fayette urban county

And I want it to look like:

city
Washington
Boston
El Paso
Nashville-Davidson
Lexington-Fayette

"city," "county," and "urban county" are the only three words that follow after a city name.
In other words, I want to extract the substring from left to the space before either city, county, or urban.

The only way I can think of approaching this using subinstring:

replace city = subinstr(city, " city", "", .)

I don't think, however, that I can add multiple options here.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

活雷疯 2025-01-18 17:48:17

我使用 subinstr 将所需的单词替换为空字符串，并使用 trim 删除额外的空格。

input str60(city)

"Washington city"
"Boston city"
"El Paso city"
"Lexington-Fayette urban county"
"Audacity"

end

gen     wanted = subinstr(city,"urban county","",1)
replace wanted = subinstr(wanted," county","",1)
replace wanted = subinstr(wanted," city","",1)

replace wanted = trim(wanted)

list

     +----------------------------------------------------+
     |                           city              wanted |
     |----------------------------------------------------|
  1. |                Washington city          Washington |
  2. |                    Boston city              Boston |
  3. |                   El Paso city             El Paso |
  4. | Lexington-Fayette urban county   Lexington-Fayette |
  5. |                       Audacity            Audacity |
     +----------------------------------------------------+

编辑：按照建议，我合并了一个前导空格，这样名称中带有“城市”的地方（例如 Audacity）就不会被无意中替换。 “县”也是如此（尽管这似乎不太可能）。

I used subinstr to replace the desired words with empty strings, and trim to remove additional spaces.

input str60(city)

"Washington city"
"Boston city"
"El Paso city"
"Lexington-Fayette urban county"
"Audacity"

end

gen     wanted = subinstr(city,"urban county","",1)
replace wanted = subinstr(wanted," county","",1)
replace wanted = subinstr(wanted," city","",1)

replace wanted = trim(wanted)

list

     +----------------------------------------------------+
     |                           city              wanted |
     |----------------------------------------------------|
  1. |                Washington city          Washington |
  2. |                    Boston city              Boston |
  3. |                   El Paso city             El Paso |
  4. | Lexington-Fayette urban county   Lexington-Fayette |
  5. |                       Audacity            Audacity |
     +----------------------------------------------------+

Edit: As suggested, I have incorporated a leading space so that places with "city" in their name (e.g. Audacity) are not inadvertently replaced. The same for "county" (although this seems less likely).

回复收藏 0 原文

我们只是彼此的过ke 2025-01-18 17:48:17

split 可能是一种方法。

split city, parse(" city" " urban" " county") limit(1)

split could be a way.

split city, parse(" city" " urban" " county") limit(1)

回复收藏 0 原文

酒解孤独 2025-01-18 17:48:17

我认为使用正则表达式替换来搜索空格后跟相关子字符串将是这里最灵活的选项。例如：

clear
input str60(city)

"Washington city"
"Boston city"
"El Paso city"
"Lexington-Fayette urban county"
"Audacity"
"Salt Lake City city"

end

gen clean_city = ustrregexra(city, "\s(city|county|urban county)","")


     +----------------------------------------------------+
     |                           city          clean_city |
     |----------------------------------------------------|
  1. |                Washington city          Washington |
  2. |                    Boston city              Boston |
  3. |                   El Paso city             El Paso |
  4. | Lexington-Fayette urban county   Lexington-Fayette |
  5. |                       Audacity            Audacity |
  6. |            Salt Lake City city      Salt Lake City |
     +----------------------------------------------------+

I think using regular expression replacement to search for a space followed by a relevant substring would be the most flexible option here. For example:

clear
input str60(city)

"Washington city"
"Boston city"
"El Paso city"
"Lexington-Fayette urban county"
"Audacity"
"Salt Lake City city"

end

gen clean_city = ustrregexra(city, "\s(city|county|urban county)","")


     +----------------------------------------------------+
     |                           city          clean_city |
     |----------------------------------------------------|
  1. |                Washington city          Washington |
  2. |                    Boston city              Boston |
  3. |                   El Paso city             El Paso |
  4. | Lexington-Fayette urban county   Lexington-Fayette |
  5. |                       Audacity            Audacity |
  6. |            Salt Lake City city      Salt Lake City |
     +----------------------------------------------------+

回复收藏 0 原文

~没有更多了~