当前位置：文江博客话题详情

匹配所有不是数字后跟字母的内容

发布于 2024-09-11 15:42:33 字数 160 浏览 7 评论 0原文

如果这个问题在其他地方得到了回答，我深表歉意——我做了一些搜索，但找不到答案。

假设我有一个包含一堆内容的文本文件。该内容中有一个职业代码，其格式始终为数字后跟大写字母。

如何从文件中仅提取 occ 代码？用简单的英语来说，我想删除文件中与数字大写字母模式不匹配的所有内容。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

谈情不如逗狗 2024-09-18 15:42:33

您可以使用 /(\d+[AZ])/ 进行匹配

回复收藏 0 原文

ι不睡觉的鱼゛ 2024-09-18 15:42:33

编写一个脚本，根据 occ 代码在文件中的显示方式逐行或逐字扫描，并可能使用 REGEX 检查匹配项，然后将它们写入另一个文件，这是一个简单的解决方案。

您可以在整个文档上使用单个正则表达式匹配并迭代结果，但这可能会引起问题，具体取决于文件的大小。

回复收藏 0 原文

梦在深巷 2024-09-18 15:42:33

这是使用 sed 删除除所需代码之外的所有内容的粗略尝试。（请注意，我将“数字”解释为由一个或多个数字组成的字符串，没有小数点或前导减号。）

sed -e 's/\([A-Z]\)[0-9]*/\1/g' -e 's/[0-9]*[^0-9A-Z]*//g' -e 's/[0-9]*$//' -e '/^$/d' < filename

第一个命令删除大写字母后面不是数字的任何内容（因此可能是另一个数字的开头）代码），第二个删除后面跟着大写字母以外的任何数字，第三个删除尾随数字，第四个删除空白行。

我已经进行了一些测试，这似乎工作得很好。如果有人能找到失败的案例，我会很乐意修改它。

Here's a crude attempt to remove everything except the desired codes using sed. (Note that I interpret "number" to mean a string of one or more digits, no decimal point or leading minus sign.)

sed -e 's/\([A-Z]\)[0-9]*/\1/g' -e 's/[0-9]*[^0-9A-Z]*//g' -e 's/[0-9]*$//' -e '/^$/d' < filename

The first command removes anything after a capital letter that isn't a number (and therefore perhaps the beginning of another code), the second removes any number followed by something other than a capital letter, the third removes trailing numbers and the fourth removes blank lines.

I've run some tests and this seems to work pretty well. I'll happily amend it if anyone can find a case where it fails.

回复收藏 0 原文

~没有更多了~