当前位置：文江博客话题详情

从 C 源代码中删除字符串

发布于 2024-08-01 17:52:25 字数 1542 浏览 10 评论 0原文

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

落花随流水 2024-08-08 17:52:25

C（以及大多数其他编程语言）中的所有标记都是“常规的”。也就是说，它们可以通过正则表达式进行匹配。

C 字符串的正则表达式：

"([^"\\\n]|\\(['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]+))*"

正则表达式并不难理解。基本上，字符串文字是一对围绕一堆的双引号：

非特殊（非引号/反斜杠/换行符）字符
转义符，以反斜杠开头，然后由以下之一组成：
- 一个简单的转义字符
- 1 到 3 个八进制数字
- x 和 1 个或多个十六进制数字

这是基于 C89/C90 规范的第 6.1.4 和 6.1.3.4 节。如果 C99 中出现其他任何问题，这不会捕获该问题，但这应该不难修复。

这是一个用于过滤 C 源文件并删除字符串文字的 python 脚本：

import re, sys
regex = re.compile(r'''"([^"\\\n]|\\(['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]+))*"''')
for line in sys.stdin:
  print regex.sub('', line.rstrip('\n'))

编辑：

在我发布上述内容后，我想到虽然所有 C 标记都是常规的，但不标记我们的所有内容有机会惹麻烦了。特别是，如果双引号出现在另一个标记中，我们可能会被引导到花园小路上。您提到注释已经被删除，因此我们真正需要担心的唯一一件事是字符文字（尽管我要使用的方法也可以轻松扩展以处理注释）。这是一个处理字符文字的更强大的脚本：

import re, sys
str_re = r'''"([^"\\\n]|\\(['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]+))*"'''
chr_re = r"""'([^'\\\n]|\\(['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]+))'"""

regex = re.compile('|'.join([str_re, chr_re]))

def repl(m):
  m = m.group(0)
  if m.startswith("'"):
    return m
  else:
    return ''
for line in sys.stdin:
  print regex.sub(repl, line.rstrip('\n'))

本质上，我们正在查找字符串和字符文字标记，然后单独保留 char 文字，但删除字符串文字。字符正则表达式与字符串正则表达式非常相似。

All of the tokens in C (and most other programming languages) are "regular". That is, they can be matched by a regular expression.

A regular expression for C strings:

"([^"\\\n]|\\(['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]+))*"

The regex isn't too hard to understand. Basically a string literal is a pair of double quotes surrounding a bunch of:

non-special (non-quote/backslash/newline) characters
escapes, which start with a backslash and then consist of one of:
- a simple escape character
- 1 to 3 octal digits
- x and 1 or more hex digits

This is based on sections 6.1.4 and 6.1.3.4 of the C89/C90 spec. If anything else crept in in C99, this won't catch that, but that shouldn't be hard to fix.

Here's a python script to filter a C source file removing string literals:

import re, sys
regex = re.compile(r'''"([^"\\\n]|\\(['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]+))*"''')
for line in sys.stdin:
  print regex.sub('', line.rstrip('\n'))

EDIT:

It occurred to me after I posted the above that while it is true that all C tokens are regular, by not tokenizing everything we've got an opportunity for trouble. In particular, if a double quote shows up in what should be another token we can be lead down the garden path. You mentioned that comments have already been stripped, so the only other thing we really need to worry about are character literals (though the approach Im going to use can be easily extended to handle comments as well). Here's a more robust script that handles character literals:

import re, sys
str_re = r'''"([^"\\\n]|\\(['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]+))*"'''
chr_re = r"""'([^'\\\n]|\\(['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]+))'"""

regex = re.compile('|'.join([str_re, chr_re]))

def repl(m):
  m = m.group(0)
  if m.startswith("'"):
    return m
  else:
    return ''
for line in sys.stdin:
  print regex.sub(repl, line.rstrip('\n'))

Essentially we're finding string and character literal token, and then leaving char literals alone but stripping out string literals. The char literal regex is very similar to the string literal one.

回复收藏 0 原文

晨光如昨 2024-08-08 17:52:25

您可以将源代码下载到StripCmt (. tar.gz - 5kB）。它非常小，并且应该不会太难适应条带字符串（它是发布的根据 GPL）。

您可能还想研究 C 字符串的官方词法语言规则。我很快就找到了这个，但可能不会是确定的。它将字符串定义为：

stringcon ::= "{ch}", where ch denotes any printable ASCII character (as specified by isprint()) other than " (double quotes) and the newline character.

You can download the source code to StripCmt (.tar.gz - 5kB). It's trivially small, and shouldn't be too difficult to adapt to striping strings instead (it's released under the GPL).

You might also want to investigate the official lexical language rules for C strings. I found this very quickly, but it might not be definitive. It defines a string as:

stringcon ::= "{ch}", where ch denotes any printable ASCII character (as specified by isprint()) other than " (double quotes) and the newline character.

回复收藏 0 原文

旧人哭 2024-08-08 17:52:25

在 Python 中使用 pyparsing：

from pyparsing import dblQuotedString

source = open(filename).read()
dblQuotedString.setParseAction(lambda : "")
print dblQuotedString.transformString(source)

也打印到 stdout。

In Python using pyparsing:

from pyparsing import dblQuotedString

source = open(filename).read()
dblQuotedString.setParseAction(lambda : "")
print dblQuotedString.transformString(source)

Also prints to stdout.

回复收藏 0 原文

无边思念无边月 2024-08-08 17:52:25

在 ruby 中：

#!/usr/bin/ruby
f=open(ARGV[0],"r")
s=f.read
puts(s.gsub(/"(\\(.|\n)|[^\\"\n])*"/,""))
f.close

打印到标准输出

In ruby:

#!/usr/bin/ruby
f=open(ARGV[0],"r")
s=f.read
puts(s.gsub(/"(\\(.|\n)|[^\\"\n])*"/,""))
f.close

prints to the standard output

回复收藏 0 原文

~没有更多了~

关于作者

巴黎盛开的樱花

暂无简介

0 文章

0 评论

23 人气

关注发私信

友情链接

文江博客

从 C 源代码中删除字符串

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

尘世孤行

烟─花易冷

你是年少的欢喜

倒带

忱杏

送君千里

友情链接

从 C 源代码中删除字符串

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

尘世孤行

烟─花易冷

你是年少的欢喜

倒带

忱杏

送君千里

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。