如何使用 grep 对非 usa、en、ASCII 类型字符进行转义?
我正在使用 grep 来解析通过 facebook Open Graph API 获取的好友列表。我基本上能够通过在 bash 中发出的以下命令来完成我想要的操作:
grep -aiPo '"name":"(.*?)","id":"[[:digit:]]*"' friends?blahblah-access-token-stuff
它会生成一个如下所示的列表:
"name":"John Day","id":"--id ommitted--"
"name":"Andria Cast\u00f1eda","id":"--id ommitted--" // let me draw your attention here
"name":"Jane Doe","id":"--id ommitted--"
上面更改了名称以保护隐私
如果您注意到,其中有一个未转义的序列中间的条目,对应于波形符 N。有没有一种简单的方法可以将这些字符输入到 java 程序中(我的主要目的),以便 java 理解 \u00f1eda 是代表卷曲 n 的 unicode?
我不希望通过解析java中的字符串并手动取消转义unicode来解决这个问题。我非常愿意指示 grep 来处理这种情况,或者其他广泛用于 bash 的 GNU 或开源工具。
那时,我会将整个输入作为文件提供给 java 程序,而不必担心 OMG,这是一个 unicode 转义序列吗? Java 自然会检测 unicode 字符并将它们映射到相应的内部表示。
提前致谢!
I am using grep to parse a friend list obtained via the facebook Open Graph API. I am mostly able to do what I want with the following command, issued in bash:
grep -aiPo '"name":"(.*?)","id":"[[:digit:]]*"' friends?blahblah-access-token-stuff
which yields a list which looks like:
"name":"John Day","id":"--id ommitted--"
"name":"Andria Cast\u00f1eda","id":"--id ommitted--" // let me draw your attention here
"name":"Jane Doe","id":"--id ommitted--"
Names were changed above to preserve privacy
If you notice, there is an unescaped sequence in the middle entry, that corresponds to a tilde N. Is there an easy way to to feed such characters into a java program (my primary intention) so that java understands that \u00f1eda is unicode speak for the curly n?
I would prefer not to solve this problem by parsing the string in java and manually unescaping the unicode. I would very much prefer to instruct grep to handle this situation, or another GNU or open source tool that is widely available for bash.
At that point, I would feed the entire input as a file to a java program without having to worry about OMG, is that a unicode escape sequence!!? Java would naturally detect the unicode characters and map them to it's corresponding internal representation.
Thanks in advance!
Java 可以理解 Unicode。您可以通过以下方式提供 Java Unicode 转义:
因此,如果您传递诸如
"Andria Cast\u00f1eda"
之类的字符串,其中 是转义序列,则应该正确处理它,而无需任何需要的额外处理。这里还有一个非常简短但易于理解的介绍:
Unicode in Java
如果你还是不相信,试试这个课程:
Java understands Unicode. You provide Java Unicode escapes in the following manner:
So if you pass a string such as
"Andria Cast\u00f1eda"
which is an escaped sequence, it should be handled correctly without any additional handling required.Here's also a very brief, but easy to understand introduction:
Unicode in Java
If you're still not convinced, try this class: