如何使用 grep 对非 usa、en、ASCII 类型字符进行转义？

发布于 10-14 07:33 字数 795 浏览 6 评论 0原文

我正在使用 grep 来解析通过 facebook Open Graph API 获取的好友列表。我基本上能够通过在 bash 中发出的以下命令来完成我想要的操作：

grep -aiPo '"name":"(.*?)","id":"[[:digit:]]*"' friends?blahblah-access-token-stuff

它会生成一个如下所示的列表：

"name":"John Day","id":"--id ommitted--"
"name":"Andria Cast\u00f1eda","id":"--id ommitted--" // let me draw your attention here
"name":"Jane Doe","id":"--id ommitted--"

上面更改了名称以保护隐私

如果您注意到，其中有一个未转义的序列中间的条目，对应于波形符 N。有没有一种简单的方法可以将这些字符输入到 java 程序中（我的主要目的），以便 java 理解 \u00f1eda 是代表卷曲 n 的 unicode？

我不希望通过解析java中的字符串并手动取消转义unicode来解决这个问题。我非常愿意指示 grep 来处理这种情况，或者其他广泛用于 bash 的 GNU 或开源工具。

那时，我会将整个输入作为文件提供给 java 程序，而不必担心 OMG，这是一个 unicode 转义序列吗？ Java 自然会检测 unicode 字符并将它们映射到相应的内部表示。

提前致谢！

原文

I am using grep to parse a friend list obtained via the facebook Open Graph API. I am mostly able to do what I want with the following command, issued in bash:

grep -aiPo '"name":"(.*?)","id":"[[:digit:]]*"' friends?blahblah-access-token-stuff

which yields a list which looks like:

"name":"John Day","id":"--id ommitted--"
"name":"Andria Cast\u00f1eda","id":"--id ommitted--" // let me draw your attention here
"name":"Jane Doe","id":"--id ommitted--"

Names were changed above to preserve privacy

If you notice, there is an unescaped sequence in the middle entry, that corresponds to a tilde N. Is there an easy way to to feed such characters into a java program (my primary intention) so that java understands that \u00f1eda is unicode speak for the curly n?

I would prefer not to solve this problem by parsing the string in java and manually unescaping the unicode. I would very much prefer to instruct grep to handle this situation, or another GNU or open source tool that is widely available for bash.

At that point, I would feed the entire input as a file to a java program without having to worry about OMG, is that a unicode escape sequence!!? Java would naturally detect the unicode characters and map them to it's corresponding internal representation.

Thanks in advance!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

守望孤独2024-10-21 07:33:47

Java 可以理解 Unicode。您可以通过以下方式提供 Java Unicode 转义：

String str = "\u00F6";

因此，如果您传递诸如 "Andria Cast\u00f1eda" 之类的字符串，其中是转义序列，则应该正确处理它，而无需任何需要的额外处理。

这里还有一个非常简短但易于理解的介绍：

Unicode in Java

如果你还是不相信，试试这个课程：

public class UnicodeExample {

    public static void main(String[] args) {
        
        String escaped = new String("\u00f1");
        String unescaped = new String("ñ");
        System.out.println(escaped);        
        System.out.println(unescaped);
        
        if(escaped.equals(unescaped)){
            System.out.println("The strings are the same!");
        }
        else {
            System.out.println("The strings are different!");
        }

    }

}

Java understands Unicode. You provide Java Unicode escapes in the following manner:

String str = "\u00F6";

So if you pass a string such as "Andria Cast\u00f1eda" which is an escaped sequence, it should be handled correctly without any additional handling required.

Here's also a very brief, but easy to understand introduction:

Unicode in Java

If you're still not convinced, try this class:

public class UnicodeExample {

    public static void main(String[] args) {
        
        String escaped = new String("\u00f1");
        String unescaped = new String("ñ");
        System.out.println(escaped);        
        System.out.println(unescaped);
        
        if(escaped.equals(unescaped)){
            System.out.println("The strings are the same!");
        }
        else {
            System.out.println("The strings are different!");
        }

    }

}

回复收藏 0 原文