如何对选定标签的 XML 数据进行匿名化?
我的问题如下:
我必须读取一个大的 XML 文件,50 MB; 并对一些与私人问题相关的标签/字段进行匿名化,例如姓名、地址、电子邮件、电话号码等……
我确切地知道 XML 中的哪些标签要匿名化。
s|<a>alpha</a>|MD5ed(alpha)|e;
s|<h>beta</h>|MD5ed(beta)|e;
其中 alpha
和 beta
引用其中的任何字符,这些字符也将使用类似 MD5。
我只会转换标签值,而不转换标签本身。
我希望,我对我的问题足够清楚。 我该如何实现这一目标?
My question is as follows:
I have to read a big XML file, 50 MB; and anonymise some tags/fields that relate to private issues, like name surname address, email, phone number, etc...
I know exactly which tags in XML are to be anonymised.
s|<a>alpha</a>|MD5ed(alpha)|e;
s|<h>beta</h>|MD5ed(beta)|e;
where alpha
and beta
refer to any characters within, which will also be hashed, using probably an algorithm like MD5.
I will only convert the tag value, not the tags themselves.
I hope, I am clear enough about my problem. How do I achieve this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您必须在 Python 中执行类似以下操作。
You have to do something like the following in Python.
使用正则表达式确实很危险,除非你确切地知道文件的格式,用正则表达式很容易解析,并且你确信它将来不会改变。
否则你确实可以使用 XML::Twig,如下所示。 另一种方法是使用 XML::LibXML,尽管该文件可能有点大,无法将其完全加载到内存中(话又说回来,也许不是,现在内存很便宜),所以您可能必须使用拉模式,我在不太了解。
紧凑的 XML::Twig 代码:
Using regexps is indeed dangerous, unless you know exactly the format of the file, it's easy to parse with regexps, and you are sure that it will not change in the future.
Otherwise you could indeed use XML::Twig,as below. An alternative would be to use XML::LibXML, although the file might be a bit big to load it entirely in memory (then again, maybe not, memory is cheap these days) so you might have to use the pull mode, which I don't know much about.
Compact XML::Twig code:
底线:不要使用正则表达式解析 XML。
请改用您语言的 DOM 解析库,如果您知道需要匿名化的元素,请使用 XPath 获取它们,并通过设置其 innerText/innerHTML 属性(或您的语言对它们的任何称呼)来散列其内容。
Bottom line: don't parse XML using regex.
Use your language's DOM parsing libraries instead, and if you know the elements you need to anonymize, grab them using XPath and hash their contents by setting their innerText/innerHTML properties (or whatever your language calls them).
正如 Welbog 所说,不要尝试使用正则表达式解析 XML。 你最终会后悔的。
也许最简单的方法是使用 XML::Twig。 它可以分块处理 XML,这使您可以处理非常大的文件。
另一种可能性是使用 SAX,尤其是 XML::SAX::Machines。 我自己从未真正使用过它,但它是一个面向流的系统,因此它应该能够处理大文件。 缺点是您可能需要编写更多代码来收集您关心的每个标记内的文本(其中 XML::Twig 将为您收集该文本)。
As Welbog said, don't try to parse XML with a regex. You'll regret it eventually.
Probably the easiest way to do this is using XML::Twig. It can process XML in chunks, which lets you handle very large files.
Another possibility would be using SAX, especially with XML::SAX::Machines. I've never really used that myself, but it's a stream-oriented system, so it should be able to handle large files. The downside is that you'll probably have to write more code to collect the text inside each tag that you care about (where XML::Twig will collect that text for you).