如何对选定标签的 XML 数据进行匿名化?

发布于 2024-07-13 17:09:37 字数 465 浏览 11 评论 0原文

我的问题如下:

我必须读取一个大的 XML 文件,50 MB; 并对一些与私人问题相关的标签/字段进行匿名化,例如姓名、地址、电子邮件、电话号码等……

我确切地知道 XML 中的哪些标签要匿名化。

 s|<a>alpha</a>|MD5ed(alpha)|e;
 s|<h>beta</h>|MD5ed(beta)|e;

其中 alphabeta 引用其中的任何字符,这些字符也将使用类似 MD5

我只会转换标签值,而不转换标签本身。

我希望,我对我的问题足够清楚。 我该如何实现这一目标?

My question is as follows:

I have to read a big XML file, 50 MB; and anonymise some tags/fields that relate to private issues, like name surname address, email, phone number, etc...

I know exactly which tags in XML are to be anonymised.

 s|<a>alpha</a>|MD5ed(alpha)|e;
 s|<h>beta</h>|MD5ed(beta)|e;

where alpha and beta refer to any characters within, which will also be hashed, using probably an algorithm like MD5.

I will only convert the tag value, not the tags themselves.

I hope, I am clear enough about my problem. How do I achieve this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

暖阳 2024-07-20 17:09:37

您必须在 Python 中执行类似以下操作。

import xml.etree.ElementTree as xml # or lxml or whatever
import hashlib
theDoc= xml.parse( "sample.xml" )
for alphaTag in theDoc.findall( "xpath/to/tag" ):
    print alphaTag, alphaTag.text
    alphaTag.text = hashlib.md5(alphaTag.text).hexdigest()
xml.dump(theDoc)

You have to do something like the following in Python.

import xml.etree.ElementTree as xml # or lxml or whatever
import hashlib
theDoc= xml.parse( "sample.xml" )
for alphaTag in theDoc.findall( "xpath/to/tag" ):
    print alphaTag, alphaTag.text
    alphaTag.text = hashlib.md5(alphaTag.text).hexdigest()
xml.dump(theDoc)
伪心 2024-07-20 17:09:37

使用正则表达式确实很危险,除非你确切地知道文件的格式,用正则表达式很容易解析,并且你确信它将来不会改变。

否则你确实可以使用 XML::Twig,如下所示。 另一种方法是使用 XML::LibXML,尽管该文件可能有点大,无法将其完全加载到内存中(话又说回来,也许不是,现在内存很便宜),所以您可能必须使用拉模式,我在不太了解。

紧凑的 XML::Twig 代码:

#!/usr/bin/perl

use strict;
use warnings;

use XML::Twig;
use Digest::MD5 'md5_base64';

my @tags_to_anonymize= qw( name surname address email phone);

# the handler for each element ($_) sets its content with the md5 and then flushes
my %handlers= map { $_ => sub { $_->set_text( md5_base64( $_->text))->flush } } @tags_to_anonymize;

XML::Twig->new( twig_roots => \%handlers, twig_print_outside_roots => 1)
         ->parsefile( "my_big_file.xml")
         ->flush;

Using regexps is indeed dangerous, unless you know exactly the format of the file, it's easy to parse with regexps, and you are sure that it will not change in the future.

Otherwise you could indeed use XML::Twig,as below. An alternative would be to use XML::LibXML, although the file might be a bit big to load it entirely in memory (then again, maybe not, memory is cheap these days) so you might have to use the pull mode, which I don't know much about.

Compact XML::Twig code:

#!/usr/bin/perl

use strict;
use warnings;

use XML::Twig;
use Digest::MD5 'md5_base64';

my @tags_to_anonymize= qw( name surname address email phone);

# the handler for each element ($_) sets its content with the md5 and then flushes
my %handlers= map { $_ => sub { $_->set_text( md5_base64( $_->text))->flush } } @tags_to_anonymize;

XML::Twig->new( twig_roots => \%handlers, twig_print_outside_roots => 1)
         ->parsefile( "my_big_file.xml")
         ->flush;
幻想少年梦 2024-07-20 17:09:37

底线:不要使用正则表达式解析 XML。

请改用您语言的 DOM 解析库,如果您知道需要匿名化的元素,请使用 XPath 获取它们,并通过设置其 innerText/innerHTML 属性(或您的语言对它们的任何称呼)来散列其内容。

Bottom line: don't parse XML using regex.

Use your language's DOM parsing libraries instead, and if you know the elements you need to anonymize, grab them using XPath and hash their contents by setting their innerText/innerHTML properties (or whatever your language calls them).

国际总奸 2024-07-20 17:09:37

正如 Welbog 所说,不要尝试使用正则表达式解析 XML。 你最终会后悔的。

也许最简单的方法是使用 XML::Twig。 它可以分块处理 XML,这使您可以处理非常大的文件。

另一种可能性是使用 SAX,尤其是 XML::SAX::Machines。 我自己从未真正使用过它,但它是一个面向流的系统,因此它应该能够处理大文件。 缺点是您可能需要编写更多代码来收集您关心的每个标记内的文本(其中 XML::Twig 将为您收集该文本)。

As Welbog said, don't try to parse XML with a regex. You'll regret it eventually.

Probably the easiest way to do this is using XML::Twig. It can process XML in chunks, which lets you handle very large files.

Another possibility would be using SAX, especially with XML::SAX::Machines. I've never really used that myself, but it's a stream-oriented system, so it should be able to handle large files. The downside is that you'll probably have to write more code to collect the text inside each tag that you care about (where XML::Twig will collect that text for you).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文