如何对选定标签的 XML 数据进行匿名化？

发布于 2024-07-13 17:09:37 字数 465 浏览 11 评论 0原文

我的问题如下：

我必须读取一个大的 XML 文件，50 MB；并对一些与私人问题相关的标签/字段进行匿名化，例如姓名、地址、电子邮件、电话号码等……

我确切地知道 XML 中的哪些标签要匿名化。

 s|<a>alpha</a>|MD5ed(alpha)|e;
 s|<h>beta</h>|MD5ed(beta)|e;

其中 alpha 和 beta 引用其中的任何字符，这些字符也将使用类似 MD5。

我只会转换标签值，而不转换标签本身。

我希望，我对我的问题足够清楚。我该如何实现这一目标？

原文

My question is as follows:

I have to read a big XML file, 50 MB; and anonymise some tags/fields that relate to private issues, like name surname address, email, phone number, etc...

I know exactly which tags in XML are to be anonymised.

 s|<a>alpha</a>|MD5ed(alpha)|e;
 s|<h>beta</h>|MD5ed(beta)|e;

where alpha and beta refer to any characters within, which will also be hashed, using probably an algorithm like MD5.

I will only convert the tag value, not the tags themselves.

I hope, I am clear enough about my problem. How do I achieve this?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

暖阳 2024-07-20 17:09:37

您必须在 Python 中执行类似以下操作。

import xml.etree.ElementTree as xml # or lxml or whatever
import hashlib
theDoc= xml.parse( "sample.xml" )
for alphaTag in theDoc.findall( "xpath/to/tag" ):
    print alphaTag, alphaTag.text
    alphaTag.text = hashlib.md5(alphaTag.text).hexdigest()
xml.dump(theDoc)

You have to do something like the following in Python.

import xml.etree.ElementTree as xml # or lxml or whatever
import hashlib
theDoc= xml.parse( "sample.xml" )
for alphaTag in theDoc.findall( "xpath/to/tag" ):
    print alphaTag, alphaTag.text
    alphaTag.text = hashlib.md5(alphaTag.text).hexdigest()
xml.dump(theDoc)

回复收藏 0 原文

伪心 2024-07-20 17:09:37

使用正则表达式确实很危险，除非你确切地知道文件的格式，用正则表达式很容易解析，并且你确信它将来不会改变。

否则你确实可以使用 XML::Twig，如下所示。另一种方法是使用 XML::LibXML，尽管该文件可能有点大，无法将其完全加载到内存中（话又说回来，也许不是，现在内存很便宜），所以您可能必须使用拉模式，我在不太了解。

紧凑的 XML::Twig 代码：

#!/usr/bin/perl

use strict;
use warnings;

use XML::Twig;
use Digest::MD5 'md5_base64';

my @tags_to_anonymize= qw( name surname address email phone);

# the handler for each element ($_) sets its content with the md5 and then flushes
my %handlers= map { $_ => sub { $_->set_text( md5_base64( $_->text))->flush } } @tags_to_anonymize;

XML::Twig->new( twig_roots => \%handlers, twig_print_outside_roots => 1)
         ->parsefile( "my_big_file.xml")
         ->flush;

Using regexps is indeed dangerous, unless you know exactly the format of the file, it's easy to parse with regexps, and you are sure that it will not change in the future.

Otherwise you could indeed use XML::Twig,as below. An alternative would be to use XML::LibXML, although the file might be a bit big to load it entirely in memory (then again, maybe not, memory is cheap these days) so you might have to use the pull mode, which I don't know much about.

Compact XML::Twig code:

#!/usr/bin/perl

use strict;
use warnings;

use XML::Twig;
use Digest::MD5 'md5_base64';

my @tags_to_anonymize= qw( name surname address email phone);

# the handler for each element ($_) sets its content with the md5 and then flushes
my %handlers= map { $_ => sub { $_->set_text( md5_base64( $_->text))->flush } } @tags_to_anonymize;

XML::Twig->new( twig_roots => \%handlers, twig_print_outside_roots => 1)
         ->parsefile( "my_big_file.xml")
         ->flush;

回复收藏 0 原文