将编号的记录解析为 XML
我想构建一个抓取器,用于解析 Leveson Inquiry 的记录,其格式如下作为明文:
1 Thursday, 2 February 2012
2 (10.00 am)
3 LORD JUSTICE LEVESON: Good morning.
4 MR BARR: Good morning, sir. We're going to start today
5 with witnesses from the mobile phone companies,
6 Mr Blendis from Everything Everywhere, Mr Hughes from
7 Vodafone and Mr Gorham from Telefonica.
8 LORD JUSTICE LEVESON: Very good.
9 MR BARR: We're going to listen to them all together, sir.
10 Can I ask that the gentlemen are sworn in, please.
11 MR JAMES BLENDIS (affirmed)
12 MR ADRIAN GORHAM (sworn)
13 MR MARK HUGHES (sworn)
14 Questions by MR BARR
15 MR BARR: Can I start, please, Mr Hughes, with you. Could
16 you tell us the position that you hold and a little bit
17 about your professional background, please?
18 MR HUGHES: Yes, sure. I'm currently head of fraud risk and
19 security for Vodafone UK. I have been in that position
20 since August 2011 and I've worked in the fraud risk and
21 security department in Vodafone since October 2006.
22 Q. Mr Gorham, if I could ask you the same question, please.
23 MR GORHAM: I'm the head of fraud and security for
24 Telefonica O2, I've been in that role for ten years and
25 have been in the industry for 13.
1
最终我想构建一个结构如下的 XML 文件:
<hearing date="2012-02-02" time="10:00">
<quote speaker="Lord Justice Leveson" page="1" line="3">Good morning.</quote>
<quote speaker="Mr Barr" page="1" line="4">Good morning, sir. We're going to start today with witnesses from the mobile phone companies, Mr Blendis from Everything Everywhere, Mr Hughes from Vodafone and Mr Gorham from Telefonica.</quote>
<quote speaker="Lord Justice Leveson" page="1" line="8">Very good.</quote>
[... and on ...]
</hearing>
...有什么帮助吗?
(另请注意,“MR BARR:”在某个时刻会变成简单的“Q.”。)
非常感谢!
I'm wanting to build a scraper that parses through transcripts from the Leveson Inquiry, which are in the following format as plaintext:
1 Thursday, 2 February 2012
2 (10.00 am)
3 LORD JUSTICE LEVESON: Good morning.
4 MR BARR: Good morning, sir. We're going to start today
5 with witnesses from the mobile phone companies,
6 Mr Blendis from Everything Everywhere, Mr Hughes from
7 Vodafone and Mr Gorham from Telefonica.
8 LORD JUSTICE LEVESON: Very good.
9 MR BARR: We're going to listen to them all together, sir.
10 Can I ask that the gentlemen are sworn in, please.
11 MR JAMES BLENDIS (affirmed)
12 MR ADRIAN GORHAM (sworn)
13 MR MARK HUGHES (sworn)
14 Questions by MR BARR
15 MR BARR: Can I start, please, Mr Hughes, with you. Could
16 you tell us the position that you hold and a little bit
17 about your professional background, please?
18 MR HUGHES: Yes, sure. I'm currently head of fraud risk and
19 security for Vodafone UK. I have been in that position
20 since August 2011 and I've worked in the fraud risk and
21 security department in Vodafone since October 2006.
22 Q. Mr Gorham, if I could ask you the same question, please.
23 MR GORHAM: I'm the head of fraud and security for
24 Telefonica O2, I've been in that role for ten years and
25 have been in the industry for 13.
1
Ultimately I want to build an XML file structured as follows:
<hearing date="2012-02-02" time="10:00">
<quote speaker="Lord Justice Leveson" page="1" line="3">Good morning.</quote>
<quote speaker="Mr Barr" page="1" line="4">Good morning, sir. We're going to start today with witnesses from the mobile phone companies, Mr Blendis from Everything Everywhere, Mr Hughes from Vodafone and Mr Gorham from Telefonica.</quote>
<quote speaker="Lord Justice Leveson" page="1" line="8">Very good.</quote>
[... and on ...]
</hearing>
...Any help?
(Also note, that "MR BARR:" changes into simply "Q." at a certain point.)
Many thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这通常是一个非常困难的问题,并且超出了 StackOverflow 的范围。也就是说,如果我必须这样做,我会采取迭代方法:
至于这些步骤的细节,只有您可以决定是否得到您想要的结果。此外,任何解决方案都需要事先或事后进行手动干预,以消除低频不一致问题。
This is generally a very hard problem, and is way out of scope for StackOverflow. That said, if I had to do this I'd take an iterative approach:
As to the details of those steps, only you can decide if you're getting out what you want. Also, any solution is going to require manual intervention, either beforehand or afterwards, to clean up low-frequency inconsistencies.
首先我要说的是,这不是一个万无一失的脚本,很可能有一些我忘记或忽略的事情,
但这是一个概念证明,供您改进和扩展或只是获得一个想法。
文本布局中有足够的规律可供我们使用,脚本所做的就是分割文本
转录成一系列行,并将这些行与一些模式进行匹配,以尝试识别
规律并确定数据的作用。
示例脚本:
我将在今天晚些时候更新注释和脚本
输出示例:
顺便说一句,只是出于好奇,您需要这个做什么?
let me start by saying this is not a foolproof script, there might well be something I forgot or overlooked,
but it is a proof of concept for you to improve and expand upon or just get an idea.
There are enough regularities in the text layout for us to work with, what the script does is split the
transcript to an array of lines and match those lines against a few patterns in an attempt to identify the
regularities and determine the role of the data.
Example Script:
I will update the comments and script later today
Output Sample:
b.t.w. just out of curiosity, what is it you need this for?