如何提取美国国防部合同信息以用于统计分析?
我正在尝试抓取和分析国防部获得的合同,将其与我已经获得的其他经济数据相关联。所有这些都可以在 Defense.gov 上公开获取。
但是,他们不会将其列在表格中,而是以段落形式编写相关信息(承包商、日期、名称、合同 ID 等)。我一直在尝试将数据放入 CSV 中,以便可以通过 R 运行它
。通常我只是根据数据周围的标签进行提取,但是有人可以推荐一种更简单的获取此数据的方法吗?我已经使用 wget 提取了数据,但我只是想提取它。
这是典型段落示例:
位于弗吉尼亚州赫恩登的 Booz Allen Hamilton, Inc. 获得一份价值 9,450,189 美元的成本加固定费用、无限期交付、需求研发合同,以完成/交付对陆军作战挑战的评估以及综合学习计划、实验最终报告和实验到行动计划。美国陆军将利用这些报告来制定和修订陆军概念,并为其他军种和联合概念做出贡献;为陆军和联合能力发展方案的发展提出建议;通过实验研究当前和未来的战争;并建立模型和模拟来测试新的作战理念。内布拉斯加州奥夫特空军基地的 ESG/PKS DTIC 负责承包活动(SP0700-03-D-1380,交货单:0452)。
我从 Perl 脚本开始,但提取效果不太好。我很好奇是否有人构建了一个更加动态的脚本,我可以从中构建而不是从头开始重建。
#!/usr/bin/perl -w
use Spreadsheet::WriteExcel;
# Create a new workbook called simple.xls and add a worksheet.
my $workbook = Spreadsheet::WriteExcel->new('Dec4_min.xls');
my $worksheet = $workbook->add_worksheet();
our $row = 0;
@files = <~/Def_Contracts/*.*>;
foreach $HTML (@files) { # open each file in folder #$HTML = "contract.html";
open (HTML) or die "Can't open the file!";
@fullpage = <HTML>;
print "fullpage array size = ", @fullpage. "\n";
my @cleaned; # this is a simplified array we will create
foreach $curr (@fullpage){ #this for each loop cuts array elements without dollar signs
# [0-9]+?\/[0-9]+?\/[0-9]{3}
if($curr =~ m/content="([0-9]+?\/[0-9]+?\/[0-9]{4})/) { #get date - looking for this: content="8/29/1995"
print $1;
# if ($currnt =~ m/([0-9]+,.[0-9]{4}/){ # extract date dd,(space)dddd
our $date = $1;
}
# CLEAN UP
while(substr($curr,0,1) =~ m/[^\w]/){ # while not a word char
substr($curr,0,1)=''; #cut that char
}
if($curr =~ m/\$[0-9]/) { # only use if has $number.
####################### Now we've got what we need, output relevant parts into excel.
my $firstcom = index($curr, ',');
$name = substr($curr,0,$firstcom);
# print "Name:", $name. "\n";
$worksheet->write($row,0,$name); # print the name in the first col
$worksheet->write($row,1,$date); # print the date in the 2nd col
if($curr =~ m/\$([0-9,]*)/) { # finds the cost PROBLEM: there may be more than one
# print "Cost:", $1. "\n";
$worksheet->write($row,2,$1);
}
if($curr =~ m/([A-Za-z0-9][A-Z0-9]{4}[A-Z0-9]?\-[0-9]+\-[A-Z]\-[A-Z0-9]{4})/) { # print ref # in 3rd col
# print "Cost:", $1. "\n";
$worksheet->write($row,3,$1); # ref takes form (letter ...-...-...number)
}
# 2nd attempt to get ref #
if($curr =~ m/\((.*\-.*\-.*)\)/){ # print ref # in 4rd col
# print "Cost:", $1. "\n";
$worksheet->write($row,4,$1); # ref takes form (letter ...-...-...number)
}
$worksheet->write($row,5,$curr); # print full record (for verification!)
$row ++;
} # close for if has a number statement
} # close foreach line of HTML Page
#print "cleaned array size = ", @cleaned. "\n";
print "The end.\n";
close (HTML);
} # End of foreach file
I'm trying to scrape and analyze the contracts the defense department gets, correlating it with other economic data I've already got. It's all publicly available on Defense.gov.
However, they don't list it in a table, rather the relevant information (Contractor, Date, Name, Contract ID, etc) are written in paragraph form. I've been trying to get the data into a CSV so I can run it through R.
Normally I'd just extract based on the tags around the data, but can anyone recommend a simpler way of getting at this data? I've already pulled the data using wget, but I'm just trying to extract it.
This is an example of a typical paragraph:
Booz Allen Hamilton, Inc., Herndon, Va., is being awarded a $9,450,189 cost-plus-fixed-fee, indefinite-delivery, requirements contract for research and development in order to complete/deliver the assessment of army warfighting challenges and integrated learning plans, the experiment final reports, and experiment-to-action plans. The U.S. Army will use these reports to develop and revise Army concepts and contribute to other services and joint concepts; make recommendations for the development of Army and joint capabilities development scenarios; research current and future warfare through experimentation; and build models and simulations to test new warfighting ideas. ESG/PKS DTIC, Offut Air Force Base, Neb., is the contracting activity (SP0700-03-D-1380, Delivery Order: 0452).
I started with a Perl script, but the extraction isn't working out so well. I'm curious if anyone's built a script that's more dynamic that I can build off of rather then rebuilding from scratch.
#!/usr/bin/perl -w
use Spreadsheet::WriteExcel;
# Create a new workbook called simple.xls and add a worksheet.
my $workbook = Spreadsheet::WriteExcel->new('Dec4_min.xls');
my $worksheet = $workbook->add_worksheet();
our $row = 0;
@files = <~/Def_Contracts/*.*>;
foreach $HTML (@files) { # open each file in folder #$HTML = "contract.html";
open (HTML) or die "Can't open the file!";
@fullpage = <HTML>;
print "fullpage array size = ", @fullpage. "\n";
my @cleaned; # this is a simplified array we will create
foreach $curr (@fullpage){ #this for each loop cuts array elements without dollar signs
# [0-9]+?\/[0-9]+?\/[0-9]{3}
if($curr =~ m/content="([0-9]+?\/[0-9]+?\/[0-9]{4})/) { #get date - looking for this: content="8/29/1995"
print $1;
# if ($currnt =~ m/([0-9]+,.[0-9]{4}/){ # extract date dd,(space)dddd
our $date = $1;
}
# CLEAN UP
while(substr($curr,0,1) =~ m/[^\w]/){ # while not a word char
substr($curr,0,1)=''; #cut that char
}
if($curr =~ m/\$[0-9]/) { # only use if has $number.
####################### Now we've got what we need, output relevant parts into excel.
my $firstcom = index($curr, ',');
$name = substr($curr,0,$firstcom);
# print "Name:", $name. "\n";
$worksheet->write($row,0,$name); # print the name in the first col
$worksheet->write($row,1,$date); # print the date in the 2nd col
if($curr =~ m/\$([0-9,]*)/) { # finds the cost PROBLEM: there may be more than one
# print "Cost:", $1. "\n";
$worksheet->write($row,2,$1);
}
if($curr =~ m/([A-Za-z0-9][A-Z0-9]{4}[A-Z0-9]?\-[0-9]+\-[A-Z]\-[A-Z0-9]{4})/) { # print ref # in 3rd col
# print "Cost:", $1. "\n";
$worksheet->write($row,3,$1); # ref takes form (letter ...-...-...number)
}
# 2nd attempt to get ref #
if($curr =~ m/\((.*\-.*\-.*)\)/){ # print ref # in 4rd col
# print "Cost:", $1. "\n";
$worksheet->write($row,4,$1); # ref takes form (letter ...-...-...number)
}
$worksheet->write($row,5,$curr); # print full record (for verification!)
$row ++;
} # close for if has a number statement
} # close foreach line of HTML Page
#print "cleaned array size = ", @cleaned. "\n";
print "The end.\n";
close (HTML);
} # End of foreach file
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
显然,非常不完整,但是,通常需要大量现金才能说服我处理这种混乱(
VIEWSTATE
,真的吗?):输出 否。 1001-11:
Obviously, very incomplete, but then, normally it takes a significant amount of cash to convince me to deal with this kind of mess (
VIEWSTATE
, really?):Output for No. 1001-11:
查看一些条目,我怀疑这些段落是使用一堆样板模板手动输入的。 (不同的部门/机构似乎有自己的格式;例如,空军和海军写“正在授予”,而陆军和国防后勤局使用“已授予”,并且 其他一些机构有自己独特的变体。)
因此,您似乎不太可能编写代码来解析 < em>所有条目均可靠。您能做的最好的事情可能就是编写一堆正则表达式来解析其中的大多数(例如 99% 左右),并将其余部分标记为手动处理。
我现在太累了,无法写出更详细的答案,但我建议从这样的开始:
然后浏览被拒绝的条目,添加新的正则表达式(或调整现有的正则表达式)来处理其中最常见的类型,然后重复。当然,还要记住检查解析的输出,看看正则表达式是否正常工作。
Looking at a few entries, I suspect these paragraphs are entered manually using a bunch of boilerplate templates. (The different branches / agencies seem to have their own formats; for example, the Air Force and the Navy write "is being awarded", while the Army and the DLA use "was awarded", and some other agencies have their own peculiar variants.)
Thus, it seems unlikely that you can write code to parse all the entries reliably. The best you can do is probably to write a bunch of regexps to parse most (say, 99% or so) of them, and flag the rest for manual processing.
I'm too tired to write a more detailed answer right now, but I'd suggest starting with something like this:
Then go through the entries being rejected, add new regexps (or adjust existing ones) to handle the most common types among them, and repeat. Also remember to check the parsed output to see if the regexps are working correctly, of course.