使用 Ruby 将 SEC Edgar XML 文件解析为 Nokogiri
我在解析 SEC Edgar 文件
最终结果是我希望将
和 之间的内容转换为我可以访问的格式。
这是我到目前为止不起作用的代码:
scud = open("http://sec.gov/Archives/edgar/data/1475481/0001475481-09-000001.txt")
full = scud.read
full.match(/<XML>(.*)<\/XML>/)
I'm having problems parsing the SEC Edgar files
Here is an example of this file.
The end result is I want the stuff between <XML>
and </XML>
into a format I can access.
Here is my code so far that doesn't work:
scud = open("http://sec.gov/Archives/edgar/data/1475481/0001475481-09-000001.txt")
full = scud.read
full.match(/<XML>(.*)<\/XML>/)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
好吧,有一些错误:
下面是一段快速代码,用于检索页面、去除垃圾并将结果内容解析为 XML:
Ok, there are a couple of things wrong:
Here's a quick piece of code to retrieve the page, strip the garbage, and parse the resulting content as XML:
我建议在 IRB 中练习并阅读 Nokogiri 文档,
这应该可以帮助您继续前进
I recommend practicing in IRB and reading the docs for Nokogiri
that should get you going
鉴于一年前就有人问过这个问题,答案可能是 OBE,但该人员应该做的是检查网站上的所有文件,并注意实际的归档详细信息可以在以下位置找到:
http://sec.gov/Archives/edgar/data/1475481/000147548109000001/0001475481-09-000001-index.htm
在此,您将看到 XML 文档位于已经解析出来,准备进一步操作:
http://sec.gov/Archives/edgar/data/1475481/000147548109000001/primary_doc .xml
但是请注意,最后的实际文件名由文件提交者决定,而不是由 SEC 决定。因此,您不能依赖文档始终为“primary_doc.xml”。
Given this was asked a year back, the answer is probably OBE, but what the fellow should do is examine all of the documents that are on the site, and notice the actual filing details can be found at:
http://sec.gov/Archives/edgar/data/1475481/000147548109000001/0001475481-09-000001-index.htm
Within this, you will see that the XML document is is after is already parsed out ready for further manipulation at:
http://sec.gov/Archives/edgar/data/1475481/000147548109000001/primary_doc.xml
Be warned, however, the actual file name at the end is determined by the submitter of the document, not by the SEC. Therefore, you cannot depend on the document always being 'primary_doc.xml'.