使用 Ruby 将 SEC Edgar XML 文件解析为 Nokogiri

发布于 2024-11-04 01:54:59 字数 473 浏览 8 评论 0原文

我在解析 SEC Edgar 文件

以下是此文件的示例

最终结果是我希望将 之间的内容转换为我可以访问的格式。

这是我到目前为止不起作用的代码:

scud = open("http://sec.gov/Archives/edgar/data/1475481/0001475481-09-000001.txt")
full = scud.read
full.match(/<XML>(.*)<\/XML>/)

I'm having problems parsing the SEC Edgar files

Here is an example of this file.

The end result is I want the stuff between <XML> and </XML> into a format I can access.

Here is my code so far that doesn't work:

scud = open("http://sec.gov/Archives/edgar/data/1475481/0001475481-09-000001.txt")
full = scud.read
full.match(/<XML>(.*)<\/XML>/)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

人生戏 2024-11-11 01:54:59

好吧,有一些错误:

  1. sec.gov/Archives/edgar/data/1475481/0001475481-09-000001.txt 不是 XML,所以 Nokogiri 对你没有用,除非你去掉其中的所有垃圾。从文件顶部开始,一直到真正的 XML 开始的位置,然后剪掉尾部标记以保持 XML 的正确性。因此,您需要首先解决这个问题。
  2. 你没有说出你想从文件中得到什么。如果没有这些信息,我们就无法推荐真正的解决方案。您需要花更多时间来更好地定义问题。

下面是一段快速代码,用于检索页面、去除垃圾并将结果内容解析为 XML:

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::XML(
  open('http://sec.gov/Archives/edgar/data/1475481/0001475481-09-000001.txt').read.gsub(/\A.+<xml>\n/im, '').gsub(/<\/xml>.+/mi, '')
)
puts doc.at('//schemaVersion').text
# >> X0603

Ok, there are a couple of things wrong:

  1. sec.gov/Archives/edgar/data/1475481/0001475481-09-000001.txt is NOT XML, so Nokogiri will be of no use to you unless you strip off all the garbage from the top of the file, down to where the true XML starts, then trim off the trailing tags to keep the XML correct. So, you need to attack that problem first.
  2. You don't say what you want from the file. Without that information we can't recommend a real solution. You need to take more time to define the question better.

Here's a quick piece of code to retrieve the page, strip the garbage, and parse the resulting content as XML:

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::XML(
  open('http://sec.gov/Archives/edgar/data/1475481/0001475481-09-000001.txt').read.gsub(/\A.+<xml>\n/im, '').gsub(/<\/xml>.+/mi, '')
)
puts doc.at('//schemaVersion').text
# >> X0603
这个俗人 2024-11-11 01:54:59

我建议在 IRB 中练习并阅读 Nokogiri 文档

> require 'nokogiri'
=> true
> require 'open-uri'
=> true
> doc = Nokogiri::HTML(open('http://sec.gov/Archives/edgar/data/1475481/0001475481-09-000001.txt'))
> doc.xpath('//firstname')
=> [#<Nokogiri::XML::Element:0x80c18290 name="firstname" children=[#<Nokogiri::XML::Text:0x80c18010 "Joshua">]>, #<Nokogiri::XML::Element:0x80c14d48 name="firstname" children=[#<Nokogiri::XML::Text:0x80c14ac8 "Patrick">]>, #<Nokogiri::XML::Element:0x80c11fd0 name="firstname" children=[#<Nokogiri::XML::Text:0x80c11d50 "Brian">]>] 

这应该可以帮助您继续前进

I recommend practicing in IRB and reading the docs for Nokogiri

> require 'nokogiri'
=> true
> require 'open-uri'
=> true
> doc = Nokogiri::HTML(open('http://sec.gov/Archives/edgar/data/1475481/0001475481-09-000001.txt'))
> doc.xpath('//firstname')
=> [#<Nokogiri::XML::Element:0x80c18290 name="firstname" children=[#<Nokogiri::XML::Text:0x80c18010 "Joshua">]>, #<Nokogiri::XML::Element:0x80c14d48 name="firstname" children=[#<Nokogiri::XML::Text:0x80c14ac8 "Patrick">]>, #<Nokogiri::XML::Element:0x80c11fd0 name="firstname" children=[#<Nokogiri::XML::Text:0x80c11d50 "Brian">]>] 

that should get you going

嘴硬脾气大 2024-11-11 01:54:59

鉴于一年前就有人问过这个问题,答案可能是 OBE,但该人员应该做的是检查网站上的所有文件,并注意实际的归档详细信息可以在以下位置找到:

http://sec.gov/Archives/edgar/data/1475481/000147548109000001/0001475481-09-000001-index.htm

在此,您将看到 XML 文档位于已经解析出来,准备进一步操作:

http://sec.gov/Archives/edgar/data/1475481/000147548109000001/primary_doc .xml

但是请注意,最后的实际文件名由文件提交者决定,而不是由 SEC 决定。因此,您不能依赖文档始终为“primary_doc.xml”。

Given this was asked a year back, the answer is probably OBE, but what the fellow should do is examine all of the documents that are on the site, and notice the actual filing details can be found at:

http://sec.gov/Archives/edgar/data/1475481/000147548109000001/0001475481-09-000001-index.htm

Within this, you will see that the XML document is is after is already parsed out ready for further manipulation at:

http://sec.gov/Archives/edgar/data/1475481/000147548109000001/primary_doc.xml

Be warned, however, the actual file name at the end is determined by the submitter of the document, not by the SEC. Therefore, you cannot depend on the document always being 'primary_doc.xml'.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文