使用 Ruby 将 SEC Edgar XML 文件解析为 Nokogiri

发布于 2024-11-04 01:54:59 字数 473 浏览 8 评论 0原文

最终结果是我希望将和之间的内容转换为我可以访问的格式。

这是我到目前为止不起作用的代码：

scud = open("http://sec.gov/Archives/edgar/data/1475481/0001475481-09-000001.txt")
full = scud.read
full.match(/<XML>(.*)<\/XML>/)

原文

I'm having problems parsing the SEC Edgar files

Here is an example of this file.

The end result is I want the stuff between <XML> and </XML> into a format I can access.

Here is my code so far that doesn't work:

scud = open("http://sec.gov/Archives/edgar/data/1475481/0001475481-09-000001.txt")
full = scud.read
full.match(/<XML>(.*)<\/XML>/)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

人生戏 2024-11-11 01:54:59

好吧，有一些错误：

sec.gov/Archives/edgar/data/1475481/0001475481-09-000001.txt 不是 XML，所以 Nokogiri 对你没有用，除非你去掉其中的所有垃圾。从文件顶部开始，一直到真正的 XML 开始的位置，然后剪掉尾部标记以保持 XML 的正确性。因此，您需要首先解决这个问题。
你没有说出你想从文件中得到什么。如果没有这些信息，我们就无法推荐真正的解决方案。您需要花更多时间来更好地定义问题。

下面是一段快速代码，用于检索页面、去除垃圾并将结果内容解析为 XML：

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::XML(
  open('http://sec.gov/Archives/edgar/data/1475481/0001475481-09-000001.txt').read.gsub(/\A.+<xml>\n/im, '').gsub(/<\/xml>.+/mi, '')
)
puts doc.at('//schemaVersion').text
# >> X0603

Ok, there are a couple of things wrong:

sec.gov/Archives/edgar/data/1475481/0001475481-09-000001.txt is NOT XML, so Nokogiri will be of no use to you unless you strip off all the garbage from the top of the file, down to where the true XML starts, then trim off the trailing tags to keep the XML correct. So, you need to attack that problem first.
You don't say what you want from the file. Without that information we can't recommend a real solution. You need to take more time to define the question better.

Here's a quick piece of code to retrieve the page, strip the garbage, and parse the resulting content as XML:

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::XML(
  open('http://sec.gov/Archives/edgar/data/1475481/0001475481-09-000001.txt').read.gsub(/\A.+<xml>\n/im, '').gsub(/<\/xml>.+/mi, '')
)
puts doc.at('//schemaVersion').text
# >> X0603

回复收藏 0 原文

这个俗人 2024-11-11 01:54:59

我建议在 IRB 中练习并阅读 Nokogiri 文档，

> require 'nokogiri'
=> true
> require 'open-uri'
=> true
> doc = Nokogiri::HTML(open('http://sec.gov/Archives/edgar/data/1475481/0001475481-09-000001.txt'))
> doc.xpath('//firstname')
=> [#<Nokogiri::XML::Element:0x80c18290 name="firstname" children=[#<Nokogiri::XML::Text:0x80c18010 "Joshua">]>, #<Nokogiri::XML::Element:0x80c14d48 name="firstname" children=[#<Nokogiri::XML::Text:0x80c14ac8 "Patrick">]>, #<Nokogiri::XML::Element:0x80c11fd0 name="firstname" children=[#<Nokogiri::XML::Text:0x80c11d50 "Brian">]>]

这应该可以帮助您继续前进

I recommend practicing in IRB and reading the docs for Nokogiri

> require 'nokogiri'
=> true
> require 'open-uri'
=> true
> doc = Nokogiri::HTML(open('http://sec.gov/Archives/edgar/data/1475481/0001475481-09-000001.txt'))
> doc.xpath('//firstname')
=> [#<Nokogiri::XML::Element:0x80c18290 name="firstname" children=[#<Nokogiri::XML::Text:0x80c18010 "Joshua">]>, #<Nokogiri::XML::Element:0x80c14d48 name="firstname" children=[#<Nokogiri::XML::Text:0x80c14ac8 "Patrick">]>, #<Nokogiri::XML::Element:0x80c11fd0 name="firstname" children=[#<Nokogiri::XML::Text:0x80c11d50 "Brian">]>]

that should get you going

回复收藏 0 原文