在 Haskell 中解析 XML
我正在尝试从定期提供带有股票市场报价的 XML 文件的网页获取数据 (示例数据)。 XML 的结构非常简单,如下所示:(
<?xml version="1.0"?>
<Contents>
<StockQuote Symbol="PETR3" Date="21-12-2010" Time="13:20" Price="23.02" />
</Contents>
不仅如此,但这足以作为示例)。
我想将其解析为数据结构:
data Quote = Quote { symbol :: String,
date :: Data.Time.Calendar.Day,
time :: Data.Time.LocalTime.TimeOfDay,
price :: Float}
我或多或少了解 Parsec 的工作原理(在 Real World Haskell 书的水平上),并且我尝试了一些 Text.XML
库,但是我所能开发的只是一段可以工作的代码,但对于这样一个简单的任务来说太大了,而且看起来像是一个半成品的黑客,而不是最好的代码。
我对解析器和XML了解不多(我基本上知道我在RWH书中读到的内容,我以前从未使用过解析器)(我只是做统计和数值编程,我不是计算机科学家)。是否有一个 XML 解析库,我可以在其中知道模型是什么并立即提取信息,而无需手动解析每个元素,也无需解析纯字符串?
我正在考虑这样的事情:
myParser = do cont <- openXMLElem "Contents"
quote <- openXMLElem "StockQuote"
symb <- getXMLElemField "Symbol"
date <- getXMLElemField "Date"
(...)
closequote <- closeXMLElem "StockQuote"
closecont <- closeXMLElem "Contents"
return (symb, date)
results = parse myParser "" myXMLString
我不必处理纯字符串并自己创建组合器(我很讨厌它)。
编辑:我可能需要阅读一些有关一般解析器(不仅仅是 Parsec)的内容(足以以正确的方式完成此操作)以及有关 XML 的最少内容。你们有推荐什么吗?
我必须解析的真正字符串是这样的:
stringTest = "<?xml version=\"1.0\"?>\r\n<ComportamentoPapeis><Papel Codigo=\"PETR3\"
Nome=\"PETROBRAS ON\" Ibovespa=\"#\" Data=\"05/01/201100:00:00\"
Abertura=\"29,80\" Minimo=\"30,31\" Maximo=\"30,67\" Medio=\"30,36\"
Ultimo=\"30,45\" Oscilacao=\"1,89\" Minino=\"29,71\"/></ComportamentoPapeis>\r\n"
EDIT2:
我尝试了以下方法(readFloat、readQuoteTime 等...只是从字符串中读取内容的函数)。
bvspaParser :: (ArrowXml a) => a XmlTree Quote
bvspaParser = hasName "ComportamentoPapeis" /> hasName "Papel" >>> proc x -> do
(hour,date) <- readQuoteTime ^<< getAttrValue "Data" -< x
quoteCode <- getAttrValue "Codigo" -< x
openPrice <- readFloat ^<< getAttrValue "Abertura" -< x
minim <- readFloat ^<< getAttrValue "Minimo" -< x
maxim <- readFloat ^<< getAttrValue "Maximo" -< x
ultimo <- readFloat ^<< getAttrValue "Ultimo" -< x
returnA -< Quote quoteCode (LocalTime date hour) openPrice minim maxim ultimo
docParser :: String -> IO [Quote]
docParser str = runX $ readString [] str >>> (parseXmlDocument False) >>> bvspaParser
当我在 ghci 中调用它时:
*Main> docParser stringTest >>= print
[]
有什么问题吗?
I'm trying to get data from a webpage that serves a XML file periodically with stock market quotes (sample data). The structure of the XML is very simple, and is something like this:
<?xml version="1.0"?>
<Contents>
<StockQuote Symbol="PETR3" Date="21-12-2010" Time="13:20" Price="23.02" />
</Contents>
(it's more than that but this suffices as an example).
I'd like to parse it to a data structure:
data Quote = Quote { symbol :: String,
date :: Data.Time.Calendar.Day,
time :: Data.Time.LocalTime.TimeOfDay,
price :: Float}
I understand more or less how Parsec works (on the level of the Real World Haskell book), and I tried a bit the Text.XML
library but all I could develop was a code that worked but is too big for such a simple task and looks like a half baked hack and not the best one could do.
I don't know a lot about parsers and XML (I know basically what I read in the RWH book, I never used parsers before) (I just do statistical and numerical programming, I'm not a computer scientist). Is there a XML parsing library where I could just tell what is the model and extract the information right away, without having to parse each element by hand and without having to parse pure string?
I'm thinking about something like:
myParser = do cont <- openXMLElem "Contents"
quote <- openXMLElem "StockQuote"
symb <- getXMLElemField "Symbol"
date <- getXMLElemField "Date"
(...)
closequote <- closeXMLElem "StockQuote"
closecont <- closeXMLElem "Contents"
return (symb, date)
results = parse myParser "" myXMLString
where I wouldn't have to deal with the pure string and create the combinators myself (I suck at it).
EDIT: I probably need to read a bit (just enough to get this done the right way) about parsers in general (not only Parsec) and the minimum about XML. Do you guys recomend something?
The real string I have to parse is this:
stringTest = "<?xml version=\"1.0\"?>\r\n<ComportamentoPapeis><Papel Codigo=\"PETR3\"
Nome=\"PETROBRAS ON\" Ibovespa=\"#\" Data=\"05/01/201100:00:00\"
Abertura=\"29,80\" Minimo=\"30,31\" Maximo=\"30,67\" Medio=\"30,36\"
Ultimo=\"30,45\" Oscilacao=\"1,89\" Minino=\"29,71\"/></ComportamentoPapeis>\r\n"
EDIT2:
I tried the following (readFloat, readQuoteTime, etc... are just functions to read things from strings).
bvspaParser :: (ArrowXml a) => a XmlTree Quote
bvspaParser = hasName "ComportamentoPapeis" /> hasName "Papel" >>> proc x -> do
(hour,date) <- readQuoteTime ^<< getAttrValue "Data" -< x
quoteCode <- getAttrValue "Codigo" -< x
openPrice <- readFloat ^<< getAttrValue "Abertura" -< x
minim <- readFloat ^<< getAttrValue "Minimo" -< x
maxim <- readFloat ^<< getAttrValue "Maximo" -< x
ultimo <- readFloat ^<< getAttrValue "Ultimo" -< x
returnA -< Quote quoteCode (LocalTime date hour) openPrice minim maxim ultimo
docParser :: String -> IO [Quote]
docParser str = runX $ readString [] str >>> (parseXmlDocument False) >>> bvspaParser
When I call it in ghci:
*Main> docParser stringTest >>= print
[]
Is anything wrong?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
有很多为 Haskell 编写的 XML 库可以为您进行解析。我推荐名为 xml 的库(请参阅 http://hackage.haskell.org/package/xml)。有了它,您可以简单地编写例如:
此代码片段打印
[Just "PETR3"]
作为示例 XML 的结果,并且可以轻松扩展以收集您需要的所有数据。要以您描述的风格编写程序,您应该使用 Maybe monad,因为 xml 查找函数通常返回 Maybe 字符串,表明是否可以找到标签、元素或属性。另请参阅相关问题:要使用哪个 Haskell XML 库?There are plenty of XML libraries written for Haskell that can do the parsing for you. I recommend the library called xml (see http://hackage.haskell.org/package/xml). With it, you can simply write e.g.:
This snippet prints
[Just "PETR3"]
as a result for your example XML, and it's easy to extend for collecting all the data you need. To write the program in the style you describe you should use the Maybe monad, as the xml lookup functions often return a Maybe String, signaling whether the tag, element or attribute could be found. Also see a related question: Which Haskell XML library to use?对于简单的 xml 解析,使用 tagoup 不会出错。 http://hackage.haskell.org/package/tagsoup
For simple xml parsing, you can't go wrong with tagsoup. http://hackage.haskell.org/package/tagsoup
以下代码片段使用 xml-enumerator。它将日期和时间保留为文本(解析这些内容作为读者的练习):
The following snippet uses xml-enumerator. It leaves date and time as text (parsing those is left as an exercise to the reader):
我过去使用过Haskell XML Toolbox。类似的东西
I've used Haskell XML Toolbox in the past. Something along the lines of
还有其他方法可以使用这个库,但是对于像这样简单的事情,我组合了一个 sax 解析器。
从那里你可以弄清楚要去哪里。您还可以使用 Text.XML.Expat.Annotated 将其解析为更像您在上面寻找的结构:
然后使用 Text.XML.Expat.Proc 浏览结构。
There are other ways to use this library, but for something simple like this I threw together a sax parser.
From there you can figure out where to go. You could also use Text.XML.Expat.Annotated in order to parse it into a structure that is more like what you are looking for above:
And then use Text.XML.Expat.Proc to surf the structure.