在 Haskell 中解析 XML

发布于 2024-10-10 21:30:09 字数 2890 浏览 0 评论 0原文

我正在尝试从定期提供带有股票市场报价的 XML 文件的网页获取数据 (示例数据)。 XML 的结构非常简单,如下所示:(

<?xml version="1.0"?>
<Contents>
  <StockQuote Symbol="PETR3" Date="21-12-2010" Time="13:20" Price="23.02" />
</Contents>

不仅如此,但这足以作为示例)。

我想将其解析为数据结构:

 data Quote = Quote { symbol :: String, 
                      date   :: Data.Time.Calendar.Day, 
                      time   :: Data.Time.LocalTime.TimeOfDay,
                      price  :: Float}

我或多或少了解 Parsec 的工作原理(在 Real World Haskell 书的水平上),并且我尝试了一些 Text.XML 库,但是我所能开发的只是一段可以工作的代码,但对于这样一个简单的任务来说太大了,而且看起来像是一个半成品的黑客,而不是最好的代码。

我对解析器和XML了解不多(我基本上知道我在RWH书中读到的内容,我以前从未使用过解析器)(我只是做统计和数值编程,我不是计算机科学家)。是否有一个 XML 解析库,我可以在其中知道模型是什么并立即提取信息,而无需手动解析每个元素,也无需解析纯字符串?

我正在考虑这样的事情:

  myParser = do cont  <- openXMLElem "Contents"
                quote <- openXMLElem "StockQuote" 
                symb <- getXMLElemField "Symbol"
                date <- getXMLElemField "Date"
                (...) 
                closequote <- closeXMLElem "StockQuote"
                closecont  <- closeXMLElem "Contents"
                return (symb, date)


  results = parse myParser "" myXMLString

我不必处理纯字符串并自己创建组合器(我很讨厌它)。

编辑:我可能需要阅读一些有关一般解析器(不仅仅是 Parsec)的内容(足以以正确的方式完成此操作)以及有关 XML 的最少内容。你们有推荐什么吗?

我必须解析的真正字符串是这样的:

 stringTest = "<?xml version=\"1.0\"?>\r\n<ComportamentoPapeis><Papel Codigo=\"PETR3\" 
 Nome=\"PETROBRAS ON\" Ibovespa=\"#\" Data=\"05/01/201100:00:00\" 
 Abertura=\"29,80\" Minimo=\"30,31\" Maximo=\"30,67\" Medio=\"30,36\" 
 Ultimo=\"30,45\" Oscilacao=\"1,89\" Minino=\"29,71\"/></ComportamentoPapeis>\r\n"

EDIT2:

我尝试了以下方法(readFloat、readQuoteTime 等...只是从字符串中读取内容的函数)。

bvspaParser :: (ArrowXml a) => a XmlTree Quote
bvspaParser = hasName "ComportamentoPapeis" /> hasName "Papel" >>> proc x -> do
   (hour,date) <- readQuoteTime ^<< getAttrValue "Data" -< x
   quoteCode   <- getAttrValue "Codigo" -< x
   openPrice   <- readFloat ^<< getAttrValue "Abertura" -< x
   minim       <- readFloat ^<< getAttrValue "Minimo" -< x
   maxim       <- readFloat ^<< getAttrValue "Maximo" -< x
   ultimo      <- readFloat ^<< getAttrValue "Ultimo" -< x
   returnA     -< Quote quoteCode (LocalTime date hour) openPrice minim maxim ultimo

docParser :: String -> IO [Quote]
docParser  str = runX $ readString [] str >>> (parseXmlDocument False) >>> bvspaParser

当我在 ghci 中调用它时:

*Main> docParser stringTest >>= print
[]

有什么问题吗?

I'm trying to get data from a webpage that serves a XML file periodically with stock market quotes (sample data). The structure of the XML is very simple, and is something like this:

<?xml version="1.0"?>
<Contents>
  <StockQuote Symbol="PETR3" Date="21-12-2010" Time="13:20" Price="23.02" />
</Contents>

(it's more than that but this suffices as an example).

I'd like to parse it to a data structure:

 data Quote = Quote { symbol :: String, 
                      date   :: Data.Time.Calendar.Day, 
                      time   :: Data.Time.LocalTime.TimeOfDay,
                      price  :: Float}

I understand more or less how Parsec works (on the level of the Real World Haskell book), and I tried a bit the Text.XML library but all I could develop was a code that worked but is too big for such a simple task and looks like a half baked hack and not the best one could do.

I don't know a lot about parsers and XML (I know basically what I read in the RWH book, I never used parsers before) (I just do statistical and numerical programming, I'm not a computer scientist). Is there a XML parsing library where I could just tell what is the model and extract the information right away, without having to parse each element by hand and without having to parse pure string?

I'm thinking about something like:

  myParser = do cont  <- openXMLElem "Contents"
                quote <- openXMLElem "StockQuote" 
                symb <- getXMLElemField "Symbol"
                date <- getXMLElemField "Date"
                (...) 
                closequote <- closeXMLElem "StockQuote"
                closecont  <- closeXMLElem "Contents"
                return (symb, date)


  results = parse myParser "" myXMLString

where I wouldn't have to deal with the pure string and create the combinators myself (I suck at it).

EDIT: I probably need to read a bit (just enough to get this done the right way) about parsers in general (not only Parsec) and the minimum about XML. Do you guys recomend something?

The real string I have to parse is this:

 stringTest = "<?xml version=\"1.0\"?>\r\n<ComportamentoPapeis><Papel Codigo=\"PETR3\" 
 Nome=\"PETROBRAS ON\" Ibovespa=\"#\" Data=\"05/01/201100:00:00\" 
 Abertura=\"29,80\" Minimo=\"30,31\" Maximo=\"30,67\" Medio=\"30,36\" 
 Ultimo=\"30,45\" Oscilacao=\"1,89\" Minino=\"29,71\"/></ComportamentoPapeis>\r\n"

EDIT2:

I tried the following (readFloat, readQuoteTime, etc... are just functions to read things from strings).

bvspaParser :: (ArrowXml a) => a XmlTree Quote
bvspaParser = hasName "ComportamentoPapeis" /> hasName "Papel" >>> proc x -> do
   (hour,date) <- readQuoteTime ^<< getAttrValue "Data" -< x
   quoteCode   <- getAttrValue "Codigo" -< x
   openPrice   <- readFloat ^<< getAttrValue "Abertura" -< x
   minim       <- readFloat ^<< getAttrValue "Minimo" -< x
   maxim       <- readFloat ^<< getAttrValue "Maximo" -< x
   ultimo      <- readFloat ^<< getAttrValue "Ultimo" -< x
   returnA     -< Quote quoteCode (LocalTime date hour) openPrice minim maxim ultimo

docParser :: String -> IO [Quote]
docParser  str = runX $ readString [] str >>> (parseXmlDocument False) >>> bvspaParser

When I call it in ghci:

*Main> docParser stringTest >>= print
[]

Is anything wrong?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

夏花。依旧 2024-10-17 21:30:09

有很多为 Haskell 编写的 XML 库可以为您进行解析。我推荐名为 xml 的库(请参阅 http://hackage.haskell.org/package/xml)。有了它,您可以简单地编写例如:

let contents = parseXML source
    quotes   = concatMap (findElements $ unqual "StockQuote") (onlyElems contents)
    symbols  = map (findAttr $ unqual "Symbol") quotes
print symbols

此代码片段打印 [Just "PETR3"] 作为示例 XML 的结果,并且可以轻松扩展以收集您需要的所有数据。要以您描述的风格编写程序,您应该使用 Maybe monad,因为 xml 查找函数通常返回 Maybe 字符串,表明是否可以找到标签、元素或属性。另请参阅相关问题:要使用哪个 Haskell XML 库?

There are plenty of XML libraries written for Haskell that can do the parsing for you. I recommend the library called xml (see http://hackage.haskell.org/package/xml). With it, you can simply write e.g.:

let contents = parseXML source
    quotes   = concatMap (findElements $ unqual "StockQuote") (onlyElems contents)
    symbols  = map (findAttr $ unqual "Symbol") quotes
print symbols

This snippet prints [Just "PETR3"] as a result for your example XML, and it's easy to extend for collecting all the data you need. To write the program in the style you describe you should use the Maybe monad, as the xml lookup functions often return a Maybe String, signaling whether the tag, element or attribute could be found. Also see a related question: Which Haskell XML library to use?

仅冇旳回忆 2024-10-17 21:30:09

对于简单的 xml 解析,使用 tagoup 不会出错。 http://hackage.haskell.org/package/tagsoup

For simple xml parsing, you can't go wrong with tagsoup. http://hackage.haskell.org/package/tagsoup

梅倚清风 2024-10-17 21:30:09

以下代码片段使用 xml-enumerator。它将日期和时间保留为文本(解析这些内容作为读者的练习):

{-# LANGUAGE OverloadedStrings #-}
import Text.XML.Enumerator.Parse
import Data.Text.Lazy (Text, unpack)

data Quote = Quote { symbol :: Text
                   , date   :: Text
                   , time   :: Text
                   , price  :: Float}
  deriving Show

main = parseFile_ "test.xml" (const Nothing) $ parseContents

parseContents = force "Missing Contents" $ tag'' "Contents" parseStockQuote
parseStockQuote = force "Missing StockQuote" $ flip (tag' "StockQuote") return $ do
    s <- requireAttr "Symbol"
    d <- requireAttr "Date"
    t <- requireAttr "Time"
    p <- requireAttr "Price"
    return $ Quote s d t (read $ unpack p)

The following snippet uses xml-enumerator. It leaves date and time as text (parsing those is left as an exercise to the reader):

{-# LANGUAGE OverloadedStrings #-}
import Text.XML.Enumerator.Parse
import Data.Text.Lazy (Text, unpack)

data Quote = Quote { symbol :: Text
                   , date   :: Text
                   , time   :: Text
                   , price  :: Float}
  deriving Show

main = parseFile_ "test.xml" (const Nothing) $ parseContents

parseContents = force "Missing Contents" $ tag'' "Contents" parseStockQuote
parseStockQuote = force "Missing StockQuote" $ flip (tag' "StockQuote") return $ do
    s <- requireAttr "Symbol"
    d <- requireAttr "Date"
    t <- requireAttr "Time"
    p <- requireAttr "Price"
    return $ Quote s d t (read $ unpack p)
情愿 2024-10-17 21:30:09

我过去使用过Haskell XML Toolbox。类似的东西

{-# LANGUAGE Arrows #-}

quoteParser :: (ArrowXml a) => a XmlTree Quote
quoteParser =
    hasName "Contents" /> hasName "StockQuote" >>> proc x -> do
    symbol <- getAttrValue "Symbol" -< x
    date <- readTime defaultTimeLocale "%d-%m-%Y" ^<< getAttrValue "Date" -< x
    time <- readTime defaultTimeLocale "%H:%M" ^<< getAttrValue "Time" -< x
    price <- read ^<< getAttrValue "Price" -< x
    returnA -< Quote symbol date time price

parseQuoteDocument :: String -> IO (Maybe Quote)
parseQuoteDocument xml =
    liftM listToMaybe . runX . single $
    readString [] xml >>> getChildren >>> quoteParser

I've used Haskell XML Toolbox in the past. Something along the lines of

{-# LANGUAGE Arrows #-}

quoteParser :: (ArrowXml a) => a XmlTree Quote
quoteParser =
    hasName "Contents" /> hasName "StockQuote" >>> proc x -> do
    symbol <- getAttrValue "Symbol" -< x
    date <- readTime defaultTimeLocale "%d-%m-%Y" ^<< getAttrValue "Date" -< x
    time <- readTime defaultTimeLocale "%H:%M" ^<< getAttrValue "Time" -< x
    price <- read ^<< getAttrValue "Price" -< x
    returnA -< Quote symbol date time price

parseQuoteDocument :: String -> IO (Maybe Quote)
parseQuoteDocument xml =
    liftM listToMaybe . runX . single $
    readString [] xml >>> getChildren >>> quoteParser
绳情 2024-10-17 21:30:09

还有其他方法可以使用这个库,但是对于像这样简单的事情,我组合了一个 sax 解析器。

import Prelude as P
import Text.XML.Expat.SAX
import Data.ByteString.Lazy as L

parsexml txt = parse defaultParseOptions txt :: [SAXEvent String String]

main = do
  xml <- L.readFile "stockinfo.xml"
  return  $ P.filter stockquoteelement (parsexml xml)

  where
    stockquoteelement (StartElement "StockQuote" attrs) = True
    stockquoteelement _ = False

从那里你可以弄清楚要去哪里。您还可以使用 Text.XML.Expat.Annotated 将其解析为更像您在上面寻找的结构:

parsexml txt = parse defaultParseOptions txt :: (LNode String String, Maybe XMLParseError)

然后使用 Text.XML.Expat.Proc 浏览结构。

There are other ways to use this library, but for something simple like this I threw together a sax parser.

import Prelude as P
import Text.XML.Expat.SAX
import Data.ByteString.Lazy as L

parsexml txt = parse defaultParseOptions txt :: [SAXEvent String String]

main = do
  xml <- L.readFile "stockinfo.xml"
  return  $ P.filter stockquoteelement (parsexml xml)

  where
    stockquoteelement (StartElement "StockQuote" attrs) = True
    stockquoteelement _ = False

From there you can figure out where to go. You could also use Text.XML.Expat.Annotated in order to parse it into a structure that is more like what you are looking for above:

parsexml txt = parse defaultParseOptions txt :: (LNode String String, Maybe XMLParseError)

And then use Text.XML.Expat.Proc to surf the structure.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文