Haskell 用低内存解析大 xml 文件
因此,我尝试过几个 Haskell XML 库,包括 hexpat 和 xml-enumerator。在阅读了 Real World Haskell (http://book.realworldhaskell.org/read/io.html) 中的 IO 章节后,我的印象是,如果我运行以下代码,它会在我执行它时被垃圾收集。
但是,当我在大文件上运行它时,内存使用量会随着运行而不断攀升。
runghc parse.hs bigfile.xml
我做错了什么?我的假设是错误的吗?地图/过滤器是否强制它评估所有内容?
import qualified Data.ByteString.Lazy as BSL
import qualified Data.ByteString.Lazy.UTF8 as U
import Prelude hiding (readFile)
import Text.XML.Expat.SAX
import System.Environment (getArgs)
main :: IO ()
main = do
args <- getArgs
contents <- BSL.readFile (head args)
-- putStrLn $ U.toString contents
let events = parse defaultParseOptions contents
mapM_ print $ map getTMSId $ filter isEvent events
isEvent :: SAXEvent String String -> Bool
isEvent (StartElement "event" as) = True
isEvent _ = False
getTMSId :: SAXEvent String String -> Maybe String
getTMSId (StartElement _ as) = lookup "TMSId" as
我的最终目标是使用简单的类似 sax 的界面来解析一个巨大的 xml 文件。我不想必须了解整个结构才能收到我发现“事件”的通知。
So, I've played around with several Haskell XML libraries, including hexpat and xml-enumerator. After reading the IO chapter in Real World Haskell (http://book.realworldhaskell.org/read/io.html) I was under the impression that if I run the following code, it will be garbage collected as I go through it.
However, when I run it on a big file, memory usage keeps climbing as it runs.
runghc parse.hs bigfile.xml
What am I doing wrong? Is my assumption wrong? Does the map/filter force it to evaluate everything?
import qualified Data.ByteString.Lazy as BSL
import qualified Data.ByteString.Lazy.UTF8 as U
import Prelude hiding (readFile)
import Text.XML.Expat.SAX
import System.Environment (getArgs)
main :: IO ()
main = do
args <- getArgs
contents <- BSL.readFile (head args)
-- putStrLn $ U.toString contents
let events = parse defaultParseOptions contents
mapM_ print $ map getTMSId $ filter isEvent events
isEvent :: SAXEvent String String -> Bool
isEvent (StartElement "event" as) = True
isEvent _ = False
getTMSId :: SAXEvent String String -> Maybe String
getTMSId (StartElement _ as) = lookup "TMSId" as
My end goal is to parse a huge xml file with a simple sax-like interface. I don't want to have to be aware of the whole structure to get notified that I've found an "event".
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我是 hexpat 的维护者。这是一个错误,我现在已在 hexpat-0.19.8 中修复了该错误。感谢您引起我的注意。
该错误是 ghc-7.2.1 上的新错误,它与绑定到三元组的 where 子句和 unsafePerformIO 之间的交互有关,我没有预料到它与 unsafePerformIO 之间的交互有关,我需要使与 C 代码的交互在哈斯克尔。
I'm the maintainer of hexpat. This is a bug, which I have now fixed in hexpat-0.19.8. Thanks for drawing it to my attention.
The bug is new on ghc-7.2.1, and it's to do with an interaction that I didn't expect between a where clause binding to a triple, and unsafePerformIO, which I need to make the interaction with the C code appear pure in Haskell.
这似乎是 hexpat 的问题。运行编译后的优化,并且仅针对简单的任务(例如
length
),会导致线性内存使用。看看 hexpat,我认为存在过多的缓存(请参阅
parseG
函数)。我建议联系 hexpat 维护者并询问这是否是预期的行为。无论哪种方式,它都应该在黑线鳕中提到,但资源消耗似乎在库文档中经常被忽略。This appears to be an issue with hexpat. Running compiled, with optimization, and just for a simple task such as
length
, results in linear memory use.Looking at hexpat, I think there is excessive caching going on (see the
parseG
function). I suggest contacting the hexpat maintainer(s) and asking if this is expected behavior. It should have been mentioned in the haddocks either way, but resource consumption seems to get ignored too often in library documentation.