Haskell 用低内存解析大 xml 文件

发布于 2024-12-14 19:50:22 字数 1044 浏览 1 评论 0原文

因此,我尝试过几个 Haskell XML 库,包括 hexpat 和 xml-enumerator。在阅读了 Real World Haskell (http://book.realworldhaskell.org/read/io.html) 中的 IO 章节后,我的印象是,如果我运行以下代码,它会在我执行它时被垃圾收集。

但是,当我在大文件上运行它时,内存使用量会随着运行而不断攀升。

runghc parse.hs bigfile.xml

我做错了什么?我的假设是错误的吗?地图/过滤器是否强制它评估所有内容?

import qualified Data.ByteString.Lazy as BSL
import qualified Data.ByteString.Lazy.UTF8 as U
import Prelude hiding (readFile)
import Text.XML.Expat.SAX 
import System.Environment (getArgs)

main :: IO ()
main = do
    args <- getArgs
    contents <- BSL.readFile (head args)
    -- putStrLn $ U.toString contents
    let events = parse defaultParseOptions contents 
    mapM_ print $ map getTMSId $ filter isEvent events

isEvent :: SAXEvent String String -> Bool 
isEvent (StartElement "event" as) = True
isEvent _ = False

getTMSId :: SAXEvent String String -> Maybe String
getTMSId (StartElement _ as) = lookup "TMSId" as

我的最终目标是使用简单的类似 sax 的界面来解析一个巨大的 xml 文件。我不想必须了解整个结构才能收到我发现“事件”的通知。

So, I've played around with several Haskell XML libraries, including hexpat and xml-enumerator. After reading the IO chapter in Real World Haskell (http://book.realworldhaskell.org/read/io.html) I was under the impression that if I run the following code, it will be garbage collected as I go through it.

However, when I run it on a big file, memory usage keeps climbing as it runs.

runghc parse.hs bigfile.xml

What am I doing wrong? Is my assumption wrong? Does the map/filter force it to evaluate everything?

import qualified Data.ByteString.Lazy as BSL
import qualified Data.ByteString.Lazy.UTF8 as U
import Prelude hiding (readFile)
import Text.XML.Expat.SAX 
import System.Environment (getArgs)

main :: IO ()
main = do
    args <- getArgs
    contents <- BSL.readFile (head args)
    -- putStrLn $ U.toString contents
    let events = parse defaultParseOptions contents 
    mapM_ print $ map getTMSId $ filter isEvent events

isEvent :: SAXEvent String String -> Bool 
isEvent (StartElement "event" as) = True
isEvent _ = False

getTMSId :: SAXEvent String String -> Maybe String
getTMSId (StartElement _ as) = lookup "TMSId" as

My end goal is to parse a huge xml file with a simple sax-like interface. I don't want to have to be aware of the whole structure to get notified that I've found an "event".

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

未蓝澄海的烟 2024-12-21 19:50:22

我是 hexpat 的维护者。这是一个错误,我现在已在 hexpat-0.19.8 中修复了该错误。感谢您引起我的注意。

该错误是 ghc-7.2.1 上的新错误,它与绑定到三元组的 where 子句和 unsafePerformIO 之间的交互有关,我没有预料到它与 unsafePerformIO 之间的交互有关,我需要使与 C 代码的交互在哈斯克尔。

I'm the maintainer of hexpat. This is a bug, which I have now fixed in hexpat-0.19.8. Thanks for drawing it to my attention.

The bug is new on ghc-7.2.1, and it's to do with an interaction that I didn't expect between a where clause binding to a triple, and unsafePerformIO, which I need to make the interaction with the C code appear pure in Haskell.

岛徒 2024-12-21 19:50:22

这似乎是 hexpat 的问题。运行编译后的优化,并且仅针对简单的任务(例如length),会导致线性内存使用。

看看 hexpat,我认为存在过多的缓存(请参阅 parseG 函数)。我建议联系 hexpat 维护者并询问这是否是预期的行为。无论哪种方式,它都应该在黑线鳕中提到,但资源消耗似乎在库文档中经常被忽略。

This appears to be an issue with hexpat. Running compiled, with optimization, and just for a simple task such as length, results in linear memory use.

Looking at hexpat, I think there is excessive caching going on (see the parseG function). I suggest contacting the hexpat maintainer(s) and asking if this is expected behavior. It should have been mentioned in the haddocks either way, but resource consumption seems to get ignored too often in library documentation.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文