从大文本中解析数字,可能不需要正则表达式(性能关键)
在你们开始用以下变体回答之前,我对正则表达式非常熟悉: /d+
我想知道是否有正则表达式的替代方法来解析大型文本文件中的数字。
我正在解析大量的大文件,需要对关键字的位置进行一些组/位置分析。我现在需要开始查找与我感兴趣的内容紧密嵌套的数字组。如果可能的话,我想避免使用正则表达式,因为这需要一个快速的过程。
可以提取文件的块来检查感兴趣的数字。然而,这需要更多的工作并增加搜索的硬编码限制。 (我想避免这种情况)
我愿意接受任何建议。
更新
抱歉缺少示例数据。出于 HIPAA 的原因,我什至不想考虑打乱文本并将其发布。
任何 stackoverflow.com 问题页面的 HTML 源代码都是一个很好的替代品。想象一下,我需要获取所有发布问题答案的人的声誉(分数)。这也意味着还需要逗号 (,)。我无法删除 html 来简化内容,因为我正在使用一些密度分析来清除不相关的内容。删除 HTML 会使内容过于紧密地混合在一起。
I'm extremely familiar with regex before you all start answering with variations of: /d+
I want to know if there are alternatives to regex for parsing numbers out of a large text file.
I'm parsing through tons of huge files and need to do some group/location analysis on the positions of keywords. I'm now at the point where i need to start finding groups of numbers as well nested closely to my content of interest. I want to avoid regex if at all possible because this needs to be a speedy process.
It is possible to take chunks of a file to inspect for the numbers of interest. That however would require more work and add hard coded limits for searching. (i'd like to avoid this)
I'm open to any suggestions.
UPDATE
Sorry for the lack of sample data. For HIPAA reasons I'd rather not even consider scrambling the text and posting it.
A great substitute would be the HTML source of any stackoverflow.com question page. Imagine I needed to grab the reputation (score) of all people that posted an answer to a question. This also means that the comma (,) is needed as well. I can't remove the html to simplify the content because I'm using some density analysis to weed out unrelated content. Removing the HTML would mix content too close together.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
除非该文件是某种 SGML,否则我不知道任何方法(这并不是说没有,我只是不知道)
但是,这并不是说您不能创建你自己的解析器;您可以通过编写仅查找数字范围的内容来消除 .Net 正则表达式库的一些开销。
从根本上讲,我想这就是任何图书馆在最基本的层面上都会做的事情。
如果您可以发布您将要处理的数据类型的示例,可能会有帮助吗?
Unless the file is some sort of SGML, then I don't know of any method (which is not to say there isn't, I just don't know of one)
However, it's not to say that you can't create your own parser; you could eliminate some of the overheads of the .Net regex library by writing something that only finds ranges of numbers.
Fundamentally, I guess that that's all any library would do, at the most basic level.
Might help if you can post a sample of the sort of data you'll be processing?