我正在解析 InputStream 的某些模式以从中提取值,例如我不想
<span class="filename"><a href="http://example.com/foo">foo</a>
使用完整的 html 解析器,因为我对文档结构不感兴趣,而只对一些定义明确的信息位感兴趣。 (只有他们的顺序很重要)
目前我正在使用一种非常简单的方法,我为每个模式都有一个对象,其中包含开始和结束“标签”的 char[] (在示例中,开始将是 以获取 url) 和位置标记。对于 InputStream 读取的每个字符,我会迭代所有模式并调用 match(char)
函数,一旦开始模式匹配,该函数就会返回 true,从那时起,我会在 StringBuilder 中收集以下字符直到现在活动的模式再次 match() 。然后,我使用模式 ID 和读取的字符串调用一个函数,以进一步处理它。
虽然这在大多数情况下工作得很好,但我想在模式中包含正则表达式,所以我也可以匹配类似的东西,
<span class="filename" id="234217"><a href="http://example.com/foo">foo</a>
在这一点上,我确信我会重新发明轮子,因为这肯定是以前做过的,而且我不'首先我真的不想编写自己的正则表达式解析器。但是,我找不到任何可以满足我需求的东西。
不幸的是,Scanner
类仅匹配一种模式,而不是模式列表,我可以使用哪些替代方案?它不应该很重并且可以与 Android 一起使用。
I am parsing an InputStream for certain patterns to extract values from it, e.g. I would have something like
<span class="filename"><a href="http://example.com/foo">foo</a>
I don't want to use a full fledged html parser as I am not interested in the document structure but only in some well defined bits of information. (Only their order is important)
Currently I am using a very simple approach, I have an Object for each Pattern that contains a char[] of the opening and closing 'tag' (in the example opening would be <span class="filename"><a href="
and closing "
to get the url) and a position marker. For each character read by of the InputStream, I iterate over all Patterns and call the match(char)
function that returns true once the opening pattern does match, from then on I collect the following chars in a StringBuilder until the now active pattern does match() again. I then call a function with the ID of the Pattern and the String read, to process it further.
While this works fine in most cases, I wanted to include regular expressions in the pattern, so I could also match something like
<span class="filename" id="234217"><a href="http://example.com/foo">foo</a>
At this point I was sure I would reinvent the wheel as this most certainly would have been done before, and I don't really want to write my own regex parser to begin with. However, I could not find anything that would do what I was looking for.
Unfortunately the Scanner
class only matches one pattern, not a list of patterns, what alternatives could I use? It should not be heavy and work with Android.
发布评论
评论(3)
您的意思是您想要将任何
元素与给定的
class
属性相匹配,而不考虑它可能具有的其他属性?这很简单:文件“test.txt”包含您的问题的文本,输出为:
You mean you want to match any
<span>
element with a givenclass
attribute, irrespective of other attributes it may have? That's easy enough:The file "test.txt" contains the text of your question, and the output is:
Scanner.useDelimiter(Pattern) API 似乎就是您正在寻找的。您必须使用 OR (|) 分隔的模式字符串。
不过,这种模式很快就会变得非常复杂。
the Scanner.useDelimiter(Pattern) API seems to be what you're looking for. You would have to use an OR (|) separated pattern string.
This pattern can get really complicated really quickly though.
你认为这一切以前都已经完成了,这是正确的:)你所谈论的是标记化和解析的问题,因此我建议你考虑JavaCC。
当您学习理解 JavaCC 的语法时,有一个学习曲线,因此下面是一个帮助您入门的实现。
该语法是HTML 标准 JavaCC 语法的简化版本。您可以添加更多作品来匹配其他图案。
You are right to think this has all been done before :) What you are talking about is a problem of tokenizing and parsing and I therefore suggest you consider JavaCC.
There is something of a learning curve with JavaCC as you learn to understand it's grammar, so below is an implementation to get you started.
The grammar is a chopped down version of the standard JavaCC grammar for HTML. You can add more productions for matching other patterns.