解析输入流以获取多种模式

发布于 2024-11-01 18:03:33 字数 901 浏览 10 评论 0 原文

我正在解析 InputStream 的某些模式以从中提取值，例如我不想

<span class="filename"><a href="http://example.com/foo">foo</a>

使用完整的 html 解析器，因为我对文档结构不感兴趣，而只对一些定义明确的信息位感兴趣。（只有他们的顺序很重要）
目前我正在使用一种非常简单的方法，我为每个模式都有一个对象，其中包含开始和结束“标签”的 char[] （在示例中，开始将是以获取 url) 和位置标记。对于 InputStream 读取的每个字符，我会迭代所有模式并调用 match(char) 函数，一旦开始模式匹配，该函数就会返回 true，从那时起，我会在 StringBuilder 中收集以下字符直到现在活动的模式再次 match() 。然后，我使用模式 ID 和读取的字符串调用一个函数，以进一步处理它。虽然这在大多数情况下工作得很好，但我想在模式中包含正则表达式，所以我也可以匹配类似的东西，

<span class="filename" id="234217"><a href="http://example.com/foo">foo</a>

在这一点上，我确信我会重新发明轮子，因为这肯定是以前做过的，而且我不'首先我真的不想编写自己的正则表达式解析器。但是，我找不到任何可以满足我需求的东西。
不幸的是，Scanner 类仅匹配一种模式，而不是模式列表，我可以使用哪些替代方案？它不应该很重并且可以与 Android 一起使用。

原文

I am parsing an InputStream for certain patterns to extract values from it, e.g. I would have something like

<span class="filename"><a href="http://example.com/foo">foo</a>

I don't want to use a full fledged html parser as I am not interested in the document structure but only in some well defined bits of information. (Only their order is important)
Currently I am using a very simple approach, I have an Object for each Pattern that contains a char[] of the opening and closing 'tag' (in the example opening would be <span class="filename"><a href="and closing " to get the url) and a position marker. For each character read by of the InputStream, I iterate over all Patterns and call the match(char) function that returns true once the opening pattern does match, from then on I collect the following chars in a StringBuilder until the now active pattern does match() again. I then call a function with the ID of the Pattern and the String read, to process it further.
While this works fine in most cases, I wanted to include regular expressions in the pattern, so I could also match something like

<span class="filename" id="234217"><a href="http://example.com/foo">foo</a>

At this point I was sure I would reinvent the wheel as this most certainly would have been done before, and I don't really want to write my own regex parser to begin with. However, I could not find anything that would do what I was looking for.
Unfortunately the Scanner class only matches one pattern, not a list of patterns, what alternatives could I use? It should not be heavy and work with Android.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦晓ヶ微光ヅ倾城 2024-11-08 18:03:33

您的意思是您想要将任何元素与给定的 class 属性相匹配，而不考虑它可能具有的其他属性？这很简单：

Scanner sc = new Scanner(new File("test.txt"), "UTF-8");
Pattern p = Pattern.compile(
    "<span[^>]*class=\"filename\"[^>]*>\\s*<a[^>]*href=\"([^\"]+)\""
);
while (sc.findWithinHorizon(p, 0) != null)
{
  MatchResult m = sc.match();
  System.out.println(m.group(1));
}

文件“test.txt”包含您的问题的文本，输出为：

http://example.com/foo
and closing
http://example.com/foo

You mean you want to match any <span> element with a given class attribute, irrespective of other attributes it may have? That's easy enough:

Scanner sc = new Scanner(new File("test.txt"), "UTF-8");
Pattern p = Pattern.compile(
    "<span[^>]*class=\"filename\"[^>]*>\\s*<a[^>]*href=\"([^\"]+)\""
);
while (sc.findWithinHorizon(p, 0) != null)
{
  MatchResult m = sc.match();
  System.out.println(m.group(1));
}

The file "test.txt" contains the text of your question, and the output is:

http://example.com/foo
and closing
http://example.com/foo

回复收藏 0 原文

一枫情书 2024-11-08 18:03:33

Scanner.useDelimiter(Pattern) API 似乎就是您正在寻找的。您必须使用 OR (|) 分隔的模式字符串。

不过，这种模式很快就会变得非常复杂。

回复收藏 0 原文

甜心 2024-11-08 18:03:33

你认为这一切以前都已经完成了，这是正确的:)你所谈论的是标记化和解析的问题，因此我建议你考虑JavaCC。

当您学习理解 JavaCC 的语法时，有一个学习曲线，因此下面是一个帮助您入门的实现。

该语法是HTML 标准 JavaCC 语法的简化版本。您可以添加更多作品来匹配其他图案。

options {
  JDK_VERSION = "1.5";
  static = false;
}

PARSER_BEGIN(eg1)
import java.util.*;
public class eg1 {
  private String currentTag;
  private String currentSpanClass;
  private String currentHref;

  public static void main(String args []) throws ParseException {
    System.out.println("Starting parse");
    eg1 parser = new eg1(System.in);
    parser.parse();
    System.out.println("Finishing parse");
  }
}

PARSER_END(eg1)

SKIP :
{
    <       ( " " | "\t" | "\n" | "\r" )+   >
|   <       "<!" ( ~[">"] )* ">"            >
}

TOKEN :
{
    <STAGO:     "<"                 >   : TAG
|   <ETAGO:     "</"                >   : TAG
|   <PCDATA:    ( ~["<"] )+         >
}

<TAG> TOKEN [IGNORE_CASE] :
{
    <A:      "a"              >   : ATTLIST
|   <SPAN:   "span"           >   : ATTLIST
|   <DONT_CARE: (["a"-"z"] | ["0"-"9"])+  >   : ATTLIST
}

<ATTLIST> SKIP :
{
    <       " " | "\t" | "\n" | "\r"    >
|   <       "--"                        >   : ATTCOMM
}

<ATTLIST> TOKEN :
{
    <TAGC:      ">"             >   : DEFAULT
|   <A_EQ:      "="             >   : ATTRVAL

|   <#ALPHA:    ["a"-"z","A"-"Z","_","-","."]   >
|   <#NUM:      ["0"-"9"]                       >
|   <#ALPHANUM: <ALPHA> | <NUM>                 >
|   <A_NAME:    <ALPHA> ( <ALPHANUM> )*         >

}

<ATTRVAL> TOKEN :
{
    <CDATA:     "'"  ( ~["'"] )* "'"
        |       "\"" ( ~["\""] )* "\""
        | ( ~[">", "\"", "'", " ", "\t", "\n", "\r"] )+
                            >   : ATTLIST
}

<ATTCOMM> SKIP :
{
    <       ( ~["-"] )+         >
|   <       "-" ( ~["-"] )+         >
|   <       "--"                >   : ATTLIST
}



void attribute(Map<String,String> attrs) :
{
    Token n, v = null;
}
{
    n=<A_NAME> [ <A_EQ> v=<CDATA> ]
    {
        String attval;
        if (v == null) {
            attval = "#DEFAULT";
        } else {
            attval = v.image;
            if( attval.startsWith("\"") && attval.endsWith("\"") ) {
              attval = attval.substring(1,attval.length()-1);
            } else if( attval.startsWith("'") && attval.endsWith("'") ) {
              attval = attval.substring(1,attval.length()-1);
            }
        }
        if( attrs!=null ) attrs.put(n.image.toLowerCase(),attval);
    }
}

void attList(Map<String,String> attrs) : {}
{
    ( attribute(attrs) )+
}


void tagAStart() : {
  Map<String,String> attrs = new HashMap<String,String>();
}
{
    <STAGO> <A> [ attList(attrs) ] <TAGC>
    {
      currentHref=attrs.get("href");    
      if( currentHref != null && "filename".equals(currentSpanClass) )
      {
        System.out.println("Found URL: "+currentHref);
      }
    }
}

void tagAEnd() : {}
{
    <ETAGO> <A> <TAGC>
    {
      currentHref=null;
    }
}

void tagSpanStart() : {
  Map<String,String> attrs = new HashMap<String,String>();
}
{
    <STAGO> <SPAN> [ attList(attrs) ] <TAGC>
    {
      currentSpanClass=attrs.get("class");
    }
}

void tagSpanEnd() : {}
{
    <ETAGO> <SPAN> <TAGC>
    {
      currentSpanClass=null;
    }
}

void tagDontCareStart() : {}
{
   <STAGO> <DONT_CARE> [ attList(null) ] <TAGC>
}

void tagDontCareEnd() : {}
{
   <ETAGO> <DONT_CARE> <TAGC>
}

void parse() : {}
{
    (
      LOOKAHEAD(2) tagAStart() |
      LOOKAHEAD(2) tagAEnd() |
      LOOKAHEAD(2) tagSpanStart() |
      LOOKAHEAD(2) tagSpanEnd() |
      LOOKAHEAD(2) tagDontCareStart() |
      LOOKAHEAD(2) tagDontCareEnd() |
      <PCDATA>
    )*
}

You are right to think this has all been done before :) What you are talking about is a problem of tokenizing and parsing and I therefore suggest you consider JavaCC.

There is something of a learning curve with JavaCC as you learn to understand it's grammar, so below is an implementation to get you started.

The grammar is a chopped down version of the standard JavaCC grammar for HTML. You can add more productions for matching other patterns.

options {
  JDK_VERSION = "1.5";
  static = false;
}

PARSER_BEGIN(eg1)
import java.util.*;
public class eg1 {
  private String currentTag;
  private String currentSpanClass;
  private String currentHref;

  public static void main(String args []) throws ParseException {
    System.out.println("Starting parse");
    eg1 parser = new eg1(System.in);
    parser.parse();
    System.out.println("Finishing parse");
  }
}

PARSER_END(eg1)

SKIP :
{
    <       ( " " | "\t" | "\n" | "\r" )+   >
|   <       "<!" ( ~[">"] )* ">"            >
}

TOKEN :
{
    <STAGO:     "<"                 >   : TAG
|   <ETAGO:     "</"                >   : TAG
|   <PCDATA:    ( ~["<"] )+         >
}

<TAG> TOKEN [IGNORE_CASE] :
{
    <A:      "a"              >   : ATTLIST
|   <SPAN:   "span"           >   : ATTLIST
|   <DONT_CARE: (["a"-"z"] | ["0"-"9"])+  >   : ATTLIST
}

<ATTLIST> SKIP :
{
    <       " " | "\t" | "\n" | "\r"    >
|   <       "--"                        >   : ATTCOMM
}

<ATTLIST> TOKEN :
{
    <TAGC:      ">"             >   : DEFAULT
|   <A_EQ:      "="             >   : ATTRVAL

|   <#ALPHA:    ["a"-"z","A"-"Z","_","-","."]   >
|   <#NUM:      ["0"-"9"]                       >
|   <#ALPHANUM: <ALPHA> | <NUM>                 >
|   <A_NAME:    <ALPHA> ( <ALPHANUM> )*         >

}

<ATTRVAL> TOKEN :
{
    <CDATA:     "'"  ( ~["'"] )* "'"
        |       "\"" ( ~["\""] )* "\""
        | ( ~[">", "\"", "'", " ", "\t", "\n", "\r"] )+
                            >   : ATTLIST
}

<ATTCOMM> SKIP :
{
    <       ( ~["-"] )+         >
|   <       "-" ( ~["-"] )+         >
|   <       "--"                >   : ATTLIST
}



void attribute(Map<String,String> attrs) :
{
    Token n, v = null;
}
{
    n=<A_NAME> [ <A_EQ> v=<CDATA> ]
    {
        String attval;
        if (v == null) {
            attval = "#DEFAULT";
        } else {
            attval = v.image;
            if( attval.startsWith("\"") && attval.endsWith("\"") ) {
              attval = attval.substring(1,attval.length()-1);
            } else if( attval.startsWith("'") && attval.endsWith("'") ) {
              attval = attval.substring(1,attval.length()-1);
            }
        }
        if( attrs!=null ) attrs.put(n.image.toLowerCase(),attval);
    }
}

void attList(Map<String,String> attrs) : {}
{
    ( attribute(attrs) )+
}


void tagAStart() : {
  Map<String,String> attrs = new HashMap<String,String>();
}
{
    <STAGO> <A> [ attList(attrs) ] <TAGC>
    {
      currentHref=attrs.get("href");    
      if( currentHref != null && "filename".equals(currentSpanClass) )
      {
        System.out.println("Found URL: "+currentHref);
      }
    }
}

void tagAEnd() : {}
{
    <ETAGO> <A> <TAGC>
    {
      currentHref=null;
    }
}

void tagSpanStart() : {
  Map<String,String> attrs = new HashMap<String,String>();
}
{
    <STAGO> <SPAN> [ attList(attrs) ] <TAGC>
    {
      currentSpanClass=attrs.get("class");
    }
}

void tagSpanEnd() : {}
{
    <ETAGO> <SPAN> <TAGC>
    {
      currentSpanClass=null;
    }
}

void tagDontCareStart() : {}
{
   <STAGO> <DONT_CARE> [ attList(null) ] <TAGC>
}

void tagDontCareEnd() : {}
{
   <ETAGO> <DONT_CARE> <TAGC>
}

void parse() : {}
{
    (
      LOOKAHEAD(2) tagAStart() |
      LOOKAHEAD(2) tagAEnd() |
      LOOKAHEAD(2) tagSpanStart() |
      LOOKAHEAD(2) tagSpanEnd() |
      LOOKAHEAD(2) tagDontCareStart() |
      LOOKAHEAD(2) tagDontCareEnd() |
      <PCDATA>
    )*
}

回复收藏 0 原文

~没有更多了~