Java 正则表达式还是 XML 解析器？

发布于 2025-01-02 13:36:13 字数 340 浏览 1 评论 0原文

我想删除任何标签，例如

<p>hello <namespace:tag : a>hello</namespace:tag></p>

成为

 <p> hello hello </p>

如果它是正则表达式，由于某种原因现在正在工作，那么最好的方法是什么？任何人都可以帮忙吗？

(<|</)[:]{1,2}[^</>]>

编辑：额外

原文

I want to remove any tags such as

<p>hello <namespace:tag : a>hello</namespace:tag></p>

to become

 <p> hello hello </p>

What is the best way to do this if it is regex for some reason this is now working can anyone help?

(<|</)[:]{1,2}[^</>]>

edit:
added

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

回忆躺在深渊里 2025-01-09 13:36:13

一定要使用 XML 解析器。正则表达式不应用于解析 *ML

回复收藏 0 原文

烦人精 2025-01-09 13:36:13

您不应该将正则表达式用于这些目的，使用诸如 lxml 或 BeautifulSoup

>>> import lxml.html as lxht
>>> myString = '<p>hello <namespace:tag : a>hello</namespace:tag></p>'
>>> lxht.fromstring(myString).text_content()
'hello hello'

这是一个原因为什么你不应该用正则表达式解析html/xml。

You should not use regex for these purposes use a parser like lxml or BeautifulSoup

>>> import lxml.html as lxht
>>> myString = '<p>hello <namespace:tag : a>hello</namespace:tag></p>'
>>> lxht.fromstring(myString).text_content()
'hello hello'

Here is a reason why you should not parse html/xml with regex.

回复收藏 0 原文

彻夜缠绵 2025-01-09 13:36:13

如果您只是想从一些简单的 XML 中提取纯文本，最好的（最快、最小的内存占用）是对数据运行 for 循环：

PSEUDOCODE BELOW

bool inMarkup = false;
string text = "";
for each character in data // (dunno what you're reading from)
{
    char c = current;
    if( c == '<' ) inMarkup = true;
    else if( c == '>') inMarkup = false;
    else if( !inMarkup ) text += c;
}

注意：这如果你在解析中遇到 CDATA、JavaScript 或 CSS 之类的东西，就会崩溃。

所以，总结一下......如果很简单，请执行上面的操作而不是正则表达式。如果不是那么简单，请听听其他人的意见并使用高级解析器。

If you're just trying to pull the plain text out of some simple XML, the best (fastest, smallest memory footprint) would be to just run a for loop over the data:

PSEUDOCODE BELOW

bool inMarkup = false;
string text = "";
for each character in data // (dunno what you're reading from)
{
    char c = current;
    if( c == '<' ) inMarkup = true;
    else if( c == '>') inMarkup = false;
    else if( !inMarkup ) text += c;
}

Note: This will break if you encounter things like CDATA, JavaScript, or CSS in your parsing.

So, to sum up... if it's simple, do something like above and not a regular expression. If it isn't that simple, listen to the other guys an use an advanced parser.

回复收藏 0 原文

网白 2025-01-09 13:36:13

这是我个人用于解决java中类似问题的解决方案。用于此目的的库是 Jsoup ：http://jsoup.org/。

在我的特定情况下，我必须解开包含具有特定值的属性的标签。您会看到这段代码中反映出来，它不是这个问题的确切解决方案，但可以帮助您解决问题。

  public static String unWrapTag(String html, String tagName, String attribute, String matchRegEx) {
    Validate.notNull(html, "html must be non null");
    Validate.isTrue(StringUtils.isNotBlank(tagName), "tagName must be non blank");
    if (StringUtils.isNotBlank(attribute)) {
      Validate.notNull(matchRegEx, "matchRegEx must be non null when an attribute is provided");
    }    
    Document doc = Jsoup.parse(html);
    OutputSettings outputSettings = doc.outputSettings();
    outputSettings.prettyPrint(false);
    Elements elements = doc.getElementsByTag(tagName);
    for (Element element : elements) {
      if(StringUtils.isBlank(attribute)){
        element.unwrap();
      }else{
        String attr = element.attr(attribute);
        if(!StringUtils.isBlank(attr)){
          String newData = attr.replaceAll(matchRegEx, "");
          if(StringUtils.isBlank(newData)){
            element.unwrap();
          }
        }        
      }
    }
    return doc.html();
  }

This is a solution I personally used for a likewise problem in java. The library used for this is Jsoup : http://jsoup.org/.

In my particular case I had to unwrap tags that had an attribute with a particular value in them. You see that reflected in this code, it's not the exact solution to this problem but could put you on your way.

  public static String unWrapTag(String html, String tagName, String attribute, String matchRegEx) {
    Validate.notNull(html, "html must be non null");
    Validate.isTrue(StringUtils.isNotBlank(tagName), "tagName must be non blank");
    if (StringUtils.isNotBlank(attribute)) {
      Validate.notNull(matchRegEx, "matchRegEx must be non null when an attribute is provided");
    }    
    Document doc = Jsoup.parse(html);
    OutputSettings outputSettings = doc.outputSettings();
    outputSettings.prettyPrint(false);
    Elements elements = doc.getElementsByTag(tagName);
    for (Element element : elements) {
      if(StringUtils.isBlank(attribute)){
        element.unwrap();
      }else{
        String attr = element.attr(attribute);
        if(!StringUtils.isBlank(attr)){
          String newData = attr.replaceAll(matchRegEx, "");
          if(StringUtils.isBlank(newData)){
            element.unwrap();
          }
        }        
      }
    }
    return doc.html();
  }

回复收藏 0 原文

~没有更多了~