使用正则表达式将标签内容转换为lipsum

发布于 2024-11-24 01:50:08 字数 290 浏览 2 评论 0原文

我正在取消一个微型网站的品牌以用作作品集。它是用静态 html 构建的，我需要用 Lipsum 甚至乱序文本替换每个非脚本标记的内容 - 但它必须与当前文本具有相同的字符数，以保持格式良好。此外，我真的宁愿使用 GUI grep 编辑器来完成此操作，而不是编写脚本，因为可能有一些我需要保留其内容的标签。

我使用正则表达式 \>([^$]+?)\< 来查找它们（所有脚本都以 $ 开头，因此它会跳过脚本标记），但我找不到任何方法计算匹配的字符数，并替换为相应数量的唇语或随机字符。

感谢您的帮助！

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

猫烠⑼条掵仅有一顆心 2024-12-01 01:50:08

我能够成功地做到这一点，尽管我最终不得不使用 Java 程序。事实证明正则表达式很好，因为我没有解析整个事情，只是解析几个部分。有一些怪癖，但这完成了工作。

public class Debrander {

public static void main(String[] args) {

       // reads in html from StdIn
       String htmlPage = StdIn.readAll();

       //regex matches all content within non-script non-style tags
       Pattern tagContentRegex = Pattern.compile("\\>(.*?)\\<(?!/script)(?!/style)");
       Matcher myMatcher = tagContentRegex.matcher(htmlPage);

       //different regex to check for whitespace
       Pattern whiteRegex = Pattern.compile("[^\\s]");

       StringBuffer sb = new StringBuffer();

       LoremIpsum4J loremIpsum = new LoremIpsum4J();
       loremIpsum.setStartWithLoremIpsum(false);

       //loop through all matches
       while(myMatcher.find()){
           String tagContent = htmlPage.substring(myMatcher.start(1), myMatcher.end(1));
           Matcher whiteMatcher = whiteRegex.matcher(tagContent);
           //whiteMatcher makes sure there is a NON-WHITESPACE character in the string
           if (whiteMatcher.find()){
               Integer charCount = (myMatcher.end(1) - myMatcher.start(1));

               String[] lipsum = loremIpsum.getBytes(charCount);
               String replaceString = ">";

               for (int i=0; i<lipsum.length; i++){
                   replaceString += lipsum[i];
               }
               replaceString += "<";
               myMatcher.appendReplacement(sb, replaceString);
           }
       }
       myMatcher.appendTail(sb);
       StdOut.println(sb.toString());
   }

}

I was able to successfully do this, though I had to end up using a Java program. Turns out regex is fine cause I'm not parsing the whole thing, just a few parts. There are a few quirks but this got the job done.

public class Debrander {

public static void main(String[] args) {

       // reads in html from StdIn
       String htmlPage = StdIn.readAll();

       //regex matches all content within non-script non-style tags
       Pattern tagContentRegex = Pattern.compile("\\>(.*?)\\<(?!/script)(?!/style)");
       Matcher myMatcher = tagContentRegex.matcher(htmlPage);

       //different regex to check for whitespace
       Pattern whiteRegex = Pattern.compile("[^\\s]");

       StringBuffer sb = new StringBuffer();

       LoremIpsum4J loremIpsum = new LoremIpsum4J();
       loremIpsum.setStartWithLoremIpsum(false);

       //loop through all matches
       while(myMatcher.find()){
           String tagContent = htmlPage.substring(myMatcher.start(1), myMatcher.end(1));
           Matcher whiteMatcher = whiteRegex.matcher(tagContent);
           //whiteMatcher makes sure there is a NON-WHITESPACE character in the string
           if (whiteMatcher.find()){
               Integer charCount = (myMatcher.end(1) - myMatcher.start(1));

               String[] lipsum = loremIpsum.getBytes(charCount);
               String replaceString = ">";

               for (int i=0; i<lipsum.length; i++){
                   replaceString += lipsum[i];
               }
               replaceString += "<";
               myMatcher.appendReplacement(sb, replaceString);
           }
       }
       myMatcher.appendTail(sb);
       StdOut.println(sb.toString());
   }

}

回复收藏 0 原文

~没有更多了~