如何创建用于解析阿拉伯日期的正则表达式

发布于 2024-09-11 00:12:12 字数 1335 浏览 14 评论 0原文

我正在开发一个程序，该程序运行一系列正则表达式来尝试从网页中查找 DOM 中的日期。例如，在 www.engadget.com 中/2010/07/19/windows-phone-7-in-deep-preview/，我会将“Jul 19th 2010”与我的正则表达式匹配。在多种格式和语言下一切都进展顺利，直到我打开了一个阿拉伯语网页。例如，请考虑 http://islammaktoob.maktoobblog.com/。日期 2010 年 7 月 18 日以阿拉伯语显示在帖子顶部，但我不知道如何匹配它。有人有匹配阿拉伯日期的经验吗？如果有人可以发布一个示例或他们用来匹配该阿拉伯日期的正则表达式，那将会非常有帮助。谢谢你！

更新：

越来越接近：

String fromTheSite = "كتبها اسلام مكتوب ، في 18 تموز 2010 الساعة: 09:42 ص"; 
    NamedMatcher infoMatcher = NamedPattern.compile("(?<Day>[0-3]?[0-9]) (?<Month>يناير|فبراير|مارس|أبريل|إبريل|مايو|يونيو|يونيه|يوليو|يوليه|أغسطس|سبتمبر|أكتوبر|نوفمبر|ديسمبر|كانون الثاني|شباط|آذار|نيسان|أيار|حزيران|تموز|آب|أيلول|تشرين الأول|تشرين الثاني|كانون الأول) (?<Year>[1-2][0-9][0-9][0-9]) ", Pattern.CANON_EQ).matcher(fromTheSite);
    while(infoMatcher.find()){
        System.out.println(infoMatcher.group());
        System.out.println(infoMatcher.group("Day"));
        System.out.println(infoMatcher.group("Month"));
        System.out.println(infoMatcher.group("Year"));
    }

给我

18 تموز 2010
18
تموز
2010

为什么比赛看起来不按顺序进行？

原文

I'm working on a program that is running a series of regexs to attempt to find a date within the DOM from a webpage.
For example, in www.engadget.com/2010/07/19/windows-phone-7-in-depth-preview/, I would match "Jul 19th 2010" with my regex. Things were going fine in multiple formats and languages until I hit an Arabic webpage.
As an example, consider http://islammaktoob.maktoobblog.com/. The date July 18, 2010 appears in Arabic at the top of the post, but I can't figure out how to match it. Does anyone have any experience on matching Arabic dates? If someone could post an example or the regex they would use to match that Arabic date, it would be very helpful. Thank you!

Update:

Getting closer:

String fromTheSite = "كتبها اسلام مكتوب ، في 18 تموز 2010 الساعة: 09:42 ص"; 
    NamedMatcher infoMatcher = NamedPattern.compile("(?<Day>[0-3]?[0-9]) (?<Month>يناير|فبراير|مارس|أبريل|إبريل|مايو|يونيو|يونيه|يوليو|يوليه|أغسطس|سبتمبر|أكتوبر|نوفمبر|ديسمبر|كانون الثاني|شباط|آذار|نيسان|أيار|حزيران|تموز|آب|أيلول|تشرين الأول|تشرين الثاني|كانون الأول) (?<Year>[1-2][0-9][0-9][0-9]) ", Pattern.CANON_EQ).matcher(fromTheSite);
    while(infoMatcher.find()){
        System.out.println(infoMatcher.group());
        System.out.println(infoMatcher.group("Day"));
        System.out.println(infoMatcher.group("Month"));
        System.out.println(infoMatcher.group("Year"));
    }

Gives me