以各种格式呈现时解析日期的推荐方法

发布于 2024-10-18 02:05:13 字数 405 浏览 6 评论 0原文

我有一个日期集合,作为用户在一段时间内输入的字符串。由于这些数据来自人类,几乎没有经过验证,因此输入的日期格式差异很大。以下是一些例子(前导数字仅供参考):

  1. 1897年8月20日、
  2. 21日 1909年5月31日、6月1日 2007
  3. 年1月29日
  4. 1954年5月10日、11日、12日
  5. 2006年3月26日、27日、28日、29日、30日
  6. 27日、28日、 2006 年 11 月 29 日、11 月 30 日、12 月 1 日

我想在 C# 中解析这些日期,最终得到一组 DateTime 对象,每天一个 DateTime 对象。因此,上面的 (1) 将产生 2 个 DateTime 对象,而 (6) 将产生 5 个 DateTime 对象。

I have a collection of dates as strings entered by users over a period of time. Since these came from humans with little or no validation., the formats entered for the dates varies widely. Below are some examples (the leading numbers are for reference only):

  1. 20th, 21st August 1897
  2. 31st May, 1st June 1909
  3. 29th January 2007
  4. 10th, 11th, 12th May 1954
  5. 26th, 27th, 28th, 29th, 30th March 2006
  6. 27th, 28th, 29th, 30th November, 1st December 2006

I would like to parse these dates in c# to end up with sets of DateTime objects, with one DateTime object per day. So (1) above would result in 2 DateTime objects and (6) would result in 5 DateTime objects.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

影子是时光的心 2024-10-25 02:05:13

我建议对它们进行泛化处理(基本上删除数字和名称并使它们成为占位符),然后按类似的格式进行分组,以便您有一个可以使用的示例组。

例如,20th, 21st August 1987 则变为 [number][postfix], [number][postfix] [month] [year](假设 被识别为数字,后缀和月份是明显的,年份是 4 位数字)。

从那里,您可以找出有多少遵循该模式,然后找到需要匹配的独特模式。然后你至少可以有一个样本来测试你想使用的任何类型的算法(正则表达式可能是你最好的选择,因为它可以检测重复的模式(#th[, $th[, .. .]])和日期名称。)


看来您可能想按模式将其分解(鉴于您提供的内容)。因此,例如,首先分解年度信息:

(.*?)([0-9]{4})(?:, |$)

然后您需要将其分解为几个月

(.*?)(January|February|...)(?:, |$)

然后您希望该月内包含几天:

(?:([0-9]{1,2})(?:st|nd|rd|th)(?:, )?)*(?:, |$)

然后是关于编译信息的。但同样,这只是使用你面前的东西。最终,您需要知道您正在使用什么类型的数据以及您希望如何处理它。


更新了

所以,我忍不住尝试自己解决这个问题。我想确保我使用的方法在某种程度上是准确的,而且我没有把烟吹到你的裙子上。话虽如此,这就是我想出的办法。请注意,这是在 PHP 中,有几个原因:

  1. PHP 更容易上手
  2. 我觉得如果这是一个可行的解决方案,您应该努力将其移植过来。 :grin:

无论如何,这是源代码和演示输出。享受。

<?php
  $samples = array(
    '20th, 21st August 1897',
    '31st May, 1st June 1909',
    '29th January 2007',
    '10th, 11th, 12th May 1954',
    '26th, 27th, 28th, 29th, 30th March 2006',
    '27th, 28th, 29th, 30th November, 1st December 2006',
    '30th, 31st, December 2010, 1st, 2nd January 2011'
  );

  //header('Content-Type: text/plain');

  $months = array('january','february','march','april','may','june','july','august','september','october','november','december');

  foreach ($samples as $sample)
  {
    $dates = array();

    // find yearly information first
    $yearly = null;
    if (preg_match_all('/(?:^|\s)(?<month>.*?)\s?(?<year>[0-9]{4})(?:$|,)/',$sample,$yearly))
    {//var_dump($yearly);
      for ($y = 0; $y < count($yearly[0]); $y++)
      {
        $year = $yearly['year'][$y];
        //echo "year: {$year}\r\n";

        $monthly = null;
        if (preg_match_all('/(?<days>(?:(?:^|\s)[0-9]{1,2}(?:st|nd|rd|th),?)*)\s?(?<month>'.implode('|',$months).')$/i',$yearly['month'][$y],$monthly))
        {//var_dump($monthly);
          for ($m = 0; $m < count($monthly[0]); $m++)
          {
            $month = $monthly['month'][$m];
            //echo "month: {$month}\r\n";

            $daily = null;
            if (preg_match_all('/(?:^|\s)(?<day>[0-9]{1,2})(?:st|nd|rd|th)(?:,|$)/i',$monthly['days'][$m],$daily))
            {//var_dump($daily);
              for ($d = 0; $d < count($daily[0]); $d++)
              {
                $day = $daily['day'][$d];
                //echo "day: {$day}\r\n";

                $dates[] = sprintf("%d-%d-%d", array_search(strtolower($month),$months)+1, $day, $year);
              }
            }
          }
        }
        $data = $yearly[1];
      }
    }

    echo "<p><b>{$sample}</b> was parsed to include:</p><ul>\r\n";
    foreach ($dates as $date)
      echo "<li>{$date}</li>\r\n";
    echo "</ul>\r\n";
  }
?>

20th, 21st August 1897 被解析为包括:

  • 8-20-1897
  • 8-21-1897

31st May, 1st June 1909 被解析为包括:

  • 6-1-1909

2007 年 1 月 29 日 被解析为包括:

  • 1-29-2007

1954 年 5 月 10 日、11 日、12 日 被解析为包括:

  • 5-10-1954
  • 5-11-1954
  • 5-12-1954

26 日、27 日、28 日、29 日、 2006 年 3 月 30 日 被解析为包括:

  • 3-26-2006
  • 3-27-2006
  • 3-28-2006
  • 3-29-2006
  • 3-30-2006

2006年11月27日、28日、29日、11月30日、12月1日被解析为包括:

  • 12 -1-2006

2010年12月30日、31日、2011年1月1日、2日被解析为包括:

  • 12-30-2010
  • 12-31-2010
  • 1-1-2011
  • 1-2-2011

并证明没有任何问题我的袖子,http://www.ideone.com/GGMaH

I would recommend processing them for generalization (basically remove the numbers and names and make them place holders) then group by similar format so you have a sample group to work with.

For example, 20th, 21st August 1987 then becomes [number][postfix], [number][postfix] [month] [year] (given that a <number><st|th|rd|nd> is recognized as number and postfix and months are obvious, and years are 4-digit numerics).

From there, you find out how many follow that pattern, and then find how many unique patterns you need to match. Then you can at least have a sample to test any kind of algorithm you wish to use at it (regex is probably going to be your best bet since it can detect repeated patterns (#th[, $th[, ...]]) and day names.)


It appears you probably want to break it down by pattern (given what you've provided). So, for instance first break out yearly information:

(.*?)([0-9]{4})(?:, |$)

Then you need to break it down in to months

(.*?)(January|February|...)(?:, |$)

Then you want days contained within that month:

(?:([0-9]{1,2})(?:st|nd|rd|th)(?:, )?)*(?:, |$)

Then it's about compiling the information. But again, that's just using what you have in front of me. Ultimately you need to know what kind of data you're working with and how you want to tackle it.


Updated

So, i couldn't help but try to tackle this on my own. I wanted to prive that the method I was using was some-what accurate and I wasn't blowing smoke up your skirt. Having said that, this is what I have come up with. Note that this is in PHP for a couple of reasons:

  1. PHP was easier to get my hands on to
  2. I felt that if this was a viable solution, you should have to work at porting it over. :grin:

Anyways, here's the source and demo output. Enjoy.

<?php
  $samples = array(
    '20th, 21st August 1897',
    '31st May, 1st June 1909',
    '29th January 2007',
    '10th, 11th, 12th May 1954',
    '26th, 27th, 28th, 29th, 30th March 2006',
    '27th, 28th, 29th, 30th November, 1st December 2006',
    '30th, 31st, December 2010, 1st, 2nd January 2011'
  );

  //header('Content-Type: text/plain');

  $months = array('january','february','march','april','may','june','july','august','september','october','november','december');

  foreach ($samples as $sample)
  {
    $dates = array();

    // find yearly information first
    $yearly = null;
    if (preg_match_all('/(?:^|\s)(?<month>.*?)\s?(?<year>[0-9]{4})(?:$|,)/',$sample,$yearly))
    {//var_dump($yearly);
      for ($y = 0; $y < count($yearly[0]); $y++)
      {
        $year = $yearly['year'][$y];
        //echo "year: {$year}\r\n";

        $monthly = null;
        if (preg_match_all('/(?<days>(?:(?:^|\s)[0-9]{1,2}(?:st|nd|rd|th),?)*)\s?(?<month>'.implode('|',$months).')$/i',$yearly['month'][$y],$monthly))
        {//var_dump($monthly);
          for ($m = 0; $m < count($monthly[0]); $m++)
          {
            $month = $monthly['month'][$m];
            //echo "month: {$month}\r\n";

            $daily = null;
            if (preg_match_all('/(?:^|\s)(?<day>[0-9]{1,2})(?:st|nd|rd|th)(?:,|$)/i',$monthly['days'][$m],$daily))
            {//var_dump($daily);
              for ($d = 0; $d < count($daily[0]); $d++)
              {
                $day = $daily['day'][$d];
                //echo "day: {$day}\r\n";

                $dates[] = sprintf("%d-%d-%d", array_search(strtolower($month),$months)+1, $day, $year);
              }
            }
          }
        }
        $data = $yearly[1];
      }
    }

    echo "<p><b>{$sample}</b> was parsed to include:</p><ul>\r\n";
    foreach ($dates as $date)
      echo "<li>{$date}</li>\r\n";
    echo "</ul>\r\n";
  }
?>

20th, 21st August 1897 was parsed to include:

  • 8-20-1897
  • 8-21-1897

31st May, 1st June 1909 was parsed to include:

  • 6-1-1909

29th January 2007 was parsed to include:

  • 1-29-2007

10th, 11th, 12th May 1954 was parsed to include:

  • 5-10-1954
  • 5-11-1954
  • 5-12-1954

26th, 27th, 28th, 29th, 30th March 2006 was parsed to include:

  • 3-26-2006
  • 3-27-2006
  • 3-28-2006
  • 3-29-2006
  • 3-30-2006

27th, 28th, 29th, 30th November, 1st December 2006 was parsed to include:

  • 12-1-2006

30th, 31st, December 2010, 1st, 2nd January 2011 was parsed to include:

  • 12-30-2010
  • 12-31-2010
  • 1-1-2011
  • 1-2-2011

And to prove there's nothing up my sleeve, http://www.ideone.com/GGMaH

长途伴 2024-10-25 02:05:13

我对此进行了更多思考,解决方案变得显而易见。对字符串进行标记并以相反的顺序解析标记。这将检索年份,然后是月份,然后是日期。这是我的解决方案:

// **** Start definition of the class bcdb_Globals ****
public static class MyGlobals
{
    static Dictionary<string, int> _month2Int = new Dictionary<string, int>
    {
        {"January", 1},
        {"February", 2},
        {"March", 3},
        {"April", 4},
        {"May", 5},
        {"June", 6},
        {"July", 7},
        {"August", 8},
        {"September", 9},
        {"October", 10},
        {"November", 11},
        {"December", 12}
    };
    static public int GetMonthAsInt(string month)
    {
        return( _month2Int[month] );
    }
}


public class MyClass
{
    static char[] gDateSeparators = new char[2] { ',', ' ' };

    static Regex gDayRegex = new Regex("[0-9][0-9]?(st|nd|rd|th)");
    static Regex gMonthRegex = new Regex("January|February|March|April|May|June|July|August|September|October|November|December");
    static Regex gYearRegex = new Regex("[0-9]{4}");

    public void ParseMatchDate(string matchDate)
    {
        Stack matchDateTimes = new Stack();
        string[] tokens = matchDate.Split(gDateSeparators,StringSplitOptions.RemoveEmptyEntries);
        int curYear = int.MinValue;
        int curMonth = int.MinValue;
        int curDay = int.MinValue;

        for (int pos = tokens.Length-1; pos >= 0; --pos)
        {
            if (gYearRegex.IsMatch(tokens[pos]))
            {
                curYear = int.Parse(tokens[pos]);
            }
            else if (gMonthRegex.IsMatch(tokens[pos]))
            {
                curMonth = MyGlobals.GetMonthAsInt(tokens[pos]);
            }
            else if (gDayRegex.IsMatch(tokens[pos]))
            {
                string tok = tokens[pos];
                curDay = int.Parse(tok.Substring(0,(tok.Length-2)));
                // Dates are in reverse order, so using a stack means we'll pull em off in the correct order
                matchDateTimes.Push(new DateTime(curYear, curMonth, curDay));
            }
        }

        // Now get the datetimes
        while (matchDateTimes.Count > 0)
        {
            // Do something with dates here
        }
    }

}

I thought some more about this and the solution became obvious. Tokenize the string and parse the tokens in reverse order. This will retrieve the year, then month then day(s). Here is my solution:

// **** Start definition of the class bcdb_Globals ****
public static class MyGlobals
{
    static Dictionary<string, int> _month2Int = new Dictionary<string, int>
    {
        {"January", 1},
        {"February", 2},
        {"March", 3},
        {"April", 4},
        {"May", 5},
        {"June", 6},
        {"July", 7},
        {"August", 8},
        {"September", 9},
        {"October", 10},
        {"November", 11},
        {"December", 12}
    };
    static public int GetMonthAsInt(string month)
    {
        return( _month2Int[month] );
    }
}


public class MyClass
{
    static char[] gDateSeparators = new char[2] { ',', ' ' };

    static Regex gDayRegex = new Regex("[0-9][0-9]?(st|nd|rd|th)");
    static Regex gMonthRegex = new Regex("January|February|March|April|May|June|July|August|September|October|November|December");
    static Regex gYearRegex = new Regex("[0-9]{4}");

    public void ParseMatchDate(string matchDate)
    {
        Stack matchDateTimes = new Stack();
        string[] tokens = matchDate.Split(gDateSeparators,StringSplitOptions.RemoveEmptyEntries);
        int curYear = int.MinValue;
        int curMonth = int.MinValue;
        int curDay = int.MinValue;

        for (int pos = tokens.Length-1; pos >= 0; --pos)
        {
            if (gYearRegex.IsMatch(tokens[pos]))
            {
                curYear = int.Parse(tokens[pos]);
            }
            else if (gMonthRegex.IsMatch(tokens[pos]))
            {
                curMonth = MyGlobals.GetMonthAsInt(tokens[pos]);
            }
            else if (gDayRegex.IsMatch(tokens[pos]))
            {
                string tok = tokens[pos];
                curDay = int.Parse(tok.Substring(0,(tok.Length-2)));
                // Dates are in reverse order, so using a stack means we'll pull em off in the correct order
                matchDateTimes.Push(new DateTime(curYear, curMonth, curDay));
            }
        }

        // Now get the datetimes
        while (matchDateTimes.Count > 0)
        {
            // Do something with dates here
        }
    }

}

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文