使用 C# 格式化字符串中的句子

发布于 2024-08-19 15:39:08 字数 182 浏览 9 评论 0原文

我有一个包含多个句子的字符串。如何将每个句子中第一个单词的第一个字母大写。类似于word中的段落格式。

例如,“这是一些代码。该代码是用 C# 编写的。” 输出必须是“这是一些代码。代码是用 C# 编写的”。

一种方法是根据“.”分割字符串。然后将第一个字母大写,然后重新加入。

有更好的解决方案吗?

I have a string with multiple sentences. How do I Capitalize the first letter of first word in every sentence. Something like paragraph formatting in word.

eg ."this is some code. the code is in C#. "
The ouput must be "This is some code. The code is in C#".

one way would be to split the string based on '.' and then capitalize the first letter and then rejoin.

Is there a better solution?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

燕归巢 2024-08-26 15:39:08

在我看来,当涉及到潜在复杂的基于规则的字符串匹配和替换时,没有比基于正则表达式的解决方案更好的了(尽管事实上它们很难阅读!)。在我看来,这提供了最佳的性能和内存效率 - 您会对它的速度感到惊讶。

我会使用 Regex.Replace 重载接受输入字符串、正则表达式模式和MatchEvaluator 委托。 MatchEvaluator 是一个接受 Match 对象作为输入并返回字符串替换的函数。

代码如下:

public static string Capitalise(string input)
{
  //now the first character
  return Regex.Replace(input, @"(?<=(^|[.;:])\s*)[a-z]",
    (match) => { return match.Value.ToUpper(); });
}

正则表达式使用 (?<=) 构造(零宽度正向后查找)将捕获限制为仅捕获字符串开头前面的 az 字符或所需的标点符号。在 [.;:] 位中,您可以添加所需的额外字符(例如 [.;:?."] 添加 ? 和 " 字符。

这意味着,另外,您的 MatchEvaluator 不必执行任何不必要的字符串连接(出于性能原因您希望避免这种情况)。

其他回答者之一提到的有关使用 RegexOptions.Compiled 的所有其他内容从性能角度来看也相关。不过,静态 Regex.Replace 方法确实提供了非常相似的性能优势(只是有一个额外的字典查找) -

如果这里的任何其他非正则解决方案能够更好地工作,我会感到惊讶。 将此解决方案与艾哈迈德的解决方案进行比较

编辑

,因为他非常正确地指出,环顾四周可能比按照他的方式进行效率更低,

这是我所做的粗略基准:

public string LowerCaseLipsum
{
  get
  {
    //went to lipsum.com and generated 10 paragraphs of lipsum
    //which I then initialised into the backing field with @"[lipsumtext]".ToLower()
    return _lowerCaseLipsum;
  }
 }
 [TestMethod]
 public void CapitaliseAhmadsWay()
 {
   List<string> results = new List<string>();
   DateTime start = DateTime.Now;
   Regex r = new Regex(@"(^|\p{P}\s+)(\w+)", RegexOptions.Compiled);
   for (int f = 0; f < 1000; f++)
   {
     results.Add(r.Replace(LowerCaseLipsum, m => m.Groups[1].Value
                      + m.Groups[2].Value.Substring(0, 1).ToUpper()
                           + m.Groups[2].Value.Substring(1)));
   }
   TimeSpan duration = DateTime.Now - start;
   Console.WriteLine("Operation took {0} seconds", duration.TotalSeconds);
 }

 [TestMethod]
 public void CapitaliseLookAroundWay()
 {
   List<string> results = new List<string>();
   DateTime start = DateTime.Now;
   Regex r = new Regex(@"(?<=(^|[.;:])\s*)[a-z]", RegexOptions.Compiled);
   for (int f = 0; f < 1000; f++)
   {
     results.Add(r.Replace(LowerCaseLipsum, m => m.Value.ToUpper()));
   }
   TimeSpan duration = DateTime.Now - start;
   Console.WriteLine("Operation took {0} seconds", duration.TotalSeconds);
 }

在发布版本中 。 ,我的解决方案比 Ahmad 的解决方案快约 12%(1.48 秒而不是 1.68 秒),

但有趣的是,如果通过静态 Regex.Replace 方法完成,两者都会慢约 80%,而且我的解决方案更慢。比艾哈迈德的。

In my opinion, when it comes to potentially complex rules-based string matching and replacing - you can't get much better than a Regex-based solution (despite the fact that they are so hard to read!). This offers the best performance and memory efficiency, in my opinion - you'll be surprised at just how fast this'll be.

I'd use the Regex.Replace overload that accepts an input string, regex pattern and a MatchEvaluator delegate. A MatchEvaluator is a function that accepts a Match object as input and returns a string replacement.

Here's the code:

public static string Capitalise(string input)
{
  //now the first character
  return Regex.Replace(input, @"(?<=(^|[.;:])\s*)[a-z]",
    (match) => { return match.Value.ToUpper(); });
}

The regex uses the (?<=) construct (zero-width positive lookbehind) to restrict captures only to a-z characters preceded by the start of the string, or the punctuation marks you want. In the [.;:] bit you can add the extra ones you want (e.g. [.;:?."] to add ? and " characters.

This means, also, that your MatchEvaluator doesn't have to do any unnecessary string joining (which you want to avoid for performance reasons).

All the other stuff mentioned by one of the other answerers about using the RegexOptions.Compiled is also relevant from a performance point of view. The static Regex.Replace method does offer very similar performance benefits, though (there's just an additional dictionary lookup).

Like I say - I'll be surprised if any of the other non-regex solutions here will work better and be as fast.

EDIT

Have put this solution up against Ahmad's as he quite rightly pointed out that a look-around might be less efficient than doing it his way.

Here's the crude benchmark I did:

public string LowerCaseLipsum
{
  get
  {
    //went to lipsum.com and generated 10 paragraphs of lipsum
    //which I then initialised into the backing field with @"[lipsumtext]".ToLower()
    return _lowerCaseLipsum;
  }
 }
 [TestMethod]
 public void CapitaliseAhmadsWay()
 {
   List<string> results = new List<string>();
   DateTime start = DateTime.Now;
   Regex r = new Regex(@"(^|\p{P}\s+)(\w+)", RegexOptions.Compiled);
   for (int f = 0; f < 1000; f++)
   {
     results.Add(r.Replace(LowerCaseLipsum, m => m.Groups[1].Value
                      + m.Groups[2].Value.Substring(0, 1).ToUpper()
                           + m.Groups[2].Value.Substring(1)));
   }
   TimeSpan duration = DateTime.Now - start;
   Console.WriteLine("Operation took {0} seconds", duration.TotalSeconds);
 }

 [TestMethod]
 public void CapitaliseLookAroundWay()
 {
   List<string> results = new List<string>();
   DateTime start = DateTime.Now;
   Regex r = new Regex(@"(?<=(^|[.;:])\s*)[a-z]", RegexOptions.Compiled);
   for (int f = 0; f < 1000; f++)
   {
     results.Add(r.Replace(LowerCaseLipsum, m => m.Value.ToUpper()));
   }
   TimeSpan duration = DateTime.Now - start;
   Console.WriteLine("Operation took {0} seconds", duration.TotalSeconds);
 }

In a release build, the my solution was about 12% faster than the Ahmad's (1.48 seconds as opposed to 1.68 seconds).

Interestingly, however, if it was done through the static Regex.Replace method, both were about 80% slower, and my solution was slower than Ahmad's.

川水往事 2024-08-26 15:39:08

这是一个正则表达式解决方案,它使用标点符号类别来避免指定 .!?" 等。尽管您当然应该检查它是否满足您的需求或明确设置它们。阅读“支持的 Unicode 常规类别”下的“P”类别位于MSDN 字符类页面上的部分。

string input = @"this is some code. the code is in C#? it's great! In ""quotes."" after quotes.";
string pattern = @"(^|\p{P}\s+)(\w+)";

// compiled for performance (might want to benchmark it for your loop)
Regex rx = new Regex(pattern, RegexOptions.Compiled);

string result = rx.Replace(input, m => m.Groups[1].Value
                                + m.Groups[2].Value.Substring(0, 1).ToUpper()
                                + m.Groups[2].Value.Substring(1));

如果您决定不这样做要使用 \p{P} 类,您必须自己指定字符,类似于:

string pattern = @"(^|[.?!""]\s+)(\w+)";

编辑: 下面是演示 3 种模式的更新示例 第一个显示。第二个展示如何通过使用类减法来选择某些标点符号类别,同时删除特定的标点符号组。第三个与第二个类似,但不使用不同的组

。拼出一些标点符号类别所指的内容,因此进行了细分:

  • P:所有标点符号(包含以下所有类别)
  • Pc:下划线 _< /code>
  • Pd:破折号 -
  • Ps:左括号、方括号和大括号 ( [ {
  • Pe:右括号、中括号和大括号 ) ] }
  • < strong>Pi:初始单/双引号(MSDN 表示“根据使用情况,其行为可能类似于 Ps/Pe”)
  • Pf:最终单/双引号(MSDN Pi 注释适用)
  • < strong>Po:其他标点符号,例如逗号、冒号、分号和斜线 ,, :, ;, \, /

仔细比较这些组对结果的影响。这应该会给你很大程度的灵活性。如果这看起来并不理想,那么您可以使用字符类中的特定字符,如前面所示。

string input = @"foo ( parens ) bar { braces } foo [ brackets ] bar. single ' quote & "" double "" quote.
dash - test. Connector _ test. Comma, test. Semicolon; test. Colon: test. Slash / test. Slash \ test.";

string[] patterns = { 
    @"(^|\p{P}\s+)(\w+)", // all punctuation chars
    @"(^|[\p{P}-[\p{Pc}\p{Pd}\p{Ps}\p{Pe}]]\s+)(\w+)", // all punctuation chars except Pc/Pd/Ps/Pe
    @"(^|[\p{P}-[\p{Po}]]\s+)(\w+)" // all punctuation chars except Po
};

// compiled for performance (might want to benchmark it for your loop)
foreach (string pattern in patterns)
{
    Console.WriteLine("*** Current pattern: {0}", pattern);
    string result = Regex.Replace(input, pattern,
                            m => m.Groups[1].Value
                                 + m.Groups[2].Value.Substring(0, 1).ToUpper()
                                 + m.Groups[2].Value.Substring(1));
    Console.WriteLine(result);
    Console.WriteLine();
}

请注意,“Dash”未使用最后一个模式大写,并且位于新行上。使其大写的一种方法是使用 RegexOptions.Multiline 选项。尝试上面的代码片段,看看它是否满足您想要的结果。

另外,为了举例,我在上面的循环中没有使用 RegexOptions.Compiled 。要使用这两个选项或将它们一起使用:RegexOptions.Compiled | RegexOptions.Multiline。

Here's a regex solution that uses the punctuation category to avoid having to specify .!?" etc. although you should certainly check if it covers your needs or set them explicitly. Read up on the "P" category under the "Supported Unicode General Categories" section located on the MSDN Character Classes page.

string input = @"this is some code. the code is in C#? it's great! In ""quotes."" after quotes.";
string pattern = @"(^|\p{P}\s+)(\w+)";

// compiled for performance (might want to benchmark it for your loop)
Regex rx = new Regex(pattern, RegexOptions.Compiled);

string result = rx.Replace(input, m => m.Groups[1].Value
                                + m.Groups[2].Value.Substring(0, 1).ToUpper()
                                + m.Groups[2].Value.Substring(1));

If you decide not to use the \p{P} class you would have to specify the characters yourself, similar to:

string pattern = @"(^|[.?!""]\s+)(\w+)";

EDIT: below is an updated example to demonstrate 3 patterns. The first shows how all punctuations affect casing. The second shows how to pick and choose certain punctuation categories by using class subtraction. It uses all punctuations while removing specific punctuation groups. The third is similar to the 2nd but using different groups.

The MSDN link doesn't spell out what some of the punctuation categories refer to, so here's a breakdown:

  • P: all punctuations (comprises all of the categories below)
  • Pc: underscore _
  • Pd: dash -
  • Ps: open parenthesis, brackets and braces ( [ {
  • Pe: closing parenthesis, brackets and braces ) ] }
  • Pi: initial single/double quotes (MSDN says it "may behave like Ps/Pe depending on usage")
  • Pf: final single/double quotes (MSDN Pi note applies)
  • Po: other punctuation such as commas, colons, semi-colons and slashes ,, :, ;, \, /

Carefully compare how the results are affected by these groups. This should grant you a great degree of flexibility. If this doesn't seem desirable then you may use specific characters in a character class as shown earlier.

string input = @"foo ( parens ) bar { braces } foo [ brackets ] bar. single ' quote & "" double "" quote.
dash - test. Connector _ test. Comma, test. Semicolon; test. Colon: test. Slash / test. Slash \ test.";

string[] patterns = { 
    @"(^|\p{P}\s+)(\w+)", // all punctuation chars
    @"(^|[\p{P}-[\p{Pc}\p{Pd}\p{Ps}\p{Pe}]]\s+)(\w+)", // all punctuation chars except Pc/Pd/Ps/Pe
    @"(^|[\p{P}-[\p{Po}]]\s+)(\w+)" // all punctuation chars except Po
};

// compiled for performance (might want to benchmark it for your loop)
foreach (string pattern in patterns)
{
    Console.WriteLine("*** Current pattern: {0}", pattern);
    string result = Regex.Replace(input, pattern,
                            m => m.Groups[1].Value
                                 + m.Groups[2].Value.Substring(0, 1).ToUpper()
                                 + m.Groups[2].Value.Substring(1));
    Console.WriteLine(result);
    Console.WriteLine();
}

Notice that "Dash" is not capitalized using the last pattern and it's on a new line. One way to make it capitalized is to use the RegexOptions.Multiline option. Try the above snippet with that to see if it meets your desired result.

Also, for the sake of example, I didn't use RegexOptions.Compiled in the above loop. To use both options OR them together: RegexOptions.Compiled | RegexOptions.Multiline.

要走就滚别墨迹 2024-08-26 15:39:08

您有几种不同的选择:

  1. 拆分字符串、大写然后重新连接的方法
  2. 使用正则表达式执行表达式的替换(对于大小写可能有点棘手)
  3. 编写一个 C# 迭代器,迭代每个字符并生成一个新的 IEnumerable,其中句点后的第一个字母为大写。可能会提供流媒体解决方案的好处。
  4. 循环遍历每个字符以及在句点之后立即出现的大写字符(忽略空格) - StringBuffer 可能会使这更容易。

下面的代码使用了迭代器:

public static string ToSentenceCase( string someString )
{
  var sb = new StringBuilder( someString.Length );
  bool wasPeriodLastSeen = true; // We want first letter to be capitalized
  foreach( var c in someString )
  {
      if( wasPeriodLastSeen && !c.IsWhiteSpace ) 
      {
          sb.Append( c.ToUpper() );
          wasPeriodLastSeen = false;         
      }        
      else
      {
          if( c == '.' )  // you may want to expand this to other punctuation
              wasPeriodLastSeen = true;
          sb.Append( c );
      }
  }

  return sb.ToString();
}

You have a few different options:

  1. Your approach of splitting the string, capitalizing and then re-joining
  2. Using regular expressions to perform a replace of the expressions (which can be a bit tricky for case)
  3. Write a C# iterator that iterates over each character and yields a new IEnumerable<char> with the first letter after a period in upper case. May offer benefit of a streaming solution.
  4. Loop over each char and upper-case those that appear immediately after a period (whitespace ignored) - a StringBuffer may make this easier.

The code below uses an iterator:

public static string ToSentenceCase( string someString )
{
  var sb = new StringBuilder( someString.Length );
  bool wasPeriodLastSeen = true; // We want first letter to be capitalized
  foreach( var c in someString )
  {
      if( wasPeriodLastSeen && !c.IsWhiteSpace ) 
      {
          sb.Append( c.ToUpper() );
          wasPeriodLastSeen = false;         
      }        
      else
      {
          if( c == '.' )  // you may want to expand this to other punctuation
              wasPeriodLastSeen = true;
          sb.Append( c );
      }
  }

  return sb.ToString();
}
绳情 2024-08-26 15:39:08

我不知道为什么,但根据 LBushkin 的建议,我决定尝试一下收益回报。只是为了好玩。

static IEnumerable<char> CapitalLetters(string sentence)
        {
            //capitalize first letter
            bool capitalize = true;
            char lastLetter;
            for (int i = 0; i < sentence.Length; i++)
            {
                lastLetter = sentence[i];
                yield return (capitalize) ? Char.ToUpper(sentence[i]) : sentence[i];


                if (Char.IsWhiteSpace(lastLetter) && capitalize == true)
                    continue;

                capitalize = false;
                if (lastLetter == '.' || lastLetter == '!') //etc
                    capitalize = true;
            }
        }

使用方法:

string sentence = new String(CapitalLetters("this is some code. the code is in C#.").ToArray());

I don't know why, but I decided to give yield return a try, based on what LBushkin had suggested. Just for fun.

static IEnumerable<char> CapitalLetters(string sentence)
        {
            //capitalize first letter
            bool capitalize = true;
            char lastLetter;
            for (int i = 0; i < sentence.Length; i++)
            {
                lastLetter = sentence[i];
                yield return (capitalize) ? Char.ToUpper(sentence[i]) : sentence[i];


                if (Char.IsWhiteSpace(lastLetter) && capitalize == true)
                    continue;

                capitalize = false;
                if (lastLetter == '.' || lastLetter == '!') //etc
                    capitalize = true;
            }
        }

To use it:

string sentence = new String(CapitalLetters("this is some code. the code is in C#.").ToArray());
夏见 2024-08-26 15:39:08
  1. 在 StringBuffer 中完成您的工作。
  2. 整个内容小写。
  3. 循环遍历并大写前导字符。
  4. 调用 ToString.
  1. Do your work in a StringBuffer.
  2. Lowercase the whole thing.
  3. Loop through and uppercase leading chars.
  4. Call ToString.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文