Java 可以用什么字符来解析段落?

发布于 2024-08-20 18:13:59 字数 316 浏览 1 评论 0原文

我相信人们会从这个中得到一个很好的笑声,但在我的一生中,我找不到一个分隔符来指示新段落何时在文本字符串中开始。字、线?很容易,但段落似乎更难找到。我已经尝试过连续两次换行,即段落换行符和换行符的 Unicode 表示形式,但没有成功。

编辑:对于我原来问题的含糊之处,我深表歉意。为了回答一些问题,它是最初在 Windows 上创建的基本文本文件。我正在测试一些代码,以便使用 RIM eclipse 插件使用 Blackberry JDE 4.5 打开和分析其内容。虽然文件的源将是Windows(至少在可预见的将来)并且是基本文本,但我无法控制它们的创建方式(它是第三方源,我无法访问它的创建方式)

I'm sure folks will get a good laugh out of this one, but for the life of me I cannot find a seperator that will indicate when a new paragraph has begun in a string of text. Word, and line? Easy peasy, but paragraph seems to be much harder to find. I've tried two line breaks in a row, the Unicode representation of paragraph break and line break, with no luck.

EDIT: I apologize for the vagueness of my original question. To answer some of the questions, it is a basic text file originally created on windows. I'm testing some code for opening and analyzing it's contents with the Blackberry JDE 4.5 using the RIM eclipse plugin. While the source of the file will be windows (at least for the foreseeable future) and be basic text, I have no control over how they are created (it's a third party source that I dont' have access to the way it is created)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

月亮是我掰弯的 2024-08-27 18:13:59

常用的段落分隔符不存在。

您可能可以假设一行中的两个或多个换行符(带有可选的水平空白)表示段落分隔符。但这条“规则”也有很多例外。例如,当一个段落

  • 被浮动图形打断,或者
  • 包含项目符号点

然后继续......就像这样。对于这种事情,恐怕没有解决办法。

根据下面@Aiden 的评论编辑。 (现在很明显,这与OP无关,但可能与通过谷歌等找到问题的其他人相关)

而不是尝试从文本中逆向工程段落,也许您应该考虑指定您的输入应该是(例如)Markdown 语法;即由 StackOverflow 支持。 Markdown Wiki 包含多种语言(包括 Java)的 Markdown 解析器实现的链接。

(这假设您对尝试解析为段落等的文本的输入格式有一定的控制。)

There is no such paragraph break character in common usage.

You might be able to get away with assuming that two or more line breaks in a row (with optional horizontal whitespace) indicates a paragraph break. But there are numerous exceptions to this "rule". For example, when a paragraph

  • is interrupted by a floating figure, or
  • contains bullet points

and then continues on ... like this one. For that kind of thing, there is probably no solution.

EDIT per @Aiden's comment below. (It is now clear that this is not relevant to the OP, but it may be relevant to others who find the question via Google, etc)

Instead of trying to reverse engineer paragraphs from text, perhaps you should consider specifying that your input should be in (for example) Markdown syntax; i.e. as supported by StackOverflow. The Markdown Wiki includes links to markdown parser implementations in many languages, including Java.

(This assumes that you have some control over the input format of the text you are trying to parse into paragraphs, etcetera.)

落墨 2024-08-27 18:13:59

纯文本文档中的段落通常由两个或多个行分隔符分隔。行分隔符可以是换行符 (\n)、回车符 (\r) 或回车符后跟换行符 (\r \n)。这三种分隔符通常与操作系统相关,但任何应用程序都可以自由地使用任何类型的行分隔符写入文本。事实上,从不同来源(如网页)组装的文本很可能包含两种或多种分隔符。当您的应用读取文本时,无论它在什么平台上运行,它都应该始终检查所有三种行分隔符。

BufferedReader#readLine() 就是这样做的,但当然它一次只能读取一行。简单的散文通常会作为表示段落的非空行和表示段落之间空格的空行的交替序列返回。但不要指望它;注意多个空行,并注意“空”行实际上可能包含空格字符,例如空格 (\u0020) 和 TAB (\u0009)。

如果您选择不使用 BufferedReader,您可能必须从头开始编写检测代码。 Java ME 不包含正则表达式支持,因此 split()java.util.Scanner 不可用; StringTokenizer 不区分单个分隔符和连续多个分隔符,除非您使用 returnDelims 选项。然后它一次返回一个字符的分隔符,因此您仍然需要编写自己的代码来确定您正在查看哪种分隔符(如果有)。

Paragraphs in plain text documents are usually separated by two or more line separators. A line separator may be a linefeed (\n), a carriage-return (\r), or a carriage-return followed by a linefeed (\r\n). These three kinds of separator are typically associated with operating systems, but any application is free to write text using any kind of line separator. In fact, text that's been assembled from diverse sources (like a web page) may well contain two or more kinds of separator. When your app reads text, no matter what platform it's running on, it should always check for all three kinds of line separator.

BufferedReader#readLine() does that, but of course it only reads one line at a time. Simple prose will usually be returned as an alternating sequence of non-empty lines representing paragraphs, and empty lines representing the spaces between them. But don't count on it; watch for multiple empty lines, and be aware that "empty" lines may in fact contain whitespace characters like space (\u0020) and TAB (\u0009).

If you choose not to go with a BufferedReader, you may have to write the detection code from scratch. Java ME doesn't include regex support, so split() and java.util.Scanner are not available; and StringTokenizer makes no distinction between a single delimiter character and several in a row unless you use the returnDelims option. Then it returns the delimiters one character at a time, so you still have to write your own code to figure out what kind of separator you're looking at, if any.

吃素的狼 2024-08-27 18:13:59

您可能需要在换行符上查找 CR LF 序列 (\r\n) - 显然答案将取决于文本格式。

It is possible that instead on line feed you need to look for a CR LF sequence (\r\n) - obviously the answer would depend on the text format.

浅唱々樱花落 2024-08-27 18:13:59
String lineSeparator = System.getProperty("line.separator");

这将返回平台的默认行分隔符。

因此,例如以下内容应该有效:

String[] paragraphs = text.split(lineSeparator);
String lineSeparator = System.getProperty("line.separator");

This returns the platform's default line separator.

Thus, e.g. the following should work:

String[] paragraphs = text.split(lineSeparator);
七分※倦醒 2024-08-27 18:13:59

我假设您有一个文本文件,而不是像 MS-Word 或 RTF 这样的复杂文档。

文本文档中段落的概念没有明确定义。大多数情况下,新段落将被识别,因为当您在文本编辑器中打开文档时,您将看到下一组文本从下一行开始。

有两个特殊字符即。换行符(LF - '\n')和回车符(CR - '\r')使文本从下一行开始。下一行使用哪个字符取决于您使用的操作系统。此外,有时也使用两者的组合,如 CRLF ('\r\n')。

在 java 中,您可以使用 System.getProperty("line.separator"); 确定用于分隔行/段落的字符或字符集。但这又带来了新的问题。如果您在 MS Windows 中创建一个文本文件,然后在 Unix 中打开它会怎么样?本例中文本文件中的行分隔符是windows的行分隔符,但java运行在unix上。

我的建议是:

如果文本(文档)的长度为零,则段落= 0。

如果文本(文档)的长度不为零,则

  • 考虑'\n''\r' 作为
    中断字符
  • 扫描文本以查找上面的换行符
    人物。
  • 任何连续换行符
    以任何顺序都应被视为
    一个段落分隔符
  • 段落数 = 1 + (段落数
    段落中断)

请注意,斯蒂芬指出的例外情况也同样适用于此。

public class ParagraphTest {

    public static void main(String[] args) {
        String document = 
                    "Hello world.\n" + 
                    "This is line 2.\n\r" + 
                    "Line 3 here.\r" + 
                    "Yet another line 4.\n\r\n\r" + 
                    "Few more lines 5.\r";
        printParaCount(document);
    }

    public static void printParaCount(String document) {
        String lineBreakCharacters = "\r\n";
        StringTokenizer st = new StringTokenizer(
                    document, lineBreakCharacters);
        System.out.println("ParaCount: " + st.countTokens());
    }

}

输出

ParaCount: 5

I assume you have a text file and not a complex document like MS-Word or RTF.

The concept of paragraph in text document is not well defined. Most cases new paragraph will be recognized by the fact that when you open a document in text editor, you will see next set of text starting on next line.

There are two special characters viz. new-line (LF - '\n') and carriage-return (CR - '\r') that causes the text to start on next line. Which character is used for next line depends on operating system you use. Further more, sometimes combination of both is also used like CRLF ('\r\n').

In java you can determine character or set of characters used to seprate lines/paragraphs using System.getProperty("line.separator");. But this brings in new problem. What if you create a text file in MS Windows and then open it in Unix? Line seprator in text file in this case is that of windows, but java is running on unix.

.

My recommendation is:

IF length of text(docuemnt) is zero, THEN paragraphs = 0.

IF length of text(docuemnt) is NOT zero, THEN

  • Consider '\n' and '\r' as line
    break characters
    .
  • Scan your text for above line break
    characters.
  • Any continious line break characters
    in any order should be considered as
    one paragraph break.
  • Number of paragraphs = 1 + (count of
    paragraph breaks)

Note, exceptions pointed by Stephen still applies here as well.

.

public class ParagraphTest {

    public static void main(String[] args) {
        String document = 
                    "Hello world.\n" + 
                    "This is line 2.\n\r" + 
                    "Line 3 here.\r" + 
                    "Yet another line 4.\n\r\n\r" + 
                    "Few more lines 5.\r";
        printParaCount(document);
    }

    public static void printParaCount(String document) {
        String lineBreakCharacters = "\r\n";
        StringTokenizer st = new StringTokenizer(
                    document, lineBreakCharacters);
        System.out.println("ParaCount: " + st.countTokens());
    }

}

Output

ParaCount: 5
超可爱的懒熊 2024-08-27 18:13:59

首先,最好的选择是定义一个段落。无论是换行符、双换行符还是换行符后跟制表符。假设您无法控制输入并想要确定各种文本样本中的段落数,则可能存在上述任何情况。此外,它们可能在同一文档中用于相同目的。因此需要对此进行一些分析,并记住它不会始终 100% 准确。

首先初始化各种可能的段落分隔符:

  • "\r"
  • "\n\r"
  • "\n"
  • System.getProperty("line.seperator")

以及所有这些,但两次,以及所有这些变体都带有一个附加制表符('\t') 位于末尾。

执行此操作的低效方法是将输入加载到字符串中,然后调用 buffer.split().length 来确定有多少个段落。高效、可扩展的方法是使用流并检查输入,考虑段落的长度,并丢弃低于给定“阈值”的这些段落。更高级的算法甚至可能在遇到换行符处理方式的切换(例如,几行非常短的行,或几行非常长的行)后,会切换它认为是一个段落的内容。

所有这些都假设您正在处理没有章节标题等的未格式化文本。归根结底,询问特定文本中有多少段落的概念就像询问一年有多少周一样。虽然不完全是 52,但也差不多。

First, your best bet would be to define a paragraph. Whether it is a line break, a double line break, or a line break followed by a tab. Assuming that you have no control over the input and want to determine the number of paragraphs in various samples of text, any of these situations may exist. Furthermore, they might be used to the same purpose within the same document. So some analysis is needed for this, and keep in mind it won't be 100% accurate all the time.

Start by initializing the various possible paragraph breaks:

  • "\r"
  • "\n\r"
  • "\n"
  • System.getProperty("line.seperator")

and all of those, but twice, and all those variations with an additional tab character ('\t') on the end.

The inefficient way to do this would be to load the input into a string and then call buffer.split().length to determine how many paragraphs there were. The efficient, scalable way would be to use a stream and go over the input, taking into account how long the paragraph is, and throwing out those paragraphs beneath a given "threshold". A more advanced algorithm might even switch what it considers to be a paragraph after it encounters a switch in the way line breaks are handled (several very short lines, or several very long ones, for example).

And all of this is assuming that you are dealing with unformatted text without section titles, etc. What it comes down to is the concept of asking how many paragraphs are in a particular piece of text is like asking how many weeks are in a year. It's not exactly 52, but it's around there.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文