StringTokenizer 标记化问题

发布于 2024-08-25 17:13:06 字数 1854 浏览 7 评论 0原文

String a ="the STRING TOKENIZER CLASS ALLOWS an APPLICATION to BREAK a STRING into TOKENS.  ";

StringTokenizer st = new StringTokenizer(a);
while (st.hasMoreTokens()){
  System.out.println(st.nextToken());

给出上面的代码，输出如下，

the
STRING TOKENIZER CLASS
ALLOWS
an
APPLICATION
to
BREAK
a
STRING
into
TOKENS.

我唯一的问题是为什么“STRING TOKENIZER CLASS”已合并为一个令牌?????????

当我尝试运行这段代码时，

System.out.println("STRING TOKENIZER CLASS".contains(" "));

它打印了有趣的结果，

FALSE

，这听起来不合逻辑，对吧？我不知道出了什么问题。

我找到了原因，Java以某种方式未将空间识别为有效空间。但是，我不知道从前端处理到我发布的代码，它是如何变成这样的。

伙计们，我需要强调的是，下面的代码先于上面的代码运行..

if (!suspectedContentCollector.isEmpty()){ 迭代器 i = ShouldestContentCollector.iterator(); 字符串温度=“”； while (i.hasNext()){ temp+=i.next().toLowerCase()+ " "; } StringTokenizer st = new StringTokenizer(temp);

        while (st.hasMoreTokens()){
            temp=st.nextToken();
            temp=StopWordsRemover.remove(temp);
            analyzedSentence = analyzedSentence.replace(temp,temp.toUpperCase());
        }
    }

因此，一旦将其更改为大写，某处似乎出了问题，我意识到只有某些空格未被识别。难道是从文档中检索文本的原因吗？

以下代码，

String a ="STRING TOKENIZER CLASS 允许应用程序将字符串分解为令牌。"; for (int i : a.toCharArray()) { System.out.print(i + " "); }

产生以下输出，

116 104 101 32 83 84 82 73 78 71 160 84 79 75 69 78 73 90 69 82 160 67 76 65 83 83 32 65 76 76 79 87 83 32 97 110 32 65 80 80 76 73 67 65 84 73 79 78 32 116 111 32 66 82 69 65 75 32 97 32 83 84 82 73 78 71 32 105 110 116 111 32 84 79 75 69 78 83 46 160 32

原文

String a ="the STRING TOKENIZER CLASS ALLOWS an APPLICATION to BREAK a STRING into TOKENS.  ";

StringTokenizer st = new StringTokenizer(a);
while (st.hasMoreTokens()){
  System.out.println(st.nextToken());

Given above codes, the output is following,

the
STRING TOKENIZER CLASS
ALLOWS
an
APPLICATION
to
BREAK
a
STRING
into
TOKENS.

My only question is why the "STRING TOKENIZER CLASS" has been combined into one token????????

When I try to run this code,

System.out.println("STRING TOKENIZER CLASS".contains(" "));

It printed funny result,

FALSE

It sound not logical right? I've no idea what went wrong.

I found out the reason, the space was not recognized as valid space by Java somehow. But, I don't know how it turned up to be like that from the front processing up to the code that I've posted.

Guys, I need to highlight that, below code runs first before the above one..

if (!suspectedContentCollector.isEmpty()){
Iterator i = suspectedContentCollector.iterator();
String temp="";
while (i.hasNext()){
temp+=i.next().toLowerCase()+ " ";
}
StringTokenizer st = new StringTokenizer(temp);

        while (st.hasMoreTokens()){
            temp=st.nextToken();
            temp=StopWordsRemover.remove(temp);
            analyzedSentence = analyzedSentence.replace(temp,temp.toUpperCase());
        }
    }

Hence, once it has been changed to UPPERCASE, something seems to went wrong somewhere and I realized only certain spaces were not recognized. Could it be the reason of retrieving the text from the document?

Following code,

String a ="the STRING TOKENIZER CLASS ALLOWS an APPLICATION to BREAK a STRING into TOKENS. ";
for (int i : a.toCharArray()) {
System.out.print(i + " ");
}

produced following output,

116
104
101
32
83
84
82
73
78
71
160
84
79
75
69
78
73
90
69
82
160
67
76
65
83
83
32
65
76
76
79
87
83
32
97
110
32
65
80
80
76
73
67
65
84
73
79
78
32
116
111
32
66
82
69
65
75
32
97
32
83
84
82
73
78
71
32
105
110
116
111
32
84
79
75
69
78
83
46
160
32

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

一笔一画续写前缘 2024-09-01 17:13:06

答案就在您添加的代码片段中。列出的整数显示单词 STRING 后面的空格是 ASCII 字符 160，即，而不是字符 32，即普通空格。编辑原始字符串，将 STRING TOKENIZER CLASS 中的空格替换为实际空格而不是移位空格。

只是来自 1.4.2 Javadoc 的附带评论：

StringTokenizer 是一个遗留类，
出于兼容性原因保留
尽管在新版本中不鼓励使用它
代码。建议任何人
寻求此功能使用
String 的 split 方法或
改为 java.util.regex 包。

回复收藏 0 原文

奢欲 2024-09-01 17:13:06

查看字符代码，有问题的“空格”是 0xA0，它是一个不间断的空格。我的猜测是，它是故意输入的，以便将“STRING TOKENIZER CLASS”视为一个单词。

解决方案（如果您确实认为将“STRING TOKENIZER CLASS”分解为三个单词是正确的）是将不间断空格作为分隔符添加到 StringTokenizer 类（分别是 String.split() 方法）。例如

  new StringTokenizer(string, " \t\n\r\f\240")

Looking at the character codes, the 'space' in question is 0xA0, which is intended to be a non-breaking space. My guess is that it was entered deliberately so that 'STRING TOKENIZER CLASS' is treated as one word.

The solution (if you indeed deem it correct to break up 'STRING TOKENIZER CLASS' into three words) would be to pass add the non-breaking space as delimiter to the StringTokenizer class (resp. the String.split() method). E.g.

  new StringTokenizer(string, " \t\n\r\f\240")

回复收藏 0 原文

就此别过 2024-09-01 17:13:06

您是否有可能在“STRING TOKENIZER CLASS”中使用除正常 ascii 空白之外的其他内容？也许你按住了 Shift 键并在那里得到了一个移位空格？

回复收藏 0 原文

蒗幽 2024-09-01 17:13:06

请帮我们大家一个忙，复制并粘贴此代码片段的输出：

    for (int i : a.toCharArray()) {
        System.out.print(i + " ");
    }

好的，现在查看输出，它证实了我们一直怀疑的内容：那些“空格”是 ASCII 160， 不间断空格。它与 ASCII 32 常规空格不同。

您可以让分词器（正如其他人所说，它已过时）包含 ASCII 160 作为分隔符，或者如果它一开始就不应该存在，您可以从输入字符串中过滤掉它。

目前，在标记化之前 a = a.replace((char) 160, (char) 32); 是一个快速修复。

Do us all a favor and copy and paste the output of this snippet:

    for (int i : a.toCharArray()) {
        System.out.print(i + " ");
    }

OK, now looking at the output, it confirms what we've all been suspecting: those "spaces" are ASCII 160, the non-breaking space. It's a different character from the ASCII 32 regular space.

You can let the tokenizer (which is obsolete as others have said) to include ASCII 160 as delimiter, or you can filter it out from the input string if it's not supposed to be there in the first place.

For now, a = a.replace((char) 160, (char) 32); before tokenizing is a quick-fix.

回复收藏 0 原文