我编写一些代码来读取长文本文件。它在txt file中有10000个英文单词

with open('/train.txt', 'r') as fin
len(text)#result is 10000
len(text.split() #result is 2800

。是整个文本,Len()的两个结果应为10000。 为什么?由于我的计算机有限?还是我的文字有问题?


I write some code to read a long text file. it has 10000 English words in txt file.then I want to use split() to get all word to train them, the code is like this:

with open('/train.txt', 'r') as fin
len(text)#result is 10000
len(text.split() #result is 2800

IT only get 2800 words of the text when using split(),but I think it should be the whole text and the both results of len() should be the same 10000.
why? due to my computer limited? or my text has problem?

the result of text command is as follow:

The result of text.split() command is as follow:


len(text)是文件“ train.txt”中字符的总数(假设ASCII文本,这将与您的文件大小相同)。


Sidenote:假设您的定界符是\ n您可以在UNIX上与CAT Train.txt |在UNIX上进行验证。 WC -L

len(text) is the total number of characters in the file 'train.txt' (assuming ASCII text, this will be the same as your file-size).

len(text.split(...)) is the total number of tokens in the file (as determined by your delimiter).

Sidenote: Assuming your delimiter is \n you can cross verify this on unix with cat train.txt | wc -l.

