cut: stdin: Mac OS 上的非法字节序列
我是 bash 和正则表达式的新手。我正在使用 UIC 的 TIC 数据集进行练习。 https://archive.ics.uci.edu/ml /machine-learning-databases/tic-mld/
我正在尝试从文件 TicDataDecr.txt 中提取列名称,该文件看起来像这样:
TICEVAL2000.txt:
Dataset for predictions (4000 customer records). It has the same format as TICDATA2000.txt, only the target is missing. Participants are supposed to return the list of predicted targets only. All datasets are in tab delimited format.
The meaning of the attributes and attribute values is given below.
TICTGTS2000.txt
Targets for the evaluation set.
DATADICTIONARY
Nr Name Description Domain
1 MOSTYPE Customer Subtype see L0
2 MAANTHUI Number of houses 1 ñ 10
3 MGEMOMV Avg size household 1 ñ 6
4 MGEMLEEF Avg age see L1
5 MOSHOOFD Customer main type see L2
6 MGODRK Roman catholic see L3
...
我只想获取第二个描述栏:
MOSTYPE
MAANTHUI
MGEMOMV
MGEMLEEF
MOSHOOFD
MGODRK
...
我正在尝试使用以下代码来执行此操作:
egrep "^[0-9]+\s[A-Z][A-Z]+" TicDataDescr.txt | cut -d' ' -f2
但我遇到了以下错误:
MOSTYPE
cut: stdin: Illegal byte sequence
我尝试使用 dos2unix 转换文件
dos2unix TicDataDescr.txt
,但仍然遇到相同的错误。 为什么我会遇到这样的错误?另外,有没有办法使用正则表达式解决这个问题?感谢您的帮助。
I'm new to bash and regex. I'm using the TIC dataset from UIC to practice.
https://archive.ics.uci.edu/ml/machine-learning-databases/tic-mld/
I'm trying to extract the column names from the file TicDataDescr.txt which looks something like this:
TICEVAL2000.txt:
Dataset for predictions (4000 customer records). It has the same format as TICDATA2000.txt, only the target is missing. Participants are supposed to return the list of predicted targets only. All datasets are in tab delimited format.
The meaning of the attributes and attribute values is given below.
TICTGTS2000.txt
Targets for the evaluation set.
DATADICTIONARY
Nr Name Description Domain
1 MOSTYPE Customer Subtype see L0
2 MAANTHUI Number of houses 1 ñ 10
3 MGEMOMV Avg size household 1 ñ 6
4 MGEMLEEF Avg age see L1
5 MOSHOOFD Customer main type see L2
6 MGODRK Roman catholic see L3
...
I want to take only the second column of the description:
MOSTYPE
MAANTHUI
MGEMOMV
MGEMLEEF
MOSHOOFD
MGODRK
...
I'm trying with to do it with the following code:
egrep "^[0-9]+\s[A-Z][A-Z]+" TicDataDescr.txt | cut -d' ' -f2
But I'm encountering the following error:
MOSTYPE
cut: stdin: Illegal byte sequence
I've tried converting the file using dos2unix
dos2unix TicDataDescr.txt
but i still get the same error.
Why am I running into such error? Also, is there a way around it using regex? Thanks for the help.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论