cut: stdin: Mac OS 上的非法字节序列

发布于 2025-01-10 22:56:10 字数 1375 浏览 0 评论 0原文

我是 bash 和正则表达式的新手。我正在使用 UIC 的 TIC 数据集进行练习。 https://archive.ics.uci.edu/ml /machine-learning-databases/tic-mld/

我正在尝试从文件 TicDataDecr.txt 中提取列名称,该文件看起来像这样:

TICEVAL2000.txt: 
Dataset for predictions (4000 customer records). It has the same format as TICDATA2000.txt, only the target is missing. Participants are supposed to return the list of predicted targets only. All datasets are in tab delimited format. 
The meaning of the attributes and attribute values is given below.

TICTGTS2000.txt
Targets for the evaluation set.

DATADICTIONARY

Nr Name Description Domain

1 MOSTYPE Customer Subtype see L0

2 MAANTHUI Number of houses 1 ñ 10

3 MGEMOMV Avg size household 1 ñ 6

4 MGEMLEEF Avg age see L1

5 MOSHOOFD Customer main type see L2

6 MGODRK Roman catholic see L3
...

我只想获取第二个描述栏:

MOSTYPE
MAANTHUI
MGEMOMV
MGEMLEEF
MOSHOOFD
MGODRK
...

我正在尝试使用以下代码来执行此操作:

egrep "^[0-9]+\s[A-Z][A-Z]+" TicDataDescr.txt | cut -d' ' -f2

但我遇到了以下错误:

MOSTYPE
cut: stdin: Illegal byte sequence

我尝试使用 dos2unix 转换文件

dos2unix TicDataDescr.txt

,但仍然遇到相同的错误。 为什么我会遇到这样的错误?另外,有没有办法使用正则表达式解决这个问题?感谢您的帮助。

I'm new to bash and regex. I'm using the TIC dataset from UIC to practice.
https://archive.ics.uci.edu/ml/machine-learning-databases/tic-mld/

I'm trying to extract the column names from the file TicDataDescr.txt which looks something like this:

TICEVAL2000.txt: 
Dataset for predictions (4000 customer records). It has the same format as TICDATA2000.txt, only the target is missing. Participants are supposed to return the list of predicted targets only. All datasets are in tab delimited format. 
The meaning of the attributes and attribute values is given below.

TICTGTS2000.txt
Targets for the evaluation set.

DATADICTIONARY

Nr Name Description Domain

1 MOSTYPE Customer Subtype see L0

2 MAANTHUI Number of houses 1 ñ 10

3 MGEMOMV Avg size household 1 ñ 6

4 MGEMLEEF Avg age see L1

5 MOSHOOFD Customer main type see L2

6 MGODRK Roman catholic see L3
...

I want to take only the second column of the description:

MOSTYPE
MAANTHUI
MGEMOMV
MGEMLEEF
MOSHOOFD
MGODRK
...

I'm trying with to do it with the following code:

egrep "^[0-9]+\s[A-Z][A-Z]+" TicDataDescr.txt | cut -d' ' -f2

But I'm encountering the following error:

MOSTYPE
cut: stdin: Illegal byte sequence

I've tried converting the file using dos2unix

dos2unix TicDataDescr.txt

but i still get the same error.
Why am I running into such error? Also, is there a way around it using regex? Thanks for the help.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文