cut: stdin: Mac OS 上的非法字节序列

发布于 2025-01-10 22:56:10 字数 1375 浏览 0 评论 0原文

我是 bash 和正则表达式的新手。我正在使用 UIC 的 TIC 数据集进行练习。 https://archive.ics.uci.edu/ml /machine-learning-databases/tic-mld/

我正在尝试从文件 TicDataDecr.txt 中提取列名称，该文件看起来像这样：

TICEVAL2000.txt: 
Dataset for predictions (4000 customer records). It has the same format as TICDATA2000.txt, only the target is missing. Participants are supposed to return the list of predicted targets only. All datasets are in tab delimited format. 
The meaning of the attributes and attribute values is given below.

TICTGTS2000.txt
Targets for the evaluation set.

DATADICTIONARY

Nr Name Description Domain

1 MOSTYPE Customer Subtype see L0

2 MAANTHUI Number of houses 1 ñ 10

3 MGEMOMV Avg size household 1 ñ 6

4 MGEMLEEF Avg age see L1

5 MOSHOOFD Customer main type see L2

6 MGODRK Roman catholic see L3
...

我只想获取第二个描述栏：

MOSTYPE
MAANTHUI
MGEMOMV
MGEMLEEF
MOSHOOFD
MGODRK
...

我正在尝试使用以下代码来执行此操作：

egrep "^[0-9]+\s[A-Z][A-Z]+" TicDataDescr.txt | cut -d' ' -f2

但我遇到了以下错误：

MOSTYPE
cut: stdin: Illegal byte sequence

我尝试使用 dos2unix 转换文件

dos2unix TicDataDescr.txt

，但仍然遇到相同的错误。为什么我会遇到这样的错误？另外，有没有办法使用正则表达式解决这个问题？感谢您的帮助。

原文

I'm new to bash and regex. I'm using the TIC dataset from UIC to practice.
https://archive.ics.uci.edu/ml/machine-learning-databases/tic-mld/

I'm trying to extract the column names from the file TicDataDescr.txt which looks something like this:

TICEVAL2000.txt: 
Dataset for predictions (4000 customer records). It has the same format as TICDATA2000.txt, only the target is missing. Participants are supposed to return the list of predicted targets only. All datasets are in tab delimited format. 
The meaning of the attributes and attribute values is given below.

TICTGTS2000.txt
Targets for the evaluation set.

DATADICTIONARY

Nr Name Description Domain

1 MOSTYPE Customer Subtype see L0

2 MAANTHUI Number of houses 1 ñ 10

3 MGEMOMV Avg size household 1 ñ 6

4 MGEMLEEF Avg age see L1

5 MOSHOOFD Customer main type see L2

6 MGODRK Roman catholic see L3
...

I want to take only the second column of the description:

MOSTYPE
MAANTHUI
MGEMOMV
MGEMLEEF
MOSHOOFD
MGODRK
...

I'm trying with to do it with the following code:

egrep "^[0-9]+\s[A-Z][A-Z]+" TicDataDescr.txt | cut -d' ' -f2

But I'm encountering the following error:

MOSTYPE
cut: stdin: Illegal byte sequence

I've tried converting the file using dos2unix

dos2unix TicDataDescr.txt

but i still get the same error.
Why am I running into such error? Also, is there a way around it using regex? Thanks for the help.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

列表为空，暂无数据

关于作者

你曾走过我的故事

暂无简介

文章

910 人气

关注发私信

卷耳

文章 0 评论 0

关注

佚名

文章 0 评论 0

关注

℉服软

文章 0 评论 0

关注

qq_2gSKZM

文章 0 评论 0

关注

凉宸

文章 0 评论 0

关注

gyhjy

文章 0 评论 0

友情链接

文江博客

cut: stdin: Mac OS 上的非法字节序列

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

关于作者

相关话题

热门标签

推荐作者

卷耳

佚名

℉服软

qq_2gSKZM

凉宸

gyhjy

友情链接

cut: stdin: Mac OS 上的非法字节序列

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

关于作者

相关话题

热门标签

推荐作者

卷耳

佚名

℉服软

qq_2gSKZM

凉宸

gyhjy

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。