numpy recfunctions join_by 错误
在对多个键进行外连接时,numpy.lib.recfunctions 中的 join_by 函数似乎存在问题。 matplotlib.mlab 函数工作正常。 recfunctions 版本似乎混合/匹配了一些键(我有两个键:001258 和 001670,recfunctions 除了 001258 和 001670 之外还生成了键 001270 和 001658)。有人遇到过这个问题吗?
我有两个文本文件 test.csv 和 test2.csv,其中包含以下内容: test.csv:
gvkey,fyr,ogpoilq,datadate,cusip
001258,12,,03/31/2002,13916P209
001258,12,,06/30/2002,13916P209
001258,12,,09/30/2002,13916P209
001258,12,31.0000,12/31/2002,13916P209
001678,12,74968.0000,12/31/2003,037411105
001678,12,,03/31/2004,037411105
001678,12,,06/30/2004,037411105
001678,12,,09/30/2004,037411105
001678,12,84736.0000,12/31/2004,037411105
001678,12,,03/31/2005,037411105
001678,12,,06/30/2005,037411105
001678,12,,09/30/2005,037411105
001678,12,85434.0000,12/31/2005,037411105
001678,12,,03/31/2006,037411105
001678,12,,06/30/2006,037411105
001678,12,,09/30/2006,037411105
001678,12,81971.0000,12/31/2006,037411105
test2.csv:
gvkey,datadate,fyearq,fqtr,ciderglq,cisecglq
001258,12/31/2001,2001,4,,
001258,03/31/2002,2002,1,,
001258,06/30/2002,2002,2,,
001258,09/30/2002,2002,3,,
001258,12/31/2002,2002,4,,
001258,03/31/2003,2003,1,,
001258,06/30/2003,2003,2,,
001678,03/31/2004,2004,1,,
001678,06/30/2004,2004,2,,
001678,09/30/2004,2004,3,,
001678,12/31/2004,2004,4,,
001678,03/31/2005,2005,1,-136.9970,0.0000
001678,06/30/2005,2005,2,-7.8000,0.0000
001678,09/30/2005,2005,3,-164.6470,0.0000
001678,12/31/2005,2005,4,73.3180,0.0000
001678,03/31/2006,2006,1,71.6100,0.0000
001678,06/30/2006,2006,2,5.5850,0.0000
以下代码生成正确和不正确的合并表:
import datetime
import numpy as np
import numpy.lib.recfunctions as rf
import matplotlib.mlab as ml
date_converter = lambda x: datetime.date(int(x[-4:]), int(x[:2]), int(x[3:5]))
prod_df = np.genfromtxt("../data/test.csv", filling_values=np.nan, converters={3:date_converter}, dtype="S10, f8, O4", names="gvkey, prod, date", delimiter=",", usecols=(0,2,3), skip_header=1)
hedge_df = np.genfromtxt("../data/test2.csv", filling_values=np.nan, converters={1:date_converter}, dtype="S10, O4, f8", names="gvkey, date, hedgepnl", delimiter=",", usecols=(0,1,4), skip_header=1)
correct_outer_merge = ml.rec_join(["gvkey", "date"], prod_df, hedge_df, "outer")
incorrect_outer_merge = rf.rec_join(["gvkey", "date"], prod_df, hedge_df, "outer")
There seems to be a problem with the join_by function in numpy.lib.recfunctions when doing an outer join on multiple keys. The matplotlib.mlab function works correctly. The recfunctions version seems to mix/match some of the keys (I had two keys: 001258 and 001670, the recfunctions produced keys 001270 and 001658 in addition to 001258 and 001670). Has anyone run into this issue?
I have two text files, test.csv and test2.csv that contain the following:
test.csv:
gvkey,fyr,ogpoilq,datadate,cusip
001258,12,,03/31/2002,13916P209
001258,12,,06/30/2002,13916P209
001258,12,,09/30/2002,13916P209
001258,12,31.0000,12/31/2002,13916P209
001678,12,74968.0000,12/31/2003,037411105
001678,12,,03/31/2004,037411105
001678,12,,06/30/2004,037411105
001678,12,,09/30/2004,037411105
001678,12,84736.0000,12/31/2004,037411105
001678,12,,03/31/2005,037411105
001678,12,,06/30/2005,037411105
001678,12,,09/30/2005,037411105
001678,12,85434.0000,12/31/2005,037411105
001678,12,,03/31/2006,037411105
001678,12,,06/30/2006,037411105
001678,12,,09/30/2006,037411105
001678,12,81971.0000,12/31/2006,037411105
test2.csv:
gvkey,datadate,fyearq,fqtr,ciderglq,cisecglq
001258,12/31/2001,2001,4,,
001258,03/31/2002,2002,1,,
001258,06/30/2002,2002,2,,
001258,09/30/2002,2002,3,,
001258,12/31/2002,2002,4,,
001258,03/31/2003,2003,1,,
001258,06/30/2003,2003,2,,
001678,03/31/2004,2004,1,,
001678,06/30/2004,2004,2,,
001678,09/30/2004,2004,3,,
001678,12/31/2004,2004,4,,
001678,03/31/2005,2005,1,-136.9970,0.0000
001678,06/30/2005,2005,2,-7.8000,0.0000
001678,09/30/2005,2005,3,-164.6470,0.0000
001678,12/31/2005,2005,4,73.3180,0.0000
001678,03/31/2006,2006,1,71.6100,0.0000
001678,06/30/2006,2006,2,5.5850,0.0000
The following code produces the correct and incorrect merged tables:
import datetime
import numpy as np
import numpy.lib.recfunctions as rf
import matplotlib.mlab as ml
date_converter = lambda x: datetime.date(int(x[-4:]), int(x[:2]), int(x[3:5]))
prod_df = np.genfromtxt("../data/test.csv", filling_values=np.nan, converters={3:date_converter}, dtype="S10, f8, O4", names="gvkey, prod, date", delimiter=",", usecols=(0,2,3), skip_header=1)
hedge_df = np.genfromtxt("../data/test2.csv", filling_values=np.nan, converters={1:date_converter}, dtype="S10, O4, f8", names="gvkey, date, hedgepnl", delimiter=",", usecols=(0,1,4), skip_header=1)
correct_outer_merge = ml.rec_join(["gvkey", "date"], prod_df, hedge_df, "outer")
incorrect_outer_merge = rf.rec_join(["gvkey", "date"], prod_df, hedge_df, "outer")
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论