如何处理丢失的数据?信息将用于数据可视化
每个人如何处理数据框中的丢失值?我通过使用普查Web API获取数据创建了一个数据框。 “ GTCBSA”变量提供了我将其用于(Plotly and Dash)所需的城市信息,我发现数据中有很多丢失值。我是否只是空白并继续进行数据可视化?以下是我的
2004 = https://api.census.gov/data/2004/cps/basic/jun?get=gtcbsa,pefntvty& amp; amp; amp; amp; app; *
varible description = https://api.census.gov/data/data/2022222/2022/cps/cps/cps/bbasic/jan/variable /gtcbsa.json
How does everyone deal with missing values in dataframe? I created a dataframe by using a Census Web Api to get the data. The 'GTCBSA' variable provides the City information which is required for me to use it for (plotly and dash) and I found that there is a lot of missing values in the data. Do I just leave it blank and continue with my data visualization? The following is my variable
Example data for 2004 = https://api.census.gov/data/2004/cps/basic/jun?get=GTCBSA,PEFNTVTY&for=state:*
Variable description = https://api.census.gov/data/2022/cps/basic/jan/variables/GTCBSA.json
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
有不同的方法。取决于用例和丢失的数据类型。例如,对于具有一些缺失值的几乎连续的时间表信号数据,您可以通过执行某种类型的插值(例如线性插值)来尝试根据附近值填充缺失值。
但是,在您的情况下,缺失的值是城市,行都是独立的(每行都是不同的受访者)。据我所知,您没有任何方法可以合理地推断城市缺少的行,因此您必须从考虑这些行中丢下这些行。
我不是美国人口普查使用的数据收集方法的专家,而是此源,似乎有多种方法,因此我可以看到受访者的城市有可能未知(在线工具可能无法获得被告的城市,或者也许被告拒绝陈述其城市)。缺少数据是一个非常普遍的问题。
但是,在用缺失的城市放下所有行之前,您可能会进行简短的检查以查看是否存在模式(例如,丢失城市的行主要来自一个州?)。如果您正在进行任何州级分析,则可以将行留在缺失的城市中。
There are different ways of dealing with missing data depending on the use case and the type of data that is missing. For example, for a near-continuous stream of timeseries signals data with some missing values, you can attempt to fill the missing values based on nearby values by performing some type of interpolation (linear interpolation, for example).
However, in your case, the missing values are cities and the rows are all independent (each row is a different respondent). As far as I can tell, you don't have any way to reasonably infer the city for the rows where the city is missing so you'll have to drop these rows from consideration.
I am not an expert in the data collection method(s) used by the US census, but from this source, it seems like there are multiple methods used so I can see how it might be possible that the city of the respondent isn't known (the online tool might not be able to obtain the city of the respondent, or perhaps the respondent declined to state their city). Missing data is a very common issue.
However, before dropping all of rows with missing cities, you might do a brief check to see if there is any pattern (e.g. are the rows with missing cities predominantly from one state, for example?). If you are doing any state-level analysis, you could keep the rows with missing cities.