如何在 Stata 中创建 group by 语句(来自 MySQL)?
我不是统计人员,但必须处理相当多的数据。在大多数情况下,这些数据集来自在线调查;因此我确实有一个 MySQL 数据库并且知道如何从中获取一些结果。
然而,现在我得到了一个Stata文件,我需要做一些分析。在 MySQL 中我知道如何做到这一点,但我被困在 Stata 中并请求你的帮助。
我有一个不太小的表(大约 50k 行),包含以下列(还有更多列,但这些是我必须使用的列):
id - 对象 ID、唯一值
name - 对象名称、字符串值
class - 类对象的来源,整数范围 1 - 6
origin - 对象的来源,整数范围 1 - 2
在 50k 行中,只有大约 7k 个不同的名称。在 Stata 中,我可以检索带有 list name
的所有名称,甚至可以将其限制为带有 list name if class == 2
的单个类。
现在我想要一个所有不同名称的列表以及具有该名称的对象的计数并按计数对列表进行排序。 在 MySQL 中,我会查询 SELECT name, COUNT(*) AS cnt FROM objects GROUP BY name ORDER BY cnt DESC。 但是在 Stata 中如何做到这一点?
接下来的步骤是为每个类或两个来源获取这样的列表,即SELECT name, COUNT(*) AS cnt FROM objects WHERE class = 2 GROUP BY name ORDER BY cnt DESC
,Stata也可以吗?
ps:我不知道 stackoverflow 是否是正确的地方,因为 Stata 并不是真正的编程语言,是吗?但我在这里发现了一些Stata相关的问题;这就是我将其发布在这里的原因。如果有更好的地方可以这样做,请指出我正确的地方。
I am not a statistics guy but have to deal with quite some data. In most cases these data sets come from an online survey; hence I do have a MySQL database and know how to get some results out of that.
However, now I got a Stata file and I am required to do some analysis. In MySQL I'd know how to do that, but I am stuck in Stata and ask for your help.
I have a not too small table (roughly 50k rows) containing following columns (there are more cols but these are the ones I have to work with):
id - Object ID, unique values
name - Name of object, string value
class - Class of object, integer range 1 - 6
origin - Origin of object, integer range 1 - 2
Within the 50k rows there are only about 7k different names. In Stata I can retrieve all names with list name
and could even restrict it to a single class with list name if class == 2
.
Now I want a list of all different names along with a count of objects having that name and have the list sorted by count.
In MySQL I'd query SELECT name, COUNT(*) AS cnt FROM objects GROUP BY name ORDER BY cnt DESC
. But how would that be done in Stata?
Next steps would be to get such lists for each class or for both origins, i.e. SELECT name, COUNT(*) AS cnt FROM objects WHERE class = 2 GROUP BY name ORDER BY cnt DESC
, is that possible with Stata, too?
ps: I don't know if stackoverflow is the right place as Stata is not really a programming language, is it? But I found some Stata-related questions here; that's why I posted it here. If there's a better place to do so, please point me to the right place.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
请记住,Stata 仅适用于固定长度的矩形表,因此您只能添加跨越整个 50k 行的列。在此设置中,您可以执行以下操作。
对于第一个问题(名称和频率列表),您可以
注意,
collapse
将删除现有数据并替换为摘要。 (通常,我讨厌这个命令用于数据管理的这方面,但它应该在这里工作。)如果您不希望这种情况发生,这里有一个更复杂的技巧:(说明:
_N
是by
-group 中的观测值数量,_n
是by
-group 中当前观测值的数量。)然后您可以将其子集化使用
if 到感兴趣的类class==#
,正如您所知。Keep in mind that Stata only works with rectangular tables of fixed length, so you can only add columns that span the whole 50k rows. Within this setup, this is what you can do.
For the first problem (the list of names and frequencies), you can
Note that
collapse
will delete the existing data and replace with the summary. (Usually, I hate this command for this aspect of data management, but it should work here.) If you don't want this to happen, here's a more sophisticated trick:(Explanation:
_N
is the number of observations inby
-group, and_n
is the number of the current observation within theby
-group.)You can then subset this to the class of interest with
if class==#
, as you already know.另请查看可使用
ssc inst groups
下载的groups
命令。Also check out the
groups
command downloadable usingssc inst groups
.