Frequently Asked Questions
General
What is scipeds
?
scipeds
is a Python library for working with IPEDS data.
Who made scipeds
and why did they make it?
scipeds
was created by Science for America. As part of Science for America's work on STEM equity, we started working with IPEDS data and found ourselves writing code to make our work easier. We realized that most of the code we'd written wasn't specific to our own research questions, and decided to create scipeds
to make it easier for ourselves and others to work with IPEDS data.
Data
How does scipeds
query the data?
scipeds
pre-processes raw data from IPEDS into a duckdb database file, which users download the first time they get set up with scipeds
. Functions are then implemented in various QueryEngine
s to make it easy to aggregate data in common ways without having to write a ton of SQL queries.
Can I re-produce the pre-processed database for myself?
Yes! By cloning the GitHub repository and running the data pipeline.
What data is currently included in scipeds
?
scipeds
currently incorporates the following IPEDS survey components:
- IPEDS Completions Survey (1984-2023)
- IPEDS Directory Information (2011-2023)
Warning
Race/ethnicity is unavailable for completions data from 1984-1994. All race/ethnicity columns have been set to "unknown" during this time period.
In addition, race/ethnicity encoding changed between 2010 and 2011 data.
Which survey variables are included in scipeds
?
It depends. If the variable is part of one of the surveys currently included in the package, then yes. If it's part of a different survey that's not currently part of the package (e.g. Fall Enrollment), then no.
scipeds
doesn't have the data I need. What should I do?
We're always looking for contributors and would love for you to add it to the package! Check out the contributing guide to get started, or start a new discussion to share your idea and we'll be happy to work with you to make it happen!
Completions
What completions data is currently available?
scipeds
uses data from the "A" series of aggregated IPEDS Completions Suvey data, which contains completers by university (6-digit UNITID), year, field of study (6-digit CIP code), award level, race/ethnicity, and gender. This data is available in the ipeds_completions_a
table in the pre-processed duckdb.
In addition to the 6-digit CIP code, higher-level field taxonomies are also provided. The following is a complete schema for the ipeds_completions_a
table:
ipeds_completions_a
column_name | comment | data_type |
---|---|---|
year | IPEDS survey year | USMALLINT |
unitid | UNITID of institution | UINTEGER |
cipcode | The originally recorded CIP code | ENUM |
awlevel | Level of award | ENUM |
majornum | Major number | UTINYINT |
cip2020 | Crosswalked 2020 CIP code | ENUM |
race_ethnicity | Race / ethnicity of completers | ENUM |
gender | Gender of completers | ENUM |
n_awards | Number of completions | UINTEGER |
ncses_sci_group | NCSES Science and Engineering Alternate Classification | ENUM |
ncses_field_group | NCSES Broad Fields Alternate Classification | ENUM |
ncses_detailed_field_group | NCSES Detailed Fields Classification | ENUM |
nsf_broad_field | NSF Diversity and STEM Report Broad Field Classification | ENUM |
dhs_stem | DHS STEM Classification | BOOLEAN |
The queries in the Engine don't cover my use case. What should I do?
You can write your own SQL query and run it using the engine's get_df_from_query
function.
I've written a super useful query that I think should be part of the engine. What should I do?
You can add a wrapper to your query in the engine definition. See the contributing guide for how to contribute to the codebase.
How do query filters work?
Query filters allow you to specify the data you want to include for your analysis.
For example, if you only want to look at Associate's degrees and want to exclude non-resident aliens from your analysis, you can specify the race/ethnicity groups in your QueryFilter
as all groups except RaceEthn.nonres
and the award levels as just AwardLevel.associates
. By default, query filters will include all years of data, all race/ethnicity groups, both first and second majors, and all award levels.
Warning
Excluding items from your queries will change the values you see in any totals
columns. If you do filter your data, make sure you double-check or cross-reference the numbers you get back to make sure you are seeing what you would expect.
I ran a query - what are the different columns and what do they mean?
It depends a little bit on the query, but in general each value returned by a completions query counts the number of awards or degrees received by members within a particular group. The number of awards is subject to any specified QueryFilters
:
rollup_degrees_within_group
is the number of degrees awarded to a particular group (specified by yourgrouping
) across all the fields specified in yourTaxonomyRollup
rollup_degrees_total
is the number of degrees across all groups, across all the fields specified in yourTaxonomyRollup
field_degrees_within_group
is the number of degrees awarded to a particular group (specified by yourgrouping
) within a field specified by aFieldTaxonomy
field_degrees_total
is the number of degrees within a specified field across all groups within a field specified by aFieldTaxonomy
uni_degrees_within_group
is the number of degrees awarded to a particular group, across all fieldsuni_degrees_total
is the total number of degrees awarded across all groups and all fields
What's the difference between a query that does something by_grouping
as opposed to within_grouping
?
Queries that aggregate data by_grouping
will have columns containing totals
across all groupings. Queries that aggregate data within_grouping
will always be indexed by the intersectional
grouping, but the total
column will be the number of degrees within that particular group. For example, aggregating within gender will lead to totals that correspond to the number of degrees awarded to all women rather than to all students.
Why don't my sub-group totals add to the total value?
Queries generally return only non-zero values. If your query doesn't return a non-zero value for each group, then the sub-totals won't add up to the values in the totals
columns.
What is the deal with first and second majors?
IPEDS records students who double major, and records one of these majors as secondary. By default, both first and second majors are included in aggregates (so one student may be counted more than once). You can aggregate only by first majors by choosing the appropriate QueryFilter
values.