Skip to content

Frequently Asked Questions

General

What is scipeds?

scipeds is a Python library for working with IPEDS data.

Who made scipeds and why did they make it?

scipeds was created by Science for America. As part of Science for America's work on STEM equity, we started working with IPEDS data and found ourselves writing code to make our work easier. We realized that most of the code we'd written wasn't specific to our own research questions, and decided to create scipeds to make it easier for ourselves and others to work with IPEDS data.

Data

How does scipeds query the data?

scipeds pre-processes raw data from IPEDS into a duckdb database file, which users download the first time they get set up with scipeds. Functions are then implemented in various QueryEngines to make it easy to aggregate data in common ways without having to write a ton of SQL queries.

Can I re-produce the pre-processed database for myself?

Yes! By cloning the GitHub repository and running the data pipeline.

What data is currently included in scipeds?

scipeds currently incorporates the following IPEDS survey components:

  • IPEDS Completions Survey (1984-2023)
  • IPEDS Directory Information (2011-2023)

Warning

Race/ethnicity is unavailable for completions data from 1984-1994. All race/ethnicity columns have been set to "unknown" during this time period.

In addition, race/ethnicity encoding changed between 2010 and 2011 data.

Which survey variables are included in scipeds?

It depends. If the variable is part of one of the surveys currently included in the package, then yes. If it's part of a different survey that's not currently part of the package (e.g. Fall Enrollment), then no.

scipeds doesn't have the data I need. What should I do?

We're always looking for contributors and would love for you to add it to the package! Check out the contributing guide to get started, or start a new discussion to share your idea and we'll be happy to work with you to make it happen!

Completions

What completions data is currently available?

scipeds uses data from the "A" series of aggregated IPEDS Completions Suvey data, which contains completers by university (6-digit UNITID), year, field of study (6-digit CIP code), award level, race/ethnicity, and gender. This data is available in the ipeds_completions_a table in the pre-processed duckdb.

In addition to the 6-digit CIP code, higher-level field taxonomies are also provided. The following is a complete schema for the ipeds_completions_a table:

ipeds_completions_a

column_name comment data_type
year IPEDS survey year USMALLINT
unitid UNITID of institution UINTEGER
cipcode The originally recorded CIP code ENUM
awlevel Level of award ENUM
majornum Major number UTINYINT
cip2020 Crosswalked 2020 CIP code ENUM
race_ethnicity Race / ethnicity of completers ENUM
gender Gender of completers ENUM
n_awards Number of completions UINTEGER
ncses_sci_group NCSES Science and Engineering Alternate Classification ENUM
ncses_field_group NCSES Broad Fields Alternate Classification ENUM
ncses_detailed_field_group NCSES Detailed Fields Classification ENUM
nsf_broad_field NSF Diversity and STEM Report Broad Field Classification ENUM
dhs_stem DHS STEM Classification BOOLEAN

The queries in the Engine don't cover my use case. What should I do?

You can write your own SQL query and run it using the engine's get_df_from_query function.

I've written a super useful query that I think should be part of the engine. What should I do?

You can add a wrapper to your query in the engine definition. See the contributing guide for how to contribute to the codebase.

How do query filters work?

Query filters allow you to specify the data you want to include for your analysis.

For example, if you only want to look at Associate's degrees and want to exclude non-resident aliens from your analysis, you can specify the race/ethnicity groups in your QueryFilter as all groups except RaceEthn.nonres and the award levels as just AwardLevel.associates. By default, query filters will include all years of data, all race/ethnicity groups, both first and second majors, and all award levels.

Warning

Excluding items from your queries will change the values you see in any totals columns. If you do filter your data, make sure you double-check or cross-reference the numbers you get back to make sure you are seeing what you would expect.

I ran a query - what are the different columns and what do they mean?

It depends a little bit on the query, but in general each value returned by a completions query counts the number of awards or degrees received by members within a particular group. The number of awards is subject to any specified QueryFilters:

  • rollup_degrees_within_group is the number of degrees awarded to a particular group (specified by your grouping) across all the fields specified in your TaxonomyRollup
  • rollup_degrees_total is the number of degrees across all groups, across all the fields specified in your TaxonomyRollup
  • field_degrees_within_group is the number of degrees awarded to a particular group (specified by your grouping) within a field specified by a FieldTaxonomy
  • field_degrees_total is the number of degrees within a specified field across all groups within a field specified by a FieldTaxonomy
  • uni_degrees_within_group is the number of degrees awarded to a particular group, across all fields
  • uni_degrees_total is the total number of degrees awarded across all groups and all fields

What's the difference between a query that does something by_grouping as opposed to within_grouping?

Queries that aggregate data by_grouping will have columns containing totals across all groupings. Queries that aggregate data within_grouping will always be indexed by the intersectional grouping, but the total column will be the number of degrees within that particular group. For example, aggregating within gender will lead to totals that correspond to the number of degrees awarded to all women rather than to all students.

Why don't my sub-group totals add to the total value?

Queries generally return only non-zero values. If your query doesn't return a non-zero value for each group, then the sub-totals won't add up to the values in the totals columns.

What is the deal with first and second majors?

IPEDS records students who double major, and records one of these majors as secondary. By default, both first and second majors are included in aggregates (so one student may be counted more than once). You can aggregate only by first majors by choosing the appropriate QueryFilter values.