Usage¶
This page documents basic concepts and functionality within scipeds
.
The Data¶
First things first: scipeds
is all about IPEDS data. While IPEDS provides an incredibly rich set of data across multiple decades, the data is split out across different files and formats on the IPEDS data portal, making it challenging to do anything but the most basic analyses.
One key thing we've done with scipeds
is to standardize and centralize this data across multiple years into a single, query-able database. That database file is critical to using scipeds
, so downloading it is the first step.
For this tutorial, we'll be focusing on IPEDS completions data. That said, we've written scipeds
in a way that should make it easy to add other survey components like admissions and enrollment. As the package expands, approaches and concepts similar to the ones we share here will apply to these new IPEDS survey components.
import scipeds
# Download the pre-processed database. You can control
# the file location by setting the SCIPEDS_CACHE_DIR
# environment variable or by specifying an output_path parameter.
# If nothing is provided (as in the example here),
# this path Defaults to platformdirs' user_cache_dir.
scipeds.download_db()
Downloading pre-processed IPEDS db to /opt/render/.cache/.scipeds/scipeds_0_0_6.duckdb
Download complete. scipeds has 3 files and is using 923MB in /opt/render/.cache/.scipeds
Now that the database is downloaded, we can use the IPEDS Completions Query Engine to query it.
The Query Engine¶
scipeds
is really just a thin wrapper for aggregating the pre-processed IPEDS data using SQL queries. This thin wrapper is implemented in the CompletionsQueryEngine
, which is what we'll use to work with the pre-processed database file that we just downloaded.
from scipeds.data.completions import CompletionsQueryEngine
We also provide a few other tools in scipeds
that makes it possible to construct different kinds of queries of the completions data.
In particular, we provide tools for filtering data and aggregating data across fields in the scipeds.data.queries
submodule, and convenience enumerations in scipeds.data.enums
that make it easier to specify the values you want to filter on without having to remember how every variable is specifically spelled.
We'll start by importing these helper tools and initializing the completions query engine.
# Convenience tools for specifying different variables
from scipeds.data.enums import (
AwardLevel,
FieldTaxonomy,
Grouping,
NCSESSciGroup,
RaceEthn,
)
# Tools for filtering data and aggregating fields within a taxonomy
from scipeds.data.queries import (
QueryFilters,
TaxonomyRollup,
)
# If you used a non-default location when you downlaoded
# the pre-processed database, you should specify that location
# with the db_path parameter here.
engine = CompletionsQueryEngine()
Filtering the underlying data¶
First, we'll walk through how to use our pre-written wrappers to query the completions data.
All of the pre-written queries take a QueryFilter
as input. This defines the set of filters that are applied to the underlying data before any aggregations are done. For example, you can use a QueryFilter
to specify which years of data you want and to limit your results to just Bachelor's degrees.
filters = QueryFilters(
# The first year you want to include in your aggregation
# (default: 1984)
start_year=2022,
# The last year you want to include in your aggregation
# (default: 2023)
end_year=2022,
# The racial/ethnic groups you want to include
# (default: all racial / ethnic groups)
race_ethns=list(RaceEthn),
# The award levels you want to include
# (default: all award levels)
award_levels=AwardLevel.bachelors,
# The major numbers you want to include
# (default: both first and second majors)
majornums=[1, 2],
)
There's a few enumerations we use and provide that make it so you don't have to remember the exact string values for different variables. For instance, RaceEthn
is an enumeration of the different race/ethnicity groups present in the dataset, Degree
is an enumeration of all the award levels available, etc.
Rolling up over fields¶
The underlying completions table that we're looking at provides the total number of awards for each individual CIP code. Because we're usually interested in high-level field groupings (rather than individual CIP codes), we created a TaxonomyRollup
which you use to tell scipeds
the fields within a type of field grouping that you want to aggregate together.
There are two main concepts in the TaxonomyRollup
:
- The
taxonomy_name
, which defines which approach for grouping fields you want to use. For example, NCSES provides taxonomies for grouping CIP codes into broad field groups (like "Math & Computer Sciences") and detailed field groups (like "Computer Science"). You can see the options available in the enumerationFieldTaxonomy
. Enumerated values are defined for some columns like the NCSES Science Group (NCSESSciGroup
), but not for others (like CIP code) because there are simply too many values. - The
taxonomy_values
, which is a list of fields from the taxonomy you've chosen that you want to aggregate over. These are also available as enumerations, with each type ofFieldTaxonomy
having their own set of enumerated fields.
For the following examples, we'll be looking at the NCSES broadest categorization, aggregating over all fields that NCSES has categorized as "Science & Engineering". We do this by specifying the ncses_sci_group
taxonomy and including any fields that are marked as NCSESSciGroup.sci
.
ncses_stem = TaxonomyRollup(
taxonomy_name=FieldTaxonomy.ncses_sci_group,
taxonomy_values=[NCSESSciGroup.sci],
)
If you want to see which CIP codes are included in each FieldTaxonomy
, you can get a table with all of the CIP codes and their respective categorizations with the .get_cip_table()
function:
engine.get_cip_table()
cip_title | ncses_sci_group | ncses_field_group | ncses_detailed_field_group | nsf_broad_field | dhs_stem | |
---|---|---|---|---|---|---|
cip2020 | ||||||
01.0000 | Agriculture, General | Science and engineering | Life Sciences | Agricultural Sciences | Agricultural and biological sciences | False |
01.0101 | Agricultural Business and Management, General | Non-science and engineering | Business and Management | Business and Management | Non-science and engineering | False |
01.0102 | Agribusiness/Agricultural Business Operations | Non-science and engineering | Business and Management | Business and Management | Non-science and engineering | False |
01.0103 | Agricultural Economics | Science and engineering | Social Sciences | Economics | Social and behavioral sciences | False |
01.0104 | Farm/Farm and Ranch Management | Non-science and engineering | Business and Management | Business and Management | Non-science and engineering | False |
... | ... | ... | ... | ... | ... | ... |
61.2707 | Unknown | Not categorized in NCSES crosswalk | Not categorized in NCSES crosswalk | Not categorized in NCSES crosswalk | Not categorized in NSF broad field classification | False |
61.2801 | Unknown | Not categorized in NCSES crosswalk | Not categorized in NCSES crosswalk | Not categorized in NCSES crosswalk | Not categorized in NSF broad field classification | False |
80.9999 | Unknown | Not categorized in NCSES crosswalk | Not categorized in NCSES crosswalk | Not categorized in NCSES crosswalk | Not categorized in NSF broad field classification | False |
95.0000 | Unknown | Not categorized in NCSES crosswalk | Not categorized in NCSES crosswalk | Not categorized in NCSES crosswalk | Not categorized in NSF broad field classification | False |
95.9500 | Undesignated Field Of Study | Non-science and engineering | Other Non-sciences or Unknown Disciplines | Other Non-sciences or Unknown Disciplines | Non-science and engineering | False |
2015 rows × 6 columns
Pre-written queries to roll up across fields¶
Now that we've specified how we want to filter the raw data and which fields we want to aggregate over, we can run through some of the queries we've pre-written.
Aggregate by gender¶
First, we can calculate the total number of completions by gender by using the rollup_by_grouping
function provided by the CompletionsQueryEngine
. Remember: this will return the total completions across all of the fields we've specified (the rollup
parameter), subject to the filters we've specified (the query_filters
parameter).
# Aggregate total bachelor's degree completions in 2022 in STEM fields
# by gender, including all races and both first and second majors
stem_completions_by_gender = engine.rollup_by_grouping(
grouping=Grouping.gender,
rollup=ncses_stem,
query_filters=filters,
)
stem_completions_by_gender
rollup_degrees_within_gender | rollup_degrees_total | uni_degrees_within_gender | uni_degrees_total | |
---|---|---|---|---|
gender | ||||
men | 381882 | 798298 | 884320 | 2138714 |
women | 416416 | 798298 | 1254394 | 2138714 |
The returned dataframe contains four columns:
rollup_degrees_within_gender
contains the number of degrees awarded across the fields specified by theTaxonomyRollup
to members of the group specified in the rowrollup_degrees_total
contains the total number of degrees awarded across the fields specified by theTaxonomyRollup
across all groups in the groupinguni_degrees_within_gender
contains the total number of degrees awarded in all fields to members of the group specified in the rowuni_degrees_total
contains the total number of degrees awarded in all fields across all groups in the grouping.
In other words, the columns that start with rollup
provide total completions across just the fields specified in the TaxonomyRollup
, while columns that start with uni
provide total completions across all fields. Similarly, columns that end with _within_gender
provide these totals for each gender separately, while columns that end with _total
provide totals across all students in the query.
Note that all these values are subject to the specified QueryFilters
.
The query returns some redundant data (and more data than needed) because that makes it easier to compute various percentages. For example, from the returned dataframe it's easy for us to compute the percentage of STEM Bachelor's degrees awarded to men and women as well as the percentage of all university Bachelor's degrees awarded to men and women:
import pandas as pd
stem_pct = (
stem_completions_by_gender[
"rollup_degrees_within_gender"
]
/ stem_completions_by_gender["rollup_degrees_total"]
)
uni_pct = (
stem_completions_by_gender["uni_degrees_within_gender"]
/ stem_completions_by_gender["uni_degrees_total"]
)
pct_df = pd.concat([stem_pct, uni_pct], axis=1)
pct_df.columns = ["stem_pct", "uni_pct"]
pct_df
stem_pct | uni_pct | |
---|---|---|
gender | ||
men | 0.47837 | 0.413482 |
women | 0.52163 | 0.586518 |
Aggregate by race/ethnicity¶
If instead we wanted to aggregate by race/ethnicity instead of gender, the query engine lets us do that easily:
# Aggregate to race/ethnicity-level within NCSES STEM
engine.rollup_by_grouping(
grouping=Grouping.race_ethnicity,
rollup=ncses_stem,
query_filters=filters,
)
rollup_degrees_within_race_ethnicity | rollup_degrees_total | uni_degrees_within_race_ethnicity | uni_degrees_total | |
---|---|---|---|---|
race_ethnicity | ||||
American Indian or Alaska Native | 2696 | 798298 | 8850 | 2138714 |
Asian | 90626 | 798298 | 173760 | 2138714 |
Black or African American | 61561 | 798298 | 197290 | 2138714 |
Hispanic or Latino | 128678 | 798298 | 344957 | 2138714 |
Native Hawaiian or Other Pacific Islander | 1342 | 798298 | 4529 | 2138714 |
Non-resident alien | 55293 | 798298 | 103282 | 2138714 |
Two or more races | 34614 | 798298 | 83073 | 2138714 |
Unknown | 22065 | 798298 | 71277 | 2138714 |
White | 401423 | 798298 | 1151696 | 2138714 |
You can see that we follow a similar naming convention for the columns, except that this time the grouping is _within_race_ethnicity
rather than _within_gender
.
Aggregate by intersectional group¶
Aggregating across intersectional race/ethnicity and gender groups is also straightforward:
# Aggregate to intersectional groups
engine.rollup_by_grouping(
grouping=Grouping.intersectional,
rollup=ncses_stem,
query_filters=filters,
)
rollup_degrees_intersectional | rollup_degrees_total | uni_degrees_intersectional | uni_degrees_total | ||
---|---|---|---|---|---|
race_ethnicity | gender | ||||
American Indian or Alaska Native | men | 1122 | 798298 | 3105 | 2138714 |
women | 1574 | 798298 | 5745 | 2138714 | |
Asian | men | 46038 | 798298 | 77163 | 2138714 |
women | 44588 | 798298 | 96597 | 2138714 | |
Black or African American | men | 23340 | 798298 | 68367 | 2138714 |
women | 38221 | 798298 | 128923 | 2138714 | |
Hispanic or Latino | men | 53325 | 798298 | 129063 | 2138714 |
women | 75353 | 798298 | 215894 | 2138714 | |
Native Hawaiian or Other Pacific Islander | men | 655 | 798298 | 1868 | 2138714 |
women | 687 | 798298 | 2661 | 2138714 | |
Non-resident alien | men | 31495 | 798298 | 53380 | 2138714 |
women | 23798 | 798298 | 49902 | 2138714 | |
Two or more races | men | 15583 | 798298 | 33469 | 2138714 |
women | 19031 | 798298 | 49604 | 2138714 | |
Unknown | men | 11161 | 798298 | 30582 | 2138714 |
women | 10904 | 798298 | 40695 | 2138714 | |
White | men | 199163 | 798298 | 487323 | 2138714 |
women | 202260 | 798298 | 664373 | 2138714 |
Note that the output dataframe here is multi-indexed on both race_ethnicity
and gender
.
Pre-written queries to returning results for each field within a taxonomy¶
In addition to the rollup queries above, which provide results rolled up across a specified set of fields, we also provide queries for returning data for each field within a particular taxonomy.
# Compute totals by gender for each NCSES Field Group
engine.field_totals_by_grouping(
grouping=Grouping.gender,
taxonomy=FieldTaxonomy.ncses_field_group,
query_filters=filters,
)
field_degrees_within_gender | field_degrees_total | uni_degrees_within_gender | uni_degrees_total | ||
---|---|---|---|---|---|
ncses_field_group | gender | ||||
Architecture and Environmental Design | men | 4629 | 9535 | 884320 | 2138714 |
women | 4906 | 9535 | 1254394 | 2138714 | |
Arts and Music | men | 34274 | 94139 | 884320 | 2138714 |
women | 59865 | 94139 | 1254394 | 2138714 | |
Business and Management | men | 216242 | 404518 | 884320 | 2138714 |
women | 188276 | 404518 | 1254394 | 2138714 | |
Communication and Librarianship | men | 32387 | 90183 | 884320 | 2138714 |
women | 57796 | 90183 | 1254394 | 2138714 | |
Education | men | 30601 | 118451 | 884320 | 2138714 |
women | 87850 | 118451 | 1254394 | 2138714 | |
Engineering | men | 95159 | 127300 | 884320 | 2138714 |
women | 32141 | 127300 | 1254394 | 2138714 | |
Geosciences | men | 3362 | 6419 | 884320 | 2138714 |
women | 3057 | 6419 | 1254394 | 2138714 | |
Humanities | men | 32672 | 82213 | 884320 | 2138714 |
women | 49541 | 82213 | 1254394 | 2138714 | |
Interdisciplinary or Other Sciences | men | 8275 | 22359 | 884320 | 2138714 |
women | 14084 | 22359 | 1254394 | 2138714 | |
Law | men | 1218 | 3916 | 884320 | 2138714 |
women | 2698 | 3916 | 1254394 | 2138714 | |
Life Sciences | men | 60441 | 176774 | 884320 | 2138714 |
women | 116333 | 176774 | 1254394 | 2138714 | |
Life Sciences (Non-S&E) | men | 47381 | 272793 | 884320 | 2138714 |
women | 225412 | 272793 | 1254394 | 2138714 | |
Math and Computer Sciences | men | 108053 | 148007 | 884320 | 2138714 |
women | 39954 | 148007 | 1254394 | 2138714 | |
Other Non-sciences or Unknown Disciplines | men | 33459 | 82243 | 884320 | 2138714 |
women | 48784 | 82243 | 1254394 | 2138714 | |
Physical Sciences | men | 13042 | 23698 | 884320 | 2138714 |
women | 10656 | 23698 | 1254394 | 2138714 | |
Psychology | men | 27437 | 137620 | 884320 | 2138714 |
women | 110183 | 137620 | 1254394 | 2138714 | |
Religion and Theology | men | 7942 | 12378 | 884320 | 2138714 |
women | 4436 | 12378 | 1254394 | 2138714 | |
Science and Engineering Technologies | men | 21132 | 43381 | 884320 | 2138714 |
women | 22249 | 43381 | 1254394 | 2138714 | |
Social Sciences | men | 74388 | 178480 | 884320 | 2138714 |
women | 104092 | 178480 | 1254394 | 2138714 | |
Social Service Professions | men | 2666 | 23432 | 884320 | 2138714 |
women | 20766 | 23432 | 1254394 | 2138714 | |
Vocational Studies and Home Economics | men | 29560 | 80875 | 884320 | 2138714 |
women | 51315 | 80875 | 1254394 | 2138714 |
We've tried to keep the column names fairly self-explanatory, but you'll notice that for these queries, the rollup_
columns have been replaced by columns that start with field_
. That's because this data is being returning for each pre-defined field in the taxonomy, rather than rolled up across a user-specified set of fields.
# Compute totals for each race/ethnicity for each NCSES Field Group
engine.field_totals_by_grouping(
grouping=Grouping.race_ethnicity,
taxonomy=FieldTaxonomy.ncses_field_group,
query_filters=filters,
)
field_degrees_within_race_ethnicity | field_degrees_total | uni_degrees_within_race_ethnicity | uni_degrees_total | ||
---|---|---|---|---|---|
ncses_field_group | race_ethnicity | ||||
Architecture and Environmental Design | American Indian or Alaska Native | 34 | 9535 | 8850 | 2138714 |
Asian | 814 | 9535 | 173760 | 2138714 | |
Black or African American | 560 | 9535 | 197290 | 2138714 | |
Hispanic or Latino | 1758 | 9535 | 344957 | 2138714 | |
Native Hawaiian or Other Pacific Islander | 11 | 9535 | 4529 | 2138714 | |
... | ... | ... | ... | ... | ... |
Vocational Studies and Home Economics | Native Hawaiian or Other Pacific Islander | 267 | 80875 | 4529 | 2138714 |
Non-resident alien | 1085 | 80875 | 103282 | 2138714 | |
Two or more races | 2912 | 80875 | 83073 | 2138714 | |
Unknown | 3027 | 80875 | 71277 | 2138714 | |
White | 39000 | 80875 | 1151696 | 2138714 |
189 rows × 4 columns
# Aggregate across each intersectional group for each field
engine.field_totals_by_grouping(
grouping=Grouping.intersectional,
taxonomy=FieldTaxonomy.ncses_field_group,
query_filters=filters,
)
field_degrees_intersectional | field_degrees_total | uni_degrees_intersectional | uni_degrees_total | |||
---|---|---|---|---|---|---|
ncses_field_group | race_ethnicity | gender | ||||
Architecture and Environmental Design | American Indian or Alaska Native | men | 17 | 9535 | 3105 | 2138714 |
women | 17 | 9535 | 5745 | 2138714 | ||
Asian | men | 332 | 9535 | 77163 | 2138714 | |
women | 482 | 9535 | 96597 | 2138714 | ||
Black or African American | men | 274 | 9535 | 68367 | 2138714 | |
... | ... | ... | ... | ... | ... | ... |
Vocational Studies and Home Economics | Two or more races | women | 1950 | 80875 | 49604 | 2138714 |
Unknown | men | 1511 | 80875 | 30582 | 2138714 | |
women | 1516 | 80875 | 40695 | 2138714 | ||
White | men | 15086 | 80875 | 487323 | 2138714 | |
women | 23914 | 80875 | 664373 | 2138714 |
378 rows × 4 columns
# Aggregate by gender for each NSF Broad Field
# and each university each year
filters = QueryFilters(start_year=2011, end_year=2013)
df = engine.uni_field_totals_by_grouping(
grouping=Grouping.gender,
taxonomy=FieldTaxonomy.nsf_broad_field,
query_filters=filters,
by_year=True,
)
df
field_degrees_within_gender | field_degrees_total | uni_degrees_within_gender | uni_degrees_total | |||||
---|---|---|---|---|---|---|---|---|
institution_name | unitid | nsf_broad_field | gender | year | ||||
Community College of the Air Force | 100636 | Agricultural and biological sciences | men | 2011 | 11 | 19 | 14659 | 19263 |
women | 2011 | 8 | 19 | 4604 | 19263 | |||
Mathematical and computer sciences | men | 2011 | 770 | 984 | 14659 | 19263 | ||
women | 2011 | 214 | 984 | 4604 | 19263 | |||
Non-science and engineering | men | 2011 | 13736 | 18073 | 14659 | 19263 | ||
... | ... | ... | ... | ... | ... | ... | ... | ... |
Arizona State University Digital Immersion | 483124 | Social and behavioral sciences | women | 2013 | 177 | 260 | 870 | 1220 |
Louisiana Delta Community College | 483212 | Mathematical and computer sciences | men | 2013 | 13 | 24 | 240 | 590 |
women | 2013 | 11 | 24 | 350 | 590 | |||
Non-science and engineering | men | 2013 | 227 | 566 | 240 | 590 | ||
women | 2013 | 339 | 566 | 350 | 590 |
99786 rows × 4 columns
Custom queries¶
If you have a query you want to run that's not covered by the pre-written queries, you can write the SQL yourself and execute it using the engine's get_df_from_query
function.
You can run any SQL query using this approach, which means you can use SQL to figure out which tables and columns are available in the pre-processed duckdb database file.
query = "SHOW TABLES"
# Using the engine
engine.get_df_from_query(query)
name | |
---|---|
0 | cip_info |
1 | ipeds_completions_a |
2 | ipeds_directory_info |
from scipeds.constants import COMPLETIONS_TABLE
query = f"DESCRIBE {COMPLETIONS_TABLE}"
engine.get_df_from_query(query)
column_name | column_type | null | key | default | extra | |
---|---|---|---|---|---|---|
0 | year | USMALLINT | YES | None | None | None |
1 | unitid | UINTEGER | YES | None | None | None |
2 | cipcode | ENUM('01.0000', '01.0101', '01.0102', '01.0103... | YES | None | None | None |
3 | awlevel | ENUM('Associate''s degree', 'Award of at least... | YES | None | None | None |
4 | majornum | UTINYINT | YES | None | None | None |
5 | cip2020 | ENUM('01.0000', '01.0101', '01.0102', '01.0103... | YES | None | None | None |
6 | race_ethnicity | ENUM('American Indian or Alaska Native', 'Asia... | YES | None | None | None |
7 | gender | ENUM('men', 'women') | YES | None | None | None |
8 | n_awards | UINTEGER | YES | None | None | None |
9 | ncses_sci_group | ENUM('Non-science and engineering', 'Not categ... | YES | None | None | None |
10 | ncses_field_group | ENUM('Architecture and Environmental Design', ... | YES | None | None | None |
11 | ncses_detailed_field_group | ENUM('Aerospace Engineering', 'Agricultural Sc... | YES | None | None | None |
12 | nsf_broad_field | ENUM('Agricultural and biological sciences', '... | YES | None | None | None |
13 | dhs_stem | BOOLEAN | YES | None | None | None |
You can then use these table and column names to write your own custom SQL:
# Which CIP codes have the most awards over all recorded years?
query = f"""
SELECT cip2020, SUM(n_awards) as total_awards
FROM {COMPLETIONS_TABLE}
GROUP BY cip2020
ORDER BY total_awards DESC;
"""
top_cips = (
engine.get_df_from_query(query)
.head(10)
.set_index("cip2020")
)
top_cips
total_awards | |
---|---|
cip2020 | |
52.0201 | 9393674.0 |
24.0101 | 8794187.0 |
51.3801 | 5832908.0 |
42.0101 | 3749035.0 |
52.0301 | 2615257.0 |
26.0101 | 2348858.0 |
52.0101 | 2298456.0 |
13.1202 | 2291154.0 |
24.0102 | 2283145.0 |
51.0801 | 2089676.0 |