Public data API¶

Our mission is to make research and data on the world's biggest problems accessible and understandable to the public. As part of this work, we provide an experimental API to the datasets.

When using the API, you have access to the public catalog of data processed by our data team. The catalog indexes tables of data, rather than datasets or individual indicators. To learn more, read about our data model.

At the moment, we only support Python.

Our API is in beta

We currently only provide a python API. Our hope is to extend this to other languages in the future. Please report any issue that you may find.

Python

(see example notebook)

owid-catalog¶

A Pythonic API for working with OWID's data catalog.

Status: experimental, APIs likely to change

Overview¶

Our World in Data is building a new data catalog, with the goal of our datasets being reproducible and transparent to the general public. That project is our etl, which going forward will contain the recipes for all the datasets we republish.

This library allows you to query our data catalog programmatically, and get back data in the form of Pandas data frames, perfect for data pipelines or Jupyter notebook explorations.

graph TB

etl -->|reads| walden[upstream datasets]
etl -->|generates| s3[data catalog]
catalog[owid-catalog-py] -->|queries| s3

We would love feedback on how we can make this library and overall data catalog better. Feel free to send us an email at info@ourworldindata.org, or start a discussion on Github.

Quickstart¶

Install with pip install owid-catalog. Then you can get data in two different ways.

Charts catalog¶

This API attempts to give you exactly the data you in a chart on our site.

from owid.catalog import charts

# get the data for one chart by URL
df = charts.get_data('https://ourworldindata.org/grapher/life-expectancy')

Notice that the last part of the URL is the chart's slug, its identifier, in this case life-expectancy. Using the slug alone also works.

df = charts.get_data('life-expectancy')

To see what charts are available, you can list them all.

>>> slugs = charts.list_charts()
>>> slugs[:5]
['above-ground-biomass-in-forest-per-hectare',
 'above-or-below-extreme-poverty-line-world-bank',
 'abs-change-energy-consumption',
 'absolute-change-co2',
 'absolute-gains-in-mean-female-height']

Data science API¶

We also curate much more data than is available on our site. To access that in efficient binary (Feather) format, use our data science API.

This API is designed for use in Jupyter notebooks.

from owid import catalog

# look for Covid-19 data, return a data frame of matches
catalog.find('covid')

# load Covid-19 data from the Our World in Data namespace as a data frame
df = catalog.find('covid', namespace='owid').load()

There many be multiple versions of the same dataset in a catalog, each will have a unique path. To easily load the same dataset again, you should record its path and load it this way:

from owid import catalog

path = 'garden/ihme_gbd/2023-05-15/gbd_mental_health_prevalence_rate/gbd_mental_health_prevalence_rate'

rc = catalog.RemoteCatalog()
df = rc[path]

Development¶

You need Python 3.8+, poetry and make installed. Clone the repo, then you can simply run:

# run all unit tests and CI checks
make test

# watch for changes, then run all checks
make watch

Changelog¶

v0.3.10
Add experimental chart data API in owid.catalog.charts
v0.3.9
Switch from isort & black & fake8 to ruff
v0.3.8
Pin dataclasses-json==0.5.8 to fix error with python3.9
v0.3.7
Fix bugs.
Improve metadata propagation.
Improve metadata YAML file handling, to have common definitions.
Remove DatasetMeta.origins.
v0.3.6
Fixed tons of bugs
processing.py module with pandas-like functions that propagate metadata
Support for Dynamic YAML files
Support for R2 alongside S3
v0.3.5
Remove catalog.frames; use owid-repack package instead
Relax dependency constraints
Add optional channel argument to DatasetMeta
Stop supporting metadata in Parquet format, load JSON sidecar instead
Fix errors when creating new Table columns
v0.3.4
Bump pyarrow dependency to enable Python 3.11 support
v0.3.3
Add more arguments to Table.__init__ that are often used in ETL
Add Dataset.update_metadata function for updating metadata from YAML file
Python 3.11 support via update of pyarrow dependency
v0.3.2
Fix a bug in Catalog.__getitem__()
Replace mypy type checker by pyright
v0.3.1
Sort imports with isort
Change black line length to 120
Add grapher channel
Support path-based indexing into catalogs
v0.3.0
Update OWID_CATALOG_VERSION to 3
Support multiple formats per table
Support reading and writing parquet files with embedded metadata
Optional repack argument when adding tables to dataset
Underscore |
Get version field from DatasetMeta init
Resolve collisions of underscore_table function
Convert version to str and load json dimensions
v0.2.9
Allow multiple channels in catalog.find function
v0.2.8
Update OWID_CATALOG_VERSION to 2
v0.2.7
Split datasets into channels (garden, meadow, open_numbers, ...) and make garden default one
Add .find_latest method to Catalog
v0.2.6
Add flag is_public for public/private datasets
Enforce snake_case for table, dataset and variable short names
Add fields published_by and published_at to Source
Added a list of supported and unsupported operations on columns
Updated pyarrow
v0.2.5
Fix ability to load remote CSV tables
v0.2.4
Update the default catalog URL to use a CDN
v0.2.3
Fix methods for finding and loading data from a LocalCatalog
v0.2.2
Repack frames to compact dtypes on Table.to_feather()
v0.2.1
Fix key typo used in version check
v0.2.0
Copy dataset metadata into tables, to make tables more traceable
Add API versioning, and a requirement to update if your version of this library is too old
v0.1.1
Add support for Python 3.8
v0.1.0
Initial release, including searching and fetching data from a remote catalog