owid-catalog: Data Structures and Processing¶
Enhanced pandas data structures with rich metadata support for OWID's data processing pipelines.
Quick Reference¶
from owid.catalog import Dataset, Table, Variable
from owid.catalog import processing as pr
# Create a table with metadata
tb = Table(df, metadata={"short_name": "population"})
Metadata Hierarchy¶
Dataset
├── metadata: DatasetMeta (sources, licenses, title)
└── Tables
├── metadata: TableMeta (table-level info)
└── Variables (columns)
└── metadata: VariableMeta (unit, description, sources)
Metadata Propagation¶
As the table is processed, metadata is preserved and propagated to resulting tables and variables.
# Slicing
tb_filtered = tb[tb["year"] > 2000] # Keeps metadata
# Filtering
tb_loc = tb.loc[tb["country"] == "USA"] # Keeps metadata
# Sorting
tb_sorted = tb.sort_values("gdp_per_capita") # Keeps metadata
# Column operations
tb["gdp_per_capita_usd"] = tb["gdp_per_capita"] * 2
# Merging
tb_merged = pr.merge(tb1, tb2, on="country") # Merges metadata
# Concatenating
tb_concat = pr.concat([tb1, tb2]) # Combines metadata
# Pivoting
tb_pivot = pr.pivot(tb, index="year", ...) # Adjusts metadata
# Melting
tb_melted = pr.melt(tb, ...)
File Formats¶
Tables support multiple formats with automatic detection: feather, parquet, and CSV. Metadata is stored separately in .meta.json files.
Reference¶
Metadata-aware alternatives to pandas functions.
owid.catalog.core.processing
¶
Common operations performed on tables and variables.
Functions:
-
ignore_warnings–Ignore warnings. You can pass a list of specific warnings to ignore like MetadataWarning or StepWarning.
-
keep_metadata–Decorator that turns a function that works on DataFrame or Series into a function that works
-
multi_merge–Merge multiple tables.
-
read–Read a file based on extension, dispatching to the appropriate reader.
-
read_custom–Read data using a custom reader function and return a Table with metadata.
-
read_df–Create a Table (with metadata and an origin) from a DataFrame.
ignore_warnings
¶
Ignore warnings. You can pass a list of specific warnings to ignore like MetadataWarning or StepWarning.
Usage
with ignore_warnings(): ds_garden = create_dataset(...)
Source code in lib/catalog/owid/catalog/core/warnings.py
keep_metadata
¶
Decorator that turns a function that works on DataFrame or Series into a function that works on Table or Variable and preserves metadata. If the decorated function renames columns, their metadata won't be copied.
Example
Source code in lib/catalog/owid/catalog/core/tables.py
multi_merge
¶
Merge multiple tables.
This is a helper function when merging more than two tables on common columns.
Parameters:
Returns:
-
combined(Table) –Merged table.
Source code in lib/catalog/owid/catalog/core/tables.py
read
¶
read(
filepath_or_buffer: str | Path | IO[AnyStr],
*args: Any,
file_extension: str | None = None,
metadata: TableMeta | None = None,
origin: Origin | None = None,
underscore: bool = False,
**kwargs: Any,
) -> Table
Read a file based on extension, dispatching to the appropriate reader.
Parameters:
-
filepath_or_buffer(str | Path | IO[AnyStr]) –Path to the file or file-like object to read.
-
*args(Any, default:()) –Additional positional arguments passed to the format-specific reader.
-
file_extension(str | None, default:None) –File extension (without dot). If None, inferred from filepath.
-
metadata(TableMeta | None, default:None) –Table metadata.
-
origin(Origin | None, default:None) –Origin of the table data.
-
underscore(bool, default:False) –True to make all column names snake case.
-
**kwargs(Any, default:{}) –Additional keyword arguments passed to the format-specific reader.
Returns:
-
Table–Table with data and metadata.
Note
For reading ZIP files, use Snapshot.extracted() context manager instead. See etl/snapshot.py for the recommended approach to handling archives.
Source code in lib/catalog/owid/catalog/core/tables.py
read_custom
¶
read_custom(
read_function: Callable,
filepath_or_buffer: str | Path | IO[AnyStr],
metadata: TableMeta,
origin: Origin | None = None,
underscore: bool = False,
*args: Any,
**kwargs: Any,
) -> Table
Read data using a custom reader function and return a Table with metadata.
This function allows using any custom data reading function while automatically attaching metadata and origin information to the resulting Table. Useful when standard read functions (read_csv, read_excel, etc.) don't meet specific needs.
Parameters:
-
read_function(Callable) –Custom function to read the data. Must accept filepath_or_buffer as first argument and return a DataFrame or Table.
-
filepath_or_buffer(str | Path | IO[AnyStr]) –Path to the file or file-like object to read.
-
metadata(TableMeta) –Table metadata.
-
origin(Origin | None, default:None) –Origin of the table data.
-
underscore(bool, default:False) –True to make all column names snake case.
-
*args(Any, default:()) –Additional positional arguments to pass to read_function.
-
**kwargs(Any, default:{}) –Additional keyword arguments to pass to read_function.
Returns:
-
Table(Table) –Data read by the custom function as a Table with attached metadata and origin.
Source code in lib/catalog/owid/catalog/core/tables.py
read_df
¶
read_df(
df: DataFrame,
metadata: TableMeta | None = None,
origin: Origin | None = None,
underscore: bool = False,
) -> Table
Create a Table (with metadata and an origin) from a DataFrame.
Parameters:
-
df(DataFrame) –Input DataFrame.
-
metadata(TableMeta | None, default:None) –Table metadata (with a title and description).
-
origin(Origin | None, default:None) –Origin of the table.
-
underscore(bool, default:False) –True to ensure all column names are snake case.
Returns:
-
Table(Table) –Original data as a Table with metadata and an origin.
Source code in lib/catalog/owid/catalog/core/tables.py
Container for multiple tables with shared metadata.
owid.catalog.core.datasets
¶
Classes:
-
Dataset–A dataset is a folder containing data tables with metadata.
Functions:
-
checksum_file–Calculate MD5 checksum of a single file.
Dataset
dataclass
¶
A dataset is a folder containing data tables with metadata.
A Dataset represents a collection of related data tables stored in a directory.
Each dataset has an index.json file containing metadata about the dataset
and references to its tables.
Attributes:
-
path(str) –Path to the dataset directory.
-
metadata(DatasetMeta) –Dataset-level metadata (title, description, sources, etc).
Example
Load an existing dataset:
Create a new dataset:
Initialize a Dataset from a directory path.
Parameters:
Methods:
-
add–Add a table to this dataset.
-
checksum–Calculate MD5 checksum of all data and metadata in the dataset.
-
index–Generate an index DataFrame describing all tables in this dataset.
-
read–Read a table from the dataset with performance options.
-
update_metadata–Update dataset and table metadata from a YAML file.
Source code in lib/catalog/owid/catalog/core/datasets.py
add
¶
Add a table to this dataset.
Saves the table to the dataset's directory in the specified format(s). By default, saves in multiple formats for compatibility.
Parameters:
-
table(Table) –The table to add to the dataset.
-
formats(list[FileFormat], default:DEFAULT_FORMATS) –List of file formats to save (feather, parquet, csv). Defaults to DEFAULT_FORMATS (usually ["feather"]).
-
repack(bool, default:True) –If True, optimize column dtypes to reduce file size (e.g. float64 -> float32). Set to False for very large dataframes if repacking fails or is too slow.
Raises:
-
PrimaryKeyMissing–If table has no primary key and OWID_STRICT is set.
-
NonUniqueIndex–If table index has duplicates and OWID_STRICT is set.
Example
Source code in lib/catalog/owid/catalog/core/datasets.py
checksum
¶
checksum() -> str
Calculate MD5 checksum of all data and metadata in the dataset.
Generates a checksum that includes the dataset's index file and all data files. Useful for detecting changes to the dataset.
Returns:
-
str–MD5 checksum as a hexadecimal string.
Source code in lib/catalog/owid/catalog/core/datasets.py
index
¶
Generate an index DataFrame describing all tables in this dataset.
Creates a summary DataFrame with one row per table, including metadata like namespace, version, checksum, dimensions, and file paths.
Parameters:
-
catalog_path(Path, default:Path('/')) –Base path for calculating relative paths. Defaults to "/".
Returns:
-
DataFrame–DataFrame with columns: namespace, dataset, version, table, checksum, is_public,
-
DataFrame–title, description, dimensions, path, channel, and formats.
Source code in lib/catalog/owid/catalog/core/datasets.py
read
¶
read(
name: str | None = None,
reset_index: bool = True,
safe_types: bool = True,
reset_metadata: Literal[
"keep", "keep_origins", "reset"
] = "keep",
load_data: bool = True,
) -> Table
Read a table from the dataset with performance options.
This is an alternative to ds[table_name] with more control over
loading behavior for performance optimization.
Parameters:
-
name(str | None, default:None) –Name of the table to read. If None and dataset has only one table, reads that table automatically.
-
reset_index(bool, default:True) –If True, don't set primary keys. This can make loading large multi-index datasets much faster. Default is True.
-
safe_types(bool, default:True) –If True, convert numeric columns to nullable types (Float64, Int64) and categorical to string[pyarrow]. This increases memory usage but prevents type issues. Default is True.
-
reset_metadata(Literal['keep', 'keep_origins', 'reset'], default:'keep') –Controls variable metadata reset behavior: - "keep": Leave metadata unchanged (default) - "keep_origins": Reset metadata but retain origins attribute - "reset": Reset all variable metadata
-
load_data(bool, default:True) –If False, only load metadata without actual data. Useful when you only need to inspect metadata. Default is True.
Returns:
-
Table–The loaded table with data and metadata.
Raises:
-
ValueError–If name is None but dataset contains multiple tables.
-
KeyError–If the specified table name doesn't exist.
Example
Read single table with safe defaults
Keep index
Faster, less memory
Only metadata
Source code in lib/catalog/owid/catalog/core/datasets.py
209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 | |
update_metadata
¶
update_metadata(
metadata_path: Path,
yaml_params: dict[str, Any] | None = None,
if_origins_exist: SOURCE_EXISTS_OPTIONS = "replace",
errors: Literal["ignore", "warn", "raise"] = "raise",
extra_variables: Literal["raise", "ignore"] = "raise",
) -> None
Update dataset and table metadata from a YAML file.
Loads metadata from a .meta.yml file and updates the dataset's metadata and all referenced tables. This is the primary way to add rich metadata to datasets in the ETL workflow.
Parameters:
-
metadata_path(Path) –Path to the .meta.yml file with metadata definitions. See existing metadata files for examples of the expected structure.
-
yaml_params(dict[str, Any] | None, default:None) –Additional parameters to pass to the YAML loader.
-
if_origins_exist(SOURCE_EXISTS_OPTIONS, default:'replace') –How to handle existing origins: - "replace" (default): Replace existing origin with new one - "append": Append new origin to existing origins - "fail": Raise exception if origin already exists
-
errors(Literal['ignore', 'warn', 'raise'], default:'raise') –How to handle errors during update: - "raise" (default): Raise exception on errors - "warn": Issue warning but continue processing - "ignore": Silently ignore errors
-
extra_variables(Literal['raise', 'ignore'], default:'raise') –How to handle variables in metadata not in dataset: - "raise" (default): Raise exception - "ignore": Skip extra variables
Example
Source code in lib/catalog/owid/catalog/core/datasets.py
checksum_file
¶
Calculate MD5 checksum of a single file.
Reads the file in chunks to handle large files efficiently.
Parameters:
-
filename(str) –Path to the file to checksum.
Returns:
-
Any–MD5 hash object (use .hexdigest() to get string representation).
Source code in lib/catalog/owid/catalog/core/datasets.py
pandas DataFrame with column-level metadata.
owid.catalog.core.tables
¶
Classes:
-
Table–Enhanced pandas DataFrame with rich metadata support.
-
VariableGroupBy–
Functions:
-
align_categoricals–Align categorical columns if possible. If not, return originals. This is necessary for
-
copy_metadata–Copy metadata from a different table to self.
-
keep_metadata–Decorator that turns a function that works on DataFrame or Series into a function that works
-
multi_merge–Merge multiple tables.
-
read–Read a file based on extension, dispatching to the appropriate reader.
-
read_custom–Read data using a custom reader function and return a Table with metadata.
-
read_df–Create a Table (with metadata and an origin) from a DataFrame.
-
update_variable_dimensions–Update a variable's dimensions metadata.
Table
¶
Table(
*args: Any,
metadata: TableMeta | None = None,
short_name: str | None = None,
underscore: bool = False,
camel_to_snake: bool = False,
like: Table | None = None,
**kwargs: Any,
)
Bases: DataFrame
Enhanced pandas DataFrame with rich metadata support.
Table extends pandas DataFrame to include metadata at both the table level and individual column level. It's the primary data structure for ETL operations.
Attributes:
-
metadata(TableMeta) –Table-level metadata (title, description, sources, etc).
-
_fields(dict[str, VariableMeta]) –Dictionary mapping column names to their VariableMeta objects.
-
DEBUG–Set to True to enable metadata validation debugging.
Example
Create a table from a DataFrame:
Create with metadata:
Copy metadata from another table:
Initialize a Table with data and metadata.
Parameters:
-
*args(Any, default:()) –Positional arguments passed to pandas.DataFrame.init.
-
metadata(TableMeta | None, default:None) –TableMeta object with table-level metadata. Creates empty metadata if not provided.
-
short_name(str | None, default:None) –Shortcut to set metadata.short_name. Alternative to passing
metadata=TableMeta(short_name="my_name"). -
underscore(bool, default:False) –If True, convert column and index names to snake_case.
-
camel_to_snake(bool, default:False) –If True, convert camelCase column names to snake_case. Only applies when underscore=True.
-
like(Table | None, default:None) –Copy metadata from this Table (including column metadata). Alternative to manually copying metadata for all columns.
-
**kwargs(Any, default:{}) –Keyword arguments passed to pandas.DataFrame.init.
Example
Methods:
-
astype–Cast table columns to specified dtype(s).
-
check_metadata–Check that all variables in the table have origins.
-
copy–Create a copy of the table with all metadata.
-
copy_metadata–Copy metadata from another table to this table.
-
drop–Drop specified labels from rows or columns.
-
equals_table–Check if two tables are equal including metadata.
-
fillna–Usual fillna, but, if the object given to fill values with is a table, transfer its metadata to the filled
-
filter–Subset rows or columns based on their labels.
-
format–Format the table according to OWID standards.
-
from_records–Calling
Table.from_recordsreturns a Table, but does not call init and misses metadata. -
get_column_or_index–Get a variable by name from either columns or index.
-
groupby–Groupby that preserves metadata. It uses observed=True by default.
-
join–Join tables while preserving metadata.
-
melt–Unpivot table from wide to long format.
-
merge–Merge with another DataFrame or Table.
-
pivot–Reshape table from long to wide format.
-
prune_metadata–Remove metadata for columns no longer in the table.
-
read–Read a table from disk in any supported format.
-
read_csv–Read table from CSV file with accompanying metadata.
-
read_feather–Read table from Feather file with accompanying metadata.
-
read_json–Read the table from a JSON file plus accompanying JSON sidecar.
-
read_parquet–Read table from Parquet file with accompanying metadata.
-
reindex–Conform table to new index with optional filling logic.
-
rename–Rename columns while preserving their metadata.
-
rename_index_names–Rename index values names.
-
reset_index–Reset the index to default integer index.
-
rolling–Rolling operation that preserves metadata.
-
set_index–Set the DataFrame index using specified columns.
-
to–Save this table to disk in a supported format.
-
to_csv–Save table as CSV with accompanying metadata file.
-
to_excel–Save table to Excel file with optional metadata codebook.
-
to_feather–Save table as Feather file with accompanying metadata.
-
to_json–Save this table as a JSON file plus accompanying JSON metadata file.
-
to_parquet–Save table as Parquet file with metadata sidecar.
-
underscore–Convert column and index names to underscore format.
-
update_metadata–Update table-level metadata fields.
-
update_metadata_from_yaml–Update table and variable metadata from a YAML file.
Source code in lib/catalog/owid/catalog/core/tables.py
all_columns
property
¶
Get names of all columns including index levels.
Returns both regular columns and index names in a single list, useful for iterating over all variables in the table.
Returns:
Example
codebook
property
¶
Generate a human-readable codebook for this table.
Creates a DataFrame summarizing all variables in the table with their titles, descriptions, units, and source attributions.
Returns:
-
DataFrame–DataFrame with columns:
- column: Column name (including index columns)
- title: Title from metadata (title_public > display.name > title)
- description: Short description of the indicator
- unit: Unit of measurement with short unit in parentheses
- source: Formatted source attribution with URLs
primary_key
property
¶
Get the table's primary key column names.
Returns the names of index levels, which serve as the table's primary key for identifying unique rows.
Returns:
astype
¶
Cast table columns to specified dtype(s).
Convert one or more columns to a specified data type. Wrapper around pandas astype that returns a Table.
Parameters:
-
*args(Any, default:()) –Positional arguments passed to pandas.DataFrame.astype.
-
**kwargs(Any, default:{}) –Keyword arguments passed to pandas.DataFrame.astype.
Returns:
-
Table–Table with columns cast to specified types.
Example
Cast single column:
Cast multiple columns:
Cast all columns:
Source code in lib/catalog/owid/catalog/core/tables.py
check_metadata
¶
Check that all variables in the table have origins.
Source code in lib/catalog/owid/catalog/core/tables.py
copy
¶
Create a copy of the table with all metadata.
Parameters:
-
deep(bool, default:True) –If True (default), make a deep copy of the data and metadata. If False, creates a shallow copy.
Returns:
-
Table–A new Table with copied data and metadata.
Source code in lib/catalog/owid/catalog/core/tables.py
copy_metadata
¶
Copy metadata from another table to this table.
Copies both table-level metadata and variable-level metadata for all matching columns. Useful for preserving metadata after transformations.
Parameters:
-
from_table(Table) –Source table to copy metadata from.
-
deep(bool, default:False) –If True, make a deep copy of the metadata. Default is False.
Returns:
-
Table–Self, for method chaining.
Source code in lib/catalog/owid/catalog/core/tables.py
drop
¶
Drop specified labels from rows or columns.
Remove rows or columns by specifying label names and axis. Wrapper around pandas drop that returns a Table.
Parameters:
-
*args(Any, default:()) –Positional arguments passed to pandas.DataFrame.drop.
-
**kwargs(Any, default:{}) –Keyword arguments passed to pandas.DataFrame.drop.
Returns:
-
Table–Table with specified labels dropped.
Example
Drop columns:
Drop rows by index:
Drop columns with axis parameter:
Source code in lib/catalog/owid/catalog/core/tables.py
equals_table
¶
Check if two tables are equal including metadata.
Compares both data and metadata for equality. This is more comprehensive than pandas equals() which only checks data.
Parameters:
-
table(Table) –Table to compare with.
Returns:
Note
NaN values are handled specially to ensure consistent comparison even when NaN values are present.
Source code in lib/catalog/owid/catalog/core/tables.py
fillna
¶
Usual fillna, but, if the object given to fill values with is a table, transfer its metadata to the filled table.
Source code in lib/catalog/owid/catalog/core/tables.py
filter
¶
Subset rows or columns based on their labels.
Filter the table to include only specified rows or columns by name. Wrapper around pandas filter that returns a Table.
Parameters:
-
*args(Any, default:()) –Positional arguments passed to pandas.DataFrame.filter.
-
**kwargs(Any, default:{}) –Keyword arguments passed to pandas.DataFrame.filter. Common kwargs include: - items: List of axis labels to select - like: Keep labels matching this string pattern - regex: Keep labels matching this regex pattern - axis: Axis to filter on (0 for rows, 1 for columns)
Returns:
-
Table–Filtered Table with only selected labels.
Example
Filter columns by exact names:
Filter columns containing pattern:
Filter columns with regex:
Source code in lib/catalog/owid/catalog/core/tables.py
format
¶
format(
keys: str | list[str] | None = None,
verify_integrity: bool = True,
underscore: bool = True,
sort_rows: bool = True,
sort_columns: bool = False,
short_name: str | None = None,
**kwargs: Any,
) -> Table
Format the table according to OWID standards.
Applies standard OWID formatting: underscores column names, sets index, verifies uniqueness, and sorts data. This is a convenience method that chains multiple operations commonly used in ETL workflows.
Note
Underscoring happens first, so use underscored key names in the keys parameter (e.g., use 'country' if original had 'Country').
Parameters:
-
keys(str | list[str] | None, default:None) –Index column name(s). If None, uses ["country", "year"].
-
verify_integrity(bool, default:True) –If True (default), raise error if index has duplicate entries.
-
underscore(bool, default:True) –If True (default), convert column names to snake_case format. Disable if names are already properly formatted.
-
sort_rows(bool, default:True) –If True (default), sort rows by index in ascending order.
-
sort_columns(bool, default:False) –If True, sort columns alphabetically. Default is False.
-
short_name(str | None, default:None) –Optional short name to assign to table metadata.
-
**kwargs(Any, default:{}) –Additional arguments passed to the underscore() method.
Returns:
-
Table–Formatted Table with standardized structure and metadata.
Raises:
-
KeyError–If specified keys are not found in table columns.
-
ValueError–If verify_integrity=True and index has duplicates.
Example
Basic formatting with default country/year index:
Equivalent to:
Custom index columns:
Skip underscoring if already formatted:
Format with custom table name:
Source code in lib/catalog/owid/catalog/core/tables.py
1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 | |
from_records
classmethod
¶
Calling Table.from_records returns a Table, but does not call init and misses metadata.
Source code in lib/catalog/owid/catalog/core/tables.py
get_column_or_index
¶
Get a variable by name from either columns or index.
Retrieves a Variable from the table, checking both regular columns and index levels. This is useful when you don't know whether a variable is stored as a column or index.
Parameters:
-
name(str) –Name of the variable to retrieve.
Returns:
-
Indicator–Variable object with data and metadata.
Raises:
-
ValueError–If name is not found in either columns or index.
Example
Source code in lib/catalog/owid/catalog/core/tables.py
groupby
¶
Groupby that preserves metadata. It uses observed=True by default.
Source code in lib/catalog/owid/catalog/core/tables.py
join
¶
Join tables while preserving metadata.
Extends pandas join with proper type signature for Table. Metadata from both tables is preserved in the result.
Parameters:
-
other(DataFrame | Table) –Table or DataFrame to join with.
-
*args(Any, default:()) –Positional arguments passed to pandas.DataFrame.join.
-
**kwargs(Any, default:{}) –Keyword arguments passed to pandas.DataFrame.join. Supports all pandas join parameters.
Returns:
-
Table–Joined table with combined metadata.
Source code in lib/catalog/owid/catalog/core/tables.py
melt
¶
melt(
id_vars: tuple[str] | list[str] | str | None = None,
value_vars: tuple[str] | list[str] | str | None = None,
var_name: str = "variable",
value_name: str = "value",
short_name: str | None = None,
*args: Any,
**kwargs: Any,
) -> Table
Unpivot table from wide to long format.
Converts columns into rows, transforming wide-format data into long-format. Wrapper around pandas melt that preserves metadata. See owid.catalog.tables.melt() for full documentation.
Parameters:
-
id_vars(tuple[str] | list[str] | str | None, default:None) –Column(s) to use as identifier variables (not melted).
-
value_vars(tuple[str] | list[str] | str | None, default:None) –Column(s) to unpivot. If None, uses all columns except id_vars.
-
var_name(str, default:'variable') –Name for the variable column. Default is "variable".
-
value_name(str, default:'value') –Name for the value column. Default is "value".
-
short_name(str | None, default:None) –Optional short name for resulting table metadata.
-
*args(Any, default:()) –Additional positional arguments passed to melt().
-
**kwargs(Any, default:{}) –Additional keyword arguments passed to melt().
Returns:
-
Table–Melted Table in long format with preserved metadata.
Example
Melt all columns except country and year:
>>> long_table = table.melt(id_vars=["country", "year"])
>>> # Melt specific columns:
>>> long_table = table.melt(
... id_vars=["country", "year"],
... value_vars=["gdp", "population"]
... )
>>> # Custom column names:
>>> long_table = table.melt(
... id_vars="country",
... var_name="indicator",
... value_name="measurement"
... )
Source code in lib/catalog/owid/catalog/core/tables.py
merge
¶
Merge with another DataFrame or Table.
Wrapper around pandas merge that preserves Table metadata. See owid.catalog.tables.merge() for full documentation.
Parameters:
-
right(Any) –DataFrame or Table to merge with.
-
*args(Any, default:()) –Positional arguments passed to merge().
-
**kwargs(Any, default:{}) –Keyword arguments passed to merge().
Returns:
-
Table–Merged Table with combined metadata.
Example
Source code in lib/catalog/owid/catalog/core/tables.py
pivot
¶
pivot(
*,
index: str | list[str] | None = None,
columns: str | list[str] | None = None,
values: str | list[str] | None = None,
join_column_levels_with: str | None = None,
short_name: str | None = None,
fill_dimensions: bool = True,
**kwargs: Any,
) -> Table
Reshape table from long to wide format.
Converts rows into columns, transforming long-format data into wide-format. Wrapper around pandas pivot that preserves metadata. See owid.catalog.tables.pivot() for full documentation.
Parameters:
-
index(str | list[str] | None, default:None) –Column(s) to use for the new index. If None, uses existing index.
-
columns(str | list[str] | None, default:None) –Column(s) whose unique values become new columns.
-
values(str | list[str] | None, default:None) –Column(s) to aggregate. If None, uses all remaining columns.
-
join_column_levels_with(str | None, default:None) –If pivoting creates multi-level columns, join them with this separator (e.g., "_").
-
short_name(str | None, default:None) –Optional short name for resulting table metadata.
-
fill_dimensions(bool, default:True) –If True, fill missing dimension values. Default is True.
-
**kwargs(Any, default:{}) –Additional arguments passed to pivot().
Returns:
-
Table–Pivoted Table in wide format with preserved metadata.
Example
Source code in lib/catalog/owid/catalog/core/tables.py
1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 | |
prune_metadata
¶
prune_metadata() -> Table
Remove metadata for columns no longer in the table.
Cleans up the internal metadata dictionary to remove entries for columns that have been dropped. Useful after column filtering or selection operations.
Returns:
-
Table–Self, for method chaining.
Example
Source code in lib/catalog/owid/catalog/core/tables.py
read
classmethod
¶
Read a table from disk in any supported format.
Automatically detects the format from file extension and loads the table with its metadata. Supports .csv, .feather, and .parquet.
Parameters:
-
path(str | Path) –Path to the file to read. Extension determines format.
-
**kwargs(Any, default:{}) –Additional arguments passed to format-specific reader.
Returns:
-
Table–Loaded Table with data and metadata.
Raises:
-
ValueError–If file extension is not recognized.
Example
Source code in lib/catalog/owid/catalog/core/tables.py
read_csv
classmethod
¶
Read table from CSV file with accompanying metadata.
Loads a table from a CSV file and its associated .meta.json metadata file. For example, reads both "data.csv" and "data.meta.json".
Parameters:
-
path(str | Path) –Path to the CSV file (must end with .csv).
-
**kwargs(Any, default:{}) –Additional arguments passed to the internal metadata loader.
Returns:
-
Table–Table with data and metadata loaded.
Raises:
-
ValueError–If path doesn't end with .csv.
Source code in lib/catalog/owid/catalog/core/tables.py
read_feather
classmethod
¶
Read table from Feather file with accompanying metadata.
Loads a table from a Feather file and its associated .meta.json metadata file. Supports both local file paths and URLs.
Parameters:
-
path(str | Path) –Path or URL to the Feather file (must end with .feather).
-
load_data(bool, default:True) –If True, load the actual data. If False, only load metadata and column structure (useful for inspecting large files).
-
**kwargs(Any, default:{}) –Additional arguments passed to the internal metadata loader.
Returns:
-
Table–Table with data and metadata loaded.
Raises:
-
ValueError–If path doesn't end with .feather.
Example
Source code in lib/catalog/owid/catalog/core/tables.py
read_json
classmethod
¶
Read the table from a JSON file plus accompanying JSON sidecar.
The path may be a local file path or a URL.
Source code in lib/catalog/owid/catalog/core/tables.py
read_parquet
classmethod
¶
Read table from Parquet file with accompanying metadata.
Loads a table from a Parquet file and its associated .meta.json metadata file. Supports both local file paths and URLs.
Parameters:
-
path(str | Path) –Path or URL to the Parquet file (must end with .parquet).
-
**kwargs(Any, default:{}) –Additional arguments passed to the internal metadata loader.
Returns:
-
Table–Table with data and metadata loaded.
Raises:
-
ValueError–If path doesn't end with .parquet.
Example
Source code in lib/catalog/owid/catalog/core/tables.py
reindex
¶
Conform table to new index with optional filling logic.
Create a new Table with changed index. Missing values are filled according to the specified method. Wrapper around pandas reindex.
Parameters:
-
*args(Any, default:()) –Positional arguments passed to pandas.DataFrame.reindex.
-
**kwargs(Any, default:{}) –Keyword arguments passed to pandas.DataFrame.reindex.
Returns:
-
Table–Table conformed to new index.
Example
Reindex with new labels:
Fill missing values:
Forward fill:
Source code in lib/catalog/owid/catalog/core/tables.py
rename
¶
Rename columns while preserving their metadata.
Extends pandas rename to maintain variable metadata when renaming columns or index levels. Metadata follows the renamed columns automatically.
Parameters:
-
*args(Any, default:()) –Positional arguments passed to pandas.DataFrame.rename.
-
**kwargs(Any, default:{}) –Keyword arguments passed to pandas.DataFrame.rename. Supports all pandas rename parameters including mapper, index, columns, and inplace.
Returns:
-
Table | None–Renamed table if inplace=False (default), None if inplace=True.
Example
Source code in lib/catalog/owid/catalog/core/tables.py
rename_index_names
¶
Rename index values names.
Source code in lib/catalog/owid/catalog/core/tables.py
reset_index
¶
Reset the index to default integer index.
Extends pandas.reset_index with proper type signature for Table.
Converts index levels to regular columns.
Parameters:
-
level(Any, default:None) –Index level(s) to reset. If None, resets all levels.
-
inplace(bool, default:False) –If True, modify the table in place. Default is False.
-
**kwargs(Any, default:{}) –Additional arguments passed to pandas.DataFrame.reset_index.
Returns:
-
Table | None–Table with reset index if inplace=False, None if inplace=True.
Example
Source code in lib/catalog/owid/catalog/core/tables.py
rolling
¶
Rolling operation that preserves metadata.
set_index
¶
Set the DataFrame index using specified columns.
Extends pandas set_index to update table metadata with primary key and dimension information. The index columns become the table's identifying dimensions.
Parameters:
-
keys(str | list[str]) –Column name or list of column names to set as index.
-
**kwargs(Any, default:{}) –Additional arguments passed to pandas.DataFrame.set_index.
Returns:
-
Table | None–Table with new index if inplace=False, None if inplace=True.
Example
Source code in lib/catalog/owid/catalog/core/tables.py
to
¶
Save this table to disk in a supported format.
The format is automatically detected from the file extension (.csv, .feather, or .parquet).
Parameters:
-
path(str | Path) –Output file path. Extension determines format.
-
repack(bool, default:True) –If True, optimize column dtypes to reduce file size. Set to False for very large tables if optimization fails.
Example
Source code in lib/catalog/owid/catalog/core/tables.py
to_csv
¶
Save table as CSV with accompanying metadata file.
Saves both the data as CSV and metadata as a separate JSON file. For example, "mytable.csv" will have metadata at "mytable.meta.json".
Parameters:
-
path(Any | None, default:None) –Output CSV path. If None, returns CSV as string.
-
**kwargs(Any, default:{}) –Additional arguments passed to pandas.DataFrame.to_csv. By default, includes index only if table has a primary key.
Returns:
-
None | str–CSV string if path is None, otherwise None.
Example
Source code in lib/catalog/owid/catalog/core/tables.py
to_excel
¶
to_excel(
excel_writer: Any,
with_metadata: bool = True,
sheet_name: str = "data",
metadata_sheet_name: str = "metadata",
**kwargs: Any,
) -> None
Save table to Excel file with optional metadata codebook.
Exports the table data to an Excel file, optionally including a separate sheet with the codebook metadata.
Parameters:
-
excel_writer(Any) –File path or ExcelWriter object to save to.
-
with_metadata(bool, default:True) –If True, include a metadata codebook sheet. Default is True.
-
sheet_name(str, default:'data') –Name for the data sheet. Default is "data".
-
metadata_sheet_name(str, default:'metadata') –Name for the metadata sheet. Default is "metadata".
-
**kwargs(Any, default:{}) –Additional arguments passed to pandas.DataFrame.to_excel.
Example
Source code in lib/catalog/owid/catalog/core/tables.py
to_feather
¶
to_feather(
path: Any,
repack: bool = True,
compression: Literal[
"zstd", "lz4", "uncompressed"
] = "zstd",
**kwargs: Any,
) -> None
Save table as Feather file with accompanying metadata.
Saves the table in Apache Arrow Feather format with a separate JSON metadata file. For example, "mytable.feather" will have metadata at "mytable.meta.json".
Note
Feather format cannot store indexes, so the index is reset before saving and restored when reading.
Parameters:
-
path(Any) –Output file path (must end with .feather).
-
repack(bool, default:True) –If True, optimize column dtypes to reduce file size. Set to False for very large tables if repacking is slow.
-
compression(Literal['zstd', 'lz4', 'uncompressed'], default:'zstd') –Compression algorithm to use. Options are: - "zstd" (default): High compression ratio - "lz4": Faster compression - "uncompressed": No compression
-
**kwargs(Any, default:{}) –Additional arguments passed to pandas.DataFrame.to_feather.
Raises:
-
ValueError–If path doesn't end with .feather or if index names overlap with column names.
Example
Source code in lib/catalog/owid/catalog/core/tables.py
to_json
¶
Save this table as a JSON file plus accompanying JSON metadata file. If the table is stored at "mytable.json", the metadata will be at "mytable.meta.json".
By default, uses orient="records" which outputs a simple array of objects without schema information. The index is reset and included as regular columns.
Source code in lib/catalog/owid/catalog/core/tables.py
to_parquet
¶
Save table as Parquet file with metadata sidecar.
Saves the table in Apache Parquet format with a separate JSON metadata file. Parquet provides efficient columnar storage and compression.
Note
Metadata is stored in a separate .meta.json file rather than embedded in the Parquet schema to enable efficient partial reading of large files.
Parameters:
-
path(Any) –Output file path (must end with .parquet).
-
repack(bool, default:True) –If True, optimize column dtypes to reduce file size. Set to False for very large tables if repacking is slow.
Raises:
-
ValueError–If path doesn't end with .parquet.
Example
Source code in lib/catalog/owid/catalog/core/tables.py
underscore
¶
underscore(
collision: Literal[
"raise", "rename", "ignore"
] = "raise",
inplace: bool = False,
camel_to_snake: bool = False,
) -> Table
Convert column and index names to underscore format.
Converts all column names and index names to snake_case format. In rare cases where two columns map to the same underscored name, the collision parameter controls the behavior.
Parameters:
-
collision(Literal['raise', 'rename', 'ignore'], default:'raise') –How to handle naming collisions: - "raise" (default): Raise ValueError if collision occurs - "rename": Append numbered suffix to duplicates - "ignore": Keep first occurrence
-
inplace(bool, default:False) –If True, modify the table in place. Default is False.
-
camel_to_snake(bool, default:False) –If True, convert camelCase to snake_case. Default is False (only converts spaces and special chars).
Returns:
-
Table–Table with underscored names (or None if inplace=True).
Example
Basic underscoring
Convert camelCase
Handle collisions
Modify in place
Source code in lib/catalog/owid/catalog/core/tables.py
1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 | |
update_metadata
¶
Update table-level metadata fields.
Convenience method to update multiple metadata fields at once.
Parameters:
-
**kwargs(Any, default:{}) –Metadata field names and values to update. Must be valid TableMeta attributes.
Returns:
-
Table–Self, for method chaining.
Raises:
-
AssertionError–If any field name is not a valid TableMeta attribute.
Example
Source code in lib/catalog/owid/catalog/core/tables.py
update_metadata_from_yaml
¶
update_metadata_from_yaml(
path: Path | str,
table_name: str,
yaml_params: dict[str, Any] | None = None,
extra_variables: Literal["raise", "ignore"] = "raise",
if_origins_exist: SOURCE_EXISTS_OPTIONS = "replace",
) -> None
Update table and variable metadata from a YAML file.
Loads metadata definitions from a .meta.yml file and updates both table-level and variable-level metadata. This is the primary way to add rich metadata in the ETL workflow.
Parameters:
-
path(Path | str) –Path to the .meta.yml file with metadata definitions.
-
table_name(str) –Name of the table in the YAML file to load metadata from. Also updates the table's short_name to this value.
-
yaml_params(dict[str, Any] | None, default:None) –Additional parameters to pass to the YAML loader.
-
extra_variables(Literal['raise', 'ignore'], default:'raise') –How to handle variables in YAML not in table: - "raise" (default): Raise exception - "ignore": Skip extra variables
-
if_origins_exist(SOURCE_EXISTS_OPTIONS, default:'replace') –How to handle existing origins: - "replace" (default): Replace existing origin with new one - "append": Append new origin to existing origins - "fail": Raise exception if origin already exists
Example
Source code in lib/catalog/owid/catalog/core/tables.py
VariableGroupBy
¶
VariableGroupBy(
groupby: SeriesGroupBy,
name: str,
metadata: VariableMeta,
table_metadata: TableMeta,
)
Methods:
-
rolling–Apply rolling window function and return a new VariableGroupBy with proper metadata.
Source code in lib/catalog/owid/catalog/core/tables.py
rolling
¶
rolling(*args: Any, **kwargs: Any) -> VariableGroupBy
Apply rolling window function and return a new VariableGroupBy with proper metadata.
Source code in lib/catalog/owid/catalog/core/tables.py
align_categoricals
¶
align_categoricals(
left: SeriesOrVariable, right: SeriesOrVariable
) -> tuple[SeriesOrVariable, SeriesOrVariable]
Align categorical columns if possible. If not, return originals. This is necessary for efficient merging.
Source code in lib/catalog/owid/catalog/core/tables.py
copy_metadata
¶
Copy metadata from a different table to self.
Source code in lib/catalog/owid/catalog/core/tables.py
keep_metadata
¶
Decorator that turns a function that works on DataFrame or Series into a function that works on Table or Variable and preserves metadata. If the decorated function renames columns, their metadata won't be copied.
Example
Source code in lib/catalog/owid/catalog/core/tables.py
multi_merge
¶
Merge multiple tables.
This is a helper function when merging more than two tables on common columns.
Parameters:
Returns:
-
combined(Table) –Merged table.
Source code in lib/catalog/owid/catalog/core/tables.py
read
¶
read(
filepath_or_buffer: str | Path | IO[AnyStr],
*args: Any,
file_extension: str | None = None,
metadata: TableMeta | None = None,
origin: Origin | None = None,
underscore: bool = False,
**kwargs: Any,
) -> Table
Read a file based on extension, dispatching to the appropriate reader.
Parameters:
-
filepath_or_buffer(str | Path | IO[AnyStr]) –Path to the file or file-like object to read.
-
*args(Any, default:()) –Additional positional arguments passed to the format-specific reader.
-
file_extension(str | None, default:None) –File extension (without dot). If None, inferred from filepath.
-
metadata(TableMeta | None, default:None) –Table metadata.
-
origin(Origin | None, default:None) –Origin of the table data.
-
underscore(bool, default:False) –True to make all column names snake case.
-
**kwargs(Any, default:{}) –Additional keyword arguments passed to the format-specific reader.
Returns:
-
Table–Table with data and metadata.
Note
For reading ZIP files, use Snapshot.extracted() context manager instead. See etl/snapshot.py for the recommended approach to handling archives.
Source code in lib/catalog/owid/catalog/core/tables.py
read_custom
¶
read_custom(
read_function: Callable,
filepath_or_buffer: str | Path | IO[AnyStr],
metadata: TableMeta,
origin: Origin | None = None,
underscore: bool = False,
*args: Any,
**kwargs: Any,
) -> Table
Read data using a custom reader function and return a Table with metadata.
This function allows using any custom data reading function while automatically attaching metadata and origin information to the resulting Table. Useful when standard read functions (read_csv, read_excel, etc.) don't meet specific needs.
Parameters:
-
read_function(Callable) –Custom function to read the data. Must accept filepath_or_buffer as first argument and return a DataFrame or Table.
-
filepath_or_buffer(str | Path | IO[AnyStr]) –Path to the file or file-like object to read.
-
metadata(TableMeta) –Table metadata.
-
origin(Origin | None, default:None) –Origin of the table data.
-
underscore(bool, default:False) –True to make all column names snake case.
-
*args(Any, default:()) –Additional positional arguments to pass to read_function.
-
**kwargs(Any, default:{}) –Additional keyword arguments to pass to read_function.
Returns:
-
Table(Table) –Data read by the custom function as a Table with attached metadata and origin.
Source code in lib/catalog/owid/catalog/core/tables.py
read_df
¶
read_df(
df: DataFrame,
metadata: TableMeta | None = None,
origin: Origin | None = None,
underscore: bool = False,
) -> Table
Create a Table (with metadata and an origin) from a DataFrame.
Parameters:
-
df(DataFrame) –Input DataFrame.
-
metadata(TableMeta | None, default:None) –Table metadata (with a title and description).
-
origin(Origin | None, default:None) –Origin of the table.
-
underscore(bool, default:False) –True to ensure all column names are snake case.
Returns:
-
Table(Table) –Original data as a Table with metadata and an origin.
Source code in lib/catalog/owid/catalog/core/tables.py
update_variable_dimensions
¶
Update a variable's dimensions metadata.
Parameters:
-
variable(Indicator) –The variable to update with dimension information
-
dimensions_data(dict[str, Any]) –Dictionary containing dimension information
Source code in lib/catalog/owid/catalog/core/tables.py
pandas Series with metadata.
owid.catalog.core.indicators
¶
Classes:
-
Indicator–Enhanced pandas Series with indicator-level metadata support.
-
IndicatorRolling–Wrapper for pandas rolling window operations that preserves Indicator metadata.
Functions:
-
combine_indicators_metadata–Combine metadata from multiple indicators based on an operation.
-
copy_metadata–Copy metadata from one indicator to another.
-
get_unique_description_key_points_from_indicators–Get unique description key points from a list of indicators.
-
get_unique_licenses_from_indicators–Get unique licenses from a list of indicators.
-
get_unique_origins_from_indicators–Get unique origins from a list of indicators.
-
is_nullable_series–Check if a series has a nullable pandas dtype.
Indicator
¶
Indicator(
data: Any = None,
index: Any = None,
name: str | None = None,
_fields: dict[str, VariableMeta] | None = None,
metadata: VariableMeta | None = None,
**kwargs: Any,
)
Bases: Series
Enhanced pandas Series with indicator-level metadata support.
Indicator is a pandas Series subclass that stores rich metadata about individual indicators. It serves as the column type in Table objects and automatically propagates metadata through operations.
Note
This class was formerly called Variable. The old name is still available
as an alias for backwards compatibility.
Key features:
- Automatic metadata propagation through arithmetic operations
- Processing log tracking for data provenance
- Integration with OWID catalog metadata system
- Support for rich metadata including sources, origins, licenses
Attributes:
-
_name(str | None) –Internal name storage for metadata mapping.
-
_fields(dict[str, VariableMeta]) –Dictionary mapping indicator names to their VariableMeta objects.
-
metadata(VariableMeta) –Indicator-level metadata accessible via
.metadataor.mproperty.
Example
Create an indicator with metadata:
from owid.catalog import Indicator, VariableMeta
ind = Indicator(
[1, 2, 3],
name="gdp",
metadata=VariableMeta(
title="GDP",
unit="trillion USD",
description="Gross Domestic Product"
)
)
Access metadata using shortcuts:
print(ind.metadata.title) # Full property access
print(ind.m.title) # Shorthand alias
print(ind.title) # Direct property access
Metadata propagates through operations:
Initialize an Indicator with data and metadata.
Parameters:
-
data(Any, default:None) –Array-like data for the indicator (list, numpy array, pandas Series, etc.).
-
index(Any, default:None) –Index labels for the data. If None, uses default integer index.
-
name(str | None, default:None) –Name of the indicator. Required if metadata is provided.
-
_fields(dict[str, VariableMeta] | None, default:None) –Internal metadata dictionary. Don't use directly - use
metadataparameter instead. -
metadata(VariableMeta | None, default:None) –VariableMeta object with indicator-level metadata (title, unit, sources, etc.).
-
**kwargs(Any, default:{}) –Additional arguments passed to
pandas.Series.__init__.
Raises:
-
AssertionError–If both
metadataand_fieldsare provided, or ifmetadatais provided without aname.
Example
Create a simple indicator:
Create with metadata:
Methods:
-
copy_metadata–Copy metadata from another indicator.
-
rolling–Create a rolling window operation that preserves metadata.
-
to_frame–Convert Indicator to a Table (single-column table).
Source code in lib/catalog/owid/catalog/core/indicators.py
m
property
¶
m: VariableMeta
Metadata alias for shorter access.
Provides convenient shorthand access to indicator metadata.
Returns:
-
VariableMeta–The indicator's VariableMeta object.
copy_metadata
¶
Copy metadata from another indicator.
Parameters:
-
from_variable(Indicator) –Source indicator to copy metadata from.
-
inplace(bool, default:False) –If True, modifies the current indicator. If False, returns a new indicator.
Returns:
-
Indicator | None–New indicator with copied metadata if
inplace=False, otherwise None.
Example
Create new indicator with copied metadata
Copy metadata in-place
Source code in lib/catalog/owid/catalog/core/indicators.py
rolling
¶
rolling(*args: Any, **kwargs: Any) -> IndicatorRolling
Create a rolling window operation that preserves metadata.
This method wraps pandas rolling operations while maintaining the indicator's metadata.
Parameters:
-
*args(Any, default:()) –Arguments passed to
pandas.Series.rolling. -
**kwargs(Any, default:{}) –Keyword arguments passed to
pandas.Series.rolling.
Returns:
-
IndicatorRolling–IndicatorRolling object that applies operations while preserving metadata.
Example
Calculate 7-day rolling average
The result retains the original indicator's metadata
Source code in lib/catalog/owid/catalog/core/indicators.py
to_frame
¶
Convert Indicator to a Table (single-column table).
When a new name is given, the indicator's metadata is copied to the renamed column so that origins are not lost.
Source code in lib/catalog/owid/catalog/core/indicators.py
IndicatorRolling
¶
IndicatorRolling(
rolling: Rolling,
metadata: VariableMeta,
name: str | None = None,
)
Wrapper for pandas rolling window operations that preserves Indicator metadata.
This class intercepts rolling window operations (mean, sum, std, etc.) and ensures that the resulting Indicator retains the original metadata.
Note
This class was formerly called VariableRolling.
Attributes:
-
rolling–The underlying pandas Rolling object.
-
metadata–Indicator metadata to preserve through operations.
-
name–Indicator name to preserve through operations.
Example
Create a rolling average
Metadata is preserved
Note
You typically don't instantiate this class directly. Use Indicator.rolling() instead.
Initialize an IndicatorRolling wrapper.
Parameters:
-
rolling(Rolling) –The pandas Rolling object to wrap.
-
metadata(VariableMeta) –Metadata to preserve through operations.
-
name(str | None, default:None) –Indicator name to preserve through operations.
Source code in lib/catalog/owid/catalog/core/indicators.py
combine_indicators_metadata
¶
combine_indicators_metadata(
indicators: list[Any] | None = None,
operation: OPERATION | None = None,
name: str = UNNAMED_INDICATOR,
*,
variables: list[Any] | None = None,
) -> VariableMeta
Combine metadata from multiple indicators based on an operation.
This function intelligently merges metadata from multiple indicators when they are combined through operations like addition, division, etc. The logic varies by field:
- If all indicators have identical values for a field, that value is preserved
- For lists (sources, origins, licenses), all unique values are combined
- For some operations (e.g., division), only the first indicator's metadata is kept
- Processing logs are merged and a new entry is added for the operation
Parameters:
-
indicators(list[Any] | None, default:None) –List of indicators (or other objects) to combine metadata from. Non-Indicator objects are automatically filtered out.
-
operation(OPERATION | None, default:None) –Type of operation being performed ("+", "-", "*", "/", etc.). Affects how metadata fields are combined.
-
name(str, default:UNNAMED_INDICATOR) –Name for the resulting indicator. Defaults to UNNAMED_INDICATOR.
-
variables(list[Any] | None, default:None) –Deprecated alias for indicators parameter (for backwards compatibility).
Returns:
-
VariableMeta–Combined VariableMeta object with merged metadata from all indicators.
Example
Metadata from addition
Metadata from division (keeps first indicator's metadata)
Note
This function is typically called automatically by Indicator arithmetic operations. You rarely need to call it directly.
Source code in lib/catalog/owid/catalog/core/indicators.py
758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 | |
copy_metadata
¶
copy_metadata(
from_variable: Indicator,
to_variable: Indicator,
inplace: bool = False,
) -> Indicator | None
Copy metadata from one indicator to another.
Parameters:
-
from_variable(Indicator) –Source indicator to copy metadata from.
-
to_variable(Indicator) –Target indicator to copy metadata to.
-
inplace(bool, default:False) –If True, modifies
to_variablein place. If False, returns a new indicator.
Returns:
-
Indicator | None–New indicator with copied metadata if
inplace=False, otherwise None.
Example
Create new indicator with copied metadata
Copy metadata in-place
Source code in lib/catalog/owid/catalog/core/indicators.py
get_unique_description_key_points_from_indicators
¶
Get unique description key points from a list of indicators.
Collects all unique key points from the description_key field of multiple indicators, preserving order of first occurrence.
Parameters:
Returns:
Example
Source code in lib/catalog/owid/catalog/core/indicators.py
get_unique_licenses_from_indicators
¶
Get unique licenses from a list of indicators.
Collects all unique License objects from the metadata of multiple indicators, preserving order of first occurrence.
Parameters:
Returns:
Example
Source code in lib/catalog/owid/catalog/core/indicators.py
get_unique_origins_from_indicators
¶
Get unique origins from a list of indicators.
Collects all unique Origin objects from the metadata of multiple indicators, preserving order of first occurrence.
Parameters:
Returns:
Example
Source code in lib/catalog/owid/catalog/core/indicators.py
is_nullable_series
¶
Check if a series has a nullable pandas dtype.
Determines whether a pandas Series uses one of the nullable integer, float, or boolean dtypes (as opposed to traditional numpy dtypes).
Parameters:
-
s(Any) –Any object to check. Typically a pandas Series.
Returns:
-
bool–True if the object has a nullable pandas dtype, False otherwise.
Example
import pandas as pd
# Nullable integer dtype
s1 = pd.Series([1, 2, None], dtype="Int64")
assert is_nullable_series(s1) == True
# Traditional numpy dtype
s2 = pd.Series([1, 2, 3], dtype="int64")
assert is_nullable_series(s2) == False
# Nullable boolean dtype
s3 = pd.Series([True, False, None], dtype="boolean")
assert is_nullable_series(s3) == True
Note
Nullable dtypes (capitalized like Int64) differ from numpy dtypes (int64)
in that they can represent missing values using pd.NA instead of np.nan.
Source code in lib/catalog/owid/catalog/core/indicators.py
owid.catalog.core.meta
¶
Classes:
-
DatasetMeta–The metadata for this entire dataset kept in JSON (e.g. mydataset/index.json).
-
License–License information for data products.
-
MetaBase–Base class for all metadata objects in the catalog.
-
Origin–Comprehensive metadata about the origin of a data product.
-
TableMeta– -
VariableMeta–Allowed fields for
displayattribute used for grapher:
Functions:
-
is_year_or_date–Matches dates in "yyyy-mm-dd" format or years in "yyyy" format.
-
update_variable_metadata–Post-process variable metadata and fix issues before rendering or exporting to grapher.
DatasetMeta
dataclass
¶
DatasetMeta(
channel: str | None = None,
namespace: str | None = None,
short_name: str | None = None,
title: str | None = None,
description: str | None = None,
licenses: list[License] = list(),
is_public: bool = True,
additional_info: dict[str, Any] | None = None,
version: str | None = None,
update_period_days: int | None = None,
non_redistributable: bool = False,
source_checksum: str | None = None,
)
Bases: MetaBase
The metadata for this entire dataset kept in JSON (e.g. mydataset/index.json).
The number of fields is limited, but should handle everything that we get from Snapshot. There is a lot more opportunity to store more metadata at the table and the variable level.
Methods:
-
copy–Create a copy of the metadata object.
-
from_dict–Create metadata object from dictionary.
-
load–Load metadata from a JSON file.
-
save–Save metadata to a JSON file.
-
to_dict–Convert metadata object to dictionary.
-
update–Update metadata fields with new values.
-
update_from_yaml–The main reason for wanting to do this is to manually override what goes into Grapher before an export.
Attributes:
copy
¶
copy(deep: bool = True) -> Self
Create a copy of the metadata object.
Parameters:
-
deep(bool, default:True) –If True, creates a deep copy (copies nested objects). If False, creates a shallow copy.
Returns:
-
Self–Copy of the metadata object.
Example
Source code in lib/catalog/owid/catalog/core/meta.py
from_dict
classmethod
¶
Create metadata object from dictionary.
Parameters:
Returns:
-
T–New metadata object of the appropriate type.
Note
This uses a custom implementation that's significantly faster than the default dataclasses_json method.
Source code in lib/catalog/owid/catalog/core/meta.py
load
classmethod
¶
load(filename: str) -> Self
Load metadata from a JSON file.
Parameters:
-
filename(str) –Path to the JSON file containing metadata.
Returns:
-
Self–Metadata object loaded from the file.
Source code in lib/catalog/owid/catalog/core/meta.py
save
¶
Save metadata to a JSON file.
Parameters:
Source code in lib/catalog/owid/catalog/core/meta.py
to_dict
¶
Convert metadata object to dictionary.
Parameters:
-
encode_json(bool, default:False) –If True, encodes values for JSON serialization.
Returns:
Example
Source code in lib/catalog/owid/catalog/core/meta.py
update
¶
Update metadata fields with new values.
Parameters:
-
**kwargs(dict[str, Any], default:{}) –Field names and their new values. None values are ignored.
Example
Source code in lib/catalog/owid/catalog/core/meta.py
update_from_yaml
¶
The main reason for wanting to do this is to manually override what goes into Grapher before an export.
Source code in lib/catalog/owid/catalog/core/meta.py
License
dataclass
¶
Bases: MetaBase
License information for data products.
Stores licensing details for datasets and variables, including the license name and URL to the full license text.
Attributes:
-
name(str | None) –License name (e.g., "CC BY 4.0", "MIT", "Public Domain").
-
url(str | None) –URL to the full license text or information page.
Example
Methods:
-
copy–Create a copy of the metadata object.
-
from_dict–Create metadata object from dictionary.
-
load–Load metadata from a JSON file.
-
save–Save metadata to a JSON file.
-
to_dict–Convert metadata object to dictionary.
-
update–Update metadata fields with new values.
copy
¶
copy(deep: bool = True) -> Self
Create a copy of the metadata object.
Parameters:
-
deep(bool, default:True) –If True, creates a deep copy (copies nested objects). If False, creates a shallow copy.
Returns:
-
Self–Copy of the metadata object.
Example
Source code in lib/catalog/owid/catalog/core/meta.py
from_dict
classmethod
¶
Create metadata object from dictionary.
Parameters:
Returns:
-
T–New metadata object of the appropriate type.
Note
This uses a custom implementation that's significantly faster than the default dataclasses_json method.
Source code in lib/catalog/owid/catalog/core/meta.py
load
classmethod
¶
load(filename: str) -> Self
Load metadata from a JSON file.
Parameters:
-
filename(str) –Path to the JSON file containing metadata.
Returns:
-
Self–Metadata object loaded from the file.
Source code in lib/catalog/owid/catalog/core/meta.py
save
¶
Save metadata to a JSON file.
Parameters:
Source code in lib/catalog/owid/catalog/core/meta.py
to_dict
¶
Convert metadata object to dictionary.
Parameters:
-
encode_json(bool, default:False) –If True, encodes values for JSON serialization.
Returns:
Example
Source code in lib/catalog/owid/catalog/core/meta.py
update
¶
Update metadata fields with new values.
Parameters:
-
**kwargs(dict[str, Any], default:{}) –Field names and their new values. None values are ignored.
Example
Source code in lib/catalog/owid/catalog/core/meta.py
MetaBase
¶
Bases: DataClassJsonMixin
Base class for all metadata objects in the catalog.
Provides common functionality for metadata serialization, hashing, comparison, and persistence. All metadata classes (DatasetMeta, TableMeta, VariableMeta, etc.) inherit from this base class.
Key features:
- JSON serialization/deserialization
- Deterministic hashing for deduplication
- Deep copying support
- File persistence (save/load)
- Dictionary conversion
Example
from owid.catalog import DatasetMeta
# Create metadata
meta = DatasetMeta(title="GDP Data", short_name="gdp")
# Save to file
meta.save("metadata.json")
# Load from file
loaded = DatasetMeta.load("metadata.json")
# Convert to dictionary
d = meta.to_dict()
# Create deep copy
copy = meta.copy(deep=True)
Methods:
-
copy–Create a copy of the metadata object.
-
from_dict–Create metadata object from dictionary.
-
load–Load metadata from a JSON file.
-
save–Save metadata to a JSON file.
-
to_dict–Convert metadata object to dictionary.
-
update–Update metadata fields with new values.
copy
¶
copy(deep: bool = True) -> Self
Create a copy of the metadata object.
Parameters:
-
deep(bool, default:True) –If True, creates a deep copy (copies nested objects). If False, creates a shallow copy.
Returns:
-
Self–Copy of the metadata object.
Example
Source code in lib/catalog/owid/catalog/core/meta.py
from_dict
classmethod
¶
Create metadata object from dictionary.
Parameters:
Returns:
-
T–New metadata object of the appropriate type.
Note
This uses a custom implementation that's significantly faster than the default dataclasses_json method.
Source code in lib/catalog/owid/catalog/core/meta.py
load
classmethod
¶
load(filename: str) -> Self
Load metadata from a JSON file.
Parameters:
-
filename(str) –Path to the JSON file containing metadata.
Returns:
-
Self–Metadata object loaded from the file.
Source code in lib/catalog/owid/catalog/core/meta.py
save
¶
Save metadata to a JSON file.
Parameters:
Source code in lib/catalog/owid/catalog/core/meta.py
to_dict
¶
Convert metadata object to dictionary.
Parameters:
-
encode_json(bool, default:False) –If True, encodes values for JSON serialization.
Returns:
Example
Source code in lib/catalog/owid/catalog/core/meta.py
update
¶
Update metadata fields with new values.
Parameters:
-
**kwargs(dict[str, Any], default:{}) –Field names and their new values. None values are ignored.
Example
Source code in lib/catalog/owid/catalog/core/meta.py
Origin
dataclass
¶
Origin(
producer: str,
title: str,
description: str | None = None,
title_snapshot: str | None = None,
description_snapshot: str | None = None,
citation_full: str | None = None,
attribution: str | None = None,
attribution_short: str | None = None,
version_producer: str | None = None,
url_main: str | None = None,
url_download: str | None = None,
date_accessed: str | None = None,
date_published: YearDateLatest | None = None,
license: License | None = None,
)
Bases: MetaBase
Comprehensive metadata about the origin of a data product.
Origin provides detailed provenance information for datasets, including
producer details, citations, URLs, publication dates, and licensing. This is
the modern replacement for the legacy Source class.
Attributes:
-
producer(str) –Name of the institution or author(s) that produced the data (e.g., "World Bank", "United Nations").
-
title(str) –Title of the original data product.
-
description(str | None) –Description of the data product and its methodology.
-
title_snapshot(str | None) –Title of the specific data subset extracted from the product. Only use if different from
title. -
description_snapshot(str | None) –Description of the snapshot subset. Use when the snapshot differs from the full data product.
-
citation_full(str | None) –Complete citation for the data product in academic format.
-
attribution(str | None) –Name to use for attribution (e.g., "V-Dem Institute" instead of individual authors). Defaults to
producerif not provided. -
attribution_short(str | None) –Short form of attribution for space-constrained contexts.
-
version_producer(str | None) –Version number or identifier from the data producer (e.g., "v12", "2023.1").
-
url_main(str | None) –Authoritative URL for the dataset's main page.
-
url_download(str | None) –Direct URL to download the dataset.
-
date_accessed(str | None) –ISO-format date when the dataset was accessed (YYYY-MM-DD).
-
date_published(YearDateLatest | None) –Publication date (YYYY-MM-DD), year (YYYY), or "latest" for continuously updated datasets.
-
license(License | None) –License information for the data product.
Example
from owid.catalog import Origin, License
# Comprehensive origin metadata
origin = Origin(
producer="World Bank",
title="World Development Indicators",
description="Annual indicators of development",
attribution_short="World Bank",
version_producer="2024",
url_main="https://datatopics.worldbank.org/world-development-indicators/",
url_download="https://databank.worldbank.org/data/download/WDI_CSV.zip",
date_accessed="2024-01-15",
date_published="2024",
license=License(
name="CC BY 4.0",
url="https://creativecommons.org/licenses/by/4.0/"
)
)
# Minimal origin (only required fields)
origin_minimal = Origin(
producer="UN",
title="Population Data"
)
Raises:
-
ValueError–If
date_publishedis not a valid year, date, or "latest".
Methods:
-
copy–Create a copy of the metadata object.
-
from_dict–Create metadata object from dictionary.
-
load–Load metadata from a JSON file.
-
save–Save metadata to a JSON file.
-
to_dict–Convert metadata object to dictionary.
-
update–Update metadata fields with new values.
copy
¶
copy(deep: bool = True) -> Self
Create a copy of the metadata object.
Parameters:
-
deep(bool, default:True) –If True, creates a deep copy (copies nested objects). If False, creates a shallow copy.
Returns:
-
Self–Copy of the metadata object.
Example
Source code in lib/catalog/owid/catalog/core/meta.py
from_dict
classmethod
¶
Create metadata object from dictionary.
Parameters:
Returns:
-
T–New metadata object of the appropriate type.
Note
This uses a custom implementation that's significantly faster than the default dataclasses_json method.
Source code in lib/catalog/owid/catalog/core/meta.py
load
classmethod
¶
load(filename: str) -> Self
Load metadata from a JSON file.
Parameters:
-
filename(str) –Path to the JSON file containing metadata.
Returns:
-
Self–Metadata object loaded from the file.
Source code in lib/catalog/owid/catalog/core/meta.py
save
¶
Save metadata to a JSON file.
Parameters:
Source code in lib/catalog/owid/catalog/core/meta.py
to_dict
¶
Convert metadata object to dictionary.
Parameters:
-
encode_json(bool, default:False) –If True, encodes values for JSON serialization.
Returns:
Example
Source code in lib/catalog/owid/catalog/core/meta.py
update
¶
Update metadata fields with new values.
Parameters:
-
**kwargs(dict[str, Any], default:{}) –Field names and their new values. None values are ignored.
Example
Source code in lib/catalog/owid/catalog/core/meta.py
TableMeta
dataclass
¶
TableMeta(
short_name: str | None = None,
title: str | None = None,
description: str | None = None,
dataset: DatasetMeta | None = None,
primary_key: list[str] = list(),
dimensions: list[TableDimension] | None = None,
)
Bases: MetaBase
Methods:
-
copy–Create a copy of the metadata object.
-
from_dict–Create metadata object from dictionary.
-
load–Load metadata from a JSON file.
-
save–Save metadata to a JSON file.
-
to_dict–Convert metadata object to dictionary.
-
update–Update metadata fields with new values.
Attributes:
copy
¶
copy(deep: bool = True) -> Self
Create a copy of the metadata object.
Parameters:
-
deep(bool, default:True) –If True, creates a deep copy (copies nested objects). If False, creates a shallow copy.
Returns:
-
Self–Copy of the metadata object.
Example
Source code in lib/catalog/owid/catalog/core/meta.py
from_dict
classmethod
¶
Create metadata object from dictionary.
Parameters:
Returns:
-
T–New metadata object of the appropriate type.
Note
This uses a custom implementation that's significantly faster than the default dataclasses_json method.
Source code in lib/catalog/owid/catalog/core/meta.py
load
classmethod
¶
load(filename: str) -> Self
Load metadata from a JSON file.
Parameters:
-
filename(str) –Path to the JSON file containing metadata.
Returns:
-
Self–Metadata object loaded from the file.
Source code in lib/catalog/owid/catalog/core/meta.py
save
¶
Save metadata to a JSON file.
Parameters:
Source code in lib/catalog/owid/catalog/core/meta.py
to_dict
¶
Convert metadata object to dictionary.
Parameters:
-
encode_json(bool, default:False) –If True, encodes values for JSON serialization.
Returns:
Example
Source code in lib/catalog/owid/catalog/core/meta.py
update
¶
Update metadata fields with new values.
Parameters:
-
**kwargs(dict[str, Any], default:{}) –Field names and their new values. None values are ignored.
Example
Source code in lib/catalog/owid/catalog/core/meta.py
VariableMeta
dataclass
¶
VariableMeta(
title: str | None = None,
description: str | None = None,
description_short: str | None = None,
description_from_producer: str | None = None,
description_key: list[str] = list(),
origins: list[Origin] = list(),
licenses: list[License] = list(),
unit: str | None = None,
short_unit: str | None = None,
display: dict[str, Any] | None = None,
additional_info: dict[str, Any] | None = None,
processing_level: PROCESSING_LEVELS | None = None,
presentation: VariablePresentationMeta | None = None,
description_processing: str | None = None,
license: License | None = None,
type: VARIABLE_TYPE | None = None,
sort: list[str] = list(),
dimensions: dict[str, Any] | None = None,
original_short_name: str | None = None,
original_title: str | None = None,
)
Bases: MetaBase
Allowed fields for display attribute used for grapher:
name
zeroDay
yearIsDay
includeInTable
numDecimalPlaces
conversionFactor
entityAnnotationsMap
Fields unit and shortUnit are copied from attributes unit and short_unit
on VariableMeta object
NOTE: consider using its own object for display instead of dict and also possibly
underscoring fields and converting them back to camelCase before inserting to grapher
Methods:
-
from_dict–Create metadata object from dictionary.
-
load–Load metadata from a JSON file.
-
render–Render Jinja in all fields of VariableMeta. Return a new VariableMeta object.
-
save–Save metadata to a JSON file.
-
to_dict–Convert metadata object to dictionary.
-
update–Update metadata fields with new values.
Attributes:
-
schema_version(int) –Schema version is used to easily understand everywhere what metadata standard was used
schema_version
property
¶
schema_version: int
Schema version is used to easily understand everywhere what metadata standard was used for authoring this variable metadata. Defaults to 1 for our legacy variables. "Modern" variables that fill in the presentation key and use origins should record 2 here.
from_dict
classmethod
¶
Create metadata object from dictionary.
Parameters:
Returns:
-
T–New metadata object of the appropriate type.
Note
This uses a custom implementation that's significantly faster than the default dataclasses_json method.
Source code in lib/catalog/owid/catalog/core/meta.py
load
classmethod
¶
load(filename: str) -> Self
Load metadata from a JSON file.
Parameters:
-
filename(str) –Path to the JSON file containing metadata.
Returns:
-
Self–Metadata object loaded from the file.
Source code in lib/catalog/owid/catalog/core/meta.py
render
¶
render(
dim_dict: dict[str, Any], remove_dods: bool = False
) -> VariableMeta
Render Jinja in all fields of VariableMeta. Return a new VariableMeta object.
:param dim_dict: dictionary of dimensions to render :param remove_dods: remove references to details on demand from a text
Usage
from owid.catalog import Dataset from etl import paths
ds = Dataset(paths.DATA_DIR / "garden/emissions/2025-02-12/ceds_air_pollutants") tb = ds['ceds_air_pollutants'] tb.emissions.m.render({'pollutant': 'CO', 'sector': 'Transport'})
Source code in lib/catalog/owid/catalog/core/meta.py
save
¶
Save metadata to a JSON file.
Parameters:
Source code in lib/catalog/owid/catalog/core/meta.py
to_dict
¶
Convert metadata object to dictionary.
Parameters:
-
encode_json(bool, default:False) –If True, encodes values for JSON serialization.
Returns:
Example
Source code in lib/catalog/owid/catalog/core/meta.py
update
¶
Update metadata fields with new values.
Parameters:
-
**kwargs(dict[str, Any], default:{}) –Field names and their new values. None values are ignored.
Example
Source code in lib/catalog/owid/catalog/core/meta.py
is_year_or_date
¶
Matches dates in "yyyy-mm-dd" format or years in "yyyy" format.
Source code in lib/catalog/owid/catalog/core/meta.py
update_variable_metadata
¶
update_variable_metadata(
meta: VariableMeta,
) -> VariableMeta
Post-process variable metadata and fix issues before rendering or exporting to grapher. Things like converting strings to numbers, removing empty fields, post-processing jinja rendering, etc.
Source code in lib/catalog/owid/catalog/core/meta.py
owid.catalog.core.utils
¶
Functions:
-
dataclass_from_dict–Recursively create an instance of a dataclass from a dictionary. We've implemented custom
-
dynamic_yaml_load–Load YAML file with dynamic parameter substitution.
-
dynamic_yaml_to_dict–Convert dynamic YAML object to plain dictionary.
-
hash_any–Return a unique, deterministic hash for an arbitrary object.
-
parse_numeric_list–Parse a string representation of a numeric list.
-
prune_dict–Remove private keys and empty values from a dictionary recursively.
-
pruned_json–Decorator that modifies a class's to_dict method to prune empty values.
-
remove_details_on_demand–Remove details-on-demand references from markdown text.
-
underscore–Convert arbitrary string to snake_case format.
-
underscore_table–Convert column and index names to underscore format.
-
validate_underscore–Validate that a name follows snake_case convention.
dataclass_from_dict
¶
Recursively create an instance of a dataclass from a dictionary. We've implemented custom method because original dataclasses_json.from_dict was too slow (this gives us more than 2x speedup). See https://github.com/owid/etl/pull/3517#issuecomment-2468084380 for more details.
Source code in lib/catalog/owid/catalog/core/utils.py
dynamic_yaml_load
¶
Load YAML file with dynamic parameter substitution.
Loads a YAML file and updates it with provided parameters for dynamic interpolation. Supports loading from file paths, path strings, or file-like objects.
Parameters:
-
source(Path | str | TextIO) –File path (Path or str) or file-like object (e.g., StringIO).
-
params(dict, default:{}) –Dictionary of parameters to substitute in the YAML. Defaults to empty dict.
Returns:
-
dict–Parsed YAML data as dictionary with parameters applied.
Example
Source code in lib/catalog/owid/catalog/core/utils.py
dynamic_yaml_to_dict
¶
Convert dynamic YAML object to plain dictionary.
Dynamic YAML objects can cause issues when unpacking into dataclass constructors. This function converts them to standard Python dictionaries for safe usage.
Parameters:
-
yd(Any) –Dynamic YAML object to convert.
Returns:
-
dict–Plain Python dictionary.
Example
Problem: Dynamic YAML can cause errors
Solution: Convert to dict first
Note
Always use this conversion before unpacking into dataclass constructors to avoid unexpected behavior with dynamic YAML objects.
Source code in lib/catalog/owid/catalog/core/utils.py
hash_any
¶
Return a unique, deterministic hash for an arbitrary object.
This function is especially useful when working with mutable objects, such as dataclasses that
can't be made frozen, but where you still need to use operations like set, dict keys, or
deduplication with unique. A standard Python hash() is not suitable in such cases because Python's
hash() function for strings is randomized across different interpreter sessions for security reasons
(via PYTHONHASHSEED), which can result in non-deterministic hash values.
This function handles common Python data structures, such as dataclasses, lists, dicts, strings, and None,
and ensures that the returned hash is always deterministic across different runs. For strings, it uses an MD5
hash truncated to 64 bits to maintain consistent behavior across different runs of the program.
The function is recursive, so it can handle nested objects like lists of dataclasses, dicts with list values, etc.
Parameters:
-
x(Any) –The object to be hashed. It can be of any type: dataclass, list, dict, string, or other.
Returns:
-
int(int) –A deterministic integer hash value for the object.
Special cases:
- Dataclasses: It recursively hashes each field of the dataclass by generating a tuple of (field_name_hash, field_value_hash) and then hashes that tuple.
- Lists: It recursively hashes each element in the list, converts the list to a tuple (because tuples are hashable), and then hashes the tuple.
- Dictionaries: It hashes the keys and values of the dictionary, sorting them by key to ensure consistency, then generates a tuple of (key_hash, value_hash) pairs and hashes that tuple.
- Strings: Instead of the built-in
hash(), it uses the MD5 hash algorithm to generate a consistent 64-bit hash (by truncating the result) that remains the same across interpreter runs. - None: Always returns
0as the hash forNone. - Other types: Falls back on Python's built-in
hash()function for all other types of objects.
Example
Source code in lib/catalog/owid/catalog/core/utils.py
parse_numeric_list
¶
Parse a string representation of a numeric list.
Converts a comma-separated string of numbers (optionally wrapped in brackets) into a Python list of integers and floats.
Parameters:
-
val(list | str) –String representation of a numeric list or an existing list. If already a list, returns it unchanged.
Returns:
Example
# String with brackets
parse_numeric_list("[10, 20, 30]")
# Returns: [10, 20, 30]
# String without brackets
parse_numeric_list("1.5, 2.5, 3.0")
# Returns: [1.5, 2.5, 3.0]
# Mixed integers and floats
parse_numeric_list("10, 20.5, 30")
# Returns: [10, 20.5, 30]
# Already a list (no-op)
parse_numeric_list([1, 2, 3])
# Returns: [1, 2, 3]
Note
Numbers with decimal points are parsed as floats, others as integers.
Source code in lib/catalog/owid/catalog/core/utils.py
prune_dict
¶
Remove private keys and empty values from a dictionary recursively.
Removes all keys starting with underscore (private fields) and all empty
values (None, empty lists, empty dicts) from a dictionary and its nested
structures — except keys listed in KEEP_IF_EMPTY, where an explicit
empty value is meaningful and must round-trip through serialization.
Inside lists, only empty dicts and empty lists are filtered — None is
preserved so that positional arrays (e.g. customNumericColors,
customNumericLabels in grapher_config) keep their alignment, where None
means "fall back to default" at that index.
Parameters:
-
d(dict) –Dictionary to prune.
Returns:
-
dict–New dictionary with private keys and empty values removed.
Example
d = {
"title": "Dataset",
"_internal": "hidden",
"count": 0, # Kept (not empty)
"empty_list": [],
"chartTypes": [], # Kept — in KEEP_IF_EMPTY
"nested": {"value": 1, "null": None},
"positional": [None, None, "#bc8e5a"], # None preserved inside list
}
result = prune_dict(d)
# Returns: {"title": "Dataset", "count": 0, "chartTypes": [],
# "nested": {"value": 1}, "positional": [None, None, "#bc8e5a"]}
Source code in lib/catalog/owid/catalog/core/utils.py
pruned_json
¶
Decorator that modifies a class's to_dict method to prune empty values.
Wraps a dataclass's to_dict method to automatically remove private fields
(starting with underscore) and empty values when serializing to JSON.
Parameters:
-
cls(T) –Dataclass to decorate.
Returns:
-
T–The same class with modified
to_dictmethod.
Example
from dataclasses import dataclass
from owid.catalog.utils import pruned_json
@pruned_json
@dataclass
class Config:
name: str
_internal: str = "hidden"
optional: str | None = None
config = Config(name="test", _internal="secret", optional=None)
d = config.to_dict()
# Returns: {"name": "test"} (no _internal or optional)
Note
This decorator is commonly used with metadata classes to keep JSON output clean by removing None values and private fields.
Source code in lib/catalog/owid/catalog/core/utils.py
remove_details_on_demand
¶
Remove details-on-demand references from markdown text.
Strips out special markdown links that reference details-on-demand content, keeping only the link text. This is useful for generating plain text versions of content that contains interactive elements.
Parameters:
-
text(str) –Markdown text containing details-on-demand references.
Returns:
-
str–Text with details-on-demand references removed, keeping only link text.
Example
text = "This is a [description](#dod:something) of the data."
result = remove_details_on_demand(text)
# Returns: "This is a description of the data."
Multiple references
Note
The regex pattern matches [text](#dod:keyword) and replaces it with just text.
Source code in lib/catalog/owid/catalog/core/utils.py
underscore
¶
Convert arbitrary string to snake_case format.
Transforms strings into valid Python identifiers using snake_case convention. Handles special characters, punctuation, and optionally converts camelCase. Originally fine-tuned for World Bank WDI column names.
Parameters:
-
name(str | None) –String to format. Returns None if input is None.
-
validate(bool, default:True) –If True, validates the result is valid snake_case and raises NameError if not. Defaults to True.
-
camel_to_snake(bool, default:False) –If True, converts camelCase to snake_case before other transformations. Defaults to False.
Returns:
-
str | None–String in snake_case format, or None if input was None.
Raises:
-
NameError–If validate is True and the result is not valid snake_case.
Example
Warning
This function may evolve in the future. For critical use cases, either add tests or manually underscore your column names.
Source code in lib/catalog/owid/catalog/core/utils.py
137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 | |
underscore_table
¶
Convert column and index names to underscore format.
Warning
DEPRECATED: Use table.underscore() method instead. This function
exists only for backward compatibility.
Parameters:
-
t(Any) –Table object to underscore.
-
*args(Any, default:()) –Positional arguments passed to
table.underscore(). -
**kwargs(Any, default:{}) –Keyword arguments passed to
table.underscore().
Returns:
-
Any–Table with underscored column and index names.
Example
Deprecated usage
Preferred usage
Source code in lib/catalog/owid/catalog/core/utils.py
validate_underscore
¶
Validate that a name follows snake_case convention.
Parameters:
-
name(str | None) –String to validate. If None, validation is skipped.
-
object_name(str, default:'Name') –Name of the object being validated, used in error messages. Defaults to "Name".
Raises:
-
NameError–If name is not valid snake_case (lowercase letters, digits, and underscores only, must start with letter or underscore).
Example
Valid names pass silently
Invalid names raise NameError
Source code in lib/catalog/owid/catalog/core/utils.py
owid.catalog.s3_utils
¶
Classes:
-
MissingCredentialsError–Raised when R2 credentials are not found.
-
UploadError–Raised when S3 upload or download operations fail.
Functions:
-
connect_r2–Create a connection to Cloudflare R2 storage.
-
connect_r2_cached–Create a cached, thread-safe connection to Cloudflare R2.
-
download–Download a file from S3 to local filesystem.
-
download_s3_folder–Download all files from an S3 folder to a local directory.
-
list_s3_objects–List all objects in an S3 folder.
-
s3_bucket_key–Extract bucket name and key from an S3 URL.
-
upload–Upload the file at the given local filename to the S3 URL.
MissingCredentialsError
¶
Bases: Exception
Raised when R2 credentials are not found.
This exception is raised when neither environment variables nor rclone configuration contain the required R2 credentials.
UploadError
¶
Bases: Exception
Raised when S3 upload or download operations fail.
This exception wraps boto3 ClientError exceptions that occur during S3 operations like upload, download, or file listing.
connect_r2
¶
Create a connection to Cloudflare R2 storage.
Creates a boto3 S3 client configured for Cloudflare R2. Credentials are loaded from environment variables or rclone configuration file.
Credential sources (in priority order):
- Environment variables:
R2_ACCESS_KEY,R2_SECRET_KEY,R2_ENDPOINT,R2_REGION_NAME - rclone config file:
~/.config/rclone/rclone.conf(section:owid-r2)
Returns:
-
BaseClient–Boto3 S3 client configured for R2.
Example
Note
For cached connections that reuse the same client across calls, use
connect_r2_cached() instead. This is more efficient for multiple operations.
See Also
connect_r2_cached(): Thread-safe cached version- Cloudflare R2 docs: https://developers.cloudflare.com/r2/
Source code in lib/catalog/owid/catalog/s3_utils.py
connect_r2_cached
¶
Create a cached, thread-safe connection to Cloudflare R2.
Returns a cached R2 client that's reused across multiple calls. This is more efficient than creating a new connection for every request. Thread-safe through locking mechanism.
Returns:
-
BaseClient–Cached boto3 S3 client configured for R2.
Example
Use cached connection for multiple operations
Note
The connection is cached indefinitely. If credentials change during runtime, the application needs to be restarted.
See Also
connect_r2(): Non-cached version for one-time connections
Source code in lib/catalog/owid/catalog/s3_utils.py
download
¶
download(
s3_url: str,
filename: str,
quiet: bool = False,
client: BaseClient | None = None,
) -> None
Download a file from S3 to local filesystem.
Parameters:
-
s3_url(str) –S3 URL of the file to download (e.g.,
s3://bucket/path/file.csv). -
filename(str) –Local path where the file should be saved.
-
quiet(bool, default:False) –If True, suppresses log messages. Defaults to False.
-
client(BaseClient | None, default:None) –Optional boto3 S3 client. If None, connects to R2 automatically.
Raises:
-
UploadError–If the download fails due to S3 client errors.
Example
Download a file
Download quietly (no logs)
Source code in lib/catalog/owid/catalog/s3_utils.py
download_s3_folder
¶
download_s3_folder(
s3_folder: str,
local_dir: Path,
exclude: list[str] = [],
include: list[str] = [],
client: BaseClient | None = None,
max_workers: int = 20,
delete: bool = False,
) -> None
Download all files from an S3 folder to a local directory.
Downloads all objects from an S3 folder using parallel threads for efficiency. Supports filtering with include/exclude patterns and optional deletion of local files not present in S3.
Parameters:
-
s3_folder(str) –S3 folder URL. Must end with a slash (e.g.,
s3://bucket/folder/). -
local_dir(Path) –Local directory path where files will be downloaded.
-
exclude(list[str], default:[]) –List of patterns to exclude from download. Files containing any of these patterns will be skipped.
-
include(list[str], default:[]) –List of patterns to include in download. If specified, only files containing one of these patterns will be downloaded.
-
client(BaseClient | None, default:None) –Optional boto3 S3 client. If None, connects to R2 automatically.
-
max_workers(int, default:20) –Maximum number of parallel download threads. Defaults to 20.
-
delete(bool, default:False) –If True, deletes local files that don't exist in the S3 folder. Defaults to False.
Raises:
-
AssertionError–If s3_folder doesn't end with a slash.
-
UploadError–If any download fails.
Example
Download entire folder
Download only CSV files
Download and sync (delete local files not in S3)
Exclude backup files
Note
The local_dir is created automatically if it doesn't exist.
Source code in lib/catalog/owid/catalog/s3_utils.py
148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 | |
list_s3_objects
¶
List all objects in an S3 folder.
Recursively lists all objects within an S3 folder, handling pagination automatically. Excludes folder markers (keys ending with '/').
Parameters:
-
s3_folder(str) –S3 folder URL (e.g.,
s3://bucket/path/to/folder/). -
client(BaseClient | None, default:None) –Optional boto3 S3 client. If None, connects to R2 automatically.
Returns:
Example
List all objects in a folder
Use custom client
Note
This function handles pagination automatically for folders with more than 1000 objects.
Source code in lib/catalog/owid/catalog/s3_utils.py
s3_bucket_key
¶
Extract bucket name and key from an S3 URL.
Parses both s3:// and https:// S3 URLs to extract the bucket name
and object key.
Parameters:
-
url(str) –S3 URL in either format: -
s3://bucket-name/path/to/object-https://bucket-name.s3.region.amazonaws.com/path/to/object
Returns:
Example
Source code in lib/catalog/owid/catalog/s3_utils.py
upload
¶
upload(
s3_url: str,
filename: str | Path,
public: bool = False,
quiet: bool = False,
downloadable: bool = False,
) -> None
Upload the file at the given local filename to the S3 URL.
Parameters:
-
s3_url(str) –S3 URL to upload to
-
filename(str | Path) –Local file to upload
-
public(bool, default:False) –Whether to make the file publicly readable
-
quiet(bool, default:False) –Whether to suppress log messages
-
downloadable(bool, default:False) –If True, force browsers to download the file instead of displaying it inline. Sets Content-Disposition header to 'attachment; filename="..."'