Datautils Library¶
The owid-datautils library provides utility functions for common data processing tasks in ETL workflows.
You can install it via pip:
owid.datautils.dataframes
¶
Objects related to pandas dataframes.
Classes:
-
DataFramesHaveDifferentLengths–Dataframes cannot be compared because they have different number of rows.
-
ObjectsAreNotDataframes–Given objects are not dataframes.
Functions:
-
apply_on_categoricals–Apply a function across multiple categorical Series efficiently.
-
are_equal–Check if two DataFrames are equal with detailed comparison report.
-
combine_two_overlapping_dataframes–Combine two DataFrames with overlapping columns, prioritizing the first.
-
compare–Compare two DataFrames element-wise for equality.
-
concatenate–Concatenate while preserving categorical columns.
-
count_missing_in_groups–Count the number of missing values in each group.
-
groupby_agg–Group DataFrame with intelligent NaN handling during aggregation.
-
has_index–Check if a DataFrame has a meaningful index.
-
map_series–Map Series values with performance optimization and flexible NaN handling.
-
multi_merge–Merge multiple DataFrames on common columns.
-
rename_categories–Alternative to pd.Series.cat.rename_categories which supports non-unique categories.
-
to_file–Save a dataframe in any format.
DataFramesHaveDifferentLengths
¶
ObjectsAreNotDataframes
¶
apply_on_categoricals
¶
Apply a function across multiple categorical Series efficiently.
High-performance operation that applies a function to categorical Series without converting to strings first. Uses category codes internally to prevent memory explosion and significantly improve speed.
Parameters:
-
cat_series(list[Series]) –List of Series with categorical dtype.
-
func(Callable[..., str]) –Function that takes N arguments (one per Series) and returns a string. Called for each unique combination of category codes.
Returns:
-
Series–New categorical Series with the function applied.
Example
import pandas as pd
from owid.datautils.dataframes import apply_on_categoricals
# Combine country and region categories
countries = pd.Series(["USA", "UK", "USA"], dtype="category")
regions = pd.Series(["Americas", "Europe", "Americas"], dtype="category")
# Concatenate with separator
result = apply_on_categoricals(
[countries, regions],
lambda c, r: f"{c} ({r})"
)
# Result: ["USA (Americas)", "UK (Europe)", "USA (Americas)"]
# Still categorical dtype, much faster than string operations
Note
This is significantly faster than converting categories to strings, especially for large DataFrames with repeated category values.
Source code in lib/datautils/owid/datautils/dataframes.py
are_equal
¶
are_equal(
df1: DataFrame,
df2: DataFrame,
absolute_tolerance: float = 1e-08,
relative_tolerance: float = 1e-08,
verbose: bool = True,
) -> tuple[bool, DataFrame]
Check if two DataFrames are equal with detailed comparison report.
Comprehensive equality check that compares structure, dtypes, and values with tolerance for floating-point numbers. Treats all NaN values as equal. Optionally prints a detailed summary of differences.
Parameters:
-
df1(DataFrame) –First DataFrame to compare.
-
df2(DataFrame) –Second DataFrame to compare.
-
absolute_tolerance(float, default:1e-08) –Maximum absolute difference for numeric equality:
abs(a - b) <= absolute_tolerance. -
relative_tolerance(float, default:1e-08) –Maximum relative difference for numeric equality:
abs(a - b) / abs(b) <= relative_tolerance. -
verbose(bool, default:True) –If True, print detailed comparison summary showing all differences found.
Returns:
-
tuple[bool, DataFrame]–Tuple of (equality_flag, comparison_dataframe) where: - equality_flag: True if DataFrames are equal within tolerance - comparison_dataframe: Boolean DataFrame showing element-wise equality. Empty if DataFrames have incompatible shapes.
Example
import pandas as pd
from owid.datautils.dataframes import are_equal
df1 = pd.DataFrame({"a": [1, 2], "b": [3, 4]})
df2 = pd.DataFrame({"a": [1, 2], "b": [3, 4]})
equal, comparison = are_equal(df1, df2, verbose=True)
# Prints: "Dataframes are identical..."
# Returns: (True, DataFrame of all True values)
df3 = pd.DataFrame({"a": [1, 2], "c": [5, 6]})
equal, comparison = are_equal(df1, df3, verbose=True)
# Prints differences: missing columns, etc.
# Returns: (False, DataFrame)
Source code in lib/datautils/owid/datautils/dataframes.py
151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 | |
combine_two_overlapping_dataframes
¶
combine_two_overlapping_dataframes(
df1: DataFrame,
df2: DataFrame,
index_columns: list[str] | None = None,
keep_column_order: bool = False,
) -> DataFrame
Combine two DataFrames with overlapping columns, prioritizing the first.
Intelligent merge that combines DataFrames with potentially identical columns, prioritizing values from df1 but filling its NaN values with data from df2. Avoids creating duplicate columns (e.g., "col_x", "col_y") that result from standard merges.
Why not use standard operations
pd.merge(): Creates duplicate columns with "_x" and "_y" suffixespd.concat()+drop_duplicates(): Would keep NaN values from df1 instead of filling them with df2 values
Parameters:
-
df1(DataFrame) –First DataFrame (higher priority for values).
-
df2(DataFrame) –Second DataFrame (used to fill NaN values in df1).
-
index_columns(list[str] | None, default:None) –Column names to use as index for alignment (e.g., ["country", "year"]). Must exist in both DataFrames as regular columns. If None, uses existing DataFrame indices.
-
keep_column_order(bool, default:False) –If True, preserve original column order (df1 columns first, then new df2 columns). If False, sort columns alphabetically.
Returns:
-
DataFrame–Combined DataFrame with union of rows and columns, prioritizing df1 values.
Example
import pandas as pd
from owid.datautils.dataframes import combine_two_overlapping_dataframes
df1 = pd.DataFrame({
"country": ["USA", "UK"],
"gdp": [20, None],
"population": [330, 67]
})
df2 = pd.DataFrame({
"country": ["USA", "UK", "France"],
"gdp": [21, 3, 2.7],
"area": [9.8, 0.24, 0.64]
})
result = combine_two_overlapping_dataframes(
df1, df2,
index_columns=["country"]
)
# country gdp population area
# 0 USA 20.0 330 9.80 # GDP from df1
# 1 UK 3.0 67 0.24 # GDP from df2 (was NaN in df1)
# 2 France 2.7 NaN 0.64 # New row from df2
Source code in lib/datautils/owid/datautils/dataframes.py
775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 | |
compare
¶
compare(
df1: DataFrame,
df2: DataFrame,
columns: list[str] | None = None,
absolute_tolerance: float = 1e-08,
relative_tolerance: float = 1e-08,
) -> DataFrame
Compare two DataFrames element-wise for equality.
Performs element-by-element comparison of two DataFrames, treating NaN values as equal and allowing tolerance for floating-point comparisons.
Parameters:
-
df1(DataFrame) –First DataFrame to compare.
-
df2(DataFrame) –Second DataFrame to compare.
-
columns(list[str] | None, default:None) –List of column names to compare (must exist in both DataFrames). If None, all common columns are compared.
-
absolute_tolerance(float, default:1e-08) –Maximum absolute difference allowed for values to be considered equal:
abs(a - b) <= absolute_tolerance. -
relative_tolerance(float, default:1e-08) –Maximum relative difference allowed for values to be considered equal:
abs(a - b) / abs(b) <= relative_tolerance.
Returns:
-
DataFrame–DataFrame of booleans with the same shape as the comparison. Each element
-
DataFrame–is True if the corresponding values in df1 and df2 are equal (within tolerance).
Raises:
-
ObjectsAreNotDataframes–If either input is not a DataFrame.
-
DataFramesHaveDifferentLengths–If DataFrames have different row counts.
Example
Note
DataFrames must have the same number of rows to be compared.
Source code in lib/datautils/owid/datautils/dataframes.py
concatenate
¶
Concatenate while preserving categorical columns.
Original source code from https://stackoverflow.com/a/57809778/1275818.
Source code in lib/datautils/owid/datautils/dataframes.py
count_missing_in_groups
¶
Count the number of missing values in each group.
This is equivalent but faster than:
num_nans_detected = df.groupby(groupby_columns, **groupby_kwargs).agg(
lambda x: pd.isnull(x).sum()
)
Source code in lib/datautils/owid/datautils/dataframes.py
groupby_agg
¶
groupby_agg(
df: DataFrame,
groupby_columns: list[str] | str,
aggregations: dict[str, Any] | None = None,
num_allowed_nans: int | None = None,
frac_allowed_nans: float | None = None,
min_num_values: int | None = None,
) -> DataFrame
Group DataFrame with intelligent NaN handling during aggregation.
Enhanced version of pandas.DataFrame.groupby().agg() that provides control over
how NaN values are treated during aggregation. By default, pandas ignores NaNs,
which can produce misleading results (e.g., treating NaNs as zeros in sums).
This function supports weighted aggregations using the special syntax
mean_weighted_by_<column_name> for any aggregation.
Behavior
- When all NaN parameters are None: behaves like standard pandas groupby
- When any NaN parameter is set: applies sequential validation rules
NaN Handling Rules (applied in order):
1. If `num_allowed_nans` is set: group becomes NaN if it has more NaNs
2. If `frac_allowed_nans` is set: group becomes NaN if NaN fraction exceeds threshold
3. If `min_num_values` is set: group becomes NaN if valid values < threshold
Parameters:
-
df(DataFrame) –Source DataFrame to group and aggregate.
-
groupby_columns(list[str] | str) –Column name(s) to group by. Can be a single string or list.
-
aggregations(dict[str, Any] | None, default:None) –Dictionary mapping column names to aggregation functions. If None, applies 'sum' to all columns. Supports weighted means with syntax:
{'col': 'mean_weighted_by_weight_col'}. -
num_allowed_nans(int | None, default:None) –Maximum number of NaN values allowed in a group before the aggregate becomes NaN.
-
frac_allowed_nans(float | None, default:None) –Maximum fraction of NaN values allowed (0.0-1.0). Group becomes NaN if NaN fraction exceeds this threshold.
-
min_num_values(int | None, default:None) –Minimum number of non-NaN values required. Group becomes NaN if it has fewer valid values (and at least one NaN).
Returns:
-
DataFrame–Grouped and aggregated DataFrame with NaN handling applied.
Example
Basic groupby with NaN control
import pandas as pd
from owid.datautils.dataframes import groupby_agg
df = pd.DataFrame({
"country": ["USA", "USA", "UK", "UK"],
"year": [2020, 2021, 2020, 2021],
"value": [100, None, 200, 300]
})
# Standard pandas sum treats NaN as 0
# result = df.groupby("country").sum() # USA: 100
# With min_num_values=1, NaN if all values are NaN
result = groupby_agg(
df,
groupby_columns="country",
aggregations={"value": "sum"},
min_num_values=1
)
# USA: 100 (has 1 valid value), UK: 500 (has 2 valid values)
Weighted mean aggregation
Note
Does not support multiple aggregations for the same column
(e.g., {'a': ('sum', 'mean')}).
Source code in lib/datautils/owid/datautils/dataframes.py
310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 | |
has_index
¶
has_index(df: DataFrame) -> bool
Check if a DataFrame has a meaningful index.
Determines whether a DataFrame has an actual index set, or just the default dummy integer index created by pandas.
Parameters:
-
df(DataFrame) –DataFrame to check for index.
Returns:
-
bool–True if DataFrame has a non-dummy (single or multi) index, False otherwise.
Example
Source code in lib/datautils/owid/datautils/dataframes.py
map_series
¶
map_series(
series: Series,
mapping: dict[Any, Any],
make_unmapped_values_nan: bool = False,
warn_on_missing_mappings: bool = False,
warn_on_unused_mappings: bool = False,
show_full_warning: bool = False,
) -> Series
Map Series values with performance optimization and flexible NaN handling.
Enhanced version of pandas.Series.map() that:
- Preserves unmapped values instead of converting to NaN (optional)
- Much faster than
Series.replace()for large DataFrames - Supports categorical Series with automatic category management
- Provides warnings for missing or unused mappings
Behavior differences from pandas.Series.map():
- Default: unmapped values keep original values (not NaN)
- With `make_unmapped_values_nan=True`: same as `Series.map()`
Parameters:
-
series(Series) –Series to map values from.
-
mapping(dict[Any, Any]) –Dictionary mapping old values to new values.
-
make_unmapped_values_nan(bool, default:False) –If True, unmapped values become NaN. If False, they retain original values.
-
warn_on_missing_mappings(bool, default:False) –If True, warn about values in Series that don't exist in mapping.
-
warn_on_unused_mappings(bool, default:False) –If True, warn about mapping entries not used by any value in Series.
-
show_full_warning(bool, default:False) –If True, print full list of missing/unused values in warnings.
Returns:
-
Series–Series with mapped values.
Example
Basic mapping
import pandas as pd
from owid.datautils.dataframes import map_series
series = pd.Series(["usa", "uk", "france"])
mapping = {"usa": "United States", "uk": "United Kingdom"}
# Default: unmapped values preserved
result = map_series(series, mapping)
# ["United States", "United Kingdom", "france"]
# With NaN for unmapped
result = map_series(series, mapping, make_unmapped_values_nan=True)
# ["United States", "United Kingdom", NaN]
With warnings
Source code in lib/datautils/owid/datautils/dataframes.py
542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 | |
multi_merge
¶
Merge multiple DataFrames on common columns.
Convenience function for merging more than two DataFrames sequentially.
Equivalent to chaining multiple pd.merge() calls.
Parameters:
-
dfs(list[DataFrame]) –List of DataFrames to merge.
-
on(list[str] | str) –Column name(s) to merge on. Must exist in all DataFrames with the same name.
-
how(str, default:'inner') –Type of merge to perform. Options: 'inner', 'outer', 'left', 'right'. Default is 'inner'.
Returns:
-
DataFrame–Merged DataFrame containing all input DataFrames joined on specified columns.
Example
import pandas as pd
from owid.datautils.dataframes import multi_merge
df1 = pd.DataFrame({"country": ["USA", "UK"], "gdp": [20, 3]})
df2 = pd.DataFrame({"country": ["USA", "UK"], "pop": [330, 67]})
df3 = pd.DataFrame({"country": ["USA", "UK"], "area": [9.8, 0.24]})
result = multi_merge([df1, df2, df3], on="country")
# country gdp pop area
# 0 USA 20 330 9.80
# 1 UK 3 67 0.24
Source code in lib/datautils/owid/datautils/dataframes.py
rename_categories
¶
Alternative to pd.Series.cat.rename_categories which supports non-unique categories.
We do that by replacing non-unique categories first and then mapping with unique categories. Unused categories are removed during the process. It should be as fast as pd.Series.cat.rename_categories if there are no non-unique categories.
Source code in lib/datautils/owid/datautils/dataframes.py
to_file
¶
Save a dataframe in any format.
Will be deprecated. Use owid.datautils.io.df.to_file instead.
Source code in lib/datautils/owid/datautils/dataframes.py
owid.datautils.io
¶
Input/Output methods.
Modules:
-
archive–Input/Output functions for local files.
-
df–DataFrame io operations.
-
json–Input/Output functions for local files.
Functions:
-
decompress_file–Extract a zip or tar file.
-
df_from_file–Load a file as a pandas DataFrame with URL and compression support.
-
df_to_file–Save DataFrame to file with automatic format detection and smart defaults.
-
load_json–Load data from JSON file with optional duplicate key detection.
-
save_json–Save data to a JSON file.
decompress_file
¶
decompress_file(
input_file: str | Path,
output_folder: str | Path,
overwrite: bool = False,
) -> None
Extract a zip or tar file.
It can be a local or a remote file.
Parameters:
-
input_file(str | Path) –Path to local zip file, or URL of a remote zip file.
-
output_folder(str | Path) –Path to local folder.
-
overwrite(bool, default:False) –Overwrite decompressed content if it already exists (otherwise raises an error if content already exists).
Source code in lib/datautils/owid/datautils/io/archive.py
df_from_file
¶
df_from_file(
file_path: str | Path,
file_type: str | None = None,
**kwargs: Any,
) -> DataFrame | list[DataFrame]
Load a file as a pandas DataFrame with URL and compression support.
Enhanced wrapper around pandas read_* functions that adds:
- Automatic format detection from file extension
- URL download support (via @enable_file_download decorator)
- Compressed file reading (with explicit file_type)
The function infers the file type from the extension after the last dot. For example: "file.csv" reads as CSV, "https://example.com/data.xlsx" reads as Excel.
Parameters:
-
file_path(str | Path) –Local path or URL to the file. Supports local files and HTTP(S) URLs.
-
file_type(str | None, default:None) –Explicit file type when reading compressed files (e.g., "csv", "dta", "json"). Only needed for compressed files. Specifies the format of the compressed content, not the compression format itself.
-
**kwargs(Any, default:{}) –Additional arguments passed to the underlying
pandas.read_*function.
Returns:
-
DataFrame | list[DataFrame]–DataFrame loaded from the file. Some formats (like HTML) may return a list of DataFrames.
Raises:
-
ValueError–If file extension is unknown or
file_typenot provided for compressed files. -
FileNotFoundError–If the file path doesn't exist.
Example
Load from local file
from owid.datautils.io.df import from_file
# CSV file
df = from_file("data.csv")
# Excel with specific sheet
df = from_file("data.xlsx", sheet_name="Sheet1")
Load from URL
Load compressed file
Note
Supported formats: csv, dta, feather, hdf, html, json, parquet, pickle, xlsx, xml.
Compression formats: gz, bz2, zip, xz, zst, tar (with explicit file_type).
Source code in lib/datautils/owid/datautils/io/df.py
df_to_file
¶
Save DataFrame to file with automatic format detection and smart defaults.
Enhanced wrapper around pandas to_* methods that provides:
- Automatic format selection from file extension
- Auto-creation of parent directories
- Intelligent index handling (omits dummy indices)
- Optional overwrite protection
Format is determined by file extension: "data.csv" creates CSV, "data.parquet" creates Parquet, etc.
Parameters:
-
df(DataFrame) –DataFrame to save.
-
file_path(str | Path) –Output file path. Parent directories are created if needed.
-
overwrite(bool, default:True) –If True, overwrite existing files. If False, raise error if file already exists.
-
**kwargs(Any, default:{}) –Additional arguments passed to the underlying
pandas.to_*method (e.g.,na_rep,sep,compression).
Raises:
-
ValueError–If file extension is not supported.
-
FileExistsError–If file exists and
overwrite=False.
Example
Basic usage
from owid.datautils.io.df import to_file
import pandas as pd
df = pd.DataFrame({"a": [1, 2], "b": [3, 4]})
# Save as CSV
to_file(df, "output.csv")
# Save as Parquet
to_file(df, "output.parquet")
# Save with custom parameters
to_file(df, "output.csv", na_rep="N/A", sep=";")
Auto-create directories
Overwrite protection
Note
Supported formats: csv, dta, feather, hdf, html, json, md, parquet, pickle, tex, txt, xlsx, xml.
Index handling: Automatically omits dummy indices (default integer index)
but preserves meaningful indices. Override with index=True/False in kwargs.
Source code in lib/datautils/owid/datautils/io/df.py
116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 | |
load_json
¶
Load data from JSON file with optional duplicate key detection.
If the JSON file contains duplicated keys, a warning is optionally raised, and only the value of the latest duplicated key is kept.
Parameters:
-
json_file(str | Path) –Path to JSON file. Supports local files and URLs (via decorator).
-
warn_on_duplicated_keys(bool, default:True) –If True, warn about duplicate keys in JSON file.
Returns:
-
Any–Data loaded from JSON file (typically a dict or list).
Example
Source code in lib/datautils/owid/datautils/io/json.py
save_json
¶
Save data to a JSON file.
Parameters:
-
data(Any) –Data to be stored in JSON file (typically a dict or list).
-
json_file(str | Path) –Path to output JSON file. Parent directories are created if needed.
-
**kwargs(Any, default:{}) –Additional keyword arguments for
json.dump()(e.g.,indent=4,sort_keys=True).
Example
Source code in lib/datautils/owid/datautils/io/json.py
owid.datautils.google
¶
Google utils.
Modules:
-
api–Google API class.
-
config–Google configuration functions.
-
sheets–Google Sheet utils.
Classes:
-
GoogleApi–API for Google Drive.
GoogleApi
¶
GoogleApi(clients_secrets_file: str | None = None)
API for Google Drive.
Initialise Google API.
To obtain client_secrets_file, follow the instructions from:
https://medium.com/analytics-vidhya/how-to-connect-google-drive-to-python-using-pydrive-9681b2a14f20
Note
- Additionally, make sure to add yourself in Test users, as noted in: https://stackoverflow.com/questions/65980758/pydrive-quickstart-and-error-403-access-denied
- Select Desktop App instead of Web Application as the application type.
Parameters:
-
clients_secrets_file(str | None, default:None) –Path to client_secrets file.
Example
First time calling the function should look similar to:
New calls can then be made as follows:
Methods:
-
download_file–Download a file from Google Drive.
-
download_folder–Download a folder from Google Drive.
-
list_files–List files in a Google Drive folder.
Attributes:
-
drive(GoogleDrive) –Google Drive object.
Source code in lib/datautils/owid/datautils/google/api.py
download_file
classmethod
¶
download_file(
output: str,
url: str | None = None,
file_id: str | None = None,
quiet: bool = True,
**kwargs: Any,
) -> None
Download a file from Google Drive.
The file must be public, otherwise this function won't work.
Parameters:
-
output(str) –Local path to save the downloaded file.
-
url(str | None, default:None) –URL to the file on Google Drive (it must be public), by default None
-
file_id(str | None, default:None) –ID of the file on Google Drive (the file must be public), by default None.
-
quiet(bool, default:True) –Suppress terminal output. Default is False.
Raises:
-
ValueError–If neither
urlnoridare provided.
Source code in lib/datautils/owid/datautils/google/api.py
download_folder
classmethod
¶
Download a folder from Google Drive.
The folderm must be public, otherwise this function won't work.
Parameters:
-
url(str) –URL to the folder on Google Drive (must be public).
-
output(str) –Local path to save the downloaded folder.
-
quiet(bool, default:True) –Suppress terminal output. Default is False.
Source code in lib/datautils/owid/datautils/google/api.py
list_files
¶
list_files(parent_id: str) -> GoogleDriveFileList
List files in a Google Drive folder.
Parameters:
-
parent_id(str) –Google Drive folder ID.
Returns:
-
GoogleDriveFileList–List of files in the folder.
Source code in lib/datautils/owid/datautils/google/api.py
owid.datautils.format
¶
Utils for the processing of different data formats.
Modules:
-
numbers–Numeric formatting.
Functions:
-
format_number–Format number string to integer.
format_number
¶
Format number string to integer.
Only supports integer conversion. Handles various formats including separators and numeric words.
Parameters:
Returns:
-
int–Formatted number as integer.
Example
Source code in lib/datautils/owid/datautils/format/numbers.py
owid.datautils.decorators
¶
Library decorators.
Functions:
-
enable_file_download–Decorator that allows functions expecting local file paths to accept URLs and S3 paths.
enable_file_download
¶
Decorator that allows functions expecting local file paths to accept URLs and S3 paths.
This decorator automatically downloads remote files to temporary storage before calling the decorated function, making any file-processing function work transparently with:
- Local file paths (unchanged behavior)
- HTTP/HTTPS URLs (downloaded via web request)
- S3 paths (downloaded via S3 client)
Parameters:
-
path_arg_name(str | None, default:None) –Name of the parameter containing the file path. If None, uses the first positional argument.
Example
@enable_file_download(path_arg_name="file_path")
def load_data(file_path):
with open(file_path, 'r') as f:
return f.read()
# Now works with all of these:
load_data("/local/file.txt") # Local file
load_data("https://example.com/data.txt") # HTTP download
load_data("s3://bucket/data.txt") # S3 download
Warning
Downloads entire files to temporary storage on every call. For large files or frequent access, consider explicit caching or streaming approaches.