Omnipy v0.17.0 Release Notes
Pre-releaseRelease date: Nov 7, 2024
v0.17.0 of Omnipy was also a huge release, with a focus on features for building dynamic URLs
and loading datasets asynchronously from APIs. As a whole, the release was a major step towards
dependable communication with APIs, and the ability to handle large datasets in a concurrent and
efficient manner.
New features in v0.17.0
-
Dynamic building of URLs
A new model,
HttpUrlModel, has been added to support dynamic building of URLs from parts. It is
more flexible than other similar solutions in the standard Python library,Pydantic, or other
libraries, supporting the following features:- All parts can be easily edited at any time, using built-in types such as
dictandPath - Automatic data type conversion (generic Omnipy feature)
- Continuous validation after each change (generic Omnipy feature)
- Error recovery: revert to last valid snapshot after invalid change (generic Omnipy feature)
- Whenever the
HttpUrlModelobject is converted to a string, i.e. by insertion into a
StrModel/StrDatasetor being used to fetch data, the URL string is automatically
constructed from the parts. - Builds on top of
Urlfrom
pydantic_core, which provides basic validation, URL encoding as well as
punycode encoding of international domain names for
increased security
With the
HttpUrlDataset, dynamic URLs are scaled up to operate in batch mode, e.g. for building
URLs for repeated API calls to be fetched concurrently and asynchronously. - All parts can be easily edited at any time, using built-in types such as
-
Datasetupgraded to support state info for per-item tasksTo support per-item asynchronous tasks, the
Datasetclass has been upgraded to support state
information for pendinG and failed tasks - on a per-item basis. This includes storing
exceptions and other relevant info for each item that has failed or is pending. Dataset
visualisation has been updated to relay this info to the user in a clear and concise way. -
Job modifier
iterate_over_data_filesnow supports asynchronous iterationThe
iterate_over_data_filesjob modifier has been upgraded to support asynchronous iteration
over data files. This allows for more efficient handling of large datasets, and is especially
useful when combined with the newDatasetstate information for pending and failed tasks
(see item above). -
Automatic handling of asynchronous tasks based on runtime environment
Through the new
auto_asyncjob modifier, Omnipy now automatically detects whether the code is
being run in an asynchronous runtime environment, such as a Jupyter notebook, and adjusts the
execution of asynchronous tasks accordingly:- Technically, if
auto_asyncis set toTrue(the default), the existing event loop is detected
and used to run an asynchronous OmnipyTaskas anasyncio.Task, allowing tasks to be run in
the background if run from, e.g., a Jupyter notebook. - If no event loop is detected, Omnipy will create a new event loop and close it after the task is
finished, allowing the task to be run synchronously in a regular Python script, or from the
console. - The
auto_asyncfeature alleviates the complexity of running asynchronous tasks in different
environments, and simplifies the combined use of asynchronous and synchronous tasks.
Note 1: Omnipy is yet to support asynchronous flows, so asynchronous tasks currently need to
be run independently.Note 2:
auto_asyncdoes not support the opposite functionality, that is, running blocking
synchronous tasks in the background in an asyncronous environment. This would require running the
blocking tasks in threads, however Omnipy runtime objects (such as configs) are not (yet)
thread-safe. Hence, synchronous tasks will block the event loop and any asynchronous tasks that
are running there. - Technically, if
-
Datasetnow supports asynchronous loading of data from URLsThe
Datasetclass has been upgraded to support asynchronous loading of data from URLs. This
makes use of the newHttpUrlDatasetclass for building URLs, the new state information for
failed and pending per-item tasks, and the asynchronous iteration over data files. The fetching
is implemented in the newget_*_from_api_endpointtasks (where*isjson,
bytes, orstr), built on top of the asynchronousaiohttplibrary, and supports the following
features:- Automatic retry of HTTP requests, building on the
aiohttp_retrylibrary. Retries are
configurable to retry for particular HTTP response codes, to retry a specified number of times
and to use a specified algorithm to calculate the delay between retries. - Rate limiting of HTTP requests, building on the
aiolimiterlibrary. Rate limiting is
configurable to limit the number of requests per time period, and to specify the time period
used for calculation, indirectly also controlling the burst size. Adding to what is provided by
theaiolimiterlibrary, Omnipy ensures that the maximum rate limit is not exceeded also for
the initial burst of requests. - Automatic reset of rate limiter counting and delays for subsequent batches of requests
- Retries and rate limiting are configured individually for each domain. Omnipy ensures that HTTP
requests in the same batch (e.g. provided in the sameHttpUrlDataset) are coordinated
according to their domain. - The default values for retries and rate limiting are set to reasonable values, so that this
functionality is provided seamlessly for the users. However, these default values can be easily
be changed if needed. Dataset.load()now supports lists and dicts of paths or URLs (strings orHttpUrlModel
objects) as input, as well asHttpUrlDatasetobjects.- Due to the asynchronous nature of the
get_*_from_api_endpointtasks, users in an asynchronous
environment such as Jupyter Notebook can inspect the status of the download tasks while the
download is in progress by inspecting theDatasetobject.
- Automatic retry of HTTP requests, building on the
-
Other new features / bug fixes / refactorings
- Refactored Model and Dataset repr to make use of IPython pretty printer. Drops support for
plain Python console for automatic pretty prints - Implemented NestedSplitToItemsModel and NestedJoinItemsModel for parsing nested structures of
any level to/from strings (e.g."param1=true¶m2=42") - Implemented MatchItemsModel, which allows for filtering of items in a list based on a
user-defined functions - Implemented task
create_row_index_from_column()and basic table datasets - Added support for optional fields in
PydanticRecordModel - Fixed lack of
to_data()conversion when importing mappings and iterators of models to a
dataset - Refactored models and datasets for split and join, to reduce duplication and allow adjustments
of params for all.
- Refactored Model and Dataset repr to make use of IPython pretty printer. Drops support for