xbatcher: Batch Generation from Xarray Datasets
Xbatcher is a small library for iterating xarray DataArrays in batches. The goal is to make it easy to feed xarray datasets to machine learning libraries such as Keras.
Installation
Xbatcher can be installed from PyPI as:
pip install xbatcher
Or via Conda as:
conda install -c conda-forge xbatcher
Or from source as:
pip install git+https://github.com/pangeo-data/xbatcher.git
Basic Usage
Let’s say we have an xarray dataset
In [1]: import xarray as xr
In [2]: import numpy as np
In [3]: da = xr.DataArray(np.random.rand(1000, 100, 100), name='foo',
...: dims=['time', 'y', 'x']).chunk({'time': 1})
...:
In [4]: da
Out[4]:
<xarray.DataArray 'foo' (time: 1000, y: 100, x: 100)>
dask.array<xarray-<this-array>, shape=(1000, 100, 100), dtype=float64, chunksize=(1, 100, 100), chunktype=numpy.ndarray>
Dimensions without coordinates: time, y, x
and we want to create batches along the time dimension. We can do it like this
In [5]: import xbatcher
In [6]: bgen = xbatcher.BatchGenerator(da, {'time': 10})
In [7]: for batch in bgen:
...: pass
...: batch
...:
Out[7]:
<xarray.Dataset>
Dimensions: (time: 10, sample: 10000)
Coordinates:
* sample (sample) MultiIndex
- y (sample) int64 0 0 0 0 0 0 0 0 0 0 ... 99 99 99 99 99 99 99 99 99
- x (sample) int64 0 1 2 3 4 5 6 7 8 9 ... 91 92 93 94 95 96 97 98 99
Dimensions without coordinates: time
Data variables:
foo (sample, time) float64 0.8635 0.4634 0.734 ... 0.2086 0.7433 0.7284
or via a built-in Xarray accessor:
In [8]: import xbatcher
In [9]: for batch in da.batch.generator({'time': 10}):
...: pass
...: batch
...:
Out[9]:
<xarray.Dataset>
Dimensions: (time: 10, sample: 10000)
Coordinates:
* sample (sample) MultiIndex
- y (sample) int64 0 0 0 0 0 0 0 0 0 0 ... 99 99 99 99 99 99 99 99 99
- x (sample) int64 0 1 2 3 4 5 6 7 8 9 ... 91 92 93 94 95 96 97 98 99
Dimensions without coordinates: time
Data variables:
foo (sample, time) float64 0.8635 0.4634 0.734 ... 0.2086 0.7433 0.7284