xbatcher: Batch Generation from Xarray Datasets#
Xbatcher is a small library for iterating Xarray DataArrays and Datasets in batches. The goal is to make it easy to feed Xarray objects to machine learning libraries such as Keras.
Installation#
Xbatcher can be installed from PyPI as:
python -m pip install xbatcher
Or via Conda as:
conda install -c conda-forge xbatcher
Or from source as:
python -m pip install git+https://github.com/xarray-contrib/xbatcher.git
Optional Dependencies#
Note
The required dependencies installed with Xbatcher are Xarray, Dask, and NumPy. You will need to separately install TensorFlow or PyTorch to use those data loaders or Xarray accessors.
To install Xbatcher and PyTorch via Conda:
conda install -c conda-forge xbatcher pytorch
Or via PyPI:
python -m pip install xbatcher[torch]
To install Xbatcher and TensorFlow via Conda:
conda install -c conda-forge xbatcher tensorflow
Or via PyPI:
python -m pip install xbatcher[tensorflow]
Basic Usage#
Let’s say we have an Xarray Dataset
In [1]: import xarray as xr
In [2]: import numpy as np
In [3]: da = xr.DataArray(np.random.rand(1000, 100, 100), name='foo',
...: dims=['time', 'y', 'x']).chunk({'time': 1})
...:
In [4]: da
Out[4]:
<xarray.DataArray 'foo' (time: 1000, y: 100, x: 100)> Size: 80MB
dask.array<xarray-<this-array>, shape=(1000, 100, 100), dtype=float64, chunksize=(1, 100, 100), chunktype=numpy.ndarray>
Dimensions without coordinates: time, y, x
and we want to create batches along the time dimension. We can do it like this
In [5]: import xbatcher
In [6]: bgen = xbatcher.BatchGenerator(da, {'time': 10})
In [7]: for batch in bgen:
...: pass
...: batch
...:
Out[7]:
<xarray.DataArray 'foo' (sample: 10000, time: 10)> Size: 800kB
array([[0.35889184, 0.18149453, 0.11764907, ..., 0.31643258, 0.80570048,
0.11053592],
[0.14407245, 0.762631 , 0.01199226, ..., 0.47529933, 0.72232368,
0.11563477],
[0.52798741, 0.02816382, 0.64851972, ..., 0.24144177, 0.65515846,
0.22074101],
...,
[0.91672048, 0.9725373 , 0.18369454, ..., 0.58539207, 0.60820327,
0.5484866 ],
[0.68396518, 0.48891789, 0.04829527, ..., 0.38500821, 0.36687758,
0.34340475],
[0.0905384 , 0.54641467, 0.25411867, ..., 0.92374025, 0.89527988,
0.19894906]])
Coordinates:
* sample (sample) object 80kB MultiIndex
* y (sample) int64 80kB 0 0 0 0 0 0 0 0 0 ... 99 99 99 99 99 99 99 99
* x (sample) int64 80kB 0 1 2 3 4 5 6 7 8 ... 92 93 94 95 96 97 98 99
Dimensions without coordinates: time
or via a built-in Xarray accessor:
In [8]: import xbatcher
In [9]: for batch in da.batch.generator({'time': 10}):
...: pass
...: batch
...:
Out[9]:
<xarray.DataArray 'foo' (sample: 10000, time: 10)> Size: 800kB
array([[0.35889184, 0.18149453, 0.11764907, ..., 0.31643258, 0.80570048,
0.11053592],
[0.14407245, 0.762631 , 0.01199226, ..., 0.47529933, 0.72232368,
0.11563477],
[0.52798741, 0.02816382, 0.64851972, ..., 0.24144177, 0.65515846,
0.22074101],
...,
[0.91672048, 0.9725373 , 0.18369454, ..., 0.58539207, 0.60820327,
0.5484866 ],
[0.68396518, 0.48891789, 0.04829527, ..., 0.38500821, 0.36687758,
0.34340475],
[0.0905384 , 0.54641467, 0.25411867, ..., 0.92374025, 0.89527988,
0.19894906]])
Coordinates:
* sample (sample) object 80kB MultiIndex
* y (sample) int64 80kB 0 0 0 0 0 0 0 0 0 ... 99 99 99 99 99 99 99 99
* x (sample) int64 80kB 0 1 2 3 4 5 6 7 8 ... 92 93 94 95 96 97 98 99
Dimensions without coordinates: time