Data Providers¶
All data consumed by Hebel models must be provided in the form of
DataProvider
objects. DataProviders
are classes that provide
iterators which return batches for training. By writing custom
DataProviders`
, this creates a lot of flexibility about where data
can come from and enables any sort of pre-processing on the data. For
example, a user could write a DataProvider
that receives data from
the internet or through a pipe from a different process. Or, when
working with text data, a user may define a custom DataProvider
to
perform tokenization and stemming on the text before returning it.
A DataProvider
is defined by subclassing the
hebel.data_provider.DataProvider
class and must implement at
a minimum the special methods __iter__
and next
.
Abstract Base Class¶
-
class
hebel.data_providers.
DataProvider
(data, targets, batch_size)¶ This is the abstract base class for
DataProvider
objects. Subclass this class to implement a custom design. At a minimum you must provide implementations of thenext
method.
Minibatch Data Provider¶
-
class
hebel.data_providers.
MiniBatchDataProvider
(data, targets, batch_size)¶ This is the standard
DataProvider
for mini-batch learning with stochastic gradient descent.Input and target data may either be provided as
numpy.array
objects, or aspycuda.GPUArray
objects. The latter is preferred if the data can fit on GPU memory and will be much faster, as the data won’t have to be transferred to the GPU for every minibatch. If the data is provided as anumpy.array
, then every minibatch is automatically converted to to apycuda.GPUArray
and transferred to the GPU.Parameters: - data – Input data.
- targets – Target data.
- batch_size – The size of mini-batches.
Multi-Task Data Provider¶
-
class
hebel.data_providers.
MultiTaskDataProvider
(data, targets, batch_size=None)¶ DataProvider
for multi-task learning that uses the same training data for multiple targets.This
DataProvider
is similar to thehebel.data_provider.MiniBatchDataProvider
, except that it has not one but multiple targets.Parameters: - data – Input data.
- targets – Multiple targets as a list or tuple.
- batch_size – The size of mini-batches.
See also:
hebel.models.MultitaskNeuralNet
,hebel.layers.MultitaskTopLayer
Batch Data Provider¶
-
class
hebel.data_providers.
BatchDataProvider
(data, targets)¶ DataProvider
for batch learning. Always returns the full data set.Parameters: - data – Input data.
- targets – Target data.
See also:
Dummy Data Provider¶
-
class
hebel.data_providers.
DummyDataProvider
(*args, **kwargs)¶ A dummy
DataProvider
that does not store any data and always returnsNone
.
MNIST Data Provider¶
-
class
hebel.data_providers.
MNISTDataProvider
(array, batch_size=None)¶ DataProvider
that automatically provides data from the MNIST data set of hand-written digits.Depends on the skdata package.
Parameters: - array – {‘train’, ‘val’, ‘test’} Whether to use the official training, validation, or test data split of MNIST.
- batch_size – The size of mini-batches.