Data Providers

All data consumed by Hebel models must be provided in the form of DataProvider objects. DataProviders are classes that provide iterators which return batches for training. By writing custom DataProviders`, this creates a lot of flexibility about where data can come from and enables any sort of pre-processing on the data. For example, a user could write a DataProvider that receives data from the internet or through a pipe from a different process. Or, when working with text data, a user may define a custom DataProvider to perform tokenization and stemming on the text before returning it.

A DataProvider is defined by subclassing the hebel.data_provider.DataProvider class and must implement at a minimum the special methods __iter__ and next.

Abstract Base Class

class hebel.data_providers.DataProvider(data, targets, batch_size)

This is the abstract base class for DataProvider objects. Subclass this class to implement a custom design. At a minimum you must provide implementations of the next method.

Minibatch Data Provider

class hebel.data_providers.MiniBatchDataProvider(data, targets, batch_size)

This is the standard DataProvider for mini-batch learning with stochastic gradient descent.

Input and target data may either be provided as numpy.array objects, or as pycuda.GPUArray objects. The latter is preferred if the data can fit on GPU memory and will be much faster, as the data won’t have to be transferred to the GPU for every minibatch. If the data is provided as a numpy.array, then every minibatch is automatically converted to to a pycuda.GPUArray and transferred to the GPU.

Parameters:
  • data – Input data.
  • targets – Target data.
  • batch_size – The size of mini-batches.

Multi-Task Data Provider

class hebel.data_providers.MultiTaskDataProvider(data, targets, batch_size=None)

DataProvider for multi-task learning that uses the same training data for multiple targets.

This DataProvider is similar to the hebel.data_provider.MiniBatchDataProvider, except that it has not one but multiple targets.

Parameters:
  • data – Input data.
  • targets – Multiple targets as a list or tuple.
  • batch_size – The size of mini-batches.

See also:

hebel.models.MultitaskNeuralNet, hebel.layers.MultitaskTopLayer

Batch Data Provider

class hebel.data_providers.BatchDataProvider(data, targets)

DataProvider for batch learning. Always returns the full data set.

Parameters:
  • data – Input data.
  • targets – Target data.

See also:

hebel.data_providers.MiniBatchDataProvider

Dummy Data Provider

class hebel.data_providers.DummyDataProvider(*args, **kwargs)

A dummy DataProvider that does not store any data and always returns None.

MNIST Data Provider

class hebel.data_providers.MNISTDataProvider(array, batch_size=None)

DataProvider that automatically provides data from the MNIST data set of hand-written digits.

Depends on the skdata package.

Parameters:
  • array – {‘train’, ‘val’, ‘test’} Whether to use the official training, validation, or test data split of MNIST.
  • batch_size – The size of mini-batches.