Data Sets
In the simplest form a data set is just a Task which produces data a long with a storage mechanism. This allows for a number of benefits over just using tasks directly.
Data can be produced once and used many times. For example if a number of Tasks
depend of the same Dataset AME can prepare the dataset once.
Here is an example of what a simple data set configuration looks like:
# ame.yaml
...
dataSets:
- name: mnist
path: ./data # Specifies where the tasks stores data.
# task:
taskRef: fetch_mnist # References a task which produces data.
Configuring a data set
A simple data set cfg is quick to set and can then be progressively enhanced as your needs expand. Here we will walk through the process of first setting up a simple data set and then go through the more advanced options.
The minimum requirements for a dataset is a path
pointer to where data should be saved from and a Task
which will produce data at that path. As shown in the mnist example above.
Lets start with that here:
# ame.yaml
...
dataSets:
- name: mnist
path: ./data # Specifies where the tasks stores data.
task:
taskRef: fetch_mnist # References a task which produces data.
So far so good, we have a path data
and reference a Task
that produces our data.
Dataset size
If a dataset is large it is a good idea to specifiy the storage requirements. This will allow AME to warn you if the object storage is running out.
If you do not specify the size AME will attempt to save the dataset, detect the failure and then produce an alert.
# ame.yaml
...
dataSets:
- name: mnist
path: ./data # Specifies where the tasks stores data.
size: 50Gi
task:
taskRef: fetch_mnist # References a task which produces data.
Interacting with data sets
To see the status of live data sets, use the AME's cli. Current it is only possible to see data sets that are in use, meaning referenced by some running task.
You can also view datasets from AME's dashboard:
TODO: dataset image
Consuming data from object storage
AME does not yet have builtin support for extracing data from object storage, although it will in the near future, see the tracking issue here. It is still quite simplte to accomplish this in pure python, so we shall demonstrate that here.