Create a dataset
A dataset is a named collection of images at the organization level that you label and use for training. Before training, your dataset must meet these minimums:
| Requirement | Minimum |
|---|---|
| Total images | 15 |
| Labeled images | 80% of total |
| Examples per label | 10 |
| Label distribution | Roughly equal |
| Image source | Production environment |
For production use, aim for hundreds of images per label under varied conditions.
1. Create a dataset
You can create a dataset from the web UI, the CLI, or programmatically.
Web UI:
- Go to app.viam.com.
- Click the DATA tab in the top navigation.
- Click the DATASETS subtab.
- Click + Create dataset.
- Enter a descriptive name for your dataset. Use a name that reflects the task,
such as
inspection-parts-v1orpackage-detection. Dataset names must be unique within your organization. - Click Create.
Your empty dataset now appears in the list.
CLI:
If you have the Viam CLI installed, create a dataset from the command line:
viam dataset create --org-id=YOUR-ORG-ID --name=my-inspection-dataset
The command returns the dataset ID, which you will need for subsequent CLI and SDK operations.
import asyncio
from viam.rpc.dial import DialOptions
from viam.app.viam_client import ViamClient
API_KEY = "YOUR-API-KEY"
API_KEY_ID = "YOUR-API-KEY-ID"
ORG_ID = "YOUR-ORGANIZATION-ID"
async def connect() -> ViamClient:
dial_options = DialOptions.with_api_key(API_KEY, API_KEY_ID)
return await ViamClient.create_from_dial_options(dial_options)
async def main():
viam_client = await connect()
data_client = viam_client.data_client
dataset_id = await data_client.create_dataset(
name="my-inspection-dataset",
organization_id=ORG_ID,
)
print(f"Created dataset: {dataset_id}")
viam_client.close()
if __name__ == "__main__":
asyncio.run(main())
package main
import (
"context"
"fmt"
"go.viam.com/rdk/app"
"go.viam.com/rdk/logging"
)
func main() {
apiKey := "YOUR-API-KEY"
apiKeyID := "YOUR-API-KEY-ID"
orgID := "YOUR-ORGANIZATION-ID"
ctx := context.Background()
logger := logging.NewDebugLogger("create-dataset")
viamClient, err := app.CreateViamClientWithAPIKey(
ctx, app.Options{}, apiKey, apiKeyID, logger)
if err != nil {
logger.Fatal(err)
}
defer viamClient.Close()
dataClient := viamClient.DataClient()
datasetID, err := dataClient.CreateDataset(
ctx, "my-inspection-dataset", orgID)
if err != nil {
logger.Fatal(err)
}
fmt.Printf("Created dataset: %s\n", datasetID)
}
Replace all placeholder values (YOUR-API-KEY, YOUR-API-KEY-ID,
YOUR-ORGANIZATION-ID) with your actual values. To find your organization ID,
click your organization name in the top navigation bar, then click Settings.
Your organization ID is displayed on the settings page with a copy button.
2. Add images to the dataset
With a dataset created, you need to populate it with images.
Web UI:
- Click the DATA tab in the top navigation.
- Use the filters to find the images you want. Filter by machine, component, time range, or tags.
- Select individual images by clicking their checkboxes, or use Select all to select all visible images.
- Click Add to dataset in the action bar that appears.
- Select your dataset from the dropdown.
- Click Add.
The selected images are now part of your dataset.
CLI:
Add images to a dataset using filter criteria:
viam dataset data add filter \
--dataset-id=YOUR-DATASET-ID \
--location-id=YOUR-LOCATION-ID \
--tags=label1,label2
This adds all images matching the filter to the dataset. You can filter by location, machine, component, tags, or time range.
async def main():
viam_client = await connect()
data_client = viam_client.data_client
await data_client.add_binary_data_to_dataset_by_ids(
binary_ids=["binary-data-id-1", "binary-data-id-2"],
dataset_id="YOUR-DATASET-ID",
)
print("Images added to dataset.")
viam_client.close()
You can get binary data IDs by querying for images first using the data client’s
binary_data_by_filter method, which returns objects that include the binary ID.
err = dataClient.AddBinaryDataToDatasetByIDs(
ctx,
[]string{"binary-data-id-1", "binary-data-id-2"},
"YOUR-DATASET-ID",
)
if err != nil {
logger.Fatal(err)
}
fmt.Println("Images added to dataset.")
3. Annotate your images
Before training, you need to label every image in your dataset with tags (for classification) or bounding boxes (for object detection).
See Annotate images for step-by-step instructions on manual labeling, or Automate annotation to use an existing ML model to speed up the process.
4. Verify dataset quality
Before you train a model, check that your dataset meets the requirements.
In the web UI:
- Go to the DATA tab and click the DATASETS subtab.
- Click your dataset to open it.
- Review the dataset summary, which shows:
- Total number of images
- Number of labeled images
- Labels used and their counts
- Check each requirement:
| Check | What to look for |
|---|---|
| Enough images | At least 15 total. More is better. |
| Labeling coverage | At least 80% of images have tags or bounding boxes |
| Examples per label | At least 10 images per label |
| Label balance | No label should have more than 3x the images of any other label |
| Production conditions | Images should represent real operating conditions, not staged or ideal setups |
Common issues to fix before training:
- Too few images in one class: Capture more images of the underrepresented class, or remove the class and merge it with a related one.
- Unlabeled images: Either label them or remove them from the dataset. Unlabeled images do not help training and can confuse the summary statistics.
- Non-representative images: If your model will run on a factory floor but your training images were taken on a clean desk, the model will not generalize. Capture images under production conditions – with the actual lighting, background, camera angle, and distance your machine uses.
async def main():
viam_client = await connect()
data_client = viam_client.data_client
datasets = await data_client.list_datasets_by_organization_id(
organization_id=ORG_ID,
)
for ds in datasets:
print(f"Dataset: {ds.name}, ID: {ds.id}")
viam_client.close()
datasets, err := dataClient.ListDatasetsByOrganizationID(ctx, orgID)
if err != nil {
logger.Fatal(err)
}
for _, ds := range datasets {
fmt.Printf("Dataset: %s, ID: %s\n", ds.Name, ds.ID)
}
Troubleshooting
What’s next
- Annotate images – label your images with tags or bounding boxes for training.
- Automate annotation – use an existing ML model to auto-label images instead of doing it by hand.
- Train a model – use your labeled dataset to train a classification or object detection model.
Was this page helpful?
Glad to hear it! If you have any other feedback please let us know:
We're sorry about that. To help us improve, please tell us what we can do better:
Thank you!