Edit

Train a model

Submit a training job from your labeled dataset. Viam runs the job on cloud infrastructure – no GPU provisioning or framework installation needed. Training logs are available for 7 days after the job completes.

For background on model frameworks (TFLite and TensorFlow), task types, and how deployment works, see the overview.

1. Start a training job from the web UI

Go to app.viam.com.
Click the DATA tab in the top navigation.
Click the DATASETS subtab.
Click the dataset you want to train on.
Click Train model.
Select the model framework:
- TFLite for edge devices (recommended for most use cases)
- TF for general-purpose models requiring more compute
Enter a name for your model. Use a descriptive name like part-inspector-v1 or package-detector-v1. This name identifies the model in your organization’s registry.
Select the task type:
- Single Label Classification if each image has one tag
- Multi Label Classification if images have multiple tags
- Object Detection if you used bounding box annotations
Select which labels to include in training. You can exclude labels that have too few examples or that you do not want the model to learn.
Click Train model.

The training job starts. You will see a confirmation message with the job ID.

2. Start a training job from the CLI

If you prefer the command line, use the Viam CLI:

viam train submit managed \
  --dataset-id=YOUR-DATASET-ID \
  --model-org-id=YOUR-ORG-ID \
  --model-name=part-inspector-v1 \
  --model-type=single_label_classification \
  --model-framework=tflite \
  --model-labels=good-part,defective-part

Required flags:

Flag	Description	Accepted values
`--dataset-id`	Dataset to train on	Your dataset ID
`--model-org-id`	Organization to save the model in	Your organization ID
`--model-name`	Name for the trained model	Any string
`--model-type`	Task type	`single_label_classification`, `multi_label_classification`, `object_detection`
`--model-framework`	Model framework	`tflite`, `tensorflow`
`--model-labels`	Labels to train on	Comma-separated list of labels from your dataset

--model-version is optional and defaults to the current timestamp.

The command returns a training job ID that you can use to check status.

3. Monitor training progress

Web UI:

In the Viam app, click the DATA tab.
Click the MODELS subtab, then expand Active Training.
You will see a list of training jobs with their status:
- Pending – the job is queued
- In Progress – training is running
- Completed – the model is ready
- Failed – something went wrong
- Canceled – the job was canceled before completing
Click a job ID to view detailed logs.

CLI:

Check the status of a training job:

viam train get --job-id=YOUR-JOB-ID

View training logs:

viam train logs --job-id=YOUR-JOB-ID

Training logs expire after 7 days. If you need to retain logs for longer, copy them before they expire.

Python
Go

async def main():
    viam_client = await connect()
    ml_training_client = viam_client.ml_training_client

    job = await ml_training_client.get_training_job(
        id="YOUR-TRAINING-JOB-ID",
    )
    print(f"Status: {job.status}")
    print(f"Model name: {job.model_name}")
    print(f"Created: {job.created_on}")

    viam_client.close()

job, err := mlTrainingClient.GetTrainingJob(ctx, "YOUR-TRAINING-JOB-ID")
if err != nil {
    logger.Fatal(err)
}
fmt.Printf("Status: %s\n", job.Status)
fmt.Printf("Model name: %s\n", job.ModelName)
fmt.Printf("Created: %s\n", job.CreatedOn)

4. Test your model

After training completes, test the model by deploying it to a machine with a vision service and checking its predictions against live or captured data.

Deploy the model to a machine with a camera.
Configure a vision service that uses the model.
On the machine’s CONTROL tab, open the vision service panel to see live classifications or detections.
Evaluate the results against a variety of conditions:
- Images that clearly belong to each class (should get high confidence)
- Ambiguous images (helps you understand the model’s decision boundary)
- Images from conditions not in the training set (reveals generalization gaps)

5. Deploy and iterate

When training completes, the model is stored in your organization’s registry. See Deploy a model to a machine to configure the module, ML model service, and vision service on your machine.

After deploying, improve your model by collecting targeted data where it struggles (edge cases, counterexamples, varied conditions), using auto-annotation to label efficiently, and retraining. If your machine is configured to use the model, the new version deploys automatically.

To review past training jobs:

Python
Go

async def main():
    viam_client = await connect()
    ml_training_client = viam_client.ml_training_client

    jobs = await ml_training_client.list_training_jobs(
        org_id=ORG_ID,
    )
    for job in jobs:
        print(f"Job: {job.id}, Status: {job.status}, "
              f"Model: {job.model_name}, Created: {job.created_on}")

    viam_client.close()

jobs, err := mlTrainingClient.ListTrainingJobs(
    ctx, orgID, app.TrainingStatusUnspecified)
if err != nil {
    logger.Fatal(err)
}
for _, job := range jobs {
    fmt.Printf("Job: %s, Status: %d, Model: %s, Created: %s\n",
        job.ID, job.Status, job.ModelName, job.CreatedOn)
}

Troubleshooting

Training job fails

Check the training logs. In the MODELS tab, expand Active Training and click the failed job ID to view logs. The error message usually indicates the problem.
Dataset too small. Training requires at least 15 images with at least 80% labeled. Check your dataset in the DATASETS tab.
No labels selected. You must select at least two labels. A model cannot learn to classify if there is only one category.
Bounding box format issue. For object detection, verify that bounding box coordinates are normalized between 0.0 and 1.0, and x_min is less than x_max.

Low confidence scores

Add more training data. Low confidence usually means the model has not seen enough examples. More diverse images of each class will help.
Check label balance. If one label dominates the dataset, the model may assign low confidence to minority labels. Balance the dataset and retrain.
Verify image quality. Blurry, dark, or low-resolution images make it harder for the model to learn distinctive features.
Lower the confidence threshold. If the model is correct but with scores around 0.4-0.6, your threshold may be set too high.

Training takes too long

Large datasets take longer. A dataset with thousands of images may take an hour or more. This is normal.
TF models take longer than TFLite. If training time is a concern, switch to TFLite.
Training is queued. If the status stays at “Pending”, the training infrastructure may be busy. Jobs are processed in order.

What’s next

Deploy a model to a machine – configure the module, ML model service, and vision service to run your model.
Add computer vision – the full guide to configuring vision services and cloud inference.
Detect objects (2D) – use your object detection model to find and locate objects in camera images.
Classify images – use your classification model to categorize images from your machine’s camera.

Was this page helpful?

Glad to hear it! If you have any other feedback please let us know:

We're sorry about that. To help us improve, please tell us what we can do better:

Thank you!