Publish
Publishing datasets
Upload, license, and version robotics, vision, and language datasets at any scale.
Supported formats
Datasets are stored as a sharded set of files with a schema and an optional sample index. The platform indexes the manifest, not the rows, so storage costs scale with bytes rather than column count.
- Tabular — Parquet (preferred), Arrow IPC, CSV
- Vision — WebDataset (
.tar), Parquet with embedded bytes - Robotics — LeRobot, RLDS, MCAP, ROS 2 bags
- Audio — Parquet with
flacbytes, WebDataset - Text — JSONL, Parquet
Use Parquet wherever possible
Create a dataset
from nodedata import NodeData
node = NodeData()
dataset = node.datasets.create(
name="panda-arm-grasp-100k",
modality=["rgb", "depth", "proprioception", "action"],
license="CC-BY-4.0",
description_path="./README.md",
price_usd=199.00,
tags=["manipulation", "panda", "sim2real"],
)Upload shards
Shards are content-addressed. Re-uploading an identical file is a no-op; only the manifest is updated. This lets you ship incremental snapshots without rewriting the entire dataset.
# Stream shards directly from disk; resumes automatically on failure
node.datasets.upload(
dataset.id,
files=[
"./shards/000.parquet",
"./shards/001.parquet",
"./shards/002.parquet",
],
schema="./schema.json",
sample_index="./sample_index.json",
)node-data dataset push \
--name panda-arm-grasp-100k \
--license CC-BY-4.0 \
--modality rgb,depth,proprioception,action \
--price 199 \
--schema ./schema.json \
./shards/Schema
Every dataset has a typed schema. The schema is part of the public listing and is enforced at upload time. Inferred schemas are also accepted; the inferred result will be written back to your manifest.
{
"splits": ["train", "val", "test"],
"fields": {
"rgb": { "type": "uint8", "shape": [-1, 224, 224, 3] },
"depth": { "type": "float16", "shape": [-1, 224, 224] },
"proprioception": { "type": "float32", "shape": [-1, 14] },
"action": { "type": "float32", "shape": [-1, 7] }
}
}Stream datasets
Datasets are designed to stream. The SDK shards downloads across a thread pool, prefetches into a ring buffer, and decompresses Parquet in the background.
for episode in node.datasets.stream("acme/panda-arm-grasp-100k", split="train"):
rgb = episode["rgb"] # (T, 224, 224, 3) uint8
depth = episode["depth"] # (T, 224, 224) float16
action = episode["action"] # (T, 7) float32
# ... train ...Splits, versions, and provenance
- Every revision is immutable. Consumers pin a version with
revision="1.2.0". - Splits are declared at the manifest level — no folder naming conventions required.
- Each shard records a provenance hash for reproducibility.
- Optional differential privacy reports can be attached to a revision.
PII and sensitive data
Datasets containing personal data must declare the data class on creation. Public datasets flagged as containing biometric or identifying data are gated behind an attestation form for buyers.