Publish

Publishing datasets

Upload, license, and version robotics, vision, and language datasets at any scale.

Supported formats

Datasets are stored as a sharded set of files with a schema and an optional sample index. The platform indexes the manifest, not the rows, so storage costs scale with bytes rather than column count.

Tabular — Parquet (preferred), Arrow IPC, CSV
Vision — WebDataset (.tar), Parquet with embedded bytes
Robotics — LeRobot, RLDS, MCAP, ROS 2 bags
Audio — Parquet with flac bytes, WebDataset
Text — JSONL, Parquet

Use Parquet wherever possible

Parquet shards compress 3–10× better than raw JSONL and column-prune at query time. Node Data preview, sampling, and search index all run on Parquet natively.

Create a dataset

python

from nodedata import NodeData

node = NodeData()

dataset = node.datasets.create(
    name="panda-arm-grasp-100k",
    modality=["rgb", "depth", "proprioception", "action"],
    license="CC-BY-4.0",
    description_path="./README.md",
    price_usd=199.00,
    tags=["manipulation", "panda", "sim2real"],
)

Upload shards

Shards are content-addressed. Re-uploading an identical file is a no-op; only the manifest is updated. This lets you ship incremental snapshots without rewriting the entire dataset.

python

# Stream shards directly from disk; resumes automatically on failure
node.datasets.upload(
    dataset.id,
    files=[
        "./shards/000.parquet",
        "./shards/001.parquet",
        "./shards/002.parquet",
    ],
    schema="./schema.json",
    sample_index="./sample_index.json",
)

CLIbash

node-data dataset push \
  --name panda-arm-grasp-100k \
  --license CC-BY-4.0 \
  --modality rgb,depth,proprioception,action \
  --price 199 \
  --schema ./schema.json \
  ./shards/

Schema

Every dataset has a typed schema. The schema is part of the public listing and is enforced at upload time. Inferred schemas are also accepted; the inferred result will be written back to your manifest.

schema.jsonjson

{
  "splits": ["train", "val", "test"],
  "fields": {
    "rgb":            { "type": "uint8",   "shape": [-1, 224, 224, 3] },
    "depth":          { "type": "float16", "shape": [-1, 224, 224]    },
    "proprioception": { "type": "float32", "shape": [-1, 14]          },
    "action":         { "type": "float32", "shape": [-1, 7]           }
  }
}

Stream datasets

Datasets are designed to stream. The SDK shards downloads across a thread pool, prefetches into a ring buffer, and decompresses Parquet in the background.

python

for episode in node.datasets.stream("acme/panda-arm-grasp-100k", split="train"):
    rgb = episode["rgb"]                # (T, 224, 224, 3) uint8
    depth = episode["depth"]            # (T, 224, 224)   float16
    action = episode["action"]          # (T, 7)          float32
    # ... train ...

Splits, versions, and provenance

Every revision is immutable. Consumers pin a version with revision="1.2.0".
Splits are declared at the manifest level — no folder naming conventions required.
Each shard records a provenance hash for reproducibility.
Optional differential privacy reports can be attached to a revision.

PII and sensitive data

Datasets containing personal data must declare the data class on creation. Public datasets flagged as containing biometric or identifying data are gated behind an attestation form for buyers.

Publish

Publishing datasets

Upload, license, and version robotics, vision, and language datasets at any scale.

Supported formats

Datasets are stored as a sharded set of files with a schema and an optional sample index. The platform indexes the manifest, not the rows, so storage costs scale with bytes rather than column count.

Tabular — Parquet (preferred), Arrow IPC, CSV
Vision — WebDataset (.tar), Parquet with embedded bytes
Robotics — LeRobot, RLDS, MCAP, ROS 2 bags
Audio — Parquet with flac bytes, WebDataset
Text — JSONL, Parquet

Use Parquet wherever possible

Parquet shards compress 3–10× better than raw JSONL and column-prune at query time. Node Data preview, sampling, and search index all run on Parquet natively.

Create a dataset

python

from nodedata import NodeData

node = NodeData()

dataset = node.datasets.create(
    name="panda-arm-grasp-100k",
    modality=["rgb", "depth", "proprioception", "action"],
    license="CC-BY-4.0",
    description_path="./README.md",
    price_usd=199.00,
    tags=["manipulation", "panda", "sim2real"],
)

Upload shards

Shards are content-addressed. Re-uploading an identical file is a no-op; only the manifest is updated. This lets you ship incremental snapshots without rewriting the entire dataset.

python

# Stream shards directly from disk; resumes automatically on failure
node.datasets.upload(
    dataset.id,
    files=[
        "./shards/000.parquet",
        "./shards/001.parquet",
        "./shards/002.parquet",
    ],
    schema="./schema.json",
    sample_index="./sample_index.json",
)

CLIbash

node-data dataset push \
  --name panda-arm-grasp-100k \
  --license CC-BY-4.0 \
  --modality rgb,depth,proprioception,action \
  --price 199 \
  --schema ./schema.json \
  ./shards/

Schema

schema.jsonjson

{
  "splits": ["train", "val", "test"],
  "fields": {
    "rgb":            { "type": "uint8",   "shape": [-1, 224, 224, 3] },
    "depth":          { "type": "float16", "shape": [-1, 224, 224]    },
    "proprioception": { "type": "float32", "shape": [-1, 14]          },
    "action":         { "type": "float32", "shape": [-1, 7]           }
  }
}

Stream datasets

Datasets are designed to stream. The SDK shards downloads across a thread pool, prefetches into a ring buffer, and decompresses Parquet in the background.

python

for episode in node.datasets.stream("acme/panda-arm-grasp-100k", split="train"):
    rgb = episode["rgb"]                # (T, 224, 224, 3) uint8
    depth = episode["depth"]            # (T, 224, 224)   float16
    action = episode["action"]          # (T, 7)          float32
    # ... train ...

Splits, versions, and provenance

Every revision is immutable. Consumers pin a version with revision="1.2.0".
Splits are declared at the manifest level — no folder naming conventions required.
Each shard records a provenance hash for reproducibility.
Optional differential privacy reports can be attached to a revision.

PII and sensitive data

Datasets containing personal data must declare the data class on creation. Public datasets flagged as containing biometric or identifying data are gated behind an attestation form for buyers.