Skip to content
Skip to content

Datasets you can browse, audit, and deploy.

Rights-cleared corpora across audio, image, video, and text. Every dataset ships with a card your compliance team will read before your engineering team does.

Most dataset pages sell volume.

We’d rather show you what’s inside each file: the modality, the locales, the licensing terms, the chain of custody, the sample clips. If it passes your review, it ships. If it doesn’t, it’s not for you.

Conversational speech, rights-cleared, globally balanced.

A two-speaker conversational corpus collected through our platform across 18 locales. Each file ships with word-level transcripts, speaker diarisation, and the full consent trail.

Modality
Conversational speech (two-speaker)
Locales
18 globally balanced, en-US, en-GB, zh-CN, pt-BR, pt-PT, es-MX, fr-FR, ja-JP, ko-KR, de-DE, hi-IN, and more
Domain split
Healthcare, meetings, and contact-centre
Format
WAV, PCM, 44.1 kHz, stereo
Transcripts
Word-level timestamps, speaker labels
Licensing
Rights-cleared for commercial model training, derivative works available on request
Samples
Public on HuggingFace and Datarade
Availability
Licensed per locale or as the full corpus

What’s coming next.

We publish new datasets as they clear QA. Here’s what is in production now.

Image
Object recognition and scene classification across retail, medical, and industrial domains
Video
Egocentric task demonstrations with synced audio narration
Multimodal
Paired voice-image-text datasets for instruction-following models
Accented speech
Extended single-locale corpora with fine-grained accent labeling

If you need one of these sooner than we’ll have it ready, that’s a custom collection. Talk to us.

Nothing in the catalog fits? Scope a project.

TELL US

The modality, language, domain, volume, and delivery timeline you need. Any special requirements for speaker profile, accent coverage, or capture environment.

WE SCOPE

Within 48 hours of the first call, you’ll have a scoped plan covering contributor sourcing, project timeline, pricing, delivery format, and rights scope.

WE COLLECT

Collection runs through the same platform, same consent framework, same multi-layer QA as every dataset in our catalog. The dataset card is built as we go.

YOU REVIEW

Early batches land in a review folder. Approve, flag for rework, or adjust scope. Final delivery happens on your preferred channel.

Sample clips and full corpora, in three places.

HuggingFace
Open samples from most datasets, searchable by modality and locale
Datarade
Full catalog listings with licensing terms and request-quote flow
Direct
Enterprise contracts and custom projects ship via signed direct access

All three paths point to the same source of truth: the dataset card for each corpus.

The ones we get most.

Can I listen to samples before requesting a spec?

Yes. Public samples are on HuggingFace and Datarade. For custom datasets or full-corpus evaluation, we’ll share a sample set directly after a short scoping call.

What does rights-cleared mean for these datasets?

Every contributor signed a consent form authorizing commercial training use for the specific project their data contributed to. The consent scope is documented in each dataset card and retrievable per file. You buy datasets with the rights spelled out, not assumed.

Can I license just a subset of a corpus?

Usually yes. Most datasets are available per-locale, per-domain, or per-speaker-segment. Contact us with the slice you need and we will confirm licensing and pricing for that subset.

Do you sell exclusive licenses?

For custom projects, yes. Catalog datasets are non-exclusive by default because they were collected with that intention. If you need exclusivity on a catalog dataset, we can discuss a carve-out.

What happens if we find an issue with a dataset we licensed?

Tell us. We investigate and, depending on the nature of the issue, replace files, refund, or adjust scope. A dataset is only as good as its next complaint handled well.

How are datasets priced?

Per hour for audio, per-asset for image and video, per-token or per-document for text. Exact pricing depends on locale, domain, rights scope, and volume. We quote on request.

How do you handle personally identifiable information?

Consent forms explicitly authorise or restrict identifying features. Anonymisation options include voice masking, face blurring, metadata redaction. The dataset card specifies the anonymisation level for each file.

Can I buy data I can re-sell?

No. Redistribution rights are not granted by default. If you are a downstream marketplace or data broker and need redistribution, contact us for a separate license tier.

Browse the catalog. Or tell us what you need.