Curated datasets and prompts for real businesses — built with guardrails

We collect, refine, and package high-quality datasets, generate prompt libraries, and train models with strict privacy and safety controls. Designed for SMEs and startup teams who need pragmatic AI—not research projects.

Choose a Data Tier Request a Sample

📊

Curated Datasets

High-quality data with provenance notes and versioning

💬

Prompt Packs

Industry-specific prompts with evaluation suites

🛡️

PII Scrubbing

Automatic privacy protection in every pipeline

🤖

LLM Training

LoRA & instruction-tuning with RAG options

🔒

Safety Guardrails

Toxicity, bias & jailbreak screening built-in

📊

Curated Datasets

High-quality data with provenance notes and versioning

💬

Prompt Packs

Industry-specific prompts with evaluation suites

🛡️

PII Scrubbing

Automatic privacy protection in every pipeline

🤖

LLM Training

LoRA & instruction-tuning with RAG options

🔒

Safety Guardrails

Toxicity, bias & jailbreak screening built-in

Scroll for more! →

What We Do

🌐

Data Collection

Ethical web collection, partner feeds, and customer uploads. Scheduler + retrievers with adaptive rate limits and failover. All sources tracked for provenance and license.

⚙️

Refining & Labeling

Deduplication, language detection, PII scrubbing, toxicity filters, topic/intent tagging, and human-in-the-loop QA for "gold" samples.

✍️

Prompt Generation

Bilingual prompt packs per vertical (marketing, support, real estate, retail). Includes style, tone, brand-safety and response-length controls.

📦

Dataset Sales

Single-purchase industry packs with clear licensing and update cadence. Optional RAG-ready chunking + embeddings export.

🧠

LLM Training

Instruction-tuning & LoRA on curated sets. Metric tracking (helpfulness, safety, grounding), eval suites, and rollback to prior versions.

🚀

Deployment Support

RAG APIs, retrieval schemas, and guardrail middleware. We help you wire it into apps or agents, then monitor drift and quality.

Enterprise-Grade Guardrails

Every pipeline comes with built-in privacy, safety, and compliance controls

🔐

PII Handling

Automatic detection/removal of emails, phones, addresses, IDs, faces. Optional hashing/tokenization. PII never included in deliverables.

📜

Consent & Licensing

Only public/partner data with documented terms. We record provenance, timestamps, license notes, and block disallowed reuse.

⚠️

Safety & Brand

Toxicity, bias & jailbreak screens; category allow/deny; Spanish/MX tone controls. Red-team prompts baked into QA runs.

💾

Storage & Access

Encrypted at rest (cloud buckets) and in transit (TLS). Signed links or VPC peering. Access is least-privilege.

✅

Compliance

We align with common principles (GDPR/CCPA-style data minimization & purpose limits). DPA available on request.

📋

Auditability

Dataset cards list sources, filters, and known caveats. Versioned releases with changelogs for reproducibility.

Pay-Per-GB (No Commitment)

Need data without a monthly plan? Purchase exactly what you need, when you need it.

📁

Raw Data

$2 / GB

✓ Ethically collected web data
✓ Partner feeds & public sources
✓ Provenance & license tracking
✓ JSON/CSV/Parquet formats
✓ No monthly commitment

Ideal for teams with their own processing pipelines who need quality source data.

Get Raw Data

Best Value

✨

Refined Datasets

$8 / GB

✓ PII scrubbed & deduplicated
✓ Language detection & filtering
✓ Toxicity & bias screening
✓ Topic/intent tagging included
✓ RAG-ready chunking available
✓ No monthly commitment

Production-ready data with all guardrails applied. Perfect for LLM fine-tuning.

Get Refined Data

Monthly Plans (USD)

Ongoing data processing with full support. Fair-use: GB refers to pre-compression input volume processed within the month.

Starter

$299 /month

25 GB curated data processing

✓ Prompt pack: 1 vertical
✓ 1 eval suite & basic guardrails
✓ Email support

Get Started

Popular

Growth

$749 /month

100 GB curated data processing

✓ Prompt packs: 3 verticals
✓ 2 eval suites, advanced guardrails
✓ Slack support + quarterly review

Get Started

Scale

$1,899 /month

500 GB curated data processing

✓ All prompt packs (library access)
✓ Full eval + red-team runs
✓ Training advisory (LoRA planning)

Get Started

Enterprise

Custom pricing

1+ TB special licensing

✓ Private datasets & SLAs
✓ Dedicated infra / VPC
✓ On-prem or sovereign options

One-Time Purchases

Dataset Packs

$49 Micro-pack

$199 Standard pack

$499 Pro pack

Includes license & provenance notes. Optional embeddings +$49.

Training Add-on

From $1,500 per job

LoRA planning + supervised runs. Metrics, evals, rollback, and model card included. (Excl. cloud compute)

Implementation

$120/hr or fixed-bid

RAG wiring & guardrail setup. Fixed-bid available after a short scoping call.

All currency is USD. Taxes, cloud costs, and marketplace fees billed at actuals where applicable.

APIs & Models

REST endpoints for dataset manifests, signed downloads, and RAG retrieval. Optional embeddings export (FAISS/PGV/Cloud). Webhooks for update events.

We support instruction-tuning & LoRA on top models; RAG stacks with bilingual retrievers; and guardrail middleware for prompts & responses.

Request API Access

curl -X GET \
  "https://api.thedatafactory.dev/v1/datasets" \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json"

# Response
{
  "datasets": [
    {
      "id": "mkt-es-2024",
      "vertical": "marketing",
      "language": "es-MX",
      "records": 50000
    }
  ]
}

Curated datasets and prompts for real businesses — built with guardrails

Curated Datasets

Prompt Packs

PII Scrubbing

LLM Training

Safety Guardrails

Curated Datasets

Prompt Packs

PII Scrubbing

LLM Training

Safety Guardrails

What We Do

Data Collection

Refining & Labeling

Prompt Generation

Dataset Sales

LLM Training

Deployment Support

Enterprise-Grade Guardrails

PII Handling

Consent & Licensing

Safety & Brand

Storage & Access

Compliance

Auditability

Pay-Per-GB (No Commitment)

Raw Data

Refined Datasets

Monthly Plans (USD)

Starter

Growth

Scale

Enterprise

One-Time Purchases

Dataset Packs

Training Add-on

Implementation

APIs & Models

Get In Touch

Email

WhatsApp

NextGenAI

Grupo Casa Roca