Matrix Voxel

01 — Model Family

🗺️

Voxel Atlas

World & environment generation. Terrain, buildings, biomes, interiors.

.vox — Voxel Grid

.obj — Scene

.usd — Stage

Open Source Planned

⚒️

Voxel Forge

Game-ready mesh & asset generation with PBR textures.

.obj / .glb

.fbx — Game Engine

.usdz — Apple AR

Open Source Planned

🖨️

Voxel Cast

3D printable generation. Watertight, manifold, structurally valid.

.stl — Universal

.step — CAD

.3mf — Modern

Open Source Planned

🔭

Voxel Lens

NeRF & Gaussian Splatting. Photorealistic scenes for VR/AR & cinema.

.ply — 3DGS

NeRF weights

.mp4 — Render

Open Source Planned

⭐

Voxel Prime

All-in-one unified generation. All output types in a single API call.

All formats

Pipeline mode

Style transfer

Closed Source Planned

02 — Input Modalities

💬

Text

CLIP-ViT-L + T5-XXL

Natural language prompt. Primary conditioning signal. Supports detailed descriptions, style directives, material specs.

🖼️

Image

DINOv2-L + Depth Encoder

Single reference image lifted to 3D. Infers geometry from visual cues, shading, perspective.

📷

Multi-View

Multi-View Transformer

2–12 images from different angles. Best geometry accuracy. Triangulates structure from multiple perspectives.

🎬

Video

Video-MAE + Temporal Pool

Extracts frames, infers 3D from camera motion. Enables dynamic / animated 3D scene generation (Lens).

🗿

3D Model

PointNet++

Existing mesh or point cloud as conditioning. Enables retexturing, restyling, format conversion, remeshing.

ALL INPUTS → CROSS-MODAL ATTENTION FUSION → 1024-DIM CONDITIONING VECTOR

03 — Core Architecture

Text Encoder (T5-XXL + CLIP-ViT-L)

→ 1024-dim text embedding

~4.7B

Image / MultiView / Video / 3D Encoders

DINOv2 · MultiViewTX · VideoMAE · PointNet++

~1.2B

Cross-Modal Conditioning Fusion

Fuses all active input modalities → unified conditioning

Module 01

Flow Matching DiT Backbone

24 blocks · hidden 1536 · 24 heads · AdaLN-Zero · 3D RoPE

~2.3B

Triplane Latent Space

3 × 256 × 256 × 32 channels · XY / XZ / YZ planes

SHARED

Atlas Decoder

Scene layout + voxelizer

Forge Decoder

Occupancy + mesh refinement

Cast Decoder

SDF → watertight mesh

Lens Decoder

Gaussian param predictor

Flow Matching Config

MethodOptimal Transport FM

Inference Steps20 – 50 NFE

vs DDPM~50× faster sampling

CFG Scale5.0 – 10.0

CFG Dropout10% during train

SchedulerRF (Rectified Flow)

Triplane Latent

PlanesXY · XZ · YZ

Resolution256 × 256

Channels32 per plane

Total values~6M latents

Point queryProject → 3 planes → sum

Backbone Stats

ArchitectureDiT (Diffusion Transformer)

Layers24 transformer blocks

Hidden dim1536

Attention heads24

ConditioningAdaLN-Zero

Parameters~2.3B

04 — Task-Specific Decoder Heads

🗺️ Atlas

⚒️ Forge

🖨️ Cast

🔭 Lens

Voxel Atlas

World & environment generation. Generates complete 3D scenes — terrain, structures, vegetation, sky. Supports infinite world tiling with seamless chunk stitching and biome-aware generation across 8 biome types.

.vox.obj scene.usd stage

Scene Layout Transformer

6-layer transformer over 32×32 spatial grid. Divides space into semantic regions: terrain, structures, vegetation, sky, water.

Region-wise NeRF Decoder

Per-region MLP: 3D coords + triplane → density + RGB + semantic label. Marching cubes extraction per region.

Infinite World Tiling

Generates seamless adjacent chunks. Tiling latent conditioning ensures consistent biome transitions at borders.

LOD Generator

Auto-generates 4 levels of detail per scene object. Compatible with Unity/Unreal LOD systems.

Decoder params~400M

Total model size~2.7B

VRAM (BF16)~22GB

Small scene~8s on A100

Large chunk (256³)~35s on A100

Biome types8

LOD levels4

Max world resolution256×256×128 voxels

Voxel Forge

Game-ready 3D asset generation with clean topology and full PBR texture maps. Supports characters, objects, props, vehicles, and architectural elements. Topology-optimized for animation rigging.

.obj + .mtl.glb / .gltf.fbx.usdz

Occupancy Network + Marching Cubes

MLP: 3D point + triplane → occupancy probability. Differentiable marching cubes produces initial raw mesh.

Mesh Refinement GNN

Graph neural network over mesh vertices/edges. 8 message-passing rounds. Predicts vertex position offsets for clean quad-dominant topology.

UV Unwrapper + Texture Diffusion

Learned UV unwrapping (SeamlessUV lineage). 2D flow matching in UV space generates albedo, roughness, metallic, normal maps at 1K–2K resolution.

Topology & Animation Optimizer

Enforces edge loops for rigging. Optional bilateral symmetry. Scale normalized to real-world meters.

Decoder params~350M

Total model size~2.65B

VRAM (BF16)~21GB

Low poly (≤5K tris)~6s on A100

Mid poly (≤50K tris)~15s on A100

High poly (≤500K)~45s on A100

Texture resolution1024 or 2048px

LOD levels4 (100/50/25/10%)

Voxel Cast

Physically valid 3D printable model generation. Every Cast output is guaranteed watertight, manifold, zero self-intersections, minimum wall thickness enforced. Supports FDM, SLA, and resin printing workflows.

.stl.obj (watertight).step (CAD).3mf

SDF Decoder → Dual Marching Cubes

MLP outputs signed distance field. Dual marching cubes guarantees watertight topology — no holes by construction.

Printability Validator (hard constraints)

Wall thickness ≥ 1.2mm enforced. Overhang > 45° flagged. Manifold checker + auto-repair. All outputs pass validation before delivery.

Hollowing Engine

Auto-hollows solid objects with configurable wall thickness. Adds drain holes. Reduces material use by up to 80%.

Interlocking Part Splitter

Splits large objects into printable parts with generated snap-fit joints. Scale validates against Bambu, Prusa, Ender bed sizes.

Decoder params~200M

Total model size~2.5B

VRAM (BF16)~20GB

GuaranteedWatertight + Manifold

Min wall thickness1.2mm

Hollowing savingup to 80% material

Supported printersBambu · Prusa · Ender

Non-manifold edges0 (guaranteed)

Voxel Lens

Photorealistic 3D scene generation via Neural Radiance Fields and 3D Gaussian Splatting. Optimized for VR/AR visualization, cinematic rendering, and dynamic animated scenes from video input.

.ply (3DGS)NeRF weights.mp4 renderdepth maps

Gaussian Parameter Decoder

Per-Gaussian: position (3) + rotation (4) + scale (3) + opacity (1) + SH coefficients (48). Targets 500K–3M Gaussians per scene. Adaptive densification + pruning.

NeRF Branch (Instant-NGP style)

Hash-grid encoder + tiny MLP. Runs in parallel with 3DGS branch. Used for scenes requiring higher photometric accuracy.

Dynamic Scene Support

Temporal Gaussian sequences for animated scenes. Accepts video input → extracts motion → generates temporally consistent 3DGS.

Compression Module

Reduces 3DGS file size by 60–80% with minimal quality loss. Critical for web and mobile delivery of Gaussian scenes.

Decoder params~500M

Total model size~2.8B

VRAM (BF16)~22GB

Object-centric~12s on A100

Indoor scene~40s on A100

Outdoor scene~90s on A100

Max Gaussians3M per scene

File compression60–80% reduction

05 — 10 Shared Custom Modules

01

Cross-Modal Conditioning Fusion

CrossModalAttention over all active input types. Unified 1024-dim conditioning vector fed to backbone.

02

3D RoPE Encoder

RoPE adapted for triplane 3D spatial positions. Encodes XYZ coordinates with rotary positional embeddings.

03

Geometry Quality Scorer

Rates generated geometry [0–1] before output. Flags low-quality generations for re-sampling at higher NFE.

04

Semantic Label Head

Per-voxel/vertex semantic class prediction: wall, floor, ceiling, tree, water, metal, glass, fabric, etc.

05

Scale & Unit Manager

Enforces consistent real-world scale. All outputs tagged with unit metadata (meters). Validates print scale.

06

Material Property Head

Predicts PBR properties: roughness, metallic, IOR, subsurface scattering. Compatible with Blender/UE material graphs.

07

Confidence & Uncertainty Head

Per-region generation confidence. Flags uncertain areas in output metadata. Drives re-sampling priority.

08

Prompt Adherence Scorer

CLIP-based similarity score: how well the 3D output matches the input text prompt. Exposed in API response.

09

Multi-Resolution Decoder

Coarse-to-fine generation: 64³ → 128³ → 256³. Each stage refines the previous. Enables fast previews at 64³.

10

Style Embedding Module

Encodes style reference images into conditioning vector. Transfers art style, material aesthetic, and visual language to 3D output.

06 — Voxel Prime — All-In-One

VOXEL PRIME

CLOSED SOURCE · API ONLY · ALL DECODER HEADS · 6 EXCLUSIVE MODULES

🟣 Closed Source API Access Only

Cross-Task Consistency

Ensures Atlas world + Forge assets + Lens scene all match visually when generated together in one call.

Scene Population Engine

Generates a world (Atlas) then auto-populates it with fitting assets (Forge). One prompt → full scene.

Pipeline Orchestrator

Chains Atlas → Forge → Cast → Lens in a single API request. Manages inter-model dependencies automatically.

4× Texture Upscaler

Photorealistic super-resolution on all generated textures. 512px base → 2048px final via flow matching in UV space.

Style Transfer Module

Apply artistic styles (Studio Ghibli, cyberpunk, brutalist, etc.) consistently across all output types in one generation.

Iterative Refinement

Text-guided editing of already-generated 3D content. "Make the roof taller" → re-runs only affected regions.

POST /v1/voxel/generate
{
  "prompt": "A medieval castle on a cliff at sunset",
  "output_types": ["world", "mesh", "nerf"],
  "inputs": {
    "image": "base64...",     // optional reference
    "video": "base64..."      // optional for dynamic scenes
  },
  "settings": {
    "quality": "high",            // draft | standard | high
    "style": "realistic",         // realistic | stylized | low-poly | ...
    "scale_meters": 100.0,
    "populate_scene": true     // Atlas → auto-populate with Forge assets
  }
}

07 — Training Data

Dataset	Size	Content	Used by
Objaverse-XL	10M+	Massive diverse 3D objects	AtlasForgeCastLens
Objaverse	800K+	Diverse annotated 3D assets	ForgeCastLens
ShapeNet	55K	Common object categories	ForgeCast
ScanNet / ScanNet++	1.5K scenes	Indoor 3D scans (RGB-D)	AtlasLens
KITTI / nuScenes	40K frames	Outdoor driving 3D scenes	AtlasLens
ABO (Amazon Berkeley)	148K	Product meshes + materials	Forge
Thingiverse	2M+	3D printable STL models	Cast
Polycam Scans	~500K	Real-world 3DGS / NeRF captures	Lens
Synthetic Renders	Generated	Multi-view renders of Objaverse	AtlasForgeCastLens
Text-3D Pairs (synthetic)	Generated	GPT-4o descriptions of Objaverse	AtlasForgeCastLens

08 — Model Sizes & VRAM

VOXEL ATLAS

Backbone2.3B

Decoder~400M

Total~2.7B

BF16 VRAM~22GB

INT8 VRAM~11GB

VOXEL FORGE

Backbone2.3B

Decoder~350M

Total~2.65B

BF16 VRAM~21GB

INT8 VRAM~11GB

VOXEL CAST

Backbone2.3B

Decoder~200M

Total~2.5B

BF16 VRAM~20GB

INT8 VRAM~10GB

VOXEL LENS

Backbone2.3B

Decoder~500M

Total~2.8B

BF16 VRAM~22GB

INT8 VRAM~11GB

VOXEL PRIME

Backbone2.3B

All Decoders~1.4B

Total~3.7B

BF16 VRAM~30GB

INT8 VRAM~15GB

✦ All specialist models (Atlas/Forge/Cast/Lens) fit A100 40GB in BF16 comfortably. · INT8 quantization brings all under 15GB — viable on consumer RTX 4090 (24GB). · Voxel Prime requires A100 40GB BF16 or 2× 4090 INT8.

09 — Training Strategy

01

Backbone Pre-Training

Train shared DiT on Objaverse-XL triplane reconstructions
Text + single image conditioning only
Context: general 3D structure, no task specialization
~100K steps on A100 cluster

02

Decoder Head Training

Freeze backbone, train each head independently
Atlas: ScanNet + synthetic world data
Forge: ShapeNet + Objaverse + textures
Cast: Thingiverse + watertight synthetic meshes
Lens: Polycam + synthetic multi-view renders
~50K steps each (parallel runs)

03

Joint Fine-Tuning

Unfreeze backbone, full end-to-end per model
Add all input modalities (video, multi-view, point cloud)
Multi-resolution curriculum: 64³ → 128³ → 256³
~30K steps each

04

Prime Training

Init from jointly fine-tuned backbone
All 4 decoder heads trained simultaneously
Cross-task consistency losses
Pipeline orchestrator + style transfer modules
~50K steps