MATRIX.CORP — 3D GENERATION SERIES

MATRIX VOXEL

Flow Matching · Multi-Modal Input · Task-Specific Decoders · A100 40GB

5 Models Shared Backbone ~2.3B Flow Matching (DiT) Triplane Latent Space 4 Open Source 1 Closed Source (Prime) Text + Image + Video + 3D Input OBJ · GLB · STL · NeRF · 3DGS · USD A100 40GB Target
SCROLL TO EXPLORE
01 — Model Family
🗺️
Voxel Atlas
World & environment generation. Terrain, buildings, biomes, interiors.
.vox — Voxel Grid
.obj — Scene
.usd — Stage
Open Source Planned
⚒️
Voxel Forge
Game-ready mesh & asset generation with PBR textures.
.obj / .glb
.fbx — Game Engine
.usdz — Apple AR
Open Source Planned
🖨️
Voxel Cast
3D printable generation. Watertight, manifold, structurally valid.
.stl — Universal
.step — CAD
.3mf — Modern
Open Source Planned
🔭
Voxel Lens
NeRF & Gaussian Splatting. Photorealistic scenes for VR/AR & cinema.
.ply — 3DGS
NeRF weights
.mp4 — Render
Open Source Planned
Voxel Prime
All-in-one unified generation. All output types in a single API call.
All formats
Pipeline mode
Style transfer
Closed Source Planned
02 — Input Modalities
💬
Text
CLIP-ViT-L + T5-XXL
Natural language prompt. Primary conditioning signal. Supports detailed descriptions, style directives, material specs.
🖼️
Image
DINOv2-L + Depth Encoder
Single reference image lifted to 3D. Infers geometry from visual cues, shading, perspective.
📷
Multi-View
Multi-View Transformer
2–12 images from different angles. Best geometry accuracy. Triangulates structure from multiple perspectives.
🎬
Video
Video-MAE + Temporal Pool
Extracts frames, infers 3D from camera motion. Enables dynamic / animated 3D scene generation (Lens).
🗿
3D Model
PointNet++
Existing mesh or point cloud as conditioning. Enables retexturing, restyling, format conversion, remeshing.
ALL INPUTS → CROSS-MODAL ATTENTION FUSION → 1024-DIM CONDITIONING VECTOR
03 — Core Architecture
Text Encoder (T5-XXL + CLIP-ViT-L)
→ 1024-dim text embedding
~4.7B
Image / MultiView / Video / 3D Encoders
DINOv2 · MultiViewTX · VideoMAE · PointNet++
~1.2B
Cross-Modal Conditioning Fusion
Fuses all active input modalities → unified conditioning
Module 01
Flow Matching DiT Backbone
24 blocks · hidden 1536 · 24 heads · AdaLN-Zero · 3D RoPE
~2.3B
Triplane Latent Space
3 × 256 × 256 × 32 channels · XY / XZ / YZ planes
SHARED
Atlas Decoder
Scene layout + voxelizer
Forge Decoder
Occupancy + mesh refinement
Cast Decoder
SDF → watertight mesh
Lens Decoder
Gaussian param predictor
Flow Matching Config
MethodOptimal Transport FM
Inference Steps20 – 50 NFE
vs DDPM~50× faster sampling
CFG Scale5.0 – 10.0
CFG Dropout10% during train
SchedulerRF (Rectified Flow)
Triplane Latent
PlanesXY · XZ · YZ
Resolution256 × 256
Channels32 per plane
Total values~6M latents
Point queryProject → 3 planes → sum
Backbone Stats
ArchitectureDiT (Diffusion Transformer)
Layers24 transformer blocks
Hidden dim1536
Attention heads24
ConditioningAdaLN-Zero
Parameters~2.3B
04 — Task-Specific Decoder Heads
🗺️ Atlas
⚒️ Forge
🖨️ Cast
🔭 Lens
Voxel Atlas
World & environment generation. Generates complete 3D scenes — terrain, structures, vegetation, sky. Supports infinite world tiling with seamless chunk stitching and biome-aware generation across 8 biome types.
.vox.obj scene.usd stage
Scene Layout Transformer
6-layer transformer over 32×32 spatial grid. Divides space into semantic regions: terrain, structures, vegetation, sky, water.
Region-wise NeRF Decoder
Per-region MLP: 3D coords + triplane → density + RGB + semantic label. Marching cubes extraction per region.
Infinite World Tiling
Generates seamless adjacent chunks. Tiling latent conditioning ensures consistent biome transitions at borders.
LOD Generator
Auto-generates 4 levels of detail per scene object. Compatible with Unity/Unreal LOD systems.
Decoder params~400M
Total model size~2.7B
VRAM (BF16)~22GB
Small scene~8s on A100
Large chunk (256³)~35s on A100
Biome types8
LOD levels4
Max world resolution256×256×128 voxels
Voxel Forge
Game-ready 3D asset generation with clean topology and full PBR texture maps. Supports characters, objects, props, vehicles, and architectural elements. Topology-optimized for animation rigging.
.obj + .mtl.glb / .gltf.fbx.usdz
Occupancy Network + Marching Cubes
MLP: 3D point + triplane → occupancy probability. Differentiable marching cubes produces initial raw mesh.
Mesh Refinement GNN
Graph neural network over mesh vertices/edges. 8 message-passing rounds. Predicts vertex position offsets for clean quad-dominant topology.
UV Unwrapper + Texture Diffusion
Learned UV unwrapping (SeamlessUV lineage). 2D flow matching in UV space generates albedo, roughness, metallic, normal maps at 1K–2K resolution.
Topology & Animation Optimizer
Enforces edge loops for rigging. Optional bilateral symmetry. Scale normalized to real-world meters.
Decoder params~350M
Total model size~2.65B
VRAM (BF16)~21GB
Low poly (≤5K tris)~6s on A100
Mid poly (≤50K tris)~15s on A100
High poly (≤500K)~45s on A100
Texture resolution1024 or 2048px
LOD levels4 (100/50/25/10%)
Voxel Cast
Physically valid 3D printable model generation. Every Cast output is guaranteed watertight, manifold, zero self-intersections, minimum wall thickness enforced. Supports FDM, SLA, and resin printing workflows.
.stl.obj (watertight).step (CAD).3mf
SDF Decoder → Dual Marching Cubes
MLP outputs signed distance field. Dual marching cubes guarantees watertight topology — no holes by construction.
Printability Validator (hard constraints)
Wall thickness ≥ 1.2mm enforced. Overhang > 45° flagged. Manifold checker + auto-repair. All outputs pass validation before delivery.
Hollowing Engine
Auto-hollows solid objects with configurable wall thickness. Adds drain holes. Reduces material use by up to 80%.
Interlocking Part Splitter
Splits large objects into printable parts with generated snap-fit joints. Scale validates against Bambu, Prusa, Ender bed sizes.
Decoder params~200M
Total model size~2.5B
VRAM (BF16)~20GB
GuaranteedWatertight + Manifold
Min wall thickness1.2mm
Hollowing savingup to 80% material
Supported printersBambu · Prusa · Ender
Non-manifold edges0 (guaranteed)
Voxel Lens
Photorealistic 3D scene generation via Neural Radiance Fields and 3D Gaussian Splatting. Optimized for VR/AR visualization, cinematic rendering, and dynamic animated scenes from video input.
.ply (3DGS)NeRF weights.mp4 renderdepth maps
Gaussian Parameter Decoder
Per-Gaussian: position (3) + rotation (4) + scale (3) + opacity (1) + SH coefficients (48). Targets 500K–3M Gaussians per scene. Adaptive densification + pruning.
NeRF Branch (Instant-NGP style)
Hash-grid encoder + tiny MLP. Runs in parallel with 3DGS branch. Used for scenes requiring higher photometric accuracy.
Dynamic Scene Support
Temporal Gaussian sequences for animated scenes. Accepts video input → extracts motion → generates temporally consistent 3DGS.
Compression Module
Reduces 3DGS file size by 60–80% with minimal quality loss. Critical for web and mobile delivery of Gaussian scenes.
Decoder params~500M
Total model size~2.8B
VRAM (BF16)~22GB
Object-centric~12s on A100
Indoor scene~40s on A100
Outdoor scene~90s on A100
Max Gaussians3M per scene
File compression60–80% reduction
05 — 10 Shared Custom Modules
01
Cross-Modal Conditioning Fusion
CrossModalAttention over all active input types. Unified 1024-dim conditioning vector fed to backbone.
02
3D RoPE Encoder
RoPE adapted for triplane 3D spatial positions. Encodes XYZ coordinates with rotary positional embeddings.
03
Geometry Quality Scorer
Rates generated geometry [0–1] before output. Flags low-quality generations for re-sampling at higher NFE.
04
Semantic Label Head
Per-voxel/vertex semantic class prediction: wall, floor, ceiling, tree, water, metal, glass, fabric, etc.
05
Scale & Unit Manager
Enforces consistent real-world scale. All outputs tagged with unit metadata (meters). Validates print scale.
06
Material Property Head
Predicts PBR properties: roughness, metallic, IOR, subsurface scattering. Compatible with Blender/UE material graphs.
07
Confidence & Uncertainty Head
Per-region generation confidence. Flags uncertain areas in output metadata. Drives re-sampling priority.
08
Prompt Adherence Scorer
CLIP-based similarity score: how well the 3D output matches the input text prompt. Exposed in API response.
09
Multi-Resolution Decoder
Coarse-to-fine generation: 64³ → 128³ → 256³. Each stage refines the previous. Enables fast previews at 64³.
10
Style Embedding Module
Encodes style reference images into conditioning vector. Transfers art style, material aesthetic, and visual language to 3D output.
06 — Voxel Prime — All-In-One
VOXEL PRIME
CLOSED SOURCE · API ONLY · ALL DECODER HEADS · 6 EXCLUSIVE MODULES
🟣 Closed Source API Access Only
Cross-Task Consistency
Ensures Atlas world + Forge assets + Lens scene all match visually when generated together in one call.
Scene Population Engine
Generates a world (Atlas) then auto-populates it with fitting assets (Forge). One prompt → full scene.
Pipeline Orchestrator
Chains Atlas → Forge → Cast → Lens in a single API request. Manages inter-model dependencies automatically.
4× Texture Upscaler
Photorealistic super-resolution on all generated textures. 512px base → 2048px final via flow matching in UV space.
Style Transfer Module
Apply artistic styles (Studio Ghibli, cyberpunk, brutalist, etc.) consistently across all output types in one generation.
Iterative Refinement
Text-guided editing of already-generated 3D content. "Make the roof taller" → re-runs only affected regions.
POST /v1/voxel/generate
{
  "prompt": "A medieval castle on a cliff at sunset",
  "output_types": ["world", "mesh", "nerf"],
  "inputs": {
    "image": "base64...",     // optional reference
    "video": "base64..."      // optional for dynamic scenes
  },
  "settings": {
    "quality": "high",            // draft | standard | high
    "style": "realistic",         // realistic | stylized | low-poly | ...
    "scale_meters": 100.0,
    "populate_scene": true     // Atlas → auto-populate with Forge assets
  }
}
07 — Training Data
DatasetSizeContentUsed by
Objaverse-XL10M+Massive diverse 3D objectsAtlasForgeCastLens
Objaverse800K+Diverse annotated 3D assetsForgeCastLens
ShapeNet55KCommon object categoriesForgeCast
ScanNet / ScanNet++1.5K scenesIndoor 3D scans (RGB-D)AtlasLens
KITTI / nuScenes40K framesOutdoor driving 3D scenesAtlasLens
ABO (Amazon Berkeley)148KProduct meshes + materialsForge
Thingiverse2M+3D printable STL modelsCast
Polycam Scans~500KReal-world 3DGS / NeRF capturesLens
Synthetic RendersGeneratedMulti-view renders of ObjaverseAtlasForgeCastLens
Text-3D Pairs (synthetic)GeneratedGPT-4o descriptions of ObjaverseAtlasForgeCastLens
08 — Model Sizes & VRAM
VOXEL ATLAS
Backbone2.3B
Decoder~400M
Total~2.7B
BF16 VRAM~22GB
INT8 VRAM~11GB
VOXEL FORGE
Backbone2.3B
Decoder~350M
Total~2.65B
BF16 VRAM~21GB
INT8 VRAM~11GB
VOXEL CAST
Backbone2.3B
Decoder~200M
Total~2.5B
BF16 VRAM~20GB
INT8 VRAM~10GB
VOXEL LENS
Backbone2.3B
Decoder~500M
Total~2.8B
BF16 VRAM~22GB
INT8 VRAM~11GB
VOXEL PRIME
Backbone2.3B
All Decoders~1.4B
Total~3.7B
BF16 VRAM~30GB
INT8 VRAM~15GB
✦  All specialist models (Atlas/Forge/Cast/Lens) fit A100 40GB in BF16 comfortably.  ·  INT8 quantization brings all under 15GB — viable on consumer RTX 4090 (24GB).  ·  Voxel Prime requires A100 40GB BF16 or 2× 4090 INT8.
09 — Training Strategy
01
Backbone Pre-Training
  • Train shared DiT on Objaverse-XL triplane reconstructions
  • Text + single image conditioning only
  • Context: general 3D structure, no task specialization
  • ~100K steps on A100 cluster
02
Decoder Head Training
  • Freeze backbone, train each head independently
  • Atlas: ScanNet + synthetic world data
  • Forge: ShapeNet + Objaverse + textures
  • Cast: Thingiverse + watertight synthetic meshes
  • Lens: Polycam + synthetic multi-view renders
  • ~50K steps each (parallel runs)
03
Joint Fine-Tuning
  • Unfreeze backbone, full end-to-end per model
  • Add all input modalities (video, multi-view, point cloud)
  • Multi-resolution curriculum: 64³ → 128³ → 256³
  • ~30K steps each
04
Prime Training
  • Init from jointly fine-tuned backbone
  • All 4 decoder heads trained simultaneously
  • Cross-task consistency losses
  • Pipeline orchestrator + style transfer modules
  • ~50K steps