This is an ultra-simple, single-file PyTorch implementation of MoonViT, the native-resolution vision encoder from Kimi-VL.
## MoonViT - Pytorch

This is an ultra-simple, single-file PyTorch implementation of MoonViT, the native-resolution vision encoder from Kimi-VL. I implemented this model because I think it's a great ViT variation with the ability to ingest images of dynamic sizes and resolutions at scale.
## Install
```bash
$ pip install open-moonvit
```
Or from source:
```bash
$ git clone https://github.com/kyegomez/open-moonvit
$ cd open-moonvit
$ pip install -e .
```
FlashAttention is optional. If `flash_attn` is importable and you're on CUDA, the var-length kernel is used automatically. Otherwise a block-diagonal SDPA fallback runs on CPU / MPS / CUDA with no extra dependencies.
```bash
$ pip install flash-attn --no-build-isolation # optional
```
## Usage
```python
import torch
from open_moonvit import MoonViT, MoonViTConfig, MLPProjector
encoder = MoonViT(MoonViTConfig()) # ~413M params, SigLIP-SO-400M defaults
# a batch of images at different resolutions, no padding, no resizing
images = [
torch.randn(3, 224, 280),
torch.randn(3, 140, 196),
torch.randn(3, 336, 336),
]
out = encoder(images)
out.last_hidden_state # (L_total, 1152) packed patch tokens
out.cu_seqlens # (4,) int32 image boundaries in the packed seq
out.grid_shapes # [(16,20), (10,14), (24,24)]
```
To feed an LLM, compose with the MLP projector (2×2 pixel-shuffle then a two-layer MLP):
```python
projector = MLPProjector(
vision_hidden_size = 1152,
llm_hidden_size = 2048,
)
tokens, grids, cu = projector(out.last_hidden_state, out.grid_shapes, out.cu_seqlens)
tokens.shape # (L_total // 4, 2048)
```
## How it works
```mermaid
flowchart TD
A([Images]) --> B[MoonViTPatchEmbed]
B --> C[AbsolutePosEmbedInterpolator]
subgraph enc[MoonViT]
C --> D[MoonViTEncoderLayer × 27]
R[RotaryEmbedding2D] -.-> D
D --> LN[post LayerNorm]
end
LN --> PS[PixelShuffle2x]
subgraph proj[MLPProjector]
PS --> MLP[Linear · Act · Linear]
end
MLP --> OUT([LLM Tokens])
```
Four things to internalize:
1. **Packing, not padding.** Images of different shapes become one long sequence. No wasted compute on pad tokens.
2. **Two positional embeddings, added together.** The paper i