From Base Model to Aligned Assistant
- Multi-stage alignment: SFT → DPO → GRPO
- Multi-dimensional reward modeling
- Safety alignment using Constitutional AI principles
- Vision-language training: 4-stage process
Figure 17: Tag distribution in VLM instruction data
After pre-training, we employed a multi-stage alignment process: Supervised Fine-Tuning (SFT),
Direct Preference Optimization (DPO), and a modified Group Relative Policy Optimization (GRPO).
We used multi-dimensional reward modeling considering correctness, truthfulness, helpfulness, and harmlessness.
Our vision-language model went through a 4-stage training process to integrate visual capabilities.