Meet DanQing: a 100M‑pair Chinese vision‑language dataset
Chinese AI that can both see and read needs better data. Meet DanQing—an up‑to‑date Chinese vision‑language dataset with 100M image–text pairs. Built from 2024–2025 Common Crawl data and curated with a rigorous pipeline, DanQing aims to power stronger multimodal models.
Why it matters
- Bridges the gap with English datasets like LAION/COYO; Chinese VLP has long lacked scale and quality.
- Fresher semantics: reflects new brands, slang, and events from recent years.
- Higher data quality via stricter selection to reduce noise and mismatches.
- Proven gains: continual pretraining of SigLIP2 shows consistent improvements on Chinese zero‑shot classification, cross‑modal retrieval, and LMM evaluations.
- Open and usable: released under CC‑BY 4.0 for broad research and product use.
Paper: https://arxiv.org/abs/2601.10305
Paper: https://arxiv.org/abs/2601.10305v1
Register: https://www.AiFeta.com
#AI #MachineLearning #ComputerVision #Multimodal #VisionLanguage #Chinese #Dataset #OpenSource #CLIP #SigLIP #Research #DanQing