From Pixels to Posts: Retrieval-Augmented Fashion Captioning and Hashtag Generation

Kari Jaaskelainen

25 Nov 2025 — 1 min read

Meet a new way to write captions and hashtags for fashion photos—grounded in what is actually in the picture.

Detects multiple garments in an image with a YOLO-based model.
Extracts dominant colors and infers fabric and gender by retrieving similar products with CLIP and FAISS.
Packages these facts as an evidence pack that steers a large language model to stay faithful while sounding stylish.

Why it matters: classic end-to-end captioners often miss attributes or hallucinate. Retrieval-augmented generation keeps the style and improves factual grounding.

Results: the detector reached 0.71 mAP across nine garment types. The RAG-LLM pipeline delivered more attribute-aligned captions and hashtags with higher coverage (including full coverage at the 50% threshold), while a fine-tuned BLIP baseline showed higher word overlap but weaker generalization.

Takeaway: blend vision, retrieval, and LLMs to scale accurate, on-brand fashion copy across products and shoots.

Paper: https://arxiv.org/abs/2511.19149v1

Register: https://www.AiFeta.com

AI FashionTech ComputerVision GenAI RAG LLM YOLO CLIP BLIP ecommerce arXiv

From Pixels to Posts: Retrieval-Augmented Fashion Captioning and Hashtag Generation

Kari Jaaskelainen

Read more

Tekoäly myötäilee toteamuksia enemmän kuin kysymyksiä

Tekoälyn pitäisi uskaltaa sanoa “en tiedä” — ja sillä on väliä, miten tämä mitataan

Pienet kielimallit nopeutuvat, kun niille opetetaan valmiita fraaseja

Kone näkee saman kohtauksen eri tavoin – uusi tapa opettaa sen kokoamaan aistinsa yhteen