
CommunityNews
What If We Recaption Billions of Web Images with LLaMA-3?
What If We Recaption Billions of Web Images with LLaMA-3?.
Web-crawled image-text pairs are inherently noisy. Prior studies demonstrate that semantically aligning and enriching textual descriptions of these pairs can significantly enhance model training across various vision-language tasks, particularly text-to-image generation. However, large-scale investigations in this area remain predominantly closed-source. Our paper aims to bridge this community effort, leveraging the powerful and \textit{open-sourced} LLaMA-3, a GPT-4 level LLM. Our recaptioning pipeline is simple: first, we fine-tune a LLaMA-3-8B powered LLaVA-1.5 and then employ it to recaption 1.3 billion images from the DataComp-1B dataset. Our empirical results confirm that this enhanced dataset, Recap-DataComp-1B, offers substantial benefits in training advanced vision-language models. For discriminative models like CLIP, we observe enhanced zero-shot performance in cross-modal retrieval tasks. For generative models like text-to-image Diffusion Transformers, the generated images exhibit a significant improvement in alignment with users’ text instructions, especially in following complex queries. Our project page is What If We Recaption Billions of Web Images with LLaMA-3?
Read in full here:
This thread was posted by one of our members via one of our news source trackers.
Popular General Dev topics










Other popular topics










Categories:
Sub Categories:
- All
- In The News
- Dev Chat (200)
- Questions (30)
- Resources (118)
- Blogs/Talks (26)
- Jobs (3)
- Events (14)
- Code Editors (58)
- Hardware (57)
- Reviews (4)
- Sales (15)
- Design & UX (4)
- Marketing & SEO (1)
- Industry & Culture (14)
- Ethics & Privacy (19)
- Business (4)
- Learning Methods (4)
- Content Creators (7)
- DevOps & Hosting (9)
Popular Portals
- /elixir
- /rust
- /wasm
- /ruby
- /erlang
- /phoenix
- /keyboards
- /rails
- /js
- /python
- /security
- /go
- /swift
- /vim
- /clojure
- /java
- /haskell
- /emacs
- /svelte
- /onivim
- /typescript
- /crystal
- /c-plus-plus
- /tailwind
- /kotlin
- /gleam
- /react
- /flutter
- /elm
- /ocaml
- /ash
- /vscode
- /opensuse
- /centos
- /php
- /deepseek
- /scala
- /html
- /zig
- /debian
- /nixos
- /lisp
- /agda
- /textmate
- /sublime-text
- /react-native
- /kubuntu
- /arch-linux
- /revery
- /ubuntu
- /manjaro
- /django
- /spring
- /diversity
- /lua
- /nodejs
- /julia
- /slackware
- /c
- /neovim