CommunityNews

CommunityNews

MM1: Methods, Analysis and Insights from Multimodal LLM Pre-training

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training.
In this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons. For example, we demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving state-of-the-art (SOTA) few-shot results across multiple benchmarks, compared to other published pre-training results. Further, we show that the image encoder together with image resolution and the image token count has substantial impact, while the vision-language connector design is of comparatively negligible importance. By scaling up the presented recipe, we build MM1, a family of multimodal models up to 30B parameters, consisting of both dense models and mixture-of-experts (MoE) variants, that are SOTA in pre-training metrics and achieve competitive performance after supervised fine-tuning on a range of established multimodal benchmarks. Thanks to large-scale pre-training, MM1 enjoys appealing properties such as enhanced in-context learning, and multi-image reasoning, enabling few-shot chain-of-thought prompting.

Read in full here:

This thread was posted by one of our members via one of our news source trackers.

Where Next?

Popular General Dev topics Top

First poster: bot
Kinesis Advantage360 Ergonomic Keyboard. Split-adjustable, contoured design that maximizes comfort and boosts productivity. Mechanical s...
New
CommunityNews
…or, “why make programming even harder?” Learning functional programming is an opportunity to discover a new way to represent programs, t...
New
First poster: bot
sqlglot/python_sql_engine.md at main · tobymao/sqlglot. Python SQL Parser and Transpiler. Contribute to tobymao/sqlglot development by c...
New
CommunityNews
SLUM: The Shadow Library Uptime Monitor. This dashboard tracks the availability of popular shadow libraries in real time from a US-based...
New
First poster: AstonJ
Truly independent web browser. Contribute to LadybirdBrowser/ladybird development by creating an account on GitHub.
New
First poster: dyowee
olmOCR is an open-source tool for converting PDFs to text with high accuracy, preserving reading order and supporting tables, equations, ...
New
First poster: alvinkatojr
There are countless articles why developers should not focus on Frameworks too much and instead learn to understand the underlying langua...
New
First poster: alvinkatojr
About accelerationism, NRx, and the intersection of technology, religion, and philosophy: an analysis of the essential ideas in the new A...
New
CommunityNews
Rendering Action Mailer emails with Phlex components and layouts: Clean, Composable, and Completely Ruby - Blog post by Camillo Visini
New
CommunityNews
The French originated the meter in the 1790s as one/ten-millionth of the distance from the equator to the north pole along a meridian thr...
New

Other popular topics Top

brentjanderson
Bought the Moonlander mechanical keyboard. Cherry Brown MX switches. Arms and wrists have been hurting enough that it’s time I did someth...
New
AstonJ
Curious to know which languages and frameworks you’re all thinking about learning next :upside_down_face: Perhaps if there’s enough peop...
New
AstonJ
Do the test and post your score :nerd_face: :keyboard: If possible, please add info such as the keyboard you’re using, the layout (Qw...
New
dimitarvp
Small essay with thoughts on macOS vs. Linux: I know @Exadra37 is just waiting around the corner to scream at me “I TOLD YOU SO!!!” but I...
New
PragmaticBookshelf
Tailwind CSS is an exciting new CSS framework that allows you to design your site by composing simple utility classes to create complex e...
New
DevotionGeo
The V Programming Language Simple language for building maintainable programs V is already mentioned couple of times in the forum, but I...
New
Margaret
Hello everyone! This thread is to tell you about what authors from The Pragmatic Bookshelf are writing on Medium.
1147 28379 760
New
PragmaticBookshelf
Build efficient applications that exploit the unique benefits of a pure functional language, learning from an engineer who uses Haskell t...
New
AstonJ
If you want a quick and easy way to block any website on your Mac using Little Snitch simply… File > New Rule: And select Deny, O...
New
DevotionGeo
I have always used antique keyboards like Cherry MX 1800 or Cherry MX 8100 and almost always have modified the switches in some way, like...
New