CommunityNews

CommunityNews

AudioX: Diffusion Transformer for Anything-to-Audio Generation

Audio and music generation have emerged as crucial tasks in many applications, yet existing approaches face significant limitations: they operate in isolation without unified capabilities across modalities, suffer from scarce high-quality, multi-modal training data, and struggle to effectively integrate diverse inputs. In this work, we propose AudioX, a unified Diffusion Transformer model for Anything-to-Audio and Music Generation. Unlike previous domain-specific models, AudioX can generate both general audio and music with high quality, while offering flexible natural language control and seamless processing of various modalities including text, video, image, music, and audio. Its key innovation is a multi-modal masked training strategy that masks inputs across modalities and forces the model to learn from masked inputs, yielding robust and unified cross-modal representations. To address data scarcity, we curate two comprehensive datasets: vggsound-caps with 190K audio captions based on the VGGSound dataset, and V2M-caps with 6 million music captions derived from the V2M dataset. Extensive experiments demonstrate that AudioX not only matches or outperforms state-of-the-art specialized models, but also offers remarkable versatility in handling diverse input modalities and generation tasks within a unified architecture.

Read in full here:

Where Next?

Popular General Dev topics Top

Exadra37
As part of our continued goal of helping developers provide safer products for businesses and consumers, we here at McAfee Advanced Threa...
New
First poster: AstonJ
https://permission.site/ This thread was posted by one of our members via one of our news source trackers.
New
First poster: dimitarvp
On Wednesday last week, Google’s Fiona Cicconi wrote to company employees. She announced that Google was bringing forward its timetable ...
New
First poster: bot
Site Fingerprinting google.com Yes youtube.com Yes Amazon.com Yes Yahoo.com Yes Zoom.us No Facebook.com Yes Reddit.com Ye...
New
First poster: bot
It has some interesting features: It’s entirely wireless (the left half speaks Bluetooth to the right half, and the right half speaks B...
New
First poster: dimitarvp
A career ending mistake — Bitfield Consulting. As software engineers, we’re constantly making detailed, elaborate plans for computers to...
New
First poster: dani
The pool of talented C++ developers is running dry. Highly sought after, rarely provided.
New
First poster: FatimaAdamu
Two US lawyers fined for submitting fake court citations from ChatGPT. Law firm also penalised after chatbot invented six legal cases th...
New
First poster: alvinkatojr
There are countless articles why developers should not focus on Frameworks too much and instead learn to understand the underlying langua...
New
New

Other popular topics Top

PragmaticBookshelf
Learn from the award-winning programming series that inspired the Elixir language, and go on a step-by-step journey through the most impo...
New
dasdom
No chair. I have a standing desk. This post was split into a dedicated thread from our thread about chairs :slight_smile:
New
DevotionGeo
I know that -t flag is used along with -i flag for getting an interactive shell. But I cannot digest what the man page for docker run com...
New
AstonJ
In case anyone else is wondering why Ruby 3 doesn’t show when you do asdf list-all ruby :man_facepalming: do this first: asdf plugin-upd...
New
PragmaticBookshelf
Use WebRTC to build web applications that stream media and data in real time directly from one user to another, all in the browser. ...
New
New
New
New
sir.laksmana_wenk
I’m able to do the “artistic” part of game-development; character designing/modeling, music, environment modeling, etc. However, I don’t...
New
AstonJ
Curious what kind of results others are getting, I think actually prefer the 7B model to the 32B model, not only is it faster but the qua...
New