InspireMusic: A Unified Framework for Controlled High-Fidelity Long-Form Music, Song and Audio Generation

[Code] [Studio: Modelscope] [Spaces: HuggingFace] [Models: Modelscope] [Models: HuggingFace] [Technical Report (coming soon)]


Tongyi Lab

Alibaba Group

Abstract We introduce InspireMusic, a unified framework designed to generate high-fidelity music, songs, and audio, which integrates an autoregressive transformer with a super-resolution flow-matching model. This framework enables to generate high-fidelity long-form audio at 48kHz from both text and audio modalities. Our model differs from previous approaches, we utilize dual audio tokenizers: a high-bitrate compression audio tokenizer contains richer semantic information, thereby reducing training costs and enhancing efficiency, and an acoustic codec that preserves fine-grained acoustic details during flow-matching model training. This combination enables us to achieve high-quality audio generation with long-form coherence. Then an autoregressive transformer model based on Qwen2.5 to predict 75Hz audio tokens. Next, we employ a super resolution flow matching model to learn the latent features of the audio from 150Hz music tokenzier, and finally, we output high-quality audio waveforms through a Vocoder. This framework represents a significant advancement in music generation by directly modeling raw audio, ensuring both diversity and high-fidelity output.

Highlights

    - Long-form music generation.
    - High audio quality, support 48kHz, 24kHz.
    - A unified high efficiency music generation framework.

Contents

Overview of InspireMusic

Figure 1. An overview of the InspireMusic framework. We introduce InspireMusic, a unified framework for music, song, and audio generation capable of producing high-quality 48kHz long-form audio. InspireMusic consists of three key components: - **Dual Audio Tokenizers**: The framework first converts raw audio waveforms into discrete tokens that are efficiently processed by the autoregressive model. We employ two tokenizers: WavTokenizer converts 24kHz audio into 75Hz discrete tokens, while Hifi-Codec transforms 48kHz audio into 150Hz latent features suited for our flow matching model. - **Autoregressive Transformer**: This component is trained using a next-token prediction approach on both text and audio tokens, enabling it to generate coherent and contextually relevant audio sequences. - **Super-Resolution Flow Matching** Model: An ODE-based diffusion model, specifically a super-resolution flow matching (SRFM) model, maps the lower-resolution audio tokens to latent features with a higher sampling rate. A vocoder then generates the final audio waveform from these enhanced latent features. InspireMusic supports a range of tasks including text-to-music, music continuation, music reconstruction, and music super-resolution.-- [Paper]-->

Text-to-Music Generation

This section showcases a collection of musical pieces generated by InspireMusic.

The music was generated by InspireMusic and MusicGen, serving as a comparison of short-form text-to-music generation.

Text Description InspireMusic InspireMusic w/o CFM MusicGen-Small MusicGen-Medium MusicGen-Large
Experience soothing and sensual instrumental jazz with a touch of Bossa Nova, perfect for a relaxing restaurant or spa ambiance.

Music Generation w/ Different Sampling Methods

This section showcases a collection of musical pieces generated by InspireMusic, utilizing repetition-aware sampling and top-k sampling methods, respectively.

Text Description w/ Repetition Aware Sampling w/ CFM w/ Repetition Aware Sampling w/o CFM w/ Top-K Sampling w/ CFM w/ Top-K Sampling w/o CFM
Experience soothing and sensual instrumental jazz with a touch of Bossa Nova, perfect for a relaxing restaurant or spa ambiance.

Music Generation with & without Conditional Flow Matching

This section showcases a collection of musical pieces generated by InspireMusic with and without Conditional Flow Matching (CFM), utilizing repetition-aware sampling.
Music Structure Genre Text Description Audio Output w/ Conditional Flow Matching Audio Output w/o Conditional Flow Matching
Intro Instrumental A delightful collection of classical keyboard music, purely instrumental, exuding a timeless and elegant charm.
Chorus R&B A soothing blend of instrumental and R&B rhythms, featuring serene and calming melodies.
Verse Jazz Experience soothing and sensual instrumental jazz with a touch of Bossa Nova, perfect for a relaxing restaurant or spa ambiance.
Outro Rock The instrumental rock piece features dynamic oscillations and wave-like progressions, creating an immersive and energetic atmosphere. The music is purely instrumental, with no vocals, and it blends elements of rock and post-rock for a powerful and evocative experience.
Long-form music generation.
Text Prompt InspireMusic-1.5B-Long InspireMusic-1.5B-Long_no_flow
The instrumental rock piece features dynamic oscillations and wave-like progressions, creating an immersive and energetic atmosphere. The music is purely instrumental, with no vocals, and it blends elements of rock and post-rock for a powerful and evocative experience.

Music Continuation

This section showcases a collection of musical pieces generated by InspireMusic for music continuation task.

Music Continuation Samples

This section showcases a collection of musical pieces generated by InspireMusic for music continuation task, audio output with Conditional Flow Matching.
Music Prompt InspireMusic-1.5B-Long InspireMusic-1.5B InspireMusic-1.5B-24kHz InspireMusic-Base InspireMusic-Base-24kHz
This section showcases a collection of musical pieces generated by InspireMusic for music continuation task, audio output without Conditional Flow Matching.
Music Prompt InspireMusic-1.5B-Long_no_flow InspireMusic-1.5B_no_flow InspireMusic-Base_no_flow

More Diverse Generation Samples

This section showcases a collection of musical pieces generated by InspireMusic with different music structures and genre.

InspireMusic w/ Repetition-Aware Sampling

This section showcases a collection of musical pieces generated by InspireMusic for short-form text-to-music generation with conditional flow matching, utilizing repetition-aware sampling.

Music Structure Genre Text Description Audio Output
Intro Instrumental A serene instrumental piece with a nostalgic feel, blending traditional and modern elements for a soothing auditory experience.
Metal Exploring a blend of symphonic metal with a primal, rhythmic instrumental backdrop, this track features powerful, creating an epic and immersive auditory experience.
Pop Featuring a soothing blend of Mandarin pop with a gentle instrumental backdrop, this track offers a serene and reflective ambiance.
Instrumental A serene blend of instrumental modern classical music, evoking a contemplative and tranquil atmosphere.
Electronic Experience an electrifying journey through high-energy electronic beats and mesmerizing trance rhythms.
Game The instrumental track from a video game soundtrack exudes an adventurous and lively atmosphere, characterized by its energetic and playful melodies.
Verse Hip-hop | Laid-back The track exudes a laid-back, instrumental hip-hop vibe with a relaxing beat.
Piano A serene and elegant classical piano piece, evoking a sense of tranquility and nobility, with a purely instrumental arrangement.
Game The instrumental and anime-inspired music evokes a nostalgic journey through classic video game soundscapes, blending energetic and atmospheric elements.
Latin A vibrant blend of Latin and Latin American instrumental music, likely featuring in a heartfelt and rhythmic style.
Electronic The electronic track exudes a warm, wintery vibe with instrumental.
Electronic The track delivers an immersive electronic experience with a hypnotic, psychedelic atmosphere.
Electronic The track delivers an immersive electronic experience with a hypnotic, psychedelic atmosphere, featuring English vocals that add a captivating layer to the progressive soundscape.
Punk The music exudes a gritty, rebellious energy with a raw punk vibe.
Chorus Classical Instrumental A captivating piece of classical instrumental music, characterized by its intricate variations and timeless elegance.
Electronic, Folk The instrumental track blends electronic and folk elements, creating a serene and atmospheric soundscape.
Reggae Experience a deep, instrumental journey through reggae and dub, characterized by echoing rhythms and a laid-back, atmospheric vibe. The music is purely instrumental, allowing the rich, resonant basslines and intricate soundscapes to take center stage.
Electronic TThe music exudes a serene and atmospheric electronic vibe.
Instrumental | Playful | Whimsical The instrumental piece exudes a playful and whimsical atmosphere, likely featuring lively and rhythmic elements. The music seems to be inspired by nature and animals, creating an engaging and light-hearted experience.
Electronic An energetic electronic track with a trance-like feel, featuring instrumental melodies and a dynamic beat. The music likely has a Chinese influence, given the tags, and is purely instrumental without any vocal performance.
Electronic A soothing blend of instrumental and electronic sounds, likely featuring relaxing and ambient tones. The music is instrumental, suggesting no vocals, and it evokes a serene, space-themed atmosphere.
Instrumental The instrumental track exudes a serene and reflective ambiance, characterized by its smooth and melodic composition. The absence of vocals allows the music to create a tranquil and contemplative atmosphere, perfect for relaxation or introspection.
Jazz A serene blend of contemporary jazz and instrumental music, characterized by drifting melodies and a smooth, floating ambiance. The piece likely features instrumental performances without vocals, creating a tranquil and sophisticated atmosphere.
Reggae | Pop A vibrant blend of reggae and pop, delivering a soulful and energetic performance.
Outro World Fusion Music Exploring the vibrant rhythms and intricate beats of Egyptian percussion, this offers a captivating journey into world fusion music, blending traditional sounds with modern influences. The vocals, likely in Arabic, add an authentic touch, with a dynamic performance that complements the energetic and immersive instrumental backdrop.
Electronic An energetic electronic track with a trance vibe, featuring instrumental elements and a dynamic beat.
Electronic The instrumental track exudes a progressive house vibe with electronic elements, creating a serene and futuristic atmosphere.
Rock | Funky The instrumental rock track exudes a funky, energetic vibe with a touch of rock elements, creating a dynamic and engaging listening experience.
Electronic The instrumental track exudes a serene and atmospheric electronic vibe, blending ambient sounds with a touch of tranquility.

InspireMusic w/ TopK Sampling

This section presents the demo of shortform instrumental music generation with 30 seconds pre-trained InspireMusic-Base model based on Top-K sampling method.

Music Structure Genre Text Description Audio Output
Verse Pop Dreamy and ethereal, this English-language track blends elements of pop and dream pop, featuring a remix that adds a neon-lit twist. The vocals, likely delivered by a male singer, float effortlessly over the lush, atmospheric soundscape.
Pop A lively and rhythmic instrumental track, characterized by its upbeat and danceable pop style.
Classical A serene instrumental piece blending classical and contemporary elements, performed in English, evoking a timeless and elegant atmosphere.
Electronic A vibrant electronic instrumental experience blending drum and bass elements, likely featuring soothing Chinese vocals.
Rock A dynamic blend of instrumental rock and math rock, featuring intricate rhythms and a powerful, energetic feel. The music is purely instrumental, allowing listeners to immerse themselves in the complex, yet harmonious soundscapes.
Electronic House | Dance The track exudes a captivating electronic house vibe, characterized by its instrumental purity and energetic dance beats.
New Age Music Experience soothing instrumental New Age music designed to balance chakras, enhance happiness, and maintain calm, perfect for yoga sessions and relaxation.
Game Music An energetic blend of Japanese anime and game music, featuring dynamic instrumentals and possibly female vocals, creating a nostalgic and vibrant atmosphere.
Chorus R&B The music exudes a soulful and rhythmic vibe, blending elements of R&B and soul. The performance likely features a smooth, emotive delivery.
Electronic | Anime A mysterious and atmospheric track, blending elements of dark electronic and anime soundtrack styles.
Jazz Imagine a lively instrumental jazz piece with a swinging rhythm that evokes a sense of joy and nostalgia. The music, characterized by its upbeat tempo and smooth melodies, creates an atmosphere perfect for dancing or simply enjoying a cheerful moment.
Electronic Dance The energetic electronic dance track features nostalgic vibes.
Electronic An energetic electronic track with a trance-like feel, featuring instrumental melodies and a dynamic beat. The music likely has a Chinese influence, given the tags, and is purely instrumental without any vocal performance.
Instrumental A serene and melodic Chinese ballad featuring gentle, harmonious with an instrumental version that highlights traditional musical elements.
Heavy Metal The intense live performance delivers a raw and aggressive sound, characterized by heavy, distorted guitars and powerful drumming, delivered with a harsh, guttural style, typical of extreme metal genres.
Energetic | Cardiac Sound The instrumental track delivers an energetic and rhythmic experience, characterized by a pulsating beat and dynamic soundscapes.
Indian Classical A soothing blend of traditional Indian classical music featuring intricate flute melodies and rhythmic tabla beats.
Intro Electronic Dance A serene instrumental piece with a nostalgic feel, likely featuring male vocals in Mandarin, blending traditional and modern elements for a soothing auditory experience.
Electronic The track exudes a deeply immersive electronic vibe, characterized by pure instrumental sounds and a rich deep house rhythm. The overall feel is intoxicating and hypnotic.
Outro Modern Instrumentation A soothing blend of traditional Chinese melodies and modern instrumentation, featuring gentle that evoke a serene and tranquil atmosphere.
House The vibrant mix of house and classical crossover creates an energetic and sophisticated atmosphere, featuring with a powerful performance.
Hip-hop | Rap style
A dynamic blend of West Coast hip-hop with a storytelling rap style, featuring English vocals that convey a gritty, urban narrative.
Latin dance The vibrant and energetic rhythms of this music blend Latin dance styles with a global influence, featuring lively salsa and merengue beats. The vocals, likely in Spanish, are delivered with passionate intensity, possibly by a male singer.

Disclaimer

The content provided above is for academic purposes only and is intended to demonstrate technical capabilities. Some examples are sourced from the internet. If any content infringes on your rights, please contact us to request its removal.