InspireMusic: A Unified Framework for Controlled High-Fidelity Long-Form Music, Song and Audio Generation
[Code] [Studio: Modelscope] [Spaces: HuggingFace] [Models: Modelscope] [Models: HuggingFace] [Technical Report (coming soon)]
Tongyi Lab
Alibaba Group
Abstract We introduce InspireMusic, a unified framework designed to generate high-fidelity music, songs, and audio, which integrates an autoregressive transformer with a super-resolution flow-matching model. This framework enables to generate high-fidelity long-form audio at 48kHz from both text and audio modalities. Our model differs from previous approaches, we utilize dual audio tokenizers: a high-bitrate compression audio tokenizer contains richer semantic information, thereby reducing training costs and enhancing efficiency, and an acoustic codec that preserves fine-grained acoustic details during flow-matching model training. This combination enables us to achieve high-quality audio generation with long-form coherence. Then an autoregressive transformer model based on Qwen2.5 to predict 75Hz audio tokens. Next, we employ a super resolution flow matching model to learn the latent features of the audio from 150Hz music tokenzier, and finally, we output high-quality audio waveforms through a Vocoder. This framework represents a significant advancement in music generation by directly modeling raw audio, ensuring both diversity and high-fidelity output.
Highlights
-
- Long-form music generation.
-
- High audio quality, support 48kHz, 24kHz.
-
- A unified high efficiency music generation framework.
Contents
- Music Generation Model: InspireMusic
Overview of InspireMusic
Figure 1. An overview of the InspireMusic framework. We introduce InspireMusic, a unified framework for music, song, and audio generation capable of producing high-quality 48kHz long-form audio. InspireMusic consists of three key components: - **Dual Audio Tokenizers**: The framework first converts raw audio waveforms into discrete tokens that are efficiently processed by the autoregressive model. We employ two tokenizers: WavTokenizer converts 24kHz audio into 75Hz discrete tokens, while Hifi-Codec transforms 48kHz audio into 150Hz latent features suited for our flow matching model. - **Autoregressive Transformer**: This component is trained using a next-token prediction approach on both text and audio tokens, enabling it to generate coherent and contextually relevant audio sequences. - **Super-Resolution Flow Matching** Model: An ODE-based diffusion model, specifically a super-resolution flow matching (SRFM) model, maps the lower-resolution audio tokens to latent features with a higher sampling rate. A vocoder then generates the final audio waveform from these enhanced latent features. InspireMusic supports a range of tasks including text-to-music, music continuation, music reconstruction, and music super-resolution.-- [Paper]-->
Text-to-Music Generation
This section showcases a collection of musical pieces generated by InspireMusic.
The music was generated by InspireMusic and MusicGen, serving as a comparison of short-form text-to-music generation.
Text Description | InspireMusic w/o CFM | MusicGen-Small | MusicGen-Medium | MusicGen-Large | |
---|---|---|---|---|---|
Experience soothing and sensual instrumental jazz with a touch of Bossa Nova, perfect for a relaxing restaurant or spa ambiance. |
Music Generation w/ Different Sampling Methods
This section showcases a collection of musical pieces generated by InspireMusic, utilizing repetition-aware sampling and top-k sampling methods, respectively.
Text Description | w/ Repetition Aware Sampling w/ CFM | w/ Repetition Aware Sampling w/o CFM | w/ Top-K Sampling w/ CFM | w/ Top-K Sampling w/o CFM |
---|---|---|---|---|
Experience soothing and sensual instrumental jazz with a touch of Bossa Nova, perfect for a relaxing restaurant or spa ambiance. |
Music Generation with & without Conditional Flow Matching
This section showcases a collection of musical pieces generated by InspireMusic with and without Conditional Flow Matching (CFM), utilizing repetition-aware sampling.Music Structure | Genre | Text Description | Audio Output w/o Conditional Flow Matching | |
---|---|---|---|---|
Intro | Instrumental | A delightful collection of classical keyboard music, purely instrumental, exuding a timeless and elegant charm. | ||
Chorus | R&B | A soothing blend of instrumental and R&B rhythms, featuring serene and calming melodies. | ||
Verse | Jazz | Experience soothing and sensual instrumental jazz with a touch of Bossa Nova, perfect for a relaxing restaurant or spa ambiance. | ||
Outro | Rock | The instrumental rock piece features dynamic oscillations and wave-like progressions, creating an immersive and energetic atmosphere. The music is purely instrumental, with no vocals, and it blends elements of rock and post-rock for a powerful and evocative experience. |
Text Prompt | InspireMusic-1.5B-Long | InspireMusic-1.5B-Long_no_flow |
---|---|---|
The instrumental rock piece features dynamic oscillations and wave-like progressions, creating an immersive and energetic atmosphere. The music is purely instrumental, with no vocals, and it blends elements of rock and post-rock for a powerful and evocative experience. |
Music Continuation
This section showcases a collection of musical pieces generated by InspireMusic for music continuation task.
Music Continuation Samples
This section showcases a collection of musical pieces generated by InspireMusic for music continuation task, audio output with Conditional Flow Matching.Music Prompt | InspireMusic-1.5B-Long | InspireMusic-1.5B | InspireMusic-1.5B-24kHz | InspireMusic-Base | InspireMusic-Base-24kHz |
---|---|---|---|---|---|
Music Prompt | InspireMusic-1.5B-Long_no_flow | InspireMusic-1.5B_no_flow | InspireMusic-Base_no_flow |
---|---|---|---|
More Diverse Generation Samples
This section showcases a collection of musical pieces generated by InspireMusic with different music structures and genre.InspireMusic w/ Repetition-Aware Sampling
This section showcases a collection of musical pieces generated by InspireMusic for short-form text-to-music generation with conditional flow matching, utilizing repetition-aware sampling.
Music Structure | Genre | Text Description | Audio Output |
---|---|---|---|
Intro | Instrumental | A serene instrumental piece with a nostalgic feel, blending traditional and modern elements for a soothing auditory experience. | |
Metal | Exploring a blend of symphonic metal with a primal, rhythmic instrumental backdrop, this track features powerful, creating an epic and immersive auditory experience. | ||
Pop | Featuring a soothing blend of Mandarin pop with a gentle instrumental backdrop, this track offers a serene and reflective ambiance. | ||
Instrumental | A serene blend of instrumental modern classical music, evoking a contemplative and tranquil atmosphere. | ||
Electronic | Experience an electrifying journey through high-energy electronic beats and mesmerizing trance rhythms. | ||
Game | The instrumental track from a video game soundtrack exudes an adventurous and lively atmosphere, characterized by its energetic and playful melodies. | ||
Verse | Hip-hop | Laid-back | The track exudes a laid-back, instrumental hip-hop vibe with a relaxing beat. | |
Piano | A serene and elegant classical piano piece, evoking a sense of tranquility and nobility, with a purely instrumental arrangement. | ||
Game | The instrumental and anime-inspired music evokes a nostalgic journey through classic video game soundscapes, blending energetic and atmospheric elements. | ||
Latin | A vibrant blend of Latin and Latin American instrumental music, likely featuring in a heartfelt and rhythmic style. | ||
Electronic | The electronic track exudes a warm, wintery vibe with instrumental. | ||
Electronic | The track delivers an immersive electronic experience with a hypnotic, psychedelic atmosphere. | ||
Electronic | The track delivers an immersive electronic experience with a hypnotic, psychedelic atmosphere, featuring English vocals that add a captivating layer to the progressive soundscape. | ||
Punk | The music exudes a gritty, rebellious energy with a raw punk vibe. | ||
Chorus | Classical Instrumental | A captivating piece of classical instrumental music, characterized by its intricate variations and timeless elegance. | |
Electronic, Folk | The instrumental track blends electronic and folk elements, creating a serene and atmospheric soundscape. | ||
Reggae | Experience a deep, instrumental journey through reggae and dub, characterized by echoing rhythms and a laid-back, atmospheric vibe. The music is purely instrumental, allowing the rich, resonant basslines and intricate soundscapes to take center stage. | ||
Electronic | TThe music exudes a serene and atmospheric electronic vibe. | ||
Instrumental | Playful | Whimsical | The instrumental piece exudes a playful and whimsical atmosphere, likely featuring lively and rhythmic elements. The music seems to be inspired by nature and animals, creating an engaging and light-hearted experience. | ||
Electronic | An energetic electronic track with a trance-like feel, featuring instrumental melodies and a dynamic beat. The music likely has a Chinese influence, given the tags, and is purely instrumental without any vocal performance. | ||
Electronic | A soothing blend of instrumental and electronic sounds, likely featuring relaxing and ambient tones. The music is instrumental, suggesting no vocals, and it evokes a serene, space-themed atmosphere. | ||
Instrumental | The instrumental track exudes a serene and reflective ambiance, characterized by its smooth and melodic composition. The absence of vocals allows the music to create a tranquil and contemplative atmosphere, perfect for relaxation or introspection. | ||
Jazz | A serene blend of contemporary jazz and instrumental music, characterized by drifting melodies and a smooth, floating ambiance. The piece likely features instrumental performances without vocals, creating a tranquil and sophisticated atmosphere. | ||
Reggae | Pop | A vibrant blend of reggae and pop, delivering a soulful and energetic performance. | ||
Outro | World Fusion Music | Exploring the vibrant rhythms and intricate beats of Egyptian percussion, this offers a captivating journey into world fusion music, blending traditional sounds with modern influences. The vocals, likely in Arabic, add an authentic touch, with a dynamic performance that complements the energetic and immersive instrumental backdrop. | |
Electronic | An energetic electronic track with a trance vibe, featuring instrumental elements and a dynamic beat. | ||
Electronic | The instrumental track exudes a progressive house vibe with electronic elements, creating a serene and futuristic atmosphere. | ||
Rock | Funky | The instrumental rock track exudes a funky, energetic vibe with a touch of rock elements, creating a dynamic and engaging listening experience. | ||
Electronic | The instrumental track exudes a serene and atmospheric electronic vibe, blending ambient sounds with a touch of tranquility. |
InspireMusic w/ TopK Sampling
This section presents the demo of shortform instrumental music generation with 30 seconds pre-trained InspireMusic-Base model based on Top-K sampling method.
Music Structure | Genre | Text Description | Audio Output |
---|---|---|---|
Verse | Pop | Dreamy and ethereal, this English-language track blends elements of pop and dream pop, featuring a remix that adds a neon-lit twist. The vocals, likely delivered by a male singer, float effortlessly over the lush, atmospheric soundscape. | |
Pop | A lively and rhythmic instrumental track, characterized by its upbeat and danceable pop style. | ||
Classical | A serene instrumental piece blending classical and contemporary elements, performed in English, evoking a timeless and elegant atmosphere. | ||
Electronic | A vibrant electronic instrumental experience blending drum and bass elements, likely featuring soothing Chinese vocals. | ||
Rock | A dynamic blend of instrumental rock and math rock, featuring intricate rhythms and a powerful, energetic feel. The music is purely instrumental, allowing listeners to immerse themselves in the complex, yet harmonious soundscapes. | ||
Electronic House | Dance | The track exudes a captivating electronic house vibe, characterized by its instrumental purity and energetic dance beats. | ||
New Age Music | Experience soothing instrumental New Age music designed to balance chakras, enhance happiness, and maintain calm, perfect for yoga sessions and relaxation. | ||
Game Music | An energetic blend of Japanese anime and game music, featuring dynamic instrumentals and possibly female vocals, creating a nostalgic and vibrant atmosphere. | ||
Chorus | R&B | The music exudes a soulful and rhythmic vibe, blending elements of R&B and soul. The performance likely features a smooth, emotive delivery. | |
Electronic | Anime | A mysterious and atmospheric track, blending elements of dark electronic and anime soundtrack styles. | ||
Jazz | Imagine a lively instrumental jazz piece with a swinging rhythm that evokes a sense of joy and nostalgia. The music, characterized by its upbeat tempo and smooth melodies, creates an atmosphere perfect for dancing or simply enjoying a cheerful moment. | ||
Electronic Dance | The energetic electronic dance track features nostalgic vibes. | ||
Electronic | An energetic electronic track with a trance-like feel, featuring instrumental melodies and a dynamic beat. The music likely has a Chinese influence, given the tags, and is purely instrumental without any vocal performance. | ||
Instrumental | A serene and melodic Chinese ballad featuring gentle, harmonious with an instrumental version that highlights traditional musical elements. | ||
Heavy Metal | The intense live performance delivers a raw and aggressive sound, characterized by heavy, distorted guitars and powerful drumming, delivered with a harsh, guttural style, typical of extreme metal genres. | ||
Energetic | Cardiac Sound | The instrumental track delivers an energetic and rhythmic experience, characterized by a pulsating beat and dynamic soundscapes. | ||
Indian Classical | A soothing blend of traditional Indian classical music featuring intricate flute melodies and rhythmic tabla beats. | ||
Intro | Electronic Dance | A serene instrumental piece with a nostalgic feel, likely featuring male vocals in Mandarin, blending traditional and modern elements for a soothing auditory experience. | |
Electronic | The track exudes a deeply immersive electronic vibe, characterized by pure instrumental sounds and a rich deep house rhythm. The overall feel is intoxicating and hypnotic. | ||
Outro | Modern Instrumentation | A soothing blend of traditional Chinese melodies and modern instrumentation, featuring gentle that evoke a serene and tranquil atmosphere. | |
House | The vibrant mix of house and classical crossover creates an energetic and sophisticated atmosphere, featuring with a powerful performance. | ||
Hip-hop | Rap style | A dynamic blend of West Coast hip-hop with a storytelling rap style, featuring English vocals that convey a gritty, urban narrative. | ||
Latin dance | The vibrant and energetic rhythms of this music blend Latin dance styles with a global influence, featuring lively salsa and merengue beats. The vocals, likely in Spanish, are delivered with passionate intensity, possibly by a male singer. |
Disclaimer
The content provided above is for academic purposes only and is intended to demonstrate technical capabilities. Some examples are sourced from the internet. If any content infringes on your rights, please contact us to request its removal.