This web site demonstrates the audio examples of the paper titled GANStrument: Adversarial Instrument Sound Synthesis with Pitch-invariant Instance Conditioning.

GANStrument is a novel generative adversarial model for instrument sound synthesis. Given a one-shot sound as input, it is able to generate pitched instrument sounds that reflect the timbre of the input within an interactive time. By exploiting instance conditioning, GANStrument achieves better fidelity and diversity of synthesized sounds and generalization ability to various inputs. In addition, we introduce an adversarial training scheme for a pitch-invariant feature extractor that significantly improves the pitch accuracy and timbre consistency.

Playing MIDIs
Conditioning Comparison
Feature Extractor Comparison
Additional Examples
- Non-Instrument Sound Inputs
- Latent interpolation
Bonus Track
References

Playing MIDIs

The following examples play the Prelude of Suite No. 1 in G major, BWV 1007, J.S. Bach, with GANStrument-generated sounds. Two one-shot sounds are used as inputs for each example, and synthesized timbre is gradually varied over time based on the interpolation in the latent space, where the interpolation ratio is incremented by 0.1 every bar.

Synthesized results demonstrate that GANStrument is able to generate instrument sounds with accurate pitch that reflect input timbres, and smoothly interpolate multiple sounds. Note that, in contrast to end-to-end music synthesis, instrument sound synthesis with GANStrument enables independent control of MIDIs and timbre, which is compatible with typical production flows in the music industry.

Flute to Brass

input 1	input 2	interpolation (input 1 to 2)

Flute to Keyboard

input 1	input 2	interpolation (input 1 to 2)

Organ to Guitar

input 1	input 2	interpolation (input 1 to 2)

Mallet to Reed

input 1	input 2	interpolation (input 1 to 2)

Conditioning Comparison

To validate the proposed approach, we trained two class-conditional GANs as strong baselines: the first model was conditioned on pitch, and the other was conditioned on both pitch and instrument category (using 11 NSynth instrument categories). Note that we used the same architecture and training parameters for these models for fair comparison.

The following examples show the interpolation in the latent space. For baseline models, we projected the input onto the latent space by the hybrid GAN inversion that uses both an encoder and optimization as described in the original paper. We trained both our model and baselines on the NSynth dataset [1]. The Good-sounds dataset [2] was not seen during training. Note that NSynth inputs were taken from validation or test sets.

The following results demonstrate that the baseline models struggle with inverting the inputs and completely fail to mix two sounds. On the other hand, GANStrument smoothly interpolates two timbres with accurate pitch.

Keyboard and Brass (Nsynth, Fig. 2 in the original paper)

conditioning	input 1	100:0	75:25	50:50	25:75	0:100	input 2
pitch w/ enc.+opt.
pitch + instrument w/ enc.+opt.
pitch + instance (ours)

Bass and Mallet (NSynth)

conditioning	input 1	100:0	75:25	50:50	25:75	0:100	input 2
pitch w/ enc.+opt.
pitch + instrument w/ enc.+opt.
pitch + instance (ours)

Vocal and Guitar (NSynth)

conditioning	input 1	100:0	75:25	50:50	25:75	0:100	input 2
pitch w/ enc.+opt.
pitch + instrument w/ enc.+opt.
pitch + instance (ours)

Clarinet and Cello (Good-sounds, unseen)

conditioning	input 1	100:0	75:25	50:50	25:75	0:100	input 2
pitch w/ enc.+opt.
pitch + instrument w/ enc.+opt.
pitch + instance (ours)

Saxophone and Trumpet (Good-sounds, unseen)

conditioning	input 1	100:0	75:25	50:50	25:75	0:100	input 2
pitch w/ enc.+opt.
pitch + instrument w/ enc.+opt.
pitch + instance (ours)

Flute and Violin (Good-sounds, unseen)

conditioning	input 1	100:0	75:25	50:50	25:75	0:100	input 2
pitch w/ enc.+opt.
pitch + instrument w/ enc.+opt.
pitch + instance (ours)

Feature Extractor Comparison

We conducted an ablation study to evaluate the effectiveness of the proposed pitch-invariant feature extractor. For comparison, we trained an instrument identity classifier without the pitch-adversarial loss as the baseline feature extractor.

The following examples show mel-spectrograms of 88 pitches generated with the input of the NSynth and Good-sounds datasets. Note that we fixed the noise vectors across 88 pitches.

These results demonstrate that the feature extractor without adversarial training scheme produced inaccurate pitches and timbre inconsistency. In contrast, the pitch-invariant feature extractor with our adversarial training scheme produced stable pitches with timbre consistency.

Keyboard (NSynth)

feature extractor	input	88 pitches
w/o adv. training
w/ adv. training

Vocal (NSynth)

feature extractor	input	88 pitches
w/o adv. training
w/ adv. training

Mallet (NSynth)

feature extractor	input	88 pitches
w/o adv. training
w/ adv. training

Clarinet (Good-sounds, unseen)

feature extractor	input	88 pitches
w/o adv. training
w/ adv. training

Cello (Good-sounds, unseen)

feature extractor	input	88 pitches
w/o adv. training
w/ adv. training

Saxophone (Good-sounds, unseen, Fig. 3 in the original paper)

feature extractor	input	88 pitches
w/o adv. training
w/ adv. training

Additional Examples

Non-Instrument Sound Inputs

The following examples show synthesized results with the input of non-instrument sounds. The synthesized sounds reflected the input timbres and produced stable pitch like musical instruments.

These results demonstrate that GANStrument has generalization ability to non-instrument sounds to some extent and is able to exploit a variety of sound materials to design the timbre as the traditional samplers do.

Rooster Chicken (Fig. 4 in the original paper)

input	pitch 48	pitch 55	pitch 60	pitch 67	pitch 72

Water Drop (Fig. 4 in the original paper)

input	pitch 48	pitch 55	pitch 60	pitch 67	pitch 72

Firework Explosion

input	pitch 36	pitch 43	pitch 48	pitch 55	pitch 60

Cat Meow

input	pitch 60	pitch 67	pitch 72	pitch 79	pitch 84

Drill

input	pitch 60	pitch 67	pitch 72	pitch 79	pitch 84

Latent interpolation

The following examples show the interpolation in the latent space. Their inputs included harmonic instrument sounds, percussive sounds, and non-instrument sounds as well.

These results demonstrate that GANStrument is able to smoothly interpolate multiple sounds with stable pitch and enables users to freely explore the desired timbre in the latent space.

Organ and Guitar

input 1	100:0	75:25	50:50	25:75	0:100	input 2

Reed and Mallet

input 1	100:0	75:25	50:50	25:75	0:100	input 2

Keyboard and Flute

input 1	100:0	75:25	50:50	25:75	0:100	input 2

Brass and Organ

input 1	100:0	75:25	50:50	25:75	0:100	input 2

Reed and Hi-hat

input 1	100:0	75:25	50:50	25:75	0:100	input 2

Bass and Tom

input 1	100:0	75:25	50:50	25:75	0:100	input 2

Guitar and Cat Meow

input 1	100:0	75:25	50:50	25:75	0:100	input 2

Reed and Drill

input 1	100:0	75:25	50:50	25:75	0:100	input 2

Bonus Track

We tried making loop tracks by ourselves using GANStrument.

Loop Track 1

The following example consists of 3 tracks for which we use GANStrument-generated sounds and the accompaniment track of drum. Note that the interpolation ratio of the 3 main tracks varies over time (0.0, 0.1, 0.2, …, 1.0) and we do not use GANStrument-generated sounds for the accompaniment track.

track	input 1	input 2
Track 1: melody (mallet to reed)
Track 2: arpeggio (flute to keyboard)
Track 3: chord (organ to guitar)
accompaniment (not generated)	N/A	N/A

Mixing result

Loop Track 2

The following example consists of 4 tracks for which we use GANStrument-generated sounds and the accompaniment track of keyboard and drum. Note that the interpolation ratio of the 4 main tracks varies over time (0.1, 0.2, 0.5, and 0.9) and we do not use GANStrument-generated sounds for the accompaniment track.

track	input 1	input 2
Track 1: bass (bass to kick)
Track 2: melody 1 (flute to brass)
Track 3: melody 2 (organ to guitar)
Track 4: arpeggio (string to marimba)
accompaniment (not generated)	N/A	N/A

Mixing result

References

[1] J. Engel, C. Resnick, A. Roberts, S. Dieleman, M. Norouzi, D. Eck, and K. Simonyan, “Neural audio synthesis of musical notes with wavenet autoencoders,” in Proc. ICML, 2017.

[2] O. Romani Picas, H. Parra Rodriguez, D. Dabiri, H. Tokuda, W. Hariya, K. Oishi, and X. Serra, “A real-time system for measuring sound goodness in instrumental sounds,” in Audio Engineering Society Convention 138, 2015.