Does VocalNet run without an internet connection after the initial license activation?

The conversion engine processes audio entirely on-device — no audio leaves the local machine and no ongoing internet connection is required for operation. License activation requires an internet connection at the point of purchase and initial registration. Artist Mode model downloads require an internet connection to pull each model to the local machine on first access; subsequent sessions using a previously downloaded model run offline. The offline processing commitment means audio under NDA or with label confidentiality obligations stays within the local environment throughout the conversion workflow.

What sample material produces the best voice clone results from imported files?

Dry, unprocessed recordings with minimal reverb, no compression artifacts, and clean tonal balance produce the most stable and convincing conversions — VocalNet's model generation step extracts timbral fingerprints from the source material, and any processing layered onto the samples becomes encoded into the fingerprint. Intelligible vocal phrases with consistent pitch and minimal background noise give the model cleaner timbral data to extract. The amount of source material needed has not been specified with a precise duration requirement, but longer samples covering a wider range of pitches and dynamics give the model more training density across registers.

Can VocalNet process instruments other than voice?

The engine processes any monophonic signal — single-note synth lines, lead guitar, solo woodwind, or bass lines pass through the conversion chain and take on characteristics of the target voice model's timbre. Polyphonic signals produce artifacts because the model interprets multi-pitch input as a single source. The conversion behaves differently on non-vocal sources than on vocals because the target model was trained on voice material, so the timbral transfer onto an instrument produces a hybrid character rather than a clean vocal impression — the result is a textural effect rather than a transparent voice transplant.

How does Artist Mode licensing work for commercially released tracks?

Each Artist Mode model carries per-model terms set by the artist's agreement with Session Loops. The plugin license and the base subscription do not cover Artist Mode commercial use independently — the artist's revenue share from model purchases implies ongoing rights governance over the voice model. Before releasing music commercially that includes Artist Mode-processed vocals, the specific terms of the relevant model's license require review. The terms are model-specific and may differ between artists; blanket commercial clearance from the plugin license itself should not be assumed.

What is the practical difference between Blend Mode and Artist Mode in the cursor interface?

Blend Mode populates the three cursor vertices with custom voice models generated from user-imported samples — the voices are built by the user from their own source recordings. Artist Mode populates vertices with pre-trained licensed artist models downloaded from Session Loops' roster. Both modes use the same triangular cursor and the same conversion engine; the difference is the origin and licensing structure of the voice models at the vertices. Vertices can be mixed between the two modes within a single session, so a configuration with one custom-cloned model and two Artist Mode models sharing the blend space is valid.

Session Loops VocalNet v1.6.0 [WiN-MAC]

Session Loops VocalNet interface showing a triangular vocal morphing control with cyan, magenta, and red nodes connected around a central white node, featuring English and World voice blending options.

Product: VocalNet
Developer: Session Loops
Version: 1.6.0
Format: VST3, AU, AAX
Requirements: Windows 10 or later, macOS 10.13 or later
Source: sessionloops.com/vocalnet

Download (434 MB)

Torrent

VocalNet is a real-time voice conversion plugin built on a retrieval-based voice conversion (RVC) architecture, running entirely on-device without cloud dependency. The signal path works in three stages: sample import and voice model generation, real-time timbral transformation of an input signal to match the cloned target, and simultaneous morphing across up to three loaded voices via a triangular cursor. It operates as an effect insert in any VST3, AU, or AAX host with ARA support for compatible DAWs. The primary differentiator is the three-voice cursor blend running in real time — voice character is navigated as a continuous spatial parameter between distinct cloned models rather than toggled as a discrete selection.

Key Takeaway

Sessions where a performer needs to audition a different vocal timbre in real time during tracking, or where a producer is building a lead vocal layer that sits between two distinct character references, are where the three-voice blend delivers something a pitch-shifter or formant processor cannot. It complements rather than replaces a pitch correction chain — transformation quality degrades on heavily processed source material and falls apart on polyphonic input. Producers who work exclusively from pre-rendered audio stems with no intention of re-tracking can cover the same ground with offline RVC tools at lower cost. The Artist Mode voice roster is small; engineers building sessions around a specific licensed artist voice need to confirm that artist’s model is available before purchase.

RVC Architecture and the Sample Import Chain

VocalNet’s conversion engine is built on the same RVC (Retrieval-based Voice Conversion) architecture that underlies the open-source community tools the developer explicitly cites as the quality reference. The founder described the goal as “seamless integration into DAW workflows, while maintaining audio quality comparable to RVC models” — the plugin is a DAW-integrated implementation of that architecture, not a proprietary replacement for it. Running on CPU without GPU requirements, the model inference operates at standard consumer machine specs.

The model generation step — converting imported vocal samples into a target voice — happens inside the plugin after sample import. Transformation quality is directly proportional to source sample quality: dry, unprocessed recordings with clean tonal balance and minimal background noise produce more stable and convincing conversions than samples with reverb, compression artifacts, or heavy EQ. This is a structural constraint of RVC-based systems, not a plugin-specific deficiency — the model extracts timbral fingerprints from the sample material, and any processing layered onto those samples becomes part of the fingerprint.

The conversion runs in real time against a live or recorded input signal on a track. VocalNet reads the input signal frame by frame and outputs the transformed audio with the delay determined by the model inference time. The stated latency target is sub-30ms on standard hardware — a figure that places real-time monitoring through headphones inside the perceptible threshold of approximately 30ms that most engineers cite as the boundary where delay becomes disruptive during vocal recording. Reports from some users describe latency that contradicts the stated specification; hardware configuration and buffer settings in the host DAW affect the actual monitoring delay independently of the plugin’s internal processing time.

Voice model generation is gated behind the paid license tier. The 14-day trial period includes full access to transformation and morphing features; after the trial, voice cloning from imported samples is disabled. The trial allows complete evaluation of the three-voice cursor system and ARA integration before purchase, but because model generation is the core workflow, a 14-day window is the minimum meaningful evaluation period for any producer who wants to test conversion quality against their own vocal material.

Three-Voice Cursor and the Blend Space Geometry

The triangular cursor interface assigns one cloned or Artist Mode voice to each vertex and maps the cursor position to a weighted blend of all three simultaneously. At the centroid, all three voices contribute equally. Moving toward any vertex increases that voice’s weight continuously while reducing the other two proportionally — the result is a smooth timbral gradient between all three models rather than a hard switch or a crossfade between two endpoints.

The cursor is automatable in compatible DAWs. A blend shift written as an automation curve during a session produces a continuous timbral change across a phrase, section, or entire track — a parameter move that functions like automating a filter cutoff except the spectral change is driven by the conversion engine rather than a static EQ shape. For live performance or tracking passes where the blend position will be adjusted by hand, the cursor responds to mouse input in real time without the transformation engine dropping out between positions.

The three-voice limit defines the size of the blendable space. Any combination of three models creates a distinct triangular space; switching one vertex to a different model redefines the whole gradient. Producing a blend that incorporates four or more distinct voice characters simultaneously is outside the architecture — the cursor has three vertices and the blend weights sum to one. Producers who want to cycle through larger numbers of voice models quickly, as in auditioning candidates, need to replace vertices manually and re-center the cursor between tests.

Blend Mode and Artist Mode share the cursor interface. In Blend Mode the vertices are populated from imported custom-cloned samples; in Artist Mode they can be populated from the licensed artist model roster. Both modes can be mixed — one vertex holding a custom model and two vertices holding Artist Mode models is a valid configuration. The distinction matters for licensing: Artist Mode models carry per-model purchase terms that govern how the processed output can be used commercially.

ARA Integration and the DAW Timeline Relationship

ARA (Audio Random Access) support changes how VocalNet reads and processes audio relative to the standard plugin insert model. Without ARA, VocalNet processes the input signal as a stream — the conversion engine works on the audio frame by frame as it flows through the insert slot. With ARA in compatible DAWs, the plugin has direct access to the audio file region on the timeline, allowing it to analyze context beyond the current playback position and process non-destructively against the original audio region rather than the real-time stream.

In practice the ARA workflow is faster for producers working on pre-recorded vocal tracks as opposed to live monitoring. The conversion result is attached to the clip rather than re-calculated on each playback pass, and the processed output can be reviewed against the timeline without re-recording or bouncing. The standard insert workflow remains available for live monitoring during recording — ARA is not a requirement for using the plugin, and the plugin functions without it in any VST3, AU, or AAX host.

ARA compatibility is host-dependent. Ableton Live does not support ARA; Logic Pro, Studio One, Cubase, and Reaper do. The non-ARA insert workflow functions identically across all supported hosts — the difference is workflow speed on existing recorded material, not the conversion quality. Engineers who primarily use Ableton Live will encounter no feature loss from the absent ARA pathway, but the faster timeline-integrated editing available in ARA hosts is not accessible.

Artist Mode: Model Roster, Revenue Structure, and Availability

Artist Mode, introduced in v1.5 (February 2026), makes licensed voice models available through the plugin from a roster of collaborating artists. Three models were available at launch; the roster is described as expanding. Artists receive a majority share of the per-model purchase revenue — Session Loops describes the structure as “most of the revenue goes directly to the artist.” Each model is auditionable on a 14-day free trial before purchase; the trial period applies per model independently of the plugin license trial.

The model quality in Artist Mode is determined by the source recordings provided by each collaborating artist, which sets the floor for the conversion. Since Artist Mode models are created by the developer in collaboration with the artist rather than by the user from arbitrary samples, the input sample quality constraint that governs custom model generation is handled on the supply side — the user receives a pre-trained model rather than generating one from their own imports. The conversion quality on a given session still depends on the match between the input source voice and the target model’s register and timbre.

At launch the Artist Mode roster was three models. A voice-specific tool whose licensed catalog is that small presents a meaningful constraint for producers who need a specific vocal character and whose needs may not map to the three available options. The 14-day per-model trial permits evaluation before any purchase commitment, but the combination of a small roster and per-model purchasing — rather than a flat subscription unlocking all artists — means licensing costs accumulate per voice if multiple Artist Mode models are needed in a single project. Subscription tier pricing ($5.99/mo for the bundle) does not include Artist Mode models.

Polyphonic Input, Source-to-Target Register, and Processing Limits

VocalNet processes monophonic signals. The underlying RVC architecture performs timbral conversion on a single melodic line; feeding a chord, a harmonized stack, or a stereo mix with multiple simultaneous voices produces artifacts as the model attempts to interpret the combined waveform as a single source pitch. The Bedroom Producers Blog coverage explicitly notes the plugin is “flexible enough to process other monophonic sounds like synths or leads” — phrasing that identifies both the expansion of the use case and its limit: the signal must carry one pitch at a time regardless of the source type.

Pitch-register proximity between the source input voice and the target model affects conversion quality at extremes. An input voice operating primarily in a register far outside the training register of the target model produces register-transition artifacts — timbral instability or harmonic smearing at notes where the model has less training density. The RVC architecture performs automatic range adaptation for moderate register mismatches, but the adaptation has a practical ceiling. Pushing a bass-range input through a high-register target or vice versa without pre-transposing the input increases artifact density. A pitch transposition step before the VocalNet insert, or the plugin’s own pitch shift parameter, can narrow the register gap before conversion.

The conversion engine processes audio end-to-end at the project’s native sample rate and bit depth within the plugin chain — no format conversion or lossy encoding step occurs between the input signal and the processed output. CPU requirements are low enough for standard consumer hardware without dedicated GPU; the founder stated explicitly that the engine “runs smoothly on standard CPU.” Multiple simultaneous instances have not been independently benchmarked for overhead; sessions with several simultaneous VocalNet inserts at different buffer settings should be tested for cumulative CPU load before committing to that configuration.

Where Timbral Conversion Ends

VocalNet transforms the timbral character of a monophonic vocal toward a trained reference voice. It does not perform pitch correction, timing correction, breath removal, noise gate, de-essing, or any other conventional vocal processing. The output of the conversion still carries the pitch, timing, and performance artifacts of the original input — a flat note fed through VocalNet is a flat note in the target voice’s timbre. Stacking conventional vocal processing around VocalNet to clean the input before conversion and shape the output after is standard practice in sessions where the source performance has uncorrected issues, but VocalNet itself does not include those tools.

The voice model library cannot be expanded beyond what is accessible through the plugin interface. Third-party RVC community models — available in large numbers through open-source repositories — are not directly importable into VocalNet. The custom clone workflow requires importing audio samples and generating a model within the plugin. Engineers who regularly pull pre-trained RVC models from community sources to avoid the sample-import step will find that workflow closed; IK Multimedia ReSing, which supports external RVC model imports directly, addresses that specific use case.

The commercial licensing terms for output produced using Artist Mode models are per-model and set by each artist’s agreement with Session Loops. The plugin license itself does not confer blanket commercial clearance on output produced using Artist Mode voices. Before delivering a commercially released track that includes Artist Mode-processed vocals, confirming the specific terms of the relevant artist model’s license is a necessary step — the artist’s share of the revenue structure implies ongoing rights governance rather than a flat buyout at the point of model purchase.

FAQs

Does VocalNet run without an internet connection after the initial license activation?

The conversion engine processes audio entirely on-device — no audio leaves the local machine and no ongoing internet connection is required for operation. License activation requires an internet connection at the point of purchase and initial registration. Artist Mode model downloads require an internet connection to pull each model to the local machine on first access; subsequent sessions using a previously downloaded model run offline. The offline processing commitment means audio under NDA or with label confidentiality obligations stays within the local environment throughout the conversion workflow.
What sample material produces the best voice clone results from imported files?

Dry, unprocessed recordings with minimal reverb, no compression artifacts, and clean tonal balance produce the most stable and convincing conversions — VocalNet’s model generation step extracts timbral fingerprints from the source material, and any processing layered onto the samples becomes encoded into the fingerprint. Intelligible vocal phrases with consistent pitch and minimal background noise give the model cleaner timbral data to extract. The amount of source material needed has not been specified with a precise duration requirement, but longer samples covering a wider range of pitches and dynamics give the model more training density across registers.
Can VocalNet process instruments other than voice?

The engine processes any monophonic signal — single-note synth lines, lead guitar, solo woodwind, or bass lines pass through the conversion chain and take on characteristics of the target voice model’s timbre. Polyphonic signals produce artifacts because the model interprets multi-pitch input as a single source. The conversion behaves differently on non-vocal sources than on vocals because the target model was trained on voice material, so the timbral transfer onto an instrument produces a hybrid character rather than a clean vocal impression — the result is a textural effect rather than a transparent voice transplant.
How does Artist Mode licensing work for commercially released tracks?

Each Artist Mode model carries per-model terms set by the artist’s agreement with Session Loops. The plugin license and the base subscription do not cover Artist Mode commercial use independently — the artist’s revenue share from model purchases implies ongoing rights governance over the voice model. Before releasing music commercially that includes Artist Mode-processed vocals, the specific terms of the relevant model’s license require review. The terms are model-specific and may differ between artists; blanket commercial clearance from the plugin license itself should not be assumed.
What is the practical difference between Blend Mode and Artist Mode in the cursor interface?

Blend Mode populates the three cursor vertices with custom voice models generated from user-imported samples — the voices are built by the user from their own source recordings. Artist Mode populates vertices with pre-trained licensed artist models downloaded from Session Loops’ roster. Both modes use the same triangular cursor and the same conversion engine; the difference is the origin and licensing structure of the voice models at the vertices. Vertices can be mixed between the two modes within a single session, so a configuration with one custom-cloned model and two Artist Mode models sharing the blend space is valid.

Session Loops VocalNet

Price: 99

Price Currency: USD

Operating System: Windows 10, macOS 10.13

Application Category: Multimedia

Editor's Rating:
3.8