A method of modelling extended audio objects for audio rendering in a virtual or augmented reality environment is described. The method comprises obtaining an extent representation indicative of a geometric form of an extended audio object and information relating to one or more first audio sources that are associated with the extended audio object. Furthermore, the method comprises obtaining a relative point on the geometric form of the extended audio object based on a user position in the virtual or augmented reality environment. The method also comprises determining an extent parameter for the extent representation based on the user position and the relative point and determining positions of one or more second audio sources, relative to the user position, for modelling the extended audio object. In addition, the method comprises outputting a modified representation of the extended audio object for modelling the extended audio object.
A method for encoding envelope information is provided. In some implementations, the method involves determining a first downmixed signal associated with a downmixed channel associated with an audio signal to be encoded. In some implementations, the method involves determining energy levels of the first downmixed signal for a plurality of frequency bands. In some implementations, the method involves determining whether to encode information indicative of the energy levels in a bitstream. In some implementations, the method involves encoding the determined energy levels. In some implementations, the method involves generating an energy control value indicating that energy levels are encoded. In some implementations, the method involves generating the bitstream, wherein the energy control value and the information indicative of the energy levels are usable by the decoder to adjust energy levels associated with the first downmixed signal.
G10L 19/008 - Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
G10L 19/02 - Speech or audio signal analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
3.
MULTI-BAND DUCKING OF AUDIO SIGNALS TECHNICAL FIELD
A method for multi-band ducking of audio signals is provided. In some implementations, the method involves receiving, at a decoder, an input audio signal, wherein the input audio signal is a downmixed audio signal. In some implementations, the method involves separating the input audio signal into a first set of frequency bands. In some implementations, the method involves determining a set of ducking gains, a ducking gain corresponding to a frequency band of the first set of frequency bands. In some implementations, the method involves generating a broadband decorrelated audio signal, wherein ducking gains of the set of ducking gains are applied to at least one of: 1) a second set of frequency bands prior to generating the at least one broadband decorrelated audio signal; or 2) a third set of frequency bands that separates the at least one broadband decorrelated audio signal.
G10L 19/008 - Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
4.
PROJECTION SYSTEM AND METHOD OF DRIVING A PROJECTION SYSTEM WITH FIELD MAPPING
A projection system includes a light source configured to emit a light in response to an image data, a phase light modulator configured to receive the light from the light source and to apply a spatially-varying phase modulation on the light, thereby generating a projection light and steering the light on a reconstruction field, wherein the reconstruction field is a complex plane on which a reconstruction image is formed, and a controller configured to control the light source, control the phase light modulator, initialize (401) the reconstruction field to an initial value, and iteratively for each of a plurality of subframes within a frame of the image data: set (402) the reconstruction field to the initial value for the first iteration or set (402) the reconstruction field to a subsequent-iteration reconstruction field value for any subsequent-iteration, map (403) the reconstruction field to a modulation field, wherein the modulation field is a complex plane of the phase light modulator which modulates a phase of the light, set (404) an amplitude of the modulation field to a predetermined value, and map (405) the modulation field with the amplitude set to the predetermined value, to a subsequent-iteration reconstruction field, wherein the controller is further configured to provide (408) a phase control signal based on the modulation field mapped with the last iteration to the phase light modulator.
A method for performing gain control on audio signals is provided. In some implementations, the method involves determining downmixed signals associated with one or more downmix channels associated with a current frame of an audio signal to be encoded. In some implementations, the method involves determining whether an overload condition exists for an encoder. In some implementation, the method involves determining a gain parameter. In some implementations, the method involves determining at least one gain transition function based on the gain parameter and a gain parameter associated with a preceding frame of the audio signal. In some implementations, the method involves applying the at least one gain transition function to one or more of the downmixed signals. In some implementations, the method involves encoding the downmixed signals in connection with information indicative of gain control applied to the current frame.
G10L 19/008 - Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
G10L 19/22 - Mode decision, i.e. based on audio signal content versus external parameters
Described is a method of audio processing in a HbbTV terminal device. The method includes receiving a decoded broadcast feed including a first audio track, receiving HbbTV content relating to the broadcast feed, the HbbTV content including a second audio track, extracting level-related information from the decoded broadcast feed, wherein the level-related information is embedded in the decoded broadcast feed and enables to obtain an indication of an original audio level of the first audio track, analyzing the first audio track for determining an actual audio level of the first audio track, determining a gain factor based on the actual audio level and the original audio level, and generating a third audio track for output by the HbbTV terminal device based on the first audio track, the second audio track, and the gain factor. Also described is an apparatus for carrying out the method, as well as corresponding programs and computer-readable storage media.
H04N 21/462 - Content or additional data management e.g. creating a master electronic program guide from data received from the Internet and a Head-end or controlling the complexity of a video stream by scaling the resolution or bit-rate based on the client capabi
H04N 21/434 - Disassembling of a multiplex stream, e.g. demultiplexing audio and video streams or extraction of additional data from a video stream; Remultiplexing of multiplex streams; Extraction or processing of SI; Disassembling of packetised elementary stream
H04H 20/10 - Arrangements for replacing or switching information during the broadcast or during the distribution
H04N 21/44 - Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to MPEG-4 scene graphs
H04N 21/458 - Scheduling content for creating a personalised stream, e.g. by combining a locally stored advertisement with an incoming stream; Updating operations, e.g. for OS modules
A projection system and method includes a light source configured to emit a light in response to an image data; a phase light modulator configured to receive the light from the light source and to apply a spatially-varying phase modulation on the light, thereby to steer the light and generate a projection light; and a controller configured to dynamically determine, based on at least one of a user input or a sensor signal, a target geometry of a projection surface on which the projection light is projected, determine, based on the target geometry, a phase configuration for a frame of the image data, and provide a phase control signal to the phase light modulator, the phase control signal configured to cause the phase light modulator to generate the projection light in accordance with the phase configuration for the frame.
Disclosed herein are method, systems, and computer-program products for segmenting a binaural recording of speech into parts containing self-speech and parts containing external speech, and processing each category with different settings, to obtain an enhanced overall presentation. The segmentation is based on a combination of: i) feature-based frame-by-frame classification, and ii) detecting dissimilarity by statistical methods. The segmentation information is then used by a speech enhancement chain, where independent settings are used to process the self- and external speech parts.
G10L 25/51 - Speech or voice analysis techniques not restricted to a single one of groups specially adapted for particular use for comparison or discrimination
G10L 25/87 - Detection of discrete points within a voice signal
A method of audio processing includes performing spatial analysis on a binaural signal to estimate level differences and phase differences characteristic of a binaural filter of the binaural signal, performing object extraction on the binaural audio signal using the estimated level and phase differences to generate a left/right main component signal and a left/right residual component signal. The system may process the left/right main and left/right residual components differently using different object processing parameters for e.g. repositioning, equalization, compression, upmixing, channel remapping or storage to generate a processed binaural signal that provides an improved listening experience. Repositioning may be based on head tracking sensor data.
H04S 5/00 - Pseudo-stereo systems, e.g. in which additional channel signals are derived from monophonic signals by means of phase shifting, time delay or reverberation
H04S 7/00 - Indicating arrangements; Control arrangements, e.g. balance control
10.
METHOD AND APPARATUS FOR PROCESSING OF AUDIO DATA USING A PRE-CONFIGURED GENERATOR
Described herein is a method for setting up a decoder for generating processed audio data from an audio bitstream, the decoder comprising a Generator of a Generative Adversarial Network, GAN, for processing of the audio data, wherein the method includes the steps of (a) pre-configuring the Generator for processing of audio data with a set of parameters for the Generator, the parameters being determined by training, at training time, the Generator using the full concatenated distribution; and (b) pre-configuring the decoder to determine, at decoding time, a truncation mode for modifying the concatenated distribution and to apply the determined truncation mode to the concatenated distribution. Described are further a method of generating processed audio data from an audio bitstream using a Generator of a Generative Adversarial Network, GAN, for processing of the audio data and a respective apparatus. Moreover, described are also respective systems and computer program products.
G10L 19/005 - Correction of errors induced by the transmission channel, if related to the coding algorithm
G10L 25/30 - Speech or voice analysis techniques not restricted to a single one of groups characterised by the analysis technique using neural networks
G06N 3/04 - Architecture, e.g. interconnection topology
Some methods may involve receiving a first content stream that includes first audio signals, rendering the first audio signals to produce first audio playback signals, generating first calibration signals, generating first modified audio playback signals by inserting the first calibration signals into the first audio playback signals, and causing a loudspeaker system to play back the first modified audio playback signals, to generate first audio device playback sound. The method(s) may involve receiving microphone signals corresponding to at least the first audio device playback sound and to second through Nth audio device playback sound corresponding to second through Nth modified audio playback signals (including second through Nth calibration signals) played back by second through Nth audio devices, extracting second through Nth calibration signals from the microphone signals and estimating at least one acoustic scene metric based, at least partly, on the second through Nth calibration signals.
Some methods involve causing a plurality of audio devices in an audio environment to reproduce audio data, each audio device of the plurality of audio devices including at least one loudspeaker and at least one microphone, determining audio device location data including an audio device location for each audio device of the plurality of audio devices and obtaining microphone data from each audio device of the plurality of audio devices. Some methods involve determining a mutual audibility for each audio device of the plurality of audio devices relative to each other audio device of the plurality of audio devices, determining a user location of a person in the audio environment, determining a user location audibility of each audio device of the plurality of audio devices at the user location and controlling one or more aspects of audio device playback based, at least in part, on the user location audibility.
Disclosed is an audio signal encoding/decoding method that uses an encoding downmix strategy applied at an encoder that is different than a decoding re-mix/upmix strategy applied at a decoder. Based on the type of downmix coding scheme, the method comprises: computing input downmixing gains to be applied to the input audio signal to construct a primary downmix channel; determining downmix scaling gains to scale the primary downmix channel; generating prediction gains based on the input audio signal, the input downmixing gains and the downmix scaling gains; determining residual channel(s) from the side channels by using the primary downmix channel and the prediction gains to generate side channel predictions and subtracting the side channel predictions from the side channels; determining decorrelation gains based on energy in the residual channels; encoding the primary downmix channel, the residual channel(s), the prediction gains and the decorrelation gains; and sending the bitstream to a decoder.
G10L 19/008 - Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
H04S 5/00 - Pseudo-stereo systems, e.g. in which additional channel signals are derived from monophonic signals by means of phase shifting, time delay or reverberation
G10L 19/24 - Variable rate codecs, e.g. for generating different qualities using a scalable representation such as hierarchical encoding or layered encoding
A method may involve: receiving direction of arrival (DOA) data corresponding to sound emitted by at least a first smart audio device of the audio environment that includes a first audio transmitter and a first audio receiver, the DOA data corresponding to sound received by at least a second smart audio device of the audio environment that includes a second audio transmitter and a second audio receiver, the DOA data corresponding to sound emitted by at least the second smart audio device and received by at least the first smart audio device; receiving one or more configuration parameters corresponding to the audio environment, to one or more audio devices, or both; and minimizing a cost function based at least in part on the DOA data and the configuration parameter(s), to estimate a position and an orientation of at least the first smart audio device and the second smart audio device.
Method for encoding scene-based audio is provided. In some implementations, the method involves determining, by an encoder, a spatial direction of a dominant sound component in a frame of an input audio signal. In some implementations, the method involves determining rotation parameters based on the determined spatial direction and a direction preference of a coding scheme to be used to encode the input audio signal. In some implementations, the method involves rotating sound components of the frame based on the rotation parameters such that, after being rotated, the dominant sound component has a spatial direction that aligns with the direction preference of the coding scheme. In some implementations, the method involves encoding the rotated sound components of the frame of the input audio signal using the coding scheme in connection with an indication of the rotation parameters or an indication of the spatial direction of the dominant sound component.
G10L 19/008 - Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
16.
AUTOMATIC GENERATION AND SELECTION OF TARGET PROFILES FOR DYNAMIC EQUALIZATION OF AUDIO CONTENT
In an embodiment, a method comprises: filtering reference audio content items to separate the reference audio content items into different frequency bands; for each frequency band, extracting a first feature vector from at least a portion of each of the reference audio content items, wherein the first feature vector includes at least one audio characteristic of the reference audio content items; obtaining at least one semantic label from at least a portion of each of the reference audio content items; obtaining a second feature vector consisting of the first feature vectors per frequency band and the at least one semantic label; generating, based on the second feature vector, cluster feature vectors representing centroids of clusters; separating the reference audio content items according to the cluster feature vectors; and computing an average target profile for each cluster based on the reference audio content items in the cluster.
Described herein is a computer-implemented deep-learning-based system for determining an indication of an audio quality of an input audio frame. The system comprises at least one inception block configured to receive at least one representation of an input audio frame and to map the at least one representation of the input audio frame into a feature map; and at least one fully connected layer configured to receive a feature map corresponding to the at least one representation of the input audio frame from the at least one inception block, wherein the at least one fully connected layer is configured to determine the indication of the audio quality of the input audio frame. Described are further respective methods of operating and training said system.
G10L 25/60 - Speech or voice analysis techniques not restricted to a single one of groups specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
G10L 25/30 - Speech or voice analysis techniques not restricted to a single one of groups characterised by the analysis technique using neural networks
G06N 3/04 - Architecture, e.g. interconnection topology
18.
SIGNAL CODING USING A GENERATIVE MODEL AND LATENT DOMAIN QUANTIZATION
The present disclosure provides a decoder configured to receive a finite bitrate stream that includes a quantized latent frame, where the quantized latent frame includes a quantized representation of a current frame of a signal in a latent domain different from a first domain; to generate a reconstructed latent frame from the quantized latent frame; to use a generative neural network model to perform a task for which the general neural network model has been trained, wherein the task includes to generate parameters for an invertible mapping from the latent domain to the first domain; to reconstruct a current frame of the signal in the first domain, which includes to map the reconstructed latent frame to the first domain by use of the invertible mapping, and to use the reconstructed current frame of the signal in the first domain to update a state of the generative neural network model.
G10L 19/02 - Speech or audio signal analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
G06N 3/04 - Architecture, e.g. interconnection topology
19.
A GENERATIVE NEURAL NETWORK MODEL FOR PROCESSING AUDIO SAMPLES IN A FILTER-BANK DOMAIN
A neural network system is provided, implementing a generative model for autoregressively generating a distribution for a plurality of current filter-bank samples of an audio signal, wherein the current samples correspond to a current time slot, and each current sample corresponds to a channel of the filter-bank. The system includes a hierarchy of a plurality of neural network processing tiers ordered from a top to a bottom tier, each tier trained to generate conditioning information based on previous filter-bank samples and, for at least each tier but the top tier, also on the conditioning information from a tier higher up in the hierarchy, and an output stage trained to generate the probability distribution based on previous samples for one or more previous time slots and the conditioning information from the lowest processing tier.
G06N 3/04 - Architecture, e.g. interconnection topology
G10L 19/00 - Speech or audio signal analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
G10L 19/02 - Speech or audio signal analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
G10L 21/00 - Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
G10L 25/30 - Speech or voice analysis techniques not restricted to a single one of groups characterised by the analysis technique using neural networks
A neural network system for predicting frequency coefficients of a media signal, the neural network system comprising a time predicting portion including at least one neural network trained to predict a first set of output variables representing a specific frequency band of a current time frame given coefficients of one or several previous time frames, and a frequency predicting portion including a at least one neural network trained to predict a second set of output variables representing a specific frequency band given coefficients of one or several frequency bands adjacent to the specific frequency band in said current time frame. Such a neural network system forms a predictor capable of capturing both temporal and frequency dependencies occurring in time-frequency tiles of a media signal.
G10L 19/04 - Speech or audio signal analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
G10L 21/038 - Speech enhancement, e.g. noise reduction or echo cancellation using band spreading techniques
Described is a method of training a deep-learning-based system for sound source separation. The system comprises a separation stage for frame-wise extraction of representations of sound sources from a representation of an audio signal, and a clustering stage for generating, for each frame, a vector indicative of an assignment permutation of extracted frames of representations of sound sources to respective sound sources. The representation of the audio signal is a waveform-based representation. The separation stage is trained using frame-level permutation invariant training. Further, the clustering stage is trained to generate embedding vectors for the frames of the audio signal that allow to determine estimates of respective assignment permutations between extracted sound signals and labels of sound sources that had been used for the frames. Also described is a method of using the deep-learning-based system for sound source separation.
G10L 21/0308 - Voice signal separating characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
G10L 25/30 - Speech or voice analysis techniques not restricted to a single one of groups characterised by the analysis technique using neural networks
The present disclosure relates to a method and system for performing packet loss concealment using a neural network system. The method comprises obtaining a representation of an incomplete audio signal, inputting the representation of the incomplete audio signal to an encoder neural network and outputting a latent representation of a predicted complete audio signal. The latent representation is input to a decoder neural network which outputs a representation of a predicted complete audio signal comprising a reconstruction of the original portion of the complete audio signal, wherein said encoder neural network and said decoder neural network have been trained with an adversarial neural network.
G10L 19/005 - Correction of errors induced by the transmission channel, if related to the coding algorithm
G10L 19/00 - Speech or audio signal analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
23.
METHOD AND APPARATUS FOR GENERATING AN INTERMEDIATE AUDIO FORMAT FROM AN INPUT MULTICHANNEL AUDIO SIGNAL
Described herein is a method for training a machine learning algorithm. The method may comprise receiving a first input multichannel audio signal. The method may comprise generating, using the machine learning algorithm, an intermediate audio signal based on the first input multichannel audio signal. The method may comprise rendering the intermediate audio signal into a first output multichannel audio signal. Further, the method may comprise improving the machine learning algorithm based on a difference between the first input multichannel audio signal and the first output multichannel audio signal. Described herein are further an apparatus for generating an intermediate audio format from an input multichannel audio signal as well as a respective computer program product comprising a computer-readable storage medium with instructions adapted to carry out said method when executed by a device having processing capability.
G10L 25/30 - Speech or voice analysis techniques not restricted to a single one of groups characterised by the analysis technique using neural networks
G10L 19/008 - Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
24.
METHOD AND APPARATUS FOR NEURAL NETWORK BASED PROCESSING OF AUDIO USING SINUSOIDAL ACTIVATION
Described herein is a method of processing an audio signal using a deep-learning-based generator, wherein the method includes the steps of: (a) inputting the audio signal into the generator for processing the audio signal; (b) mapping a time segment of the audio signal to a latent feature space representation, using an encoder stage of the generator; (c) upsampling the latent feature space representation using a decoder stage of the generator, wherein at least one layer of the decoder stage applies sinusoidal activation; and (d) obtaining, as an output from the decoder stage of the generator, a processed audio signal. Described are further a method for training said generator and respective apparatus, systems and computer program products.
In some embodiments, a method, comprises: dividing, using at least one processor, an audio input into speech and non-speech segments; for each frame in each non-speech segment, estimating, using the at least one processor, a time-varying noise spectrum of the non-speech segment; for each frame in each speech segment, estimating, using the at least one processor, speech spectrum of the speech segment; for each frame in each speech segment, identifying one or more non-speech frequency components in the speech spectrum; comparing the one or more non-speech frequency components with one or more corresponding frequency components in a plurality of estimated noise spectra and selecting the estimated noise spectrum from the plurality of estimated noise spectra based on a result of the comparing.
The present invention relates to a method and device for processing a first and a second audio signal representing an input binaural audio signal acquired by a binaural recording device. The present invention further relates to a method for rendering a binaural audio signal on a speaker system. The method for processing a binaural signal comprising extracting audio information from the first audio signal, computing a band gain for reducing noise in the first audio signal and applying the band gains to respective frequency bands of the first audio signal in accordance with a dynamic scaling factor, to provide a first output audio signal. Wherein the dynamic scaling factor has a value between zero and one and is selected so as to reduce quality degradation for the first audio signal.
A method comprising receiving a first input bit stream for a first parametrically coded input audio signal, the first input bit stream including data representing a first input core audio signal and a first set including at least one spatial parameter relating to the first parametrically coded input audio signal. A first covariance matrix of the first parametrically coded audio signal is determined based on the spatial parameter(s) of the first set. A modified set including at least one spatial parameter is determined based on the determined first covariance matrix, wherein the modified set is different from the first set. An output core audio signal is determined, which is based on, or constituted by, the first input core audio signal. An output bit stream for a parametrically coded output audio signal is generated, the output bit stream including data representing the output core audio signal and the modified set.
G10L 19/008 - Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
Described is a method of performing automatic audio enhancement on an input audio signal including at least one speech-articulation noise event. The method comprises: segmenting the input audio signal into a number of audio frames; obtaining at least one feature parameter from the audio frames; and determining, based at least in part on the obtained feature parameter, a respective type of the speech-articulation noise event and a respective time-frequency range associated with the speech-articulation noise event within the input audio signal.
G10L 15/04 - Segmentation; Word boundary detection
G10L 21/0264 - Noise filtering characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
G10L 25/93 - Discriminating between voiced and unvoiced parts of speech signals
G10L 21/0308 - Voice signal separating characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
G10L 25/09 - Speech or voice analysis techniques not restricted to a single one of groups characterised by the type of extracted parameters the extracted parameters being zero crossing rates
G10L 25/21 - Speech or voice analysis techniques not restricted to a single one of groups characterised by the type of extracted parameters the extracted parameters being power information
G10L 25/24 - Speech or voice analysis techniques not restricted to a single one of groups characterised by the type of extracted parameters the extracted parameters being the cepstrum
G10L 25/84 - Detection of presence or absence of voice signals for discriminating voice from noise
G10L 21/0316 - Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
29.
HUM NOISE DETECTION AND REMOVAL FOR SPEECH AND MUSIC RECORDINGS
Described are methods of processing audio data for hum noise detection and/or removal. The audio data comprises a plurality of frames. One method incudes: classifying frames of the audio data as either content frames or noise frames, using one or more content activity detectors; determining a noise spectrum from one or more frames of the audio data that are classified as noise frames; determining one or more hum noise frequencies based on the determined noise spectrum; generating an estimated hum noise signal based on the one or more hum noise frequencies; and removing hum noise from at least one frame of the audio data based on the estimated hum noise signal. Also described are apparatus for carrying out the methods, as well as corresponding programs and computer-readable storage media.
Described are methods of processing an audio signal for packet loss concealment. The audio signal comprises a sequence of frames, each frame containing representations of a plurality of audio channels and reconstruction parameters for upmixing the plurality of audio channels to a predetermined channel format. One method includes: receiving the audio signal; and generating a reconstructed audio signal in the predefined channel format based on the received audio signal. Generating the reconstructed audio signal comprises: determining whether at least one frame of the audio signal has been lost; and if a number of consecutively lost frames exceeds a first threshold, fading the reconstructed audio signal to a predefined spatial configuration. Also described is a method of encoding an audio signal. Yet further described are apparatus for carrying out the methods, as well as corresponding programs and computer-readable storage media.
G10L 19/005 - Correction of errors induced by the transmission channel, if related to the coding algorithm
G10L 19/008 - Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
A deep-learning-based system for performing automated multitrack mixing based on a plurality of input audio tracks is described herein. The system comprises one or more instances of a deep-learning-based first network and one or more instances of a deep- learning-based second network. Particularly, the first network is configured to, based on the 5 input audio tracks, generate parameters for use in the automated multitrack mixing. The second network is configured to, based on the parameters, apply signal processing and at least one mixing gain to the input audio tracks, for generating an output mix of the audio tracks.
G10H 1/00 - ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE - Details of electrophonic musical instruments
H04H 60/04 - Studio equipment; Interconnection of studios
Described is a method of training a neural-network-based system for determining an indication of an audio quality of an audio input. The method includes obtaining, as input, at least one training set comprising audio samples. The audio samples include audio samples of a first type and audio samples of a second type, wherein each of the first type of audio samples is labelled with information indicative of a respective predetermined audio quality metric, and wherein each of the second type of audio samples is labelled with information indicative of a respective audio quality metric relative to that of a reference audio sample. The method further includes: inputting the training set to the neural-network-based system; and iteratively training the system to predict the respective label information of the audio samples in the training set.
G10L 25/69 - Speech or voice analysis techniques not restricted to a single one of groups specially adapted for particular use for evaluating synthetic or decoded voice signals
G10L 25/30 - Speech or voice analysis techniques not restricted to a single one of groups characterised by the analysis technique using neural networks
33.
PERCEPTUAL OPTIMIZATION OF MAGNITUDE AND PHASE FOR TIME-FREQUENCY AND SOFTMASK SOURCE SEPARATION SYSTEMS
A method comprises: obtaining softmask values for frequency bins of time-frequency tiles representing an audio signal; reducing, or expanding and limiting, the softmask values; and applying the reduced, or expanded and limited, softmask values to the frequency bins to create a time-frequency representation of an estimated target source. An alternative method comprises, for each time-frequency tile: obtaining softmask values; applying the softmask values to the frequency bins to create a time-frequency domain representation of an estimated target source; obtaining a panning parameter and a source concentration estimates for the target source; determining, using the panning parameter estimate and the softmask values, a magnitude for the time-frequency representation of the estimated target source; determining, using the panning parameter estimate and the source phase concentration estimate, a phase for the time-frequency representation of the estimated target source; and combining the magnitude and the phase.
G10L 25/18 - Speech or voice analysis techniques not restricted to a single one of groups characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
34.
FRAME LOSS CONCEALMENT FOR A LOW-FREQUENCY EFFECTS CHANNEL
A method of generating a substitution frame for a lost audio frame of an audio signal is presented. The method may comprise determining an audio filter based on samples of a valid audio frame preceding the lost audio frame. The method may comprise generating the substitution frame based on the audio filter and the samples of the valid audio frame preceding the lost audio frame. The method may be advantageously applied to a low frequency effects (LFE) channel of a multi-channel audio signal.
G10L 19/005 - Correction of errors induced by the transmission channel, if related to the coding algorithm
G10L 25/12 - Speech or voice analysis techniques not restricted to a single one of groups characterised by the type of extracted parameters the extracted parameters being prediction coefficients
35.
METHODS, APPARATUS, AND SYSTEMS FOR DETECTION AND EXTRACTION OF SPATIALLY-IDENTIFIABLE SUBBAND AUDIO SOURCES
In an embodiment, a method comprises: transforming one or more frames of a two-channel time domain audio signal into a time-frequency domain representation including a plurality of time-frequency tiles, wherein the frequency domain of the time-frequency domain representation includes a plurality of frequency bins grouped into subbands. For each time-frequency tile, the method comprises: calculating spatial parameters and a level for the time-frequency tile; modifying the spatial parameters using shift and squeeze parameters; obtaining a softmask value for each frequency bin using the modified spatial parameters, the level and subband information; and applying the softmask values to the time-frequency tile to generate a modified time-frequency tile of an estimated audio source. In an embodiment, a plurality of frames of the time-frequency tiles are assembled into a plurality of chunks, wherein each chunk includes a plurality of subbands, and the method described above is performed on each subband of each chunk.
Described herein is a method of determining parameters for a generative neural network for processing an audio signal, wherein the generative neural network includes an encoder stage mapping to a coded feature space and a decoder stage, each stage including a plurality of convolutional layers with one or more weight coefficients, the method comprising a plurality of cycles with sequential processes of: pruning the weight coefficients of either or both stages based on pruning control information, the pruning control information determining the number of weight coefficients that are pruned for respective convolutional layers; training the pruned generative neural network based on a set of training data; determining a loss for the trained and pruned generative neural network based on a loss function; and determining updated pruning control information based on the determined loss and a target loss. Further described are corresponding apparatus, programs, and computer-readable storage media.
G10L 25/30 - Speech or voice analysis techniques not restricted to a single one of groups characterised by the analysis technique using neural networks
An audio bitstream is decoded into audio objects and audio metadata for the audio objects. The audio objects include a specific audio object. The audio metadata specifies frame-level gains that include a first gain and a second gain respectively for a first audio frame and a second audio frame. It is determined, based on the first and second gains, whether sub-frame gains are to be generated for the specific audio object. If so, a ramp length is determined for a ramp used to generate the sub-frame gains for the specific audio object. The ramp of the ramp length is used to generate the sub-frame gains for the specific audio object. A sound field represented by the audio objects with the sub-frame gains is rendered by audio speakers.
G10L 19/008 - Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
H04S 3/00 - Systems employing more than two channels, e.g. quadraphonic
38.
METHOD AND UNIT FOR PERFORMING DYNAMIC RANGE CONTROL
The present document describes a dynamic range control unit (210) configured to apply dynamic range control, referred to as DRC, to an audio signal (211). The DRC unit (210) is configured to downsample a subband signal (212) derived from the audio signal (211), to provide a downsampled subband signal (321), to determine a DRC gain (329) based on the downsampled subband signal (321), and to apply the DRC gain (329) to the subband signal (212), to provide a compressed subband signal (213) of a compressed audio signal (214).
G10L 21/0316 - Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
G10L 21/0364 - Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude for improving intelligibility
G10L 19/008 - Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
H03G 7/00 - Volume compression or expansion in amplifiers
H03G 9/02 - Combinations of two or more types of control, e.g. gain control and tone control in untuned amplifiers
39.
METHODS AND APPARATUS FOR UNIFIED SPEECH AND AUDIO DECODING IMPROVEMENTS
Described herein are methods, apparatus and computer products for decoding an encoded MPEG-D USAC bitstream. Described herein are such methods, apparatus and computer products that reduce a computational complexity.
G10L 19/02 - Speech or audio signal analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
Described herein is a method for improving dialogue intelligibility during playback of audio data on a playback device, wherein the audio data comprise dialogue audio data, and at least one of music and effects audio data, the method including the steps of: determining a volume mixing ratio based on a volume value for playback; mixing the dialogue audio data and the at least one of music and effects audio data based on said volume mixing ratio; and outputting the mixed audio data for playback. Described are further a respective playback device and a respective computer program product.
Computer-implemented methods and devices for combined audio separation and classification are provided. An estimated separated signal is time gated based on a determination of an audio classifier of, at least in part, the original mix of signals before separation. Combined separation, classification, and time gating of both the estimated signal and a residual signal are also provided.
Described herein is a method of generating, in a dynamic range reduced domain, an enhanced multi-channel audio signal from an audio bitstream including a multi-channel audio signal, wherein the multi-channel audio signal comprises two or more channels, and wherein the method includes jointly enhancing the two or more channels of the dynamic range reduced raw multi-channel audio signal using a multi-channel Generator of a Generative Adversarial Network setting. Described herein are further a method for training a multi-channel Generator in a dynamic range reduced domain in a Generative Adversarial Network setting, an apparatus for generating, in a dynamic range reduced domain, an enhanced multi-channel audio signal from an audio bitstream including a multi-channel audio signal, respective systems and a computer program product.
Described herein is a method of processing audio content for rendering in a three-dimensional audio scene, wherein the audio content comprises a sound source at a source position, the method comprising: obtaining a voxelized representation of the three-dimensional audio scene, wherein the voxelized representation indicates volume elements in which sound can propagate and volume elements by which sound is occluded; generating a two-dimensional projection map for the audio scene based on the voxelized representation by applying a projection operation to the voxelized representation that projects onto a horizontal plane; and determining parameters indicating a virtual source position of a virtual sound source based on the source position, a listener position, and the projection map, to simulate, by rendering a virtual source signal from the virtual source position, an impact of acoustic diffraction by the three-dimensional audio scene on a source signal of the sound source at the source position. Described are moreover a corresponding apparatus as well as corresponding computer program products.
A63F 13/54 - Controlling the output signals based on the game progress involving acoustic signals, e.g. for simulating revolutions per minute [RPM] dependent engine sounds in a driving game or reverberation against a virtual wall
Embodiments are disclosed for automatic leveling of speech content. In an embodiment, a method comprises: receiving, using one or more processors, frames of an audio recording including speech and non-speech content; for each frame: determining, using the one or more processors, a speech probability; analyzing, using the one or more processors, a perceptual loudness of the frame; obtaining, using the one or more processors, a target loudness range for the frame; computing, using the one or more processors, gains to apply to the frame based on the target loudness range and the perceptual loudness analysis, where the gains include dynamic gains that change frame-by-frame and that are scaled based on the speech probability; and applying the gains to the frame so that a resulting loudness range of the speech content in the audio recording fits within the target loudness range.
G10L 21/0364 - Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude for improving intelligibility
G10L 17/00 - Speaker identification or verification
G10L 25/18 - Speech or voice analysis techniques not restricted to a single one of groups characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
G10L 25/30 - Speech or voice analysis techniques not restricted to a single one of groups characterised by the analysis technique using neural networks
G10L 25/78 - Detection of presence or absence of voice signals
H03G 3/32 - Automatic control in amplifiers having semiconductor devices the control being dependent upon ambient noise level or sound level
A method of audio processing includes generating harmonics in a hybrid complex quadrature mirror filter domain. Generating the harmonics may include multiplication, using a feedback delay loop, and dynamic compression. The harmonics may be generated based on one or more hybrid sub-bands of the complex transform domain signal.
Described herein is a method for controlling media data playout on a client device, wherein the method includes the steps of: (a) retrieving, by the client device, media data comprising a plurality of segments subdivided into one or more chunks for playout from at least one media server; (b) analyzing a current chunk of the one or more chunks of a current segment; and (c) adapting the playout of the media data in response to the result of the analysis prior to fully retrieving the current chunk. Described herein are further a client device having implemented a media player application configured to perform said method and a computer program product with instructions adapted to cause a device having processing capability to carry out said method.
H04N 21/43 - Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronizing decoder's clock; Client middleware
Embodiments are disclosed for noise floor estimation and noise reduction, In an embodiment, a method comprises: obtaining an audio signal; dividing the audio signal into a plurality of buffers; determining time-frequency samples for each buffer of the audio signal; for each buffer and for each frequency, determining a median (or mean) and a measure of an amount of variation of energy based on the samples in the buffer and samples in neighboring buffers that together span a specified time range of the audio signal; combining the median (or mean) and the measure of the amount of variation of energy into a cost function; for each frequency: determining a signal energy of a particular buffer of the audio signal that corresponds to a minimum value of the cost function; selecting the signal energy as the estimated noise floor of the audio signal; and reducing, using the estimated noise floor, noise in the audio signal.
A method for adaptive streaming of media content with bitrate switching is described, wherein the media content comprising a plurality of consecutive media segments. The method comprising, at a media streaming server: transmitting a segment of the media content encoded in a first coding mode having a first bitrate; receiving an indication for a coding mode switch to a second coding mode having a second bitrate and in response transmitting a transition segment for transitioning between the first coding mode and the second coding mode; and transmitting another segment of the media content encoded in the second coding mode.
H04N 21/2343 - Processing of video elementary streams, e.g. splicing of video streams or manipulating MPEG-4 scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements
H04N 21/845 - Structuring of content, e.g. decomposing content into time segments
H04N 21/8543 - Content authoring using a description language, e.g. MHEG [Multimedia and Hypermedia information coding Expert Group] or XML [eXtensible Markup Language]
49.
PROJECTION SYSTEM AND METHOD OF DRIVING A PROJECTION SYSTEM
A projection system and method includes a light source configured to emit a light in response to an image data; a phase light modulator configured to receive the light from the light source and to apply a spatially-varying phase modulation on the light; and a controller configured to determine, for a frame of the image data, a plurality of phase configurations, respective ones of the plurality of phase configurations corresponding to solutions of a phase algorithm and representing the same image with a different modulation pattern, and provide a phase control signal to the phase light modulator, the phase control signal configured to cause the phase light modulator to modulate the plurality of phase configurations in a time-divisional manner within a time period of the frame, thereby to project a series of subframes within the time period.
Embodiments are disclosed for channel-based audio (CBA) (e.g., 22.2-ch audio) to object-based audio (OBA) conversion. The conversion includes converting CBA metadata to object audio metadata (OAMD) and reordering the CBA channels based on channel shuffle information derived in accordance with channel ordering constraints of the OAMD. The OBA with reordered channels is rendered in a playback device using the OAMD or in a source device, such as a set-top box or audio/video recorder. In an embodiment, the CBA metadata includes signaling that indicates a specific OAMD representation to be used in the conversion of the metadata. In an embodiment, pre-computed OAMD is transmitted in a native audio bitstream (e.g., AAC) for transmission (e.g., over HDMI) or for rendering in a source device. In an embodiment, pre-computed OAMD is transmitted in a transport layer bitstream (e.g., ISO BMFF, MPEG4 audio bitstream) to a playback device or source device.
H04S 3/00 - Systems employing more than two channels, e.g. quadraphonic
G10L 19/008 - Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
The present application describes a method (400) for providing personalized audio to a user. The method (400) comprises receiving (401) a manifest file (140) for a media element from which audio is to be rendered, wherein the manifest file (140) comprises a description (141) for a plurality of different presentations (152) of audio content of the media element. In addition, the method (400) comprises selecting (402) a presentation (152) from the plurality of presentations (152) based on the manifest file (140). The method (400) further comprises receiving (403) a list of audio track objects comprised within the media element, and selecting (404) an audio track object from the list of audio track objects, in dependence of the selected presentation (152).
H04N 21/2343 - Processing of video elementary streams, e.g. splicing of video streams or manipulating MPEG-4 scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements
H04N 21/84 - Generation or processing of descriptive data, e.g. content descriptors
H04N 21/45 - Management operations performed by the client for facilitating the reception of or the interaction with the content or administrating data related to the end-user or to the client device itself, e.g. learning user preferences for recommending movies
52.
METHODS AND DEVICES FOR PERSONALIZING AUDIO CONTENT
The present document describes a method (400) for personalizing audio content. The method (400) comprises receiving (401) a manifest file (140) for the audio content. The manifest file (140) comprises at least one adaptation set (281, 282) referencing an audio bitstream (121), where the audio bitstream (121) comprises a plurality of audio objects (181), and a plurality of different preselection elements (291, 292, 293) for the adaptation set (281, 282), wherein the different preselection elements (291, 292, 293) specify different combinations of the plurality of audio objects (181). The method (400) further comprises selecting (402) a preselection element (291) from the plurality of different preselection elements (291, 292, 293), and causing (403) rendering of an audio signal which depends on the selected preselection element (291).
H04N 21/485 - End-user interface for client configuration
H04N 21/462 - Content or additional data management e.g. creating a master electronic program guide from data received from the Internet and a Head-end or controlling the complexity of a video stream by scaling the resolution or bit-rate based on the client capabi
H04N 21/262 - Content or additional data distribution scheduling, e.g. sending additional data at off-peak times, updating software modules, calculating the carousel transmission frequency, delaying a video stream transmission or generating play-lists
A speech separation server comprises a deep-learning encoder with nonlinear activation. The encoder is programmed to take a mixture audio waveform in the time domain, learn generalized patterns from the mixture audio waveform, and generate an encoded representation that effectively characterizes the mixture audio waveform for speech separation.
Described herein is a method of waveform decoding, the method including the steps of: (a) receiving, by a waveform decoder, a bitstream including a finite bitrate representation of a source signal; (b) waveform decoding the finite bitrate representation of the source signal to obtain a waveform approximation of the source signal; (c) providing the waveform approximation of the source signal to a generative model that implements a probability density function, to obtain a probability distribution for a reconstructed signal of the source signal; and (d) generating the reconstructed signal of the source signal based on the probability distribution. Described are further a method and system for waveform coding and a method of training a generative model.
G10L 19/00 - Speech or audio signal analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
Described herein is a method of encoding an audio signal. The method comprises: generating a plurality of subband audio signals based on the audio signal; determining a spectral envelope of the audio signal; for each subband audio signal, determining autocorrelation information for the subband audio signal based on an autocorrelation function of the subband audio signal; and generating an encoded representation of the audio signal, the encoded representation comprising a representation of the spectral envelope of the audio signal and a representation of the autocorrelation information for the plurality of subband audio signals. Further described are methods of decoding the audio signal from the encoded representation, as well as corresponding encoders, decoders, computer programs, and computer-readable recording media.
G10L 25/06 - Speech or voice analysis techniques not restricted to a single one of groups characterised by the type of extracted parameters the extracted parameters being correlation coefficients
G10L 19/02 - Speech or audio signal analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
G10L 25/18 - Speech or voice analysis techniques not restricted to a single one of groups characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
G10L 25/30 - Speech or voice analysis techniques not restricted to a single one of groups characterised by the analysis technique using neural networks
56.
METHODS AND DEVICES FOR GENERATION AND PROCESSING OF MODIFIED AUDIO BITSTREAMS
Described herein is a method for generating a modified bitstream on a source device, wherein the method includes the steps of: a) receiving, by a receiver, a bitstream including coded media data; b) generating, by an embedder, payload of additional media data and embedding the payload in the bitstream for obtaining, as an output from the embedder, a modified bitstream including the coded media data and the payload of the additional media data; and c) outputting the modified bitstream to a sink device. Described is further a method for processing said modified bitstream on a sink device. Described are moreover a respective source device and sink device as well as a system of a source device and a sink device and respective computer program products.
H04N 21/2389 - Multiplex stream processing, e.g. multiplex stream encrypting
H04N 21/435 - Processing of additional data, e.g. decrypting of additional data or reconstructing software from modules extracted from the transport stream
H04N 7/24 - Systems for the transmission of television signals using pulse code modulation
57.
METHODS AND DEVICES FOR GENERATION AND PROCESSING OF MODIFIED BITSTREAMS
Described herein is a method for generating a modified bitstream on a source device, wherein the method includes the steps of: a) receiving, by a receiver, a bitstream including coded media data; b) generating, by an embedder, payload of additional media data and embedding the payload in the bitstream for obtaining, as an output from the embedder, a modified bitstream including the coded media data and the payload of the additional media data; and d) outputting the modified bitstream to a sink device. Described is further a method for processing said modified bitstream on a sink device. Described are moreover a respective source device and sink device as well as a system of a source device and a sink device and respective computer program products.
A rendering mode may be determined for received audio data, including audio signals and associated spatial data. The audio data may be rendered for reproduction via a set of loudspeakers of an environment according to the rendering mode, to produce rendered audio signals. Rendering the audio data may involve determining relative activation of a set of loudspeakers in an environment. The rendering mode may be variable between a reference spatial mode and one or more distributed spatial modes. The reference spatial mode may have an assumed listening position and orientation. In the distributed spatial mode(s), one or more elements of the audio data may each be rendered in a more spatially distributed manner than in the reference spatial mode and spatial locations of remaining elements of the audio data may be warped such that they span a rendering space of the environment more completely than in the reference spatial mode.
An audio session management method for an audio environment having multiple audio devices may involve receiving, from a first device implementing a first application and by a device implementing an audio session manager, a first route initiation request to initiate a first route for a first audio session. The first route initiation request may indicate a first audio source and a first audio environment destination. The first audio environment destination may correspond with at least a first person in the audio environment, but in some instances will not indicate an audio device. The method may involve establishing a first route corresponding to the first route initiation request. Establishing the first route may involve determining a first location of at least the first person in the audio environment, determining at least one audio device for a first stage of the first audio session and initiating or scheduling the first audio session.
An audio processing method may involve receiving output signals from each microphone of a plurality of microphones in an audio environment, the output signals corresponding to a current utterance of a person and determining, based on the output signals, one or more aspects of context information relating to the person, including an estimated current proximity of the person to one or more microphone locations. The method may involve selecting two or more loudspeaker-equipped audio devices based, at least in part, on the one or more aspects of the context information, determining one or more types of audio processing changes to apply to audio data being rendered to loudspeaker feed signals for the audio devices and causing one or more types of audio processing changes to be applied. In some examples, the audio processing changes have the effect of increasing a speech to echo ratio at one or more microphones.
H04S 7/00 - Indicating arrangements; Control arrangements, e.g. balance control
H04M 1/60 - Substation equipment, e.g. for use by subscribers including speech amplifiers
H04M 9/08 - Two-way loud-speaking telephone systems with means for conditioning the signal, e.g. for suppressing echoes for one or both directions of traffic
H04R 3/02 - Circuits for transducers for preventing acoustic reaction
H04R 1/40 - Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only by combining a number of identical transducers
Methods for rendering audio for playback by two or more speakers are disclosed. The audio includes one or more audio signals, each with an associated intended perceived spatial position. Relative activation of the speakers may be a cost function of a model of perceived spatial position of the audio signals when played back over the speakers, a measure of proximity of the intended perceived spatial position of the audio signals to positions of the speakers, and one or more additional dynamically configurable functions. The dynamically configurable functions may be based on at least one or more properties of the audio signals, one or more properties of the set of speakers and/or one or more external inputs.
A multi-stream rendering system and method may render and play simultaneously a plurality of audio program streams over a plurality of arbitrarily placed loudspeakers. At least one of the program streams may be a spatial mix. The rendering of said spatial mix may be dynamically modified as a function of the simultaneous rendering of one or more additional program streams. The rendering of one or more additional program streams may be dynamically modified as a function of the simultaneous rendering of the spatial mix.
Individual loudspeaker dynamics processing configuration data, for each of a plurality of loudspeakers of a listening environment, may be obtained. Listening environment dynamics processing configuration data may be determined, based on the individual loudspeaker dynamics processing configuration data. Dynamics processing may be performed on received audio data based on the listening environment dynamics processing configuration data, to generate processed audio data. The processed audio data may be rendered for reproduction via a set of loudspeakers that includes at least some of the plurality of loudspeakers, to produce rendered audio signals. The rendered audio signals may be provided to, and reproduced by, the set of loudspeakers.
An audio session management method may involve: determining, by an audio session manager, one or more first media engine capabilities of a first media engine of a first smart audio device, the first media engine being configured for managing one or more audio media streams received by the first smart audio device and for performing first smart audio device signal processing for the one or more audio media streams according to a first media engine sample clock; receiving, by the audio session manager and via a first application communication link, first application control signals from the first application; and controlling the first smart audio device according to the first media engine capabilities, by the audio session manager, via first audio session management control signals transmitted to the first smart audio device via a first smart audio device communication link and without reference to the first media engine sample clock.
The present document discloses a method for playback of media content via a delivery channel. The delivery channel may generally refer to the channels through which audio or video programs are delivered (transmitted) to the user (receiver). The media content may generally comprise consecutive media programs. In particular, for a specific media program within the media content, a respective content type for that specific media program is also provided. The method may comprise receiving an indication of the sensitivity of a media program to playback latency. The method may further comprise receiving at least a portion of the media program. The method may yet further comprise adapting the playback of the media program based on the indication of its sensitivity to playback latency.
H04N 21/61 - Network physical structure; Signal processing
H04N 21/24 - Monitoring of processes or resources, e.g. monitoring of server load, available bandwidth or upstream requests
H04N 21/262 - Content or additional data distribution scheduling, e.g. sending additional data at off-peak times, updating software modules, calculating the carousel transmission frequency, delaying a video stream transmission or generating play-lists
H04N 21/235 - Processing of additional data, e.g. scrambling of additional data or processing content descriptors
H04N 21/44 - Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to MPEG-4 scene graphs
66.
PRESENTATION INDEPENDENT MASTERING OF AUDIO CONTENT
A method for generating mastered audio content, the method comprising obtaining an input audio content comprising a number, M1, of audio signals, obtaining rendered presentation of the input audio content, the rendered presentation comprising a number, M2, of audio signals, obtaining a mastered presentation generated by mastering the rendered presentation, comparing the mastered presentation with the rendered presentation to determine one or more indications of differences between the mastered presentation and the rendered presentation, modifying one or more of the audio signals of the input audio content based on the indications of differences to generate the mastered audio content. With this approach, conventional, typically stereo, channel-based mastering tools can be used to provide a mastered version of any input audio content, including object-based immersive audio content.
The present disclosure relates to a method of processing audio content including directivity information for at least one sound source, the directivity information comprising a first set of first directivity unit vectors representing directivity directions and associated first directivity gains. The disclosure further relates to corresponding methods of encoding and decoding audio content including directivity information for at least one sound source.
G10L 19/008 - Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
H04S 7/00 - Indicating arrangements; Control arrangements, e.g. balance control
Dialogue enhancement of an audio signal, comprising obtaining a set of time-varying parameters configured to estimate a dialogue component present in said audio signal, estimating the dialogue component from the audio signal, applying a compressor only to the estimated dialogue component, to generate a processed dialogue component, applying a user-determined gain to the processed dialogue component, to provide an enhanced dialogue component. The processing of the estimated dialogue may be performed on the decoder side or encoder side. The invention enables an improved dialogue enhancement.
Described herein is a method of generating a media bitstream to transmit parameters for updating a neural network implemented in a decoder, wherein the method includes the steps of: (a) determining at least one set of parameters for updating the neural network; (b) encoding the at least one set of parameters and media data to generate the media bitstream; and (c) transmitting the media bitstream to the decoder for updating the neural network with the at least one set of parameters. Described herein are further a method for updating a neural network implemented in a decoder, an apparatus for generating a media bitstream to transmit parameters for updating a neural network implemented in a decoder, an apparatus for updating a neural network implemented in a decoder and computer program products comprising a computer-readable storage medium with instructions adapted to cause the device to carry out said methods when executed by a device having processing capability.
A system and method comprise a light source; a spatial light modulator including a substantially transparent material layer and a phase modulation layer; an imaging device configured to receive a light from the light source as reflected by the spatial light modulator, and to generate an image data; and a controller. The controller provides a phase-drive signal to the spatial light modulator and determines an attenuating wavefront of the substantially transparent material layer based on the image data.
G09G 3/00 - Control arrangements or circuits, of interest only in connection with visual indicators other than cathode-ray tubes
G02B 26/06 - Optical devices or arrangements for the control of light using movable or deformable optical elements for controlling the phase of light
G02F 1/01 - Devices or arrangements for the control of the intensity, colour, phase, polarisation or direction of light arriving from an independent light source, e.g. switching, gating or modulating; Non-linear optics for the control of the intensity, phase, polarisation or colour
H03H 1/00 - Constructional details of impedance networks whose electrical mode of operation is not specified or applicable to more than one type of network
G09G 3/36 - Control arrangements or circuits, of interest only in connection with visual indicators other than cathode-ray tubes for presentation of an assembly of a number of characters, e.g. a page, by composing the assembly by combination of individual elements arranged in a matrix by control of light from an independent source using liquid crystals
71.
METHOD, APPARATUS AND SYSTEM FOR HYBRID SPEECH SYNTHESIS
A method of decoding an original speech signal for hybrid adversarial-parametric speech synthesis comprising:(a) receiving quantized original linear prediction coding parameters estimated by applying linear prediction coding analysis filtering to an original speech signal and a quantized compressed representation of a residual of the original speech signal; (b) dequantizing the original linear prediction coding parameters and the compressed representation of the residual; (c) inputting the dequantized compressed representation of the residual into a decoder part of a Generator for applying adversarial mapping from the compressed residual domain to a fake (first) signal domain; (d) outputting, by the decoder part of the Generator, a fake speech signal; (e) applying linear prediction coding analysis filtering to the fake speech signal for obtaining a corresponding fake residual; (f) reconstructing the original speech signal by applying linear prediction coding cross-synthesis filtering to the fake residual and the dequantized original linear prediction coding analysis parameters.
A method of encoding audio content comprises performing a content analysis of the audio content, generating classification information indicative of a content type of the audio content based on the content analysis, encoding the audio content and the classification information in a bitstream, and outputting the bitstream. A method of decoding audio content from a bitstream including audio content and classification information for the audio content, wherein the classification information is indicative of a content classification of the audio content, comprises receiving the bitstream, decoding the audio content and the classification information, and selecting, based on the classification information, a post processing mode for performing post processing of the decoded audio content. Selecting the post processing mode can involve calculating one or more control weights for post processing of the decoded audio content based on the classification information.
The disclosure herein generally relates to capturing, acoustic pre-processing, encoding, decoding, and rendering of directional audio of an audio scene. In particular, it relates to a device adapted to modify a directional property of a captured directional audio in response to spatial data of a microphone system capturing the directional audio. The disclosure further relates to a rendering device configured to modify a directional property of a received directional audio in response to received spatial data.
There is provided encoding and decoding methods for representing spatial audio that is a combination of directional sound and diffuse sound. An exemplary encoding method includes inter alia creating a single- or multi-channel downmix audio signal by downmixing input audio signals from a plurality of microphones in an audio capture unit capturing the spatial audio; determining first metadata parameters associated with the downmix audio signal, wherein the first metadata parameters are indicative of one or more of: a relative time delay value, a gain value, and a phase value associated with each input audio signal; and combining the created downmix audio signal and the first metadata parameters into a representation of the spatial audio.
Described herein is a method of decoding an audio or speech signal, the method including the steps of: (a) receiving, by a decoder, a coded bitstream including the audio or speech signal and conditioning information; (b) providing, by a bitstream decoder, decoded conditioning information in a format associated with a first bitrate; (c) converting, by a converter, the decoded conditioning information from the format associated with the first bitrate to a format associated with a second bitrate; and (d) providing, by a generative neural network, a reconstruction of the audio or speech signal according to a probabilistic model conditioned by the conditioning information in the format associated with the second bitrate. Described are further an apparatus for decoding an audio or speech signal, a respective encoder, a system of the encoder and the apparatus for decoding an audio or speech signal as well as a respective computer program product.
G10L 19/24 - Variable rate codecs, e.g. for generating different qualities using a scalable representation such as hierarchical encoding or layered encoding
G10L 25/30 - Speech or voice analysis techniques not restricted to a single one of groups characterised by the analysis technique using neural networks
The present disclosure relates to the field audio coding, an in particular to an audio decoder having at least two decoding modes, and associated decoding methods and decoding software for such audio decoder. In one of the decoding modes, at least one dynamic audio object is mapped to a set of static audio objects, the set of static audio objects corresponding to a predefined speaker configuration. The present disclosure further relates to a corresponding audio encoder, and associated encoding methods and encoding software for such audio encoder.
G10L 19/008 - Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
The disclosed embodiments enable converting audio signals captured in various formats by various capture devices into a limited number of formats that can be processed by an audio codec (e.g., an Immersive Voice and Audio Services (IVAS) codec). In an embodiment, a simplification unit of the audio device receives an audio signal captured by one or more audio capture devices coupled to the audio device. The simplification unit determines whether the audio signal is in a format that is supported/not supported by an encoding unit of the audio device. Based on the determining, the simplification unit, converts the audio signal into a format that is supported by the encoding unit. In an embodiment, if the simplification unit determines that the audio signal is in a spatial format, the simplification unit can convert the audio signal into a spatial "mezzanine" format supported by the encoding.
G10L 19/008 - Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
78.
METHODS AND DEVICES FOR CONTROLLING AUDIO PARAMETERS
A method of controlling headphones having external microphone signal pass-through functionality may involve controlling a display to present a geometric shape on the display and receiving an indication of digit motion from a sensor system associated with the display. The sensor system may include a touch sensor system or a gesture sensor system. The indication may be an indication of a direction of digit motion relative to the display. The method may involve controlling the display to present a sequence of images indicating that the geometric shape either enlarges or contracts, depending on the direction of digit motion and changing a headphone transparency setting according to a current size of the geometric shape. The headphone transparency setting may correspond to an external microphone signal gain setting and/or a media signal gain setting of the headphones.
Described herein is a method of low-bitrate coding of audio data and generating enhancement metadata for controlling audio enhancement of the low-bitrate coded audio data at a decoder side, including the steps of: (a) core encoding original audio data at a low bitrate to obtain encoded audio data; (b) generating enhancement metadata to be used for controlling a type and/or amount of audio enhancement at the decoder side after core decoding the encoded audio data; and (c) outputting the encoded audio data and the enhancement metadata. Described is further an encoder configured to perform said method. Described is moreover a method for generating enhanced audio data from low-bitrate coded audio data based on enhancement metadata and a decoder configured to perform said method.
G10L 19/24 - Variable rate codecs, e.g. for generating different qualities using a scalable representation such as hierarchical encoding or layered encoding
G10L 21/02 - Speech enhancement, e.g. noise reduction or echo cancellation
G10L 25/30 - Speech or voice analysis techniques not restricted to a single one of groups characterised by the analysis technique using neural networks
80.
METHODS, APPARATUS AND SYSTEMS FOR GENERATION, TRANSPORTATION AND PROCESSING OF IMMEDIATE PLAYOUT FRAMES (IPFS)
Described herein is an audio decoder for decoding a bitstream of encoded audio data, wherein the bitstream of encoded audio data represents a sequence of audio sample values and comprises a plurality of frames, wherein each frame comprises associated encoded audio sample values, the audio decoder comprising: a determiner configured to determine whether a frame of the bitstream of encoded audio data is an immediate playout frame comprising encoded audio sample values associated with a current frame and additional information; and an initializer configured to initialize the decoder if the determiner determines that the frame is an immediate playout frame, wherein initializing the decoder comprises decoding the encoded audio sample values comprised by the additional information before decoding the encoded audio sample values associated with the current frame. Described are further a method for decoding said bitstream of encoded audio data as well as an audio encoder, a system of audio encoders and a method for generating said bitstream of encoded audio data with immediate playout frames. Described are moreover also an apparatus for generating immediate playout frames in a bitstream of encoded audio data or for removing immediate playout frames from a bitstream of encoded audio data and respective non-transitory digital storage media.
Embodiments are directed to a companding method and system for reducing coding noise in an audio codec. A method of processing an audio signal includes the following operations. A system receives an audio signal. The system determines that a first frame of the audio signal includes a sparse transient signal. The system determines that a second frame of the audio signal includes a dense transient signal. The system compresses/expands (compands) the audio signal using a companding rule that applies a first companding exponent to the first frame of the audio signal and applies a second companding exponent to the second frame of the audio signal, each companding exponent being used to derive a respective degree of dynamic range compression and expansion for a corresponding frame. The system then provides the companded audio signal to a downstream device.
Described herein is a method for creating object-based audio content from a text input for use in audio books and/or audio play, the method including the steps of: a) receiving the text input; b) performing a semantic analysis of the received text input; c) synthesizing speech and effects based on one or more results of the semantic analysis to generate one or more audio objects; d) generating metadata for the one or more audio objects; and e) creating the object-based audio content including the one or more audio objects and the metadata. Described herein are further a computer-based system including one or more processors configured to perform said method and a computer program product comprising a computer-readable storage medium with instructions adapted to carry out said method when executed by a device having processing capability.
G10L 15/18 - Speech classification or search using natural language modelling
G10L 19/008 - Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
G10L 13/10 - Prosody rules derived from text; Stress or intonation
A method of playing out media from a media engine run on a receiving apparatus, the method comprising: at the receiving apparatus, receiving a media data structure comprising audio or video content formatted in a plurality of layers, including at least a first layer comprising the audio or video content encoded according to an audio or video encoding scheme respectively, and a second layer encapsulating the encoded content in one or more media containers according to a media container format; determining that at least one of the media containers further encapsulates runnable code for processing at least some of the formatting of the media data structure in order to support playout of the audio or video content by the media engine; running the code on a code engine of the receiving apparatus in order to perform the processing of the media data structure for input to the media engine.
H04N 21/434 - Disassembling of a multiplex stream, e.g. demultiplexing audio and video streams or extraction of additional data from a video stream; Remultiplexing of multiplex streams; Extraction or processing of SI; Disassembling of packetised elementary stream
H04N 21/439 - Processing of audio elementary streams
H04N 21/4402 - Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to MPEG-4 scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
The present document describes a method (500) for generating a bitstream (101), wherein the bitstream (101) comprises a sequence of superframes (400) for a sequence of frames of an immersive audio signal (111). The method (500) comprises, repeatedly for the sequence of superframes (400), inserting (501) coded audio data (206) for one or more frames of one or more downmix channel signals (203) derived from the immersive audio signal (111), into data fields (411, 421, 412, 422) of a superframe (400); and inserting (502) metadata (202, 205) for reconstructing one or more frames of the immersive audio signal (111) from the coded audio data (206), into a metadata field (403) of the superframe (400).
H04S 3/00 - Systems employing more than two channels, e.g. quadraphonic
G10L 19/008 - Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
86.
METHODS AND DEVICES FOR ENCODING AND/OR DECODING IMMERSIVE AUDIO SIGNALS
The present document describes a method (700) for encoding a multi-channel input signal (201). The method (700) comprises determining (701) a plurality of downmix channel signals (203) from the multi-channel input signal (201) and performing (702) energy compaction of the plurality of downmix channel signals (203) to provide a plurality of compacted channel signals (404). Furthermore, the method (700) comprises determining (703) joint coding metadata (205) based on the plurality of compacted channel signals (404) and based on the multi-channel input signal (201), wherein the joint coding metadata (205) is such that it allows upmixing of the plurality of compacted channel signals (404) to an approximation of the multi-channel input signal (201). In addition, the method (700) comprises encoding (704) the plurality of compacted channel signals (404) and the joint coding metadata (205).
G10L 19/008 - Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
G10L 19/02 - Speech or audio signal analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
G10L 19/04 - Speech or audio signal analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
Methods for generating an AV bitstream (e.g., an MPEG-2 transport stream or bitstream segment having adaptive streaming format) such that the AV bitstream includes at least one video I-frame synchronized with at least one audio I-frame, e.g., including by re-authoring at least one video or audio frame (as a re-authored I-frame or a re-authored P-frame). Typically, a segment of content of the AV bitstream which includes the re-authored frame starts with an I-frame and includes at least one subsequent P-frame. Other aspects are methods for adapting such an AV bitstream, audio/video processing units configured to perform any embodiment of the inventive method, and audio/video processing units which include a buffer memory which stores at least one segment of an AV bitstream generated in accordance with any embodiment of the inventive method.
H04N 21/61 - Network physical structure; Signal processing
H04N 21/647 - Control signaling between network components and server or clients; Network processes for video distribution between server and clients, e.g. controlling the quality of the video stream, by dropping packets, protecting content from unauthorised alteration within the network, monitoring of network load or bridging bet
88.
METHODS AND SYSTEMS FOR STREAMING MEDIA DATA OVER A CONTENT DELIVERY NETWORK
The present document describes a method (900) for establishing control information for a control policy of a client (102) for streaming data (103) from at least one server (101, 701). The method (900) comprises performing (901) a message passing process between a server agent of the server (101, 701) and a client agent of the client (102), in order to iteratively establish control information. Furthermore, the method (900) comprises generating (902) a convergence event for the message passing process to indicate that the control information has been established.
A method for decoding an encoded audio bitstream is disclosed. The method includes receiving the encoded audio bitstream and decoding the audio data to generate a decoded lowband audio signal. The method further includes extracting high frequency reconstruction metadata and filtering the decoded lowband audio signal with an analysis filterbank to generate a filtered lowband audio signal. The method also includes extracting a flag indicating whether either spectral translation or harmonic transposition is to be performed on the audio data and regenerating a highband portion of the audio signal using the filtered lowband audio signal and the high frequency reconstruction metadata in accordance with the flag. The high frequency regeneration is performed as a post-processing operation with a delay of 3010 samples per audio channel.
G10L 21/0388 - Speech enhancement, e.g. noise reduction or echo cancellation using band spreading techniques - Details of processing therefor
G10L 19/02 - Speech or audio signal analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
G10L 19/24 - Variable rate codecs, e.g. for generating different qualities using a scalable representation such as hierarchical encoding or layered encoding
90.
METHODS, APPARATUS AND SYSTEMS FOR ENCODING AND DECODING OF DIRECTIONAL SOUND SOURCES
Some disclosed methods involve encoding or decoding directional audio data. Some encoding methods may involve receiving a mono audio signal corresponding to an audio object and a representation of a radiation pattern corresponding to the audio object. The radiation pattern may include sound levels corresponding to plurality of sample times, a plurality of frequency bands and a plurality of directions. The methods may involve encoding the mono audio signal and encoding the source radiation pattern to determine radiation pattern metadata. Encoding the radiation pattern may involve determining a spherical harmonic transform of the representation of the radiation pattern and compressing the spherical harmonic transform to obtain encoded radiation pattern metadata.
G10L 19/008 - Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
H04S 5/00 - Pseudo-stereo systems, e.g. in which additional channel signals are derived from monophonic signals by means of phase shifting, time delay or reverberation
91.
METHODS, APPARATUS AND SYSTEMS FOR THREE DEGREES OF FREEDOM (3DOF+) EXTENSION OF MPEG-H 3D AUDIO
Described is a method of processing position information indicative of an object position of an audio object, wherein the object position is usable for rendering of the audio object, that comprises: obtaining listener orientation information indicative of an orientation of a listener's head; obtaining listener displacement information indicative of a displacement of the listener's head; determining the object position from the position information; modifying the object position based on the listener displacement information by applying a translation to the object position; and further modifying the modified object position based on the listener orientation information. Further described is a corresponding apparatus for processing position information indicative of an object position of an audio object, wherein the object position is usable for rendering of the audio object.
The present disclosure relates to a method of decoding audio scene content from a bitstream by a decoder that includes an audio renderer with one or more rendering tools. The method comprises receiving the bitstream, decoding a description of an audio scene from the bitstream, determining one or more effective audio elements from the description of the audio scene, determining effective audio element information indicative of effective audio element positions of the one or more effective audio elements from the description of the audio scene, decoding a rendering mode indication from the bitstream, wherein the rendering mode indication is indicative of whether the one or more effective audio elements represent a sound field obtained from pre-rendered audio elements and should be rendered using a predetermined rendering mode, and in response to the rendering mode indication indicating that the one or more effective audio elements represent the sound field obtained from pre-rendered audio elements and should be rendered using the predetermined rendering mode, rendering the one or more effective audio elements using the predetermined rendering mode, wherein rendering the one or more effective audio elements using the predetermined rendering mode takes into account the effective audio element information, and wherein the predetermined rendering mode defines a predetermined configuration of the rendering tools for controlling an impact of an acoustic environment of the audio scene on the rendering output. The disclosure further relates to a method of generating audio scene content and a method of encoding audio scene content into a bitstream.
The present disclosure relates to methods, apparatus and systems for encoding an audio signal into a bitstream, in particular at an encoder, comprising: encoding or including audio signal data associated with 3DoF audio rendering into one or more first bitstream parts of the bitstream, and encoding or including metadata associated with 6DoF audio rendering into one or more second bitstream parts of the bitstream. The present disclosure further relates to methods, apparatus and systems for decoding an audio signal and audio rendering based on the bitstream.
G10L 19/008 - Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
G10L 19/24 - Variable rate codecs, e.g. for generating different qualities using a scalable representation such as hierarchical encoding or layered encoding
H04S 7/00 - Indicating arrangements; Control arrangements, e.g. balance control
94.
METHOD AND APPARATUS FOR PROCESSING OF AUXILIARY MEDIA STREAMS EMBEDDED IN A MPEG-H 3D AUDIO STREAM
The disclosure relates to methods, apparatus and systems for side load processing of packetized media streams. In an embodiment, the apparatus comprises: a receiver for receiving a bitstream, and a splitter for identifying a packet type in the bitstream and splitting, based on the identification of a value of the packet type in the bit stream into a main stream and an auxiliary stream.
H04N 21/439 - Processing of audio elementary streams
H04N 21/435 - Processing of additional data, e.g. decrypting of additional data or reconstructing software from modules extracted from the transport stream
H04N 21/4363 - Adapting the video stream to a specific local network, e.g. a IEEE 1394 or Bluetooth® network
H04N 21/485 - End-user interface for client configuration
H04N 21/434 - Disassembling of a multiplex stream, e.g. demultiplexing audio and video streams or extraction of additional data from a video stream; Remultiplexing of multiplex streams; Extraction or processing of SI; Disassembling of packetised elementary stream
A method for decoding an encoded audio bitstream is disclosed. The method includes receiving the encoded audio bitstream and decoding the audio data to generate a decoded lowband audio signal. The method further includes extracting high frequency reconstruction metadata and filtering the decoded lowband audio signal with an analysis filterbank to generate a filtered lowband audio signal. The method also includes extracting a flag indicating whether either spectral translation or harmonic transposition is to be performed on the audio data and regenerating a highband portion of the audio signal using the filtered lowband audio signal and the high frequency reconstruction metadata in accordance with the flag.
G10L 19/02 - Speech or audio signal analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
G10L 19/00 - Speech or audio signal analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
The present document describes a method (400) for encoding a soundfield representation (SR) input signal (101, 301) describing a soundfield at a reference position, wherein the SR input signal (101, 301) comprises a plurality of channels for a plurality of different directivity patterns of the soundfield at the reference position. The method (400) comprises extracting (401) one or more audio objects (103, 303) from the SR input signal (101, 301). Furthermore, the method (400) comprises determining (402) a residual signal (102, 302) based on the SR input signal (101, 301) and based on the one or more audio objects (103, 303). The method (400) also comprises performing joint coding of the one or more audio objects (103, 303) and/or the residual signal (102, 302). In addition, the method (400) comprises generating (403) a bitstream (701) based on data generated in the context of joint coding of the one or more audio objects (103, 303) and/or the residual signal (102, 302).
G10L 19/008 - Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
97.
METHOD AND SYSTEM FOR HANDLING LOCAL TRANSITIONS BETWEEN LISTENING POSITIONS IN A VIRTUAL REALITY ENVIRONMENT
A method (910) for rendering an audio signal in a virtual reality rendering environment (180) is described. The method (910) comprises rendering (911) an origin audio signal of an audio source (311, 312, 313) from an origin source position on an origin sphere (114) around an origin listening position (301) of a listener (181). Furthermore, the method (900) comprises determining (912) that the listener (181) moves from the origin listening position (301) to a destination listening position (302). In addition, the method (900) comprises determining (913) a destination source position of the audio source (311, 312, 313) on a destination sphere (114) around the destination listening position (302) based on the origin source position, and determining (914) a destination audio signal of the audio source (311, 312, 313) based on the origin audio signal. Furthermore, the method (900) comprises rendering (915) the destination audio signal of the audio source (311, 312, 313) from the destination source position on the destination sphere (114) around the destination listening position (302).
The present disclosure relates to an apparatus for decoding an encoded Unified Audio and Speech stream. The apparatus comprises a core decoder for decoding the encoded Unified Audio and Speech stream. The core decoder includes a fast Fourier transform, FFT, module implementation based on a Cooley-Tuckey algorithm. The FFT module is configured to determine a discrete Fourier transform, DFT. Determining the DFT involves recursively breaking down the DFT into small FFTs based on the Cooley-Tucker algorithm and using radix-4 if a number of points of the FFT is a power of 4 and using mixed radix if the number is not a power of 4. Performing the small FFTs involves applying twiddle factors. Applying the twiddle factors involves referring to pre-computed values for the twiddle factors. The present disclosure further relates to an apparatus for decoding an encoded Unified Audio and Speech stream, in which the core decoder is configured to decode an LPC filter that has been quantized using a line spectral frequency, LSF, representation from the Unified Audio and Speech stream. Decoding the LPC filter from the Unified Audio and Speech stream comprises computing a first-stage approximation of a LSF vector, reconstructing a residual LSF vector, if an absolute quantization mode has been used for quantizing the LPC filter, determining inverse LSF weights for inverse weighting of the residual LSF vector by referring to pre-computed values for the inverse LSF weights or their respective corresponding LSF weights, inverse weighting the residual LSF vector by the determined inverse LSF weights, and calculating the LPC filter based on the inversely-weighted residual LSF vector and the first-stage approximation of the LSF vector. The present disclosure further relates to corresponding methods and storage media.
G06F 17/14 - Fourier, Walsh or analogous domain transformations
G10L 19/02 - Speech or audio signal analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
The present disclosure relates to an apparatus for decoding an encoded Unified Audio and Speech stream. The apparatus comprises a core decoder for decoding the encoded Unified Audio and Speech stream. The core decoder includes an upmixing unit adapted to perform mono to stereo upmixing. The upmixing unit includes a decorrelator unit D adapted to apply a decorrelation filter to an input signal. The decorrelator unit is adapted to determine filter coefficients for the decorrelation filter by referring to pre-computed values. The present disclosure further relates to a an apparatus for encoding a Unified Audio and Speech stream, as well as to corresponding methods and storage media.
G10L 19/008 - Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
H04S 3/02 - Systems employing more than two channels, e.g. quadraphonic of the matrix type, i.e. in which input signals are combined algebraically, e.g. after having been phase shifted with respect to each other
G10L 19/025 - Detection of transients or attacks for time/frequency resolution switching
G10H 7/00 - Instruments in which the tones are synthesised from a data store, e.g. computer organs
100.
METHOD AND SYSTEM FOR HANDLING GLOBAL TRANSITIONS BETWEEN LISTENING POSITIONS IN A VIRTUAL REALITY ENVIRONMENT
A method (900) for rendering audio in a virtual reality rendering environment (180) is described. The method (900) comprises rendering (901) an origin audio signal of an origin audio source (113) of an origin audio scene (111) from an origin source position on a sphere (114) around a listening position (201) of a listener (181). Furthermore, the method (900) comprises determining (902) that the listener (181) moves from the listening position (201) within the origin audio scene (111) to a listening position (202) within a different destination audio scene (112). In addition, the method (900) comprises applying (903) a fade-out gain to the origin audio signal to determine a modified origin audio signal, and rendering (903) the modified origin audio signal of the origin audio source (113) from the origin source position on the sphere (114) around the listening position (201, 202).