Disclosed herein are computer-implemented method, system, and computer-readable storage-medium embodiments for implementing densification in music search. An embodiment includes processor(s) configured to obtain a first feature set extracted from a first audio recording, and a first fingerprint of the first audio recording; and evaluate, using at least one first machine-learning algorithm, a similarity index corresponding to the first audio recording with respect to at least one second audio recording, considering: the first feature set extracted from the first audio recording, and a second feature set extracted from the at least one second audio recording; or the first fingerprint of the first audio recording, and at least one second fingerprint of the at least one second audio recording. Further embodiments include defining arrangement group(s) including the first audio recording and the at least one second audio recording with similarity index within a predetermined range, outputting densified response(s) to a search query.
User interface techniques provide user vocalists with mechanisms for solo audiovisual capture and for seeding subsequent performances by other users (e.g., joiners). Audiovisual capture may be against a full-length work or seed spanning much or all of a pre-existing audio (or audiovisual) work and in some cases may mix, to seed further contributions of one or more joiners, a users captured media content for at least some portions of the audio (or audiovisual) work. A short seed or short segment may span less than all (and in some cases, much less than all) of the audio (or audiovisual) work. For example, a verse, chorus, refrain, hook or other limited chunk of an audio (or audiovisual) work may constitute a short seed or short segment. Computational techniques are described that allow a system to automatically identify suitable short seeds or short segments. After audiovisual capture against the short seed or short segment, a resulting, solo or group, full-length or short-form performance may be posted, livestreamed, or otherwise disseminated in a social network
H04N 21/4402 - Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to MPEG-4 scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
H04N 21/439 - Processing of audio elementary streams
Disclosed herein are computer-implemented method, system, and computer-readable storage-medium embodiments for implementing template-based excerpting and rendering of multimedia performances technologies. An embodiment includes at least one computer processor configured to retrieve a first content instance and corresponding first metadata. The first content instance may include a first plurality of structural elements, with at least one structural element corresponding to at least part of the first metadata. An embodiment may further include selecting a first template comprising a first set of parameters. A parameter of the first set of parameters may be applicable to the at least one structural element. Applicable parameter(s) of the first template may be actively associated with the at least part of the first metadata corresponding to the at least one structural element. The first content instance may be transformed by a rendering engine running on the at least one computer processor.
G10H 1/00 - ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE - Details of electrophonic musical instruments
G06F 16/58 - Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
G06F 16/907 - Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
4.
AUGMENTED REALITY FILTERS FOR CAPTURED AUDIOVISUAL PERFORMANCES
Visual effects, including augmented reality-type visual effects, are applied to audiovisual performances with differing visual effects and/or parameterizations thereof applied in correspondence with computationally determined audio features or elements of musical structure coded in temporally-synchronized tracks or computationally determined therefrom. Segmentation techniques applied to one or more audio tracks (e.g., vocal or backing tracks) are used to compute some of the components of the musical structure. In some cases, applied visual effects are based on an audio feature computationally extracted from a captured audiovisual performance or from an audio track temporally-synchronized therewith.
H04N 21/43 - Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronizing decoder's clock; Client middleware
H04N 21/431 - Generation of visual interfaces; Content or additional data rendering
H04N 21/434 - Disassembling of a multiplex stream, e.g. demultiplexing audio and video streams or extraction of additional data from a video stream; Remultiplexing of multiplex streams; Extraction or processing of SI; Disassembling of packetised elementary stream
H04N 21/236 - Assembling of a multiplex stream, e.g. transport stream, by combining a video stream with other content or additional data, e.g. inserting a URL [Uniform Resource Locator ] into a video stream, multiplexing software data into a video stream; Remultiplexing of multiplex streams; Insertion of stuffing bits into the multiplex stream, e.g. to obtain a constant bit-rate; Assembling of a packetised elementary stream
H04N 21/2368 - Multiplexing of audio and video streams
User interface techniques provide user vocalists with mechanisms for seeding subsequent performances by other users (e.g., joiners). A seed may be a full-length seed spanning much or all of a pre-existing audio (or audiovisual) work and mixing, to seed further contributions of one or more joiners, a user's captured media content for at least some portions of the audio (or audiovisual) work. A short seed may span less than all (and in some cases, much less than all) of the audio (or audiovisual) work. For example, a verse, chorus, refrain, hook or other limited chunk of an audio (or audiovisual) work may constitute a seed. A seeding user's call invites other users to join the full-length or short form seed by singing along, singing a particular vocal part or musical section, singing harmony or other duet part, rapping, talking, clapping, recording video, adding a video clip from camera roll, etc. The resulting group performance, whether full-length or just a chunk, may be posted, livestreamed, or otherwise disseminated in a social network.
Techniques have been developed to facilitate the livestreaming of group audiovisual performances. Audiovisual performances including vocal music are captured and coordinated with performances of other users in ways that can create compelling user and listener experiences. For example, in some cases or embodiments, duets with a host performer may be supported in a singwith- the-artist style audiovisual livestream in which aspiring vocalists request or queue particular songs for a live radio show entertainment format. The developed techniques provide a communications latency-tolerant mechanism for synchronizing vocal performances captured at geographically-separated devices (e.g., at globally-distributed, but network-connected mobile phones or tablets or at audiovisual capture devices geographically separated from a live studio).
H04N 21/43 - Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronizing decoder's clock; Client middleware
H04N 21/434 - Disassembling of a multiplex stream, e.g. demultiplexing audio and video streams or extraction of additional data from a video stream; Remultiplexing of multiplex streams; Extraction or processing of SI; Disassembling of packetised elementary stream
H04N 21/472 - End-user interface for requesting content, additional data or services; End-user interface for interacting with content, e.g. for content reservation or setting reminders, for requesting event notification or for manipulating displayed content
H04N 21/485 - End-user interface for client configuration
User interface techniques provide user vocalists with mechanisms for forward and backward traversal of audiovisual content, including pitch cues, waveform- or envelope-type performance timelines, lyrics and/or other temporally-synchronized content at record-time, 5 during edits, and/or in playback. Recapture of selected performance portions, coordination of group parts, and overdubbing may all be facilitated. Direct scrolling to arbitrary points in the performance timeline, lyrics, pitch cues and other temporally-synchronized content allows user to conveniently move through a capture or audiovisual edit session. In some cases, a user vocalist may be guided through the performance timeline, lyrics, pitch cues 10 and other temporally-synchronized content in correspondence with group part information such as in a guided short-form capture for a duet. A scrubber allows user vocalists to conveniently move forward and backward through the temporally-synchronized content.
Visual effects schedules are applied to audiovisual performances with differing visual effects applied in correspondence with differing elements of musical structure. Segmentation techniques applied to one or more audio tracks (e.g., vocal or backing tracks) are used to compute some of the components of the musical structure. In some cases, applied visual effects schedules are mood-denominated and may be selected by a performer as a component of his or her visual expression or determined from an audiovisual performance using machine learning techniques.
H04N 21/439 - Processing of audio elementary streams
H04N 21/43 - Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronizing decoder's clock; Client middleware
H04N 21/4402 - Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to MPEG-4 scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
H04N 5/272 - Means for inserting a foreground image in a background image, i.e. inlay, outlay
G06N 99/00 - Subject matter not provided for in other groups of this subclass
9.
AUDIOVISUAL COLLABORATION METHOD WITH LATENCY MANAGEMENT FOR WIDE-AREA BROADCAST
Techniques have been developed to facilitate the livestreaming of group audiovisual performances. Audiovisual performances including vocal music are captured and coordinated with performances of other users in ways that can create compelling user and listener experiences. For example, in some cases or embodiments, duets with a host performer may be supported in a sing-with-the-artist style audiovisual livestream in which aspiring vocalists request or queue particular songs for a live radio show entertainment format. The developed techniques provide a communications latency-tolerant mechanism for synchronizing vocal performances captured at geographically-separated devices (e.g., at globally-distributed, but network-connected mobile phones or tablets or at audiovisual capture devices geographically separated from a live studio).
H04N 21/434 - Disassembling of a multiplex stream, e.g. demultiplexing audio and video streams or extraction of additional data from a video stream; Remultiplexing of multiplex streams; Extraction or processing of SI; Disassembling of packetised elementary stream
H04N 21/436 - Interfacing a local distribution network, e.g. communicating with another STB or inside the home
10.
CROWD-SOURCED TECHNIQUE FOR PITCH TRACK GENERATION
Digital signal processing and machine learning techniques can be employed in a vocal capture and performance social network to computationally generate vocal pitch tracks from a collection of vocal performances captured against a common temporal baseline such as a backing track or an original performance by a popularizing artist. In this way, crowd-sourced pitch tracks may be generated and distributed for use in subsequent karaoke-style vocal audio captures or other applications. Large numbers of performances of a song can be used to generate a pitch track. Computationally determined pitch trackings from individual audio signal encodings of the crowd-sourced vocal performance set are aggregated and processed as an observation sequence of a trained Hidden Markov Model (HMM) or other statistical model to produce an output pitch track.
Embodiments described herein relate generally to systems comprising a display device, a display device-coupled computing platform, a mobile device in communication with the computing platform, and a content server in which methods and techniques of capture and/or processing of audiovisual performances are described and, in particular, description of techniques suitable for use in connection with display device connected computing platforms for rendering vocal performance captured by a handheld computing device.
Notwithstanding practical limitations imposed by mobile device platforms and applications, truly captivating musical instruments may be synthesized in ways that allow musically expressive performances to be captured and rendered in real-time. Synthetic musical instruments that provide a game, grading or instructional mode are described in which one or more qualities of a user's performance are assessed relative to a musical score. By providing a range of modes (from score-assisted to fully user-expressive), user interactions with synthetic musical instruments are made more engaging and tend to capture user interest over generally longer periods of time. Synthetic musical instruments are described in which force dynamics of user gestures (such as finger contact forces applied to a multi-touch sensitive display or surface and/or the temporal extent and applied pressure of sustained contact thereon) are captured and drive the digital synthesis in ways that enhance expressiveness of user performances.
G10H 1/00 - ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE - Details of electrophonic musical instruments
G10H 7/00 - Instruments in which the tones are synthesised from a data store, e.g. computer organs
13.
AUTOMATED GENERATION OF COORDINATED AUDIOVISUAL WORK BASED ON CONTENT CAPTURED GEOGRAPHICALLY DISTRIBUTED PERFORMERS
Vocal audio of a user together with performance synchronized video is captured and coordinated with audiovisual contributions of other users to form composite duet-style or glee club-style or window-paned music video-style audiovisual performances. In some cases, the vocal performances of individual users are captured (together with performance synchronized video) on mobile devices, television-type display and/or set-top box equipment in the context of karaoke-style presentations of lyrics in correspondence with audible renderings of a backing track. Contributions of multiple vocalists are coordinated and mixed in a manner that selects for presentation, at any given time along a given performance timeline, performance synchronized video of one or more of the contributors. Selections are in accord with a visual progression that codes a sequence of visual layouts in correspondence with other coded aspects of a performance score such as pitch tracks, backing audio, lyrics, sections and/or vocal parts.
H04N 21/23 - Processing of content or additional data; Elementary server operations; Server middleware
H04N 21/20 - Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
H04N 21/233 - Processing of audio elementary streams
H04N 21/234 - Processing of video elementary streams, e.g. splicing of video streams or manipulating MPEG-4 scene graphs
H04N 21/2343 - Processing of video elementary streams, e.g. splicing of video streams or manipulating MPEG-4 scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements
14.
COMPUTATIONALLY-ASSISTED MUSICAL SEQUENCING AND/OR COMPOSITION TECHNIQUES FOR SOCIAL MUSIC CHALLENGE OR COMPETITION
An application that manipulates audio (or audiovisual) content, automated music creation technologies may be employed to generate new musical content using digital signal processing software hosted on handheld and/or server (or cloud-based) compute platforms to intelligently process and combine a set of audio content captured and submitted by users of modern mobile phones or other handheld compute platforms. The user-submitted recordings may contain speech, singing, musical instruments, or a wide variety of other sound sources, and the recordings may optionally be preprocessed by the handheld devices prior to submission.
Coordinated audio and video filter pairs are applied to enhance artistic and emotional content of audiovisual performances. Such filter pairs, when applied in audio and video processing pipelines of an audiovisual application hosted on a portable computing device (such as a mobile phone or media player, a compute pad or tablet, a game controller or a personal digital assistant or book reader) can allow user selection of effects that enhance both audio and video coordinated therewith. Coordinated audio and video are captured, filtered and rendered at the portable computing device using camera and microphone interfaces, using digital signal processing software executable on a processor and using storage, speaker and display devices of, or interoperable with, the device. By providing audiovisual capture and personalization on an intimate handheld device, social interactions and postings of a type made popular by modern social networking platforms can now be extended to audiovisual content.
SOCIAL MUSIC SYSTEM AND METHOD WITH CONTINUOUS, REAL-TIME PITCH CORRECTION OF VOCAL PERFORMANCE AND DRY VOCAL CAPTURE FOR SUBSEQUENT RE-RENDERING BASED ON SELECTIVELY APPLICABLE VOCAL EFFECT(S) SCHEDULE(S)
Despite many practical limitations imposed by mobile device platforms and application execution environments, vocal musical performances may be captured and, in some cases or embodiments, pitch-corrected and/or processed in accord with a user selectable vocal effects schedule for mixing and rendering with backing tracks in ways that create compelling user experiences. In some cases, the vocal performances of individual users are captured on mobile devices in the context of a karaoke-style presentation of lyrics in correspondence with audible renderings of a backing track. Such performances can be pitch-corrected in real-time at the mobile device in accord with pitch correction settings. Vocal effects schedules may also be selectively applied to such performances. In these ways, even amateur user/performers with imperfect pitch are encouraged to take a shot at "stardom" and/or take part in a game play, social network or vocal achievement application architecture that facilitates musical collaboration on a global scale and/or, in some cases or embodiments, to initiate revenue generating in-application transactions.
Captured vocals may be automatically transformed using advanced digital signal processing techniques that provide captivating applications, and even purpose-built devices, in which mere novice user-musicians may generate, audibly render and share musical performances. In some cases, the automated transformations allow spoken vocals to be segmented, arranged, temporally aligned with a target rhythm, meter or accompanying backing tracks and pitch corrected in accord with a score or note sequence. Speech-to-song music applications are one such example. In some cases, spoken vocals may be transformed in accord with musical genres such as rap using automated segmentation and temporal alignment techniques, often without pitch correction. Such applications, which may employ different signal processing and different automated transformations, may nonetheless be understood as speech-to-rap variations on the theme.
A synthetic multi-string musical instrument captures a stream of expressive gestures indicated on a multi-touch sensitive display for note/chord soundings and associated performance effects and embellishments. Visual cues in accord with a musical score may be revealed/advanced at a current performance tempo, but it is the user's gestures that actually drive the audible performance rendering via digital synthesis. Opportunities for user expression (or variance from score) include onset and duration of note soundings, tempo changes, as well as uncued string bend effects, vibrato, etc. Gesturing mechanism are provide to allow user musicians to sound chords without having to register precisely accurate multi-touch screen contacts. This can be especially helpful for mobile phone, media player and game controller embodiments, where there is generally limited real-estate to display six (6) or more strings, and user fingers are generally too fat to precisely contact such strings.
G10H 1/00 - ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE - Details of electrophonic musical instruments
Synthetic multi-string musical instruments have been developed for capturing and rendering musical performances on handheld or other portable devices in which a multi-touch sensitive display provides one of the input vectors for an expressive performance by a user or musician. Visual cues may be provided on the multi-touch sensitive display to guide the user in a performance based on a musical score. Alternatively, or in addition, uncued freestyle modes of operation may be provided. In either case, it is not the musical score that drives digital synthesis and audible rendering of the synthetic multi-string musical instrument. Rather, it is the stream of user gestures captured at least in part using the multi-touch sensitive display that drives the digital synthesis and audible rendering.
G10H 1/00 - ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE - Details of electrophonic musical instruments
20.
CONTINUOUS SCORE-CODED PITCH CORRECTION AND HARMONY GENERATION TECHNIQUES FOR GEOGRAPHICALLY DISTRIBUTED GLEE CLUB
Despite many practical limitations imposed by mobile device platforms and application execution environments, vocal musical performances may be captured and continuously pitch-corrected for mixing and rendering with backing tracks in ways that create compelling user experiences. In some cases, the vocal performances of individual users are captured on mobile devices in the context of a karaoke-style presentation of lyrics in correspondence with audible renderings of a backing track. Such performances can be pitch-corrected in real-time at a portable computing device (such as a mobile phone, personal digital assistant, laptop computer, notebook computer, pad-type computer or netbook) in accord with pitch correction settings. In some cases, pitch correction settings include a scorecoded melody and/or harmonies supplied with, or for association with, the lyrics and backing tracks. Harmonies notes or chords may be coded as explicit targets or relative to the score coded melody or even actual pitches sounded by a vocalist.
Techniques have been developed to facilitate (1) the capture and pitch correction of vocal performances on handheld or other portable computing devices and (2) the mixing of such pitch-corrected vocal performances with backing tracks for audible rendering on targets that include such portable computing devices and as well as desktops, workstations, gaming stations, even telephony targets. Implementations of the described techniques employ signal processing techniques and allocations of system functionality that are suitable given the generally limited capabilities of such handheld or portable computing devices and that facilitate efficient encoding and communication of the pitch-corrected vocal performances (or precursors or derivatives thereof) via wireless and/or wired bandwidth-limited networks for rendering on portable computing devices or other targets.