Readme

Files and versions

4.3 KiB

Raw Blame History

VGGish Embedding Operator (Pytorch)

Authors: Jael Gu

Overview

This operator uses reads the waveform of an audio file and then applies VGGish to extract features. The original VGGish model is built on top of Tensorflow.[1] This operator converts VGGish into Pytorch. It generates a set of vectors given an input. Each vector represents features of a non-overlapping clip with a fixed length of 0.96s and each clip is composed of 64 mel bands and 96 frames. The model is pre-trained with a large scale of audio dataset AudioSet. As suggested, this model is suitable to extract features at high level or warm up a larger model.

Interface

__call__(self, datas: List[NamedTuple('data', [('audio', 'ndarray'), ('sample_rate', 'int')])])

Args:

datas:
- a named tuple including audio data in numpy.ndarray and sample rate in integer

Returns:

The Operator returns a tuple Tuple[('embs', numpy.ndarray)] containing following fields:

vec:
- embeddings of the audio
- data type: numpy.ndarray
- shape: (num_clips, 128)

Requirements

You can get the required python package by requirements.txt.

How it works

The towhee/torch-vggish Operator implements the function of audio embedding, which can be added to a towhee pipeline. For example, it is the key operator of the pipeline audio-embedding-vggish.

Reference

[1]. https://github.com/tensorflow/models/tree/master/research/audioset/vggish [2]. https://tfhub.dev/google/vggish/1

More Resources

What is a Transformer Model? An Engineer's Guide: A transformer model is a neural network architecture. It's proficient in converting a particular type of input into a distinct output. Its core strength lies in its ability to handle inputs and outputs of different sequence length. It does this through encoding the input into a matrix with predefined dimensions and then combining that with another attention matrix to decode. This transformation unfolds through a sequence of collaborative layers, which deconstruct words into their corresponding numerical representations.

At its heart, a transformer model is a bridge between disparate linguistic structures, employing sophisticated neural network configurations to decode and manipulate human language input. An example of a transformer model is GPT-3, which ingests human language and generates text output.

Comparing Different Vector Embeddings - Zilliz blog: Learn about the difference in vector embeddings between models and how to use multiple collections of vector data in one Jupyter Notebook.
How to Get the Right Vector Embeddings - Zilliz blog: A comprehensive introduction to vector embeddings and how to generate them with popular open-source models.
Exploring OpenAI CLIP: The Future of Multi-Modal AI Learning - Zilliz blog: Multimodal AI learning can get input and understand information from various modalities like text, images, and audio together, leading to a deeper understanding of the world. Learn more about OpenAI's CLIP (Contrastive Language-Image Pre-training), a popular multimodal model for text and image data.
Audio Retrieval Based on Milvus - Zilliz blog: Create an audio retrieval system using Milvus, an open-source vector database. Classify and analyze sound data in real time.
Sparse and Dense Embeddings: A Guide for Effective Information Retrieval with Milvus | Zilliz Webinar: Zilliz webinar covering what sparse and dense embeddings are and when you'd want to use one over the other.
Sparse and Dense Embeddings: A Guide for Effective Information Retrieval with Milvus | Zilliz Webinar: Zilliz webinar covering what sparse and dense embeddings are and when you'd want to use one over the other.
Zilliz partnership with PyTorch - View image search solution tutorial: Zilliz partnership with PyTorch

4.3 KiB

Raw Blame History

VGGish Embedding Operator (Pytorch)

Authors: Jael Gu

Overview

Interface

__call__(self, datas: List[NamedTuple('data', [('audio', 'ndarray'), ('sample_rate', 'int')])])

Args:

datas:
- a named tuple including audio data in numpy.ndarray and sample rate in integer

Returns:

The Operator returns a tuple Tuple[('embs', numpy.ndarray)] containing following fields:

vec:
- embeddings of the audio
- data type: numpy.ndarray
- shape: (num_clips, 128)

Requirements

You can get the required python package by requirements.txt.

How it works

The towhee/torch-vggish Operator implements the function of audio embedding, which can be added to a towhee pipeline. For example, it is the key operator of the pipeline audio-embedding-vggish.

Reference

[1]. https://github.com/tensorflow/models/tree/master/research/audioset/vggish [2]. https://tfhub.dev/google/vggish/1

More Resources

What is a Transformer Model? An Engineer's Guide: A transformer model is a neural network architecture. It's proficient in converting a particular type of input into a distinct output. Its core strength lies in its ability to handle inputs and outputs of different sequence length. It does this through encoding the input into a matrix with predefined dimensions and then combining that with another attention matrix to decode. This transformation unfolds through a sequence of collaborative layers, which deconstruct words into their corresponding numerical representations.

Comparing Different Vector Embeddings - Zilliz blog: Learn about the difference in vector embeddings between models and how to use multiple collections of vector data in one Jupyter Notebook.
How to Get the Right Vector Embeddings - Zilliz blog: A comprehensive introduction to vector embeddings and how to generate them with popular open-source models.
Exploring OpenAI CLIP: The Future of Multi-Modal AI Learning - Zilliz blog: Multimodal AI learning can get input and understand information from various modalities like text, images, and audio together, leading to a deeper understanding of the world. Learn more about OpenAI's CLIP (Contrastive Language-Image Pre-training), a popular multimodal model for text and image data.
Audio Retrieval Based on Milvus - Zilliz blog: Create an audio retrieval system using Milvus, an open-source vector database. Classify and analyze sound data in real time.
Sparse and Dense Embeddings: A Guide for Effective Information Retrieval with Milvus | Zilliz Webinar: Zilliz webinar covering what sparse and dense embeddings are and when you'd want to use one over the other.
Sparse and Dense Embeddings: A Guide for Effective Information Retrieval with Milvus | Zilliz Webinar: Zilliz webinar covering what sparse and dense embeddings are and when you'd want to use one over the other.
Zilliz partnership with PyTorch - View image search solution tutorial: Zilliz partnership with PyTorch

Readme

Files and versions

4.3 KiB Raw Blame History

VGGish Embedding Operator (Pytorch)

Overview

Interface

Requirements

How it works

Reference

More Resources

4.3 KiB Raw Blame History

VGGish Embedding Operator (Pytorch)

Overview

Interface

Requirements

How it works

Reference

More Resources

4.3 KiB

Raw Blame History

4.3 KiB

Raw Blame History