torch-vggish/README.md

# VGGish Embedding Operator (Pytorch)

Authors: Jael Gu

## Overview

This operator uses reads the waveform of an audio file and then applies VGGish to extract features. The original VGGish model is built on top of Tensorflow.[1] This operator converts VGGish into **Pytorch**. It generates a set of vectors given an input. Each vector represents features of a non-overlapping clip with a fixed length of 0.96s and each clip is composed of 64 mel bands and 96 frames. The model is pre-trained with a large scale of audio dataset [AudioSet](https://research.google.com/audioset). As suggested, this model is suitable to extract features at high level or warm up a larger model.

## Interface

```python
__call__(self, datas: List[NamedTuple('data', [('audio', 'ndarray'), ('sample_rate', 'int')])])
```

**Args:**

- datas:
  - a named tuple including audio data in numpy.ndarray and sample rate in integer

**Returns:**

The Operator returns a tuple Tuple[('embs', numpy.ndarray)] containing following fields:

- vec:
  - embeddings of the audio
  - data type: `numpy.ndarray`
  - shape: (num_clips, 128)

## Requirements

You can get the required python package by [requirements.txt](./requirements.txt).

## How it works

The `towhee/torch-vggish` Operator implements the function of audio embedding, which can be added to a towhee pipeline. For example, it is the key operator of the pipeline [audio-embedding-vggish](https://hub.towhee.io/towhee/audio-embedding-vggish).

## Reference

[1]. https://github.com/tensorflow/models/tree/master/research/audioset/vggish
[2]. https://tfhub.dev/google/vggish/1


# More Resources

- [What is a Transformer Model? An Engineer's Guide](https://zilliz.com/glossary/transformer-models): A transformer model is a neural network architecture. It's proficient in converting a particular type of input into a distinct output. Its core strength lies in its ability to handle inputs and outputs of different sequence length. It does this through encoding the input into a matrix with predefined dimensions and then combining that with another attention matrix to decode. This transformation unfolds through a sequence of collaborative layers, which deconstruct words into their corresponding numerical representations. 

At its heart, a transformer model is a bridge between disparate linguistic structures, employing sophisticated neural network configurations to decode and manipulate human language input. An example of a transformer model is GPT-3, which ingests human language and generates text output.
- [Comparing Different Vector Embeddings - Zilliz blog](https://zilliz.com/blog/comparing-different-vector-embeddings): Learn about the difference in vector embeddings between models and how to use multiple collections of vector data in one Jupyter Notebook.
- [How to Get the Right Vector Embeddings - Zilliz blog](https://zilliz.com/blog/how-to-get-the-right-vector-embeddings): A comprehensive introduction to vector embeddings and how to generate them with popular open-source models.
- [Exploring OpenAI CLIP: The Future of Multi-Modal AI Learning - Zilliz blog](https://zilliz.com/learn/exploring-openai-clip-the-future-of-multimodal-ai-learning): Multimodal AI learning can get input and understand information from various modalities like text, images, and audio together, leading to a deeper understanding of the world. Learn more about OpenAI's CLIP (Contrastive Language-Image Pre-training), a popular multimodal model for text and image data.
- [Audio Retrieval Based on Milvus - Zilliz blog](https://zilliz.com/blog/audio-retrieval-based-on-milvus): Create an audio retrieval system using Milvus, an open-source vector database. Classify and analyze sound data in real time.
- [Sparse and Dense Embeddings: A Guide for Effective Information Retrieval with Milvus | Zilliz Webinar](https://zilliz.com/event/sparse-and-dense-embeddings-webinar): Zilliz webinar covering what sparse and dense embeddings are and when you'd want to use one over the other.
- [Sparse and Dense Embeddings: A Guide for Effective Information Retrieval with Milvus | Zilliz Webinar](https://zilliz.com/event/sparse-and-dense-embeddings-webinar/success): Zilliz webinar covering what sparse and dense embeddings are and when you'd want to use one over the other.
- [Zilliz partnership with PyTorch - View image search solution tutorial](https://zilliz.com/partners/pytorch): Zilliz partnership with PyTorch
Add files Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 3 years ago			`# VGGish Embedding Operator (Pytorch)`
Initial commit 3 years ago
Add files Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 3 years ago			`Authors: Jael Gu`

			`## Overview`

			This operator uses reads the waveform of an audio file and then applies VGGish to extract features. The original VGGish model is built on top of Tensorflow.[1] This operator converts VGGish into Pytorch. It generates a set of vectors given an input. Each vector represents features of a non-overlapping clip with a fixed length of 0.96s and each clip is composed of 64 mel bands and 96 frames. The model is pre-trained with a large scale of audio dataset [AudioSet](https://research.google.com/audioset). As suggested, this model is suitable to extract features at high level or warm up a larger model.

			`## Interface`

			```python
Update readme Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 3 years ago			`__call__(self, datas: List[NamedTuple('data', [('audio', 'ndarray'), ('sample_rate', 'int')])])`
Add files Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 3 years ago			```

			`Args:`

Update readme Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 3 years ago			`- datas:`
			`- a named tuple including audio data in numpy.ndarray and sample rate in integer`
Add files Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 3 years ago
			`Returns:`

			`The Operator returns a tuple Tuple[('embs', numpy.ndarray)] containing following fields:`

Update readme Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 3 years ago			`- vec:`
Add files Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 3 years ago			`- embeddings of the audio`
			- data type: `numpy.ndarray`
Update readme Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 3 years ago			`- shape: (num_clips, 128)`
Add files Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 3 years ago
			`## Requirements`

			`You can get the required python package by [requirements.txt](./requirements.txt).`

			`## How it works`

			The `towhee/torch-vggish` Operator implements the function of audio embedding, which can be added to a towhee pipeline. For example, it is the key operator of the pipeline [audio-embedding-vggish](https://hub.towhee.io/towhee/audio-embedding-vggish).

			`## Reference`

			`[1]. https://github.com/tensorflow/models/tree/master/research/audioset/vggish`
			`[2]. https://tfhub.dev/google/vggish/1`
Add more resources Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 7 months ago

			`# More Resources`

			- [What is a Transformer Model? An Engineer's Guide](https://zilliz.com/glossary/transformer-models): A transformer model is a neural network architecture. It's proficient in converting a particular type of input into a distinct output. Its core strength lies in its ability to handle inputs and outputs of different sequence length. It does this through encoding the input into a matrix with predefined dimensions and then combining that with another attention matrix to decode. This transformation unfolds through a sequence of collaborative layers, which deconstruct words into their corresponding numerical representations.

			`At its heart, a transformer model is a bridge between disparate linguistic structures, employing sophisticated neural network configurations to decode and manipulate human language input. An example of a transformer model is GPT-3, which ingests human language and generates text output.`
			`- [Comparing Different Vector Embeddings - Zilliz blog](https://zilliz.com/blog/comparing-different-vector-embeddings): Learn about the difference in vector embeddings between models and how to use multiple collections of vector data in one Jupyter Notebook.`
			`- [How to Get the Right Vector Embeddings - Zilliz blog](https://zilliz.com/blog/how-to-get-the-right-vector-embeddings): A comprehensive introduction to vector embeddings and how to generate them with popular open-source models.`
			`- [Exploring OpenAI CLIP: The Future of Multi-Modal AI Learning - Zilliz blog](https://zilliz.com/learn/exploring-openai-clip-the-future-of-multimodal-ai-learning): Multimodal AI learning can get input and understand information from various modalities like text, images, and audio together, leading to a deeper understanding of the world. Learn more about OpenAI's CLIP (Contrastive Language-Image Pre-training), a popular multimodal model for text and image data.`
			`- [Audio Retrieval Based on Milvus - Zilliz blog](https://zilliz.com/blog/audio-retrieval-based-on-milvus): Create an audio retrieval system using Milvus, an open-source vector database. Classify and analyze sound data in real time.`
			`- [Sparse and Dense Embeddings: A Guide for Effective Information Retrieval with Milvus \| Zilliz Webinar](https://zilliz.com/event/sparse-and-dense-embeddings-webinar): Zilliz webinar covering what sparse and dense embeddings are and when you'd want to use one over the other.`
			`- [Sparse and Dense Embeddings: A Guide for Effective Information Retrieval with Milvus \| Zilliz Webinar](https://zilliz.com/event/sparse-and-dense-embeddings-webinar/success): Zilliz webinar covering what sparse and dense embeddings are and when you'd want to use one over the other.`
			`- [Zilliz partnership with PyTorch - View image search solution tutorial](https://zilliz.com/partners/pytorch): Zilliz partnership with PyTorch`