vit-image-embedding/README.md

# ViT Embedding Operator

Authors: kyle he

## Overview

The ViT(Vision Transformer) is a model for image classification that employs a Transformer-like architecture over patches of the image. This includes the use of Multi-Head Attention, Scaled Dot-Product Attention and other architectural features seen in the Transformer architecture traditionally used for NLP[1], which is trained on [imagenet dataset](https://image-net.org/download.php).

## Interface

```python
__init__(self, model_name: str = 'vit_large_patch16_224',
                 framework: str = 'pytorch', weights_path: str = None)
```

**Args:**

- model_name:
  - the model name for embedding
  - supported types: `str`, for example 'vit_large_patch16_224'
- framework:
  - the framework of the model
  - supported types: `str`, default is 'pytorch'
- weights_path:
  - the weights path
  - supported types: `str`, default is None, using pretrained weights

```python
__call__(self,  image: 'towhee.types.Image')
```

**Args:**

- img_tensor:
  - the input image tensor
  - supported types: `torch.Tensor`

**Returns:**

The Operator returns a tuple `Tuple[('embedding', numpy.ndarray)]` containing following fields:

- feature_vector:
  - the embedding of the image
  - data type: `numpy.ndarray`
  - shape: (dim,)
## Requirements

You can get the required python package by [requirements.txt](./requirements.txt).

## How it works

The `towhee/vit-embedding` Operator implements the function of image embedding, which can add to the pipeline. For example, it's the key Operator named embedding_model within [image-embedding-vitlarage](https://hub.towhee.io/towhee/image-embedding-vitlarge) pipeline.

## Reference

[1].https://arxiv.org/abs/2010.11929


# More Resources

- [What is a Transformer Model? An Engineer's Guide](https://zilliz.com/glossary/transformer-models): A transformer model is a neural network architecture. It's proficient in converting a particular type of input into a distinct output. Its core strength lies in its ability to handle inputs and outputs of different sequence length. It does this through encoding the input into a matrix with predefined dimensions and then combining that with another attention matrix to decode. This transformation unfolds through a sequence of collaborative layers, which deconstruct words into their corresponding numerical representations. 

At its heart, a transformer model is a bridge between disparate linguistic structures, employing sophisticated neural network configurations to decode and manipulate human language input. An example of a transformer model is GPT-3, which ingests human language and generates text output.
- [How to Get the Right Vector Embeddings - Zilliz blog](https://zilliz.com/blog/how-to-get-the-right-vector-embeddings): A comprehensive introduction to vector embeddings and how to generate them with popular open-source models.
- [The guide to clip-vit-base-patch32 | OpenAI](https://zilliz.com/ai-models/clip-vit-base-patch32): clip-vit-base-patch32: a CLIP multimodal model variant by OpenAI for image and text embedding.
- [What are Vision Transformers (ViT)? - Zilliz blog](https://zilliz.com/learn/understanding-vision-transformers-vit): Vision Transformers (ViTs) are neural network models that use transformers to perform computer vision tasks like object detection and image classification.
- [What Are Vector Embeddings?](https://zilliz.com/glossary/vector-embeddings): Learn the definition of vector embeddings, how to create vector embeddings, and more.
- [What is Detection Transformers (DETR)?  - Zilliz blog](https://zilliz.com/learn/detection-transformers-detr-end-to-end-object-detection-with-transformers): DETR (DEtection TRansformer) is a deep learning model for end-to-end object detection using transformers.
Update Signed-off-by: shiyu22 <shiyu.chen@zilliz.com> 4 years ago			`# ViT Embedding Operator`
Initial commit 4 years ago
Update Signed-off-by: shiyu22 <shiyu.chen@zilliz.com> 4 years ago			`Authors: kyle he`

			`## Overview`

			`The ViT(Vision Transformer) is a model for image classification that employs a Transformer-like architecture over patches of the image. This includes the use of Multi-Head Attention, Scaled Dot-Product Attention and other architectural features seen in the Transformer architecture traditionally used for NLP[1], which is trained on [imagenet dataset](https://image-net.org/download.php).`

			`## Interface`

			```python
			`__init__(self, model_name: str = 'vit_large_patch16_224',`
			`framework: str = 'pytorch', weights_path: str = None)`
			```

			`Args:`

			`- model_name:`
			`- the model name for embedding`
			- supported types: `str`, for example 'vit_large_patch16_224'
			`- framework:`
			`- the framework of the model`
			- supported types: `str`, default is 'pytorch'
			`- weights_path:`
			`- the weights path`
			- supported types: `str`, default is None, using pretrained weights

			```python
change input type 4 years ago			`__call__(self, image: 'towhee.types.Image')`
Update Signed-off-by: shiyu22 <shiyu.chen@zilliz.com> 4 years ago			```

			`Args:`

change input type 4 years ago			`- img_tensor:`
			`- the input image tensor`
			- supported types: `torch.Tensor`
Update Signed-off-by: shiyu22 <shiyu.chen@zilliz.com> 4 years ago
			`Returns:`

			The Operator returns a tuple `Tuple[('embedding', numpy.ndarray)]` containing following fields:

			`- feature_vector:`
			`- the embedding of the image`
			- data type: `numpy.ndarray`
output shape introduction 4 years ago			`- shape: (dim,)`
Update Signed-off-by: shiyu22 <shiyu.chen@zilliz.com> 4 years ago			`## Requirements`

			`You can get the required python package by [requirements.txt](./requirements.txt).`

			`## How it works`

Update README Signed-off-by: shiyu22 <shiyu.chen@zilliz.com> 4 years ago			The `towhee/vit-embedding` Operator implements the function of image embedding, which can add to the pipeline. For example, it's the key Operator named embedding_model within [image-embedding-vitlarage](https://hub.towhee.io/towhee/image-embedding-vitlarge) pipeline.
Update Signed-off-by: shiyu22 <shiyu.chen@zilliz.com> 4 years ago
			`## Reference`

			`[1].https://arxiv.org/abs/2010.11929`
Add more resources Signed-off-by: Jael Gu <mengjia.gu@zilliz.com> 10 months ago

			`# More Resources`

			- [What is a Transformer Model? An Engineer's Guide](https://zilliz.com/glossary/transformer-models): A transformer model is a neural network architecture. It's proficient in converting a particular type of input into a distinct output. Its core strength lies in its ability to handle inputs and outputs of different sequence length. It does this through encoding the input into a matrix with predefined dimensions and then combining that with another attention matrix to decode. This transformation unfolds through a sequence of collaborative layers, which deconstruct words into their corresponding numerical representations.

			`At its heart, a transformer model is a bridge between disparate linguistic structures, employing sophisticated neural network configurations to decode and manipulate human language input. An example of a transformer model is GPT-3, which ingests human language and generates text output.`
			`- [How to Get the Right Vector Embeddings - Zilliz blog](https://zilliz.com/blog/how-to-get-the-right-vector-embeddings): A comprehensive introduction to vector embeddings and how to generate them with popular open-source models.`
			`- [The guide to clip-vit-base-patch32 \| OpenAI](https://zilliz.com/ai-models/clip-vit-base-patch32): clip-vit-base-patch32: a CLIP multimodal model variant by OpenAI for image and text embedding.`
			`- [What are Vision Transformers (ViT)? - Zilliz blog](https://zilliz.com/learn/understanding-vision-transformers-vit): Vision Transformers (ViTs) are neural network models that use transformers to perform computer vision tasks like object detection and image classification.`
			`- [What Are Vector Embeddings?](https://zilliz.com/glossary/vector-embeddings): Learn the definition of vector embeddings, how to create vector embeddings, and more.`
			`- [What is Detection Transformers (DETR)? - Zilliz blog](https://zilliz.com/learn/detection-transformers-detr-end-to-end-object-detection-with-transformers): DETR (DEtection TRansformer) is a deep learning model for end-to-end object detection using transformers.`