Readme

Files and versions

3.7 KiB

Raw Permalink Blame History

ViT Embedding Operator

Authors: kyle he

Overview

The ViT(Vision Transformer) is a model for image classification that employs a Transformer-like architecture over patches of the image. This includes the use of Multi-Head Attention, Scaled Dot-Product Attention and other architectural features seen in the Transformer architecture traditionally used for NLP[1], which is trained on imagenet dataset.

Interface

__init__(self, model_name: str = 'vit_large_patch16_224',
                 framework: str = 'pytorch', weights_path: str = None)

Args:

model_name:
- the model name for embedding
- supported types: str, for example 'vit_large_patch16_224'
framework:
- the framework of the model
- supported types: str, default is 'pytorch'
weights_path:
- the weights path
- supported types: str, default is None, using pretrained weights

__call__(self,  image: 'towhee.types.Image')

Args:

img_tensor:
- the input image tensor
- supported types: torch.Tensor

Returns:

The Operator returns a tuple Tuple[('embedding', numpy.ndarray)] containing following fields:

feature_vector:
- the embedding of the image
- data type: numpy.ndarray
- shape: (dim,)

Requirements

You can get the required python package by requirements.txt.

How it works

The towhee/vit-embedding Operator implements the function of image embedding, which can add to the pipeline. For example, it's the key Operator named embedding_model within image-embedding-vitlarage pipeline.

Reference

[1].https://arxiv.org/abs/2010.11929

More Resources

What is a Transformer Model? An Engineer's Guide: A transformer model is a neural network architecture. It's proficient in converting a particular type of input into a distinct output. Its core strength lies in its ability to handle inputs and outputs of different sequence length. It does this through encoding the input into a matrix with predefined dimensions and then combining that with another attention matrix to decode. This transformation unfolds through a sequence of collaborative layers, which deconstruct words into their corresponding numerical representations. At its heart, a transformer model is a bridge between disparate linguistic structures, employing sophisticated neural network configurations to decode and manipulate human language input. An example of a transformer model is GPT-3, which ingests human language and generates text output.
How to Get the Right Vector Embeddings - Zilliz blog: A comprehensive introduction to vector embeddings and how to generate them with popular open-source models.
The guide to clip-vit-base-patch32 | OpenAI: clip-vit-base-patch32: a CLIP multimodal model variant by OpenAI for image and text embedding.
What are Vision Transformers (ViT)? - Zilliz blog: Vision Transformers (ViTs) are neural network models that use transformers to perform computer vision tasks like object detection and image classification.
What Are Vector Embeddings?: Learn the definition of vector embeddings, how to create vector embeddings, and more.
What is Detection Transformers (DETR)? - Zilliz blog: DETR (DEtection TRansformer) is a deep learning model for end-to-end object detection using transformers.

3.7 KiB

Raw Permalink Blame History

ViT Embedding Operator

Authors: kyle he

Overview

Interface

__init__(self, model_name: str = 'vit_large_patch16_224',
                 framework: str = 'pytorch', weights_path: str = None)

Args:

model_name:
- the model name for embedding
- supported types: str, for example 'vit_large_patch16_224'
framework:
- the framework of the model
- supported types: str, default is 'pytorch'
weights_path:
- the weights path
- supported types: str, default is None, using pretrained weights

__call__(self,  image: 'towhee.types.Image')

Args:

img_tensor:
- the input image tensor
- supported types: torch.Tensor

Returns:

The Operator returns a tuple Tuple[('embedding', numpy.ndarray)] containing following fields:

feature_vector:
- the embedding of the image
- data type: numpy.ndarray
- shape: (dim,)

Requirements

You can get the required python package by requirements.txt.

How it works

Reference

[1].https://arxiv.org/abs/2010.11929

More Resources

What is a Transformer Model? An Engineer's Guide: A transformer model is a neural network architecture. It's proficient in converting a particular type of input into a distinct output. Its core strength lies in its ability to handle inputs and outputs of different sequence length. It does this through encoding the input into a matrix with predefined dimensions and then combining that with another attention matrix to decode. This transformation unfolds through a sequence of collaborative layers, which deconstruct words into their corresponding numerical representations. At its heart, a transformer model is a bridge between disparate linguistic structures, employing sophisticated neural network configurations to decode and manipulate human language input. An example of a transformer model is GPT-3, which ingests human language and generates text output.
How to Get the Right Vector Embeddings - Zilliz blog: A comprehensive introduction to vector embeddings and how to generate them with popular open-source models.
The guide to clip-vit-base-patch32 | OpenAI: clip-vit-base-patch32: a CLIP multimodal model variant by OpenAI for image and text embedding.
What are Vision Transformers (ViT)? - Zilliz blog: Vision Transformers (ViTs) are neural network models that use transformers to perform computer vision tasks like object detection and image classification.
What Are Vector Embeddings?: Learn the definition of vector embeddings, how to create vector embeddings, and more.
What is Detection Transformers (DETR)? - Zilliz blog: DETR (DEtection TRansformer) is a deep learning model for end-to-end object detection using transformers.

Readme

Files and versions

3.7 KiB Raw Permalink Blame History

ViT Embedding Operator

Overview

Interface

Requirements

How it works

Reference

More Resources

3.7 KiB Raw Permalink Blame History

ViT Embedding Operator

Overview

Interface

Requirements

How it works

Reference

More Resources

3.7 KiB

Raw Permalink Blame History

3.7 KiB

Raw Permalink Blame History