logo
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Readme
Files and versions

3.7 KiB

ViT Embedding Operator

Authors: kyle he

Overview

The ViT(Vision Transformer) is a model for image classification that employs a Transformer-like architecture over patches of the image. This includes the use of Multi-Head Attention, Scaled Dot-Product Attention and other architectural features seen in the Transformer architecture traditionally used for NLP[1], which is trained on imagenet dataset.

Interface

__init__(self, model_name: str = 'vit_large_patch16_224',
                 framework: str = 'pytorch', weights_path: str = None)

Args:

  • model_name:
    • the model name for embedding
    • supported types: str, for example 'vit_large_patch16_224'
  • framework:
    • the framework of the model
    • supported types: str, default is 'pytorch'
  • weights_path:
    • the weights path
    • supported types: str, default is None, using pretrained weights
__call__(self,  image: 'towhee.types.Image')

Args:

  • img_tensor:
    • the input image tensor
    • supported types: torch.Tensor

Returns:

The Operator returns a tuple Tuple[('embedding', numpy.ndarray)] containing following fields:

  • feature_vector:
    • the embedding of the image
    • data type: numpy.ndarray
    • shape: (dim,)

Requirements

You can get the required python package by requirements.txt.

How it works

The towhee/vit-embedding Operator implements the function of image embedding, which can add to the pipeline. For example, it's the key Operator named embedding_model within image-embedding-vitlarage pipeline.

Reference

[1].https://arxiv.org/abs/2010.11929

More Resources

  • What is a Transformer Model? An Engineer's Guide: A transformer model is a neural network architecture. It's proficient in converting a particular type of input into a distinct output. Its core strength lies in its ability to handle inputs and outputs of different sequence length. It does this through encoding the input into a matrix with predefined dimensions and then combining that with another attention matrix to decode. This transformation unfolds through a sequence of collaborative layers, which deconstruct words into their corresponding numerical representations. At its heart, a transformer model is a bridge between disparate linguistic structures, employing sophisticated neural network configurations to decode and manipulate human language input. An example of a transformer model is GPT-3, which ingests human language and generates text output.
  • How to Get the Right Vector Embeddings - Zilliz blog: A comprehensive introduction to vector embeddings and how to generate them with popular open-source models.
  • The guide to clip-vit-base-patch32 | OpenAI: clip-vit-base-patch32: a CLIP multimodal model variant by OpenAI for image and text embedding.
  • What are Vision Transformers (ViT)? - Zilliz blog: Vision Transformers (ViTs) are neural network models that use transformers to perform computer vision tasks like object detection and image classification.
  • What Are Vector Embeddings?: Learn the definition of vector embeddings, how to create vector embeddings, and more.
  • What is Detection Transformers (DETR)? - Zilliz blog: DETR (DEtection TRansformer) is a deep learning model for end-to-end object detection using transformers.

3.7 KiB

ViT Embedding Operator

Authors: kyle he

Overview

The ViT(Vision Transformer) is a model for image classification that employs a Transformer-like architecture over patches of the image. This includes the use of Multi-Head Attention, Scaled Dot-Product Attention and other architectural features seen in the Transformer architecture traditionally used for NLP[1], which is trained on imagenet dataset.

Interface

__init__(self, model_name: str = 'vit_large_patch16_224',
                 framework: str = 'pytorch', weights_path: str = None)

Args:

  • model_name:
    • the model name for embedding
    • supported types: str, for example 'vit_large_patch16_224'
  • framework:
    • the framework of the model
    • supported types: str, default is 'pytorch'
  • weights_path:
    • the weights path
    • supported types: str, default is None, using pretrained weights
__call__(self,  image: 'towhee.types.Image')

Args:

  • img_tensor:
    • the input image tensor
    • supported types: torch.Tensor

Returns:

The Operator returns a tuple Tuple[('embedding', numpy.ndarray)] containing following fields:

  • feature_vector:
    • the embedding of the image
    • data type: numpy.ndarray
    • shape: (dim,)

Requirements

You can get the required python package by requirements.txt.

How it works

The towhee/vit-embedding Operator implements the function of image embedding, which can add to the pipeline. For example, it's the key Operator named embedding_model within image-embedding-vitlarage pipeline.

Reference

[1].https://arxiv.org/abs/2010.11929

More Resources

  • What is a Transformer Model? An Engineer's Guide: A transformer model is a neural network architecture. It's proficient in converting a particular type of input into a distinct output. Its core strength lies in its ability to handle inputs and outputs of different sequence length. It does this through encoding the input into a matrix with predefined dimensions and then combining that with another attention matrix to decode. This transformation unfolds through a sequence of collaborative layers, which deconstruct words into their corresponding numerical representations. At its heart, a transformer model is a bridge between disparate linguistic structures, employing sophisticated neural network configurations to decode and manipulate human language input. An example of a transformer model is GPT-3, which ingests human language and generates text output.
  • How to Get the Right Vector Embeddings - Zilliz blog: A comprehensive introduction to vector embeddings and how to generate them with popular open-source models.
  • The guide to clip-vit-base-patch32 | OpenAI: clip-vit-base-patch32: a CLIP multimodal model variant by OpenAI for image and text embedding.
  • What are Vision Transformers (ViT)? - Zilliz blog: Vision Transformers (ViTs) are neural network models that use transformers to perform computer vision tasks like object detection and image classification.
  • What Are Vector Embeddings?: Learn the definition of vector embeddings, how to create vector embeddings, and more.
  • What is Detection Transformers (DETR)? - Zilliz blog: DETR (DEtection TRansformer) is a deep learning model for end-to-end object detection using transformers.