ViT Embedding Operator

Authors: kyle he

Overview

The ViT(Vision Transformer) is a model for image classification that employs a Transformer-like architecture over patches of the image. This includes the use of Multi-Head Attention, Scaled Dot-Product Attention and other architectural features seen in the Transformer architecture traditionally used for NLP[1], which is trained on imagenet dataset.

Interface

__init__(self, model_name: str = 'vit_large_patch16_224',
                 framework: str = 'pytorch', weights_path: str = None)

Args:

model_name:
- the model name for embedding
- supported types: str, for example 'vit_large_patch16_224'
framework:
- the framework of the model
- supported types: str, default is 'pytorch'
weights_path:
- the weights path
- supported types: str, default is None, using pretrained weights

__call__(self, img_path: str)

Args:

img_path:
- the input image path
- supported types: str

Returns:

The Operator returns a tuple Tuple[('embedding', numpy.ndarray)] containing following fields:

feature_vector:
- the embedding of the image
- data type: numpy.ndarray
- shape: (dim,)

Requirements

You can get the required python package by requirements.txt.

How it works

The towhee/vit-embedding Operator implements the function of image embedding, which can add to the pipeline. For example, it's the key Operator named embedding_model within image-embedding-vitlarage pipeline.

Reference

[1].https://arxiv.org/abs/2010.11929

zhang chen 70104b8e17 output shape introduction			9 Commits
pytorch		Update	4 years ago
.gitattributes	841 B	Update	4 years ago
.gitignore	3.0 KiB	Update	4 years ago
README.md	1.7 KiB	output shape introduction	4 years ago
__init__.py	719 B	Update	4 years ago
requirements.txt	0 B	Update	4 years ago
vit_image_embedding.py	1.8 KiB	Update	4 years ago
vit_image_embedding.yaml	240 B	Update	4 years ago