# ViT Embedding Operator Authors: kyle he ## Overview The ViT(Vision Transformer) is a model for image classification that employs a Transformer-like architecture over patches of the image. This includes the use of Multi-Head Attention, Scaled Dot-Product Attention and other architectural features seen in the Transformer architecture traditionally used for NLP[1], which is trained on [imagenet dataset](https://image-net.org/download.php). ## Interface ```python __init__(self, model_name: str = 'vit_large_patch16_224', framework: str = 'pytorch', weights_path: str = None) ``` **Args:** - model_name: - the model name for embedding - supported types: `str`, for example 'vit_large_patch16_224' - framework: - the framework of the model - supported types: `str`, default is 'pytorch' - weights_path: - the weights path - supported types: `str`, default is None, using pretrained weights ```python __call__(self, img_path: str) ``` **Args:** - img_path: - the input image path - supported types: `str` **Returns:** The Operator returns a tuple `Tuple[('embedding', numpy.ndarray)]` containing following fields: - feature_vector: - the embedding of the image - data type: `numpy.ndarray` ## Requirements You can get the required python package by [requirements.txt](./requirements.txt). ## How it works The `towhee/vit-embedding` Operator implements the function of image embedding, which can add to the pipeline. For example, it's the key Operator named embedding_model within [vit-embedding](https://hub.towhee.io/towhee/vit-embedding) pipeline. ## Reference [1].https://arxiv.org/abs/2010.11929