logo
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Readme
Files and versions

57 lines
1.6 KiB

# ViT Embedding Operator
3 years ago
Authors: kyle he
## Overview
The ViT(Vision Transformer) is a model for image classification that employs a Transformer-like architecture over patches of the image. This includes the use of Multi-Head Attention, Scaled Dot-Product Attention and other architectural features seen in the Transformer architecture traditionally used for NLP[1], which is trained on [imagenet dataset](https://image-net.org/download.php).
## Interface
```python
__init__(self, model_name: str = 'vit_large_patch16_224',
framework: str = 'pytorch', weights_path: str = None)
```
**Args:**
- model_name:
- the model name for embedding
- supported types: `str`, for example 'vit_large_patch16_224'
- framework:
- the framework of the model
- supported types: `str`, default is 'pytorch'
- weights_path:
- the weights path
- supported types: `str`, default is None, using pretrained weights
```python
__call__(self, img_path: str)
```
**Args:**
- img_path:
- the input image path
- supported types: `str`
**Returns:**
The Operator returns a tuple `Tuple[('embedding', numpy.ndarray)]` containing following fields:
- feature_vector:
- the embedding of the image
- data type: `numpy.ndarray`
## Requirements
You can get the required python package by [requirements.txt](./requirements.txt).
## How it works
The `towhee/vit-embedding` Operator implements the function of image embedding, which can add to the pipeline. For example, it's the key Operator named embedding_model within [image-embedding-vitlarage](https://hub.towhee.io/towhee/image-embedding-vitlarge) pipeline.
## Reference
[1].https://arxiv.org/abs/2010.11929