{"id":2317,"date":"2025-06-11T12:12:33","date_gmt":"2025-06-11T06:42:33","guid":{"rendered":"https:\/\/texpertssolutions.com\/notes\/?p=2317"},"modified":"2025-06-26T14:53:07","modified_gmt":"2025-06-26T09:23:07","slug":"what-are-visual-transformer","status":"publish","type":"post","link":"https:\/\/texpertssolutions.com\/notes\/2025\/06\/11\/what-are-visual-transformer\/","title":{"rendered":"What are Visual Transformer?"},"content":{"rendered":"\n<p> Let\u2019s break down <strong>Visual Transformers<\/strong> (also called <strong>Vision Transformers<\/strong> or <strong>ViTs<\/strong>) in a super simple and clear way\ud83e\udde0\ud83d\udc41\ufe0f\u200d\ud83d\udde8\ufe0f\ud83d\udcca<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">\ud83e\udde0 What Are Visual Transformers?<\/h2>\n\n\n\n<p><strong>Visual Transformers<\/strong> are deep learning models that use the <strong>Transformer architecture<\/strong> (originally made for text \ud83d\udcc4) to <strong>analyze images \ud83d\uddbc\ufe0f<\/strong> instead of using CNNs (Convolutional Neural Networks).<\/p>\n\n\n\n<p>They were introduced in the paper <strong>&#8220;An Image is Worth 16&#215;16 Words&#8221;<\/strong> by Google in 2020 \ud83d\udcda\ud83d\udd0d<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\">\ud83d\udce6 Basic Idea:<\/h3>\n\n\n\n<p>Think of an image as a <strong>puzzle<\/strong> \ud83e\udde9:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>We break the image into small <strong>patches<\/strong> (like mini-images)<\/li>\n\n\n\n<li>Each patch is treated like a &#8220;word&#8221; in NLP \ud83e\uddfe<\/li>\n\n\n\n<li>Then the <strong>Transformer<\/strong> learns relationships between these patches (like it does with words in a sentence!)<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">\ud83d\udd0d How It Works (Simply Explained):<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Step 1\ufe0f\u20e3: \ud83e\udde9 <strong>Image to Patches<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Divide the image into small square patches (e.g., 16\u00d716 pixels each)<\/li>\n\n\n\n<li>Flatten each patch into a vector \ud83d\udccf<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Step 2\ufe0f\u20e3: \ud83e\udde0 <strong>Embed the Patches<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Just like word embeddings in text, each patch is turned into a numeric vector<\/li>\n\n\n\n<li>Add <strong>position info<\/strong> so the model knows <strong>where<\/strong> each patch belongs in the image \ud83e\udded<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Step 3\ufe0f\u20e3: \ud83d\udd01 <strong>Transformer Encoder<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Feed all patch vectors into the transformer<\/li>\n\n\n\n<li>It uses <strong>self-attention<\/strong> to learn relationships between different parts of the image \ud83d\udc41\ufe0f\u200d\ud83d\udde8\ufe0f<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Step 4\ufe0f\u20e3: \ud83e\uddfe <strong>Classification or Output<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The final output is usually a <strong>classification token<\/strong> (like in BERT for text) that predicts the image class (e.g., &#8220;cat&#8221; \ud83d\udc31 or &#8220;car&#8221; \ud83d\ude97)<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">\ud83c\udd9a ViT vs CNN (What\u2019s the Difference?)<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>Feature<\/th><th>CNN \ud83e\udde0<\/th><th>ViT \ud83e\udd16<\/th><\/tr><\/thead><tbody><tr><td>Works with<\/td><td>Pixels directly<\/td><td>Image patches<\/td><\/tr><tr><td>Learns using<\/td><td>Convolutions (filters)<\/td><td>Attention (relationships)<\/td><\/tr><tr><td>Position Awareness<\/td><td>Built-in (via structure)<\/td><td>Needs positional encoding<\/td><\/tr><tr><td>Data Requirement<\/td><td>Works well on small data<\/td><td>Needs <strong>lots of data<\/strong> or pretraining<\/td><\/tr><tr><td>Interpretability<\/td><td>Less clear<\/td><td>More explainable with attention \u2728<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">\ud83e\uddea Where Are Visual Transformers Used?<\/h2>\n\n\n\n<p>\u2705 Image classification (e.g., cat vs dog) \ud83d\udc36\ud83d\udc31<br>\u2705 Object detection \ud83e\uddcd\ud83d\udce6<br>\u2705 Image segmentation \ud83e\udde0\ud83e\udde9<br>\u2705 Medical imaging \ud83e\uddec<br>\u2705 Video analysis \ud83c\udfa5<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">\ud83d\udcc8 Pros and Cons<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>\u2705 Pros<\/th><th>\u274c Cons<\/th><\/tr><\/thead><tbody><tr><td>Great at <strong>global understanding<\/strong> \ud83c\udf0d<\/td><td>Needs <strong>more data<\/strong> to train<\/td><\/tr><tr><td>Works better with <strong>pretraining<\/strong> \ud83c\udfcb\ufe0f<\/td><td>Slower than CNNs on small tasks \ud83d\udc0c<\/td><\/tr><tr><td>Easy to scale \ud83c\udfd7\ufe0f<\/td><td>More complex to implement \ud83d\udcbb<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\">\ud83c\udfaf TL;DR:<\/h3>\n\n\n\n<p><strong>Visual Transformers = Transformer + Images<\/strong><\/p>\n\n\n\n<p>They slice images into patches \ud83e\udde9, treat them like words \ud83d\udcd6, and use attention to &#8220;understand&#8221; the image \ud83e\udde0\ud83d\udc41\ufe0f\u200d\ud83d\udde8\ufe0f \u2014 just like how Transformers understand sentences!<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Let\u2019s break down Visual Transformers (also called Vision Transformers or ViTs) in a super simple and &hellip; <a title=\"What are Visual Transformer?\" class=\"hm-read-more\" href=\"https:\/\/texpertssolutions.com\/notes\/2025\/06\/11\/what-are-visual-transformer\/\"><span class=\"screen-reader-text\">What are Visual Transformer?<\/span>Read more<\/a><\/p>\n","protected":false},"author":1,"featured_media":2350,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[641],"tags":[],"class_list":["post-2317","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-machine-learning"],"aioseo_notices":[],"jetpack_featured_media_url":"https:\/\/i0.wp.com\/texpertssolutions.com\/notes\/wp-content\/uploads\/2025\/06\/7.png?fit=1280%2C720&ssl=1","jetpack-related-posts":[],"jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/texpertssolutions.com\/notes\/wp-json\/wp\/v2\/posts\/2317","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/texpertssolutions.com\/notes\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/texpertssolutions.com\/notes\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/texpertssolutions.com\/notes\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/texpertssolutions.com\/notes\/wp-json\/wp\/v2\/comments?post=2317"}],"version-history":[{"count":2,"href":"https:\/\/texpertssolutions.com\/notes\/wp-json\/wp\/v2\/posts\/2317\/revisions"}],"predecessor-version":[{"id":2367,"href":"https:\/\/texpertssolutions.com\/notes\/wp-json\/wp\/v2\/posts\/2317\/revisions\/2367"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/texpertssolutions.com\/notes\/wp-json\/wp\/v2\/media\/2350"}],"wp:attachment":[{"href":"https:\/\/texpertssolutions.com\/notes\/wp-json\/wp\/v2\/media?parent=2317"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/texpertssolutions.com\/notes\/wp-json\/wp\/v2\/categories?post=2317"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/texpertssolutions.com\/notes\/wp-json\/wp\/v2\/tags?post=2317"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}