Mastering Transformer Models for Computer Vision & GenAI
20-Hour course spread over 8 sessions. Timing: 11.30am to 01.30pm IST
Commencing 8th April 2025 on Tuesdays & Thursdays
Unlock the Power of Large Language Models with Transformer Architectures with Dr.Anand
Course Instructor
Dr. S. Mahesh Anand is a distinguished educator, corporate trainer, keynote speaker, and consultant in the fields of Data Science, Machine Learning, and Artificial Intelligence. With over two decades of experience, Dr. Anand has been instrumental in shaping the learning journey of more than 50,000 students and professionals across India.
Dr. Anand served as a full-time faculty member at VIT University (Vellore) for a decade, where he honed his academic and research skills, before founding his consulting and training firm, Scientific Computing Solutions (SCS-India), in 2012.
His professional footprint includes delivering transformative corporate training sessions for leading organizations like Great Learning, Chegg, TNQTech, CGI, Mad Street Den and many startups, alongside conducting over 800 master training sessions for faculty members in higher education academic institutions across India.
Among his accolades, Dr. Anand is the recipient of the AT&T Labs Award from IEEE Headquarters and the M.V. Chauhan Award from IEEE India Council for his pioneering work in ANN-Fuzzy hybrid AI model for cancer prediction. He has also been recognized with the Best Data Science & AI Educator Award by AI Global Media, UK in the year 2022.
As the founder of "Learn AI with Anand" a flagship program of SCS-India, he continues to inspire learners through his cohort online courses.
MTM for GenAI LLMs: Course Outline
Session-1: Introduction to Byte Pair Encoding (Tokenization), Word Embeddings, Positional Encoding
Session-2: Visualization and Interpretation of Word Embeddings & Positional Encoding
Session-3: Introduction to the Self Attention Mechanism in Encoder: Attention Score Vs Attention Vector, Multi-Head Attention, Latent Attention
Session-4: Role of Feed Forward Layers, Mixture of Experts (MoE) & different output layer configuration for encoder only BERT/RoBERTa
Session-5: Loading and Inferring BERT, Transfer Learning, BERT as feature extractor, Full Model Training for BERT
Session-6: Introduction to Decoder Side of Transformer, Masked Self Attention and Cross Multi-Head Attention.
Session-7: End-to-End Encoder-Decoder Transformer for GenAI Tasks, Loading GenAI models like GPT Series/Gemini, Llama, Gemma for direct Inference
Session-8: Introduction to RAG, Docstore and VectorDB, Llama Index and LangChain Frameworks
Session-9: Advanced RAG Systems: MergerRetriever, MultiVectorRetriever, Cross Encoder based Re-Ranking
Session-10: Advanced Fine-Tuning Techniques for LLMs: Exploring LoRA, PEFT, and QLORA Techniques for Llama & Gemma Models
Session-11: Multi-AI Agent Systems, Crew & Google Gemini, LLM evaluation metrics: Faithfullness & Context Relevance using RAGAS
Session-12: Deploying LLMs as APIs: Integration with LangChain and FastAPI, Standalone Vs Cloud Base Configurations
MViT Models for Computer Vision & GenAI: Course Outline
Session-1: Introduction to Vision Transformers (ViT) , Overview of Transformers in NLP vs. Computer Vision
Session-2: Vision Transformer Architecture, Image tokenization: Patching and embedding, Multi-Head Self-Attention in Vision
Session-3: Transfer learning in ViT models, Fine-tuning a pre-trained ViT model using PyTorch
Session-4: Object detection using Vision Transformers (e.g., DETR), Running a ViT-based model for object detection (e.g., DETR on COCO dataset)
Session-5: Hybrid Models (CNN + Vision Transformers), Implementation of Classification task using Swin and ConvNext Transformer Models
Session-6: Exploration of DINO and MAE for self-supervised learning. Experimenting with a self-supervised pre-trained model for an image recognition task.
Session-7: Introduction to Multi-Modal Vision-Language Models, How multi-modal models integrate vision and language contexts for tasks like image captioning and visual question answering?
Session-8: CLIP (Contrastive Languageā€“Image Pre-training): Overview and applications. DALL-E: Generating images from text prompts using transformers. Flamingo: Vision-Language models for few-shot learning.
Provisional Enrollment for April-2025
Mastering Transformer Models for Computer Vision & GenAI
Made with Gamma