Speaking Skills Talk - Victor Akinwande

— 2:00pm

Location:
In Person - Newell-Simon 4305

Speaker:
VICTOR AKINWANDE, Ph.D. Student, Computer Science Department, Carnegie Mellon University
https://home.victorakinwande.com/

Self-supervised vision-language models trained with contrastive objectives perform  better as one increases their scale. Typically, the image encoder in such models are larger than the text encoder and we are often able to amortize the inference cost of the text encoder by using a predefined set of text-embedding but not with the image encoder. This poses a challenge for deploying large vision-language models especially in resource-constrained environments. 

In this talk, I will present HyperCLIP - a vision-language architecture that dynamically adapts a small image encoder using a hypernetwork. This hypernetwork learns to produce a subset of the image encoder parameters conditioned on the text-embedding, and the entire model (hypernetwork, image encoder, and text encoder) are trained jointly end-end. HyperCLIP increases the zero-shot accuracy of SigLIP models with small image encoders by up to 3% on ImageNet and 5% on CIFAR-100 with minimal training throughput overhead. 

Presented in Partial Fulfillment of the CSD Speaking Skills Requirement

Event Website:
https://csd.cmu.edu/calendar/speaking-skills-talk-victor-akinwande


Add event to Google
Add event to iCal