{"id":7290,"date":"2022-09-16T08:49:00","date_gmt":"2022-09-16T08:49:00","guid":{"rendered":"https:\/\/panon.rastek.id\/?p=7290"},"modified":"2022-09-16T08:58:04","modified_gmt":"2022-09-16T08:58:04","slug":"scaling-language-image-learning-in-100-languages","status":"publish","type":"post","link":"https:\/\/panon.rastek.id\/id\/scaling-language-image-learning-in-100-languages\/","title":{"rendered":"Scaling Language-Image Learning in 100+ Languages"},"content":{"rendered":"<p>[vc_row][vc_column]<header class=\"kd-section-title col-lg-12 text-left\" ><h3 class=\"separator_off\" >Scaling Language-Image Learning in 100+ Languages<\/h3><\/header>[vc_column_text]Advanced language models (e.g.,\u00a0<a href=\"https:\/\/arxiv.org\/abs\/2005.14165\">GPT<\/a>,\u00a0<a href=\"https:\/\/ai.googleblog.com\/2021\/12\/more-efficient-in-context-learning-with.html\">GLaM<\/a>,\u00a0<a href=\"https:\/\/ai.googleblog.com\/2022\/04\/pathways-language-model-palm-scaling-to.html\">PaLM<\/a>\u00a0and\u00a0<a href=\"https:\/\/ai.googleblog.com\/2020\/02\/exploring-transfer-learning-with-t5.html\">T5<\/a>) have demonstrated diverse capabilities and achieved impressive results across tasks and languages by scaling up their number of parameters. Vision-language (VL) models can benefit from similar scaling to address many tasks, such as\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Natural_language_generation#Image_Captioning\">image captioning<\/a>, visual\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Question_answering\">question answering<\/a>\u00a0(VQA),\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Outline_of_object_recognition\">object recognition<\/a>, and in-context\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Optical_character_recognition\">optical-character-recognition<\/a>\u00a0(OCR). Increasing the success rates for these practical tasks is important for everyday interactions and applications. Furthermore, for a truly universal system, vision-language models should be able to operate in many languages, not just one.<\/p>\n<p>In \u201c<a href=\"https:\/\/arxiv.org\/abs\/2209.06794\">PaLI: A Jointly-Scaled Multilingual Language-Image Model<\/a>\u201d, we introduce a unified language-image model trained to perform many tasks and in over 100 languages. These tasks span vision, language, and multimodal image and language applications, such as\u00a0<a href=\"https:\/\/arxiv.org\/abs\/1505.00468\">visual question answering<\/a>,\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Natural_language_generation#Image_Captioning\">image captioning<\/a>,\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Object_detection\">object detection<\/a>,\u00a0<a href=\"https:\/\/paperswithcode.com\/task\/image-classification\">image classification<\/a>,\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Optical_character_recognition\">OCR<\/a>,\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Knowledge_representation_and_reasoning\">text reasoning<\/a>, and others. Furthermore, we use a collection of public images that includes automatically collected annotations in 109 languages, which we call the WebLI dataset. The PaLI model pre-trained on WebLI achieves state-of-the-art performance on challenging image and language benchmarks, such as\u00a0<a href=\"https:\/\/arxiv.org\/abs\/1504.00325\">COCO-Captions<\/a>,\u00a0<a href=\"https:\/\/ai.google.com\/research\/ConceptualCaptions\">CC3M<\/a>,\u00a0<a href=\"https:\/\/nocaps.org\/\">nocaps<\/a>,\u00a0<a href=\"https:\/\/textvqa.org\/textcaps\/\">TextCaps<\/a>,\u00a0<a href=\"https:\/\/visualqa.org\/\">VQAv2<\/a>,\u00a0<a href=\"https:\/\/okvqa.allenai.org\/\">OK-VQA<\/a>,\u00a0<a href=\"https:\/\/textvqa.org\/\">TextVQA<\/a>\u00a0and others. It also outperforms prior models\u2019 multilingual visual captioning and visual question answering benchmarks.<\/p>\n<p><b>Overview<\/b><br \/>\nOne goal of this project is to examine how language and vision models interact at scale and specifically the scalability of language-image models. We explore both per-modality scaling and the resulting cross-modal interactions of scaling. We train our largest model to 17 billion (17B) parameters, where the visual component is scaled up to 4B parameters and the language model to 13B.<\/p>\n<p>The PaLI model architecture is simple, reusable and scalable. It consists of a\u00a0<a href=\"https:\/\/ai.googleblog.com\/2017\/08\/transformer-novel-neural-network.html\">Transformer<\/a>\u00a0encoder that processes the input text, and an auto-regressive Transformer decoder that generates the output text. To process images, the input to the Transformer encoder also includes &#8220;visual words&#8221; that represent an image processed by a\u00a0<a href=\"https:\/\/arxiv.org\/abs\/2010.11929\">Vision Transformer<\/a>\u00a0(ViT). A key component of the PaLI model is reuse, in which we seed the model with weights from previously-trained uni-modal vision and language models, such as\u00a0<a href=\"https:\/\/arxiv.org\/abs\/2010.11934\">mT5-XXL<\/a>\u00a0and large\u00a0<a href=\"https:\/\/ai.googleblog.com\/2020\/12\/transformers-for-image-recognition-at.html\">ViTs<\/a>. This reuse not only enables the transfer of capabilities from uni-modal training, but also saves computational cost.[\/vc_column_text][\/vc_column][\/vc_row]<\/p>","protected":false},"excerpt":{"rendered":"<p>[vc_row][vc_column][vc_column_text]Advanced language models (e.g.,\u00a0GPT,\u00a0GLaM,\u00a0PaLM\u00a0and\u00a0T5) have demonstrated diverse capabilities and achieved impressive results across tasks and languages by scaling up their [&hellip;]<\/p>","protected":false},"author":1,"featured_media":7021,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[76,78,75,1],"tags":[24,27,29],"class_list":["post-7290","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-manufacturing","category-technology","category-transportation","category-uncategorized","tag-business","tag-marketing","tag-online"],"_links":{"self":[{"href":"https:\/\/panon.rastek.id\/id\/wp-json\/wp\/v2\/posts\/7290","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/panon.rastek.id\/id\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/panon.rastek.id\/id\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/panon.rastek.id\/id\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/panon.rastek.id\/id\/wp-json\/wp\/v2\/comments?post=7290"}],"version-history":[{"count":0,"href":"https:\/\/panon.rastek.id\/id\/wp-json\/wp\/v2\/posts\/7290\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/panon.rastek.id\/id\/wp-json\/wp\/v2\/media\/7021"}],"wp:attachment":[{"href":"https:\/\/panon.rastek.id\/id\/wp-json\/wp\/v2\/media?parent=7290"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/panon.rastek.id\/id\/wp-json\/wp\/v2\/categories?post=7290"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/panon.rastek.id\/id\/wp-json\/wp\/v2\/tags?post=7290"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}