Models, Benchmarks, and Applications
Recent advances in multimodal foundation models—including large vision-language models and multimodal generative models—have significantly reshaped artificial intelligence by enabling unified perception, understanding, and generation across text, images, videos, audio, and other sensory modalities. Beyond perception and generation, a growing body of research indicates that the next frontier of intelligence lies in agentic AI: systems that can autonomously reason, plan, act, and adapt over time in interactive environments. This shift marks a fundamental transition from passive multimodal models to Agentic Multimodal Intelligence (AMI).
This workshop brings together researchers and practitioners across multimedia, vision, natural language processing, robotics, and human-computer interaction to advance agentic intelligence—fostering interdisciplinary discussion on models, data, benchmarks, and evaluation for next-generation intelligent agents. We particularly encourage work that bridges multimodal learning, reinforcement learning, embodied AI, and human-in-the-loop systems toward building scalable, robust, and trustworthy agentic systems.

Adelaide University, Australia
qi.chen04@adelaide.edu.auPostdoctoral research fellow at AIML, Adelaide University, working with Prof. Anton van den Hengel and A/Prof. Qi Wu. Research focuses on Multimodal Generative AI and Large Multimodal Models. 30+ publications in IEEE-TPAMI, CVPR, NeurIPS, ICCV.

Adelaide University, Australia
junyan.wang@adelaide.edu.auPostdoctoral research fellow at AIML, Adelaide University. Ph.D. from UNSW (2024). Research spans computer vision, video understanding, and generative AI. Publications include CVPR, ICLR, NeurIPS.

Macquarie University, Australia
qykshr@gmail.comLecturer at Macquarie University and Adjunct Lecturer at AIML, Adelaide University. Ph.D. from Harbin Institute of Technology (2018). Research spans vision-language navigation, video captioning, image generation, and music generation.

USTC, China
dengjj@ustc.edu.cnProfessor at USTC. Research interests in multimodal understanding, spatial intelligence, and embodied AI. 50+ papers in IEEE TPAMI, NeurIPS, CVPR, ICCV, ECCV. Area Chair at ACM Multimedia 2024 & 2025.

Fudan University, China
lihao_lh@fudan.edu.cnProfessor at Fudan University. Ph.D. from the University of Chinese Academy of Sciences. 100+ peer-reviewed articles and 20+ patents. Research interests include foundation models and video generation.

SCUT, China
mingkuitan@scut.edu.cnProfessor at South China University of Technology. Ph.D. from Nanyang Technological University (2014). Research in machine learning, sparse analysis, deep learning, and large-scale optimization.

USTC, China
lihq@ustc.edu.cnProfessor at USTC. IEEE Fellow (2021). 200+ papers in top-tier journals and conferences. Research interests include image/video coding, computer vision, and reinforcement learning. Associate Editor of IEEE TMM.

USTC, China
xjchang@ustc.edu.cnProfessor at USTC and Visiting Professor at MBZUAI. ARC DECRA Fellow (2019–2021). Previously at CMU, Monash University, and RMIT. Research on machine learning for multimedia analysis and computer vision.

Adelaide University, Australia
qi.wu01@adelaide.edu.auAssociate Professor at Adelaide University. Ph.D. from the University of Bath (2015). Research interests in Image Captioning, Visual Question Answering, and Vision-to-Language. State-of-the-art results in MS COCO Captioning Challenges.

Adelaide University, Australia
anton.vandenhengel@adelaide.edu.auChief Scientist at AIML and Director of the Centre for Augmented Reasoning, Adelaide University. Fellow of the Australian Academy of Technology and Engineering. Former Director of Applied Science at Amazon.
We invite submissions of original, unpublished research on topics related to agentic multimodal intelligence. All accepted papers will be published in the ACM Digital Library as part of the ACM MM 2026 workshop proceedings.
Submissions are managed through OpenReview. Click below to submit your paper.
Submit via OpenReviewSubmission deadline: 16 July 2026
Speaker lineup to be announced. Stay tuned for updates.
Affiliation TBD
Talk title TBD
Affiliation TBD
Talk title TBD
Affiliation TBD
Talk title TBD
Affiliation TBD
Talk title TBD
The workshop will be held in a hybrid format, accommodating both on-site and online participation.
| Topic | Duration | Speaker | Affiliation |
|---|---|---|---|
| Morning Session | |||
| Opening of the Workshop | 5 min | Workshop Chairs | — |
| Keynote Talk 1 | 30 min | TBD | TBD |
| Keynote Talk 2 | 30 min | TBD | TBD |
| Coffee Break | 10 min | — | — |
| Round Table Discussion | 30 min | Workshop Host | — |
| Keynote Talk 3 | 30 min | TBD | TBD |
| Keynote Talk 4 | 30 min | TBD | TBD |
| Afternoon Session | |||
| Paper Presentations (Accepted Papers) | 20 min each | TBD | — |