MIT and IBM released ChartNet, a 1.7-million-sample synthetic training dataset that lets compact open-source vision-language ...
Foundation models have made great advances in robotics, enabling the creation of vision-language-action (VLA) models that generalize to objects, scenes, and tasks beyond their training data. However, ...
Crucially, these tests are generated by custom code and don’t rely on pre-existing images or tests that could be found on the public Internet, thereby “minimiz[ing] the chance that VLMs can solve by ...
Figure AI has unveiled HELIX, a pioneering Vision-Language-Action (VLA) model that integrates vision, language comprehension, and action execution into a single neural network. This innovation allows ...
Microsoft announced a new version of its small language model, Phi-3, which can look at images and tell you what’s in them. Phi-3-vision is a multimodal model — aka it can read both text and images — ...
If you would like the ability to run AI vision applications on your home computer you might be interested in a new language model called Moondream. Capable of processing what you say, what you write, ...
Cohere For AI, AI startup Cohere’s nonprofit research lab, this week released a multimodal “open” AI model, Aya Vision, the lab claimed is best-in-class. Aya Vision can perform tasks like writing ...
Microsoft on Tuesday released Phi-4-reasoning-vision-15B, a compact open-weight multimodal AI model that the company says matches or exceeds the performance of systems many times its size — while ...