The Perceiver, unveiled recently in the paper Perceiver: General Perception with Iterative Attention continues the trend towards universality that has been hinted at for several years, i.e. that less and less is built into an AI program that is specific to a task.
In the past, most programs were developed in natural language with a sense of the language function, such as answering questions or translating languages.
Google’s Transformer erased these distinctions and produced a program that could handle a variety of tasks by creating a sufficiently skilful language representation.
Likewise, Perceiver challenges the notion that different types of data, such as sound or image, require different neural network architectures.
Perceiver channels of this multitasking approach. It captures three types of input: images, videos, and so-called point clouds, a collection of dots that describe what a LiDAR sensor on top of a car “sees” from the road.
Once trained, the system can achieve some meaningful results in benchmark tests, including the classic ImageNet image recognition test; Audio Set, and ModelNet.
Perceiver achieves the task with two tricks, or perhaps with a trick and a cheat.
The first trick is to reduce the amount of data that the Transformer needs to be directly operated with. While large neural transformer networks have been fed gigabytes and gigabytes of text data, the amount of data in images or video or audio files or point clouds is much larger.
The second trick is to give the model some clues about the structure of the data.
The problem with a Transformer is that it knows nothing about the spatial elements of an image or the time value of an audio clip. A Transformer is always insensitive to these details of the structure of the respective type of data.
There are a number of problematic issues with Perceiver that may not really make it the ideal supermodel with a million tasks as Google portrays it.
One is that the program does not always perform as well as programs that are tailored to a particular modality.
For example, on Audio Set, the Perceiver fell behind a program introduced last year by Haytham M. Fayek and Anurag Kumar of Facebook that merges information about audio and video.
All we know is that Perceiver can learn different kinds of representations.
The authors of the paper present a series of so-called attention cards, visual studies that show what the Perceiver focuses on in each lump of training data, which indicate that the Perceiver adapts to where it focuses on computing.
For more information, read the original story in ZDNet.