One of my hobbies is to play with pre-trained models to see how good they perform. Almost always it was clear to me that pre-trained models, like DeepLab or MaskRCNN, are not particularly good at detecting some abstract representations of an object. They are to some extends good at detecting (full) silhouettes, but they perform poorly when they try to identify an object from an illustration, like vector art.
Using MaskRCNN for Detecting cows from silhouettes.
There are some other severe defects with trained models, for example, Lighting and the Angle! Believe it or not but you can turn a cow into a horse by just changing the light color. Or even funnier, rotate a cow, and it will become an airplane!
Effects of rotation and lighting on the classifier (MaskRCNN).
The immediate explanation you can come up with is training set distribution. Or maybe CNNs have some severe limitations in creating an abstract representation of an object in their layers, thus a good generalization, just by looking at them!
This kind of thoughts was in my head until I decided to conduct some experiments on how good humans are at doing this kind of tasks! After choosing some images and asking random people to try guessing them, I found out humans are good, but not too good! Almost always some angles make detecting objects very hard. After these toy experiments, I decided to make a silly video game about detecting object just from their silhouettes. I grabbed a few 3D models from Google Poly Toolkit and created a straightforward and boring game called “Ee Chie?!” (loosely translates to “wut is dis!”). It’s currently in Persian right now, but I have some plans to add other languages in the future. You can download it from Google Play.
Here are some videos, Just in case you are curious about how MaskRCNN performs on some objects from the game.