Playing I-Spy with a Raspberry Pi and Google Cloud Vision

I recently took part in an electronics hackathon, and as usual with hackathons, I tend to care less about making something useful and in theme, and more about making a fun toy. My team’s project was essentially getting a computer to play I-spy.

The core technology in this project was Google’s Cloud Vision API - a web interface for a powerful image analysis tool. One of its features is image annotation where you can send it an image and it’ll return a set of labels with levels of confidence on what it thinks is in the image.

The concept for the game was simple enough:

  1. Use a Raspberry Pi and Camera to take an image of something in the surroundings
  2. Get labels for that image from the Google Cloud Vision API
  3. Send it up to a cloud server where the labels can be distributed to players
  4. Using a smartphone app, players will receive hints based on these labels
  5. Based on these hints, players will race to take a photo of what they think the Raspberry Pi saw
  6. The players’ photos get uploaded to the Cloud Vision API and if the annotations on their photo match the ones from the Raspberry Pi, they score a point on a leaderboard.

We called it Raspberry Pi-Spy. Pun names are the best names


Servos and Raspberries

The rest of my team were cloud software and mobile app developers, and while I have worn many hats in the past, as soon as they mentioned ASP.NET I decided to put the web-development hat back in the drawer and instead focussed entirely on the Raspberry Pi and hardware side of things.

We needed to get the camera to automatically point in random directions and so I decided to fashion a pan-tilt bracket using cardboard, servomotors and a hot-melt glue gun.

Regular servomotors have a fixed range of motion typically between 0 and 180° that is controlled by pulse-width modulation. The Raspberry Pi only has one pin that supports PWM in hardware, and while this might have been a problem back in 2012, people since then have written some excellent tools to get around this problem and deliver PWM control using the GPIO pins.

PWM control of a servomotor

For this project I used ServoBlaster, specifically the user-space implentation, along with two SG90 micro servos as these could happily be run off the RPi’s 5V rail without the need for an additional power supply.

ServoBlaster took no time to set up. With the daemon running you get a new virtual device under /dev/ called servoblaster to write to that I then wrapped in python like so:

class Servo:
    def __init__(self, servo_id, min_width, max_width):
        self.servo_id = servo_id
        self.min_width = min_width
        self.max_width = max_width
        self.ratio = (max_width - min_width)/180.0

    def set_angle(self, angle):
        angle_in_us = self.ratio * angle
        
        if angle_in_us > self.max_width or angle_in_us < self.min_width:
            print("Requested angle exceeds operating bounds!")
            return

        # Make a call like echo 0=1500us > /dev/servoblaster
        command = "echo {s_id}={a_us}us > /dev/servoblaster".format(
                    s_id=self.servo_id, a_us=angle_in_us)
        os.system(command)
        

It irked me to have to use an os.system call like that, but I couldn’t get it to work otherwise. I feel like you should be able to use os.write but it didn’t work for me when I tried it and so stayed with the above method due to time constraints. If anyone reading this has managed this, please email me.

It wasn’t long before I had twitching servo motors.

It’s alive!

The design for the pan-tilt bracket was essentially just an L-bracket with two servomotors attached - the camera would be attached to the back of one of the servos and the whole thing would be held together with hot-melt glue. Many other people at the Hackathon were making their projects out of acrylic, metal and pre-prepared 3D-printed parts. Instead I opted for an aesthetic that looked like it had been assembled by an earnest 10-year-old and it worked like a charm.

Needed to adjust the operating range here as it points down too much


Google Vision

Google actually provides a Python client library for using their Cloud Vision service. However after spending the best part of an hour trying to get it to work, it would still always fail on import. At first it complained about missing Google libraries that I was able to resolve with pip install, but I decided to call it quits on hitting an AttributeError in a particular dependency. I suspect the Google client library was expecting a different version of a dependency in question than what I’d installed with pip, but I was unable to find out what version was needed. It’s possible I was being dense and managed to miss some critical step, but by the looks of things Google needs to sort out its dependency management with their Python client library or provide some more thorough install instructions.

Truth be told, using a full-featured client library was unnecessary for my needs anyway. All I needed to do was fashion a simple json request and I’d get the result returned to me in a json response. Stringing this all together in python was straightforward and is a good example of why python is awesome.

import picamera
import io
import requests
import base64

image_stream = io.BytesIO()
camera = picamera.PiCamera()

camera.capture(image_stream, 'jpeg')
encoded_image = base64.b64encode(image_stream)

msg = {"requests" : [{"image": { "content" : encoded_image }, 
                      "features" : [{"type" : "LABEL_DETECTION", "maxResults" : 5 }]
                      }]
      }

r = requests.post(CLOUD_VISION_URL, json=msg)
response = r.json()

The response would then be a bunch of json that looked something like this:

{
  "responses": [
    {
      "labelAnnotations": [
        {
          "mid": "/m/0n68_",
          "description": "structure",
          "score": 0.7834804
        },
        {
          "mid": "/m/07c1v",
          "description": "technology",
          "score": 0.7664998
        },
        {
          "mid": "/m/097wsh",
          "description": "exercise machine",
          "score": 0.6235316
        },
        {
          "mid": "/m/0k4j",
          "description": "car",
          "score": 0.55102164
        },
        {
          "mid": "/m/04szw",
          "description": "musical instrument",
          "score": 0.50483143
        }
      ]
    }
  ]
}

This response was then repackaged up and sent to the cloud. However we found sending just a raw result of a whole camera image was not sufficient for the game of I-Spy we wanted to create. The labels in this example are both disparate and wrong. I don’t have the photo that generated this response anymore, but I can assure you there were no exercise machines, cars or musical instruments in the room at the time!


Making it smarter

The main problem is that photos taken by a randomly positioned camera will rarely have an obvious subject and will be of a larger scene rather than any one specific object. Also, the images of small objects that were used to train the neural-network Google uses for this service were likely clear, well-cropped images of the subject, rather than wide shots. In order to try and get better, more accurate labelling we needed to crop the photographs around specific objects.

 {
  "labelAnnotations": [
    {
      "mid": "/m/095ki",
      "description": "wire",
      "score": 0.6643218
    },
    {
      "mid": "/m/0l5n_",
      "description": "font",
      "score": 0.5532821
    },
    {
      "mid": "/m/0p45k",
      "description": "electronics",
      "score": 0.5198623
    },
  ]
}
{
  "labelAnnotations": [
    {
      "mid": "/m/07c1v",
      "description": "technology",
      "score": 0.86034685
    },
    {
      "mid": "/m/083kv",
      "description": "wire",
      "score": 0.8227609
    },
    {
      "mid": "/m/0h8mjg9",
      "description": "animal trap",
      "score": 0.773611
    },
    {
      "mid": "/m/0kwn5",
      "description": "breadboard",
      "score": 0.74207425
    },
    {
      "mid": "/m/02ndjp",
      "description": "mousetrap",
      "score": 0.7187619
    },
    {
      "mid": "/m/0h8mzmc",
      "description": "circuit prototyping",
      "score": 0.6879506
    }
    ]
}

I too often mistake breadboard circuits for mousetraps


What I’d stumbled in to was the Image Segmentation problem which is still very much an open problem in Computer Vision. Recent research using Google’s Tango project has produced some very impressive results in the field of real-time image segmentation; however Tango has a stereoscopic camera and so those sorts of techniques are a bit of a non-starter. Very cool though.


Based on some cursory research I figured that a Graph Segmentation or Region-growing method might be able to get good enough segmentation to be able to crop the random photographs on probable-objects and get better labels. However by this point it was around 3am and it seemed unlikely I’d be able to grok the maths in either of these methods in time, let alone implement and tune them before the hackathon finishing time.

Other approaches to Image Segmentation involve using Convolutional Neural Networks however again, due to the time constraints we didn’t exactly have time to prepare and train a network to achieve this. However we did find another feature of the Google Cloud Vision service: Crop Hints.

In theory you’re supposed to be able to send an image to the Cloud Vision API, similarly to how you request labels, and you get back the coordinates for a bounding box that contains what Google reckons is the most important thing in the image. I figured we could request crop-hints for the raw photograph, apply cropping based on the returned coordinates, and then request labels for the cropped image.

{
    "cropHintsAnnotation": {
        "cropHints": [
        {
            "boundingPoly": {
                "vertices": [
                {
                    "y": 336
                },
                {
                    "x": 1100,
                    "y": 336
                },
                {
                    "x": 1100,
                    "y": 967
                },
                {
                    "y": 967
                }
                ]
            },
                "confidence": 0.79999995,
                "importanceFraction": 0.69
        }
        ]
    }
}

Sadly this didn’t work either as whenever we sent a request for crop-hints, the response was always a bounding box the size of the original image. We found this even when using photos where the subject was obvious. Maybe it only works on cats?


Last ditch efforts

With only a couple of hours remaining to try and get better results for image labelling I tried a few different things based on intuition to improve the performance of the Raspberry Pi-Spy:

  • Randomly crop the image. Given that we intended the Raspberry Pi camera to placed in an open area, I figured randomly cropping the image would at least give some chance of focussing on an object in the image. This way we’d be more likely to get labels for an object in the vicinity, rather than labels for the scene as a whole

  • Reject images lacking a label with a score of more than 0.85. If none of the labels were high enough, point the camera somewhere else and take another picture. Unfortunately this sometimes resulted in rejection of correct labels, but these things are a trade-off.

  • Filter out confident labels that are too abstract with Natural Language Processing. Sometimes return labels from the Cloud Vision API were adjectives and abstract things such as “green” or “darkness”. Using the Python Natural Language Toolkit I was able to tokenise the labels and filter out non-nouns. I’d also wanted to filter abstract nouns and only use labels that were of concrete things to try and avoid using labels such as “technology” but I couldn’t figure out how to do this before the time was up. (If you have a good idea on how to achieve this, please email me!)

Final thoughts

While I’d spent my time hacking away in Python and playing with hot glue and cardboard the rest of my team had successfully assembled both an Android app tied to a cloud service and public scoreboard making for a very playable game that demoed well at the end.

Hackathons force you to come up with solutions quickly and also give you permission to suck. What we produced was far from polished, but I always feel that the experience and lessons learned from these sorts of events are far more valuable than what you actually produce in the end. Also, despite the issues I had with Google Cloud Vision, it’s still a rather impressive tool and I’ll be sure to keep it in mind for future projects.