Every image is searchable with Inception & a Crawler in Google Cloud for 0$

As I was attempting a Kaggle contest on Bosch, suddenly I was piqued at reverse image search and having attempted face detection year ago by building a prototype web app and deep learning was beckoning. It has been making huge strides with FPGA, Elastic GPU hardware, neural processors on scene, it’s getting hotter by the days and time to get hands dirty. Deep learning visual frameworks have mushroomed eclipsing well established ones like OpenCV which still powers niche use cases. Google’s TensorFlow is getting it’s own limelight and I was curious as to how reverse image search engine might work utilizing TensorFlow, while googling stumbled on ViSenze and Tineye web services that were filling diverse needs, former being a e-commerce reverse search engine while the latter lets you know where a given image is sourced or identified in the entire internet. They can squeeze off search and display results in 1 or 2 seconds excluding the time to upload or extracting an image off an URL.This is pretty impressive given Tineye has indexed more than a billion images.

How do we make our own ViSenze or Tineye or IQnect or GoFind? Githubbing, found a TensorFlow based reverse image search project (credits to this GitHub project & Akshayu Bhat) and realized this to be a great way to start but a real use case can make it even more compelling. Thought of a commercial website selling apparel’s could be a good candidate to get real world images to index and test the capability of this reverse search. This experiment had a unique twist, all along being a Windows aficionado using MS software development tools, TensorFlow forced me to switch to Linux environment as it’s only available in Linux or Mac. As of 29 Nov 16, Tensorflow finally adds support to Windows, its too late. Being a windows developer, decided to naturally gravitate to Ubuntu desktop on Oracle VirtualBox, I’ve previously played around Ubuntu desktop albeit just to get a hang of GUI and used some of Linux tools in them but never did serious development.

Now, let’s get practical, setup the dev environment (I’m new to Linux and want to learn), spruce up the code from the fork, add a crawler, plus a commercially available API to detect whether uploaded image (for reverse visual search) is safe and appropriate and detect it’s content while returning nearest 12 items when an visual item is searched. My claim of 0$ is to leverage google cloud trial and before you jump in to test drive, you may want to take a look at such an implementation using google cloud engine @ http://visualsearch.avantprise.com/. The actual search takes 3 to 4 seconds on 70K images whereas approximate search is a bit faster.

Setup and run Visual Search Server

Get the latest Oracle VirtualBox here and install it on your Windows m/c (mine is windows 10 build 1439). Now proceed to download Linux desktop Ubuntu 14.04.5 Trusty Tahr from from osboxes.org to get the OS up and running. Using the userid as osboxes.org and same as password, you get the desktop up and running or just install it from scratch in VirtualBox, which is what I did (also suggested) by providing 80GB disk size for VM to have ample amount of space to grow dynamically if needed to that set limit. Make sure , you install GuestAdditions to VM instance of Ubuntu desktop, this is useful when you want to transfer files between Ubuntu and Host OS and also makes the display to adjust flexibly. Do note that as I change between office ethernet and home wireless, I got to change the network adapter to wireless for Adapter1 to get it going at home.


Ubuntu desktop comes with python 2.7.6 and hence you don’t need anaconda or other python environments and I’m not looking at exclusive python environments to make this experiment long winded. About development environment? well I’m used to Visual Studio for C# & Python and WebStorm for NodeJs. Hence wanted to stick to the same tools with a slight difference, this time went with Visual Studio Code, a great open source tool with fantastic extensions and and works like a charm. Log into your Ubuntu desktop and launch terminal and type python –version to check the version and ensure it is 2.7.6. Don’t forget to enable shared clipboard to be bi-directional for this VM instance in VirtualBox. Get git, pip and fabric installed as follows:

sudo apt-get install git
sudo apt-get install python-pip
sudo pip install fabric
sudo apt-get install openssh-server

Ensure you have a rsa key created to connect to GCE and local dev environment if required (also do a ssh localhost) using the following commands

ssh-keygen -t rsa (Press enter for each line)
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod og-wx ~/.ssh/authorized_keys
ssh localhost 

Type exit to see logout message is displayed. Now you’re all set, go ahead to clone the repo. In the terminal (under home directory) type

sudo git clone https://github.com/dataspring/TensorFlowSearch.git
cd ~/TensorFlowSearch
vi settings.py (to change username, etc. & save)

Open up settings.py and change the LOCALUSER to the user you set while creating the Ubuntu desktop VM and optionally LOCALHOST to a specific IP address assigned if doesn’t work. With the code base in desktop ready, we need to setup the development environment in local Ubuntu desktop, so that we can run, debug and change code. Fabric allows to run sudo commands either locally or remotely + tons of other features. With python fabric in place, run fabric calling the setup function from terminal as

sudo fab --list  (lists all methods)
sudo fab localdevsetup

with a couple of ENTER & Y key-presses, will install all prerequisites for python development environment, TensorFlow, Fabric, Sqlite3, Visual Studio Code and SQL LiteBrowser. If all goes well, run the crawler to get a few images from an e-commerce site (carousell.com) and we can start the visual search server as follows:

sudo fab ShopSiteImages
sudo fab index
sudo fab server

Open a browser in your Ubuntu desktop and type http://localhost/ – which should launch a screen as shown below and start searching


Launch Visual Studio Code from terminal by typing in below command and this to allow to launch VS Code with admin rights for debugging to work properly

sudo code --user-data-dir="~/.vscode-root"  

Install python extensions, you are all set to change, play around the code with nice debugging support! Once launched, point to git directory @ ~/TensorFlowSearch to open the code and modify.

Detect content appropriate-ness and type – Clarifai to our rescue

I thought of including a safe content search that will be vital while doing a visual search as it involves user uploaded/snapped image. Among myriad of video & image recognition services that offers detection of unsafe content, Clarifai is simple and there’s a free plan to test and play with REST API. Navigate to their developer site and obtain an API key and you’re all set. In this search form, from angular, uploaded image is sent to Clarifai API to check whether the image is safe and appropriate and you get a probability score which is displayed in the search screen. Also another API call is made to detect ‘content type’. The code snippet that’s used in controller.js (hosted in python flask web app) file is as follows and you may want to get your own API key as the current key is part of free tier and may exhaust.

code link for controller.js under angular


Design and Implement a Simple Crawler

Getting images for this simple crawler is what makes it fun and useful. As to experiment, I selected carousell.com which sells anything that can be snapped in your cellphone camera. It’s a great and upcoming service that allows anyone to sell – their tagline is ‘Snap to Sell, Chat to Buy for FREE on the carousell marketplace!’. Now that it’d be good if we could get images off their site which is already meant for public to consume and buy items – how do we know what is offered and how to scrape meta data and images? Well, I just download their mobile app in android and started to look at underlying web traffic that provides data to app to decipher the contract i.e. API pattern that powers it. There are nifty ways to configure your android mobile to get internet traffic proxyed by Fiddler in a PC through WiFi and monitoring ongoing traffic in fiddler while using their app will provide enough information to understand API story behind – how their wares are categorized, metadata is designed and images are served. With this info, you can quickly write a routine in python to get images for our experiment and also define our metadata to make the search worthwhile i.e. upon performing a visual search, we not only present the nearest 12 items that resembles given image, but also display additional metadata – as to how much it costs, where it is available and refer them to actual e-commerce site for purchase if they intend – facilitating the buying process.
This crawler hinges on the product categorization, page iteration technique implemented in the API to get images and metadata which is further persisted in a local sqlite3 database for searching purposes. The idea here is to retrieve image once, extract TensorFlow model features and discard image but keep metadata. This prevents the service from serving it’s own images instead points to image URL at the commerce site to avoid egress cost from cloud provider. Sqlite3 fits the bill by providing a simple data store but this can be scaled depending on the future scalability requirements which we can’t anticipate now. Crawler is designed to restart wherever it stopped with a manual intervention to reset the following variables – – to facilitate re-crawling where it left.

Crawler Design

  1. Decide on Product Collection Number, Pagination Parameters part of the API (figured out from API pattern)
  2. Start iterating on each collection, setting the returning result count and keep increasing the page count until max-iteration count
  3. Issue a python requests.get and parse the returned json results to get meta data and fill ‘sellimages’ table of sqlite3 db
  4. Retrieve the image from the URL
  5. If and when this whole process is rerun, ensure metadata and image if present already is overwritten – crawler re-runs are idempotent as long API signature is not changed – in which case crawler may also fail
  6. We assume only jpeg images are provided from the API’s metadata URL and it is so
  7. Uses simple python modules like requests, json, sqlite3, urllib


Indexer & Searcher

The gist of indexing images is to simply use TensorFlow by loading a pre-trained model – trained off ImageNet aka InceptionV3 that is already available as a protobuffer file – in our case network.pb. Then parse it to import graph definitions and use this definition set to extract ‘incept/pool_3:0’ features off each image. Indexer further spits out chunk of these files and concatenates them based on the batch-size configured and gets stored as index files. KNN search is performed using scipy spatial function.

In the next iteration to this article, I want to further see which spatial functional metric is performant? (many are available ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘cityblock’, ‘correlation’, ‘cosine’, ‘dice’, ‘euclidean’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘matching’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘wminkowski’, ‘yule’.) I’m also looking at a new development to see how to incorporate newly available pre-trained models (.pb files) to see which one can fare better for this use case implementing KNN search. One such site that lists new pre-trained model is GradientZoo and you need to figure out how to generate protobuf file from these model constnat files and starting point is here.

Sqlite3 db

Essentially there are 2 tables keep track of data ingested, ‘indeximages’ to log crawler runs and ‘sellimages’ to ingest metadata for each image crawled. You can view database in the Ubuntu desktop – just launch sqlitebrowser and point to the db file @ /home/deep/shopsite/sqllite3/


Take it to Cloud Heights – Setup in GCE

I claim that indexing, crawling and search cost is 0$ with the generous credit of $300 to try out Google Cloud. Only thing is GPU HW availability is yet to be mainstream unlike AWS but Google has made the announcement few weeks of GPU in cloud. Fire up Google Cloud and set up an account keying in your credit card and they absolutely say that credit card will never be overcharged upon trial period completion and it’s worth giving a try. Use the Quick Start Guide to Linux VM, and follow the screenshots to guide you to create a Ubuntu 14.04 server with v4 CPU, 80 GB SSD and 15GB RAM.

  1. Create you Project
  2. Create a VM Instance
  3. Select Zone (near your place), Machine Type (4 vCPU 15 GB) , Boot disk (80 GB SSD)
  4. Allow HTTP and HTTPS traffic
  5. Click on Networking and chose an external IP to be static IP (so that it can retain same IP on restarts)
  6. Click on SSH keys, navigate to ~/.ssh/ and open id_rsa file we created early and copy all contents and paste it in SSH Keys
  7. You’ll end up in a VM created as follows

From your local Ubuntu desktop, launch the terminal, just do

ssh username@extrenalip

username is as per your id_rsa file when you copied to SSH Keys and External IP is the static IP that you reserved. It should connect to remote host and now logout.

Time to setup GCE VM instance. Open up settings.py file in local ubuntu desktop and change HOST (to static IP) and USER (SSH Keys assigned user) accordingly and save. Now fire up fabric to do setup for us in the remote host m/c

sudo fab live hostsetup

Once all setup is complete, ssh in and test run to crawl images, index them and start the web server. You can access the server by pointing your browser to http://<external ip>/. This ensures that all works. Next is to stress test. Open the settings.py file in remote machine again and change the following to larger values : RESULT_STEPS, MAX_ITER, MAX_COLLECTION, BATCH_SIZE.

Now that the process is going to be long running, you need to launch ssh window and use screen command that allows you to run processes uninterrupted even when ssh is disconnected. For those from windows world using command window, there’s a very nice tutorial explaining screen.

sudo apt-get install screen
screen -R crawlrun
cd ~/TensorFlowSearch
sudo fab ShopSiteImages

Cntrl+A followed by Cntrl D to close screen and logout (to detach) and once the crawling process is over, do the same for Indexing and then run another screen session for web app to let the search server to be available in the internet for all.

If the count of image file is very large, in millions, best way to check the image file count in /home/deep/shopsite/images/ is not to use ls but go with rsync. Also once index run is completed, all images move to /done folder.

rsync --stats --dry-run -ax /home/deep/shopsite/images/ /xxx

Another handy utility, similar to task manager in linux to monitor resource utilization

ps -eo pcpu,pid,user,args | sort -r -k1 | less 
<or simply use> 

Get FileZilla and install it which comes handy for installing code to Google VM later. Alternatively you can use your private GitLab project which is free.

Future of Visual Search

What the community have to do next is to take visual search to next level – and some thoughts :
As we have mature Apache products like Solr, a similar open source product is the need of the hour – one that is robust enough to

  1. ingest images of any type,resolution in batch and real-time
  2. capture preset time interval frames on continuous video streams for images
  3. crawl any site with a pluggable API engine,
  4. store images & metadata in different cloud storage services using connectors/plugins,
  5. a configurable pre-trained deep learning models for feature extraction off images
  6. store meta data in Lucene store
  7. search visual images using KNN and other advanced ML methods
  8. faceted search on meta data.
  9. etc.

Perhaps a combination of likes of Apache Spark + Apache Solr + Above Features + Stream Processing + ML/DL methods = Apache Visor – best open source image search out there!

P.S. :
If you’re interested on big test data  generation framework on SQL, check out my GitHub page