AzureSql as Json Serializer : Blazing fast microservice

The startup site I was working, dubbed the educational network, lists courses from partners. To amplify engagement with users, we needed an elegant but simple commenting system.  Users when logged in can comment on a course they have taken and providers can reply to it – akin to airbnb or expedia site.
Before we shape this requirement into a microservice, lets see what Martin Fowler’s take on this: to quote him: One reasonable argument we’ve heard is that you shouldn’t start with a microservices architecture. Instead begin with a monolith, keep it modular, and split it into microservices once the monolith becomes a problem. (Although this advice isn’t ideal, since a good in-process interface is usually not a good service interface.) So we write this with cautious optimism. So far, we’ve seen enough about the microservice style to feel that it can be a worthwhile road to tread. We can’t say for sure where we’ll end up, but one of the challenges of software development is that you can only make decisions based on the imperfect information that you currently have to hand.)

As most startups’ backend architecture starts as a monolith API with an eye for future scalability, our startup site treads the same path but with two stern demands.

  1. design for exit, so that a future standalone microservice from monolith is a easier transition
  2. extract the max out of the given cloud resources and make this API as scalable now and then.

We’ll explore how we accomplished the above two tenets using AzureSql and ASP.NET Core tweaking ‘design & infra’ choices. We utilized a 5 eDTU 2GB Sql Server Db (costing $5/month, cheapest hosted db in azure) and a spare windows VM that can host this microservice. Backend was EF Core with hierarchical LINQ & Newtonsoft as Json Serializer. Performance was dismal and this necessitated a redesign to use AzureSql’s native JSON capability to hierarchical-ize and serialize results. Artillery.io proved nifty in load testing the API and triage the problem areas and achieve our goals.

Conclusion: Core 1.1 with Dapper can achieve 300+ API calls in a minute with a total throughput of 4MB data returned with median response time of 101 ms using merely one 5 eDTU SQL Server database (the very basic entry level db in azure)  hosting ~3 million comments and a million users.

This blog article can can also be used as a walk through to recreate the whole experience yourself – essentially you need a Azure Subscription and local SQL Server! It covers the use case – Disqus like Comment/Reply System, design methodology, query design, issues encountered and SQL db Json Serialization technique, Artillery.io API testing tool and all important load test results. The code is hosted @ github

What the business wanted? – “Comment & Reply” Requirements:

  1. Logged in users be able to comment on each course/service
  2. Course/Service Provider can reply to those comments
  3. Ability to have hierarchical comments but for now restricted to 1 level
  4. Multiple comments on a course/service by users allowed (no hierarchical – comment on comment)
  5. Provider can reply to a comment and alter it (no reply on reply)
  6. While browsing a course, able to see comments by users and replies to them by their providers if any

Entity Design

Table Utility
Users Users registered in the system
Courses Services provided by Users (registered as service providers)
Comments Comment for a course/service: Rating, Title, Remarks, CreatedOn – for which CourseId
CommentSnapshots First and Last Comment for User/Service combo

The above ERD depicts a run down version of the actual entities involved in the design, all attributes avoided for confidentiality. These attributes suffice for a base design of the problem we are discussing.

The idea to snapshot first and last comment is to provide a quick way to retrieve a comment by a user with intermediate comments are retrieved on demand – this is useful when comments per user is viewed either by an Admin or by user.  Also it’ll be useful to limit the search whether a user has really made at least a comment for a given service rather than searching the entire comment history in comments table to ascertain that. Again there could be even a better design but we started off with this which fulfills all the requirements outlined above.

We avoided having foreign keys as the system is destined to be compartmentalized and modular microservices is the final implementation where in each entity will be in its own domain and have their own services.

Infra Choice

DB
Being a Azure shop, we decided to use a Azure SQL Server with the very basic offering: a 5 e-DTU 2GB db at USD5 per month.

App Backend
Windows 2016 VM with 2GB RAM and SSD with ASP.NET Core Web API and IIS, We started with core 1.1 and also tested the solution using Core 2.0, the latest release to compare performance.

Data Prep

To test this system realistically, we’re looking at 1 million customers, 30k services/courses with ~2.5 to 3M comments. To create the customers, restore BigTestData.bak into a local SQL database from BigTestData.rar (refer to my GitHub dataspring/Retail and look for Getting started – Environment: Windows 7 and above with SQL Server Express 2012 and above – Steps to Generate Data)

  1. Create Comments database –> run 01-DataPrep-CreateDb.sql
  2. Create Functions and Indexes –> run 02-DataPrep-CreateFnsAndIndexes.sql
  3. Create ~1M Users –> 03-DataPrep-CopyUsers.sql
  4. Create ~2M Comments & ~1M Replies –> 04-DataPrep-FillData.sql (takes a while…..)

The gist of data generation:

  1. Pick Users with user id (<50000) to be providers (aka assumed -registered as providers)
  2. Create 40000 courses with providers iterated from a specific id of users
  3. Use a random user ID (between 500K to 1M) to create 15 comments for each of the 40K course
  4. Capture 1st and last comment in to CommentsSnapShot
  5. Create Reply for each of the comment
  6. Also randomly vary the content in the Title and Remark to be realistic
  7. Ensure all 15 comments have sufficient and proper chronological order
 ASP.NET Core & EF.Core – Some Thoughts

ASP.NET Core benchmarks are astounding given there was a blog before that I read and not sure it’s ‘use case’ is relevant but the load test we’re planning to do due course (as explained below) is a practical test with pragmatic data though.  Always micro ORM like Dapper keeps beating EF Core to the core as in this blog and I wanted to try Dapper as well in the load test.

Coding the API

Fire up your VS 2017 community and look for ASP.NET Core 1.1 Web API template and create your Web API project – Core1dot1Service and save the resulting solution as CoreBenchMarks. You can copy to entire code @ github and follow along as well.

I was contemplating on the final requirement (point 6.) and started off with the EF core and LINQ but there wasn’t lot of examples to do hierarchical queries in EF as clearly and succinctly on the web.
So I headed to do on my own and created a http get method with this LINQ query:

[HttpGet]
[Route("method/jsonfromlinq")]
public async Task<List<CommentBlock>> GetFromLinq(string ratingType, int courseId, int? userId = null, int skip = 0, int size = 10, int skipThread = 0, int sizeThread = 10 )
{
	return await 
	_dbContext.CommentSnapShots
		.Where(r => r.CourseId == courseId && r.UserId == (userId ?? r.UserId) && r.CommentType == ratingType)
		.Join(_dbContext.Users,
		 r => r.UserId,
		 u => u.UserId,
		 (r, u) => new CommentBlock
		 {
			 UserDisplayName = u.DisplayName,
			 UserRating = r.LastRating,
			 Comment = r.LastRemarks,
			 UserLastUpdate = r.LastUpdate,
			 Comments = _dbContext.Comments.Where(c => c.CourseId == r.CourseId && c.UserId == r.UserId && c.CommentType == ratingType)
									 .Select(cm => new Comment
									 {
										 CommentId = cm.CommentId,
										 Rating = cm.Rating,
										 Remarks = cm.Remarks,
										 CreatedDate = cm.CreatedDate,
										 Reply = _dbContext.Comments.Where(rp => rp.ParentId == cm.CommentId && rp.CommentType == (ratingType + "Reply"))
												 .Select(ply => new Reply
												 {
													 Remarks = ply.Remarks,
													 CreatedDate = ply.CreatedDate
												 }).FirstOrDefault()
									 })
									   .OrderByDescending(o => o.CreatedDate)
									   .Skip(skipThread)
									   .Take(sizeThread)
									   .ToList()
		 })
		 .OrderByDescending(o => o.UserLastUpdate)
		 .Skip(skip)
		 .Take(size)
		 .ToListAsync();
 }

Hierarchical Design:
For a given Course ID/Service ID and Rating Type (‘Course’) :

  • extract Last Comment from ‘CommentSnapShots’ table (if a user ID is provided, filter by it)
    • and then all Comments reverse chronologically from ‘Comments’ table
      • and replies for every comment if any from service providers

and return whole set as hierarchical json object. As proved and expected LINQ queries are notoriously inefficient and so happened that during the load tests, no data were returned, as we in the next section which covers load testing.

Load Testing : Abandon VS Load Testing Tool & Embrace Artillery

Since I had VS 2013 Ultimate, wanted to give a try to see how good the load testing can be. It’s intuitive to record if you have an GUI for your APIs or you have to manually do your GET requests and record it in IE to be captured. With Windows 10, you have Edge but VS Load Testing recording still depends on IE and hence you got to install additional stuff. There was no great way to do POST API calls easily and randomizing data inputs, reading data from text files and integrating into the test was a pain that I had to abandon the whole exercise and move to best alternative – open source – Artillery.IO fits the bill fantastically and I was able to learn the whole thing within few hours. It was such a pleasant thing to do load testing on APIs with a simple and easy to understand yaml file and NodeJS.

Ensure you have latest Node and just follow getting started with Artillery.io. Create a solution folder under Solution called and ‘Artillery.LoadTests’ . Now there are 2 steps, generate random data to use and create load test script:

Just generate the data and copy it to folder where artillery yaml file is located

Select Top 5000	
[SnapShotId]
,[CommentType]
,[CourseId]
,[UserId]
,[Skip] = [dbo].[Random_Range](0,3)
,Size = [dbo].[Random_Range](2,10)  
,SkipThread = [dbo].[Random_Range_With_Default](0,1,0,8)
,SizeThread = [dbo].[Random_Range](2,10)
from [dbo].[CommentSnapShots]
ORDER BY NEWID() 

If you’re hosting the .NET Core wherever, accordingly change the target.

config:
  environments:
      AzCore11:
        target: "http://comments.avantprise.com"
      AzCore2:
        target: "http://comcore2.avantprise.com"
      local:
        target: "http://localhost:43182"  
  #target: "http://comments.avantprise.com"
  phases:
      - duration: 30
        arrivalCount: 10
        name: "Warm up phase"
      - duration: 60
        arrivalRate: 1
        name: "High load phase"
  processor: "./proc-functions.js"           
  payload:
      path: "./testData.csv"
      fields:
          - "SnapShotId"
          - "CommentType"
          - "CourseId"
          - "UserId"      
          - "Skip"
          - "Size"
          - "SkipThread"
          - "SizeThread"
      #order: "sequence"          
# scenario definitions      
scenarios:
  - name: "Stress Test JsonFromLinq API - where JSON is returned from LINQ"
    flow:
    - get:
          #----------- just for a given course ID -----------------------------
          url: "/api/comments/method/jsonfromlinq?ratingType={{CommentType}}&courseId={{CourseId}}&skip={{Skip}}&size={{Size}}&skipThread={{SkipThread}}&sizeThread={{SizeThread}}"
          afterResponse: "logResponse"
          #think: 5
    - log: "jsonfromLinq api call : ratingType={{CommentType}}, courseId={{CourseId}}, skip={{Skip}}, size={{Size}}, skipThread={{SkipThread}}, sizeThread={{SizeThread}}"     

We’re using a simple loading pattern to start with:

  • A phase which generates a fixed count of new arrivals over a period of time : 10 users in 30 seconds
  • A phase with a duration and a constant arrival rate of a number of new virtual users per second : 1 user / second for 60 seconds
  • In total : 70 requests in 1.5 minute or 90 seconds

As you can see the below performance snapshot, EF Core LINQ is very performant on the Laptop (perhaps spec is good) but when ported to Azure VM with 5 DTU Auzre SQL, simply doesn’t work!
To mitigate this performance issue, we have to redesign the whole data access and perhaps relinquish the abstraction which LINQ provides and need to go bare metal – to database level and unravel how far we can stress the system to be performant. Options available to accomplish this are both from code and infra:

  1. Scale Azure SQL to 30 or more DTUs
  2. Use a 3rd party Json Serilaizer with existing LINQ query
  3. Partition LINQ query into individual queries in option 2
  4. Abandon LINQ and go bare metal on SQL : Stored Proc and Json Serializatioin in SQL Server

We embarked on option 4 which provides cost effective solution and can be quick win if we need to scale within budget.

LINQ query was redesigned as stored proc with TSQL’s powerful JSON capability to hierarchical-ize and serialize the result and return json text.

Proc Design – Version 1

	SELECT u.displayName
	,c.courseId
	--,c.UserId
	,c.commentType
	,c.lastTitle
	,c.lastRating
	,c.lastRemarks
	,c.lastUpdate
	,c.lastCommentId
	----------------------------
	,(
		SELECT t.commentId
			,t.title
			,t.rating
			,t.remarks
			,t.createdDate
			---------------------------------
			,(
				SELECT r.commentId
					,(Select top 1 displayName from Users usr where usr.UserId = r.UserId) as displayName 
					,r.title
					,r.remarks
					,r.createdDate
				FROM Comments AS r
				WHERE r.CourseId = t.CourseId
					--AND r.UserId = t.UserId
					AND r.CommentType = t.CommentType + 'Reply'
					AND r.ParentId = t.CommentId
				FOR JSON PATH, INCLUDE_NULL_VALUES
				) AS reply
		----------------------------------
		FROM Comments AS t
		WHERE t.CourseId = c.CourseId
			AND t.UserId = c.UserId
			AND t.CommentType = c.CommentType
			--AND t.ParentId = 0
		ORDER BY t.CreatedDate DESC
		OFFSET @SkipThread ROWS
		FETCH NEXT @SizeThread ROWS ONLY
		FOR JSON PATH, INCLUDE_NULL_VALUES
		) AS thread
	---------------------------
	FROM CommentSnapShots AS c
	INNER JOIN Users AS u ON c.UserId = u.UserId
	WHERE Isnull(c.CourseId, '') = Isnull(COALESCE(@CourseId, c.CourseId), '')	
		AND c.UserId = COALESCE(@UserId, c.UserId)
		AND c.CommentType = @RatingType
	ORDER BY c.LastUpdate DESC 
	OFFSET @Skip ROWS
	FETCH NEXT @Size ROWS ONLY
	FOR JSON PATH, INCLUDE_NULL_VALUES

SqlServer as Json Serializer is achieved using the FOR JSON construct and iterating the design through its options makes the result nearly similar to what you get from LINQ based hierarchial results serialized by Newtonsoft serializer.

Issues in Version 1 and Mitigation:

  1. TSQL has a nice feature called COALESCE function which comes handy if any of the filter fields are null or not provided, we can easily manage the WHERE clause but it hurts performance hugely and either you have to use a dynamic SQL or altogether remove COALESCE function in the WHERE clause.
  2. Key Lookup is a costly affair in the SQL execution which is evident from peeking into the execution plan, hence you need to have a corresponding non-clustered index fields matching the query WHERE clause fields and Include columns matching the selected fields…great example here.
  3. Yet another aspect is to accept dirty reads – which I’ve not tried here but worth if a slight marginal error is acceptable. You can use NOLOCK which his functionally equivalent to an isolation level of READ UNCOMMITTED. If you plan to use NOLOCK on all tables in a complex query, then using SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED is easier, because you don’t have to apply the hint to every table.

Version 2 removed coalesce and created a handful of non-clustered indexes with INCLUDE columns to remove all key lookups and literally leap-frog query performance including selected columns in the indexes themselves. Final stored proc code is here and the corresponding Web API is using json string pass-through to send teh results back from db without attempting any .NET Core level serialization.Check out the code below.

[HttpGet]
[Route("method/jsonfromdapper")]
public async Task<ContentResult> GetFromDapper(string ratingType, int? courseId, int? userId = null, int skip = 0, int size = 10, int skipThread = 0, int sizeThread = 10)
{


	using (var connection = new SqlConnection(ConnectionConfig.DefaultConnection))
	{
		connection.Open();

		DynamicParameters dp = new DynamicParameters();

		dp.Add("@RatingType", ratingType ?? (object)DBNull.Value, DbType.String);
		dp.Add("@CourseId", courseId ?? (object)DBNull.Value, DbType.Int32);
		dp.Add("@UserId", userId ?? (object)DBNull.Value, DbType.Int32);
		dp.Add("@Skip", skip, DbType.Int32);
		dp.Add("@Size", size > 20 ? 20 : size, DbType.Int32);
		dp.Add("@SkipThread", skipThread, DbType.Int32);
		dp.Add("@SizeThread", sizeThread > 20 ? 20 : sizeThread, DbType.Int32); ;


		var results = await connection.QueryAsync<string>("GetComments", dp, commandType: CommandType.StoredProcedure);

		List<string> jsonResults = new List<string>();

		return Content(string.Join("", results.ToArray()), new MediaTypeHeaderValue("application/json"));
	}

}

Load Test Results
Armed with the optimized stored proc that also does JSON serialization, next is to really test this to ascertain how much the minimum infra can withstand when goes live. It seems the App VM wasn’t the bottleneck but the DB in the end. A simple yet comprehensive load testing regime was used here to compare and contrast and conclude.

Following were the versions tried:

  1. API as such with LINQ query – with ASP.NET Core 1.1 & EF Core 1.1
  2. API with optimized Stored Proc (that hierachial and serilizaes results within) – with ASP.NET Core 1.1 & EF Core 1.1
  3. API with optimized Stored Proc (that hierarchical-izes and serializes results within) – with ASP.NET Core 1.1 & Dapper (the best ORM out there)
  4. Point 2 & 3 – with ASP.NET Core 2.0 and EF Core 2.0

Load Pattern: Ramp-up with 10 users/calls in 30 sec and add 1 user/call every sec for next 60 seconds

Results are here:

Pattern with Random Data: Ramp-up with 10 users/calls in 30 sec and add 5 user/call every sec for next 60 seconds and the results, as you can see data throughput more or less same :

The conclusion is clear, winner is Core 1.1 with Dapper and can achieve 300+ API calls in a minute with a total throughput of 4MB data returned with median response time of 101 ms using merely 5 eDTU SQL Server database hosting ~3 million comments and a million users.

Test results are available @ Github for console outputs and json results during artillery load testing.

 

Advertisements

How well have we mastered ‘the science of teams’?

Startups and established teams face tremendous pressure to deliver. The breakneck speed of changes in languages and framework ecosystems, delivering code well thought out and written for maintainability and scalability adds another dimension to this pressure.  But the one that stands most out is level of interaction and collaboration within the team and outside. As we manage this chaos with routine practices like recruiting good talent, enticing with good options, equipping with best infrastructure and software tools, training and orienting in best practices, pair programming and team building to enhance group interactions may not be sufficient in short and long term.

Personally, I have been through lots of drab team building activities – paintballs, movie togethers, charity activities, dinner and dances, laugh it throughs, corporate self-help trainings, Da Vinci Code themed hunts on taxi hops, etc.  Take a moment and name those you have attended so far and my question is – has it really improved your outlook of working collaboratively, boosted output and understanding of your peers and sneers?

A stream of questions arises. Does your team have the requisite team dynamics, energy and camaraderie to act in unison to deliver what the company wants out of them – orchestrated by the program/project lead? Is team cohesion and communication superb or subpar? Also, is there a practical, genuine and approachable way to rejuvenate and maintain the team spirit? Has our mastery of ‘the science of teams’ adequate to meet these challenges?

‘The science of teams’ answer needn’t be high octane. Enter the Board Game.

boardgame

You might think I’m silly and guess what? a board game is a family activity for leisure and fun, full of name calling, commotion, laughs, benign barbs, teasing, strategizing and winning. But it’s more than that.  There’s no sense of loss even when you lose the game – all that matters, is the journey. It’s full of fun, camaraderie, surprise and ample talk and that’s what we need for teams in team building – gentle yet reinforcing. The corporate driven, high octane team building activities isn’t going to help in anyway. Conversely, it benefits only event companies, and drains the ‘must be spent’ budget.

When I joined this gaming enterprise’s data team that had a close-knit group of developers, artists and learning designers, I was skeptical of what playing a ‘Board Game’ can bring to the team which is already developing 3D games (using Unity) for their bread and butter. I was wrong, it did wonders. We have ‘Board Game’ time every fortnight sandwiched between the last hours of work time and the evening. We start our sessions around 5pm and end around 7pm, so the team members are not disrupted of their routine commute. There’s a choice of drinks for every other person – beer, wine, coffee and sodas. Of course, depending on the budget – some healthy finger foods and pizza. Obviously, the game master is the one who has supposedly gone through the rule book and is expected to be familiar with the game by either playing it her/himself, or watching it explained from YouTube videos by the game provider. We take turns to be the game master to save time and guide rest of the team to be effective, extracting maximum fun out of the play. Ideally the board should support at least 4 or 6 players. If the team size is 8, then others can join to form groups within. Those who are really not interested in participating can be spectator supporter for a player still. Those who were reluctant initially became active players subsequently.

What I noticed and witnessed for myself during game play: people talk freely (over their choice of drink and food), discuss their strategy, question, cheer and tease other players. Every dice throw is fun in anticipation of their desired number to catapult their position in the game. With constant conversation around rules, booties won and lost, twists and turns the game – takes you through a journey of fun, anticipation and interaction – hence camaraderie & respect develops. This breaks the ice, and starts conversation that flows beyond play time into work time. The fun is amplified, when the winner decides who the future game master is and new game to play next time from the repertoire of board games in stock, sometimes even proposing a new board game to buy. Every year our budget is to acquire 6 new games in the $100 to $150 price range with older ones given away to employees. On a side note, the game master has to really understand the rules of the game well, read through the cryptic rule book and decipher the nuances to instill the gaming spirit.  Be the gate keeper on errant players, gently nudge them to participate, benevolently whip the procrastinators into action and take leadership in steering the gameplay.

No wonder, as we’re 3D gaming company ourselves, being emboldened in our physical board game play, we wanted to create our own board game. It was in the true spirit of startup and experimentation, we recently released our own on Kickstarter, called Avertigos. I humbly encourage you to take a look at it and see whether you can turn your team building activity into something genuinely fun, indoor, refreshingly new and authentic; not only with Avertigos, but with any board game of your choice. Be mindful of the game genre, and give it a try. It can do wonders. It did for us.

This could be one simple step in mastering the science of teams and certainly helped on aspects discussed in this ‘the science of teams’ article for us.

Every image is searchable with Inception & a Crawler in Google Cloud for 0$

As I was attempting a Kaggle contest on Bosch, suddenly I was piqued at reverse image search and having attempted face detection year ago by building a prototype web app and deep learning was beckoning. It has been making huge strides with FPGA, Elastic GPU hardware, neural processors on scene, it’s getting hotter by the days and time to get hands dirty. Deep learning visual frameworks have mushroomed eclipsing well established ones like OpenCV which still powers niche use cases. Google’s TensorFlow is getting it’s own limelight and I was curious as to how reverse image search engine might work utilizing TensorFlow, while googling stumbled on ViSenze and Tineye web services that were filling diverse needs, former being a e-commerce reverse search engine while the latter lets you know where a given image is sourced or identified in the entire internet. They can squeeze off search and display results in 1 or 2 seconds excluding the time to upload or extracting an image off an URL.This is pretty impressive given Tineye has indexed more than a billion images.

How do we make our own ViSenze or Tineye or IQnect or GoFind? Githubbing, found a TensorFlow based reverse image search project (credits to this GitHub project & Akshayu Bhat) and realized this to be a great way to start but a real use case can make it even more compelling. Thought of a commercial website selling apparel’s could be a good candidate to get real world images to index and test the capability of this reverse search. This experiment had a unique twist, all along being a Windows aficionado using MS software development tools, TensorFlow forced me to switch to Linux environment as it’s only available in Linux or Mac. As of 29 Nov 16, Tensorflow finally adds support to Windows, its too late. Being a windows developer, decided to naturally gravitate to Ubuntu desktop on Oracle VirtualBox, I’ve previously played around Ubuntu desktop albeit just to get a hang of GUI and used some of Linux tools in them but never did serious development.

Now, let’s get practical, setup the dev environment (I’m new to Linux and want to learn), spruce up the code from the fork, add a crawler, plus a commercially available API to detect whether uploaded image (for reverse visual search) is safe and appropriate and detect it’s content while returning nearest 12 items when an visual item is searched. My claim of 0$ is to leverage google cloud trial and before you jump in to test drive, you may want to take a look at such an implementation using google cloud engine @ http://visualsearch.avantprise.com/. The actual search takes 3 to 4 seconds on 70K images whereas approximate search is a bit faster.

Setup and run Visual Search Server

Get the latest Oracle VirtualBox here and install it on your Windows m/c (mine is windows 10 build 1439). Now proceed to download Linux desktop Ubuntu 14.04.5 Trusty Tahr from from osboxes.org to get the OS up and running. Using the userid as osboxes.org and same as password, you get the desktop up and running or just install it from scratch in VirtualBox, which is what I did (also suggested) by providing 80GB disk size for VM to have ample amount of space to grow dynamically if needed to that set limit. Make sure , you install GuestAdditions to VM instance of Ubuntu desktop, this is useful when you want to transfer files between Ubuntu and Host OS and also makes the display to adjust flexibly. Do note that as I change between office ethernet and home wireless, I got to change the network adapter to wireless for Adapter1 to get it going at home.

oraclevm-deep

Ubuntu desktop comes with python 2.7.6 and hence you don’t need anaconda or other python environments and I’m not looking at exclusive python environments to make this experiment long winded. About development environment? well I’m used to Visual Studio for C# & Python and WebStorm for NodeJs. Hence wanted to stick to the same tools with a slight difference, this time went with Visual Studio Code, a great open source tool with fantastic extensions and and works like a charm. Log into your Ubuntu desktop and launch terminal and type python –version to check the version and ensure it is 2.7.6. Don’t forget to enable shared clipboard to be bi-directional for this VM instance in VirtualBox. Get git, pip and fabric installed as follows:

sudo apt-get install git
sudo apt-get install python-pip
sudo pip install fabric
sudo apt-get install openssh-server

Ensure you have a rsa key created to connect to GCE and local dev environment if required (also do a ssh localhost) using the following commands

ssh-keygen -t rsa (Press enter for each line)
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod og-wx ~/.ssh/authorized_keys
ssh localhost 

Type exit to see logout message is displayed. Now you’re all set, go ahead to clone the repo. In the terminal (under home directory) type

sudo git clone https://github.com/dataspring/TensorFlowSearch.git
cd ~/TensorFlowSearch
vi settings.py (to change username, etc. & save)

Open up settings.py and change the LOCALUSER to the user you set while creating the Ubuntu desktop VM and optionally LOCALHOST to a specific IP address assigned if 127.0.0.1 doesn’t work. With the code base in desktop ready, we need to setup the development environment in local Ubuntu desktop, so that we can run, debug and change code. Fabric allows to run sudo commands either locally or remotely + tons of other features. With python fabric in place, run fabric calling the setup function from terminal as

sudo fab --list  (lists all methods)
sudo fab localdevsetup

with a couple of ENTER & Y key-presses, will install all prerequisites for python development environment, TensorFlow, Fabric, Sqlite3, Visual Studio Code and SQL LiteBrowser. If all goes well, run the crawler to get a few images from an e-commerce site (carousell.com) and we can start the visual search server as follows:

sudo fab ShopSiteImages
sudo fab index
sudo fab server

Open a browser in your Ubuntu desktop and type http://localhost/ – which should launch a screen as shown below and start searching

localhost

Launch Visual Studio Code from terminal by typing in below command and this to allow to launch VS Code with admin rights for debugging to work properly

sudo code --user-data-dir="~/.vscode-root"  

Install python extensions, you are all set to change, play around the code with nice debugging support! Once launched, point to git directory @ ~/TensorFlowSearch to open the code and modify.

Detect content appropriate-ness and type – Clarifai to our rescue

I thought of including a safe content search that will be vital while doing a visual search as it involves user uploaded/snapped image. Among myriad of video & image recognition services that offers detection of unsafe content, Clarifai is simple and there’s a free plan to test and play with REST API. Navigate to their developer site and obtain an API key and you’re all set. In this search form, from angular, uploaded image is sent to Clarifai API to check whether the image is safe and appropriate and you get a probability score which is displayed in the search screen. Also another API call is made to detect ‘content type’. The code snippet that’s used in controller.js (hosted in python flask web app) file is as follows and you may want to get your own API key as the current key is part of free tier and may exhaust.

code link for controller.js under angular

clarifai

Design and Implement a Simple Crawler

Getting images for this simple crawler is what makes it fun and useful. As to experiment, I selected carousell.com which sells anything that can be snapped in your cellphone camera. It’s a great and upcoming service that allows anyone to sell – their tagline is ‘Snap to Sell, Chat to Buy for FREE on the carousell marketplace!’. Now that it’d be good if we could get images off their site which is already meant for public to consume and buy items – how do we know what is offered and how to scrape meta data and images? Well, I just download their mobile app in android and started to look at underlying web traffic that provides data to app to decipher the contract i.e. API pattern that powers it. There are nifty ways to configure your android mobile to get internet traffic proxyed by Fiddler in a PC through WiFi and monitoring ongoing traffic in fiddler while using their app will provide enough information to understand API story behind – how their wares are categorized, metadata is designed and images are served. With this info, you can quickly write a routine in python to get images for our experiment and also define our metadata to make the search worthwhile i.e. upon performing a visual search, we not only present the nearest 12 items that resembles given image, but also display additional metadata – as to how much it costs, where it is available and refer them to actual e-commerce site for purchase if they intend – facilitating the buying process.
This crawler hinges on the product categorization, page iteration technique implemented in the API to get images and metadata which is further persisted in a local sqlite3 database for searching purposes. The idea here is to retrieve image once, extract TensorFlow model features and discard image but keep metadata. This prevents the service from serving it’s own images instead points to image URL at the commerce site to avoid egress cost from cloud provider. Sqlite3 fits the bill by providing a simple data store but this can be scaled depending on the future scalability requirements which we can’t anticipate now. Crawler is designed to restart wherever it stopped with a manual intervention to reset the following variables – – to facilitate re-crawling where it left.

Crawler Design

  1. Decide on Product Collection Number, Pagination Parameters part of the API (figured out from API pattern)
  2. Start iterating on each collection, setting the returning result count and keep increasing the page count until max-iteration count
  3. Issue a python requests.get and parse the returned json results to get meta data and fill ‘sellimages’ table of sqlite3 db
  4. Retrieve the image from the URL
  5. If and when this whole process is rerun, ensure metadata and image if present already is overwritten – crawler re-runs are idempotent as long API signature is not changed – in which case crawler may also fail
  6. We assume only jpeg images are provided from the API’s metadata URL and it is so
  7. Uses simple python modules like requests, json, sqlite3, urllib

crawler

Indexer & Searcher

The gist of indexing images is to simply use TensorFlow by loading a pre-trained model – trained off ImageNet aka InceptionV3 that is already available as a protobuffer file – in our case network.pb. Then parse it to import graph definitions and use this definition set to extract ‘incept/pool_3:0’ features off each image. Indexer further spits out chunk of these files and concatenates them based on the batch-size configured and gets stored as index files. KNN search is performed using scipy spatial function.

In the next iteration to this article, I want to further see which spatial functional metric is performant? (many are available ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘cityblock’, ‘correlation’, ‘cosine’, ‘dice’, ‘euclidean’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘matching’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘wminkowski’, ‘yule’.) I’m also looking at a new development to see how to incorporate newly available pre-trained models (.pb files) to see which one can fare better for this use case implementing KNN search. One such site that lists new pre-trained model is GradientZoo and you need to figure out how to generate protobuf file from these model constnat files and starting point is here.

Sqlite3 db

Essentially there are 2 tables keep track of data ingested, ‘indeximages’ to log crawler runs and ‘sellimages’ to ingest metadata for each image crawled. You can view database in the Ubuntu desktop – just launch sqlitebrowser and point to the db file @ /home/deep/shopsite/sqllite3/

sqllite3dbimg

Take it to Cloud Heights – Setup in GCE

I claim that indexing, crawling and search cost is 0$ with the generous credit of $300 to try out Google Cloud. Only thing is GPU HW availability is yet to be mainstream unlike AWS but Google has made the announcement few weeks of GPU in cloud. Fire up Google Cloud and set up an account keying in your credit card and they absolutely say that credit card will never be overcharged upon trial period completion and it’s worth giving a try. Use the Quick Start Guide to Linux VM, and follow the screenshots to guide you to create a Ubuntu 14.04 server with v4 CPU, 80 GB SSD and 15GB RAM.

  1. Create you Project
    gce-project
  2. Create a VM Instance
  3. Select Zone (near your place), Machine Type (4 vCPU 15 GB) , Boot disk (80 GB SSD)
  4. Allow HTTP and HTTPS traffic
  5. Click on Networking and chose an external IP to be static IP (so that it can retain same IP on restarts)
  6. Click on SSH keys, navigate to ~/.ssh/ and open id_rsa file we created early and copy all contents and paste it in SSH Keys
  7. You’ll end up in a VM created as follows

From your local Ubuntu desktop, launch the terminal, just do

ssh username@extrenalip

username is as per your id_rsa file when you copied to SSH Keys and External IP is the static IP that you reserved. It should connect to remote host and now logout.

Time to setup GCE VM instance. Open up settings.py file in local ubuntu desktop and change HOST (to static IP) and USER (SSH Keys assigned user) accordingly and save. Now fire up fabric to do setup for us in the remote host m/c

sudo fab live hostsetup

Once all setup is complete, ssh in and test run to crawl images, index them and start the web server. You can access the server by pointing your browser to http://<external ip>/. This ensures that all works. Next is to stress test. Open the settings.py file in remote machine again and change the following to larger values : RESULT_STEPS, MAX_ITER, MAX_COLLECTION, BATCH_SIZE.

Now that the process is going to be long running, you need to launch ssh window and use screen command that allows you to run processes uninterrupted even when ssh is disconnected. For those from windows world using command window, there’s a very nice tutorial explaining screen.

sudo apt-get install screen
screen -R crawlrun
cd ~/TensorFlowSearch
sudo fab ShopSiteImages

Cntrl+A followed by Cntrl D to close screen and logout (to detach) and once the crawling process is over, do the same for Indexing and then run another screen session for web app to let the search server to be available in the internet for all.

If the count of image file is very large, in millions, best way to check the image file count in /home/deep/shopsite/images/ is not to use ls but go with rsync. Also once index run is completed, all images move to /done folder.

rsync --stats --dry-run -ax /home/deep/shopsite/images/ /xxx

Another handy utility, similar to task manager in linux to monitor resource utilization

ps -eo pcpu,pid,user,args | sort -r -k1 | less 
<or simply use> 
top

Get FileZilla and install it which comes handy for installing code to Google VM later. Alternatively you can use your private GitLab project which is free.

Future of Visual Search

What the community have to do next is to take visual search to next level – and some thoughts :
As we have mature Apache products like Solr, a similar open source product is the need of the hour – one that is robust enough to

  1. ingest images of any type,resolution in batch and real-time
  2. capture preset time interval frames on continuous video streams for images
  3. crawl any site with a pluggable API engine,
  4. store images & metadata in different cloud storage services using connectors/plugins,
  5. a configurable pre-trained deep learning models for feature extraction off images
  6. store meta data in Lucene store
  7. search visual images using KNN and other advanced ML methods
  8. faceted search on meta data.
  9. etc.

Perhaps a combination of likes of Apache Spark + Apache Solr + Above Features + Stream Processing + ML/DL methods = Apache Visor – best open source image search out there!

P.S. :
If you’re interested on big test data  generation framework on SQL, check out my GitHub page

The Great Equations

Breakthrough in Science from Pythagoras to Heisenberg – by Robert P. Crease is vivid, entertaining and full of science for the inquisitive mind.

Mastering them to explain to general and scientific audience is great skill and every science and engineering student should aspire to and keep it in his intellectual repertoire.

The equations dealt were:

  • The Gold standard for mathematical beauty is Euler’s equation
         e + 1 = 0

  • The most significant event of 19th century – Maxwell’s Equations:
        δ . E = 4πp
        δ x B – 1/c * δE/δt = 4π/c J
        δ x E + 1/c * δB/δt = 0

        δ . B = 0
  • Celebrity Equation by Einstein:
        E = mc2

 

User Interface and the art of seduction

I was very much thinking of participating in start-up challenge that involved an open ended question to organize financial data deluge into organisable chunks by a local start-up accelerator for a large wealth management bank. But the question being open ended and no further info forth coming, I thought I’ll leave the fray, lest there’s another identified start-up whose work this bank wants to capitalize…I’m not sure. But this participation interest lead me to read a few resources on UI design, they’re great and gives you a good head start if you need to design a seductive, meaningful and delightful interface and deliver it on time:

  1. Lean UX – Applying Lean Principles to Improve User Experience – Jeff Gothelf and Josh Seiden
  2. Seductive Interaction Design – Stephen P Anderson
  3. Refining Design for Business – Using Analytics, Marketing, and Technology to Inform Customer Centric Design – Michael Krypel
  4. Interface Design for Learning – Design Strategies for Learning Experiences  – Dorian Peters

Each book delves into unique areas and are practical resources to conceive, design and deliver a great UI. Certainly all would agree that “A man is only half of him and rest is his attire” and so does a UI to a software service.

Naked statistics – what you need understand from statistics as such

A fantastic and informative read in the era of nascence of big-data. Some excerpts captured for better understanding of stats in prediction.

The idea to learn statistics was best summarized as follows:

·    Summarize huge quantities of data

·    Make better decisions

·     Answer important social questions

·     Recognize patterns that can refine how we do everything from selling diapers to catching criminals

·     Catch cheaters and prosecute criminals

·     Evaluate effectiveness of policies, programs, drugs, medical procedures and other innovations

Descriptive Statistics

Mode = most frequently occurring
Median = rearrange all numbers in ascending order and select the central value (50 percentile value)
Mean = Average
A better way is to have decile values, if you’re in top decile in earning in USA, you’re earning is more than 90% of the population. Percentile scores are better than absolute scores. If 43 correct answers falls into 83rd percentile, then this student is doing better than most of his peers statewide. If he’s in 8th percentile, then he’s really struggling.
Measuring of dispersion matters, if mean score on the SAT mat test is 500 with a standard deviation of 100, and bulk of students taking the test will be within one standard deviation of the mean, or between 400 or 600. How many students do you think will scoring 720 or more? Probably not very many. The most important and common distributions in statistics is the normal distribution.

clip_image002[4]

Deceptive Description

Statistical malfeasance has very little to do with bad math. If anything, impressive calculations can obscure nefarious motives. The fact that you’ve calculated the mean correctly will not alter the fact that the median is a more accurate indicator. Judgment and integrity turn out to be surprisingly important. A detailed knowledge of statistics does not deter wrongdoing any more than a detailed knowledge of the law averts criminal behavior. With both statistics and crime, the bad guys often know exactly what they’re doing.

Correlation

It measures the degree to which 2 phenomena are related to one another. There’s a correlation between summer temperatures and ice-cream sales. When one goes up, so does the other. Two variables are positively correlated if a change in one is associated with a change in the other in the same direction, such as a relationship between height and weight.

clip_image003[4]

A pattern consisting of dots scattered across the page is somewhat an unwieldy tool. If Netflix tried to make film recommendations by plotting ratings for thousands if films by millio0ns of customers, the results would bury the HQ in scatter plots. Instead, the power of correlation as a statistical tool is that we can encapsulate an association between two variables in a single descriptive statistic: the correlation coefficient. Its value ranges from -1 to 1. Closer to 1 or -1 is perfect +ve or –ve association whereas 0 has no relation at all. There is no unit attached to it.

Basic Probability

The Law of Large Numbers (LLN) explains why casinos always make money in the long run. The probabilities associated with all casino games favor the house. Probability tree might help to navigate some problems and to decide. The investment decision and widespread screening for a Rare Disease. The Chicago police department has created an entire predictive analysis unit, in part because gang activity, the source of much of the city’s violence, follows certain patterns. In 2011, New York Times ran the following headline “Sending the Police before There’s a Crime”

Problems with Probability

Assuming events are independent when they’re not: The probability of flipping two heads in a row is ½ ^ 2 i.e. ¼. Whereas two engines of jet during transatlantic flight is not 1/100,000 ^ 2.
Not understanding when events ARE independent: If you’re in a casino, you’ll see people looking longingly at the dice or cards and declaring that they’re “due”. If the roulette ball has landed on black five times in a row, then clearly now it must turn up red. No, no, no! The probability if the ball’s landing on a red number remains unchanged: 16/38. The belief otherwise is sometimes called “the gambler’s fallacy”. In fact, if you flip a coin 1,000,000 times and get 1,000,000 heads in a row, the probability of getting tails on the next flip is still 1/2. Even in sports, the notion of streaks may be illusory.

Clusters happen: A great exercise to make rare event is possible. If you’re in a class of 50 or 100 students. More is better. All stand up and flip the coin, anyone who flips head must sit down. Assuming we start with 100 students, roughly 50 will sit down after the first flip. Then we do it again. after which 25 or so are still standing. And so on. More often than not, there’ll be a student standing at the end who has flipped five or six tails in a row. At that point, I ask the student questions like “How did you do it? And what are the best training exercise for flipping do many tails in a row? Or IS there a special diet? This elicit laughter because the class just watched the whole process unfold; they know that the student who flipped six tails has no special talent. When we see anomalous event like that out of context, we assume that something besides randomness must be responsible

Reversion to mean: Have you heard about the Sports Illustrated jinx, whereby individual athletes or teams featured on the cover of Sports Illustrated subsequently see their performance fall off. The more statistically sound explanation is that teams and athletes appear on the cover after some anomalously good stretch (such as a twenty-game winning streak)  and their subsequent performance reverts back to what is normal., or the mean. This is the phenomenon known as reversion to the mean, Probability tells us that any outlier – an observation that is particularly far from the mean in one direction or the other – is likely to be followed by outcomes that are more consistent with the long-term average.

Importance of Data:

Selection Bias: Is your selected data collection is sufficiently broad range and not confined? As in a survey of consumers in an airport is going to be biased by the fact that people who fly are likely to be wealthier than the general public.
Publication Bias: Positive findings are more likely to be published than the negative findings, which can skew the results that we see.
Recall Bias: memory is fascinating thing – though not always a great source of good data. We’ve a natural impulse to understand the present as a logical consequence of things that happened in the past- cause and effect. A study of diet by breast cancer patients was done. The striking finding was that the women with breast cancer recalled a diet that was much higher in fat than what they consumed; the women with no concern did not.
Survivorship Bias: If you have a room of people with varying heights, forcing the short people to leave will raise the average height in the room, but it doesn’t make anyone tall

Central Limit Theorem:
For this to apply, sample sizes need to be relatively large (over 30 as a rule of thumb).

1.   If you draw large, random samples from any population, the means of those samples will be distributed normally around the population mean (regard less of what the distribution of the underlying population looks like)

2.   Most sample means will lie reasonably close to the population mean; the standard error us what defines “reasonably close”

3.   CLT tells us that the probability that a sample mean will lie within a certain distance of the population mean. It is relatively unlikely that a sample mean will lie more than 2 standard errors from the population mean, and extremely unlikely that it will lie three or more standard errors from the population mean.

4.   The less likely it is that an outcome has been observed by chance, the more confident we can be in surmising that some other factor is in play.

clip_image005[4]

Inference

Statistics cannot prove anything with certainty. Instead the power of statistical inference derives from observing some pattern or outcome and then using probability to determine the most likely explanation for that outcome. Suppose a strange gambler arrives in town and offers you a wager: He wins $1000 if he rolls a six with a single die; you win $500 if he rolls anything else – a pretty good bet from your standpoint. He then proceeds to roll ten sixes in a row, taking $10,000 from you. One possible explanation is that he was lucky. An alternative explanation is that he cheated somehow. The probability if rolling ten sixes in a row with a fair die is roughly 1 in 60 million. You can’t prove that he cheated, but you ought at least to inspect the die. Null hypothesis, Type I and Type II errors to be explored as well.

Regression Analysis

It allows us to analyze how one variable affects the other. In a large sample of weight versus height, if plotted on a graph looks like as below:

clip_image006[4]

If you say the pattern is “Weight increases with height” – it may not be very insightful. One step further is to “fit a line” that best describes a linear relationship between the two variables. Regression analysis typically uses a methodology called Ordinary Least Squares, or OLS to do this and is best visually explained here and further advanced techniques and concepts are here. Once we have an equation, how we the results are statistically significant or not?

Standard Error is a measure of error in the coefficient computed for the regression equation. If we take 30 different samples of 20 peoples to arrive at the regression equation, then in each case the coefficient will reflect a value akin to this group and from central limit theorem, we can infer that this should be around the true association coefficient. With this assumption we can calculate the Standard Error for the regression coefficient.

One rule of thumb: Coefficient is likely to be statistically significant when the coefficient is at least twice the size of the standard error. T-statistic = observed regression coefficient/standard error. P-Value = chance of getting an outcome as extreme as no true association between the variables. R2 = total amount of variation explained by the regression equation i.e. how much variation around mean is due to height differences alone. When eth sample (degree of freedom) size reaches large number, t-statistic becomes similar to normal distribution.

Top Sever Regression Mistakes

1.   Using regression to analyze a nonlinear relationship

2.   Correlation does not equal causation

3.   Reverse Causality:  ensure in a statistical equation between A and B, were an affects B, it’s entirely plausible that B affects A.

4.   Omitted variable bias: This is about omitting an important variable in the regression equation

5.   Highly correlated explanatory variables (multi-co-linearity): If we want to find effect of illegal drug use on SAT scores. If we assess heroin and cocaine are used, then using these variables individually may not yield good results than a combined one as those who use cocaine may not use heroin and vice-versa. So their data points individually may be small and may not give correct results

6.    Extrapolating beyond the data: you cannot use the weight/height data to predict the weight of new-born

7.    Data-mining with too many variables

There are two lessons in designing a proper regression model

1.   Figuring out what variables should be examined and where the data should come from – is more important than the underlying statistical calculations. This process is referred to as estimating the equation, or specifying a good regression equation. The best researches are the ones who can think logically about what variables ought to be included in a regression equation, what might be missing, and how the eventual results can and should be interpreted.

Regression analysis builds only a circumstantial case. An association between two variables is like a fingerprint at the scene of the crime. It points us in the right direction, but it’s rarely enough to convict. (and sometimes a fingerprint at the scene of a crime may not belong to the perpetrator) Any regression analysis needs a theoretical underpinning. What are explanatory variables in the equation? What phenomena from other disciplines can explain the observed results? For instance, why do we think that wearing purple shoes would boost performance on the math portion of the SAT or that eating of popcorn can help prevent prostate cancer?

Blogging using Word 2013

 

 

 

 

 

 

Hope I’ll accustom to Word 2013 & Windows 8 as the blogging tool from now on, super simple and efficient than Windows Live Writer and still lacks a couple of things and hope MS can fill those gaps soon. Word 2013’s cloud embrace is remarkable and Office is progressing in the cloud direction and is for good for consumers and MS. The diagram above uses Word 2013’s ‘smart art’ feature as I wanted to test my skills to create a simple diagram and publish it. After publishing, realized that I may still need Windows Live Writer in case and installed 2012 version and corrected the placement of text below the diagram as Word 2013 misses the following:
Preview before posting as appears in a browser/wordpress site and source HTML editing. Hence the tilt to upend may take yet another version, perhaps word 2016?! Another juicy thing is Windows 8 metro is solid and very likeable!!

Technorati Tags: