How well have we mastered ‘the science of teams’?

Startups and established teams face tremendous pressure to deliver. The breakneck speed of changes in languages and framework ecosystems, delivering code well thought out and written for maintainability and scalability adds another dimension to this pressure.  But the one that stands most out is level of interaction and collaboration within the team and outside. As we manage this chaos with routine practices like recruiting good talent, enticing with good options, equipping with best infrastructure and software tools, training and orienting in best practices, pair programming and team building to enhance group interactions may not be sufficient in short and long term.

Personally, I have been through lots of drab team building activities – paintballs, movie togethers, charity activities, dinner and dances, laugh it throughs, corporate self-help trainings, Da Vinci Code themed hunts on taxi hops, etc.  Take a moment and name those you have attended so far and my question is – has it really improved your outlook of working collaboratively, boosted output and understanding of your peers and sneers?

A stream of questions arises. Does your team have the requisite team dynamics, energy and camaraderie to act in unison to deliver what the company wants out of them – orchestrated by the program/project lead? Is team cohesion and communication superb or subpar? Also, is there a practical, genuine and approachable way to rejuvenate and maintain the team spirit? Has our mastery of ‘the science of teams’ adequate to meet these challenges?

‘The science of teams’ answer needn’t be high octane. Enter the Board Game.

boardgame

You might think I’m silly and guess what? a board game is a family activity for leisure and fun, full of name calling, commotion, laughs, benign barbs, teasing, strategizing and winning. But it’s more than that.  There’s no sense of loss even when you lose the game – all that matters, is the journey. It’s full of fun, camaraderie, surprise and ample talk and that’s what we need for teams in team building – gentle yet reinforcing. The corporate driven, high octane team building activities isn’t going to help in anyway. Conversely, it benefits only event companies, and drains the ‘must be spent’ budget.

When I joined this gaming enterprise’s data team that had a close-knit group of developers, artists and learning designers, I was skeptical of what playing a ‘Board Game’ can bring to the team which is already developing 3D games (using Unity) for their bread and butter. I was wrong, it did wonders. We have ‘Board Game’ time every fortnight sandwiched between the last hours of work time and the evening. We start our sessions around 5pm and end around 7pm, so the team members are not disrupted of their routine commute. There’s a choice of drinks for every other person – beer, wine, coffee and sodas. Of course, depending on the budget – some healthy finger foods and pizza. Obviously, the game master is the one who has supposedly gone through the rule book and is expected to be familiar with the game by either playing it her/himself, or watching it explained from YouTube videos by the game provider. We take turns to be the game master to save time and guide rest of the team to be effective, extracting maximum fun out of the play. Ideally the board should support at least 4 or 6 players. If the team size is 8, then others can join to form groups within. Those who are really not interested in participating can be spectator supporter for a player still. Those who were reluctant initially became active players subsequently.

What I noticed and witnessed for myself during game play: people talk freely (over their choice of drink and food), discuss their strategy, question, cheer and tease other players. Every dice throw is fun in anticipation of their desired number to catapult their position in the game. With constant conversation around rules, booties won and lost, twists and turns the game – takes you through a journey of fun, anticipation and interaction – hence camaraderie & respect develops. This breaks the ice, and starts conversation that flows beyond play time into work time. The fun is amplified, when the winner decides who the future game master is and new game to play next time from the repertoire of board games in stock, sometimes even proposing a new board game to buy. Every year our budget is to acquire 6 new games in the $100 to $150 price range with older ones given away to employees. On a side note, the game master has to really understand the rules of the game well, read through the cryptic rule book and decipher the nuances to instill the gaming spirit.  Be the gate keeper on errant players, gently nudge them to participate, benevolently whip the procrastinators into action and take leadership in steering the gameplay.

No wonder, as we’re 3D gaming company ourselves, being emboldened in our physical board game play, we wanted to create our own board game. It was in the true spirit of startup and experimentation, we recently released our own on Kickstarter, called Avertigos. I humbly encourage you to take a look at it and see whether you can turn your team building activity into something genuinely fun, indoor, refreshingly new and authentic; not only with Avertigos, but with any board game of your choice. Be mindful of the game genre, and give it a try. It can do wonders. It did for us.

This could be one simple step in mastering the science of teams and certainly helped on aspects discussed in this ‘the science of teams’ article for us.

Advertisements

Every image is searchable with Inception & a Crawler in Google Cloud for 0$

As I was attempting a Kaggle contest on Bosch, suddenly I was piqued at reverse image search and having attempted face detection year ago by building a prototype web app and deep learning was beckoning. It has been making huge strides with FPGA, Elastic GPU hardware, neural processors on scene, it’s getting hotter by the days and time to get hands dirty. Deep learning visual frameworks have mushroomed eclipsing well established ones like OpenCV which still powers niche use cases. Google’s TensorFlow is getting it’s own limelight and I was curious as to how reverse image search engine might work utilizing TensorFlow, while googling stumbled on ViSenze and Tineye web services that were filling diverse needs, former being a e-commerce reverse search engine while the latter lets you know where a given image is sourced or identified in the entire internet. They can squeeze off search and display results in 1 or 2 seconds excluding the time to upload or extracting an image off an URL.This is pretty impressive given Tineye has indexed more than a billion images.

How do we make our own ViSenze or Tineye or IQnect or GoFind? Githubbing, found a TensorFlow based reverse image search project (credits to this GitHub project & Akshayu Bhat) and realized this to be a great way to start but a real use case can make it even more compelling. Thought of a commercial website selling apparel’s could be a good candidate to get real world images to index and test the capability of this reverse search. This experiment had a unique twist, all along being a Windows aficionado using MS software development tools, TensorFlow forced me to switch to Linux environment as it’s only available in Linux or Mac. As of 29 Nov 16, Tensorflow finally adds support to Windows, its too late. Being a windows developer, decided to naturally gravitate to Ubuntu desktop on Oracle VirtualBox, I’ve previously played around Ubuntu desktop albeit just to get a hang of GUI and used some of Linux tools in them but never did serious development.

Now, let’s get practical, setup the dev environment (I’m new to Linux and want to learn), spruce up the code from the fork, add a crawler, plus a commercially available API to detect whether uploaded image (for reverse visual search) is safe and appropriate and detect it’s content while returning nearest 12 items when an visual item is searched. My claim of 0$ is to leverage google cloud trial and before you jump in to test drive, you may want to take a look at such an implementation using google cloud engine @ http://visualsearch.avantprise.com/. The actual search takes 3 to 4 seconds on 70K images whereas approximate search is a bit faster.

Setup and run Visual Search Server

Get the latest Oracle VirtualBox here and install it on your Windows m/c (mine is windows 10 build 1439). Now proceed to download Linux desktop Ubuntu 14.04.5 Trusty Tahr from from osboxes.org to get the OS up and running. Using the userid as osboxes.org and same as password, you get the desktop up and running or just install it from scratch in VirtualBox, which is what I did (also suggested) by providing 80GB disk size for VM to have ample amount of space to grow dynamically if needed to that set limit. Make sure , you install GuestAdditions to VM instance of Ubuntu desktop, this is useful when you want to transfer files between Ubuntu and Host OS and also makes the display to adjust flexibly. Do note that as I change between office ethernet and home wireless, I got to change the network adapter to wireless for Adapter1 to get it going at home.

oraclevm-deep

Ubuntu desktop comes with python 2.7.6 and hence you don’t need anaconda or other python environments and I’m not looking at exclusive python environments to make this experiment long winded. About development environment? well I’m used to Visual Studio for C# & Python and WebStorm for NodeJs. Hence wanted to stick to the same tools with a slight difference, this time went with Visual Studio Code, a great open source tool with fantastic extensions and and works like a charm. Log into your Ubuntu desktop and launch terminal and type python –version to check the version and ensure it is 2.7.6. Don’t forget to enable shared clipboard to be bi-directional for this VM instance in VirtualBox. Get git, pip and fabric installed as follows:

sudo apt-get install git
sudo apt-get install python-pip
sudo pip install fabric
sudo apt-get install openssh-server

Ensure you have a rsa key created to connect to GCE and local dev environment if required (also do a ssh localhost) using the following commands

ssh-keygen -t rsa (Press enter for each line)
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod og-wx ~/.ssh/authorized_keys
ssh localhost 

Type exit to see logout message is displayed. Now you’re all set, go ahead to clone the repo. In the terminal (under home directory) type

sudo git clone https://github.com/dataspring/TensorFlowSearch.git
cd ~/TensorFlowSearch
vi settings.py (to change username, etc. & save)

Open up settings.py and change the LOCALUSER to the user you set while creating the Ubuntu desktop VM and optionally LOCALHOST to a specific IP address assigned if 127.0.0.1 doesn’t work. With the code base in desktop ready, we need to setup the development environment in local Ubuntu desktop, so that we can run, debug and change code. Fabric allows to run sudo commands either locally or remotely + tons of other features. With python fabric in place, run fabric calling the setup function from terminal as

sudo fab --list  (lists all methods)
sudo fab localdevsetup

with a couple of ENTER & Y key-presses, will install all prerequisites for python development environment, TensorFlow, Fabric, Sqlite3, Visual Studio Code and SQL LiteBrowser. If all goes well, run the crawler to get a few images from an e-commerce site (carousell.com) and we can start the visual search server as follows:

sudo fab ShopSiteImages
sudo fab index
sudo fab server

Open a browser in your Ubuntu desktop and type http://localhost/ – which should launch a screen as shown below and start searching

localhost

Launch Visual Studio Code from terminal by typing in below command and this to allow to launch VS Code with admin rights for debugging to work properly

sudo code --user-data-dir="~/.vscode-root"  

Install python extensions, you are all set to change, play around the code with nice debugging support! Once launched, point to git directory @ ~/TensorFlowSearch to open the code and modify.

Detect content appropriate-ness and type – Clarifai to our rescue

I thought of including a safe content search that will be vital while doing a visual search as it involves user uploaded/snapped image. Among myriad of video & image recognition services that offers detection of unsafe content, Clarifai is simple and there’s a free plan to test and play with REST API. Navigate to their developer site and obtain an API key and you’re all set. In this search form, from angular, uploaded image is sent to Clarifai API to check whether the image is safe and appropriate and you get a probability score which is displayed in the search screen. Also another API call is made to detect ‘content type’. The code snippet that’s used in controller.js (hosted in python flask web app) file is as follows and you may want to get your own API key as the current key is part of free tier and may exhaust.

code link for controller.js under angular

clarifai

Design and Implement a Simple Crawler

Getting images for this simple crawler is what makes it fun and useful. As to experiment, I selected carousell.com which sells anything that can be snapped in your cellphone camera. It’s a great and upcoming service that allows anyone to sell – their tagline is ‘Snap to Sell, Chat to Buy for FREE on the carousell marketplace!’. Now that it’d be good if we could get images off their site which is already meant for public to consume and buy items – how do we know what is offered and how to scrape meta data and images? Well, I just download their mobile app in android and started to look at underlying web traffic that provides data to app to decipher the contract i.e. API pattern that powers it. There are nifty ways to configure your android mobile to get internet traffic proxyed by Fiddler in a PC through WiFi and monitoring ongoing traffic in fiddler while using their app will provide enough information to understand API story behind – how their wares are categorized, metadata is designed and images are served. With this info, you can quickly write a routine in python to get images for our experiment and also define our metadata to make the search worthwhile i.e. upon performing a visual search, we not only present the nearest 12 items that resembles given image, but also display additional metadata – as to how much it costs, where it is available and refer them to actual e-commerce site for purchase if they intend – facilitating the buying process.
This crawler hinges on the product categorization, page iteration technique implemented in the API to get images and metadata which is further persisted in a local sqlite3 database for searching purposes. The idea here is to retrieve image once, extract TensorFlow model features and discard image but keep metadata. This prevents the service from serving it’s own images instead points to image URL at the commerce site to avoid egress cost from cloud provider. Sqlite3 fits the bill by providing a simple data store but this can be scaled depending on the future scalability requirements which we can’t anticipate now. Crawler is designed to restart wherever it stopped with a manual intervention to reset the following variables – – to facilitate re-crawling where it left.

Crawler Design

  1. Decide on Product Collection Number, Pagination Parameters part of the API (figured out from API pattern)
  2. Start iterating on each collection, setting the returning result count and keep increasing the page count until max-iteration count
  3. Issue a python requests.get and parse the returned json results to get meta data and fill ‘sellimages’ table of sqlite3 db
  4. Retrieve the image from the URL
  5. If and when this whole process is rerun, ensure metadata and image if present already is overwritten – crawler re-runs are idempotent as long API signature is not changed – in which case crawler may also fail
  6. We assume only jpeg images are provided from the API’s metadata URL and it is so
  7. Uses simple python modules like requests, json, sqlite3, urllib

crawler

Indexer & Searcher

The gist of indexing images is to simply use TensorFlow by loading a pre-trained model – trained off ImageNet aka InceptionV3 that is already available as a protobuffer file – in our case network.pb. Then parse it to import graph definitions and use this definition set to extract ‘incept/pool_3:0’ features off each image. Indexer further spits out chunk of these files and concatenates them based on the batch-size configured and gets stored as index files. KNN search is performed using scipy spatial function.

In the next iteration to this article, I want to further see which spatial functional metric is performant? (many are available ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘cityblock’, ‘correlation’, ‘cosine’, ‘dice’, ‘euclidean’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘matching’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘wminkowski’, ‘yule’.) I’m also looking at a new development to see how to incorporate newly available pre-trained models (.pb files) to see which one can fare better for this use case implementing KNN search. One such site that lists new pre-trained model is GradientZoo and you need to figure out how to generate protobuf file from these model constnat files and starting point is here.

Sqlite3 db

Essentially there are 2 tables keep track of data ingested, ‘indeximages’ to log crawler runs and ‘sellimages’ to ingest metadata for each image crawled. You can view database in the Ubuntu desktop – just launch sqlitebrowser and point to the db file @ /home/deep/shopsite/sqllite3/

sqllite3dbimg

Take it to Cloud Heights – Setup in GCE

I claim that indexing, crawling and search cost is 0$ with the generous credit of $300 to try out Google Cloud. Only thing is GPU HW availability is yet to be mainstream unlike AWS but Google has made the announcement few weeks of GPU in cloud. Fire up Google Cloud and set up an account keying in your credit card and they absolutely say that credit card will never be overcharged upon trial period completion and it’s worth giving a try. Use the Quick Start Guide to Linux VM, and follow the screenshots to guide you to create a Ubuntu 14.04 server with v4 CPU, 80 GB SSD and 15GB RAM.

  1. Create you Project
    gce-project
  2. Create a VM Instance
  3. Select Zone (near your place), Machine Type (4 vCPU 15 GB) , Boot disk (80 GB SSD)
  4. Allow HTTP and HTTPS traffic
  5. Click on Networking and chose an external IP to be static IP (so that it can retain same IP on restarts)
  6. Click on SSH keys, navigate to ~/.ssh/ and open id_rsa file we created early and copy all contents and paste it in SSH Keys
  7. You’ll end up in a VM created as follows

From your local Ubuntu desktop, launch the terminal, just do

ssh username@extrenalip

username is as per your id_rsa file when you copied to SSH Keys and External IP is the static IP that you reserved. It should connect to remote host and now logout.

Time to setup GCE VM instance. Open up settings.py file in local ubuntu desktop and change HOST (to static IP) and USER (SSH Keys assigned user) accordingly and save. Now fire up fabric to do setup for us in the remote host m/c

sudo fab live hostsetup

Once all setup is complete, ssh in and test run to crawl images, index them and start the web server. You can access the server by pointing your browser to http://<external ip>/. This ensures that all works. Next is to stress test. Open the settings.py file in remote machine again and change the following to larger values : RESULT_STEPS, MAX_ITER, MAX_COLLECTION, BATCH_SIZE.

Now that the process is going to be long running, you need to launch ssh window and use screen command that allows you to run processes uninterrupted even when ssh is disconnected. For those from windows world using command window, there’s a very nice tutorial explaining screen.

sudo apt-get install screen
screen -R crawlrun
cd ~/TensorFlowSearch
sudo fab ShopSiteImages

Cntrl+A followed by Cntrl D to close screen and logout (to detach) and once the crawling process is over, do the same for Indexing and then run another screen session for web app to let the search server to be available in the internet for all.

If the count of image file is very large, in millions, best way to check the image file count in /home/deep/shopsite/images/ is not to use ls but go with rsync. Also once index run is completed, all images move to /done folder.

rsync --stats --dry-run -ax /home/deep/shopsite/images/ /xxx

Another handy utility, similar to task manager in linux to monitor resource utilization

ps -eo pcpu,pid,user,args | sort -r -k1 | less 
<or simply use> 
top

Get FileZilla and install it which comes handy for installing code to Google VM later. Alternatively you can use your private GitLab project which is free.

Future of Visual Search

What the community have to do next is to take visual search to next level – and some thoughts :
As we have mature Apache products like Solr, a similar open source product is the need of the hour – one that is robust enough to

  1. ingest images of any type,resolution in batch and real-time
  2. capture preset time interval frames on continuous video streams for images
  3. crawl any site with a pluggable API engine,
  4. store images & metadata in different cloud storage services using connectors/plugins,
  5. a configurable pre-trained deep learning models for feature extraction off images
  6. store meta data in Lucene store
  7. search visual images using KNN and other advanced ML methods
  8. faceted search on meta data.
  9. etc.

Perhaps a combination of likes of Apache Spark + Apache Solr + Above Features + Stream Processing + ML/DL methods = Apache Visor – best open source image search out there!

P.S. :
If you’re interested on big test data  generation framework on SQL, check out my GitHub page

The Great Equations

Breakthrough in Science from Pythagoras to Heisenberg – by Robert P. Crease is vivid, entertaining and full of science for the inquisitive mind.

Mastering them to explain to general and scientific audience is great skill and every science and engineering student should aspire to and keep it in his intellectual repertoire.

The equations dealt were:

  • The Gold standard for mathematical beauty is Euler’s equation
         e + 1 = 0

  • The most significant event of 19th century – Maxwell’s Equations:
        δ . E = 4πp
        δ x B – 1/c * δE/δt = 4π/c J
        δ x E + 1/c * δB/δt = 0

        δ . B = 0
  • Celebrity Equation by Einstein:
        E = mc2

 

User Interface and the art of seduction

I was very much thinking of participating in start-up challenge that involved an open ended question to organize financial data deluge into organisable chunks by a local start-up accelerator for a large wealth management bank. But the question being open ended and no further info forth coming, I thought I’ll leave the fray, lest there’s another identified start-up whose work this bank wants to capitalize…I’m not sure. But this participation interest lead me to read a few resources on UI design, they’re great and gives you a good head start if you need to design a seductive, meaningful and delightful interface and deliver it on time:

  1. Lean UX – Applying Lean Principles to Improve User Experience – Jeff Gothelf and Josh Seiden
  2. Seductive Interaction Design – Stephen P Anderson
  3. Refining Design for Business – Using Analytics, Marketing, and Technology to Inform Customer Centric Design – Michael Krypel
  4. Interface Design for Learning – Design Strategies for Learning Experiences  – Dorian Peters

Each book delves into unique areas and are practical resources to conceive, design and deliver a great UI. Certainly all would agree that “A man is only half of him and rest is his attire” and so does a UI to a software service.

Naked statistics – what you need understand from statistics as such

A fantastic and informative read in the era of nascence of big-data. Some excerpts captured for better understanding of stats in prediction.

The idea to learn statistics was best summarized as follows:

·    Summarize huge quantities of data

·    Make better decisions

·     Answer important social questions

·     Recognize patterns that can refine how we do everything from selling diapers to catching criminals

·     Catch cheaters and prosecute criminals

·     Evaluate effectiveness of policies, programs, drugs, medical procedures and other innovations

Descriptive Statistics

Mode = most frequently occurring
Median = rearrange all numbers in ascending order and select the central value (50 percentile value)
Mean = Average
A better way is to have decile values, if you’re in top decile in earning in USA, you’re earning is more than 90% of the population. Percentile scores are better than absolute scores. If 43 correct answers falls into 83rd percentile, then this student is doing better than most of his peers statewide. If he’s in 8th percentile, then he’s really struggling.
Measuring of dispersion matters, if mean score on the SAT mat test is 500 with a standard deviation of 100, and bulk of students taking the test will be within one standard deviation of the mean, or between 400 or 600. How many students do you think will scoring 720 or more? Probably not very many. The most important and common distributions in statistics is the normal distribution.

clip_image002[4]

Deceptive Description

Statistical malfeasance has very little to do with bad math. If anything, impressive calculations can obscure nefarious motives. The fact that you’ve calculated the mean correctly will not alter the fact that the median is a more accurate indicator. Judgment and integrity turn out to be surprisingly important. A detailed knowledge of statistics does not deter wrongdoing any more than a detailed knowledge of the law averts criminal behavior. With both statistics and crime, the bad guys often know exactly what they’re doing.

Correlation

It measures the degree to which 2 phenomena are related to one another. There’s a correlation between summer temperatures and ice-cream sales. When one goes up, so does the other. Two variables are positively correlated if a change in one is associated with a change in the other in the same direction, such as a relationship between height and weight.

clip_image003[4]

A pattern consisting of dots scattered across the page is somewhat an unwieldy tool. If Netflix tried to make film recommendations by plotting ratings for thousands if films by millio0ns of customers, the results would bury the HQ in scatter plots. Instead, the power of correlation as a statistical tool is that we can encapsulate an association between two variables in a single descriptive statistic: the correlation coefficient. Its value ranges from -1 to 1. Closer to 1 or -1 is perfect +ve or –ve association whereas 0 has no relation at all. There is no unit attached to it.

Basic Probability

The Law of Large Numbers (LLN) explains why casinos always make money in the long run. The probabilities associated with all casino games favor the house. Probability tree might help to navigate some problems and to decide. The investment decision and widespread screening for a Rare Disease. The Chicago police department has created an entire predictive analysis unit, in part because gang activity, the source of much of the city’s violence, follows certain patterns. In 2011, New York Times ran the following headline “Sending the Police before There’s a Crime”

Problems with Probability

Assuming events are independent when they’re not: The probability of flipping two heads in a row is ½ ^ 2 i.e. ¼. Whereas two engines of jet during transatlantic flight is not 1/100,000 ^ 2.
Not understanding when events ARE independent: If you’re in a casino, you’ll see people looking longingly at the dice or cards and declaring that they’re “due”. If the roulette ball has landed on black five times in a row, then clearly now it must turn up red. No, no, no! The probability if the ball’s landing on a red number remains unchanged: 16/38. The belief otherwise is sometimes called “the gambler’s fallacy”. In fact, if you flip a coin 1,000,000 times and get 1,000,000 heads in a row, the probability of getting tails on the next flip is still 1/2. Even in sports, the notion of streaks may be illusory.

Clusters happen: A great exercise to make rare event is possible. If you’re in a class of 50 or 100 students. More is better. All stand up and flip the coin, anyone who flips head must sit down. Assuming we start with 100 students, roughly 50 will sit down after the first flip. Then we do it again. after which 25 or so are still standing. And so on. More often than not, there’ll be a student standing at the end who has flipped five or six tails in a row. At that point, I ask the student questions like “How did you do it? And what are the best training exercise for flipping do many tails in a row? Or IS there a special diet? This elicit laughter because the class just watched the whole process unfold; they know that the student who flipped six tails has no special talent. When we see anomalous event like that out of context, we assume that something besides randomness must be responsible

Reversion to mean: Have you heard about the Sports Illustrated jinx, whereby individual athletes or teams featured on the cover of Sports Illustrated subsequently see their performance fall off. The more statistically sound explanation is that teams and athletes appear on the cover after some anomalously good stretch (such as a twenty-game winning streak)  and their subsequent performance reverts back to what is normal., or the mean. This is the phenomenon known as reversion to the mean, Probability tells us that any outlier – an observation that is particularly far from the mean in one direction or the other – is likely to be followed by outcomes that are more consistent with the long-term average.

Importance of Data:

Selection Bias: Is your selected data collection is sufficiently broad range and not confined? As in a survey of consumers in an airport is going to be biased by the fact that people who fly are likely to be wealthier than the general public.
Publication Bias: Positive findings are more likely to be published than the negative findings, which can skew the results that we see.
Recall Bias: memory is fascinating thing – though not always a great source of good data. We’ve a natural impulse to understand the present as a logical consequence of things that happened in the past- cause and effect. A study of diet by breast cancer patients was done. The striking finding was that the women with breast cancer recalled a diet that was much higher in fat than what they consumed; the women with no concern did not.
Survivorship Bias: If you have a room of people with varying heights, forcing the short people to leave will raise the average height in the room, but it doesn’t make anyone tall

Central Limit Theorem:
For this to apply, sample sizes need to be relatively large (over 30 as a rule of thumb).

1.   If you draw large, random samples from any population, the means of those samples will be distributed normally around the population mean (regard less of what the distribution of the underlying population looks like)

2.   Most sample means will lie reasonably close to the population mean; the standard error us what defines “reasonably close”

3.   CLT tells us that the probability that a sample mean will lie within a certain distance of the population mean. It is relatively unlikely that a sample mean will lie more than 2 standard errors from the population mean, and extremely unlikely that it will lie three or more standard errors from the population mean.

4.   The less likely it is that an outcome has been observed by chance, the more confident we can be in surmising that some other factor is in play.

clip_image005[4]

Inference

Statistics cannot prove anything with certainty. Instead the power of statistical inference derives from observing some pattern or outcome and then using probability to determine the most likely explanation for that outcome. Suppose a strange gambler arrives in town and offers you a wager: He wins $1000 if he rolls a six with a single die; you win $500 if he rolls anything else – a pretty good bet from your standpoint. He then proceeds to roll ten sixes in a row, taking $10,000 from you. One possible explanation is that he was lucky. An alternative explanation is that he cheated somehow. The probability if rolling ten sixes in a row with a fair die is roughly 1 in 60 million. You can’t prove that he cheated, but you ought at least to inspect the die. Null hypothesis, Type I and Type II errors to be explored as well.

Regression Analysis

It allows us to analyze how one variable affects the other. In a large sample of weight versus height, if plotted on a graph looks like as below:

clip_image006[4]

If you say the pattern is “Weight increases with height” – it may not be very insightful. One step further is to “fit a line” that best describes a linear relationship between the two variables. Regression analysis typically uses a methodology called Ordinary Least Squares, or OLS to do this and is best visually explained here and further advanced techniques and concepts are here. Once we have an equation, how we the results are statistically significant or not?

Standard Error is a measure of error in the coefficient computed for the regression equation. If we take 30 different samples of 20 peoples to arrive at the regression equation, then in each case the coefficient will reflect a value akin to this group and from central limit theorem, we can infer that this should be around the true association coefficient. With this assumption we can calculate the Standard Error for the regression coefficient.

One rule of thumb: Coefficient is likely to be statistically significant when the coefficient is at least twice the size of the standard error. T-statistic = observed regression coefficient/standard error. P-Value = chance of getting an outcome as extreme as no true association between the variables. R2 = total amount of variation explained by the regression equation i.e. how much variation around mean is due to height differences alone. When eth sample (degree of freedom) size reaches large number, t-statistic becomes similar to normal distribution.

Top Sever Regression Mistakes

1.   Using regression to analyze a nonlinear relationship

2.   Correlation does not equal causation

3.   Reverse Causality:  ensure in a statistical equation between A and B, were an affects B, it’s entirely plausible that B affects A.

4.   Omitted variable bias: This is about omitting an important variable in the regression equation

5.   Highly correlated explanatory variables (multi-co-linearity): If we want to find effect of illegal drug use on SAT scores. If we assess heroin and cocaine are used, then using these variables individually may not yield good results than a combined one as those who use cocaine may not use heroin and vice-versa. So their data points individually may be small and may not give correct results

6.    Extrapolating beyond the data: you cannot use the weight/height data to predict the weight of new-born

7.    Data-mining with too many variables

There are two lessons in designing a proper regression model

1.   Figuring out what variables should be examined and where the data should come from – is more important than the underlying statistical calculations. This process is referred to as estimating the equation, or specifying a good regression equation. The best researches are the ones who can think logically about what variables ought to be included in a regression equation, what might be missing, and how the eventual results can and should be interpreted.

Regression analysis builds only a circumstantial case. An association between two variables is like a fingerprint at the scene of the crime. It points us in the right direction, but it’s rarely enough to convict. (and sometimes a fingerprint at the scene of a crime may not belong to the perpetrator) Any regression analysis needs a theoretical underpinning. What are explanatory variables in the equation? What phenomena from other disciplines can explain the observed results? For instance, why do we think that wearing purple shoes would boost performance on the math portion of the SAT or that eating of popcorn can help prevent prostate cancer?

Blogging using Word 2013

 

 

 

 

 

 

Hope I’ll accustom to Word 2013 & Windows 8 as the blogging tool from now on, super simple and efficient than Windows Live Writer and still lacks a couple of things and hope MS can fill those gaps soon. Word 2013’s cloud embrace is remarkable and Office is progressing in the cloud direction and is for good for consumers and MS. The diagram above uses Word 2013’s ‘smart art’ feature as I wanted to test my skills to create a simple diagram and publish it. After publishing, realized that I may still need Windows Live Writer in case and installed 2012 version and corrected the placement of text below the diagram as Word 2013 misses the following:
Preview before posting as appears in a browser/wordpress site and source HTML editing. Hence the tilt to upend may take yet another version, perhaps word 2016?! Another juicy thing is Windows 8 metro is solid and very likeable!!

Technorati Tags: