One of my friends at a leading tech company was tasked with hiring a bunch of data scientists.

Curiously, so was I.

So we exchanged notes.

His job posting says -

---

We don't care about your education

We don't care about your experience.

We don't care about the technologies you already know.

What we care about:

You need to be an expert in coding.

---

My job posting was infact the polar opposite:

---

Must have a PhD in CS(Machine Learning)/Applied Statistics/Applied Math, or other STEM discipline.

Comfortable with statistical data analysis.

What we DON'T care about:

Your programming expertise.

---

Study the two posts carefully.

To tack on a cultural angle, one of them is this true blue All-American, "a programmer can do anything they set their mind to" approach.

The other is a European/Commonwealth approach - "programming can wait, first show me some credentials in Statistics."

Both of us sifted through a bunch of resumes.

But only one of us had a high SIFT risk.

Guess who ?

Dr. David Lowe, a Professor in Computer Vision, published one of the most popular algorithms in the field in 1999. His algorithm, SIFT, aka Scale Invariant Feature Transform, essentially extracts feature descriptions of an image. These feature descriptors are scale invariant - they don't change if the image is distorted by affine transforms. So if the next image was a copy of the previous one, but rotated by 30 degrees and sheared and twice as small and then

reflected in a mirror, the feature descriptors would still work.They'd be able to tag it accurately as the same image they've previously identified.

If there's one piece of insight I'd like you to take away from this fluff piece,it is - study SIFT!

It's a wonderful algorithm and you'll be doing yourself a huge favor trying to understand its innards.

Feature transforms are crucial not just to CV, but to ML, data science and even data scientists!

Think about a bunch of red and blue points evenly spaced out on the circumference of a circle.

So you have a red point, then a blue, then red, then blue & so on, evenly laid out along the circle.

Can a linear classifier separate them ?

It should be obvious that the answer is no.

There's simply no way to draw a straight line separating the reds from the blues.

But what if we came up with a clever feature transform ?

Lets make a toy version of this problem - say there were only a dozen points - so 6 red & 6 blue points.

If you have trouble visualizing this, just doodle a bunch of dots on a napkin & you should be convinced.To lay them out evenly on a circle would mean you'd have a 60 degree separation between two consecutive reds. But you'd have a 30 degree separation between a red and blue.

This tiny bit of insight is enough to ensure a clean separation.

Curiously, so was I.

So we exchanged notes.

His job posting says -

---

We don't care about your education

We don't care about your experience.

We don't care about the technologies you already know.

What we care about:

You need to be an expert in coding.

---

My job posting was infact the polar opposite:

---

Must have a PhD in CS(Machine Learning)/Applied Statistics/Applied Math, or other STEM discipline.

Comfortable with statistical data analysis.

What we DON'T care about:

Your programming expertise.

---

Study the two posts carefully.

To tack on a cultural angle, one of them is this true blue All-American, "a programmer can do anything they set their mind to" approach.

The other is a European/Commonwealth approach - "programming can wait, first show me some credentials in Statistics."

Both of us sifted through a bunch of resumes.

But only one of us had a high SIFT risk.

Guess who ?

Dr. David Lowe, a Professor in Computer Vision, published one of the most popular algorithms in the field in 1999. His algorithm, SIFT, aka Scale Invariant Feature Transform, essentially extracts feature descriptions of an image. These feature descriptors are scale invariant - they don't change if the image is distorted by affine transforms. So if the next image was a copy of the previous one, but rotated by 30 degrees and sheared and twice as small and then

reflected in a mirror, the feature descriptors would still work.They'd be able to tag it accurately as the same image they've previously identified.

If there's one piece of insight I'd like you to take away from this fluff piece,it is - study SIFT!

It's a wonderful algorithm and you'll be doing yourself a huge favor trying to understand its innards.

Feature transforms are crucial not just to CV, but to ML, data science and even data scientists!

Think about a bunch of red and blue points evenly spaced out on the circumference of a circle.

So you have a red point, then a blue, then red, then blue & so on, evenly laid out along the circle.

Can a linear classifier separate them ?

It should be obvious that the answer is no.

There's simply no way to draw a straight line separating the reds from the blues.

But what if we came up with a clever feature transform ?

Lets make a toy version of this problem - say there were only a dozen points - so 6 red & 6 blue points.

If you have trouble visualizing this, just doodle a bunch of dots on a napkin & you should be convinced.To lay them out evenly on a circle would mean you'd have a 60 degree separation between two consecutive reds. But you'd have a 30 degree separation between a red and blue.

This tiny bit of insight is enough to ensure a clean separation.

Simply map each point ie. its cartesian. (x,y) coordinate, to its polar co-ordinates (r,theta)

Then take the angle theta, and obtain the quotient when divided by 60.

Boom! All red points get mapped to the line y=0, and the blue ones to the line y = 30.

Clearly a line separates the above two parallel lines, so a linear classifier suffices.

So what we've done is a straightforward feature transform.

Given an (x,y), we apply the transform f(x,y) = arctan(y/x) % 60

The result of this feature transform is then used by our linear classifier, and life is good.

Feature transforms apply across the board - not just to images in computer vision or feature vectors in ML. The other day, I ran into my former real estate agent while I was shopping for vegetables.We were chatting about this and that, and I was telling him how excited I was about my company's flexible work from home policy, so I can spend more time with my family. Instantly he said, well, if you are going to be working from home more often, you will need an office space - I can show you a larger house in this new subdivision with dedicated home offices!

You see what happened there ?

Feature transform!

I am not in the market for buying a home. I was just making smalltalk, but my real estate agent picked up on a few features, transformed them into the real estate space, and suddenly made himself a much more valuable commodity, hawking a new house out of the blue!

So that's SIFT risk.

This happens all the time, and is actually quite subtle.

It isn't insidious or out of malice - just plain human nature.

I once worked with a junior data scientist who insisted on visualizing decision trees.

Now, decision trees have good explanatory power, and yes, they can be visualized.

But, and this is a big but, almost nobody builds decision trees so they can visualize them.

That's because, unlike textbook examples where the decision tree is shallow and has a handful of nodes, real-life decision trees tend to be deeply nested with say 300 nodes and a depth of 15. At that point, it is so messy it becomes quite useless to draw it out.

The tree is very useful - that's why you trained it. But it isn't very useful to actually draw the tree. Its sufficient to simply use the tree for its accurate predictive power. There's no need to visualize it. So why was this junior data scientist so keen on drawing the tree ?

After a bit of questioning, it became quite obvious.

Before he "became" a data scientist, he was a front-end javascript developer!

You know - the guys who build web apps - who use D3 & React and Angular and JQuery - one of those guys. So what was happening here was a classic case of feature transform. He had taken a data science problem - constructing a decision tree, and decided that if visualizing the tree was the actual problem, he could move the problem from the data science space to a javascript space! Then he'd be free to mess around with D3 and jquery, drawing edges and nodes with different colors on an HTML5 canvas! None of this would have any benefit to the actual machine learning problem, but he would be able to keep himself busy for a month or two doing what he loved best - dabbling with javascript, instead of figuring out the statistical decision tree properties like gini and entropy and information gain, which was all rather new and foreign to him.

Now, I don't blame the guy. He was new to data science, and he was quite eager. But the material is rather dense and theoretical, so he subconsciously figured, hey if I just did this little feature transform here, the decision tree problem becomes a javascript problem, and javascript I can handle!

I know this data scientist who almost drove a fledgling startup into the ground by convincing the CEO to invest in big-data infrastructure. When they brought him on board, he told them the company would need to provision compute clusters for the incoming petabytes and exabytes of data to be competitive with internet giants in the Valley. The company took him at his word and spent few millions on a spanking new Hadoop cluster with two dozen nodes and gigantic data capacity. When I looked at the actual data they had, it was clear to me that a used dual Xeon with a terabyte of disk and 128GB RAM, that retails for about $3K, ie.less than a MacPro these days, was more than sufficient to run Spark mllib in local mode and give them the results they wanted. Instead, they over-extended themselves, planning for a future that never came. Guess what that person did before he "became" a data scientist ? Yes, he was a systems guy who specialized in hardware and racks and cisco network switches and the like. He had taken a data science problem of assembling a simple compute server to run machine learning problems, and feature transformed it into the space he understood best - that of exorbitant rack mounted hardware!

It is not that hard to spot feature transforms.

Whenever data science is on the table, generally the topics of discussion should involve ML/applied stats - your decision trees, your ensembles, your logistic regression, your matrix factorization, your eigen vectors, things of that nature.I once witnessed a heated discussion on whether Avro files was the appropriate format or Parquet ! This is feature transform at work - instead of solving the data science problem at hand, the "programmers", because that is what they are, have transformed your problem into a space they understand best - that of programming and file formats. Another good indication is being religious about ETL tools. If you see constant conflict on whether to use Pig or Spark or Scalding or Redshift or Phoenix or whatnot, these are all symptoms of feature transform.

ETL is a complete non-issue as far as the machine learning algorithm is concerned, yet I see companies investing massively on the so-called data pipeline, and finally using something really silly like naive bayes on the end result...akin to buying a million dollar purse to store a few crumpled one dollar bills.

There are companies out there which put their data scientists through a release cycle - so your data scientist now has to show up at scrum and do story points and epics and check in his code and be conversant with git rebase and other exotic git incantations to be considered a valuable team player! Once again - a feature transform. Data science is exploratory and messy by definition - that's simply what it is. To tidy that up and insist of transforming data science into product engineering, with git repos and agile cycles, that makes it into a whole new toothless beast. Furthermore, these same companies insist that data scientists write code with meaningful variable names and self-explanatory function names and so on. So no more grad(x), or min(nabla z).

Instead, you have to spell it all out -minimizeByGradientDescent( targetFunction:Double, firstDerivative:Double, xIntercept:Double) !

Otherwise how do you expect to check it in and get two ship-its and productionize your data insights ?!

Now that you've seen a whole bunch of SIFT at work, I hope you recognize SIFT risk for what it is - a real hazard that prevents data scientists from being productive at doing data science, instead turning them loose on some tangentially related domain so they can project a facade of being productive.

Then take the angle theta, and obtain the quotient when divided by 60.

Boom! All red points get mapped to the line y=0, and the blue ones to the line y = 30.

Clearly a line separates the above two parallel lines, so a linear classifier suffices.

So what we've done is a straightforward feature transform.

Given an (x,y), we apply the transform f(x,y) = arctan(y/x) % 60

The result of this feature transform is then used by our linear classifier, and life is good.

Feature transforms apply across the board - not just to images in computer vision or feature vectors in ML. The other day, I ran into my former real estate agent while I was shopping for vegetables.We were chatting about this and that, and I was telling him how excited I was about my company's flexible work from home policy, so I can spend more time with my family. Instantly he said, well, if you are going to be working from home more often, you will need an office space - I can show you a larger house in this new subdivision with dedicated home offices!

You see what happened there ?

Feature transform!

I am not in the market for buying a home. I was just making smalltalk, but my real estate agent picked up on a few features, transformed them into the real estate space, and suddenly made himself a much more valuable commodity, hawking a new house out of the blue!

So that's SIFT risk.

**When you hire a person who is skilled in X, to do job Y, you are setting yourself up for SIFT risk.**

The person will feature transform your y in the Y space into a little x in the X space, so he can then become valuable working on x!The person will feature transform your y in the Y space into a little x in the X space, so he can then become valuable working on x!

This happens all the time, and is actually quite subtle.

It isn't insidious or out of malice - just plain human nature.

I once worked with a junior data scientist who insisted on visualizing decision trees.

Now, decision trees have good explanatory power, and yes, they can be visualized.

But, and this is a big but, almost nobody builds decision trees so they can visualize them.

That's because, unlike textbook examples where the decision tree is shallow and has a handful of nodes, real-life decision trees tend to be deeply nested with say 300 nodes and a depth of 15. At that point, it is so messy it becomes quite useless to draw it out.

The tree is very useful - that's why you trained it. But it isn't very useful to actually draw the tree. Its sufficient to simply use the tree for its accurate predictive power. There's no need to visualize it. So why was this junior data scientist so keen on drawing the tree ?

After a bit of questioning, it became quite obvious.

Before he "became" a data scientist, he was a front-end javascript developer!

You know - the guys who build web apps - who use D3 & React and Angular and JQuery - one of those guys. So what was happening here was a classic case of feature transform. He had taken a data science problem - constructing a decision tree, and decided that if visualizing the tree was the actual problem, he could move the problem from the data science space to a javascript space! Then he'd be free to mess around with D3 and jquery, drawing edges and nodes with different colors on an HTML5 canvas! None of this would have any benefit to the actual machine learning problem, but he would be able to keep himself busy for a month or two doing what he loved best - dabbling with javascript, instead of figuring out the statistical decision tree properties like gini and entropy and information gain, which was all rather new and foreign to him.

Now, I don't blame the guy. He was new to data science, and he was quite eager. But the material is rather dense and theoretical, so he subconsciously figured, hey if I just did this little feature transform here, the decision tree problem becomes a javascript problem, and javascript I can handle!

**When we don't know something, we anchor steadfastly to what we do know, and we try desperately to relate it to our current problem, so we can add value instead of helplessly spinning our wheels.**I know this data scientist who almost drove a fledgling startup into the ground by convincing the CEO to invest in big-data infrastructure. When they brought him on board, he told them the company would need to provision compute clusters for the incoming petabytes and exabytes of data to be competitive with internet giants in the Valley. The company took him at his word and spent few millions on a spanking new Hadoop cluster with two dozen nodes and gigantic data capacity. When I looked at the actual data they had, it was clear to me that a used dual Xeon with a terabyte of disk and 128GB RAM, that retails for about $3K, ie.less than a MacPro these days, was more than sufficient to run Spark mllib in local mode and give them the results they wanted. Instead, they over-extended themselves, planning for a future that never came. Guess what that person did before he "became" a data scientist ? Yes, he was a systems guy who specialized in hardware and racks and cisco network switches and the like. He had taken a data science problem of assembling a simple compute server to run machine learning problems, and feature transformed it into the space he understood best - that of exorbitant rack mounted hardware!

It is not that hard to spot feature transforms.

Whenever data science is on the table, generally the topics of discussion should involve ML/applied stats - your decision trees, your ensembles, your logistic regression, your matrix factorization, your eigen vectors, things of that nature.I once witnessed a heated discussion on whether Avro files was the appropriate format or Parquet ! This is feature transform at work - instead of solving the data science problem at hand, the "programmers", because that is what they are, have transformed your problem into a space they understand best - that of programming and file formats. Another good indication is being religious about ETL tools. If you see constant conflict on whether to use Pig or Spark or Scalding or Redshift or Phoenix or whatnot, these are all symptoms of feature transform.

ETL is a complete non-issue as far as the machine learning algorithm is concerned, yet I see companies investing massively on the so-called data pipeline, and finally using something really silly like naive bayes on the end result...akin to buying a million dollar purse to store a few crumpled one dollar bills.

There are companies out there which put their data scientists through a release cycle - so your data scientist now has to show up at scrum and do story points and epics and check in his code and be conversant with git rebase and other exotic git incantations to be considered a valuable team player! Once again - a feature transform. Data science is exploratory and messy by definition - that's simply what it is. To tidy that up and insist of transforming data science into product engineering, with git repos and agile cycles, that makes it into a whole new toothless beast. Furthermore, these same companies insist that data scientists write code with meaningful variable names and self-explanatory function names and so on. So no more grad(x), or min(nabla z).

Instead, you have to spell it all out -minimizeByGradientDescent( targetFunction:Double, firstDerivative:Double, xIntercept:Double) !

Otherwise how do you expect to check it in and get two ship-its and productionize your data insights ?!

Now that you've seen a whole bunch of SIFT at work, I hope you recognize SIFT risk for what it is - a real hazard that prevents data scientists from being productive at doing data science, instead turning them loose on some tangentially related domain so they can project a facade of being productive.

**Be wary of SIFT risk.**