TL;DR The number of times the average American has sex in a year, correlated to his/her age, sex and political affiliation, is given by the quadratic polynomial -

MALE | 210 * ((Age - 18)/71), Goodness of Fit 47%^{2} - 276* (Age-18)/71 + 128 |

Conservative Female | 210 * (Age - 18)/71), Goodness of Fit 62%^{2} - 311*(Age - 18)/71 + 133 |

Liberal Female | 210 * (Age - 18)/71), Goodness of Fit 51%^{2} - 290*(Age - 18)/71 + 124 |

__GSS Data Mining with NO TOOLS !!!__Imagine if you've been dropped off on a desert island, with the GSS Data & no statistical tools!

That's right - No R! No Matlab! No SAS! No Stata! No Excel!

Just the dataset & a plain old general purpose programming language.

(Which language ? Well, pick your favorite - Python, Ruby, C, whatever. I'll be using Scala)

Now, why would you want to artificially constrain yourself like this ?

Well, one of my pet peeves is tool-dependence, a.k.a tools-as-a-crutch.

Whenever I interview a candidate, I start with trivial questions, like

"What's the standard deviation of {1,3,5,7,9} ?"

Over 90% of the candidates cannot handle this !!!

They want a powerful stat tool to solve this trivial problem.

This is the familiar "meta knowledge" conundrum - the candidate "knows of" standard deviation, they just don't "know" standard deviation! They know of variance, but don't know variance! They've heard of correlations, but don't how how to compute one! Knowing of X while remaining ignorant about what X really is, is a problem that continues to confound the present generation.

Hence, we shall assume we have no tools whatsoever. We have been dropped into enemy territory and we'll have to build our weapons from scratch. We shall do exactly that!

Linear Regression ie. Ordinary Least Squares, is one of the simplest & most powerful tools we can resort to, in order to compute correlations.

So, the mean of a dataset is simply the sum of the elements divided by the size of the datset.

Now, we simply use this mean function to compute variance, correlation etc!

The variance is a measure of how far off each variate is from the mean. We simply subtract off the mean from each element, get rid of the negatives by squaring, and take the mean of this new dataset.

The variance is a measure of how far off each variate is from the mean. We simply subtract off the mean from each element, get rid of the negatives by squaring, and take the mean of this new dataset.

The standard deviation is simply the square root of the variance.

Scaling: When regressing a variable against another, its best if the two were comparable in magnitude. This is easily achieved by scaling. Find the min & max values the variable can take on, subtract the min and divide by the max. This ensures all values are constrained in the [0-1] range.

Scaling: When regressing a variable against another, its best if the two were comparable in magnitude. This is easily achieved by scaling. Find the min & max values the variable can take on, subtract the min and divide by the max. This ensures all values are constrained in the [0-1] range.

Normalization: Regression is vastly simplified if the dataset has zero mean, unit variance. This is achieved by subtracting off the mean & dividing by the standard deviation of the dataset.

Correlation: Given two normalized variables X & Y, we'd like to find how the variables behave in tandem. To measure this behavior, we take the product of the variates & finding the mean of this product dataset.

Regression: Given two normalized variables X & Y & their correlation corr, linear regression is simply Y = corr * X

And we're done! Well, not quite. We're done with the math. But data analysis isn't just math - its mostly extraction of the data, summary statistics, choosing the dependent & independent variates to build an OLS model, refining the model & so forth.

Histogramming is probably the simplest mechanism to look at an individual column. Contrary to popular opinion, you certainly don't need graphics expertise or plotting libraries to display a histogram. Suppose a column has qualitative data - we count the number of categories, number of instances of each category, and draw a scaled bar to represent each category.

And we're done! Well, not quite. We're done with the math. But data analysis isn't just math - its mostly extraction of the data, summary statistics, choosing the dependent & independent variates to build an OLS model, refining the model & so forth.

Histogramming is probably the simplest mechanism to look at an individual column. Contrary to popular opinion, you certainly don't need graphics expertise or plotting libraries to display a histogram. Suppose a column has qualitative data - we count the number of categories, number of instances of each category, and draw a scaled bar to represent each category.

To build an OLS model, we need choose two quantitative variables. Say we'd like to regress sex frequency against age. Well, "Age" is quantitative, but "Sex Frequency" is not. But we could easily convert the categories in "Sex Frequency" to a number. A simple substitution string, such as

will replace a "Weekly" entry with 52 (since there are 52 weeks in a year), "2-3 Times a Month" with 36 ( 3 * 12 months/year =36) and so forth.

Finally, we'd like filters so we can focus on a subset of the data. A simple filter like

is interpreted as - throw away all the rows except the ones where the person is a Female Conservative.

So now that we have the text-processing & OLS computing pieces ready, what's a good model ?

Consider

Sex Frequency = f(Age)

What sort of a function f would best fit ?

Certainly the frequency of sex peaks after puberty, and then begins dropping as a person ages. The drop isn't linear - once a person is in his/her 30s-40s, gets married,has children etc., various sociological & biological stressors set in, and the frequency drops quite drastically. So a straight line would be a bad model. However, a quadratic can suitably model this relationship.

Can OLS be used to construct a quadratic model ?

Certainly!

We seek -

To pull this off, we can move the x^2 on the left, and regress (y - x^2) against a * x + b

Further, once we scale & normalize the data, the "a" is simply the correlation, and the intercept "b"= 0.

*"Not at All=0&Weekly=52&Once or Twice=2&2-3 Times a Month=36&4+ Per Week=210&2-3 Per Week=156&Once a Month=12"*will replace a "Weekly" entry with 52 (since there are 52 weeks in a year), "2-3 Times a Month" with 36 ( 3 * 12 months/year =36) and so forth.

Finally, we'd like filters so we can focus on a subset of the data. A simple filter like

*"Sex=Female&Liberalism-Conservatism=Conservative"*is interpreted as - throw away all the rows except the ones where the person is a Female Conservative.

So now that we have the text-processing & OLS computing pieces ready, what's a good model ?

Consider

Sex Frequency = f(Age)

What sort of a function f would best fit ?

Certainly the frequency of sex peaks after puberty, and then begins dropping as a person ages. The drop isn't linear - once a person is in his/her 30s-40s, gets married,has children etc., various sociological & biological stressors set in, and the frequency drops quite drastically. So a straight line would be a bad model. However, a quadratic can suitably model this relationship.

Can OLS be used to construct a quadratic model ?

Certainly!

We seek -

*Sex Frequency = y = x^2 + a * x + b, with x = Age*To pull this off, we can move the x^2 on the left, and regress (y - x^2) against a * x + b

Further, once we scale & normalize the data, the "a" is simply the correlation, and the intercept "b"= 0.

The quadratic OLS model for a "Conservative Female" has the following parameters -

*$ scala GSSJob GSS-Data-for-Statwing-Prize.txt job "Age" "Sex Frequency in Last Year" "Not at All=0&Weekly=52&Once or Twice=2&2-3 Times a Month=36&4+ Per Week=210&2-3 Per Week=156&Once a Month=12" "Sex=Female&Liberalism-Conservatism=Conservative"*

*OLS Normalized Poly:*

mux: 0.413

sigmax: 0.246

xmin: 18.000

xscaler: 71.000

muy: 0.022

sigmay: 0.463

ymin: 0.000

yscaler: 210.000

corr: -0.788

rsq: 0.622

mux: 0.413

sigmax: 0.246

xmin: 18.000

xscaler: 71.000

muy: 0.022

sigmay: 0.463

ymin: 0.000

yscaler: 210.000

corr: -0.788

rsq: 0.622

The goodness-of-fit is 62.2 %, quite impressive for a sociological dataset.

Let's now deconstruct the relationship between "Age" and "Sex Frequency", using the parameters above -

Given an Age x, we first scale its magnitude

(x-18)/71

We then normalize the scaled value

[(x-18)/71 - 0.413]/0.246

Given a Sex Frequency y, we first scale its magnitude

(y-0)/210

We then subtract off the quadratic component of the scaled x

y/210 - ((x-18)/71)^2

This quantity is then normalized

(y/210 - ((x-18)/71)^2) - 0.022)/0.463

The normalized quantitities are related by the correlation

(y/210 - ((x-18)/71)^2) - 0.022)/0.463 = -0.788 * [(x-18)/71 - 0.413]/0.246

A little bit of algebra leads to

*Sex Frequency = 210 * ((Age - 18)/71)^2 - 311 * (Age - 18)/71 + 133*Voila!

Similarly, with a 51% goodness of fit, a

**"Liberal Female"**has an annual sex frequency given by

*Sex Frequency = 210 * ((Age - 18)/71)^2 - 290 * (Age - 18)/71 + 124*

For

**Males**, the political affiliation doesn't seem to impact the R^2 as much,

so with a goodness of fit = 47%, we have

*Sex Frequency = 210 * ((Age - 18)/71)^2 - 276 * (Age - 18)/71 + 128*So we've come very far. With zero tools at our disposal,we've constructed a pretty accurate quadratic model of human age vs annual frequency of sex. Our ancestors computed pi to hundreds of digits and computed logarithm tables by hand, so our accomplishments, while heroic, are certainly very doable and should serve as a model for what can be pulled off with paper, pencil & a bit of math.

The entire code ( OLS Model + Data Extraction ) is given below -