Search engine users ( i.e. every single one of us ) go to a search engine ( google.com, bing.com, yandex.com etc. ) and search for a term.

Say you search for "red shoes". You then get shown a bunch of red shoe ads. You hopefully click on the ad, that takes you to a "landing page" where these shoes are sold. You maybe purchase a shoe - you click on the shoe and put it in the shopping cart and enter your credit card number & boom - you have converted! You saw the ad - an IMPRESSION, following which you gave the ad publisher a CLICK then a CONVERSION.

There's a monetary cost attached to each IMPRESSION, each CLICK, each CONVERSION. When you hear search marketers talk about CPC, that's the COST PER CLICK. Then then's CPL - COST PER LEAD. Then the RPC - REVENUE PER CONVERSION, RPClick - REVENUE PER CLICK & so forth...

Basically, anything you do on a search engine has $$$ attached to it.

Now, the thing you searched for - the "red shoes" - we call that a term.

There's two kinds of terms - headterms and tailterms.

Headterms are terms that yield clicks and revenue every single day. Why ? Well, because someone somewhere on the planet wants to buy a red shoe, and the planet is big enough and red shoes are popular enough so that they get searched every day. So the "red shoe" term is associated with a <NUMBER OF IMPRESSIONS, NUMBER OF CLICKS, NUMBER OF CONVERSIONS> daily vector, with nonzero terms. That means over time you can compute interesting things like median RPC, median CPC, and then build Bidding Models for them. This is what I do. This is my bread and butter. Mostly, it involves math like Lagrangians and Curve Fitting and Brent Optimization - see "Blue Collar Math" for details on what I do. Its like 99% math, 1% engineering ( or writing code ).

However, what if, instead of searching for "red shoes", you search for "moulinex food processor 8000" ? Now, how many people on the planet will want to buy this exact food processor on a given day ? Probably zero. So that's a tailterm. If you have a tailterm, your daily vector <NUMBER OF IMPRESSIONS, NUMBER OF CLICKS, NUMBER OF CONVERSIONS> looks like <0,0,0>. But ofcourse, there will be days on which some fucker actually searches for this moulinex, and you end up with a nonzero vector.

Now, lets look at the scale of this problem in real life. Say you are a Macy's. You are going to have 5 MILLION terms in your portfolio - these are the things people search for, so you'd like a Macy's ad to show up when they search for them. Of these 5 MILLION, say some 80% are tailterms! That means, people very rarely search for them, so their daily vector looks like <0,0,0> , though there will be the odd day when somebody does search for them and you end up with a nonzero vector.

Say Macy's wants to spend $100,000 per day, on these 5 million terms. Since ther CPC, RPC etc. is well known for the 20% of the headterms, you can build a nice portfolio optimization model using Lagrange, and it'll work for those head terms. But what do you do about tail terms - 80% of 5 million is 4 million. So you don't have any kind of model for the 4 million tails, because they mostly look like <0,0,0> every day, except sometimes they don't. What do you do ?

Think of each tailterm as a bad hen. Instead of laying an egg a day like a good hen, these bad hens are very unpredictable. They lay maybe 3 eggs over 1 full year, and it is not known whether they will lay the egg tomorrow or not! Ofcourse, you have historical data on the egg-laying behavior for as long as you want. So for say 1 full year, you know what the daily vector looks like. So you've got 366 daily vectors per bad hen. And you've got 4 million bad hens. Can you put enough hens in a coop so you get an omlette every day ?

Some naive approaches ( these approaches actually work, but not all the time :)

1. Say a bad hen laid an egg on Day 5, Day 17, and Day 274. The rest of the days in the year, it did not lay an egg. Model this distribution as a frequency of 366/3. This is a horrible model, because you are basically asserting the hen will lay an egg every 122 days - which isn't true at all. But hey, you got to model it somehow. So lets begin by converting each hen's annual egg distribution into a simple number, which is the frequency i.e. how many days to lay 1 egg.

Now think back to those algebra problems in middle school - If Tom takes 5 days to paint the wall and Will takes 3 days, together how long will they take ? This is a simple variant of harmonic mean - they will take 5*3/(5+3) = 15/8 days ie. 1.9 days. Now add maybe Oscar who takes 4 days into the mix. Then add Adam who takes 7 days....soon, you can easily make a bunch of people who together take LESS THAN ONE DAY to paint the wall!

By the same reasoning, it is possible to group enough hens with different egg-laying frequency together so the coop gives you an omelette every day - Just stack enough hens so their combined harmonic frequency is below one. From a programming standpoint, this is ideal - because you can compute these harmonic frequencies recursively. So given 4 million hens with their frequencies, writing a function that spits out the optimal number of clusters and hens in each cluster - is not too hard. Should take you about an hour in Scala, tops.

2. If you think of a coin toss problem, where laying the egg means Heads and not laying maps to a tail, you've got a Binomial distribution ie. a B(n,p) where n = number of trails = 366, p = number of heads/366 . So now you ask the question, given a bunch of binomial variables, can we say something about the distribution of their sum ? Here's a rather interesting approach to the problem.

3. Finally, suppose you break up the training data into 365 clustering days + 7 tuning days. You use the 365 days data to build k random clusters, say k = 100. So you build 100 coops, into which you stuff pretty much equal number of hens. A more intelligent clustering approach will simply use the Harmonic mean variant in approach #1. So you build k clusters such that the harmonic frequency per cluster < 1. Now, you look at Day 366. If each cluster does not yield a egg, move hens from one cluster to another until it does! Then look at Day 367 & repeat! So do this for 7 days. At this point, you have fewer than k clusters, but these clusters actually yield an egg for every one of those 7 days. So you've basically overtrained them on 7 days of data, and now you hope you have some stable distribution that will yield an egg per coop on the 8th day. Feel free to increase 7 days to 30 days etc...the idea is basically to use a year's worth of data to construct frequencies, then build random clusters so each cluster has harmonic frequency less than 1, then decrease the number of clusters even further by moving hens from 1 cluster to another in order to guarantee an egg for every one of the 7 ( or 30 ) tuning days. Then you hope everything works out well on your test set ( mostly it does, sometimes it doesn't :)

Any suggestions appreciated. Tailterm modeling is an interesting problem with practical monetary value. You can't make an omelette without breaking an egg - in this case, the problem is definitely worth breaking a few eggs!