Scalding solution in github

You think Scalding, and you think Scala code for word-count, page-rank, classic big-data MR jobs that benefit from distributed parallel computing on a Hadoop cluster. But this here ain't your grandfather's map-reduce :)

Problems of this nature were first formalized and solved by Dr. Harry Markowitz at my alma-mater the University of Chicago, way back in 1952! When Harry defended his PhD. thesis, his dissertation advisor MILTON FRIEDMAN is said to have remarked "This is not economics!". However, cooler heads prevailed & Harry became Dr. Markowitz!

Since then, Portfolio Selection has been studied to death - you could divvy up the $1000 under a second using the frontcon(..) function in Matlab, or using CPLEX, or AMPL. So then, why Scalding ? Well, the other day I was buttering a toast while my wife unwrapped a toy for our 5 month old. The toy needed 2AA batteries, and I needed a screwdriver to pry open the battery compartment. Well, I don't have a screwdriver lying around, so what did I do ?

Trick question! You see, I was buttering the toast, so I used the butter-knife to unscrew the battery compartment and thus disaster was averted and the kid got his toy. Moral of the story: Sometimes, you can use Scalding to do all sorts of weird math-y stuff its probably not intended to do. Its a lot of fun! Besides, trying age-old problems using shiny new tools like Scalding brings all sorts of insights to the table.

Lets define the problem formally:

**Divide $1000 among 4 stocks - Kroger, Abbott Labs, Dollar Tree and Monster Beverage, in the best possible way.**

Why these 4 stocks ? Well, you'll find out towards the very end of this article!

val cash = 1000.0 // money at hand

val error = 1 // its ok if we cannot invest the last dollar

val (kr,abt,dltr,mnst) = (27.0,64.0,41.0,52.0) // share prices

val stocks = IndexedSeq( kr,abt,dltr,mnst)

So the above Scala snippet simply says we have $1000 & 4 stocks priced at $27,$64, $41 and $52, sitting in a sequence.

Lets now invoke the weightedSum function in Scalding. This would give us a pipe containing all tuples that satisfy the constraint ie. all posible combinations that add up to $1000, within an error bound of $1.

weightedSum( stocks, cash,error).write( Tsv("invest.txt"))

Lets examine the content of the sink:

$ cat invest.txt

0 0 13 9

0 1 0 18

0 1 19 3

0 2 1 16

0 2 20 1

0 3 7 10

0 4 8 8

0 5 9 6

0 6 15 0

0 8 3 7

0 9 4 5

0 10 5 3

0 14 0 2

1 0 6 14

1 1 12 8

1 2 13 6

1 3 0 15

1 3 14 4

1 4 1 13

1 5 2 11

1 6 8 5

1 7 9 3

1 8 10 1

1 11 4 2

1 12 5 0

2 0 18 4

2 1 5 13

2 1 19 2

2 2 6 11

2 3 7 9

2 4 13 3

2 5 14 1

2 6 1 10

2 7 2 8

2 8 3 6

2 9 9 0

3 0 11 9

3 1 12 7

3 2 18 1

3 3 0 14

3 4 6 8

3 5 7 6

3 6 8 4

3 9 2 5

3 10 3 3

3 11 4 1

4 0 4 14

4 1 5 12

4 2 11 6

4 3 12 4

4 4 13 2

4 5 0 11

4 6 1 9

4 7 7 3

4 8 8 1

4 12 3 0

5 0 16 4

5 1 17 2

5 2 4 11

5 3 5 9

5 4 6 7

5 5 12 1

5 7 0 8

5 8 1 6

5 9 2 4

6 0 9 9

6 1 10 7

6 2 11 5

6 5 5 6

6 6 6 4

6 7 7 2

6 10 1 3

6 11 2 1

7 0 2 14

7 0 16 3

7 1 3 12

7 2 4 10

7 3 10 4

7 4 11 2

7 7 0 7

7 8 6 1

8 0 9 8

8 1 15 2

8 2 16 0

8 3 3 9

8 4 4 7

8 5 5 5

8 9 0 4

8 10 1 2

9 0 2 13

9 1 8 7

9 2 9 5

9 3 10 3

9 6 4 4

9 7 5 2

9 11 0 1

10 0 14 3

10 1 1 12

10 1 15 1

10 2 2 10

10 3 3 8

10 4 9 2

10 5 10 0

11 0 7 8

11 1 8 6

11 2 14 0

11 4 2 7

11 5 3 5

11 6 4 3

12 0 0 13

12 1 1 11

12 2 7 5

12 3 8 3

12 4 9 1

12 7 3 2

12 8 4 0

13 0 12 3

13 1 13 1

13 2 0 10

13 3 1 8

13 4 2 6

13 5 8 0

14 0 5 8

14 1 6 6

14 2 7 4

14 5 1 5

14 6 2 3

14 7 3 1

15 0 12 2

15 2 0 9

15 3 6 3

15 4 7 1

15 8 2 0

16 0 5 7

16 1 11 1

16 4 0 6

16 5 1 4

17 1 4 6

17 2 5 4

17 3 6 2

17 6 0 3

17 7 1 1

18 0 10 2

18 4 5 1

19 0 3 7

19 1 4 5

19 6 0 2

20 2 3 4

20 3 4 2

21 0 8 2

21 1 9 0

22 0 1 7

22 1 2 5

22 2 3 3

23 0 8 1

23 3 2 2

23 4 3 0

24 0 1 6

24 1 7 0

25 1 0 5

25 2 1 3

25 3 2 1

26 0 6 1

26 4 1 0

27 1 0 4

28 3 0 1

29 0 4 1

34 0 2 0

37 0 0 0

So it looks like we have 169 unique ways to divvy up the cash. Lets take one of the 169 tuples. How about (5,4,6,7) ?

So 5 shares of Kroger @ $27 = $135

4 shares of Abbott Labs @ $64 = $256

6 shares of Dollar Tree @ $41 = $246

7 shares of Monster @ $52 = $468

135+256+246+468 = $1001

But is (5,4,6,7) the best-posible tuple among the 169 choices ?

Lets define the best: We want the tuple that has the smallest risk. The 60-day 4x4 correlation matrix for these stocks is given below:

KR ABT DLTR MNST

0.448 0.177 0 0.017

0.177 0.393 0.177 0.237

0 0.177 0.237 0.06

0.017 0.237 0.06 0.19

What this says is rather simple: We have the variances along the diagonal and the covariances elsewhere. So Kroger has a variance of 0.448, and the 60-day correlation between Kroger and Abbott is 0.177, and so forth. ( Compute 60-day correlations here. )

Given a vector of weights W, the net risk imposed by this matrix = W'CW

To define these matrices & perform matrix transposes & multiplications, lets use COLT.

import cern.colt.matrix.DoubleFactory2D

val data = Array(Array(0.448, 0.177, 0.0, 0.017), Array(0.177, 0.393, 0.177, 0.237), Array(0.0, 0.177, 0.237, 0.06), Array(0.017, 0.237, 0.06, 0.19))

val corr = DoubleFactory2D.dense.make(data)

corr:

4 x 4 matrix

0.448 0.177 0 0.017

0.177 0.393 0.177 0.237

0 0.177 0.237 0.06

0.017 0.237 0.06 0.19

We now take each of the 169 tuples & compute their net risks. The tuple with the least risk is what we are after.

import cern.colt.matrix.linalg.Algebra

import cern.colt.matrix.DoubleFactory1D

import java.util.StringTokenizer

val alg = Algebra.DEFAULT

val file = scala.io.Source.fromFile("invest.txt").getLines.toList

// convert the tab-delimited weights to a row vector

def getWeights(s:String) = {

val stok = new StringTokenizer(s, "\t")

val weights = (1 to 4).map( x=> stok.nextToken.toDouble).toArray

DoubleFactory1D.dense.make(weights)

}

// compute risks per tuple and sort by risk

file.map(line=> {

val w = getWeights(line)

( line, alg.mult( alg.mult(corr,w), w))

}).sortBy(x=>x._2)

res11: List[(String, Double)] = List(

(1 0 6 14,56.776),

(4 0 4 14,56.824),

(

**2 1 5 13,57.544),**

'(4 1 5 12,58.552),

(2 2 6 11,59.646),

.....

(37 0 0 0,613.312)

So the best possible way to invest your $1000 is the (1,0,6,14) tuple.

But if you want to buy atleast 1 share of each stock, (2,1,5,13) is the way to go.

Ofcourse, the absolute worst thing you could do is (37,0,0,0) ie. put all your money in 37 shares of Kroger. With this worst tuple, you get no diversification and take on 10 times the risk of the best tuple.

So ( putting on my Jim Cramer hat here) I am going to recommend the following portfolio mix:

2 shares of Kroger : $54

1 share of Abbott : $64

5 shares of Dollar Tree : $205

13 shares of Monster: $676

Total : $1001

Go forth and invest!

Okay, enough of finance., back to CS.

How does all this work inside of Scalding ?

The secret sauce is obviously the algorithm behind weightedSum(...). This is a homegrown sophisticated-brute-force algorithm that I conjured up out of thin air :)

Seriously, this was the very first Scalding algorithm I coded up, and I have not found any reference to this algorithm elsewhere in the literature, so I am going to claim credit for it :))

Lemme walk you thru my algorithm: Say you want all possible ways to purchase 3 stocks, whose prices are ( $1, $2, $5), to sum to $10.

Make 3 pipes, one per stock -

Pipe 1 has $0, $1, $2, $3, $4, $5,$6,$7,$8,$9, $10

Pipe 2 has $0, $2, $4, $6, $8 , $10

Pipe 3 has $0, $5, $10

So I've just made pipes that hold multiples of the share price.

Now cross the pipes two at a time.

So crossing pipe1 and pipe2 gives us a pipe with 2 columns.

(0,0)

(0,2)

(0,4),

.....

(1,0)

(1,2)

(1,4)

....

.....

(10,10)

a total of 66 tuples.

Now create a column that holds the sum of the two columns, and throw away tuples that exceed 10.

I call this "progressive filtering"

This gets rid of bad tuples like (1,10), (2,9),(2,10),(3,8),....

Now take this pipe with good-tuples, and cross with the third pipe.

Once again, apply progressive filtering.

This gets rid of bad tuples like (1,9,5), (1,9,10),(2,8,5),(2,8,10),....

This combination of "cross-multiply with progressive-filter" is what makes weightedSum sophisticated.

But why is weightedSum(...) brute-force ?

Well, with 4 stocks , the smallest stock being the $27 Kroger, and $1000 at stake, we had 169 possible candidates.

Where did they come from ?

Well, 1000/27 ~ 40.

40^4 = 2,560,000

So in the worst case, we had a total of 2.5 MILLION fourples....but we managed to narrow them down to 169 valid candidates. ( Note that due to Progressive Filtering, we probably generate less than 2.5 million fourples....under a million ballpark )

In general, the problem scales like O( (cash/minWeight)^numStocks )

So it is certainly exponential, and hence the brute-force aspect of it.

However, given Scalding's awesome distributed capabilities, and the parallelized nature of our algorithm ( remember - 1 pipe per stock, so n stocks lead to n pipes crossed & progressively filtered in parallel ), we perform quite well....in fact, this particular problem benchmarks under 2 seconds! Now, as the number of stocks grow, or if we have more cash to play with, this leads to exponentially more bad tuples...but unlike in Scala where we'd simply run out of memory unless we resort to Dantzig, Scalding will happily chug along & compute the best possible tuple !

Mystery: You might be wondering why we picked the 4 stocks - Kroger, Abbott Labs, Dollar Tree and Monster Beverages. One of several criticisms against Portfolio Selection using mean-variance optimization techniques is that it isn't stable. The variance of stocks aren't fixed, and if you run this same algorithm a month from now, the correlation matrix would have surely changed, and hence the tuples. That does not mean it's a useless algorithm. In fact, while in grad school I interviewed with BlackRock, who were using what was essentially a souped-up Markowitz optimization for a $100B fund with 100s of stocks in them. "So won't the prices and matrices change everyday ? " I asked. Well, yes! So Blackrock runs the algorithm throughout the day and rebalances the portfolio twice a day - at 11am and 2pm !!!

For this exercise, I used Google Stock Screener to pick 4 stocks with a low beta ( risk) and high average volume that had correlation data available in AIStockCharts.

The lowest possible betas with corr data were KR,ABT,DLTR and MNST.

Mystery solved!

Scalding solution in github