Andres Perez

Aug 21, 2013

as mentioned by **abhi** in his previous post, **ganitha** is the first open source implementation of machine learning and statistical techniques on **scalding**.

**scalding** is the core API that we use for all our software at tresata and we have quite successfully implemented machine learning algorithms in the same framework. as most of you will already know, **scalding** is a **scala** DSL for cascading and was originally open sourced by twitter. both **scala** and **cascading** are integral components of the technology architecture at tresata.

the core idea behind **ganitha** was to make complex pieces of MapReduce logic available in a much more clean, simple and powerful abstraction that allows you to run real world algorithms at massive scale.

our plan is to continue to add to this library and starting today, we welcome contributions from the OSS community…

the integration of mahout vectors into scalding was driven by our need to have sparse vectors available in scalding with compact in-memory and serializable representations. additionally, mahout also has some nice routines for turning data into vectors representations. we would like to be able to use the output of those routines in scalding. to make mahout vectors usable in Scala/Scalding we did the following:

**pimp mahout vectors**

we used the **pimp-my-library** pattern in scala to make mahout vectors more friendly to use (see `RichVector.scala`

). note that we decided to not implement `RichVector`

as an `IndexedSeq`

, but rather as an `Iterable`

, for our first iteration.

we also didn’t implement `IterableLike`

(or `IndexedSeqLike`

) and `CanBuildFrom`

, which would allow operations like `vector.take(3)`

and `vector.map(f)`

to return new mahout vectors.

the reason for this was that we were not happy with the interaction between the builders and the pimp-my-library pattern. we might still add these features in the future.

as an alternative, we provided the `vectorMap`

methods that return mahout vectors. thanks to `RichVector`

, you can now do things like:

```
scala> import com.tresata.ganitha.mahout._
scala> import com.tresata.ganitha.mahout.Implicits._
scala> // create a sparse vector of size 6 with elements 1, 3 and 5 non-zero
scala> val v = RichVector(6, List((1, 1.0), (3, 2.0), (5, 3.0)))
v: org.apache.mahout.math.Vector = {5:3.0,3:2.0,1:1.0}
```

we can see it’s using a mahout `RandomAccessSparseVector`

with `v.getClass`

. likewise, we can create a dense vector with 3 elements as follows:

```
scala> val v1 = RichVector(Array(1.0,2.0,3.0))
v1: org.apache.mahout.math.Vector = {0:1.0,1:2.0,2:3.0}
```

we can also perform basic math and vector operations (map, fold, etc.) on the vectors. elements inside the vectors can be accessed and set (since mahout vectors are mutable), however, this is not encouraged.

```
scala> (v + 2) / 2
res1: org.apache.mahout.math.Vector = {5:2.5,4:1.0,3:2.0,2:1.0,1:1.5,0:1.0}
scala> v.map(x => x * 2).sum
res2: Double = 12.0
scala> v.fold(0.0)(_ + _)
res3: Double = 6.0
scala> v(3)
res4: Double = 2.0
scala> v(3) = 3.0
scala> v
res5: org.apache.mahout.math.Vector = {5:3.0,3:3.0,1:1.0}
scala> v * v
res6: org.apache.mahout.math.Vector = {5:9.0,3:4.0,1:1.0}
```

the nonZero method provides access to the non-zero elements as a scala `Iterable`

.

```
scala> v.nonZero.toMap
res7: scala.collection.immutable.Map[Int,Double] = Map(5 -> 3.0, 3 -> 3.0, 1 -> 1.0)
```

dense vectors can be converted to sparse, and vice versa.

```
scala> v1.toSparse.getClass
res8: java.lang.Class[_ <: org.apache.mahout.math.Vector]
= class org.apache.mahout.math.RandomAccessSparseVector
```

the `vectorMap`

operation provides access to the assignment operation on a mahout vector, but as a non-mutating operation (it creates a copy first).

```
scala> v.vectorMap(x => x * 2)
res9: org.apache.mahout.math.Vector = {5:6.0,3:4.0,1:2.0}
```

** [br] [br] making serialization transparent**

mahout’s vectors come with a separate class called `VectorWritable`

that implements `Writable`

for serialization within hadoop. the issue with this is that you cannot just register `VectorWritable`

as a hadoop serializer and be done with it. if you did this then you would have to constantly wrap your mahout vectors in a `VectorWritable`

to make them serializable. to make the serialization transparent we added `VectorSerializer`

, a **kryo** serializer that defers to `VectorWritable`

for the actual work.

all one has to do is register `VectorSerializer`

with kryo, and serialization works in scalding. for example, if you are using a `JobConf`

you can write:

```
VectorSerializer.register(job)
```

the same applies to a scalding `Config`

(which is a `Map[AnyRef, AnyRef]`

):

```
VectorSerializer.register(config)
```

with `RichVector`

and `VectorSerializer`

s, it’s now much easier to integrate mahout vectors into our scala projects and the pains that come with serialization are quelled.

in a forthcoming post, andy will outline our now open-source implementation of K-means where we make use of the additions outlined above!

i look forward to hearing from you about how you can make use of ganitha in your own work as well as contributing to it