Andres Perez
Aug 21, 2013
as mentioned by abhi in his previous post, ganitha is the first open source implementation of machine learning and statistical techniques on scalding.
scalding is the core API that we use for all our software at tresata and we have quite successfully implemented machine learning algorithms in the same framework. as most of you will already know, scalding is a scala DSL for cascading and was originally open sourced by twitter. both scala and cascading are integral components of the technology architecture at tresata.
the core idea behind ganitha was to make complex pieces of MapReduce logic available in a much more clean, simple and powerful abstraction that allows you to run real world algorithms at massive scale.
our plan is to continue to add to this library and starting today, we welcome contributions from the OSS community…
the integration of mahout vectors into scalding was driven by our need to have sparse vectors available in scalding with compact in-memory and serializable representations. additionally, mahout also has some nice routines for turning data into vectors representations. we would like to be able to use the output of those routines in scalding. to make mahout vectors usable in Scala/Scalding we did the following:
pimp mahout vectors
we used the pimp-my-library pattern in scala to make mahout vectors more friendly to use (see RichVector.scala
). note that we decided to not implement RichVector
as an IndexedSeq
, but rather as an Iterable
, for our first iteration.
we also didn’t implement IterableLike
(or IndexedSeqLike
) and CanBuildFrom
, which would allow operations like vector.take(3)
and vector.map(f)
to return new mahout vectors.
the reason for this was that we were not happy with the interaction between the builders and the pimp-my-library pattern. we might still add these features in the future.
as an alternative, we provided the vectorMap
methods that return mahout vectors. thanks to RichVector
, you can now do things like:
scala> import com.tresata.ganitha.mahout._
scala> import com.tresata.ganitha.mahout.Implicits._
scala> // create a sparse vector of size 6 with elements 1, 3 and 5 non-zero
scala> val v = RichVector(6, List((1, 1.0), (3, 2.0), (5, 3.0)))
v: org.apache.mahout.math.Vector = {5:3.0,3:2.0,1:1.0}
we can see it’s using a mahout RandomAccessSparseVector
with v.getClass
. likewise, we can create a dense vector with 3 elements as follows:
scala> val v1 = RichVector(Array(1.0,2.0,3.0))
v1: org.apache.mahout.math.Vector = {0:1.0,1:2.0,2:3.0}
we can also perform basic math and vector operations (map, fold, etc.) on the vectors. elements inside the vectors can be accessed and set (since mahout vectors are mutable), however, this is not encouraged.
scala> (v + 2) / 2
res1: org.apache.mahout.math.Vector = {5:2.5,4:1.0,3:2.0,2:1.0,1:1.5,0:1.0}
scala> v.map(x => x * 2).sum
res2: Double = 12.0
scala> v.fold(0.0)(_ + _)
res3: Double = 6.0
scala> v(3)
res4: Double = 2.0
scala> v(3) = 3.0
scala> v
res5: org.apache.mahout.math.Vector = {5:3.0,3:3.0,1:1.0}
scala> v * v
res6: org.apache.mahout.math.Vector = {5:9.0,3:4.0,1:1.0}
the nonZero method provides access to the non-zero elements as a scala Iterable
.
scala> v.nonZero.toMap
res7: scala.collection.immutable.Map[Int,Double] = Map(5 -> 3.0, 3 -> 3.0, 1 -> 1.0)
dense vectors can be converted to sparse, and vice versa.
scala> v1.toSparse.getClass
res8: java.lang.Class[_ <: org.apache.mahout.math.Vector]
= class org.apache.mahout.math.RandomAccessSparseVector
the vectorMap
operation provides access to the assignment operation on a mahout vector, but as a non-mutating operation (it creates a copy first).
scala> v.vectorMap(x => x * 2)
res9: org.apache.mahout.math.Vector = {5:6.0,3:4.0,1:2.0}
[br] [br] making serialization transparent
mahout’s vectors come with a separate class called VectorWritable
that implements Writable
for serialization within hadoop. the issue with this is that you cannot just register VectorWritable
as a hadoop serializer and be done with it. if you did this then you would have to constantly wrap your mahout vectors in a VectorWritable
to make them serializable. to make the serialization transparent we added VectorSerializer
, a kryo serializer that defers to VectorWritable
for the actual work.
all one has to do is register VectorSerializer
with kryo, and serialization works in scalding. for example, if you are using a JobConf
you can write:
VectorSerializer.register(job)
the same applies to a scalding Config
(which is a Map[AnyRef, AnyRef]
):
VectorSerializer.register(config)
with RichVector
and VectorSerializer
s, it’s now much easier to integrate mahout vectors into our scala projects and the pains that come with serialization are quelled.
in a forthcoming post, andy will outline our now open-source implementation of K-means where we make use of the additions outlined above!
i look forward to hearing from you about how you can make use of ganitha in your own work as well as contributing to it