Andres Perez
Jun 23, 2014
This post discusses the implementation of Naive-Bayes classification in Ganitha, Tresata’s open-source machine-learning library built on Scalding. A Naive-Bayes classifier is a probabilistic classifier used in machine-learning that involves the application of Bayes’ theorem. The underlying model is “naive” because of the assumption that the attributes are conditionally independent of each other. Naive-Bayes learning is surprisingly effective in a wide range of applications, given the simplifying assumption of feature independence. Though not as powerful as decision-tree learning, it is considerably less computationally complex than many other forms of classifiers, and in many cases, the naive assumption has little impact on the quality of predictions.
Naive-Bayes Classifying
Ganitha supplies three of the more popular forms of Naive-Bayes classifiers: Gaussian, Multinomial, and Bernoulli. In gaussian Naive-Bayes, a type of classifier used for continuous data, we are making the assumption that the features associated with each class lie along a normal distribution. In a multinomial or Bernoulli event model, we are dealing with discrete features, a common example being the classification of a document given the presence of words (features) in the text. In this case, each word has a score assigned to it for each label, or class. In multinomial Naive-Bayes, each feature vector relates to the term frequency of the words found in the document or class. We make the ‘bag-of-words’ assumption, in which documents are represented as a multiset of words, disregarding grammar or word order. In Bernoulli Naive-Bayes, features represent binary occurrences, and in this classification model, the absence of a word/feature has an effect on the calculated probabilities.
Each classifier consists of a training phase, where an NBModel is constructed from the training set of data, and a classifying, or predicting, phase. In the classifying phase, each data point that is to be classified is given a probability (in this case a log probability is used) for each label, and the label with the highest, or *maximum a posteriori* probability is assigned to the data point.
Support for Vector Types
Ganitha provides a simple framework for supporting additional vector types. By creating an object extending the VectorHelper
or DenseVectorHelper
class and implementing the supported methods, you can add support for a custom vector type from an outside library to use with Naive-Bayes. As an example, the code to add support for Jblas using vectors backed by org.jblas.DoubleMatrix
objects is as follows:
import org.jblas.{ DoubleMatrix => JblasVector } object JblasVectorHelper extends DenseVectorHelper[JblasVector] { def plus(v1: JblasVector, v2: JblasVector) = v1.add(v2) def scale(v: JblasVector, k: Double) = { val v2 = new JblasVector(v.data.clone); v2.mmuli(k) } def toString(v: JblasVector) = v.toString def size(v: JblasVector) = v.rows def sum(v: JblasVector) = v.sum def dot(v1: JblasVector, v2: JblasVector) = v1.dot(v2) def map(v: JblasVector, f: Double => Double) = new JblasVector(v.data.clone.map(f)) def l1Distance(v1: JblasVector, v2: JblasVector): Double = v1.distance1(v2) def euclidean(v1: JblasVector, v2: JblasVector): Double = v1.distance2(v2) def cosine(v1: JblasVector, v2: JblasVector): Double = { val dotProd = v1.dot(v2) if (dotProd < 0.00000001) 1.0 // don't waste calculations on orthogonal vectors or 0 val denom = v1.norm2 * v2.norm2 1.0 - abs(dotProd / denom) } def iterator(v: JblasVector) = v.data.iterator }
We’ve added support for some popular vector representations from open-source libraries, including Mahout, Breeze, Jblas, and Saddle.
You can see our code for implementing Naive-Bayes classifiers, and give it a run yourself, at our GitHub page. We welcome contributions to Ganitha, as well as suggestions for what machine-learning applications built on Scalding you’d like to see open-sourced next!