Advanced Analytics with Spark by Sandy Ryza

Advanced Analytics with Spark by Sandy Ryza

Author:Sandy Ryza
Language: eng
Format: epub
Publisher: O'Reilly Media
Published: 2017-06-16T04:00:00+00:00


import org.apache.spark.mllib.linalg.{Vectors, Vector => MLLibVector} import org.apache.spark.ml.linalg.{Vector => MLVector} val vecRdd = docTermMatrix.select("tfidfVec").rdd.map { row => Vectors.fromML(row.getAs[MLVector]("tfidfVec")) }

To find the singular value decomposition, we simply wrap an RDD of row vectors in a RowMatrix and call computeSVD:

import org.apache.spark.mllib.linalg.distributed.RowMatrix vecRdd.cache() val mat = new RowMatrix(vecRdd) val k = 1000 val svd = mat.computeSVD(k, computeU=true)

The RDD should be cached in memory beforehand because the computation requires multiple passes over the data. The computation requires O(nk) storage on the driver, O(n) storage for each task, and O(k) passes over the data.

As a reminder, a vector in term space means a vector with a weight on every term, a vector in document space means a vector with a weight on every document, and a vector in concept space means a vector with a weight on every concept. Each term, document, or concept defines an axis in its respective space, and the weight ascribed to the term, document, or concept means a length along that axis. Every term or document vector can be mapped to a corresponding vector in concept space. Every concept vector has possibly many term and document vectors that map to it, including a canonical term and document vector that it maps to when transformed in the reverse direction.

V is an n × k matrix in which each row corresponds to a term and each column corresponds to a concept. It defines a mapping between term space (the space where each point is an n-dimensional vector holding a weight for each term) and concept space (the space where each point is a k-dimensional vector holding a weight for each concept).

Similarly, U is an m × k matrix where each row corresponds to a document and each column corresponds to a concept. It defines a mapping between document space and concept space.

S is a k × k diagonal matrix that holds the singular values. Each diagonal element in S corresponds to a single concept (and thus a column in V and a column in U). The magnitude of each of these singular values corresponds to the importance of that concept: its power in explaining the variance in the data. An (inefficient) implementation of SVD could find the rank-k decomposition by starting with the rank-n decomposition and throwing away the n–k smallest singular values until there are k left (along with their corresponding columns in U and V). A key insight of LSA is that only a small number of concepts is important to represent that data. The entries in the S matrix directly indicate the importance of each concept. They also happen to be the square roots of the eigenvalues of MMT.



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.