Advanced Analytics with Spark by Sandy Ryza
Author:Sandy Ryza
Language: eng
Format: epub
Publisher: O'Reilly Media
Published: 2017-06-16T04:00:00+00:00
import org.apache.spark.mllib.linalg.{Vectors, Vector => MLLibVector} import org.apache.spark.ml.linalg.{Vector => MLVector} val vecRdd = docTermMatrix.select("tfidfVec").rdd.map { row => Vectors.fromML(row.getAs[MLVector]("tfidfVec")) }
To find the singular value decomposition, we simply wrap an RDD of row vectors in a RowMatrix and call computeSVD:
import org.apache.spark.mllib.linalg.distributed.RowMatrix vecRdd.cache() val mat = new RowMatrix(vecRdd) val k = 1000 val svd = mat.computeSVD(k, computeU=true)
The RDD should be cached in memory beforehand because the computation requires multiple passes over the data. The computation requires O(nk) storage on the driver, O(n) storage for each task, and O(k) passes over the data.
As a reminder, a vector in term space means a vector with a weight on every term, a vector in document space means a vector with a weight on every document, and a vector in concept space means a vector with a weight on every concept. Each term, document, or concept defines an axis in its respective space, and the weight ascribed to the term, document, or concept means a length along that axis. Every term or document vector can be mapped to a corresponding vector in concept space. Every concept vector has possibly many term and document vectors that map to it, including a canonical term and document vector that it maps to when transformed in the reverse direction.
V is an n × k matrix in which each row corresponds to a term and each column corresponds to a concept. It defines a mapping between term space (the space where each point is an n-dimensional vector holding a weight for each term) and concept space (the space where each point is a k-dimensional vector holding a weight for each concept).
Similarly, U is an m × k matrix where each row corresponds to a document and each column corresponds to a concept. It defines a mapping between document space and concept space.
S is a k × k diagonal matrix that holds the singular values. Each diagonal element in S corresponds to a single concept (and thus a column in V and a column in U). The magnitude of each of these singular values corresponds to the importance of that concept: its power in explaining the variance in the data. An (inefficient) implementation of SVD could find the rank-k decomposition by starting with the rank-n decomposition and throwing away the n–k smallest singular values until there are k left (along with their corresponding columns in U and V). A key insight of LSA is that only a small number of concepts is important to represent that data. The entries in the S matrix directly indicate the importance of each concept. They also happen to be the square roots of the eigenvalues of MMT.
Download
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.
Deep Learning with Python by François Chollet(12587)
Hello! Python by Anthony Briggs(9926)
OCA Java SE 8 Programmer I Certification Guide by Mala Gupta(9800)
The Mikado Method by Ola Ellnestam Daniel Brolund(9786)
Dependency Injection in .NET by Mark Seemann(9347)
Algorithms of the Intelligent Web by Haralambos Marmanis;Dmitry Babenko(8309)
Test-Driven iOS Development with Swift 4 by Dominik Hauser(7770)
Grails in Action by Glen Smith Peter Ledbrook(7704)
The Well-Grounded Java Developer by Benjamin J. Evans Martijn Verburg(7566)
Becoming a Dynamics 365 Finance and Supply Chain Solution Architect by Brent Dawson(7146)
Microservices with Go by Alexander Shuiskov(6906)
Practical Design Patterns for Java Developers by Miroslav Wengner(6824)
Test Automation Engineering Handbook by Manikandan Sambamurthy(6766)
Secrets of the JavaScript Ninja by John Resig Bear Bibeault(6423)
Angular Projects - Third Edition by Aristeidis Bampakos(6185)
The Art of Crafting User Stories by The Art of Crafting User Stories(5703)
NetSuite for Consultants - Second Edition by Peter Ries(5635)
Demystifying Cryptography with OpenSSL 3.0 by Alexei Khlebnikov(5446)
Kotlin in Action by Dmitry Jemerov(5073)
