Mastering Machine Learning with Spark 2.x by Alex Tellez
Author:Alex Tellez
Language: eng
Format: epub, mobi
Publisher: Packt
Published: 2018-02-26T13:46:27+00:00
val MIN_TOKEN_LENGTH = 3
val toTokens = (minTokenLen: Int, stopWords: Array[String], review: String) =>
review.split("""\W+""")
.map(_.toLowerCase.replaceAll("[^\\p{IsAlphabetic}]", ""))
.filter(w => w.length > minTokenLen)
.filter(w => !stopWords.contains(w))
With all the building blocks ready, we just apply them to the loaded input data, augmenting them by a new column, reviewTokens, which holds a list of words extracted from the review:
val toTokensUDF = udf(toTokens.curried(MIN_TOKEN_LENGTH)(stopWords))
movieReviews = movieReviews.withColumn("reviewTokens", toTokensUDF('reviewText))
The reviewTokens column is a perfect input for the word2vec model. We can build it using the Spark ML library:
val word2vec = new Word2Vec()
.setInputCol("reviewTokens")
.setOutputCol("reviewVector")
.setMinCount(1)
val w2vModel = word2vec.fit(movieReviews)
The Spark implementation has several additional hyperparameters:
setMinCount: This is the minimum frequency with which we can create a word. It is another processing step so that the model is not running on super rare terms with low counts.
setNumIterations: Typically, we see that a higher number of iterations leads to more accurate word vectors (think of these as the number of epochs in a traditional feed-forward neural network). The default value is set to 1.
setVectorSize: This is where we declare the size of our vectors. It can be any integer with a default size of 100. Many of the public word vectors that come pretrained tend to favor larger vector sizes; however, this is purely application-dependent.
setLearningRate: Just like a regular neural network, which we learned about in Chapter 2, Detecting Dark Matter- The Higgs-Boson Particle, discretion is needed in part by the data scientist--too little a learning rate and the model will take forever-and-a-day to converge. However, if the learning rate is too large, one risks a non-optimal set of learned weights in the network. The default value is 0.
Download
Mastering Machine Learning with Spark 2.x by Alex Tellez.mobi
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.
The Mikado Method by Ola Ellnestam Daniel Brolund(22435)
Hello! Python by Anthony Briggs(21625)
Secrets of the JavaScript Ninja by John Resig Bear Bibeault(20184)
Dependency Injection in .NET by Mark Seemann(19563)
The Well-Grounded Java Developer by Benjamin J. Evans Martijn Verburg(19311)
Kotlin in Action by Dmitry Jemerov(19237)
OCA Java SE 8 Programmer I Certification Guide by Mala Gupta(18775)
Algorithms of the Intelligent Web by Haralambos Marmanis;Dmitry Babenko(17577)
Adobe Camera Raw For Digital Photographers Only by Rob Sheppard(16967)
Grails in Action by Glen Smith Peter Ledbrook(16730)
Sass and Compass in Action by Wynn Netherland Nathan Weizenbaum Chris Eppstein Brandon Mathis(14220)
Secrets of the JavaScript Ninja by John Resig & Bear Bibeault(12199)
Test-Driven iOS Development with Swift 4 by Dominik Hauser(10923)
A Developer's Guide to Building Resilient Cloud Applications with Azure by Hamida Rebai Trabelsi(10597)
Jquery UI in Action : Master the concepts Of Jquery UI: A Step By Step Approach by ANMOL GOYAL(10029)
Hit Refresh by Satya Nadella(9116)
The Kubernetes Operator Framework Book by Michael Dame(8538)
Exploring Deepfakes by Bryan Lyon and Matt Tora(8365)
Robo-Advisor with Python by Aki Ranin(8305)