Mastering Machine Learning with Spark 2.x by Alex Tellez

Author:Alex Tellez , Date: March 21, 2018 ,Views: 138

Mastering Machine Learning with Spark 2.x by Alex Tellez

Author:Alex Tellez
Language: eng
Format: epub, mobi
Publisher: Packt
Published: 2018-02-26T13:46:27+00:00

val MIN_TOKEN_LENGTH = 3

val toTokens = (minTokenLen: Int, stopWords: Array[String], review: String) =>

review.split("""\W+""")

.map(_.toLowerCase.replaceAll("[^\\p{IsAlphabetic}]", ""))

.filter(w => w.length > minTokenLen)

.filter(w => !stopWords.contains(w))

With all the building blocks ready, we just apply them to the loaded input data, augmenting them by a new column, reviewTokens, which holds a list of words extracted from the review:

val toTokensUDF = udf(toTokens.curried(MIN_TOKEN_LENGTH)(stopWords))

movieReviews = movieReviews.withColumn("reviewTokens", toTokensUDF('reviewText))

The reviewTokens column is a perfect input for the word2vec model. We can build it using the Spark ML library:

val word2vec = new Word2Vec()

.setInputCol("reviewTokens")

.setOutputCol("reviewVector")

.setMinCount(1)

val w2vModel = word2vec.fit(movieReviews)

The Spark implementation has several additional hyperparameters:

setMinCount: This is the minimum frequency with which we can create a word. It is another processing step so that the model is not running on super rare terms with low counts.

setNumIterations: Typically, we see that a higher number of iterations leads to more accurate word vectors (think of these as the number of epochs in a traditional feed-forward neural network). The default value is set to 1.

setVectorSize: This is where we declare the size of our vectors. It can be any integer with a default size of 100. Many of the public word vectors that come pretrained tend to favor larger vector sizes; however, this is purely application-dependent.

setLearningRate: Just like a regular neural network, which we learned about in Chapter 2, Detecting Dark Matter- The Higgs-Boson Particle, discretion is needed in part by the data scientist--too little a learning rate and the model will take forever-and-a-day to converge. However, if the learning rate is too large, one risks a non-optimal set of learned weights in the network. The default value is 0.

Download

Mastering Machine Learning with Spark 2.x by Alex Tellez.epub
Mastering Machine Learning with Spark 2.x by Alex Tellez.mobi

Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.

Categories

other	Arts & Photography
Biographies & Memoirs	Business & Money
Calendars	Christian Books & Bibles
Comics & Graphic Novels	Computers & Technology
Cookbooks, Food & Wine	Crafts, Hobbies & Home
Education & Teaching	Engineering & Transportation
Health, Fitness & Dieting	Humor & Entertainment
Law	Lesbian, Gay, Bisexual & Transgender Books
Literature & Fiction	Medical Books
Mystery, Thriller & Suspense	Parenting & Relationships
Politics & Social Sciences	Reference
Religion & Spirituality	Romance
Science & Math	Science Fiction & Fantasy
Self-Help	Sports & Outdoors
Teen & Young Adult	Test Preparation
Travel	Children's Books
History