10 Machine Learning Blueprints You Should Know for Cybersecurity by Rajvardhan Oak

10 Machine Learning Blueprints You Should Know for Cybersecurity by Rajvardhan Oak

Author:Rajvardhan Oak [Rajvardhan Oak]
Language: eng
Format: epub
Publisher: Packt Publishing
Published: 2023-05-31T00:00:00+00:00


Word embeddings

The TF-IDF approach is considered to be what we call a bag of words approach in machine learning terms. Each word is scored based on its presence, irrespective of the order in which it appears. Word embeddings are numeric representations of words assigned such that words that are similar in meaning have similar embeddings – the numeric representations are close to each other in the feature space. The most fundamental technique used to to generate word embeddings is called Word2Vec.

Word2Vec embeddings are produced by a shallow neural network. Recall that the last layer of a classification model is a sigmoid or softmax layer for producing an output probability distribution. This softmax layer operates on the features it receives from the pre-final layer – these features can be treated as high-dimensional representations of the input. If we chop off the last layer, the neural network without the classification layer can be used to extract these embeddings.

Word2Vec can work in one of two ways:

Continuous Bag of Words: A neural network model is trained to predict the next word in a sentence. Input sentences are broken down to generate training examples. For example, if the text corpus contains the sentence I went to walk the dog, then X = I went to walk the and Y = dog would be one training example.

Skip-Gram: This is the more widely used technique. Instead of predicting the target word, we train a model to predict the surrounding words. For example, if the text corpus contains the sentence I went to walk the dog, then our input would be walk and the output would be a prediction (or probabilistic prediction) of the surrounding two or more words. Because of this design, the model learns to generate similar embeddings for similar words. After the model is trained, we can pass the word of interest as an input, and use the features of the final layer as our embedding.



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.