Home > Computers & Technology > Computer Science

Hands-On Python Natural Language Processing by Aman Kedia Mayank Rasu

Author:Aman Kedia, Mayank Rasu [Aman Kedia, Mayank Rasu] , Date: July 5, 2020 ,Views: 1170

Hands-On Python Natural Language Processing by Aman Kedia Mayank Rasu

Author:Aman Kedia, Mayank Rasu [Aman Kedia, Mayank Rasu]
Language: eng
Format: epub
Tags: COM018000 - COMPUTERS / Data Processing, COM042000 - COMPUTERS / Natural Language Processing, COM037000 - COMPUTERS / Machine Theory
Publisher: Packt Publishing
Published: 2020-06-26T04:39:40+00:00

Exploring fastText

We discussed and built models based on the Word2Vec approach in Chapter 5, Word Embeddings and Distance Measurements for Text, wherein each word in the vocabulary had a vector representation. Word2Vec relies heavily on the vocabulary it has been trained to represent. Words that occur during inference times, if not present in the vocabulary, will be mapped to a possibly unknown token representation. There can be a lot of unseen words here:

Can we do better than this?

In certain languages, sub-words or internal word representations and structures carry important morphological information:

Can we capture this information?

To answer the preceding code block, yes, we can, and we will use fastText to capture the information contained in the sub-words:

What is fastText and how does it work?

Bojanowski et al., researchers from Facebook, built on top of the Word2Vec Skip-gram model developed by Mikolov et al., which we discussed in Chapter 5, Word Embeddings and Distance Measurements for Text, by encapsulating each word as a combination of character n-grams. Each of these n-grams has a vector representation. Word representations are actually a result of the summation of their character n-grams:

What are the character n-grams?

Let's see the two- and three-character n-grams for the word language:

la, lan, an, ang, ng, ngu, gu, gua, ua, uag, ag, age, ge

fastText leads to parameter sharing among various words that have any overlapping n-grams. We capture their morphological information from sub-words to build an embedding for the word itself. Also, when certain words are missing from the training vocabulary or rarely occur, we can still have a representation for them if their n-grams are present as part of other words.

The authors kept most of the settings similar to the Word2Vec model. They initially trained fastText using a Wikipedia corpus based on 9 different languages. As of March 18, 2020, the fastText GitHub documentation states that fastText models have been built for 157 languages.

Facebook released the fastText library as a standalone implementation that can be directly imported and worked on in Python. Gensim offers its own fastText implementation and has also built a wrapper around Facebook's fastText library. Since we have focused on Gensim for most of our tasks, we will use Gensim's fastText implementation next to build word representations.

We will discuss parameters that are new to fastText as most of them are common to the Word2Vec and Doc2Vec models. We have taken the same common_texts data to explore fastText.

Download

Hands-On Python Natural Language Processing by Aman Kedia Mayank Rasu.epub

Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.

Categories

Linux & Unix	iPhone & iOS
Macintosh	Android
Business Technology	Certification
Computer Science	Databases & Big Data
Digital Audio, Video & Photography	Games & Strategy Guides
Graphics & Design	Hardware & DIY
History & Culture	Internet & Social Media
Mobile Phones, Tablets & E-Readers	Networking & Cloud Computing
Operating Systems	Programming
Programming Languages	Security & Encryption
Software	Web Development & Design