Hands-On Python Natural Language Processing by Aman Kedia Mayank Rasu
Author:Aman Kedia, Mayank Rasu [Aman Kedia, Mayank Rasu]
Language: eng
Format: epub
Tags: COM018000 - COMPUTERS / Data Processing, COM042000 - COMPUTERS / Natural Language Processing, COM037000 - COMPUTERS / Machine Theory
Publisher: Packt Publishing
Published: 2020-06-26T04:39:40+00:00
Exploring fastText
We discussed and built models based on the Word2Vec approach in Chapter 5, Word Embeddings and Distance Measurements for Text, wherein each word in the vocabulary had a vector representation. Word2Vec relies heavily on the vocabulary it has been trained to represent. Words that occur during inference times, if not present in the vocabulary, will be mapped to a possibly unknown token representation. There can be a lot of unseen words here:
Can we do better than this?
In certain languages, sub-words or internal word representations and structures carry important morphological information:
Can we capture this information?
To answer the preceding code block, yes, we can, and we will use fastText to capture the information contained in the sub-words:
What is fastText and how does it work?
Bojanowski et al., researchers from Facebook, built on top of the Word2Vec Skip-gram model developed by Mikolov et al., which we discussed in Chapter 5, Word Embeddings and Distance Measurements for Text, by encapsulating each word as a combination of character n-grams. Each of these n-grams has a vector representation. Word representations are actually a result of the summation of their character n-grams:
What are the character n-grams?
Let's see the two- and three-character n-grams for the word language:
la, lan, an, ang, ng, ngu, gu, gua, ua, uag, ag, age, ge
fastText leads to parameter sharing among various words that have any overlapping n-grams. We capture their morphological information from sub-words to build an embedding for the word itself. Also, when certain words are missing from the training vocabulary or rarely occur, we can still have a representation for them if their n-grams are present as part of other words.
The authors kept most of the settings similar to the Word2Vec model. They initially trained fastText using a Wikipedia corpus based on 9 different languages. As of March 18, 2020, the fastText GitHub documentation states that fastText models have been built for 157 languages.
Facebook released the fastText library as a standalone implementation that can be directly imported and worked on in Python. Gensim offers its own fastText implementation and has also built a wrapper around Facebook's fastText library. Since we have focused on Gensim for most of our tasks, we will use Gensim's fastText implementation next to build word representations.
We will discuss parameters that are new to fastText as most of them are common to the Word2Vec and Doc2Vec models. We have taken the same common_texts data to explore fastText.
Download
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.
Algorithms of the Intelligent Web by Haralambos Marmanis;Dmitry Babenko(8301)
Test-Driven Development with Java by Alan Mellor(6731)
Data Augmentation with Python by Duc Haba(6644)
Principles of Data Fabric by Sonia Mezzetta(6396)
Learn Blender Simulations the Right Way by Stephen Pearson(6295)
Microservices with Spring Boot 3 and Spring Cloud by Magnus Larsson(6167)
Hadoop in Practice by Alex Holmes(5959)
Jquery UI in Action : Master the concepts Of Jquery UI: A Step By Step Approach by ANMOL GOYAL(5807)
RPA Solution Architect's Handbook by Sachin Sahgal(5563)
Big Data Analysis with Python by Ivan Marin(5368)
The Infinite Retina by Robert Scoble Irena Cronin(5253)
Life 3.0: Being Human in the Age of Artificial Intelligence by Tegmark Max(5149)
Pretrain Vision and Large Language Models in Python by Emily Webber(4331)
Infrastructure as Code for Beginners by Russ McKendrick(4092)
Functional Programming in JavaScript by Mantyla Dan(4038)
The Age of Surveillance Capitalism by Shoshana Zuboff(3955)
WordPress Plugin Development Cookbook by Yannick Lefebvre(3806)
Embracing Microservices Design by Ovais Mehboob Ahmed Khan Nabil Siddiqui and Timothy Oleson(3610)
Applied Machine Learning for Healthcare and Life Sciences Using AWS by Ujjwal Ratan(3581)
