Text Mining in Practice with R by Ted Kwartler
Author:Ted Kwartler
Language: eng
Format: epub, pdf
ISBN: 9781119282082
Publisher: Wiley
Published: 2017-05-10T00:00:00+00:00
5.2.1 What is String Distance?
String distance is the measurement of one string to another at the character level. That is to say the number of letters that need to be changed in some manner to make that string into another. For example, the word “cat” has a distance of 1 to the word “bat.” This is because the substitution of “c” for “b” is the only difference. The distance between “cat” and “bats” is two because there is a single substitution and also an insertion of “s.”
All three functions, amatch, stringdist and stringdistmatrix, are passed a specific method allowing for the types of changes that are allowed. The five methods dictate which of the four operators can be used to change one string to another so they match. These include substitution, deletion, insertion and transposition of adjacent characters. The four operators work within the larger character string and act upon substrings. The distance is then just the minimum number of changes needed to match the two terms. Another practical example is “book” and “books.” These strings have a distance of 1. This is because you can insert a single “s” to get a match between the two terms.
The five distinct methods passed to the functions represent a mixture of one or more of the four substring operators. Fuzzy matching and string distance measurements can be the basis behind document spellcheck and corrected search queries when using a search engine.
The first of the five methods uses only substitution. As you may expect, this is when you can substitute one character for another in order to match the terms. For example, assume that you only had substitution as your tool to match two terms. When comparing “racecarz” and “racecars” you would be able to substitute one character to make them match. The single letter “z” can be substituted with “s” to get a match. If a term was further misspelled to “racearzc” the substitution count increases. In this case, there are four substitutions, the second “a” to “c”, second “r” to “a”, “z” to “r” and the last “c” to “s” to make a match. Notice that the substitution operations did not insert a “c” and move the “ar” to the right. It simply substituted character by character. This distance method is called “Hamming.”
Still another method that can be applied to string distance functions is optimal string alignment (OSA). The OSA method allows the use of all four operations, substitution, deletion, insertion and transposition. Armed with more tools, our racecar example results change. With OSA, racecarz and racecars still have distance of 1 stemming from the substitution operator. However, given that OSA can use more operations to match strings, the example of “racearzc” and “racecars” has a distance of 3 not 4. Since OSA can use insertion it inserts a “c” in front of the “ar” as the first operation. Next, OSA can delete the last letter and substitute the “z” to an “s”. So the total number of operations in this example is only three.
Download
Text Mining in Practice with R by Ted Kwartler.pdf
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.
Modelling of Convective Heat and Mass Transfer in Rotating Flows by Igor V. Shevchuk(6232)
Weapons of Math Destruction by Cathy O'Neil(5854)
Factfulness: Ten Reasons We're Wrong About the World – and Why Things Are Better Than You Think by Hans Rosling(4500)
Descartes' Error by Antonio Damasio(3168)
A Mind For Numbers: How to Excel at Math and Science (Even If You Flunked Algebra) by Barbara Oakley(3111)
Factfulness_Ten Reasons We're Wrong About the World_and Why Things Are Better Than You Think by Hans Rosling(3053)
TCP IP by Todd Lammle(3020)
Applied Predictive Modeling by Max Kuhn & Kjell Johnson(2917)
Fooled by Randomness: The Hidden Role of Chance in Life and in the Markets by Nassim Nicholas Taleb(2872)
The Tyranny of Metrics by Jerry Z. Muller(2857)
The Book of Numbers by Peter Bentley(2791)
The Great Unknown by Marcus du Sautoy(2543)
Once Upon an Algorithm by Martin Erwig(2482)
Easy Algebra Step-by-Step by Sandra Luna McCune(2475)
Lady Luck by Kristen Ashley(2421)
Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2) by Alboukadel Kassambara(2388)
Police Exams Prep 2018-2019 by Kaplan Test Prep(2365)
All Things Reconsidered by Bill Thompson III(2268)
Linear Time-Invariant Systems, Behaviors and Modules by Ulrich Oberst & Martin Scheicher & Ingrid Scheicher(2239)
