Predictive Analytics, Data Mining and Big Data by Steven Finlay
Author:Steven Finlay
Language: eng
Format: epub
Publisher: PALGRAVE MACMILLAN
The difference in the predictive accuracy of different models is usually pretty small.24 This is true even for problems that are described as being highly non-linear or there are a lot of interactions between variables25 as long as suitable data transformations have been applied (e.g. binning and the use of indicator variables26). A classic case is fraud detection. A very widely expressed belief is that you have to use a neural network or support vector machine if you want to produce a decent model, because of the complexities of the relationships in fraud data. This is a misconception, based on the fact that one of the earliest fraud detection systems just happened to be based on a neural network model. I have come across more than one example of industry-leading fraud detection systems based on linear models and/or rule sets that have performed as well as or better than competitors based on more advanced methods. Having said this, one should be careful not to confuse general and specific findings.27 There is a lot of evidence that a wide range of algorithms yield very similar levels of performance on average, but for some specific problems one method may be substantially better than another â but you canât tell if this is the case until youâve built the model. Therefore it often makes sense to develop a number of competing models using different methodologies in order to see which one generates the best model for your particular problem.
One drawback of neural networks is that it is notoriously easy to over-fit to the data, making them appear to perform much better than they really are, i.e. their performance in real-world usage is inferior to their performance based on the data used to develop them. They also require a lot more computer power to generate than linear models constructed using linear or logistic regression, or decision trees using C4.5 or CHAID (often 10â100 times more), which can cause problems when one is dealing with large samples and lots of predictor variables.
Decision trees, like neural networks, are prone to over-fitting and have some other drawbacks. In particular:
Popular algorithms for deriving decision trees are not very efficient at utilizing data. Consequently, their performance is sometimes (although not always) marginally worse than other types of predictive model for a development sample of a given size.28 This is particularly true when small and medium-sized samples are used to construct the model.29
For classification you need equal numbers of cases that do/do not display the behavior to build good decision trees. If you have lots more examples of behavior or non-behavior in the development sample then model performance will be poor (e.g. the results of a mailing campaign where only 1% of those targeted respond). The greater the degree of imbalance the worse the model will be. Decision tree algorithms are more sensitive to imbalance than almost any other type model construction method.30 There are however, ways of getting around this problem.31
The range of scores is smaller than many other types of model, resulting in score distributions that are âclumpy.
Download
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.
Algorithms of the Intelligent Web by Haralambos Marmanis;Dmitry Babenko(8299)
Azure Data and AI Architect Handbook by Olivier Mertens & Breght Van Baelen(6737)
Building Statistical Models in Python by Huy Hoang Nguyen & Paul N Adams & Stuart J Miller(6714)
Serverless Machine Learning with Amazon Redshift ML by Debu Panda & Phil Bates & Bhanu Pittampally & Sumeet Joshi(6590)
Data Wrangling on AWS by Navnit Shukla | Sankar M | Sam Palani(6375)
Driving Data Quality with Data Contracts by Andrew Jones(6324)
Machine Learning Model Serving Patterns and Best Practices by Md Johirul Islam(6089)
Learning SQL by Alan Beaulieu(5995)
Weapons of Math Destruction by Cathy O'Neil(5779)
Big Data Analysis with Python by Ivan Marin(5363)
Data Engineering with dbt by Roberto Zagni(4359)
Solidity Programming Essentials by Ritesh Modi(4009)
Time Series Analysis with Python Cookbook by Tarek A. Atwan(3866)
Pandas Cookbook by Theodore Petrou(3577)
Blockchain Basics by Daniel Drescher(3294)
Hands-On Machine Learning for Algorithmic Trading by Stefan Jansen(2905)
Feature Store for Machine Learning by Jayanth Kumar M J(2814)
Learn T-SQL Querying by Pam Lahoud & Pedro Lopes(2796)
Mastering Python for Finance by Unknown(2744)
