Applied Text Analysis with Python by Benjamin Bengfort Tony Ojeda Rebecca Bilbro

Applied Text Analysis with Python by Benjamin Bengfort Tony Ojeda Rebecca Bilbro

Author:Benjamin Bengfort, Tony Ojeda, Rebecca Bilbro
Language: eng
Format: mobi, epub
Publisher: O'Reilly Media, Inc.
Published: 2016-12-19T05:00:00+00:00


Figure 2-1. The Model Selection Triple

In the feature extraction phase, which we will begin to explore in our discussion of vectorization in this chapter, the goal is to analyze, extract, and select a sufficiently hearty set of features with which to model the data. In the second phase, a set of algorithms are selected from a model family, which can then be used, evaluated, and compared in parallel. Finally, we conduct tuning by adjusting the model hyperparameters to identify the combination that result in the most predictive fitted model.

These tasks together allow data scientists to define and describe a learning model that is able to effectively leverage specific data (feature engineering) with a specific interaction between variables and the target of interest (algorithm selection) then optimize the behavior of that model during learning and prediction (hyperparameter tuning). Applied methodologies for all three workflows usually include heuristics or rules of thumb for specific algorithms, which can loosely be described as intuition, combined with automatic optimization and search techniques.

While the workflow it describes is one with which many machine learning practitioners are likely familiar, the model selection triple was first explicitly described in a 2015 SIGMOD paper by Kumar et al1. In their paper, which concerns the development of next-generation database systems built to anticipate predictive modeling, the authors cogently express that such systems are badly needed due to the highly experimental nature of machine learning in practice. “Model selection,” they explain, “is iterative and exploratory because the space of [model selection triples] is usually infinite, and it is generally impossible for analysts to know a priori which [combination] will yield satisfactory accuracy and/or insights.”

Indeed, the process of model selection is complex, iterative, and substantially more intricate than, say, the choice of a support vector machine over a decision tree classifier. Our model selection triple workflow aims to treat these iterations as central to the science of machine learning. It is a workflow that, thanks to the robust and secure foundational data layer, can afford to enable optimization by facilitating rather than limiting those iterations.



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.