Application of Artificial Intelligence to Assessment by Hong Jiao;

Application of Artificial Intelligence to Assessment by Hong Jiao;

Author:Hong Jiao;
Language: eng
Format: epub
Publisher: Information Age Publishing
Published: 2020-03-10T15:50:01+00:00


Figure 5.2 Sentence to Sentence coherence in a Kaggle essay set.

Detecting Bias in Models

Machine learning models can be biased by training on non-construct relevant features (measurement bias) or by making inappropriate assumptions about the algorithms used to model those features (algorithm bias). But the models can also be biased by being trained on biased data (sample bias) or by not sampling subgroups appropriately (prejudice bias). Machine learning can leverage massive amounts of data to draw inferences. By “learning” from examples, and extracting patterns, it derives rules of what features in human behavior correspond to complex tasks. But if those examples contain bias, then the machine learning may infer that same bias. Three critical best practices in avoiding bias are to (a) select a broad sample of training examples to cover the range of possible input, (b) evaluate bias across subgroups and set criteria for effective performance, and (c) protect the model from scoring inappropriate input.

As an example, if students are given a writing prompt which asks: “Write an essay about a hero and describe why this person is a hero to you,” there may be many ways for a student to describe what kind of a person fits their view of being a hero. Thus, it is important to capture the range of potential types of answers and not differentially penalize responses for the choice of their hero, instead of the quality of their explanation. This can be done by broadly selecting writing samples across the intended population, ensuring that responses from different subgroups are represented as well as using both high and low quality essays within the training sets.

Then, when a model is built, it should be tested against held-out data representing the various subgroups to ensure that different subgroups are not scored differently. For each of these subgroups, it is important to calculate various agreement indices (e.g., r, Kappa, Quadratic Kappa, Exact agreement, and standardized mean differences (SMDs) comparing human–human results and with IEA-human results (see Williamson, Xi, & Breyer, 2012) . Significant differences between subgroups should be flagged and should determine whether the model is appropriate or needs additional training data before it is used.

Protecting the Model

Once a model is finally operationally deployed, it is important to ensure that the model only scores examples that are within the range of the assumptions that were made by the model. This can be considered protecting the model from inappropriate input, or maybe more properly, protecting the results from models that are not built for that kind of input. The features used to analyze the essays provide a means to ascertain how well any essay falls within the distributional confines of the training set. By analyzing the features of the essays from the training set, we can therefore derive an expected range of essay features. The system can then determine if the value for a particular feature or the combined values for a group of features are beyond the training range. Those that fall outside the confidence interval of this



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.