July 18, 2018 ᛫ 15 min

Classifying colleges with machine learning

This blog post is derived from work that Yash Tanga and I completed in our spring trimester as seniors at Lawrenceville, from the 'Regression and Machine Learning' math course.

Americans are attending college in ever increasing numbers. An estimated 20 million students are expected to attend college or university immediately following high school graduation, as of fall 2017, which represents an increase of 5 million since 2000.

This project seeks to classify whether a college is private or public based on predictor variables including out of state tuition, acceptance rate, percentage of incoming freshman class that were in the top 10% of their high school class, etc. There are therefore two classes—and two levels—of the response variable. The factors listed below may considered by a college applicant in his or her college search.

A model that depends upon a linearized trend line and a model that uses proximity-based determination produced very similar accuracy figures.

Because the choice of attending either a public or private university is influenced by several factors, this project may provide assistance to those seeking to decide by assessing how much a certain university conforms to the typical, expected characteristics of a private or public university.

Pre-processing

For all the classification methods, a training set was created using 80% of the original data set while a portion of the data set was reserved as a test set to see how the data would respond to new information, as shown. The original, full data set consisted of 777 colleges, so a split of 80% would yield a training set and test set of 622 and 155 colleges, respectively. These quantities would both yield acceptable sample sizes, with a larger training set for the purpose of creating more refined models.

Decision trees

Decision trees separate data by explanatory variables to generate subsets of the data that are distinct to one another, but internally similar. In other words, decision trees generate homogeneous groups with maximized heterogeneity between them.

One term used frequently in the discussion of trees is ‘pruning,’ or the removal of sub-nodes from their respective decision nodes. Low pruning can result in overfitting, while high pruning will lead to large variance, as reflected by less homogeneous groups at each node.

We decided to use decisions trees first in the production of this report, so that we may find the most influential variables for future classification models, a feature of ‘random bagging.’

A general tree model is provided below.

Standard decision tree

The tree seeks to maximize the homogeneity of its groups while maximizing separability between groups. The left number indicates the number of public colleges, while the number on the right indicates number of private colleges (e.g. 132/13 means 132 public colleges and 13 private colleges)

Another tree was generated using this lower (less pruning) setting, shown below.

Decision tree with reduced pruning

This decision tree has less pruning than the one above. There are more subnodes (13 vs. 9), which provides more insight into the tree’s decision making at the cost of potential overfitting. However, this tree has the lowest error, which may be a necessary sacrifice rather than experiencing a high CP with potential underfitting.

An important component of trees is random forest, which helps determine which variables have the most impact on the data.

Variable importance plot

MeanDecreaseGini shows how much purity, or homogeneity, is lost when the variable is omitted from the tree. As you can see, out of state tuition and enrollment are the most impactful variables, which will form the basis of the other classification models below.

Linear discriminant analysis

Linear discriminant analysis maximizes separability between a ‘k’ number of classes via dimension reduction. In our case, this means two classes: private and public colleges.

Linear discriminant analysis will maximize the distance between the means of each class while minimizing the variation within each class. So long as a minimum of two classes are involved, this process works.

Scatterplot of enrolled students vs. out of state tuition

The two most important variables—out of state tuition and number of enrolled students—plotted from the training set without any modification.

Our training error was 0.093, which is fairly accurate. Consider that random guessing would yield a training error of 0.50.

Scatterplot of enrolled students vs. out of state tuition

The dataset after applying LDA using a formula to determine whether a college was private or not using out of state tuition and number of enrolled students. The distributions are well separated, but still exhibit some overlap.

K-nearest neighbor

K-nearest neighbor classifies the sample’s response variable based on what the closest point is classified as. So, essentially, the training set is used to create a model of space closest to a certain point. The test set is then applied to the model. The training set point that is closest to the test set point will be classified the same. The test error is .096.

kNN for out of state tuition vs. acceptances

To determine which graphs would be most appropriate to display for kNN, we used assembled an array of 21 kNN plots using various combinations of explanatory variable pairs using the parimat(). From this, we gathered that out of state tuition and acceptances yielded the lowest error, at 0.054.

Conclusions

The decision tree was first used to determine the best predictor for our response variable. The tree showed that the out of state tuition (Outstate) was the best predictor for whether or not the college or university was private. This is consistent with the implications of the other classification methods specifically kNN. The kNN graphs with the lowest error all had out-of-state tuition as a explanatory variable, providing further evidence that ‘outstate’ is the best predictor for the response variable.

The best classification method was LDA using out of state tuition and enrollment as explanatory variables, with a training error of 0.093 (test error oddly of 0). This result was followed closely by kNN, which produced a respectable test error of 0.096. Decision trees were able to produce a model that had a cross-validated error of 0.198.

The effectiveness of these particular models is somewhat remarkable in retrospect, for both a model that depends upon a linearized trend line and a model that uses proximity-based determination produced very similar accuracy figures. However, the data was fairly separated to begin with, so this made for a particularly easier classification task.

If you would like to see the R code for this article, which includes the linked dataset with full variable descriptions, click here.