Module 3 Assignment.docx

Module 3 Assignment – GLM and Logistic Regression Intermediate Analytics Ayeshabi W Tigdikar Master of Science in Project Management, Northeastern University Professor Richard He October 10, 2023 Table of Contents

I NTRODUCTION ......................................................................................................................................... 3 A NALYSIS ................................................................................................................................................... 3 C ONCLUSION ........................................................................................................................................... 10 R EFERENCES ........................................................................................................................................... 11 A PPENDIX ................................................................................................................................................ 11

I NTRODUCTION In this study, we looked into a dataset of higher education institutions, concentrating on variables that affect whether a university is classified as "Private" or "Not Private." To create and assess a logistic regression model, we started by dividing the dataset into training and test sets. Using predictors like "Room.Board," "Books," and "PhD," we fitted a logistic regression model to the training data using the glm() function in the "stats" package. Then, for both the training and test datasets, we computed several classification metrics, such as accuracy, precision, recall, and specificity, and created a confusion matrix to evaluate the model's performance. To evaluate the overall effectiveness of the model, we also created a ROC (Receiver Operating Characteristic) curve and computed the AUC. A NALYSIS 1. Importing the dataset and performing EDA and Descriptive Statistics. a. Structure The shown output is a data frame called "College" that contains details about 777 colleges or universities. The 18 variables in this dataset include a categorical variable called "Private" that indicates whether or not the institution is private as well as many numerical variables like "Apps" (the number of applications received), "Accept" (the number of applications approved), "Enroll" (the number of enrollments), and others. It is necessary to refer to extra documentation or context in order to properly comprehend the definitions of some variables, such as "PhD," "Expend," and "Grad.Rate," which have less obvious connotations based merely on their names. The basis for future data analysis and modeling tasks relating to university features and attributes is provided by this dataset.

b. Summary The "College" dataset, which includes information about numerous colleges, is summarized statistically in the output. For both numerical and categorical data, it offers descriptive statistics. There are 212 public and 565 private institutions, according to the category variable "Private". For numeric variables, summary statistics are shown in the following columns. For instance, the number of "Apps" ranges from 81 to 48,094, with a mean of roughly 3,002. Likewise, the range for "Accept" is 72 to 26,330, with a mean of roughly 2,019. The summary provides a brief description of the distribution and central characteristics of the dataset and contains measurements such as minimum, maximum, quartiles, mean, and standard deviation for each numeric variable. c. Histogram

Interpretation: According to the histogram, the bulk of the colleges in the survey only get a small number of applications. The tall bars on the histogram's left side make this clear. The bars get shorter as you move to the right, indicating that fewer colleges receive a higher volume of applications. Due to the fact that only a few number of institutions receive a large volume of applications while the bulk fall into lower ranges, the distribution is right-skewed, which is usual for application data. In conclusion, this histogram shows the distribution of application data in the dataset visually and enables you to observe the concentration of institutions throughout various application ranges.

Interpretation: The dataset's histogram displays the distribution of enrollments among colleges. The tall bars on the left side of the histogram show that the majority of colleges have enrollments in the lower range. The bars get shorter to the right, which shows that fewer universities have greater enrollments. For enrollment data, the distribution is right-skewed, as is usual. In other words, fewer colleges have bigger enrollments than there are colleges with more modest enrollments. In conclusion, this histogram shows how the dataset's colleges' enrollments are distributed visually, with the majority having lower enrollments and a smaller proportion having greater enrollments. d. Barplot

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help