# Building Predictive Models for NYC High Schools (Alec Hubel)

of 5

## Content

Building Predictive Models for NYC Public High Schools

not necessarily have enough resources for all of the students that it is responsible for. @ was also able to collect the average salary for teachers in a given school. 2y intention was to use this measure as a pro9y for the .uality of teacher in a given school. \$ teacher"s salary in New York is determined by how much training they have received and how many years of e9perience they have. @ decided to operate under the assumption that that a school with a higher average teacher salary has a higher .uality teachers.

Drawing 1: NYC School Districts

Drawing 2: Heatmap of Demographic Data by District

=astly! @ collected a data7set from the yearly school survey that is administered to parents! teachers! and students. 0rom this survey! the NYC %epartment of &ducation is able to e9tract scores for safety and respect! communication! engagement! and academic e9pectations. \$dditionally! it contained information on the e9tracurricular offerings of a school. &ach school in the above data sets was given a uni.ue identifier called a "%/N". This was e9tremely useful for two reasons. 0irstly! it allowed me to use the 1andas "Coin" function to combine all of my data in to easily combine all of my data in to a single data frame! without too much e9traneous data cleaning. 3econdly! the %/N allowed me to e9tract the district and borough for a given school. 0or e9ample! /ron9 =eadership \$cademy Digh 3chool"s %/N is *8E),). The first two digits ' *8 ' signify that this school is located in district 8 (there are B, school districts within New York City). The third character ' E ' corresponds to the /ron9 (the other letter>borough pairs are 2>2anhattan! F>Fueens! 5>3taten @sland! and G>/rooklyn).

Methodolog% 0irst! @ had to narrow down my data7set from the 1(**H schools to the -*, high schools in the NYC school system. )* of those schools did not report graduation rates and \$12 measures. This is a result of a regulatory re.uirement that prevents a school from releasing this information when there are ,* or less graduates (generally the smaller schools in the system). \$fter removing those schools with missing data! @ employed a randomi#ed (*>B* split to create a training set and a testing set. 0or both of my models! @ was attempting to predict a continuous value ' graduation rate and \$12. \$s such! @ decided to use scikit7learn"s ridge regression algorithm. @ began with a "kitchen sink" approach and threw all of my variable in to the model. @ then removed variables one7by7one until @ had could isolate the factors that most influenced graduation rate and \$12. Iariable were selected for removal when their p7values indicated a lack of statistical significance and their absence from my model did not substantially detract from my model"s accuracy. The accuracy of my model was determined using both the 57s.uared and mean s.uared error (23&). &esults ' (raduation &ate

The final iteration of my model for \$12 yielded an r7s.uared of *.:; and a 23& of .*118 ' incrementally more accurate than my model for graduation rates. Kenerally speaking! the same drivers

that impacted graduation rates impacted a school"s aspirational performance measure. Jealthy and safe schools with a high proportion of white students tended to outperform. &9tracurricular activities had a strongly positive impact! with music and technology clubs Coining the sports and academic clubs as the e9tracurriculars that had an outsi#ed positive impact. There was one takeaway from this model that stood out from the crowd4 the most predictive variable in this model was the percentage of the student body that was of \$sian descent. @n fact! when @ built this model with that data point as the sole variable! it generated an 57s.uared of *.-;. Jhile this lends credence to the notion that \$sian students tend to outperform on standardi#ed tests! it does not solve the .uestion of why \$sian students tend to outperform. \$ common hypothesis is that \$sian students tend to have a stronger work ethic and spend more time studying! naturally yielding better test scores! but @ do not have access to data that could confirm or refute that idea. Caveats and uture &esearch There are a few caveats to consider in order to put the results of this analysis in the proper conte9t. 0irstly! the magnet school system in New York City throws a bit of a wrench in to the data. \$t the end of middle school! every New York City public school student takes a standardi#ed test. 1erform well enough on that test! and they will be admitted in to one of the higher performing or specialty schools in the city (i.e. 3tuyvesant! /ron9 3cience! etc.). This system creates two issues. 0irstly! there will be self7selection bias. 3tudents with higher natural ability will go to better schools! reinforcing their high standing (particularly in terms of graduation rate and aspiration performance measures). 3econdly! many students do not attend schools in their home boroughs or districts. This may be one of the reasons that a school"s location was not particularly significant in my models. 2ore data would have been useful as well. %irect metrics for the natural ability of a student body! non7 school activities that take up a student"s time outside of the classroom! and the .uality of teaching staff were unavailable. 3ome of this may be alleviated soon ' in the ,*1B7,*1- school year! teachers will be evaluated on a continuous scale. These data points may prove useful in future research. =astly! the results of this study would have been much more intriguing if the data could be collected on a student level! as opposed to a school level. \$fter all! policy recommendations directed at schools are meant to improve the .uality of education for individual students. @f the school7level could be bypassed! it may be easier to identify ways to more directly help students.

## Recommended

#### ALEC High-Risk Health Insurance Pool

Or use your account on DocShare.tips

Hide