A data set of the attributes of 382 students in secondary education collected from two schools. The goal is to predict the grade in math and portugese at the end of the third period. See the cited sources for additional information.

student

Format

382 observations from 13 variables represented as a list consisting of a binary factor response matrix y with two responses: portugese and math for the final scores in period three for the respective subjects. The list also contains x: a sparse feature matrix of class 'dgCMatrix' with the following variables:

school_ms

student's primary school, 1 for Mousinho da Silveira and 0 for Gabriel Pereira

sex

sex of student, 1 for male

age

age of student

urban

urban (1) or rural (0) home address

large_family

whether the family size is larger than 3

cohabitation

whether parents live together

Medu

mother's level of education (ordered)

Fedu

fathers's level of education (ordered)

Mjob_health

whether the mother was employed in health care

Mjob_other

whether the mother was employed as something other than the specified job roles

Mjob_services

whether the mother was employed in the service sector

Mjob_teacher

whether the mother was employed as a teacher

Fjob_health

whether the father was employed in health care

Fjob_other

whether the father was employed as something other than the specified job roles

Fjob_services

whether the father was employed in the service sector

Fjob_teacher

whether the father was employed as a teacher

reason_home

school chosen for being close to home

reason_other

school chosen for another reason

reason_rep

school chosen for its reputation

nursery

whether the student attended nursery school

internet

Pwhether the student has internet access at home

Source

P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7. http://www3.dsi.uminho.pt/pcortez/student.pdf

Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science.

Preprocessing

All of the grade-specific predictors were droppped from the data set. (Note that it is not clear from the source why some of these predictors are specific to each grade, such as which parent is the student's guardian.) The categorical variables were dummy-coded. Only the final grades (G3) were kept as dependent variables, whilst the first and second period grades were dropped.