Although the educational level of the Portuguese population has improved in the last decades, the statistics keep Portugal at Europe’s tail end due to its high student failure rates. In particular, lack of success in the core classes of Mathematics and the Portuguese language is extremely serious. On the other hand, the fields of Machine Learning, which aim at extracting high-level knowledge from raw data, offer interesting automated tools that can aid the education domain. The present work intends to approach student achievement in secondary education using machine learning techniques. Recent real-world data (e.g. student grades, demographic, social and school related features) was collected by using school reports and questionnaires. The two core classes (i.e. Mathematics and Portuguese) were modelled under binary/five-level classification and regression tasks. Also, four DM models (i.e. Decision Trees, Random Forest, Logistic Regression and Support Vector Machines) and three input selections (e.g. with and without previous grades) were tested. The results show that a good predictive accuracy can be achieved, provided that the first and/or second school period grades are available. Although student achievement is highly influenced by past evaluations, an explanatory analysis has shown that there are also other relevant features (e.g. number of absences, parent’s job and education, alcohol consumption). As a direct outcome of this research, more efficient student prediction tools can be developed, improving the quality of education and enhancing school resource management.
Education is a key factor for achieving a long-term economic progress. During the last decades, the Portuguese educational level has improved. However, the statistics keep the Portugal at Europe’s tail end due to its high student failure and dropping out rates. For example, in 2006 the early school leaving rate in Portugal was 40% for 18 to 24 year olds, while the European Union average value was just 15%.
- In particular, failure in the core classes of Mathematics and Portuguese (the native language) is extremely serious, since they provide fundamental knowledge for the success in the remaining school subjects (e.g. physics or history).
In this work, we will analyze recent real-world data from two Portuguese secondary schools. Two different sources were used: mark reports and questionnaires. Since the former contained scarce information (i.e. only the grades and number of absences were available), it was complemented with the latter, which allowed the collection of several demographic, social and school related attributes (e.g. student’s age, alcohol consumption, mother’s education).
- The aim is to predict student achievement and if possible to identify the key variables that affect educational success/failure.
- The two core classes (i.e. Mathematics and Portuguese) will be modeled under three DM goals:
i) Binary classification (pass/fail);
ii) Classification with five levels (from I very good or excellent to V - insufficient);
iii) Regression, with a numeric output that ranges between zero (0%) and twenty (100%).
- For each of these approaches, three input setups (e.g. with and without the school period grades) and four DM algorithms (e.g. Decision Trees, Random Forest) will be tested. Moreover, an explanatory analysis will be performed over the best models, in order to identify the most relevant features.
Hardware and Software Requirements:
- Windows 7,8 or 10 (32 or 64 bit)
- RAM – 4GB
- Python IDLE
- Anaconda Navigator