EarthRef.org Reference Database (ERR) -- Bowden et al. 2002

The way that available data are divided into training, testing, and validation subsets can have a significant influence on the performance of an artificial neural network (ANN). Despite numerous studies, no systematic approach has been developed for the optimal division of data for ANN models. This paper presents two methodologies for dividing data into representative subsets, namely, a genetic algorithm (GA) and a self-organizing map (SOM). These two methods are compared with the conventional approach commonly used in the literature, which involves an arbitrary division of the data. A case study is presented in which ANN models developed using each data division technique are used to forecast salinity in the River Murray at Murray Bridge (South Australia) 14 days in advance. When tested on a validation data set from July 1992 to March 1998, the models developed using the GA and SOM data division techniques resulted in a reduction in RMS error of 24.2% and 9.9%, respectively, over the conventional data division method. It was found that a SOM could be used to diagnose why an ANN model has performed poorly, given that the poor performance is primarily related to the data themselves and not the choice of the ANN's parameters or architecture.