About me
Traffic Congestion Analysis using an Autoencoder for Feature Selection and Anomaly Detection - Yazmin Elizabet Martinez, Wolfgang Bein
Road traffic congestion and its causes are highly researched due to the impact on daily life. The abundance of traffic data being automatically captured allows us to analyze large amounts of data but the challenge is to allocate resources to label that data. This is where unsupervised and semi-supervised machine learning techniques are useful for analyzing large amounts of data. Our goal was to analyze road traffic congestion to find anomalies of heavy congestion due to signal issues so we developed an Autoencoder, a semi-supervised neural network. An Autoencoder has two core components, an encoder to compress the input data and the decoder attempts to reconstruct the original input data. The encoder component learns the latent or most important characteristics of the data. For our primary dataset, we used the Regional Transportation Commission’s (RTC) open dataset called “Trip Status at Signal” which included 641,807 data points with 13 feature inputs for signal performance metrics captured for September 2024. The input features include ‘Signal_Id’, ‘Speed’,’TotalDelay’, ’IntersectionDelay’, ‘NumberOfStops’, ’TotalStopTime’, ‘LinkSpeed’, ’LinkLength’, ‘LinkSpeedRatio’, ‘HourOfDay’, 'DayOfWeek', 'DayType', and 'PeakHour'. To prepare the data for training the Autoencoder, we used the congestion measure Speed Performance Index in the dataset ‘LinkSpeedRatio’ to separate the normal normal congestion from the abnormal congestion (SPI < 10). By training the Autoencoder, we were able analyze the encoder weights which represent the importance of each input feature. To do so, we generated a correlation matrix with correlation scores between the encoder weights to find the input features that are highly correlated. We generated the correlation matrix using the Pearson product-moment correlation coefficients. The correction matrix allowed us to minimize the number of features in a dataset, namely feature selection. We were able to identify the input features by order of importance (highest scores mean high similarity and near zero mean no correlation) and we found that the top five input features are 'NumberOfStops’, ‘HourOfDay’, ‘LinkSpeed’, ‘LinkSpeedRatio’, and ‘Speed’ that the Autoencoder learned to detect heavy road traffic congestion (less than 10 SPI). This result came about from using a simple Autoencoder with 1 dense layer for the encoder and 1 dense layer for the decoder. The Autodecoder was able to detect heavy congestion anomalies at an 64% accuracy. Our experiments found that a deep Autoencoder with more than two layers (1 decoder layer and 1 encoder layer) performed worse, lower accuracy, when detecting anomalies. We believe that the difficulty comes from four things. Some of the input features are independent of each other, we need additional months data and different types of inputs (secondary datasets of signal settings, weather, and traffic incident). We also need to explore other Autoencoder architectures and investigate methods to identify patterns in traffic congestion caused by the traffic signals in order to retrain the Autoencoder and improve our results.
This work was supported by U.S. DOT PSR UTC grant 69A355234309 (Subaward from USC) and by National Science Foundation Grant EAGER 2433820. The research was conducted at UNLV’s Center for Information Technology and Algorithms in collaboration with UNLV’s Transportation Research Center.