Time Series Prediction with Recurrent Neural Networks
Languages and Skills: Python, Artificial Neural Networks, PyTorch, Time
Series Prediction, Data Analysis
Full report pdf
With this project, I set out to create an AI model to forecast the attendance at the campus
gym.
During end of my third year, I started regularly going to the gym on campus. I eventually learned
that the staff post live attendance updates every 30 minutes on Twitter. After initially
finding it difficult to tell whether 80 in the weight room meant busy or not as I had no prior frame
of reference, I thought I would write a script to scrape as much of their tweet data as possible
and then graph it. With the python package I was using, I could scrape the most recent ~800 tweets.
Running this once a month for 8 months, I accumulated over 5000 tuples from May 2nd, 2024 onward.
Usage by Day of the Week
I referred to this as my day of the week average
model because it took every data point for each day of the week and hour and then simply
averaged them together. This is the data I was most interested in initially as it helped
me pick the best day and time to go to the gym. Looking back, I probably could have just
googled "what day is the gym least busy?" but that's beside the point. I ended up
picking early afternoon on Sundays. I also had charts for each month separately so I
could see how different months were more or less busy.
Quickly, I realized I wanted to do more with the data I was accumulating. So, for my AI 2 final
project, I challenged myself to learn about recurrent neural networks and create a long short-term
memory (LSTM) model that could perform time series prediction and forecast how many people there
would be at the gym on any given day. It would do this by taking as input the previous five hours of
attendance readings and then predict each hour's attendance for the rest of the day.
During my research and development, I discovered a previous student's attempt at solving my problem,
which they referred to as GyMBRo, AKA Gym
Monitoring By Robot. Clever name. Every day this bot would post a morning forecast graph on Twitter
and, throughout the day, fetch tweets from the Rec Centre and plot those points to the graph. The
issue I had with this implementation is that it did not adjust its prediction throughout the day,
say if the gym were abnormally busy, which I planned to have mine do via inputting of the five
points to make the forecast from. GyMBRo did, however, give me a good comparison for my forecasts
and, when I found its source code, an additional 7 years of attendance data (50 000 tuples) to help
with model training. Because good LSTM models are known to be very dependent on training with lots
of data, I knew this was a big win. Thanks Demetri!
While keeping things concise here on my site, I engineered my labeled tuples to consist of 5
features: the month (1-12), the week (0-52), the day (1-31), the hour (0-23), and the day of the
week (0-6). The label was the number of people in the weight room. This label would also be a
feature when put back into the model to get the next prediction. I also added ~1000 tuples to my
dataset with labels of '0' for then the gym was closed. After feature engineering, I took 80% of the
days and saved them for training, 10% for validation, and the remaining 10% for testing. I then used
PyTorch to train my LSTM model. I trained 100 epochs for each different variation of hyperparameters
and saved the one with the best validation MSE. Then I chose the hyperparameter combination with
the best test MSE. I ended up creating a model with a test MSE of 541.6. This is
noticeably better than the MSE of my simple day of the week average model, which had a MSE of 752.2
on the same test dataset. For more in-depth background information, methodology, results, and
conclusions of my research, find my full
report here.
Hyperparameter Tuning
This chart shows my process of finding the right
hyperparameters for training. The column in light green shows the best combination I
found. Each other column has a grey value that differs from the best combination. The
validation and test mean squared error can be seen for each combination at the bottom
colour coded from green (best) to red (worst). Test also has root mean squared error.
After concluding my research, I learned lots about what it takes to create an AI model,
the necessary steps needed to format training and testing data, and how to communicate my findings.
Ultimately, I considered my model a success. It preformed better than my previous attempts of simply
averaging previous
days' data and, often, I found the model to preform better the GyMBRo model. I did not find the
model's performance at forecasting at the start of the day to be very
accurate, before any attendance observations have been made. I think the model could be improved by
adding more features, specifically for upcoming holidays and open/closed hours. As well, perhaps
combining two models for a forecast would be beneficial. Using a different architecture to make a
morning forecast could counteract the underestimating morning problem.
To run my code on your own, find my repo here and run predict.py in the LSTM folder.
Animated Forecast for April 18, 2023
This is an animation of many hourly forecasts for April
18, 2023 throughout the day. The red line shows the prediction made by the model for the
remainder of the day based on the 5-hour long input sequence in blue. The green line
shows the actual attendance reported by Rec Centre staff. This green line is what the
red prediction line should, in theory, match up or be very close to. Additionally, the
MSE can also be seen for each prediction at the top. The initial morning performance is
noticeably worse than later in the day.
Poor Forecasts for March 30, 2023
Here you can see three predictions made for March 30, 2023.
In black, my model's forecast made at 5 a.m. In red, my model's forecast made at 11 a.m.
In orange, the average attendance for a Thursday in March. Green shows actual attendance
observations. This day was abnormally busy, likely due to the end of the term
approaching, and the forecasts were poor. The 11 a.m. forecast predicts a busier
day overall, but converges on a normal day by closing, which was not correct.