Volcanic Eruption Prediction

12 min readMay 16, 2021

Introduction

Predicting and forecasting environmental activity is a difficult task, but by using new algorithms and technologies this is becoming more and more accurate on a shorter time scale. Weather reports are very good examples of that. Just like the weather report, here we are trying to predict the volcanic eruption.

The traditional way of solving this problem involves lots of domain knowledge on volcanology, hydrology, thermal geophysics etc. and also requires a lot of human efforts to analyze all the data & make predictions about the eruption. In this case study we are trying to solve the same problem with the help of techniques & algorithms in data science and machine learning.

Business Problem

Just one unseen eruption can result in tens of thousands of lives lost. In very active volcanoes, current approaches predict eruptions some minutes in advance, but they usually fail at long term predictions. So the business problem here is to reliably predict when the volcano will next erupt. And if the new system reduces the human efforts required for the prediction that will also be one plus point. If the system is reliable and accurate for the longer period of time then it can make a big impact, the evacuations could me more timely. We can save a lot of lives and properties surrounding the volcanos.

ML Formulation

In the machine learning settings, this will be a supervised learning problem, as we will be using previously recorded labeled sensor data to train the models. Here we are predicting the time, when the volcano will erupt next. We are predicting, from the current time ( time at which reading has been recorded on the sensors), after how much time the volcano is likely to erupt. This time there is a continuous random variable, which makes this prediction problem, a ‘Regression Problem’, in machine learning.

Business Constraints

There are not very strict time and resource constraints, but the time and resources taken by the system/model should be reasonable.

Dataset Information

To train the model we will be using the open source data provided by the National Institute of Geophysics and Volcanology. Data can be found at this link. This data contains different segments of the sensor data along with their time to eruption. Each segment contains the data from 10 different sensors, placed at different locations around the volcanos. Every sensor data in each segment contains 60K reading noted on the same sensor. As per the sources, each of the segments contains 10 minutes of logs recorded on the sensors. Both training and testing data combined takes about 29GB+ size on the file system

Performance Metric

We want the predictions of the model as near to actual value as possible. We need to penalize the model for deviating the prediction from actual value. This problem comes under the regression settings, mean squared error or mean absolute error can be used to measure the performance of the model. As per the analysis, there are no extreme outliers in the data, so we don’t have to worry about the unusual large errors. So, mean absolute error will be a good metric to measure the performance of the model.

Exploratory data analysis

In the kaggle competition, for volcanic eruption prediction, you will find one zip file which contains all the training and testing data, along with the sample_subbmission.csv and train.csv file. Here the train.csv file contains two columns, segment_id and time_to_eruption. Corresponding to each segment_id present in this file, there is one csv file present in the train/ folder. That file represents the segment, As stated earlier each segment is the 10 mins of logs/reading from 10 different sensors which are arrayed around the volcanos. Time to erupt present in the train.csv file is the target value which we need to predict given the segment i.e. a csv file with 10 columns and 60001 rows.

Analysis of the Target Value

The target value i.e. time_to_eruption, is the time interval between the segment reading and the next volcanic eruption, from the given data and the provide information in kaggle overview, it is little bit unclear for me, what exactly is the unit of time measure here, but that will not be very measure issue here, at least we know what we are predicting is a continuous random variable. Let’s calculate some basic stats and see how the probability density function and the cumulative distribution function for the given target variable looks with the help of seaborn.

As we can see there are no extreme outliers present in the dataset. The value of the skew and estimated pdf, shows that the distribution is not exactly symmetric, but it is closer to zero, which indicates it is closer to the symmetry. The small negative value for kurtosis and the PDF plot also, says the distribution do not have long/fat tails.

Analysis of the Sample Segments

Let’s check some sample segments, to get the idea of what type of sensor and signal data we are working with. We will use matplotlib pyplot to go over all the sensors in csv file, i.e. go column by column in pandas dataframe.

While playing with the data, to get good and detailed insights, you should check at least 20 sample segments before making any conclusions (this is highly dependent on the size of the dataset you have, in this case 20 is good). To gain some more insights, we can compare the sensor data of two segments. It would be really interesting to see how the sensor data of the segment having minimum time to eruption compares with the sensor data of the segment with maximum time to eruption. For reducing some of the length of article I am adding only one comparison here, but I got some amazing insights comparing different segments, like comparison of the minimum time to eruption with maximum, mean, median etc, and same with the maximum one, you can play around there.

The important points from the analysis of sample signals and the comparisons are, there are some missing values present in the dataset. These missing values are of two types, one is in some segments, some values are missing, and second, In some of the segments complete sensors are missing. For some segments, some sensor data contains only one value all the time, like in some segments, some sensors recorded all the values zeros. This might be some recording error or at the time of the reading sensor was not working. The comparison of the sample segment showed that there is significant difference in the sensor reading in minimum and maximum time to eruption. And same with the other comparison, which indicates that, we can make a machine learning model which can differentiate between these two, and give the sample segment, hopefully able to predict the time to eruption.

Missing Value Analysis

As seen earlier, there are some missing values present in the data. Let’s try to find out if there is any specific pattern in the missing values. In here we will try to find out what is the frequency of each sensor completely missing from the sensor, if the sensor is not completely missing then what percentage of data is missing from the particular sensor in different segments. We can also try to look for the things like, are there some groups of sensors which are missing all together. This information might be very helpful while doing the actual modeling for the predictions. To avoid the code duplications, first we will write a generic function to log all the missing data information in one file. This will give us two advantages, there will be no need to write the same code for testing data, and it will give us more structured data to perform analysis.

Now, After calling the function, we have the data frame which contains all the missing data information. Let’s plot simple bat plots to check which sensor is completely missing in how many sensors.

Also, let’s check if there are any specific groups of the sensors, which are missing frequently.

We can do the same analysis for the testing data, ( This is considered to be a bad practice to do any type of analysis on the testing data and you should always avoid this ).Just be sure that the missing values and missing sensors pattern in the training and testing data are alike. This might help us in taking decisions in handling the missing values. To avoid the relations of the code I am adding the outputs of the testing data missing values analysis.

From the plots, It is very much clear that some sensors are quite frequently missing. And as we suspected earlier, there are groups of sensors missing like the sensor 2 and sensor 8.At this point it is still unclear that, which imputation technique will give the better results, we can check that while doing some experiments during the modeling.

Feature Engineering

Now comes the best part, the art part of the cycle. The data we have is sensor data, i.e. sensor reading note over the time, time series data. We can not give the complete series or segment as input to the model. If we do that it will be very high dimensional, with that approach we need (60K*10) dimensional vector to represent one data point, that will not be memory efficient, will take very expensive computational resources and of course the curse of dimensionality. But, the good people have done a lot of research to featurize the time series for a long time. So, there are many options to featurize the time series. Tsfresh is one very helpful python library, which can help with any type of time series problems and it is highly efficient and scalable. But, we will start with the basic features first and then according to the performance of the model on those basic features we will add more features.

In the most basic features, we can consider the simple stats of each sensor present in the segment. In this we can calculate the sum, mean ,median, min, max, standard deviation, skewness, kurtosis etc. writing different functions to calculate different types of features, will give the logical structure to the pipeline of data preparation, which will make it very easy to prepare the testing data also. So, let’s write the function to calculate the basic features.

In the sample segments, we saw that the sensor’s data is fluctuating around some values in some windows. Different quantile values also kind of help us make rough ideas on the values present in different parts/ time windows in the complete sensor data. Having the different quantile values as features can definitely improve the performance of the model. Then let’s get the quantile features also.

The Fast Fourier Transform is considered as one of the most important algorithms in the history of mathematics. The fft is the heart of most modern signal processing systems. The most simplest explanation of the fft is, It takes one complex signal and divides it into the sum of sine and cosine waves. Yes..It sounds magical right.. But at its core it is a very simple and beautiful combination of mathematics, calculus and physics, check out these links ( link1, link2) for more details. Taking the coefficient of the fast fourier transform as the features for the model can really help the model achieve very good results.

We have done some analysis on missing data, we can use that information also as features for our model. The intuition behind using missing data as features is, model can look at the missing data information and decide how much importance to give to values of those sensors, which means I am just hoping that model should do something like this. This inspiration behind using the missing data information came from the forget gate in LSTM layers, which I am currently learning about.

Modeling

I will add the link to the google colab notebook, which contains all the experimentations with different models and hyperparameters which I did on the features, Just to avoid making the blog lengthy. The linear models, linear regression and support vector machine, did not perform well on the features. This might be due to the dimensionality of the final data or it might not be possible for a linear decision boundary hyperplane to entertain the data in various dimensions.

To check the assumption that dimensionality is the one causing the problems, I used PCA to reduce the dimensionality and then tried the linear models but there was still no significant improvement in the model performance. Logical progression from here says, If linear models are not working then go for the non-linear models, and the most popular choice, check ensemble models, but before using random forest or gradient boosted decision trees, one should check how decision trees are performing on the data. If the decision tree performs reasonably well, then we can surely go for the ensemble models and expect good performance from them. But, if the decision tree completely fails on the data, there is very less chance that the random forest or gradient boosting decision trees will perform well. This also helps in hyperparameter tuning like max_depth, number of estimators etc.

In this case, Decision tree worked well. And accordingly, ensemble methods worked significantly better than the linear models. We can do a simple grid search on xgboost to get the best models. To gain a little deeper understanding, in the features importance we also check and compare the feature importance of the two best random forest models and GBDT model. And see what are the top most common important features. Here are the importance from the random forest model.

We can do a simple random search on the hyper params for the best gbdt model.

Results

Here are some of the best results I got while doing the different experimentations and hyperparameters tuning, with different features.

Deployment

I have also deployed the best gbdt model with flask on heroku. You can find the same on this link. This web app allows users to upload a csv file, i.e. the segments on the server. The app validates the file, the file must follow the structure of the segment. With the prediction of time to eruption, the app also provides what are the most important features considered by the model to make prediction, along with the importance and the calculated value of that feature in the provided segment.

Conclusions and Feature Work

As compared to the kaggle leaderboard for this problem, I was able to get a very good score both in public and private leaderboard with the very basic features. Earlier, I talked about the tsfresh, for feature engineering but due to the limits on my computational resources, I was not able to use the tsfresh features. According to my estimations, It will take more than 3 day to featurize whole training data with tsfresh on very high end hardware. I am planning to do that, with google cloud or aws. I am just getting started with deep learning, and I think the deep learning models with lstm layers can really improve the performance significantly with the rich features generated using tsfresh.

References

Thanks to the National Institute of Geophysics and Volcanology for providing the dataset and hosting the kaggle competition. I took help from many sources, while doing this case study. Starting with the documentation of all the libraries I used, and the discussion section of kaggle competition. Here are some of the most important sources other than documentation, which was very helpful in understanding, approaching and solving this problem.

Use of Machine Learning to Analyze and Hopefully Predict Volcano Activity — https://scholarworks.utep.edu/cs_techrep/1053/
TSFRESH — https://tsfresh.readthedocs.io/en/latest/
INGV — Volcanic Eruption Prediction. EDA. Modeling — https://www.kaggle.com/isaienkov/ingv-volcanic-eruption-prediction-eda-modeling

Special thanks to Applied AI Course ( https://www.appliedaicourse.com/ ) for the amazing course and the team at applied ai course.

Thanks for reading, you can find me on linkedin.

Here is the link for github and colab notebook.