EDICOM Analytics: Path towards machine learning

Pedro Ramírez and Manuel Roselló, Software Engineers, will discuss how to design and train a model to predict and aid in decision-making, based on a dataset and graphs.

    Written by:

    Manuel Roselló

    Software Engineer

    Full Stack Developer specialized in Angular and Python, with great interest in Machine Learning techniques and Artificial Intelligence.

    Pedro Ramírez

    Software Engineer

    Software architect and DevOps specialist. Lover of technology and process optimization and automation as a means for continuous improvement and overall quality.

Introduction

Historical archive data currently provide a wealth of information that can assist us in decision-making and predicting future events.

In this article we will discuss how to design and train a model to predict and aid in decision-making, based on a dataset and graphs. For this purpose, we will use our data storage platform EdicomLta (Edicom Long-Term Archiving) as a data source and EDICOM Analytics as a tool to exploit the information graphically.

Introduction to the experiment

To set up a practical simulation of a possible use case for our data, an experiment was carried out to obtain a prediction of the EDICOMNet application’s data traffic to build an HPA (Horizontal Pod Autoscaler) for horizontal scaling of pods in the system. A pod is the smallest basic unit that can be controlled by the scaling and deployment manager, Kubernetes. Each pod may be composed of one or more application containers, which share storage and resources, as well as their runtime specifications.

EDICOMNet is a service from EDICOM that operates as a VAN (Value-Added Network) and enables the secure and traceable exchange of EDI documents, with 24-hour availability and constant monitoring. This information is stored in the EdicomLta platform, which will serve as a data source for the experiment.

For the experiment, we need to predict the number of documents per hour to supply pods to the platform and meet the resource needs in advance. For this purpose, a scenario was designed, in which each pod is able to receive approximately 10,000 documents per hour efficiently.

With this information, we will perform an experiment to determine the model’s feasibility: graphically displaying the weekly and hourly traffic prediction and assessing the reliability of the results and their visual representation through the EDICOMAnalytics application.

Practical experiment

To carry out the experiment, the code was written using the Python programming language, due to the simplicity of use of its libraries when applied to machine learning. Specifically, we used four very common and popular libraries in this field, which are also well documented:

  • Pandas (pd): Data manipulation in table format.
  • Numpy (np): Algebraic operations.
  • Sklearn: Machine learning models and related utilities.
  • Matplotlib (plt): Graphical representations of data.

Automated learning by means of Random Forests

In recent years, machine learning has been a rising trend and there is an abundance of publications full of terms referring to various learning models: neural networks, clustering techniques, SVMs, etc. However, there are other types of algorithms which, despite not boasting so much fame outside the academic context, are as capable, or even superior in certain contexts, to the alternatives mentioned above.

One of these models is the Random Forest. The name is due to the fact that it consists of a combination of several uncorrelated decision trees. But, what is a decision tree? This paper does not intend to go into great detail on the algorithms that govern this computational model. However, a minor theoretical framework will be presented below to contextualize the experiment.

A decision tree is a very simple structure that every reader has probably used unconsciously on more than one occasion. At each node, we ask a question whose answer can only be yes or no. Depending on the response, we take one branch or another and reach a new node with a different question. And so on, until we reach a “leaf” node that gives us a final answer.

This example could be part of a very simplified reasoning for classifying figures. We see how each decision is based mainly on one feature: straight lines, length of sides, etc…. We could even ask about the colour, but in this case it would not be decisive. Likewise, in the experiment we will have a series of features that will determine the decisions of the trees. We will see how some features are more relevant than others in order to reach the correct leaf nodes accurately. In a random regression forest, each tree makes an individual prediction and the final value is the average or mean of all of them (or the majority vote in classification problems). The main idea behind this ensemble algorithm is that the combined prediction of all the “weak” classifiers (the trees) will always be better than their individual predictions.

The key to this mechanism consists of minimizing the correlation between trees. To do this, we use the technique known as bagging(Bootstrap Aggregation), which consists in transmitting to each tree a random sample with replacement equal in size to the original total. For example, if our data are in a list [a, b, c] we might have one tree [c, a, b] and another [b, c, b] (replacement allows duplicate values). Thanks to this system, the trees are different from each other and are guaranteed to generate independent structures and outcomes.

In addition, there is typically a configurable parameter that determines the number of features (fields, characteristics) that are used to determine a split (decision to follow one branch or another), instead of using all of them every time. This generates more randomness and helps preventing potential overfitting from happening.

Nevertheless, it is logical to ask the following question: Why Random Forest rather than a neural network? Neural networks are incredibly versatile and have enormous potential. Today, they are one of the most common tools for tackling any type of machine learning problem. But it is precisely this versatility that allows them to be overcome when dealing with specific problems. In the case of our experiment, we know that we are dealing with textual data in table format: it is not task related to facial recognition, natural language processing, etc. Random Forests have demonstrated good performance for tasks with this type of tabular data and provide us with certain advantages over a neural network:

  • A Random Forest requires fewer data to come up with a good result. Our dataset may seem relatively large with its thousands of data points, but it is actually quite small by neural network standards.
  • The computational cost is much lower. So much so, that there is no need to resort to GPUs or TPUs to train the model.
  • It is easier to configure and interpret. As we will see in the following sections, we can even obtain an objective measurement of the relevance that each feature has for the model. A neural network is usually much more difficult to “decipher” and often requires some “trial and error” groping.

Study and processing of data

For the experiment, we shall work with a dataset that compiles the document traffic in the EdicomNet application. Specifically, we will have a total of 17,927 hourly traffic data items from January 2019 to the end of February 2021. This is an initial data model with a very simple structure, as it consists of only two fields:

  • key_as_string: Expressing the date and time in UTC format, in accordance with ISO-8601.
  • doc_count: Indicating the number of documents (traffic) recorded during the last hour.

Next, we can see the first five elements of the dataset (where the index field is generated automatically to assign a unique identifier to each element):

df = pd.read_json('edicomnet_traffic.json') #Also compatible with CSV
df.head() # We load the first 5 elements
indexkey_as_stringdoc_count
02019-01-01T00:00:00.000Z14139
12019-01-01T01:00:00.000Z12432
22019-01-01T02:00:00.000Z19827
32019-01-01T03:00:00.000Z10413
42019-01-01T04:00:00.000Z14284

With this initial data we could already build and train a model. However, its effectiveness would probably leave much to be desired. In any machine learning problem, prior analysis of the data is the most important stage, as it allows us to devise a tailor-made model that fits the specific needs of the dataset. Mainly, we need to understand the data before processing them. As a first step, we can generate a histogram to visualize the distribution of the values:

EDICOM Analytics: camino hacia el machine learning

It is easy to see how the data follow a bimodal distribution, with quite a few values clustered around 10,000 documents and a large majority in the 40,000 document zone (in fact, the mean is 33.327). This graph lets us see that the dataset has good potential for predictions, as there are few atypical values or outliers.

However, the most interesting aspect of our data is that they form a time series, so it can be very relevant to visualize their evolution over time in order to detect possible patterns. In order for us to observe traffic within a temporal context in an optimal way, we modify our data table by subdividing the date into several more informative fields:

# We convert to date format
df.loc[:,'key_as_string'] = pd.to_datetime(df.loc[:,'key_as_string'],utc=True)
# Extract fields from the date and restructure
df['YEAR'] = df['key_as_string'].dt.year
df['MONTH'] = df['key_as_string'].dt.month
df['DAY'] = df['key_as_string'].dt.day
df['WEEKDAY'] = df['key_as_string'].dt.weekday
df['HOUR'] = df['key_as_string'].dt.hour
df = df.drop(['key_as_string'], axis=1)
df.columns = ['TRAFFIC', 'YEAR', 'MONTH', 'DAY', 'WEEKDAY', 'HOUR']
df = df[['YEAR', 'MONTH', 'DAY', 'WEEKDAY', 'HOUR', 'TRAFFIC']]
# Then sort chronologically
df = df.sort_values(['YEAR', 'MONTH', 'DAY'], axis=0, ascending=[True, True, True])

Where YEAR, MONTH and DAY represent, respectively, the year, month and day; WEEKDAY will indicate the day of the week (0= Monday, 4= Friday); and HOUR shows the hour without minutes between 0 and 23.

indexYEARMONTHDAYWEEKDAY HOURTRAFFIC
02019111014139
12019111112432
22019111219827
32019111310413
42019111414284
17946202122141946127
17947202122142042835
17948202122142144219
17949202122142239986
17950202122142362337


With this new data structure, it is much easier to plot the data as a time function. For example, we can study its global evolution over the years:

EDICOM Analytics: camino hacia el machine learning

Obviously, there are too many data items for them to be seen clearly, but two conclusions can be drawn from this chart. On the one hand, the traffic has a similar behaviour over time and, on the other hand, it seems to tend to rise and fall periodically. Based on these “clues”, we proceed to construct another chart. This time, grouping the data by day of the month:

With this visualization, the pattern is much more apparent, but there seems to be some shift between the different months. What this shift suggests is that the pattern we are looking for will not necessarily be related to the day of the month, but to the day of the week:

With this last representation, the enigma seems to be resolved: there is a clear pattern in the traffic that seems to present a direct relationship between its volume and the day of the week. This is, of course, before our human eyes. So, how is it translated to the model? We mentioned previously that our data structure was quite basic and could be improved. When we talk about “improving” the data model, we are talking about adding, deleting and modifying fields. Our Random Forest is smart, but we can (and should) help it to be even more so if we provide it with data in a structure that makes it easy to find relevant information.

After a series of tests, the final model includes several improvements:

  • The year is ruled out, as it does not provide much information.
  • Three fields are added which may be useful: the traffic just 24 hours ago, the accumulated traffic since then and the difference between this accumulated traffic and the previous figure. To this end, we must sacrifice from the dataset the first day (01/01/2019), as it has no previous data with which to generate the new fields.
# Previous day at the same time
df.loc[:, 'PREVIOUS'] = df.loc[:, 'TRAFFIC'].shift(periods=24)
# Sum of the last 24 hours and their difference
df.loc[:, 'CUMSUM'] = df.loc[:, 'TRAFFIC'].rolling(min_periods=1, window=24).sum()
df.loc[:, 'CUMSUM_DIFF'] = df.loc[:, 'CUMSUM'].diff(periods=1)
# We clean up the data
df = df.dropna()
df.loc[:, ['PREVIOUS', 'PREVIOUS_DIFF', 'CUMSUM', 'CUMSUM_DIFF']] = df.loc[:, ['PREVIOUS', 'CUMSUM', 'CUMSUM_DIFF']].astype(int)
df = df[['YEAR', 'MONTH', 'DAY', 'WEEKDAY', 'HOUR', 'PREVIOUS', 'CUMSUM', 'CUMSUM_DIFF', 'TRAFFIC']]
# And remove the year
df = df.drop(['YEAR'], axis=1)

So, the final set has the following format:

indexMONTHDAYWEEKDAYHOURPREVIOUSCUMSUMCUMSUM_DIFFTRAFFIC
24122014139228254-80486091
25122112432222503-57516681
26122219827214247-825611571
27122310413210232-40156398
28122414284209572-66013624
17946226419438771136414225046127
17947226420375471141702528842835
17948226421421431143778207644219
17949226422414731142291-148739986
179502264234740211572261493562337

Preparing the learning model

Once the dataset is ready, we can train our learning model with it. First, we will subtract the most recent data (February 2021) to use them in a simulation of predictions once we have the final model. Leaving aside the data of these weeks for this practical example, we will take the data of two years and one month to prepare our model, and with these data, we must create a random subset of data for training and another for testing. The ratio chosen for this division was 80-20; totally arbitrary, but very commonplace in machine learning experiments.

Let’s briefly go over how our data are organized: February 2021 is separated for a subsequent simulation, and the rest of the data have been randomly reordered and 80% will be used to train the model, and 20% will be used for the testing phase.

# The first three weeks of February for forecasting
experiment_data = df.copy()
experiment_data = experiment_data.loc[~((experiment_data['YEAR'] == 2021) & (experiment_data['MONTH'] > 1))]
prediction_data = df.copy()
prediction_data = prediction_data.loc[((prediction_data['YEAR'] == 2021) & (prediction_data['MONTH'] == 2))]
# We separate into train 80% and test 20% after a shuffle
RANDOM_SEED = 42
TEST_SIZE = 0.2
train, test = train_test_split(experiment_data.sample(frac=1, random_state=RANDOM_SEED), test_size=TEST_SIZE)
# And construct the datasets
y_train = train['TRAFFIC']
x_train = train.drop(['TRAFFIC'], axis=1)
y_test = test['TRAFFIC']
x_test = test.drop(['TRAFFIC'], axis=1)

When adjusting the model’s hyperparameters, we used the Grid Search method, which automatically tests all possible combinations of some parameter values entered manually, and determines which ones provide a more optimal model to work with the training data set. This process was repeated several times, initially with very wide value ranges (e.g., maximum depth from 10 to 300), and later adjusted based on the results (the last test had maximum depths between 14 and 20). In the next snippet you can see an example of Grid Search, now with highly localized values, as they are from the most recent test.

# Random Forest with parameter search by GridSearch
model = RandomForestRegressor()
param_search = {
  'bootstrap': [True, False], #Activates sampling with bootstrap
  'max_depth': [14, 17, 20], #Maximum depth of trees
  'max_features': ['auto', 'sqrt'], #Num. of features per split
  'min_samples_leaf': [1, 2], # Min. Samples per leaf node
  'min_samples_split': [2, 5, 10], # Min. Samples per split
  'n_estimators': [700, 750, 800], #Number of trees
}
tscv = TimeSeriesSplit(n_splits=12)
gsearch = GridSearchCV(estimator=model,
                       cv=tscv,
                       param_grid=param_search,
                       scoring=rmse_score,
                       n_jobs=multiprocessing.cpu_count() - 4,
                       verbose=3)
gsearch.fit(x_train, y_train)
best_score = gsearch.best_score_
best_model = gsearch.best_estimator_
# Cross validation score of model
np.mean(cross_val_score(best_model, x_train, y_train, cv=tscv, scoring='r2', n_jobs = multiprocessing.cpu_count() - 4))
# Training
best_model.fit(x_train, and_train)

This is a computationally expensive process which, fortunately, can be executed in parallel using several CPU cores. Using eight cores simultaneously, we observed that initial assessments combining multiple, very diverse values could take more than ten hours, while the last searches with few values were completed in a matter of minutes. The optimal model that was finally chosen has the following configuration:

bootstrapmax_depthmax_featuresmin_samples_leafmin_samples_splitn_estimators
True17auto12750

In contrast to the hyperparameter search, the model can be trained really quickly: in around 35 seconds. We obtained a

In contrast to the hyperparameter search, the model can be trained really quickly: in around 35 seconds. We obtained a cross validation mean of 0.98 R2 score.

Results analysis

It is highly advisable to compute a feature relevance evaluation, which is an extremely useful tool available in the sklearn library.

When it comes to choosing a branch, the model evaluates how relevant each field has been to the decision. Based on these results, we could, for example, dispense with the DAY and MONTH fields in a future version of the model, as they do not seem to provide relevant information. However, this is probably due to the fact that the data accumulate for only two years. If more years were to be included, it is quite likely that seasonal and holiday traffic patterns would appear, so these fields would become more relevant. It is worth noting that WEEKDAY was clearly the most important data item for the model, thus confirming the theory of patterns depending on the day of the week that we had proposed in the data analysis section.

To assess the model, as the training was non-deterministic, data were collected from five different runs and the mean values were extracted:

Explained varianceRMSER2Mean differencePrecision (5.000)
0,99521.267,330,9951200,3598,30%
  • The explained variance represents the proportion by which the model fits the variation in the data, with 1 being the maximum or best possible value.
  • The RMSE (Root Mean Squared Error) measures the differences between the predicted values and the observed values. Its value scales with the range of possible values in the data. In our case, we ranged from 0 to 170,000 documents, so it can be considered a low error.
  • The coefficient of determination R2 estimates the quality of the model for replicating the results in future predictions and the variation in outcomes that can be explained by the model. As in the case of the explained variance, the optimal value is 1.
  • The mean difference is simply a calculation of the average value of the difference, in absolute value, between the predicted values and the actual values, in number of documents. For our project, a difference of 200 documents is perfectly acceptable.
  • The precision was calculated based on a threshold of ±5.000 documents, as a new pod is needed for every 10,000. This is a simple calculation of hits within the threshold over the total predicted data.

Visualizing the predictions

After training the model and seeing that it obtains good results, it is time to perform the prediction simulation with the February data. Ideally, the model would be used for 24-hour forecasts:

As can be seen in a 24-hour prediction, the predictions are really tight, with an average accuracy of roughly 98% for five runs. With this information we can get a fairly accurate assessment of how to scale pods based on hourly traffic. If we extrapolate this data to the number of pods, for a maximum of 10,000 requests per pod we would be left with an average accuracy of 94.17%, the maximum deviation being of one pod, which may be acceptable.

We can go further and apply the same prediction to the first three weeks of February as a whole. A review of the results shows that the data continues to have approximately the same accuracy (93.45%) within the margin of ±5,000 documents. It is particularly interesting to see how, although the weeks follow a similar pattern, they do not necessarily reach the same peaks, and the predictions also match these variations.

Specifically, for the 504 values, the maximum deviation continues to be only 1 pod:

Based on these results, we can verify that the accuracy of the model for this prediction is correct and that it could be implemented as a solution to manage the dynamic scaling of pods, with a very high reliability.

Conclusions

EDICOM manages large amounts of data, so its exploitation through machine learning techniques would be a good formula to improve our services and open new lines of technological development.

The results of this first experiment have allowed us to verify the feasibility of this type of predictions, not only in a theoretical way, but also for a real and practical use, thus opening new horizons for the exploitation of data stored in EdicomLta. In this particular case, we were able to accurately predict the dynamic scaling of pods, which canMoreover, this type of models and predictions could also be adapted to numerous applications thanks to the volume and diversity of data managed by EDICOM: estimating orders, invoices, hours of consulting or support, etc. In addition, by obtaining structured prediction data, it would be possible to enrich the EDICOMAnalytics service by allowing its automated graphic visualization.

In conclusion, we must emphasise yet again that, if we manage to analyse and understand the data correctly, it is viable to build a model that fits the needs of the given problem. This small experiment has enabled us to verify that we have both the data and the tools necessary to carry out future projects in this area.

References

[1] Toni Yiu. Understanding Random Forest. (Towards Data Science, 2019) Link

[2] James Montantes. 3 Reasons to Use Random Forest Over a Neural Network–Comparing Machine Learning versus Deep Learning. (Towards Data Science, 2020) Link