What is deep learning: case study in EDICOM. Part II

In this second part, we will describe how to implement, step by step, a deep learning system that solves the problem posed: to program an algorithm that automatically qualifies the complexity of a new EDICOM technical task.

    Artículo elaborado por:

    Jose Blas Vilata

    Director Técnico y socio fundador de EDICOM

Introduction

Bear in mind that the ultimate aim of this article is to program an algorithm that automatically qualifies the complexity of a new EDICOM technical task, thus simulating what is currently done by a human expert, relying on the knowledge of all the background data on technical tasks currently stored in our management systems.

We have already commented that although one can try to approach the problem from a traditional algorithmic point of view, we see quite clearly that the definition of this problem fits snugly within artificial intelligence algorithms, specifically in the machine learning branch of artificial intelligence and, more specifically, in its more modern version, deep learning.

In the first part of this article, we reviewed the basic theoretical concepts of deep learning. In this second part, we will describe how to implement, step by step, a deep learning system that solves the problem posed.

Tools to be used and preparing the environment

The programming language we are going to use is Python, so this is the first thing we have to install. Now, all we need is the Python libraries, which we shall install as we need them, although we recommend starting with at least the following ones:

Python is one of the predominant programming languages in the area of statistics, data mining and machine learning. Since it is free software, countless users have been able to leverage algorithms, giving rise to a very large number of libraries where you can find practically all the machine learning techniques in existence.

Scikit-learn (sklearn) is an open source machine library that allows both supervised and unsupervised learning. It also provides various tools for model fitting, data pre-processing, model selection and evaluation, as well as many other uses.

TensorFlow is the main open source deep learning framework, developed and maintained by Google.
Keras is an open source deep learning library written in Python. Keras lets you design, adjust, assess and use deep learning models to make predictions in just a few lines of code accessible and understandable for most developers.

Using TensorFlow directly can be a bit of a challenge for developers, so in 2019, Google launched TensorFlow 2 version 2, which integrated the Keras API directly and promoted this interface as the default or standard interface for deep learning development in the platform. This way, with TensorFlow version 2 there is no need to install Keras, as it comes already built in.

This integration is commonly known as the tf.keras (TensorFlow-Keras) interface or API.

We can install TensorFlow and sklearn directly onto our machine, or, if we don’t want to “clutter” our computer with the new programs, we can make use of a Docker image (jvilata/tensorf-sk:v1) with everything pre-installed and ready to run and which I’m placing at your disposal in the Docker Hub public repository. To install this image, logically you need to have the Docker engine installed, which in the majority of Linux suites comes pre-installed or is really easy to install, and in Windows we can also get it by means of the corresponding installer from the page: https://docs.docker.com/docker-for-windows/install/

To test that our Docker image is working, we run it and enter Bash (Linux Shell):

Here, the “-v d:\docker\:/srv” is optional and allows you to map a local directory of your machine, in this case, “d:\docker”, to a directory of the virtual machine of the Docker image, in this case “/srv”. This makes it easier for us to edit files without having to install more programs in Docker, as the edits will be made locally in this folder and the changes will be seen directly from the Docker image.

The Docker image includes the “/Edicom” directory with the data and examples shown here. At this point it does not matter if you are in Docker or in a local installation, as we will do everything from within Python.

Now, let’s check that everything is installed correctly, entering Python and printing the version of TensorFlow (follow the red lines):

If we have done the installation locally and it tells us that the “TensorFlow” module is not found, it is because we have not installed it yet, as will probably happen with other libraries that we will need and which are already installed in the Docker image. As an example, I am only going to show a command of how to install a Python library, in this case TensorFlow, and the same would have to be done with each of the libraries we are going to need. From the system Shell we run:

We are going to reprint the TensorFlow and Keras versions in a slightly different way, through a Python program file. To do so, we open a text file which we shall call “versions.py”. If we are using Docker image, this file must be located in the directory that we have mapped in “/srv”, in our case, “d:\docker”. We write the following in the file:

And now, from the command line, we execute:

I redirect the error output “2> null” so as not to see unwanted messages on the screen. The majority are alerts for uninstalled hardware, such as GPUs, and TensorFlow notifies us, but it doesn’t matter – it simply means that the full potential of these accelerators will not be taken advantage of.

Life-cycle of a machine learning project

The following steps will usually be common to every machine learning project.

  • Defining the problem: What do we want to predict? What data are available? Or which data do we need to get
  • Exploring, understanding and preparing the data to be used to create the model.
  • Separating the observations into a training set and a validation or test set. It is very important to ensure that no information from the test set takes part in the training process of the model.
  • Defining the model, usually of sequential type with at least 2 dense layers, one for input and another for output, and their corresponding trigger functions.
  • We compile and adjust the model by defining a loss function, an optimization algorithm and a metric function that lets us know how good our model is in terms of the expected outcomes.
  • We train the model and improve it by incorporating new variables or optimizing the hyperparameters (“epochs”, “batch_size”) based on the result of the evaluation.
  • We evaluate the model’s capability with the test data set to get an estimate of the model’s ability to predict new observations.
  • We save the final model so that it can be used in the future to predict results from new data.

Defining the problem

As we have already explained on several occasions, at EDICOM each requirement from the client gives rise to a technical task in our management system where different data are gathered, such as the description of the work to be carried out, the client making the request, traceability dates, the number of working days to be billed to the customer for this work, etc. Then, an expert manager manually rates the complexity of this task on a scale from 0 to 5. On the basis of this qualification, a technical project manager with the appropriate skills will be assigned for its execution. This rating will also serve to better adjust the capacity of technicians based on the number of project they manage according to their complexity.

Our aim is to program an algorithm that automatically rates the complexity of a new task from 0 to 5, thus simulating what a human expert actually does, based on the knowledge of all the technical task data currently stored in our management system, built up over decades and duly qualified by a human.

Exploring and understanding the data

Generally, the data extraction and preparation phase will be the most time-consuming aspect of a machine learning project.

We are going to extract the data on technical tasks from the EDICOM management system. To do so, we ask one of the experts that currently qualifies the tasks which of the task data are the most significant when it comes to rating task complexity. After a long conversation, the conclusion is as follows: amount of the sale (euros), previous tasks of this same client by complexity type from 0 to 5, number of message to be integrated, estimated days of work sold, estimated monthly message volume, whether the client belongs to a business group and level of support already assigned to the client (basic, preferential).

We program a Query to extract the above data from our relational database for the last 5 years and export the result in CSV format (comma separated text). Each row corresponds to a technical task and each column is one of the characteristics that we think may contribute to better define the prediction we want to make in our calculation of the complexity of the task.

We also have to export with the same Query and in the same CSV file the column or columns that represent the outcome to be obtained in our prediction, in our case the estimated complexity of the task, represented in the shaded column. As we are talking about historical or archive data, we do have this information, which has been entered by a human over time.

Now, we create a file called “edicom1.py” where we will enter the Python code to explore and manipulate the CSV data and get to know them a little better.

First, we will have to import the Python libraries that we are going to use later to load and view the data:

Next, we load the Dataset or CSV data file into the “data” variable with the semi-colon “;”as field separator and indicating that the decimal separator will be the comma “,”.

Our CSV file looks more or less like this, where we can see that we have loaded the name of the columns in the first row:

Now we are going to explore the Dataset loaded in Python in the “data” variable:

With datos.head(4) it shows us the first 4 rows of the Dataset, datos.info() shows the columns with their types and datos.shape displays the number of rows and columns. The last instruction will show us the number of null values found for each column:

deep leaning 13

We see that the columns “IDTAREA” and “PAIS_GESTOR” appear, which we are not going to use in our model and can therefore delete from the Dataset. We also notice that in the “importeventa”, “complejidad” and “diasVendidos” columns there are null values, 3689, 99 and 2, respectively. The models do not work well with nulls, so we must take one of the following actions: delete the rows with nulls or change the null values to 0 or some other value, such as the mean. We will choose the second option and set the null “complejidad” value to 1 and fill in the attributes “importeventa” and “diasVendidos” with a mean of the values according to their “complejidad”.

To cut a long story short, we will show a couple of graphs about the Dataset as an example. In the “edicom1.py” file of the Docker image there are more examples of graphs.

This generates a PDF file “tmp0.pdf” showing the pie chart image with the Dataset complexity types:

Another graph that can help us understand the data depicts the correlations between variables, because if we detect that two columns are highly correlated, we can eliminate one of the two columns. In contrast, we will want the columns we choose to have a good correlation with the target column, meaning that they contribute enough to make a good prediction.

We can already see a priori that only the “mensajesAIntegrar” and “diasVendidos” columns have a relevant correlation in terms of complexity. Unfortunately, the rest of the attributes will not contribute much to the prediction, but they will be of some help. In subsequent versions of the model, we shall have to find new variables that provide greater value to the set.

The “complejidad” column with values from 0 to 5 is of categorical type because it is actually indicating a classification, in other words, although we have numbers, it is as if we had strings of the type “nada complejo”, “poco complejo”, etc. Deep learning is based on statistical algorithms and statistical algorithms work with numbers. We therefore need to convert the categorical information into numeric columns. There are several ways to do this, but one of the most common approaches is one-hot encoding.

In one-hot encoding, a new column is created for each unique value in the categorical column. In our case, we will create 6 columns of type “comp_0”, “comp_1”, etc. These columns will only have values 0 or 1. For example, if the complexity was 2, the “comp_2” column would have a 1 and all the others a 0.

Programming the model

Now we are ready to define our model. To do so, we create a new-file of Python called “edicom2.py” where we shall set out from the following content:

What is deep learning case study
What is deep learning case study

Now we separate from the Dataset our data columns into the X variable and the column or columns for labels or results, in our case “complejidad” in the y variable:

What is deep learning case study

In the instruction df.values[:, 1:] we are indicating that all the rows (:) will form part of X and only from the second column onward (1:), as the first column is the 0. To the y variable we assign all the rows and columns from labels.

Separating training and test observations

It is necessary to split our Dataset into one set of training data and another for test, which are completely separate. This is done by the sklearn “train_test_split” function. In this case, we indicate that we want 33% given over to test:

What is deep learning case study

In the “n_features” variable, we save the number of columns from the training dataset X_train.shape[1]. Remember that shape returns an array where [0] shows the number of rows and [1] the number of columns.

Defining the neural network model

We have defined a “sequential” model with only 2 layers: the initial and final layer. We have left a third “hidden” layer commented in case the reader wants to do their own tests, trying to improve the accuracy of the model.

What is deep learning case study

The initial layer consists of 128 units or neurons. It will receive the data from “n_features” columns and an activation function called a rectifier will be applied to the output of each linear unit of this layer, which will convert each neuron into a rectified linear unit or “ReLU”. The result of each of these 128 “ReLUs” is transferred to the next layer, consisting of 6 neurons or linear units to which we will apply the “softmax” activation function specialized in classification problems. Basically, what it will do is to give a probability between 0 and 1 for each of the 6 neurons. The one with the highest probability will be given as the result.

Compiling the model

Model compilation requires the selection of a loss function to be optimized, such as the mean squared error “MSE” for regression problems or “crossentropy” for classification problems.

It also requires that an algorithm be selected to perform the optimization procedure, usually stochastic gradient descent “SGD” or a modern variation such as “Adam”.

Finally, we must select a performance metric to track during the model training process. For classification problems usually “accuracy” (percentage of hits) and mean absolute error “MAE” in regression problems.

What is deep learning case study

Training the model

To train the model, we first have to select the training configuration, such as the number of “epochs” and the size of the “batch”. These concepts were described in greater detail in part 1 of this article.

The training applies the “Adam” optimization algorithm chosen to minimize the selected loss function “categorical_crossentropy” and updates the model using the error back-propagation algorithm. This process will repeat iterations of epochs * (tamañoMuestra / batch_size) and in each iteration it will be tested with batch_size samples.

Fitting the model is the slow part of the entire process and may take anything from seconds to hours or days, depending on the complexity of the model, the hardware being used and the size of the training data set.

What is deep learning case study

We will use 20% of the training sample for result validation and model fitting and we use an “early stopping” function so as not to continue iterating if the gain is below the stipulated min_delta of 0.00001. However, we continue with 100 more iterations in case the value changes (patience).

We will use 20% of the training sample for result validation and model fitting and we use an “early stopping” function so as not to continue iterating if the gain is below the stipulated min_delta of 0.00001. However, we continue with 100 more iterations in case the value changes (patience).

What is deep learning case study

Evaluating the model and saving in a file

Once we have selected and fitted the model to our training data set, we can use the test data to estimate the model’s performance on new data, so we can make an estimate of the model’s generalization error.

If we are satisfied with the value of the metric obtained, we can use the model to make predictions with future data.

What is deep learning case study

Going live and making predictions

Once we have calculated and saved the model, we can make predictions with new data. As an example we have the file “edicom3.py”, where we load the previously calculated model and make the prediction:

What is deep learning case study
What is deep learning case study

Results and Conclusions

The first conclusion I have drawn when developing this test is that I did not manage to get the accuracy of the model to exceed 60%, and in the best of cases, not even changing the hyperparameters for batch and epochs. Even taking a larger data set than the original, which was initially 6,000 samples, we have gone to 10,000, which is the current size.

Another conclusion is that I have tried it with different batch sizes and number of epochs, and the batch did have some effect, and to a lesser extent the number of epochs as of 200. Regarding the batch size, I tried with 32, 50 and 100 and the best outcomes were obtained with 50.

Therefore, and as a final conclusion, I consider that: either the estimation of our experts does not always follow a “rational” pattern and in those cases the model is unable to predict the classification decided by the expert, or, most likely, we are missing more input variables that would help the system to better predict the outcome, information that the expert does have but that we were not able to extract in this version.

In any case, the model constructed has a sufficient level of accuracy to serve at this time as a reference for the expert who is currently grading the tasks, but without going so far as to become a 100% autonomous system.

References