The mission is to automate job applications for everybody.
I already did the following steps:
In this notebook, the goal is to perform better than the benchmark we set with the Stochastic Gradient Descent algorithm. The result was about 95 % for the testing dataset. We will now implement a neural network to perform the same task to hopefully reach an even better accuracy.
We use a dataframe which I scraped from www.stepstone.de. The scraping process is in a separate jupyter notebook.
At first, we need to import the libraries and tools we need to start.
import json import keras from sklearn.feature_extraction.text import TfidfVectorizer
Before we use the neural network to actually classify the data, we need to turn the documents in a vector form - which we already did in previous notebooks. So instead of having raw text data, we use the following Preprocessing Pipeline guided by the Library I created for these tasks:
X =  y =  with open('preprocessed.json', 'r') as outfile: X = json.load(outfile) with open('labels.json', 'r') as outfile: y = json.load(outfile) X = json.loads(X) y = json.loads(y)
To be able to follow up with the progress, the neural network does, we define two plots, showing us the accuracies on the one hand, and the losses on the other hand, both for the training and testing set.
import matplotlib.pyplot as plt plt.style.use('ggplot') def plot_history(history): acc = history.history['acc'] val_acc = history.history['val_acc'] loss = history.history['loss'] val_loss = history.history['val_loss'] x = range(1, len(acc) + 1) plt.figure(figsize=(12, 5)) plt.subplot(1, 2, 1) plt.plot(x, acc, 'b', label='Training acc') plt.plot(x, val_acc, 'r', label='Validation acc') plt.title('Training and validation accuracy') plt.legend() plt.subplot(1, 2, 2) plt.plot(x, loss, 'b', label='Training loss') plt.plot(x, val_loss, 'r', label='Validation loss') plt.title('Training and validation loss') plt.legend()
I created a class that is supposed to be able to take vectorized text data and their labels as input. Creating an object of this class directly builds a model and displays its summary. That way, you are directly aware about how many parameters the network is going to train.
There are three important functions:
ModelCheckpointas callback to save the best model right when an epoch is over
epochsis set to 5 by default. After a few iterations, I noticed that the network increased its training accuracy, but overfitting led to worse results in the validation set.
It makes a one-hot-encoded vector out of text labels
class DeepLearningModel(): def __init__(self, X, y): from keras.models import Sequential from keras import layers from keras.layers import Dropout from sklearn.model_selection import train_test_split self.X_train, self.X_test, self.y_train, self.y_test = train_test_split( self.vectorize(X), self.encoding(y), test_size=0.25, random_state=1000) input_dim = self.X_train.shape model = Sequential() model.add(layers.Dense(256, input_dim=input_dim, activation='relu')) model.add(Dropout(0.2)) model.add(layers.Dense(128, activation='relu')) model.add(Dropout(0.2)) model.add(layers.Dense(5, activation='sigmoid')) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) self.model = model self.model.summary() def fitModel(self, epochs = 5): from keras.callbacks import ModelCheckpoint checkpoint = ModelCheckpoint("model.ks", monitor='val_loss', verbose=1, save_best_only=True, save_weights_only=False, mode='auto', period=1) history = self.model.fit(x=self.X_train, y=self.y_train, batch_size=None, epochs=epochs, verbose=1, callbacks=[checkpoint], validation_split=None, validation_data=(self.X_test, self.y_test), shuffle=True, class_weight=None, sample_weight=None, initial_epoch=0, steps_per_epoch=None, validation_steps=None) plot_history(history) def encoding(self, y): # company = 1, tasks = 5, profile = 4, offer = 3, contact = 2 from sklearn.preprocessing import LabelEncoder encoder = LabelEncoder() y_encoded = encoder.fit_transform(y) from sklearn.preprocessing import OneHotEncoder encoder2 = OneHotEncoder(sparse=False) y_encoded = y_encoded.reshape(len(y_encoded), 1) y_encoded = encoder2.fit_transform(y_encoded) return y_encoded def vectorize(self, X): vectorizer = TfidfVectorizer(tokenizer = lambda x: x, preprocessor = lambda x: x) return vectorizer.fit_transform(X)
The architecture is structured as follows:
This activation function helps to get rid of negative values and leaves positives as they are. The biggest advantage of ReLu is indeed non-saturation of its gradient, which greatly accelerates the convergence of stochastic gradient descent compared to the sigmoid / tanh functions (paper by Krizhevsky et al). In addition, it is computationally more effective than activation functions like softmax or sigmoid.
Dropout layers in between
These dropout layers with a dropout of 20 %
These layers help to reduce overfitting. Because 20% of the randomly defined nodes are not participating in learning, other nodes are better trained. "A fully connected layer occupies most of the parameters, and hence, neurons develop co-dependency amongst each other during training which curbs the individual power of each neuron leading to over-fitting of training data." (https://medium.com/@amarbudhiraja/https-medium-com-amarbudhiraja-learning-less-to-learn-better-dropout-in-deep-machine-learning-74334da4bfc5)
Dense Output layer
This activation function is used to assure that the output data is scaled between 0 and 1. This means, that each of the category gets assigned a percentage value indicating how likely a text is represented by each category.
The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters (https://arxiv.org/pdf/1412.6980.pdf)
Further information: https://arxiv.org/pdf/1412.6980.pdf
Here's the whole architecture in one view:
Please note: I didn't use all the nodes that are created within the net. This is because there's no proper way to display all the input nodes (126,741) and the other ones. This image only gives you an idea about how the nodes are connected to each other and how the input is transformed into an output.
dl = DeepLearningModel(X,y)
_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense_4 (Dense) (None, 256) 32445952 _________________________________________________________________ dropout_3 (Dropout) (None, 256) 0 _________________________________________________________________ dense_5 (Dense) (None, 128) 32896 _________________________________________________________________ dropout_4 (Dropout) (None, 128) 0 _________________________________________________________________ dense_6 (Dense) (None, 5) 645 ================================================================= Total params: 32,479,493 Trainable params: 32,479,493 Non-trainable params: 0 _________________________________________________________________
C:\Users\Thorben\Documents\workspace\GPU_Environment\lib\site-packages\sklearn\preprocessing\_encoders.py:368: FutureWarning: The handling of integer data will change in version 0.22. Currently, the categories are determined based on the range [0, max(values)], while in the future they will be determined based on the unique values. If you want the future behaviour and silence this warning, you can specify "categories='auto'". In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly. warnings.warn(msg, FutureWarning)
Train on 32661 samples, validate on 10888 samples Epoch 1/5 32661/32661 [==============================] - 54s 2ms/step - loss: 0.1160 - acc: 0.9636 - val_loss: 0.0718 - val_acc: 0.9794 Epoch 00001: val_loss improved from inf to 0.07183, saving model to model.ks Epoch 2/5 32661/32661 [==============================] - 53s 2ms/step - loss: 0.0289 - acc: 0.9913 - val_loss: 0.0792 - val_acc: 0.9788 Epoch 00002: val_loss did not improve from 0.07183 Epoch 3/5 32661/32661 [==============================] - 52s 2ms/step - loss: 0.0079 - acc: 0.9976 - val_loss: 0.0958 - val_acc: 0.9782 Epoch 00003: val_loss did not improve from 0.07183 Epoch 4/5 32661/32661 [==============================] - 53s 2ms/step - loss: 0.0047 - acc: 0.9985 - val_loss: 0.1033 - val_acc: 0.9784 Epoch 00004: val_loss did not improve from 0.07183 Epoch 5/5 32661/32661 [==============================] - 53s 2ms/step - loss: 0.0037 - acc: 0.9988 - val_loss: 0.1108 - val_acc: 0.9783 Epoch 00005: val_loss did not improve from 0.07183
As you can see, the training accuracy rises to about 0.998 which is a very good accuracy. But, we have to take an even closer look at the validation accuracy, which stays at an even level but has highest numbers in the first epoch.
The losses show the same picture, whereas this gives us more confidence in trusting the earliest epoch. After all, we could even train the net for one epoch, which gives us a very satisfying accuracy value of nearly 98 %.