xLearn Command Line Guide

Once you built xLearn from source code successfully, you can get two executable files (xlearn_train and xlearn_predict) in your build directory. Now you can use these two executable files to perform training and prediction tasks.

Quick Start

Make sure that you are in the build directory of xLearn, and you can find the demo data small_test.txt and small_train.txt in this directory. Now we can type the following command to train a model:

./xlearn_train ./small_train.txt

Here, we show a portion of the output in this task. Note that the loss value shown in your local machine could be different with the following result:

[ ACTION     ] Start to train ...
[------------] Epoch      Train log_loss     Time cost (sec)
[   10%      ]     1            0.569292                0.00
[   20%      ]     2            0.517142                0.00
[   30%      ]     3            0.490124                0.00
[   40%      ]     4            0.470445                0.00
[   50%      ]     5            0.451919                0.00
[   60%      ]     6            0.437888                0.00
[   70%      ]     7            0.425603                0.00
[   80%      ]     8            0.415573                0.00
[   90%      ]     9            0.405933                0.00
[  100%      ]    10            0.396388                0.00
[ ACTION     ] Start to save model ...
[------------] Model file: ./small_train.txt.model

By default, xLearn uses the logistic regression (LR) to train the model within 10 epoch.

After that, we can see that a new file called small_train.txt.model has been generated in the current directory. This file stores the trained model checkpoint, and we can use this model file to make a prediction in the future:

./xlearn_predict ./small_test.txt ./small_train.txt.model

After that, we can get a new file called small_test.txt.out in the current directory. This is the output of xLearn’s prediction. Here we show the first five lines of this output by using the following command:

head -n 5 ./small_test.txt.out

-1.9872
-0.0707959
-0.456214
-0.170811
-1.28986

These lines of data is the prediction score calculated for each example in the test set. The negative data represents the negative example and positive data represents the positive example. In xLearn, you can convert the score to (0-1) by using --sigmoid option, and also you can convert your result to binary result (0 and 1) by using --sign option:

./xlearn_predict ./small_test.txt ./small_train.txt.model --sigmoid
head -n 5 ./small_test.txt.out

0.120553
0.482308
0.387884
0.457401
0.215877

./xlearn_predict ./small_test.txt ./small_train.txt.model --sign
head -n 5 ./small_test.txt.out

0
0
0
0
0

Model Output

Users may want to generate different model files (by using different hyper-parameters), and hence users can set the name and path of the model checkpoint file by using -m option. By default, the name of the model file is training_data_name + .model:

./xlearn_train ./small_train.txt -m new_model

Also, users can save the model in TXT format by using -t option. For example:

./xlearn_train ./small_train.txt -t model.txt

After that, we can get a new file called model.txt, which stores the trained model in TXT format:

head -n 5 ./model.txt

-0.688182
0.458082
0
0
0

For the linear and bias term, we store each parameter in each line. For FM and FFM, we store each vector of the latent factor in each line. For example:

Linear:

bias: 0
i_0: 0
i_1: 0
i_2: 0
i_3: 0

FM:

bias: 0
i_0: 0
i_1: 0
i_2: 0
i_3: 0
v_0: 5.61937e-06 0.0212581 0.150338 0.222903
v_1: 0.241989 0.0474224 0.128744 0.0995021
v_2: 0.0657265 0.185878 0.0223869 0.140097
v_3: 0.145557 0.202392 0.14798 0.127928

FFM:

bias: 0
i_0: 0
i_1: 0
i_2: 0
i_3: 0
v_0_0: 5.61937e-06 0.0212581 0.150338 0.222903
v_0_1: 0.241989 0.0474224 0.128744 0.0995021
v_0_2: 0.0657265 0.185878 0.0223869 0.140097
v_0_3: 0.145557 0.202392 0.14798 0.127928
v_1_0: 0.219158 0.248771 0.181553 0.241653
v_1_1: 0.0742756 0.106513 0.224874 0.16325
v_1_2: 0.225384 0.240383 0.0411782 0.214497
v_1_3: 0.226711 0.0735065 0.234061 0.103661
v_2_0: 0.0771142 0.128723 0.0988574 0.197446
v_2_1: 0.172285 0.136068 0.148102 0.0234075
v_2_2: 0.152371 0.108065 0.149887 0.211232
v_2_3: 0.123096 0.193212 0.0179155 0.0479647
v_3_0: 0.055902 0.195092 0.0209918 0.0453358
v_3_1: 0.154174 0.144785 0.184828 0.0785329
v_3_2: 0.109711 0.102996 0.227222 0.248076
v_3_3: 0.144264 0.0409806 0.17463 0.083712

Online Learning

xLearn can supoort online learning, which can train new data based on the pre-trained model. User can use the -pre option to specify the file path of pre-trained model. For example:

./xlearn_train ./small_train.txt -s 0 -pre ./pre_model

Note that, xLearn can only uses the binary model, not the TXT model.

Prediction Output

Users can also set -o option to specify the prediction output file. For example:

./xlearn_predict ./small_test.txt ./small_train.txt.model -o output.txt
head -n 5 ./output.txt

-2.01192
-0.0657416
-0.456185
-0.170979
-1.28849

By default, the name of the output file is test_data_name + .out .

Choose Machine Learning Algorithm

For now, xLearn can support three different machine learning algorithms, including linear model, factorization machine (FM), and field-aware factorization machine (FFM).

Users can choose different machine learning algorithms by using -s option:

-s <type> : Type of machine learning model (default 0)
   for classification task:
       0 -- linear model (GLM)
       1 -- factorization machines (FM)
       2 -- field-aware factorization machines (FFM)
   for regression task:
       3 -- linear model (GLM)
       4 -- factorization machines (FM)
       5 -- field-aware factorization machines (FFM)

For LR and FM, the input data format can be CSV or libsvm. For FFM, the input data should be the libffm format:

libsvm format:

   label index_1:value_1 index_2:value_2 ... index_n:value_n

CSV format:

   label value_1 value_2 .. value_n

libffm format:

   label field_1:index_1:value_1 field_2:index_2:value_2 ...

xLearn can also use , as the splitor in file. For example:

libsvm format:

   label,index_1:value_1,index_2:value_2 ... index_n:value_n

CSV format:

   label,value_1,value_2 .. value_n

libffm format:

   label,field_1:index_1:value_1,field_2:index_2:value_2 ...

Note that, if the csv file doesn’t contain the label y, the user should add a placeholder to the dataset by themselves (Also in test data). Otherwise, xLearn will treat the first element as the label y.

Users can also give a libffm file to LR and FM task. At that time, xLearn will treat this data as libsvm format. The following command shows how to use different machine learning algorithms to solve the binary classification problem:

./xlearn_train ./small_train.txt -s 0  # Linear model (GLM)
./xlearn_train ./small_train.txt -s 1  # Factorization machine (FM)
./xlearn_train ./small_train.txt -s 2  # Field-awre factorization machine (FFM)

Set Validation Dataset

A validation dataset is used to tune the hyper-parameters of a machine learning model. In xLearn, users can use -v option to set the validation dataset. For example:

./xlearn_train ./small_train.txt -v ./small_test.txt

A portion of xLearn’s output:

Epoch      Train log_loss       Test log_loss     Time cost (sec)
    1            0.575049            0.530560                0.00
    2            0.517496            0.537741                0.00
    3            0.488428            0.527205                0.00
    4            0.469010            0.538175                0.00
    5            0.452817            0.537245                0.00
    6            0.438929            0.536588                0.00
    7            0.423491            0.532349                0.00
    8            0.416492            0.541107                0.00
    9            0.404554            0.546218                0.00

Here we can see that the training loss continuously goes down. But the validation loss (test loss) goes down first, and then goes up. This is because the model has already overfitted current training dataset. By default, xLearn will calculate the validation loss in each epoch, while users can also set different evaluation metrics by using -x option. For classification problems, the metric can be: acc (accuracy), prec (precision), f1 (f1 score), auc (AUC score). For example:

./xlearn_train ./small_train.txt -v ./small_test.txt -x acc
./xlearn_train ./small_train.txt -v ./small_test.txt -x prec
./xlearn_train ./small_train.txt -v ./small_test.txt -x f1
./xlearn_train ./small_train.txt -v ./small_test.txt -x auc

For regression problems, the metric can be mae, mape, and rmsd (rmse). For example:

cd demo/house_price/
../../xlearn_train ./house_price_train.txt -s 3 -x rmse --cv
../../xlearn_train ./house_price_train.txt -s 3 -x rmsd --cv

Note that, in the above example we use cross-validation by using --cv option, which will be introduced in the next section.

Cross-Validation

Cross-validation, sometimes called rotation estimation, is a model validation technique for assessing how the results of a statistical analysis will generalize to an independent dataset. In xLearn, users can use the --cv option to use this technique. For example:

./xlearn_train ./small_train.txt --cv

On default, xLearn uses 3-folds cross validation, and users can set the number of fold by using -f option:

./xlearn_train ./small_train.txt -f 5 --cv

Here we set the number of folds to 5. The xLearn will calculate the average validation loss at the end of its output message:

 ...
[------------] Average log_loss: 0.549417
[ ACTION     ] Finish Cross-Validation
[ ACTION     ] Clear the xLearn environment ...
[------------] Total time cost: 0.03 (sec)

Choose Optimization Method

In xLearn, users can choose different optimization methods by using -p option. For now, xLearn can support sgd, adagrad, and ftrl method. By default, xLearn uses the adagrad method. For example:

./xlearn_train ./small_train.txt -p sgd
./xlearn_train ./small_train.txt -p adagrad
./xlearn_train ./small_train.txt -p ftrl

Compared to traditional sgd method, adagrad adapts the learning rate to the parameters, performing larger updates for infrequent and smaller updates for frequent parameters. For this reason, it is well-suited for dealing with sparse data. In addition, sgd is more sensitive to the learning rate compared with adagrad.

FTRL (Follow-the-Regularized-Leader) is also a famous method that has been widely used in the large-scale sparse problem. To use FTRL, users need to tune more hyper-parameters compared with sgd and adagrad.

Hyper-parameter Tuning

In machine learning, a hyper-parameter is a parameter whose value is set before the learning process begins. By contrast, the value of other parameters is derived via training. Hyper-parameter tuning is the problem of choosing a set of optimal hyper-parameters for a learning algorithm.

First, the learning rate is one of the most important hyper-parameters used in machine learning. By default, this value is set to 0.2 in xLearn, and we can tune this value by using -r option:

./xlearn_train ./small_train.txt -v ./small_test.txt -r 0.1
./xlearn_train ./small_train.txt -v ./small_test.txt -r 0.5
./xlearn_train ./small_train.txt -v ./small_test.txt -r 0.01

We can also use the -b option to perform regularization. By default, xLearn uses L2 regularization, and the regular_lambda has been set to 0.00002:

./xlearn_train ./small_train.txt -v ./small_test.txt -r 0.1 -b 0.001
./xlearn_train ./small_train.txt -v ./small_test.txt -r 0.1 -b 0.002
./xlearn_train ./small_train.txt -v ./small_test.txt -r 0.1 -b 0.01

For the FTRL method, we also need to tune another four hyper-parameters, including -alpha, -beta, -lambda_1, and -lambda_2. For example:

./xlearn_train ./small_train.txt -p ftrl -alpha 0.002 -beta 0.8 -lambda_1 0.001 -lambda_2 1.0

For FM and FFM, users also need to set the size of latent factor by using -k option. By default, xLearn uses 4 for this value:

./xlearn_train ./small_train.txt -s 1 -v ./small_test.txt -k 2
./xlearn_train ./small_train.txt -s 1 -v ./small_test.txt -k 4
./xlearn_train ./small_train.txt -s 1 -v ./small_test.txt -k 5
./xlearn_train ./small_train.txt -s 1 -v ./small_test.txt -k 8

xLearn uses SSE instruction to accelerate vector operation, and hence the time cost for k=2 and k=4 are the same.

For FM and FFM, users can also set the hyper-parameter -u for scalling model initialization. By default, this value is 0.66:

./xlearn_train ./small_train.txt -s 1 -v ./small_test.txt -u 0.80
./xlearn_train ./small_train.txt -s 1 -v ./small_test.txt -u 0.40
./xlearn_train ./small_train.txt -s 1 -v ./small_test.txt -u 0.10

Set Epoch Number and Early-Stopping

For machine learning tasks, one epoch consists of one full training cycle on the training set. In xLearn, users can set the number of epoch for training by using -e option:

./xlearn_train ./small_train.txt -e 3
./xlearn_train ./small_train.txt -e 5
./xlearn_train ./small_train.txt -e 10

If you set the validation data, xLearn will perform early-stopping by default. For example:

./xlearn_train ./small_train.txt -s 2 -v ./small_test.txt -e 10

Here, we set epoch number to 10, but xLearn stopped at epoch 7 because we get the best model at that epoch (you may get different a stopping number on your local machine):

 ...
[ ACTION     ] Early-stopping at epoch 7
[ ACTION     ] Start to save model ...

Users can set the window size for early stopping by using -sw option:

./xlearn_train ./small_train.txt -e 10 -v ./small_test.txt -sw 3

Users can disable early-stopping by using --dis-es option:

./xlearn_train ./small_train.txt -s 2 -v ./small_test.txt -e 10 --dis-es

At this time, xLearn performed completed 10 epoch for training.

By default, xLearn will use the metric value to choose the best epoch if user has set the metric (-x). If not, xLearn uses the test_loss to choose the best epoch.

Lock-Free Learning

By default, xLearn performs Hogwild! lock-free learning, which takes advantages of multiple cores of modern CPU to accelerate training task. But lock-free training is non-deterministic. For example, if we run the following command multiple times, we may get different loss value at each epoch:

./xlearn_train ./small_train.txt

The 1st time: 0.396352
The 2nd time: 0.396119
The 3nd time: 0.396187
...

Users can set the number of thread for xLearn by using -nthread option:

./xlearn_train ./small_train.txt -nthread 2

If you don’t set this option, xLearn uses all of the CPU cores by default. xLearn will show the number of threads:

[------------] xLearn uses 2 threads for training task.
[ ACTION     ] Read Problem ...

Users can disable lock-free training by using --dis-lock-free:

./xlearn_train ./small_train.txt --dis-lock-free

In thie time, our result are determinnistic:

The 1st time: 0.396372
The 2nd time: 0.396372
The 3nd time: 0.396372

The disadvantage of --dis-lock-free is that it is much slower than lock-free training.

Instance-wise Normalization

For FM and FFM, xLearn uses instance-wise normalizarion by default. In some scenes like CTR prediction, this technique is very useful. But sometimes it hurts model performance. Users can disable instance-wise normalization by using --no-norm option:

./xlearn_train ./small_train.txt -s 1 -v ./small_test.txt --no-norm

Note that if you use Instance-wise Normalization in training process, you also need to use the meachnism in prediction process.

Quiet Training

When using --quiet option, xLearn will not calculate any evaluation information during the training, and it will just train the model quietly:

./xlearn_train ./small_train.txt --quiet

In this way, xLearn can accelerate its training speed significantly.

xLearn can also support Python API, and we will introduce it in the next section.