xLearn Command Line Guide¶
Once you built xLearn from source code successfully, you can get two executable files
xlearn_predict) in your
build directory. Now you can use these
two executable files to perform training and prediction tasks.
Make sure that you are in the
build directory of xLearn, and you can find the demo data
small_train.txt in this directory. Now we can type the following command to train a model:
Here, we show a portion of the output in this task. Note that the loss value shown in your local machine could be different with the following result:
[ ACTION ] Start to train ... [------------] Epoch Train log_loss Time cost (sec) [ 10% ] 1 0.569292 0.00 [ 20% ] 2 0.517142 0.00 [ 30% ] 3 0.490124 0.00 [ 40% ] 4 0.470445 0.00 [ 50% ] 5 0.451919 0.00 [ 60% ] 6 0.437888 0.00 [ 70% ] 7 0.425603 0.00 [ 80% ] 8 0.415573 0.00 [ 90% ] 9 0.405933 0.00 [ 100% ] 10 0.396388 0.00 [ ACTION ] Start to save model ... [------------] Model file: ./small_train.txt.model
By default, xLearn uses the logistic regression (LR) to train the model within 10 epoch.
After that, we can see that a new file called
small_train.txt.model has been generated in the current directory. This file stores the trained model checkpoint, and we can use this model file to make a prediction in the future:
./xlearn_predict ./small_test.txt ./small_train.txt.model
After that, we can get a new file called
small_test.txt.out in the current directory. This is the output of xLearn’s prediction. Here we show the first five lines of this output by using the following command:
head -n 5 ./small_test.txt.out -1.9872 -0.0707959 -0.456214 -0.170811 -1.28986
These lines of data is the prediction score calculated for each example in the test set. The negative data represents the negative example and positive data represents the positive example. In xLearn, you can convert the score to (0-1) by using
--sigmoid option, and also you can convert your result to binary result (0 and 1) by using
./xlearn_predict ./small_test.txt ./small_train.txt.model --sigmoid head -n 5 ./small_test.txt.out 0.120553 0.482308 0.387884 0.457401 0.215877 ./xlearn_predict ./small_test.txt ./small_train.txt.model --sign head -n 5 ./small_test.txt.out 0 0 0 0 0
Users may want to generate different model files (by using different hyper-parameters), and hence users can set the name and path of the model checkpoint file by using
-m option. By default, the name of the model file is
./xlearn_train ./small_train.txt -m new_model
Also, users can save the model in
TXT format by using
-t option. For example:
./xlearn_train ./small_train.txt -t model.txt
After that, we can get a new file called
model.txt, which stores the trained model in
head -n 5 ./model.txt -0.688182 0.458082 0 0 0
For the linear and bias term, we store each parameter in each line. For FM and FFM, we store each vector of the latent factor in each line. For example:
bias: 0 i_0: 0 i_1: 0 i_2: 0 i_3: 0
bias: 0 i_0: 0 i_1: 0 i_2: 0 i_3: 0 v_0: 5.61937e-06 0.0212581 0.150338 0.222903 v_1: 0.241989 0.0474224 0.128744 0.0995021 v_2: 0.0657265 0.185878 0.0223869 0.140097 v_3: 0.145557 0.202392 0.14798 0.127928
bias: 0 i_0: 0 i_1: 0 i_2: 0 i_3: 0 v_0_0: 5.61937e-06 0.0212581 0.150338 0.222903 v_0_1: 0.241989 0.0474224 0.128744 0.0995021 v_0_2: 0.0657265 0.185878 0.0223869 0.140097 v_0_3: 0.145557 0.202392 0.14798 0.127928 v_1_0: 0.219158 0.248771 0.181553 0.241653 v_1_1: 0.0742756 0.106513 0.224874 0.16325 v_1_2: 0.225384 0.240383 0.0411782 0.214497 v_1_3: 0.226711 0.0735065 0.234061 0.103661 v_2_0: 0.0771142 0.128723 0.0988574 0.197446 v_2_1: 0.172285 0.136068 0.148102 0.0234075 v_2_2: 0.152371 0.108065 0.149887 0.211232 v_2_3: 0.123096 0.193212 0.0179155 0.0479647 v_3_0: 0.055902 0.195092 0.0209918 0.0453358 v_3_1: 0.154174 0.144785 0.184828 0.0785329 v_3_2: 0.109711 0.102996 0.227222 0.248076 v_3_3: 0.144264 0.0409806 0.17463 0.083712
xLearn can supoort online learning, which can train new data based on the pre-trained model. User can use the
-pre option to specify the file path of pre-trained model. For example:
./xlearn_train ./small_train.txt -s 0 -pre ./pre_model
Note that, xLearn can only uses the binary model, not the TXT model.
Users can also set
-o option to specify the prediction output file. For example:
./xlearn_predict ./small_test.txt ./small_train.txt.model -o output.txt head -n 5 ./output.txt -2.01192 -0.0657416 -0.456185 -0.170979 -1.28849
By default, the name of the output file is
Choose Machine Learning Algorithm¶
For now, xLearn can support three different machine learning algorithms, including linear model, factorization machine (FM), and field-aware factorization machine (FFM).
Users can choose different machine learning algorithms by using
-s <type> : Type of machine learning model (default 0) for classification task: 0 -- linear model (GLM) 1 -- factorization machines (FM) 2 -- field-aware factorization machines (FFM) for regression task: 3 -- linear model (GLM) 4 -- factorization machines (FM) 5 -- field-aware factorization machines (FFM)
For LR and FM, the input data format can be
libsvm. For FFM, the input data should be the
libsvm format: label index_1:value_1 index_2:value_2 ... index_n:value_n CSV format: label value_1 value_2 .. value_n libffm format: label field_1:index_1:value_1 field_2:index_2:value_2 ...
xLearn can also use
, as the splitor in file. For example:
libsvm format: label,index_1:value_1,index_2:value_2 ... index_n:value_n CSV format: label,value_1,value_2 .. value_n libffm format: label,field_1:index_1:value_1,field_2:index_2:value_2 ...
Note that, if the csv file doesn’t contain the label
y, the user should add a
placeholder to the dataset by themselves (Also in test data). Otherwise, xLearn
will treat the first element as the label
Users can also give a
libffm file to LR and FM task. At that time, xLearn will
treat this data as
libsvm format. The following command shows how to use different
machine learning algorithms to solve the binary classification problem:
./xlearn_train ./small_train.txt -s 0 # Linear model (GLM) ./xlearn_train ./small_train.txt -s 1 # Factorization machine (FM) ./xlearn_train ./small_train.txt -s 2 # Field-awre factorization machine (FFM)
Set Validation Dataset¶
A validation dataset is used to tune the hyper-parameters of a machine learning model.
In xLearn, users can use
-v option to set the validation dataset. For example:
./xlearn_train ./small_train.txt -v ./small_test.txt
A portion of xLearn’s output:
Epoch Train log_loss Test log_loss Time cost (sec) 1 0.575049 0.530560 0.00 2 0.517496 0.537741 0.00 3 0.488428 0.527205 0.00 4 0.469010 0.538175 0.00 5 0.452817 0.537245 0.00 6 0.438929 0.536588 0.00 7 0.423491 0.532349 0.00 8 0.416492 0.541107 0.00 9 0.404554 0.546218 0.00
Here we can see that the training loss continuously goes down. But the validation loss (test loss) goes down
first, and then goes up. This is because the model has already overfitted current training dataset. By default,
xLearn will calculate the validation loss in each epoch, while users can also set different evaluation metrics by
-x option. For classification problems, the metric can be:
f1 (f1 score),
auc (AUC score). For example:
./xlearn_train ./small_train.txt -v ./small_test.txt -x acc ./xlearn_train ./small_train.txt -v ./small_test.txt -x prec ./xlearn_train ./small_train.txt -v ./small_test.txt -x f1 ./xlearn_train ./small_train.txt -v ./small_test.txt -x auc
For regression problems, the metric can be
rmsd (rmse). For example:
cd demo/house_price/ ../../xlearn_train ./house_price_train.txt -s 3 -x rmse --cv ../../xlearn_train ./house_price_train.txt -s 3 -x rmsd --cv
Note that, in the above example we use cross-validation by using
--cv option, which will be
introduced in the next section.
Cross-validation, sometimes called rotation estimation, is a model validation technique for assessing
how the results of a statistical analysis will generalize to an independent dataset. In xLearn, users
can use the
--cv option to use this technique. For example:
./xlearn_train ./small_train.txt --cv
On default, xLearn uses 3-folds cross validation, and users can set the number of fold by using
./xlearn_train ./small_train.txt -f 5 --cv
Here we set the number of folds to
5. The xLearn will calculate the average validation loss at
the end of its output message:
... [------------] Average log_loss: 0.549417 [ ACTION ] Finish Cross-Validation [ ACTION ] Clear the xLearn environment ... [------------] Total time cost: 0.03 (sec)
Choose Optimization Method¶
In xLearn, users can choose different optimization methods by using
-p option. For now, xLearn
ftrl method. By default, xLearn uses the
./xlearn_train ./small_train.txt -p sgd ./xlearn_train ./small_train.txt -p adagrad ./xlearn_train ./small_train.txt -p ftrl
Compared to traditional
adagrad adapts the learning rate to the parameters, performing
larger updates for infrequent and smaller updates for frequent parameters. For this reason, it is well-suited for
dealing with sparse data. In addition,
sgd is more sensitive to the learning rate compared with
FTRL (Follow-the-Regularized-Leader) is also a famous method that has been widely used in the large-scale
sparse problem. To use FTRL, users need to tune more hyper-parameters compared with
In machine learning, a hyper-parameter is a parameter whose value is set before the learning process begins. By contrast, the value of other parameters is derived via training. Hyper-parameter tuning is the problem of choosing a set of optimal hyper-parameters for a learning algorithm.
learning rate is one of the most important hyper-parameters used in machine learning.
By default, this value is set to
0.2 in xLearn, and we can tune this value by using
./xlearn_train ./small_train.txt -v ./small_test.txt -r 0.1 ./xlearn_train ./small_train.txt -v ./small_test.txt -r 0.5 ./xlearn_train ./small_train.txt -v ./small_test.txt -r 0.01
We can also use the
-b option to perform regularization. By default, xLearn uses
L2 regularization, and
the regular_lambda has been set to
./xlearn_train ./small_train.txt -v ./small_test.txt -r 0.1 -b 0.001 ./xlearn_train ./small_train.txt -v ./small_test.txt -r 0.1 -b 0.002 ./xlearn_train ./small_train.txt -v ./small_test.txt -r 0.1 -b 0.01
FTRL method, we also need to tune another four hyper-parameters, including
-lambda_2. For example:
./xlearn_train ./small_train.txt -p ftrl -alpha 0.002 -beta 0.8 -lambda_1 0.001 -lambda_2 1.0
For FM and FFM, users also need to set the size of latent factor by using
-k option. By default, xLearn
4 for this value:
./xlearn_train ./small_train.txt -s 1 -v ./small_test.txt -k 2 ./xlearn_train ./small_train.txt -s 1 -v ./small_test.txt -k 4 ./xlearn_train ./small_train.txt -s 1 -v ./small_test.txt -k 5 ./xlearn_train ./small_train.txt -s 1 -v ./small_test.txt -k 8
xLearn uses SSE instruction to accelerate vector operation, and hence the time cost for
k=4 are the same.
For FM and FFM, users can also set the hyper-parameter
-u for scalling model initialization. By default, this value is
./xlearn_train ./small_train.txt -s 1 -v ./small_test.txt -u 0.80 ./xlearn_train ./small_train.txt -s 1 -v ./small_test.txt -u 0.40 ./xlearn_train ./small_train.txt -s 1 -v ./small_test.txt -u 0.10
Set Epoch Number and Early-Stopping¶
For machine learning tasks, one epoch consists of one full training cycle on the training set.
In xLearn, users can set the number of epoch for training by using
./xlearn_train ./small_train.txt -e 3 ./xlearn_train ./small_train.txt -e 5 ./xlearn_train ./small_train.txt -e 10
If you set the validation data, xLearn will perform early-stopping by default. For example:
./xlearn_train ./small_train.txt -s 2 -v ./small_test.txt -e 10
Here, we set epoch number to
10, but xLearn stopped at epoch
7 because we get the best model
at that epoch (you may get different a stopping number on your local machine):
... [ ACTION ] Early-stopping at epoch 7 [ ACTION ] Start to save model ...
Users can set the
window size for early stopping by using
./xlearn_train ./small_train.txt -e 10 -v ./small_test.txt -sw 3
Users can disable early-stopping by using
./xlearn_train ./small_train.txt -s 2 -v ./small_test.txt -e 10 --dis-es
At this time, xLearn performed completed 10 epoch for training.
By default, xLearn will use the metric value to choose the best epoch if user has set the metric (
-x). If not, xLearn uses the test_loss to choose the best epoch.
By default, xLearn performs Hogwild! lock-free learning, which takes advantages of multiple cores of modern CPU to accelerate training task. But lock-free training is non-deterministic. For example, if we run the following command multiple times, we may get different loss value at each epoch:
./xlearn_train ./small_train.txt The 1st time: 0.396352 The 2nd time: 0.396119 The 3nd time: 0.396187 ...
Users can set the number of thread for xLearn by using
./xlearn_train ./small_train.txt -nthread 2
If you don’t set this option, xLearn uses all of the CPU cores by default. xLearn will show the number of threads:
[------------] xLearn uses 2 threads for training task. [ ACTION ] Read Problem ...
Users can disable lock-free training by using
./xlearn_train ./small_train.txt --dis-lock-free
In thie time, our result are determinnistic:
The 1st time: 0.396372 The 2nd time: 0.396372 The 3nd time: 0.396372
The disadvantage of
--dis-lock-free is that it is much slower than lock-free training.
For FM and FFM, xLearn uses instance-wise normalizarion by default. In some scenes like CTR prediction, this technique is very
useful. But sometimes it hurts model performance. Users can disable instance-wise normalization by using
./xlearn_train ./small_train.txt -s 1 -v ./small_test.txt --no-norm
Note that if you use Instance-wise Normalization in training process, you also need to use the meachnism in prediction process.
--quiet option, xLearn will not calculate any evaluation information during the training, and
it will just train the model quietly:
./xlearn_train ./small_train.txt --quiet
In this way, xLearn can accelerate its training speed significantly.
xLearn can also support Python API, and we will introduce it in the next section.