Hi Tian Qin,

1 min readOct 1, 2019

Good question — it’s usually both.

The “total number of samples” is referencing the number of observations that trained the individual tree.

Random forest models uses bagging (bootstrap aggregating); for each tree, a different training set is created by sampling the original, with replacement (i.e. the same data point can be chosen more than once).

With Scikit-Learn, you can set bootstrap to be True or False. If you have n observations and specify bootstrap to be True, for each tree, it will sample, with replacement, n examples from your data set. If you set bootstrap to be False, there is no sampling and each tree will be trained on the same (original) training data.

Spark also implements bagging but doesn’t provide an option to prevent it. It includes an argument called subsamplingRate, this allows you to specify the fraction of the training data you wish to use for each decision tree. However, the recommended value is 1.0 (in our example, n) as its purpose is to speeding up training.

Best,

Stacey

Written by Stacey Ronaghan

No responses yet