The “random” part of random forest is that for each split on a tree, you take a random subset of…

1 min readNov 8, 2018

The “random” part of random forest is that for each split on a tree, you take a random subset of the features but still use the same splitting criteria.

The parameters for decision trees are also available for random forest implementations but each have additional parameters.

Spark has two additional parameters: “numTrees” — how many trees to use in the random forest; and “featureSubsetStrategy” — how to select the subset of features at each split.

Scikit-learn has two parameters that map with those seen with Spark: “n_estimators” — the number of trees; “max_features” — the number of features in your subset (this is also available in the decision tree algorithm). It also has the following: “bootstrap” — whether the data should be bootstrapped when sampling for trees; “oob_score” — whether the score should be based only on examples that haven’t been used in each tree (out-of-bag); “n_jobs” — the number of jobs (trees to build) to run in parallel; “verbose” — a value to indicate how much information to print when fitting and predicting; and “warm_start” — to be used if you are adding more estimators (trees) to an existing ensemble (forest).

Here’s an article for more information on random forest parameter tuning with Scikit-learn: https://www.analyticsvidhya.com/blog/2015/06/tuning-random-forest-model/

Written by Stacey Ronaghan

Responses (1)