Tips To Use Algorithm Random Forest in Python Tips To Use Algorithm Random Forest in Python

Tips To Use Random Forest Algorithm in Python

There are four types of machine learning algorithms, and one of them is the random forest in python. This type is formed on ensemble learning. In ensemble learning, you have to combine the various algorithms that are different from each other. You can also combine multiple algorithms similar to each other.

We can do this to form an impressive prediction model. You can use the random forest for both classification and regression tasks. Too many same types of algorithms are merged and make a giant tree forest. This is the reason that its name is Random forest.

How the Random Forest Algorithm Works?

Tips To Use Algorithm Random Forest in Python

Do you know the working process of the random forest? Its working process involves some steps given below.

  • The first step is to select the random records. You have to set these records through a dataset.
  • After this, create a decision tree according to the selected N random records.
  • Now, you have to tell how many trees you need in an algorithm. For this, select the number of trees.
  • Now again, perform step 1 and step 2.

Advantages of Using Random Forest

The random forest also comes with both advantages and disadvantages. If you are using random forests for the regression and the classification, you need to know its pros and cons.

  • Random forest in python always helps to enlarge the perfection.
  • It is best for both regression and classification problems.
  • The random forest algorithm is excellent. When the new data is presented, this data will not affect the whole algorithm. It might be possible that the data will influence the single tree but not all the trees.
  • If you have both the numerical and categorical features, then the random forest will perform flawlessly.

Disadvantages of Using Random Forest

After discussing the advantages of the random forest now, it’s time to discuss the disadvantages.

  • The first and significant disadvantage of random forest is complexity. The random forest built too many trees, and after this, it merged their results. That’s why it needs more power and resources.
  • Second is the long training time; it requires much time to teach other people.

Using Random Forest for Regression

Now, we will tell you about how the random forest will resolve all the regression problems. After this, we will discuss how it will determine the issues present in classification. There are too many other random forest python examples that you can see.

Problem Definition

If we talk about the main problem, we have to predict gas usage in the 48 US states. It depends on every person’s income, tax on petrol, concreted highways, and the population that contains the driving license.

Solution

The best way to get the solution to this problem is to use a random forest algorithm python. We use it with the help of the Scikit-Learn Python library. Besides this, it is compulsory to apply the machine learning pipeline. All the steps to do this include:

1. Import Libraries

The first step that you have to do is implement the different codes given below and then import the libraries.

import pandas as pd

import numpy as np

2. Import the Dataset

With the help of the below-given link, you can quickly get the dataset to import it.

https://drive.google.com/file/d/1mVmGNx6cbfvRHC_DvF12ZL3wGLSHD9f_/view

When you download the dataset from the given link, it will automatically be stored in the D drive. You can change the folder related to your needs.

To import the dataset, you have to implement the given command

(dataset = pd.read_csv(‘D:\Datasets\petrol_consumption.csv’)

It might be possible that the value in the dataset is not scaled. In this situation, we have to scale them.

3. Arrange the Data

In this step, you have to arrange the data for the training. We do two steps: divide the data into the label sets and attributes, and the second step is to divide the whole result into the test sets and training.

To divide the data into labels and attributes you have to execute the given commands.

X = dataset.iloc[:, 0:4].values

y = dataset.iloc[:, 4].values

After this, now divide the data result into the test sets and training by the following commands.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

4. Scaling

For the random forest, the feature scaling step is not essential. But if you want to scale the dataset, then apply the given below commands.

# Feature Scaling

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

X_train = sc.fit_transform(X_train)

X_test = sc.transform(X_test)

5. Algorithm Training

After scaling the dataset, now train the algorithm, and for this, the given commands will help you. Execute all the commands:

from sklearn.ensemble import RandomForestRegressor

regressor = RandomForestRegressor(n_estimators=20, random_state=0)

regressor.fit(X_train, y_train)

y_pred = regressor.predict(X_test)

6. Finding the Performance

In solving the regression problem with the help of random forest, the last step is to find the algorithm performance. We will use the metrics in regression to evaluate performance: mean squared error, mean absolute error, and root means squared error.

from sklearn import metrics

print(‘Mean Absolute Error:’, metrics.mean_absolute_error(y_test, y_pred))

print(‘Mean Squared Error:’, metrics.mean_squared_error(y_test, y_pred))

print(‘Root Mean Squared Error:’, np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

The output of the above-given code is:

Mean Absolute Error: 51.765

Mean Squared Error: 4216.16675

Root Mean Squared Error: 64.932016371

According to this, if we have selected the 20 estimators, 64.93 is the root mean squared error. The average usage of petrol is 576.77, which is less than the root means squared error. If we use more estimators like 200, then it will change the result.

Mean Absolute Error: 47.9825

Mean Squared Error: 3469.7007375

Root Mean Squared Error: 58.9041657058

Using Random Forest for Classification

Problem Definition

In the classification, the problem is to predict that the currency note is genuine or fake. But you have to expect this according to the four characteristics. The first is asymmetric, image curtosis, entropy, and image variance. Also, read our comparison between R Vs Python in which we compare benefits of both languages for their use in artificial intelligence.

Solution

Now we will solve the classification problem, and for this, it is essential to use the random forest classifier. There are different steps to solve this classification problem, and all are given below.

1. Import Libraries

import pandas as pd

import numpy as np

2. Importing Dataset

To download the dataset, click on the below-given link.

https://drive.google.com/file/d/13nw-uRXPY8XIZQxKRNZ3yYlho-CYm_Qt/view

With the help of these commands, you can easily import the dataset.

dataset = pd.read_csv(“D:/Datasets/bill_authentication.csv”)

3. Data Preparation for Training

The given commands will divide the data into the labels and attributes

X = dataset.iloc[:, 0:4].values

y = dataset.iloc[:, 4].values

These commands will divide the data into the test sets and training.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

4. Scaling

The feature scaling performs the same as the previous problem.

# Feature Scaling

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

X_train = sc.fit_transform(X_train)

X_test = sc.transform(X_test)

5. Algorithm Training

After scaling the dataset we have to train the random forest algorithm and for this we have to use these commands

from sklearn.ensemble import RandomForestRegressor

regressor = RandomForestRegressor(n_estimators=20, random_state=0)

regressor.fit(X_train, y_train)

y_pred = regressor.predict(X_test)

Previously in the regression problem, we used the RandomForestRegressor that is the sklearn.ensemble library class. But now, in the classification problem, we use the RandomForestClassifier.

Other than this, we have to select the n_estimators that define how many trees. And suppose we will select the 20 trees.

6. Evaluating the Algorithm

Same as the regression problem here we have to use metrics like the precision recall, accuracy, F1 values and confusion matrix.

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(confusion_matrix(y_test,y_pred))

print(classification_report(y_test,y_pred))

print(accuracy_score(y_test, y_pred))……

The output that you get from the above code is:

[[155 2]

    1  117]]

           precision   recall   f1-score   support 

        0   0.99     0.99   0.99   157

        1   0.98     0.99   0.99   118 

 avg / total   0.99     0.99   0.99   275

0.989090909091

With the help of 20 trees, you will get 98.90 percent, which is more than enough. That’s why you don’t need to increase the number of estimators. If you increase the estimators by more than 20 them, it will not increase the accuracy percentage.

Frequently Asked Questions (FAQs)

What is a random forest in Python?

The random forest is the estimator in Python that places the various decision tree classifiers on the dataset. It also increases the accuracy.

How do I run a random forest in Python?

  • From the dataset, you have to choose the random samples.
  • After this, build the decision tree.
  • From each decision, the tree collects the predicted results.
  • Do voting for every prediction.
  • For the predicted result that contains the most votes, you have to select that result.

What is the difference between SVM and random forest?

The random forest provides you with the expectations related to the class. On the other hand, the SVM provides you the boundary distance. But the main thing is that you still have to convert it into the expectations.

Why is Random Forest the best?

The first reason is that the random forest in python is one of the most used algorithms, and it always provides you the accurate result without tuning the hyper-parameter. It is effortless and flexible; the random forest is best compared to others because of these features.

Final Words

Well! That’s all about the tips to use random forest algorithms using python. We hope that after this detailed discussion, you will face no problem with random forest in python. Let us know in the comments if this guide proved helpful for you.