Finding Best ML Algorithm for House Price Prediction using k Cross Validation and GridSearchCV.
In machine learning, we couldn’t fit the model on the training data and can’t say that the model will work accurately for the real data. For this, we must assure that our model got the correct patterns from the data, and it is not getting up too much noise. For this purpose, we use the cross-validation technique.Cross-validation is a technique in which we train our model using the subset of the data-set and then evaluate using the complementary subset of the data-set.
|
area_type |
availability |
location |
size |
society |
total_sqft |
bath |
balcony |
price |
0 |
Super built-up Area |
19-Dec |
Electronic City Phase II |
2 BHK |
Coomee |
1056 |
2.0 |
1.0 |
39.07 |
1 |
Plot Area |
Ready To Move |
Chikka Tirupathi |
4 Bedroom |
Theanmp |
2600 |
5.0 |
3.0 |
120.00 |
2 |
Built-up Area |
Ready To Move |
Uttarahalli |
3 BHK |
NaN |
1440 |
2.0 |
3.0 |
62.00 |
3 |
Super built-up Area |
Ready To Move |
Lingadheeranahalli |
3 BHK |
Soiewre |
1521 |
3.0 |
1.0 |
95.00 |
4 |
Super built-up Area |
Ready To Move |
Kothanur |
2 BHK |
NaN |
1200 |
2.0 |
1.0 |
51.00 |
area_type
Built-up Area 2418
Carpet Area 87
Plot Area 2025
Super built-up Area 8790
Name: area_type, dtype: int64
|
location |
size |
total_sqft |
bath |
price |
0 |
Electronic City Phase II |
2 BHK |
1056 |
2.0 |
39.07 |
1 |
Chikka Tirupathi |
4 Bedroom |
2600 |
5.0 |
120.00 |
2 |
Uttarahalli |
3 BHK |
1440 |
2.0 |
62.00 |
3 |
Lingadheeranahalli |
3 BHK |
1521 |
3.0 |
95.00 |
4 |
Kothanur |
2 BHK |
1200 |
2.0 |
51.00 |
Data Cleaning: Handling NA/Null values
location 1
size 16
total_sqft 0
bath 73
price 0
dtype: int64
location 0
size 0
total_sqft 0
bath 0
price 0
dtype: int64
array(['2 BHK', '4 Bedroom', '3 BHK', '4 BHK', '6 Bedroom', '3 Bedroom',
'1 BHK', '1 RK', '1 Bedroom', '8 Bedroom', '2 Bedroom',
'7 Bedroom', '5 BHK', '7 BHK', '6 BHK', '5 Bedroom', '11 BHK',
'9 BHK', nan, '9 Bedroom', '27 BHK', '10 Bedroom', '11 Bedroom',
'10 BHK', '19 BHK', '16 BHK', '43 Bedroom', '14 BHK', '8 BHK',
'12 Bedroom', '13 BHK', '18 Bedroom'], dtype=object)
Feature Engineering
Feature engineering is the process of using domain knowledge to extract features from raw data via data mining techniques. These features can be used to improve the performance of machine learning algorithms. Feature engineering can be considered as applied machine learning itself
<ipython-input-81-4c4c73fbe7f4>:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https:
df3['bhk'] = df3['size'].apply(lambda x: int(x.split(' ')[0]))
array([ 2, 4, 3, 6, 1, 8, 7, 5, 11, 9, 27, 10, 19, 16, 43, 14, 12,
13, 18], dtype=int64)
|
location |
size |
total_sqft |
bath |
price |
bhk |
1718 |
2Electronic City Phase II |
27 BHK |
8000 |
27.0 |
230.0 |
27 |
4684 |
Munnekollal |
43 Bedroom |
2400 |
40.0 |
660.0 |
43 |
array(['1056', '2600', '1440', ..., '1133 - 1384', '774', '4689'],
dtype=object)
|
location |
size |
total_sqft |
bath |
price |
bhk |
30 |
Yelahanka |
4 BHK |
2100 - 2850 |
4.0 |
186.000 |
4 |
122 |
Hebbal |
4 BHK |
3067 - 8156 |
4.0 |
477.000 |
4 |
137 |
8th Phase JP Nagar |
2 BHK |
1042 - 1105 |
2.0 |
54.005 |
2 |
165 |
Sarjapur |
2 BHK |
1145 - 1340 |
2.0 |
43.490 |
2 |
188 |
KR Puram |
2 BHK |
1015 - 1540 |
2.0 |
56.800 |
2 |
410 |
Kengeri |
1 BHK |
34.46Sq. Meter |
1.0 |
18.500 |
1 |
549 |
Hennur Road |
2 BHK |
1195 - 1440 |
2.0 |
63.770 |
2 |
648 |
Arekere |
9 Bedroom |
4125Perch |
9.0 |
265.000 |
9 |
661 |
Yelahanka |
2 BHK |
1120 - 1145 |
2.0 |
48.130 |
2 |
672 |
Bettahalsoor |
4 Bedroom |
3090 - 5002 |
4.0 |
445.000 |
4 |
Above shows that total_sqft can be a range (e.g. 2100-2850). For such case we can just take average of min and max value in the range. There are other cases such as 34.46Sq. Meter which one can convert to square ft using unit conversion. I am going to just drop such corner cases to keep things simple
Add new feature called price per square feet
Examine locations which is a categorical variable. We need to apply dimensionality reduction technique here to reduce number of locations
location
Whitefield 535
Sarjapur Road 392
Electronic City 304
Kanakpura Road 266
Thanisandra 236
...
LIC Colony 1
Kuvempu Layout 1
Kumbhena Agrahara 1
Kudlu Village, 1
1 Annasandrapalya 1
Name: location, Length: 1293, dtype: int64
Dimensionality Reduction
Any location having less than 10 data points should be tagged as "other" location. This way number of categories can be reduced by huge amount. Later on when we do one hot encoding, it will help us with having fewer dummy columns
location
BTM 1st Stage 10
Basapura 10
Sector 1 HSR Layout 10
Naganathapura 10
Kalkere 10
..
LIC Colony 1
Kuvempu Layout 1
Kumbhena Agrahara 1
Kudlu Village, 1
1 Annasandrapalya 1
Name: location, Length: 1052, dtype: int64
|
location |
size |
total_sqft |
bath |
price |
bhk |
price_per_sqft |
0 |
Electronic City Phase II |
2 BHK |
1056.0 |
2.0 |
39.07 |
2 |
3699.810606 |
1 |
Chikka Tirupathi |
4 Bedroom |
2600.0 |
5.0 |
120.00 |
4 |
4615.384615 |
2 |
Uttarahalli |
3 BHK |
1440.0 |
2.0 |
62.00 |
3 |
4305.555556 |
3 |
Lingadheeranahalli |
3 BHK |
1521.0 |
3.0 |
95.00 |
3 |
6245.890861 |
4 |
Kothanur |
2 BHK |
1200.0 |
2.0 |
51.00 |
2 |
4250.000000 |
Outlier Removal Using Business Logic
|
location |
size |
total_sqft |
bath |
price |
bhk |
price_per_sqft |
9 |
other |
6 Bedroom |
1020.0 |
6.0 |
370.0 |
6 |
36274.509804 |
45 |
HSR Layout |
8 Bedroom |
600.0 |
9.0 |
200.0 |
8 |
33333.333333 |
58 |
Murugeshpalya |
6 Bedroom |
1407.0 |
4.0 |
150.0 |
6 |
10660.980810 |
68 |
Devarachikkanahalli |
8 Bedroom |
1350.0 |
7.0 |
85.0 |
8 |
6296.296296 |
70 |
other |
3 Bedroom |
500.0 |
3.0 |
100.0 |
3 |
20000.000000 |
Check above data points. We have 6 bhk apartment with 1020 sqft. Another one is 8 bhk and total sqft is 600. These are clear data errors that can be removed safely
Outlier Removal Using Standard Deviation and Mean
count 12456.000000
mean 6308.502826
std 4168.127339
min 267.829813
25% 4210.526316
50% 5294.117647
75% 6916.666667
max 176470.588235
Name: price_per_sqft, dtype: float64
Outlier Removal Using Bathrooms Feature
array([ 4., 3., 2., 5., 8., 1., 6., 7., 9., 12., 16., 13.])
|
location |
size |
total_sqft |
bath |
price |
bhk |
price_per_sqft |
5277 |
Neeladri Nagar |
10 BHK |
4000.0 |
12.0 |
160.0 |
10 |
4000.000000 |
8486 |
other |
10 BHK |
12000.0 |
12.0 |
525.0 |
10 |
4375.000000 |
8575 |
other |
16 BHK |
10000.0 |
16.0 |
550.0 |
16 |
5500.000000 |
9308 |
other |
11 BHK |
6000.0 |
12.0 |
150.0 |
11 |
2500.000000 |
9639 |
other |
13 BHK |
5425.0 |
13.0 |
275.0 |
13 |
5069.124424 |
It is unusual to have 2 more bathrooms than number of bedrooms in a home
|
location |
size |
total_sqft |
bath |
price |
bhk |
price_per_sqft |
1626 |
Chikkabanavar |
4 Bedroom |
2460.0 |
7.0 |
80.0 |
4 |
3252.032520 |
5238 |
Nagasandra |
4 Bedroom |
7000.0 |
8.0 |
450.0 |
4 |
6428.571429 |
6711 |
Thanisandra |
3 BHK |
1806.0 |
6.0 |
116.0 |
3 |
6423.034330 |
8411 |
other |
6 BHK |
11338.0 |
9.0 |
1000.0 |
6 |
8819.897689 |
|
location |
size |
total_sqft |
bath |
price |
bhk |
price_per_sqft |
0 |
1st Block Jayanagar |
4 BHK |
2850.0 |
4.0 |
428.0 |
4 |
15017.543860 |
1 |
1st Block Jayanagar |
3 BHK |
1630.0 |
3.0 |
194.0 |
3 |
11901.840491 |
2 |
1st Block Jayanagar |
3 BHK |
1875.0 |
2.0 |
235.0 |
3 |
12533.333333 |
3 |
1st Block Jayanagar |
3 BHK |
1200.0 |
2.0 |
130.0 |
3 |
10833.333333 |
4 |
1st Block Jayanagar |
2 BHK |
1235.0 |
2.0 |
148.0 |
2 |
11983.805668 |
... |
... |
... |
... |
... |
... |
... |
... |
10232 |
other |
2 BHK |
1200.0 |
2.0 |
70.0 |
2 |
5833.333333 |
10233 |
other |
1 BHK |
1800.0 |
1.0 |
200.0 |
1 |
11111.111111 |
10236 |
other |
2 BHK |
1353.0 |
2.0 |
110.0 |
2 |
8130.081301 |
10237 |
other |
1 Bedroom |
812.0 |
1.0 |
26.0 |
1 |
3201.970443 |
10240 |
other |
4 BHK |
3600.0 |
5.0 |
400.0 |
4 |
11111.111111 |
7251 rows × 7 columns
|
location |
total_sqft |
bath |
price |
bhk |
0 |
1st Block Jayanagar |
2850.0 |
4.0 |
428.0 |
4 |
1 |
1st Block Jayanagar |
1630.0 |
3.0 |
194.0 |
3 |
2 |
1st Block Jayanagar |
1875.0 |
2.0 |
235.0 |
3 |
3 |
1st Block Jayanagar |
1200.0 |
2.0 |
130.0 |
3 |
4 |
1st Block Jayanagar |
1235.0 |
2.0 |
148.0 |
2 |
Using One Hot Encoding For Location
|
1st Block Jayanagar |
1st Phase JP Nagar |
2nd Phase Judicial Layout |
2nd Stage Nagarbhavi |
5th Block Hbr Layout |
5th Phase JP Nagar |
6th Phase JP Nagar |
7th Phase JP Nagar |
8th Phase JP Nagar |
9th Phase JP Nagar |
... |
Vishveshwarya Layout |
Vishwapriya Layout |
Vittasandra |
Whitefield |
Yelachenahalli |
Yelahanka |
Yelahanka New Town |
Yelenahalli |
Yeshwanthpur |
other |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
... |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
... |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
2 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
... |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
3 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
... |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
4 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
... |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
5 rows × 242 columns
|
location |
total_sqft |
bath |
price |
bhk |
1st Block Jayanagar |
1st Phase JP Nagar |
2nd Phase Judicial Layout |
2nd Stage Nagarbhavi |
5th Block Hbr Layout |
... |
Vijayanagar |
Vishveshwarya Layout |
Vishwapriya Layout |
Vittasandra |
Whitefield |
Yelachenahalli |
Yelahanka |
Yelahanka New Town |
Yelenahalli |
Yeshwanthpur |
0 |
1st Block Jayanagar |
2850.0 |
4.0 |
428.0 |
4 |
1 |
0 |
0 |
0 |
0 |
... |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
1st Block Jayanagar |
1630.0 |
3.0 |
194.0 |
3 |
1 |
0 |
0 |
0 |
0 |
... |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
2 |
1st Block Jayanagar |
1875.0 |
2.0 |
235.0 |
3 |
1 |
0 |
0 |
0 |
0 |
... |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
3 |
1st Block Jayanagar |
1200.0 |
2.0 |
130.0 |
3 |
1 |
0 |
0 |
0 |
0 |
... |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
4 |
1st Block Jayanagar |
1235.0 |
2.0 |
148.0 |
2 |
1 |
0 |
0 |
0 |
0 |
... |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
5 rows × 246 columns
|
total_sqft |
bath |
price |
bhk |
1st Block Jayanagar |
1st Phase JP Nagar |
2nd Phase Judicial Layout |
2nd Stage Nagarbhavi |
5th Block Hbr Layout |
5th Phase JP Nagar |
... |
Vijayanagar |
Vishveshwarya Layout |
Vishwapriya Layout |
Vittasandra |
Whitefield |
Yelachenahalli |
Yelahanka |
Yelahanka New Town |
Yelenahalli |
Yeshwanthpur |
0 |
2850.0 |
4.0 |
428.0 |
4 |
1 |
0 |
0 |
0 |
0 |
0 |
... |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
1630.0 |
3.0 |
194.0 |
3 |
1 |
0 |
0 |
0 |
0 |
0 |
... |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
2 |
1875.0 |
2.0 |
235.0 |
3 |
1 |
0 |
0 |
0 |
0 |
0 |
... |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
3 |
1200.0 |
2.0 |
130.0 |
3 |
1 |
0 |
0 |
0 |
0 |
0 |
... |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
4 |
1235.0 |
2.0 |
148.0 |
2 |
1 |
0 |
0 |
0 |
0 |
0 |
... |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
5 rows × 245 columns
|
total_sqft |
bath |
bhk |
1st Block Jayanagar |
1st Phase JP Nagar |
2nd Phase Judicial Layout |
2nd Stage Nagarbhavi |
5th Block Hbr Layout |
5th Phase JP Nagar |
6th Phase JP Nagar |
... |
Vijayanagar |
Vishveshwarya Layout |
Vishwapriya Layout |
Vittasandra |
Whitefield |
Yelachenahalli |
Yelahanka |
Yelahanka New Town |
Yelenahalli |
Yeshwanthpur |
0 |
2850.0 |
4.0 |
4 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
... |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
1630.0 |
3.0 |
3 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
... |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
2 |
1875.0 |
2.0 |
3 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
... |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
3 |
1200.0 |
2.0 |
3 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
... |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
4 |
1235.0 |
2.0 |
2 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
... |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
5 rows × 244 columns
0 428.0
1 194.0
2 235.0
3 130.0
4 148.0
...
10232 70.0
10233 200.0
10236 110.0
10237 26.0
10240 400.0
Name: price, Length: 7251, dtype: float64
Use K Fold cross validation to measure accuracy of our LinearRegression model
In this method, we split the data-set into k number of subsets(known as folds) then we perform training on the all the subsets but leave one(k-1) subset for the evaluation of the trained model. In this method, we iterate k times with a different subset reserved for testing purpose each time.
Always remember, a lower value of k is more biased, and hence undesirable. On the other hand, a higher value of K is less biased, but can suffer from large variability. It is important to know that a smaller value of k always takes us towards validation set approach, whereas a higher value of k leads to LOOCV approach.
array([0.82430186, 0.77166234, 0.85089567, 0.80837764, 0.83653286])
Find best model using GridSearchCV
|
model |
best_score |
best_params |
0 |
linear_regression |
0.818354 |
{'normalize': False} |
1 |
lasso |
0.687430 |
{'alpha': 2, 'selection': 'random'} |
2 |
decision_tree |
0.720273 |
{'criterion': 'friedman_mse', 'splitter': 'best'} |
Based on above results we can say that LinearRegression gives the best score. Hence we will use that.
Test the model for few properties