Ebuka Ezenwafor
A result-oriented, sound data analyst with over 4 years of industrial experience in collecting, organizing, and interpreting various forms of data. Artistic in finding timely solutions to problems and determining modifications for optimal use of organizational data using machine learning algorithms
CODE.
Firstly, i imported all the libraries that i can use to effortly write the code and model that would solve the problem. At this level, i am trying to clean up th dataset and fill up missing values as they are too many to delete. deletion will affect the outcome of the data and prediction. from the dataset, 'CUT' has over 60% of its data missing, using mode, mean or medium is not a good representation of the missing values thus, it should not be used for the analysis As noticed polish and symmetry are ordinal values with some missing value, i used mode to fill it up because i cant use mean nor medium.
note: from the orginal file, vendor was saved as integer which in think shouldnt be as it just vendors selling the diamond. more so, if i had left it, onehotendcoder will see and read it as an integer.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.compose import make_column_selector as selector
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
import xgboost as xgb
training = pd.read_csv("training.csv")
offer= pd.read_csv('offers.csv')
pd.set_option("display.max_columns", None)
training.fillna(value = {'Depth':training.Depth.mean(), 'Table':training.Table.mean()}, inplace=True)
offer.fillna(value = {'Depth':offer.Depth.mean(), 'Table':offer.Table.mean()}, inplace=True)
offer2 = offer
training['Polish'] = np.where(training['Polish']== ' ', training['Polish'].mode(), training['Polish'])
training['Shape']=np.where(training['Shape'] == 'ROUND', 'Round', training['Shape'])
offer['Shape']=np.where(offer['Shape'] == 'ROUND', 'Round', offer['Shape'])
training['Regions']=np.where(training['Regions'] == 'Other/Unknown',
training['Regions'].mode(), training['Regions'])
offer['Regions']=np.where(offer['Regions'] == 'Other/Unknown',
offer['Regions'].mode(), offer['Regions'])
training['Symmetry']=np.where(training['Symmetry'] == ' ', training['Symmetry'].mode(),
np.where(training['Symmetry'] == 'Execllent', 'Excellent', training['Symmetry']))
offer['Symmetry']=np.where(offer['Symmetry'] == ' ', offer['Symmetry'].mode(), offer['Symmetry'])
mode = training['Cert'].mode().values[0]
training['Cert'].fillna(value=mode, inplace=True)
mode2 = offer['Cert'].mode().values[0]
offer['Cert'].fillna(value=mode2, inplace=True)
mode3 = offer['Known_Conflict_Diamond'].mode().values[0]
offer['Known_Conflict_Diamond'].fillna(value = mode3, inplace=True)
training['Vendor'] = training.Vendor.astype(str)
Here, i started subsetting the dataset for prediction. i removed id, cut, retail, price and log retail as they wont be needed for the supposed machine learning. to reiterate, cut was removed because over 60% of the data is missing and filling it up wont have a good representation of the entire sample.
from the observaiton, logprice was used this is because one of assumptions of regession is that the dataset has to be normally distributed. but price and retail price are skewed to the right and so, log of the price is used for the prediction, the predicted price which is still in log is transformed with actual value.
cleaned = training.drop(columns = ["id","Cut", "Retail", "Price", "LogRetail"])
offer_cleaned = offer.drop(columns = ["id","Cut", "Offers"])
target = cleaned["LogPrice"]
target
predictors= cleaned.drop(columns=["LogPrice"])
from the dataset it is observed that some categorical data has lots of various levels categories. labelling this columns with value 1,2,3,…..n wont work as it contains lots of categories and the model might see them as integers and wont perform optimally to get better prediction and model accuracy. so what is the way out? "one hot encoding". one hot encoding is a method of coverting categorical values into a new categorical column and assign a binary value of 1 or 0 to these columns. From this section, you will notice that one hot encoding was used for both training and offer dataset to keep the data in same shape for prediction.
numerical_columns_selector = selector(dtype_exclude=object)
categorical_columns_selector = selector(dtype_include=object)
numerical_columns = numerical_columns_selector(predictors)
categorical_columns = categorical_columns_selector(predictors)
categorical_preprocessor = OneHotEncoder(handle_unknown="ignore")
numerical_preprocessor = StandardScaler()
preprocessor = ColumnTransformer([
('one-hot-encoder', categorical_preprocessor, categorical_columns),
('standard_scaler', numerical_preprocessor, numerical_columns)])
####offer one-hot-encoder
numerical_columns_selector_offer = selector(dtype_exclude=object)
categorical_columns_selector_offer = selector(dtype_include=object)
numerical_columns_offer = numerical_columns_selector_offer(offer_cleaned)
categorical_columns_offer = categorical_columns_selector_offer(offer_cleaned)
categorical_preprocessor = OneHotEncoder(handle_unknown="ignore")
numerical_preprocessor = StandardScaler()
preprocessor = ColumnTransformer([
('one-hot-encoder', categorical_preprocessor, categorical_columns_offer),
('standard_scaler', numerical_preprocessor, numerical_columns_offer)])
rfr = make_pipeline(preprocessor, RandomForestRegressor())
x_train, x_test, y_train, y_test = train_test_split(
predictors, target, random_state=42)
rfr.fit(x_train, y_train)
Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('one-hot-encoder',
OneHotEncoder(handle_unknown='ignore'),
['Cert', 'Clarity', 'Color',
'Measurements', 'Polish',
'Regions', 'Shape',
'Symmetry']),
('standard_scaler',
StandardScaler(),
['Carats', 'Depth',
'Known_Conflict_Diamond',
'Table', 'Vendor'])])),
('randomforestregressor', RandomForestRegressor())])
y_pred = rfr.predict(offer_cleaned)
y_pred
array([ 6.38442209, 6.361169 , 6.39055154, ..., 9.9901546 ,
11.71520498, 8.65394591])
rfr.score(x_test, y_test)
0.9807327067596628
offer['Offers'] = y_pred
offer['Offers'] = np.exp(offer['Offers'])
since i want to select from the predicted offer, i have to create a justification for my selection. to do that, i have to predict the Retailprice as well and find the growth rate difference between the price and the retail price. after which i will sort the column by descending order to enable me select the rows and ensure my offer is less than 5,000,000.
RetailPrice prediction
from the observation, you will notice that logretail was also used to get the predicted value of retailprice
cleaned2 = training.drop(columns = ["id","Cut", "Retail", "Price", "LogPrice"])
offer['Vendor'] = offer.Vendor.astype(str)
retail_cleaned = offer.drop(columns = ["id","Cut", "Offers"])
retail_cleaned
cleaned
target_retail = cleaned2["LogRetail"]
predcitors_retail = cleaned2.drop(columns=["LogRetail"])
numerical_columns_selector_retail = selector(dtype_exclude=object)
categorical_columns_selector_retail = selector(dtype_include=object)
numerical_columns_retail = numerical_columns_selector_retail(predictors_retail)
categorical_columns_retail = categorical_columns_selector_retail(predictors_retail)
categorical_preprocessor = OneHotEncoder(handle_unknown="ignore")
numerical_preprocessor = StandardScaler()
preprocessor_retail = ColumnTransformer([
('one-hot-encoder', categorical_preprocessor, categorical_columns_retail),
('standard_scaler', numerical_preprocessor, numerical_columns_retail)])
rfrr = make_pipeline(preprocessor_retail, RandomForestRegressor())
x_train_retail, x_test_retail, y_train_retail, y_test_retail = train_test_split(
predictors_retail, target_retail, random_state=42)
rfrr.fit(x_train_retail, y_train_retail)
Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('one-hot-encoder',
OneHotEncoder(handle_unknown='ignore'),
['Cert', 'Clarity', 'Color',
'Measurements', 'Polish',
'Regions', 'Shape',
'Symmetry', 'Vendor']),
('standard_scaler',
StandardScaler(),
['Carats', 'Depth',
'Known_Conflict_Diamond',
'Table'])])),
('randomforestregressor', RandomForestRegressor())])
rfrr.score(x_test_retail, y_test_retail)
0.9728831513146663
retail_pred = rfrr.predict(retail_cleaned)
offer['RetailPrice'] = retail_pred
offer['RetailPrice'] = np.exp(offer['RetailPrice'])
offer1 = pd.DataFrame(offer)
here, i substracted the predicted price from the predicted retailprice then multipled it by the predicted price to get the actual rate/percentage which is a true estimate of the price.. if i had used just the diff between predicted price and predicted retailprice, the value wont represent the actual difference rate.
offer1['profit']= (offer1['RetailPrice'] - offer1['Offers'])/offer1['Offers']
offer_growth = offer1.sort_values(by=['profit'], ascending=False)
since i have a price constraint of 5,000,000,i have to employ the use of loop to ensure that my total predicted price is not more than 5,000,000
selected = 0
count = 0
budget = 4990756.903125745
for x in offer_growth['Offers']:
selected += x
count +=1
if selected == budget:
break
my_offers = offer_growth.iloc[:620, :]
lastly, i used a merge function to fix my prediction in the orignal offer document.
offer3 = offer2.drop(columns = ["Offers"])
offer3 = pd.DataFrame(offer3)
my_offer2 = my_offers[['id','Offers']]
my_offer2 = pd.DataFrame(my_offer2)
final_offer = pd.merge(offer3, my_offer2, how='left')
id Carats Cert Clarity Color Cut Depth \
0 8051 0.42 AGSL SI2 L 62.700000
1 8052 0.40 GemEx VS2 M 63.800000
2 8053 0.41 AGSL SI2 L 62.200000
3 8054 0.40 AGSL SI1 K 64.200000
4 8055 0.40 AGSL VS2 L 64.300000
... ... ... ... ... ... ... ...
2678 10729 1.65 AGSL VS1 E Excellent 58.400000
2679 10730 2.35 AGSL SI2 E Excellent 62.100000
2680 10731 2.35 AGSL SI2 J Excellent 61.781656
2681 10732 5.02 AGSL I1 J Excellent 59.500000
2682 10733 1.00 AGSL SI2 E Very good 59.700000
Known_Conflict_Diamond Measurements Polish Regions Shape \
0 False 4.81x4.78x3.01 Excellent Australia Round
1 False 4.64x4.61x2.95 Excellent Australia Round
2 True 4.76x4.74x2.95 Excellent DR Congo Round
3 False 4.63x4.59x2.96 Very good Angola Emerald
4 False 4.67x4.62x2.99 Excellent Russia Round
... ... ... ... ... ...
2678 False 7.79x7.83x4.56 Excellent Canada Pear
2679 True 8.48x8.53x5.28 Excellent Angola Round
2680 False 8.64x8.67x5.2 Excellent Russia Pear
2681 False 11.16x11.18x6.64 Excellent Russia Oval
2682 False 6.42x6.45x3.84 Excellent Russia Round
Symmetry Table Vendor RetailPrice profit Offers
0 Excellent 57.000000 1 950.126925 0.603476 NaN
1 Excellent 59.000000 1 943.947796 0.630525 NaN
2 Very good 58.000000 1 118.859820 -0.800633 NaN
3 Very good 58.000000 1 898.241287 0.379981 NaN
4 Very good 58.104324 1 943.378085 0.585112 NaN
... ... ... ... ... ... ...
2678 Excellent 60.000000 4 37564.831487 0.306380 NaN
2679 Excellent 58.000000 2 4537.574374 -0.871937 NaN
2680 Very good 58.104324 2 35508.073094 0.628014 NaN
2681 Excellent 59.000000 2 186142.860581 0.520539 NaN
2682 Excellent 61.000000 2 9197.414939 0.604371 NaN
[2683 rows x 18 columns]
final_offer.to_csv(r'C:\Users\ebukaezenwafor\Desktop\export_dataframe.csv', index = False, header=True)
offer1.to_csv(r'C:\Users\ebukaezenwafor\Desktop\ofer1.csv', index = False, header=True)
QUESTION A &B
a) from the chart, the vendors are selling relatively at a competitive price.
b) there is a positive correlation between carats and retail price sp to say that the higher the carat, the greater the price of the daimond. it is a natural phenomenon as daimond are sold based on their carat.
plt.bar(training['Vendor'], training['Retail'].mean())
plt.show()
plt.plot(training['Carats'], training['Retail'], 'o')
plt.show()