top of page

CODE.

Firstly, i imported all the libraries that i can use to effortly write the code and model that would solve the problem. At this level, i am trying to clean up th dataset and fill up missing values as they are too many to delete. deletion will affect the outcome of the data and prediction. from the dataset, 'CUT' has over 60% of its data missing, using mode, mean or medium is not a good representation of the missing values thus, it should not be used for the analysis As noticed polish and symmetry are ordinal values with some missing value, i used mode to fill it up because i cant use mean nor medium.

note: from the orginal file, vendor was saved as integer which in think shouldnt be as it just vendors selling the diamond. more so, if i had left it, onehotendcoder will see and read it as an integer.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.compose import make_column_selector as selector
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
import xgboost as xgb
training = pd.read_csv("training.csv")
offer= pd.read_csv('offers.csv')
pd.set_option("display.max_columns", None)


training.fillna(value = {'Depth':training.Depth.mean(), 'Table':training.Table.mean()}, inplace=True)
offer.fillna(value = {'Depth':offer.Depth.mean(), 'Table':offer.Table.mean()}, inplace=True)
offer2 = offer
training['Polish'] = np.where(training['Polish']== ' ', training['Polish'].mode(), training['Polish'])

training['Shape']=np.where(training['Shape'] == 'ROUND', 'Round', training['Shape'])
offer['Shape']=np.where(offer['Shape'] == 'ROUND', 'Round', offer['Shape'])

training['Regions']=np.where(training['Regions'] == 'Other/Unknown', 
                            training['Regions'].mode(), training['Regions'])
offer['Regions']=np.where(offer['Regions'] == 'Other/Unknown', 
                         offer['Regions'].mode(), offer['Regions'])

training['Symmetry']=np.where(training['Symmetry'] == ' ', training['Symmetry'].mode(), 
                             np.where(training['Symmetry'] == 'Execllent', 'Excellent', training['Symmetry']))

offer['Symmetry']=np.where(offer['Symmetry'] == ' ', offer['Symmetry'].mode(), offer['Symmetry'])


mode = training['Cert'].mode().values[0]
training['Cert'].fillna(value=mode, inplace=True)
mode2 = offer['Cert'].mode().values[0]
offer['Cert'].fillna(value=mode2, inplace=True)

mode3 = offer['Known_Conflict_Diamond'].mode().values[0]
offer['Known_Conflict_Diamond'].fillna(value = mode3, inplace=True)

training['Vendor'] = training.Vendor.astype(str)

Here, i started subsetting the dataset for prediction. i removed id, cut, retail, price and log retail as they wont be needed for the supposed machine learning. to reiterate, cut was removed because over 60% of the data is missing and filling it up wont have a good representation of the entire sample.

from the observaiton, logprice was used this is because one of assumptions of regession is that the dataset has to be normally distributed. but price and retail price are skewed to the right and so, log of the price is used for the prediction, the predicted price which is still in log is transformed with actual value.

cleaned = training.drop(columns = ["id","Cut", "Retail", "Price", "LogRetail"])

offer_cleaned = offer.drop(columns = ["id","Cut", "Offers"])

target = cleaned["LogPrice"]
target
predictors= cleaned.drop(columns=["LogPrice"])

from the dataset it is observed that some categorical data has lots of various levels categories. labelling this columns with value 1,2,3,…..n  wont work as it contains lots of categories and the model might see them as integers and wont perform optimally to get better prediction and model accuracy. so what is the way out? "one hot encoding". one hot encoding is a method of coverting categorical values into a new categorical column and assign a binary value of 1 or 0 to these columns. From this section, you will notice that one hot encoding was used for both training and offer dataset to keep the data in same shape for prediction.

numerical_columns_selector = selector(dtype_exclude=object)
categorical_columns_selector = selector(dtype_include=object)

numerical_columns = numerical_columns_selector(predictors)
categorical_columns = categorical_columns_selector(predictors)

categorical_preprocessor = OneHotEncoder(handle_unknown="ignore")
numerical_preprocessor = StandardScaler()

preprocessor = ColumnTransformer([
   ('one-hot-encoder', categorical_preprocessor, categorical_columns),
   ('standard_scaler', numerical_preprocessor, numerical_columns)])


####offer one-hot-encoder
numerical_columns_selector_offer = selector(dtype_exclude=object)
categorical_columns_selector_offer = selector(dtype_include=object)

numerical_columns_offer = numerical_columns_selector_offer(offer_cleaned)
categorical_columns_offer = categorical_columns_selector_offer(offer_cleaned)

categorical_preprocessor = OneHotEncoder(handle_unknown="ignore")
numerical_preprocessor = StandardScaler()

preprocessor = ColumnTransformer([
  ('one-hot-encoder', categorical_preprocessor, categorical_columns_offer),
  ('standard_scaler', numerical_preprocessor, numerical_columns_offer)])

rfr = make_pipeline(preprocessor, RandomForestRegressor())

x_train, x_test, y_train, y_test = train_test_split(
   predictors, target, random_state=42)

rfr.fit(x_train, y_train)

Pipeline(steps=[('columntransformer',
                ColumnTransformer(transformers=[('one-hot-encoder',
                                                 OneHotEncoder(handle_unknown='ignore'),
                                                 ['Cert', 'Clarity', 'Color',
                                                  'Measurements', 'Polish',
                                                  'Regions', 'Shape',
                                                  'Symmetry']),
                                                ('standard_scaler',
                                                 StandardScaler(),
                                                 ['Carats', 'Depth',
                                                  'Known_Conflict_Diamond',
                                                  'Table', 'Vendor'])])),
               ('randomforestregressor', RandomForestRegressor())])

y_pred = rfr.predict(offer_cleaned)
y_pred

array([ 6.38442209,  6.361169  ,  6.39055154, ...,  9.9901546 ,
      11.71520498,  8.65394591])

rfr.score(x_test, y_test)

0.9807327067596628

offer['Offers'] = y_pred
offer['Offers'] = np.exp(offer['Offers'])

since i want to select from the predicted offer, i have to create a justification for my selection. to do that, i have to predict the Retailprice as well and find the growth rate difference between the price and the retail price. after which i will sort the column by descending order to enable me select the rows and ensure my offer is less than 5,000,000.

RetailPrice prediction

from the observation, you will notice that logretail was also used to get the predicted value of retailprice

cleaned2 = training.drop(columns = ["id","Cut", "Retail", "Price", "LogPrice"])
offer['Vendor'] = offer.Vendor.astype(str)
retail_cleaned = offer.drop(columns = ["id","Cut", "Offers"])
retail_cleaned
cleaned
target_retail = cleaned2["LogRetail"]
predcitors_retail = cleaned2.drop(columns=["LogRetail"])

numerical_columns_selector_retail = selector(dtype_exclude=object)
categorical_columns_selector_retail = selector(dtype_include=object)

numerical_columns_retail = numerical_columns_selector_retail(predictors_retail)
categorical_columns_retail = categorical_columns_selector_retail(predictors_retail)

categorical_preprocessor = OneHotEncoder(handle_unknown="ignore")
numerical_preprocessor = StandardScaler()

preprocessor_retail = ColumnTransformer([
   ('one-hot-encoder', categorical_preprocessor, categorical_columns_retail),
   ('standard_scaler', numerical_preprocessor, numerical_columns_retail)])

rfrr = make_pipeline(preprocessor_retail, RandomForestRegressor())

x_train_retail, x_test_retail, y_train_retail, y_test_retail = train_test_split(
   predictors_retail, target_retail, random_state=42)

rfrr.fit(x_train_retail, y_train_retail)

Pipeline(steps=[('columntransformer',
                ColumnTransformer(transformers=[('one-hot-encoder',
                                                 OneHotEncoder(handle_unknown='ignore'),
                                                 ['Cert', 'Clarity', 'Color',
                                                  'Measurements', 'Polish',
                                                  'Regions', 'Shape',
                                                  'Symmetry', 'Vendor']),
                                                ('standard_scaler',
                                                 StandardScaler(),
                                                 ['Carats', 'Depth',
                                                  'Known_Conflict_Diamond',
                                                  'Table'])])),
               ('randomforestregressor', RandomForestRegressor())])

rfrr.score(x_test_retail, y_test_retail)

0.9728831513146663

retail_pred = rfrr.predict(retail_cleaned)

offer['RetailPrice'] = retail_pred
offer['RetailPrice'] = np.exp(offer['RetailPrice'])

offer1 = pd.DataFrame(offer)

here, i substracted the predicted price from the predicted retailprice then multipled it by the predicted price to get the actual rate/percentage which is a true estimate of the price.. if i had used just the diff between predicted price and predicted retailprice, the value wont represent the actual difference rate.

offer1['profit']= (offer1['RetailPrice'] - offer1['Offers'])/offer1['Offers']

offer_growth = offer1.sort_values(by=['profit'], ascending=False)

since i have a price constraint of 5,000,000,i have to employ the use of loop to ensure that my total predicted price is not more than 5,000,000

selected = 0
count = 0
budget = 4990756.903125745
for x in offer_growth['Offers']:
   selected += x
   count +=1
   if selected == budget:
       break

my_offers = offer_growth.iloc[:620, :]

lastly, i used a merge function to fix my prediction in the orignal offer document.

offer3 = offer2.drop(columns = ["Offers"])
offer3 = pd.DataFrame(offer3)
my_offer2 = my_offers[['id','Offers']]
my_offer2 = pd.DataFrame(my_offer2)

final_offer = pd.merge(offer3, my_offer2, how='left')

         id  Carats   Cert Clarity Color        Cut      Depth  \
0      8051    0.42   AGSL     SI2     L             62.700000   
1      8052    0.40  GemEx     VS2     M             63.800000   
2      8053    0.41   AGSL     SI2     L             62.200000   
3      8054    0.40   AGSL     SI1     K             64.200000   
4      8055    0.40   AGSL     VS2     L             64.300000   
...     ...     ...    ...     ...   ...        ...        ...   
2678  10729    1.65   AGSL     VS1     E  Excellent  58.400000   
2679  10730    2.35   AGSL     SI2     E  Excellent  62.100000   
2680  10731    2.35   AGSL     SI2     J  Excellent  61.781656   
2681  10732    5.02   AGSL      I1     J  Excellent  59.500000   
2682  10733    1.00   AGSL     SI2     E  Very good  59.700000   

     Known_Conflict_Diamond      Measurements     Polish    Regions    Shape  \
0                      False    4.81x4.78x3.01  Excellent  Australia    Round   
1                      False    4.64x4.61x2.95  Excellent  Australia    Round   
2                       True    4.76x4.74x2.95  Excellent   DR Congo    Round   
3                      False    4.63x4.59x2.96  Very good     Angola  Emerald   
4                      False    4.67x4.62x2.99  Excellent     Russia    Round   
...                      ...               ...        ...        ...      ...   
2678                   False    7.79x7.83x4.56  Excellent     Canada     Pear   
2679                    True    8.48x8.53x5.28  Excellent     Angola    Round   
2680                   False     8.64x8.67x5.2  Excellent     Russia     Pear   
2681                   False  11.16x11.18x6.64  Excellent     Russia     Oval   
2682                   False    6.42x6.45x3.84  Excellent     Russia    Round   

      Symmetry      Table Vendor    RetailPrice    profit  Offers  
0     Excellent  57.000000      1     950.126925  0.603476     NaN  
1     Excellent  59.000000      1     943.947796  0.630525     NaN  
2     Very good  58.000000      1     118.859820 -0.800633     NaN  
3     Very good  58.000000      1     898.241287  0.379981     NaN  
4     Very good  58.104324      1     943.378085  0.585112     NaN  
...         ...        ...    ...            ...       ...     ...  
2678  Excellent  60.000000      4   37564.831487  0.306380     NaN  
2679  Excellent  58.000000      2    4537.574374 -0.871937     NaN  
2680  Very good  58.104324      2   35508.073094  0.628014     NaN  
2681  Excellent  59.000000      2  186142.860581  0.520539     NaN  
2682  Excellent  61.000000      2    9197.414939  0.604371     NaN  

[2683 rows x 18 columns]

final_offer.to_csv(r'C:\Users\ebukaezenwafor\Desktop\export_dataframe.csv', index = False, header=True)

offer1.to_csv(r'C:\Users\ebukaezenwafor\Desktop\ofer1.csv', index = False, header=True)

                   QUESTION A &B

a) from the chart, the vendors are selling relatively at a competitive price.

b) there is a positive correlation between carats and retail price sp to say that the higher the carat, the greater the price of the daimond. it is a natural phenomenon as daimond are sold based on their carat.

plt.bar(training['Vendor'], training['Retail'].mean())
plt.show()

plt.plot(training['Carats'], training['Retail'], 'o')
plt.show()

3099970173

  • Facebook
  • Twitter
  • LinkedIn

©2022 by Ebuka Ezenwafor

bottom of page