Feature importance measures from random forest regressor using near-infrared spectra for predicting carbonization characteristics of kraft lignin-derived hydrochar

This study investigated the feature importance of near-infrared spectra from random forest regression models constructed to predict the carbonization characteristics of hydrochars produced by hydrothermal carbonization of kraft lignin. The model achieved high coefficients of determination of 0.989, 0.988, and 0.985 with root mean square errors of 0.254, 0.003, and 0.008 when predicting the carbon content, atomic O/C ratio, and H/C ratio, respectively. The random forest models outperformed the multilayer perceptron models for all predictions. In the feature importance analysis, the spectral regions at 1600–1800 nm, the first overtone of C–H stretching vibrations, and 2000–2300 nm, the combination bands, were highly important for predicting the carbon content and O/C predictions, whereas the region at 1250–1711 nm contributed to predicting H/C. The random forest models trained with the high-importance regions achieved better prediction performances than those trained with the entire spectral range, demonstrating the usefulness of the feature importance yielded by the random forest and the feasibility of selective application of the spectral data.


Introduction
To combat global warming and the resulting climate change, the Intergovernmental Panel on Climate Change (IPCC) has adopted a special report on global temperature increase of 1.5 ℃ [1]. In response, countries around the globe are establishing carbon-neutral strategies to reduce CO 2 emissions, including the use and development of sustainable and renewable sources of energy. Wood is a representative natural resource that is both renewable and sustainable. Lignin, one of the main elements in wood is an aromatic component of lignocellulosic biomass and is carbon rich. The pulp industry mass produces kraft lignin as a waste [2]. Lignin is a bioresource known for maximizing resource efficiency and converting waste into valuable resources, helping in achieving carbon neutrality.
Hydrothermal carbonization (HTC) is a thermochemical process that converts organic compounds, such as lignin, into structured carbon materials called hydrochars with relatively mild temperatures of 130-250 ℃ [3,4]. HTC can convert biomass into carbonaceous solids without an energy-intensive drying process [5]. The HTC yield of lignin is higher than that of cellulose and hemicellulose because of the thermally stable phenolic structures of lignin [6,7].
Lignin-derived hydrochar can be used in various applications, including fuels, batteries, polymer composites, adsorbents, and electrochemical devices [8][9][10][11]. The carbonization characteristics of hydrochar are typically determined using elemental analysis. However, it has recently been reported that rapid and non-destructive evaluation is possible using near-infrared (NIR) spectroscopy with multivariate analysis [12]. Numerous studies have demonstrated that NIR spectroscopy is suitable for capturing the characteristics of biological materials [13][14][15][16][17][18][19][20]. Nevertheless, the evaluation of the spectral bands contributing to predictions by the models has been limited to estimation using indirect methods.
Random forest (RF) is a prediction model and machine learning technique that incorporates the importance of input variables for prediction [21]. RF is a framework for aggregating predictions from multiple tree models, yielding quantified information about the features and their contribution to the prediction. This nature of tree models makes them an ideal choice for studies using biological data [22][23][24][25].
This study established RF regression models trained with NIR spectral data to predict the carbonization characteristics of hydrochars produced by the hydrothermal carbonization of kraft lignin. Furthermore, the NIR spectral regions were classified according to their feature importance as computed by the RF models. The performance evaluation of the RF models trained with each selected spectral region verified the practical usefulness of the feature importance computed by the RF.

Samples and hydrothermal carbonization
Kraft lignin, a raw material for hydrochars, was provided by a domestic pulp manufacturer (Moorim P&P, Ulsan, Korea). Lignin was produced as a byproduct of an industrial-scale pulping process to manufacture bleached hardwood pulp using an alkaline white liquor consisting of sodium hydroxide, chlorine dioxide, and sodium sulfide.
For HTC, 5.6 g of lignin powder was mixed with 140 mL of distilled water to prepare suspensions with a solid-to-liquid ratio of 1:25. A glass liner containing the suspension was placed in a reaction vessel and heated in a heating mantle at target temperatures of 150, 175, 200, 225, and 250 ℃ for retention times of 1, 2, 3, and 5 h, respectively. Retention time is the duration at which the target temperature is maintained. After heating, the reaction vessel was allowed to cool naturally to room temperature (13.4-23.1 ℃). The solid residues generated from the HTC of lignin were vacuum filtered, oven-dried, and pulverized to produce powdered hydrochars.

Elemental analysis
Elemental analysis was performed to investigate the elemental compositions of the hydrochars. The weight percentages of carbon (C), hydrogen (H), nitrogen (N), and sulfur (S) were measured using an elemental analyzer (Flash EA 1112, Thermo Electron Corp., Waltham, MA, USA). The oxygen (O) weight percentage was estimated to be 100 − (C + H + N + S). The C wt%, O/C, and H/C ratios were calculated as indicators for evaluating the carbonization characteristics of the hydrochars.

Spectral dataset
NIR spectral data were used as input variables to build regression models for predicting the carbonization characteristics of the hydrochars. NIR reflectance spectra were collected from the hydrochars using an NIR spectrometer (NIR Quest, Ocean Insight, Orlando, FL, USA) with a reflection probe and tungsten halogen lamp. The optical resolution of the spectrometer was 6.6 nm, and the spectra were collected in the wavelength range of 870-2500 nm.
All spectra were second derivatized by Savitzky-Golay smoothing to 13 points with the fifth-order function [26]. A wavelength range of 1250-2300 nm, excluding noisy regions in the original range, was selected and used to build the prediction models. The selected spectral region corresponded to 165 input variables. Hwang et al. [12] demonstrated the effectiveness of second-derivative transformation and spectral selection in predicting the carbonization characteristics of hydrochars.
Three NIR spectra were acquired for each HTC condition, including the control sample. The dataset thus consisted of 63 spectra. The dataset was independently divided into training and test sets at a ratio of 8-2 and used for the construction and evaluation of the prediction model.

Random forest regressor
The RF model for regression [21], an ensemble learning technique, was used to predict the carbonization characteristics of the hydrochars. Ensemble learning combines predictions from multiple models to produce more accurate results than a single model. This study used decision trees (DT) for regression [27] as the base learner to construct an RF model.
DT is a simple model that predicts the result by performing a split based on the predictor (input variable) that reduces the mean squared error (MSE) the most. As shown in Fig. 1a, when predicting the output from the input variables, DT starts from a single node (root node) and creates branches (decision nodes) based on the input variable (feature) with the smallest MSE (Eq. 1).
where y and y are the measured and predicted values of the samples in a node, respectively, and n is the number of samples in a node. This process is repeated at each decision node to create a tree-shaped decision structure. A node that cannot branch further owing to a nondecreasing MSE is called a leaf node, and the average value of the samples in that node becomes a candidate for the prediction. When unseen data are input into the completed DT model, the data move according to predetermined branching criteria. The value of the leaf node where the data finally arrived was used as the predicted value of the DT. A schematic of the RF model is illustrated in Fig. 1b. RF is a combination of multiple DTs and improves prediction performance by averaging the predictions of multiple DTs and controls overfitting, a chronic weakness of DTs. To increase the randomness of the DTs, the RF model builds them using random sampling without using all input variables. This process is performed to create independent trees. In addition, RF generates bootstrap sets from the training set using random sampling with replacement. In this process, approximately 2/3 (in-bag samples) of the training data are used for DT training. The remaining 1/3 (out-of-bag samples) of the data are used to validate the tree model, similar to the threefold cross-validation. The probability that a sample is not selected from m data points during random sampling with replacement is (m − 1)/m. If this is repeated m times, the out-of-bag (OOB) probability is approximately 36.8%, according to Eq. (2). The RF model was built with multiple DTs generated by learning the in-bag samples, and the final prediction of the model for new data was output as the average value of the predictions of the DTs.
In this study, all DTs constituting the RF were based on the classification and regression tree (CART) [27] algorithm and were independently generated without pruning. For n_feature, which are input variables for DT generation, the square root ('sqrt'), binary logarithm ('log2'), and one-third ('1/3') of all spectral points were tested. An n_tree from 10 to 300 was tested, and the optimal n_feature and n_tree were determined based on the minimum OOB error via grid searches. The coefficient of determination (R 2 ) and root mean square error (RMSE) were used as evaluation metrics for the model performance.

Multilayer perceptron regressor
Multilayer perceptron (MLP) models were employed to compare the prediction performance with RF models. The MLP regressor learns data using backpropagation, with no activation function in the output layer. The squared error was used as the loss function, and the model was optimized using the stochastic gradient descent-based optimizers SGD and Adam. The various network architectures listed in Table 1 were tested with logarithmic learning rates ranging from 0.0001 to 0.1 to optimize the network configuration of the MLP. The maximum number of iterations was set to 3000. Grid searches with threefold cross-validation optimized the hyperparameters of network architectures, optimizers, and learning rates. The prediction performances of the RF and MLP regression models established in this study were compared with those of the reported partial least squares (PLS) regression models, which are a chemometric approach combined with NIR spectroscopy and multivariate analysis [12].

Feature importance measures
The spectral feature importance was measured based on the mean decrease in impurity (MDI) [28] to identify the NIR regions that contribute to predicting the carbonization characteristics of hydrochars. Tree-based models provide information about the contribution of the input variables used in prediction, called feature importance. The feature importance of an ensemble model, such as RF, is also an ensemble of the feature importance of its base models. In this study, variance reduction based on MSE, a criterion for branching nodes, was used to measure the feature importance of the DTs.
When an upper node (parent node) branches into two lower nodes (child nodes) by feature i, the importance of the feature is defined as the difference between the MSE of the parent node and the sum of the MSEs of the child nodes, which is called information gain (Eq. 3).
where G Pj is the information gain of node j. Pj is the parent node j, and Lj and Rj are the left and right child nodes branched from Pj, respectively. w is the weight of the node and is the number of samples in the node relative to the total number of samples. M is the mean squared error of the node. The importance of feature i in a decision tree can be calculated as follows: where I(f i ) DT is the importance of feature i in the tree model and G j is the information gain of node j branched by feature i. The feature importance of the RF model is computed as an ensemble of the importance of all the DTs in the RF. Before the ensemble, the importance of each feature was normalized using Eq. (5).
Then, the final importance of each feature in the RF model was averaged over all DTs, as follows: where I(f i ) RF is the importance of feature i in the RF model and N T is the total number of DTs in the RF.

Elemental analysis
The elemental compositions of the hydrochars are listed in Table 2. The carbon content (C wt%) of the samples increased as the temperature and retention time for HTC, i.e., the reaction severity, increased (Fig. 2a). The C wt% of the control sample was 62.83 wt%, which increased to 69.37% after HTC at 250 ℃ for 5 h. The C wt% values of (5) hydrochars produced at 150 and 175 ℃ were similar at approximately 65%. However, above 200 ℃, the content increased linearly with an increase in HTC temperature and time. The increase in the C wt% was attributed to dehydration and decarboxylation during carbonization [4,29]. Simultaneously, the atomic oxygen/carbon (O/C) ratio and atomic hydrogen/carbon (H/C) ratio gradually decreased (Fig. 2b), which was mainly due to chemical dehydration [30], suggesting that carbon-intensive materials were produced from HTC. The van Krevelen diagram (Fig. 2b) shows that lignin, whose elemental composition is similar to that of lignite brown coal, was converted to brown coal via HTC [7]. The reduced S wt% of the hydrochars was due to their dissolution during HTC [4]. Figure 3 shows the change in OOB errors with the addition of each regression tree during RF training to predict the carbonization characteristics of the hydrochars. At the beginning of the tree addition, the OOB errors were reduced, and the minima were measured for less than 50 trees. The following errors recovered slightly and remained at a similar level from 100 trees or more. Similar trends were observed for the C wt%, O/C, and H/C predictions. The optimal number of features for tree generation was 'log2' for C wt% prediction and 'sqrt' for O/C and H/C predictions. The prediction results of the RF models for the C wt%, O/C, and H/C of the hydrochars are presented in Fig. 4 and Table 3. The scatter plots (Fig. 4) show that the training and test sets had similar trends. The RF models accurately predicted the carbonization performance of the hydrochars (Table 3). For the C wt% prediction, the model achieved the R 2 value of 0.989 on the test set from 43 regression trees built using the 'log2' (7 features) of all input variables. For O/C and H/C predictions, R 2 values of 0.988 and 0.985 were obtained from 43 and 16 trees (n_tree), respectively, with the number of NIR spectral points (n_feature) of 'sqrt' (13 features).

Prediction of carbonization characteristics RF models
The O/C ratio indicates polarity and is related to the adsorbability of the material [31]. Both O/C and H/C ratios are indicators of the stability of the carbonaceous materials [32]. The lower the ratios the materials have, the more stable, inert, and resistant to decomposition, as they resemble the characteristics of graphite [33]. Therefore, the RF models established in this study have the potential to be applied for predicting the carbonization characteristics and evaluating the quality of hydrochars. In addition, the use of NIR spectroscopy supports rapid and non-destructive predictions. Table 4 shows the comparison of the performance of the RF model with that of other regression models in predicting the carbonization characteristics of the hydrochars. Because the DT models for regression yielded good predictions enough, their collaboration RF did not improve the performance except for O/C prediction. However, RF was found to be robust against bias and overfitting, whereas a single DT was vulnerable [34]. The RF models were comparable to the PLS regression models [12], a chemometric approach, and produced better predictions than MLP models. Although MLP is applicable to various non-linear problems, it performed poorer than the other tested models. The relatively low performance of MLPs was attributed to the small scale of the NIR spectral dataset. These results support the methodological validity of RF regression combined with NIR spectroscopy to predict the carbonization characteristics of hydrochars.

Spectral feature importance Mean decrease in impurity-based importance
The feature importance of the RF models for predicting the carbonization characteristics of hydrochars was computed based on the total reduction in the MSE. The second-derivative NIR spectra in the range of 1250-2300 nm of the control and hydrochar samples and their corresponding importance are shown in Fig. 5.
In the C wt% and O/C predictions, the spectral regions with high importance were 1600-1800 and 2000-2300 nm, respectively. The first region was dominated by the first overtone of C-H stretching vibrations [35]. However, the band at 1684 nm may have originated from a combination of carboxyl groups [36]. The decrease in the intensity of the peaks at 1684 nm with increasing temperature can be attributed to decarboxylation by HTC (Fig. 5). This is because removing the carboxyl groups increases the C wt% and decreases the O/C ratio [37]. Consequently, the band at 1684 nm yielded the highest importance for the C wt% and O/C predictions (Fig. 5a, b). Hwang et al. [12] estimated that the band at 1449 nm, assigned to the phenolic group, strongly influenced the prediction of the carbonization characteristics. However, in this study, the RF models suggested that the band had a low contribution in predicting C wt% and O/C with its low feature importance values. The second region comprises the combination bands, and the bands at 2132 nm (coupling of C-H and C=C stretching vibrations) and 2267 nm (coupling of O-H and C-O stretching vibrations) were assigned to lignin [36,37]. The latter has been observed in the second-derivative spectrum of milled hardwood lignin [37].
The feature importance computed in the H/C prediction was different from the others. The importance of the band at 1449 nm, which has low importance for C wt% and O/C predictions, was highest for H/C prediction. This band was assigned to the first overtone of the O-H stretching vibration of the phenolic groups of lignin [38].  In contrast, the bands at 1684 and 2267 nm, which contributed to the C wt% and O/C predictions, respectively, had low importance for H/C prediction. For the spectral region of 1800-1999 nm, assigned to components regrading cellulose and water, low importance values were observed for all predictions.

Feature selection
The spectral regions were classified into high and low importance based on the feature importance values, and the RF models trained with the data from each region were reconstructed. The RF models trained with the high-importance spectral regions outperformed those trained with the low-importance regions and the full spectral range for all predictions. In addition, the models  trained with a combination of high-importance regions yielded the best performance in this study (Fig. 6). Although the number of features available for model building was lower owing to partial selection of the spectral region, the improvement in model performance proves that the feature importance computed from RF is reliable for predicting the carbonization characteristics of the hydrochars. Furthermore, these results suggest that the selection of custom spectral regions according to the output variables may be a better strategy for prediction.

Conclusions
RF regression models combined with NIR spectroscopy predicted the carbonization characteristics of ligninderived hydrochars with R 2 values above 0.98. The MDIbased feature importance computed from RF models indicated that the spectral regions influencing C wt% and O/C predictions differed from those for H/C. The high-importance regions helped improve model performance. These results suggest that the selective application of spectral regions, depending on the prediction target, might be a better strategy for prediction. In rapid and non-destructive prediction using NIR spectroscopy, the ensemble method of tree models is a promising technique and is comparable to chemometrics.