Pahang in Peninsular Malaysia has been chosen as the research site due to its annual monsoon floods, which harm the local population.
Study design and Data Collection Tool
Flood influencing factors.
According to the data available for Pahang and a comprehensive literature search, a total of nine factors have been identified as potential indicators of heightened flood susceptibility in the context of modelling studies. These factors encompass elevation, slope, curvature, flow direction, flow accumulation, distance from river, rainfall, land-use, and geology. Together, these parameters effectively capture the topographical and hydrometeorological conditions that contribute to the overall vulnerability of the region to flooding events[15,16] .
Digital Elevation Models (DEMs) have demonstrated their indispensable role in ensuring the precision of hydrodynamic models [17] . The Earth data platform provided access to the 30 m resolution Shuttle Radar Topography Mission (SRTM) DEM Version 3, from which the digital elevation data will be obtained [18] . The presence of flooding is largely impacted by the slope of the land, as steeper slopes can accelerate the flow of water over the surface, hindering its ability to seep into the ground [19] . The shape of a surface, as determined by its curvature, indicates whether it is convex, concave, or flat, indicating changes in slope inclination. Concave surfaces tend to collect flood water, increasing the likelihood of flooding [20] . The direction of flow plays a crucial role in determining the path that surface water will take and the potential for flooding [21]
An increase in flow accumulation coincides with an increase in vulnerability to flooding [19] . In this research, the distance from rivers was estimated using the Euclidean distance tool in ArcGIS software, which utilized a raster layer depicting the river network. The ArcGIS platform will be used to generate maps for the elevation, slope, curvature, flow direction, flow accumulation, and distance from river, which will be subsequently categorized into sub-classes using the natural break classification method. Flooding occurs when there is a sudden increase in water levels in rivers, lakes, and reservoirs due to intense rainfall, often resulting in inadequate drainage [22] . We will be using data from 10 precipitation stations in Pahang, including Cameron Highlands, Bentong, Bera, Kuantan, Lipis, Maran, Pekan, Raub, Rompin, and Temerloh, to create a rainfall distribution map for the research area. We will employ the Inverse Distance Weighted (IDW) approach, utilizing a 10-year dataset from 2012 to 2021, to construct the map [23] . This method ensured that the rainfall patterns in the area being studied were accurately depicted.
The properties of drainage systems are significantly affected by changes in land use and land cover (LULC) in the upstream watersheds. These modifications directly impact the occurrence of surface overflow and the land surface's capacity to absorb water, ultimately playing a role in the frequency and intensity of flooding events [24] . The global geological and LULC data will be obtained from the worldwide geological maps database provided by the USGS and the Global data [25] . The LULC map will be created using the ArcGIS platform, delineating seven distinct categories: water bodies, trees, flooded vegetation, crops, built area, bare terrain, and rangeland. In the case of the Geology map of Pahang, and will be segmented into nine primary soil features, based on the USGS-USA soil taxonomy [25,27] .
Random forest (RF) Embedding classifier.
The random forest technique demonstrates strong predictive accuracy and is adept at managing large datasets for regression and classification purposes. By training numerous decision trees concurrently
through bootstrapping, aggregation, and bagging methods, the RF method consistently outperforms alternative techniques in accuracy and prevents overfitting. Moreover, the training process for the RF-embedding model is quicker, leading to superior classification accuracy [28].
Feature selection is crucial for improving model efficiency, eliminating unnecessary data, preventing overfitting, and enhancing generalization on test data. In this study, an embedded feature selection method using a shuffling algorithm was used to create random probes based on the original variables. These probes were combined with the variables to train a Random Forest regression model, which determined the significance of each variable (Z-score). Variables with a Z-score higher than the maximum Z-score among the random probes were considered important [29] . In this context, the DML algorithm uses the embedded Mean Decrease Accuracy (MDA) measure. It typically splits based on "gini" for Gini impurity and
"entropy" for information gain, mathematically defined as p(xi) for each possible value i of random variable x and c for the number of classes in the dataset (Eq 1,2) [30–32].
Entropy∶H(x)=−sum(i=1)ni3=((n(n+1))/2)2 (1)
Gini(E)=1−sum(i=1)Ci1=pi2 (2)
The RF learning model, using multiple decision trees, is more accurate than a single decision tree. It combines random feature selection and bagging for classification and regression. In this study, a popular machine learning FS method ranked flood influencing factors. This algorithm is widely endorsed by researchers for its strong predictive performance, high accuracy, and ease of interpretation. It iteratively
generates rankings by shuffling features and identifying consistently important ones [29,33] .