Data Analysis | Mlproject

Using remote sensing, topographic, and socioeconomic data, this project analyzed Urban Heat Island Intensity (UHII) in the Appalachian region. The dataset comprised over 230,000 samples across 30 features, spatially distributed in a 5 km grid. Key variables included Land Surface Temperature (LST), NDVI, NDBI, elevation, slope, road density, and demographic indicators such as population density and median income. We used Google Earth Engine (GEE) to process these data.

The dataset consists of approximately 230,000 examples with 30 features, spatial coordinates (lat/lon), and the target variable—Urban Heat Island Index (UHII). Missing values in the feature matrix were filled using a moving average with a window size of 10 to smooth the data. After preprocessing, the data was split into training (70%), validation (15%), and test (15%) sets using a stratified random split to maintain the distribution of UHII across all subsets. Original row indices were preserved for post-analysis and spatial mapping.

We implemented and evaluated three models as listed below:

Gradient Boosting is an algorithm that builds decision trees sequentially, where each new tree corrects the errors of the previous ones. Its strength lies in capturing complex, nonlinear relationships and handling noisy data efficiently, making it a popular choice for structured tabular data.
Random Forest is an ensemble method that constructs multiple decision trees using random subsets of data and features (bagging). It’s known for robustness and interpretability, but it does not incorporate spatial context, which can limit its effectiveness in geospatial applications.
Geographically Weighted Regression is a spatial regression technique that models relationships locally, varying coefficients across geographic locations. This allows it to reveal spatial heterogeneity, especially useful in regions with diverse topography and land use but it is computationally intensive and requires careful spatial preprocessing.