Labelling knowledge is a vital step within the course of for creating supervised Machine Studying (ML) fashions. It includes the crucial job of assigning related labels or classes to totally different options throughout the knowledge, comparable to land cowl class (e.g. vegetation, water our bodies, city space, and so on.) or different bodily traits of the Earth’s floor. These labels may be multi-class (e.g., forest, grassland, city), or binary (e.g., water or non-water).
This publish presents Consumer State of affairs 2 of the AI/ML Enhancement Challenge, titled “Alice labels Earth Statement (EO) knowledge”. It demonstrates how the enhancements being deployed within the Geohazards Exploitation Platform (GEP) and City Thematic Exploitation Platform (U-TEP) will help customers labelling EO knowledge.
For this Consumer State of affairs, an interactive Jupyter Pocket book is used to information an ML practitioner, comparable to Alice, by means of the next steps:
- create knowledge labels, utilizing QGIS Software program or a Solara / Leafmap software
- load Labels and Sentinel-2 knowledge utilizing STAC API
- pattern Sentinel-2 knowledge with Labels and create a dataframe
- validate the labelled knowledge towards the International Floor Water (GSW) dataset
- use the dataframe to coach a ML mannequin primarily based on a Random Forest classifier
- carry out raster inference on a Sentinel-2 scene to generate a binary water masks
Sensible examples and instructions are exhibited to show how this new capabilities can be utilized from a Jupyter Pocket book.
s2-Notebook_preview985×850 157 KB
The method for creating vector (level or polygon) knowledge layers is illustrated with two examples:
- QGIS Software program: a devoted profile on the App Hub is configured to the consumer for utilizing QGIS Software program (extra particulars may be discovered on the App Hub online User Manual). The steps to create new Shapefile Layers, add classification sorts for every level / polygon, and save the output in a geojson format are illustrated with a number of screenshots.
- Solara / Leafmap software: an interactive map, constructed on
Solara
andLeafmap
, has been built-in within the Pocket book to present the choice to the consumer to manually create and save labels proper from the Pocket book itself.
After the annotations are created, both from QGIS or from the Solara / Leafmap interactive map, and saved right into a .geojson file, the consumer can create the STAC Merchandise of the EO labels, and publish it on the STAC endpoint. That is executed with the pystac
Python library and an interactive kind proper within the Pocket book.
Entry to Labels and EO knowledge was facilitated by means of the utilisation of the libraries pystac
and pystac_client
. These libraries allow customers to work together with a STAC catalog by defining particular question parameters, comparable to time vary, space of curiosity, and knowledge assortment preferences. Subsequently, solely the STAC Objects that align with the offered standards are retrieved for the consumer.
Beneath is given a simplified code snippet for implementing STAC knowledge search and for displaying outcomes on an interactive map. An upcoming article, devoted to the STAC format and knowledge entry can be printed, with extra steering and examples.
# Import libraries
import pystac; from pystac_client import Consumer
# Entry to STAC Catalog
cat = Consumer.open("https://ai-extensions-stac.terradue.com", ...)# Outline question parameters
start_date = “2023-06-01”
end_date = “2023-06-30”
bbox = [-121.857043 37.853934 -120.608968 38.840424]
cloud_cover = 30
tile = “10SFH”# Search Labels by AOI, begin/finish date
query_sel = cat.search(
collections=[“ai-extensions-svv-dataset-labels”],
datetime=(start_date, end_date),
bbox=bbox,
)labels = query_sel.item_collection()# Search EO knowledge (Sentinel-2) by AOI, begin/finish date, cloud cowl and tile quantity
query_sel = cat.search(
collections=[“sentinel-2-l2a”],
datetime=(start_date, end_date),
bbox=bbox,
question={"eo:cloud_cover": {"lt": cloud_cover}},
)eo_item = [item for item in query_sel.item_collection() if tile in item.id][0]
As soon as the Label knowledge is loaded, it’s transformed right into a dataframe (gdf) utilizing geopandas
library. The Python library folium
was then used to show each the Labels and EO knowledge on an interactive map.
import folium; from folium import GeoJson, LayerControl, plugins
map = folium.Map(location=[x, y], tiles="OpenStreetMap", zoom_start=9)# Add Labels to map
map = addPoints2Map(gdf, map)# Add footprint of EO scene
footprint_eo = folium.GeoJson(eo_item.geometry,style_function=lambda x: {...})
footprint_eo.add_to(map)# Visualise map
map
s2-interactiveMap829×500 82.1 KB
After loading the information, the Pocket book continues with the implementation of a operate to iteratively pattern the EO knowledge in correspondence of every labelled level. Along with sampling a collection of the Sentinel-2 reflectance band (coastal, purple, inexperienced, blue, nir, nir08, nir09, swir16, and swir22), three vegetation indices are additionally calculated (ndvi, ndwi1, and ndwi2). After sampling the EO bands and calculating the vegetation indices, all the information is concatenated right into a pandas
DataFrame.
import pandas as pd
tmp_gdfs = []
for i, label_item in enumerate(eo_items):
sampled_data = sample_data(label_item=label_item, common_bands=["coastal", "red", "green", "blue", "nir", "nir08", "nir09", "swir16", "swir22"])
tmp_gdfs.append(sampled_data)# Create pandas dataframe
gdf_points = pd.concat(tmp_gdfs)# Save to file
gdf_points.to_pickle(“filename.pkl”)
A comparability towards one other, impartial, dataset was carried out to point out a validation method of the labelled knowledge. As a validation dataset, we used the International Floor Water (GSW) dataset, generated by JRC (Quotation: Pekel, Jean-François; Cottam, Andrew; Gorelick, Noel; Belward, Alan (2017): International Floor Water Explorer dataset. European Fee, Joint Analysis Centre (JRC), http://data.europa.eu/89h/jrc-gswe-global-surface-water-explorer-v1).
The comparability was carried out just by iterating by means of the generated labels dataframe and by counting the variety of factors labelled as “water” that have been appropriately categorized as water (i.e. with pixel worth greater than 80%) additionally within the GSW dataset.
The dataframe was ready for the supervised ML job by changing it right into a binary classification dataset (i.e. “water” and “no-water”) and by eradicating pointless columns. Additional and extra detailed evaluation on the dataframe may be carried out by means of Exploratory Knowledge Evaluation (EDA). Take a look at extra info on the lately printed article dedicated to EDA, for extra particulars and steering on this.
The dataset was then break up into prepare and check with the devoted operate train_test_split() from the sklearn
package deal.
from sklearn.model_selection import train_test_split
# columns used as options throughout coaching
feature_cols = ['coastal','red','green','blue','nir','nir08','nir09','swir16','swir22', 'ndvi', 'ndwi1', 'ndwi2']# column identify for label
LABEL_NAME = 'CLASSIFICATION'options = train_dataset[feature_cols] # cols for options
label = train_dataset[LABEL_NAME] # col for labels
X_train, X_test, y_train, y_test = train_test_split(
options, label,
random_state=42,
train_size=0.85,
)
The ML mannequin developed on this Pocket book was a Random Forest classifier utilizing k-fold cross validation. Random Forest is a strong and versatile supervised ML algorithm that grows and combines a number of resolution timber to create a “forest.” It may be used for each classification and regression issues. Okay-Fold Cross-Validation is a method utilized in ML to evaluate the efficiency and generalisation potential of a mannequin. The steps concerned within the Okay-Fold Cross-Validation are:
- break up the dataset into Okay subsets, or “folds”.
- The mannequin is then skilled Okay occasions, every time utilizing Okay-1 folds for coaching, and the remaining fold for validation.
- This course of is repeated Okay occasions, with every of the Okay folds used precisely as soon as because the validation knowledge.
- The Okay outcomes from the Okay folds are then averaged to supply a single estimation of mannequin efficiency.
s2 — k-fold_diagram1036×452 83.8 KB
The ML parameters are outlined and used to coach the mannequin with a number of easy capabilities, offered these are outlined.
hyperparameters = {
'n_estimators': 200,
'criterion':'gini',
'max_depth':None,
'min_samples_split':2,
'min_samples_leaf':1,
'min_weight_fraction_leaf':0.0,
'max_features':'sqrt',
'max_leaf_nodes':None,
'min_impurity_decrease':0.0,
'bootstrap':True,
'oob_score':False,
'n_jobs':-1,
'random_state':42,
'verbose':0,
'warm_start':True,
'class_weight':None,
'ccp_alpha':0.0,
'max_samples':None
}
# outline mannequin obj which is outlined in utils.py
mannequin = Mannequin(hyperparameters)# coaching mannequin utilizing k-fold cross validation
estimators = mannequin.coaching(X=X_train,Y=y_train,folds=5)
The mannequin is evaluated on unseen knowledge with the next analysis metrics:
- Accuracy: calculated because the ratio of appropriately predicted situations to the full variety of situations within the dataset
- Recall: also called sensitivity or true constructive charge, recall is a metric that evaluates the flexibility of a classification mannequin to appropriately determine all related situations from a dataset
- Precision: it evaluates the accuracy of the constructive predictions made by a classification mannequin
- F1-score: it’s a metric that mixes precision and recall right into a single worth. It’s notably helpful when there’s an uneven class distribution (imbalanced courses) and offers a stability between precision and recall
- Confusion Matrix: it offers an in depth breakdown of the mannequin’s efficiency, highlighting situations of right and incorrect predictions.
The code snippet beneath exhibits how the mannequin may be evaluated, adopted by the output of the analysis metrics calculated in the course of the course of.
# consider mannequin
best_model = mannequin.analysis(estimators,X_test, y_test)
Different methods to judge the ML mannequin are the distribution of the chance of predicted values, the Receiver Working Attribute (ROC) Curve, and the evaluation of the permutation options significance. All three may be derived and plotted from throughout the Pocket book with one easy line of code.
# Distribution of chance of predicted values
ml_helper.distribution_of_predicted_val(best_model, X_train, X_test)
# ROC Curve
ml_helper.roc(best_model,X_test,y_test)# Permutation Significance
ml_helper.p_importance(best_model,X_test,y_test,hyperparameters,MODEL_OUTPUT_DIR)
s2-three images1600×517 73.9 KB
Lastly, the most effective ML mannequin may be saved to a file in order that it may be loaded and used sooner or later. The one prerequisite for making use of the ML mannequin is for the enter dataset to have the identical format because the coaching dataset described above.
import joblib
# Save the mannequin to file
model_fname = 'best_rf_model.joblib'
joblib.dump(best_model, model_fname)
Now the consumer can apply the ML mannequin on a Sentinel-2 picture to generate a binary water masks output. After loading the EO knowledge and the ML mannequin into the Pocket book, the ML mannequin is utilized to make predictions over the whole enter EO knowledge. The steps to carry out these operations are proven within the simplified code snippet beneath.
# Choose EO property from the loaded Sentinel-2 scene (eo_item)
fileList = {}
for f in eo_item.get_assets():
if (f in feature_cols) or f == 'scl':
fileList[f] = eo_item.get_assets()[f].href# Load the ML mannequin classifier
mannequin = joblib.load(model_fname)# Make predictions
predictions = ml_helper.readRastersToArray(mannequin, fileList, feature_cols)# Save predictions
df_predict = pd.DataFrame(predictions.ravel(),columns=['predictions'])
df_predict.to_pickle('prediction.pkl')# Create binary masks
predictions = df_predict['predictions']
predictions = predictions.to_numpy().reshape((10980,10980))# Apply sieve operation to take away small options (in pixels)
my_array_uint8 = predictions.astype(rasterio.uint8)
sieved = sieve(my_array_uint8, threshold=1000, connectivity=8)# Use Scene Classification band to filter out clouds and dangerous knowledge
with rasterio.open(fileList['scl']) as scl_src:
scl = scl_src.learn(1)
scl = np.the place(~np.isin(scl, [4, 5, 6, 7, 11]), np.nan, scl)
mask_out = np.the place(~np.isnan(scl), sieved, np.nan)# Use Scene Classification band to filter out clouds and dangerous knowledge
import matplotlib.pyplot as plt
plt.imshow(mask_out,interpolation='none'); plt.title("Improved outcome")
Within the determine above, water our bodies are plotted in yellow and non-water pixels are plotted in darkish blue, and clouds are masked out in white (top-right nook of the picture).
This work demonstrates the brand new functionalities introduced by the AI/ML Enhancement Challenge to assist a ML practitioner:
- create EO knowledge labels, utilizing QGIS Software program or a Solara / Leafmap software
- load Labels and EO knowledge with STAC API
- pattern EO knowledge with Labels and create a dataframe
- use the dataframe to coach a Random Forest classifier
- carry out raster inference on a particular Sentinel-2 scene to generate a binary water masks.
Helpful hyperlinks:
Web site: https://www.terradue.com/portal/
Focus on discussion board: https://discuss.terradue.com/t/announcing-the-launch-of-the-ai-ml-enhancement-project-for-gep-and-urban-tep-exploitation-platforms/1188/3
– Parham Membari
– Pedro Pereira goncalves
– Fabrice Brito
– Simone Vaccari