End to End Machine learning project (Part 2): Interactive EDA using Bokeh and Streamlit | by Pratha Pawar | Jun, 2024

Making ready the info for the plots

  if READ_SAVED_FILE_DIRECTLY:
print('Studying pre-saved processed file..')
nyc = loc_read_parquet_file(FINAL_FILE_LOCATION)
else:
# Getting uncooked NYC taxi knowledge
nyc = get_raw_data(RAW_URL, QUERY, MONTHS, LIMIT, NYC_SQL_FILE_LOC, fetch_if_exists = FETCH_IF_EXISTS)# Processing and including new feats
nyc = prepare_data(nyc, NYC_RAW_FILE_LOC, fetch_if_exists = FETCH_IF_EXISTS)
# Including taxi zones and mercator reworked lat/lons
nyc = process_data(nyc, FINAL_FILE_LOCATION , fetch_if_exists = FETCH_IF_EXISTS, 
taxi_zones_file = TAXI_ZONES_FILE, transform_merc = True)

See beneath the dataset nyc after characteristic engineering, cleansing, including zone reference and changing to Mercator projections for plotting. Notice: That is the ultimate dataset that’s used for creating the interactive visualization. The principle options that have been used for plotting are: pickup_weekday, pickup_hour, trip_duration_minutes, pickup_MercatorX, pickup_MercatorY, dropoff_MercatorX, dropoff_MercatorY, pickup_borough, dropoff_borough, pickup_zone_name, dropoff_zone_name.

>>> data_for_plotting.information()
<class 'pandas.core.body.DataFrame'>
RangeIndex: 1751081 entries, 0 to 1751080
Information columns (whole 29 columns):
#   Column                 Dtype         
---  ------                 -----         
0   vendorid               int64         
1   pickup_datetime        datetime64[ns]
2   dropoff_datetime       datetime64[ns]
3   passenger_count        int64         
4   pickup_longitude       float64       
5   pickup_latitude        float64       
6   store_and_fwd_flag     object        
7   dropoff_longitude      float64       
8   dropoff_latitude       float64       
9   trip_duration          float64       
10  pickup_date            datetime64[ns]
11  pickup_month           int64         
12  pickup_day             int64         
13  pickup_hour            int64         
14  pickup_weekday         class      
15  vacation                int64         
16  distance_hav           float64       
17  bearing                float64       
18  trip_duration_minutes  float64       
19  pickup_taxizone_id     float64       
20  dropoff_taxizone_id    float64       
21  pickup_MercatorX       float64       
22  pickup_MercatorY       float64       
23  dropoff_MercatorX      float64       
24  dropoff_MercatorY      float64       
25  pickup_borough         object        
26  dropoff_borough        object        
27  pickup_zone_name       object        
28  dropoff_zone_name      object        
dtypes: class(1), datetime64[ns](3), float64(14), int64(6), object(5)
reminiscence utilization: 375.7+ MB

The borough names, and zone id and names have been mapped to every location (lat/lon) for permitting evaluation utilizing zone and borough names. The ..data/external/taxi_zones_shape/taxi_zones.shp file comprises the knowledge on the geographic boundaries of every zone, in addition to the corresponding borough identify. And within the code, I’ve used multi-processing to hurry up this mapping course of by to eight to 10 occasions.

def assign_taxi_zones(df: pd.DataFrame, chunk: int | None = None, 
lon_var: str = 'pickup_longitude', lat_var: str = 'pickup_latitude', 
locid_var: str = 'pickup_taxizone_id', 
taxi_zones_file: str = '../knowledge/exterior/taxi_zones_shape/taxi_zones.shp',
) -> gpd.GeoDataFrame:# make a replica since we'll modify lats and lons
localdf = df[[lon_var, lat_var]].copy()
# lacking lat lon information is indicated by nan. Fill with zero
# which is exterior New York shapefile. 
localdf[lon_var] = localdf[lon_var].fillna(worth=0.)
localdf[lat_var] = localdf[lat_var].fillna(worth=0.)
shape_df = gpd.read_file(taxi_zones_file)
shape_df.drop(['OBJECTID', "Shape_Area", "Shape_Leng"], axis=1, inplace=True)
shape_df = shape_df.to_crs(pyproj.CRS('epsg:4326'))
attempt:
print(f"assigning taxi zones to every location: {lon_var}, {lat_var}, chunk = {chunk}")
local_gdf = gpd.GeoDataFrame(
localdf, crs = pyproj.CRS('epsg:4326'),
geometry = [Point(xy) for xy in
zip(localdf[lon_var], localdf[lat_var])
]
)
local_gdf = gpd.sjoin(local_gdf, shape_df, 
how = 'left', op = 'inside')
return local_gdf.LocationID.rename(locid_var)

Two attainable backgrounds for the plot

Google maps — wants your google maps API key
Bokeh tile suppliers — simple to make use of, offers all of the required info, so long as you don’t care about Google maps objects. However this selection wants us to transform from the lat/lengthy (EPSG: 4326) format to a Mercator projection (EPSG: 3857). Checkout this link for the geopandas projections’ particulars.

I’ve added code for each. In case, you resolve to make use of Google maps because the background, you’ll need so as to add your GMaps API key in a apikey.txt file.

Projections for Bokeh

The coordinate reference system (CRS) is vital as a result of the geometric shapes in a GeoSeries or GeoDataFrame object are merely a group of coordinates in an arbitrary area. A CRS tells Python how these coordinates relate to locations on the Earth. (source)

The situation info we have now for the pickup and dropoff factors within the authentic dataset is in lat-long (EPSG: 4326) format (diploma lat, diploma lon). However for utilizing Open road map tiles for Bokeh plots, we have to convert this to a spherical Mercator projection (EPSG: 3857). See an instance code beneath.

import pyproj# Changing NYC metropolis limits to Mercator projection
nyc_long_limits = (-74.257159, -73.699215)
nyc_lat_limits = (40.471021, 40.987326)
# Setting coordinate system
inProj = pyproj.CRS('epsg:4326') 
outProj = pyproj.CRS('epsg:3857')
transform_to_lat_lon = pyproj.Transformer.from_crs(inProj, outProj, always_xy = True)
nyc_lon1, nyc_lat1 = transform_to_lat_lon.rework(nyc_long_limits[0], nyc_lat_limits[0])
nyc_lon2, nyc_lat2 = transform_to_lat_lon.rework(nyc_long_limits[1], nyc_lat_limits[1])
## The transformed coordinates
# >>> nyc_lon1, nyc_lon2
# (-8266269.127635151, -8204159.085663989)
# >>> nyc_lat1, nyc_lat2
# 4934627.716294977, 5010472.431436094)

I’ve used the CartoDB Positron tile for Bokeh plots, however you need to use something you like from this list.

Creating the interactive visualization

Bokeh itself has a models object which offers all the required instruments to create the plots, add instruments like checkbox and slider, add gyph renderers for circles, and so on. All these fashions are amassed right into a document occasion which may be known as by the shopper. This occasion is interactive and might react to person inputs (e.g. a slider choosing an integer between 1 to 10). I’ve carried out this code in this function. You’ll be able to attempt working this in a pocket book or in a script to get an interactive graph (much like what we have now created utilizing streamlit, however all of it utilizing solely bokeh).
However I made a decision to make use of a mixture of bokeh and streamlit for simple deployment and sharing goal. On this case, all of the person interactions are abstracted away to be dealt with by streamlit (like choose hour, weekday, zones, and so on.). Bokeh will get the already filtered knowledge and focuses solely on the plotting half.

To easily begin the streamlit server and visualization, run the next out of your terminal:

> make stream_maps

This can run the streamlit_maps.py utilizing streamlit. Which in flip will run the streamlit_points or streamlit_lines strategies in NYC/streamlit_general_maps.py file primarily based on person choice on the graph.

When working for the primary time, use READ_SAVED_FILE_DIRECTLY=FALSE to run all the info fetch and processing features and save the information.
The following time, you possibly can run the identical file utilizing READ_SAVED_FILE_DIRECTLY=True to make use of the already saved knowledge information.

This makes the above code impartial and that is the one command that you have to run to begin the streamlit visualization.

In a nutshell, that is what the code does (for the factors graph):

# Choosing the pickup or dropoff zones to visualise
pickup_or_dropoff_ed, pickup_or_dropoff = self.add_sidebar_pickup_or_dropoff()# Slider for choosing hour of the day
slider_hour = self.add_sidebar_hour()
# Choosing the weekday
checkbox_weekday = self.add_sidebar_weekday()
# Getting an inventory of the out there taxi zones, for the person to pick from 
zone_name_list = self.add_sidebar_select_zone(pickup_or_dropoff_ed, pickup_or_dropoff)
# Filtering the info primarily based on person choice -- Reacting to person interplay
print('Filtering the dataframe primarily based on person choice...')
df_to_plot = self.nyc[(self.nyc[pickup_or_dropoff].isin(zone_name_list)) 
& (self.nyc['pickup_weekday'].isin(checkbox_weekday)) 
& (self.nyc['pickup_hour'].isin(slider_hour)) 
]
print('Performed filtering the info')
print(f"Variety of journeys chosen: {df_to_plot.form[0]}, out of {self.nyc.form[0]}")
# Plotting the bokeh plot
print('Plotting the bokeh maps plot now....')
if is_gmaps:
assert api_key is just not None, "in an effort to use google maps, you have to cross in a legitimate google maps api key."
intrc_trips_loc_p = plot_gmaps_streamlit(
df_to_plot, 
latitude_column= ['pickup_latitude', 'dropoff_latitude'], 
longitude_column = ['pickup_longitude', 'dropoff_longitude'],
color_column = 'trip_duration_minutes', size_column = 5.0,
api_key = api_key, map_type = 'roadmap', map_zoom = 10,
width = 700, peak = 600
)
else:
# Utilizing Bokeh tile maps
intrc_trips_loc_p = plot_cartmaps_streamlit(
knowledge = df_to_plot, 
latitude_column = ['pickup_MercatorY', 'dropoff_MercatorY'], 
longitude_column= ['pickup_MercatorX', 'dropoff_MercatorX'], 
map_tile_type = CARTODBPOSITRON, 
nyc_long_limits = (-74.257159, -73.677215), #(-74.257159, -73.699215), 
nyc_lat_limits = (40.471021, 40.987326), #(40.471021, 40.987326)
color_column = 'trip_duration_minutes',
size_column = 6, 
width = 700, peak = 600
)
return intrc_trips_loc_p

And the strains chart is identical, besides as a substitute of factors, we plot the strains from one level to the opposite, and we plot the pickups and dropoffs on the identical plot not like the factors graph.

# ....................
# ....................# Plotting the strains from different zones to the chosen taxi zone/s
lines_p_dropoff = plot_src_to_dest_arrows(knowledge = df_to_plot_dropoff, 
pickup_or_dropoff = 'dropoff_zone_name',
custom_title = "Journeys ending on the chosen taxi zone/s (Blue circles are dropoff factors) n"
f"Variety of journeys chosen: {df_to_plot_dropoff.form[0]:,}", 
**common_kwargs
)
# Combining the 2 plots right into a single format
format = gridplot([lines_p_pickup, lines_p_dropoff], ncols = 2, 
sizing_mode = "scale_both", merge_tools = False) # utilizing grid plot helps keep away from the massive padding appplied by row() 
return format

Source link

End to End Machine learning project (Part 2): Interactive EDA using Bokeh and Streamlit | by Pratha Pawar | Jun, 2024

Working with Input-Convex Neural Networks part3(Machine Learning 2024) | by Monodeep Mukherjee | Jul, 2024

Embracing the Future: The Rise of AI-Driven Development in Software Engineering The software… | by DevBlogs | Jul, 2024

Research on Metaheuristic methods part4(Machine Learning 2024) | by Monodeep Mukherjee | Jul, 2024

NVIDIA Accelerates Google Quantum AI Processor Design With Simulation of Quantum Device Physics

Game Development and Cloud Computing: Benefits of Cloud-Native Game Servers

Teradata AI Unlimited in Microsoft Fabric is Now Available for Public Preview through Microsoft Fabric Workload Hub

Cognigy Unveils Agentic AI: Transforming the Future of Enterprise Contact Centers

Preparing Finance Data for AI: A 5-Step Data Cleansing Checklist

Our Picks

Enhancing Agent Productivity and Customer Satisfaction with Generative AI Agent Assists

Revolutionizing Logistics – The Role of Robotics in Modern Warehousing

Adapt or Become Obsolete: AI’s Unstoppable Business Revolution

Most Popular

Revolutionizing the Way We Find Love

Will GenAI Replace Data Engineers? No – And Here’s Why.

Assortment Optimization Machine Learning | by Danishaliarshar | Mar, 2024

End to End Machine learning project (Part 2): Interactive EDA using Bokeh and Streamlit | by Pratha Pawar | Jun, 2024

Making ready the info for the plots

Two attainable backgrounds for the plot

Projections for Bokeh

Creating the interactive visualization

Related Posts