Making ready the info for the plots
if READ_SAVED_FILE_DIRECTLY:
print('Studying pre-saved processed file..')
nyc = loc_read_parquet_file(FINAL_FILE_LOCATION)
else:
# Getting uncooked NYC taxi knowledge
nyc = get_raw_data(RAW_URL, QUERY, MONTHS, LIMIT, NYC_SQL_FILE_LOC, fetch_if_exists = FETCH_IF_EXISTS)# Processing and including new feats
nyc = prepare_data(nyc, NYC_RAW_FILE_LOC, fetch_if_exists = FETCH_IF_EXISTS)
# Including taxi zones and mercator reworked lat/lons
nyc = process_data(nyc, FINAL_FILE_LOCATION , fetch_if_exists = FETCH_IF_EXISTS,
taxi_zones_file = TAXI_ZONES_FILE, transform_merc = True)
- See beneath the dataset
nyc
after characteristic engineering, cleansing, including zone reference and changing to Mercator projections for plotting. Notice: That is the ultimate dataset that’s used for creating the interactive visualization. The principle options that have been used for plotting are:pickup_weekday, pickup_hour, trip_duration_minutes, pickup_MercatorX, pickup_MercatorY, dropoff_MercatorX, dropoff_MercatorY, pickup_borough, dropoff_borough, pickup_zone_name, dropoff_zone_name.
>>> data_for_plotting.information()
<class 'pandas.core.body.DataFrame'>
RangeIndex: 1751081 entries, 0 to 1751080
Information columns (whole 29 columns):
# Column Dtype
--- ------ -----
0 vendorid int64
1 pickup_datetime datetime64[ns]
2 dropoff_datetime datetime64[ns]
3 passenger_count int64
4 pickup_longitude float64
5 pickup_latitude float64
6 store_and_fwd_flag object
7 dropoff_longitude float64
8 dropoff_latitude float64
9 trip_duration float64
10 pickup_date datetime64[ns]
11 pickup_month int64
12 pickup_day int64
13 pickup_hour int64
14 pickup_weekday class
15 vacation int64
16 distance_hav float64
17 bearing float64
18 trip_duration_minutes float64
19 pickup_taxizone_id float64
20 dropoff_taxizone_id float64
21 pickup_MercatorX float64
22 pickup_MercatorY float64
23 dropoff_MercatorX float64
24 dropoff_MercatorY float64
25 pickup_borough object
26 dropoff_borough object
27 pickup_zone_name object
28 dropoff_zone_name object
dtypes: class(1), datetime64[ns](3), float64(14), int64(6), object(5)
reminiscence utilization: 375.7+ MB
- The borough names, and zone id and names have been mapped to every location (lat/lon) for permitting evaluation utilizing zone and borough names. The
..data/external/taxi_zones_shape/taxi_zones.shp
file comprises the knowledge on the geographic boundaries of every zone, in addition to the corresponding borough identify. And within the code, I’ve used multi-processing to hurry up this mapping course of by to eight to 10 occasions.
def assign_taxi_zones(df: pd.DataFrame, chunk: int | None = None,
lon_var: str = 'pickup_longitude', lat_var: str = 'pickup_latitude',
locid_var: str = 'pickup_taxizone_id',
taxi_zones_file: str = '../knowledge/exterior/taxi_zones_shape/taxi_zones.shp',
) -> gpd.GeoDataFrame:# make a replica since we'll modify lats and lons
localdf = df[[lon_var, lat_var]].copy()
# lacking lat lon information is indicated by nan. Fill with zero
# which is exterior New York shapefile.
localdf[lon_var] = localdf[lon_var].fillna(worth=0.)
localdf[lat_var] = localdf[lat_var].fillna(worth=0.)
shape_df = gpd.read_file(taxi_zones_file)
shape_df.drop(['OBJECTID', "Shape_Area", "Shape_Leng"], axis=1, inplace=True)
shape_df = shape_df.to_crs(pyproj.CRS('epsg:4326'))
attempt:
print(f"assigning taxi zones to every location: {lon_var}, {lat_var}, chunk = {chunk}")
local_gdf = gpd.GeoDataFrame(
localdf, crs = pyproj.CRS('epsg:4326'),
geometry = [Point(xy) for xy in
zip(localdf[lon_var], localdf[lat_var])
]
)
local_gdf = gpd.sjoin(local_gdf, shape_df,
how = 'left', op = 'inside')
return local_gdf.LocationID.rename(locid_var)
Two attainable backgrounds for the plot
- Google maps — wants your google maps API key
- Bokeh tile suppliers — simple to make use of, offers all of the required info, so long as you don’t care about Google maps objects. However this selection wants us to transform from the lat/lengthy (EPSG: 4326) format to a Mercator projection (EPSG: 3857). Checkout this link for the geopandas projections’ particulars.
I’ve added code for each. In case, you resolve to make use of Google maps because the background, you’ll need so as to add your GMaps API key in a apikey.txt
file.
Projections for Bokeh
The coordinate reference system (CRS) is vital as a result of the geometric shapes in a GeoSeries or GeoDataFrame object are merely a group of coordinates in an arbitrary area. A CRS tells Python how these coordinates relate to locations on the Earth. (source)
The situation info we have now for the pickup and dropoff factors within the authentic dataset is in lat-long (EPSG: 4326) format (diploma lat, diploma lon). However for utilizing Open road map tiles for Bokeh plots, we have to convert this to a spherical Mercator projection (EPSG: 3857). See an instance code beneath.
import pyproj# Changing NYC metropolis limits to Mercator projection
nyc_long_limits = (-74.257159, -73.699215)
nyc_lat_limits = (40.471021, 40.987326)
# Setting coordinate system
inProj = pyproj.CRS('epsg:4326')
outProj = pyproj.CRS('epsg:3857')
transform_to_lat_lon = pyproj.Transformer.from_crs(inProj, outProj, always_xy = True)
nyc_lon1, nyc_lat1 = transform_to_lat_lon.rework(nyc_long_limits[0], nyc_lat_limits[0])
nyc_lon2, nyc_lat2 = transform_to_lat_lon.rework(nyc_long_limits[1], nyc_lat_limits[1])
## The transformed coordinates
# >>> nyc_lon1, nyc_lon2
# (-8266269.127635151, -8204159.085663989)
# >>> nyc_lat1, nyc_lat2
# 4934627.716294977, 5010472.431436094)
I’ve used the CartoDB Positron tile for Bokeh plots, however you need to use something you like from this list.
Creating the interactive visualization
- Bokeh itself has a models object which offers all the required instruments to create the plots, add instruments like checkbox and slider, add gyph renderers for circles, and so on. All these fashions are amassed right into a document occasion which may be known as by the shopper. This occasion is interactive and might react to person inputs (e.g. a slider choosing an integer between 1 to 10). I’ve carried out this code in this function. You’ll be able to attempt working this in a pocket book or in a script to get an interactive graph (much like what we have now created utilizing streamlit, however all of it utilizing solely bokeh).
- However I made a decision to make use of a mixture of bokeh and streamlit for simple deployment and sharing goal. On this case, all of the person interactions are abstracted away to be dealt with by streamlit (like choose hour, weekday, zones, and so on.). Bokeh will get the already filtered knowledge and focuses solely on the plotting half.
To easily begin the streamlit server and visualization, run the next out of your terminal:
> make stream_maps
This can run the streamlit_maps.py utilizing streamlit. Which in flip will run the streamlit_points or streamlit_lines strategies in NYC/streamlit_general_maps.py file primarily based on person choice on the graph.
- When working for the primary time, use
READ_SAVED_FILE_DIRECTLY=FALSE
to run all the info fetch and processing features and save the information. - The following time, you possibly can run the identical file utilizing
READ_SAVED_FILE_DIRECTLY=True
to make use of the already saved knowledge information.
This makes the above code impartial and that is the one command that you have to run to begin the streamlit visualization.
In a nutshell, that is what the code does (for the factors graph):
# Choosing the pickup or dropoff zones to visualise
pickup_or_dropoff_ed, pickup_or_dropoff = self.add_sidebar_pickup_or_dropoff()# Slider for choosing hour of the day
slider_hour = self.add_sidebar_hour()
# Choosing the weekday
checkbox_weekday = self.add_sidebar_weekday()
# Getting an inventory of the out there taxi zones, for the person to pick from
zone_name_list = self.add_sidebar_select_zone(pickup_or_dropoff_ed, pickup_or_dropoff)
# Filtering the info primarily based on person choice -- Reacting to person interplay
print('Filtering the dataframe primarily based on person choice...')
df_to_plot = self.nyc[(self.nyc[pickup_or_dropoff].isin(zone_name_list))
& (self.nyc['pickup_weekday'].isin(checkbox_weekday))
& (self.nyc['pickup_hour'].isin(slider_hour))
]
print('Performed filtering the info')
print(f"Variety of journeys chosen: {df_to_plot.form[0]}, out of {self.nyc.form[0]}")
# Plotting the bokeh plot
print('Plotting the bokeh maps plot now....')
if is_gmaps:
assert api_key is just not None, "in an effort to use google maps, you have to cross in a legitimate google maps api key."
intrc_trips_loc_p = plot_gmaps_streamlit(
df_to_plot,
latitude_column= ['pickup_latitude', 'dropoff_latitude'],
longitude_column = ['pickup_longitude', 'dropoff_longitude'],
color_column = 'trip_duration_minutes', size_column = 5.0,
api_key = api_key, map_type = 'roadmap', map_zoom = 10,
width = 700, peak = 600
)
else:
# Utilizing Bokeh tile maps
intrc_trips_loc_p = plot_cartmaps_streamlit(
knowledge = df_to_plot,
latitude_column = ['pickup_MercatorY', 'dropoff_MercatorY'],
longitude_column= ['pickup_MercatorX', 'dropoff_MercatorX'],
map_tile_type = CARTODBPOSITRON,
nyc_long_limits = (-74.257159, -73.677215), #(-74.257159, -73.699215),
nyc_lat_limits = (40.471021, 40.987326), #(40.471021, 40.987326)
color_column = 'trip_duration_minutes',
size_column = 6,
width = 700, peak = 600
)
return intrc_trips_loc_p
And the strains chart is identical, besides as a substitute of factors, we plot the strains from one level to the opposite, and we plot the pickups and dropoffs on the identical plot not like the factors graph.
# ....................
# ....................# Plotting the strains from different zones to the chosen taxi zone/s
lines_p_dropoff = plot_src_to_dest_arrows(knowledge = df_to_plot_dropoff,
pickup_or_dropoff = 'dropoff_zone_name',
custom_title = "Journeys ending on the chosen taxi zone/s (Blue circles are dropoff factors) n"
f"Variety of journeys chosen: {df_to_plot_dropoff.form[0]:,}",
**common_kwargs
)
# Combining the 2 plots right into a single format
format = gridplot([lines_p_pickup, lines_p_dropoff], ncols = 2,
sizing_mode = "scale_both", merge_tools = False) # utilizing grid plot helps keep away from the massive padding appplied by row()
return format