Geospatial partitioning of open transit data

Abstract

One of the guiding principles of open data is that anyone can use the raw data for any purpose, enabling third parties to use this data to fill specific niches. Public transit operators often publish their open data as a single data dump, but developers with limited computational resources may not have the means to process all this data. Existing work has already focused on fragmenting the data by departure time, so that data consumers can be more selective in the data they choose to process. However, each fragment still contains data from the entire operator’s service area. Applications that require data from a specific region waste time and bandwidth downloading, parsing, and ultimately discarding irrelevant data. We build upon this idea by fragmenting geospatially as well as by departure time. Unlike the conventional approach of generating discrete sets of stops, we generate and publish geospatial partitions. Our method is robust to changes in the original data, such as the deletion or the addition of stops, which is crucial in scenarios where data publishers themselves do not control the data. In this paper we explore popular clustering methods such as k-means and METIS, alongside two simple domain-specific methods of our own. We compare the effectiveness of each for the use case of client-side route planning, focusing on the ease of use of the data and the cacheability of the data fragments. Interestingly enough, choosing a clustering algorithm seems to be an exercise in moderation. Creating more clusters cannot reduce query times below what is needed to compute the journey itself, and complex methods such as METIS induce more computational overhead on the client. Our results show that simply clustering stops by their proximity to 8 transport hubs yields the most promising results: queries are 2.4 times faster and download 4 times less data. More than anything though, our results show that the difference between clustering methods is small, and that engineers can safely choose practical and simple solutions. We expect that this insight also holds true for publishing other geospatial data such as road networks, sensor data, or points of interest.

Introduction

People who rely on wheelchair-accessible public transportation have very specific information needs when they are looking to buy a house. Real estate websites can include this information in their item listings, but only if they can find and access relevant datasets. Fortunately, many public transit operators publish their offering as open data, often using de facto standards such as the General Transit Feed Specification (GTFS) or official standards such as Network Timetable Exchange (NeTEx). However, these standards result in large data dumps: the combined GTFS feed of the public transit companies that operate in the Brussels area (SNCB, STIB, De Lijn, and Tec) is already over 1 GB of data. Searching for GTFS memory issues on the Web shows that many people have learned the hard way that this is more data than their personal laptops, Raspberry Pis, or VPSs can handle.

A popular use case for open transit data is route planning. The ideal route depends on many factors such as ticket prices, transfer times, walking distances, reliability, and arrival times. The value ascribed to each of these factors is ultimately subjective, and will likely change over time due to external factors such as the weather. However, contemporary route planning services offer little in terms of personalization because they sacrifice flexibility to provide better query time performance [1, 2, 3]. For example, an algorithm that relies on precomputed shortest paths is ill-suited to generate scenic routes. The route planning can also be done directly on the client however, and this has the benefit that more flexible algorithms become viable because users can only saturate their own CPUs. Client-side applications come with their challenges though, and ingesting the data is particularly difficult in this case. The European Commission reported that in 2019 the average price for 2 GB of mobile data within the EU28 was still €10 [4], which means that that client-side route planners have to be conservative in which data they download.

These examples show that the way data is published can restrict how the data can be used. What may be feasible for a corporation may not be feasible for a regular person, even though the Open Definition defines open data as data that can be freely used, modified, and shared by anyone for any purpose. Our goal is thus clear: we want to improve the way open transit data is published, so that more applications become more viable for more people.

Method

Fig. 1: Visualized on the left are the departure and destination locations, based on one week of query logs from the Flemish public transit operator De Lijn. Visualized on the right are the locations of all connections in their network during the same time period. Note that there are many places with a considerable amount of connections that are in low demand.

The Linked Connections publishing scheme enables applications to ingest data for specific point on time, but each fragment still contains data from the entire transit operator’s service area. Fig. 1 shows that some regions served by the Flemish public transit operator, De Lijn, are more popular than others, implying that it makes sense to fragment by location. Existing work has shown that partitioning public transit networks can improve query times of route planning services, so we investigate if similar improvements can be obtained for the publishing of raw data.

However, first we should consider what is necessary to make publishing fragmented data viable in the real world. We make a distinction between data owners and data publishers, with a clear distinction between their responsibilities. A data owner focuses on maintaining the data quality, while a data publisher focuses on making the data accessible. Both roles come with their own challenges, and as such it is not uncommon for data owners to outsource the data publishing to third parties. This means that data publishers may not have control over the actual data – they have to follow when the data changes. For example, public transit operators routinely add and remove temporary stops due to maintenance works, and these changes have to be reflected in the published data with as little friction as possible.

Data

Guided by the insights provided by Fig. 1, we will focus on the Flemish public transit network for the remainder of this paper. To provide some context: Flanders is a small region within Europe, but with 487 inhabitants/km² in 2019, it is also one of the most densely populated [17]. The public transit network is equally dense; at the time of writing there are 35,791 stops spread out over 13,522 km² for a density of 2.6 stops/km². There are roughly 1 million connections on a regular weekday, and the corresponding Linked Connections data results in over 10 million RDF triples per day. We use data from the first whole week of December as the input data for the methods discussed in this section.

Clustering

Existing work has focused on clustering stops, or trips, into discrete sets of objects. This means that every time a new stop gets added by the data owner, the data publisher must explicitly assign this stop to a cluster – or risk publishing incomplete data. For clustering algorithms such as k-means this is relatively easy, but for algorithms such as METIS this involves recomputing the entire clustering. Instead of searching for an algorithm that supports updates, we propose to publish the clusters differently. Instead of creating discrete sets of stops, we partition the physical world. The resulting partitions are published as separate resources, allowing any agent to infer to which cluster every stop belongs. In other words, data publishers do not have to explicitly label every stop themselves – the data speaks for itself.

We start by adapting two clustering methods that are often used to partition transit networks: k-Means and METIS. However, both methods disregard one important feature of transit networks; k-means does not consider network connectivity and METIS does not consider physical locations. This leads us to propose an alternative method, called Hub, which clusters stops by their proximity to important transporation hub. As others have shown good results from hierarchical methods, we also consider a merge-based adaption of Hub, appropriately named Merged. The remainder of this subsection discusses how each method is used to generate a geospatial partitioning.


Fig. 2: The 8 partitions each evaluated method creates. Note that the two methods on the top row create regions of roughly equal sizes, while the approaches at the bottom create regions of varying sizes. The approaches in the left column create regions with simple shapes, while the ones on the right create irregular shapes.

k-means

Although k-means is a simple algorithm, existing work has found it to be competitive with more complex methods [12], so we consider it among the state of the art for this particular use-case. As the name implies, this clustering distributes a given set of points in exactly clusters, where every point belongs to cluster with the nearest cluster mean. Iterative heuristics exist to compute this clustering, we used the implementation from scikit.learn 0.20.3 with default parameters, and using the stops’ WGS84 coordinates as input. To obtain a spatial partitioning from this, we create a Voronoi diagram using the cluster means as seed points. Because the Voronoi cells of two adjacent points on the convex hull share an infinitely long edge, we add some extra padding points that represent the bounding box of the operator’s service area – and then discard all infinite edges.

METIS

METIS is another algorithm that is used to partition public transit networks [11, 12], so we consider it to be among the state of the art as well. Since it is a graph clustering algorithm, we must represent the public transit network as a graph. We follow the conventional approach of creating a vertex for every stop, and connecting them with an edge if they are connected through a single connection. Every edge is assigned a weight that corresponds to how many connections connect those stops. We used a Python wrapper of the reference implementation to compute the clustering, using the contig option to force contiguous partitions.

The METIS algorith only sees the network as a connectivity graph though – it does not know anything about the physical location of the stops. This means that even though it creates contiguous partitions, those partitions are not contiguous in the physical world. We obtain a clean spatial partitioning using an additional post-processing step that 1) creates the Voronoi diagram of all stops, 2) merges all Voronoi cells that belong to the same cluster, and 3) merge isolated rings into the encompassing cluster.

Hub

Hub is the first of our own methods that aims to incorporate both the geospatial and the graph-like nature of public transit networks. It iteratively selects the stops based on which trips pass through it. In the first iteration it selects the stop with the most unique trips, in the subsequent iterations it selects the stop with the most unique trips that the previous stop(s) do not have. After iterations it contains the $k$ most important hubs, which lead us to name this method Hub. These selected stops are then used as seed points to create a Voronoi diagram. To illustrate the simplicity of this approach, Fig. 3 contains all the necessary code to implement this, up until the creation of the Voronoi diagram.

def hub(k):
    done_trips = set()
    selected_stops = []
    for _ in range(k):
        best_stop = None
        best_stop_score = 0
        for stop, trips in stop_to_trips.items():
            stop_score = len(set(trips) - set(done_trips))
            if stop_score > best_stop_score:
                best_stop = stop
                best_stop_score = stop_score
        selected_stops.append(best_stop)
        done_trips.update(stop_to_trips[best_stop])
    return selected_stops
Fig. 3: The Hub method can be implemented in just 14 lines of Python code.

Merged

Instead of stopping the Hub algorithm after iterations we can also let it terminate, and then use the Jaccard similarity coefficient to merge the two most similar adjacent Voronoi regions until only remain. As there is a finite amount of trips, this algorithm has a clear termination condition: it stops when all trips are covered by one of the selected stops. This makes the process more complex, but existing work has shown good results using hierarchical clustering techniques [12].

Hypermedia Controls

As discussed at the beginning of Subsection 3.2, we want our published data to be easy to maintain. Our idea was to publish the partitioning itself, so that clients have all the information they need to decide to which cluster every stop belongs. We have already discussed how to obtain the partitionings, now we discuss how to publish them.

The partitions are published on the Web as stand-alone resources using the Hydra and GeoSPARQL vocabularies. The Hydra is used to describe a partitioning as a collection of regions, and the wktLiteral datatype from the GeoSPARQL vocabulary is used to describe individual regions. GeoJSON is another common way to define geometries, but since GeoJSON polygons are incompatible with JSON-LD we chose to use the simpler string representation: WKT. Fig. 4 contains a JSON-LD snippet of a single partition resource.

{
  "@id": "https:/​/​example.org/shapes/hub_4",
  "hydra:member": [
    {
      "@id": "https:/​/​example.org/shapes/hub_4/1",
      "geo:asWKT": "POLYGON ((4.170761972221639 50.70794391488566, 5.611060049825466 51.96704409851701, 4.233768984483664 54.50221456451739, 3.77479828593018 53.53143994233906, 3.799815048180065 52.18573049093929, 4.170761972221639 50.70794391488566))"
    }, ...
  ], ...
}
Fig. 4: JSON-LD representation of a partitioning. Note that both the partitioning and the individual regions are separate resources, allowing other datasets to refer to them.

These partition resources are then used to fragment Linked Connections data. This two-step approach allows for reusing existing partitions, such as administrative regions. A modified Linked Connections server can ingest a given partitioning, and fragment the data accordingly. The server creates one hydra:PartialCollectionView instance per region, and then creates an index of all generated views using the tree ontology. This ontology is used to link every view to the geospatial area it covers. Fig. 5 contains a JSON-LD snippet of such an index.

{
  "@id": "http:/​/​example.org/connections",
  "@type": "tree:Node",
  "tree:relation": [
    {
      "@type": "tree:GeospatiallyContainsRelation",
      "shacl:path": "geo:contains",
      "tree:node": "http:/​/​example.org/connections?cluster=https%3A//example.org/shapes/hub_4/1",
      "tree:value": {
        "@id": "https:/​/​example.org/shapes/hub_4/1",
        "geo:asWKT": "POLYGON ((4.170761972221639 50.70794391488566, 5.611060049825466 51.96704409851701, 4.233768984483664 54.50221456451739, 3.77479828593018 53.53143994233906, 3.799815048180065 52.18573049093929, 4.170761972221639 50.70794391488566))",
        "@type": "geo:Geometry"
      }
    }, ...
  ], ...
}
Fig. 5: JSON-LD representation of a view index. The tree:node property points to a hydra:PartialCollectionView, which is a connections page from the original Linked Connections specification. The tree:value property defines which geospatial area that view covers.

Evaluation

In our introduction we declared our intent to make more applications viable by improving the way we publish data. We gave client-side route planning as an example of a use case that needs to be conservative in the amounts of data they download, so we focus on this application to evaluate our data.

We have adapted an existing library for client-side route planning that uses Linked Connection data, so that it can interpret our hypermedia controls. This library uses the earliest arrival time variant of the Connection Scan Algorithm. This algorithm, similar to Dijkstra’s algorithm, builds a list of which stops are reachable and how long it takes to reach them. A client that knows the location of each stop can also infer which clusters are reachable, so our adapted route planner simply fetches data for all reachable clusters – slowly growing its list of data sources. We focus on the use-case of client-side route planning because this a relatively demanding application.

As mentioned in Section 3, we use 1 week of Linked Connection as input for the clustering algorithms. We then used each method to create 4, 8, 16, and 32 clusters. A redis-backed server creates an ordered list of all connections within every generated region, and exposes these using the hypermedia controls defined in the Subsection 3.3. The same server also hosts a version of the data with one cluster that contains all the data, i.e. without any geospatial partitioning. Altogether we test 17 (4 partitionings for each of the 4 methods, and the baseline) different partitionings, and each data fragment contains 20 minutes of data.

We make extensive use of letter-value plots [18] because our results have a long tail, which causes visualizations such as box plots to label many results as outliers. These plots show the median value as a black line, and then show the 75%, 87.5%, … quantiles as separate boxes, making it easy to compare these statistics.

Ease of use

As a proxy for how easy the geospatially fragmented data is to use, we measure how many resources a client needed solve the same query. Specifically, the time it takes for the same client to solve the query with a given partitioning, as well as how much data was downloaded. We compare those values to those of the baseline; the unpartitioned data.

5,000 queries were randomly selected from a query log that was given to us by the transit the operator itself. All these queries were received on the same day, but throughout the day. We eliminate as many variables as possible to isolate the impact of the partitioning; the client and server run on two separate machines on the same local network, a constant 20ms of latency is added per response, and the client only processes one query at a time.

Fig. 6: The median query time with just 4 clusters is already 58% that of the original query times, and using 8 clusters further improves this to 45%. Note the diminishing returns as more clusters are added, using 16 and 32 clusters lower the relative query times to 41% and 42%. The Hub and k-means methods yield very similar results, while METIS performs significantly worse.
Fig. 7: Using just 4 clusters is enough to reduce the amount of downloaded data to 45% of the original amount of downloaded data, and adding more clusters consistently improves this metric. Although all methods seem competitive in this metric, the Hub method has a consistently low median and 75% percentile.

Fig. 6 shows that having just a few clusters already significantly improves the query performance, but that adding more clusters has diminishing returns. Even without the overhead of ingesting unnecessary data the client still has to compute the route. The METIS results are somewhat surprising; they are slightly worse across the board, and even become worse when going from 16 to 32 clusters. As Fig. 2 shows, the clusters from METIS are more complex than those from other methods, which creates a heavier workload on the client. Fig. 7 shows that the amount of downloaded data does keep decreasing by adding more clusters, and theoretically we can avoid all unnecessary data by creating a cluster per stop.

Cacheability

Another important feature of Linked Connections is the cache effectiveness of the data fragments, which gives a Linked Connections server its scalability. As we are partitioning Linked Connections data, and thus making it more fine-grained, we have to measure the impact this has on the cache effectiveness. Unfortunately, we do not have access to any form of user ID, which makes it hard to benchmark a real-world scenario where there are client-side and server-side caches. Instead, we measure how fast a cache warms up in every configuration, and what the hit rate of a warm cache is. These two metrics give an indication of how cacheable the partitioned data is, and how this compares to the cacheability of the original data.

While running the benchmarks for the usability metrics, we also record which resources are fetched. We then replay these requests, running them through a simulated LRU cache to measure the hit rates. To measure the hit rates on a warm cache we first run all requests through a cache, and then create 1,000 samples of 500 requests to measure the overall hit rate of each sample. The hit rates on a cold cache are obtained by doing the same starting from a cold cache, and by varying the amount of requests per sample. We set the cache size to 20 MB, and each partitioning results in roughly 70 MB of gzipped data, so we expect many cache evictions.

Fig. 8: Less valuable cache space is wasted on irrelevant data by using a fine-grained partitioning. The median hit rate on a warm cache using the unpartitioned data is 26%, the highest hit rate, 44%, is obtained using the Hub method with 32 clusters. The k-means method scores noticeable worse than the other methods.
Fig. 9: Line plots of the median cache hit rates per configuration, showing that caches take longer to warm up with a fine-grained partitioning. However, any method with 4 or 8 clusters matches the hit rate of the original data on a warm cache (26%) after 350 requests.

Fig. 8 and Fig. 9 show that partitioned data can improve the cache hit rate, but that caches take longer to warm up. The cache effectiveness when using 8 clusters surpasses that of the baseline at around 350 requests, regardless of the clustering method. The average query downloads 9 resources in this case, so that the cache effectiveness is better than the baseline’s if the data is used to answer more than 39 queries per day.

Discussion

Our findings show that results from the field of route planning do not easily translate to publishing data on the Web, because, as stated in Section 3, we want the processed data to stay in sync with the raw data. We move some of the clustering logic to the client, which in return can then avoid downloading and parsing a lot of irrelevant data.

The number of clusters has a noticeable impact on all evaluated metrics. Even a small amount of clusters can make a client-side route planner twice as fast. More clusters do not necessarily lead to better results, as we quickly see diminishing returns in terms of query times. The amount of downloaded data does keep decreasing, but at the cost of cacheability. Interestingly, even when starting from a cold cache the cacheability of a small amount of clusters is on par with the cacheability of the original data.

METIS and k-Means yield good results in the amount of downloaded data metric, but both struggle in other tests. Clusters from METIS have a complex shape because it does not consider the stops’ locations, making it harder for clients to interpret them. As a result, the query times using METIS data are consistently worse than those using other methods. A similar pattern presents itself for the merge-based method, which is also noticeably worse in the query time metric – more so than in the downloaded data metric. The k-Means method on the other hand shows great results in both the query time and downloaded data metrics, but the resulting data fragments are harder to cache. The Hub method is the only method that performs well across all metrics. This method combines the geospatial and the graph-like features of public transit networks. The merge-based approach does this all well, but is burdened by more complex cluster shapes.

More than anything though, our results show that the difference between clustering methods is small, and that engineers can safely choose practical and simple solutions. Partitioning and publishing public transit schedules seems to be an exercise in moderation: do not use too many clusters, and stick to simple shapes.

Conclusion

In this paper we investigated what data publishers can do to make their open transit data easier to use. Based on research from the field of route planning, we explored the idea of geospatially partitioning public transit networks. We evaluated 4 different clustering methods for the use-case of client-side route planning: k-means, METIS, and two domain-specific methods of our own. The partitions were obtained using Voronoi diagrams, and were then published with the appropriate hypermedia controls that clients can use to discover clusters of public transit stops.

Despite being very popular in the field of route planning, METIS has subpar results compared to the other methods because the complex nature of METIS clusters makes them hard to interpret for clients. The other common method, k-eans, does lead to good query times, but at the cost of cacheability – which is an important feature for publishing open data in a scalable fashion. The most promising method, both in terms of user convenience and cacheability, is to just create a Voronoi diagram around the most important transportation hubs. Although simple, this method results in simple clusters, while still incorporating both the geospatial and graph-like nature of transit data. The same less-is-more pattern repeats itself when looking at the role of the number of clusters; adding more clusters has diminishing returns on the query time, while making the data harder to cache.

Our goal was to make open transit data more useful, so that more people can use it in more applications. We focused on the use case of client-side route planning, which have to be conservative in which data they download as mobile data is still expensive. And in that regard, we succeeded. A simple clustering algorithm and 8 clusters is all it takes to download 4 times less data, and to answer queries 2.4 times faster. Preliminary results show that the cacheability, and thus the scalability, of this approach is on par with the existing Linked Connections publishing scheme.

More than anything though, we have found that the difference between clustering methods is small, and that engineers can safely go for simple solutions. Future work can investigate if this insight translates to the publishing of other geospatial data such as road networks, sensor data, or points of interest. Additionally, the scalability should be tested in the real world, comparing it to both route planning services and existing Linked Connections servers.