Development Gateway’s Geocoding Suite has several components, each working in tandem with aid and management information systems to assign precise geospatial data on the locations of development projects.We have recently announced the addition of a lightweight, user-friendly automatic geocoding backend tool – aptly called the AutoGeocoder. If you read our last blog post, you’re familiar with some of its highlights, functions, and features. In today’s post, we’ll be diving deeper into the tool’s inner workings – as well as into recent changes we’ve made to the lightweight, open source Open Aid Geocoder tool.
Below, we begin with the AutoGeocoder:
As mentioned in our last post, the AutoGeocoder tool reads through text provided in various document formats (PDF, DOC, TXT) to identify activity locations, and then produces a final list of georeferenced location names. The tool has been fully developed in Python 3, and combines well-known tools and libraries such as NLTK and scikit-learn.
In short, the AutoGeocoder is able to read a text-based document, split it into sentences, classify those sentences using a text classifier, filter classified sentences to include only project implementation-related text, generate a list of named entities from the selected sentences by querying the Stanford NER Server, and finally query the GeoNames API to retrieve the final project location information.
It is composed of three key elements:
- A supervised machine learning-based text classifier that detects which sections of the documents refer specifically to project implementation details;
- A Named Entity Recognizer (NER), currently provided by Stanford NER;
- A Gazetteer Service, allowing the tool to query geographic information, provided by GeoNames.
The AutoGeocoder can be configured to run in three different modes:
- Firstly, it can be run through the command line interface – which is the tool’s default mode. This mode is useful in extracting project locations from single text documents or from IATI XML files.
- With a bit more configuration, and setup of a PostgreSQL database, the user can interact with the tool through the Micro Web user interface. This allows users to upload files, autogeocode them, download the results, and track all sentences that were used as location sources.
- Finally, plugging the AutoGeocoder into the current Open Aid Geocoder database allows the user to send previously-imported activity records to the AutoGeocoder process queue, see which projects have already been autogecoded, review locations on the map, and manually edit information.
Figure 1: the AutoGeocoder Command Line Interface – Geocoding a PDF file
Figure 2: AutoGeocoder Full Process Workflow
As mentioned, a key piece of the AutoGeocoder is the pre-trained text classifier. Using machine learning, it tells the tool whether a found location is actually the project implementation location or is an irrelevant location (such as the address of the donor’s headquarters). The default classifier has been trained with a small dataset, so it is recommended that users train their own text classifiers to achieve enhanced precision.
The AutoGeocoder provides a set of features useful for this task:
- An IATI data downloader;
- A text sample generator method, which can be called from the command line tool;
- A n interface to manually classify text samples.
- A classifier trainer method – which generates a ready-to-use classifier that can be called from the command line tool.
Open Aid Geocoder
Another recently-updated Geocoding Suite feature is the Open Aid Geocoder. Originally, this tool was designed to be plugged into an existing IMS system, but in keeping pace with user demand, we have created an API and simple database for the interface – allowing it to run as a standalone service. As mentioned in our last post, it now also uses the AutoGeocoder as well as allows for manual searching and adding of location information.
The Open Aid Geocoder, is now composed of two main modules – the Geocoder API and the Geocoder UI:
Entirely developed in Java and based on the Spring framework, this RESTful API provides the UI with backend support. The Geocoder API exposes a group of JSON HTTP endpoints, which allows the UI to import new project records, read the current list of projects, read full project information by providing a project identifier, and save edited records. The PostgreSQL database and the PostGIS extension store geographic project information as well as Global Administrative Boundaries (GADM) records. These records allow the UI to render the project’s recipient country administrative boundary layer over the map, and then query administrative names while manually geocoding. We’ve developed the API to provide a default backend to the Open Aid Geocoder, allowing seamless usage of the tool as a standalone web service. For a better look at the Geocoder API, check out the Github repository here.
Originally released last year, the Geocoder UI has undergone a major refactoring to adapt to the recently-developed Geocoder API. We’ve also added new capabilities such as multilingual data entry support, integration with the AutoGeocoder, Import/Export of IATI 2.01 and 2.02 XML files, and a renewed, freshly designed interface. For a better look at the Geocoder UI, see the corresponding github repository here, and take a look at the User Guide here.
Figure 3: Geocoder Suite Stack Diagram
Figure 4: Elements of the AutoGeocoder
The Development Gateway Geocoder Suite project team includes: Mauricio Bertoli, Anush Martirosyan, Ionut Dobre, Taryn Davis, Sebastian Dimunzio, Galina Kalvatcheva, and Llanco Talamantes.
Representatives from Development Gateway: an IREX Venture (DG) will be attending the African Green Revolution Forum (AGRF) from September 5-9 in Kigali, Rwanda to highlight two projects: the Visualizing Insights on African Agriculture (VIFAA) project and the Farmer-Centric Data Governance Models project.
In Episode 2 of "Data…for What?!," a podcast series from Development Gateway: an IREX Venture (DG) which explores our new strategic plan, Josh Powell met with experts from DG and IREX to discuss DG’s expansion into the education, media and disinformation, and youth sectors. The conversations explore the most pressing challenges and greatest opportunities for data and technology to positively impact these sectors and discuss how these trends are likely to play out in the years ahead. Based on these trends, the experts explain the unique fit for DG’s skills and specific opportunities for collaboration that align with the vision of DG’s partnership with IREX, which has a long and successful history working in each sector.
To help contextualize the new Strategic Plan, we are launching a podcast series called Data… for What?! This series consists of 5 episodes in Josh Powell and Vanessa Goas talk to DGers throughout the organization – as well as collaborators within our strategic partner, IREX - about how and why we prioritized the various elements of the new strategy. In this first episode, we talk to Kristin Lord, President and CEO of IREX about how our partnership fits into the Strategic Plan; and to Aleks Dardelli, Executive Vice President of IREX and Chair of DG’s Board of Directors, about the process of putting the Plan together at this opportune, yet precarious, global moment.