At Fields Data, we collect data on the organizational presence in varying countries. Such data includes the type of activities the organization performs, the IATI sectors their work can be categorized into and their location at a provincial level. For this report, we utilized hierarchical clustering to group data based on these similarities. Since our data has numerous organizations with similar names, such as Caritas Burundi and Caritas Uganda, hierarchical clustering was used instead of K-means clustering in order to nest the name variations under the stem name. Additionally, hierarchical clustering allowed us to create nodes based on the sector and type of activities. For this first iteration, we only focused on the region of Ruyigi.
We approached this problem by first standardizing the name of organizations and the sectors, as well as dropping any null values. Next, we concatenated the data per organization as one entry in a list. Then, we tokenize each entry in order to subsequently vectorize via TFIDF vectorization. Using TFIDF, we calculated the distance between each entry of the list (organization characteristics), which resulted in the visualization below.
Figure 1. Visualization of hierarchical clustering result
The challenges we currently face are finding an accurate vectorization method that can clearly identify the nuances that are involved in international development data, as well as a method for dealing with organizations that have several programs with differing characteristics. In the next iterations, we plan to run tests with different combinations of tokenizing, stemming and vectorizing, as well as different forms of vectorization. Additionally, we will attempt to achieve better results by concatenating all the different programs of one organization. It will be interesting to see if, through this concatenation, insights are lost or, the contrary, better discovered.
Hierarchical clustering can be used by your organization to categorize data that has a hierarchy, such as location data that involves the continent, country, region and city, as well as when you wish to identify large groups based on similarities, such as humanitarian health groups versus developmental health groups.
The code to this our report can be found at our github repo at @fieldsdata.