1. The document proposes a framework to automatically structure unstructured crowdsourced biodiversity data provided by volunteers on Facebook groups into semantically structured information.
2. It takes two Taiwanese Facebook biodiversity interest groups as examples and uses natural language processing with Taiwanese geographic and species databases to extract species names and locations from discussion threads.
3. The structured data is published and linked as open data using semantic annotation, content management systems, and browser plugins to provide users with digested feedback while allowing volunteers to continue contributing in familiar ways.
1. Harvesting crowdsourcing biodiversity data from Facebook groups
Jason Guan-Shuo Mai1, Cheng-Hsin Hsu1, Dong-Po Deng2, De-En Lin3, Hsu-Hong Lin3, Kwang-Tsao Shao1
1 Taiwan Biodiversity Information Facility (TaiBIF), Biodiversity Research Center, Academia Sinica, Taipei, Taiwan
2 Institute of Information Science, Academia Sinica, Taipei, Taiwan
3 Taiwan Endemic Species Research Institute, Council of Agriculture, Nantou, Taiwan
The emergence of Web 2.0 enables people to contribute their biodiversity observations on the Web. These crowdsourcing biodiversity data are increasing their
value in scientific studies due to the potentially broader spatial and temporal scales. However, the data provided in plain text hinder the process of data retrieval
and analysis. In this study, we propose a framework to automatically structure the loose-format text so that volunteers can keep providing data in their own
familiar ways, while interested citizens, biodiversity researchers and managers can benefit from the semantically structured information. We take 2 Facebook
biodiversity interest groups Reptile-Road-Mortality and Enjoy-Moths as examples.
0. Crowdsourcing - Thread
participants provide 2. Using natural language Post message
unstructured data processing techs with Taiwan
voluntarily Geographic Name and Taiwan Post Picture
Catalogue of Life databases as
Facebook interest groups knowledge bases to extract
Comment message
species vernacular names and
6. Improving place names from a thread Comment message
source data
Comment message
quality without
changing users’ …
Reptile-Road-Mortality Enjoy-Moths What a typical discussion thread
own familiar looks like.
ways 1. Crawling data from
Facebook via its API Our algorithm picks a most related species
name appearing in a thread based on social
networking characteristics.
Semantic
annotation tool
disambiguates For each vernacular name in TaiCOL do:
toponymic occurs in the message? Full-matched
homonyms 細紋南蛇
Yes name
No
occurs in the
Prefix3 message? Postfix2 occurs in the thread?
細紋南 Yes 南蛇 Yes
No No
occurs in the
One click on a message?
message to
recognize species
Main Prefix2
細紋
Yes Postfix1
蛇
No
Yes
No
vernacular names
and related
Database Name doesn’t exist in the Matched abbreviation
message Calculate confidence score
information
of this name
5. Developing
4. Publishing
browser plug-
linked open
ins to give
data via D2R
users digested
server for
feedback of
open access
structuralized
and usage
data
Our dataset is linked to other datasets on
linked open data cloud such as DBPedia,
GeoNames and LODE (Linked Open Data of 3. Introducing content management
Ecology) so it can have benefit from the large
amount of meta-information they provide. system Drupal for easier data Algorithms used to recognize abbreviations
management (including error of vernacular names and place names
correction) and display