Supporting the efficient discovery and use of Web APIs is increasingly important as their use and popularity grows. Yet, a simple task like finding potentially inter- esting APIs and their related documentation turns out to be hard and time consuming even when using the best resources currently available on the Web. We describe our research towards an automated Web API documentation crawler and search engine. We have devised and exploited crowdsourcing techniques to generate a curated dataset of Web APIs documentation. Thanks to this dataset, we have devised an engine able to automatically detect documentation pages. Our preliminary experiments have shown that we obtain an accuracy of 80% and a precision increase of 15 points over a keyword-based heuristic we have used as baseline.
Dev Dives: Streamline document processing with UiPath Studio Web
Harnessing the Crowds for Automating the Identification of Web APIs
1. Harnessing The Crowds For Automating
The Identification Of Web APIs
Carlos Pedrinaci, Chenghua Lin, Dong Liu, John Domingue
KMi, The Open University
2. Web APIs are the Publicly offering valuable data and
functionality
new WEB services Widely used and reused
Although their use is hardly automated
3. Web APIs and RESTful
Services
• Services based on a simple(r) stack of
technologies than WS-*
• Roughly URL + HTTP + XML/JSON
• Easy way to provide a programmatic
interface to existing Web sites
• Seldom adopt REST principles
12. Issues for
Discovering Web APIs
• There is no simple way to effectively
and uniquely identify Web APIs
• No standardised document
describing the interface
• URLs are hardly usable for this end
14. Hypothesis
• Every Web API provides a/several
public documentation page(s)
• These pages provide the most relevant
information for developers
‣ Web API location can be approached as a
documentation discovery problem
15. Web API Given a Web page determine if it
documents an API or not
Identification Sometimes a hard problem even for
humans
16. Collecting Harnessing the crowds for detecting
documentation pages
documentation Pages
17. Generating a Often the links are obsolete or point to
general pages
curated dataset
18. Dataset Generated
• We used API Validator to process 1,872
APIs from ProgrammableWeb
• 43% of the URLs we started with
(data from 2010)
• 624 a documentation page
• 929 not a documentation page
• 318 skipped (server down or unclear)
19. Web API identification
Engine
• Web API identification as a binary
classification problem
• Extract core features from Web pages
• Use machine learning algorithms to
provide an identification engine
20. Preliminary Experiment
• Used initially only Web page words as a
feature
• Trained two classifiers NB and SVM
• Used a simple keyword-based heuristic
as baseline for comparison (the
occurrence of 3 or more keywords)
• api, input, output, GET, PUT, etc
21. Evaluation Results
Model Precision Recall F1 Accuracy
Keyword 60.3 75.7 67.0 70.2
NB 71.0 79.2 74.8 78.6
SVM 75.4 70.8 73.1 79.0
22. Evaluation Results
• Although preliminary the approach
already provides promising results
• Both NB and SVM provide a good
accuracy (about 80%)
• Best Precision (75.4%) achieved by SVM
which is 15 points better than the
baseline
23. Conclusions and Future
Work
• Discovering Web APIs is becoming
increasingly important and existing
support is not optimal
• Web APIs identification is a first step
that can well be approached as a
documentation identification problem
• Crowds input (ProgWeb and API
Validator) has been essential
24. Conclusions and
Future Work
• Further features are been included for
improving the results
• Title, URL, presence of camelCase
words
• Current tests have reached an
accuracy of 82% using SGD
25. Conclusions and
Future Work
• A larger training set is necessary
• Need more validated pages (help!)
• http://iserve-dev.kmi.open.ac.uk/validator/
• A larger experiment will be carried over
a normal Web crawl
A Web API - the offered functionality and data\n(example of an invocation and the XML obtained)\n
\n
Google?\n
Prog Web\n\nIssues: \n- requires manual registration\n- gets out of date (example)\n- discovery remains at a very coarse grain (manual categorisation, or some keywords) \nNo notion of operations and resources provided, etc\n
Prog Web\n\nIssues: \n- requires manual registration\n- gets out of date (example)\n- discovery remains at a very coarse grain (manual categorisation, or some keywords) \nNo notion of operations and resources provided, etc\n
Prog Web\n\nIssues: \n- requires manual registration\n- gets out of date (example)\n- discovery remains at a very coarse grain (manual categorisation, or some keywords) \nNo notion of operations and resources provided, etc\n