Citoid/html-metadata: Scrape metadata from html in your choice of 4 formats
completed by: m4tx
mentors: Andre Klapper, Mvolz
citoid is a Node.js application (written in Javascript) that retrieves information about a webpage, book, journal article, etc. given a URL to the webpage or some other identifier, like DOI (digital object identifier). There are installation instructions and more information available at https://www.mediawiki.org/wiki/Citoid; however, for the purposes of this project you don't need to install or use Citoid.
We get most of our metadata from another open source project, Zotero's translation-server. However, we also have a native webscraper in citoid, lib/scrape.js, which currently has very limited functionality.
To add more functionality to scrape.js (which currently just gets the contents of <title></title> and a few other properties), we'd like to take advantage of several other metadata standards that exist. These are:
OpenGraph (currently supported- don't pick this one!) https://phabricator.wikimedia.org/T1069
HighWire: https://phabricator.wikimedia.org/T76225
Embedded RDF: https://phabricator.wikimedia.org/T7622
CoINS: https://phabricator.wikimedia.org/T76223
Dublin Core: https://phabricator.wikimedia.org/T76224
As such, we're developing a node library that will be able to scrape all of these different types of metadata from html, https://github.com/mvolz/html-metadata
You can see in the file https://github.com/mvolz/html-metadata/blob/master/index.js that scrapeCOinS, scrapeHighWire, scrapeEmbeddedRDF, and scrapeDublinCore are all not implemented.
Choose one of the functions to implement, and comment which with function you've chosen to implement on this page when you've done so. More details about each different data format can be found in the phabricator link next to the type listed above to help you choose.
(please be advised that while the general wikimedia directions advise you to use gerrit for version control; as this is a Node.js library, not mediawiki specific software, you should use github.com to commit your work, as this library is not on gerrit. html-metadata is not currently published as a Node module but will be at some point.)