Tuesday, January 03, 2017

Wikidata-powered citation lists with citation.js

I don't get enough time with the kids as I would like, but if your son is doing interesting coding projects it makes that a lot easier. One project he is working on is citation.js, a JavaScript library to edit bibliographies. It has become really powerful and totally awesome! We all hate formatting bibliographies and that every journal has its own format. LaTeX and Citation Style Language have done wonders here, but all should even be simpler. As an author I want to be able to just give a DOI and that should be enough.

Or a Wikidata entity identifier.

And citation.js makes that last thing possible, and I spent some time with Lars to implement this for my homepage:

This is more or less what I had before too, but then everything hard coded. The citation.js way allow me to give just a list of two entity IDs (Q27062312 and Q27062639) and citation.js outputs the above. I just have this snippet in the HTML:

      <ul class="cite" id="cite1" />
      <script class="code" type="text/javascript">
        var wikidata = new Cite()
        wikidata.set( [ "Q27062312", "Q27062639" ] )
        htmlOutput = wikidata.get( opt )
          htmlOutput.replace( /&(lt|#60);/g, '<' )
                    .replace( /&(gt|#62);/g, '>' )

The formatting is actually mostly done with a CSL template (though it needs a hack to get it to output HTML), though adapted to also output the DOI hyperlink and Altmetric icon (you can find the customized CSL in the HTML source code as CC-BY-SA 3.0). The citation.js library fetches the data from Wikidata and actually has to deal with the structure there, which includes a mixture of 'author' and 'author name string' fields for author information. Well done!

If you like this, make sure to check out Wikicite, OpenCitations, and Scholia, projects that enabled and triggered some of the ideas behind the above citation.js use!

"10 everyday things on the web the EU Commission wants to make illegal" #04

Fourth example is harder then the third and I hope I got the translation of Julia Reda's example in good way. The starting point is simple enough, bookmarking things where an image is used. However, I am less sure to what extend we use this in online science.

04. Pinning a photo to an online shopping list

Well, you can see how much trouble I had with finding a good equivalent here. So, what is a science shopping list? The above example shows a Google+ post by Björn Brembs. Now, G+ is not really a shopping list, but then again, literature is what researchers buy. Literally. We pay millions and millions for it. Second, we do have dedicated shopping lists for these products, but they not always support images. Of course, these shopping lists are our CiteULike, Mendeley, ResearchGate, etc accounts.

Second limitation of this example is that we would not consider most of our literature of journalistic nature. Therefore the above example. Blogs are typically a mixture of science writing and a kind of journalism. It's a grey area. Now, under the new laws, Björn would have to ask my permission, and worse, G+ needs to install a monitoring system to see if Björn got a proper license as to not break my copyright.

So, back to the likes of ResearchGate and ScienceOpen. With the current proposal, any system of this kind with some commercial model in mind (both are set up by SMEs), they will have to install this monitoring system (after all, we also happily bookmark Nature News articles). The cost of that investment will have to come from somewhere, so this has an enormous impact on their sustainability.

Even worse, the wordings in the proposal I have seen so far, and to the extend I understand Julia's worries, there are no limitations set on this; few or no words on allowed behavior. So, what about dissemination systems in general? I think later examples (we still have six to go!), will shed more light on that.

(And make sure to read the original article by Julia Read!)

Monday, January 02, 2017

EPA CompTox Dashboard IDs in Wikidata

After Antony Williams left the ChemSpider team, he moved on to the EPA. Since then, he has set up the EPA CompTox Dashboard (see also doi:10.1007/s00216-016-0139-z [€]). And in August he was kind enough to upload mappings between InChIKeys (doi:10.1186/s13321-015-0068-4) and their identifiers on Figshare (doi:10.6084/m9.figshare.3578313.v1) as a tab-separated values (TSV) file. Because this database is of interest to our pathway and systems biology work, I realized I wanted ID-ID mappings in our BridgeDb identifier mappings files (doi:10.1186/1471-2105-11-5). As I wrote earlier, I have adopted Wikidata (doi:10.3897/rio.1.e7573) as data source. So, entering these new identifiers in Wikidata is helpful.

Somewhere in the past few months I proposed the needed Wikidata property, P3117 ("DSSTOX substance identifier"), which was approved some time later. For entering the mappings, I have opted to write a Bioclipse script (doi:10.1186/1471-2105-10-397) that uses the Wikidata SPARQL endpoint to get about 150 thousand Wikidata item identifiers (Q-codes) and their InChIKeys. I then parses over the lines in the TSV file from Figshare and creates input for Wikidata for each match, based on exact InChIKey string equivalence.

This output is formatted QuickStatements instructions, a great tool set up by Magnus Manske. Each line looks like (here for N6-methyl-deoxy-adenosine-5'-monophosphate, aka Q27456455):

Q27456455 P3117 "DTXSID30678817" S248 Q28061352

The P248 ("stated in") property is used to link the source (hence: S248) information as reference, with points to the Q28061352 item which is for the Figshare entry for Tony's mapping data. The result in this Wikidata item looks like:

I entered about 36 thousand of such statements to Wikidata. Thus, the yield is about 5%, calculating from the CompTox Dashboard as starting point with about 720 thousand identifiers. From a Wikidata perspective, the yield is higher. There are about 150 thousand items with an InChIKey, so that 24% could be mapped.

Based on properties of the property, it does some automatic validation. For example, it is specified that any Wikidata item can only have one DSSTOX substance identifier, because it can only have one InChIKey too. Similarly, there can not be two Wikidata items with the same DSSTOX identifier. Normally, because because of how Wikidata works, there can be isolated examples. With less then 25 constraint violations, the quality of the process turned out pretty high (>99.9%).

Some of the issues have been manually inspected. Causes vary. One issue was that the Wikidata item in fact had more than one InChIKey. A possible reason for that is that it does not distinguish between various forms of a compound. Two Wikidata items have been split up accordingly. Other problems are due to features of the CompTox Dashboard, and some issues have been tweeted to the Dashboard team.

This mashup of these two resources, as anticipated in our H2020 proposal (doi:10.3897/rio.1.e7573), makes it possible to easily make slices of data. For example, we can query for experimental data for compounds in the EPA CompTox Dashboard with a SPARQL query like for the dipole moment:

Importantly, this query shows the source where this data comes from, one of the advantages of Wikidata.

Friday, December 30, 2016

"10 everyday things on the web the EU Commission wants to make illegal" #03

The third example in this series is not too hard to explain.

03. Posting a blog post to social media

Because many of you are familiar with blogging and many of you blog yourself, you know what this one is about. The way I understood it, it will be legal, and you just have to figure out if and how much I would need to pay Kerstin to share this wonderful story about clinical trials in Wiki{pedia|data} on Google+:

As with all original examples, Julia's post provides a lot of legal detail, which I reshare here for this item, because you may initially think this is just about news from newspapers, but here too, wording matters:

And while I have argued a long time ago, that there are many kinds of blogging (it's just a medium, like paper), many can certainly be considered of journalistic nature. In fact, some even use their blog for getting press tickets for scientific conferences (but that's another story ;).

Well, if you are still reading this series, maybe it is time to head over to the website.

"10 everyday things on the web the EU Commission wants to make illegal" #01

OK, after moving to the second example, I realized the subtle difference with the first: I got example 01 and 02 mixed up, and while the previous post was really discussing Julia's second example. Example 01 is really about snippets of publications, like quotes. Now, before you argue that quoting is legal, realize that depends on specifics in various jurisdictions, and, as Julia writes:

"[..] in many EU countries, sharing an extract without further commenting on its substance is not covered by that exception".

So, I hope this post provides enough commenting and substance. But that clearly does not apply for modern way of dissemination of science via Twitter.

01. Sharing what happened 20 years ago

Anyway, now I got a kickstart for the first example too: both tweets were actually about news of close to twenty years ago: both publications are of about 20 years ago! So, take the first tweet with the title of the Nature News article, but now with a quote.

This will be illegal for commercial entities, and possible me too: there is no significant commenting. It practically means that covering the news of the past will be practically illegal or very hard at least, or at least to some, where some is ill-defined, because of the proposal is very unclear about who can and who cannot.

Oh, and if you're not already freaked out: it's retroactive. That is, happy cleaning up the past 20 years of dissemination you did and figure out where this example applied. Nice excuse to not do research!