Summer Code Party at CfA

On Saturday, we hosted an event at CfA as part of Mozilla's Webmaker campaign. It originally began as a continuation of the #OAHack that PLoS hosted (see post below) but the dates aligned and we were pleased to bring together the related ideas of the open web, open science and open access into one event.

The Presentation
John Wilbanks' started our day with a presentation explaining that if government is a platform, science is a wiki. In its current state, it's a terribly inefficient one, however. He shared some statistics on references to traditional versus open-published papers, such as the number and variety of citations resulting.  So why is science such a terribly inefficient wiki and what can we do to improve sharing, reuse, collaboration and ultimately progress?

John Wilbanks Presenting
First, open content. John noted the NIH's open access policy and how that's changed the playing field for scientists and in many ways for publishers too, although no publisher has presented adverse event data, suggesting that fears over open publishing destroying traditional business models may be overblown. A few months ago, the OSTP requested information on public access to digital data and scientific publications, and a number of authors of replies were in attendance (example). We discussed FRPAA, the Federal Research Public Access Act and some of the political theater surrounding it, including the red herrings such as the acquisition of US taxpayer-funded research by foreign governments, which could certainly pay if they wanted. I began to think of the arguments over the EU's database directive, which provides protection to database producers based on the quality and quantity of information collected and arranged, preventing its extraction; this is separate from copyright, though copyright may exist in the non-factual aspects of the underlying data. Fortunately, attempts to create sui generis database rights went nowhere in the U.S., and things have worked out just fine without them for years. Indeed, even the EU report in 2006 on the effects of the sui generis right casts a skeptical gaze, noting its effects are "unproven." But that's another story for another day.

Another important point was that there can be no change in outcome without change in stakeholders. Noting the frequency of lobbying activity by organizations opposed to open access, John explored other ways of getting legislators' attention. For example, he kickstarted a campaign on the White House's We the People petition site that reached the threshold of 25,000 signatures in about two weeks, a quick and clear indication that this issue needs attention from policy and lawmakers. We all look forward to the White House reply.

The next step in making science more efficient is open data. Using an example of climate data, John pointed to the variety of data collected independently, for different reasons, and how making it openly available is crucial for context and understanding, especially by lay people. For example, there's research on runoff, ocean temperature, land surfaces, clouds and precipitation, solar energy, and much more. At the same time, raw data alone is only a small step toward making use of it: raw metadata and standards processes, document submission standards and archives are necessary too. There are also the existential questions about data, which led to the final requirement for making science efficient: open consent.

Here, John spoke of the time and expense of organizing participation in clinical and other studies, and the narrow scope of that consent. Yet we're constantly producing a stream of data in every activity we do. Why the disconnect? Surely part of it is legal, not having clear legislation or guidance (or conservatively interpreting current law, especially where there's no explicit guidance and little case law). John used the example of The Eatery, an app that allowed users to vote on how healthy their meals are, and other users can vote also. The millions of ratings demonstrated not only that we overestimate the health value of our food intake, but that in only 5 months, with no grant and no academics involved, 7.68 million data points were collected and now that data set is in high demand from researchers.

After seeing this example, services like 23andme, interviews, and more case studies, it became clear to John that there is a critical mass of people who prefer sharing as a form of control. However, one of the unintended consequences of informed consent is that data remains limited to a specific purpose, rather than the portable consent John is working on that would allow one to simply donate data to science. In other words, donating data to be used in any study by anyone. The Consent to Research project teaches users the core ideas of informed consent, allows review of a consent agreement, and prompts participation by allowing users to upload data while selecting the permissions granted to researchers: right to research, redistribute, public the results, and right to commercialize products derived from research. Along the way, users are required to watch a video explaining the potential for harm from sharing. What if, for example, your shared data is used to connect you to a crime? The potential social, legal and economic issues are only limited to your imagination. Consider things like paternity suits, analysis by employers or insurance companies, etc.

With open content, permission via informed consent, and the participation of people (who want sharing as a means of control), science can become at least a modestly effective wiki.

The Projects

Wherein I note that, "At CfA,
we infiltrate the civil-bureaucratic stack."
The projects spanned a number of open access and open web topics, beginning with the Adopt an Institution project. The team, which included participants from CfA, PLoS and Creative Commons (not necessarily in their official capacities) developed an outreach strategy for the app, began adding more universities and institutions to the app, and modified the database structure to allow multiple participants to "adopt" a single institution and indicate their affiliation (professor, student, administrator) and subject area of primary interest. This project is built in Ruby on Rails, the code is available on Github, and it's deployed using Heroku.

Open Science Hub Team
The Open Science Hub Team
Another team continued hacking on the Open Science Hub, a web site to collect and display info on the open access movement in a way that's more broadly accessible. They re-themed the site, built in Joomla, expanded administrator capabilities for adding new content, updated some of the articles, added a Twitter feed to the site, and discussed strategies for bringing the results of open science to a wider audience, including journalists.

Following up on their work scraping and displaying information about the Open Access petition mentioned above, and a subsequent conversation on Twitter about furthering the work, one team started work on a specification for a whitehouse.gov We The People API. There's a Python scraper here, proof-of-concept map, and more detailed specs and next steps here. One goal of the project is an easier way to see the time, actual location (not just what people entered - there could be ambiguity, such as Ontario, CA being California or Canada) and other info about petition signers. 

Finally, a group set out to survey the landscape of open music notation tools, prompted by the recent success of the Open Goldberg Variations project which released public domain versions of the scores and recordings of the pieces. We wondered how replicable the project would be, and sought to first identify musicians' needs. Conveniently, we had two classical musicians (and one jazz guy.. me). There's a surprising amount of variation in publishers' adaptations, arrangements, editions and other derivative works of public domain classical music (adding tempo, dynamics, etc.), and there's not an easy way of comparing versions. We set out to explore a "Github for music" and started by reviewing MusicXML, an XML-based format for music notation, then looked at Music 21 , a Python-based "set of tools for helping scholars and other active listeners answer questions about music quickly and simply." Also, some of the Javascript and HTML5 goodies that allow for entry and manipulation of music in the browser, either to or from Music XML. For example, the open source web-based music notation rendering API Vexflow is used in this HTML5 Cloud Composer project. For the "sheet music Github", the idea would be to input the public domain score and convert to Music XML, display it as notation that could be modified in the browser, save the modifications and convert back to Music XML, then save that result in a user's profile or separate version from the original public domain score.

A number of side conversations I heard were about front end responsive design, Unglue.it (a site that collects donations for purchasing literary works or reaching agreement with publishers to release existing works under Creative Commons licenses), and the economic implications of open access. All in all, we had a great time, many new connections were made, and lots of follow up scheduled. Thanks to all who attended, and thanks to Mozilla, PLoS, and CfA for their support.

Popular posts from this blog

Thinking About BIPA and Machine Learning

Changing PDF Metadata with Python

A New Serverless Look