Switchboard for translated scientific articles

From Translate Science

Introduction

Translated scientific articles open science to regular people, science enthusiasts, activists, advisors, educators, trainers, consultants, architects, doctors, journalists, planners, administrators, technicians and scientists. This page describes an idea for a tool to make it easier to find translations, which makes it more worthwhile to produce translations.

In its simplest form users should be able to search using a Digital Object Identifier (DOI), a title, a reference or OpenURL and be presented with a list with links to translations.

Also searching by topic would be useful as translated articles tend to be the more important ones in a field. In addition, by also having a topic directory with statics pages for every translation, search engines can crawl the metadata on the translations. The database should also be accessible via an Application Programming Interface (API) so that other tools and webpages can automatically display or add information on any translations.

People or organizations who made or have translations should be able to upload lists with links. There were similar databases during the Cold War to keep up with Soviet research and we want to try to rescue their datasets and upload them to our database. Many research libraries, international organizations and research institutes (WMO, UK Met Office, ...) have translated articles, which should be included.

The expensive organizations maintaining these databases and making translations collapsed after the Cold War. In the internet age, we can maintain large knowledge bases more cost effectively with global volunteers, as Wikipedia has demonstrated, and include many more languages. Also translating has become much easier as a reasonable first draft can often be provided by machine learning. And we can now network people who only occasionally make translations (of their own articles).

Not every contribution will be perfect. Users and editors of such a database should be given moderation tools. With versioning it should be easy to revert vandalism or spamming. We could green lists of known scientific repositories and red lists of known spammers will be maintained. Sherpa has an API to access their database of repositories and COAR is working on a new database. Also Open Access search engines, such as BASE or CORE, may have lists of scientific repositories. When a certain amount of translations from one webpage (e.g., NOAA) has been accepted they could be green listed.

If there are multiple translations for a language, editors or users should be able to indicate which one is best, to rank them. If only because external system using our information may be designed to only accept one translation per language as that will be the most typical case.

A "talk page", similar to Wikipedia's, could be useful to allow users to point to problems, discuss which translations are best and which quality flags need to be set. Possibly even to organize to jointly make a better translation. This could be implemented with a commenting or forum system in a background tab. Copying the idea of Wikipedia of making a page with recent changes can help with quality control. Such a page can be filtered in several ways, e.g., for contributions by new people. In case someone made a problematic contribution a look at their user pages may find more.

This page mainly describes the technical aspects of such a Translations Switchboard, but there is also a human aspect. We will need a community of editors for every language to check submitted URLs to avoid spam and select the best version in case multiple ones are available. Also we will need publicity so that people know about the service. Part of the advertising could function via integration of our system in others; see below. We will need volunteers who contact possible sources of translations and promote the production of translations in their circles.

Technical details

API

The core of the translations switchboard is a database with an API that allows people to query the existence of translations and upload information on translations. The API of CrossRef could serve as inspiration. Combined queries, e.g., DOI and language, should be possible. To avoid problems with copy rights and hosting large datasets we will not host the translations ourselves, but give users the URL where they can be found. This API will also be used for our own homepage, where people can search for articles.

It would be nice if people can use a hash to inquire whether we have a translation. That way they would leave less private information. Especially for use of our API with a browser add-on that would send a request for every homepage a DOI is mentioned on.

Translation journals

In case no translation is found, the homepage will link to this Wiki to provide advise on finding translations and the user will be encouraged to search for the ISSN of the journal as well. For many journals there were regular translations made and published in translation journals. A search for these journals by ISSN and year could indicate whether there is a translation journal and which library may have the translations, even if it is not online. Data from the Library of Congress would be a good start for this ISSN database.

Languages

The languages of the original and translation will be stored using the ISO systems for languages. Also the editors doing the quality control can indicate the languages they master using the ISO systems. Users with an account could see the languages they can read at the top. For articles with multiple translations the ISO codes could be used as quick links at the top of the search results page.

Disciplines

There are disciplinary ontologies that categorize the topics of journals. They could be used as first estimate of the topic of an article. For example the open hierarchical, three-level classification tree system of Science-Metrix[1][2], as well as the proprietary systems of ISI, A&HCI and ERA. (Science-Metrix uses the Office Open XML format.) They could provide good estimates for the discipline of a large number of the translated articles and could be matched to the expertise of our editors. That approach has one important limitation: Journal-based classification would not be useful for local journals that do not have a set subject. For the topic of an article, the Universal Decimal Classification (UDC), is an attractive option; it is free (CC-SA) and available in 57 languages. We should study how well these journal level classifications map to the article level classification.

Uploading information

Adding articles to the database should be as user friendly as possible. In case both the original has a DOI, that, the URL of the translation and the language of the translation may be the only information we need. (The language of the original is mostly known via CrossRef. Maybe we can make or find a tool that estimates the language of the translation).

But there are articles and other scientific documents that do not have a DOI. Especially older literature. It may be possible to allow users to give us references in a free format and parse them to machine readable bibliographic data with tools such as AnyStyle.

Furthermore, it should be possible to upload multiple translations simultaneously. We should consult research libraries and institutes what kind of method they prefer for such bulk upload.

Integrations with other systems

Integrations with other systems are important, it helps the users and spreads the word. We should collaborate with the organizations behind reference managers, repositories, publishing system and peer review systems so that they show translation if they are available.

A WordPress Plugin and a browser add-on to automatically alert of translations of the originals articles mentioned on a webpage would be useful.

How to put translations in Wikidata should be discussed with that community. We have started a discussion at Project Source Metadata. The procedures for donating our data should be discussed with the group on Data Donations next. A problem with uploading information is that there may not be a direct link between the information we have and Wiki Items, especially for authors it can be hard to match names to actual person (having a Wiki Item); a work around is to add authors as Author Name String.

There are still only a handful translations on Wikidata, but with the API of Wikidata for downloading data we could download them.

CrossRef has around two thousand translation* in their database and regularly checking their API for new ones is worthwhile. CrossRef is considering also including data from non-members (non-publishers) in their database; so in future they could include our data.

We could use the OpenURL resolver to integrate with other software (e.g., reference managers such as Zotero), so that they could show translations if available. There is an implementation of OpenURL at CrossRef, which we could use for inspiration.

* To download all articles with the CrossRef API add to the API URL: "works?filter=relation.type:is-translation-of". I did not make a direct link as downloading the data takes some time, so crawlers should not follow the link. You can do the same for articles that have a translation: "works?filter=relation.type:has-translation". This pages specifies a large part of the API.

Points to ponder

How hard would it be to make the system distributed, to have multiple servers who talk to each other and exchange data if they trust each other? We are doing this for science, but there are groups outside of science who could use similar system. (Disciplinary) groups within science may be able to use their networks to promote the production of translations. That would make bulk download of our data a good idea to get a new server started; although initially we do not have that much data so using the API to download the entire dataset would not be that cumbersome.

It could be worthwhile to make a (private) backup of the known translations and regularly check for broken links. The backup can help the editors find the new location of the translation or to upload it elsewhere if the license allows for this.

It may be a good idea to have multiple types of links to translations. Literal translations, but also related works in another language, for example a PhD thesis in language X and a corresponding article in language Y. Sometimes people may write a summary of an article in another language, which could be valuable if there is no full translation. Also links to partial translations can still be valuable and showing them could promote their completion.