Difference between revisions of "Switchboard for translated scientific articles"

From Translate Science
(First draft of a page on a translations switchboard. Proofreading tomorrow.)
 
(→‎Integrations with other systems: Added sentence on web annotations and peer review systems.)
 
(11 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
== Introduction ==
 
== Introduction ==
[[Importance of translations|Translated scientific articles open science]] to regular people, science enthusiasts, activists, advisors, educators, trainers, consultants, architects, doctors, journalists, planners, administrators, technicians and scientists. This page describes an idea for a tool to make it easier to find translations and thus make them more worthwhile to make.
+
[[Importance of translations|Translated scientific articles open science]] to regular people, science enthusiasts, activists, advisors, educators, trainers, consultants, architects, doctors, journalists, planners, administrators, technicians and scientists. This page describes an idea for a tool to make it easier to find translations, which makes it more worthwhile to produce translations.
  
In its simplest form users should be able to search using a Digital Object Identifier ([[w:Digital_object_identifier|DOI]]), a reference or [[w:OpenURL|OpenURL]] and be presented with a list with links to translations.  
+
In its simplest form users should be able to search using a Digital Object Identifier ([[w:Digital_object_identifier|DOI]]), a title, a reference or [[w:OpenURL|OpenURL]] and be presented with a list with links to translations.  
  
Also searching by topic would be useful as translated articles tend to be the more important ones in a field. In addition, by also having a directory with statics pages, search engines can crawl the metadata on the translations. The database should also be accessible via an Application Programming Interface (API) so that other tools and webpages can automatically add information on any translations.  
+
Also searching by topic would be useful as translated articles tend to be the more important ones in a field. In addition, by also having a topic directory with statics pages for every translation, search engines can crawl the metadata on the translations. The database should also be accessible via an Application Programming Interface (API) so that other tools and webpages can automatically display or add information on any translations.  
  
People or organizations who made or have translations should be able to upload lists with links. There were [[Obtaining_copies_of_old_translations_published_as_technical_reports|similar databases during the Cold War]] to keep up with Soviet research and we want to try to rescue their datasets and include them in our database. Many research libraries, international organizations and research institutes ([[w:World_Meteorological_Organization|WMO]], [[w:Met_Office|UK Met Office]], ...) have translated articles.
+
People or organizations who made or have translations should be able to upload lists with links. There were [[Obtaining_copies_of_old_translations_published_as_technical_reports|similar databases during the Cold War]] to keep up with Soviet research and we want to try to rescue their datasets and upload them to our database. Many research libraries, international organizations and research institutes ([[w:World_Meteorological_Organization|WMO]], [[w:Met_Office|UK Met Office]], ...) have translated articles, which should be included.
  
The expensive organizations maintaining these databases and making translations collapsed after the Cold War. In the internet age, we can maintain a database more cost effectively with global volunteers, as Wikipedia has demonstrated. Also translating has become much easier as a reasonable first draft can often be provided by machine learning. And we can now network people who only occasionally make translations (of their own articles).
+
The expensive organizations maintaining these databases and making translations collapsed after the Cold War. In the internet age, we can maintain large knowledge bases more cost effectively with global volunteers, as Wikipedia has demonstrated, and include many more languages. Also translating has become much easier as a reasonable first draft can often be provided by machine learning. And we can now network people who only occasionally make translations (of their own articles).
  
Users and editors should be given moderation tools. With versioning it should be easy to revert vandalism or spamming. [https://www.coar-repositories.org/news-updates/coar-recommendations-for-operations-funding-and-governance-of-an-international-repository-directory/ White lists of known scientific repositories] and black lists of known spammers will be maintained. Also Open Access search engines, such as [[w:BASE_(search_engine)|BASE]] or [[w:COnnecting_REpositories|CORE]], may have lists of scientific repositories. When a certain amount of translations from one webpage (e.g., [[w:National_Oceanic_and_Atmospheric_Administration|NOAA]]) has been accepted they could be white listed.
+
Not every contribution will be perfect. Users and editors of such a database should be given moderation tools. With versioning it should be easy to revert vandalism or spamming. We could green lists of known scientific repositories and red lists of known spammers will be maintained. [https://v2.sherpa.ac.uk/api/metadata-schema.html Sherpa has an API to access their database of repositories] and [https://www.coar-repositories.org/news-updates/coar-recommendations-for-operations-funding-and-governance-of-an-international-repository-directory/ COAR is working on a new database]. Also Open Access search engines, such as [[w:BASE_(search_engine)|BASE]] or [[w:COnnecting_REpositories|CORE]], may have lists of scientific repositories. When a certain amount of translations from one webpage (e.g., [[w:National_Oceanic_and_Atmospheric_Administration|NOAA]]) has been accepted they could be green listed.
  
If there are multiple translations for a language, editors or users should be able to indicate which one is best, to rank them. If only because external system using our information may only be able to accept one translation per language.
+
If there are multiple translations for a language, editors or users should be able to indicate which one is best, to rank them. If only because external system using our information may be designed to only accept one translation per language as that will be the most typical case.
  
A "talk page", similar to Wikipedia could be useful to allow users to point to problems, discuss which translations are best and which quality flags need to be set. Possibly even to jointly make a better translation. Copying the idea of Wikipedia of making a page with [https://www.wikidata.org/wiki/Special:RecentChanges recent changes] can help with quality control. Such a page can be filtered in several ways, e.g. for contributions by new people. In case someone made a problematic submissions a look at their user pages may find more.
+
A "talk page", similar to Wikipedia's, could be useful to allow users to point to problems, discuss which translations are best and which quality flags need to be set. Possibly even to organize to jointly make a better translation. This could be implemented with a commenting or forum system in a background tab. Copying the idea of Wikipedia of making a page with [https://www.wikidata.org/wiki/Special:RecentChanges recent changes] can help with quality control. Such a page can be filtered in several ways, e.g., for contributions by new people. In case someone made a problematic contribution a look at their user pages may find more.
  
 
This page mainly describes the technical aspects of such a Translations Switchboard, but there is also a human aspect. We will need a community of editors for every language to check submitted URLs to avoid spam and select the best version in case multiple ones are available. Also we will need publicity so that people know about the service. Part of the advertising could function via integration of our system in others; see below. We will need volunteers who contact possible sources of translations and promote the production of translations in their circles.
 
This page mainly describes the technical aspects of such a Translations Switchboard, but there is also a human aspect. We will need a community of editors for every language to check submitted URLs to avoid spam and select the best version in case multiple ones are available. Also we will need publicity so that people know about the service. Part of the advertising could function via integration of our system in others; see below. We will need volunteers who contact possible sources of translations and promote the production of translations in their circles.
  
== Technical details ==
+
==Technical details==
=== API ===
+
=== API===
The core of the translations switchboard is a database with an API that allows people to query the existence of translations and upload information on translations. The [https://github.com/CrossRef/rest-api-doc API of CrossRef] could serve as inspiration. Combined queries, e.g. DOI and language, should be possible. To avoid problems with copy rights and hosting large datasets we will not host the translations ourselves, but give users the URL where they can be found. This API will also be used for our own homepage, where people can search for articles.
+
The core of the translations switchboard is a database with an API that allows people to query the existence of translations and upload information on translations. The [https://github.com/CrossRef/rest-api-doc API of CrossRef] could serve as inspiration. Combined queries, e.g., DOI and language, should be possible. To avoid problems with copy rights and hosting large datasets we will not host the translations ourselves, but give users the URL where they can be found. This API will also be used for our own homepage, where people can search for articles.
  
 
It would be nice if people can use a [[w:Cryptographic_hash_function|hash]] to inquire whether we have a translation. That way they would leave less private information. Especially for use of our API with a browser add-on that would send a request for every homepage a DOI is mentioned on.
 
It would be nice if people can use a [[w:Cryptographic_hash_function|hash]] to inquire whether we have a translation. That way they would leave less private information. Especially for use of our API with a browser add-on that would send a request for every homepage a DOI is mentioned on.
  
=== Translation journals ===
+
===Translation journals ===
In case no translation is found, the homepage will link to this Wiki to provide advise on finding translations and the user will be encouraged to search for the [[w:International_Standard_Serial_Number|ISSN] of the journal as well. For many journals there were regular translations made and published in translation journals. A search for these journals by ISSN and year could indicate which library may have the translations, even if it is not online. The Library of Congress would be a good start for this ISSN database.
+
In case no translation is found, the homepage will link to this Wiki to provide advise on finding translations and the user will be encouraged to search for the [[w:International_Standard_Serial_Number|ISSN]] of the journal as well. For many journals there were regular translations made and published in translation journals. A search for these journals by ISSN and year could indicate whether there is a translation journal and which library may have the translations, even if it is not online. Data from the Library of Congress would be a good start for this ISSN database.
  
=== Languages ===
+
===Languages===
The languages of the original and translation will be stored using the [[w:ISO_639|ISO systems for languages]. Also the editors doing the quality control can indicate the languages they master using the ISO systems. Users with an account could see the languages they master at the top. For articles with multiple translations the ISO codes could be used as quick links at the top of the search results page.
+
The languages of the original and translation will be stored using the [[w:ISO_639|ISO systems for languages]]. Also the editors doing the quality control can indicate the languages they master using the ISO systems. Users with an account could see the languages they can read at the top. For articles with multiple translations the ISO codes could be used as quick links at the top of the search results page.
  
=== Disciplines ===
+
===Disciplines===
There are disciplinary ontologies that categorize the topics of journals. For example the open hierarchical, three-level classification tree system of [https://www.science-metrix.com Science-Metrix][http://ceur-ws.org/Vol-1155/paper-07.pdf][https://www.scientometrics-school.eu/images/4_1_13Archambault_Journal%20classifications.pdf], as well as the proprietary systems of ISI, CHI and ERA. (Science-Metrix uses the [[w:Office_Open_XML|Office Open XML format]].) They could provide good estimate for a large number of the translated articles and could be matched to the expertise of our editors.
+
There are disciplinary ontologies that categorize the topics of journals. They could be used as first estimate of the topic of an article. For example the open hierarchical, three-level classification tree system of [https://www.science-metrix.com Science-Metrix][http://ceur-ws.org/Vol-1155/paper-07.pdf][https://www.scientometrics-school.eu/images/4_1_13Archambault_Journal%20classifications.pdf], as well as the proprietary systems of [[wikipedia:Institute_for_Scientific_Information|ISI]], [[wikipedia:Arts_and_Humanities_Citation_Index|A&HCI]] and [https://www.tandfonline.com/db/era ERA]. (Science-Metrix uses the [[w:Office_Open_XML|Office Open XML format]].) They could provide good estimates for the discipline of a large number of the translated articles and could be matched to the expertise of our editors. That approach has one important limitation: Journal-based classification would not be useful for ''local'' journals that do not have a set subject. For the topic of an article, the [http://www.udcsummary.info/php/index.php?lang=en Universal Decimal Classification (UDC)], is an attractive option; it is free (CC-SA) and available in 57 languages. We should study how well these journal level classifications map to the article level classification.
  
=== Uploading information ===
+
===Uploading information===
Adding articles to the database should be as user friendly as possible. In case both the original and the translation have a DOI, that and the language of the translation may be the only information we need (the language of the original is know via CrossRef).  
+
Adding articles to the database should be as user friendly as possible. In case both the original has a DOI, that, the URL of the translation and the language of the translation may be the only information we need. (The language of the original is mostly known via CrossRef. Maybe we can make or find a tool that estimates the language of the translation).  
  
But there are articles and other scientific documents that do not have a DOI. Especially older literature. It may be possible to allow users to give us the two references in free format and parse them to machine readable bibliographic data with tools such as [https://github.com/inukshuk/anystyle AnyStyle].
+
But there are articles and other scientific documents that do not have a DOI. Especially older literature. It may be possible to allow users to give us references in a free format and parse them to machine readable bibliographic data with tools such as [https://github.com/inukshuk/anystyle AnyStyle].
  
Furthermore, it should be possible to upload multiple translations simultaneously. We should consult research libraries and institutes what kind of method they prefer for bulk upload.
+
Furthermore, it should be possible to upload multiple translations simultaneously. We should consult research libraries and institutes what kind of method they prefer for such bulk upload.
  
== Integrations with other systems ==
+
==Integrations with other systems ==
Integrations with other systems are important, it helps the users and spreads the word. We should collaborate with reference managers, repositories, publishing system and peer review systems so that they show translation if they are available.
+
Integrations with other systems are important, it helps the users and spreads the word. We should collaborate with the organizations behind reference managers, repositories, publishing system and peer review systems so that they show translation if they are available.
  
 
A WordPress Plugin and a browser add-on to automatically alert of translations of the originals articles mentioned on a webpage would be useful.
 
A WordPress Plugin and a browser add-on to automatically alert of translations of the originals articles mentioned on a webpage would be useful.
  
How to put translations in Wikidata should be discussed with that community. The most fitting groups seems to be [https://www.wikidata.org/wiki/Wikidata:WikiProject_Source_MetaData Project Source Metadata]. The procedure for donating our data should be discussed with the group on [https://www.wikidata.org/wiki/Wikidata:Data_donation Data Donations].
+
How to put translations in Wikidata should be discussed with that community. We have started [https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Source_MetaData#Translated_scientific_articles a discussion] at [https://www.wikidata.org/wiki/Wikidata:WikiProject_Source_MetaData Project Source Metadata]. The procedures for donating our data should be discussed with the group on [https://www.wikidata.org/wiki/Wikidata:Data_donation Data Donations] next. A problem with uploading information is that there may not be a direct link between the information we have and Wiki Items, especially for authors it can be hard to match names to actual person (having a Wiki Item); a work around is to add authors as [https://www.wikidata.org/wiki/Property:P2093 Author Name String].
  
 
There are still only a [https://w.wiki/yVf handful translations on Wikidata], but with the [https://www.wikidata.org/wiki/Wikidata:Data_access API of Wikidata for downloading data] we could download them.  
 
There are still only a [https://w.wiki/yVf handful translations on Wikidata], but with the [https://www.wikidata.org/wiki/Wikidata:Data_access API of Wikidata for downloading data] we could download them.  
  
CrossRef has around two thousand translation in their database and regularly checking their [https://github.com/CrossRef/rest-api-doc API] for new ones is worthwhile. CrossRef is considering also including data from non-members (non-publishers) in their database; they would be happy to use our data.
+
CrossRef has around two thousand translation* in their database and regularly checking their [https://github.com/CrossRef/rest-api-doc API] for new ones is worthwhile. CrossRef is considering also including data from non-members (non-publishers) in their database; so in future they could include our data.
  
We could use the [[w:OpenURL|OpenURL resolver]] to integrate with other software (e.g., reference managers such as Zotero), so that they could show translations if available
+
We could use the [[Index.php?title=w:OpenURL|OpenURL resolver]] to integrate with other software (e.g., reference managers such as [[wikipedia:Zotero|Zotero]]), so that they could show translations if available. There is [https://www.crossref.org/education/retrieve-metadata/openurl/ an implementation of OpenURL at CrossRef], which we could use for inspiration. We could use web annotations to advertise the translations on the webpage and PDF of the original. Also adding the existence of a translation on Grassroots Review Journals and PubPeer could help, if that would be allowed by PubPeer.
There is [https://www.crossref.org/education/retrieve-metadata/openurl/ an implementation of OpenURL at CrossRef], which we could use for inspiration.  
 
  
== Points to ponder ==
+
''<sup>*</sup> To download all articles with the [http://api.crossref.org/ CrossRef API] add to the API URL: "works?filter=relation.type:is-translation-of". I did not make a direct link as downloading the data takes some time, so crawlers should not follow the link. You can do the same for articles that have a translation: "works?filter=relation.type:has-translation". [https://github.com/CrossRef/rest-api-doc/blob/master/api_format.md This pages specifies a large part of the API.]''
How hard would it be to make the system distributed? To have multiple such servers who talk to each other and exchange data if they trust each other? That would make bulk download a good idea to get a new server started; although initially we do not have that much data so using the API to download the entire dataset would not be that cumbersome.
+
 
 +
== Points to ponder==
 +
How hard would it be to make the system distributed, to have multiple servers who talk to each other and exchange data if they trust each other? We are doing this for science, but there are groups outside of science who could use similar system. (Disciplinary) groups within science may be able to use their networks to promote the production of translations. That would make bulk download of our data a good idea to get a new server started; although initially we do not have that much data so using the API to download the entire dataset would not be that cumbersome.
  
 
It could be worthwhile to make a (private) backup of the known translations and regularly check for broken links. The backup can help the editors find the new location of the translation or to upload it elsewhere if the license allows for this.
 
It could be worthwhile to make a (private) backup of the known translations and regularly check for broken links. The backup can help the editors find the new location of the translation or to upload it elsewhere if the license allows for this.
  
It may be a good idea to have multiple types of links to translations. Literal translations, but also related works in another language, for example a PhD thesis in language X and a corresponding article in language Y. Also links to partial translations can still be valuable and promote their completion.
+
It may be a good idea to have multiple types of links to translations. Literal translations, but also related works in another language, for example a PhD thesis in language X and a corresponding article in language Y. Sometimes people may write a summary of an article in another language, which could be valuable if there is no full translation. Also links to partial translations can still be valuable and showing them could promote their completion.

Latest revision as of 00:45, 7 October 2021

Introduction

Translated scientific articles open science to regular people, science enthusiasts, activists, advisors, educators, trainers, consultants, architects, doctors, journalists, planners, administrators, technicians and scientists. This page describes an idea for a tool to make it easier to find translations, which makes it more worthwhile to produce translations.

In its simplest form users should be able to search using a Digital Object Identifier (DOI), a title, a reference or OpenURL and be presented with a list with links to translations.

Also searching by topic would be useful as translated articles tend to be the more important ones in a field. In addition, by also having a topic directory with statics pages for every translation, search engines can crawl the metadata on the translations. The database should also be accessible via an Application Programming Interface (API) so that other tools and webpages can automatically display or add information on any translations.

People or organizations who made or have translations should be able to upload lists with links. There were similar databases during the Cold War to keep up with Soviet research and we want to try to rescue their datasets and upload them to our database. Many research libraries, international organizations and research institutes (WMO, UK Met Office, ...) have translated articles, which should be included.

The expensive organizations maintaining these databases and making translations collapsed after the Cold War. In the internet age, we can maintain large knowledge bases more cost effectively with global volunteers, as Wikipedia has demonstrated, and include many more languages. Also translating has become much easier as a reasonable first draft can often be provided by machine learning. And we can now network people who only occasionally make translations (of their own articles).

Not every contribution will be perfect. Users and editors of such a database should be given moderation tools. With versioning it should be easy to revert vandalism or spamming. We could green lists of known scientific repositories and red lists of known spammers will be maintained. Sherpa has an API to access their database of repositories and COAR is working on a new database. Also Open Access search engines, such as BASE or CORE, may have lists of scientific repositories. When a certain amount of translations from one webpage (e.g., NOAA) has been accepted they could be green listed.

If there are multiple translations for a language, editors or users should be able to indicate which one is best, to rank them. If only because external system using our information may be designed to only accept one translation per language as that will be the most typical case.

A "talk page", similar to Wikipedia's, could be useful to allow users to point to problems, discuss which translations are best and which quality flags need to be set. Possibly even to organize to jointly make a better translation. This could be implemented with a commenting or forum system in a background tab. Copying the idea of Wikipedia of making a page with recent changes can help with quality control. Such a page can be filtered in several ways, e.g., for contributions by new people. In case someone made a problematic contribution a look at their user pages may find more.

This page mainly describes the technical aspects of such a Translations Switchboard, but there is also a human aspect. We will need a community of editors for every language to check submitted URLs to avoid spam and select the best version in case multiple ones are available. Also we will need publicity so that people know about the service. Part of the advertising could function via integration of our system in others; see below. We will need volunteers who contact possible sources of translations and promote the production of translations in their circles.

Technical details

API

The core of the translations switchboard is a database with an API that allows people to query the existence of translations and upload information on translations. The API of CrossRef could serve as inspiration. Combined queries, e.g., DOI and language, should be possible. To avoid problems with copy rights and hosting large datasets we will not host the translations ourselves, but give users the URL where they can be found. This API will also be used for our own homepage, where people can search for articles.

It would be nice if people can use a hash to inquire whether we have a translation. That way they would leave less private information. Especially for use of our API with a browser add-on that would send a request for every homepage a DOI is mentioned on.

Translation journals

In case no translation is found, the homepage will link to this Wiki to provide advise on finding translations and the user will be encouraged to search for the ISSN of the journal as well. For many journals there were regular translations made and published in translation journals. A search for these journals by ISSN and year could indicate whether there is a translation journal and which library may have the translations, even if it is not online. Data from the Library of Congress would be a good start for this ISSN database.

Languages

The languages of the original and translation will be stored using the ISO systems for languages. Also the editors doing the quality control can indicate the languages they master using the ISO systems. Users with an account could see the languages they can read at the top. For articles with multiple translations the ISO codes could be used as quick links at the top of the search results page.

Disciplines

There are disciplinary ontologies that categorize the topics of journals. They could be used as first estimate of the topic of an article. For example the open hierarchical, three-level classification tree system of Science-Metrix[1][2], as well as the proprietary systems of ISI, A&HCI and ERA. (Science-Metrix uses the Office Open XML format.) They could provide good estimates for the discipline of a large number of the translated articles and could be matched to the expertise of our editors. That approach has one important limitation: Journal-based classification would not be useful for local journals that do not have a set subject. For the topic of an article, the Universal Decimal Classification (UDC), is an attractive option; it is free (CC-SA) and available in 57 languages. We should study how well these journal level classifications map to the article level classification.

Uploading information

Adding articles to the database should be as user friendly as possible. In case both the original has a DOI, that, the URL of the translation and the language of the translation may be the only information we need. (The language of the original is mostly known via CrossRef. Maybe we can make or find a tool that estimates the language of the translation).

But there are articles and other scientific documents that do not have a DOI. Especially older literature. It may be possible to allow users to give us references in a free format and parse them to machine readable bibliographic data with tools such as AnyStyle.

Furthermore, it should be possible to upload multiple translations simultaneously. We should consult research libraries and institutes what kind of method they prefer for such bulk upload.

Integrations with other systems

Integrations with other systems are important, it helps the users and spreads the word. We should collaborate with the organizations behind reference managers, repositories, publishing system and peer review systems so that they show translation if they are available.

A WordPress Plugin and a browser add-on to automatically alert of translations of the originals articles mentioned on a webpage would be useful.

How to put translations in Wikidata should be discussed with that community. We have started a discussion at Project Source Metadata. The procedures for donating our data should be discussed with the group on Data Donations next. A problem with uploading information is that there may not be a direct link between the information we have and Wiki Items, especially for authors it can be hard to match names to actual person (having a Wiki Item); a work around is to add authors as Author Name String.

There are still only a handful translations on Wikidata, but with the API of Wikidata for downloading data we could download them.

CrossRef has around two thousand translation* in their database and regularly checking their API for new ones is worthwhile. CrossRef is considering also including data from non-members (non-publishers) in their database; so in future they could include our data.

We could use the OpenURL resolver to integrate with other software (e.g., reference managers such as Zotero), so that they could show translations if available. There is an implementation of OpenURL at CrossRef, which we could use for inspiration. We could use web annotations to advertise the translations on the webpage and PDF of the original. Also adding the existence of a translation on Grassroots Review Journals and PubPeer could help, if that would be allowed by PubPeer.

* To download all articles with the CrossRef API add to the API URL: "works?filter=relation.type:is-translation-of". I did not make a direct link as downloading the data takes some time, so crawlers should not follow the link. You can do the same for articles that have a translation: "works?filter=relation.type:has-translation". This pages specifies a large part of the API.

Points to ponder

How hard would it be to make the system distributed, to have multiple servers who talk to each other and exchange data if they trust each other? We are doing this for science, but there are groups outside of science who could use similar system. (Disciplinary) groups within science may be able to use their networks to promote the production of translations. That would make bulk download of our data a good idea to get a new server started; although initially we do not have that much data so using the API to download the entire dataset would not be that cumbersome.

It could be worthwhile to make a (private) backup of the known translations and regularly check for broken links. The backup can help the editors find the new location of the translation or to upload it elsewhere if the license allows for this.

It may be a good idea to have multiple types of links to translations. Literal translations, but also related works in another language, for example a PhD thesis in language X and a corresponding article in language Y. Sometimes people may write a summary of an article in another language, which could be valuable if there is no full translation. Also links to partial translations can still be valuable and showing them could promote their completion.