Collaborative translation

From Translate Science
Revision as of 02:47, 5 October 2021 by VictorVenema (talk | contribs) (Still unfinished, more descriptive version.)

One of the ideas to promote the translation of scientific articles is to create a collaborative translation tool. Producing a good translation needs a several skills, it needs skill in two languages and knowledge about the topic. Such a combination is easier to find in a team and working in a team motivates. People regularly make partial translations and stop when they know enough for themselves. It would be great if others could finish the job.

There are already many translation tools, but none really fit this use case. There is machine translation; for many language pairs this works quite well nowadays, but for scientific texts the system may trip up and accuracy is very important. So a machine translation is at best a first draft an expert should have a look at. When this quality is good enough, people will not need a collaborative translation tool, nor a translation published in a repository and findable via our translations database.

There are computer aided translation packages to help (team of) professional translators. These tend to be proprietary and thus cannot be improved to fit our use case. They do have many tricks that would be worthwhile to implement in a collaborative system as well, such a databases with phrases to ensure they are consistently translated. Also the data formats can be used as inspiration. The only exception to this rule seems to be OmegaT, a FOSS project coded in Java. A remaining problem could be that such systems are intended for professionals and may have a steep learning curve.

Tools for the translation of software packages (their user interface and documentation) is often collaborative, tends to be easy to use and sometimes even somewhat gamified. In addition these systems are often free software and could thus be improved. This may thus come closest to a collaborative tool for scientific articles, but these tools only work with nicely structured text files for in- and output, while scientific articles will have equations, tables, references, and figures and will often not even be available in a text format, but some sort of PDF file.

The core of such a tool would be translating one text into another. This would work best if the text to be translated were a a machine readable text format. For this first step the comment on open science feed below lists some tools and also this tool may be part of such a system.

* Design a collaborative translation tool?

  • this should work with a group of people with similar interest or
  • a group of researchers working on the same project which needs to translate relevant references or
  • a group of students in a classroom practicing annotation or critical review on a paper. Prior to that they need to translate it (as a complete translation or a summary) of the paper.
  • Ben mentioned he had several unfinished translations
  • A translation requires expertise in two languages and the field of study, which is easier with a team.
  • Working in a team is nicer
  • For the User Interface of software there are beautiful collaborative tools. Which give feedback on progress.
  • Would a translation tool, with an draft made by deep learning, also help deep learning? The system could see how people correct the translation. Is that useful information?
  • Equations are a problem
  • Also for ontologies (Peter, Ideas Challenge)

Single Source Publishing https://github.com/singlesourcepub/community/wiki/Announcement-Blog

   * Work on collaborative translation?

       * Weblate, pandoc, OCR(?)

       * Scientific markdown as the translation format (is text, so would work well with existing software for code). This is a nice collaborative scientific markdown pad. https://mur2.co.uk/editor

* Translator acknowledgement


At an Open Access Barcamp I (VV) met the developer of pandoc and Zettlr. Pandoc could do a large part of the output (and maybe some of the input) a collaborative translation tool would need. We have agreed to call next months.

  • https://pandoc.org
  • It would be useful to gather more information on when translations are legal. In the US they tend to be fair use.
  • Ben’s post on how to make translations will also be helpful in thinking about how such a system would look like.
  • The user feedback in interactive machine translations is used to improve the system
  • Using machine learning for a first draft translation would require considerable resources, memory, computer power and bandwidth
    • Maybe we could collaborate with the European Open Science Cloud (which is not European (but global), not just about open science and not really a cloud (more a network and standards). :-) )

Comment on Open Science Feed

davidpomerenke made a useful comment on the Open Science Feed that would be helpful with the input:

   I've recently coded an unpublished project on scientific citation mining, and for that purpose I had looked a bit into tools for converting PDFs to more useful formats.
   I ended up using Grobid, which converts the PDF to a very detailed XML format. The format is not a word processing format though, but a format specifically for representing scientific documents. I don't know, if it would, for example, contain tags about bold or italicized text. The tool is working really well, but since you probably cannot use the output XML format directly, it will need some postprocessing, which would be relatively simple with XML parsing libraries.
   An alternative is pdfextract by Crossref. They probably use this to build their own large database. It also works really well and gives you some JSON that would probably need less postprocessing than Grobid. I didn't use it for some minor technical reason that I forgot.
   pdffigures2 is from the team behind Semantic Scholar, and they probably use it to extract the figures that they show in their search engine. It only extracts figures and their captions and no other things. I don't recall whether the other tools can also extract figures, but if not, then this will be a perfect supplement.
   Another alternative that's on my list but that I didn't try is Cermine.
   There are some more tools that specialize in mining only the citations, but I found them to be less powerful (although perhaps more performant) than Grobid.
   Many publishers also publish a supplementary HTML version these days, which may be an acceptable format or at least easy to convert to other formats with pandoc. I have also seen that authors upload the Latex source along with the PDF on Arxiv, but I don't know common that is.
   Another current project which is not directly related to your question but which you may find cool is ScholarPhi, where they try to annotate PDFs with useful semantic information.