Difference between revisions of "Collaborative translation"

From Translate Science
(Stub)
 
(→‎Putting it all together: Added Python tools)
 
(7 intermediate revisions by the same user not shown)
Line 1: Line 1:
One of the [[Ideas_for_promoting_translations_in_science|ideas to promote the translation of scientific articles]] is to create a collaborative translation tool.
+
One of the [[Ideas_for_promoting_translations_in_science|ideas to promote the translation of scientific articles]] is to create a collaborative translation tool. Producing a good translation needs a several skills, it needs skill in two languages and knowledge about the topic. Such a combination is easier to find in a team and working in a team motivates. People regularly make partial translations and stop when they know enough for themselves. It would be great if others could finish the job.
  
A quality translation needs a good understanding of two languages and the topic. Such a combination is easier to find in a team. Also people often make partial translations and stop when they know enough for themselves. It would be great if others could continue the work.
+
== Existing translation tools ==
 +
There are already many [[translation tools]], but none really fit this use case. There is [[Translation tools#Machine translation|machine translation]]; for many language pairs this works quite well nowadays, but for scientific texts the system may trip up and accuracy is very important. So a machine translation is at best a first draft an expert should have a look at. When this quality is good enough, people will not need a collaborative translation tool, nor a translation published in a repository and findable via our translations database.
  
== Comment on Open Science Feed ==
+
There are [[Translation tools#Computer Assisted Translation|computer aided translation packages]] to help (team of) professional translators. These systems work with office file formats and HTML and tend to be proprietary and thus cannot be improved to fit our use case. The break the text up in segments (paragraphs) to be translated piece by piece. They do have [[wikipedia:Computer-assisted_translation|many tricks]] that would be worthwhile to implement in a collaborative system as well, such a databases with phrases to ensure they are consistently translated. Also the data formats can be used as inspiration. The only exception to this rule seems to be [[wikipedia:OmegaT|OmegaT]], a FOSS project coded in Java. A remaining problem could be that such systems are intended for professionals and [https://polyglot.city/@Stoori/106670443229138474 may have a steep learning curve].
[https://reddit.com/r/Open_Science/comments/pvgj3y/project_to_rebuild_papers_with_plaintext_markup/heci5zd/ davidpomerenke] made a useful comment on the Open Science Feed that would be helpful with the input:
 
  
I've recently coded an unpublished project on scientific citation mining, and for that purpose I had looked a bit into tools for converting PDFs to more useful formats.
+
Tools for the [[Translation tools#Software translation tools|translation of software packages]] (their user interface and documentation) is often collaborative, tends to be easy to use and sometimes even somewhat gamified. In addition these systems are often free software and could thus be improved. This may thus come closest to a collaborative tool for scientific articles, but these tools only work with nicely structured text files for in- and output, while scientific articles will have equations, tables, references, and figures and will often not even be available in a text format, but some sort of PDF file.
  
    I ended up using [https://github.com/kermitt2/grobid Grobid], which converts the PDF to a very detailed XML format. The format is not a word processing format though, but a format specifically for representing scientific documents. I don't know, if it would, for example, contain tags about bold or italicized text. The tool is working really well, but since you probably cannot use the output XML format directly, it will need some postprocessing, which would be relatively simple with XML parsing libraries.
+
== Putting it all together ==
    An alternative is [https://gitlab.com/crossref/pdfextract pdfextract] by Crossref. They probably use this to build their own large database. It also works really well and gives you some JSON that would probably need less postprocessing than Grobid. I didn't use it for some minor technical reason that I forgot.
+
So a collaborative translation tool would live on the internet and combine features of the computer aided translation packages and the software translation packages, while having additional tools to parse article PDFs into a text format and afterwards put the translated article together.  
    [https://github.com/allenai/pdffigures2 pdffigures2] is from the team behind Semantic Scholar, and they probably use it to extract the figures that they show in their search engine. It only extracts figures and their captions and no other things. I don't recall whether the other tools can also extract figures, but if not, then this will be a perfect supplement.
 
    Another alternative that's on my list but that I didn't try is [https://github.com/CeON/CERMINE Cermine].
 
    There are some more tools that specialize in mining only the citations, but I found them to be less powerful (although perhaps more performant) than Grobid.
 
  
Many publishers also publish a supplementary HTML version these days, which may be an acceptable format or at least easy to convert to other formats with [https://pandoc.org/ pandoc]. I have also seen that authors upload the Latex source along with the PDF on Arxiv, but I don't know common that is.
+
In future [https://github.com/singlesourcepub/community/wiki/Announcement-Blog Single Source Publishing] will make input and output easier and some journals already provide scientific articles in HTML or XML formats, but for now we will also need tools to work with PDFs. For old articles we will even have to work with PDFs that are just scans of paper articles and need OCR.  
  
Another current project which is not directly related to your question but which you may find cool is [https://scholarphi.org/ ScholarPhi], where they try to annotate PDFs with useful semantic information.
+
An internet tool that does this is [https://papertohtml.org/ PaperToHTML]. On the Open Science Feed [https://www.reddit.com/r/Open_Science/comments/pvgj3y/project_to_rebuild_papers_with_plaintext_markup/heci5zd/ someone mentioned several tools] that may work.[https://web.archive.org/web/*/https://www.reddit.com/r/Open_Science/comments/pvgj3y/project_to_rebuild_papers_with_plaintext_markup/heci5zd/] There is [https://github.com/kermitt2/grobid Grobid], which converts the PDF to a very detailed XML format, but may not preserve formatting such as italics. Allen AI, the team behind Semantic Scholar, build on Grobid to create [https://github.com/allenai/s2orc-doc2json/ their own tool], which may be easier to use. Also CrossRef has a tool to convert PDFs into JSON called [https://gitlab.com/crossref/pdfextract pdfextract]. A further option is [https://github.com/CeON/CERMINE Cermine]. There are also [https://www.geeksforgeeks.org/extract-text-from-pdf-file-using-python/ Python tools to extract text from PDFs].
 +
 
 +
Both for input and output the Swiss knife of document conversions [https://pandoc.org/ pandoc] can be helpful. The same developer has build a scientific MarkDown tool on top of pandoc called [https://www.zettlr.com/ Zettlr]. Scientific markdown could be a good option as the translation format because it is text, so would work well with existing software for code, while it can handle equations, tables, figures and references natively. There is a also already [https://mur2.co.uk/editor a collaborative scientific markdown pad]. 
 +
 
 +
The collaborative tool should allow for communication between translators in general (for coordination of the work and community building) and discussions on specific translated sentences. Preferably this communication would work for two people as well as for entire classes jointly translating an article. It should be able to upload partial translations and have a page showing partial translations where people can help out.
 +
 
 +
It would save a lot of time for many translations to have a first draft translated by [[Translation tools#Machine translation|machine learning tools]]. This should be checked by humans from accuracy, but a scientific article does not have to be beautiful prose, but clear. The user feedback in interactive machine translations can be used to improve the system and make it better at translating scientific works. The latter would require running the machine translation ourselves. It was suggested that this may require considerable resources, memory, computer power and bandwidth; maybe this could be obtained by collaborating with the European Open Science Cloud.
 +
 
 +
== Points to ponder ==
 +
Coming from the academic tradition, I would expect that translators would like to be named. However, we have seen with Wikipedia that many people are willing to work on big projects without getting personal credit, or at least without it being easily noticed. Not connecting your name and reputation to a translation can also lower the barrier to participate. One could try both system, but I expect that most translators will be academics (which is an assumption we should check) and would like credit. Still it is a good idea to implement this in a way that it is easy to opt out of being named.
 +
 
 +
Another design question is whether to use pure text for the translate or markdown. Using pure text would mean that words which are printed bold or italic would not have this markup in the translation (or this would have to be added in the post-processing after the system gives out an office, LaTex, XML or markdown file). In case of equations, tables and figures this would mean that the reader of the translation would need to use the original as well and only have translation support for captions, axis labels, legends and so on as descriptive text. The advantage is that pure text is much easier for the translator to handle.
 +
 
 +
Using MarkDown would tempt the translator to also recreate the equations and tables. The figures would still be like in the original with the translator describing the axis and legends, etc. Making equations and tables is quite involved and costs time even for someone well versed in markdown. Maybe we can set up the system in a way that translators would normally not try to do this by copying equations, tables and figures into the translation unchanged and ask for a description. The use of Markdown for italics and bold is quite easy to learn and if present in the original important to reproduce.

Latest revision as of 15:04, 23 November 2021

One of the ideas to promote the translation of scientific articles is to create a collaborative translation tool. Producing a good translation needs a several skills, it needs skill in two languages and knowledge about the topic. Such a combination is easier to find in a team and working in a team motivates. People regularly make partial translations and stop when they know enough for themselves. It would be great if others could finish the job.

Existing translation tools

There are already many translation tools, but none really fit this use case. There is machine translation; for many language pairs this works quite well nowadays, but for scientific texts the system may trip up and accuracy is very important. So a machine translation is at best a first draft an expert should have a look at. When this quality is good enough, people will not need a collaborative translation tool, nor a translation published in a repository and findable via our translations database.

There are computer aided translation packages to help (team of) professional translators. These systems work with office file formats and HTML and tend to be proprietary and thus cannot be improved to fit our use case. The break the text up in segments (paragraphs) to be translated piece by piece. They do have many tricks that would be worthwhile to implement in a collaborative system as well, such a databases with phrases to ensure they are consistently translated. Also the data formats can be used as inspiration. The only exception to this rule seems to be OmegaT, a FOSS project coded in Java. A remaining problem could be that such systems are intended for professionals and may have a steep learning curve.

Tools for the translation of software packages (their user interface and documentation) is often collaborative, tends to be easy to use and sometimes even somewhat gamified. In addition these systems are often free software and could thus be improved. This may thus come closest to a collaborative tool for scientific articles, but these tools only work with nicely structured text files for in- and output, while scientific articles will have equations, tables, references, and figures and will often not even be available in a text format, but some sort of PDF file.

Putting it all together

So a collaborative translation tool would live on the internet and combine features of the computer aided translation packages and the software translation packages, while having additional tools to parse article PDFs into a text format and afterwards put the translated article together.

In future Single Source Publishing will make input and output easier and some journals already provide scientific articles in HTML or XML formats, but for now we will also need tools to work with PDFs. For old articles we will even have to work with PDFs that are just scans of paper articles and need OCR.

An internet tool that does this is PaperToHTML. On the Open Science Feed someone mentioned several tools that may work.[1] There is Grobid, which converts the PDF to a very detailed XML format, but may not preserve formatting such as italics. Allen AI, the team behind Semantic Scholar, build on Grobid to create their own tool, which may be easier to use. Also CrossRef has a tool to convert PDFs into JSON called pdfextract. A further option is Cermine. There are also Python tools to extract text from PDFs.

Both for input and output the Swiss knife of document conversions pandoc can be helpful. The same developer has build a scientific MarkDown tool on top of pandoc called Zettlr. Scientific markdown could be a good option as the translation format because it is text, so would work well with existing software for code, while it can handle equations, tables, figures and references natively. There is a also already a collaborative scientific markdown pad.

The collaborative tool should allow for communication between translators in general (for coordination of the work and community building) and discussions on specific translated sentences. Preferably this communication would work for two people as well as for entire classes jointly translating an article. It should be able to upload partial translations and have a page showing partial translations where people can help out.

It would save a lot of time for many translations to have a first draft translated by machine learning tools. This should be checked by humans from accuracy, but a scientific article does not have to be beautiful prose, but clear. The user feedback in interactive machine translations can be used to improve the system and make it better at translating scientific works. The latter would require running the machine translation ourselves. It was suggested that this may require considerable resources, memory, computer power and bandwidth; maybe this could be obtained by collaborating with the European Open Science Cloud.

Points to ponder

Coming from the academic tradition, I would expect that translators would like to be named. However, we have seen with Wikipedia that many people are willing to work on big projects without getting personal credit, or at least without it being easily noticed. Not connecting your name and reputation to a translation can also lower the barrier to participate. One could try both system, but I expect that most translators will be academics (which is an assumption we should check) and would like credit. Still it is a good idea to implement this in a way that it is easy to opt out of being named.

Another design question is whether to use pure text for the translate or markdown. Using pure text would mean that words which are printed bold or italic would not have this markup in the translation (or this would have to be added in the post-processing after the system gives out an office, LaTex, XML or markdown file). In case of equations, tables and figures this would mean that the reader of the translation would need to use the original as well and only have translation support for captions, axis labels, legends and so on as descriptive text. The advantage is that pure text is much easier for the translator to handle.

Using MarkDown would tempt the translator to also recreate the equations and tables. The figures would still be like in the original with the translator describing the axis and legends, etc. Making equations and tables is quite involved and costs time even for someone well versed in markdown. Maybe we can set up the system in a way that translators would normally not try to do this by copying equations, tables and figures into the translation unchanged and ask for a description. The use of Markdown for italics and bold is quite easy to learn and if present in the original important to reproduce.