Collaborative translation

From Translate Science
Revision as of 15:29, 26 September 2021 by VictorVenema (talk | contribs) (Stub)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

One of the ideas to promote the translation of scientific articles is to create a collaborative translation tool.

A quality translation needs a good understanding of two languages and the topic. Such a combination is easier to find in a team. Also people often make partial translations and stop when they know enough for themselves. It would be great if others could continue the work.

Comment on Open Science Feed

davidpomerenke made a useful comment on the Open Science Feed that would be helpful with the input:

I've recently coded an unpublished project on scientific citation mining, and for that purpose I had looked a bit into tools for converting PDFs to more useful formats.

   I ended up using Grobid, which converts the PDF to a very detailed XML format. The format is not a word processing format though, but a format specifically for representing scientific documents. I don't know, if it would, for example, contain tags about bold or italicized text. The tool is working really well, but since you probably cannot use the output XML format directly, it will need some postprocessing, which would be relatively simple with XML parsing libraries.
   An alternative is pdfextract by Crossref. They probably use this to build their own large database. It also works really well and gives you some JSON that would probably need less postprocessing than Grobid. I didn't use it for some minor technical reason that I forgot.
   pdffigures2 is from the team behind Semantic Scholar, and they probably use it to extract the figures that they show in their search engine. It only extracts figures and their captions and no other things. I don't recall whether the other tools can also extract figures, but if not, then this will be a perfect supplement.
   Another alternative that's on my list but that I didn't try is Cermine.
   There are some more tools that specialize in mining only the citations, but I found them to be less powerful (although perhaps more performant) than Grobid.

Many publishers also publish a supplementary HTML version these days, which may be an acceptable format or at least easy to convert to other formats with pandoc. I have also seen that authors upload the Latex source along with the PDF on Arxiv, but I don't know common that is.

Another current project which is not directly related to your question but which you may find cool is ScholarPhi, where they try to annotate PDFs with useful semantic information.