Similarly to a tailored dress which ensures a better fit, also in the area of machine translation better results can be achieved by preparing or adapting a machine translation engine to a specific field (domain) such as medicine, engineering etc.
What does the adaptation of such a machine translation engine look like in practice and what results can you expect?
The process of adapting a general MT (Machine Translation) engine is called Domain Adaptation. We usually start by training a translation model on general data for a given language pair. The data used in this process include for example translated news articles, various application manuals, Wikipedia articles, European Parliament speeches, film subtitles and a lot more. You can try the basic (general) translation engine thus obtained for example at: translator.lingea.com.
Subsequently, the training data from a given domain has to be obtained – the domain can be quite general such as health, tourism or online sale of services but it can also be more specific, such as washing machine user manuals. We ideally use parallel data i.e. data where both the source and translation sentences are available. In some cases even texts available only in the target language can be used for effective adaptation. However, in any case a large amount of the data is needed – a minimum of tens of thousands but preferably hundreds of thousands of sentences. This data can be then directly used for training as well as for the selection of other suitable translations from general corpora based on the similarity of the texts.
As soon as we have the training corpus ready, we can proceed to carry out the domain adaptation itself. It virtually consists in additional training of the general model using the selected domain data. We can then use a minor part of the data which was kept aside to monitor the quality of translation obtained and based on that we make further changes throughout the process of data preparation and training of the model until satisfactory results are achieved.