Journal of Evaluation Output Quality of Machine Translation


An Evaluation of Output Quality of Machine Translation (Padideh Software vs. Google Translate)


Haniyeh Sadeghi Azer  and Mohammad Bagher Aghayi (Corresponding author) 

Journal of; 

Advances in Language and Literary Studies, ISSN: 2203-4714, Vol. 6 No. 4; August 2015, Australian International Academic Centre, Australia  


-    Every people, all over the world need a language to communicate with others, But Sometimes people do not know each other’s language, so a person or a tool is needed to translate the source Language into the target language.
-    human  translators  are  not  always  available  and  easy  to  find. Also,  the  amount  of  written material  that  a person can  translate  in a specific  time  is very  limited. The translation process is very time consuming, and moreover, having a human translator is costly. Therefore, searching for alternative methods for translation is crucial.
-    Using computers for translation proposes a solution for all these costly and time consuming processes which have to be done by human translator. Machine translation’s purpose is to reduce the cost of the translation process and increase the quality of the translated material.
-    Translating a language into another one through computer is not an easy task. A human language is a very complicated system, so Machine translation involves a great deal of complicated analysis and manipulation, and despite the advances that are done in this field but it is not accomplished yet.
-    The evaluation of machine translation systems is an important field of research, for optimizing the performance of MT systems and their effectiveness.  There  are  a  range  of  different  evaluation  approaches  for  evaluating MT systems; progress  in  the  field  of  machine  translation  relies  on  assessing  the  quality  of  a  new  system  through  systematic evaluation.  The evaluation strategy adopted in this study is human evaluation.


The aim of the research is  to  find  out  which  program  produces  a  relatively  better  output,  in  dealing  with  diverse  text-types  in  translation direction from English to Persian, and its acceptability and usability for end-users.
to evaluate  the  translation quality of  two machine  translation systems  in  translating six different  text types,  from English  to  Persian. The evaluation was based on criteria proposed by Van  Slype (1979). The proposed model for evaluation is a black-box type, comparative and adequacy-oriented evaluation.
To conduct the evaluation, a questionnaire was  assigned  to  end-users  to  evaluate  the outputs  to  examine  and determine,  if  the machine-generated  translations are intelligible and acceptable from their point of view and which one of the machine-generated translations produced by Padideh  software and Google Translate  is more acceptable and useful  from  the end-users point of view.


-    The focus is on manual corpus analysis and human judgments on machine-generated translation. It intends to report an evaluation of the output quality of two prevalent English-Persian MT programs, named, Padideh software and Google translate.
-    Most of the time, users of MT cannot select proper MT systems compatible to their needs and their purpose for using MT.
-    Arnold,  et  al  (1994)  indicates  that  the purchase of  an MT  system  is  in many  cases  a  costly  affair  and  requires careful  consideration.  It  is  important  to understand  the organizational  consequences  and  to be  aware of  the  system’s capacities. Evaluation of MT systems helps to inform about the usability and acceptability of them.  
-    The research design, employed in this study is build on previous work conducted by Van Slype  (1979). Criteria of evaluation are established by Georges Van Slype (1979) Method for evaluating the quality of Machine Translation from the perspective of acceptance and usability for the end-users.
-    The  evaluation  made  in  this  research  focused  on  the  quality  of  the  output,  i.e.,  the  translation  of two  prevalent English-Persian  MT  programs.  The  evaluation  of  these  two  different  translation  programs  will  be  established  by implementing Van Slype (1979) method, for evaluating machine translation.
-    In 1979, Van Slype compiled a comprehensive critical review of MT evaluation methods on behalf of Bureau Marcel van Dijk for the Commission of the European Communities, who had set up a program aimed at “lowering the barriers between the languages of the Community” (Van Slype, 1979, p.11). The purposes of this study were: to document the kinds  of  methodologies  being  employed  at  this  time  in  MT  evaluation;  to  make  some  recommendations  to  the Commission, amongst other things, on the methodology it should use when evaluating its machine translation systems; and  to  conduct  research  which  would  help  in  the  long  term  with  the  efficiency  of  these  evaluations. 
-    The report distinguished between two levels of evaluation: macroevaluation (or total evaluation) determines the acceptability of a system, compares the quality of two systems or two versions of the same system, and assesses the usability of a system; while microevaluation (or detailed evaluation) determines the improvability of a system.
-    Macroevaluation ; This  level of evaluation concerns  itself with  the assessment of  the system’s overall performance. It aims  at  examining  the  acceptance  of  a  translation  system,  comparing  the  quality  of  two  translation  systems  or  two versions of the same system and/or assessing the usability of a translation system (Van Slype, 1979, pp.12 and 21).
-    Van Slype (1979) broke down the various criteria into ten classes, assembled in turn into four groups according to the level at which they approach the quality of the translation.
•    Cognitive level (effective communication of information and knowledge).
  1. Intelligibility
  2. Fidelity
  3. Coherence
  4. Usefulness
  5. Acceptability
•    Economic level (excluding costs).
  1. Reading time
  2. Correction time
  3. Translation time
•    Linguistic level (conformity with a linguistic model)
•    Operational level (effective operation).

Description of criteria and methods of macroevaluation, used in this study:
 Cognitive level: 

1. Intelligibility: 

Van Slype (ibid) defines the criteria as: Subjective evaluation of the degree of comprehensibility and clarity of the translation. Measurement of intelligibility by rating sentences on a 4-point scale.
o    Method:
  • Submission of a  text  sample  in  several versions  (original  text, MT without and with post-editing, human  translation with and without revision) to a group of evaluators; the texts being distributed so that each evaluator:
  1. Receives only one of each of the versions of the texts.
  2. Receives a series of sentences in sequence (sentences in their context).
  • Rating of each sentence according to a 4-point scale.
  • Calculation of the average of the ratings per text and version, with and without weighting as a function of the number of words in each sentence evaluated.
o    Scale of intelligibility:
  • 3: Very intelligible: all the content of the message is comprehensible, even if there are errors of style and/or of spelling,  and if certain words are missing, or are badly translated, but close to the target language.
  • 2: Fairly intelligible: the major part of the message passes.
  • 1: Basely intelligible: a part only of the content is understandable, representing less than 50% of the message.
  • 0: Unintelligible: nothing or almost nothing of the message is comprehensible.

2. Fidelity: 

Van Slype (ibid), defines fidelity as: Subjective evaluation of the measure in which the information contained in the sentence of the original text reappears without distortion in the translation. The fidelity rating should, generally, be equal  to or  lower  than  the  intelligibility rating, since  the unintelligible part of the message  is of course not  found  in  the  translation. Any variation between the  intelligibility  rating and  the  fidelity rating is due to additional distortion of the information, which can arise from:
•    A loss of information (silence) (example: word not translated).
•    Interference (noise) (example: word added by the system).
•    A distortion from a combination of loss and interference (example: word badly translated).
Measurement of fidelity by rating on a 4-point scale:
o    Method:
  • Submission of a sample of original texts, with the corresponding translations, to one or more evaluators. 
  • Successive examination of each sentence, in the first place in the translation, then in the original text.
  • Rating of the fidelity, sentence by sentence.
  • Calculation of the average of the fidelity ratings.
o    Scale of fidelity:
  • 3: Completely or almost completely faithful.
  • 2: Fairly faithful: more than 50 % of the original information passes in the translation.
  • 1: Barely faithful: less than 50 % of the original information passes in the translation.
  • 0: Completely or almost completely unfaithful.

3. Coherence: 

One author only, Y. WILKS (cited in Van Slype 1979), proposes this criterion:
o    Definition of the criterion:
The  quality  of  a  translation  can  be  assessed  by  its  level  of  coherence  without  the  need  to  study  its  correctness  as compared to the original text. Once a sufficiently large sample is available, the probability that the translation should be at the same time coherent and totally wrong is very weak.
o    Method of evaluation:
Y. WILKS does not indicate, unfortunately, how in practice it is possible to rate the coherence of a text. He notes that if an  original  text  may  be  coherent;  this  means  that  any  assessment  of  the  coherence  of  its MT  version  may  not  be absolute, based on the MT, but must be relative, as compared to the coherence of the source text. But then one is once again compelled to use bilingual evaluators. 

4. Usability: 

Definition of the criterion:
One author, W.  LENDERS  (cited  in  Van  Slype  1979),  defines  usability  (which  he  also  calls  applicability)  as  the possibility to make use of the translation. Another,  P. ARTHERN  (cited  in Van  Slype  1979),  defines  usability  as  far  as  a  translation  service  is  concerned,  as revisibility.
o    Method:  B.H.Dostert (ibid): Measurement of the quality by direct questioning of the final users.
5. Acceptability:
Definition of the criterion: Van Slype defines acceptability as “a subjective assessment of the extent to which a translation is acceptable to its final user”  (ibid, p.92). Van Slype maintains that acceptability can be effectively measured only by a survey of final users and this is illustrated in his suggested subjective evaluation, the second of two methods for evaluating acceptability in the report:
o    Measurement of acceptability by analysis of user motivation, and
o    Measurement of acceptability by direct questioning of users.
Measurement of acceptability by direct questioning of users:
o    Method:
•    Submission of a sample of MT with the original texts and the corresponding HTs, to a sample of potential users. 
•    Questions asked (among others).
• Do you consider the translation of these documents to be acceptable, knowing that it comes from a computer and that it can be obtained within a very short time, of the order of half a day?
  • In all cases.
  • In certain circumstances (to be specified).
  • Never.
  • For myself.
  • For certain of my colleagues.
• Would you be interested in having access  to a  system of machine  translation providing  texts of  the quality of  those shown to you? 

6. Reading time: 

Reading time can be assessed in various ways: Van Slype (ibid): by timing the time spent by the evaluator in reading each text of the sample.

-    The Corpus

The  corpus  selected  for  this  study  is,  six  different  text  types  which  are  selected  for  English  to  Persian  MT  and evaluation. The different text-types are: 1) Kid’s Story 2) Political Text 3) Computer Science Text 4) Legal Text 5) A Poem as a Literary Text 6) A Webpage.
The corpus selected for the study is six complete texts, that haven’t separated from their context. The  SL  texts  have  been  collected  from  university  textbooks  and  Internet  websites. Most of these texts have been selected on the basis of being rich in domain-specific terminologies. Each of the sample texts has translated once by Padideh software and once by Google Translate.

Subject Object:

Two English-Persian Machine translation program (Padideh software & Google Translate) are selected as the subject of this research. The research only evaluates the output quality of Machine translation programs. Different text-types have been selected, in order to examine the translation produced by each program.

Research Question;

RQ1: Are machine-generated translations intelligible and acceptable from the point of view of end-users of diverse text-type of documents?
RQ2: Which one of the machine-generated translations produced by Padideh software and Google translate is more acceptable and useful from the end-users point of view?
The Aims of present study is to establish whether six different text types target language translations produced by two prevalent  machine  translation  softwares  (Google  translate  and  Padideh  translator)  are  considered  intelligible  and acceptable from the point of view of end- users (RQ1), and which one of the machine-generated translations produced by  them  is  more  acceptable  and  useful  from  their  point  of  view  (RQ2). 
These research questions are investigated through human evaluation of machine translation output.  Therefore in order to meet the aims proposed, the study developed to use a human evaluation model to conduct end-user evaluations of diverse text-types.


This research used Quantitative research design
The  corpus  analysis  techniques  in  the  interview  questionnaire  design  are  valid  and  reliable.  In order to minimize errors, we systematically conduct the analysis on the corpus, and for the design of the interview questionnaire we build on the work of Van Slype (1979).
The proposed model  for  the  functional  attributes  is  a black-box  type  superficial,  comparative  and  adequacy-oriented evaluation. In other words, there is no interaction with the systems tested and the goal is to determine whether output is actually helpful to the user groups in question. 
On the basis of the tasks relevant to the end-user’s needs in this study, only six functional quality characteristics have been investigated. These include:  ‘intelligibility’, ‘fidelity’, ‘coherence’, ‘usability’, ‘acceptability’ and ‘reading time’. 
In  this  work,  the  black-box  evaluation  has  been  chosen  due  to  the  fact  that  commercial MT  systems  can  only  be evaluated  by  this  approach  (Volk,  2001).  Consequently,  there  has  been  no  access  to  the  inner  workings  of  these systems.  Even  so,  it  is  desirable  to  be  able  to  draw  from  such  an  evaluation  enough  conclusions  about  the  various system components. 

Data analysis;

Detailed analyses and classifications of the  results concerning  the  various  criteria  types  are  presented with  tables  and  charts. The questionnaire used in this study was carefully analysed to ensure that the data gathered was presented clearly. 
A detailed analysis based on the black-box approach, superficial and adequacy/ declarative evaluation of six various text types for each of the two MT systems reveals the results.
These results are classified and presented on the basis of: 
•    Variation in scores between raters.
•    Comparison of systems for text types.
•    Average of scores of raters.
•    Percentage of scores of raters.
The result of the application of the evaluation methods in testing the criteria, take into consideration the grades on the scoring  scale,  the  total  score value, and  the average  score value with  respect  to each  rater and each of  the  tested MT systems.  The  evaluation  results  are  reported  in  tables,  which  show  the  distribution  of  the  scores  obtained  from  the investigation of text-types for each of the quality characteristics and MT systems.

Analysis and Classification of Results

This part is the most important process, which is to calculate the human judgments based on the assigned questionnaire. The evaluators were asked to consider each text and its machine translated outputs to examine the parameters which are provided in the questionnaire. The scores assigned to each parameter by evaluators are shown in Tables and for better analysis; the results are presented in charts for each parameter.
There are sixteen evaluators.


The findings indicate that, the machine-generated translations are intelligible and acceptable in translating certain text-types, for end-users and Google Translate is more acceptable from end-users point of view.

Related Posts

Post a Comment

Subscribe Our Newsletter