Journal of Evaluation Output Quality of Machine Translation

Title;

An Evaluation of Output Quality of Machine Translation (Padideh Software vs. Google Translate)

Writer;

Haniyeh Sadeghi Azer and Mohammad Bagher Aghayi (Corresponding author)

Journal of;

Advances in Language and Literary Studies, ISSN: 2203-4714, Vol. 6 No. 4; August 2015, Australian International Academic Centre, Australia

Background;

-    Every people, all over the world need a language to communicate with others, But Sometimes people do not know each other’s language, so a person or a tool is needed to translate the source Language into the target language.
-    human translators are not always available and easy to find. Also, the amount of written material that a person can translate in a specific time is very limited. The translation process is very time consuming, and moreover, having a human translator is costly. Therefore, searching for alternative methods for translation is crucial.
-    Using computers for translation proposes a solution for all these costly and time consuming processes which have to be done by human translator. Machine translation’s purpose is to reduce the cost of the translation process and increase the quality of the translated material.
-    Translating a language into another one through computer is not an easy task. A human language is a very complicated system, so Machine translation involves a great deal of complicated analysis and manipulation, and despite the advances that are done in this field but it is not accomplished yet.
-    The evaluation of machine translation systems is an important field of research, for optimizing the performance of MT systems and their effectiveness. There are a range of different evaluation approaches for evaluating MT systems; progress in the field of machine translation relies on assessing the quality of a new system through systematic evaluation. The evaluation strategy adopted in this study is human evaluation.

Aim;

The aim of the research is to find out which program produces a relatively better output, in dealing with diverse text-types in translation direction from English to Persian, and its acceptability and usability for end-users.
to evaluate the translation quality of two machine translation systems in translating six different text types, from English to Persian. The evaluation was based on criteria proposed by Van Slype (1979). The proposed model for evaluation is a black-box type, comparative and adequacy-oriented evaluation.
To conduct the evaluation, a questionnaire was assigned to end-users to evaluate the outputs to examine and determine, if the machine-generated translations are intelligible and acceptable from their point of view and which one of the machine-generated translations produced by Padideh software and Google Translate is more acceptable and useful from the end-users point of view.

Theory;

-    The focus is on manual corpus analysis and human judgments on machine-generated translation. It intends to report an evaluation of the output quality of two prevalent English-Persian MT programs, named, Padideh software and Google translate.
-    Most of the time, users of MT cannot select proper MT systems compatible to their needs and their purpose for using MT.
-    Arnold, et al (1994) indicates that the purchase of an MT system is in many cases a costly affair and requires careful consideration. It is important to understand the organizational consequences and to be aware of the system’s capacities. Evaluation of MT systems helps to inform about the usability and acceptability of them.
-    The research design, employed in this study is build on previous work conducted by Van Slype (1979). Criteria of evaluation are established by Georges Van Slype (1979) Method for evaluating the quality of Machine Translation from the perspective of acceptance and usability for the end-users.
-    The evaluation made in this research focused on the quality of the output, i.e., the translation of two prevalent English-Persian MT programs. The evaluation of these two different translation programs will be established by implementing Van Slype (1979) method, for evaluating machine translation.
-    In 1979, Van Slype compiled a comprehensive critical review of MT evaluation methods on behalf of Bureau Marcel van Dijk for the Commission of the European Communities, who had set up a program aimed at “lowering the barriers between the languages of the Community” (Van Slype, 1979, p.11). The purposes of this study were: to document the kinds of methodologies being employed at this time in MT evaluation; to make some recommendations to the Commission, amongst other things, on the methodology it should use when evaluating its machine translation systems; and to conduct research which would help in the long term with the efficiency of these evaluations.
-    The report distinguished between two levels of evaluation: macroevaluation (or total evaluation) determines the acceptability of a system, compares the quality of two systems or two versions of the same system, and assesses the usability of a system; while microevaluation (or detailed evaluation) determines the improvability of a system.
-    Macroevaluation ; This level of evaluation concerns itself with the assessment of the system’s overall performance. It aims at examining the acceptance of a translation system, comparing the quality of two translation systems or two versions of the same system and/or assessing the usability of a translation system (Van Slype, 1979, pp.12 and 21).
-    Van Slype (1979) broke down the various criteria into ten classes, assembled in turn into four groups according to the level at which they approach the quality of the translation.
•    Cognitive level (effective communication of information and knowledge).

Intelligibility
Fidelity
Coherence
Usefulness
Acceptability

• Economic level (excluding costs).

Reading time
Correction time
Translation time

• Linguistic level (conformity with a linguistic model)
• Operational level (effective operation).

Description of criteria and methods of macroevaluation, used in this study:
Cognitive level:

1. Intelligibility:

Van Slype (ibid) defines the criteria as: Subjective evaluation of the degree of comprehensibility and clarity of the translation. Measurement of intelligibility by rating sentences on a 4-point scale.
o Method:

Submission of a text sample in several versions (original text, MT without and with post-editing, human translation with and without revision) to a group of evaluators; the texts being distributed so that each evaluator:

Receives only one of each of the versions of the texts.
Receives a series of sentences in sequence (sentences in their context).

Rating of each sentence according to a 4-point scale.

Calculation of the average of the ratings per text and version, with and without weighting as a function of the number of words in each sentence evaluated.

o Scale of intelligibility:

3: Very intelligible: all the content of the message is comprehensible, even if there are errors of style and/or of spelling, and if certain words are missing, or are badly translated, but close to the target language.
2: Fairly intelligible: the major part of the message passes.
1: Basely intelligible: a part only of the content is understandable, representing less than 50% of the message.
0: Unintelligible: nothing or almost nothing of the message is comprehensible.

2. Fidelity:

Van Slype (ibid), defines fidelity as: Subjective evaluation of the measure in which the information contained in the sentence of the original text reappears without distortion in the translation. The fidelity rating should, generally, be equal to or lower than the intelligibility rating, since the unintelligible part of the message is of course not found in the translation. Any variation between the intelligibility rating and the fidelity rating is due to additional distortion of the information, which can arise from:
•    A loss of information (silence) (example: word not translated).
•    Interference (noise) (example: word added by the system).
•    A distortion from a combination of loss and interference (example: word badly translated).
Measurement of fidelity by rating on a 4-point scale:
o    Method:

Submission of a sample of original texts, with the corresponding translations, to one or more evaluators.
Successive examination of each sentence, in the first place in the translation, then in the original text.
Rating of the fidelity, sentence by sentence.
Calculation of the average of the fidelity ratings.

o Scale of fidelity:

3: Completely or almost completely faithful.
2: Fairly faithful: more than 50 % of the original information passes in the translation.
1: Barely faithful: less than 50 % of the original information passes in the translation.
0: Completely or almost completely unfaithful.

3. Coherence:

One author only, Y. WILKS (cited in Van Slype 1979), proposes this criterion:
o Definition of the criterion:
The quality of a translation can be assessed by its level of coherence without the need to study its correctness as compared to the original text. Once a sufficiently large sample is available, the probability that the translation should be at the same time coherent and totally wrong is very weak.
o Method of evaluation:
Y. WILKS does not indicate, unfortunately, how in practice it is possible to rate the coherence of a text. He notes that if an original text may be coherent; this means that any assessment of the coherence of its MT version may not be absolute, based on the MT, but must be relative, as compared to the coherence of the source text. But then one is once again compelled to use bilingual evaluators.

4. Usability:

Definition of the criterion:
One author, W. LENDERS (cited in Van Slype 1979), defines usability (which he also calls applicability) as the possibility to make use of the translation. Another, P. ARTHERN (cited in Van Slype 1979), defines usability as far as a translation service is concerned, as revisibility.
o    Method: B.H.Dostert (ibid): Measurement of the quality by direct questioning of the final users.
5. Acceptability:
Definition of the criterion: Van Slype defines acceptability as “a subjective assessment of the extent to which a translation is acceptable to its final user” (ibid, p.92). Van Slype maintains that acceptability can be effectively measured only by a survey of final users and this is illustrated in his suggested subjective evaluation, the second of two methods for evaluating acceptability in the report:
o    Measurement of acceptability by analysis of user motivation, and
o    Measurement of acceptability by direct questioning of users.
Measurement of acceptability by direct questioning of users:
o    Method:
•    Submission of a sample of MT with the original texts and the corresponding HTs, to a sample of potential users.
•    Questions asked (among others).
• Do you consider the translation of these documents to be acceptable, knowing that it comes from a computer and that it can be obtained within a very short time, of the order of half a day?

In all cases.
In certain circumstances (to be specified).
Never.
For myself.
For certain of my colleagues.

• Would you be interested in having access to a system of machine translation providing texts of the quality of those shown to you?

6. Reading time:

Reading time can be assessed in various ways: Van Slype (ibid): by timing the time spent by the evaluator in reading each text of the sample.

- The Corpus

The corpus selected for this study is, six different text types which are selected for English to Persian MT and evaluation. The different text-types are: 1) Kid’s Story 2) Political Text 3) Computer Science Text 4) Legal Text 5) A Poem as a Literary Text 6) A Webpage.
The corpus selected for the study is six complete texts, that haven’t separated from their context. The SL texts have been collected from university textbooks and Internet websites. Most of these texts have been selected on the basis of being rich in domain-specific terminologies. Each of the sample texts has translated once by Padideh software and once by Google Translate.

Subject Object:

Two English-Persian Machine translation program (Padideh software & Google Translate) are selected as the subject of this research. The research only evaluates the output quality of Machine translation programs. Different text-types have been selected, in order to examine the translation produced by each program.

Research Question;

RQ1: Are machine-generated translations intelligible and acceptable from the point of view of end-users of diverse text-type of documents?
RQ2: Which one of the machine-generated translations produced by Padideh software and Google translate is more acceptable and useful from the end-users point of view?
The Aims of present study is to establish whether six different text types target language translations produced by two prevalent machine translation softwares (Google translate and Padideh translator) are considered intelligible and acceptable from the point of view of end- users (RQ1), and which one of the machine-generated translations produced by them is more acceptable and useful from their point of view (RQ2).
These research questions are investigated through human evaluation of machine translation output. Therefore in order to meet the aims proposed, the study developed to use a human evaluation model to conduct end-user evaluations of diverse text-types.

Methodology;

This research used Quantitative research design
The corpus analysis techniques in the interview questionnaire design are valid and reliable. In order to minimize errors, we systematically conduct the analysis on the corpus, and for the design of the interview questionnaire we build on the work of Van Slype (1979).
The proposed model for the functional attributes is a black-box type superficial, comparative and adequacy-oriented evaluation. In other words, there is no interaction with the systems tested and the goal is to determine whether output is actually helpful to the user groups in question.
On the basis of the tasks relevant to the end-user’s needs in this study, only six functional quality characteristics have been investigated. These include: ‘intelligibility’, ‘fidelity’, ‘coherence’, ‘usability’, ‘acceptability’ and ‘reading time’.
In this work, the black-box evaluation has been chosen due to the fact that commercial MT systems can only be evaluated by this approach (Volk, 2001). Consequently, there has been no access to the inner workings of these systems. Even so, it is desirable to be able to draw from such an evaluation enough conclusions about the various system components.

Data analysis;

Detailed analyses and classifications of the results concerning the various criteria types are presented with tables and charts. The questionnaire used in this study was carefully analysed to ensure that the data gathered was presented clearly.
A detailed analysis based on the black-box approach, superficial and adequacy/ declarative evaluation of six various text types for each of the two MT systems reveals the results.
These results are classified and presented on the basis of:
•    Variation in scores between raters.
•    Comparison of systems for text types.
•    Average of scores of raters.
•    Percentage of scores of raters.
The result of the application of the evaluation methods in testing the criteria, take into consideration the grades on the scoring scale, the total score value, and the average score value with respect to each rater and each of the tested MT systems. The evaluation results are reported in tables, which show the distribution of the scores obtained from the investigation of text-types for each of the quality characteristics and MT systems.

Analysis and Classification of Results

This part is the most important process, which is to calculate the human judgments based on the assigned questionnaire. The evaluators were asked to consider each text and its machine translated outputs to examine the parameters which are provided in the questionnaire. The scores assigned to each parameter by evaluators are shown in Tables and for better analysis; the results are presented in charts for each parameter.
There are sixteen evaluators.

Conclusion;

The findings indicate that, the machine-generated translations are intelligible and acceptable in translating certain text-types, for end-users and Google Translate is more acceptable from end-users point of view.

Tukang Terjemah