In this essay I will explore the steps taken to establish a psychological assessment measure along with the assessment measure for cross-cultural application. When developing a measure its important to first clarify the Subject that the test has to cover, what exactly would you like to measure. Also the number and type of questions you want to include, how long will the test take and the scoring procedures. You should make sure the wording is clear and understandable and practice using the test and marking it.
It needs to be planned carefully and items need to be decided upon and the initial version f the measure needs to be administered so that the real effectiveness of the item is discovered. Once this is done you should administer the test to a group of people so that the measures validity, reliability and norms can be established. Once you understand why the test is being conducted you need to kick into more details and into the Steps involved. Firstly we look at the planning phase; we need to specify the aim of the measure. This looks at the purpose, the attributes and characteristics that it will measure.
Don’t waste your time!
Order your assignment!
The test outcomes such as decisions that will be made based on the results. What is the target population for the measure for example elderly +60 years of age and which language should be used? Also you should consider the type of measure; will you do the measure on the computer or paper based, and if the measure if normative, positive (compares desirable options and chooses the most preferred) or criterion referenced measure (measures performance against a fixed set of criteria or learning standards ii. What are they expected to know or be able to do at that stage).
Within the planning phase we also need to define the content of the measure for example the rational method which is taking a theoretical study on the efferent viewpoints being measured. A construct map should be developed to ensure that the concept is established. The purpose, why are you making the assessment measure? The test developer should ideally use more than one approach when deciding on the content of the measure. The more detailed and analytical is it the more grounded the theory is. Finally in the planning phase we need to develop a test plan.
The two aspects that are considered in the test plan are ‘a stimulus to which a test taker responds and a mechanism for response’ (McIntyre and Some frequent item formats are open ended items, multiple choice items, rue or false, sentence completion items and an essay or oral presentation. When deciding on the above item formats you should consider the audience for example a large group or small group, age groups, university students or elderly. For example it would make more sense to use an open ended essay format for a small group of university students rather than a large group of elderly test participants.
There are many formats when responding to a test measure such as objective formats where there is a multiple choice or true of false option, one item is correct and the subjective formats where the participant responds verbally, in rating or pictures are presented and test takers respond by telling a Story (Thematic Apperception Test). The time frame and length of the test should be decided and you should also look at if bias is introduced though any item or response options. This is particularly important especially if you are using it with a multi-cultural group of respondents.
It is better to have options for languages when administering tests to avoid any bias perception in this area. On another note, test participants may be the source of bias themselves by disagreeing with every question or marking a particular letter ii “D” in a multiple choice. It is hard to control but should be minimized by the test developers where possible. The Next step in developing a measure is the Item writing stage. This stage continuously looks back to the purpose and specifications provided for the measure.
When writing the items there are some do’s and don’t use direct wording that gets straight to the point of the question, we don’t want to in intentionally confuse participants. Use the level of language for that audience being tested ii complex words for children / teenagers may not be as easily understood as it would be for university graduates. The answers should vary location, statements should approximately the same length and the nature of the question should be relevant. You should then review the items with experts to assess and give feedback.
Use the feedback given to develop your questionnaire further to be better and more effective. The third step is to assemble and pre-test the experimental version of the measure. You should arrange items sensibly and in a logical order. They should be grouped or arranged on the necessary pages in the test booklet. The length of the test needs to be finalized. The time for the participants to dead the items needs to be considered especially if each question has time constraints. You need to decide if you need a separate answer booklet for paper based tests or if it should be in the same booklet.
There should be administration instructions and finally pretest the experimental version of the measure. These administration guidelines need to be clear and specific as we don’t want it to have a negative consequence on the results. The test should now be ready to be administered to a sample from the target population, The next phase is the item-analysis phase; this phase uses statistical methods o identify any test items that are not working well. For example, if it’s too easy, too hard, failing to show any difference between participants or even not scoring correctly the analysis will show it.
There are two common statistics reports; the first one is determine the item difficulty which looks at the number of participants that answered correctly. (Foxtrot, 2013,app) The second one is determine discriminating power, which measure how well the item discriminates between participants who are knowable in the area and those who are not. (Foxtrot, 201 3,app) An item-total correlation can then be conducted; this is the correlation teen the question score and assessment score.
You would expect a participant that gets a question correct the overall assessment have a higher overall assessment score, and likewise for those who get it wrong. The Item Response Theory (ART) looks at the difficulty level and discriminatory power of an item. The ART can be converted into an item characteristic curve which represents the respondent’s ability as a function of the probability of endorsing the item. When you are developing a measure in a multicultural country like South Africa you should look at item bias in the early stages of developing the measure.
We would then identify items for the final pool. The classical test theory, DIF and ART analyses can be used to decide which items should be removed and kept in the final version of the test. The next stage would be to revise and standardize the final version of the measure. In this phase we revise the items and do final testing. Any items that might be a problem during the item analysis phase need to be looked at in this phase. Items should then be selected for the final version, by this stage information regarding difficulty, discrimination and bias has been found and then the selection takes place.
Change in any administration instructions and scoring procedures will be completed. And finally administer the final version to a sample of the target population. Technical evaluation and establishing norms is the next phase. It’s important in this stage that you test reliability and validity. Test reliability is how consistently a test measures a characteristic for example if the same person took the test twice will they get a similar test score or a different one. If the test is measured similar or the same then the measure is said to be reliable. In essence it’s the extent to which test scores re free from measurement error.
It is important now that types of validity and reliability are computed. Testing norms consists of data that makes it possible to decide the relativity of an individual who has taken the test. The result by itself has little meaning, almost always a test score should be compared to a similar group (the norm group). For example; individuals of a similar sex, age or social class. Some norm scales that can be used are; percentiles and standard scores. An example of percentiles is if an individuals IQ is in the 20th percentile it means that only 20% of the standardization ample received scores at or below the score the subject attained.
Standard scores is how far the participants raw score deviates from the average of the normative sample. They are expressed in the mean and standard deviation of the normative group. Cut-score scale looks at scores at or above that point that are interpreted differently from scores below that point. With norm references measures each test takers performance is interpreted with reference to a standard norm group. The raw Score of an individual is looked at so that we can position it in relation to the normative sample that is determined.
There are anon subgroups when establishing norms they are the applicant pool and the incumbent population. The final process is to publish and refine continuously. We must publish a test manual and submit it for classification. When preparing the test manual it’s important to consider some key points such as; the purpose of the test measure, the ultimate reason for you completing the test, as established above. Who can the measure be given too? What is the final length and duration of the test, what are the estimated times for completion. How the scoring will be calculated and any administration points that need to be outlined.
The test development process that will be followed, points about the reliability and validity of the information and what exactly the findings were. How you conducted bias testing, and when and how norm groups were established, for example the sample characteristics might be location, gender, cultural background etc. You should also include information about cut off scores, and how local norms should be established and finally how the performance on the measure should be interpreted. (Foxtrot, 2013,app-80). You should then submit the measure for classification to the Psychometrics
Committee of Professional Board for Psychology so that they can decide if it should be classified as a psychological measure or not. DISCUSS issues related to the reliability of a psychological measure Reliability is the consistency with which a psychometric test measures whatever it measures. For example if someone weighed themselves throughout the day they would expect to see a similar reading. Scales that measured weight differently each time would be of little use. If findings from research are replicated consistently they are reliable. A correlation coefficient can be used to assess the degree of reliability.
If a degree is reliable it should show a positive correlation. It doesn’t mean that the exact results should be given each time but a strong positive correlation between results of the same test indicate reliability. We also need to consider that you can never know a persons ‘true score’. A person could have many factors that influence their score for example their emotional state of mind, tiredness, the temperature, room and so on. The simple equation of X (observed score) T (true score) + e (error score) means that the variability of your measure is the sum of the variability due to the true score and the variability due to random error.
It’s a simple yet important model for measurement and reminds us that measurement does have an error component and it’s a foundation of the reliability theory. There are five types of Reliability Coefficient: Test-retest reliability – tests are done twice on the same person on two separate occasions for example; if a test is designed to assess students learning in psychology is given to them twice with two weeks difference in the testing. The reliability coefficient is the correlation between the scores, test 1 and test 2. It does have its limitations such as the testing circumstances as issued above or memory could play a part as well.
The advantages is its straightforward and obvious method for determining the reliability measure. Alternative form reliability – two forms of the test are administered and correlations are calculated, the scoring procedure and number of items should be exactly the same. The advantages of this is the persons answers are compared slightly differently to different version of the survey questions. You may reverse the order of the response choices for example. Some limitations are if the behavior functions under consideration are subject to a rage practice effect the use of alternative forms will reduce but not eliminate such an effect.
It is highly likely that individuals will differ in amount of improvement owing to extent of previous practice with similar material and motivation in taking the test. Under these conditions the practice effect represents another source of variance that will reduce the correlation between the two test forms. Alternative forms are also unavailable for many tests because of practical difficulties of making two exact forms. Split-half reliability – You split the measure into two even halves and you put the relation coefficient between these two scores.
Some challenges with this measure is how to split the test to gather the most equivalent halves. In some cases they use odd or even numbers rather than the start of the end of the test. It also only gives you scores based on half the test whereas test-retest and alternative-form reliability give you a score based on the full number of items in the test. Inter-item consistency ? is influenced by two sources, the error variance of the content sampling and similarity of the behavior domain sampled. This is for tests whoso items are either right or wrong or multiple hooch.
Inter-scorer reliability – is the difference between the score ratings by the raters. This is used where tests are open ended questions, assignments, etc. Examiners marking school and university exams are assessed on a regular basis, to ensure that they all adhere to the same standards. It would be extremely unfair to fail an exam because the observer was having a bad day The reliability coefficient typically ranges from O ? 1 . The number closer to 1 shows that there IS a high reliability, a low reliability coefficient shows more error in the assessment results usually due to temporary factors that we issued previously.
Reliability is considered good if its above . 80 – this indicates a strong relationship. The statistical significance of the correlation is indicated by a probability value of less than 0. 05. This means that the probability of obtaining a correlation coefficient by chance is less than 5 times our of 1 00 so there is a presence of a relationship. Fifth result is -0. 08 there is a statistically significant relationship between class size and reading score. The techniques for assessing the internal consistency of a test are: http://satisfactoriness. Bloodspot. Awe/2010/04/interpretation-of- relation. HTML Once you have their approval the measure is ready to be published and marketed. High attention to detail should be placed in this stage to ensure no MIS-print. There is no set time frame in which you should re visit the measure and it will depend on a case by case basis. There are assessment practitioners available who assess the quality of assessment measures. Adapting assessment measures is essential in multicultural and multilingual societies. Cross cultural adaptation looks at languages (translation) and the cultural adaptation issues in the process of preparing a questionnaire for use n another setting.
An example of this would be in my work place colleagues are required to take a ‘colleague engagement survey, with over 40 different nationalities working here its essential that the test is available in different languages and takes into account different backgrounds. It is beneficial to adapt assessment measures to ensure fairness, it reduces costs and saves time if you adapt a current measure rather than having to develop a new one and it also aids comparative studies between different cultures and languages. Earn through my own experience that it’s important to ensure hat there are no communication issues when translating into different languages. For example ‘do you have a best friend at work’ was translated in German to have a different meaning, therefore caused confusion amongst the German colleagues. You also need to consider the format Of the test and make sure all cultures will be familiar with the test format lee. True or False. Finally remember to consider time frames and limits, some cultures may view competing the task quick is better whereas others the opposite.
To set up this survey required a lot of different skill sets and expertise especially with the translation part. When adapting an assessment measure for cross- cultural application it is important to follow the following steps. Review construct equivalence in the languages and cultures of interest. This involves clarifying the concepts and ensure they will be still equivalent to the target language / audience. This is essential because the new measure will need to show core concepts appropriate to the target culture.
Decide whether test adaptation / translation is the best strategy. Test adapting and translation can result in many challenges for example there may not be an equivalent term for that language, idiomatic expressions may to be translated literally or the use of a negative term can confuse test takers. Some tests translate easier into certain languages than others. The more similar the target language and culture the easier the adaptation will be for example English to Arabic may be harder than English to Spanish.
When there are no cross cultural comparisons it may be easier to produce a new test that meets those cultural parameters than having less than satisfactory results. Choose well qualified translators; You should select a translator that is fluent in both languages and familiar with the cultures under the test study. Prior training on the test construction with the translators may aid the process. It is also advisable that two independent translators followed by reconciliation of both versions by a third party. Using this strategy increases the chances that problems are discovered prior to finalizing a test adaptation.
Translate/adapt in the test using appropriate design; for example a judgmental design which is based on a group of individuals decision on the degree to which the measures are similar. They have a relevant set of expertise and are very familiar with the culture / language groups under inconsideration. The two common designs are forward-translation design which is when the translators aim to find an equivalent for the word of phrase not a literal translation. They should try to be clear and concise in forming the question and they should aim for the most common target audience.
Backward translation is when the test is translated back to English by an independent translator who’s first language is English and has no previous connection with the survey. The main focus here is emphasis in the cultural equivalence not linguistic equivalence and then mistakes should be worked through. Review the adapted version and make necessary changes; In forward- translation design a set of translators examine the adapted version of the test for any errors which may result in differences in interpretation of the two different language versions.
Focusing on the quality of the translation is also important. With the backward-translation design they would take the adapted version of the test and back translate to the source language then decisions would be made out of the similarity of the original and back-translated versions of the test. Conduct a tryout of the adapted version of the test. It is important to remember that a judgmental review is not sufficient evidence to create the validity of a test in a second language.
A pilot test will consist of administering the test as well as interviewing the examinees to obtain their criticisms of the test itself, instructions, time limits and other factors. Any variations or issues are tested during this phase. Conduct a validation investigation: Although translators are cap blew of fixing errors in adapted tests many problems still go unnoticed until test items are field tested. It should be field tested using a large sample of individuals preventative of the target population (that will be used eventually).
During this stage you should also check that the items are similar in both the adapted and source language versions of the test, this can be achieved through the use of an item bias study. If there are items that function differently for each group when the groups are matched on ability they can be eliminated from the test or can be re administered and analyses. The Muezzin, Hamilton and Zing (2001) study highlights the fact that even small samples can be useful in detecting flaws in the translation/adaptation process cause the problems of poor translations often are large and therefore easy to detect. A-porter, Peg 69) Place scores Of both the translated and original test on a common scale. At this stage the common design is needed to place the test scores from different versions of the test on a common scale. The popular linking designs are the bilingual group design, the matched monolingual group design and the monolingual group design. The final step is to document the process and prepare a manual for the test users. This is where the manual includes specifics such as administration of he test as well as how to interpret the test scores.