Reading Mode - Questions

Teaching QUESTION #6767

A testing agency wants to determine whether a new aptitude test produces scores consistent with scores from an already-validated ability test administered simultaneously. Which reliability/validity method is being employed?

Test-retest reliability
Concurrent criterion validity✔️
Predictive criterion validity
Split-half reliability

Correct Answer Logic:

Concurrent validity is a form of criterion validity where two measures are administered at the same time (concurrently) and their scores are correlated. If the new test scores align with the validated test, this is evidence of concurrent validity. It is not reliability (which involves the same test repeated or split) nor predictive validity (which requires a future criterion).

Uploaded by: Fani Warraich

Teaching QUESTION #6768

Question 342

A teacher constructs a 50-item MCQ test. Items 1–25 cover knowledge-level outcomes (p-values around 0.85–0.90) and items 26–50 cover application-level outcomes (p-values around 0.40–0.55). Which statement about this test's suitability for norm-referenced interpretation is MOST accurate?

Both halves are equally suitable for NRT because they cover all Bloom's levels
The application-level items (items 26–50) are more suitable for NRT because their moderate difficulty creates greater score variance, enabling better discrimination among examinees✔️
The knowledge-level items are better for NRT because easier items reduce test anxiety
NRT requires all items to have the same difficulty level, so the test is unsuitable

Correct Answer Logic:

Norm-referenced tests require substantial score variance to rank examinees accurately. Items with p-values between 0.40 and 0.60 are near optimal for NRT because they generate the widest spread of scores. Items with very high p-values (0.85–0.90) produce ceiling effects and minimal variance, reducing the test's discriminating power — making them less suitable for NRT purposes.

Uploaded by: Fani Warraich

Teaching QUESTION #6769

Question 343

A teacher notices that a student's persistent academic failure is unrelated to motivation or teaching methods and suspects sensory or cognitive processing issues. Which type of educational decision, and what corresponding assessment tool, is MOST appropriate?

Grading decision; teacher-made achievement test
Diagnostic decision; specialized standardized diagnostic battery to identify root causes✔️
Placement decision; aptitude test
Selection decision; criterion-referenced test

Correct Answer Logic:

Diagnostic decisions address the causes of persistent learning difficulties — intellectual, physical, emotional, or environmental. When standard instructional remediation has failed, a diagnostic battery (not a routine achievement test) is needed to pinpoint the underlying cause. This is distinct from placement (where to put the student) or selection (whether to admit).

Uploaded by: Fani Warraich

Teaching QUESTION #6770

Question 344

In developing MCQ distractors, a teacher writes: 'Which country first used nuclear weapons in warfare? a) USA b) Soviet Union c) Germany d) France'. Only option (a) is a genuine threat to an informed student. What principle of MCQ construction is violated?

The stem should not use negative phrasing
Distractors should be plausible to uniformed students; implausible distractors do not contribute to item functioning and should be revised✔️
The stem should present a definite problem
All options should be grammatically consistent with the stem

Correct Answer Logic:

A core rule of MCQ construction is that all distractors must be plausible to students who lack the relevant knowledge. If a distractor is not selected by any student in a field test, it contributes nothing to the item's measurement function and should be replaced with a more compelling incorrect option.

Uploaded by: Fani Warraich

Teaching QUESTION #6771

Question 345

Carey (1988) identified six elements for developing a Table of Specification. Which combination is MOST essential to ensure both content representativeness AND cognitive depth?

Total number of items and test format
Balance among goals selected for the exam AND balance among levels of learning (higher and lower order)✔️
Enabling skills and number of items per goal
Test format and difficulty level

Correct Answer Logic:

Content representativeness is ensured by balancing the weighting across all instructional goals (objectives). Cognitive depth is ensured by explicitly including items at both lower-order (knowledge, comprehension) and higher-order (application, analysis, synthesis) levels of Bloom's Taxonomy. These two elements working together prevent a test from oversampling easy recall items at the expense of complex thinking.

Uploaded by: Fani Warraich

Teaching QUESTION #6772

Question 346

A student at the SOLO Taxonomy's 'Extended Abstract' level is asked: 'How does the concept of formative assessment relate to Vygotsky's Zone of Proximal Development?' Their response correctly links both concepts, generalizes the principle to other pedagogical models, and generates a new theoretical proposition. Which SOLO indicator verbs best describe this performance?

Enumerate, classify, describe
Identify, memorize, do simple procedure
Theorize, generalize, hypothesize, reflect✔️
Compare, contrast, integrate, apply

Correct Answer Logic:

At the Extended Abstract level, students transcend the given content domain, make connections to other areas, and think hypothetically. Key indicator verbs include: theorize, generalize, hypothesize, reflect, and generate. The response described — linking to another theory, generalizing the principle, and proposing a new idea — precisely matches these verbs.

Uploaded by: Fani Warraich

Teaching QUESTION #6773

Question 347

What is the fundamental difference between 'speed tests' and 'power tests' in terms of their design and the construct they measure?

Speed tests use harder items while power tests use easier items, measuring the same construct
Power tests have generous time limits and harder items, measuring maximum depth of knowledge; speed tests have strict time limits and easier items, measuring processing speed and efficiency✔️
Speed tests are used in CRT while power tests are used in NRT
Power tests are always essay-based while speed tests are always MCQ-based

Correct Answer Logic:

By definition: power tests use liberal time limits so virtually all examinees can attempt every item — items are difficult, measuring depth of knowledge. Speed tests use strict time limits that prevent completion — items are easy, measuring how quickly and accurately examinees can process and respond. They measure fundamentally different constructs.

Uploaded by: Fani Warraich

Teaching QUESTION #6774

Question 348

A test developer wants to use the Split-Half method to estimate reliability. After splitting the test into odd and even items and correlating the halves, they get r = 0.70. However, this underestimates the reliability of the full test. Which formula corrects for this, and why is the correction necessary?

The KR-20 formula; because internal consistency must account for item difficulty
The Spearman-Brown prophecy formula; because reliability increases with test length, and the correlation between two halves reflects reliability of a test only half as long✔️
The inter-rater reliability coefficient; because two independent scorers are effectively two test halves
The KR-21 formula; because it handles non-dichotomous scoring

Correct Answer Logic:

Split-half reliability correlates two halves of the test, but a half-test is less reliable than the full test. The Spearman-Brown formula corrects for this by estimating the reliability of the full-length test from the split-half correlation. This is a fundamental principle: longer tests, all else equal, are more reliable because they sample the domain more broadly.

Uploaded by: Fani Warraich

Teaching QUESTION #6775

Question 349

Which of the following actions MOST directly threatens the consequence validity of a high-stakes examination?

Using moderately difficult items with acceptable discrimination indices
Teaching exclusively to the specific test items used in past exams (teaching to the test), which artificially narrows the curriculum and constrains student learning✔️
Administering equivalent forms of the test across different testing centers
Using a Table of Specification to ensure content balance

Correct Answer Logic:

Consequence validity evaluates the intended and unintended effects of using assessment results. When teaching to the test occurs, the unintended consequence is a narrowed curriculum — students learn test content, not the broader domain. This undermines the fundamental educational purpose of assessment, representing a direct threat to consequence validity.

Uploaded by: Fani Warraich

Teaching QUESTION #6776

Question 350

In a criterion-referenced test (CRT) context, a cut score of 70% is set to distinguish 'master' from 'non-master'. After the exam, 85% of students pass. Which statement about this result is MOST consistent with CRT principles?

This result is anomalous because CRT normally produces a normal distribution of scores
This result is expected and acceptable in CRT; mastery tests are designed with relatively easy items, and it is desirable that most students demonstrate mastery✔️
This outcome shows the test lacked discriminating power and should be revised for better spread
The cut score should be raised to 90% to reduce the pass rate to a more appropriate level

Correct Answer Logic:

In CRT, the test is designed to assess mastery of a defined set of skills; it does not aim to spread students along a distribution. It is entirely acceptable — even ideal — for most students to pass if they have mastered the content. Around 80% correct per item is the expected CRT item difficulty. The purpose is mastery verification, not comparison or ranking.

Uploaded by: Fani Warraich

Teaching QUESTION #6777

Question 351

A teacher asks students to sort unseen essay papers by quality (best to worst), then assigns grades based on relative rank. According to holistic rubric theory, which grading philosophy does this approach embody, and what is its key limitation?

Criterion-referenced philosophy; it lacks discriminating power
Norm-referenced philosophy; papers are ranked relative to each other rather than against absolute quality criteria, which makes it unsuitable for large numbers of papers✔️
Absolute standard philosophy; it can be applied consistently to any number of papers
Analytic philosophy; sub-criteria are implicitly weighted differently

Correct Answer Logic:

This is the fourth approach to holistic scoring — ranking papers relative to each other — which aligns with norm-referenced or relative standard grading. Its critical limitation: it cannot be applied to large sets of papers because it requires reading and comparing all papers simultaneously, and scores depend on the composition of the specific group rather than absolute quality.

Uploaded by: Fani Warraich

Teaching QUESTION #6778

Question 352

A researcher applies Item Response Theory to a set of test items and finds that one item has a very low 'a' (discrimination) parameter and a relatively high 'c' (pseudo-guessing) parameter. What practical recommendation follows from this?

Retain the item because low guessing should be prioritized over discrimination
The item likely does not differentiate ability levels effectively and may be susceptible to correct responses through guessing; it should be revised or removed from the bank✔️
Increase the item difficulty to reduce the guessing parameter
Use this item as a warm-up item since it is accessible to all ability levels

Correct Answer Logic:

In IRT's 3-parameter logistic model: 'a' is discrimination (steepness of ICC), 'b' is difficulty, and 'c' is the pseudo-guessing parameter (lower asymptote of the ICC). A low 'a' means the item poorly differentiates ability levels; a high 'c' means even low-ability examinees have a substantial probability of answering correctly — suggesting guessing is inflating scores. Such items should be revised or dropped.

Uploaded by: Fani Warraich

Teaching QUESTION #6779

Question 353

A teacher includes the following stem in a test: 'Photosynthesis is the process by which: a) plants absorb water from soil; b) plants produce food using sunlight; c) plants release oxygen at night; d) plants absorb carbon dioxide for cellular respiration.' Upon analysis, option (b) contains a keyword ('food') that also appears in the course definition students memorized. Which MCQ construction flaw does this represent?

Including a distractor that is too difficult
Verbal association between the stem/correct answer and the course definition — the word 'food' provides an irrelevant linguistic clue to the answer without requiring genuine understanding✔️
The stem is stated as an incomplete sentence, which is a poor format
All options are grammatically inconsistent with the stem

Correct Answer Logic:

Suggestion 7 in MCQ construction warns against verbal associations between the stem or correct answer and memorized material. When a keyword in the answer mirrors a specific phrase from the definition, students can identify the correct option through rote linguistic matching rather than conceptual understanding. This undermines the item's validity as a measure of comprehension.

Uploaded by: Fani Warraich

Teaching QUESTION #6780

Question 354

In a Table of Specification, after calculating that 25% of instructional time was devoted to Topic A, a teacher allocates 13 marks out of 50 total marks to Topic A. Is this allocation within acceptable limits?

No, the allocation must be exact — 12.5 marks and no rounding is permitted
Yes, the acceptable tolerance is ±2 percentage points; 13/50 = 26%, which is within ±2% of 25%✔️
No, the allocation should always round down to avoid over-testing any topic
Yes, but only if Topic A contains application-level questions

Correct Answer Logic:

The Table of Specification guideline states: Percent of instruction time = Percent of examination value (within ±2 percent). Topic A received 25% instructional time. Allocated: 13/50 = 26%. Since 26% is within ±2% of 25%, this is acceptable. If it were 28% or above (or 22% or below), revision would be needed.

Uploaded by: Fani Warraich

Teaching QUESTION #6781

Question 355

Which of the following BEST explains why portfolios have very low reliability as an assessment tool compared to standardized tests?

Portfolios always favor high-ability students because they choose their own best work
Portfolio scoring lacks the standardized criteria, uniform conditions, and objective scoring procedures that characterize reliable measurement — different assessors apply different criteria to different content✔️
Portfolios contain too many items, which statistically reduces internal consistency
Portfolios are only valid for formative purposes and reliability is not applicable to them

Correct Answer Logic:

Reliability requires consistency across administrations, scorers, and conditions. Portfolios are diverse in content (student-selected), judged subjectively by different assessors using loosely defined criteria, and lack uniform conditions. All of these factors introduce variability into scores, which by definition reduces reliability. This is listed as a known weakness of portfolios.

Uploaded by: Fani Warraich

Teaching QUESTION #6782

Question 356

A test developer notices that one MCQ option is consistently longer and more qualified than the other three options, and it is also the correct answer. What test construction error has occurred and how should it be corrected?

The stem does not present a definite problem; rewrite the stem to be a direct question
The relative length of the correct answer provides an unintentional clue; revise distractors to be approximately equal in length to the correct answer by adding qualifying phrases✔️
There are too few distractors; add a fifth option
The item uses a negative stem; convert to a positive format

Correct Answer Logic:

Suggestion 8 in MCQ construction states: the relative length of alternatives should not provide a clue to the answer. Correct answers tend to require qualification to be unambiguously true, making them longer. The fix is to deliberately add similar qualifying phrases to the distractors to equalize length, removing the length clue while preserving plausibility.

Uploaded by: Fani Warraich

Teaching QUESTION #6783

Question 357

The National Education Assessment System (NEAS) was established in Pakistan primarily with funding from the World Bank and DfID in 2003. What was its PRIMARY assessment purpose — and how does this differ from the purpose of the Board of Intermediate and Secondary Education (BISE)?

NEAS certifies individual student performance for promotion; BISE monitors national education standards
NEAS conducts large-scale national assessments to inform policy, monitor curriculum implementation standards, and identify achievement correlates at the system level; BISE conducts high-stakes individual certification examinations (SSC, HSSC) at grades 10 and 12✔️
NEAS administers competitive entrance examinations for public sector jobs; BISE focuses on diagnostic assessment
Both serve identical purposes but operate at different administrative levels

Correct Answer Logic:

NEAS is a system-level monitoring body. It conducts large-scale assessments to give federal policymakers a picture of education quality, monitor curriculum translation into learning, and identify factors affecting achievement. It does not certify individual students. BISE, in contrast, conducts individual-certification high-stakes examinations (SSC at grade 10, HSSC at grade 12) that directly determine students' academic credentials.

Uploaded by: Fani Warraich

Teaching QUESTION #6784

Question 358

A test item reads: 'Which of the following is NOT an example of formative assessment?' with correct answer option 'Final examination'. A student incorrectly answers 'Weekly quiz'. According to item analysis in CTT, if students with high total test scores systematically choose 'weekly quiz' as their answer, what does the discrimination index likely indicate?

A high positive D value, indicating the item works well
A negative D value, indicating that the item functions in reverse — high achievers are misled while lower achievers answer correctly — suggesting the item needs immediate review✔️
A D value near 0, indicating the item has no differential effect
A D value above 0.40, indicating excellent item quality

Correct Answer Logic:

If high-scoring students are more likely to answer incorrectly than low-scoring students, the item discrimination index D = (Upper Group % Correct) – (Lower Group % Correct) will be negative. Negative D values are the most serious item analysis warning sign, typically indicating a keying error, ambiguous wording, or a misleading clue that advantageously targets lower-ability guessers over informed higher-ability students.

Uploaded by: Fani Warraich

Teaching QUESTION #6785

Question 359

Bloom's original (1956) Taxonomy listed 'Synthesis' above 'Analysis' and 'Evaluation' as the highest level. Anderson and Krathwohl's revised taxonomy (2001) made two significant changes. Which accurately describes BOTH changes?

They added a seventh level called 'Innovation' and renamed 'Knowledge' to 'Remembering'
They reversed the two highest categories (Evaluation moved above Synthesis, which was renamed Creating) and changed all category names from nouns to verbs✔️
They collapsed the taxonomy to four levels and merged Synthesis with Evaluation
They added the Psychomotor and Affective domains to the Cognitive domain

Correct Answer Logic:

In the Revised Bloom's Taxonomy by Anderson and Krathwohl: (1) All category names were changed from nouns to verbs (Knowledge → Remembering; Comprehension → Understanding; etc.). (2) The two highest levels were reversed — Creating (formerly Synthesis) became the highest level, above Evaluating. This reflects the view that generating new ideas is cognitively more demanding than judging existing ones.

Uploaded by: Fani Warraich

Teaching QUESTION #6786

Question 360

A test packaging decision involves arranging items 'from easy to hard'. Which of the following is the MOST psychologically sound rationale for this arrangement?

Easier items have higher discrimination and should be answered first to maximize test reliability
Beginning with accessible items reduces test anxiety, builds examinee confidence, and motivates engagement — benefiting students who struggle with initial performance anxiety without disadvantaging stronger students✔️
Harder items lose marks if unattempted, so they must be placed last to ensure all items are reached
This arrangement ensures the test follows a norm-referenced scoring pattern

Correct Answer Logic:

The test administration literature supports easy-to-hard arrangement primarily for psychological reasons: it provides a positive start, reduces anxiety, builds confidence, and motivates students to continue engaging with the test. This is particularly beneficial for test-anxious or lower-ability students and does not penalize stronger students who find the early items trivially easy.

Uploaded by: Fani Warraich

More Options

Suggest a Question