๐
Questions Reading Mode
Study questions platform-wide or filter by specific tests with correct answers revealed.
Correct Answer Logic:
Formative assessment is defined by its purpose โ to improve the teaching-learning process through ongoing feedback โ not by whether grades are assigned. Grades are a hallmark of summative assessment. Measurement generates data; assessment uses that data to enhance instruction. Observation is a recognized tool of formative assessment.
Uploaded by: Fani Warraich
Correct Answer Logic:
A fundamental recommendation for high-stakes testing is protection against high-stakes decisions based on a single test. Important educational decisions require triangulation of evidence from multiple sources to ensure validity and fairness.
Uploaded by: Fani Warraich
Correct Answer Logic:
Error in CTT refers to any factor beyond the student's true ability that affects the observed score. Poor environmental conditions (noise, lighting) and test anxiety are classic sources of construct-irrelevant variance that inflate the error component, making the observed score a less accurate reflection of true ability.
Uploaded by: Fani Warraich
Teaching
QUESTION #6750
Question 324
A test developer calculates item difficulty (p-value) for a four-choice MCQ and obtains p = 0.92. What is the MOST appropriate interpretation and action?
Correct Answer Logic:
For a four-alternative MCQ, the optimal p-value is approximately 0.62. A p-value of 0.92 means 92% answered correctly โ the item is very easy. In NRT contexts this reduces discrimination power. Items above p = 0.90 need careful review. High p-value does not guarantee high discrimination; in fact, near-universal correct responses often yield near-zero or negative discrimination.
Uploaded by: Fani Warraich
Teaching
QUESTION #6751
Question 325
A test has a Kuder-Richardson reliability of 0.85. A parallel form of the same test yields a test-retest-with-equivalent-forms correlation of 0.68. Which of the following best explains this discrepancy?
Correct Answer Logic:
KR-20 is an internal consistency measure computed from a single test administration โ it cannot capture the variance introduced by time passage or form differences. The equivalent-forms-with-retest method measures both stability and equivalence, capturing additional sources of variance that lower the coefficient. This explains why KR-20 (single administration) tends to exceed test-retest-with-equivalent-forms reliability.
Uploaded by: Fani Warraich
Correct Answer Logic:
In Bloom's Revised Taxonomy, Creating is the highest level โ it involves generating, planning, or producing new ideas or products. Synthesizing across disciplines and forming novel hypotheses are quintessential Creating-level tasks. Extended-response essay items allow the freedom of expression and length needed for such complex, open-ended performance. Restricted-response and MCQs constrain the response in ways that prevent authentic synthesis and creation.
Uploaded by: Fani Warraich
Teaching
QUESTION #6753
Question 327
A researcher finds that a well-known mathematics aptitude test correlates strongly with students' later success in engineering programs. This evidence most directly supports which type of validity?
Correct Answer Logic:
Criterion validity is established by correlating test scores with an external criterion. When the criterion is measured at a future point in time (engineering success), this is predictive validity โ a subtype of criterion validity. Content validity is about domain sampling; construct validity is about the underlying psychological construct. Concurrent validity uses a simultaneously-collected criterion.
Uploaded by: Fani Warraich
Correct Answer Logic:
Inter-rater reliability (inter-scorer reliability) measures the consistency of scores assigned by two or more independent raters. A correlation of 0.55 is low, indicating significant scorer disagreement. This is a reliability problem, specifically related to the consistency-of-ratings dimension, not content or criterion validity.
Uploaded by: Fani Warraich
Teaching
QUESTION #6755
Question 329
In a norm-referenced test, items are deliberately selected to have an average difficulty of around p = 0.50 rather than p = 0.80. What is the PRIMARY measurement rationale for this design decision?
Correct Answer Logic:
The core purpose of a norm-referenced test is to rank examinees along a continuum. Score variance is the statistical engine that enables ranking. Items near p = 0.50 (neither too easy nor too hard) produce maximum score variance. Items with very high or very low p-values reduce variance and thus reduce the test's discriminating power.
Uploaded by: Fani Warraich
Teaching
QUESTION #6756
Question 330
A teacher constructs a test and finds that item discrimination index D = -0.15 for question 7. Which interpretation is MOST accurate?
Correct Answer Logic:
The discrimination index D = (Upper Group % Correct) โ (Lower Group % Correct). A negative D means the lower-scoring group outperformed the upper-scoring group on this item. This is a serious red flag: it could indicate a keying error, ambiguous wording, or content that rewards lower-ability guessers. Such items should be removed from scoring or revised.
Uploaded by: Fani Warraich
Correct Answer Logic:
Using the Table of Specification formula: Percentage of instruction time = (150/500) ร 100 = 30%. Mark allocation = 30% of 50 = 15 marks. The Table of Specification ensures that the proportion of test marks mirrors the proportion of instructional time devoted to each content area.
Uploaded by: Fani Warraich
Teaching
QUESTION #6758
Question 332
Which of the following BEST distinguishes the SOLO Taxonomy's "Relational" level from its "Multi-structural" level?
Correct Answer Logic:
In SOLO Taxonomy, Multi-structural understanding means several components are known but each remains discrete โ students cannot see the whole. At the Relational level, the components are connected and integrated: students understand cause-effect, compare-contrast, and see how parts contribute to a unified whole. This integration is the key differentiator.
Uploaded by: Fani Warraich
Teaching
QUESTION #6759
Question 333
A test is highly reliable but consistently measures vocabulary skill instead of reading comprehension as intended. According to the framework of validity and reliability, which statement is MOST accurate?
Correct Answer Logic:
This scenario illustrates the classic principle: a test can be reliable without being valid. Reliability (consistency) is a necessary but not sufficient condition for validity. Valid results require that the test measures what it claims to measure. Consistent measurement of the wrong construct produces reliable but invalid scores.
Uploaded by: Fani Warraich
Correct Answer Logic:
One of the most important rules for constructing matching exercises is to include more responses than premises (or allow responses to be used more than once) so that the last premise cannot be answered by elimination. A strict one-to-one correspondence with equal numbers allows students to answer the final item through process of elimination, without any content knowledge.
Uploaded by: Fani Warraich
Correct Answer Logic:
Construct validity is the evidence that a test measures the intended theoretical construct. For a nuanced construct like "understanding" (as opposed to memorization), one must: define the construct, identify its sub-constructs (e.g., conceptual understanding vs. procedural), write items targeting each, and validate through expert judgment and factor analysis. This is the rigorous process described in construct validity.
Uploaded by: Fani Warraich
Correct Answer Logic:
A core rule for constructing true/false items is: avoid including two ideas in one statement unless cause-and-effect relationships are being tested. This item combines (a) natural resources/location and (b) high GDP growth. A student might agree with the premise but disagree with the conclusion, or vice versa, making the item inherently ambiguous.
Uploaded by: Fani Warraich
Teaching
QUESTION #6763
Question 337
Which of the following scenarios represents the MOST appropriate use of diagnostic assessment rather than formative or summative assessment?
Correct Answer Logic:
Diagnostic assessment is specifically designed to identify the causes of persistent learning difficulties โ not merely to measure progress or assign grades. The scenario where a student repeatedly fails despite varied instruction, prompting a deeper investigation, precisely matches the definition and purpose of diagnostic assessment.
Uploaded by: Fani Warraich
Correct Answer Logic:
The slope of the ICC reflects discrimination power. A steep slope means: for examinees near the item's difficulty level, even a slight difference in underlying ability dramatically changes the probability of success. This is a highly efficient discriminating item. Flatness means all ability levels have similar success rates โ poor discrimination. Steepness does not dictate difficulty; an item can be steep at any point on the ability scale.
Uploaded by: Fani Warraich
Correct Answer Logic:
Construct-irrelevant variance occurs when a test format measures something other than the intended construct. Asking whether item types are appropriate for the intended outcomes directly addresses whether the chosen format (e.g., MCQ vs. essay) can actually measure the targeted learning outcome, or whether it introduces irrelevant cognitive demands (e.g., writing skill when testing content knowledge).
Uploaded by: Fani Warraich
Correct Answer Logic:
The described procedure โ reading the full response and assigning a single overall score without breaking it into criteria โ is holistic scoring. Its key advantage is efficiency (quicker to score) and appropriateness for extended-response tasks involving synthesis and evaluation, where performance is a gestalt that is difficult to decompose into discrete point-scoring elements.
Uploaded by: Fani Warraich
Sign in to join the conversation and share your thoughts.
Log In to Comment