Showing posts with label tests. Show all posts
Showing posts with label tests. Show all posts

December 11, 2010

More Mandatory Finnish Content

Taksin Nuoret writes:
Ever since December 2001, when the results of the first PISA survey were made public, the Finnish educational system has received a lot of international attention. Foreign delegations are flocking to Finland, in the hope of discovering Finland's secrets.

The explanation widely accepted is that the Finnish educational system is better. For example, the following aspects have been pointed out:
  • Schools routinely provide tutoring for weak students.
  • Each school has a social worker ("koulukuraattori").
  • Substitute teachers are often provided when the teacher is ill.

Damn, why didn't anybody in the U.S. ever think of having substitute teachers? 

Anyway, Nuoret goes on to make the argument that perhaps Finnish is an easier language for kids to learn in than many other languages. He notes that Estonians, the other Finno-Ugric-speaking country, also do better than expected on PISA, and that Swedish-speakers in Finland do a little worse than Finnish-speakers on the PISA, even though Swedish-speakers have larger stock portfolios.

Arguments that one language is better than another for thinking about something have been around a long time. For example, maybe the poor reading performance in Latin American countries has something to do with the how it normally takes more letters and syllables to say something in Spanish than in English, as you can see by noting bilingual signs and the like. Puerto Ricans make up for the extra syllables in spoken Spanish by talking faster, but perhaps it's hard to read faster. I don't know.

We don't seem to have made much progress over the many years I've been listening to language theories like these at figuring out how to evaluate them. I like Nuoret's argument because he at least comes up with two pieces of evidence from the PISA results. That's only two pieces of data, however.  Yet that's still about twice the average amount of data presented in these types of discussions of whether one language is more efficient for thinking than another language.

December 7, 2010

PISA scores for Shanghai

Here are the results from the 2009 PISA test of 15-year-old's school performance with the first ever scores from the city of Shanghai. Shanghai swept the three categories. (Shanghai is not the full country of China -- Shanghai is the current favored son city of the regime, with restrictions on who gets to internally migrate there). Also, this is the first time Shanghai participated, so they presumably waited until they were ready to rip on the test.

Mean is 500, standard deviation is 100. So, Shanghai students beat the advanced world's international mean by 0.75, 0.56, and 1.00 standard deviations. Pretty good. On an IQ-like scale, that's approaching 112.

Having seen the National Merit Semifinalist names for 2009 in California, where there was a single Cohen and 49 Wangs among the honorees, I can't say I was too surprised. Santa Clara County might challenge Shanghai in a heads-up match, though.

Interestingly, the NYT table left out some low-scoring countries, such as Mexico, but then what possible policy implications do the educational attainment and intelligence of Mexicans have for an American audience? None, none  I tell you! Here's an eyeball-frying graphic from The Guardian of the OECD countries (leaving out unrepresentative cities like Shanghai). Check out the bottom line:

Converting to an IQ scale, Mexico scores an 88, although that's probably held down by the low quality of Mexican schools. 

You can look up all the results on the official PISA site here. The crazy colored Guardian graph is somewhat unfair to Mexico by making it the lowest: there are a number of countries that did worse, including Argentina. My general impression is that Mexico has been coming up in these international comparisons over the last decade.

September 22, 2010

Is the SAT getting easier?

It's interesting to look at California SAT scores over time. The farthest back data on the College Board site for the state of California is 1998, so I'll contrast 1998 to 2010. Overall, the mean score in California has dropped 3 points from 1520 to 1517 on a 600 to 2400 point scale. (Because the 1998 test was reported on a 400 to 1600 scale, I'm multiplying 1998 scores by 1.5 to match them up with 2010 scores).

A three point drop doesn't sound like much but that stability masks all sorts of things going on beneath the surface. For example, mean test scores have gone up for most ethnic groups: whites up 33 points, Asians up 57, blacks up 29, Mexican Americans up 14. (The total score went down because lower-scoring groups grew so fast.)

And that pattern of rising scores within groups is especially noteworthy because the number of high school seniors taking the test has gone up. (This data looks at each senior in the class of 1998 or 2000 and counts only the last time he or she took the SAT, so taking the test multiple times isn't a factor in these numbers.)

California SAT 1998 v. 2010






College Bound seniors 1998 # 2010 # Chg 1998 Mean 2010 Mean Chg 1998 v NHW 2010 v. NHW
Total 142,139 210,926 48% 1520 1517 -3 -89 -124
White 56,217 69,969 24% 1608 1641 33 0 0
Asian, As-Am, or Pac Isl 29,889 44,932 50% 1557 1614 57 -51 -27
Black or Af Am 8,868 14,476 63% 1292 1320 29 -317 -321
Mex or MA 18,494 42,380 129% 1341 1355 14 -267 -286
Other Hispanic 6,606 20,735 214% 1359 1325 -34 -249 -316
Puerto Rican 489 699 43% 1434 1489 55 -174 -152
American Indian 1,415 1,256 -11% 1479 1488 9 -129 -153
Other 7,863 8,498 8% 1566 1561 -5 -42 -80
No Response 12,298 7,981 -35% 1520 1566 47 -89 -75

For example, 24 percent more whites in the class of 2010 took the SAT than whites in the class of 1998, although I would guess that there were fewer white 17-year-olds in California in 2010 than in 1998. (Correct me if I'm wrong.) Black SAT-takers went up by 63 percent, even though lots of black families left California between 1998 and 2010. Clearly, there is a big push to get more marginal kids to take the SAT. (It's a great era to be in the testing racket!)

All else being equal, the higher the percentage of a group who takes the SAT, the lower the expected mean score (the scraping-the-bottom-of-the-barrel effect). But, except for the "Other Hispanic" category, where test takers exploded by 214%, we don't see that in the big groups here.

So, what's really going on with mean SAT scores in California? A few possibilities:

- Scoring is getting easier. 
- People are getting smarter.
- People in California are getting smarter relative to the rest of the country.
- Students are more familiar with test-taking because of all the other tests they take now.
- Students are better test-prepped for the SAT in 2010.

Considering the rank order of the size of the effect -- biggest gain among Asians, next biggest among whites, then blacks, then Mexicans, finally a bad dropoff among Central Americans, most of whom have recently arrived and don't know about SATs -- that would be about what I'd expect from average enthusiasm among parents for test prepping, so I don't think we can rule out the last possibility.

September 21, 2010

SAT scores in California

With all the interest (264 comments and counting) generated by the huge number of Chinese and Korean names among the national merit semifinalists (top 0.5%) on the PSAT in California, here are the latest SAT scores from California. Interestingly, in California, whites average slightly higher than Asians / Pacific Islanders, both on the traditional M+V and the new three part total including Writing: whites 1641 to Asians 1614. (Nationally, however, Asians outscore whites 1636 to 1580.)

However, Asians / Pacific Islanders have higher standard deviations. (To view this tiny type more easily, you can hit "CTRL +")

California 2010 SAT

Total Crit Read
Math
Writing
College Bound seniors # Share Mean Mean SD Mean SD Mean SD
Total 210,926 100% 1517 501 113 516 119 500 113
White 69,969 33% 1641 546 100 553 102 542 100
Asian, As-Am, or Pac Isl 44,932 21% 1614 518 116 571 121 525 122
Black or Af Am 14,476 7% 1320 444 101 436 102 440 97
Mex or MA 42,380 20% 1355 449 95 458 96 448 90
Other Hispanic 20,735 10% 1325 440 102 444 102 441 95
Puerto Rican 699 0% 1489 501 101 495 105 493 100
American Indian 1,256 1% 1488 499 102 504 101 485 98
Other 8,498 4% 1561 517 113 525 118 519 115
No Response 7,981 4% 1566 523 121 526 123 517 121

Commenter Mitch, a Bay Area testing tutor, has argued that because the PSAT is a low stakes test, which whites tend to treat as the beginning of thinking about studying for the SAT, while Asian parents tend to treat it as an important milestone in the years-long process of boning up for the SAT, the high number of Asian semifinalists in California on the PSAT is exaggerated relative to the high stakes SAT.

Being lazy, I'll leave it up to interested readers to do the work to evaluate this hypothesis and post their findings in the comments.

For example, questions to consider are: What exactly are the racial percentages of National Merit semifinalists in California? Do a higher percentage of Asian 17-year-olds take the SAT in California than do white 17-year-olds? (One thing not to worry about much in California is the SAT v. ACT divide that confuses things when thinking about SAT scores in, say, Iowa: California is traditionally an SAT state.) What is the nationality makeup of Asian / Pacific Islander 17-year-olds in California? What about taking the SAT multiple times -- how does that affect the numbers? (Okay, I found the answer to this last question: "Students are counted only once, no matter how often they tested, and only their latest scores and most recent SAT Questionnaire responses are summarized.) And so forth and so on.

Good luck!

By the way, this is the first bit of quantitative evidence I can recall to support the common-sense notion that California has smarter than the national average white people. Considering how damnably expensive it is and all the high end industries and all the Nobel Prizes, you would think it would have smart white people. But on the NAEP, California non-Hispanic whites always lag badly behind, say, Texan whites. And that was true way back on the big 1960 federal Project Talent test of 15-year-olds, where Texans beat Californians. So, numbers like that got me assuming that most white Californians are less Hewletts and Packards and more Bodines and Spicolis. But, maybe, white people in California just can't be bothered with trying on low stakes tests?

September 20, 2010

Test scores and home prices

Real estate agents famously keep track of test scores in school districts, although it's not clear which is the leading and which the lagging indicator: public school test scores or home prices. It would seem like a consulting firm might profit by creating a statistical model alerting them to arbitrage opportunities where public school test scores and home prices have gotten out of whack.

September 7, 2010

Critical Thinking

In the comments to my recent post on the Golden Age of Test Creation, Mitch points me to Linda Darling-Hammond, the prominent Ed Schooler from the Stanford Ed School, explaining how better tests would make American students smarter:
Whereas students in most parts of the United States are typically asked simply to recognize a single fact they have memorized from a list of answers, students in high-achieving countries are asked to apply their knowledge in the ways that writers, mathematicians, historians and scientists do.

In the United States, a typical item on the 12th grade National Assessment of Educational Progress, for example, asks students which two elements from a multiple choice list are found in the Earth's atmosphere. An item from the Victoria, Australia, high school biology test (which resembles those in Hong Kong and Singapore) describes how a particular virus works, asks students to design a drug to kill the virus and explain how the drug operates (complete with diagrams), and then to design and describe an experiment to test the drug - asking students to think and act like scientists. 

This kind of testing would clearly pay for itself just from the patent rights to the anti-viral drugs designed by the high school test-takers. They must be worth billions!
 

July 9, 2010

How did your kid do on the APs?


Scores for the nearly 3 million Advanced Placement test taken by high school students in May are now arriving in the mail. So, in the interests of helping you parents establish your bragging rights, here's the graph of what AP scores equate to in percentile terms. I created last year for a VDARE.com article. It shows how your kid did, but not compared to all the other kids who took the test, who are a self-selected few, but to all the other kids in the country of his or her age (including those who have already dropped out of high school). The brighter the color, the higher the score. This graph starts at the 90th percentile on the left and goes up. An untruncated graph showing the performance of all kids in the country would be ten times as wide.

AP tests are graded 1 to 5 with a 5 supposed to be an equivalent to an A in a typical college's introductory year long course in the subject, a 4 equal to a B, and so forth.

So, if your kid took the English Lit test (the top bar in the graph) and got a 4 (the yellow-orange band), he actually scored at the 98th percentile (or higher) out of all kids his age in the country. If he got a 3 (light gray) in US History the third bar down), he scored in at least the 94th percentile.

Of course, if all students took the test, the number of people scoring 3s, 4s and even 5s would go up. In particular, Red State students don't take APs as much as Blue State students, and whites don't take anywhere near as many APs as Asians.

My 2009 VDARE.com article has lots of graphs on how students do on the AP, overall and by race.

June 23, 2010

Not this again!

See if you can spot the fallacy in the following. I'll explain what's wrong with this logic afterwards.

From Inside Higher Ed:
New Evidence of Racial Bias on SAT

A new study may revive arguments that the average test scores of black students trail those of white students not just because of economic disadvantages, but because some parts of the test result in differential scores by race for students of equal academic prowess.

The finding -- already being questioned by the College Board -- could be extremely significant as many colleges that continue to rely on the SAT may be less comfortable doing so amid allegations that it is biased against black test-takers.

"The confirmation of unfair test results throws into question the validity of the test and, consequently, all decisions based on its results. All admissions decisions based exclusively or predominantly on SAT performance -- and therefore access to higher education institutions and subsequent job placement and professional success -- appear to be biased against the African American minority group and could be exposed to legal challenge," says the study, which has just appeared in Harvard Educational Review (abstract available here).

The existence of racial patterns on SAT scores is hardly new. The average score on the reading part of the SAT was 429 for black students last year -- 99 points behind the average for white students. And while white students' scores were flat, the average score for black students fell by one. Statistics like these are debated every year when SAT data are released, and when similar breakdowns are offered on other standardized tests.

The standard explanation offered by defenders of the tests is that the large gaps reflect the inequities in American society -- since black students are less likely than white students to attend well-financed, generously-staffed elementary and secondary schools, their scores lag.

In other words, the College Board says that American society is unfair, but the SAT is fair. And while many educators question that fairness of using a test on which wealthier students do consistently better than less wealthy students, research findings that directly isolate race as a factor in the fairness of individual SAT questions have, of late, been few.

The new paper in fact is based on a study that set out to replicate one of the last major studies to do so -- a paper published in the Harvard Educational Review in 2003, strongly attacked by the College Board -- and the new paper confirms those results (but using more recent SAT exams). The new paper is by Maria Santelices, assistant professor of education at the Catholic University of Chile, and Mark Wilson, professor of education at the University of California at Berkeley. The earlier study was by Roy Freedle of the Educational Testing Service.

The focus of both studies is on questions that show "differential item functioning," known by its acronym DIF. A DIF question is one on which students "matched by proficiency" and other factors have variable scores, predictably by race, on selected questions. A DIF question has notable differences between black and white (or, in theory, other subsets of students) whose educational background and skill set suggest that they should get similar scores. The 2003 study and this year's found no DIF issues in the mathematics section.

But what Freedle found in 2003 has now been confirmed independently by the new study: that some kinds of verbal questions have a DIF for black and white students. On some of the easier verbal questions, the two studies found that a DIF favored white students. On some of the most difficult verbal questions, the DIF favored black students. Freedle's theory about why this would be the case was that easier questions are likely reflected in the cultural expressions that are used commonly in the dominant (white) society, so white students have an edge based not on education or study skills or aptitude, but because they are most likely growing up around white people. The more difficult words are more likely to be learned, not just absorbed.

While the studies found gains for both black and white students on parts of the SAT, the white advantage is larger such that the studies suggest scores for black students are being held down by the way the test is scored and that a shift to favor the more difficult questions would benefit black test-takers.

Ready? Here goes:

By definition, blacks and whites are equally good at randomly guessing on multiple choice questions. So, the more difficult the question and thus the higher the percentage of students who randomly guess, the narrower the white-black differential.

If you made all the questions impossibly esoteric, so that everybody would guess on everything, then the white-black gap would disappear. If you made all the questions unbelievably easy, the white-black gap would also disappear. But when you make them a reasonable mix of difficulty in order to maximize the predictive value of the SAT, you wind up with a white-black gap -- because there is also a white-black gap in real world performance.

January 17, 2010

Evaluating teachers on value-added test scores: the Regression toward the Mean problem

Can sending star teachers into slum schools close the racial gap in school achievement? Can teachers be fairly evaluated by how much their students' test scores went up from last spring to this spring?

Both ideas are very fashionable these days. I want to evaluate both theoretically, using a simple model with two assumptions:

First, star teachers exist, fortunately. Over the course of oneyear, some teachers can raise their students test scores more than one grade level. (There are also dud teachers who can't raise test scores as much as the averager teacher can.) In my simplified model, a star teacher is one who raises grade levels 1.5 years per year 1.0 years in the classroom.

Second, the positive impact of star teachers' is partly reduced over time by regression toward the mean. After nine months under the guidance of Miss Jean Brodie, the kids are well ahead of the average. But when they come back from summer vacation, they aren't as far ahead anymore. Away from Ms. Wonderful, they've regressed toward the mean. There can be a lot of other causes for regression toward the mean. Perhaps after a second year under Miss Jean, some of the students are bored with her tricks and less intimidated by her shtick. Maybe, especially in math and science, the students start getting closer to their intellectual limits.

So, let's assess both questions about teachers with these two concepts in mind. Let's start with something I've always assumed was a good idea: value-added evaluations of teacher performance.

I've long advocated that teachers should not be evaluated upon how well their students do on standardized tests, since the impact of the teacher is typically overwhelmed in the results by the differences between students. Those kind of evaluation systems just augment the natural tendency for the best teachers to wind up with the best students, as everybody scrambles to get hired at the schools with the smartest students. Instead, I've argued for "value-added" evaluations of teachers, measuring how much test scores have gone up under the teacher relative to the students' previous scores. The Obama Administration has come around to this view, too.

Now, though, I've developed a worrisome question about measuring teacher performance on value added, something I've always recommended. How do you factor the effects of regression toward the mean into formulas for measuring teacher performance? In the real world, you can't always assume that last year's test scores show how smart each teacher's students are on average. Last years scores were likely driven up or down by the quality of the teacher last year. The really confusing thing is that it's likely that students whose test scores were unnaturally depressed by a bad teacher last year are likely to go up more this year than students whose test scores were boosted last year by a very good teacher. That's regression toward the mean.

Let's take a sports coaching example. When I was at Notre Dame High School, our archrival Crespi always killed us in pole vaulting during our annual track meet. In fact, Crespi vaulters set a whole bunch of different national age group and high school year records.That's pretty amazing. Strangely enough, it becomes less amazing when you discover that all three star Crespi vaulters were named Curran. It turns out that the Curran brothers had a pole vault track and pit in their backyard, where their father, who had been a pole vaulter, trained them in advanced pole vaulting techniques.

Here's a one minute video from a Super Eight home movie from around 1972 of seventh-grader Anthony Curran clearing 9 feet in his backyard. I had always imagined ever since I read in the 1970s about the Curran family pole vaulting practice ground that they were very rich and had a huge back yard with an Olympic Stadium type set-up, but the video shows it's cramped, ramshackle, and the pit consists of old mattresses right in front of a brick wall. It looks like a good place to break your neck. I'm sure no modern upper middle class mom would put up with Dad and the boys building such a nightmare in the backyard, but Mrs. Curran can be seen waving happily in the home movie as her 13-year-old son hurtles toward his fate.

Not too surprisingly, the Curran Brothers were quite good pole vaulters in college (Anthony Curran, now the pole vault coach at UCLA, has an all-time personal best of 18'-8"), but they weren't the record setters in their subsequent careers that they had been in high school. I don't think any Curran's ever made the U.S. Olympic team. Regression toward the mean set in as they got older and better natural athletes started to catch up to them in hours of lifetime training.

Say you were the college pole vault coach of the Curran Brothers and the athletic director said to you, "Tim Curran set a world age group record at 15, Anthony Curran sent national class year records in high school for sophomore, junior, and senior years. We recruited you the two most accomplished high school vaulters in the history of the top pole vaulting state in the Union. But under your coaching, they aren't even winning college national championships. Why are you failing so badly with all this talent we gave you?"

The true answer is that because the Currans started training so much younger than their current competitors in college, they came closer to fulfilling 100% of their natural potential in high school than anybody else in California did. Now, the other kids are catching up and regression toward the mean is kicking in for the Currans. As high schoolers, the Currans had good nature and exceptional nurture to dominate an obscure sport. By college, they were running into competitors with even better nature, and the nurture gap was closing as all the top competitors got the same amount of coaching in college.

Now, let's think about this in a typical school, where children aren't always fully randomly shuffled after each year. For example, at my elementary school in the 1960s, there were 70 children at each grade level, so they were divided up into the Blue and the Red classrooms. They weren't tracked, they were just randomly assigned. If you started out as a Blue, you typically stayed in Blue with your closer friends.

Say that the two 1st grade teachers are wildly different in effectiveness. The Blue 1st Grade teacher's students finish the year a half grade level above the average, while the Red 1st Grade teachers students finish the year a half grade level below average.

Now, if you are a second grade teacher of perfectly average effectiveness, a teacher who can be expected to raise the grade level of an average class by 1.0 years (relative to the average), which class do you want to inherit, Blue or Red, to do best on the teacher effectiveness evaluation at the end of their second grade.

Let's say that the great Blue first grade teacher's benefits have a one year half life and the bad Red first grade teacher's harm's have a one year half life. In other words, there is regression toward the mean over time in teaching effectiveness, as in so much in life.

If you were just being measured not on value added, but on simple absolute performance at the end of the grade, you'd want to inherit the Blue class that ended last year 0.5 grade levels above average. If you do an average job and the half life is one year, then they'll finish your year averaging grade level 2.25: 0.25 grade levels above average, and you'll be considered a good teacher.

On the other hand, if you are being relativistically measured on value added as calculated by your second graders' grade level at the end of your year minus their grade level at the end of the previous year, you don't want to inherit the star teacher's overachieving Blue class, because you will only get credit for adding a crummy 0.75 grade levels in value. Sure, after two years, they'll be at grade level 2.25, but the were at 1.5 a year ago, so you only get credit for 2.25-1.50 = 0.75 grade levels of value added.

Under value added measurement, you might get fired for, in essence, having inherited the better taught class.

Instead, under value added measurement, you want to inherit the underachieving Red Class from that bad teacher, so that you can get the credit for her students inevitable upward regression toward the mean. They'll wind up the year going from 0.5 to 1.75, so you'll get credit for adding the value of 1.25 grades. I'm a star! Give me my bonus money, Arne Duncan, gimme it now!

This model where there is partial regression toward the mean after the impact of superstar teachers has interesting implications for the national obsession with closing the racial gaps in school achievement.

Assume you have an elementary school with average students where every teacher is a star capable of pushing students ahead 1.5 grades each year (a Grade Level Boost of 0.5), all else being equal. If there is zero regression toward the mean, a simple Excel model predicts that when the average student graduates at the end of eighth grade, he's performing at the 12th grade level.

Grade Grd Level Boost Regress to Mean Grade Level
1 0.5 0% 1.5
2 0.5 0% 3.0
3 0.5 0% 4.5
4 0.5 0% 6.0
5 0.5 0% 7.5
6 0.5 0% 9.0
7 0.5 0% 10.5
8 0.5 0% 12.0

On the other hand, if there is 100% regression toward the mean, the average student, after eight years of star teachers, tests at just the 8.5 grade level at the end of 8th grade:
Grade Grade Level Boost Reg to Mean Grade Level
1 0.5 100% 1.5
2 0.5 100% 2.5
3 0.5 100% 3.5
4 0.5 100% 4.5
5 0.5 100% 5.5
6 0.5 100% 6.5
7 0.5 100% 7.5
8 0.5 100% 8.5

The discouraging thing is that the results of regression toward the mean aren't symmetrical: you only get the the big boosts in grade level by eliminating the last bits of regression toward the mean, but that's very hard to do.

For example, if the regression toward the mean factor is 50 percent per year, then the average student who has benefited from eight consecutive star teachers leaves the school at the end of the 8th grade performing at just the 9.0 grade level. Eight star teachers in a row have gotten him up only one grade level:

Grade Grade Level Boost Reg to Mean Grade Level
1 0.5 50% 1.5
2 0.5 50% 2.8
3 0.5 50% 3.9
4 0.5 50% 4.9
5 0.5 50% 6.0
6 0.5 50% 7.0
7 0.5 50% 8.0
8 0.5 50% 9.0

So, you can see the contemporary obsession in the Obama Administration and the prestige press comes from with trying to reduce regression toward the mean by taking away kids' summer vacations, by keeping them at school a dozen hours per day (the celebrated KIPP program), and so forth.

Unfortunately, the big gains only come from eliminating the last bits of regression toward the mean. If you can cut regression toward the mean from 50% to 25%, then the average student's grade level at the end of eighth grade increases from 9.0 to 9.8:

Grade Grade Level Boost Reg to Mean Grade Level
1 0.5 25% 1.5
2 0.5 25% 2.9
3 0.5 25% 4.2
4 0.5 25% 5.4
5 0.5 25% 6.5
6 0.5 25% 7.6
7 0.5 25% 8.7
8 0.5 25% 9.8

But, as you can see, in a school of star teachers, reducing annual regression toward the mean from 100% to 25% only boosts grade level upon eighth grade graduation by 1.3 years, from 8.5 to 9.8. In contrast, reducing annual regression toward the mean from 25% to 0% would, theoretically, boost grade level at elementary school graduation by 2.2 years, from 9.8 to 12.0. But, due to diminishing marginal returns, it's probably much harder to reduce regression toward the mean from 25% to 0% than from 100% to 25%.

Since the white-black gap at the end of high school is three to four years, these regression toward the mean calculations can help explain why there is such a Blind Side-like obsession with plugging holes in the environment where NAM students' regression toward the mean might occur. For example, the NYT Magazine ran a feature on a public boarding school in a poor part of Washington DC where the taxpayers pay $35k per student per year for five nights per week at this boarding school. But the article was heavily devoted to worrying about whether the two nights per week that the students spend at home was causing the presumed test score gains of the five nights in the dorm to regress back toward the black mean.

Of course, the real killer in terms of closing the racial gap by eliminating sources of regression toward the mean is that eventually, these individuals turn into adults whom you can't manipulate so much, and then they choose environments for themselves.

My published articles are archived at iSteve.com -- Steve Sailer

How to select a good teacher

Half Sigma points to an Atlantic article by Amanda Ripley "What Makes a Great Teacher?" discussing the research by the Teach for America charity into how to predict which college senior applicants for teaching jobs will most boost their kids test scores:
Superstar teachers had four other tendencies in common: they avidly recruited students and their families into the process; they maintained focus, ensuring that everything they did contributed to student learning; they planned exhaustively and purposefully—for the next day or the year ahead—by working backward from the desired outcome; and they worked relentlessly, refusing to surrender to the combined menaces of poverty, bureaucracy, and budgetary shortfalls. ....

Ideally, schools would hire better teachers to begin with. But this is notoriously difficult. How do you screen for a relentless mind-set?

When Teach for America began, applicants were evaluated on 12 criteria (such as persistence and communication skills), chosen based on conversations with educators. Recruits answered open-ended questions like “What is wind?” Starting in 2000, the organization began to retroactively critique its own judgments. What did the best teachers have in common when they applied for the job?

Once a model for outcomes-based hiring was built, it started churning out some humbling results. “I came into this with a bunch of theories,” says Monique Ayotte-Hoeltzel, who was then head of admissions. “I was proven wrong at least as many times as I was validated.”

Based on her own experience teaching in the Mississippi Delta, Ayotte-Hoeltzel was convinced, for example, that teachers with earlier experience working in poor neighborhoods were more effective. Wrong. An analysis of the data found no correlation.

For years, Teach for America also selected for something called “constant learning.” As Farr and others had noticed, great teachers tended to reflect on their performance and adapt accordingly. So people who tend to be self-aware might be a good bet. “It’s a perfectly reasonable hypothesis,” Ayotte-Hoeltzel says.

But in 2003, the admissions staff looked at the data and discovered that reflectiveness did not seem to matter either. Or more accurately, trying to predict reflectiveness in the hiring process did not work.

What did predict success, interestingly, was a history of perseverance—not just an attitude, but a track record. In the interview process, Teach for America now asks applicants to talk about overcoming challenges in their lives—and ranks their perseverance based on their answers. Angela Lee Duckworth, an assistant professor of psychology at the University of Pennsylvania, and her colleagues have actually quantified the value of perseverance. In a study published in TheJournal of Positive Psychology in November 2009, they evaluated 390 Teach for America instructors before and after a year of teaching. Those who initially scored high for “grit”—defined as perseverance and a passion for long-term goals, and measured using a short multiple-choice test—were 31 percent more likely than their less gritty peers to spur academic growth in their students. Gritty people, the theory goes, work harder and stay committed to their goals longer. (Grit also predicts retention of cadets at West Point, Duckworth has found.)

But another trait seemed to matter even more. Teachers who scored high in “life satisfaction”—reporting that they were very content with their lives—were 43 percent more likely to perform well in the classroom than their less satisfied colleagues. These teachers “may be more adept at engaging their pupils, and their zest and enthusiasm may spread to their students,” the study suggested.

In general, though, Teach for America’s staffers have discovered that past performance—especially the kind you can measure—is the best predictor of future performance. Recruits who have achieved big, measurable goals in college tend to do so as teachers. And the two best metrics of previous success tend to be grade-point average and “leadership achievement”—a record of running something and showing tangible results. If you not only led a tutoring program but doubled its size, that’s promising.

Knowledge matters, but not in every case. In studies of high-school math teachers, majoring in the subject seems to predict better results in the classroom. And more generally, people who attended a selective college are more likely to excel as teachers (although graduating from an Ivy League school does not unto itself predict significant gains in a Teach for America classroom). Meanwhile, a master’s degree in education seems to have no impact on classroom effectiveness.

The most valuable educational credentials may be the ones that circle back to squishier traits like perseverance. Last summer, an internal Teach for America analysis found that an applicant’s college GPA alone is not as good a predictor as the GPA in the final two years of college. If an applicant starts out with mediocre grades and improves, in other words, that curve appears to be more revealing than getting straight A’s all along.

Last year, Teach for America churned through 35,000 candidates to choose 4,100 new teachers. Staff members select new hires by deferring almost entirely to the model: they enter more than 30 data points about a given candidate (about twice the number of inputs they considered a decade ago), and then the model spits out a hiring recommendation. Every year, the model changes, depending on what the new batch of student data shows.

But all these traits that correlate with being a good teacher would also likely correlate with being a good senior vice president at a Fortune 500 firm and lots of other tough and high-paid jobs. Heck, Teach for America's ideal high school math teacher would probably also a good candidate to claw his way up the corporate ladder to be a Chief Financial Officer making 7 or even 8 figures.

It's not hugely enlightening to come up with a test that can determine that, say, Ben Franklin or James Cameron or Steve Jobs or Meryl Streep or John Madden or Steven Spielberg or Lee Kwan Yew has the skill set it takes to be a good schoolteacher. We also need another test to identify people who would be better at schoolteaching than at most other competing careers, or we'll suffer very high attrition from the schoolteacher ranks (as Teach for America does).

My published articles are archived at iSteve.com -- Steve Sailer

Augumenting MCAT with a Big 5 personality test

The NYTimes describes a study of 600 Belgian college freshman who entered a seven year medical training program (i.e., combining what in the U.S. would be undergrad pre-med and medical school). The article focuses on the additional knowledge gained by giving a Big Five personality test on top of a cognitive test:

At the start of the study, the researchers administered a standardized personality test and assessed each student for five different dimensions of personality — extraversion, neuroticism, openness, agreeableness and conscientiousness. They then followed the students through their schooling, taking note of the students’ grades, performance and attrition rates.

The investigators found that the results of the personality test had a striking correlation with the students’ performance. Neuroticism, or an individual’s likelihood of becoming emotionally upset, was a constant predictor of a student’s poor academic performance and even attrition. Being conscientious, on the other hand, was a particularly important predictor of success throughout medical school.

In the U.S. setting, conscientiousness is likely measured well by undergraduate GPA.

And the importance of openness and agreeableness increased over time, though neither did as significantly as extraversion. Extraverts invariably struggled early on but ended up excelling as their training entailed less time in the classroom and more time with patients.

“The noncognitive, personality domain is an untapped area for medical school admissions,” said Deniz S. Ones, a professor of psychology at the University of Minnesota and one of the authors of the study. “We typically address it in a more haphazard way than we do cognitive ability, relying on recommendations, essays and either structured or unstructured interviews. We need to close the loop on all of this.”

Some schools have tried to use a quantitative rating system to evaluate applicant essays and letters of recommendation, but the results remain inconsistent. “Even with these attempts to make the process more sophisticated, there is no standardization,” Dr. Ones said. “Some references might emphasize conscientiousness, and some interviewers might focus on extraversion. That nonstandardization has costs in terms of making wrong decisions based on personality characteristics.”

By using standardized assessments of personality, a medical school admissions committee can get a better sense of how a candidate stands relative to others. “If I know someone is not just stress-prone, but stress-prone at the 95th percentile rather than the 65th,” Dr. Ones said, “I would have to ask myself if that person could handle the stress of medicine.”

This all makes sense. The danger, however, always seems to be that somebody who had a high IQ and a low honesty level might be able to figure out what answers are wanted on the Big 5 personality test and just tell them what they want to hear. That's an advantage for IQ tests -- if you can figure out the answers the IQ testers want to hear, they you have a high IQ.

While standardized tests like the MCAT and the SAT have been criticized for putting certain population groups at a disadvantage, the particular personality test used in this study has been shown to work consistently across different cultures and backgrounds. “This test shows virtually none or very tiny differences between different ethnic or minority groups,” Dr. Ones noted. Because of this reliability, the test is a potentially invaluable adjunct to more traditional knowledge-based testing. “It could work as an additional predictive tool in the system,” she said.

I find this implausible. Has, for example, Woody Allen been lying to us all these years about Jews scoring higher on Neuroticism?

Keep in mind that Belgians need more than just a cognitive test because they have a single admission point for a seven year course of study, so a personality test could augment a cognitive test and high school grades. Our 3-year medical schools, however, get to use college grades, which are a lot more recent and relevant than high school grades for assessing Conscientiousness and the like.

One perennial question that personality testing could help to answer is whether hard work can make up for differences in cognitive ability. “Some of our data says yes,” Dr. Ones said. “If someone is at the 15th percentile of the cognitive test but at the 95th percentile of conscientiousness, chances are that the student is going to make it.” That student may even eventually outperform peers who have higher cognitive test scores but who are less conscientious or more neurotic and stress-prone.

Yeah, but you don't want to give the 15th percentile on the MCAT guy Dr. House's job.

This is like saying that if you score at the 95th percentile on undergrad GPA, you can make it if you score at only the 15th percentile on the MCAT. Perhaps. I would be more worried in this situation relying on a single personality test result showing extreme conscientiousness than on four years of outstanding undergraduate grades, since a personality test result showing you're a hard worker is more likely to be faked than are four years of good grades in college.

If you work hard for four years in college, then you probably are a hard worker. Still, it would be nice to have a faster selection method than that, so if the personality test boys can prove their results are reliable, more power to them. But, I'd like to see the proof, first.

My published articles are archived at iSteve.com -- Steve Sailer

January 7, 2010

NYT: "Law School Admissions Lag Among Minorities"

From the New York Times, an extremely typical news story. I'll let you decipher it, and will just point out that this new study is based on one of the same datasets as I used in my May 30, 2009 VDARE.com article last spring bringing together the most recent available scores for the Big Five postgraduate tests (LSAT, MCAT, GMAT, DAT, and GRE). Perhaps unsurprisingly, I reached different conclusions:
Law School Admissions Lag Among Minorities
by Tamar Lewin

While law schools added about 3,000 seats for first-year students from 1993 to 2008, both the percentage and the number of black and Mexican-American law students declined in that period, according to a study by a Columbia Law School professor.

What makes the declines particularly troubling, said the professor, Conrad Johnson, is that in that same period, both groups improved their college grade-point averages and their scores on the Law School Admission Test, or L.S.A.T.

“Even though their scores and grades are improving, and are very close to those of white applicants [not true], African-Americans and Mexican-Americans are increasingly being shut out of law schools,” said Mr. Johnson, who oversees the Lawyering in the Digital Age Clinic at Columbia, which collaborated with the Society of American Law Teachers to examine minority enrollment rates at American law schools.

However, Hispanics other than Mexicans and Puerto Ricans made slight gains in law school enrollment.

The number of black and Mexican-American students applying to law school has been relatively constant, or growing slightly, for two decades. But from 2003 to 2008, 61 percent of black applicants and 46 percent of Mexican-American applicants were denied acceptance at all of the law schools to which they applied, compared with 34 percent of white applicants.

“What’s happening, as the American population becomes more diverse, is that the lawyer corps and judges are remaining predominantly white,” said John Nussbaumer, associate dean of Thomas M. Cooley Law School’s campus in Auburn Hills, Mich., which enrolls an unusually high percentage of African-American students.

Mr. Nussbaumer, who has been looking at the same minority-representation numbers, independently of the Columbia clinic, has become increasingly concerned about the large percentage of minority applicants shut out of law schools.

“A big part of it is that many schools base their admissions criteria not on whether students have a reasonable chance of success, but how those L.S.A.T. numbers are going to affect their rankings in the U.S. News & World Report,” Mr. Nussbaumer said. “Deans get fired if the rankings drop, so they set their L.S.A.T. requirements very high.

“We’re living proof that it doesn’t have to be that way, that those students with the slightly lower L.S.A.T. scores can graduate, pass the bar and be terrific lawyers.”

Margaret Martin Barry, co-president of the Society of American Law Teachers, said that while she understood the importance of rankings, law schools must address the issue of diversity. “If you’re so concerned with rankings, you’re going to lose a whole generation,” she said.

The Columbia study found that among the 46,500 law school matriculants in the fall of 2008, there were 3,392 African-Americans, or 7.3 percent, and 673 Mexican-Americans, or 1.4 percent. Among the 43,520 matriculants in 1993, there were 3,432 African-Americans, or 7.9 percent, and 710 Mexican-Americans, or 1.6 percent. The study, whose findings are detailed at the Web site A Disturbing Trend in Law School Diversity, relied on the admission council’s minority categories, which track Mexican-Americans separately from Puerto Ricans and Hispanic/Latino students.

“We focused on the two groups, African-Americans and Mexican-Americans, who did not make progress in law school representation during the period,” Mr. Johnson said. “The Hispanic/Latino group did increase, from 3.1 percent of the matriculants in 1993, to 5.1 percent in 2008.”

Mr. Johnson said he did not have a good explanation for the disparity, particularly since the 2008 LSAT scores among Mexican-Americans were, on average, one point higher than those of the Hispanics, and one point lower in 1993.

Over all, Mr. Johnson said, it is puzzling that minority enrollment in law schools has fallen, even since the United States Supreme Court ruled in 2003, in Grutter v. Bollinger, that race can be taken into account in law school admissions because the diversity of the student body is a compelling state interest.

“Someone told me that things had actually gotten worse since the Grutter decision, and that’s what got us started looking at this,” Mr. Johnson said. “Many people are not aware of the numbers, even among those interested in diversity issues. For many African-American and Mexican-American students, law school is an elusive goal.”

Meanwhile, lawyer Mark Greenbaum complains in the LA Times that, by his estimate, there are 50% more law school grads each year than are needed to fill legal jobs:
From 2004 through 2008, the field grew less than 1% per year on average, going from 735,000 people making a living as attorneys to just 760,000, with the Bureau of Labor Statistics postulating that the field will grow at the same rate through 2016. Taking into account retirements, deaths and that the bureau's data is pre-recession, the number of new positions is likely to be fewer than 30,000 per year. That is far fewer than what's needed to accommodate the 45,000 juris doctors graduating from U.S. law schools each year.

Of course, a lot of people who graduate from law school never pass the Bar Exam, including about 40% of black law school grads and 53% of all blacks who start law school, according to Richard Sanders. I don't believe there is affirmative action grading on Bar Exams, so that test traps a lot of blacks who have taken out huge students loans to attend law school. Why is the NYT pushing for playing that kind of dirty trick on blacks who are even less likely to pass the Bar Exam?

My published articles are archived at iSteve.com -- Steve Sailer

December 21, 2009

Advanced Placement Tests

The New York Times holds a discussion on whether too many Advanced Placement courses and/or tests are being offered to high school students.

Leaving aside for the moment the more subtle issues (some of which are explored surprisingly well in the discussion), I noticed in the NYT's comments a "B.P." who makes one helluva case for the basic existence of Advanced Placement testing:
I was the first person in my extended family (35 siblings and first cousins in this generation) to graduate from a 4 year university. My parents both left high school at age 16. My father finished high school by correspondence, my mother has her GED. I was raised in a religious minority with lower U.S. college attendance rates than the Native American population (per Pew research). As late as my last semester of high school, I doubted whether I would be able to attend college upon high school graduation.

I was also the (male) AP State Scholar from AZ for 1994. I qualified for free AP exams based on family income level, and I took all offered AP courses consistent with my schedule as well as taking exams in several other areas where AP courses were not offered. The 63 credits I earned in this fashion allowed me to complete a BS in Electrical Engineering in 3.5 years, while taking a light enough (12-15 semester hour) course load that I could schedule all of my classes for two or three day schedules, allowing me to work 3-4 days per week, while continuing to spend roughly 20 hours per week in religious activities. While supplemented by an AZ tuition waiver (class rank based) to attend a state school, a National Merit Scholarship, and proximity to campus (4 miles from ASU), this course credit was the key factor which allowed me to make the case to my father that I would be able to continue to work in the family business while attending college for an unextended period, and it wouldn't cost him a dime, nor would we incur debt.

Had my high school (with its roughly 50% dropout rate) not had an extensive AP program, I have no doubt that I would not have gone to college. I would currently be a sub-par unemployed electrician, instead of a registered professional engineer for the past 9 years. I would be looking for a job rather than having been employed in 5 progressively more responsible engineering positions at the same utility over the past 11 years. At least three family members would currently not own the houses they are living in, my youngest sister wouldn't have graduated from ASU, and I would currently be worring about how to support my parents in retirement.

... Denying students opportunity is no service to students or society.

Sounds like the hero of a Heinlein juvenile novel from 1958.

I wonder which "religious minority" is this fellow from? Polygamous Mormon? Jehovah's Witnesses? Syrian Jewish? Shi'ite? Mennonite? There are a lot of clues in his comment (which can be read in full here), but I haven't been able to come up with a good guess.

My published articles are archived at iSteve.com -- Steve Sailer

July 28, 2009

Pilots and g-Force

As I've mentioned, one of the rules of polite journalism in discussing testing firefighters is to assume that paper and pencil tests must be irrelevant to the obviously moronic job of spraying water on burning buildings. Never refer to the voluminous data assembled over the decades by the Pentagon on the relationship between performance on paper and pencil tests and performance on similarly physical jobs.

When researching my 2004 article on John F. Kerry's and George W. Bush's IQ scores judging from their performance on the Officer's Qualification Tests they took in the later 1960s (Bush 120-125, Kerry 115-120, which turned out to fit with their GPAs at Yale), I read a lot of studies from the 1960s by the military's psychometricians documenting the predictive validity of these exams. I then tried to track down the authors to help me understand Kerry's and Bush's scores.

I spent two hours on the phone with a very helpful gentleman, now a college professor of statistics, who had retired after many years as the head psychometrician for one of the major branches of the Armed Services.

Among much else that was interesting, he mentioned that in 1990 he had provided to Charles Murray the U.S. military's scores from the renorming of its AFQT enlistment test. In 1980, the Pentagon had paid the Department of Labor to give the AFQT to all 12,000+ young people in its National Longitudinal Study of Youth database. The middle section of The Bell Curve is devoted to tracking how these ex-youths, now 25 to 33 in 1990, were doing in life in relation to their IQ scores a decade before.

My source had nothing but praise for The Bell Curve.

The psychometric expert said something that seemed puzzling to me. He said that the General Factor of intelligence completely dominated job performance as a pilot to such an extent that it really wasn't worthwhile to give multiple intelligences tests of specific piloting skills, such as the one George W. Bush took in 1968 to measure his 3-d visualization skills.

For example, a question might ask:
Which picture represents how the horizon would look straight-ahead out the cockpit window when you are in the midst of turning from flying north to flying east while banking 60 degrees?

A. _
B. /
C. \
D. |

Bush only scored, I believe, at the 25th percentile on this test, but I don't think this kind of thing came up much in the Oval Office.

My source said that he recommended getting rid of flying-specific tests for admission to pilot-training, but the brass wouldn't go along with it because they insisted their had to be pilot-specific skills separate from the g Factor.

Listening to him, I certainly agreed with the brass. After all, I have a decent IQ, but I'd make a terrible pilot during the brief interval before I became a smoking crater due to making some stupid mistake.

And, this is not something I only recently realized. I can vaguely recall being 16 and looking at the catalog from the Air Force Academy and deciding that, based on my experience driving a car, riding a bike, playing sports, and generally bumbling about in the physical world, that I wasn't cut out to pilot Air Force jets.

I've wondered about this expert's finding over the years, and I think I've finally started to figure it out: People with high IQs who would be bad pilots generally figure out for themselves that they would be bad pilots; so, they never take the tests to be pilots. Thus, the high correlation between the g Factor and pilot performance: high IQ individuals are already selected for having pilot-specific skills.

Similarly, high IQ guys who would make lousy firemen already know it, so they don't take the firemen's test much.

Thus, a hiring test like the New York ones ruled too discriminating by Judge Garaufis tend to work well. They are combination aptitude and achievement tests with all the questions solely about firefighting, but all the information needed to answer the questions given on the test. Still, under pressure, it's not too easy to decipher passages about technical details of chainsaw maintenance.

Thus, to score perfectly on these kind of tests, it's helpful to be both reasonably bright and to have studied firefighting guidebooks. High IQ guys who wouldn't make good firemen tend to figure out while they're studying that this isn't the career for them and thus don't take the tests. So, these kind of aptitude/achievement tests work quite well.

My published articles are archived at iSteve.com -- Steve Sailer