Duplicate Data at the University of Chicago

Is this fraud, or incompetence?

Mar 28, 2023

This scandal broke 2 weeks ago, when The Institute for Replication (i4replication.org) released a statement:

At our first Replication Games in Oslo, Montero, [a replication of Cooperative Property Rights and Development: Evidence from Land Reform in El Salvador] was assigned to Anders Kjelsrud [Associate Professor in economics, Oslo Business School], Andreas Kotsadam [Professor at the Psychology Department at University of Oslo,] and Ole Rogeberg [Research economist employed at the Frisch Centre].

In their new [replication] …They find that “three-quarters of the observations are duplicates”.

“Three-quarters of observations are duplicates.”

Ouch.

It’s such a stupid & glaring mistake, made at the highest level in the world, that when EJMR noticed it, Data fraud JPE Montero 2022 became the #4 top thread of March 2023.

Thank you for reading Karlstack. This post is public so feel free to share it.

Here are some EJMR reactions:

This is insane. A JPE paper where you never realize 75% of your observations are duplicates? I'm not sure which is worse. Being a researcher so bad that you make this mistake. Or it being fraud.
— Anonymous
Unbelievable. One works on a job market paper for years. How can't one fail to notice this massive mistake? It takes 15 minutes of data eyeballing.
— Anonymous
How is this fair to everyone else who was honest about their work rather than manipulate data into a JPE? Awful look for Chicago/Harris.
— Anonymous
Whether or not this intentional doesn't have a whole lot to do with the more immediate concern that this paper should be retracted from the JPE immediately.
— Anonymous
I don't get why JPE doesn't retract the paper.
— Anonymous
To me, figure 2 is strong evidence of sloppiness rather than fraud. If you tried to cheat, the last thing you would do is to generates hundreds of identical copies.
— Anonymous
What are the consequences for this?
If one of my students was caught doing this, they'd be kicked from the program.
— Anonymous
All this guys work including the ones with Sara Lowes is sus. I am familiar with their work. Someone should replicate hers too.
— Anonymous
Mistakes happen and I'd be terrified if someone tried to replicate one of my published papers...
— Anonymous
Tbh he needs to lose his job. This is way too egregious.
— Anonymous
This is such an egregious scandal. There is no way that this could have been unnoticed over multiple rounds of revisions. Montero is a fraud.
— Anonymous
To the people saying: "are you 100% sure your papers replicate", etc.
I am sure mistakes can be found in my work, but having a dataset of 1000+ observations suddenly quadrupling in size because of a bad merge and my not noticing... no way, just no way. Whoever is using this as a defense should be ashamed of themselves.
— Anonymous
It would be disheartening but not surprising if JPE doesn't retract. Its a club and most of us aint in it.
— Anonymous
A Top 5 (or several) can make or break someone's career. Whether this was sloppiness or ill-intended....it does not speak highly of his work and the double standards in assessing the work of researchers with different pedigrees.
— Anonymous
It's atrocious. How can one not realize this mistake? For a JPE!
— Anonymous
This is why I am the Taliban to my RAs when it comes to merges, sample sizes, missing data, duplicates, etc. I am extremely exacting about these details because this is exactly what happens.
— Anonymous
This is so egregious that if action is not taken, what reputation should U Chicago or J Pol Econ have?
— Anonymous
Seems a bit excessive to immediately fire him. But this should obviously dent his tenure case quite badly.
— Anonymous
These days it seems like every day I read about a new theory paper in Econometrica that is just wrong, a JPE with sloppy data that is found wrong, an AER that was plagiarized from another discipline and then waved through by an editor with a clear COI, a ReStud retracted. This means people will be richer than me to the tune of a million dollars or more and enjoy a lifetime of prestige, largely on the back of bad research.
We could fix this by ending the cult-like obsession with top 5s that means editors of those journals (clearly not capable of upholding the standards of the profession) can give a lifetime pass to anyone who fits with their mafia norms.
But the top of the profession doesn't care that this is the situation because they are busy using their control of the top 5 to create free rides for their friends and advisees
— Anonymous
I am not 100% sure; none of us can be 100% sure. But I have several people checking my codes (e.g., RAs), doing descriptives on the data, and replicating my results before submitting a paper. It is super high stakes to put a result out there -- especially in print. Having 3/4 of your data be duplicates ... pfff. You are asking the profession to be quite forgiving
— Anonymous
This is atrocious. It looks like someone felt pressure to have a solo pub and so fraudulently manipulated data and lied about it.
— Anonymous
This paper of Montero's is also super sus:
https://emontero23.github.io/emwebsite/montero_yang_festivals_2022.pdf
No way it's true. None. Pure p-hacking or data fraud.
— Anonymous
Will Montero voluntarily retract this paper then?
— Anonymous

The Journal of Political Economy (JPE) is ranked as the 4th best economics journal in the world, so as an untenured economics professor struggling for tenure, Montero’s solo publication in the JPE is like a golden ticket. This solo JPE would be akin to an untenured biology professor scoring a solo Cell, an untenured math professor scoring a solo Inventiones, an untenured physics professor scoring a solo Reviews of Modern Physics, or a medical intern scoring a solo Lancet.

Eduardo Montero is the guy who scored a solo JPE with 3/4 of his data being duplicates. He graduated from Harvard Econ PhD in 2018 and is now an assistant professor at the The University of Chicago Harris School of Public Policy.

It should be noted that UChicago, Montero’s employer, owns the JPE. The JPE is a Chicago journal, although according to a recent study, it is the only house journal that does not lower its standards for authors with ties to its owner, which is not true at the QJE and AER for authors with ties to Harvard and MIT.

IZA @iza_bonn

New IZA DP on home bias in top economics journals: "... median article quality is lower in the #QJE if authors have ties to Harvard and/or MIT than if authors are from other top-10 universities, but higher in the #JPE if authors have ties to Chicago." 👉iza.org/publications/d…

The statement continues:

We want to make it clear that this is NOT a gotcha moment. Instead, it is a moment to further think about the importance of open science, data sharing & replication in economics.

It is definitely a gotcha moment, haha, they are just being polite.

The duplication of observations was first noted via visual inspection (See Figure X in the DP). Using the raw data from the initial survey, Kjelsrud et al. identified a set of survey variables that were sufficient to uniquely identify individuals and that were also retained in the replication data. This effort reduced the number of observations in the worker-level sample from 4,770 to 1,146. Montero (2022) calculated inequality at the property-year level. Because of the significant reduction of individual observations there is a loss in meaning in this inequality measure.

“The duplication of observations was first noted via visual inspection.” This makes sense, since even a cursory look at this dataframe in RStudio or Jupyter or Stata would immediately reveal that most datapoints are duplicates.

What caused all these duplicates? Dan de Kadt, assistant Professor of Quantitative Research Methods at the London School of Economics, concludes that “The core problem - 75% of data are duplicates - came from a merge using non-unique identifiers.”

Any economist, data scientist, software engineer, statistician etc. reading this right now is cringing! They know what a “merge” is, and they know how brutal this mistake is.

If you aren’t a shape rotator, the picture below is roughly what a data “merge” looks like. Imagine you are merging multiple Excel sheets together into one big Excel sheet database, except rather than merging these sheets by hand, you write a line of code to do it for you automatically.

How to merge multiple data frames using base R — Blog — Musgrave Analytics

Sometimes, when you write this line of code you misconfigure the code, fat-finger a column, read the names of the columns wrong, mix up the names of the keys, mix up your variables, etc. A million things could go wrong that result in you creating thousands of duplicates. Everyone messes up merges sometimes. It’s a normal part of working with data.

The thing is… it’s such a normal part of working with data that it is a very, very easy mistake to catch. If you merge, and then your dataset is suddenly 4 times bigger than it is supposed to be, this is something any competent programmer would immediately notice, especially if you are Montero who holds a Masters degree in Statistics from Stanford and just spent 4 years working through multiple drafts of the paper (the first version I can find is from 2018). It’s unfathomable that he wouldn’t look at his data during those 4 years of working with it and not notice that 3/4 were duplicates.

So unfathomable, in fact, that some people believe this is a case of malicious, intentional data fraud. While that is possible, I don’t think so… and trust me I am usually the first to yell about cheating. I am inclined to give him the benefit of the doubt simply because it’s more plausible that this is a brain fart; if he wanted to commit intentional fraud, he would be INSANE to make his fraud so easily verifiable. No rational economics professor is going to risk his career by submitting a paper to the JPE knowing that 3/4 of data is duplicates, since he would rationally expect it to be caught immediately by the JPE, or anyone reading the JPE. It’s a testament to the JPE’s lack of data due diligence that he wasn’t caught before it was published.

I contacted the JPE editorial team and asked them, “How did this data error squeak through -- are there not procedures in place to perform due diligence and check the data? Were these procedures skipped, not followed properly, or were they inadequate in some way?” This is their response:

Our data policy is described here Journal of Political Economy: Data Policy (uchicago.edu)
As explained here, it is the policy of the Journal of Political Economy to publish papers only if the data used in the analysis are clearly and precisely documented and are readily available to any researcher for purposes of replication. After acceptance, authors are expected to upload their data, programs, and sufficient details to permit replication, in electronic form, to JPE's Dataverse Repository. If some or all of the data are proprietary and an exemption from this requirement has been approved by the Editor, authors must still provide a copy of the programs used to create the final results.
Beyond the usual refereeing process, we do not have a procedure for checking accepting papers for data errors. I believe some economics journals (such as the AER) have recently implemented a procedure for checking that the reported code produces the numbers in the article. While potentially useful, I believe this would not have caught the mistake in Montero's paper where there was a coding error (or address the key concerns about replicability and data/coding errors). We have debated whether to require referees to check code, and while some do, we worry that requiring it would make it even harder for us to find willing and qualified referees.
Another option we discussed is to hire people to carefully check that the code is written in a way that it reflects what the authors intend for it to do (and not simply that it runs or replicates reported results). While this would be ideal, we unfortunately do not have the resources to do this.

So, the JPE didn’t catch this error because they don’t check every line of code in every paper. They do check that the authors upload data and code, but not that it is correct. They mostly seem to operate on the “honor system” and trust that the uploaded code and data is correct.

These issues lead to analyses that show the conclusions of Montero (2022) are incorrect. While the mistakes are confined to one set of results in the original paper, these findings are highlighted in the abstract and conclusions of that work.

The conclusions of Montero’ 2022 JPE are wrong. If the conclusions of the paper are wrong due to a fatal coding error… Why is the paper not retracted?

Kjelsrud et al.’s work is now accepted at the Journal of Political Economy as a comment. Their replication was submitted late 2022 & accepted within 3 months “after only some very minor and reasonable revisions.”

Rather than retract the fraudulent 2022 paper, JPE are publishing a 2023 comment from the replication team.

I think this is the wrong decision. The JPE knows the old paper is wrong, broken, and fake, and yet they leave it published in their journal? Why? Anyone on google can still stumble upon it and cite it, thinking it is legit!

Moreover, look at Eduardo Montero’s current CV, below. I have highlighted this JPE in red:

Wow! A solo JPE! Impressive!

Nobody ever actually reads any papers (this one is 45 pages long), they only see the solo JPE and assume it is legit, and this one is plastered on the front page of his CV for the next 40 years.

I4R contacted Montero and he confirmed to us that he “thanks the replication team for their careful and thoughtful replication and for bringing the clear data issue to attention”.

Here is Montero’s generic post-publication note:

https://emontero23.github.io/emwebsite/ESLR_note.pdf

It starts by acknowledging the mistake:

Once the data merging error is corrected, these results are no longer valid.

Then apologizing and taking full responsibility:

I sincerely apologize for this inadvertent mistake.

Then he thanks the replication team:

I thank Kjelsrud et al. (2023) for bringing the data error to my attention. I hope that future research can better shed light on the causal effects of land reforms, cooperatives, and inequality across varied settings.

So what is the final outcome?

Both the replication team’s comment and Montero’s apology note will be published in the August 2023 issue of the JPE… Montero and JPE shrewdly get away with no retraction, and everybody will forget about this in a month from now. Well played, Chicago mafia.

I4R @I4Replication

We are agnostic as to whether a retraction or a lengthy corrigendum is more appropriate. But we are certainly making progress with the recent retraction at @RevEconStudies and this quickly accepted comment at @JPolEcon.

Prior to publishing, I sent a complete draft of this article to Dr. Montero and offered him the chance to point out any factual inaccuracies or anything else he feels is unfair. I also asked him, “Why did you choose to not retract your article?”

That was 36 hours ago… and in that email I said I would give him 48 hours to respond, but here I am at 36 hours publishing anyways. The JPE responded right away. I don’t think Montero is going to respond in the final 12 hours. If he does, I will update this article.

Thank you for reading Karlstack. This post is public so feel free to share it.

Karlstack

Duplicate Data at the University of Chicago

Is this fraud, or incompetence?

Discussion about this post