*The PIM is a woefully inadequate guide for audits leveraging extrapolation**.*

**EDITOR’S NOTE:** This is the third in a series of reports on alleged bias the author has uncovered in extrapolation audits.

In my last two articles, I addressed several of the reasons I feel Chapter 8 of the Program Integrity Manual (PIM) should no longer be the standard for government audits.

First and foremost, it is sorely lacking in depth and completeness. In only 20 pages, Chapter 8 attempts to establish guidelines for an interferential statistical process known as extrapolation. It fails to adequately cover the complexities involved in sampling methods, including stratification. It fails to adequately define such important issues as precision, confidence intervals, data distribution, and appropriate metrics to use in specific situations. It fails to properly elaborate on the pros and cons of units; should we use a claim? A line item? A beneficiary? Currently, the guidelines say you can use any of these, but it doesn’t say you have to use the one that is best.

Of particular concern is that the PIM mentions, as one of its main references, a book called Sampling Techniques, Third Edition, written by former Harvard professor of statistics William Cochran. In many statistical circles, this book is considered the bible for sampling and inferential statistics. In fact, in many of my post-audit extrapolation mitigation engagements, the auditor cites Cochran as a reference to its audit work.

Yet the PIM is almost the antithesis of that book. Nowhere does it engage in a more advanced discussion the reason or the logic for the guidelines. Let’s talk about specifics – in this article, I want to address one major issue regarding sampling, and that is the issue of stratification.

The main purpose of random statistical sampling is to create a subset of data from some universe or sample frame that is not only randomly selected, but also representative of that larger population of data. When I say “representative,” I mean, in the more colloquial sense, that it looks like the data in the population.

There are statistical ways to test this. One such test is called the two-sample t-test, which focuses on the difference (or similarity) in the means between the sample and the population. I can’t tell you how many times I have had a situation in which the mean paid amount for the sample was significantly higher than the mean paid amount for the population. And if you follow the PIM logic, which infers that the paid amount is correlated to the overpaid amount (it is not), and if you accept the assumption that every unit in the sample has an equal likelihood of being adjudicated as overpaid, then you have a situation that naturally biases the overpaid amount. The t-test, however, is most suited for situations in which the data is normally distributed, which almost never happens with paid data. I explained this in more detail in the first article of this series. When that situation arises, you can use what is called a Mann-Whitney test, which, like the two-sample t-test, compares the central point estimate – only in this case, it uses the median and not the mean.

I was the statistical expert in the sentencing portion of a recent case in Gainesville, Fla. (*United States v. Ona Colassante*). In this case, the data was heavily skewed, and in these types of distributions, the mean is all but useless (I will explain more about that later), but it is highly subject to outliers, which were present in droves. In such a case, the median is a more robust measurement of central tendency.

Well, the prosecutor, who was an assistant U.S. attorney (AUSA), walked over to me while I was in the witness stand, slammed Chapter 8 of the PIM down in front of me, and challenged me to find where it mentioned using the median.

Having all but memorized that chapter, I knew that it wasn’t in there anywhere, and that is one of the reasons I am writing these articles. How could you possibly have a set of statistical guidelines that focus on sampling and extrapolation for distributions you know are going to skew and not have any mention of using the median instead of the mean? Fortunately, the federal judge in the case sided with me and not with the guidelines, which ultimately resulted in the extrapolations being completely disregarded. Now there’s a precedence I can live with!

But again, I digress. Let’s talk about stratification. The main purpose of stratification is to break the population into more homogenous groups such that the variation between sampling units is as low as possible. This goes to the idea of precision, and if the government is going to use extrapolation as a method to bankrupt a small business, then they should shoot for as high a rate of precision as possible.

At least that is what any reasonable statistician might believe, but that is not a goal listed in the PIM. Imagine you have a population of procedures and services for a multispecialty practice, and you decide to select a random sample of some number of claims. It is likely that those claims will have between one and six line items, and the procedures and services representing those line items are quite often very diverse. You may have a mixture of lab, imaging, evaluation and management (E&M), surgical, etc., all within the same claim. Section 8.4.11.1 states that “generally, one defines strata to make them as internally homogeneous as possible with respect to overpayment amounts, which is equivalent to making the mean overpayments for different strata as different as possible.”

The first problem is that, without a probe sample, you have no idea what the overpayment amounts might be, which really negates the relevance of this statement. And when you have a highly diverse set of line items within a series of claims, figuring out how to correlate overpaid amounts is critically important. The government, however, always uses the paid amount to select the strata, as they believe that the paid amount is most likely correlated to the overpaid amount, but anyone who has worked on the revenue cycle of a provider knows that is not true. I can show you, for nearly any practice, that there is a huge variability in the paid amount for any one procedure across any number of payors, for the same provider. This has to do with copay and deductible amounts, edits, disallowances, and any number of other reasons. Even Professor Donald Edwards, likely the poster child for Chapter 8, stated at a conference: “In short, if you stratify by payment amount, you’re stratifying the wrong population.”

This poor practice is supported again, in that same section, as follows: “A common situation is one in which the overpayment amount in a frame of claims is thought to be significantly correlated with the amount of the original payment to the provider or supplier. The frame may then be stratified into a number of distinct groups by the level of the original payment and separate simple random samples are drawn from each stratum.”

In statistics, one does not “think” something to be “significantly correlated;” one tests for that assumption. And again, there is nothing within the PIM that gives any guidance, instruction, restrictions, limitations, best practices, or anything other than “whatever” with respect to the audit design. In essence, it infers that as long as the auditor had good intent and did his or her best, it’s all good.

Section 8.4.4.1.3 states the following: “The stratification scheme should try to ensure that a sampling unit from a particular stratum is more likely to be similar in overpayment amount to others in its stratum than to sampling units in other strata. Although the amount of an overpayment cannot be known prior to review, it may be possible to stratify on an observable variable that is correlated with the overpayment amount of the sampling unit.”

The paradox here is that the amount of an overpayment can, in fact, be known prior to the review, in contrast to what is stated above. In Cochran, on page 78, the professor gives four ways to determine overpayment in advance:

- By taking the sample in two steps, the first being a simple random sample of size n1 from which estimates s/ or p1 of S2 or P and the required n will be obtained;
- By the results of a pilot survey;
- By previous sampling of the same or a similar population; and
- By guesswork about the structure of the population, assisted by some mathematical results.

Yet not in one single case has an auditor done this. Granted, it’s not in the PIM, but the PIM is incorrect about this aspect of the audit methodology. It’s simply not true, and shouldn’t we care about that? And since they never take the steps recommended in their own referenced literature, then why wouldn’t an argument against using the paid amount make sense?

So, it would seem appropriate, then, to stratify based on the code category. E&M codes have their own specific guidelines, while surgical codes are coded more on the notes and descriptions. If you really wanted to correlate the overpaid amount to some variable, then the procedure or service would be the best candidate. Over the past 10 years, I have probably worked on more than 300 extrapolated audits, and not a single one used any other variable than the paid amount for stratification.

What does the PIM say about this? Back to 8.4.11.1: “Stratifying on a variable that is a reasonable surrogate for an overpayment can do no harm, and may greatly improve the precision of the estimated overpayment over simple random sampling.” This could, in fact, mean using the line items or code group, but in every single case for which I have suggested this (and even shown it resulted in a lower precision), I was rebuffed at the first two levels of appeal. Why? Because the PIM is simply inadequate, and again, in every single case in which I have been the statistical expert, there has never been any explanation as to the logic behind the stratification break points. Even when we have requested this, the answer is that the PIM doesn’t require them to tell us. It’s a joke, and an embarrassment to the general statistical community.

In many of my cases, stratification resulted in higher variability and worse precision, all because the stratification was poorly designed and poorly executed. In more than one of my cases, the average overpaid amount, when multiplied by the population, ended up with a total extrapolated overpaid amount that was more than the actual paid amount.

Check that out. The auditor was basically saying that the amount you were overpaid is more than the amount you were paid. Someone help me make sense of that, because to get around this, the auditor uses the lower boundary of a confidence interval and somehow that makes up for lousy statistical methods and inaccurate and incorrect results.

Sometimes, the problem is that the auditor doesn’t stratify when they should. Again, section 8.4.11.1 states: “While it is a good idea to stratify whenever there is a reasonable basis for grouping the sampling units, failure to stratify does not invalidate the sample, nor does it bias the results.”

And again, what a crock. Failure to stratify when stratification is called for, when it will result in a lower variability and better precision, absolutely invalidates the sample and biases the results. I actually have a hard time believing that this is allowed in a section on statistical guidelines, but hey, even though it violates the spirit and letter of statistical standards, I guess it is good enough for government work. I don’t understand how anyone would agree that, when you are about to recoup millions of dollars from a healthcare provider, mediocrity is an acceptable standard.

There are lots of other problems with regard to the PIM and stratification. For example, while the PIM kind of addresses outliers by suggesting the use of a certainty stratum, once again, the wording is so weak and pusillanimous as to render it worthless as a guideline. As with most of the sections, it basically says: “hey, here’s good way to do this, but don’t worry, you don’t have to pick the best way or even a good way, just any way that woks for you.” There are also problems with how strata are distributed, either proportionately or disproportionately. The PIM states: “Typically, a proportionately stratified design with a given total sample size will yield an estimate that is more precise than a simple random sample of the same size without stratifying.”

But even though it is a better method and yields a better precision, it’s up to the auditor and can rarely be challenged successfully. Again, in Cochran, on pages 122 and 123, he discusses the benefits of proportional selection and says that unless the cost of sampling differs greatly from strata to strata, proportional distribution is always the best method. And we know that it doesn’t cost any more to audit a claim in stratum 1 as it does in stratum n.

So we now come to the end of this series, and I can only hope that it has motivated the right people to take the right steps to make the right changes happen. My friend Henry P. Shaw used to say: “People change when the pain of where they are become greater than the fear of what may come.” I can tell you that I know a lot of providers that are in a lot of pain due to a poorly designed set of standards (if you can even call them that) that are used to recoup hundreds of millions (if not billions) of dollars every year from honest and hardworking healthcare providers.

I am all for audits, and truth be told, as a statistician, I am all for extrapolation. I am a big fan of using inferential statistics to estimate overpayment – but if and only if the statistical methods for sampling and extrapolation are as good as they can be.

Yes, I am talking about best practices. Yes, I am talking about paying attention to and abiding by the standards of statistical practice. Yes, I am talking about a complete rewrite of the PIM to ensure that not only are extrapolations done properly, but that it gives providers a fair shake at defending themselves when best practices are not employed.

Maybe it’s just me, but it seems like the government hates healthcare providers and does whatever it can to make their lives miserable. I don’t think I have ever seen such animus towards any particular group, and I don’t understand why there isn’t more of an outcry. I guarantee you that if this is how attorneys were treated, things would change overnight. In fact, I remember seeing this letter about cancer that said something like this: “If cancer of the penis in men was a prevalent as cancer of the breast is in women, we would have found a cure a long time ago.”

I’m not sure how true that really is, but hey, it makes a lot of sense to me.

The bottom line? All I can say is that it’s time for change.

And that’s the world according to Frank.

**Program Note:**

Listen to Frank Cohen report on this subject on Monitor Monday, June 18, 10-10:30 a.m. ET.