*Projecting results of an audit to a larger population of providers can be a serious step.*

Many of the organizations with which I engage feel this pressing need to use the results of individual provider audits to infer an overall pass/fail rate, either to everything that provider does or to the organization.

I get the reason for this; leadership wants a general idea of what their compliance risk and exposure looks like, and they often want to know which providers are contributing most to that risk. The problem with inferring those audit results to a larger population has to deal with both sampling methods and sample sizes for those audits. And much of that is driven by the model used to construct the audit. For example, some organizations are still caught in the random probe audit cycle, something that should have died years ago. The problem with a probe audit is that, unless it is targeted to a specific code, it misses the overwhelming majority of risk events. Take an internal medicine provider. If you look at just the Medicare submission database, IM docs billed 140 unique procedures/services (based on the code), of which 92 make up the top 80 percent. So let’s take a look at how this might break down. Suppose an organization using a legacy model, such as a probe audit, might have a mandate to audit 10 encounters per provider, per year. If it’s not a focused audit (in essence, a general random probe audit), the maximum number of codes you would be able to review would be 10, or 7 percent of all the unique codes the provider reported. That means that you just ignored 93 percent of the compliance risk events. And by the way, the chance of getting 10 unique codes in a random pull is something like 1.2 x 1020, or 1.2, followed by 20 zeros. You probably have a better chance of getting eaten by a shark while being struck by lightning than getting 10 unique codes. In this case, you would still only end up with one encounter to audit, and who believes that it is OK to extrapolate from one encounter?

How about “nobody?”

Maybe it’s a focused audit, so you audit 10 encounters for that provider for a specific code, say, 99233. The results of the audit reveal that 3 of those 10 encounters failed the audit for one reason or another. What’s easy is to say that this provider’s error rate is 30 percent, and if you were talking about the number of encounters that failed, you would be correct, because we are not inferring the results (yet), but rather describing the results. Where we go wrong is when we infer that 30 percent of all of the provider’s 99233 codes are coded wrong. Why can’t we do that? Well, it is based on the idea of sample error, which occurs with every single sample, as defined by the standards of statistical practice. While three out of 10 is 30 percent, from a descriptive perspective, three out of 10 is actually somewhere between 6.7 percent and 65.2 percent when inferring the results to a larger population. Many would know this process as extrapolation, and it is an inherent problem with extrapolations are conducted.

So, how many encounters would I have to audit in order to be able to extrapolate the results to the population of codes, or claims, or beneficiaries, or whatever the unit being audited? That depends on your tolerance for error. I am a fan of the 95 percent confidence interval. This is a way of measuring the relative error in our point estimate (30 percent is the point estimate, and the range is the margin of error), and it is a commonly accepted metric. Here’s what it means: if I were to pull 100 samples of 10 of the 99233 codes, in 95 of those 100 (or 95 percent), the actual error rate would be somewhere between 6.7 and 65.2 percent. That’s factual, but is it useful? That’s for you to decide, but in my world, that large of a range (or sampling error) is pretty much useless. So, back to the question of how many encounters are needed. Well, let’s say we pulled 100 encounters, and 30 of those were in error. In describing the error rate for that sample, we can still say that the point estimate is 30 percent, and we would be right on the mark. Inferentially, however, the 95 percent confidence interval is now 21.2 percent to 39.9 percent, and if you are OK with a plus-or-minus 10 percent, then you have your number. If not, then you will have to do more. How many more depends on a few assumptions, and you can use a sample size calculator (many can be found online) to calculate the sample size you need for your purposes.

Let’s say that you audited 10 of the 99233 codes for 100 providers. Now you have a sample size of 1,000. Let’s say again that 300 of them were found to have been coded in error. Now your range is from 27.1 percent to 32.9 percent, and I wouldn’t have any problem saying that there is a general error rate among all providers for the 99233 code of 30 percent, but I would not be comfortable predicting that for any individual provider, since their sample size is simply too small. Heck, you could also then analyze the averages of the results from those 100 samples, and having satisfied the Central Limit Theorem, the foundational axiom behind inferential statistics, you could be quite accurate in your projection. And you would likely have a normal distribution, which can be used for a lot more fun calculations. But more on that in another article.

I know that this can get a bit tedious, but projecting the results of an audit to a larger population of codes or providers can be a pretty serious step. For one, you may infer that a given provider has a bigger problem than it really does. Or you may push an organization into a more expensive (yet statistically valid) review of some codes or providers when it wasn’t necessary in the first place.

I am all for using statistics to estimate error rates, but I am not for extrapolating those error rates when doing so is not justified. The risks almost always outweigh the benefits. One of my favorite quotes is from George Box, a famous statistician, who said that “all models are wrong, but some are useful.”

And that’s the world according to Frank.

**Program Note:**

Listen to Frank Cohen report this story live during Monitor Monday, May 20, 10-10:30 a.m. ET.