- This is the second installment in our Quality Measurement 101 series, where we take an in-depth look at key steps and tips for effectively leveraging healthcare data to build a robust quality improvement program.
- Quality improvement administrators use statistical significance to determine if a differences in quality are likely to occur by chance.
- Utilizing PINC AI™ Quality Enterprise enables healthcare staff to have a cohesive understanding of statistical significance and helps to ensure reliable research findings.
Healthcare data is accumulating at an astounding rate. For healthcare organizations, this data represents a valuable opportunity. Data has the power to transform the patient-clinician relationship by improving care, reducing costs and enabling healthcare organizations to predict and avoid medical issues. However, using data to fairly assess quality performance often requires an understanding of statistical significance. Statistical significance is a term to describe how certain one is that deviations in quality are meaningful, beyond normal fluctuations in the data.
Quality administrators use statistical significance to determine if observed outcomes are meaningfully different than expected outcomes derived from risk-adjustment models. Risk adjustment provides a tailored quality benchmark based on the unique clinical and demographic characteristics of the evaluated patient population. For measures like mortality, readmissions, complications and length of stay, healthcare data analysts can compare observed against expected outcomes to identify meaningful deviations in performance. While risk adjustment is an essential component to a robust quality improvement program, it is important that the risk-adjusted benchmark be used in conjunction with a measure of statistical significance.
Without a measure of statistical significance, the point at which deviations in performance are considered meaningful would be driven by opinion and would vary by the individual interpreting the results.
For example, would a difference between an average observed and expected length of stay of 0.3 days for a sample of 25 Acute Myocardial Infarction (AMI) patients in the month of May be significant?
- If we increased our timeframe to an entire year, resulting in a sample size of 300, would 0.3 days be any more or less significant?
- What if observed and expected performance continued to increase to 0.9, 1.3 or 2.4?
As one might suspect, we quickly enter a gray area where it is difficult to know where the threshold is that should lead us to be concerned about performance. Statistical significance is a formalized industry-accepted practice that allows analysts to quantify these differences in terms of confidence levels.
Measuring Statistical Significance with PINC AI™ Quality Enterprise
Many healthcare organizations are turning to PINC AI™ Quality Enterprise to better measure statistical significance. PINC AI™ Quality Enterprise is a suite of clinical performance improvement capabilities designed to help health systems improve the quality of patient care, manage healthcare reform program changes, support clinician measurement and control costs – all with actionable, trusted, timely data. Through the analytics capability of PINC AI™ Quality Enterprise, including QualityAdvisor™, members build better comparators specific to their population and benchmark to top performing peers to understand improvement paths.
To determine statistical significance, it’s important to conduct a scientific experiment using hypothesis testing to determine if the proportion (or average) of observed outcomes are different than expected. Below, we break down steps in how to better calculate statistical significance.
1. Setting Up the Hypothesis
We must first define a research question: Is there a difference between our observed and expected outcomes? In hypothesis testing, this research question is framed in two parts: the null and alternate hypothesis. Note that we cannot directly say that the observed and expected outcomes are different, but we can state with some degree of confidence that the values are not the same.
Our null hypothesis will always assume the status quo—no difference exists between the observed and expected performance. The goal is to determine if we have sufficient evidence to reject our null hypothesis, with some level of confidence, and accept our alternate hypothesis (our true research question), that the observed and expected values are different.
2. Determine Confidence Levels
Now that we have framed our null and alternate hypothesis, we must decide the level at which we are willing to reject our null hypothesis. Typically, this threshold is set to a 95 percent confidence level (or a five percent significance level); however, within Premier’s proprietary CareScience risk methodology utilized in the analytics capability of PINC AI™ Quality Enterprise, three confidence levels are provided: 75 percent, 90 percent, 95 percent.
PINC AI™ Quality Enterprise solution offers statistical significance asterisks on all risk-adjusted outcome reporting to enable users' identification of statistically significant variation in quality. It empowers members to quickly identify where they should focus their performance improvement efforts, alleviating the burden of digging through data looking for areas with opportunity where the variation is not significant. The confidence levels display where the outcomes’ variation is not likely due to chance. The 75 percent and 90 percent confidence levels are often used with the CareScience method as early indicators that help identify populations that an organization may want to monitor to ensure there is no increasing systematic variation in care over time.
The Quality Console interactive executive level dashboards within Quality Enterprise display visual indicators to highlight opportunity, and the opportunity is based on statistical significance.
The confidence level provides the probability of rejecting our null hypothesis by chance. A 95 percent confidence level indicates a 5 percent chance that observed versus expected performance is identified as meaningfully different, when in fact there is not a difference (a false positive). Every hypothesis test is quantified by the acceptable level of error, and it is up to the researcher, or quality administrator, to determine the acceptable level of error for their own organization.
3. Obtaining the Test Statistic and P-Value
Once the acceptable confidence level is determined, a test is conducted to see if there is sufficient evidence to reject our null hypothesis at the specified confidence level. There are a variety of statistical tests available to answer this question, and the conditions of the data and the nature of the comparison determine the appropriate statistical test. In the case of the CareScience methodology, we utilize a two-tailed paired z-test, meaning a test used to determine the difference between two proportions or means. The two-tailed test is designed to measure positive or negative performance variation between a proportion or average of observed versus expected outcomes. The statistical test conducted by tools within PINC AI™ Quality Enterprise will produce a test statistic that allows us to derive a probability (or p-value), which is the probability of obtaining the proportion of observed events more extreme than the expected events.
4. Accepting or Rejecting the Null Hypothesis
To reject the null hypothesis, the p-value produced from our test-static must be less than our acceptable level of error (e.g., 5 percent assuming 95 percent confidence). If our p-value is less than our significance level (e.g., 1 – 95 percent or 5 percent), we can reject our null hypothesis that the observed and expected values are the same and conclude our alternate hypothesis that there is a statistically significant difference between observed and expected performance.
Measuring statistical significance is not just employed for large data sets or populations. Significance is especially important when evaluating smaller populations, as smaller populations will have larger variation and sampling error. The employed statistical test adjusts for both the size of the evaluated population and the dispersion (standard deviation) in the expected values. Therefore, as the size of the evaluated population increases, the differences between observed and expected values are more likely to be significant.
In summary, statistical significance is determined through hypothesis testing, a systematic and data-driven method to make decisions. Hypothesis testing requires 1) a clearly articulated research question (a hypothesis statement), 2) an evidence threshold (confidence level), 3) a test statistic to quantify the evidence, and 4) a determination if the evidence obtained through our test statistic exceeds our predefined confidence threshold. At this point, a decision can be made to determine if the difference between observed and expected performance is meaningful. Hypotheses testing, as defined by the above steps, is embedded in the PINC AI™ Quality Enterprise tools, thus automating the process for measuring significant deviations in quality.
Although it can be challenging, conducting statistically significant research is worthwhile. Faulty data and analyses may result in poor decisions. Utilizing PINC AI™ Quality Enterprise helps to ensure your research yields reliable findings.
For more on this topic:
The insights you need to stay ahead in healthcare: Subscribe to Premier’s Power Rankings newsletter and get our experts’ original content delivered to your inbox once a month.