## Abstract

**BACKGROUND:** Quantification cycle (Cq) and amplification efficiency (AE) are parameters mathematically extracted from raw data to characterize quantitative PCR (qPCR) reactions and quantify the copy number in a sample. Little attention has been paid to the effects of preprocessing and the use of smoothing or filtering approaches to compensate for noisy data. Existing algorithms largely are taken for granted, and it is unclear which of the various methods is most informative. We investigated the effect of smoothing and filtering algorithms on amplification curve data.

**METHODS:** We obtained published high-replicate qPCR data sets from standard block thermocyclers and other cycler platforms and statistically evaluated the impact of smoothing on Cq and AE.

**RESULTS:** Our results indicate that selected smoothing algorithms affect estimates of Cq and AE considerably. The commonly used moving average filter performed worst in all qPCR scenarios. The Savitzky–Golay smoother, cubic splines, and Whittaker smoother resulted overall in the least bias in our setting and exhibited low sensitivity to differences in qPCR AE, whereas other smoothers, such as running mean, introduced an AE-dependent bias.

**CONCLUSIONS:** The selection of a smoothing algorithm is an important step in developing data analysis pipelines for real-time PCR experiments. We offer guidelines for selection of an appropriate smoothing algorithm in diagnostic qPCR applications. The findings of our study were implemented in the R packages chipPCR and qpcR as a basis for the implementation of an analytical strategy.

Quantitative PCR (qPCR)^{6} remains the method of choice for precise DNA quantification because of its high sensitivity, robustness, and versatility (1). Isothermal amplification strategies such as recombinase polymerase amplification have been introduced as alternatives (2–4). Diagnostic applications include microRNA profiling, gene expression analysis, pathogen quantification, and cancer diagnostics (2, 3). There is an ongoing development of novel devices on the basis of capillary systems, microfluidics, point-of-care testing, microbeads, and related technologies (6, 7–12).

All these methodologies monitor DNA amplification in real time. The quantification cycle (Cq) is calculated at defined location indices of the amplification curves to determine the sample quantity. Cq is localized on the amplification trajectory in regions where the signal significantly deviates from the background noise. Several Cq methods (13), such as the cycle threshold method (Ct method), the second derivative maximum (SDM) method (14), and the Cy0 method (15), have been introduced. The estimation of Cq values from raw data with high noise is challenging, because the Cq depends on the slope of the amplification curve. It is important to guarantee similar amplification efficiencies (AEs), especially during the exponential phase, for a proper comparison of amplification curves (16). Selected Cq methods such as Cy0 perform a correction on the basis of an estimated AE (14–19). It is accepted that most of the existing models and Cq algorithms affect the Cq substantially (13, 20).

The acquisition of qPCR data is accompanied by noise and bias, and therefore preprocessing of the raw fluorescence data may be indicated. Noise is caused by random effects, the sensor system (e.g., quantization), hardware (e.g., point-of-care devices), and technical variance of the sample or weak dyes. It is present in all PCR phases. In the exponential phase, noise may lead to suboptimal curve fitting and false estimation by regression models, resulting in erroneous estimation of Cq values, AEs, and signal-to-noise ratios (7, 21). Hence, an efficient pipeline for data preprocessing (e.g., smoothing, background subtraction, normalization, and presentation) is of vital importance. In particular, for point-of-care testing, an accurate and reliable data analysis strategy is needed because these devices are operated by nonspecialized personnel.

Smoothing and filtering are typical approaches in biological and analytical sciences. We argue that smoothing and filtering are neglected topics because these methods may substantially contribute to biased estimates of the Cq and AE. To date, no study has investigated the impact of filters and smoothers in the field of amplification curves. To reduce noise, several methods on the basis of local smoothing or filtering have been proposed in the literature, including robust locally weighted regression (lowess) (22), (weighted) running mean (moving average) (9, 23), cubic splines (24), Kalman smoother (25), Friedman super smoother (26), Savitzky–Golay filter (27), and Whittaker smoother (28, 29). Multiparameter nonlinear curve fitting is an alternative to smoothing or filtering (30). However, the performance of nonlinear fitting is also compromised in the presence of high noise. All operations may affect the estimation of the background, the Cq value, and the slope during the exponential phase. Depending on the method, a bias in Cq values and AEs (which are often deduced by Cy0 = *F*_{Cq}/*F*_{Cq−1} (14)) is potentially introduced. All this may bias the quantification by unrealistic estimations of the underlying parameters.

The aim of this work is to provide a systematic study on how to deal with amplification curve data from commercial and experimental systems in the presence of noise. We share the philosophy of the Minimum Information for Publication of Quantitative Real-Time PCR Experiments (MIQE) guidelines to increase experimental transparency for better experimental practice and more reliable interpretation of qPCR results (1, 31, 32). This is the first-of-its-kind study in its comparison of various smoothers and filters widely used in DNA quantification methods. We critically question the habit of implementing smoothing and filtering in a “black box” fashion. We believe that our work is important for the qPCR community because our open-source implementations enable the investigation of smoothing effects.

## Materials and Methods

### SMOOTHING METHODS AND PARAMETER SETTINGS

We implemented 8 different smoothing methods (see Supplemental Table 1, which accompanies the online version of this article at http://www.clinchem.org/content/vol61/issue2) within the qpcR package version 1.4-0 (33) for the R statistical programming environment. All smoothers, except the Kalman filter, have user-definable parameters. Smoothers (and parameters in italics) used are lowess (weighted quadratic least-squares), *f* (span); Friedman super smoother (cross-validated running lines), *span* (span); cubic spline, *spar* (smoothing parameter), Savitzky–Golay (local polynomial regression), *p*, *f* (polynomial order, frame size); Kalman (state space model), *none*; running mean, *wsize* (window size); Whittaker (recursive penalized B-splines), λ (spline parameter); and exponential moving average (EMA) (weighted recursive exponential smoothing), α (spline parameter). All smoothers can be invoked by the function *modlist* (see online Supplemental R Script). The source code can be found for inspection in the original R functions given in online Supplemental Table 1, “Implementation.” All values were set to show some effect but not completely destroy the sigmoidal structure of qPCR curves. Parameters can be set as a list item in programming function calls to the *modlist* function, e.g., modlist[data, model = l5, smooth = “ema,” smoothPAR = list(α = 0.8)]. (For details, additional preprocessor functions, and data sets, see qpcR and chipPCR package documentations (32–34) and online Supplemental Table 2.) Scripts and routines were developed with the RKWard integrated development environment (35).

### REPLICATE AND DILUTION DATA SETS

The data sets were generated in a block cycler (CFX384 instrument, Bio-Rad) for an evaluation study of qPCR analysis methods (13) (available at http://qPCRDataMethods.hfrc.nl and from qpcR package). For the analysis of smoothing effects, we used the reps384 data set (380 replicates). Single-curve analysis used reaction 1 of the reps384 data set; the Kalman and Savitzky–Golay smoothers used the first 50 reactions of the reps384 data set. Cq values and AE were analyzed on all reactions.

For linear calibration analysis, we used the dil4reps94 set (4 dilution steps, 94 replicates per dilution). Cq values from presmoothed curves were subjected to linear regression against log_{10}(copy number), from which copy numbers at a quantification fluorescence level (Fq) = 250 were estimated by the inverse function.

The raw data values of these 2 qPCR data sets were baselined with a linear regression model (cycles 2–9). The data sets Eff625, Eff750, Eff875, and Eff1000 (chipPCR package) with defined AEs between 62.5% and 100% were used for calculation of the smoother dependency from the AE.

### qPCR PARAMETERS UNDER INVESTIGATION

We investigated 6 qPCR parameters: (*a*) Cq value derived from the SDM of a 5-parameter sigmoid model (Cq_{SDM}) (13, 30, 33); (*b*) Cq value derived at a fluorescence threshold of Fq = 250 (Ct); (*c*) Cq value as defined by the Cy0 method (15); (*d*) efficiency at SDM by calculating *F*_{Cq}/*F*_{Cq−1} (33); (*e*) efficiency of the sliding-window method (sliwin) (17) (implemented in the *sliwin* function); and (*f*) the takeoff point (TOP) where the fluorescence rises significantly above background (14) (implemented in the *takeoff* function).

### MONTE CARLO SIMULATIONS

For the analysis of smoothing effects on qPCR data with defined noise structure, we created 1000 qPCR curves with heteroscedastic noise for all replicates at each cycle with a coefficient of variation (*ĉ _{v}* =

*s*/

*x̄*) of 2% or 5% through the following pipeline: (

*a*) reaction 1 of the reps384 data set was fitted with a 5-parameter sigmoid model; (

*b*) either the fitted values

*ŷ*=

_{i}*f*(

*x*,β), where β = fit parameters, or a simulated curve with defined Cq = 18 at Fq = 250 was used as simulation template; (

_{i}*c*) to each fitted value, 2% or 5% random heteroscedastic gaussian noise as a function of the value's magnitude was added:

*z*=

_{i}*ŷ*+

_{i}*N*(0, 0.02 ·

*ŷ*), and this was repeated 1000 times; (

*d*) Each of the simulated curves in (

*c*) was smoothed by the different methods, refitted as in (

*b*), and analyzed with respect to the above described essential qPCR parameters.

The data have a user-controlled noise structure, aiming for unbiased data with known variance structure in all qPCR phases. All analysis steps and figures can be reproduced with the R script found in online Supplemental Data. We used the VideoScan heating/cooling unit and capillary convective PCR (ccPCR) cyclers (see online Supplemental Figs. 1 and 2) as low-replicate technologies for isothermal amplification [helicase-dependent amplification (HDA)] (for details refer to Chou et al. (10), Rödiger et al. (12), and online Supplemental Data).

## Results

For the quantification of smoothing method influence, we based our analysis on 2 paradigms: (*a*) by measuring the effects on real-world qPCR data consisting of a high number of replicates (the reps384 and dil4reps94 data sets from Ruijter et al. (13)) and (*b*) by evaluating Monte Carlo–simulated data. The real-world data sets were generated with different hardware platforms. Parameters deduced from the analysis were Cq at fluorescence = 250 (Ct), Cq at second derivative maximum of a sigmoidal model (SDM), Cy0, the efficiency at SDM, the efficiency of sliwin (17), and the TOP method as originally described in Tichopad et al. (14). The inspection of reps384 and dil4reps94 revealed a wide variation of plateau fluorescence values (Fig. 1) as well as in the dil4reps94 data set (see online Supplemental Fig. 4). The noise was significant in the first 12 cycles of the baseline region. A spread of Cq values at Fq = 250 (σ = 0.115) even after baselining the data was observable. As there have not been any peer-reviewed data previously available with such a high number of technical replicates, it is interesting to note that a substantial variation in baseline, exponential, and plateau regions is evident even in a pure technical replicate setup.

Using default settings (see online Supplemental Data) for the 8 different smoothers (see online Supplemental Table 1), we analyzed the resulting values after smoothing curve 1 of the reps384 data set (see online Supplemental Fig. 5). We focused on the baseline (see online Supplemental Fig. 5B) and the exponential (see online Supplemental Fig. 5C) regions to obtain an overview of the smoothing performance. All smoothers represent the data adequately only in specific regions of the curve. The cubic spline, Savitzky–Golay, and Whittaker smoothers accurately captured important patterns of data. A massive fluctuation (polygonal chain) was introduced by the Kalman smoother in the baseline region (see online Supplemental Data, section 3.2, and online Supplemental Fig. 5B).

An important goal of smoothing is to maintain the original curve trajectory as exactly as possible. Any shifting within the exponential region would introduce significant bias in Cq values and AE. It is vital that the smoothing parameters can be set to ensure smoothing of noise while concomitantly preserving curve structure. For lowess, Friedman super smoother, cubic spline, running mean, Whittaker, and EMA, we found parameters to deliver the mandatory features described above (see online Supplemental Fig. 6). Lowess, Friedman super smoother, running mean, and EMA are very sensitive to their parameters and may result in highly divergent curves. Cubic spline and Whittaker smoothers performed well and did not react oversensitively to altered parameters.

Next, we had a closer look at the Savitzky–Golay and Kalman smoothers by analyzing the baseline and exponential region of the first 50 reactions of the reps384 data set (see online Supplemental Fig. 7). The Savitzky–Golay smoother performed satisfactorily by not introducing any bias, reducing the variance of the data points subtly, and compensating the downward baseline drift at cycle 1 (see online Supplemental Fig. 7A). In the exponential region, this smoother preserved the original curve structure well (see online Supplemental Fig. 7B). In contrast, the Kalman smoother increased the variance of points in the baseline region (see online Supplemental Fig. 7A) and exerted a significant downward shift of data points in the exponential region (see online Supplemental Fig. 7B).

These preliminary results suggested that smoothing methods can substantially affect the curve structure and the dispersion of data points in the baseline and exponential regions. To inspect this in detail, we used all smoothing methods on all 380 replicate reactions of the reps384 data set. We analyzed Cq at SDM, Cq at Fq = 250, Cy0 value, efficiency at SDM, efficiency by the sliwin method, and TOP (Fig. 2). For running mean, Whittaker, and EMA, we tested different parameter values for α, λ, and *wsize*. The effect of the smoothing methods was compared with the unsmoothed fluorescence data (“no smoothing”). Lowess, Friedman super smoother, Kalman, running mean, and EMA introduced considerable bias in all analyzed qPCR parameters, whereas cubic spline, Savitzky–Golay, and Whittaker smoothers had no apparent harmful influence on parameter estimates. When analyzing the distribution of the Cq (SDM) values, cubic spline, Savitzky–Golay, and Whittaker introduced no or only slight kurtosis and skewness (see online Supplemental Fig. 8, C, D, and H). Lowess, Friedman super smoother, Kalman, running mean, and EMA caused observable kurtosis, skewness, and Cq value shifting (see online Supplemental Fig. 8, A, B, E–G, and I–L).

Our second approach was based on a classic Monte Carlo simulation procedure. We fitted a 5-parameter sigmoidal model to reaction 1 of the reps384 data set and simulated 1000 qPCR curves with 2% heteroscedastic (as a function of the fluorescence magnitude) gaussian noise (Fig. 3A). The Cq value at Fq = 250 for the fitted curve was 18.00 (Fig. 3B). Following the approach in Fig. 2, we analyzed the bias in the Cq values of the simulated curves depending on the applied smoothing methods. The shifts in Cq values agreed with the results from the real-world data. Cubic spline, Savitzky–Golay, and Whittaker smoothers imposed the least bias on the Cq values, compared with the unsmoothed data (Fig. 3C). For comparison, we used homoscedastic noise and found no substantial differences compared with the heteroscedastic setup (not shown).

The linear calibration curve method has found ubiquitous use since its initial description for relative and absolute qPCR quantification (36). Hence, it was important to investigate how smoothing would alter the parameters of calibration curves by affecting their Cq values. This analysis was based on the dil4reps94 data set (13) (96 replicates in 4 steps of a decadic dilution). For all curves, we acquired the Cq values at Fq = 250 and regressed them against log_{10}(copy numbers). All amplification curves used to construct the calibration curves (Fig. 4, B–F), except Fig. 4A, were smoothed. The unsmoothed fit resulted in a calibration curve with *R*^{2} = 0.996 (*P* < 0.001) (Fig. 4A). Predicting an unknown sample with Cq = 26 corresponded to approximately 534 copies. Application of the smoothing methods led, as hypothesized owing to the shift in Cq values, to different copy number values at Cq = 26 with a corresponding bias compared with the unsmoothed sample. Cubic spline (534 copies, approximately 0% difference) and Savitzky–Golay (528 copies, approximately 1.1% difference) displayed the least bias, whereas running mean (457 copies, approximately 14.4% difference) and Friedman super smoother (346 copies, approximately 35.2% difference) exerted the largest shift in Cq values, also evident by the largest change of linear regression parameters (Fig. 4, B and C, text insets). In agreement with previous results, cubic spline, Savitzky–Golay, and Whittaker smoothers displayed the smallest bias from Cq value shifting (Fig. 4, D–F). However, because the changes in slope and intercept affect all Cq values, ratios calculated from 2 Cq values remain similar to the unsmoothed sample for all setups (data not shown).

Fitting complete qPCR curves without sigmoidal models can be accomplished with spline interpolation. Every data point is treated as exact (i.e., the fitted curve intersects with it) by use of piecewise polynomial regression. This is the approach of efficiency estimation in Shain and Clemens (37). We conducted a simulation as in Fig. 4 by spline interpolation (function spl3 in the qpcR package) and compared the smoothing results with Cq = 18 (Fig. 5A). Corroborating our previous findings, the best-performing smoothers were cubic spline, Savitzky–Golay, and Whittaker. Cubic spline and Savitzky–Golay maintained the best accuracy (least deviation from the original Cq value, % bias) (Fig. 5C, gray arrow) and reduced the variance of Cq values slightly (Fig. 5B, gray arrow) compared with the unsmoothed data.

We hypothesized that the smoothers' tendencies to bias Cq values depend on the AE. Because of their ubiquitous use, we focused on the cubic spline, running mean at window sizes 3 and 5, and Savitzky–Golay and Whittaker smoothers. Data were taken from the chipPCR data sets Eff625, Eff750, Eff875, and Eff1000 with AEs between 62.5% and 100%. The Cq values for all curves (n = 1000) at each efficiency were calculated (Fig. 6) and compared with the original Cq = 18. Similar to the findings presented in Figs. 2, 3, and 4, smoothing introduced significant bias in Cq value estimates. However, the bias decreased rapidly with increasing AE. Cubic spline, Savitzky–Golay, and Whittaker exhibited lower sensitivity to differences in AE, whereas running mean displayed high sensitivity.

Finally, we applied different smoother and filter functions to an isothermal amplification (HDA) and the ccPCR data sets. Cubic spline worked best for HDA and ccPCR. Our results indicate that the moving average with window sizes 7–17 also worked reliably (see online Supplemental Figs. 8–11).

## Discussion and Conclusion

We investigated current qPCR analysis methods in light of the data smoothing approach applied to the data before data analysis. We found benefits and pitfalls with respect to the use of appropriate and robust smoothers. Interestingly, these techniques have not been the subject of previous investigations in the qPCR literature. By contrast, operations such as baseline subtraction, amplitude normalization, and location of the quantification cycle have been thoroughly studied (14). We found that commonly used smoothing and filter approaches can have substantial impacts on data analysis. In most commercial software packages, smoothing is a black box (no information regarding the type of smoothers, no user-definable parameters), whereas peer-reviewed, open-source methods often provide more information. Several authors recently extended a plea for open and transparent data analysis (38). We support this idea and suggest implementing all findings as open-source software (qpcR and chipPCR) for the freely available R statistical programming environment.

It is important to question whether smoothing is of analytical and quantitative benefit for the estimation of the 2 qPCR parameters Cq value and AE. Both parameters are crucial for linear calibration–derived prediction of copy numbers in unknown samples (36). Using high-replicate real-world qPCR data and simulated data with defined noise structure, we analyzed widely applied smoothing methods with regard to their effect on potential bias in quantification when estimating these 2 qPCR parameters. Our findings indicate that for qPCR analysis, the investigated smoothers can be categorized into 3 groups as follows.

Group 1 encompasses smoothers with a strong influence on the curve trajectory. These methods can alter the original curvature significantly and induce a cycle shift on the original data, thereby rendering the resulting curves nonfunctional for Cq and AE estimation. Methods included in this category include lowess, Friedman super smoother, running mean, and Kalman.

Group 2 encompasses smoothers that are sensitive to the parameter value selection. With optimal parameter selection, the smoother will not exert too much influence on the curvature (e.g., EMA, α = 0.8).

Group 3 encompasses smoothers that maintain curvature and are not overly sensitive to changes in parameter values, such as cubic spline, Savitzky–Golay, and Whittaker.

To judge smoothing performance, it is important to emphasize the overall goal of smoothing: to capture the dominant patterns in data while removing small underlying phenomena such as noise or other fine-scale structures. However, there are some inherent complications.

The original real-world data can be transformed into “non-real-world” data, including spuriously added signals that may deviate substantially from the original data (e.g., Kalman smoother in the baseline region). Assuming the data are not predominantly noise, but made of veritable systematic values, smoothing entails a loss of information for downstream applications. Finally, it has been noted that smoothing can introduce significant bias into the data (39), which was clearly evident in some smoothing methods we used (lowess, Friedman super smoother, running mean).

qPCR curves characteristically display substantial noise in the baseline and plateau phases. In the crucial exponential region, noise is less visually recognizable but still present, as can be seen by inspecting the residuals of exponential models. A closer look at the benefits of the well-performing smoothers (in the sense of preserving AE independently of the curve structure), including cubic spline, Savitzky–Golay, and Whittaker, reveals the following. The boxplot widths (25%/75% quantiles) in Figs. 5 and 6 indicate that these 3 smoothers did not introduce much bias into the original Cq and AE values. A variance reduction of these 2 parameters in a technical replicate setting was not observed with a sigmoidal model. Although the smoothers may be useful in compensating for the noise in the baseline/plateau regions, their effect in the exponential region (region for Cq and AE deduction) is minimal. A common problem of qPCRs is that the AE may change in the presence of inhibitors. Therefore, a valuable property shared by the cubic spline and Savitzky–Golay methods is their relative independence from the AE.

Contrasting with the above, a veritable reduction of Cq value variance was observed only when the fitting was done with a spline interpolation with a preceding Savitzky–Golay smoother (Fig. 5); with this combination of methods, an effect on the exponential region was evident. Because the spline function treats every point as is, an optimal smoother may have a beneficial effect when removing noise in scenarios where outliers induce artifacts in quantification.

Thus, we conclude that for most cases, data should be smoothed only for visualization. In addition, it is mandatory to use the same smoothing method for both the calibration curve samples and the unknown sample curves. When low-noise data cannot be obtained, smoothing may be applied before fitting, but any further analysis should take into consideration that the original data structure has changed. When smoothing is used, one should choose the most faithful smoother and compare classic nonlinear models with spline fits. It is quite possible that smoothing parameters can be optimized by methods such as cross-validation (see online Supplemental Data, section 1.2). We considered the cubic spline, Savitzky–Golay, and Whittaker smoothers to be most appropriate in our setting because they were less AE dependent. Other smoothing approaches were problematic because they did not maintain the curve structure. For sigmoidal models, combining spline interpolation with Savitzky–Golay smoothing may offer a beneficial approach. The same applies to isothermal amplification. We found isothermal amplification to be less biased by smoothing or filtering methods owing to the higher sample rates (time-based) than in qPCRs (cycle-based). Our data indicate that smoothing and filtering should not be part of an automatic process but be supervised.

## Footnotes

↵† Andrej-Nikolai Spiess and Stefan Rödiger contributed equally to the work, and both should be considered as first authors.

↵6 Nonstandard abbreviations:

- qPCR,
- quantitative PCR;
- Cq,
- quantification cycle;
- Ct,
- cycle threshold;
- SDM,
- second derivative maximum;
- AE,
- amplification efficiency;
- MIQE,
- Minimum Information for Publication of Quantitative Real-Time PCR Experiments;
- lowess,
- robust locally weighted regression;
- EMA,
- exponential moving average;
- Fq,
- quantification fluorescence level;
- sliwin,
- sliding-window method;
- TOP,
- takeoff point;
- ccPCR,
- capillary convective PCR;
- HDA,
- helicase-dependent amplification.

**Author Contributions:***All authors confirmed they have contributed to the intellectual content of this paper and have met the following 3 requirements: (a) significant contributions to the conception and design, acquisition of data, or analysis and interpretation of data; (b) drafting or revising the article for intellectual content; and (c) final approval of the published article.***Authors' Disclosures or Potential Conflicts of Interest:***Upon manuscript submission, all authors completed the author disclosure form. Disclosures and/or potential conflicts of interest:***Employment or Leadership:**S. Rödiger, Brandenburg University of Technology Cottbus–Senftenberg.**Consultant or Advisory Role:**None declared.**Stock Ownership:**None declared.**Honoraria:**None declared.**Research Funding:**A.-N. Spiess, grant Sp721/4-1 of the German Research Foundation; C. Deutschmann, BMBF (Federal Ministry of Education and Research, Germany) project InnoProfile-Transfer 03 IPT 611X; P. Schierack, BMBF (Federal Ministry of Education and Research, Germany) projects InnoProfile-Transfer 03IP611 and 03IPT611X; S. Rödiger, BMBF (Federal Ministry of Education and Research, Germany) project InnoProfile-Transfer 03 IPT 611X.**Expert Testimony:**None declared.**Patents:**None declared.**Role of Sponsor:**The funding organizations played no role in the design of study, choice of enrolled patients, review and interpretation of data, or preparation or approval of manuscript.

- Received for publication July 21, 2014.
- Accepted for publication October 31, 2014.

- © 2014 American Association for Clinical Chemistry