A Meta-Analysis Of High Resolution Audio Perceptual Evaluation Article By Joshua Reiss

Home | Hi-Fi Audio Reviews | Audiophile Shows | Partner Mags | News

July 2016

A Meta-Analysis Of High Resolution Audio Perceptual Evaluation
Article By Joshua Reiss

2.4 Hypotheses and disputed results
Many study results have been disputed, or given very different interpretations by their authors. Oohashi 1991 noted a persistence effect; when full range content (including frequencies beyond 20kHz) is played immediately before low pass filtered content, the subjects incorrectly identified them as the same. Woszcyk 2007 found statistical significance in the different test conditions that were used, and speculated that the complex high resolution signals might have been negatively perceived as artifacts. Both Oohashi 1991 and Woszcyk 2007 may have suffered a form of Simpson's paradox, where these false negatives canceled out a statistically significant discrimination of high resolution audio in other cases. Similar problems may have plagued King 2012, where many participants rated the 'live feed' as sounding least close to live. Indeed, Pras 2010 observed a group of individuals who 'anti-discriminate', and consistently misidentify high resolution audio in ABX tests.

Several studies intentionally considered discrimination of a high resolution format even if the content was not intended to be high resolution. In [62, 64], it was claimed that Nishiguchi 2003 did not have sufficient high frequency content. In one condition for Woszcyk 2007, a 20kHz cut-off filter was used, and in Nishiguchi 2005, the authors stated that they 'used ordinary professional recording microphones and did not intend to extend the frequency range intentionally during the recording sessions... sound stimuli were originally recorded using conventional recording microphones.' These studies were still considered in the meta-analysis of Section 3 since further investigation (e.g., spectrograms and frequency response curves in [58, 64, 68] shows that they may still have contained high frequency content, and the extent to which one can discriminate a high sample rate format without high frequency content is still a valid question.

Other studies noted conditions which may contribute to high resolution audio discrimination. [25, 60, 61] noted that intermodulation distortion may result in aliasing of high frequency content, and [63] remarked on the audibility of the noise floor for 16 bit formats at high listening levels. [23] had participants blindfolded, in order to eliminate visual distractions, and [56], though finding a null result when comparing two high resolution formats, still noted that the strongest results were amongst participants who conducted the test with headphones.

Together, the observations mentioned in this section provide insight into potential biases or flaws to be assessed for each study, and a set of hypotheses to be validated, if possible, in the following meta-analysis section.

2.5 Risk of bias
Table 2B presents a summary of the risk of bias, or other issues, in the studies. This has been adapted from [77], with a focus on the types of biases common to these tests. In particular, we are concerned with biases that may be introduced due to the methodology (e.g,, the test may be biased towards inability to discriminate high resolution content if listeners are asked to select stimuli closest to 'live' without defining 'live', as in [72]), the experimental design (e.g., level imbalance as in [45, 46] or intermodulation distortion as in [25, 60, 61] may result in false positive discrimination), or the choice of stimuli (e.g., stimuli may not have contained high resolution content as in [58], or used test signals that may not capture whatever behaviour might cause perception of high resolution content, as in [26, 59] leading to false negatives). We identified an unclear risk in each category if the risk had not been addressed or discussed, and a high risk if there was strong evidence of a flaw or bias in a category. Potential biases led both to Type I and Type II errors, i.e., to falsely suggesting an ability to discriminate or not to discriminate high resolution content, though Type II errors were more common. Furthermore, biases often existed which might result in Type II errors even when the overall result demonstrated an effect (e.g., [59]).

3. Meta-analysis results
The most common way that results are presented in the studies are as the mean percentage of trials with correct discrimination of stimuli, averaged over all participants. Thus this effect measure, equivalent to a mean difference [77], is used in most of the analysis that follows. The influence of these and other choices will be analyzed in Section 3.7.

3.1 Binomial tests
A simple form of analysis is to consider a null hypothesis, for each experiment, that there is no discernible effect. For all experimental methodologies, this would result in the answer for each trial, regardless of stimuli and subject, having a 50% probability of being correct. Table 2C depicts the number of trials, percentage of correct results for each trial, and the cumulative probability of at least that many correct answers if the experiment was truly random. Significant results at a level of a= 0.05 are given in the last column of Table 2.

Of note, several experiments where the authors concluded that there was not a statistically significant effect (Plenge 1980, Nishiguchi 2003), still appear to suggest that the null hypothesis can be rejected.

3.2 To what extent does training affect results?
Figure 2 depicts a forest plot of all studies where mean and standard deviation per participant can be obtained, divided into subgroups where participants either received detailed training (explanation of what to listen for, examples where artifacts could be heard, pretest with results provided to participants...), or received no or minimal training (explanation of the interface, screening for prior experience in critical listening).

The statistic I² measures the extent of inconsistency among the studies' results, and is interpreted as approximately the proportion of total variation in study estimates that is due to heterogeneity (differences in study design) rather than sampling error. Similarly, a low p value for heterogeneity suggests that the tests differ significantly, which may be due to bias.

The results are striking. The training subgroup reported an overall strong and significant ability to discriminate high resolution audio. Furthermore, tests for heterogeneity gave I²=0% and p=0.59, suggesting a strong consistency between those studies with training, and that all variation in study estimates could be attributed to sampling error. In contrast, those studies without training had an overall small effect. Heterogeneity tests reveal large differences between these studies I²=23%, though this may still be attributed to statistical variation, p=0.23. Contrasting the subgroups, the test for subgroup differences gives I² =95.5% and p<10-5, suggesting that almost all variation in subgroup estimates is due to genuine variation across the 'Training' and 'No training' subgroups rather than sampling error.

3.3 How does duration of stimuli and intervals affect results?
The International Telecommunication Union recommends that sound samples used for sound quality comparison should not last longer than 15–20 s, and intervals between sound samples should be up to 1.5 s [78], partly because of limitations in short-term memory of test subjects. However, the extensive research into brain response to high resolution content suggests that exposure to high frequency content may evoke a response that is both lagged and persistent for tens of seconds, e.g., [22, 48]. This implies that effective testing of high resolution audio discrimination should use much longer samples and intervals than the ITU recommendation implies.

Unfortunately, statistical analysis of the effect of duration of stimuli and intervals is difficult. Of the 18 studies suitable for meta-analysis, only 12 provide information about sample duration and 6 provide information about interval duration, and many other factors may have affected the outcomes. In addition, many experiments allowed test subjects to listen for as long as they wished, thus making these estimates very rough approximations.

Nevertheless, strong results were reported in Theiss 1997, Kaneta 2013A, Kanetada 2013B and Mizumachi 2015, which all had long intervals between stimuli. In contrast, Muraoka 1981 and Pras 2010 had far weaker results with short duration stimuli. Furthermore, Hamasaki 2004 reported statistically significant stronger results when longer stimuli were used, even though participant and stimuli selection had more stringent criteria for the trials with shorter stimuli. This is highly suggestive that duration of stimuli and intervals may be an important factor.

A subgroup analysis was performed, dividing between those studies with stated long duration stimuli and/or long intervals (30 seconds or more) and those which state only short duration stimuli and/or short intervals. The Hamasaki 2004 experiment was divided into the two subgroups based on stimuli duration of either 85-120s or approx. 20s [62, 64].

The subgroup with long duration stimuli reported 57% correct discrimination, whereas the short duration subgroup reported a mean difference of 52%. Though the distinction between these two groups was far less strong than when considering training, the subgroup differences were still significant at a 95% level, p=0.04. This subgroup test also has a small number of studies (14), and many studies in the long duration subgroup also involved training, so one can only say that it is suggestive that long durations for stimuli and intervals may be preferred for discrimination.

3.4 Effect of test methodology
There is considerable debate regarding preferred methodologies for high resolution audio perceptual evaluation. Authors have noted that ABX tests have a high cognitive load [11], which might lead to false negatives (Type II errors). An alternative, 1IFC Same-different tasks, was used in many tests. In these situations, subjects are presented with a pair of stimuli on each trial, with half the trials containing a pair that is the same and the other half with a pair that is different. Subjects must decide whether the pair represents the same or different stimuli. This test is known to be 'particularly prone to the effects of bias [79].' A test subject may have a tendency towards one answer, and this tendency may even be prevalent amongst subjects. In particular, a subtle difference may be perceived but still identified as 'same,' biasing this approach towards false negatives as well.

We performed subgroup tests to evaluate whether there are significant differences between those studies where subjects performed a 1 interval forced choice 'same/different' test, and those where subjects had to choose amongst two alternatives (ABX, AXY, or XY 'preference' or 'quality'). For same/different tests, heterogeneity test gave I²=67% and p=0.003, whereas I²=43% and p=0.08 for ABX and variants, thus suggesting that both subgroups contain diverse sets of studies (note that this test has low power, and so more importance is given to the I² value than the p value, and typically, a is set to 0.1 [77]).

A slightly higher overall effect was found for ABX, 0.05 compared to 0.02, but with confidence intervals overlapping those of the 1IFC 'same/different' subgroup. If methodology has an effect, it is likely overshadowed by other differences between studies.

3.5 Effect of quantisation
Most of the discrimination studies focus on the effect of sample rate and the use of stimuli with and without high frequency content. It is well-known that the dynamic range of human hearing (when measured over a wide range of frequencies and considering deviations among subjects) may exceed 100 dB. Therefore, it is reasonable to speculate that bit depth beyond 16 bits may be perceived.

Only a small number of studies considered perception of high resolution quantization (beyond 16 bits per sample). Theiss 1997 reported 94.1% discrimination for one test subject comparing 96kHz/24-bit to 48kHz/16-bit, and the significantly lower 64.9% discrimination over two subjects comparing 96kHz/16-bit to 48kHz/16-bit. Jackson 2014 compared 192kHz to 44.1 kHz and to 48kHz with different quantizers. They found no effect of 24 to 16 bit reduction in addition to the change in sample rate. Kanetada 2013A, Kanetada 2013B and Mizumachi 2015 all found strong results when comparing 16 to 24 bit quantization. Notably, Kanetada 2013B used 48 kHz sample rate for all stimuli and thus focused only on difference in quantization.

However, Kanetada 2013A, Kanetada 2013B and Mizumachi 2015 all used undithered quantization. Dithered quantization is almost universally preferred since, although it increases the noise floor, it reduces noise modulation and distortion. But few have looked at perception of dither. [80] dealt solely with perception of the less commonly used subtractive dither, and only at low bit depths, up to 6 bits per sample. [81] investigated preference for dither for 4 to 12 bit quantizers in two bit increments. Interestingly, they found that at 10 or 12 bits, for all stimuli, test subjects either did not show a significant preference or preferred undithered quantization over rectangular dither and triangular dither for both subtractive and nonsubtractive dither. Jackson 2014 found very little difference (over all subjects and stimuli) in discrimination ability when dither was or was not applied. Thus, based on the evidence available, it is reasonable to include these as valid discrimination experiments even though dither was not applied.

3.6 Is there publication bias?
A common concern in meta-analysis is that smaller studies reporting negative or null results may not be published. To investigate potential publication bias, we produced a funnel plot of the 16 studies where a mean difference per participant was obtained, and funnel plots of the two subgroups of studies with and without training, Figure 3. The overall funnel plot shows clear asymmetry, with few studies showing a low mean difference and a high standard error, i.e., few small studies with null results. Several studies also fall outside the 95% confidence interval, further suggesting biases. However, much of the asymmetry disappears when different funnel plots are provided for subgroups with and without training, and all studies fall within their confidence intervals. Though publication bias may still be a factor, it is likely that the additional effort in conducting a study with training was compensated for by less participants or less trials per participant, which contributes to larger standard errors. This is in full agreement with the cautions described in [82, 83].

3.7 Sensitivity Analysis
This meta-analysis involves various decisions that may be considered subjective or even arbitrary. Most notably, we aimed to include all data from all high resolution perception studies that may be transformed into an average ratio, over all participants, of correct to total discrimination tasks. The choice of included studies, interpretation of data from those studies and statistical approaches may all be questioned. For this reason, Table 5 presents a sensitivity analysis, repeating our analysis and subjecting our conclusions to alternative approaches.

Though the studies are diverse in their approaches, we considered fixed effect models in addition to random effect models. These give diminished (but still significant) results, primarily because large studies without training are weighed highly under such models.

We also considered treating the studies as yielding dichotomous rather than continuous results. That is, rather than mean and standard error over all participants, we simply consider the number of correctly discriminated trials out of all trials. This approach usually requires an experimental and control group, but due to the nature of the task and the hypothesis, it is clear that the control is random guessing, i.e., 50% correct as number of trials approaches infinity. This knowledge of the expected behavior of the control group allows use of standard meta-analysis approaches for dichotomous outcomes. Treating the data as dichotomous gave stronger results, even though it allowed inclusion of Meyer 2007, which was one of the studies that most strongly supported the null hypothesis. Use of the Mantel-Haenszel (as opposed to Inverse Variance) meta-analysis approach with the dichotomous data had no influence on results.

A full description of the statistical methods used for continuous and dichotomous results, fixed effects and random effects, and the Inverse Variance and Mantel-Haenszel methods, is given in the Appendix.

Many studies involved several conditions, and some authors participated in several studies. Treating each condition as a different study (a valid option since some conditions had quite different stimuli or experimental set-ups) or merging studies with shared authors was performed for dichotomous data only, since it was no longer possible to associate results with unique participants. Treating all conditions as separate studies yielded the strongest outcome. This is partly because some studies had conditions giving opposite results, thus hiding strong results when the different conditions were aggregated. Finally, we considered focusing only on sample rate and bandwidth (removing those studies that involved changes in bit depth) or only those using modern digital formats (removing the pre2000s studies that used either analogue or DAT systems). Though this excluded some of the studies with the strongest results, it did not change the overall effect.

Though not shown in Table 5, all of the conditions tested gave an overall effect with p<0.01, and all showed far stronger ability to discriminate high resolution audio when the studies involved training.

---> Next Page.

Quick Links

Premium Audio Review Magazine
High-End Audiophile Equipment Reviews

Equipment Review Archives
Turntables, Cartridges, Etc
Digital Source
Do It Yourself (DIY)
Preamplifiers
Amplifiers
Cables, Wires, Etc
Loudspeakers/ Monitors
Headphones, IEMs, Tweaks, Etc
Superior Audio Gear Reviews

Show Reports
HIGH END Munich 2024
AXPONA 2024 Show Report
Montreal Audiofest 2024 Report
Southwest Audio Fest 2024
Florida Intl. Audio Expo 2024
Capital Audiofest 2023 Report
Toronto Audiofest 2023 Report
UK Audio Show 2023 Report
Pacific Audio Fest 2023 Report
T.H.E. Show 2023 Report
Australian Hi-Fi Show 2023 Report
...More Show Reports

Videos
Our Featured Videos

Industry & Music News
High-Performance Audio & Music News

Partner Print Magazines
audioXpress
Australian Hi-Fi Magazine
hi-fi+ Magazine
Sound Practices
VALVE Magazine

For The Press & Industry
About Us
Press Releases
Official Site Graphics

Home | Hi-Fi Audio Reviews | News | Press Releases | About Us | Contact Us

All contents copyright^©1995 - 2024 Enjoy the Music.com^®
May not be copied or reproduced without permission. All rights reserved.