Format discrimination studies
Other forms of indirect discrimination include studies that ask participants to identify or rate semantic descriptors [44, 52], or to perform a task with or without high resolution audio, e. g., localize a sound source , set listening level , discriminate timing . Such studies may show, at a high level, what perceptual attributes are most affected. However, the difficulty with subjecting such studies to meta-analysis is that a well-designed experiment may (correctly) give a null result on the indirect discrimination task even if participants can discriminate high resolution audio by other means.
Several studies have been focused on tasks involving direct discrimination between competing high resolution audio formats. In , test subjects generally did not perceive a difference between DSD (64x 44.1 kHz, 1 bit) and DVD-A (176.4 kHz/16-bit) in an ABX test, whereas  showed a statistically significant discrimination between PCM (192kHz/24-bits) and DSD. However, in both cases, high resolution audio formats are compared against each other. Certainly in the first case, the null result does not suggest that there would be a null result when discriminating between CD quality and a higher resolution format. The second case is intriguing, but closer inspection of the experimental set-up revealed that the two formats were subject to different processing, most notably, different filtering of the low frequency content.
Transformation of study data
In King 2012, participants were asked to rate 44.1kHz, 96kHz, 192 kHz, all at 24 bit, and 'live' stimuli in terms of audio quality. This methodology is problematic in that the ranking may be inconclusive, yet people might still hear a difference, i.e. some may judge low sample rate as higher quality due to a personal preference, regardless of their ability to discriminate.
We were provided with the full data from the experiment. A priori, the decision was made to treat the 'live' stimuli as a reference, allowing the ranking data to be transformed into a form of A/B/X experiment. For each trial, it was treated as a correct discrimination if the highest sample rate, 192 kHz, was ranked closer to 'live' than the lowest sample rate, 44.1 kHz, and an incorrect discrimination if 44.1 kHz was ranked closer to 'live' than 192 kHz. Other rankings were excluded from analysis since they may have multiple interpretations. Thus if there is an inability to discriminate high resolution content, the probability of a correct answer is 50%.
In Repp 2006, participants also provided quality ratings, in this case between 24-bit/192 kHz, 16-bit/44.1kHz, and lower quality formats. This can be transformed into an XY test by assuming that correct discrimination is made when 24 bit/ 192 kHz was rated higher than 16-bit/44.1kHz, and incorrect discrimination if 24-bit/192kHz was rated lower than 16-bit/44.1kHz. Results where they are rated equal are ignored, since there is no way of knowing if participants perceived a difference but simply considered it too small compared to differences between other formats, and hence cannot be categorized. Note also that here, unlike King 2012, there is no reference with which to compare the high resolution and CD formats. Thus, without training, there may be no consistent definition of quality and it may not be possible to identify correct discrimination of formats.
Meyer 2007 revisited
First, much of the high-resolution stimuli may not have actually contained high-resolution content for three reasons; the encoding scheme on SACD obscures frequency components above 20 kHz and the SACD players typically filter above 30 or 50 kHz, the mastering on both the DVD-A and SACD content may have applied additional low pass filters, and the source material may not all have been originally recorded in high resolution. Second, their experimental set-up was not well-described, so it is possible that high resolution content was not presented to the listener even when it was available. However, their experiment was intended to be close to a typical listening experience on a home entertainment system, and one could argue that these same issues may be present in such conditions. Third, their experiment was not controlled. Test subjects performed variable numbers of trials, with varying equipment, and usually (but not always) without training. Trials were not randomized, in the sense that A was always the DVD-A/SACD and B was always CD. And A was on the left and B on the right, which introduces an additional issue that if the content was panned slightly off-center, it might bias the choice of A and B.
Meyer and Moran responded to such issues by stating , "...there are issues with their statistical independence, as well as other problems with the data. We did not set out to do a rigorous statistical study, nor did we claim to have done so..." But all of these conditions may contribute towards Type II errors, i.e. an inability to demonstrate discrimination of high resolution audio.
Although full details of their experiment, methodology and data are not available, some interesting secondary analysis is possible.  noted that 'the percentage of subjects who correctly identified SACD at least 70% of the time appears to be implausibly low." In trials with at least 55 subjects, only one subject had 8 out of 10 correct and 2 subjects achieved 7 out of 10 correct. The probability of no more than 3 people getting at least 7 out of 10 correct by chance, is 0.97%. This suggests that the results were far from the binomial distribution that one would expect if the results were truly random.
If no one was able to distinguish between formats and there were no issues in the experimental design, then all trial results would be independent, regardless of whether the trials were by the same participant, and regardless of how participants are categorized. But  also gave a breakdown of correct answers by gender, age, audio experience and hearing ability, depicted in Table 3. Non-audiophiles, in particular, have very low success rates, 30 out of 87, which has a probability of only (p(X<=30)=0.25%). Chi squared analysis comparing audiophiles with non-audiophiles gives a p value of 0.18%, suggesting that it is extremely unlikely that the data for these two groups are independent. Similarly, analysis suggests that the results for those with and without strong high frequency hearing also do not appear independent, p=4.92%. Note, however, that if there was a measurable effect, one would expect some dependency between answers from the same participant. The analysis in Table 3 is based only on total correct answers, not correct answers per participant, since this data was not available.
In several studies, a small number of participants had some form of evaluation with a p value less than 0.05. This is not necessarily evidence of high resolution audio discrimination, since the more times an experiment is run, the higher the likelihood that any result may appear significant by chance. Several experiments also involved testing several distinct hypotheses, e.g., does high resolution audio sound sharper, does it sound more tense, etc. Given enough hypotheses, some are bound to have statistical significance.
This well-known multiple comparisons problem was accounted for using the Holm, Holm-Bonferroni and Sidak corrections (see Appendix), which all gave similar results, and we also looked at the likelihood of finding a lack of statistically significant results where no or very few low p values were found. This is summarized in Table 4, which also gives the actual significance levels given that each participant has a limited number of trials with dichotomous outcomes. Interestingly, the results in Table 4 agree with the results of retesting statistically significant individuals in Nishiguchi 2003 and Hamasaki 2004, confirm the statistical significance of several results in Yoshikawa 1995, and highlight the implausible lack of seemingly significant results amongst the test subjects in Meyer 2007, previously noted by . For Pras 2010, they refute the significance of the specific individuals who 'anti-discriminate' (consistently misidentify the high resolution content in an ABX test), but confirms the significance of there being 3 such individuals out of 16, and similarly for the 3 significant results out of 15 stimuli.