The Tour Magazin tests are by far the best out there (way ahead of anything from GCN, Cycling Weekly, Bikeradar, Bikerumor, etc). They test at the GST wind tunnel, this is the same wind tunnel Swiss Side uses. They use a moving leg dummy.
But indeed, you can't take take the results at face value. The main problem is that they don't standardize the handlebar, which is a big part of the aero equation. The wheels can also be a problem -- sometimes they give an additional figure with the reference wheelset (Zipp 404), but sometimes they don't.
For example, they're tested the 3T Strada (both the 1st and 2nd gen), and they give it 211w and 210w respectively with Zipp 404s. But then you realize that both were equipped with round-ish fully-wrapped bars...that's another ~7w, which makes the 3T Strada just as fast as the fastest mainstream bikes (Aeroad, S5, SystemSix). Yet the Strada is never referred to as such.
I believe the only open mold frameset they've tested is the Baldiso Air Flight One, which is the Workswell WCB-R-306. It scored 212w with 50mm wheels and a one-piece aero handlebar, so it's a fair bit slower (~8w) than the mainstream aero bikes. There's also the Myvelo Verona, which I assume is some open mold frameset (not sure), and it had a terrible score around 222w with Zipp 404s and a one-piece aero handlebar.
Those numbeds are also weighted. They take drag numbers at different yaws and use an equation to get a number from those measurments. The equation favours lower yaw angles, but it still dpesn't tell the whole story. For example: S5 and Aeroad have the same result in tour tests, but if you look at tests that show the yaw/drag graph, you'll see that the S5 if faster at lower yaw angles (no wind of full headwind) and the aeroad if faster at higher yaw angles (crosswinds). Even though the equation tries to balance that out, the reality is that in real world conditions the S5 should be faster something like 90% of the time.
Not sure I follow your reasoning. The weighting function is based on real-world yaw distribution, so if one bike has a lower weighted drag than another, it will be faster most of the time in the real world. If one bike is 5w faster at 0 yaw than another bike that is only 5w faster at 10deg+ yaw, then this will surely be reflected by the former having a lower weighted drag.