The bot has actually never seen Destroyer's big drones during training even once, so I found it somewhat surprising that it even works as well as it does!
Completely agree that adding something like the "League" used by AlphaStar would be one of the top priorities if you wanted to push this project further. I don't think CodeCraft is sufficiently complex to really allow for several very distinct strategies in the same way as StarCraft II, but I would still expect training against a larger pool of more diverse agents to increase robustness quite a bit.
What amazes me at the end of the day is that brute-forcing seem to do much better than I initially thought it would do.
Trying random stuff just sounds stupid but with enough compute and data, I guess it could overpower smart creatures like us.
I agree that CodeCraft is vastly simpler than StarCraft but the idea is the same. just try random stuff(sometimes with better logic behind it) until something works and then optimize it to perfection.
That randomness has to be massively constrained, though. Well over 99.9 percent of inputs are guaranteed to lead to bad results. For example if we're randomly way pointing a drone, we're almost guaranteed not to be sending it somewhere useful.
Thank you for the kind words! I am also quite excited about the new points in game design space that RL will unlock and am planning write another blogpost on that topic.
CodeCraft is a programming game which you can "play" by writing a Scala/Java program that controls the game units. It's not actively developed anymore but still functional: http://codecraftgame.org/
The constraining factor in this case is CPU throughput. The CPU cannot execute instructions fast enough to keep up with the data being read in from memory.
Of course you are right that this is not the whole picture, it is possible to be bottlenecked on neither memory bandwidth nor CPU throughput.
With respect to memory, two additional considerations are read amplification and latency.
Read amplification:
Read amplification basically means that you are reading data but not actually using it.
Suppose we have a table stored in row format with four columns C1, C2, C3, C4 which store 8 byte integers.
Row format means that the values for the columns are interleaved, so the memory layout would look like this, where cn refers to some value for column Cn:
One property of conventional memory is that we can only read memory in blocks of 64bytes (the size of a cache line).
So if we have a loop that sums of all of the values in C1, we will actually also load all of the values for columns C2, C3, C4.
This means we have a read amplification of factor 4, and our usable bandwidth will be a quarter of the theoretical maximum.
Latency:
Suppose we also don't have any read amplification and make use of all values for each loaded cache line.
Then after we finish processing a cache line, and load the next cache line, we might have to wait a long time (100s of cycles) before the new cache line actually arrives!
So then we would actually be stalled because of memory latency during some parts of program execution.
One of the big advantages of using columnar layout (which stores all values for a column sequentially) is that it all but eliminates read amplification.
Similarly, by accessing data sequentially we make it possible for the CPU to predict what data will be accessed in the future, and have it loaded into the cache before we even request it.
Of course there's a number of reasons why things might not work out quite so perfectly in practice, and those may why performance in this benchmark starts to degrade before the (usable) memory bandwidth reaches the theoretical maximum memory bandwidth. But as you can see it still possible to get very close, so it's safe to assume that queries which read much less data than the maximum bandwidth are constrained on CPU.
Excellent explanation of the advantages of a columnar layout! I'm still a little doubtful about the exact diagnosis, though.
The constraining factor in this case is CPU throughput.
When you looked at it with VTune, did you find clear evidence of this? That is, were you consistently executing 4 instructions per cycle, or were you maxed out at 100 percent utilization for a particular execution port? While it's possible to be at these limits (say if you were hashing each input) in most cases I think it's rare to get to that point when reading from RAM.
Similarly, by accessing data sequentially we make it possible for the CPU to predict what data will be accessed in the future, and have it loaded into the cache before we even request it.
One important caveat to this (at least on Intel) is that the hardware prefetcher does not cross 4KB page boundaries. So if you are doing an 8B read every cycle, and if you are able to do your other processing in the other 3 µops available that cycle, it's taking you ~500 cycles to process each page. And then you hit a ~200 cycle penalty when you finish each page, which means you are taking a ~25% performance hit because of memory latency, and probably have a CPI closer to 3 rather than 4! So if you (or your compiler) haven't already done so, you might see a useful performance boost by adding in explicit prefetch statements (one per cacheline) for the next page ahead.
Anyway, thanks for your response. I like your approach, and I wish you luck in making it even faster!
Have we really progressed memory speeds to the point that reading memory is faster than execution speed on the processor? That just feels wrong.
Your other points make sense. And those would surface as constraints in a similar fashion as a CPU throughput limitation would. I just expect those well before a CPU constraint.
I confess I have not researched CPU speeds in quite a while. Nor RAM speeds. Just surprised to see that the CPU's throughput would ever really be the bottleneck.
> Have we really progressed memory speeds to the point that reading memory is faster than execution speed on the processor? That just feels wrong.
Sort of. Random main memory access speeds aren't there, but L1 is; as long as your memory accesses are sequential, proper coding can stream sequential main memory into L1 and keep the CPU working.
But it's effectively a lost art - the vast majority of programs spend a significant part of their time stalling on memory access.
To stretch the tub metaphor to include latency, you have to imagine someone turning the faucet off and on (or plugging and unplugging the drain) depending on how full the tub is.. and, with memory, there may be one spigot (the bandwidth), but a tremendous number of valve knobs (addresses), such that the someone has to decide the correct knob to turn before the tub empties out completely. If the wrong valve is opened, no water comes out (the wrong data is loaded into cache, which is the same as no data being loaded), so merely guessing is no good.
Main memory has enough bandwidth but not enough latency. If your code is written to account for that, then - yes, your CPU will be your bottleneck. If it isn’t then your CPU will be waiting for memory.
DDR4 can do >20GB/s but the latency is >10ns. If you prefetch enough cycles in advance, data will be in L1 by the time you need it. If you don’t, it’s won’t.
And that is still surprising to me. Quickly checking the internet, I see [1] which lists the throughput of an i7 is about 25 GB/s. That said, I see the newer specs report a GT/s, which is new to me. Looks like those are much lower. Though, I suspect I'm just not looking at the right specs.
Regardless, I've learned that RAM has gotten much faster than I would expect. Cool to see.
The 25GB/s corresponds to the fastest speed a DDR4 can send memory.
That said, any nontrivial computation is unlikely to keep up with that speed (e.g. 64-bit data will be ~3GD/s, so whatever computation you do has to fit in amortized 1 cycle on a 3Ghz clock, or the CPU is the bottleneck)
I'm curious how much of a contributing factor this is to why many deep learning pipelines use smaller floats. Specifically, I "knew" that the throughput some math operations was actually greater than 1. That said, I have not kept up with which operations. I would not have expected the floating point units to be the fast ones.
Again, thanks for sticking with me in this thread. I learned a bit. Hoping to come up with ways to play with this knowledge. :)
I don't think it's bytewise (just like nobody does RAID3), but, yes, they can and do. The keyword you're looking for is "interleaving".
I think it's also not even by DIMM slot but by channel, so if the channels are unevenly populated (e.g. 2 slots filled on one channel but only 1 on another), interleaving is disabled.
[The constraining factor in this case is CPU throughput. The CPU cannot execute instructions fast enough to keep up with the data being read in from memory.
]
If you take any basic computer architecture class you'd know that this is not the case. Unless there are breakthroughs within the past 5 years i have not read about. DRAM is at least 10x slower than CPU. Unless your prefetcher is so accurate it can feed the i-cache/d-cache without any stalls on the CPU side.
CodeCombat is aimed at people with no, or very little, programming experience.
My game will be targeted towards people who already have at least basic programming knowledge, and are looking for a less restrictive environment.
In that respect, it will resemble CodeCombat's zero sum mode[1].
I am hoping that the focus on slightly more experienced programmers will allow me to create a game with more strategic depth.
You raise a really good point about thinking through the game mechanics before diving into the coding.
As it happens, I have actually spent a lot of time on this and even implemented something similar some years ago.
I probably could made that clearer in the blogpost, but I mainly wanted to give a concise overview my goals and plans for this project.
My next updates should have a bit more substance :)
I will enthusiastically read and consider any progress you post.
Even if I think there's a low probability that the project will be completed as you describe, don't be discouraged. Any progress made in any direction will be valuable experience, and enjoyable. I have started more than one project that I couldn't finish and I've never regretted any of them in hindsight.
Don't be afraid to fail publicly, by which I mean post about absolutely everything you've done, no matter how unfinished it is, it'll help everyone keep enthusiastic and engaged. If you complete the project, you'll have a valuable log, if not you'll be able to see both 1) what went wrong and 2) the sub-victories you had on the way.
I have never used Clojure myself, so I don't know too much about it.
Can it easily call into Java libraries?
In that case, Clojure should work as well.