The construct you're implementing is a counting lock, or counting semaphore. Use one from a library that has a trylock version instead of rolling your own, so you can get optimized OS-supported sleep / wakeup. Or do you always have useful work to do if the trylock (aka take) fails?
Note that you can avoid using any traditional locks while implementing your own, but "lock free" is a technical term with a different meaning than lockless. A consumer almost by definition can't be lock-free in the computer-science sense, because it can be stuck waiting forever if the producer thread(s) are blocked. Related: Lock-free Progress Guarantees
CAS is good. Just make sure that your function doesn't run compare_exchange_weak at all if it sees the counter already 0 with a pure load (typically with memory_order_relaxed). You don't want your CPU hammering on the location while other threads are trying to increment it so your thread will see a non-zero value.
Another option is a signed counter, and change the compare to >= 0. Check for overshoot in the result of fetch_add(-1), and if so correct it. (Threads that see the counter as temporarily negative just see it "locked"). But this is typically not better than a CAS retry loop; failed CAS is rare unless contention is very high. And extra atomic ops to correct overshoot will probably cost about as much (or more) than CAS retries.