> If it's not properly handled, it's also likely not properly logged Then you're...

zzyzxd · on July 11, 2024

> Verbose logs are simply a pain in the arse, unless you have a massive processing system. but even then it just either kneecaps your observation window, or makes your queries take ages.

which is why this blog post brags about their capability. Technologies advances, and something difficult to do today may not be as difficult tomorrow. If your logging infra is overwhelmed, by all means drop some data and protect the system. But if Binance is happily storing and querying their 100PB logs now, that's their choice and it's totally fine. I won't say they are doing anything wrong. Again, we are talking about blue moon scenarios here, which is all about hedging risks and uncertainties. It's fine if Netflix drops a few frames of pictures in a movie, but my bank can't drop my transaction.

Wingy · on July 11, 2024

How about only save the verbose logs if there’s an error?

chhabraamit · on July 11, 2024

yup, nice idea. keep collecting logs in a flow and only log when there is an error. Or

Start logging in a buffer and only flush when there is an error.

growse · on July 13, 2024

I think this works well if you think about sampling traces not logs.

Basically, every log message should be attached to a trace. Then, you might choose to throw away the trace data based on criteria, e.g. throw away 98% of "successful" traces, and 0% of "error" traces.

The (admittedly not particularly hard) challenge then is building the infra that knows how to essentially make one buffer per trace, and keep/discard collections of related logs as required.

mnutt · on July 11, 2024

It sounds nice, but also consider: 1) depending on how your app crashes, are you sure the buffer will be flushed, and 2) if logging is expensive from a performance perspective, your base performance profile may be operating under the assumption that you’re humming along not logging anything. Some errors may beget more errors and have a snowball effect.

yepitsthat · on July 12, 2024

Both solved by having a sidecar (think of as a local ingestion point) that records everything (no waiting for flush on error), and then does tail sampling on the spans where status is non OK - i.e. everything thats non OK gets sent to Datadog, Baselime, your Grafana setup, your custom Clickhouse 100PB storage nodes. Or take your pick of any of 1000+ OpenTelemetry compatible providers. https://opentelemetry.io/docs/concepts/sampling/#tail-sampli...

Pattern is the ~same.

yepitsthat · on July 12, 2024

You're nearly there. Tail sampling on non OK states.

https://opentelemetry.io/docs/concepts/sampling/#tail-sampli...