Open Telemetry 101
History
Metrics ruled the world
Time Series Databases and Logs InfluxDB and Prometheus
Logstash
Open Tracing & Open Census
Need for One standard to rule them all - Open Telemetry
What is OpenTelemetry? Standard for emitting telemetry signals
Protocol - Most important aspect - allows our application to work without knowing where to send data - no vendor or backend knowledge required - defines data model and api contract for various signals
Lot of vendors come out in the observability space because of the stable and open source protocol
SDKs - Instrument with open source library
Core Signals
- Traces, Metrics, Logs, Profiles, RUM
- Profiles - Protocol finalized, SDKs coming
- RUM - Real user Monitoring (In Design)
Trace
- Debugging
- Complex system overview
Trace waterfalls, service graphs
Focus on Causality - Something happened due to something
Focus on Performance
High cardinality and High dimensionality
Span
- We do not create trace
We create a span. Group of spans with a same trace ID means a trace
It is basically a structured blob of data
- UniqueId (SpanId)
- CorrelationId (TraceId)
- Start Time, End Time
- CausalityId (ParentSpanID) - allows us to link all spans together
- Attributes - Semantic names and custom ids based on use cases
- Can be called fancy logs
Log
- Point in time data
- Designef for humans (Message Template)
- Ideal for Local debugging
- Useful for
- Startup and Crashes
- Debugging tracing
- Easy to do it badly
Metric
- Metric vs metric
- Metric - Time Series aggregation with Labels - data point in otel - bucketing/aggregation
- metric - something that we can measure
- Relatively cheap to store and query
- Lacks deep context
- Low cardinality/dimensionality
- If we know dimensions upfront, we can use metrics - if we know what we want to measure
What is Propogation? - Real Magic - How differenet things connect together - the idea of correlation - Transmit state - W3C Trace Headers - TraceID - ParentSpanId - Sampling data - Baggage - Footgun we never wanted - additional context between services - Not multispan attributes - does not appear on every span - should avoid doing this (W3C Baggage) - Carried over to all calls - even to third party sdks
How does it work?
Auto Instrumentation - agent based - simiar to apm agents - codeless - env variable config - sideloaded good to get started quick - becoms quite verbose and difficult to tweak data or add mroe context/attributes - can’t do what makes otel amazing - makes storing all things expensive
Coded Instrumentation - Targeted instrumentation - Full control - decide which spans/context would be usedful and keep only those - Being intentional about observing what is important to you
What is the Collector?
OTel Collector can be deployed as both an agent on a host and a service for collecting, transforming, and exporting telemetry data. The OTel Collector is a vendor-agnostic proxy that can receive telemetry data in multiple formats, transform and process it, and export it in multiple formats to be consumed by multiple backends (such as Jaeger, Prometheus, other open source backends, and many proprietary backends).
Receivers: Push- or pull-based processors for collecting data.
Processors: Responsible for transforming and filtering data.
Exporters: Push- or pull-based processors for exporting data.
All powerful and nobody should run without it Collectors are deployed as a proxy - apps send data here and then it is sent to backend Everything sends data first to collector Can do a multi fanout to different backends - multi vendor management or evaluation
- Config/Exporter centralization can be achieved
- No need to give the backend api keys to applications (API Keys are with collector)
Use collector to secure egress, Collector has redaction and filtering capabilities, centralized enrichment - details like pod, nodes, cluster, etc are added automatically Happy security, happy developers
Treat collector like an actual application - make it autoscale
What is Sampling?
cost crisis in observability when we want to store all tracing data Keeping costs low while giving debugging power
Head sampling (application) Has access to limited information about the span context
Tail sampling (Collector) Everyone should do this at scale Runs at delay after first trace span Part of collector Has access to all spans in the trace - errors, retries help take decisions Delays sending spans to backend Requires applications to send all spans Cross availability zones - super expensive
Retain context
Reduce storage/ingest
OTEL - should continuously be evaluating what to store and what not to store - what to sample and what not to sample