Introduction

In the rapidly evolving landscape of modern networks, simply knowing that a device is “up” is no longer sufficient. Network engineers leveraging NetDevOps principles require deep, real-time insights into network state, performance, and behavior to proactively identify issues, optimize resources, and ensure application experience. This is where network monitoring, observability, and telemetry become paramount.

This chapter delves into the critical role of modern monitoring and observability in a NetDevOps ecosystem. We’ll explore the shift from traditional pull-based monitoring (like SNMP and Syslog) to advanced push-based streaming telemetry using protocols such as NETCONF, RESTCONF, gRPC, and gNMI, alongside standardized data models like YANG and OpenConfig. You’ll learn how to implement and automate these solutions across multi-vendor networks using Ansible and Python, integrating them into comprehensive observability platforms.

After completing this chapter, you will be able to:

  • Differentiate between traditional monitoring, modern telemetry, and network observability.
  • Understand the architecture and benefits of streaming telemetry using NETCONF/YANG, RESTCONF, gRPC, and gNMI.
  • Configure and verify streaming telemetry subscriptions on Cisco, Juniper, and Arista devices.
  • Utilize Ansible and Python to automate the configuration and collection of telemetry data.
  • Design a basic network observability architecture incorporating collectors, time-series databases, and visualization tools.
  • Identify and mitigate security risks associated with advanced monitoring solutions.
  • Apply best practices for performance optimization and troubleshooting in telemetry-driven environments.

Technical Concepts

The journey from traditional monitoring to full network observability involves a fundamental shift in how network state information is collected, processed, and analyzed.

Traditional Monitoring vs. Modern Observability

Traditional Monitoring often relies on a “pull” model, where a monitoring system periodically queries network devices for specific metrics. Key technologies include:

  • SNMP (Simple Network Management Protocol): A widely used application-layer protocol for managing and monitoring network devices. It uses agents on devices to collect data and a manager to query them. While ubiquitous, it can be chatty, less granular, and often lacks real-time capabilities. (Refer to RFC 3411-3418 for SNMPv3 standards).
  • Syslog: A standard for message logging, allowing network devices to send event notifications (e.g., link up/down, error messages) to a central server. Excellent for event correlation but doesn’t provide granular metric data. (Refer to RFC 5424 for Syslog Protocol).
  • NetFlow/IPFIX (IP Flow Information Export): Provides data on IP traffic flows, enabling analysis of traffic patterns, bandwidth usage, and security incidents. It’s flow-based, not packet-based, offering aggregates. IPFIX (RFC 5101) is the IETF standard based on NetFlow.

Modern Telemetry and Observability adopt a “push” model, where network devices actively stream highly granular, structured data to collectors in near real-time. This shift is driven by:

  • Structured Data: Using data models like YANG for consistent, machine-readable data.
  • High Granularity: Sub-second data collection, crucial for dynamic network behavior.
  • Real-time Insights: Enables faster detection and response to anomalies.
  • Reduced Polling Overhead: Devices push data when changes occur or at set intervals.

Observability goes beyond mere monitoring. While monitoring tells you if a system is working, observability helps you understand why it’s not working, or why its performance has changed. It involves collecting diverse data types (metrics, logs, traces) to build a comprehensive understanding of system behavior from external outputs.

Network Observability Architecture

A typical network observability architecture consists of several key components:

  1. Network Devices: The source of telemetry data.
  2. Telemetry Agents: Software running on devices (or built-in) responsible for collecting raw data and formatting it according to a data model.
  3. Telemetry Collectors: Software systems that receive, parse, and often buffer the high volume of streaming data from multiple devices. Examples include Telegraf, OpenNMS, Custom Python scripts.
  4. Time-Series Database (TSDB): Optimized for storing time-stamped data, allowing efficient querying and analysis of metrics over time. Examples include Prometheus, InfluxDB, VictoriaMetrics.
  5. Data Processing & Analytics: Tools that can enrich, filter, aggregate, and analyze the collected data.
  6. Visualization & Alerting: Dashboards (e.g., Grafana) to visualize trends and anomalies, and alerting mechanisms to notify engineers of critical events.

Let’s visualize this architecture:

@startuml
skinparam handwritten true
skinparam style strict

cloud "Internet/WAN" as WAN

package "Network Infrastructure" {
  node "Core Router 1 (Cisco IOS XE)" as CR1
  node "Aggregation Switch 1 (Juniper JunOS)" as AS1
  node "Leaf Switch 1 (Arista EOS)" as LS1
}

package "Observability Platform" {
  cloud "Telemetry Collectors" as Collectors {
    component "gRPC Collector (e.g., Telegraf)" as GRPCC
    component "NETCONF/RESTCONF Listener (e.g., Python app)" as NETCONFC
    component "SNMP Manager" as SNMPM
    component "Syslog Server" as SYSLOGS
  }

  database "Time-Series Database (TSDB)" as TSDB {
    folder "Prometheus"
    folder "InfluxDB"
  }

  node "Visualization & Alerting" as Viz {
    artifact "Grafana Dashboards"
    artifact "Alert Manager"
  }
}

CR1 -[hidden] AS1
AS1 -[hidden] LS1

CR1 -up-> GRPCC : gRPC Streaming Telemetry (YANG)
AS1 -up-> GRPCC : gRPC Streaming Telemetry (OpenConfig)
LS1 -up-> GRPCC : gRPC Streaming Telemetry (OpenConfig)

CR1 -up-> NETCONFC : NETCONF/RESTCONF (YANG)
AS1 -up-> NETCONFC : NETCONF (YANG)
LS1 -up-> NETCONFC : RESTCONF (YANG/eAPI)

CR1 [label="> SNMPM : SNMP Traps/Polls
AS1"] SNMPM : SNMP Traps/Polls
LS1 [label="> SNMPM : SNMP Traps/Polls

CR1"] SYSLOGS : Syslog Events
AS1 [label="> SYSLOGS : Syslog Events
LS1"] SYSLOGS : Syslog Events

GRPCC [label="> TSDB : Store Metrics
NETCONFC"] TSDB : Store Metrics
SNMPM [label="> TSDB : Store Metrics
SYSLOGS"] TSDB : Store Logs/Metrics

TSDB --> Viz : Query Data

Viz .down.> "Network Operations Center (NOC)" as NOC : Alerts/Dashboards

@enduml

Streaming Telemetry Protocols

Streaming telemetry relies on modern, standardized protocols for efficient, structured data transfer.

NETCONF/YANG

  • NETCONF (Network Configuration Protocol): An XML-based protocol designed for configuring and managing network devices. While its primary role is configuration, it can also be used to retrieve operational state data. It operates over secure transport mechanisms like SSH or TLS.
    • RFC 6241: Network Configuration Protocol (NETCONF)
  • YANG (Yet Another Next Generation): A data modeling language used to define the structure and content of configuration and state data for network devices. YANG models provide a formal, machine-readable schema for both configuration and operational data, enabling multi-vendor interoperability.
    • RFC 7950: The YANG 1.1 Data Modeling Language

NETCONF can be used for “pulling” operational state data from devices, similar to SNMP, but with the advantage of structured YANG data.

RESTCONF/YANG

  • RESTCONF: A REST-like protocol that uses HTTP(S) to provide a programmatic interface for interacting with network devices. It exposes the YANG data model as a resource tree, allowing clients to perform CRUD (Create, Read, Update, Delete) operations.
    • RFC 8040: RESTCONF Protocol

RESTCONF offers a more web-friendly approach to access YANG-modeled data, which can be useful for integration with web applications and scripting.

gRPC and gNMI

  • gRPC (Google Remote Procedure Call): A high-performance, open-source RPC framework that can run in any environment. It uses Protocol Buffers (Protobuf) as its Interface Definition Language (IDL) for defining service methods and message structures. gRPC is efficient due to its binary message format and use of HTTP/2.
  • gNMI (gRPC Network Management Interface): A Cisco and Google-backed specification that defines a gRPC-based service for network management, including streaming telemetry. It allows clients to subscribe to specific data paths (defined by YANG/OpenConfig) and receive updates.

gRPC and gNMI are the preferred methods for high-volume, low-latency streaming telemetry due to their efficiency.

Protocol Flow: gRPC Streaming Telemetry

digraph gRPC_Telemetry {
    rankdir=LR;
    node [shape=box];

    Client [label="Telemetry Collector (gNMI Client)"];
    Device [label="Network Device (gNMI Server)"];

    subgraph cluster_0 {
        label="Subscription Request (Client to Device)";
        style=filled;
        color=lightgrey;
        Client -> Device [label="Establish gRPC Channel\n(TLS Encrypted)"];
        Device -> Client [label="Channel Acknowledged"];
        Client -> Device [label="gNMI::SubscribeRequest\n(Path, Mode: periodic/on-change)"];
    }

    subgraph cluster_1 {
        label="Data Stream (Device to Client)";
        style=filled;
        color=lightblue;
        Device -> Client [label="gNMI::SubscribeResponse\n(Telemetry Update - Protobuf/JSON Payload)"];
        Device -> Client [label="gNMI::SubscribeResponse\n(Telemetry Update - Protobuf/JSON Payload)"];
        Device -> Client [label="... Continuous Stream ..."];
    }
}

Conceptual Packet Structure: gRPC Telemetry (Simplified)

A gRPC packet, particularly over HTTP/2, is complex. Here’s a simplified view focusing on the payload within the context of a gNMI SubscribeResponse carrying an OpenConfig interface counter update.

packetdiag {
  colwidth = 32
  node_height = 72

  // Ethernet Header
  0-15: Dest MAC (6 bytes)
  16-31: Source MAC (6 bytes)
  32-35: EtherType (0x0800 for IPv4)

  // IP Header (IPv4)
  36-39: Version (4), IHL (5), DSCP, ECN
  40-47: Total Length
  48-51: Identification, Flags, Fragment Offset
  52-55: TTL, Protocol (6 for TCP), Header Checksum
  56-63: Source IP Address
  64-71: Destination IP Address

  // TCP Header
  72-79: Source Port (e.g., device ephemeral)
  80-87: Destination Port (e.g., 50051 for gRPC)
  88-103: Sequence Number
  104-119: Acknowledgment Number
  120-123: Data Offset, Reserved, Flags (SYN, ACK, PSH, URG, etc.)
  124-127: Window Size
  128-131: Checksum
  132-135: Urgent Pointer

  // HTTP/2 Frame Header (simplified, as gRPC multiplexes on streams)
  136-139: Length
  140-141: Type (e.g., DATA), Flags
  142-143: Stream Identifier, Reserved

  // gRPC Header (simplified)
  144-147: Compressed Flag, Message Length

  // gNMI SubscribeResponse (Protobuf Encoded)
  148-163: <<gNMI SubscribeResponse Message>>
  164-179:   timestamp (uint64)
  180-195:   prefix (string/Path)
  196-211:   update (list of Update messages)
  212-227:     Path (e.g., /interfaces/interface[name=GigabitEthernet1]/state/counters)
  228-243:     Val (TypedValue: counter value)
  244-259:     ... other updates ...
  260-275: <<End gNMI Message>>
}

Data Models (OpenConfig and Vendor-Native YANG)

YANG Data Models are crucial for streaming telemetry. They define the structure, syntax, and semantics of data.

  • Vendor-Native YANG Models: Provided by device vendors (e.g., Cisco, Juniper, Arista) and offer granular access to device-specific features and operational data. Examples: Cisco-IOS-XE-interfaces-oper.yang, juniper-smi.yang. You can explore these on Cisco DevNet’s YANG Suite (developer.cisco.com/yangsuite).
  • OpenConfig: An industry-wide initiative to define a common set of vendor-neutral YANG data models for network configuration and operational state. Its goal is to provide a unified approach to managing multi-vendor networks. Using OpenConfig models simplifies automation and monitoring across diverse hardware. (Learn more at openconfig.net).

The use of YANG models, especially OpenConfig, is a cornerstone of effective multi-vendor NetDevOps.

Collector Architectures

Collectors are vital for handling the ingestion of telemetry data. They typically perform several functions:

  • Ingestion: Receive data via gRPC, NETCONF, SNMP, etc.
  • Parsing: Decode Protobuf/JSON/XML payloads into usable metrics.
  • Tagging/Labeling: Add metadata (e.g., device hostname, interface name) to metrics for easier querying.
  • Buffering: Temporarily store data before writing to a TSDB.
  • Forwarding: Send processed metrics to a TSDB.

Popular open-source collector solutions include:

  • Telegraf: A plugin-driven server agent for collecting and sending metrics and events from databases, systems, and IoT sensors to various output plugins (including Prometheus, InfluxDB). Excellent for gRPC telemetry.
  • Prometheus Node Exporter: While primarily for host metrics, Prometheus itself can scrape metrics, and it has various exporters for network devices.
  • Custom Python Applications: For highly specific use cases, a Python script can act as a gNMI client to subscribe, parse, and store data.

Configuration Examples (Multi-vendor)

Here, we’ll demonstrate configuring streaming telemetry (gRPC/gNMI and NETCONF/RESTCONF) on Cisco, Juniper, and Arista devices.

Cisco IOS XE/XR (gRPC Streaming Telemetry)

This example configures a periodic gRPC subscription to stream interface statistics.

! Configure NETCONF/RESTCONF for management access (often a prerequisite for gNMI)
! Enable NETCONF SSH transport
netconf-yang
  ssh

! Enable RESTCONF HTTPS transport
restconf
  transport https
  ! Use a local authentication method
  authorization local

! Configure gNMI/gRPC telemetry
! Define the telemetry destination (collector)
telemetry ietf
  destination-group TELEMETRY_COLLECTOR
   address 192.168.10.10 port 50051
   protocol grpc tls-enable
   encoding encode-kvgpb
   profile TELEMETRY_PROFILE ! Optional TLS profile if using client certs
  ! security
  !  trustpoint TELEMETRY_CLIENT_TP
  !  pki-enrollment mode auto-client
  !  ! Ensure the collector's certificate is trusted here
  !  ! cryptounique-name TLS_SERVER_IDENTITY
  !  !
  ! encryption aes256-gcm
  ! ! If using client certificates for mutual TLS
  ! ! certificate application telemetry
  ! ! ca-trustpoint TELEMETRY_CA_TP

! Define a sensor group for the data we want to stream (e.g., interface operational state)
sensor-group INTERFACE_OPER_STATE
  ! Use an OpenConfig path for multi-vendor consistency
  ! For Cisco, verify specific YANG paths are supported
  ! Example: openconfig-interfaces:interfaces/interface/state
  ! Example: Cisco-IOS-XE-interfaces-oper:interfaces/interface/state
  ! Path examples:
  ! /interfaces/interface/state/counters
  ! /interfaces/interface[name='GigabitEthernet1']/state/counters
  path openconfig-interfaces:interfaces/interface/state/counters
  
! Define a subscription that links the sensor group to the destination
subscription PERIODIC_INTF_COUNTERS
  sensor-group INTERFACE_OPER_STATE sample-interval 10000 ! 10-second interval
  destination-group TELEMETRY_COLLECTOR
  stream cisco-push
  update-policy periodic

! --- Verification Commands ---

Verification Commands:

show telemetry ietf subscription PERIODIC_INTF_COUNTERS
show telemetry ietf destination-group TELEMETRY_COLLECTOR
show telemetry ietf sensor-group INTERFACE_OPER_STATE
show telemetry ietf connection all

Expected Output (Snippet):

Router# show telemetry ietf subscription PERIODIC_INTF_COUNTERS
Subscription ID: 100
  Type: Dynamic
  State: Enabled
  Source Address: 0.0.0.0
  Source VRF: <default>
  Stream: cisco-push
  Update policy: periodic
  Update interval: 10000 ms
  Sensor Groups:
    Sensor Group: INTERFACE_OPER_STATE (ID: 100)
      Path: openconfig-interfaces:interfaces/interface/state/counters
  Destination Groups: TELEMETRY_COLLECTOR
    Address: 192.168.10.10:50051
    Transport: grpc
    Encoding: encode-kvgpb
    Profile: TELEMETRY_PROFILE
    TLS: Enabled

Router# show telemetry ietf connection all
Telemetry connection 0:
  Peer Address: 192.168.10.10
  Peer Port: 50051
  Local Address: 10.0.0.1
  Local Port: 54321
  State: Connected
  Profile Name: TELEMETRY_PROFILE
  Subscriptions: 100

Juniper JunOS (gRPC Streaming Telemetry)

This example configures gRPC streaming telemetry for interface statistics using OpenConfig models.

# Enable gRPC and specify its listening port
set services extension-service request-response grpc clear-text port 50051

# Configure a streaming telemetry sensor (data provider)
set services analytics sensor SENSOR_INTF_STATS
set services analytics sensor SENSOR_INTF_STATS description "Interface Stats"
# Specify the OpenConfig path to collect data
# Juniper uses 'open-config:' prefix for OpenConfig paths
set services analytics sensor SENSOR_INTF_STATS resource "/junos/system/linecard/interface/"
set services analytics sensor SENSOR_INTF_STATS resource "/interfaces/interface/state/" # OpenConfig path

# Configure a streaming telemetry export profile (collector destination and frequency)
set services analytics export-profile EXPORT_INTF_STATS
set services analytics export-profile EXPORT_INTF_STATS reporting-period 10 # 10 seconds
set services analytics export-profile EXPORT_INTF_STATS format gpb
set services analytics export-profile EXPORT_INTF_STATS transport grpc
set services analytics export-profile EXPORT_INTF_STATS target-address 192.168.10.10
set services analytics export-profile EXPORT_INTF_STATS target-port 50051

# Apply the sensor and export profile to a rule
set services analytics rule RULE_INTF_STATS
set services analytics rule RULE_INTF_STATS sensor-name SENSOR_INTF_STATS
set services analytics rule RULE_INTF_STATS export-profile EXPORT_INTF_STATS

# Commit the configuration
commit

Verification Commands:

show services analytics status
show services analytics client
show services analytics export-profile EXPORT_INTF_STATS

Expected Output (Snippet):

user@juniper> show services analytics status
Extension service status:
  Current status: Running
  Enabled on: FPC0
  GRPC enabled: Yes, port 50051

user@juniper> show services analytics client
...
Client information:
  Name: EXPORT_INTF_STATS
  Address: 192.168.10.10, Port: 50051
  Protocol: grpc
  Sensor name: SENSOR_INTF_STATS
  Reporting period: 10s
  State: Connected
...

Arista EOS (gRPC Streaming Telemetry)

Arista EOS uses OpenConfig by default for gRPC telemetry.

! Enable eAPI (for RESTCONF-like interaction and generally good practice)
management api http-command
  no shutdown
  protocol https
  vrf default

! Configure gRPC telemetry
! Define the telemetry receiver (collector)
telemetry
  destination 192.168.10.10:50051
    protocol gRPC
    encoding GPB
    tls profile TELEMETRY_TLS_PROFILE ! Optional TLS profile
    source-interface Management1

! Define a sensor group (path to collect)
  sensor-group INTERFACE_COUNTERS
    path /Sysdb/interface/counters
    path /interfaces/interface/state/statistics ! OpenConfig path for counters

! Define a subscription to push data from the sensor group to the destination
  stream INTERFACE_STREAM
    sensor-group INTERFACE_COUNTERS
    destination 192.168.10.10:50051
    interval 10000 ! 10 seconds

! --- Verification Commands ---

Verification Commands:

show telemetry
show telemetry destination 192.168.10.10:50051
show telemetry stream INTERFACE_STREAM

Expected Output (Snippet):

Arista# show telemetry
Telemetry Receiver State:
  Receiver: 192.168.10.10:50051
    Protocol: gRPC
    Encoding: GPB
    State: Active
    Source-interface: Management1

Telemetry Streams:
  Stream: INTERFACE_STREAM
    Sensor Group: INTERFACE_COUNTERS
    Destination: 192.168.10.10:50051
    Interval: 10000 ms
    Last Push: 00:00:02 ago
    Push Count: 1234
    Status: OK

Network Diagrams

Diagrams are essential for visualizing complex network concepts.

Network Topology: Telemetry Lab Setup (nwdiag)

nwdiag {
  network core_network {
    address = "10.0.0.0/24"
    description = "Core Network Segment"

    CR1 [address = "10.0.0.1"];
    AS1 [address = "10.0.0.2"];
  }

  network mgmt_network {
    address = "192.168.10.0/24"
    description = "Management & Telemetry Network"

    CR1 [address = "192.168.10.1"];
    AS1 [address = "192.168.10.2"];
    LS1 [address = "192.168.10.3"];
    COLLECTOR [address = "192.168.10.10", description = "Telemetry Collector"];
    TSDB [address = "192.168.10.11", description = "Time-Series DB"];
    GRAFANA [address = "192.168.10.12", description = "Grafana / Visualization"];
  }

  // Connections implicitly defined by shared networks
  CR1 -- AS1; // Represents logical connection in core_network
  CR1 -- COLLECTOR; // Represents logical connection in mgmt_network
  AS1 -- COLLECTOR;
  LS1 -- COLLECTOR;
  COLLECTOR -- TSDB;
  TSDB -- GRAFANA;
}

Data Flow for Observability Platform (plantuml)

@startuml
scale 1.5

cloud "Network Devices" as NetDevs {
  component "Cisco IOS XE" as C_DEV
  component "Juniper JunOS" as J_DEV
  component "Arista EOS" as A_DEV
}

rectangle "Telemetry Collection Layer" {
  component "gNMI Collector\n(e.g., Telegraf)" as GNMI_COLLECTOR
  component "SNMP Poller\n(e.g., Prometheus)" as SNMP_POLLER
  component "Syslog Aggregator\n(e.g., Logstash)" as SYSLOG_AGG
}

database "Data Storage" {
  component "Time-Series DB\n(e.g., Prometheus DB, InfluxDB)" as TSDB
  component "Log Storage\n(e.g., Elasticsearch)" as LOG_STORE
}

rectangle "Analysis & Visualization" {
  component "Metrics Dashboards\n(e.g., Grafana)" as GRAFANA
  component "Alerting Engine\n(e.g., Alertmanager)" as ALERTS
  component "Log Analysis\n(e.g., Kibana)" as KIBANA
}

C_DEV -right-> GNMI_COLLECTOR : gRPC/gNMI (YANG, OpenConfig)
J_DEV -right-> GNMI_COLLECTOR : gRPC/gNMI (YANG, OpenConfig)
A_DEV -right-> GNMI_COLLECTOR : gRPC/gNMI (YANG, OpenConfig)

C_DEV -down-> SNMP_POLLER : SNMPv3
J_DEV -down-> SNMP_POLLER : SNMPv3
A_DEV -down-> SNMP_POLLER : SNMPv3

C_DEV -down-> SYSLOG_AGG : Syslog (RFC 5424)
J_DEV -down-> SYSLOG_AGG : Syslog (RFC 5424)
A_DEV -down-> SYSLOG_AGG : Syslog (RFC 5424)

GNMI_COLLECTOR -down-> TSDB : Write Metrics
SNMP_POLLER -down-> TSDB : Write Metrics
SYSLOG_AGG -down-> LOG_STORE : Write Logs

TSDB -up-> GRAFANA : Query Metrics
LOG_STORE -up-> KIBANA : Query Logs

GRAFANA -right-> ALERTS : Trigger Alerts
ALERTS -up-> "NetOps Team" : Notifications

@enduml

Automation Examples

Automating the setup of telemetry and the consumption of data is central to NetDevOps.

Python: gNMI Client for Streaming Telemetry

This Python script demonstrates how to subscribe to gNMI telemetry data from a network device using the grpc and gnmic libraries.

# pip install grpcio grpcio-tools gnmic

import grpc
import gnmic_pb2
import gnmic_pb2_grpc
import json
import time
import ssl

# Device details
DEVICE_IP = "192.168.10.1"  # IP of your Cisco/Juniper/Arista device
DEVICE_PORT = 50051       # gNMI port, typically 50051
USERNAME = "admin"
PASSWORD = "password"

# Path to subscribe to (OpenConfig interface counters)
# Adjust based on your device's supported paths and configured sensor groups
# For Cisco IOS XE: "/openconfig-interfaces:interfaces/interface/state/counters"
# For Juniper JunOS: "/interfaces/interface/state/"
# For Arista EOS: "/interfaces/interface/state/statistics"
GNMI_PATH = "/interfaces/interface/state/statistics"

def stream_telemetry():
    # Setup TLS/SSL context if needed (for secure gRPC)
    # If your device uses TLS, replace grpc.insecure_channel with grpc.secure_channel
    # and provide appropriate credentials/certificates.
    # For simplicity, this example uses insecure_channel, but production should use TLS.
    
    # Example for secure_channel (requires server cert for verification or client certs for mutual TLS)
    # with open('path/to/server_cert.pem', 'rb') as f:
    #     trusted_certs = f.read()
    # credentials = grpc.ssl_channel_credentials(root_certificates=trusted_certs)
    # channel = grpc.secure_channel(f"{DEVICE_IP}:{DEVICE_PORT}", credentials)

    # For insecure channel (NOT RECOMMENDED FOR PRODUCTION)
    channel = grpc.insecure_channel(f"{DEVICE_IP}:{DEVICE_PORT}")

    stub = gnmic_pb2_grpc.gNMIStub(channel)

    subscribe_request = gnmic_pb2.SubscribeRequest()
    subscription_list = subscribe_request.subscribe

    # Create a subscription
    subscription = subscription_list.subscription.add()
    
    # The path to subscribe to
    path_elem = subscription.path.elem.add()
    path_elem.name = GNMI_PATH.split('/')[1] # Root element e.g., 'interfaces'
    for p in GNMI_PATH.split('/')[2:]:
        elem = subscription.path.elem.add()
        if '[' in p and ']' in p:
            # Handle key-value pairs in path, e.g., interface[name=GigabitEthernet1]
            key_name = p.split('[')[0]
            key_value = p.split('=')[1].strip(']')
            elem.name = key_name
            elem.key[key_name.rstrip('s')] = key_value # Adjust key based on YANG model
        else:
            elem.name = p

    subscription.mode = gnmic_pb2.SubscriptionList.Mode.STREAM
    subscription.sample_interval = 10_000_000_000 # 10 seconds in nanoseconds
    subscription_list.mode = gnmic_pb2.SubscriptionList.Mode.STREAM
    subscription_list.encoding = gnmic_pb2.Encoding.JSON_IETF # Or PROTO for protobuf

    print(f"Subscribing to {GNMI_PATH} on {DEVICE_IP}:{DEVICE_PORT}...")

    try:
        # The stub.Subscribe method returns an iterator over responses
        for response in stub.Subscribe(iter([subscribe_request])):
            if response.update:
                timestamp_ns = response.update.timestamp
                timestamp_s = timestamp_ns / 1_000_000_000
                prefix = gnmic_pb2.Path.to_json(response.update.prefix) if response.update.prefix else "N/A"
                print(f"\n--- Telemetry Update ---")
                print(f"Timestamp: {time.ctime(timestamp_s)} ({timestamp_ns} ns)")
                print(f"Prefix: {prefix}")
                
                for update in response.update.update:
                    path = gnmic_pb2.Path.to_json(update.path)
                    value = gnmic_pb2.TypedValue.to_json(update.val)
                    print(f"  Path: {path}")
                    print(f"  Value: {value}")
            elif response.sync_response:
                print("--- Synchronization complete ---")
            else:
                print(f"Received unknown response: {response}")

    except grpc.RpcError as e:
        print(f"gRPC Error: {e.details}")
    except KeyboardInterrupt:
        print("Subscription stopped by user.")
    finally:
        channel.close()
        print("gRPC channel closed.")

if __name__ == "__main__":
    stream_telemetry()

Ansible Playbook: Configure Streaming Telemetry

This playbook configures gRPC streaming telemetry on Cisco IOS XE, Juniper JunOS, and Arista EOS devices. It assumes ansible-network.network-cli and appropriate ansible.netcommon collections are installed and inventory is set up.

---
- name: Configure Multi-Vendor Streaming Telemetry
  hosts: network_devices
  gather_facts: false
  connection: network_cli

  vars:
    telemetry_collector_ip: "192.168.10.10"
    telemetry_collector_port: 50051
    telemetry_sample_interval_ms: 10000 # 10 seconds

  tasks:
    - name: Ensure NETCONF/RESTCONF is enabled (Cisco IOS XE)
      when: ansible_network_os == 'ios' or ansible_network_os == 'iosxr'
      cisco.ios.ios_config:
        lines:
          - netconf-yang
          - restconf
          - "restconf transport https"
          - "restconf authorization local"
        save_when: modified

    - name: Configure gRPC Streaming Telemetry (Cisco IOS XE)
      when: ansible_network_os == 'ios' or ansible_network_os == 'iosxr'
      cisco.ios.ios_config:
        lines:
          - "telemetry ietf"
          - "  destination-group TELEMETRY_COLLECTOR"
          - "    address  port "
          - "    protocol grpc tls-enable" # Use 'tls-enable' for production, or 'no tls' for testing
          - "    encoding encode-kvgpb"
          - "  sensor-group INTERFACE_OPER_STATE"
          - "    path openconfig-interfaces:interfaces/interface/state/counters"
          - "  subscription PERIODIC_INTF_COUNTERS"
          - "    sensor-group INTERFACE_OPER_STATE sample-interval "
          - "    destination-group TELEMETRY_COLLECTOR"
          - "    stream cisco-push"
          - "    update-policy periodic"
        save_when: modified

    - name: Configure gRPC Streaming Telemetry (Juniper JunOS)
      when: ansible_network_os == 'junos'
      juniper.junos.junos_config:
        lines:
          - "set services extension-service request-response grpc clear-text port "
          - "set services analytics sensor SENSOR_INTF_STATS resource \"/interfaces/interface/state/\""
          - "set services analytics export-profile EXPORT_INTF_STATS reporting-period "
          - "set services analytics export-profile EXPORT_INTF_STATS format gpb"
          - "set services analytics export-profile EXPORT_INTF_STATS transport grpc"
          - "set services analytics export-profile EXPORT_INTF_STATS target-address "
          - "set services analytics export-profile EXPORT_INTF_STATS target-port "
          - "set services analytics rule RULE_INTF_STATS sensor-name SENSOR_INTF_STATS"
          - "set services analytics rule RULE_INTF_STATS export-profile EXPORT_INTF_STATS"
        commit_empty_command: true # Allows committing an empty set if no changes

    - name: Configure gRPC Streaming Telemetry (Arista EOS)
      when: ansible_network_os == 'eos'
      arista.eos.eos_config:
        lines:
          - "telemetry"
          - "  destination :"
          - "    protocol gRPC"
          - "    encoding GPB"
          - "    source-interface Management1" # Adjust as needed
          - "  sensor-group INTERFACE_COUNTERS"
          - "    path /interfaces/interface/state/statistics"
          - "  stream INTERFACE_STREAM"
          - "    sensor-group INTERFACE_COUNTERS"
          - "    destination :"
          - "    interval "
        save_when: modified

    - name: Verify Telemetry Configuration (Cisco IOS XE)
      when: ansible_network_os == 'ios' or ansible_network_os == 'iosxr'
      cisco.ios.ios_command:
        commands:
          - "show telemetry ietf subscription PERIODIC_INTF_COUNTERS"
          - "show telemetry ietf connection all"
      register: cisco_telemetry_output
      ignore_errors: true # Continue even if command fails
    - ansible.builtin.debug:
        msg: ""
      when: cisco_telemetry_output.stdout is defined

    - name: Verify Telemetry Configuration (Juniper JunOS)
      when: ansible_network_os == 'junos'
      juniper.junos.junos_command:
        commands:
          - "show services analytics status"
          - "show services analytics client"
      register: juniper_telemetry_output
      ignore_errors: true
    - ansible.builtin.debug:
        msg: ""
      when: juniper_telemetry_output.stdout is defined

    - name: Verify Telemetry Configuration (Arista EOS)
      when: ansible_network_os == 'eos'
      arista.eos.eos_command:
        commands:
          - "show telemetry stream INTERFACE_STREAM"
          - "show telemetry destination :"
      register: arista_telemetry_output
      ignore_errors: true
    - ansible.builtin.debug:
        msg: ""
      when: arista_telemetry_output.stdout is defined

Security Considerations

Network telemetry streams vast amounts of operational data, making their security paramount. Compromised telemetry can lead to:

  • Data Exposure: Sensitive network topology, performance, and traffic data falling into the wrong hands.
  • System Manipulation: If telemetry agents or protocols have configuration capabilities, a breach could allow unauthorized configuration changes.
  • Denial of Service (DoS): An attacker could overwhelm telemetry collectors with fabricated data or exhaust device resources by triggering excessive data streaming.

Attack Vectors and Mitigation Strategies

Attack VectorMitigation Strategies
Unauthorized Data AccessAuthentication: Use strong authentication (client certificates for gRPC/gNMI, AAA with NETCONF/RESTCONF, SNMPv3 with authPriv).
Authorization: Implement granular access control (RBAC) to telemetry paths and data streams.
Encryption: Always use TLS/SSL for gRPC, NETCONF, RESTCONF, and secure Syslog.
Tampering with Telemetry DataIntegrity: TLS/SSL provides data integrity checks. Use digital signatures where possible.
Secure Sources: Ensure the network devices sending telemetry are themselves secured and not compromised.
DoS on Collector/DeviceRate Limiting: Implement rate limits on telemetry streams at the device if supported.
Collector Scaling: Design collectors for horizontal scalability and redundancy.
Network Segmentation: Isolate telemetry traffic on dedicated management networks/VLANs.
Input Validation: Collectors should validate incoming data to prevent parsing malformed packets.
Compromised Monitoring InfrastructureHardening: Securely configure operating systems, databases, and applications in the observability stack.
Least Privilege: Run collector services with minimal necessary permissions.
Vulnerability Management: Regularly patch and scan all components.
Replay AttacksTimestamps & Nonces: Protocols like gRPC incorporate timestamps and request/response matching to prevent replay.
TLS Session Keys: Use fresh session keys for each connection.

Security Best Practices

  • Encrypt All Telemetry: Always use TLS/SSL for streaming telemetry (gRPC, NETCONF over SSH/TLS, RESTCONF over HTTPS). Never transmit sensitive data in plain text.
  • Strong Authentication and Authorization: Implement multi-factor authentication for management interfaces. Use client certificates for gRPC authentication. Employ AAA for programmatic access to devices.
  • Dedicated Management Network: Isolate telemetry traffic on a separate management network or VPN to reduce the attack surface.
  • Principle of Least Privilege: Configure telemetry subscriptions to send only the data absolutely necessary. Limit access to monitoring tools and dashboards.
  • Regular Auditing and Logging: Audit telemetry configurations and logs for suspicious activity. Ensure collectors log their own activity.
  • Software Supply Chain Security: Use trusted sources for libraries and tools (e.g., Python gnmic library, Telegraf plugins).
  • Secure API Keys/Credentials: Store API keys and credentials for automation (Ansible, Python) securely using vaults or secret management systems.

Security Configuration Example (Cisco IOS XE - TLS for gRPC)

To enable TLS for gRPC streaming, you typically need to set up a Public Key Infrastructure (PKI) and trustpoints on the device and ensure your collector also presents a trusted certificate.

! Create a crypto key pair for the device
crypto key generate rsa label MY_TELEMETRY_KEY modulus 2048

! Define a trustpoint for the Certificate Authority (CA) that signed your collector's certificate
crypto pki trustpoint TELEMETRY_CA_TP
  enrollment terminal
  revocation-check none
  usage ipsec ikev2 dot1x aaa web-auth tls
  ! You would paste the CA certificate here after "enrollment terminal"
  ! Example:
  ! certificate chain
  !  -----BEGIN CERTIFICATE-----
  !  ... CA Certificate Data ...
  !  -----END CERTIFICATE-----
  ! quit

! Define a trustpoint for the device's own identity certificate (signed by your internal CA)
crypto pki trustpoint TELEMETRY_DEVICE_IDENTITY_TP
  enrollment terminal
  revocation-check none
  usage ipsec ikev2 dot1x aaa web-auth tls
  rsakeypair MY_TELEMETRY_KEY
  ! You would paste the device's certificate here
  ! Example:
  ! certificate chain
  !  -----BEGIN CERTIFICATE-----
  !  ... Device Certificate Data ...
  !  -----END CERTIFICATE-----
  ! quit

! Link these to the telemetry profile
telemetry ietf
  destination-group TELEMETRY_COLLECTOR
    address 192.168.10.10 port 50051
    protocol grpc tls-enable
    encoding encode-kvgpb
    profile TELEMETRY_TLS_PROFILE
  profile TELEMETRY_TLS_PROFILE
    ! Define which certificate the device presents and which CA to trust for the client
    device-identity TELEMETRY_DEVICE_IDENTITY_TP
    peer-trustpoint TELEMETRY_CA_TP

Security Warning: Implementing PKI and TLS requires careful planning and certificate management. Incorrect configurations can lead to connection failures. Always test thoroughly in a lab environment first.

Verification & Troubleshooting

Effective verification and troubleshooting are crucial for maintaining a healthy telemetry pipeline.

Verification Commands

Beyond the vendor-specific show telemetry commands, here are general verification steps:

# Verify basic network connectivity to the collector
ping 192.168.10.10

# Verify TCP port connectivity to the collector's gRPC port
# On Linux:
nc -zv 192.168.10.10 50051
# Expected output: Connection to 192.168.10.10 50051 port [tcp/*] succeeded!

# From the collector, check if the gNMI client is running and connected
# (Specific command depends on the collector, e.g., 'systemctl status telegraf')

Expected Output

A healthy telemetry pipeline should show:

  • Device-side: Subscriptions “Active” or “Connected,” counters for pushes increasing.
  • Collector-side: Logs indicating successful connection, parsing, and forwarding of data.
  • TSDB-side: Metrics appearing correctly tagged and indexed.
  • Grafana/Visualization: Dashboards populated with real-time data.

Common Issues Table

| Issue | Possible Causes | Resolution Steps ```toml +++ title = “Network Monitoring, Observability, and Telemetry” date = 2026-01-24 draft = false weight = 11 description = “Explore modern network monitoring, observability, and telemetry paradigms, including streaming telemetry with NETCONF, RESTCONF, gRPC, and YANG models. Learn to implement and automate multi-vendor monitoring solutions using Ansible and Python, integrate with observability platforms, and apply NetDevOps principles for proactive network management, performance optimization, and robust troubleshooting.” slug = “network-monitoring-observability-telemetry” keywords = [“Network Monitoring”, “Observability”, “Telemetry”, “Streaming Telemetry”, “NETCONF”, “RESTCONF”, “gRPC”, “gNMI”, “YANG”, “OpenConfig”, “Ansible”, “Python”, “NetDevOps”, “Cisco”, “Juniper”, “Arista”, “Prometheus”, “Grafana”, “Time-Series Database”, “Infrastructure as Code”, “Network Automation”, “Packet Analysis”] tags = [“NetDevOps”, “Network Automation”, “Monitoring”, “Telemetry”, “Observability”, “Ansible”, “Python”, “YANG”, “Cisco”, “Multi-Vendor”] categories = [“Networking”, “NetDevOps”] +++

Introduction

In the rapidly evolving landscape of modern networks, simply knowing that a device is “up” is no longer sufficient. Network engineers leveraging NetDevOps principles require deep, real-time insights into network state, performance, and behavior to proactively identify issues, optimize resources, and ensure application experience. This is where network monitoring, observability, and telemetry become paramount.

This chapter delves into the critical role of modern monitoring and observability in a NetDevOps ecosystem. We’ll explore the shift from traditional pull-based monitoring (like SNMP and Syslog) to advanced push-based streaming telemetry using protocols such as NETCONF, RESTCONF, gRPC, and gNMI, alongside standardized data models like YANG and OpenConfig. You’ll learn how to implement and automate these solutions across multi-vendor networks using Ansible and Python, integrating them into comprehensive observability platforms.

After completing this chapter, you will be able to:

  • Differentiate between traditional monitoring, modern telemetry, and network observability.
  • Understand the architecture and benefits of streaming telemetry using NETCONF/YANG, RESTCONF, gRPC, and gNMI.
  • Configure and verify streaming telemetry subscriptions on Cisco, Juniper, and Arista devices.
  • Utilize Ansible and Python to automate the configuration and collection of telemetry data.
  • Design a basic network observability architecture incorporating collectors, time-series databases, and visualization tools.
  • Identify and mitigate security risks associated with advanced monitoring solutions.
  • Apply best practices for performance optimization and troubleshooting in telemetry-driven environments.

Technical Concepts

The journey from traditional monitoring to full network observability involves a fundamental shift in how network state information is collected, processed, and analyzed.

Traditional Monitoring vs. Modern Observability

Traditional Monitoring often relies on a “pull” model, where a monitoring system periodically queries network devices for specific metrics. Key technologies include:

  • SNMP (Simple Network Management Protocol): A widely used application-layer protocol for managing and monitoring network devices. It uses agents on devices to collect data and a manager to query them. While ubiquitous, it can be chatty, less granular, and often lacks real-time capabilities. (Refer to RFC 3411-3418 for SNMPv3 standards).
  • Syslog: A standard for message logging, allowing network devices to send event notifications (e.g., link up/down, error messages) to a central server. Excellent for event correlation but doesn’t provide granular metric data. (Refer to RFC 5424 for Syslog Protocol).
  • NetFlow/IPFIX (IP Flow Information Export): Provides data on IP traffic flows, enabling analysis of traffic patterns, bandwidth usage, and security incidents. It’s flow-based, not packet-based, offering aggregates. IPFIX (RFC 5101) is the IETF standard based on NetFlow.

Modern Telemetry and Observability adopt a “push” model, where network devices actively stream highly granular, structured data to collectors in near real-time. This shift is driven by:

  • Structured Data: Using data models like YANG for consistent, machine-readable data.
  • High Granularity: Sub-second data collection, crucial for dynamic network behavior.
  • Real-time Insights: Enables faster detection and response to anomalies.
  • Reduced Polling Overhead: Devices push data when changes occur or at set intervals.

Observability goes beyond mere monitoring. While monitoring tells you if a system is working, observability helps you understand why it’s not working, or why its performance has changed. It involves collecting diverse data types (metrics, logs, traces) to build a comprehensive understanding of system behavior from external outputs.

Network Observability Architecture

A typical network observability architecture consists of several key components:

  1. Network Devices: The source of telemetry data.
  2. Telemetry Agents: Software running on devices (or built-in) responsible for collecting raw data and formatting it according to a data model.
  3. Telemetry Collectors: Software systems that receive, parse, and often buffer the high volume of streaming data from multiple devices. Examples include Telegraf, OpenNMS, Custom Python scripts.
  4. Time-Series Database (TSDB): Optimized for storing time-stamped data, allowing efficient querying and analysis of metrics over time. Examples include Prometheus, InfluxDB, VictoriaMetrics.
  5. Data Processing & Analytics: Tools that can enrich, filter, aggregate, and analyze the collected data.
  6. Visualization & Alerting: Dashboards (e.g., Grafana) to visualize trends and anomalies, and alerting mechanisms to notify engineers of critical events.

Let’s visualize this architecture:

@startuml
skinparam handwritten true
skinparam style strict

cloud "Internet/WAN" as WAN

package "Network Infrastructure" {
  node "Core Router 1 (Cisco IOS XE)" as CR1
  node "Aggregation Switch 1 (Juniper JunOS)" as AS1
  node "Leaf Switch 1 (Arista EOS)" as LS1
}

package "Observability Platform" {
  cloud "Telemetry Collectors" as Collectors {
    component "gRPC Collector (e.g., Telegraf)" as GRPCC
    component "NETCONF/RESTCONF Listener (e.g., Python app)" as NETCONFC
    component "SNMP Manager" as SNMPM
    component "Syslog Server" as SYSLOGS
  }

  database "Time-Series Database (TSDB)" as TSDB {
    folder "Prometheus"
    folder "InfluxDB"
  }

  node "Visualization & Alerting" as Viz {
    artifact "Grafana Dashboards"
    artifact "Alert Manager"
  }
}

CR1 -[hidden] AS1
AS1 -[hidden] LS1

CR1 -up-> GRPCC : gRPC Streaming Telemetry (YANG)
AS1 -up-> GRPCC : gRPC Streaming Telemetry (OpenConfig)
LS1 -up-> GRPCC : gRPC Streaming Telemetry (OpenConfig)

CR1 -up-> NETCONFC : NETCONF/RESTCONF (YANG)
AS1 -up-> NETCONFC : NETCONF (YANG)
LS1 -up-> NETCONFC : RESTCONF (YANG/eAPI)

CR1 [label="> SNMPM : SNMP Traps/Polls
AS1"] SNMPM : SNMP Traps/Polls
LS1 [label="> SNMPM : SNMP Traps/Polls

CR1"] SYSLOGS : Syslog Events
AS1 [label="> SYSLOGS : Syslog Events
LS1"] SYSLOGS : Syslog Events

GRPCC [label="> TSDB : Store Metrics
NETCONFC"] TSDB : Store Metrics
SNMPM [label="> TSDB : Store Metrics
SYSLOGS"] TSDB : Store Logs/Metrics

TSDB --> Viz : Query Data

Viz .down.> "Network Operations Center (NOC)" as NOC : Alerts/Dashboards

@enduml

Streaming Telemetry Protocols

Streaming telemetry relies on modern, standardized protocols for efficient, structured data transfer.

NETCONF/YANG

  • NETCONF (Network Configuration Protocol): An XML-based protocol designed for configuring and managing network devices. While its primary role is configuration, it can also be used to retrieve operational state data. It operates over secure transport mechanisms like SSH or TLS.
    • RFC 6241: Network Configuration Protocol (NETCONF)
  • YANG (Yet Another Next Generation): A data modeling language used to define the structure and content of configuration and state data for network devices. YANG models provide a formal, machine-readable schema for both configuration and operational data, enabling multi-vendor interoperability.
    • RFC 7950: The YANG 1.1 Data Modeling Language

NETCONF can be used for “pulling” operational state data from devices, similar to SNMP, but with the advantage of structured YANG data.

RESTCONF/YANG

  • RESTCONF: A REST-like protocol that uses HTTP(S) to provide a programmatic interface for interacting with network devices. It exposes the YANG data model as a resource tree, allowing clients to perform CRUD (Create, Read, Update, Delete) operations.
    • RFC 8040: RESTCONF Protocol

RESTCONF offers a more web-friendly approach to access YANG-modeled data, which can be useful for integration with web applications and scripting.

gRPC and gNMI

  • gRPC (Google Remote Procedure Call): A high-performance, open-source RPC framework that can run in any environment. It uses Protocol Buffers (Protobuf) as its Interface Definition Language (IDL) for defining service methods and message structures. gRPC is efficient due to its binary message format and use of HTTP/2.
  • gNMI (gRPC Network Management Interface): A Cisco and Google-backed specification that defines a gRPC-based service for network management, including streaming telemetry. It allows clients to subscribe to specific data paths (defined by YANG/OpenConfig) and receive updates.

gRPC and gNMI are the preferred methods for high-volume, low-latency streaming telemetry due to their efficiency.

Protocol Flow: gRPC Streaming Telemetry

digraph gRPC_Telemetry {
    rankdir=LR;
    node [shape=box];

    Client [label="Telemetry Collector (gNMI Client)"];
    Device [label="Network Device (gNMI Server)"];

    subgraph cluster_0 {
        label="Subscription Request (Client to Device)";
        style=filled;
        color=lightgrey;
        Client -> Device [label="Establish gRPC Channel\n(TLS Encrypted)"];
        Device -> Client [label="Channel Acknowledged"];
        Client -> Device [label="gNMI::SubscribeRequest\n(Path, Mode: periodic/on-change)"];
    }

    subgraph cluster_1 {
        label="Data Stream (Device to Client)";
        style=filled;
        color=lightblue;
        Device -> Client [label="gNMI::SubscribeResponse\n(Telemetry Update - Protobuf/JSON Payload)"];
        Device -> Client [label="gNMI::SubscribeResponse\n(Telemetry Update - Protobuf/JSON Payload)"];
        Device -> Client [label="... Continuous Stream ..."];
    }
}

Conceptual Packet Structure: gRPC Telemetry (Simplified)

A gRPC packet, particularly over HTTP/2, is complex. Here’s a simplified view focusing on the payload within the context of a gNMI SubscribeResponse carrying an OpenConfig interface counter update.

packetdiag {
  colwidth = 32
  node_height = 72

  // Ethernet Header
  0-15: Dest MAC (6 bytes)
  16-31: Source MAC (6 bytes)
  32-35: EtherType (0x0800 for IPv4)

  // IP Header (IPv4)
  36-39: Version (4), IHL (5), DSCP, ECN
  40-47: Total Length
  48-51: Identification, Flags, Fragment Offset
  52-55: TTL, Protocol (6 for TCP), Header Checksum
  56-63: Source IP Address
  64-71: Destination IP Address

  // TCP Header
  72-79: Source Port (e.g., device ephemeral)
  80-87: Destination Port (e.g., 50051 for gRPC)
  88-103: Sequence Number
  104-119: Acknowledgment Number
  120-123: Data Offset, Reserved, Flags (SYN, ACK, PSH, URG, etc.)
  124-127: Window Size
  128-131: Checksum
  132-135: Urgent Pointer

  // HTTP/2 Frame Header (simplified, as gRPC multiplexes on streams)
  136-139: Length
  140-141: Type (e.g., DATA), Flags
  142-143: Stream Identifier, Reserved

  // gRPC Header (simplified)
  144-147: Compressed Flag, Message Length

  // gNMI SubscribeResponse (Protobuf Encoded)
  148-163: <<gNMI SubscribeResponse Message>>
  164-179:   timestamp (uint64)
  180-195:   prefix (string/Path)
  196-211:   update (list of Update messages)
  212-227:     Path (e.g., /interfaces/interface[name=GigabitEthernet1]/state/counters)
  228-243:     Val (TypedValue: counter value)
  244-259:     ... other updates ...
  260-275: <<End gNMI Message>>
}

Data Models (OpenConfig and Vendor-Native YANG)

YANG Data Models are crucial for streaming telemetry. They define the structure, syntax, and semantics of data.

  • Vendor-Native YANG Models: Provided by device vendors (e.g., Cisco, Juniper, Arista) and offer granular access to device-specific features and operational data. Examples: Cisco-IOS-XE-interfaces-oper.yang, juniper-smi.yang. You can explore these on Cisco DevNet’s YANG Suite (developer.cisco.com/yangsuite).
  • OpenConfig: An industry-wide initiative to define a common set of vendor-neutral YANG data models for network configuration and operational state. Its goal is to provide a unified approach to managing multi-vendor networks. Using OpenConfig models simplifies automation and monitoring across diverse hardware. (Learn more at openconfig.net).

The use of YANG models, especially OpenConfig, is a cornerstone of effective multi-vendor NetDevOps.

Collector Architectures

Collectors are vital for handling the ingestion of telemetry data. They typically perform several functions:

  • Ingestion: Receive data via gRPC, NETCONF, SNMP, etc.
  • Parsing: Decode Protobuf/JSON/XML payloads into usable metrics.
  • Tagging/Labeling: Add metadata (e.g., device hostname, interface name) to metrics for easier querying.
  • Buffering: Temporarily store data before writing to a TSDB.
  • Forwarding: Send processed metrics to a TSDB.

Popular open-source collector solutions include:

  • Telegraf: A plugin-driven server agent for collecting and sending metrics and events from databases, systems, and IoT sensors to various output plugins (including Prometheus, InfluxDB). Excellent for gRPC telemetry.
  • Prometheus Node Exporter: While primarily for host metrics, Prometheus itself can scrape metrics, and it has various exporters for network devices.
  • Custom Python Applications: For highly specific use cases, a Python script can act as a gNMI client to subscribe, parse, and store data.

Configuration Examples (Multi-vendor)

Here, we’ll demonstrate configuring streaming telemetry (gRPC/gNMI and NETCONF/RESTCONF) on Cisco, Juniper, and Arista devices.

Cisco IOS XE/XR (gRPC Streaming Telemetry)

This example configures a periodic gRPC subscription to stream interface statistics.

! Configure NETCONF/RESTCONF for management access (often a prerequisite for gNMI)
! Enable NETCONF SSH transport
netconf-yang
  ssh

! Enable RESTCONF HTTPS transport
restconf
  transport https
  ! Use a local authentication method
  authorization local

! Configure gNMI/gRPC telemetry
! Define the telemetry destination (collector)
telemetry ietf
  destination-group TELEMETRY_COLLECTOR
   address 192.168.10.10 port 50051
   protocol grpc tls-enable
   encoding encode-kvgpb
   profile TELEMETRY_PROFILE ! Optional TLS profile if using client certs
  ! security
  !  trustpoint TELEMETRY_CLIENT_TP
  !  pki-enrollment mode auto-client
  !  ! Ensure the collector's certificate is trusted here
  !  ! cryptounique-name TLS_SERVER_IDENTITY
  !  !
  ! encryption aes256-gcm
  ! ! If using client certificates for mutual TLS
  ! ! certificate application telemetry
  ! ! ca-trustpoint TELEMETRY_CA_TP

! Define a sensor group for the data we want to stream (e.g., interface operational state)
sensor-group INTERFACE_OPER_STATE
  ! Use an OpenConfig path for multi-vendor consistency
  ! For Cisco, verify specific YANG paths are supported
  ! Example: openconfig-interfaces:interfaces/interface/state
  ! Example: Cisco-IOS-XE-interfaces-oper:interfaces/interface/state
  ! Path examples:
  ! /interfaces/interface/state/counters
  ! /interfaces/interface[name='GigabitEthernet1']/state/counters
  path openconfig-interfaces:interfaces/interface/state/counters
  
! Define a subscription that links the sensor group to the destination
subscription PERIODIC_INTF_COUNTERS
  sensor-group INTERFACE_OPER_STATE sample-interval 10000 ! 10-second interval
  destination-group TELEMETRY_COLLECTOR
  stream cisco-push
  update-policy periodic

! --- Verification Commands ---

Verification Commands:

show telemetry ietf subscription PERIODIC_INTF_COUNTERS
show telemetry ietf destination-group TELEMETRY_COLLECTOR
show telemetry ietf sensor-group INTERFACE_OPER_STATE
show telemetry ietf connection all

Expected Output (Snippet):

Router# show telemetry ietf subscription PERIODIC_INTF_COUNTERS
Subscription ID: 100
  Type: Dynamic
  State: Enabled
  Source Address: 0.0.0.0
  Source VRF: <default>
  Stream: cisco-push
  Update policy: periodic
  Update interval: 10000 ms
  Sensor Groups:
    Sensor Group: INTERFACE_OPER_STATE (ID: 100)
      Path: openconfig-interfaces:interfaces/interface/state/counters
  Destination Groups: TELEMETRY_COLLECTOR
    Address: 192.168.10.10:50051
    Transport: grpc
    Encoding: encode-kvgpb
    Profile: TELEMETRY_PROFILE
    TLS: Enabled

Router# show telemetry ietf connection all
Telemetry connection 0:
  Peer Address: 192.168.10.10
  Peer Port: 50051
  Local Address: 10.0.0.1
  Local Port: 54321
  State: Connected
  Profile Name: TELEMETRY_PROFILE
  Subscriptions: 100

Juniper JunOS (gRPC Streaming Telemetry)

This example configures gRPC streaming telemetry for interface statistics using OpenConfig models.

# Enable gRPC and specify its listening port
set services extension-service request-response grpc clear-text port 50051

# Configure a streaming telemetry sensor (data provider)
set services analytics sensor SENSOR_INTF_STATS
set services analytics sensor SENSOR_INTF_STATS description "Interface Stats"
# Specify the OpenConfig path to collect data
# Juniper uses 'open-config:' prefix for OpenConfig paths
set services analytics sensor SENSOR_INTF_STATS resource "/junos/system/linecard/interface/"
set services analytics sensor SENSOR_INTF_STATS resource "/interfaces/interface/state/" # OpenConfig path

# Configure a streaming telemetry export profile (collector destination and frequency)
set services analytics export-profile EXPORT_INTF_STATS
set services analytics export-profile EXPORT_INTF_STATS reporting-period 10 # 10 seconds
set services analytics export-profile EXPORT_INTF_STATS format gpb
set services analytics export-profile EXPORT_INTF_STATS transport grpc
set services analytics export-profile EXPORT_INTF_STATS target-address 192.168.10.10
set services analytics export-profile EXPORT_INTF_STATS target-port 50051

# Apply the sensor and export profile to a rule
set services analytics rule RULE_INTF_STATS
set services analytics rule RULE_INTF_STATS sensor-name SENSOR_INTF_STATS
set services analytics rule RULE_INTF_STATS export-profile EXPORT_INTF_STATS

# Commit the configuration
commit

Verification Commands:

show services analytics status
show services analytics client
show services analytics export-profile EXPORT_INTF_STATS

Expected Output (Snippet):

user@juniper> show services analytics status
Extension service status:
  Current status: Running
  Enabled on: FPC0
  GRPC enabled: Yes, port 50051

user@juniper> show services analytics client
...
Client information:
  Name: EXPORT_INTF_STATS
  Address: 192.168.10.10, Port: 50051
  Protocol: grpc
  Sensor name: SENSOR_INTF_STATS
  Reporting period: 10s
  State: Connected
...

Arista EOS (gRPC Streaming Telemetry)

Arista EOS uses OpenConfig by default for gRPC telemetry.

! Enable eAPI (for RESTCONF-like interaction and generally good practice)
management api http-command
  no shutdown
  protocol https
  vrf default

! Configure gRPC telemetry
! Define the telemetry receiver (collector)
telemetry
  destination 192.168.10.10:50051
    protocol gRPC
    encoding GPB
    tls profile TELEMETRY_TLS_PROFILE ! Optional TLS profile
    source-interface Management1

! Define a sensor group (path to collect)
  sensor-group INTERFACE_COUNTERS
    path /Sysdb/interface/counters
    path /interfaces/interface/state/statistics ! OpenConfig path for counters

! Define a subscription to push data from the sensor group to the destination
  stream INTERFACE_STREAM
    sensor-group INTERFACE_COUNTERS
    destination 192.168.10.10:50051
    interval 10000 ! 10 seconds

! --- Verification Commands ---

Verification Commands:

show telemetry
show telemetry destination 192.168.10.10:50051
show telemetry stream INTERFACE_STREAM

Expected Output (Snippet):

Arista# show telemetry
Telemetry Receiver State:
  Receiver: 192.168.10.10:50051
    Protocol: gRPC
    Encoding: GPB
    State: Active
    Source-interface: Management1

Telemetry Streams:
  Stream: INTERFACE_STREAM
    Sensor Group: INTERFACE_COUNTERS
    Destination: 192.168.10.10:50051
    Interval: 10000 ms
    Last Push: 00:00:02 ago
    Push Count: 1234
    Status: OK

Network Diagrams

Diagrams are essential for visualizing complex network concepts.

Network Topology: Telemetry Lab Setup (nwdiag)

nwdiag {
  network core_network {
    address = "10.0.0.0/24"
    description = "Core Network Segment"

    CR1 [address = "10.0.0.1"];
    AS1 [address = "10.0.0.2"];
  }

  network mgmt_network {
    address = "192.168.10.0/24"
    description = "Management & Telemetry Network"

    CR1 [address = "192.168.10.1"];
    AS1 [address = "192.168.10.2"];
    LS1 [address = "192.168.10.3"];
    COLLECTOR [address = "192.168.10.10", description = "Telemetry Collector"];
    TSDB [address = "192.168.10.11", description = "Time-Series DB"];
    GRAFANA [address = "192.168.10.12", description = "Grafana / Visualization"];
  }

  // Connections implicitly defined by shared networks
  CR1 -- AS1; // Represents logical connection in core_network
  CR1 -- COLLECTOR; // Represents logical connection in mgmt_network
  AS1 -- COLLECTOR;
  LS1 -- COLLECTOR;
  COLLECTOR -- TSDB;
  TSDB -- GRAFANA;
}

Data Flow for Observability Platform (plantuml)

@startuml
scale 1.5

cloud "Network Devices" as NetDevs {
  component "Cisco IOS XE" as C_DEV
  component "Juniper JunOS" as J_DEV
  component "Arista EOS" as A_DEV
}

rectangle "Telemetry Collection Layer" {
  component "gNMI Collector\n(e.g., Telegraf)" as GNMI_COLLECTOR
  component "SNMP Poller\n(e.g., Prometheus)" as SNMP_POLLER
  component "Syslog Aggregator\n(e.g., Logstash)" as SYSLOG_AGG
}

database "Data Storage" {
  component "Time-Series DB\n(e.g., Prometheus DB, InfluxDB)" as TSDB
  component "Log Storage\n(e.g., Elasticsearch)" as LOG_STORE
}

rectangle "Analysis & Visualization" {
  component "Metrics Dashboards\n(e.g., Grafana)" as GRAFANA
  component "Alerting Engine\n(e.g., Alertmanager)" as ALERTS
  component "Log Analysis\n(e.g., Kibana)" as KIBANA
}

C_DEV -right-> GNMI_COLLECTOR : gRPC/gNMI (YANG, OpenConfig)
J_DEV -right-> GNMI_COLLECTOR : gRPC/gNMI (YANG, OpenConfig)
A_DEV -right-> GNMI_COLLECTOR : gRPC/gNMI (YANG, OpenConfig)

C_DEV -down-> SNMP_POLLER : SNMPv3
J_DEV -down-> SNMP_POLLER : SNMPv3
A_DEV -down-> SNMP_POLLER : SNMPv3

C_DEV -down-> SYSLOG_AGG : Syslog (RFC 5424)
J_DEV -down-> SYSLOG_AGG : Syslog (RFC 5424)
A_DEV -down-> SYSLOG_AGG : Syslog (RFC 5424)

GNMI_COLLECTOR -down-> TSDB : Write Metrics
SNMP_POLLER -down-> TSDB : Write Metrics
SYSLOG_AGG -down-> LOG_STORE : Write Logs

TSDB -up-> GRAFANA : Query Metrics
LOG_STORE -up-> KIBANA : Query Logs

GRAFANA -right-> ALERTS : Trigger Alerts
ALERTS -up-> "NetOps Team" : Notifications

@enduml

Automation Examples

Automating the setup of telemetry and the consumption of data is central to NetDevOps.

Python: gNMI Client for Streaming Telemetry

This Python script demonstrates how to subscribe to gNMI telemetry data from a network device using the grpc and gnmic libraries.

# pip install grpcio grpcio-tools gnmic

import grpc
import gnmic_pb2
import gnmic_pb2_grpc
import json
import time
import ssl

# Device details
DEVICE_IP = "192.168.10.1"  # IP of your Cisco/Juniper/Arista device
DEVICE_PORT = 50051       # gNMI port, typically 50051
USERNAME = "admin"
PASSWORD = "password"

# Path to subscribe to (OpenConfig interface counters)
# Adjust based on your device's supported paths and configured sensor groups
# For Cisco IOS XE: "/openconfig-interfaces:interfaces/interface/state/counters"
# For Juniper JunOS: "/interfaces/interface/state/"
# For Arista EOS: "/interfaces/interface/state/statistics"
GNMI_PATH = "/interfaces/interface/state/statistics"

def stream_telemetry():
    # Setup TLS/SSL context if needed (for secure gRPC)
    # If your device uses TLS, replace grpc.insecure_channel with grpc.secure_channel
    # and provide appropriate credentials/certificates.
    # For simplicity, this example uses insecure_channel, but production should use TLS.
    
    # Example for secure_channel (requires server cert for verification or client certs for mutual TLS)
    # with open('path/to/server_cert.pem', 'rb') as f:
    #     trusted_certs = f.read()
    # credentials = grpc.ssl_channel_credentials(root_certificates=trusted_certs)
    # channel = grpc.secure_channel(f"{DEVICE_IP}:{DEVICE_PORT}", credentials)

    # For insecure channel (NOT RECOMMENDED FOR PRODUCTION)
    channel = grpc.insecure_channel(f"{DEVICE_IP}:{DEVICE_PORT}")

    stub = gnmic_pb2_grpc.gNMIStub(channel)

    subscribe_request = gnmic_pb2.SubscribeRequest()
    subscription_list = subscribe_request.subscribe

    # Create a subscription
    subscription = subscription_list.subscription.add()
    
    # The path to subscribe to
    path_elem = subscription.path.elem.add()
    path_elem.name = GNMI_PATH.split('/')[1] # Root element e.g., 'interfaces'
    for p in GNMI_PATH.split('/')[2:]:
        elem = subscription.path.elem.add()
        if '[' in p and ']' in p:
            # Handle key-value pairs in path, e.g., interface[name=GigabitEthernet1]
            key_name = p.split('[')[0]
            key_value = p.split('=')[1].strip(']')
            elem.name = key_name
            elem.key[key_name.rstrip('s')] = key_value # Adjust key based on YANG model
        else:
            elem.name = p

    subscription.mode = gnmic_pb2.SubscriptionList.Mode.STREAM
    subscription.sample_interval = 10_000_000_000 # 10 seconds in nanoseconds
    subscription_list.mode = gnmic_pb2.SubscriptionList.Mode.STREAM
    subscription_list.encoding = gnmic_pb2.Encoding.JSON_IETF # Or PROTO for protobuf

    print(f"Subscribing to {GNMI_PATH} on {DEVICE_IP}:{DEVICE_PORT}...")

    try:
        # The stub.Subscribe method returns an iterator over responses
        for response in stub.Subscribe(iter([subscribe_request])):
            if response.update:
                timestamp_ns = response.update.timestamp
                timestamp_s = timestamp_ns / 1_000_000_000
                prefix = gnmic_pb2.Path.to_json(response.update.prefix) if response.update.prefix else "N/A"
                print(f"\n--- Telemetry Update ---")
                print(f"Timestamp: {time.ctime(timestamp_s)} ({timestamp_ns} ns)")
                print(f"Prefix: {prefix}")
                
                for update in response.update.update:
                    path = gnmic_pb2.Path.to_json(update.path)
                    value = gnmic_pb2.TypedValue.to_json(update.val)
                    print(f"  Path: {path}")
                    print(f"  Value: {value}")
            elif response.sync_response:
                print("--- Synchronization complete ---")
            else:
                print(f"Received unknown response: {response}")

    except grpc.RpcError as e:
        print(f"gRPC Error: {e.details}")
    except KeyboardInterrupt:
        print("Subscription stopped by user.")
    finally:
        channel.close()
        print("gRPC channel closed.")

if __name__ == "__main__":
    stream_telemetry()

Ansible Playbook: Configure Streaming Telemetry

This playbook configures gRPC streaming telemetry on Cisco IOS XE, Juniper JunOS, and Arista EOS devices. It assumes ansible-network.network-cli and appropriate ansible.netcommon collections are installed and inventory is set up.

---
- name: Configure Multi-Vendor Streaming Telemetry
  hosts: network_devices
  gather_facts: false
  connection: network_cli

  vars:
    telemetry_collector_ip: "192.168.10.10"
    telemetry_collector_port: 50051
    telemetry_sample_interval_ms: 10000 # 10 seconds

  tasks:
    - name: Ensure NETCONF/RESTCONF is enabled (Cisco IOS XE)
      when: ansible_network_os == 'ios' or ansible_network_os == 'iosxr'
      cisco.ios.ios_config:
        lines:
          - netconf-yang
          - restconf
          - "restconf transport https"
          - "restconf authorization local"
        save_when: modified

    - name: Configure gRPC Streaming Telemetry (Cisco IOS XE)
      when: ansible_network_os == 'ios' or ansible_network_os == 'iosxr'
      cisco.ios.ios_config:
        lines:
          - "telemetry ietf"
          - "  destination-group TELEMETRY_COLLECTOR"
          - "    address  port "
          - "    protocol grpc tls-enable" # Use 'tls-enable' for production, or 'no tls' for testing
          - "    encoding encode-kvgpb"
          - "  sensor-group INTERFACE_OPER_STATE"
          - "    path openconfig-interfaces:interfaces/interface/state/counters"
          - "  subscription PERIODIC_INTF_COUNTERS"
          - "    sensor-group INTERFACE_OPER_STATE sample-interval "
          - "    destination-group TELEMETRY_COLLECTOR"
          - "    stream cisco-push"
          - "    update-policy periodic"
        save_when: modified

    - name: Configure gRPC Streaming Telemetry (Juniper JunOS)
      when: ansible_network_os == 'junos'
      juniper.junos.junos_config:
        lines:
          - "set services extension-service request-response grpc clear-text port "
          - "set services analytics sensor SENSOR_INTF_STATS resource \"/interfaces/interface/state/\""
          - "set services analytics export-profile EXPORT_INTF_STATS reporting-period "
          - "set services analytics export-profile EXPORT_INTF_STATS format gpb"
          - "set services analytics export-profile EXPORT_INTF_STATS transport grpc"
          - "set services analytics export-profile EXPORT_INTF_STATS target-address "
          - "set services analytics export-profile EXPORT_INTF_STATS target-port "
          - "set services analytics rule RULE_INTF_STATS sensor-name SENSOR_INTF_STATS"
          - "set services analytics rule RULE_INTF_STATS export-profile EXPORT_INTF_STATS"
        commit_empty_command: true # Allows committing an empty set if no changes

    - name: Configure gRPC Streaming Telemetry (Arista EOS)
      when: ansible_network_os == 'eos'
      arista.eos.eos_config:
        lines:
          - "telemetry"
          - "  destination :"
          - "    protocol gRPC"
          - "    encoding GPB"
          - "    source-interface Management1" # Adjust as needed
          - "  sensor-group INTERFACE_COUNTERS"
          - "    path /interfaces/interface/state/statistics"
          - "  stream INTERFACE_STREAM"
          - "    sensor-group INTERFACE_COUNTERS"
          - "    destination :"
          - "    interval "
        save_when: modified

    - name: Verify Telemetry Configuration (Cisco IOS XE)
      when: ansible_network_os == 'ios' or ansible_network_os == 'iosxr'
      cisco.ios.ios_command:
        commands:
          - "show telemetry ietf subscription PERIODIC_INTF_COUNTERS"
          - "show telemetry ietf connection all"
      register: cisco_telemetry_output
      ignore_errors: true # Continue even if command fails
    - ansible.builtin.debug:
        msg: ""
      when: cisco_telemetry_output.stdout is defined

    - name: Verify Telemetry Configuration (Juniper JunOS)
      when: ansible_network_os == 'junos'
      juniper.junos.junos_command:
        commands:
          - "show services analytics status"
          - "show services analytics client"
      register: juniper_telemetry_output
      ignore_errors: true
    - ansible.builtin.debug:
        msg: ""
      when: juniper_telemetry_output.stdout is defined

    - name: Verify Telemetry Configuration (Arista EOS)
      when: ansible_network_os == 'eos'
      arista.eos.eos_command:
        commands:
          - "show telemetry stream INTERFACE_STREAM"
          - "show telemetry destination :"
      register: arista_telemetry_output
      ignore_errors: true
    - ansible.builtin.debug:
        msg: ""
      when: arista_telemetry_output.stdout is defined

Security Considerations

Network telemetry streams vast amounts of operational data, making their security paramount. Compromised telemetry can lead to:

  • Data Exposure: Sensitive network topology, performance, and traffic data falling into the wrong hands.
  • System Manipulation: If telemetry agents or protocols have configuration capabilities, a breach could allow unauthorized configuration changes.
  • Denial of Service (DoS): An attacker could overwhelm telemetry collectors with fabricated data or exhaust device resources by triggering excessive data streaming.

Attack Vectors and Mitigation Strategies

Attack VectorMitigation Strategies
Unauthorized Data AccessAuthentication: Use strong authentication (client certificates for gRPC/gNMI, AAA with NETCONF/RESTCONF, SNMPv3 with authPriv).
Authorization: Implement granular access control (RBAC) to telemetry paths and data streams.
Encryption: Always use TLS/SSL for gRPC, NETCONF, RESTCONF, and secure Syslog.
Tampering with Telemetry DataIntegrity: TLS/SSL provides data integrity checks. Use digital signatures where possible.
Secure Sources: Ensure the network devices sending telemetry are themselves secured and not compromised.
DoS on Collector/DeviceRate Limiting: Implement rate limits on telemetry streams at the device if supported.
Collector Scaling: Design collectors for horizontal scalability and redundancy.
Network Segmentation: Isolate telemetry traffic on dedicated management networks/VLANs.
Input Validation: Collectors should validate incoming data to prevent parsing malformed packets.
Compromised Monitoring InfrastructureHardening: Securely configure operating systems, databases, and applications in the observability stack.
Least Privilege: Run collector services with minimal necessary permissions.
Vulnerability Management: Regularly patch and scan all components.
Replay AttacksTimestamps & Nonces: Protocols like gRPC incorporate timestamps and request/response matching to prevent replay.
TLS Session Keys: Use fresh session keys for each connection.

Security Best Practices

  • Encrypt All Telemetry: Always use TLS/SSL for streaming telemetry (gRPC, NETCONF over SSH/TLS, RESTCONF over HTTPS). Never transmit sensitive data in plain text.
  • Strong Authentication and Authorization: Implement multi-factor authentication for management interfaces. Use client certificates for gRPC authentication. Employ AAA for programmatic access to devices.
  • Dedicated Management Network: Isolate telemetry traffic on a separate management network or VPN to reduce the attack surface.
  • Principle of Least Privilege: Configure telemetry subscriptions to send only the data absolutely necessary. Limit access to monitoring tools and dashboards.
  • Regular Auditing and Logging: Audit telemetry configurations and logs for suspicious activity. Ensure collectors log their own activity.
  • Software Supply Chain Security: Use trusted sources for libraries and tools (e.g., Python gnmic library, Telegraf plugins).
  • Secure API Keys/Credentials: Store API keys and credentials for automation (Ansible, Python) securely using vaults or secret management systems.

Security Configuration Example (Cisco IOS XE - TLS for gRPC)

To enable TLS for gRPC streaming, you typically need to set up a Public Key Infrastructure (PKI) and trustpoints on the device and ensure your collector also presents a trusted certificate.

! Create a crypto key pair for the device
crypto key generate rsa label MY_TELEMETRY_KEY modulus 2048

! Define a trustpoint for the Certificate Authority (CA) that signed your collector's certificate
crypto pki trustpoint TELEMETRY_CA_TP
  enrollment terminal
  revocation-check none
  usage ipsec ikev2 dot1x aaa web-auth tls
  ! You would paste the CA certificate here after "enrollment terminal"
  ! Example:
  ! certificate chain
  !  -----BEGIN CERTIFICATE-----
  !  ... CA Certificate Data ...
  !  -----END CERTIFICATE-----
  ! quit

! Define a trustpoint for the device's own identity certificate (signed by your internal CA)
crypto pki trustpoint TELEMETRY_DEVICE_IDENTITY_TP
  enrollment terminal
  revocation-check none
  usage ipsec ikev2 dot1x aaa web-auth tls
  rsakeypair MY_TELEMETRY_KEY
  ! You would paste the device's certificate here
  ! Example:
  ! certificate chain
  !  -----BEGIN CERTIFICATE-----
  !  ... Device Certificate Data ...
  !  -----END CERTIFICATE-----
  ! quit

! Link these to the telemetry profile
telemetry ietf
  destination-group TELEMETRY_COLLECTOR
    address 192.168.10.10 port 50051
    protocol grpc tls-enable
    encoding encode-kvgpb
    profile TELEMETRY_TLS_PROFILE
  profile TELEMETRY_TLS_PROFILE
    ! Define which certificate the device presents and which CA to trust for the client
    device-identity TELEMETRY_DEVICE_IDENTITY_TP
    peer-trustpoint TELEMETRY_CA_TP

Security Warning: Implementing PKI and TLS requires careful planning and certificate management. Incorrect configurations can lead to connection failures. Always test thoroughly in a lab environment first.

Verification & Troubleshooting

Effective verification and troubleshooting are crucial for maintaining a healthy telemetry pipeline.

Verification Commands

Beyond the vendor-specific show telemetry commands, here are general verification steps:

# Verify basic network connectivity to the collector
ping 192.168.10.10

# Verify TCP port connectivity to the collector's gRPC port
# On Linux:
nc -zv 192.168.10.10 50051
# Expected output: Connection to 192.168.10.10 50051 port [tcp/*] succeeded!

# From the collector, check if the gNMI client is running and connected
# (Specific command depends on the collector, e.g., 'systemctl status telegraf')

Expected Output

A healthy telemetry pipeline should show:

  • Device-side: Subscriptions “Active” or “Connected,” counters for pushes increasing.
  • Collector-side: Logs indicating successful connection, parsing, and forwarding of data.
  • TSDB-side: Metrics appearing correctly tagged and indexed.
  • Grafana/Visualization: Dashboards populated with real-time data.

Common Issues Table

| Issue | Possible Causes | Resolution Steps | | Connection Failed | Network path between device and collector is down. Firewall blocking. Incorrect IP/Port. | Check physical connections, firewalls (e.g., ufw status, firewalld --list-all on Linux, ACLs on router), and IP addresses/ports. | | No Data Received by Collector | Incorrect sensor group path or subscription configuration. Firewall blocking outgoing traffic from device. Device resource contention. | Verify exact YANG path on device. Check device logs (show log on Juniper, show logging on Cisco) for telemetry errors. Ensure device firewalls (if any) permit outbound gRPC traffic. Increase sample-interval temporarily. | | Data Received, but not Parsed/Stored | Collector configuration error (wrong encoding, incorrect path mapping). TSDB is down or misconfigured. | Check collector logs for parsing errors. Verify collector’s output plugin configuration for the TSDB. Check TSDB status and logs. Ensure YANG path is correctly mapped to Prometheus metric names or InfluxDB fields. | | High CPU/Memory on Network Device | Too many subscriptions, too low sample interval, streaming too much data. | Increase sample-interval. Filter paths to only collect essential data. Use on-change subscriptions for highly dynamic but infrequent data. Optimize YANG paths. | | TLS/SSL Handshake Failure | Mismatched certificates, incorrect trustpoints, expired certificates, incorrect ciphers. | Verify CA, device identity, and peer trustpoint configurations. Check certificate validity dates. Ensure cipher suites are compatible. Use debug crypto pki (Cisco) or show security pki (Juniper) for certificate status. | | Inconsistent Data (e.g., missing metrics) | Network congestion, packet loss, collector overload, device buffer overflow. | Check network health between device and collector. Monitor collector resource usage (CPU, memory, disk I/O). Increase device telemetry buffer size (if configurable). |

Debug Commands

  • Cisco IOS XE/XR:
    • debug telemetry all: Comprehensive debugging for telemetry.
    • debug grpc all: Debug gRPC specific operations.
    • show platform software telemetry state: Shows internal telemetry process state.
  • Juniper JunOS:
    • monitor services analytics: Real-time monitoring of telemetry events.
    • show log messages | grep telemetry: Filter system logs for telemetry-related messages.
  • Arista EOS:
    • debug telemetry agent: Debug the telemetry agent process.
    • show agent logs | grep Telemetry: Filter agent logs for telemetry.

Root Cause Analysis (RCA)

When troubleshooting, follow a systematic approach:

  1. Bottom-Up: Verify physical connectivity, then IP reachability, then port reachability (ping, traceroute, nc).
  2. Configuration Check: Double-check device telemetry configuration (paths, destinations, intervals, security).
  3. Process Status: Ensure telemetry processes are running on the device and collector processes are active.
  4. Log Analysis: Scrutinize device and collector logs for errors or warnings related to telemetry.
  5. Data Flow Validation: Use small, targeted Python scripts (like the example provided) or gnmic CLI tools to test subscriptions directly against a single device.
  6. Security Posture: Confirm firewalls, ACLs, and TLS configurations are correctly implemented and not blocking legitimate traffic.

Performance Optimization

Optimizing telemetry performance is crucial to avoid overwhelming network devices or the monitoring infrastructure.

  • Sample Interval Tuning:
    • Periodic subscriptions: Only use low sample intervals (e.g., <5 seconds) for highly critical, volatile metrics (e.g., interface errors, CPU utilization). For most operational data (e.g., routing table size), higher intervals (30-60 seconds) are sufficient.
    • On-change subscriptions: Prefer on-change mode for data that changes infrequently but is critical to capture immediately (e.g., interface status changes, peer state). This reduces unnecessary data pushes.
  • Efficient Data Encoding: Utilize Protocol Buffers (GPB) for gRPC telemetry. GPB is a binary format that is more compact and efficient than JSON or XML for high-volume data.
  • Selective Data Collection (YANG Paths): Subscribe only to the specific YANG paths and leaves required. Avoid subscribing to entire modules or large branches if you only need a few metrics. Use filters where supported.
  • Collector Scaling: Deploy collectors in a horizontally scalable architecture (e.g., multiple instances behind a load balancer). Ensure collectors have sufficient CPU, memory, and disk I/O to handle peak telemetry ingress.
  • Time-Series Database Optimization:
    • Choose a TSDB optimized for your data volume and query patterns (e.g., Prometheus for pull-based, InfluxDB for push-based, VictoriaMetrics for high scale).
    • Implement data retention policies to automatically delete old data.
    • Consider downsampling or aggregation of older data for long-term trends.
  • Network Path Optimization: Ensure the network path between devices and collectors has sufficient bandwidth and low latency, especially for high-frequency telemetry.
  • Device Resource Monitoring: Continuously monitor the CPU and memory utilization of network devices to ensure telemetry processing isn’t causing resource exhaustion. Adjust subscriptions if devices are overloaded.

Hands-On Lab

This lab will guide you through setting up gRPC streaming telemetry on a multi-vendor environment and collecting data.

Lab Topology

nwdiag {
  network mgmt_vlan {
    address = "192.168.100.0/24"
    description = "Dedicated Management/Telemetry VLAN"

    R1 [address = "192.168.100.1", description = "Cisco IOS XE Router"];
    S1 [address = "192.168.100.2", description = "Juniper JunOS Switch"];
    S2 [address = "192.168.100.3", description = "Arista EOS Switch"];
    COLLECTOR_SERVER [address = "192.168.100.10", description = "Linux VM (Telegraf/Grafana)"];
  }

  // Logical connections
  R1 -- COLLECTOR_SERVER;
  S1 -- COLLECTOR_SERVER;
  S2 -- COLLECTOR_SERVER;
}

Objectives

  1. Configure basic gRPC streaming telemetry on Cisco IOS XE, Juniper JunOS, and Arista EOS.
  2. Install and configure Telegraf as a gNMI collector on a Linux VM.
  3. Install and configure Prometheus as a time-series database.
  4. Install and configure Grafana for data visualization.
  5. Observe telemetry data streaming into Grafana dashboards.

Step-by-Step Configuration (Conceptual - requires lab environment setup)

Prerequisites:

  • Three network devices (Cisco IOS XE, Juniper JunOS, Arista EOS) with management interfaces configured and reachable at 192.168.100.1, .2, .3 respectively.
  • A Linux VM (e.g., Ubuntu Server) reachable at 192.168.100.10.
  • Basic network connectivity verified between all devices and the VM.

Step 1: Configure Network Devices for gRPC Telemetry

  • Cisco IOS XE (R1): Use the configuration from the “Cisco IOS XE/XR” section above, replacing 192.168.10.10 with 192.168.100.10. For simplicity, start with protocol grpc no tls if you don’t have PKI set up.
  • Juniper JunOS (S1): Use the configuration from the “Juniper JunOS” section, replacing 192.168.10.10 with 192.168.100.10. Use clear-text for simplicity.
  • Arista EOS (S2): Use the configuration from the “Arista EOS” section, replacing 192.168.10.10 with 192.168.100.10. For simplicity, omit the tls profile line.

Step 2: Install and Configure Telegraf on COLLECTOR_SERVER (192.168.100.10)

# Update package list and install Telegraf
sudo apt update
sudo apt install telegraf

# Generate a sample Telegraf configuration
telegraf --sample-config --input-filter gnmi --output-filter prometheus_client > telegraf.conf

# Edit telegraf.conf (sudo vim telegraf.conf)
# Configure the gNMI input plugin:
[[inputs.gnmi]]
  # Array of target network devices
  targets = [
    "192.168.100.1:50051", # Cisco
    "192.168.100.2:50051", # Juniper
    "192.168.100.3:50051"  # Arista
  ]
  username = "your_device_username"
  password = "your_device_password"
  # skip_verify = true # For labs without proper TLS certs (NOT for production)
  # tls_ca = "/etc/telegraf/certs/ca.pem" # For production with TLS
  # tls_cert = "/etc/telegraf/certs/client.pem"
  # tls_key = "/etc/telegraf/certs/client-key.pem"
  
  # Paths to subscribe to (must match what's configured on devices)
  # Telegraf supports path prefixes for multiple targets or specific paths per target
  # Example generic paths (adjust per device's actual sensor group)
  paths = [
    "/interfaces/interface/state/counters",
    "/interfaces/interface/state/", # Broader OpenConfig path
    "/junos/system/linecard/interface/" # Juniper specific
  ]
  name_override = "gnmi_telemetry" # Metric name prefix in Prometheus

# Configure Prometheus client output plugin:
[[outputs.prometheus_client]]
  listen = ":9273" # Telegraf will expose metrics on this port for Prometheus to scrape
  metric_version = 2
  collectors_metric = false
  # Add device hostname/IP as a label
  # You might need to adjust Telegraf's configuration to extract hostname if not automatically available
  # e.g., using `tag_keys = ["device_ip"]` in [[inputs.gnmi]] or a processor

Save telegraf.conf to /etc/telegraf/telegraf.d/gnmi.conf or similar.

# Start Telegraf
sudo systemctl enable telegraf
sudo systemctl start telegraf
sudo systemctl status telegraf

Step 3: Install and Configure Prometheus on COLLECTOR_SERVER

# Download and extract Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.x.x/prometheus-2.x.x.linux-amd64.tar.gz
tar -xvf prometheus-2.x.x.linux-amd64.tar.gz
sudo mv prometheus-2.x.x.linux-amd64 /usr/local/prometheus

# Create prometheus.yml
sudo vim /usr/local/prometheus/prometheus.yml
global:
  scrape_interval: 10s # How frequently Prometheus scrapes targets

scrape_configs:
  - job_name: 'telegraf'
    static_configs:
      - targets: ['localhost:9273'] # Telegraf's Prometheus client endpoint
  - job_name: 'network_devices_ping' # Example for basic reachability monitoring
    metrics_path: /probe
    params:
      module: [icmp]
    static_configs:
      - targets:
          - 192.168.100.1
          - 192.168.100.2
          - 192.168.100.3
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox_exporter:9115 # Replace with your blackbox exporter if used
  • (Optional: install and configure blackbox_exporter if using the ping job).
  • Start Prometheus. You’ll likely want to set it up as a systemd service for production.
/usr/local/prometheus/prometheus --config.file=/usr/local/prometheus/prometheus.yml
# Access Prometheus UI at http://192.168.100.10:9090

Step 4: Install and Configure Grafana on COLLECTOR_SERVER

# Install Grafana
sudo apt-get install -y apt-transport-https software-properties-common wget
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
echo "deb https://packages.grafana.com/oss/deb stable main" | sudo tee -a /etc/apt/sources.list.d/grafana.list
sudo apt-get update
sudo apt-get install grafana

# Start Grafana
sudo systemctl enable grafana-server
sudo systemctl start grafana-server
sudo systemctl status grafana-server
# Access Grafana UI at http://192.168.100.10:3000 (default admin/admin)
  • Add Prometheus as Data Source in Grafana:
    • Navigate to Configuration -> Data Sources.
    • Click Add data source, select Prometheus.
    • Set URL to http://localhost:9090.
    • Save & Test.
  • Create a Dashboard:
    • Create a new dashboard.
    • Add a panel.
    • Select your Prometheus data source.
    • Enter a query, e.g., gnmi_telemetry_interface_statistics_in_pkts_total{host="192.168.100.1"} (metric names may vary based on Telegraf’s processing).
    • Watch the data stream in real-time.

Verification Steps

  1. Device-side: Run show telemetry commands on each device to confirm subscriptions are active and connected.
  2. Telegraf: Check Telegraf logs (sudo journalctl -u telegraf -f) for successful gNMI connections and metric collection.
  3. Prometheus: Access http://192.168.100.10:9090/targets to ensure Telegraf’s Prometheus client is being scraped successfully. Use the Prometheus graph explorer to query for metrics like gnmi_telemetry_... to confirm data is present.
  4. Grafana: Create and view dashboards with queries for interface counters, CPU utilization, etc., from your network devices.

Challenge Exercises

  1. Modify a device configuration to stream a different set of metrics (e.g., CPU utilization or routing table size). Update Telegraf and Grafana to visualize this new data.
  2. Change the sample-interval on one device and observe the effect on the Grafana dashboard’s granularity.
  3. Implement basic TLS for gRPC (if you have a simple CA/certificate setup) and update device and Telegraf configurations.
  4. Add SNMP monitoring for sysUpTime from the devices into Prometheus using Telegraf’s SNMP input plugin or Prometheus’s SNMP exporter.

Best Practices Checklist

  • Standardized Data Models: Prioritize OpenConfig YANG models for multi-vendor consistency, falling back to vendor-native YANG when necessary.
  • Secure by Design: Implement TLS/SSL for all telemetry streams. Use strong authentication (client certificates) and granular authorization.
  • Dedicated Management Network: Isolate telemetry traffic on a separate network segment.
  • Scalable Collector Architecture: Design collectors for horizontal scaling and redundancy to handle increasing data volumes.
  • Appropriate Granularity: Tune sample-interval and leverage on-change subscriptions judiciously to avoid overwhelming devices or collectors.
  • Efficient Encoding: Use binary encoding (GPB) for gRPC telemetry.
  • Automated Deployment: Use Ansible or Python to automate the configuration of telemetry subscriptions across all network devices.
  • Version Control: Store all telemetry configurations (device, collector, dashboard) in a version control system (Git) as Infrastructure as Code.
  • Comprehensive Monitoring: Monitor the health and performance of the telemetry pipeline itself (devices’ CPU/memory, collector resources, TSDB health).
  • Actionable Alerting: Configure alerts on significant deviations or anomalies detected from telemetry data.
  • Data Retention Policy: Define and implement data retention policies for your TSDB to manage storage costs and query performance.
  • Documentation: Maintain clear documentation of telemetry paths, data models, collector configurations, and dashboard structures.
  • Regular Audits: Periodically audit telemetry configurations and access controls for security and compliance.

What’s Next

This chapter has equipped you with the foundational knowledge and practical skills to implement modern network monitoring and observability solutions using NetDevOps principles. You’ve seen how streaming telemetry, combined with standardized data models and automation, transforms reactive troubleshooting into proactive network management.

In the next chapter, we will delve into “Advanced Network Analytics and AI/ML for NetDevOps.” Building upon the rich telemetry data you’re now collecting, we will explore techniques for extracting deeper insights, predicting outages, detecting anomalies using machine learning, and integrating these advanced analytics into your continuous improvement pipeline. Get ready to turn data into predictive intelligence!