Skip to main content

Performance Monitoring and Observability

Talawa API includes built-in performance monitoring to help DevOps teams track request-level performance metrics.

Quick Reference

Use the table below to quickly find common commands and access points for performance monitoring:

What You WantHow to Get It
View metrics in browserOpen DevTools → Network tab → Click request → Check Server-Timing header
View metrics via CLIcurl -v http://localhost:4000/graphql → Look for Server-Timing header
Get recent metrics (API)curl http://localhost:4000/metrics/perf
Track custom operationawait ctx.perf?.time('my-op', async () => { ... })
Instrument a mutationwithMutationMetrics({ operationName: "mutation:createUser" }, resolver)
Instrument a querywithQueryMetrics({ operationName: "query:user" }, resolver) or executeWithMetrics(ctx, "query:event", fn)
View DataLoader metricsCheck ops["db:users.byId"] etc. in snapshot
Access aggregated metricsfastify.getMetricsSnapshots(windowMinutes?)

Key Thresholds at a Glance

The following thresholds help you quickly identify whether your system performance is healthy, needs attention, or requires immediate action:

Metric✅ Good⚠️ Warning❌ Critical
totalMs< 500ms500-1000ms> 1000ms
hitRate> 0.70.5-0.7< 0.5
totalOps< 2020-50> 50
ops.db.max< 100ms100-200ms> 200ms
slow arrayEmpty1-5 entries> 5 entries

Acceptable ranges: rationale

The thresholds above are based on typical API and user-experience goals:

ParameterAcceptable rangeWhy
totalMs< 500ms (good), 500–1000ms (warning), > 1000ms (critical)Response time strongly affects perceived performance; under 500ms feels responsive, over 1s risks timeouts and poor UX.
hitRate> 0.7 (good), 0.5–0.7 (warning), < 0.5 (critical)High hit rate reduces database load and latency; below 50% means the cache is not effectively offloading the DB.
totalOps< 20 (good), 20–50 (warning), > 50 (critical)Fewer operations mean less overhead and fewer round-trips; > 50 often indicates N+1 patterns and unnecessary work.
ops.db.max (or per-op max)< 100ms (good), 100–200ms (warning), > 200ms (critical)Single operations over 200ms are treated as "slow" and usually indicate missing indexes, heavy queries, or external latency.
slow array length0 (good), 1–5 (warning), > 5 (critical)Each entry is an operation over the slow threshold; many entries mean multiple bottlenecks to fix.
complexityScore< 100 (simple), 100–500 (moderate), > 1000 (very complex)Higher complexity increases CPU and risk of abuse; very high values may need query depth/limit controls.

Remediation steps to bring parameters to acceptable ranges

When a parameter is outside the acceptable range, use these steps to bring it back:

Parameter out of rangeRemediation steps
totalMs too high1. Identify the main cost: check ops and slow to see which operation types dominate. 2. If DB: add indexes, use DataLoaders, or optimize queries (see ops.db.max and N+1). 3. If cache: improve hit rate (see hitRate). 4. If external APIs: add caching, timeouts, or async processing. 5. Re-run the request and compare totalMs and ops until within target.
hitRate too low1. Confirm cache is enabled and keys are stable (no random or per-request-only keys). 2. Review TTL: increase for read-heavy, rarely changing data. 3. Add cache warming for critical paths after deploy or cold start. 4. Avoid over-invalidation: invalidate only when data actually changes. 5. Re-check cacheHits, cacheMisses, and hitRate after changes.
totalOps too high (e.g. N+1)1. Use DataLoaders for batched lookups by ID (users, organizations, events, etc.). 2. Replace N single-query resolvers with one batched loader call per entity type. 3. Consider JOINs or denormalization where batching is not enough. 4. Add field-level caching for hot, read-heavy fields. 5. Verify with metrics: totalOps and ops.db.count should drop.
ops[name].max or slow entries too high1. For DB: run EXPLAIN ANALYZE on the slow queries, add indexes on filter/sort columns, and simplify or split very heavy queries. 2. For external APIs: add timeouts, response caching, and circuit breakers. 3. Move non-critical work to background jobs where possible. 4. Re-check slow and per-operation max after changes.
complexityScore too high1. Enforce query depth and breadth limits in the GraphQL schema or middleware. 2. Add pagination (e.g. cursor-based) for list fields. 3. Consider cost/complexity analysis and reject or throttle very expensive queries. 4. Re-test with representative queries and confirm score stays within target.

After applying remediation, re-check the same request or workload via /metrics/perf or Server-Timing to confirm parameters are within the acceptable ranges above.

Overview

The Performance Metrics Foundation provides:

  • Server-Timing headers on all HTTP responses for browser-based performance analysis
  • /metrics/perf endpoint returning recent performance snapshots for monitoring dashboards
  • Request-scoped tracking of database operations, cache hits/misses, and total request duration

How to Access Performance Monitoring

Performance monitoring is automatically enabled for all HTTP requests. There are three main ways to access performance metrics:

Server-Timing Headers (Automatic)

Server-Timing headers provide real-time performance breakdown directly in HTTP responses, making them ideal for browser-based debugging and monitoring.

Every HTTP response includes a Server-Timing header with performance metrics. No configuration needed.

Access via Browser DevTools:

  1. Open your browser's Developer Tools (F12 or right-click → Inspect)
  2. Navigate to the Network tab
  3. Make any API request to your Talawa API server
  4. Click on the request in the Network tab
  5. View the Timing or Headers section to see the Server-Timing header

Access via Command Line:

curl -v http://localhost:4000/graphql \
-H "Content-Type: application/json" \
-d '{"query":"{ __typename }"}'

Sample Output:

*   Trying 127.0.0.1:4000...
* Connected to localhost (127.0.0.1) port 4000 (#0)
> POST /graphql HTTP/1.1
> Host: localhost:4000
> Content-Type: application/json
>
< HTTP/1.1 200 OK
< content-type: application/json; charset=utf-8
< Server-Timing: db;dur=45, cache;desc="hit:12|miss:3", total;dur=127
< content-length: 31
<
{"data":{"__typename":"Query"}}

Understanding the Output:

Header ComponentExampleMeaningStatus
db;dur=4545ms database timeTotal time spent in database operations✅ Good (< 100ms)
cache;desc="hit:12|miss:3"12 hits, 3 misses80% cache hit rate✅ Good (> 70%)
total;dur=127127ms totalEnd-to-end request processing time✅ Good (< 500ms)

Server-Timing Header Components:

MetricDescriptionExample
db;dur=XTotal database operation time (ms)db;dur=45
cache;desc="hit:X|miss:Y"Cache hit and miss countscache;desc="hit:12|miss:3"
total;dur=XTotal request duration (ms)total;dur=127

Sample Browser DevTools View (Chrome):

Timing
──────────────────────────────────────
Queueing: 0.52 ms
Stalled: 1.23 ms
DNS Lookup: 0.00 ms
Initial connection: 0.00 ms
SSL: 0.00 ms
Request sent: 0.15 ms
Waiting (TTFB): 127.45 ms ← Server processing time
Content Download: 0.89 ms

Server Timing
──────────────────────────────────────
db: 45.00 ms ← Database operations
cache: hit:12|miss:3
total: 127.00 ms ← Total server time

Viewing in Browser DevTools:

  1. Open your browser's Developer Tools (F12)
  2. Go to the Network tab
  3. Click on any API request
  4. View the Timing tab to see the Server-Timing breakdown

Modern browsers (Chrome 65+, Firefox 61+, Edge 79+) automatically visualize these metrics in the Timing tab.

/metrics/perf Endpoint (API)

The /metrics/perf endpoint provides a JSON feed of performance data, suitable for integration with monitoring dashboards and alerting systems.

Query the metrics endpoint to retrieve recent performance snapshots programmatically:

curl http://localhost:4000/metrics/perf | jq '.'

Sample Output (Healthy System):

{
"recent": [
{
"totalMs": 89,
"totalOps": 4,
"cacheHits": 8,
"cacheMisses": 2,
"hitRate": 0.8,
"ops": {
"db": { "count": 3, "ms": 67.2, "max": 28.5 },
"validation": { "count": 1, "ms": 12.1, "max": 12.1 }
},
"slow": []
}
]
}

Sample Output (System with Performance Issues):

{
"recent": [
{
"totalMs": 3456,
"totalOps": 127,
"cacheHits": 3,
"cacheMisses": 89,
"hitRate": 0.03,
"ops": {
"db": { "count": 124, "ms": 2890.5, "max": 567.8 },
"external-api": { "count": 3, "ms": 456.2, "max": 234.1 }
},
"slow": [
{ "op": "db", "ms": 568 },
{ "op": "db", "ms": 445 },
{ "op": "db", "ms": 389 }
],
"complexityScore": 1250
}
]
}

Field-by-Field Analysis:

FieldHealthy ValueProblematic ValueIssue Detected
totalMs893456❌ Request taking 3.4 seconds
totalOps4127❌ N+1 query pattern (124 DB calls)
hitRate0.8 (80%)0.03 (3%)❌ Cache not working
ops.db.max28.5ms567.8ms❌ Slow database query
slow[]3 entries❌ Multiple slow operations

Response Schema

The /metrics/perf endpoint returns performance data with the following schema:

FieldTypeDescription
recentArrayLast 50 performance snapshots (limited for memory)
totalMsnumberTotal time spent in tracked operations
cacheHitsnumberNumber of cache hits during request
cacheMissesnumberNumber of cache misses during request
opsObjectOperation-level statistics
ops[name].countnumberNumber of times operation was called
ops[name].msnumberTotal milliseconds spent in operation
ops[name].maxnumberMaximum duration for single operation call

Retention

Performance metrics are stored in-memory with the following retention policy:

  • In-Memory Storage: Last 200 snapshots are kept in memory
  • Endpoint Returns: Maximum 50 most recent snapshots
  • No Persistence: Metrics reset on server restart

Programmatic Access (In Code)

Access the performance tracker directly in your GraphQL resolvers or route handlers:

In GraphQL Resolvers:

export const myResolver = async (parent, args, ctx) => {
// Access the performance tracker
const perf = ctx.perf;

// Track a custom operation
const result = await perf?.time('external-api-call', async () => {
return await fetch('https://api.example.com/data');
});

// Get current snapshot for logging
const snapshot = perf?.snapshot();
console.log(`Operations so far: ${snapshot?.totalOps}`);

return result;
};

Sample Console Output:

Operations so far: 3

In Route Handlers:

app.get('/health', async (req, reply) => {
const perf = req.perf;
const snapshot = perf?.snapshot();

return {
status: 'healthy',
metrics: snapshot
};
});

Sample Response:

{
"status": "healthy",
"metrics": {
"totalMs": 45,
"totalOps": 2,
"cacheHits": 5,
"cacheMisses": 1,
"hitRate": 0.83,
"ops": {
"db": { "count": 2, "ms": 38.5, "max": 22.1 }
},
"slow": []
}
}

Available Methods:

MethodDescriptionExample
perf.time(name, asyncFn)Track async operation durationawait perf.time('db-query', async () => {...})
perf.start(name)Start manual timingconst stop = perf.start('compute'); ... stop();
perf.trackDb(ms)Record database operationperf.trackDb(45.2)
perf.trackCacheHit()Record cache hitperf.trackCacheHit()
perf.trackCacheMiss()Record cache missperf.trackCacheMiss()
perf.trackComplexity(score)Track GraphQL complexityperf.trackComplexity(150)
perf.snapshot()Get current metricsconst metrics = perf.snapshot()

Automatic DataLoader Metrics Tracking

DataLoaders automatically track performance metrics when a performance tracker is provided to createDataloaders(). This provides visibility into batch loading operation timing and helps identify N+1 query patterns. Note: This is separate from OpenTelemetry tracing (see Database Operation Tracing); metrics tracking focuses on performance timing, while tracing focuses on distributed request flow.

How It Works:

When you create DataLoaders with a performance tracker:

const dataloaders = createDataloaders(db, cache, ctx.perf);
const user = await dataloaders.user.load(userId);

The DataLoader batch function is automatically wrapped with performance metrics tracking. Each batch operation is tracked with the operation name pattern: db:{loaderName}.byId. These metrics appear in the performance snapshot (accessible via /metrics/perf endpoint or ctx.perf?.snapshot()).

Operation Names:

DataLoaderOperation Name
User Loaderdb:users.byId
Organization Loaderdb:organizations.byId
Event Loaderdb:events.byId
Action Item Loaderdb:actionItems.byId

Example Snapshot with DataLoader Metrics:

{
"totalMs": 156,
"totalOps": 3,
"ops": {
"db:users.byId": { "count": 1, "ms": 45.2, "max": 45.2 },
"db:organizations.byId": { "count": 1, "ms": 32.1, "max": 32.1 },
"db:events.byId": { "count": 1, "ms": 28.7, "max": 28.7 }
}
}

Benefits:

  • Automatic tracking - no manual instrumentation needed
  • Identifies N+1 query patterns (high count values)
  • Detects slow batch operations (high max values)
  • Works seamlessly with cache metrics (cache hits reduce DB calls)

Performance Metrics Reference

Complete Metrics Schema

MetricTypeDescriptionAcceptable Range
totalMsnumberTotal time (ms) in all tracked operations< 500ms (good), < 1000ms (acceptable), > 1000ms (investigate)
totalOpsnumberTotal count of tracked operationsSimple: 1-5, Complex: 5-20, > 50 (N+1 problem)
cacheHitsnumberSuccessful cache retrievalsHigher is better, target > 70% of total
cacheMissesnumberCache lookups requiring database queriesLower is better, target < 30% of total
hitRatenumber (0-1)Cache hit rate: hits / (hits + misses)> 0.7 (good), 0.5-0.7 (acceptable), < 0.5 (optimize)
ops[name].countnumberTimes a specific operation was calledVaries by operation type
ops[name].msnumberTotal time (ms) in a specific operationDB: < 100ms, Cache: < 10ms
ops[name].maxnumberMaximum duration (ms) for single operationDB: < 200ms, Cache: < 50ms
slowArrayOperations exceeding 200ms thresholdEmpty (ideal), < 5 (acceptable), > 10 (investigate)
complexityScorenumberGraphQL query complexity score< 100 (simple), 100-500 (moderate), > 1000 (very complex)

Interpreting Results

✅ Good Performance Indicators:

  • totalMs < 500ms for most requests
  • hitRate > 0.7 (70% cache hit rate)
  • slow array is empty or has < 5 entries
  • ops.db.max < 200ms (no slow database queries)

⚠️ Warning Signs:

  • totalMs consistently > 1000ms
  • hitRate < 0.5 (poor cache efficiency)
  • slow array frequently has > 10 entries
  • ops.db.max > 500ms (very slow database queries)
  • totalOps > 50 (possible N+1 query problem)

❌ Action Required:

  • totalMs > 2000ms (request timeout risk)
  • hitRate < 0.3 (cache not working effectively)
  • complexityScore > 1000 (query may be too expensive)
  • Multiple operations with max > 1000ms

Diagnosing Performance Issues

High Response Time (totalMs > 1000ms)

When response times exceed 1000ms, use the following diagnostic approach to identify the bottleneck:

Diagnostic Command:

curl http://localhost:4000/metrics/perf | jq '.recent[0] | {totalMs, ops, slow}'

Sample Output Indicating Problem:

{
"totalMs": 2340,
"ops": {
"db": { "count": 15, "ms": 1890.5, "max": 456.2 },
"external-api": { "count": 3, "ms": 420.1, "max": 180.3 }
},
"slow": [
{ "op": "db", "ms": 456 },
{ "op": "db", "ms": 312 }
]
}

Analysis:

  • Database operations taking 1890ms (80% of total time)
  • 15 database operations suggest potential N+1 issue
  • Two slow queries (456ms, 312ms) need optimization

Recommendations:

  1. Add database indexes for frequently queried columns
  2. Implement DataLoader to batch the 15 database operations
  3. Use EXPLAIN ANALYZE to identify slow query plans
  4. Cache external API responses to reduce 420ms overhead

Low Cache Hit Rate (hitRate < 0.5)

Diagnostic Command:

curl http://localhost:4000/metrics/perf | jq '.recent[0] | {cacheHits, cacheMisses, hitRate}'

Sample Output Indicating Problem:

{
"cacheHits": 2,
"cacheMisses": 18,
"hitRate": 0.1
}

Analysis:

  • Only 10% cache hit rate (2 hits vs 18 misses)
  • Cache is not being utilized effectively

Recommendations:

  1. Review cache keys - ensure they're consistent and not including request-specific data
  2. Increase TTL - if data doesn't change frequently, extend cache time-to-live
  3. Implement cache warming - pre-populate cache for frequently accessed data on startup
  4. Check cache invalidation - overly aggressive invalidation causes low hit rates

N+1 Query Pattern (totalOps > 50)

Diagnostic Command:

curl http://localhost:4000/metrics/perf | jq '.recent[0] | {totalOps, "db_count": .ops.db.count}'

Sample Output Indicating Problem:

{
"totalOps": 87,
"db_count": 82
}

Analysis:

  • 82 database operations for a single request
  • Classic N+1 query pattern (1 query + N related queries)

Recommendations:

  1. Implement DataLoader for batching related queries
  2. Use JOIN queries to fetch related data in a single query
  3. Review resolver structure - check if resolvers make individual queries for related data
  4. Add field-level caching for frequently accessed fields

Slow Operations in slow Array

When operations appear in the slow array, they have exceeded the 200ms threshold and require investigation:

Diagnostic Command:

curl http://localhost:4000/metrics/perf | jq '.recent[0].slow'

Sample Output Indicating Problem:

[
{ "op": "db", "ms": 567 },
{ "op": "db", "ms": 445 },
{ "op": "external-api", "ms": 312 }
]

Analysis:

  • Three operations exceeded the 200ms slow threshold
  • Two database queries and one external API call need attention

Recommendations:

  1. For slow database queries:

    • Run EXPLAIN ANALYZE to identify missing indexes
    • Add indexes on frequently queried columns
    • Consider query restructuring or denormalization
  2. For slow external API calls:

    • Implement circuit breakers for resilience
    • Add response caching with appropriate TTL
    • Consider async processing for non-critical calls

Real-World Scenarios

The following scenarios demonstrate how to use performance metrics to diagnose and resolve common issues in production environments.

Scenario 1: Optimizing a User List Query

This scenario shows how DataLoader batching can dramatically reduce database load and response times.

Before Optimization:

curl http://localhost:4000/metrics/perf | jq '.recent[0]'
{
"totalMs": 1850,
"totalOps": 52,
"cacheHits": 0,
"cacheMisses": 50,
"hitRate": 0,
"ops": {
"db": { "count": 51, "ms": 1720.5, "max": 89.2 }
},
"slow": []
}

Problem: Fetching 50 users triggers 51 database queries (1 for list + 50 for related data).

After Implementing DataLoader:

{
"totalMs": 156,
"totalOps": 3,
"cacheHits": 48,
"cacheMisses": 2,
"hitRate": 0.96,
"ops": {
"db": { "count": 2, "ms": 89.3, "max": 52.1 }
},
"slow": []
}

Improvement:

  • Response time: 1850ms → 156ms (92% faster)
  • Database queries: 51 → 2 (96% reduction)
  • Cache hit rate: 0% → 96%

Scenario 2: Identifying a Slow Database Query

This scenario demonstrates how to identify and fix a slow database query using performance metrics and database tooling.

Metrics showing the problem:

{
"totalMs": 892,
"ops": {
"db": { "count": 3, "ms": 845.2, "max": 678.5 }
},
"slow": [
{ "op": "db", "ms": 679 }
]
}

Investigation steps:

  1. One query taking 678ms out of 845ms total DB time
  2. Enable query logging to identify the slow query
  3. Run EXPLAIN ANALYZE on the identified query

After adding index:

{
"totalMs": 124,
"ops": {
"db": { "count": 3, "ms": 78.4, "max": 35.2 }
},
"slow": []
}

Improvement:

  • Response time: 892ms → 124ms (86% faster)
  • Max query time: 678ms → 35ms (95% faster)

Scenario 3: Cache Warming on Startup

This scenario shows the impact of cache warming on first-request performance after a cold start.

Before cache warming (cold start):

{
"totalMs": 456,
"cacheHits": 0,
"cacheMisses": 12,
"hitRate": 0,
"ops": {
"db": { "count": 12, "ms": 398.5, "max": 45.2 }
}
}

After cache warming:

{
"totalMs": 67,
"cacheHits": 12,
"cacheMisses": 0,
"hitRate": 1,
"ops": {
"db": { "count": 0, "ms": 0, "max": 0 }
}
}

Improvement:

  • Response time: 456ms → 67ms (85% faster)
  • Database queries: 12 → 0 (100% reduction)
  • Cache hit rate: 0% → 100%

Integration Examples

Monitoring Dashboard

Poll the /metrics/perf endpoint to display real-time performance:

// Fetch performance metrics every 5 seconds
async function monitorPerformance() {
const response = await fetch('http://localhost:4000/metrics/perf');
const data = await response.json();

if (data.recent.length === 0) {
console.log('No metrics available yet');
return;
}

// Calculate averages
const avgTotalMs = data.recent.reduce((sum, s) => sum + s.totalMs, 0) / data.recent.length;
const avgHitRate = data.recent.reduce((sum, s) => sum + s.hitRate, 0) / data.recent.length;
const avgDbTime = data.recent.reduce((sum, s) => sum + (s.ops.db?.ms || 0), 0) / data.recent.length;

console.log(`
Performance Summary (last ${data.recent.length} requests):
─────────────────────────────────────────────────
Avg Response Time: ${avgTotalMs.toFixed(2)}ms ${avgTotalMs < 500 ? '✅' : avgTotalMs < 1000 ? '⚠️' : '❌'}
Avg Cache Hit Rate: ${(avgHitRate * 100).toFixed(1)}% ${avgHitRate > 0.7 ? '✅' : avgHitRate > 0.5 ? '⚠️' : '❌'}
Avg DB Time: ${avgDbTime.toFixed(2)}ms ${avgDbTime < 100 ? '✅' : avgDbTime < 500 ? '⚠️' : '❌'}
`);
}

setInterval(monitorPerformance, 5000);

Sample Console Output:

Performance Summary (last 50 requests):
─────────────────────────────────────────────────
Avg Response Time: 156.78ms ✅
Avg Cache Hit Rate: 76.3% ✅
Avg DB Time: 89.45ms ✅

Custom Alerting Script

The following Python script monitors performance metrics and sends alerts when thresholds are exceeded:

import requests
import time
from datetime import datetime

THRESHOLDS = {
'totalMs': 500,
'hitRate': 0.5,
'totalOps': 50,
'dbMax': 200
}

def check_metrics():
response = requests.get('http://localhost:4000/metrics/perf')
data = response.json()

alerts = []
for req in data.get('recent', []):
if req['totalMs'] > THRESHOLDS['totalMs']:
alerts.append(f"Slow request: {req['totalMs']}ms (threshold: {THRESHOLDS['totalMs']}ms)")

if req['hitRate'] < THRESHOLDS['hitRate']:
alerts.append(f"Low cache hit rate: {req['hitRate']*100:.1f}% (threshold: {THRESHOLDS['hitRate']*100}%)")

if req['totalOps'] > THRESHOLDS['totalOps']:
alerts.append(f"High operation count: {req['totalOps']} (threshold: {THRESHOLDS['totalOps']})")

db_max = req.get('ops', {}).get('db', {}).get('max', 0)
if db_max > THRESHOLDS['dbMax']:
alerts.append(f"Slow DB query: {db_max}ms (threshold: {THRESHOLDS['dbMax']}ms)")

if alerts:
print(f"\n⚠️ ALERTS at {datetime.now().strftime('%H:%M:%S')}:")
for alert in alerts:
print(f" • {alert}")
else:
print(f"✅ All metrics healthy at {datetime.now().strftime('%H:%M:%S')}")

while True:
check_metrics()
time.sleep(10)

Sample Alert Output:

⚠️  ALERTS at 14:32:15:
• Slow request: 1234ms (threshold: 500ms)
• Low cache hit rate: 23.5% (threshold: 50.0%)
• High operation count: 89 (threshold: 50)

✅ All metrics healthy at 14:32:25

⚠️ ALERTS at 14:32:35:
• Slow DB query: 456ms (threshold: 200ms)

Metrics Configuration

Customize the metrics system behavior using environment variables to match your deployment needs.

Environment Variables

The metrics aggregation system is configurable through the following environment variables:

VariableDefaultDescription
API_METRICS_ENABLEDtrueMaster switch to enable or disable metrics collection and aggregation
API_METRICS_AGGREGATION_ENABLEDtrueEnable or disable the metrics aggregation background worker
API_METRICS_AGGREGATION_CRON_SCHEDULE*/5 * * * *Cron schedule for running metrics aggregation (default: every 5 minutes)
API_METRICS_AGGREGATION_WINDOW_MINUTES5Time window in minutes for aggregating snapshots
API_METRICS_SNAPSHOT_RETENTION_COUNT1000Maximum number of performance snapshots to retain in memory
API_METRICS_SLOW_OPERATION_MS200Threshold in milliseconds for considering an operation as slow
API_METRICS_SLOW_REQUEST_MS500Threshold in milliseconds for considering a request as slow
API_METRICS_CACHE_TTL_SECONDS300Time-to-live in seconds for cached aggregated metrics (5 minutes)
API_METRICS_API_KEY(none)API key to protect the /metrics/perf endpoint. When set, requests require Authorization: Bearer <key> header

Example Configuration

# Enable metrics collection (default: true)
API_METRICS_ENABLED=true

# Enable metrics aggregation (default: true)
API_METRICS_AGGREGATION_ENABLED=true

# Run aggregation every 10 minutes
API_METRICS_AGGREGATION_CRON_SCHEDULE="*/10 * * * *"

# Aggregate last 10 minutes of snapshots
API_METRICS_AGGREGATION_WINDOW_MINUTES=10

# Keep last 2000 snapshots in memory
API_METRICS_SNAPSHOT_RETENTION_COUNT=2000

# Configure slow operation threshold (default: 200ms)
API_METRICS_SLOW_OPERATION_MS=300

# Configure slow request threshold (default: 500ms)
API_METRICS_SLOW_REQUEST_MS=1000

# Configure metrics cache TTL (default: 300 seconds)
API_METRICS_CACHE_TTL_SECONDS=600

# Protect metrics endpoint with API key (recommended for production)
API_METRICS_API_KEY=your-secure-api-key-here

Accessing Protected Metrics

When the API_METRICS_API_KEY environment variable is set, all requests to the metrics endpoint require authentication. Use the Authorization header to access protected endpoints:

curl -H "Authorization: Bearer your-secure-api-key-here" \
http://localhost:4000/metrics/perf

Sample Output:

{
"recent": [
{
"totalMs": 89,
"totalOps": 4,
"cacheHits": 8,
"cacheMisses": 2,
"hitRate": 0.8
}
]
}

Metrics Aggregation Output

The background worker automatically aggregates performance snapshots at the configured interval (default: every 5 minutes) and logs the results to standard output. This log appears in your server console without any command - it runs automatically based on the API_METRICS_AGGREGATION_CRON_SCHEDULE setting.

Example Server Console Log:

{
"level": 30,
"time": 1705234567890,
"msg": "Metrics aggregation completed successfully",
"duration": "15ms",
"windowMinutes": 5,
"snapshotCount": 120,
"requestCount": 120,
"operationCount": 4,
"slowOperationCount": 2,
"cacheHitRate": "85.50%"
}

Note: This log appears automatically in your server console. To view it, simply monitor your server logs with pnpm run dev or check your logging system in production.

Key Fields:

  • windowMinutes: Time window used for aggregation
  • snapshotCount: Number of snapshots processed
  • requestCount: Total requests in the window
  • slowOperationCount: Number of operations potentially needing optimization
  • cacheHitRate: Global cache efficiency for the period

Programmatic Access to Aggregated Metrics

Access aggregated metrics programmatically via the Fastify instance:

// Get all recent snapshots
const snapshots = fastify.getMetricsSnapshots();

// Get snapshots from last 10 minutes
const recentSnapshots = fastify.getMetricsSnapshots(10);

// Use in custom endpoints or monitoring
fastify.get("/admin/metrics", async (req, reply) => {
const snapshots = fastify.getMetricsSnapshots(5); // Last 5 minutes
return {
window: "5 minutes",
count: snapshots.length,
snapshots: snapshots.slice(0, 20) // Return top 20
};
});

Use Cases:

  • Custom monitoring dashboards
  • Alerting systems
  • Performance trend analysis
  • Capacity planning

Metrics Caching

Aggregated metrics can be cached to improve performance when accessing metrics data. The metrics cache service provides:

  • Timestamp-based caching: Cache metrics by timestamp identifier
  • Time-windowed caching: Cache hourly and daily aggregations with longer TTLs
  • Configurable TTL: Control cache expiration via API_METRICS_CACHE_TTL_SECONDS

Cache Key Patterns:

  • Timestamp-based: talawa:v1:metrics:aggregated:{timestamp}
  • Hourly: talawa:v1:metrics:aggregated:hourly:{YYYY-MM-DD-HH}
  • Daily: talawa:v1:metrics:aggregated:daily:{YYYY-MM-DD}

Default TTLs:

  • Timestamp-based metrics: 300 seconds (5 minutes)
  • Hourly aggregations: 3600 seconds (1 hour)
  • Daily aggregations: 86400 seconds (24 hours)

The metrics cache service is automatically initialized when the performance plugin starts and is available via fastify.metricsCache. Cache failures are handled gracefully and do not affect metrics collection.

Programmatic Access to Cached Metrics

Access cached aggregated metrics via the metrics cache service (fastify.metricsCache):

// Get by timestamp
const cached = await fastify.metricsCache.getCachedMetrics(timestamp);

// Get by time window (hourly or daily)
const hourly = await fastify.metricsCache.getCachedMetricsByWindow("hourly", "2024-01-15-14");
const daily = await fastify.metricsCache.getCachedMetricsByWindow("daily", "2024-01-15");

// Cache aggregated metrics (metrics object first, then timestamp, then optional TTL)
await fastify.metricsCache.cacheAggregatedMetrics(aggregatedMetrics, timestamp, 600);

// Invalidate: no args = all metrics; pattern = e.g. "aggregated:hourly:*"
await fastify.metricsCache.invalidateMetricsCache();
await fastify.metricsCache.invalidateMetricsCache("aggregated:hourly:*");

Cache Invalidation Strategies:

  • Time-based: Let TTL expire naturally (default)
  • Event-based: Invalidate on significant events (deployments, config changes)
  • Manual: Invalidate specific time windows when data is known to be stale

Extending Performance Tracking

You can extend the built-in performance tracking by adding custom operations and manual timing to monitor application-specific code paths.

Resolver-level instrumentation (withMutationMetrics / withQueryMetrics)

The codebase provides helpers that wrap GraphQL resolvers for consistent operation naming and timing:

  • withMutationMetrics (from ~/src/graphql/utils/withMutationMetrics) — wraps mutation resolvers with ctx.perf?.time(operationName, ...) so each mutation is recorded under a name like mutation:createUser. Options: { operationName: string }.
  • withQueryMetrics — analogous for query resolvers (see Instrumenting GraphQL Queries).

These utilities only add resolver-level timing. They do not implement request-scoped wiring, Server-Timing header integration, aggregation workers, the /metrics/perf endpoint, or DataLoader/cache instrumentation; those are part of the broader performance foundation and are documented elsewhere in this guide. When linking PRs to issues that cover the full foundation, consider either (a) limiting the issue link to the resolver-level change and adding a checklist of remaining tasks, or (b) implementing the remaining pieces before marking the issue resolved.

Mutations: withMutationMetrics and manual patterns

Track performance metrics for GraphQL mutations to identify slow operations.

Using the Helper

The withMutationMetrics helper takes an options object with operationName and the resolver:

import { withMutationMetrics } from "~/src/graphql/utils/withMutationMetrics";

const resolve = withMutationMetrics(
{ operationName: "mutation:createUser" },
async (_parent, args, ctx) => {
// Your mutation logic here
const user = await ctx.drizzleClient.insert(usersTable).values({...});
return user;
},
);
Manual Instrumentation

For more control, instrument mutations manually:

export const createEvent = async (parent, args, ctx) => {
return await ctx.perf?.time("mutation:createEvent", async () => {
// Main mutation logic
const event = await createEventInDb(args);

// Track sub-operations
await ctx.perf?.time("mutation:createEvent:notification:enqueue", async () => {
ctx.notification?.enqueueEventCreated({...});
});

await ctx.perf?.time("mutation:createEvent:notification:flush", async () => {
await ctx.notification?.flush(ctx);
});

return event;
});
};
Operation Naming

Mutations follow the pattern: mutation:{mutationName}

MutationOperation Name
createUsermutation:createUser
createOrganizationmutation:createOrganization
createEventmutation:createEvent
updateOrganizationmutation:updateOrganization
deleteOrganizationmutation:deleteOrganization
Sub-operation Tracking

For complex mutations with multiple steps, track sub-operations:

// Track notification operations separately
"mutation:createEvent:notification:enqueue"
"mutation:createEvent:notification:flush"

// Track validation separately from DB operations
"mutation:updateOrganization:validation"
"mutation:updateOrganization:database"

This provides granular visibility into where time is spent within complex mutations.

Instrumenting GraphQL Queries

Track performance metrics for GraphQL queries to identify slow operations and optimize data fetching.

Using the Helper Utility

The withQueryMetrics helper takes an options object with operationName and the resolver:

import { withQueryMetrics } from "~/src/graphql/utils/withQueryMetrics";

const resolve = withQueryMetrics(
{ operationName: "query:user" },
async (_parent, args, ctx) => {
const user = await ctx.drizzleClient.query.usersTable.findFirst({
where: (f, op) => op.eq(f.id, args.input.id),
});
return user;
},
);

Inline Execution: executeWithMetrics

For inline execution (e.g. inside a resolver with branching logic), use executeWithMetrics:

import { executeWithMetrics } from "~/src/graphql/utils/withQueryMetrics";

// Inside resolver
return await executeWithMetrics(ctx, "query:event", async () => {
return await fetchEvent(args.id, ctx);
});

Manual Instrumentation

For more control, instrument queries manually:

export const organizations = async (parent, args, ctx) => {
return await ctx.perf?.time("query:organizations", async () => {
// Query logic with filtering, pagination
const orgs = await ctx.drizzleClient.query.organizationsTable.findMany({
where: (fields, ops) => ops.ilike(fields.name, `%${args.filter}%`),
limit: args.limit,
offset: args.offset,
});
return orgs;
});
};

Operation Naming Convention

Queries follow the naming pattern: query:{queryName}

QueryOperation Name
userquery:user
organizationquery:organization
eventquery:event
organizationsquery:organizations

Integration with DataLoader Metrics

Query instrumentation works seamlessly with DataLoader metrics. When a query uses DataLoaders, both metrics appear in the snapshot:

{
"totalMs": 234,
"ops": {
"query:organization": { "count": 1, "ms": 234, "max": 234 },
"db:organizations.byId": { "count": 1, "ms": 45, "max": 45 },
"db:users.byId": { "count": 1, "ms": 32, "max": 32 }
}
}

This shows:

  • Total query time: 234ms
  • Organization DataLoader: 45ms
  • User DataLoader (for creator): 32ms
  • Overhead (validation, etc.): ~157ms

Custom Operations

Track custom operations in your code to measure specific functionality. For detailed examples of mutation and query instrumentation, see Resolver-level instrumentation (withMutationMetrics / withQueryMetrics) and Instrumenting GraphQL Queries.

// In a GraphQL resolver
export const myResolver = async (parent, args, ctx) => {
// Track external API call
const result = await ctx.request.perf?.time('external-api', async () => {
return await fetch('https://api.example.com/data');
});

return result;
};

Manual Timing

For non-async operations, use manual timing. For resolver-level examples using perf.time(), see Mutations: withMutationMetrics and manual patterns and Instrumenting GraphQL Queries.

const stopTimer = ctx.request.perf?.start('computation');
// ... expensive computation ...
stopTimer?.();

Production Considerations

Before deploying performance monitoring to production, review the following security, performance, and data retention guidelines.

Security

⚠️ IMPORTANT: The /metrics/perf endpoint supports API key authentication via the API_METRICS_API_KEY environment variable. For production deployments:

  • Set API_METRICS_API_KEY to protect the endpoint (recommended)
  • Use network-level controls (VPN, internal-only access) as additional security
  • If API_METRICS_API_KEY is not set, the endpoint is unprotected (suitable for development only)

Performance Impact

The performance tracking system is designed to have minimal overhead on your application:

AspectImpactDetails
Memory~100KB~200 snapshots × ~500 bytes
CPUNegligibleSimple timestamp math
I/ONoneAll metrics stored in-memory
Latency< 1msMinimal overhead per request

Data Retention

  • In-Memory Storage: Last 200 snapshots kept in memory
  • Endpoint Returns: Maximum 50 most recent snapshots
  • No Persistence: Metrics reset on server restart
  • No PII: Metrics contain only timing data, no user information

Critical validations (mutation instrumentation)

When adding or changing resolver-level mutation instrumentation (e.g. withMutationMetrics, createEvent notification tracking), complete these validations before merge:

ValidationRequired actionsAddressed in repo
1. Transaction boundary (createEvent)Notification enqueue runs outside the database transaction (after commit). If enqueue fails, DB changes are not rolled back (fire-and-forget). Accept and document.Design documented in createEvent.ts (comment above notification enqueue block).
2. Performance overheadRun load tests measuring per-mutation overhead (target: <1–2 ms). Verify no memory leaks; confirm perf.snapshot() does not delay responses.Automated test test/graphql/utils/withMutationMetrics.overhead.test.ts asserts average overhead <2 ms per mutation. Leak/snapshot checks: run under load if needed.
3. Error stack trace preservationVerify perf.time() wrapper does not swallow error stack traces; full error context preserved for debugging.Automated test in withMutationMetrics.test.ts: "should preserve error stack trace when resolver throws" (same error reference, non-empty stack).

Troubleshooting

Use the following diagnostic commands and solutions to resolve common performance monitoring issues.

Metrics Not Appearing

Diagnostic Command:

curl -v http://localhost:4000/graphql \
-H "Content-Type: application/json" \
-d '{"query":"{ __typename }"}'

Expected Output (look for Server-Timing header):

< Server-Timing: db;dur=0, cache;desc="hit:0|miss:0", total;dur=15

If header is missing:

  1. Check plugin registration in src/fastifyPlugins/index.ts
  2. Verify the performance plugin is loaded
  3. Check for errors in server startup logs

Endpoint Returns Empty Array

An empty recent array is normal immediately after server startup since metrics are stored in-memory. Diagnostic Command:

curl http://localhost:4000/metrics/perf | jq '.recent | length'

Sample Output:

0

If output is 0:

  • Normal after server restart (metrics are in-memory)
  • Make some API requests to populate snapshots:
# Populate metrics with test requests
for i in {1..5}; do
curl -s http://localhost:4000/graphql \
-H "Content-Type: application/json" \
-d '{"query":"{ __typename }"}' > /dev/null
done

# Verify metrics are now available
curl http://localhost:4000/metrics/perf | jq '.recent | length'

Sample Output:

5

High totalMs Values

If totalMs is unexpectedly high, use the following diagnostic steps:

  1. Check ops.db.ms for slow database queries
  2. Review cacheMisses count - high misses mean more database hits
  3. Inspect individual operation max times for bottlenecks

Diagnostic Command:

curl http://localhost:4000/metrics/perf | jq '.recent[0] | {totalMs, ops, cacheMisses}'

Sample Output Indicating Issue:

{
"totalMs": 2450,
"ops": {
"db": { "count": 45, "ms": 2100, "max": 890 }
},
"cacheMisses": 42
}

This output shows 42 cache misses causing 45 database queries totaling 2100ms - the root cause of the high response time.

Browser Not Showing Server-Timing

If the Server-Timing header is not visible in your browser's developer tools, verify the following: Requirements:

  • Chrome 65+ / Firefox 61+ / Edge 79+
  • HTTPS connection (some browsers require secure context)
  • No browser extensions blocking headers

Verification:

  1. Open DevTools → Network tab
  2. Click on the request
  3. Check "Headers" tab for Server-Timing
  4. Check "Timing" tab for visual breakdown

Observability with OpenTelemetry

Talawa API uses OpenTelemetry to provide end-to-end visibility into incoming requests, internal processing, and outgoing calls.

Key Capabilities

  • End-to-end request tracing
  • Context propagation across services
  • Configurable sampling
  • Zero-overhead when disabled
  • Console-based tracing for local development
  • OTLP-based exporting for production/development

Configuration

Tracing behavior is controlled through environment variables:

VariableRequiredDescriptionDefault
API_OTEL_ENABLEDYesEnable/disable tracingfalse
API_OTEL_EXPORTER_TYPEYesRuntime environment (console or otlp)console
API_OTEL_TRACE_EXPORTER_ENDPOINTNoOTLP HTTP endpoint for traces
API_OTEL_METRIC_EXPORTER_ENDPOINTNoOTLP HTTP endpoint for metrics
API_OTEL_SAMPLING_RATIONoTrace sampling ratio (0-1)1
API_OTEL_SERVICE_NAMENoService identifiertalawa-api
API_OTEL_EXPORTER_ENABLEDYesEnable spans exportingfalse

Recommended Settings by Environment:

EnvironmentENABLEDSAMPLING_RATIONotes
Local/Devtrue1Capture all traces for debugging
Stagingtrue0.5Balance visibility and overhead
Production (low traffic)true0.3Good visibility
Production (high traffic)true0.1Reduce telemetry cost

Sampling Behavior Example:

API_OTEL_SAMPLING_RATIO=0.1
  • ~10% of new root traces will be sampled
  • Child spans automatically inherit the parent's sampling decision
  • This allows high-volume production systems to reduce telemetry cost without losing trace continuity

Tracing Modes

Talawa API supports two distinct tracing modes: Console Mode for local development and debugging, and External Mode for production monitoring with external dashboard services.

Console Mode (Local Development)

Console mode outputs spans directly to stdout, ideal for local development and debugging without external dependencies.

Configuration:

API_OTEL_ENABLED=true
API_OTEL_EXPORTER_TYPE=console
API_OTEL_EXPORTER_ENABLED=true
API_OTEL_SAMPLING_RATIO=1
API_OTEL_SERVICE_NAME=talawa-api

External Mode (Production Monitoring)

External mode exports spans to an external OpenTelemetry-compatible dashboard service using OTLP protocol, enabling centralized monitoring and analysis.

Configuration:

API_OTEL_ENABLED=true
API_OTEL_EXPORTER_TYPE=otlp
API_OTEL_EXPORTER_ENABLED=true
API_OTEL_TRACE_EXPORTER_ENDPOINT=your-service-endpoint
API_OTEL_METRIC_EXPORTER_ENDPOINT=your-service-endpoint
API_OTEL_SAMPLING_RATIO=0.5
API_OTEL_SERVICE_NAME=talawa-api

W3C Trace Context Propagation

Talawa API supports distributed tracing by automatically propagating trace context across service boundaries using the W3C standard format.

Talawa API automatically propagates trace context using the traceparent header:

Disabling Tracing

To completely disable OpenTelemetry tracing with zero runtime overhead, set the following environment variable:

API_OTEL_ENABLED=false

When disabled:

  • SDK is not initialized
  • No instrumentation is registered
  • No performance impact is introduced

Start the server and make a request:

Terminal 1: Start server

pnpm run dev

Terminal 2: Make request

curl http://localhost:4000/graphql \
-H "Content-Type: application/json" \
-d '{"query":"{ __typename }"}'

Expected Console Output:

{
"traceId": "347cb51fc1fd41662d768bb1142acff1",
"parentId": undefined,
"id": "fd6b2a8046d8cc77",
"name": "POST /graphql",
"kind": 1,
"timestamp": 1705312456789000,
"duration": 19356500,
"attributes": {
"http.method": "POST",
"http.url": "http://localhost:4000/graphql",
"http.target": "/graphql",
"http.status_code": 200
},
"status": { "code": 0 }
}

Understanding the Output:

FieldExampleMeaning
traceId347cb51...Unique identifier for the entire trace
idfd6b2a80...Unique identifier for this span
namePOST /graphqlOperation being traced
kind1SERVER span (incoming request)
duration19356500Duration in nanoseconds (19.36ms)
status.code00=OK, 1=ERROR

W3C Trace Context Propagation

Talawa API supports distributed tracing by automatically propagating trace context across service boundaries using the W3C standard format.

Talawa API automatically propagates trace context using the traceparent header:

curl -H "traceparent: 00-a1b2c3d4e5f67890abcdef1234567890-1234567890abcdef-01" \
-H "Content-Type: application/json" \
-d '{"query":"{ __typename }"}' \
http://localhost:4000/graphql

Expected Behavior:

  • The response will include {"data":{"__typename":"Query"}}
  • In the server console, the trace will show the provided traceId (a1b2c3d4e5f67890abcdef1234567890)
  • Child spans will inherit this trace context, enabling end-to-end distributed tracing

The trace will continue with the provided traceId, enabling distributed tracing across services.

Disabling Tracing

To completely disable OpenTelemetry tracing with zero runtime overhead, set the following environment variable:

API_OTEL_ENABLED=false

When disabled:

  • SDK is not initialized
  • No instrumentation is registered
  • No performance impact is introduced

Database Operation Tracing

Talawa API instruments database operations using the traceable utility, providing visibility into database queries through DataLoaders.

Trace Hierarchy

Database operations create child spans under their parent context:

HTTP Request → GraphQL Resolver → DataLoader → db:users.batchLoad

This enables end-to-end tracing, bottleneck identification, and N+1 query detection.

Usage

Import:

import { traceable } from "~/src/utilities/db/traceableQuery";

Basic Usage:

const users = await traceable("users", "batchLoad", async () => {
return await db.select().from(usersTable).where(inArray(usersTable.id, userIds));
});

DataLoader Example:

export const createUserLoader = (): DataLoader<string, User | null> => {
return new DataLoader(async (userIds) => {
const rows = await traceable("users", "batchLoad", async () => {
return await db.select().from(usersTable).where(inArray(usersTable.id, [...userIds]));
});
const userMap = new Map(rows.map((user) => [user.id, user]));
return userIds.map((id) => userMap.get(id) ?? null);
});
};

Span Attributes:

  • db.system: "postgresql"
  • db.operation: Operation type (e.g., "batchLoad")
  • db.model: Table name (e.g., "users")

Error Handling

Errors are automatically captured with stack traces and marked with ERROR status. The error is then re-thrown for normal application handling.

Sample Output

Local Console Output:

{
"name": "db:users.batchLoad",
"traceId": "347cb51fc1fd41662d768bb1142acff1",
"parentId": "a8f3d9e2c1b4a567",
"duration": 45238.125,
"attributes": {
"db.system": "postgresql",
"db.operation": "batchLoad",
"db.model": "users"
},
"status": { "code": 0 }
}

Production Visualization:

Trace: 347cb51fc1fd41662d768bb1142acff1
├─ POST /graphql (125ms)
├─ db:users.batchLoad (45ms)
├─ db:organizations.batchLoad (32ms)
└─ db:events.batchLoad (28ms)

Best Practices

  1. Use descriptive operation names: "findByEmail" not "query"
  2. Wrap only DB operations: Exclude business logic from traced functions
  3. Match model names to tables: Use "users" for usersTable
  4. Primary use case: DataLoaders batch loading

Security

⚠️ Never include sensitive data in model/operation names

What is captured:

  • ✅ Table names, operation types, timing
  • ❌ Query parameters, result data (not captured)

For production, use TLS/HTTPS for OTLP endpoints and consider sampling (API_OTEL_SAMPLING_RATIO=0.1).

Contributor Guidelines

When to use traceable:

  • ✅ DataLoader batch functions (required)
  • ✅ Complex multi-table queries
  • ❌ Simple single-row lookups

Verify spans locally: Set API_OTEL_ENABLED=true and API_OTEL_ENVIRONMENT=local

Tests: test/utilities/db/traceableQuery.test.ts

Further Reading


References