Skip to content

Feat/webinar llm latencies 2#5002

Open
Tianhui-Han wants to merge 8 commits into
webex:nextfrom
Tianhui-Han:feat/webinar_LLM_Latencies_2
Open

Feat/webinar llm latencies 2#5002
Tianhui-Han wants to merge 8 commits into
webex:nextfrom
Tianhui-Han:feat/webinar_LLM_Latencies_2

Conversation

@Tianhui-Han
Copy link
Copy Markdown
Contributor

@Tianhui-Han Tianhui-Han commented May 26, 2026

This pull request addresses

https://jira-eng-gpk2.cisco.com/jira/browse/SPARK-774846

by making the following changes

added latency telemetry for webinar LLM connect and Locus sync flows, including timing collection, websocket/dataset context in payloads, callback wiring from parser to meeting.

Change Type

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Tooling change
  • Internal code refactor

The following scenarios were tested

< ENUMERATE TESTS PERFORMED, WHETHER MANUAL OR AUTOMATED >

The GAI Coding Policy And Copyright Annotation Best Practices

  • GAI was not used (or, no additional notation is required)
  • Code was generated entirely by GAI
  • GAI was used to create a draft that was subsequently customized or modified
  • Coder created a draft manually that was non-substantively modified by GAI (e.g., refactoring was performed by GAI on manually written code)
  • Tool used for AI assistance (GitHub Copilot / Other - specify)
    • Github Copilot
    • Other - Please Specify
  • This PR is related to
    • Feature
    • Defect fix
    • Tech Debt
    • Automation

I certified that

  • I have read and followed contributing guidelines
  • I discussed changes with code owners prior to submitting this pull request
  • I have not skipped any automated checks
  • All existing and new tests passed
  • I have updated the documentation accordingly

Make sure to have followed the contributing guidelines before submitting.

@aws-amplify-us-east-2
Copy link
Copy Markdown

This pull request is automatically being deployed by Amplify Hosting (learn more).

Access this pull request here: https://pr-5002.d3m3l2kee0btzx.amplifyapp.com

@Tianhui-Han Tianhui-Han marked this pull request as ready for review May 26, 2026 09:06
@Tianhui-Han Tianhui-Han requested review from a team as code owners May 26, 2026 09:06
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 838cc6f657

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

const syncResponseReceivedAt = shouldCollectMetrics ? performance.now() : 0;

// Record pending metrics and complete them when the matching LLM broadcast arrives.
if (shouldCollectMetrics && syncResponse !== null) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Record pending metrics when /sync returns no body

sendSyncRequestToLocus() explicitly returns null when Locus replies with an empty body and says the actual data will arrive via messages, but performSync() only stores pendingSyncMetrics when syncResponse !== null. In the empty-body path, there is no pending entry for tryCompletePendingSyncMetrics() to finish later, so client.locus.sync.complete latency telemetry is silently dropped for those syncs (the exact flow this feature is trying to measure).

Useful? React with 👍 / 👎.

for (const dataSet of message.dataSets) {
const pending = this.pendingSyncMetrics.get(dataSet.name);

if (pending && dataSet.version >= pending.dataSetVersion) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Locus team wanted us to match the received event with the sync request by using the tracking id, see this page:
https://confluence-eng-gpk2.cisco.com/conf/spaces/CAL/pages/801680811/5K+Webinar+Latencies+Tracking

Image

I think it's important for them in cases if there is a sync storm

*/
// @ts-ignore - Fix type
this.locusInfo = new LocusInfo(this.updateMeetingObject.bind(this), this.webex, this.id);
this.locusInfo.syncMetricsCallback = (metrics: {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be cleaner to pass the callback in the constructor. We are already passing 1 callback in the constructor, so we could have a callbacks argument object with multiple callbacks. Same for how the callback is passed between LocusInfo and HashTreeParser class.
By requiring the callback in the constructor we can make sure it's always there so we then don't have to check for it when calling it

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

breakoutMoveId: eventInfo.breakoutMoveId,
breakoutSessionId: eventInfo?.currentSession?.sessionId,
breakoutGroupId: eventInfo?.currentSession?.groupId,
...(eventInfo?.llmWebsocketUrl && {llmWebsocketUrl: eventInfo.llmWebsocketUrl}),
Copy link
Copy Markdown
Collaborator

@marcin-bazyl marcin-bazyl May 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: personally I don't like this syntax, I think it's not very readable. Wouldn't it be better like this?

    const {llmWebsocketUrl, llmLatency} = eventInfo;

    const payload: any = {
      llmLatency,
      identifiers: {
        breakoutMoveId: eventInfo.breakoutMoveId,
        breakoutSessionId: eventInfo?.currentSession?.sessionId,
        breakoutGroupId: eventInfo?.currentSession?.groupId,
        llmWebsocketUrl,
      },
    };

We know that eventInfo must be defined, otherwise we would have returned in line 37

I see same syntax pattern in many other places in this PR, most of them could be made simpler by having the const with the same name as the field in the created object defined earlier.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


dataSet.timer = setTimeout(() => {
dataSet.timer = undefined;
dataSet.lastBackoffTime = delay - dataSet.idleMs;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that it would be cleaner to have:

const randomBackOff = this.getWeightedBackoffTime(dataSet.backoff); 
const delay = dataSet.idleMs + randomBackOff;

dataSet.lastBackoffTime = randomBackOff;

a few lines above instead of doing the subtraction here. Do you agree? Also, feels a bit cleaner to be setting dataSet.lastBackoffTime before the call to setTimeout instead of inside the setTimeout callback.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

throw error;
}

hashtreeResponseTime = shouldCollectMetrics
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not exactly the right place to measure the /hashtree response time - this should be done inside getHashesFromLocus() immediately after getting the response. Right now getHashesFromLocus doesn't do much after getting the response, but that's not guaranteed to be always the case, someone might add some async or time consuming stuff to it and then the calculation here won't give the correct result.
Same for hashtreePrepStart that's used for hashtreePrepTime calculations - it should be done inside getHashesFromLocus() immediately before calling this.webexRequest

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

});
}

const syncResponseReceivedAt = shouldCollectMetrics ? performance.now() : 0;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having all that code that calculates latencies spread out across the code and mixed in with other logic feels quite messy. It would be better to have a more generic mechanism - a class that maintains the latency values and does the calculations and in the HashTreeParser we would just call that class to tell it about various milestones being reached like "hash tree prep start/end", "hash tree request/response" etc. In fact we already have a class for doing this sort of stuff - CallDiagnosticLatencies - see for example how shareDuration is calculated for client.share.stopped event. You can add more "internal" events to InternalEvent - they are internal to SDK as the name suggests.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@marcin-bazyl marcin-bazyl added the validated If the pull request is validated for automation. label May 27, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1b82b71b8e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

this.pendingSyncMetrics.set(dataSet.name, {
syncResponseReceivedAt,
totalStartTime: totalStart,
randomBackoffTime: Math.round(dataSet.lastBackoffTime || 0),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Capture syncAll backoff before reporting metrics

When LLM syncs are kicked off by syncAllDatasets({onlyLLM: true}) after connecting the LLM channel, the backoff delay is computed and slept in syncAllDatasets, but lastBackoffTime is never updated on that path; it is only set in the timer callback inside runSyncAlgorithm. As a result, these new client.locus.sync.complete metrics report randomBackoffTime as 0 or a stale value from an unrelated timer, even though the sync just waited for a randomized backoff.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f15504d61e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +42 to +43
if (event === 'client.breakout-session.join.response' && !llmLatency) {
return;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve breakout join metrics without LLM latencies

In meetings where automatic LLM is disabled (the default in config.ts), Breakouts.updateBreakout() still calls onBreakoutJoinResponse without llmLatency, and no later updateLLMConnection() path emits the replacement metric because that path is gated on enableAutomaticLLM. This new guard therefore drops the existing client.breakout-session.join.response telemetry for normal breakout moves that do not collect LLM timing, instead of only adding latency fields when they are available.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3335dc8100

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +199 to +202
this.locusSyncLatencies.set(dataSetName, {
randomBackoffTime,
syncStart: value,
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Key sync latency records by meeting as well

When two Meeting instances sync the same dataset concurrently (for example, two webinars both syncing main), they share the single webex.internal.newMetrics.callDiagnosticLatencies tracker, but this record is keyed only by dataSetName. A second internal.client.locus.sync.start overwrites the first meeting's record, so the first parser's completion can report the other meeting's timings or lose its telemetry after the record is cleared, corrupting/dropping client.locus.sync.complete metrics. Include the meeting/locus identity in the key or use a per-parser tracker record.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 35ba5211a3

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

key: 'internal.client.locus.sync.start',
options: {
dataSetName: dataSet.name,
randomBackoffTime: Math.round(dataSet.lastBackoffTime || 0),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve watchdog backoff before reporting sync metrics

When a sync is initiated by the heartbeat watchdog, resetHeartbeatWatchdogs() computes a fresh backoffTime and waits heartbeatIntervalMs + backoffTime, but that path enqueues the sync without updating dataSet.lastBackoffTime. As a result, this metric records randomBackoffTime as 0 or a stale value from an earlier root-hash timer even though the watchdog just waited a randomized backoff, corrupting client.locus.sync.complete latency data for watchdog-triggered syncs.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

validated If the pull request is validated for automation.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants