AI/ML News

How Cisco accelerated the usage of generative AI with Amazon SageMaker Inference

August 9, 2024

Table of Contents

This put up is co-authored with Travis Mehlinger and Karthik Raghunathan from Cisco.

Webex by Cisco is a number one supplier of cloud-based collaboration options, together with video conferences, calling, messaging, occasions, polling, asynchronous video, and buyer expertise options like contact middle and purpose-built collaboration gadgets. Webex’s concentrate on delivering inclusive collaboration experiences fuels their innovation, which makes use of synthetic intelligence (AI) and machine studying (ML), to take away the obstacles of geography, language, persona, and familiarity with expertise. Its options are underpinned with safety and privateness by design. Webex works with the world’s main enterprise and productiveness apps—together with AWS.

Cisco’s Webex AI (WxAI) staff performs a vital position in enhancing these merchandise with AI-driven options and functionalities, utilizing massive language fashions (LLMs) to enhance person productiveness and experiences. Prior to now 12 months, the staff has more and more targeted on constructing AI capabilities powered by LLMs to enhance productiveness and expertise for customers. Notably, the staff’s work extends to Webex Contact Heart, a cloud-based omni-channel contact middle answer that empowers organizations to ship distinctive buyer experiences. By integrating LLMs, the WxAI staff allows superior capabilities akin to clever digital assistants, pure language processing (NLP), and sentiment evaluation, permitting Webex Contact Heart to offer extra customized and environment friendly buyer assist. Nonetheless, as these LLM fashions grew to include lots of of gigabytes of knowledge, the WxAI staff confronted challenges in effectively allocating sources and beginning purposes with the embedded fashions. To optimize its AI/ML infrastructure, Cisco migrated its LLMs to Amazon SageMaker Inference, bettering pace, scalability, and price-performance.

This put up highlights how Cisco carried out new functionalities and migrated present workloads to Amazon SageMaker inference parts for his or her industry-specific contact middle use circumstances. By integrating generative AI, they will now analyze name transcripts to higher perceive buyer ache factors and enhance agent productiveness. Cisco has additionally carried out conversational AI experiences, together with chatbots and digital brokers that may generate human-like responses, to automate customized communications primarily based on buyer context. Moreover, they’re utilizing generative AI to extract key name drivers, optimize agent workflows, and acquire deeper insights into buyer sentiment. Cisco’s adoption of SageMaker Inference has enabled them to streamline their contact middle operations and supply extra satisfying, customized interactions that handle buyer wants.

On this put up, we focus on the next:

Cisco’s enterprise use circumstances and outcomes
How Cisco accelerated the usage of generative AI powered by LLMs for his or her contact middle use circumstances with the assistance of SageMaker Inference
Cisco’s generative AI inference structure, which is constructed as a strong and safe basis, utilizing numerous providers and options akin to SageMaker Inference, Amazon Bedrock, Kubernetes, Prometheus, Grafana, and extra
How Cisco makes use of an LLM router and auto scaling to route requests to acceptable LLMs for various duties whereas concurrently scaling their fashions for resiliency and efficiency effectivity.
How the options on this put up impacted Cisco’s enterprise roadmap and strategic partnership with AWS
How Cisco helped SageMaker Inference construct new capabilities to deploy generative AI purposes at scale

Enhancing collaboration and buyer engagement with generative AI: Webex’s AI-powered options

On this part, we focus on Cisco’s AI-powered use circumstances.

Assembly summaries and insights

For Webex Conferences, the platform makes use of generative AI to mechanically summarize assembly recordings and transcripts. This extracts the important thing takeaways and motion gadgets, serving to distributed groups keep knowledgeable even when they missed a stay session. The AI-generated summaries present a concise overview of necessary discussions and selections, permitting staff to shortly stand up to hurry. Past summaries, Webex’s generative AI capabilities additionally floor clever insights from assembly content material. This contains figuring out motion gadgets, highlighting essential selections, and producing customized assembly notes and to-do lists for every participant. These insights assist make conferences extra productive and maintain attendees accountable.

Enhancing contact middle experiences

Webex can be making use of generative AI to its contact middle options, enabling extra pure, human-like conversations between clients and brokers. The AI can generate contextual, empathetic responses to buyer inquiries, in addition to mechanically draft customized emails and chat messages. This helps contact middle brokers work extra effectively whereas sustaining a excessive stage of customer support.

Webex clients understand optimistic outcomes with generative AI

Webex’s adoption of generative AI is driving tangible advantages for purchasers. Purchasers utilizing the platform’s AI-powered assembly summaries and insights have reported productiveness features. Webex clients utilizing the platform’s generative AI for contact facilities have dealt with lots of of hundreds of calls with improved buyer satisfaction and lowered deal with instances, enabling extra pure, empathetic conversations between brokers and shoppers. Webex’s strategic integration of generative AI is empowering customers to work smarter and ship distinctive experiences.

For extra particulars on how Webex is harnessing generative AI to boost collaboration and buyer engagement, see Webex | Distinctive Experiences for Each Interplay on the Webex weblog.

Utilizing SageMaker Inference to optimize sources for Cisco

Cisco’s WxAI staff is devoted to delivering superior collaboration experiences powered by cutting-edge ML. The staff develops a complete suite of AI and ML options for the Webex ecosystem, together with audio intelligence capabilities like noise elimination and optimizing speaker voices, language intelligence for transcription and translation, and video intelligence options like digital backgrounds. On the forefront of WxAI’s improvements is the AI-powered Webex Assistant, a digital assistant that gives voice-activated management and seamless assembly assist in a number of languages. To construct these refined capabilities, WxAI makes use of LLMs, which may include as much as lots of of gigabytes of coaching knowledge.

Initially, WxAI embedded LLM fashions straight into the appliance container photos working on Amazon Elastic Kubernetes Service (Amazon EKS). Nonetheless, because the fashions grew bigger and extra advanced, this strategy confronted vital scalability and useful resource utilization challenges. Working the resource-intensive LLMs by the purposes required provisioning substantial compute sources, which slowed down processes like allocating sources and beginning purposes. This inefficiency hampered WxAI’s skill to quickly develop, take a look at, and deploy new AI-powered options for the Webex portfolio. To deal with these challenges, the WxAI staff turned to SageMaker Inference—a totally managed AI inference service that enables seamless deployment and scaling of fashions independently from the purposes that use them. By decoupling the LLM internet hosting from the Webex purposes, WxAI may provision the mandatory compute sources for the fashions with out impacting the core collaboration and communication capabilities.

“The purposes and the fashions work and scale essentially in another way, with solely completely different price concerns; by separating them slightly than lumping them collectively, it’s a lot easier to unravel points independently.”

– Travis Mehlinger, Principal Engineer at Cisco.

This architectural shift has enabled Webex to harness the ability of generative AI throughout its suite of collaboration and buyer engagement options.

Answer overview: Bettering effectivity and decreasing prices by migrating to SageMaker Inference

To deal with the scalability and useful resource utilization challenges confronted with embedding LLMs straight into their purposes, the WxAI staff migrated to SageMaker Inference. By making the most of this absolutely managed service for deploying LLMs, Cisco unlocked vital efficiency and cost-optimization alternatives. Key advantages embrace the power to deploy a number of LLMs behind a single endpoint for sooner scaling and improved response latencies, in addition to price financial savings. Moreover, the WxAI staff carried out an LLM proxy to simplify entry to LLMs for Webex groups, allow centralized knowledge assortment, and cut back operational overhead. With SageMaker Inference, Cisco can effectively handle and scale their LLM deployments, harnessing the ability of generative AI throughout the Webex portfolio whereas sustaining optimum efficiency, scalability, and cost-effectiveness.

The next diagram illustrates the WxAI structure on AWS.

The structure is constructed on a strong and safe AWS basis:

The structure makes use of AWS providers like Software Load Balancer, AWS WAF, and EKS clusters for seamless ingress, menace mitigation, and containerized workload administration.
The LLM proxy (a microservice deployed on an EKS pod as a part of the Service VPC) simplifies the mixing of LLMs for Webex groups, offering a streamlined interface and decreasing operational overhead. The LLM proxy helps LLM deployments on SageMaker Inference, Amazon Bedrock, or different LLM suppliers for Webex groups.
The structure makes use of SageMaker Inference for optimized mannequin deployment, auto scaling, and routing mechanisms.
The system integrates Loki for logging, Amazon Managed Service for Prometheus for metrics, and Grafana for unified visualization, seamlessly built-in with Cisco SSO.
The Knowledge VPC homes the info layer parts, together with Amazon ElastiCache for caching and Amazon Relational Database Service (Amazon RDS) for database providers, offering environment friendly knowledge entry and administration.

Use case overview: Contact middle matter analytics

A key focus space for the WxAI staff is to boost the capabilities of the Webex Contact Heart platform. A typical Webex Contact Heart set up has lots of of brokers dealing with many interactions by numerous channels like cellphone calls and digital channels. Webex’s AI-powered Matter Analytics function extracts the important thing causes clients are calling about by analyzing aggregated historic interactions and clustering them into significant matter classes, as proven within the following screenshot. The contact middle administrator can then use these insights to optimize operations, improve agent efficiency, and finally ship a extra passable buyer expertise.

The Matter Analytics function is powered by a pipeline of three fashions: a name driver extraction mannequin, a subject clustering mannequin, and a subject labeling mannequin, as illustrated within the following diagram.

The mannequin particulars are as follows:

Name driver extraction – This generative mannequin summarizes the first motive or intent (known as the name driver) behind a buyer’s name. Correct computerized tagging of calls with name drivers helps contact middle supervisors and directors shortly perceive the first motive for any historic name. One of many key concerns when fixing this downside was choosing the best mannequin to steadiness high quality and operational prices. The WxAI staff selected the FLAN T5 mannequin on SageMaker Inference and instruction fine-tuned it for extracting name drivers from name transcripts. FLAN-T5 is a strong text-to-text switch transformer mannequin that performs numerous pure language understanding and technology duties. This workload had a worldwide footprint deployed in us-east-2, eu-west-2, eu-central-1, ap-southeast-1, ap-southeast-2, ap-northeast-1, and ca-central-1 AWS
Matter clustering – Though mechanically tagging each contact middle interplay with its name driver is a helpful function in itself, analyzing these name drivers in an aggregated trend over a big batch of calls can uncover much more fascinating traits and insights. The subject clustering mannequin achieves this by clustering all of the individually extracted name drivers from a big batch of calls into completely different matter clusters. It does this by making a semantic embedding for every name driver and using an unsupervised hierarchical clustering method that operates on the vector embeddings. This leads to distinct and coherent matter clusters the place semantically related name drivers are grouped collectively.
Matter labeling – The subject labeling mannequin is a generative mannequin that creates a descriptive title to function the label for every matter cluster. A number of LLMs have been prompt-tuned and evaluated in a few-shot setting to decide on the perfect mannequin for the label technology process. Lastly, Llama2-13b-chat, with its skill to higher seize contextual nuances and semantics of pure language dialog, was used for its accuracy, efficiency, and cost-effectiveness. Moreover, Llama2-13b-chat was deployed and used on SageMaker inference parts, whereas sustaining comparatively low working prices in comparison with different LLMs, through the use of particular {hardware} like g4dn and g5

This answer additionally used the auto scaling capabilities of SageMaker to dynamically regulate the variety of situations primarily based on a desired minimal of 1 endpoint and most of 30. This strategy gives environment friendly useful resource utilization whereas sustaining excessive throughput, permitting the WxAI platform to deal with batch jobs in a single day and scale to lots of of inferences per minute throughout peak hours. By deploying the mannequin on SageMaker Inference with auto scaling, WxAI staff was capable of ship dependable and correct responses to buyer interactions for his or her Matter Analytics use case.

By precisely pinpointing the decision driver, the system can recommend acceptable actions, sources, and subsequent steps to the agent, streamlining the client assist course of, additional resulting in customized and correct responses to buyer questions.

To deal with fluctuating demand and optimize useful resource utilization, the WxAI staff carried out auto scaling for his or her SageMaker Inference endpoints. They configured the endpoints to scale from a minimal to a most occasion depend primarily based on GPU utilization. Moreover, the LLM proxy routed requests between the completely different LLMs deployed on SageMaker Inference. This proxy abstracts the complexities of speaking with numerous LLM suppliers and allows centralized knowledge assortment and evaluation. This led to enhanced generative AI workflows, optimized latency, and customized use case implementations.

Advantages

By way of the strategic adoption of AWS AI providers, Cisco’s WxAI staff has realized vital advantages, enabling them to construct cutting-edge, AI-powered collaboration capabilities extra quickly and cost-effectively:

Improved improvement and deployment cycle time – By decoupling fashions from purposes, the staff has streamlined processes like bug fixes, integration testing, and have rollouts throughout environments, accelerating their general improvement velocity.
Simplified engineering and supply – The clear separation of considerations between the lean software layer and resource-intensive mannequin layer has simplified engineering efforts and supply, permitting the staff to concentrate on innovation slightly than infrastructure complexities.
Decreased prices – By utilizing absolutely managed providers like SageMaker Inference, the staff has offloaded infrastructure administration overhead. Moreover, capabilities like asynchronous inference and multi-model endpoints have enabled vital price optimization with out compromising efficiency or availability.
Scalability and efficiency – Providers like SageMaker Inference and Amazon Bedrock, mixed with applied sciences like NVIDIA Triton Inference Server on SageMaker, have empowered the WxAI staff to scale their AI/ML workloads reliably and ship high-performance inference for demanding use circumstances.
Accelerated innovation – The partnership with AWS has given the WxAI staff entry to cutting-edge AI providers and experience, enabling them to quickly prototype and deploy progressive capabilities just like the AI-powered Webex Assistant and superior contact middle AI options.

Cisco’s contributions to SageMaker Inference: Enhancing generative AI inference capabilities

Constructing upon the success of their strategic migration to SageMaker Inference, Cisco has been instrumental in partnering with the SageMaker Inference staff to construct and improve key generative AI capabilities inside the SageMaker platform. Because the early days of generative AI, Cisco has supplied the SageMaker Inference staff with helpful inputs and experience, enabling the introduction of a number of new options and optimizations:

Value and efficiency optimizations for generative AI inference – Cisco helped the SageMaker Inference staff develop progressive strategies to optimize the usage of accelerators, enabling SageMaker Inference to cut back basis mannequin (ML) deployment prices by 50% on common and latency by 20% on common with inference parts. This breakthrough delivers vital price financial savings and efficiency enhancements for purchasers working generative AI workloads on SageMaker.
Scaling enhancements for generative AI inference – Cisco’s experience in distributed programs and auto scaling has additionally helped the SageMaker staff develop superior capabilities to higher deal with the scaling necessities of generative AI fashions. These enhancements cut back auto scaling instances by as much as 40% and auto scaling detection by 6 instances, so clients can quickly scale their generative AI workloads on SageMaker to satisfy spikes in demand with out compromising efficiency.
Streamlined generative AI mannequin deployment for inference – Recognizing the necessity for simplified generative AI mannequin deployment, Cisco collaborated with AWS to introduce the power to deploy open supply LLMs and FMs with only a few clicks. This user-friendly performance removes the complexity historically related to deploying these superior fashions, empowering extra clients to harness the ability of generative AI.
Simplified inference deployment for Kubernetes clients – Cisco’s deep experience in Kubernetes and container applied sciences helped the SageMaker staff develop new Kubernetes Operator-based inference capabilities. These improvements make it easy for clients working purposes on Kubernetes to deploy and handle generative AI fashions, decreasing LLM deployment prices by 50% on common.
Utilizing NVIDIA Triton Inference Server for generative AI – Cisco labored with AWS to combine the NVIDIA Triton Inference Server, a high-performance mannequin serving container managed by SageMaker, to energy generative AI inference on SageMaker Inference. This enabled the WxAI staff to scale their AI/ML workloads reliably and ship high-performance inference for demanding generative AI use circumstances.
Packaging generative AI fashions extra effectively – To additional simplify the generative AI mannequin lifecycle, Cisco labored with AWS to boost the capabilities in SageMaker for packaging LLMs and FMs for deployment. These enhancements make it easy to organize and deploy these generative AI fashions, accelerating their adoption and integration.
Improved documentation for generative AI – Recognizing the significance of complete documentation to assist the rising generative AI ecosystem, Cisco collaborated with the AWS staff to boost the SageMaker documentation. This contains detailed guides, greatest practices, and reference supplies tailor-made particularly for generative AI use circumstances, serving to clients shortly ramp up their generative AI initiatives on the SageMaker platform.

By intently partnering with the SageMaker Inference staff, Cisco has performed a pivotal position in driving the speedy evolution of generative AI Inference capabilities in SageMaker. The options and optimizations launched by this collaboration are empowering AWS clients to unlock the transformative potential of generative AI with higher ease, cost-effectiveness, and efficiency.

“Our partnership with the SageMaker Inference product staff goes again to the early days of generative AI, and we imagine the options we have now inbuilt collaboration, from price optimizations to high-performance mannequin deployment, will broadly assist different enterprises quickly undertake and scale generative AI workloads on SageMaker, unlocking new frontiers of innovation and enterprise transformation.”

– Travis Mehlinger, Principal Engineer at Cisco.

Conclusion

By utilizing AWS providers like SageMaker Inference and Amazon Bedrock for generative AI, Cisco’s WxAI staff has been capable of optimize their AI/ML infrastructure, enabling them to construct and deploy AI-powered options extra effectively, reliably, and cost-effectively. This strategic strategy has unlocked vital advantages for Cisco in deploying and scaling its generative AI capabilities for the Webex platform. Cisco’s personal journey with generative AI, as showcased on this put up, gives helpful classes and insights for different makes use of of SageMaker Inference.

Recognizing the affect of generative AI, Cisco has performed a vital position in shaping the way forward for these capabilities inside SageMaker Inference. By offering helpful insights and hands-on collaboration, Cisco has helped AWS develop a variety of highly effective options which might be making generative AI extra accessible and scalable for organizations. From optimizing infrastructure prices and efficiency to streamlining mannequin deployment and scaling, Cisco’s contributions have been instrumental in enhancing the SageMaker Inference service.

Shifting ahead, the Cisco-AWS partnership goals to drive additional developments in areas like conversational and generative AI inference. As generative AI adoption accelerates throughout industries, Cisco’s Webex platform is designed to scale and streamline person experiences by numerous use circumstances mentioned on this put up and past. You may count on to see ongoing innovation from this collaboration in SageMaker Inference capabilities, as Cisco and SageMaker Inference proceed to push the boundaries of what’s potential on the planet of AI.

For extra info on Webex Contact Heart’s Matter Analytics function and associated AI capabilities, check with The Webex Benefit: Navigating Buyer Expertise within the Age of AI on the Webex weblog.

Concerning the Authors

Travis Mehlinger is a Principal Software program Engineer within the Webex Collaboration AI group, the place he helps groups develop and function cloud-centered AI and ML capabilities to assist Webex AI options for purchasers all over the world. In his spare time, Travis enjoys cooking barbecue, enjoying video video games, and touring across the US and UK to race go-karts.

Karthik Raghunathan is the Senior Director for Speech, Language, and Video AI within the Webex Collaboration AI Group. He leads a multidisciplinary staff of software program engineers, machine studying engineers, knowledge scientists, computational linguists, and designers who develop superior AI-driven options for the Webex collaboration portfolio. Previous to Cisco, Karthik held analysis positions at MindMeld (acquired by Cisco), Microsoft, and Stanford College.

Saurabh Trikande is a Senior Product Supervisor for Amazon SageMaker Inference. He’s obsessed with working with clients and is motivated by the aim of democratizing machine studying. He focuses on core challenges associated to deploying advanced ML purposes, multi-tenant ML fashions, price optimizations, and making deployment of deep studying fashions extra accessible. In his spare time, Saurabh enjoys mountain climbing, studying about progressive applied sciences, following TechCrunch and spending time together with his household.

Ravi Thakur is a Senior Options Architect at AWS, primarily based in Charlotte, NC. He focuses on fixing advanced enterprise challenges utilizing distributed, cloud-centered, and well-architected patterns. Ravi’s experience contains microservices, containerization, AI/ML, and generative AI. He empowers AWS strategic clients on digital transformation journeys, delivering bottom-line advantages. In his spare time, Ravi enjoys bike rides, household time, studying, motion pictures, and touring.

Amit Arora is an AI and ML Specialist Architect at Amazon Internet Providers, serving to enterprise clients use cloud-based machine studying providers to quickly scale their improvements. He’s additionally an adjunct lecturer within the MS knowledge science and analytics program at Georgetown College in Washington D.C.

Madhur Prashant is an AI and ML Options Architect at Amazon Internet Providers. He’s passionate concerning the intersection of human considering and generative AI. His pursuits lie in generative AI, particularly constructing options which might be useful and innocent, and most of all optimum for purchasers. Outdoors of labor, he loves doing yoga, mountain climbing, spending time together with his twin, and enjoying the guitar.

Supply hyperlink