Transcribing Video To Text With Google Video Intelligence API Solutions

Case Study


The Client

The client is a next-generation global technology company that helps enterprises reimagine their businesses for the digital age. The company is primarily engaged in providing a range of software services, business process outsourcing and infrastructure services. Niveus Solutions helped the client with a speech to text solution for transcribing virtual recorded sessions using Google Video Intelligence.

Project Objective

The client desired to create a platform that can upload large recorded video files effortlessly into the cloud along with transcribing the audio into a text format using AI technologies.

They also required taking screenshots of different parts of the video to capture the architecture diagrams along with visual demos of any applications and wanted the pictures to be attached to the notes that were transcribed into text. Screenshots have to be in a logical sequence along with the transcribed converted using speech to text solutions.

The client’s highest priority was the data security, as they wanted all the confidential information to be kept in a secured place.


Business Solution

  • Developed a Video Intelligence Solution for transcribing meetings’ videos for KT sessions in HCL by converting speech to text and capturing screenshots of presentations/architecture diagrams/flowcharts etc.
  • Speakers are classified in the video and the document is sequenced in a way that the speech and related screenshots come one after the other.
  • It also provides a feature to mask any sensitive information in videos and omit courtesy words from transcription.


  • The solution is executed using Google’s Cloud Video Intelligence API.
  • VideoIntelligenceServiceClient API is used to detect labels that are related to shared screens/architecture diagrams.              
  • Transfer/upload videos to Google Cloud Storage using REST APIs actuated from the custom solution’s web UI
  • Upload CSV files containing video URIs and labels into the same bucket as videos for ML training
  • Cloud Data Loss Prevention (DLP) was used to identify sensitive data in transcripts and screenshots.
  • Identity-Aware Proxy (IAP)  in GCP is used to allow only the authenticated users to access the application. 
  • Web UI was implemented with React JS as the frontend and Java at the backend.

The Impact

97% of the words in sample video transcription are accurate and all valid screenshots from sample video are captured
Integration with other cloud platforms such as AWS and Azure, and sync between cloud platform can be achieved
Text-to-Text translation from one language to another is supported
Multilingual Support is extended in the application

Technology Stack

Cloud Storage
App Engine
Cloud Speech-to-Text
Cloud Translation API
Document API
Video Intelligence
Cloud Auto ML
Cloud Pub/Sub

Drive Modernization to Unlock Innovation with Google Cloud

Connect Now