How to transcribe audio to text – the ultimate guide to transcription

Make audio and video work harder for your business. Speech-to-text transcription repurposes spoken words into precision-targeted, written content.


      Human communication is constantly evolving. From cave paintings to carrier pigeons, the printing press to the internet, telecommunications to text messaging, it seems innovation knows no bounds when it comes to spreading the word.

      Whatever the medium, most forms of communication involve either sounds we can hear or symbols we can see. Both forms have their merits and limitations. But in the modern, digital era, it is visual messaging that dominates.

      Thanks to the Internet, visual content can be seen and shared by enormous numbers of people globally in countless ways. As a result, there are huge potential rewards for creating fresh, multi-purpose, written communication out of audio recordings.

      The process of transforming audio and video content into text is called ‘transcription’.

      What is transcription?

      Confusion sometimes occurs when the word ‘transcription’ is muddled up with similar sounding words, such as ‘translation’ and ‘transcribe’.

      Although they are alike, ‘translation’, ‘transcribing’ and ‘transcription’ mean different things. In terms of linguistics, translate means to express the meaning of spoken or written words in another language; transcribe means to write out a copy; and a transcription is the written version of the spoken word.

      So, the three terms have different meanings, but they are closely connected (especially when you translate a transcriber’s transcription!).

      Read more: What is transcription? Your questions answered

      What are the benefits of transcription?

      Creating a written version of audio content sounds like a lot of work. So, why bother with transcription?

      Here are five ways transcription can benefit you and your business.

      1. Accelerate workflow

      A written document enables anyone working with audio or video to boost turnaround times. An editor can mark up and add comments to a transcript or captions in a way that would be time-consuming, impractical and maybe even impossible otherwise.

      2. Boost SEO

      Videos beat text every time on social media and online stores. Search engines, however, only respond to written content. Transcription provides the best of both worlds: videos that appeal to humans and text content that appeal to Google bots.

      3. Maximise communications

      Turn meetings, speeches and training sessions into permanent, tangible documents that deliver benefits long after the words were spoken. Transcripts of events can also be repurposed. A speech, for example, may become marketing content, an email to staff, or edited and reworked into a pitch for new business.

      4. Expand accessibility

      Transcribed audio and visual content (in the form of video captions, subtitles and full transcriptions) make it available to people who may struggle to hear. Indeed, in many countries it is a legal requirement to transcribe all public audio and visual material.

      5. Increase social shares

      Audio files alone lack the social-media impact of text and images. Transcribed audio, however, transforms audio into snippets of text that generate more shares and more web traffic.

      Download our free transcription template

      Get started with transcription. Here you will find templates for both detailed transcription and standard transcription. You can use the formats and examples in your own working document.

      Where is transcription used?

      Demand for transcription services extends across most industries. Here’s a quick guide to how different sectors can use transcription to gain a commercial and operational advantage.

      • Media transcription
        Transcription is used to aid the viewing and editing of spoken content across a wide range of video and audio media. From conference speeches to regular podcasts, any event recorded as audio or video can be transcribed into an editable text document.
      • Academic transcription
        Transcription is used within educational organisations such as schools, colleges and universities to improve the quality of teaching, the accessibility of learning resources and to provide students with searchable versions of lectures and seminars online.
      • Insurance transcription
        Insurers typically take audio statements from claimants, witnesses and other involved parties. As these interviews are legally binding, it is essential that every word is captured accurately. Transcription, therefore, generates suitable legal documentation that can be used to assess and process an insurance claim.
      • Journalism transcription
        Journalists and reporters spend much of their time conducting interviews which they – possibly with help from software or a service provider – later transcribe. With a responsibility to quote people and report events correctly, transcription gives journalists an accurate record to use for their writing and for possible future enquiry.
      • Market research transcription
        Feedback gathered from focus groups, interviews and observations plays a vital role in bringing a successful product to market. To ensure an accurate record is kept of what has been said about what, when and by whom, researchers use transcription to convert spoken words to searchable documents.

      What types of transcription are there?

      With so many applications for transcribed text, it is no surprise there are different types of transcription to choose from. In some cases, a literal transcription is vital, in others readability is the primary objective, or maybe a simple overview will suffice.

      The four most common types of transcription are verbatim, intelligent verbatim, edited and phonetic.

      1. Verbatim transcription

      A verbatim transcription captures every sound and silence that occurs. From coughs and laughter to verbal pauses and fillers (er… um… yeah… you know?). Verbatim transcriptions also record ‘noises off’, such as doors slamming and phones ringing.

      2. Intelligent verbatim transcription

      An intelligent verbatim transcription removes any irrelevant elements from the text, such as fillers and needless repetition. The result is a more concise, readable transcript that remains faithful to the original recording in every other way.

      3. Edited transcription

      An edited transcription is revised to remove any superfluous content, amend any grammatical mistakes and complete any unfinished sentences. The result can be a more formal representation of what was said, but it will be easier to read and understand than the original verbatim transcription.

      4. Phonetic transcription

      Phonetic transcription uses symbols to record the phonemes (the smallest distinct units of sound in a language), rather than the actual words spoken. Phonetic transcription should follow the same process for all languages, with symbols representing the same sound. A phonetic transcription is useful when pronunciation is important – such as when comparing speech between different age groups, locations or periods of time.

      Phonetic vs. orthographic transcription

      Unlike phonetic transcription, orthographic translation follows standardised language rule and is unaffected by changes in pronunciation. For example, the popular Australian TV soap Neighbours is known for introducing the upward phonetic inflexion at the end of sentences to the rest of the English-speaking world. While this would be immediately apparent in a phonetic transcription, an orthographic transcription would not indicate any difference. Orthographic transcriptions are the preferred choice for large-scale corpora, particularly that used for research, where details of pronunciation are immaterial

      Read more: Different types of transcription – with examples

      Manual transcription vs. automatic transcription

      Human transcribers, such as courtroom stenographers and those dealing with highly sensitive material (such as evidence in ongoing investigations and interrogations), spend years building and honing their skills to be able to deliver transcriptions at an extremely high standard.

      This personal expertise comes at a cost, however, both in terms of time and money. These constraints make manual transcription unsuitable for large-scale projects requiring a fast turnaround.

      Automated transcription reduces time and costs significantly by using software to complete the task. One weakness of this approach occurs when transcribing regional accents. Outcomes vary depending on the AI technology and machine learning capabilities of the product. At the time of writing, even the best transcription ‘machines’ are not 100% accurate.

      The choice between manual transcription and automatic transcription comes down to the sensitivity of the content and the level of accuracy required. Often, absolute precision is less of a priority than speed. A popular compromise is to combine the two methods so that software performs an initial transcription and a skilled human corrects the errors.

      How to transcribe

      Professional transcription is a great skill, often developed over many years of practice. Here are six essential rules for transcribers hoping to progress in this specialist field.

      1. Listen to the entire audio recording before beginning to write. This helps the transcriber’s brain ‘tune in’ to the style and substance of the spoken content. This is especially helpful when unfamiliar accents are involved.

      2. Listen to an entire sentence before transcribing it. This helps the transcriber to appreciate the context and avoid misunderstandings over homophones (‘I don't, no’ and ‘I don't know’, for example).

      3. Edit the completed transcription, looking for mistakes and bad grammar. A skilled transcriber can increase their efficiency by also becoming a skilled editor.

      4. Learn the correct touch-typing technique to maximise speed, accuracy and comfort. Even the best transcriber can be brought to a halt by hand cramp. Some modern transcription software makes use of foot pedals to increase speed and efficiency. A foot pedal allows users to control the audio using their feet, which frees up both hands for typing.

      5. Know any relevant jargon and abbreviations used in the audio being transcribed. This is especially important when working in sectors such as medicine, where specialist terms can make speech sound almost like a foreign language.

      6. Check, check and check again that the terminology used is correct, that each paragraph makes sense and that the transcription is complete and factually correct.

      Use timestamps to mark speakers aid analysis

      Timestamps typically use the [HH:MM:SS] format to specify the hours, minutes and seconds from the start of the audio recording at which the given text was spoken. This allows editors to jump straight to specific points in an audio file without having to work through the entire recording. Transcribers find timestamps useful too, especially when wishing to locate and review particularly challenging sections of text.

      There are several different ways timestamps can be used in transcriptions. For example:

      • Timestamps that indicate when actual speech begins and ends are useful when a recording does not have dialogue right from the start.
      • Timestamps placed whenever a new person speaks can help to locate the key moments in an audio recording.

        Ernie: (00:00:00) Hello everyone and welcome to my podcast.
        Bert: (00:00:02) And mine
        Ernie: (00:00:03) Okay, okay… Welcome to ‘our’ podcast.
      • A more granular marker can be provided by adding a new timestamp every sentence (though this can result in a cluttered transcription).

        (00:00:00) This is a brief example of a transcript. (00:00:02) The time each new sentence begins is displayed in the timestamp. (00:00:05) This is just one of many timestamping options.
      • A more common requirement is the periodic timestamp, which is added at predetermined time intervals (such as, every 3 seconds or every 30 minutes).

        (00:00:00) Here is an example of how a time-stamped transcript can incorporate a (00:00:03) timestamp every three seconds by including the (00:00:06) timestamp within the written text.

      Want to know more about Semantix transcription services?

      How to transcribe interviews

      As well as gleaning information during an interview, there is much to be learnt from the post-interview analysis of the transcript.

      The first step in creating an interview transcript is to identify what you want the transcript to achieve. For instance, if all you require are some key quotes, a more targeted approach will suffice, rather than the blanket coverage of a verbatim transcript.

      A transcript is a scannable document that can be searched for specific words. Timestamps allow the reader to listen to the original recording with little effort or delay. A transcript’s text document is quick and easy to share with collaborators compared to the much larger audio and video files and it is much easier for colleagues to edit.

      The written word also provides an opportunity to make a more objective appraisal of what was said, without the distraction of the speaker’s appearance and body language. Additional comments and tags added by the transcriber can aid the evaluation of the text and allow a more quantitative analysis to be undertaken (highlighting instances of certain emotive words, for example).

      Read more: Complete guide: How to transcribe an interview

      How to transcribe group conversations

      Bringing clarity to a transcript in which two or more people are speaking can be challenging, especially when there are frequent interruptions and talking over one another.

      The preferred approach to this scenario is to transcribe what each person says on a separate line. If they speak simultaneously, indicate this by giving both lines the same timestamp. If the commotion makes it impossible to hear what one of the speakers is saying, add an ‘Inaudible’ tag.

      How to transcribe videos

      Transcribing spoken content can be a lengthy, but very worthwhile, process. Adding transcript to web pages with videos is proven to boost SEO impact and visitor engagement (as well as aid the content’s searchability).

      So, what’s the best way to go about it?

      Here are four key ways to transcribe video content. Each method has its own advantages and disadvantages. Your preferred approach will likely depend on your specific situation.

      1. Transcription apps for mobile phones

      Mobile phones provide an easily portable tool for capturing people’s speech on the go. In addition to most smartphones’ built-in speech-to-text application, there are a variety of transcription apps available for download from the various app stores.

      2. Free, online video transcription

      A simple search-engine enquiry will reveal an assortment of free-to-use transcription tools available for use online. The quality of these free programs can vary tremendously, however, so you should always proofread the captions as it may be peppered with errors. If a video can be uploaded to YouTube, users can take advantage of automatic YouTube captioning, which offers transcriptions with up to 80% accuracy, depending on the sound quality of the video (though, not all languages are supported by YouTube).

      3. Transcription software for desktop computers

      As well as having the option to access online transcription tools, Mac and PC users can download software to their desktop computer, which allows the tools to be used without the need for an internet connection.

      4. Captioning services

      Professional captioning services and localization providers may cost more than the free software solutions, but the results are first-rate and they incorporate better options for the security and confidentiality of the content.

      Example tools for transcription

      Recent advances in AI technology and machine learning capabilities have brought about a wide variety of speech-to-text transcription products. These range from mobile apps to desktop software services; stand-alone products to entire operating systems with transcription tools built in. The available choice and range of specialist functions is huge.

      Here’s a brief overview of some of the most popular transcription tools on the market today.

      Dragon Anywhere is for Android and iOS devices and also syncs with the desktop version of the software. Its excellent recognition capabilities are only mitigated by the software’s need for an Internet connection, due to its cloud-based nature. Dragon Anywhere is available by subscription, with no one-off purchase option offered.

      Dragon Professional is designed to assist pro users through the entire process. An easy-to-use interface provides access to several powerful features – including tools to dictate and edit documents, create spreadsheets and browse the web by voice. The app’s built-in intelligence allows it to learn voices, words and phrases as it transcribes them.

      Otter is a cloud-based program created with laptops and smartphones in mind. The app’s real-time transcription allows users to search, edit, play and organise data as required. As well as being suited to transcribing interviews and lectures, Otter also facilitates collaboration between teams.

      Verbit specifically targets enterprise and educational establishments. Using neural networks to maximise effectiveness, even when background noise is present, Verbit also offers the option of involving human editors for absolute accuracy.

      Speechmatics offer a comprehensive and flexible speech-to-text service. An example of this is its claim to support all major English accents, regardless of nationality. So, that includes the many American and British English accents, as well as those from South Africa, Jamaica and beyond.

      Braina seamlessly combines transcription services with virtual assistant features within a single, intuitive interface. Braina users can search online, take notes and select music to play all while transcribing text in over 100 different languages with up to 99% accuracy.

      Windows 11, Microsoft's latest operating system, comes with build-in dictation software. When working within almost any text field, users can simply switch on, start speaking and watch the text appear on the screen.

      MacOS has Apple's dictation tool built into the actual operating system, making dictation possible in any text field. As the feature learns individual voice attributes, including accents, it gets better with continued use.

      Google Voice Typing for Google Docs gives the online word processor speech-to-text functionality. All that is required is a Google account, Chrome web browser and an Internet connection.

      Read more: What are the best transcription apps and tools?

      Set speech free! With audio transcription

      Transcription provides a readily available opportunity to leverage existing audio material to gain a competitive edge and satisfy the online world’s constant craving for fresh content.

      By turning audio into text, transcription opens up endless possibilities to turn that text into blogs, social media posts, marketing material, training resources and much more. Once you start using transcripts to multiply the value of the spoken word, you’ll be surprised how many opportunities there are.

      How will you extract maximum value from your audio and video?

      Would you like to order a transcription?