Comparing Machine Transcription Options from Rev and Sonix

As part of our continuing exploration of new options for transcription and captioning, two members of our media production team tested the automated services offered by both Rev and Sonix. We submitted the same audio and video files to each service and compared the results. Overall, both services were surprisingly accurate and easy to use. Sonix, in particular, offers some unique exporting options that could be especially useful to media producers. Below is an outline of our experience and some thoughts on potential uses.

Accuracy

The quality and accuracy of the transcription seemed comparable. Both produced transcripts with about the same number of errors. Though errors occurred at similar rates, they interestingly almost always occurred in different places. All of the transcripts would need cleaning up for official use but would work just fine for editing or review purposes. The slight edge might go to Rev here. It did a noticeably better job at distinguishing and identifying unique speakers, punctuating, and in general (but not always) recognizing names and acronyms.  

Interface

When it came time to share and edit the transcripts, both services offered similar web-based collaborative tools. The tools feature basic word processing functions and allow multiple users to highlight, strikethrough, and attach notes to sections of text. After it’s recent updates, the Rev interface is slightly cleaner and more streamlined. Again, the services are pretty much even in this category.

Export Options

This is where things get interesting. Both services allow users to export transcripts as documents (Microsoft Word, Text File, and, for Sonix, PDF) and captions (SubRip and WebVTT). However, Sonix offers some unique export options. When exporting captions, Rev automatically formats the length and line breaks of the subtitles and produces reliable results. Sonix, on the other hand, provides several options for formatting captions including character length, time duration, number of lines, and whether or not to include speaker names. The downside was that using the default settings for caption exporting in Sonix led to cluttered, clunky results, but the additional options would be useful for those looking for more control of how their captions are displayed.

Sonix also allows two completely different export options. First, users can export audio or video files that include only highlighted sections of the transcript or exclude strikethroughs. Basically, you can produce a very basic audio or video edit by editing the transcript text. It unfortunately does not allow users to move or rearrange sections of media and the edits are all hard cuts so it’s a rather blunt instrument, but it could be useful for rough cuts or those with minimal editing skills.

Sonix also provides the option of exporting XML files that are compatible with Adobe Audition, Adobe Premiere, and Final Cut Pro. When imported into the editing software these work like edit decision lists that automatically cut and label media in a timeline. We tried this with two different audio files intended for a podcast, and it worked great. This has the potential to be useful for more complicated and collaborative post-production workflows, an online equivalent of an old school “paper edit”. Again, the big drawback here is the inability to rearrange the text. It could save time when cutting down raw footage, but a true paper edit would still require editing the transcript with timecode in a word processing program.

And the winner is…

Everyone. Both Rev and Sonix offer viable and cost-effective alternatives to traditional human transcription. Though the obvious compromise in accuracy exists, it is much less severe than you might expect. Official transcripts or captions could be produced with some light editing, and, from a media production perspective, quick and cheap transcripts can be an extremely useful tool in the post-production process. Those looking to try a new service or stick with the one they’re familiar with can be confident that they’re getting the highest quality machine transcription available with either company. As more features get added and improved, like those offered by Sonix, this could become a helpful tool throughout the production process.

The Rise and Fall of BYOD

The bring your own device (BYOD) meeting or teaching space has been a popular model for small and medium meeting and teaching spaces. With the rise of inexpensive and ultra-portable laptops and tablets, the traditional “local computer” has slowly lost favor in many spaces. The computer is expensive, requires significant maintenance, and is a prime target for malicious software. Also, users generally prefer using their own device as they know the ins and outs of the hardware and operating system they prefer. The BYOD model worked well when the guest was sharing a presentation or video to a local projector or monitor. But, as AV systems have grown to include unified communication (UC) systems (WebEx, Zoom, Skype, etc.), the pain points of BYOD have been magnified.

First, when hosting a meeting on a BYOD device, connecting your device to a projector or monitor is usually rather straightforward since standardizing on HDMI. Yes, you may still need a dongle, but that’s an easy hurdle in 2019. But, as we add UC, Zoom as an example, to the meeting, things get complicated. First, you need to connect the laptop to a local USB connection (this may require yet another dongle). This USB connection may carry the video feed from the in-room camera and the in-room audio feed. This may not sound complicated, but those feeds may not be obvious. For example, the camera feed could be labeled Vaddio, Magewell, or Crestron. With audio, it can be equally difficult to discover the audio input with labels such as USB Audio, Matrox, or Biamp. Sure, many reading this article may be familiar with what these do… but even as a digital media engineer, these labels can mean multiple things.

But, who cares… we are saving money while giving maximum AV flexibility, right? Errr, not really. Yes, those with the technical understanding of how the AV system works will be able to utilize all of the audiovisual capabilities… but for the rest of the world, there might as well not be an AV system in the space. Even worse, for those that have ever attended a meeting where it takes 10+ minutes to connect the local laptop to the correct mics, speakers, and camera, you may be losing money in the form of time, compounded by every person in attendance.

The Solution?
Soft codecs to the rescue! With the rise of UC soft codecs (Zoom Room, Microsoft Teams Rooms and BluJeans Rooms, etc.) you can integrate an inexpensive device (a less expensive computer) that is capable of performing a wide range of tasks. First, all of the in-room AV connects to the soft codec, so no fumbling for dongles or figuring out which audio, mic, speaker input/output is correct. Second, the soft codec monitors the space to ensure the hardware is functioning normally, breaking local AV groups out of break fix into a managed model. Third, with calendar integration, you can schedule meetings with a physical location. The icing on the cake is that most of these UC soft codecs offer wireless sharing… so you can toss your AppleTV, Solstice Pod, etc. out the window (OK, don’t do that… but it’s one less thing you need to buy during your next refresh). Oh, and don’t even get me started about accessibility and lecture capture!

We have a keen eye on soft codec system as a potential replacement to traditional classroom AV systems in the mid to long term… and so should you.

Remote AV Control

“If only I could be in two places at once!”
– Every AV Technician… Ever.

But… what if you COULD be in two places at once? During a training earlier this year, I discovered that one hardware manufacturer offered a simple method of gaining remote access to the GUI of an AV system. As you built the system, it automatically created a password protected HTML5 web page where (assuming you knew the correct URL/password) you could control the system.

As organizations demand more from their AV systems, this kind of functionality will be an invaluable resource for small AV groups when providing evening or emergency AV support.

New Machine Caption Options Look Interesting

We wrote in April of last year about the impact of new AI and machine learning advances in the video world, and specifically around captioning. A little less than a year later, we’re starting to see the first packaged services being offered that leverage these technologies and make them available to end users. We’ve recently evaluated a couple options that merit a look:

Syncwords

Syncwords offers machine transcriptions/ captions for $0.60/per minute, and $1.35/ minute for human corrected transcriptions. We tested this service recently and the quality was impressive. Only a handful of words needed adjustment on the 5 minute test file we used, and none of them seemed likely to significantly interfere with comprehension. The recording quality of our test file was fairly high (low noise, words clearly audible, enunciated clearly).

Turnaround time for machine transcriptions is about 1/3 of the media run time on average. For human corrected transcriptions, the advertised turnaround time is 3-4 business days, but the company says the average is less than 2 days. Rush human transcription option is $1.95 with a guaranteed turnaround of 2 business days and, according to the company, average delivery within a day.

Syncwords also notes edu and quantity discounts are available for all of these services, so please inquire with them if interested.

Sonix.ai

Sonix is a subscription-based service with three tiers: single-User ($11.25 per month and $6.00 per recorded hour/ $0.10/minute), Multi-User ($16.50 per user/month and $5.00 per recorded hour) , and Enterprise ($49.50 per user/month, pricing available upon request).  You can find information about the differences among the tiers here: https://sonix.ai/pricing

The videos in the folder below show the results of our testing of these two services together with the built in speech-to-text engine currently utilized by Panopto. To be fair, the service currently integrated with Panopto is free with our Panopto license, and for Panopto to license the more current technology would likely increase their and our costs. We do wonder, however, whether it is simply a matter of time before the currently state-of-the art services such as featured here become more of a commodity:

https://oit.capture.duke.edu/Panopto/Pages/Sessions/List.aspx?folderID=4bd18f0c-e33a-4ab7-b2c9-100d4b33a254

 

Rev Adds New Rush Option

Rev.com‘s captioning services have been in wide use at Duke for the last couple years in part because of their affordability (basic captioning is a flat $1.00/minute), the generally high accuracy of the captions, and the overall quality of the user experience Rev offers via its well-designed user interfaces and quality support. Quick turnaround time is another factor Duke users seem to appreciate. While the exact turnaround times Rev promises are based on file length, we’ve found that most caption files are delivered same or next day.

Rev.com

For those of you who need guaranteed rush delivery above and beyond what Rev already offers, the company just announced it now offers an option that promises files in 5 hours or less from order receipt. There is an additional charge of $1.00/minute for this service. To choose this option, simply select the “Rush My Order” option in desktop checkout.

If any of you utilize the new rush service, we’d love to hear how it goes. Additionally, if you have any other feedback about your use of Rev or other caption providers, please feel free to reach out to oit-mt-info@duke.edu.

Kaptivo

Let’s face it… humans like articulating concepts by drawing on a wall. This behavior dates back over 64,000 years with some of the first cave paintings. While we’ve improved on the concept over the years, transitioning to clay tablets, and eventually blackboards and whiteboards, the basic idea has remained the same. Why do people like chalkboard/whiteboards? Simple, it’s a system you don’t need to learn (or you learned when you were a child), you can quickly add, adjust, and erase content, it’s multi-user, it doesn’t require power, never needs a firmware or operating system update, and it lasts for years. While I’ll avoid the grand “chalkboard vs. whiteboard” debate, we can all agree that the two communication systems are nearly identical, and are very effective in teaching environments. But, as classrooms transition from traditional learning environments (one professor teaching to a small to a medium number of students in a single classroom) to distance education and active learning environments, compounded by our rapid transition to digital platforms… the whiteboard has had a difficult time making the transition. There have been many (failed) attempts at digitizing the whiteboards, just check eBay. Most failed for a few key reasons. They were expensive, they required the user to learn a new system, they didn’t interface well with other technologies… oh, and did I mention that they were expensive?

Enter Kaptivo, a “short throw” webcam based platform for capturing and sharing whiteboard content. During our testing (Panopto sample), we found that the device was capable of capturing the whiteboard image, cleaning up the image with a bit of Kaptivo processing magic, and convert the content into an HDMI friendly format. The power of Kaptivo is in its simplicity. From a faculty/staff/student perspective, you don’t need to learn anything new… just write on the wall. But, that image can now be shared with our lecture capture system or any AV system you can think of (WebEx, Skype, Facebook, YouTube, etc.). It’s also worth noting that Kaptivo is also capable of sharing the above content with their own Kaptivo software. While we didn’t specifically test this product, it looked to be an elegant solution for organizations with limited resources.

The gotchas: Every new or interesting technology has a few gotchas. First, Kaptivo currently works with whiteboards (sorry chalkboard fans). Also, there isn’t any way to daisy chain Kaptivo or “stitch” multiple Kaptivo units together for longer whiteboards (not to mention how you would share such content). Finally, the maximum whiteboard size is currently 6′ x 4′, so that’s not all that big in a classroom environment.

At the end of the day, I could see this unit working well in a number of small collaborative learning environments, flipped classrooms and active learning spaces. We received a pre-production unit, so I’m anxious to see what the final product looks like and if some of the above-mentioned limitations can be overcomed. Overall, it’s a very slick device.

AV Voice Control – A Fad or the Future?

In June of 2017, Crestron announced that their 3 series processors were capable of integrating with Amazon’s Alexa voice control. While initially viewed with a bit of skepticism, the updates and enhancements Crestron has implemented to their modules over the past ten months have made it clear that voice control isn’t going anywhere in the short term. If anything, Crestron has doubled down on voice control with the addition of Google Assistant integration in January 2018.

One of the most appealing aspects of an AV control system is that a simple button press can trigger a series of actions with a range of hardware and software. This system shields the end user from the complexities of controlling the various aspects of the AV system. While voice control has been integrated into a wide range of simple devices (lights, electrical plugs, thermostats, locks, etc.), integrating voice control with Crestron systems leverages the same advantages of the AV system control. “Alexa, turn on the AV system,” performs the same complex tasks as the button press, but can be done from anywhere within earshot of the Alexa device, and doesn’t require any understanding of the graphic user interface of the touch panel.

How it works:  

  1. The Alexa device receives your command “Alexa, turn on the lab TV”
  2. That information is sent to Amazon’s cloud, and sees “lab TV” as a smart device and sends that information to Crestron’s cloud
  3. Crestron’s cloud receives the request and sends it to the Crestron device
  4. The Crestron device receives the request and sends it to the TV, and sends a confirmation back to Crestron’s cloud
  5. Crestron’s cloud relays a “task completed” signal to Amazon’s cloud
  6. Amazon’s cloud receives the “task completed” signal and communicates with the local Alexa Dot
  7. Alexa says “OK”

What does it take to integrate voice control? First, you’ll need an Alexa device in the room, an Amazon and Crestron account, and the room’s Crestron code. By adding two voice control modules (which requires some registration/configuration on Crestron’s website) to the existing code, you can assign button presses and analog values to specific names and phrases. A quick recompile and upload and you’re off. The hard part is figuring out what and how you want to control your system.

A very special THANK YOU!!! to the Duke Digital Initiative (DDI) for purchasing the Amazon Alexa Dot which as part of their 2017-2018 Internet of Things initiative. Without their support… this testing wouldn’t be possible.

A few things to consider:

  • Safety: Some thought should be spent on ensuring that an Alexa voice command (or misinterpreted voice command) isn’t able to cause injury. This seems obvious, but from audio levels, moving projection screens, movable walls, and thermostats, it’s important to ensure the safety of end users.
  • Security Concern: Alexa is always listening (unless you mute Alexa’s mic), and is always sending data to Amazon’s cloud. There are clear security concerns about using such a system, so take that into consideration.
  • It’s still the early days of Crestron/Alexa voice control, and voice integration can break at any point if Amazon updates Alexa. If you’re considering voice control, you should have direct access to the Crestron code and a programmer or technician capable of implementing updates as needed.
  • Alexa’s voice recognition software is far from perfect and has a particularly difficult time with accents. Also, it generally wants you to talk fast, and sometimes that doesn’t work as well with AV systems.
  • Alexa currently doesn’t have any user authentication. If one person can trigger an action, all users can trigger that action.
  • Alexa is easily confused. “Alexa, set the volume to 30%” and “Alexa, set the speakers to 30%” can confuse Alexa. This contextual understanding within Alexa is improving, but still far from perfect.
  • If your Internet goes down, so does Alexa.

This is the demo we created as a proof of concept. Consider this the tip of the iceberg in what this system can do, the future is exciting.

 

$20 Bluetooth Headphones

When I purchase inexpensive in-ear Bluetooth headphones, I generally have a somewhat low expectation for the device. If they last an academic year and are usable (with somewhat expected minor issues), I chalk it up to a great purchase. After all, they were only $20, right?

After a semester of use, I’m still using my $20 Otium Bluetooth headphones on a daily basis. I initially purchased the headphones because I wasn’t able to charge my iPhone while listening to music or watching videos, and I wasn’t prepared to spend $160 on Apple AirPods. The Otium headphones received a good review on Amazon and were well within my “just buy it” impulse-buy budget. When I received the headphones, it only took a few minutes to connect to my phone and start running it through the paces.

General Thoughts:
Audio Quality: The audio quality is similar to the headphones that come with the iPhone. They aren’t the best, but generally fine for music listening and movie watching. I’m sure an audiophile wouldn’t agree.
Comfort: I actually prefer how these headphones feel over a number of other in-ear headphones, but after 5-6 hours of constant listening, my ears get a bit tired of their fit.
Mic: The mic on the headphones is fine for basic phone conversations, but that’s about all I’d use them for.
Noise-Canceling: This device is “noise canceling” in the same way cotton balls are “noise canceling.” If this device has electronic noise-canceling technology, I don’t hear it.

Are the $20 Otium headphones a replacement for Apple AirPods? Well, perhaps. If you are a person that regularly misplaces headphones, needs 4+ hours of battery life, and doesn’t mind the tethered nature, these are a good solution… for $20. But, if you want an amazing level of integration with your iOS devices, these may not be the headphones for you. While I routinely use these with my iPhone, iPad, Apple TV, Raspberry Pi and MacBook, seeing the deep level integration of Apple’s AirPods does make me a bit jealous.

AV Use Case:
While consulting with Duke’s Fuqua School of Business, they had an interesting use case where these headphones may work well. Their faculty teaches a class in English to a group of foreign students. Some of the students are fluent in English, some have a limited English background, and others don’t speak English. The Fuqua AV group has a professional translation system in place to capture the faculty’s comments and translate them to the native language of the students. If the students ask a question in their native language, the service translates the question back to English and sends it to the faculty member. But, they are currently using a wired connection for this. The faculty has asked if there could be an untethered option, so their group is exploring using a similar Bluetooth configuration in the coming months.

AV Tool Bag – Suaoki Laser Measure

If you provide AV consulting services, as our department does for Duke University, you’ll find yourself frequently measuring all types of unique spaces. How far away is that projection screen? How long does the HDMI cable need to be? How big is the room? While a traditional 25′ tape measure is fine for many situations, the act of measuring a larger space can be a bit more difficult. Say a room is 80′ across, that’s four trips with a traditional tape measure (assuming the space is flat) and a bit of “the maths.” But, there must be a better, faster, and perhaps less intrusive method, right?

The Suaoki 131′ Laser Measure is an incredibly handy, flexible, and inexpensive device for quickly measuring all types of unique spaces. At around $25, it’s nearly the cost of a professional measuring tape and easily fits in an AV technicians bag. While I’ve only had the device for a roughly a month, I’ve found the user interface to be self-explanatory. Simply point the device at the object you want to measure to, push and hold the big red button for a few seconds, the device will start shooting lasers and providing distance feedback. Pressing the red button a second time will stop the laser.

In the above example, I received the maximum distance recorded, the minimum distance recorded, and the average of those two numbers. Considering I was hand holding the unit, the variance over ~62′ was about 1/8th of an inch. That said, some negative reviews indicate that the device isn’t amazingly accurate at long distances (that it can float ~.5″ over long distances). I didn’t find this to be the case, but in my environment, it’s not all that important that the device is perfectly accurate.

All in all, I’ve found the Suaoki 131′ Laser Measure to be a solid purchase for the AV or IT technician on the go.

Captions! Captions! Get your FCPX Captions Here!

As a self-proclaimed accessibility nut, offering subtitles/closed captions isn’t simply a nicety in 2018… it’s a necessity. This is particularly true now that my ears have passed their prime, perhaps due to one too many Guided by Voices concerts in my youth. Now, before we get a flood of “Adobe Premiere did it first!” I acknowledge that a similar feature has been available on that platform for some time, but whenever I dip my toe in Premiere on a quasi-annual basis… I quickly retreat to the warm embrace of Final Cut Pro.

 

To put this in context, I don’t shoot or edit many videos these days. But, when I do, my process for captioning is to edit the video in Final Cut Pro, export the video, upload the video to YouTube (unlisted), and let YouTube work its machine language captioning magic. Usually, within a few minutes or so, YouTube has a subtitle that’s about 80%+ accurate. From within YouTube, I then go in and manually edit the captions to achieve a near 100% accurate caption for the video. Finally, I make the video publicly viewable.

The above method is great… unless you need to re-upload the video to YouTube (or a different service) with a number of edits. Also, the longer and more complex the video becomes, the more complex managing the subtitles can become.

In a perfect world, you’d caption your footage as it is imported, either manually or sending it out to a service. This has a number of advantages, especially for larger projects. First, metadata! Searching through hours of footage for a key phrase YOU KNOW your subject said is absolutely frustrating. Wouldn’t it be better if you could search your media library for that phrase? When you caption first, this becomes possible. Second, when you make edits, the captions follow the footage. So, when you make dozens of edits… you don’t need to touch the subtitles. Very cool…

Final Cut Pro 10.4.1 is only a few days old, but it seems to be well designed and feels very Apple. Also, it wouldn’t be an Apple feature if it didn’t use a unique format called Apple iTunes Timed Text (iTT or .itt). Don’t worry, this is actually an upgrade from traditional .str caption files. With .str, you basically have the time and the world to be displayed on the screen. But, with Apple’s .itt format, you can also embed color information and location of the text on the screen. Also, .itt files import into YouTube with little trouble. If .itt just isn’t going to work, you can also select CEA-608 which is ideal for DVD or Blu-ray mastering, but .itt is the more capable format.

I’ll be keeping an eye on this feature to see if Apple eventually adds their own Siri voice to text within Final Cut Pro (perhaps FCPX 10.5?), but for now, this is a great feature for those of us that love captioning.