The Wicked Problem of Processing Terabytes of Videos

Melchor Tatlonghari


What is a Wicked Problem? In Software Engineering, it is when you are required to solve the problem multiple times, creating incremental solutions to the problem, but there’s nothing that tells you that you’ve found the correct and final solution.

Wicked Problem: How to convert 2 TB of raw video footage into small valuable chunks of video that I then uploaded to Youtube?

End Results:

2 weeks impressions
Useful Context

Hardware: 1 Macbook pro 16gb
Data: 2 Terabyte of raw video footage of gaming content (around 100 videos)
1 video = 20–40gb depending on the length, average time is 3–4 hours.
Video Rendering time takes around 50–75% of the original video time, 1 hour video would take around 30–45 minutes depending on resolution
1st Problem Iteration — Cleaning the data
The videos were around 4–6 hours long, ranging from 20gb-100gb each video file, to upload each file even on a decent internet connection took hours, but also no one is interested in watching raw 6 hours of someone playing. I needed to trim the video to small decent chunks.

Applying metadata was the first step to any Data Engineering problem, given raw videos, I needed to have some sort knowledge which videos were worth keeping, what are they about, and how many videos to convert from one raw video (how many smaller videos can I cut from one big video), this was inevitable there was no software that can tell you which parts of a video are of worth value or when a game would start or end.

Solution — Brute Force Process
I quickly learned the shortcuts of VLC media player and wrote key data points in an excel spreadsheet. Important data points were: Video Title, Introduction Start, Introduction Snd, Game Start, Game End, Game, Genre, Status, Other Notes
Learning how to cut the video was essential, I first started to use DaVinci Resolve, a free video editing software, and once you have the metadata you can essentially leave the software running overnight to render the specified videos.

Metadata of each video
2nd Problem Iteration — Manual steps
The problem with brute force, is well it’s manual, to go through a Terabyte of data is really not feasible, if you have maybe 10–20 of videos it’s fine to push through and just manually do it rather than waste time building any automation. In a day or two, I was only able to process 3–5 videos because my local machine could not handle it. It takes a lot of processing power to render a video even on a Macbook Pro.

I understood that manually doing it was not feasible, but I had to do it manually first to understand the mechanics of the problem. I knew that since I had the metadata I can automate the clipping of the videos, given I have a 4 hour video file, I needed to convert that to 6–7 smaller video files. Manually do it on a video editing software, typing in start and end time was painfully time consuming and waiting for it to render was worse. I would leave the laptop open overnight and wake up the next morning still unfinished something had to change.

Solution — Python Libraries
I created a python script integrated with MoviePy — A python library that allowed automation of clipping videos, I just needed to pass the metadata to that script and I didn’t need to open the movie editing software anymore. I had a text file with all the metadata of the videos, and once I ran the script, it would just go through it one by one spitting out videos that were already clipped and ready to go.
3rd Problem Iteration — One at a Time
Although I had solved for having to manually key in the times in a video editing software and essentially just allowing the script to go through my list on the background the script was still just processing the video one a time, the constraint of my macbook’s processing power was limited and the time it took to get through all the videos was still taking alot of time.

Solution — Cloud or Threads
I had first tried to get more processing power from the cloud, so I created a AWS Lambda function, a serverless service that only charges on processing time, from my python script. It was simple enough since my script was standalone and just needed to convert it to have Lambda signatures. However I realised when I was trying to run it with a video file that I would have to upload 40gb of video files at time to the cloud in other for that Lambda function to process that video, and the resulting output files would also live in the cloud for me to download. The cost of uploading/processing/downloading would start racking up and the wait time depending on how fast my internet was would significantly slow down the process.
In the end, I used multiprocessing in Python to enable processing videos more than 1 video at time, I needed to convert my script into multi-threading. This has helped enable the processing time to be much faster, when I left the computer open at night, I would be able to get more videos in the morning, than from running a single thread processor.
4th Problem Iteration — Dying Threads
I am currently moving around and only been processing videos on my macbook, the problem was when the macbook went to sleep as I move around places the python script would stop and I would lose all progress to a rendering video. This was fine but having to rerun the script and wait for the script to finish was again time and energy consuming. I had to keep track what had failed and what to rerun, to me the simplest solution was I just want to put the video file somewhere and the script should pick it up and start processing, if my macbook went to sleep it should still process it or at least resume when it’s back up.

Solution — Macbook Launchctl
This is when I came to learn about OSX launchctl, essentially it enables you to run background process that enables it to run even if the macbook is sleeping and it manages the process for you. So it’s similar to crontab but with macbook’s process manager instead. It was similar to offloading the process from a User process to a System process. If you bootstrap your script into launchtl, OSX considers this as one of its process rather than a user run command. It becomes an internal process that the OS has to keep running in the background. This has enabled me to just put the video on a folder, my macbook then has become a like video processor hardware 24/7, I didn’t have to run any command, it would just start processing the videos and put the processed videos on an output folder, this has significantly made the video processing much faster as the laptop was just constantly processing the videos for me regardless if I put it to sleep.

Apple treating my process as an internal process
5th Problem Iteration — Current Copyright Issues
The current problem I have is that once I’ve uploaded the file in youtube, there is background music that youtube complains about, they do specify you which start and end time the copyright music violates their terms and policy, but it’s not accurate and only give you a subset of the actual length of the music play time.

I tried python libraries that remove background music like spleeter, but it was very complex to get running and wasn’t able to remove any of the music I intended. It is worth nothing that I wanted to only seperate music from the person speaking in the video. There are tools that exist but it seems they cost x amount of dollars per minute processing of each video with gigabytes of data I wasn’t ready to cough up that much money. Overwriting entirely the audio removed the speaker’s voice as well so was out of the picture.

Closing
I haven’t perfected the solution and there will be further iteration of this problem and will write about it too. Thanks for reading.