Here's a simplified step-by-step:
Chunk the Video into 1-second Intervals
To divide the video into 1-second chunks, you would typically use a library like moviepy or opencv.
import cv2
video = cv2.VideoCapture('your_video.mp4')
fps = video.get(cv2.CAP_PROP_FPS)
frames = []
while(video.isOpened()):
    ret, frame = video.read()
    if ret:
        frames.append(frame)
    else:
        break
video.release()
cv2.destroyAllWindows()
# Now chunk into 1-second intervals
chunks = [frames[i:i+int(fps)] for i in range(0, len(frames), int(fps))]
 
Generating the Embeddings
For each 1-second chunk, a series of images are generated, and the embeddings are calculated using the OpenAI CLIP model.
import torch
import clip
model, preprocess = clip.load('ViT-B/32')
for chunk in chunks:
    # For each frame in the chunk, preprocess and convert to tensor
    images = [torch.unsqueeze(preprocess(frame), 0) for frame in chunk]
    # Stack all tensors together
    images_input = torch.cat(images, 0)
    # Generate the embedding
    with torch.no_grad():
        image_features = model.encode_image(images_input)
 
Performing the Search
You can use cosine similarity:
 
    # Calculate cosine similarity between the corpus of vectors and the query vector
    scores = util.cos_sim(query_vector, corpus_vector)[0].cpu().tolist()
    
    # Combine docs & scores
    doc_score_pairs = list(zip(docs, scores))
    
    # Sort by decreasing score
    doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
    
    # Output passages & scores
    for doc, score in doc_score_pairs:
        print(score, doc)
The challenge with this approach however is treating 1 second intervals as a series of frames does not capture the context of the video. They should be treated as moving images.
Mixpeek offers a managed search API that does this:
GET: https://api.mixpeek.com/v1/search?q=people+experiencing+joy
Response:
[
  {
    "content_id": "6452f04d4c0c0888bdc6b97c",
    "metadata": {
      "file_ext": "mp4",
      "file_id": "ebc289d7-44e1-4672-bf3c-ccfa490b7k2d",
      "file_url": "https://mixpeek.s3.amazonaws.com/<user>/<file>.mp4",
      "filename": "CR-9146f0.mp4",
    },
    "score": 0.636489987373352,
    "timestamps": [
      2.5035398230088495,
      1.2517699115044247,
      3.755309734513274
    ]
  }
]
Further reading and demo: https://learn.mixpeek.com/what-is-semantic-video-search/