SharePointTaskMaster: Computer Vision - Yolo Real Time Detection

We take a look how yolo does detect object from a street tour clip

Clip of the running the detection

The clip source that I do use for Yolo detects: https://www.youtube.com/watch?v=WTFoiHu9MAo

Code :

import cv2
import numpy as np
import time

# Load Yolo
net = cv2.dnn.readNet("weights/yolov3-tiny.weights", "cfg/yolov3-tiny.cfg")
classes = []
with open("coco.names", "r") as f:
    classes = [line.strip() for line in f.readlines()]
output_layers = net.getUnconnectedOutLayersNames()

colors = np.random.uniform(0, 255, size=(len(classes), 3))

# Load Video/Clip/Image
cap = cv2.VideoCapture("LondonCityTour.mp4")

font = cv2.FONT_HERSHEY_PLAIN
starting_time = time.time()
frame_id = 0
while True:
    _, frame = cap.read()
    frame_id += 1

    height, width, channels = frame.shape

    # Detecting objects
    blob = cv2.dnn.blobFromImage(frame, 0.00392, (800, 800), (0, 0, 0), True, crop=False)

    net.setInput(blob)
    outs = net.forward(output_layers)

    # Showing informations on the screen
    class_ids = []
    confidences = []
    boxes = []
    for out in outs:
        for detection in out:
            scores = detection[5:]
            class_id = np.argmax(scores)
            confidence = scores[class_id]
            if confidence > 0.2:
                # Object detected
                center_x = int(detection[0] * width)
                center_y = int(detection[1] * height)
                w = int(detection[3] * width)
                h = int(detection[3] * height)

                # Rectangle coordinates
                x = int(center_x - w / 1.8)
                y = int(center_y - h / 1.8)

                boxes.append([x, y, w, h])
                confidences.append(float(confidence))
                class_ids.append(class_id)

    indexes = cv2.dnn.NMSBoxes(boxes, confidences, 0.4, 0.3)

    detectedValue = ""
    for i in range(len(boxes)):
        if i in indexes:
            x, y, w, h = boxes[i]
            label = str(classes[class_ids[i]])
            confidence = confidences[i]
            color = colors[class_ids[i]]
            cv2.rectangle(frame, (x, y), (x + w, y + h), color, 2)
            cv2.putText(frame, label + " " + str(round(confidence, 2)), (x, y + 30), font, 2, color, 2)
            detectedValue = label + " "

    print(detectedValue)

    elapsed_time = time.time() - starting_time
    fps = frame_id / elapsed_time
    cv2.putText(frame, "FPS: " + str(round(fps, 2)), (10, 50), font, 2, (0, 0, 0), 3)
    cv2.imshow("Yolo Real Time Detection Result", frame)
    key = cv2.waitKey(1)
    if key == 27:
        break

cap.release()
cv2.destroyAllWindows()

This code implements real-time object detection using the YOLOv3-tiny model on a video file ("LondonCityTour.mp4"). It processes video frames, detects objects, draws bounding boxes with labels, and displays the results with frames-per-second (FPS) information. Below is a detailed explanation of the code:

1. Importing Libraries

import cv2
import numpy as np
import time

cv2: OpenCV library for image and video processing, including deep neural network (DNN) support.
numpy: Used for numerical operations, such as generating random colors and handling arrays.
time: Used to calculate FPS by tracking elapsed time.

2. Loading the YOLO Model

net = cv2.dnn.readNet("weights/yolov3-tiny.weights", "cfg/yolov3-tiny.cfg")

Loads the YOLOv3-tiny model:
yolov3-tiny.weights: Pre-trained weights for the YOLOv3-tiny model.
yolov3-tiny.cfg: Configuration file defining the model architecture.
YOLOv3-tiny is a lightweight version of YOLO, optimized for faster processing but with slightly lower accuracy.


classes = []
with open("coco.names", "r") as f:
    classes = [line.strip() for line in f.readlines()]

Reads the coco.names file, which contains the names of the 80 object classes (e.g., "person", "car", "dog") that the YOLO model can detect.

Each class name is stripped of whitespace and stored in the classes list.

output_layers = net.getUnconnectedOutLayersNames()

Retrieves the names of the output layers in the YOLO model. These layers produce the detection results (bounding boxes, class probabilities, etc.).


colors = np.random.uniform(0, 255, size=(len(classes), 3))
Generates random RGB colors for each class to visually distinguish detected objects when drawing bounding boxes.


3. Loading the Video

cap = cv2.VideoCapture("LondonCityTour.mp4")
Opens the video file "LondonCityTour.mp4" for processing.
cap is a video capture object used to read frames from the video.

4. Setting Up Display Parameters

font = cv2.FONT_HERSHEY_PLAIN
starting_time = time.time()
frame_id = 0
font: Specifies the font style (HERSHEY_PLAIN) for text annotations on the video frames.
starting_time: Records the start time to calculate FPS later.
frame_id: Tracks the number of processed frames.

5. Main Processing Loop

while True:
    _, frame = cap.read()
    frame_id += 1
Enters an infinite loop to process the video frame by frame.
cap.read(): Reads the next frame from the video. Returns a boolean (indicating success) and the frame itself.
Increments frame_id to count processed frames.

height, width, channels = frame.shape
Extracts the dimensions of the frame: height, width, and number of color channels (typically 3 for RGB).


6. Preprocessing the Frame for YOLO

blob = cv2.dnn.blobFromImage(frame, 0.00392, (800, 800), (0, 0, 0), True, crop=False)
Converts the frame into a "blob" (a preprocessed input format for the neural network):
0.00392: Scales pixel values (1/255) to normalize them.
(800, 800): Resizes the frame to 800x800 pixels, as required by YOLOv3-tiny.
(0, 0, 0): Subtracts these values from RGB channels (no mean subtraction here).
True: Swaps the red and blue channels (OpenCV uses BGR, but YOLO expects RGB).
crop=False: Resizes without cropping the image.

net.setInput(blob)
outs = net.forward(output_layers)
Sets the blob as the input to the YOLO model.
net.forward(output_layers): Runs the model to get detection outputs from the specified output layers. outs contains bounding box coordinates, confidence scores, and class probabilities.

net is of cv::dnn::Net Class Reference
This class allows to create and manipulate comprehensive artificial neural networks.
Neural network is presented as directed acyclic graph (DAG), where vertices are Layer instances, and edges specify relationships between layers inputs and outputs.
Each network layer has unique integer id and unique string name inside its network. LayerId can store either layer name or layer id.
This class supports reference counting of its instances, i. e. copies point to the same instance.
(https://docs.opencv.org/3.4/db/d30/classcv_1_1dnn_1_1Net.html#details)

7. Processing Detections

class_ids = []
confidences = []
boxes = []
for out in outs:
    for detection in out:
        scores = detection[5:]
        class_id = np.argmax(scores)
        confidence = scores[class_id]
        if confidence > 0.2:
            # Object detected
            center_x = int(detection[0] * width)
            center_y = int(detection[1] * height)
            w = int(detection[3] * width)
            h = int(detection[3] * height)
            x = int(center_x - w / 1.8)
            y = int(center_y - h / 1.8)
            boxes.append([x, y, w, h])
            confidences.append(float(confidence))
            class_ids.append(class_id)

Iterates through the detection outputs:
scores: Extracts class probabilities (starting from index 5, as the first 5 elements are bounding box data).
class_id: Identifies the most likely class by finding the index of the highest score.
confidence: The confidence score for the detected class.
Filters detections with a confidence threshold of 0.2 (ignores low-confidence detections).
For valid detections:
Calculates the bounding box coordinates (center_x, center_y, w, h) by scaling the normalized YOLO outputs (0–1) to the frame’s dimensions.
Adjusts the top-left corner (x, y) of the bounding box using a scaling factor (w/1.8, h/1.8).
Stores the box coordinates, confidence, and class ID in respective lists (boxes, confidences, class_ids).

8. Non-Maximum Suppression (NMS)

indexes = cv2.dnn.NMSBoxes(boxes, confidences, 0.4, 0.3)

Applies Non-Maximum Suppression to eliminate overlapping bounding boxes:
0.4: Confidence threshold for NMS.
0.3: Threshold for the Intersection over Union (IoU) to decide if boxes overlap too much.
Returns indexes of the boxes to keep.

9. Drawing Bounding Boxes and Labels

detectedValue = ""
for i in range(len(boxes)):
    if i in indexes:
        x, y, w, h = boxes[i]
        label = str(classes[class_ids[i]])
        confidence = confidences[i]
        color = colors[class_ids[i]]
        cv2.rectangle(frame, (x, y), (x + w, y + h), color, 2)
        cv2.putText(frame, label + " " + str(round(confidence, 2)), (x, y + 30), font, 2, color, 2)
        detectedValue = label + " "
Iterates through the filtered boxes (those in indexes):
Extracts the box coordinates (x, y, w, h), class label, and confidence.
Draws a rectangle around the detected object using cv2.rectangle with the class-specific color and a thickness of 2 pixels.
Adds text above the rectangle with the class name and confidence score using cv2.putText.
Stores the detected class label in detectedValue (though only the last detected label is kept).

print(detectedValue)

Prints the last detected class label to the console.



10. Calculating and Displaying FPS

elapsed_time = time.time() - starting_time
fps = frame_id / elapsed_time
cv2.putText(frame, "FPS: " + str(round(fps, 2)), (10, 50), font, 2, (0, 0, 0), 3)
Calculates the frames per second (FPS) by dividing the number of processed frames (frame_id) by the elapsed time.
Displays the FPS on the frame at position (10, 50) with black text.

11. Displaying the Output

cv2.imshow("Yolo Real Time Detection Result", frame)
key = cv2.waitKey(1)
if key == 27:
    break
Displays the processed frame with bounding boxes and annotations in a window titled "Yolo Real Time Detection Result".
cv2.waitKey(1): Waits 1 millisecond for a key press. If the Esc key (ASCII 27) is pressed, the loop breaks, ending the program.

12. Cleanup

cap.release()
cv2.destroyAllWindows()
Releases the video capture object to free up resources.
Closes all OpenCV windows.



Summary
This script performs real-time object detection on a video using the YOLOv3-tiny model:
Loads the YOLO model and class names.
Processes each video frame by:
Preprocessing it into a blob.
Running it through the YOLO model to detect objects.
Filtering detections and applying NMS to remove redundant boxes.
Drawing bounding boxes and labels on the frame.
Displays the frame with annotations and FPS.
Stops when the Esc key is pressed.

Potential Issues
The scaling factor (w/1.8, h/1.8) for bounding box coordinates may not be optimal and could distort box sizes.
The detectedValue variable only stores the last detected label, which may not be useful if multiple objects are detected.
The confidence threshold (0.2) and NMS parameters (0.4, 0.3) may need tuning for better performance.
Processing speed depends on hardware; YOLOv3-tiny is faster than full YOLOv3 but may still be slow on weaker systems.

Generative AI, Robot Operating System (ROS 2), Computer Vision, Natural Language Processing service, Generative AI Chatbot, Machine Learning, Mobile App, Web App? Yes, I do provide!

Call me: (+84) 0854147015
WhatsApp: +601151992689
Viber: +84854147015
https://amatasiam.web.app
Email: ThomasTrungVo@Gmail.Com

Facebook: https://www.facebook.com/voduytrung

X: https://x.com/ThomasTrung

SharePointTaskMaster

Friday, April 18, 2025

Computer Vision - Yolo Real Time Detection

No comments:

Post a Comment