People with hearing and speech disabilities continue to experience serious challenges when interacting in environments that depend primarily on spoken communication, which often limits their independence and social participation. Many existing support systems depend on costly wearable devices or lack sufficient accuracy and responsiveness for real-time use. To address these limitations, this study introduces DeepSign Vox, a computer vision–based system designed to convert sign language gestures into written text and synthesized speech. The framework employs hand landmark tracking through MediaPipe in combination with a Convolutional Neural Network (CNN) trained to recognize predefined gesture patterns. Performance analysis indicates an average detection accuracy of 97% in general usage scenarios, increasing to 99% in controlled test conditions. An integrated text-to-speech module generates clear and natural audio output, with additional support for regional languages including Telugu. By removing the need for specialized hardware and focusing on software-driven recognition, the proposed approach offers an affordable, flexible, and user-friendly solution. These results highlight the potential of DeepSign Vox to improve communication for individuals with hearing and speech impairments and to promote more inclusive interaction between humans and digital systems