The death rates of people in the global community are increasing due to deadly diseases, unpredictable type is cardiovascular disease. The datasets come with heterogeneous data like such as structured table format, and unstructured data, meaning image types. Before initiating the pre-processing pipeline, the issues to be solved are noise, class imbalance, and missing values. To overcome these, a hybrid approach such as a combination of CNN and FNN is used. In these, FNN uses embedded imputation, feature normalization, and CNN is applied by using adaptive spatial pooling to address resolution variability in the images. The cross-model attention layer is defined for fusing the features from both modalities. The issues, such as class imbalance, are overcome using synthetic minority oversampling called SMOTE as part of the data preprocessing pipeline, as well as missing data is overcome by using dropout augmented training. This model outperforms specific machine learning models and unimodal deep learning models. This model ensures a scalable, efficient solution for CVD prediction early, in personalized clinical care. The multi-modal fusion support, better interpretability with adjusting attention weights, may highlight key factors, and robustness to missing data with imputation and dropout augmentation.