In the era of digital marketing, understanding cus- tomer preferences and optimizing campaign strategies is crucial for business growth. This research introduces a novel framework that leverages advanced language models to extract valuable marketing knowledge from large-scale data. By implementing an adaptive prompting technique and progressive filtering mecha- nism, the proposed system efficiently identifies customer behavior patterns and optimizes audience targeting. Extensive experiments demonstrate the effectiveness of this approach in improving marketing performance and enhancing customer engagement, providing a scalable solution for intelligent decision-making in competitive online marketplaces.
The burgeoning development of the mobile economy has accelerated the expansion of digital commerce, prompting a surge in online promotional initiatives. Digital platforms such as Alipay facilitate the orchestration of marketing ef- forts through embedded mini-programs, where efficient data dissemination is pivotal. At the core of such systems lies the necessity to align user preferences with promotional content, wherein a Marketing-centric Knowledge Graph (MoKG) func- tions as a vital intermediary, enhancing the granularity and adaptability of user intent inference.
While traditional solutions like SupKG offer substantial coverage across product hierarchies and spatiotemporal data, they primarily focus on service-oriented relationships. MoKG complements these by targeting user-merchant interactions central to marketing objectives (refer to Fig. 1). Although SupKG’s architecture could theoretically support MoKG’s construction via established text-mining strategies (e.g., named entity recognition and relation extraction), these methodolo- gies demand extensive human annotation, rendering them inefficient at scale.
The rise of Large Language Models (LLMs) such as ChatGPT and LLaMA, pretrained on expansive web corpora, presents a viable alternative. These models encapsulate broad general knowledge, making them suitable for knowledge graph population. However, their performance may be suboptimal in domains like marketing due to a lack of familiarity with domain-specific terminology and relational structures.
To bridge this divide, the proposed approach decomposes MoKG construction into three interconnected stages: Knowl- edge Retrieval, Relation Identification, and Entity Augmen- tation. While prior domain-specific information helps inject relevance into LLMs, several challenges remain: uncontrolled relation generation, single-prompt limitations, and the im- practicality of deploying large-scale LLMs due to resource constraints and data privacy concerns.
To address these, a Progressive Prompting-Augmented mIn- ing fRamework (PAIR) is introduced. PAIR formulates relation generation as a filtered selection over a bounded relation set, leveraging refined prompts. Progressive prompt sequences are then applied to guide entity expansion, and aggregated outputs are assessed using semantic consistency and logical coher- ence metrics. To facilitate scalable deployment, a lightweight derivative model (LightPAIR) is trained using a high-quality dataset distilled from a full-scale LLM.
Formulation
The knowledge graph population task is modeled probabilis- tically. Given a source node s, the likelihood of target entity t and relation r is defined as:
P (r, t|s) = P (κ|s)P (r|s, κ)P (t|s, κ, r) (1)
κ
where:
P (κ|s) denotes the contextual knowledge distribution conditioned on source entity s.
P (r|s, κ) represents the probability of selecting a relevant relation r.
P (t|s, κ, r) captures the conditional generation of entity t based on s, κ, and r.
Framework Overview
LLMs often lack the nuanced understanding required in spe- cialized domains. To compensate, two categories of knowledge are integrated:
Relation Selection with Bounded Scope
To control the scope of relation generation, PAIR retrieves a reduced set Rs of relation candidates based on the entity’s type. An LLM then selects relevant relations RF using struc- tured prompts, producing semantically valid entity-relation pairs.
Progressive Entity Augmentation
Given a relation r and source s, multiple augmented prompts are constructed based on combinations of κS, κD, and inherited knowledge κI . These yield multiple candidate
2) KG Generation Models: Leverage large-scale language models to discover commonsense or open-domain knowledge:
An aggregation function computes the final target set TF by considering both semantic relevance and consensus frequency:
Rephrasing → Object Expansion) using a large language
Here, xs,r,t is the contextual embedding of the triple, and MLP denotes a projection network.
Scalable Knowledge Mining with LightPAIR
Given the impracticality of utilizing full-scale LLMs for massive knowledge extraction, LightPAIR is introduced as a distilled, fine-tuned variant. It is trained on labeled outputs of PAIR using parameter-efficient strategies such as LoRA. This model enables inference over large datasets with reduced resource overhead.
Fig. 1. Illustration of the MoKG sample subgraph for marketing-based entity relations.
Experiments
Experimental Configuration
KG Completion Models: Designed to extend existing knowledge graphs using textual and structural alignment:
model.
3) Variants of PAIR:
The PAIR model employs a 175-billion parameter LLM for task execution. For each progressive prompt, the model was queried three times. For reliable aggregation, a variant of BERT (KG-BERT) with 110 million parameters was utilized.
Evaluation Procedure and Criteria: Three human eval- uators assessed the extracted knowledge triplets. A triplet was tagged “valid” if agreed upon by two or more evaluators, and “invalid” if two or more disagreed. To ensure unbiased judg- ment, tuples from different methods were mixed randomly.
The mining quality was assessed using:
AEE (Average Entity Expansion): Mean count of entities derived per seed.
ILAD (Intra-List Average Distance): Mean Eu- clidean distance between target entities in represen- tation space.
Performance Evaluation
Table I presents a comparison across different models. Key insights are as follows:
TABLE I Performance comparison for MoKG mining
2*Model |
MoKG-181 |
MoKG-500 |
||||
Accuracy |
Novelty |
AEE |
Accuracy |
Novelty |
AEE |
|
BERT |
58.4% |
- |
43.0 |
57.7% |
- |
42.7 |
TRMP |
91.1% |
- |
13.8 |
91.3% |
- |
14.1 |
LMCRAWL |
86.3% |
41.2% |
36.3 |
85.2% |
41.7% |
37.1 |
COMET |
86.7% |
35.9% |
26.1 |
85.9% |
34.6% |
25.3 |
PAIR |
90.1% |
40.4% |
43.7 |
90.7% |
43.6% |
42.8 |
-Agg |
88.7% |
39.6% |
30.8 |
88.9% |
36.4% |
31.3 |
-Agg&Pr |
86.9% |
36.8% |
30.8 |
87.2% |
34.2% |
31.4 |
-Agg&Pr&Rf |
84.9% |
39.2% |
46.3 |
84.3% |
39.4% |
47.2C |
TABLE II Evaluation of LightPAIR with different LLMs
LLM |
Accuracy |
Novelty |
AEE |
ILAD |
Size |
GLM |
89.0% |
31.0% |
35.7 |
5.77 |
10B |
Baichuan2 |
90.3% |
31.5% |
41.1 |
5.96 |
7B |
ChatGLM2 |
86.3% |
28.8% |
39.2 |
5.82 |
6B |
Bloomz |
80.8% |
29.0% |
48.5 |
6.12 |
7B |
Qwen2 |
80.6% |
25.8% |
25.0 |
5.74 |
7B |
Fig. 2. Overall architecture of PAIR
Pr & Rf) achieves the highest ILAD but compromises accuracy and novelty.
LightPAIR Analysis with Smaller LLMs
TABLE III Case study illustrating the effect of prior knowledge in PAIR. Hallucinative and incorrect entities are emphasized in red and Blue, respectively.
Source Entity |
Relation Type |
Target Entities |
Mi Xiao Quan |
Related Media |
w/o knowledge: Journey to the West w/ knowledge: Tom and Jerry, Boonie Bears |
CKA |
Target Audience |
w/o knowledge: System Adminis- trator w/ knowledge: Karate Enthusiasts, Wushu Master |
Uncle Fruit |
Related Brand |
w/o knowledge: Fruit Education, Canon w/ knowledge: Xianfeng Fruit, Fruitday |
The Three Body |
Similar Movie |
w/o knowledge: The Wandering Earth w/ knowledge: Interstellar, Star Trek |
Gas Coupon |
Product of Prize |
w/o knowledge: Fuel Card w/ knowledge: Diesel, Gasoline, Gas Gift Card |
Tuxi Living Plus |
Related Company |
w/o knowledge: Tuxi Catering w/ knowledge: Carrefour, CR Van- guard, Walmart |
Fig. 3. Average novelty comparison between the original SupKG and the PAIR-enhanced MoKG across selected entity types.
TABLE IV Audience segmentation results. TAC = Target Audiences Covered (in thousands). RI = Relative Improvement over EGL.
Scenario |
EGL |
LightPAIR |
RI (%) |
Uncle Fruit |
7.1k |
8.7k |
+15.3% |
The Three Body |
3.3k |
6.6k |
+98.1% |
Schwarzkopf |
2.7k |
4.9k |
+93.1% |
Biscuits Voucher |
1.2k |
1.3k |
+31.2% |
Land Lords |
9.2k |
22.2k |
+122.0% |
Gas Coupon |
3.8k |
6.1k |
+89.2% |
Fig. 4. LightPAIR deployment (Offline A) versus EGL-based TRMP system (Offline B) for audience targeting.
Fig. 4, the proposed LightPAIR model is deployed as “Offline A” and evaluated against the traditional EGL system using the TRMP framework (“Offline B”).
Table IV reports the number of Target Audiences Covered (TAC) in various marketing scenarios. LightPAIR demon- strates significant improvements over the EGL system, with relative performance gains ranging from +15.3% to +122.0%. These improvements validate LightPAIR’s practical viability for precision marketing in large-scale deployments.
This study introduces PAIR and its optimized variant, LightPAIR, as an innovative solution for extracting marketing- relevant knowledge using large-scale language models. The proposed approach incorporates adaptive relation filtering, staged prompting strategies for entity generation, and a robust aggregation mechanism that jointly considers coherence and semantic alignment. The lightweight LightPAIR variant further refines this design by leveraging compact models trained via high-fidelity data synthesized by a strong teacher LLM.
Extensive evaluations reveal that both PAIR and LightPAIR yield superior performance in terms of knowledge graph accuracy, novelty, and diversity. Moreover, real-world testing confirms their ability to outperform established marketing frameworks in audience targeting scenarios. As a future ex- tension, it is intended to augment the current framework with metapath-driven entity expansion to enable interpretable and controllable growth of domain-specific knowledge graphs.