Haoteng's HomePage - Projects & Talks

Projects & Talks

[Play] LLMs + Serialization Talk@NAACL'25

How to talk to language models: Serialization strategies for structured entity matching

The performance of language models used for tasks involving structured data heavily depends on how the input is transformed into text, particularly the process termed serialization. This work systematically studies the impact of serialization on structured entity matching tasks by benchmarking commonly used schemes with LLMs of different sizes. Based on our findings, a novel lightweight approach is proposed for serializing knowledge graph entities via random walks and utilizing LLMs to encode sampled semantic walks for matching, which achieves SOTA performance and significant throughput increases over GPT-4-based methods.

Yin, H., Kim, J., Mathur, P., Sarker, K., & Bansal, V. (2025). How to talk to language models: Serialization strategies for structured entity matching. In Findings of the Association for Computational Linguistics: NAACL 2025. [Link, Code, Poster]

[IBM Digital Health] PvGaLM Talk

Privately Learning from Graphs with Applications in Fine-tuning Large Language Models

Existing privacy-preserving methods, such as DP-SGD, which rely on gradient decoupling assumptions, are unsuited for relational learning due to the inherent dependencies between coupled training samples. We first propose a privacy-preserving relational learning pipeline that decouples dependencies in sampled relations during training, ensuring differential privacy through a tailored application of DP-SGD. We apply this method to fine-tune LLMs (e.g., BERT, Llama2) on sensitive graph data and tackle the associated computational complexities. The results demonstrate significant improvements in relational learning tasks, all while maintaining robust privacy guarantees during training.

Yin, H., Wei, R., Chien, E., & Li, P. (2025). Privately Learning from Graphs with Applications in Fine-tuning Large Language Models. In Main Track of the 2nd Conference on Language Modeling. [PDF(arXiv), Code, Poster]

[Play] BloomSign Talk@WWW'24

Learning Scalable Structural Representations for Link Prediction with Bloom Signatures

Bloom signatures are hashing-based compact encodings of node neighborhoods, which are used to augment the message-passing framework for structural link representations. GNNs with Bloom signatures are provably more expressive than vanilla MPNNs and more scalable than existing edge-wise models. A neural network that inputs Bloom signatures can estimate any type of neighborhood overlap-based heuristic with guaranteed accuracy.

Zhang, T.*, Yin, H.*, Wei, R., Li, P., & Shrivastava, A. (2024). Learning Scalable Structural Representations for Link Prediction with Bloom Signatures. In Proceedings of the ACM Web Conference 2024. [Link, PDF(arXiv), Code]

[Play] SUREL+ Talk@Learning on Graphs & Geometry

SUREL+: Moving from Walks to Sets for Scalable Subgraph-based Graph Representation Learning

SUREL is a novel set-based computation framework for scaling subgraph-based GRL to industry-level graphs. It is the first time that SGRL has been successfully deployed on a billion-edge graph (twitter-2010). SUREL+ substitutes costly subgraph extraction by node set sampling, where the set union via online joining can act as a proxy of query-induced subgraphs for the prediction of given queries.

Yin, H., Zhang, M., Wang, J., & Li, P. (2023). SUREL+: Moving from Walks to Sets for Scalable Subgraph-based Graph Representation Learning. In Proceedings of the VLDB Endowment 16 (11): 2939-2948. [Link, PDF(arXiv), Code, Poster].

[Play] SUREL Talk@VLDB'22

Algorithm and System Co-design for Efficient Subgraph-based Graph Representation Learning

SUREL is a novel framework for efficient Subgraph-based Graph Representation Learning by co-designing the learning algorithm and its system support. It adopts the walk-based decomposition of subgraphs and reuses the walks to form subgraphs, substantially reducing the redundancy of subgraph extraction and enabling parallel computation.

Yin, H., Zhang, M., Wang, Y., Wang, J., & Li, P. (2022). Algorithm and System Co-design for Efficient Subgraph-based Graph Representation Learning. In Proceedings of the VLDB Endowment 15 (11): 2788-2796. [Link, PDF(arXiv), Code].

[Play] GNN+Node DE Talk@DLG-AAAI'21

Revisiting Graph Neural Networks and Distance Encoding From a Practical View

GNNs with Distance Encoding (DE) technique are reviewed for learning on graphs: 1) categorize the labels for node classification tasks into community type and structure type. 2) investigate how DE makes GNNs fit for tasks like node classification and link prediction. 3) design eight variants to identify the mechanism that GNNs adopt to predict two types of node labels under different graph settings.

Yin, H., Wang, Y., & Li, P. (2020). Revisiting Graph Neural Networks and Distance Encoding From a Practical View. In Proceedings of the 35th AAAI Conference on Artificial Intelligence DLG Workshop. [Link, PDF(arXiv), Code].

Graph-Structured Sequence Modeling through Spatio-Temporal U-Network

Designed a novel multi-scale architecture, Spatio-Temporal U-Net (ST-UNet), for graph-structured time series modeling. In this U-shaped network, a paired sampling operation is proposed in the domain of space and time accordingly: the pooling (ST-Pool) and the unpooling (ST-Unpool). To better localize the representation from the input, higher-level features retrieved from the pooling part are concatenated with the upsampled output. The final output of ST-UNet can be utilized for predicting node attributes or the entire graph in the next few time steps.

Yu, B.*, Yin, H.*, & Zhu, Z. (2019). ST-UNet: A Spatio-Temporal U-Network for Graph-structured Time Series Modeling. arXiv preprint arXiv:1903.05631. [PDF(arXiv)].

Machine Learning Attacks to Location Privacy

Developed a deep neural network-based attack framework that identifies and models statistical patterns in obfuscated user trajectories from location-based services. The model then leverages learned user mobility patterns to circumvent existing privacy mechanisms and perform re-identification attacks.

Internship Project was supported by the international internship program at LIX, École Polytechnique & Inria Saclay.

[Play] STGCN Talk@PKU AAIS Graduate Student Forum

Traffic Prediction with Deep Spatial Temporal Neural Nets

Designed a fully integrated convolutional neural network to model the topology of the road network and traffic patterns coherently, and then forecast the future traffic condition (speed, flow, or volume) of the network through space-time series in the mid- and long-term.

Yu, B.*, Yin, H.*, & Zhu, Z. (2018). Spatio-temporal graph convolutional networks: a deep learning framework for traffic forecasting. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (pp. 3634-3640). [Link, PDF (arXiv), Code, Slides].

Neural Artist - Style Transfer in Short Videos

Applied Conditional GANs and Fast Style Transfer to convert short videos into customized styles (e.g. Van Gogh, The Starry Night) through DNN-based texture abstraction and redesigned loss function to balance and minimize the flicker between rendering frames.

This project [Link] is awarded 'the Most Technical Difficulty Award' at Schlumberger HackPKU 2017.

Hotspot Prediction Based on Temporal Trajectory and Social Attributes

Proposed a location-vector embedding framework (Loc2vec) to predict the geographic hotspot in a certain area and explore its semantic meaning based on temporal trajectories, which are linked in a chronological order through users’ check-ins gathered from location-based social networks.

YIN, H., & LIU, Y. (2017). Semantic analysis of spatial temporal trajectory in LBSNs. (in Chinese) SCIENTIA SINICA Informationis, 47(8), 1051-1065. [Link, PDF]

Google Sites

Report abuse