Planning Large Language Models for Enhancing Spatial Cognition and Decision-Making Abilities

PlanGPT-1/1.5

PlanGPT: Enhancing Urban Planning with Tailored
Language Model and Efficient Retrieval

Abstract

In the field of urban planning, general-purpose large language models often fall short of meeting the specific needs of planners. Tasks such as generating urban planning texts, retrieving relevant information, and evaluating planning documents present unique challenges. To enhance the efficiency of urban professionals and overcome these obstacles, we introduce PlanGPT, the first specialized language model tailored for urban and spatial planning. Through collaborative efforts with institutions like the China Academy of Urban Planning and Design, PlanGPT is developed using a customized local database retrieval framework, industry-based foundational model fine-tuning, and advanced tool capabilities. Empirical testing demonstrates that PlanGPT achieves state-of-the-art performance, delivering high-quality responses that accurately adapt to the intricacies of urban planning.

Technical Architecture

Teaser

Figure 1: PlanGPT-1/1.5 Architecture.

PlanGPT-1.5: Building upon PlanGPT-1, it incorporates key engineering techniques for practical applications in the urban planning industry, including insights from real-world use cases, methods to further mitigate hallucinations, and data synthesis techniques to reduce manual annotation costs. The paper has been accepted by ACL'25 (Industry) Oral, receiving a 9/10 rating from one of the four reviewers, who highly recognized the value of PlanGPT in industry large models.

“The paper describes a real-life implementation of an LLM-based assistant tailored to a specific domain and highlights the importance of tailoring each component to obtain good usable results. It can serve as a reference for carrying out similar adaptations in other domains and use cases.”

📅 Release Date: September 28, 2023

PlanBench Planning Knowledge Benchmark

A Comprehensive Benchmark for Evaluating Urban Planning Capabilities in Large Language Models

Abstract

Urban planning, as a highly interdisciplinary and practice-oriented field, requires not only simple recall of knowledge but also complex situational judgment, policy understanding, spatial logical reasoning, and value assessment. Planning texts are characterized by dense terminology, complex structures, and long reasoning chains. Constructing benchmarks can help enhance large models' planning adaptation capabilities in the following aspects:

  • Deconstruction of planning texts (e.g., regulation breakdown, indicator interpretation)
  • Multi-level spatial governance logic (national - city - community)
  • Situational policy judgment and plan generation (e.g., site selection, land allocation, industry recommendations)

Text-based benchmarks serve as the linguistic foundation for "multimodal urban intelligence." In subsequent integrations with maps, charts, and spatial models, text comprehension capabilities are fundamental for achieving the three-dimensional linkage of "text-image-policy."

Technical Architecture

Teaser

Figure 2: PlanBench-Text Architecture.

📅 Release Date: May 19, 2025

PlanBench Planning Visual Recognition Benchmark

Multimodal Multi-image Understanding for Evaluating Multimodal Large Language Models

Abstract

National spatial planning maps visually present the concepts, goals, strategies, and specific measures of spatial planning, serving as a guide for coordinating various spatial development, protection, and utilization activities. They are not only crucial for planning decisions but also important tools for public participation and oversight of planning implementation. Planning is a highly interdisciplinary and specialized task; understanding planning maps requires grasping detailed elements (symbols, legends, geographic features) and the ability to conduct comprehensive analysis and judgment in conjunction with policies. This complexity makes understanding planning maps challenging. With the rapid development of multimodal large language models (MLLMs), we have established a benchmark for national spatial planning maps to evaluate MLLMs' capabilities in understanding these maps. Our contributions are as follows:

(1) Data: We constructed the Spatial Planning Map Database (SPMD), featuring diverse image content and high-quality annotations provided by experts in the field of planning.
(2) Framework: We proposed a comprehensive framework based on planning disciplines, measuring MLLMs' understanding of planning maps from four perspectives: perception, reasoning, association, and application, including eight subcategories.
(3) Experiments: By constructing question-answer tasks based on authoritative question banks (China's Registered Urban Planner Qualification Examination), we significantly reduced the proportion of "hallucination-style normative citations" by models.
(4) Results: All models performed worst in the application dimension, with Qwen2.5-VL-32B-Instruct achieving the highest overall score across all four dimensions.

Technical Architecture

Teaser

Figure 3: PlanBench-VL Architecture.

📅 Release Date: May 19, 2025

PlanGPT-VL

PlanGPT-VL: Enhancing Urban Planning with Domain-Specific Vision-Language Models

Abstract

Despite the critical importance of urban planning maps to professionals and educators, existing vision-language models (VLMs) often struggle to interpret and evaluate these specialized maps. These planning maps visualize key information such as land use, infrastructure layout, and functional zoning, requiring domain-specific knowledge that general VLMs typically lack. To address this issue, we developed PlanGPT-VL, the first domain-specific vision-language model designed for urban planning maps, featuring three major innovations: (1) PlanAnno-V framework for generating high-quality visual question-answering data for planning maps; (2) Keypoint reasoning mechanism that effectively reduces model hallucinations through structured verification methods; (3) PlanBench-V evaluation benchmark, the first comprehensive testing standard for assessing understanding of planning maps. Experimental results show that compared to open-source and commercial VLMs, PlanGPT-VL achieves an average performance improvement of 59.2% on specialized planning tasks. Notably, despite having only 7 billion parameters, classifying it as a lightweight model, its performance rivals that of larger models with over 72 billion parameters, providing urban planners with a reliable and factually accurate tool for professional map analysis.

Technical Architecture

Teaser

Figure 4: PlanGPT-VL Architecture.

📅 Release Date: May 19, 2025

Data Synthesis Techniques Research

FANNO: Augmenting High-Quality Instruction Data with Open-Sourced LLMs Only

Accepted by ACL 2025 Findings

Abstract

Instruction fine-tuning has emerged as a significant advancement in leveraging large language models (LLMs) to enhance task performance. However, the annotation of instruction datasets has traditionally been an expensive and labor-intensive process, often relying on manual labeling or costly proprietary LLM API calls. To address these challenges, we introduce FANNO, a fully autonomous and open-source framework that revolutionizes the annotation process without requiring pre-existing labeled data. FANNO efficiently generates diverse and high-quality datasets through structured processes such as document pre-filtering, instruction generation, and response generation using the Mistral-7b-instruct model. Experimental results on the Open LLM Leaderboard and AlpacaEval benchmarks demonstrate that FANNO can generate high-quality, diverse, and complex data comparable to human annotations

Technical Architecture

Teaser

Figure 5: FANNO Architecture.

📅 Release Date: August 2, 2024


Tag-Instruct: Controlled Instruction Complexity Enhancement through Structure-based Augmentation

Accepted by ACL 2025 Findings

Abstract

High-quality instruction data is crucial for developing large language models (LLMs), yet existing methods struggle to effectively control instruction complexity. We introduce TAG-INSTRUCT, a novel framework that enhances instruction complexity through structured semantic compression and controlled difficulty augmentation. Unlike previous prompt-based approaches that directly handle raw text, TAG-INSTRUCT compresses instructions into a compact tag space and systematically enhances complexity through a reinforcement learning-guided tag expansion. Through extensive experiments, we demonstrate that TAG-INSTRUCT outperforms existing methods in instruction complexity enhancement. Our analysis indicates that operating within the tag space provides superior controllability and stability, making it suitable for various instruction synthesis frameworks.

Technical Architecture

Teaser

Figure 6: Tag-Instruct Architecture.

📅 Release Date: May 8, 2025

Brain-inspired Spatial Intelligence: Positional Encoding Empowering Large Models

GridPE: Unifying Positional Encoding in Transformers with a Grid Cell-Inspired Framework

Abstract

Understanding spatial positions and relationships is a fundamental capability of modern AI systems. Research on human spatial cognition provides valuable guidance in this area. Discoveries in neuroscience highlight the crucial role of grid cells in spatial representation, including distance computation, path integration, and scale discrimination. This paper introduces a novel positional encoding scheme inspired by Fourier analysis and recent computational neuroscience findings on grid cells. Assuming that grid cells encode spatial positions as a sum of Fourier basis functions, we demonstrate that grid representations exhibit translational invariance during inner product computations. Furthermore, we derive the optimal grid scale ratio for multi-dimensional Euclidean spaces based on principles of biological efficiency. Leveraging these computational principles, we develop a grid cell-inspired positional encoding technique called GridPE for high-dimensional space encoding. We integrate GridPE into pyramid vision transformer architectures. Our theoretical analysis shows that GridPE provides a unified framework for positional encoding in arbitrary high-dimensional spaces. Experimental results indicate that GridPE significantly enhances transformer performance, emphasizing the importance of incorporating neuroscientific insights into AI system design.

Technical Architecture

Teaser

Figure 7: GridPE Architecture.

📅 Release Date: June 11, 2024

Life Circle Large Model

Large Language Models Empowering Community Life Circle Planning and Governance Research

To be published in 《Shanghai City Planning》

Abstract

In the context of life circle planning and community governance, large language models (LLMs) can play a pivotal role in several key areas. Firstly, through natural language interaction, LLMs can understand residents' genuine needs in various contexts, automatically extracting demand themes and sentiment tendencies from chat records, survey texts, and social media comments. This enables automatic categorization and prioritization of demands, addressing the challenges of high heterogeneity in resident needs and unstructured expression methods, thus facilitating precise service provision. Secondly, LLMs can semantically integrate and model relationships among data from sensor networks, community GIS, demographic statistics, and government service platforms, enhancing the interpretability and operability of data. This supports functional assessment, resource allocation, and spatial optimization of life circles. Additionally, in the process of collaborative community governance, LLMs can act as "intermediary agents," facilitating semantic bridging between residents and diverse entities such as street offices, property management, and enterprises. They assist in policy interpretation, topic negotiation, consensus generation, and other processes, thereby improving collaboration efficiency and satisfaction.

Technical Architecture

Teaser

Figure 8: Geography Proximity Enhanced Multimodal RAG Framework.

The WeChat mini-program "方元问问" (Fangyuan Wenwen), developed by our lab members Li Boyang and Huang Nuoxian, partially implements the scenarios of large model-assisted life circle planning and governance. It focuses on providing an intelligent assistant for public service facility information for community residents and businesses, and has been piloted in the Nantou Ancient City community in Shenzhen. This application utilizes natural language interaction through the WeChat mini-program, combined with community geographic information and multimodal data, to enable intelligent querying and recommendation of community public service facilities, providing residents with convenient service navigation and consultation. Additionally, it supports information sharing and transactions between community residents and businesses, aiding in the management and operation of public service facilities in the community.

Teaser

Figure 9: Example of "Fangyuan Wenwen" Community Application Scenario.

📅 Release Date: August 26, 2024

Members

He Zhu, Minxin Chen, Yijie Deng, Junyou Su, Wen Wang, Yurun Wang, Yulin Wu, Caicheng Niu, Tianhua Lu, Chengcheng Liu, Boyang Li, Nuoxian Huang, Ying'er Cai, Yue Wei, Sizheng Yang, Luyao Niu, Jiayu Gu, Yuhan Zou, Fenghong An, Siqi Cha, Chuang Deng, Hanying Li, Hongzhou Zheng and Qi Wang.

BSAI
CAUP